Time-series forecasting with LMU in nengo-dl

caterina · November 23, 2021, 11:04am

Hello everybody!

I have been reading the Nengo forum for a few weeks now, but now is the time for me to ask 1 or 2 questions. Thanks to all that people from the community that helped me until now and will help me in the future.

I am interested in time-series forecasting using LMUs and the implementation shown in the nengo-dl docs.
In my notebook LMU-final.ipynb (60.6 KB) you can see my most recent attempts. The data should be publicly available via the github URL. Sorry in advance if my code is not great!

At first there is an example with some small dummy dataset. I think I took it somewhere in the forum earlier this year. The LMU seems to produce good results and the loss function is also within the rage of the acceptable.
Then, I tried to plug in the dataset I am working with. For now, I have only selected one of the time-series I have in the data, in order to have it similar to the first example. However, I never managed good results. I tried changing the number of parameters (neurons and number of Legendre polynomials), adjusting the learning rate, and turning on/off the various ‘learnable’ connections. However, the loss function seems to be stay fixed at a certain not too small number throughout all the epics but the first. Have I missed something? Am I doing something oddly wrong?

Any help would be very much appreciated since I am relying mainly on this community and the docs.

I have now another question, but less important and pressing. As you could see from the notebook, my data consists of multiple different time-series, which are independent and I would need to predict a value for each of them. So I do NOT want to come up with a single prediction based on all my time series (multivariate), but I would like to threat them independently, I call them parallel time-series, but I am not sure this is the right name .
Is it possible to do it with this implementation of LMU in nengo-dl ( Legendre Memory Units in NengoDL)?

Thanks in advance to everybody for their time!

PS: for some reasons I couldn’t upload my notebook even if I reached trust level 1, hence I deleted my previous post. But now it’s working, thanks in case someone gave right!

xchoo · December 13, 2021, 2:26am

Hi @caterina,

I took a look at your code, and I don’t think you are doing anything wrong per se, just that you might have the LMU network misconfigured for your dataset. First and foremost, you should familiarize yourself with the LMU architecture (see this paper, as well as this paper [or this thesis] on the delay network on which the LMU is based).

The LMU operates by trying to find weights on a bunch of basis functions such that when evaluated at a specific time, these weighted basis functions “predict” an output time-varying signal. If the LMU network has been trained successfully on the training data, the predicted output signal should match what is expected of the network given some specific input. Now, even after reading that quick description, or after reading all of the material I posted above, you might still not fully understand how the LMU operates, so I’ll use an analogy below (this analogy is not exact, but it should give you enough intuition to understand the various parameters used in the LMU to help you tune it to your needs).

You can consider the Fourier coefficients as an analogy to how the LMU weights the (Legendre) basis functions in the LMU network. With Fourier coefficients, one can reconstruct any time varying signal by applying a weight (the Fourier coefficient) to the corresponding sine function, and then adding all of the weighted sine functions together. The LMU essentially does the same thing, but with the Legendre basis functions. The exact details of why the Legendre basis functions are used (it can be mathematically proven to be optimal to use the Legendre basis functions) can be found in the thesis I linked.

With that quick summary of how the LMU works, there are a couple of important parameters to the LMU network:

Order: number of degrees (or basis functions) to the LTI system in the LMU. I’ll discuss this in a bit.
Theta: The sliding window of data which is being represented in the LTI system in the LMU.

Order refers to the number of basis functions used in the LMU. To use the analogy of the Fourier domain, in the Fourier domain, using more basis functions allow you to represent “sharper” (higher frequency) functions. The order of the LMU has the same effect, and having a higher order allows the LMU to represent (and store) quicker changes to the input signal.

Theta refers to the amount of time that is stored within the internal memory of the LMU. Unlike the Fourier domain (where the basis functions represent a specific frequency), the basis functions in the LMU are essentially temporal basis functions (a signal that changes a known way over a specific window of time). The theta parameter of the LMU determines how much time the basis functions represent, essentially determining how much temporal information is stored within the LMU network that can subsequently be used in the training (and decoding) process.

With these two parameters in mind, lets look a how they apply to the various examples linked in your post:
In the psMNIST example, theta was set to use the length of the entire input (a flattened 28 x 28 matrix). This makes sense because we want the LMU network to be able to capture all of the information in the entire sequence before making a “prediction” on what the input is. As for order, this number was really arbitrarily set such that the overall number of states in the network match a corresponding LSTM network applied to the same problem.

In your code, you’ve set theta to the entire length of the data input (which is several thousand entries long). Doing this may be what you want to do, although, I do notice a cyclical nature to the data. If I’m interpreting it correctly, the measurements are taken daily after every meal? This being the case, it may be better (and you will see why in a bit) to use a shorter theta that encompasses a subset of these cycles. Looking at the value for order, here I can see a potential issue. You’ve configured the LMU to only use a small number of basis functions relative to the length of the data being represented by the network (i.e., relative to the length of theta).

If you think about the consequences of this, imagine spreading x number of Legendre polynomials over the number of timesteps (data points) specified by theta. You’ll see that even the largest order polynomial spans a few timesteps, and in essence, limits the ability of the network to represent and respond to large changes in the input data (which your data seems to exhibit). As a point of comparison, the LMU network for the psMNIST problem used an order of 256 for a theta of 784. Your dataset is roughly 38700 data points long, so a proportional order (using the psMNIST example as reference) would be 12637.

Now, you could configure your LMU network to use this new value (of 12637), but I suspect this would make the network take absurdly long to train… Alternatively, you could reduce the value of theta, which would in effect increase the ability of the LMU to respond to sharper changes in the input at the cost of not representing all of the data in the training set.

Configuring the LMU network for your specific task, is like all ML approaches, a bit of a black art. However, I hope that the description I provided above gives you some idea on how to start tweaking your network to give you better results.