Hi @caterina,
I took a look at your code, and I donât think you are doing anything wrong per se, just that you might have the LMU network misconfigured for your dataset. First and foremost, you should familiarize yourself with the LMU architecture (see this paper, as well as this paper [or this thesis] on the delay network on which the LMU is based).
The LMU operates by trying to find weights on a bunch of basis functions such that when evaluated at a specific time, these weighted basis functions âpredictâ an output time-varying signal. If the LMU network has been trained successfully on the training data, the predicted output signal should match what is expected of the network given some specific input. Now, even after reading that quick description, or after reading all of the material I posted above, you might still not fully understand how the LMU operates, so Iâll use an analogy below (this analogy is not exact, but it should give you enough intuition to understand the various parameters used in the LMU to help you tune it to your needs).
You can consider the Fourier coefficients as an analogy to how the LMU weights the (Legendre) basis functions in the LMU network. With Fourier coefficients, one can reconstruct any time varying signal by applying a weight (the Fourier coefficient) to the corresponding sine function, and then adding all of the weighted sine functions together. The LMU essentially does the same thing, but with the Legendre basis functions. The exact details of why the Legendre basis functions are used (it can be mathematically proven to be optimal to use the Legendre basis functions) can be found in the thesis I linked.
With that quick summary of how the LMU works, there are a couple of important parameters to the LMU network:
- Order: number of degrees (or basis functions) to the LTI system in the LMU. Iâll discuss this in a bit.
- Theta: The sliding window of data which is being represented in the LTI system in the LMU.
Order refers to the number of basis functions used in the LMU. To use the analogy of the Fourier domain, in the Fourier domain, using more basis functions allow you to represent âsharperâ (higher frequency) functions. The order of the LMU has the same effect, and having a higher order allows the LMU to represent (and store) quicker changes to the input signal.
Theta refers to the amount of time that is stored within the internal memory of the LMU. Unlike the Fourier domain (where the basis functions represent a specific frequency), the basis functions in the LMU are essentially temporal basis functions (a signal that changes a known way over a specific window of time). The theta parameter of the LMU determines how much time the basis functions represent, essentially determining how much temporal information is stored within the LMU network that can subsequently be used in the training (and decoding) process.
With these two parameters in mind, lets look a how they apply to the various examples linked in your post:
In the psMNIST example, theta
was set to use the length of the entire input (a flattened 28 x 28 matrix). This makes sense because we want the LMU network to be able to capture all of the information in the entire sequence before making a âpredictionâ on what the input is. As for order
, this number was really arbitrarily set such that the overall number of states in the network match a corresponding LSTM network applied to the same problem.
In your code, youâve set theta
to the entire length of the data input (which is several thousand entries long). Doing this may be what you want to do, although, I do notice a cyclical nature to the data. If Iâm interpreting it correctly, the measurements are taken daily after every meal? This being the case, it may be better (and you will see why in a bit) to use a shorter theta
that encompasses a subset of these cycles. Looking at the value for order
, here I can see a potential issue. Youâve configured the LMU to only use a small number of basis functions relative to the length of the data being represented by the network (i.e., relative to the length of theta).
If you think about the consequences of this, imagine spreading x number of Legendre polynomials over the number of timesteps (data points) specified by theta
. Youâll see that even the largest order polynomial spans a few timesteps, and in essence, limits the ability of the network to represent and respond to large changes in the input data (which your data seems to exhibit). As a point of comparison, the LMU network for the psMNIST problem used an order of 256 for a theta of 784. Your dataset is roughly 38700 data points long, so a proportional order (using the psMNIST example as reference) would be 12637.
Now, you could configure your LMU network to use this new value (of 12637), but I suspect this would make the network take absurdly long to train⌠Alternatively, you could reduce the value of theta, which would in effect increase the ability of the LMU to respond to sharper changes in the input at the cost of not representing all of the data in the training set.
Configuring the LMU network for your specific task, is like all ML approaches, a bit of a black art. However, I hope that the description I provided above gives you some idea on how to start tweaking your network to give you better results.