Time series classification using LMUs (Nengo-dl)


I followed the psmnist example on Nengo-dl and wanted to try using LMUs for a different time series classification problem. I picked the FordA binary classification problem to test my understanding of LMUs.

However, on training my LMU based network (similar structure as the one used in the psmnist example) for 10 epochs (same as the psmnist example), I got a final test accuracy of around 50% with little to no improvement between epochs.

This is the code I used (the first part is from the example I mentioned above)

import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import tensorflow
from tensorflow import keras
def readucr(filename):
    data = np.loadtxt(filename, delimiter="\t")
    y = data[:, 0]
    x = data[:, 1:]
    return x, y.astype(int)

root_url = "https://raw.githubusercontent.com/hfawaz/cd-diagram/master/FordA/"

x_train, y_train = readucr(root_url + "FordA_TRAIN.tsv")
x_test, y_test = readucr(root_url + "FordA_TEST.tsv")

x_train = x_train.reshape((x_train.shape[0], x_train.shape[1], 1))
x_test = x_test.reshape((x_test.shape[0], x_test.shape[1], 1))
num_classes = len(np.unique(y_train))
idx = np.random.permutation(len(x_train))
x_train = x_train[idx]
y_train = y_train[idx]
y_train[y_train == -1] = 0
y_test[y_test == -1] = 0

#####nengo from here########################
import nengo
import nengo_dl
from nengo.utils.filter_design import cont2discrete

class LMUCell(nengo.Network):
    def __init__(self, units, order, theta, input_d, **kwargs):

        # compute the A and B matrices according to the LMU's mathematical derivation
        # (see the paper for details)
        Q = np.arange(order, dtype=np.float64)
        R = (2 * Q + 1)[:, None] / theta
        j, i = np.meshgrid(Q, Q)

        A = np.where(i < j, -1, (-1.0) ** (i - j + 1)) * R
        B = (-1.0) ** Q[:, None] * R
        C = np.ones((1, order))
        D = np.zeros((1,))

        A, B, _, _, _ = cont2discrete((A, B, C, D), dt=1.0, method="zoh")

        with self:

            # create objects corresponding to the x/u/m/h variables in the above diagram
            self.x = nengo.Node(size_in=input_d)
            self.u = nengo.Node(size_in=1)
            self.m = nengo.Node(size_in=order)
            self.h = nengo_dl.TensorNode(tf.nn.tanh, shape_in=(units,), pass_time=False)

            # compute u_t from the above diagram. we have removed e_h and e_m as they
            # are not needed in this task.
                self.x, self.u, transform=np.ones((1, input_d)), synapse=None

            # compute m_t
            # in this implementation we'll make A and B non-trainable, but they
            # could also be optimized in the same way as the other parameters.
            # note that setting synapse=0 (versus synapse=None) adds a one-timestep
            # delay, so we can think of any connections with synapse=0 as representing
            # value_{t-1}.
            conn_A = nengo.Connection(self.m, self.m, transform=A, synapse=0)
            self.config[conn_A].trainable = False
            conn_B = nengo.Connection(self.u, self.m, transform=B, synapse=None)
            self.config[conn_B].trainable = False

            # compute h_t
                self.x, self.h, transform=nengo_dl.dists.Glorot(), synapse=None
                self.h, self.h, transform=nengo_dl.dists.Glorot(), synapse=0

y_test = y_test[:,None,None]
y_train = y_train[:,None,None]

with nengo.Network(seed=seed) as net:
    # remove some unnecessary features to speed up the training

    # input node
    inp = nengo.Node(np.zeros(1))

    # lmu cell
    lmu = LMUCell(
    conn = nengo.Connection(inp, lmu.x, synapse=None)
    net.config[conn].trainable = False

    # dense linear readout
    out = nengo.Node(size_in=1)
    nengo.Connection(lmu.h, out, transform=nengo_dl.dists.Glorot(), synapse=None)

    # record output. note that we set keep_history=False above, so this will
    # only record the output on the last timestep (which is all we need
    # on this task)
    p = nengo.Probe(out)

do_training = True

with nengo_dl.Simulator(net, minibatch_size=100, unroll_simulation=25) as sim:

        "Initial test accuracy: %.2f%%"
        % (sim.evaluate(x_test, y_test, verbose=1)["probe_accuracy"] * 100)

    if do_training:
        sim.fit(x_train, y_train, epochs=10)

        "Final test accuracy: %.2f%%"
        % (sim.evaluate(x_test, y_test, verbose=1)["probe_accuracy"] * 100)

Could you please help me figure out my mistake here.

Also, kindly suggest some sources for further intuition on LMUs and their hyperparameter tuning.

Hi @hetP, and welcome to the Nengo forums! :smiley:

I played around with your code and have some comments:

Regarding this observation, I believe there are multiple causes to this:

The data you are using contains two labels, 0, and 1, and I think this is what is intended for this dataset. However, the Nengo network is configured in such a way that it only produces a 1D (scalar) output:

    # dense linear readout
    out = nengo.Node(size_in=1)

I’m not sure if this was intentional, but if your data contains 2 classes, then the nengo.Node here should have a size_in=2. Making this change, you will find that the network then fails to train with Tensorflow throwing an error about mismatched array sizes. To fix this issue, I noticed that you were using the CategoricalCrossentropy loss function instead of the SparseCategoricalCrossentropy loss function. From the Tensorflow documentation, the CategoricalCrossentropy loss function should be used if your labels are “one-hot”. This means that the output of the network would either be [1, 0] for the first class, or [0, 1] for the second class. This is not the case with your dataset (it’s 0 for the first class, and 1 for the second class), so in this instance, you want to be using the SparseCategorialCrossentropy (see documentation here).

Making these changes will result in a network where the probe accuracy increases with the number of epochs, but I did notice that no matter what I tried, the validation accuracy was still about 50%. I’m not an expert with using the LMU on various datasets, so perhaps @arvoelke or @Eric can chime in with suggestions on how to improve the network. My thought is that either the LMU is too good and immediately overfitting the data, or that perhaps some pre or post processing layers (dropouts maybe?) are needed.

There are, unfortunately, not many sources describing how to tune these LMUs. I would however, refer you to our KerasLMU python package which is a Tensorflow-native implementation of the LMU (i.e., no Nengo needed). The KerasLMU documentation includes links to the original LMU paper, as well as an API reference with descriptions on what each of the parameters of the LMU are meant to do.

Another thing you can do to debug is to run your testing procedure (the sim.evaluate part) on your training data (or at least part of it). If it’s also getting 50% there, that suggests there’s a bug in the testing procedure; otherwise, it sounds like overfitting. In addition to Xuan’s suggestion of dropout layers, the first thing I would try is to simply make the LMU smaller (fewer units, lower order).

@arvoelke also took a quick look at your model and had these comments to say:

I think this is just a fundamental problem with the size of the dataset compared to the size of the model. The model has 100k trainable parameters, while the dataset has only 3601 bits of information to memorize (3601 sequences, each with either a single 1 or a 0 label). So even a very small model is still bound to just memorize all of that data. He probably needs to do a lot of data augmentation, dropout, and make the model way (way) smaller.

After some modifications to your code, he also reported:

I got rid of a bunch of connections and made the model way smaller (~8.5k params). It now gets 71.90% test accuracy, which is measurably better than 50%. It would also get better with more training and dropout for sure.

Here is the code he modified: test_forda3.py (6.4 KB)
Looking at the code, the changes he made were:

  • Swapping the neuron activation from tanh to relu.
  • Removing the computation of the hidden state (h_t)

This code should provide you a good starting point to further improve the model’s performance.