NengoDL MNIST Tutorial: Activation function used during training

Hello, I am pretty new to nengo and spiking neural networks in general. I am trying to understand the MNIST classification tutorial. Some of my question are already answered in one of the topics: “NengoDL MNIST tutorial: some questions about Conv2D TensorNode”. However, I still have some doubts/questions to clear.

  1. According to my limited knowledge, we cannot train directly the SNN model because of the non differentiable (spiking) activation function. The tutorial uses LIF as a neuron model, what kind of approximation of this neuron model is used in the training? Is the network trained on SoftLIF?

  2. What about the encoding scheme? I read that no encoding scheme is applied to the input vector and is fed directly to the network as real values. If I had to apply any encoding scheme then how would it change the network?

I am sorry if these questions are answered already… or if I am asking stupid questions. :grimacing:

Thank you.

1 Like

Hi @Choozi, and welcome to the forum!

Those are great questions! There may be partial answers to them scattered throughout the forum, but I’ll try to give a high-level overview here.

There are two difficulties in training with spiking LIF neurons. The first is that it’s spiking, and spikes are non-differentiable as you’ve pointed out. To get around that, we just use a rate approximation to the spiking neuron (i.e. if you gave the spiking neuron a constant input current of a particular strength, what would its firing rate be). This method of training on rate functions is sometimes called “ANN-SNN synthesis” in the literature, because we’re training the network as an ANN (with the rate approximation as the nonlinearity), and then using these weights for our spiking network at test time. This is opposed to spike-based methods, which find a way to make the spikes differentiable. There is also an in-between method, which is to use spikes for the forward pass during training, and the rate function for the backward pass. This method is implemented in our KerasSpiking package with the spiking_aware_training option on the SpikingActivation layer.

The other difficulty of training with LIF neurons is even the rate approximation does not have a continuous derivative. Specifically, its derivative explodes right around the firing threshold. To avoid the potential numerical problems with this, we typically train with the SoftLIF neuron type. However, in some of our work we’ve found that modern optimizers (e.g. Adam) that have features like gradient clipping can actually train fine with the standard (non-soft) LIF rate approximation. We haven’t tested this extensively, though.

As for the encoding scheme, the best place to look is at our NengoLoihi deep learning models. For example, in the CIFAR-10 example, you’ll see that when we create the network (code block [5]), our first layer (called “input-layer”) is off-chip. This layer is responsible for encoding the input image into spikes. It’s a convolution layer with a 1x1 (i.e. point-wise) convolution kernel. Essentially, it allows the network to learn how to map each RGB pixel (3 channels) into the output of 4 spiking neurons (the number of output filters for that layer). I like this way of doing things, because it lets the network learn the best way of encoding using the specified number of neurons. You could use other numbers of filters: 6 filters would allow for one “on” neuron and one “off” neuron per RGB channel, which should give a very good encoding; 3 filters would allow only one neuron per channel, which is enough to represent the information, but may mean that some input colours (e.g. black) have low firing rates for all input neurons, and thus the network would respond more slowly to these colours. (This might be okay if your images typically have a light foreground on a dark background, but may be more problematic for dark foreground on light background, for example.)

So basically the encoding scheme just adds another layer, which takes in the real-valued pixels and outputs spikes. No matter how they’re portrayed, this is essentially what any encoding scheme is doing. Really all that differs is how easy this encoding scheme is to compute. Using a 1x1 convolution kernel to map 3 input channels to 4 spiking neurons requires 12 multiplies per pixel; if you have two neurons per channel, you might only need 6 multiplies (one per neuron, since each is only getting input from one channel). Or if you change either of these methods to only multiply by 1 or -1, then you don’t actually need any multiplies at all (you can just add or subtract), which would be easier to implement on fixed-point hardware.

@Eric Thank you very much for your detailed explanation and for clearing my doubts. :slight_smile: I will dive deeper in the encoding process that is used in the “https://www.nengo.ai/nengo-loihi/examples/cifar10-convnet.html?_ga=2.222011224.1552374519.1607295149-1108030864.1607295149” and if I have any questions related to it. I will get back to you.

However, I have some more questions and I would be very happy to get the answers for them. :slight_smile:

  1. Could you please provide any link to a paper for the rate approximation which is particularly used in Nengo?

  2. The input node in the MNIST example i.e., inp = nengo.Node(np.zeros(28 * 28)) is used to take the input image. Does this np.zeros creates a 2D matrix or 1D array to take the input image? and why is it zeros?

  3. Is there anyway to do time distributed simulation? Just like we do in Keras using TimeDistributed layers? For example, I have an image of 24x24 I would like to fed it to network as one column at one time step i.e., 24x1? Is this somehow possible with nengoDL?

Thank you. :slight_smile:

  1. The paper showing the SoftLIF neuron is here.
  2. The np.zeros(28 * 28) creates a vector (1D array) of 784 elements. It’s just a placeholder to get the right size for the input node. When sim.fit or sim.evaluate is called in that example, you’ll see it is passed a set of images; these are used as the input by overwriting the zeros.
  3. Yes, it is possible to do simulation over multiple timesteps. You’ll notice that both train_images and test_images in that example are 3D arrays. The first axis is the “batch” axis, where each entry corresponds to a different example; the second axis is the time axis, where each entry is what is shown at a subsequent timestep; the final axis is the feature axis, where each element corresponds to a different feature (in this case, pixel). You’ll notice that test_images already has a time axis of 30, where the image is just tiled (i.e. repeated) across the axis. This is because when the network is evaluated on the test data, it is run over time as the spiking network. For training, we just have a single timestep (because we’re training as a rate network), but there’s no reason you can’t do something similar for the training data, to train over time. In your case, if you’ve got a set of n images of 24 timesteps with 24 pixels, you’d want your array to have shape (n, 24, 24).

@Eric thank you for your prompt reply.

1. The paper showing the SoftLIF neuron is here .

Perfect. Thank you. :slight_smile:

2. The np.zeros(28 * 28) creates a vector (1D array) of 784 elements. It’s just a placeholder to get the right size for the input node. When sim.fit or sim.evaluate is called in that example, you’ll see it is passed a set of images; these are used as the input by overwriting the zeros.
Ok, this was what my understanding was… but was a bit doubtful thus wanted to make it clear… thank you :slight_smile:

3. Yes, it is possible to do simulation over multiple timesteps. You’ll notice that both train_images and test_images in that example are 3D arrays. The first axis is the “batch” axis, where each entry corresponds to a different example; the second axis is the time axis, where each entry is what is shown at a subsequent timestep; the final axis is the feature axis, where each element corresponds to a different feature (in this case, pixel). You’ll notice that test_images already has a time axis of 30, where the image is just tiled (i.e. repeated) across the axis. This is because when the network is evaluated on the test data, it is run over time as the spiking network. For training, we just have a single timestep (because we’re training as a rate network), but there’s no reason you can’t do something similar for the training data, to train over time. In your case, if you’ve got a set of n images of 24 timesteps with 24 pixels, you’d want your array to have shape (n, 24, 24).

Thank you for your explanation. However, I am sorry if I am not understanding it correctly. I do not want to repeat my image during the training, what I meant was that instead of feeding the whole 24 x 24 image. I would like to feed only one column per step. For example: [1 2; 3 4;] is an image, I would like to first feed column [1;3] to the network and then [2;4] and make prediction something like this…

The method I’ve proposed has no repeating (I’m not calling np.tile or anything like that). It’s simply making sure your data is in the shape (n, 24, 24) when you pass it in to NengoDL. Since the middle axis is the time axis, each one of those columns will be presented for one timestep; the total length of your simulation will be 24 timesteps. If, on the other hand, you wanted to run as a standard ANN, presenting your whole image at once for one timestep, you’d want data with shape (n, 1, 24 * 24).

EDIT: Sorry, I just realized I was presenting rows one at a time, rather than columns. To do columns, just transpose. If you’ve got your images with shape (n, 24, 24), then call e.g. images = np.transpose(images, (0, 2, 1)) to have the columns on the time axis, and the rows on the feature axis.

1 Like

@Eric thank you for the explanation… :slight_smile:
I have one last question related to this thread for now:

How is the training time/step and prediction time/step calculated?
When I print the time/step in training and in prediction they both are significantly different. And can we control this time/step somehow?

The number of timesteps is controlled by the length of that time axis. So if my data has shape (10, 20, 5) then it will run for 20 timesteps, whereas if it has shape (10, 40, 5), it will run for 40 timesteps. You can have different numbers of timesteps for training and prediction (as in the spiking MNIST example, where we train as an ANN so we just need one timestep during training, but then we run as a spiking network, so we repeat the image for 30 timesteps). But they can also be the same; it just depends on the data you use for training and testing.

If you’ve got a series of images (i.e. a video) and you want to present each image for more than one timestep, you can do something like repeated_images = np.repeat(images, repeats, axis=1), where repeats is the number of times you want to repeat each image (frame).