NengoDL MNIST Tutorial: Activation function used during training

Hello, I am pretty new to nengo and spiking neural networks in general. I am trying to understand the MNIST classification tutorial. Some of my question are already answered in one of the topics: “NengoDL MNIST tutorial: some questions about Conv2D TensorNode”. However, I still have some doubts/questions to clear.

  1. According to my limited knowledge, we cannot train directly the SNN model because of the non differentiable (spiking) activation function. The tutorial uses LIF as a neuron model, what kind of approximation of this neuron model is used in the training? Is the network trained on SoftLIF?

  2. What about the encoding scheme? I read that no encoding scheme is applied to the input vector and is fed directly to the network as real values. If I had to apply any encoding scheme then how would it change the network?

I am sorry if these questions are answered already… or if I am asking stupid questions. :grimacing:

Thank you.

1 Like

Hi @Choozi, and welcome to the forum!

Those are great questions! There may be partial answers to them scattered throughout the forum, but I’ll try to give a high-level overview here.

There are two difficulties in training with spiking LIF neurons. The first is that it’s spiking, and spikes are non-differentiable as you’ve pointed out. To get around that, we just use a rate approximation to the spiking neuron (i.e. if you gave the spiking neuron a constant input current of a particular strength, what would its firing rate be). This method of training on rate functions is sometimes called “ANN-SNN synthesis” in the literature, because we’re training the network as an ANN (with the rate approximation as the nonlinearity), and then using these weights for our spiking network at test time. This is opposed to spike-based methods, which find a way to make the spikes differentiable. There is also an in-between method, which is to use spikes for the forward pass during training, and the rate function for the backward pass. This method is implemented in our KerasSpiking package with the spiking_aware_training option on the SpikingActivation layer.

The other difficulty of training with LIF neurons is even the rate approximation does not have a continuous derivative. Specifically, its derivative explodes right around the firing threshold. To avoid the potential numerical problems with this, we typically train with the SoftLIF neuron type. However, in some of our work we’ve found that modern optimizers (e.g. Adam) that have features like gradient clipping can actually train fine with the standard (non-soft) LIF rate approximation. We haven’t tested this extensively, though.

As for the encoding scheme, the best place to look is at our NengoLoihi deep learning models. For example, in the CIFAR-10 example, you’ll see that when we create the network (code block [5]), our first layer (called “input-layer”) is off-chip. This layer is responsible for encoding the input image into spikes. It’s a convolution layer with a 1x1 (i.e. point-wise) convolution kernel. Essentially, it allows the network to learn how to map each RGB pixel (3 channels) into the output of 4 spiking neurons (the number of output filters for that layer). I like this way of doing things, because it lets the network learn the best way of encoding using the specified number of neurons. You could use other numbers of filters: 6 filters would allow for one “on” neuron and one “off” neuron per RGB channel, which should give a very good encoding; 3 filters would allow only one neuron per channel, which is enough to represent the information, but may mean that some input colours (e.g. black) have low firing rates for all input neurons, and thus the network would respond more slowly to these colours. (This might be okay if your images typically have a light foreground on a dark background, but may be more problematic for dark foreground on light background, for example.)

So basically the encoding scheme just adds another layer, which takes in the real-valued pixels and outputs spikes. No matter how they’re portrayed, this is essentially what any encoding scheme is doing. Really all that differs is how easy this encoding scheme is to compute. Using a 1x1 convolution kernel to map 3 input channels to 4 spiking neurons requires 12 multiplies per pixel; if you have two neurons per channel, you might only need 6 multiplies (one per neuron, since each is only getting input from one channel). Or if you change either of these methods to only multiply by 1 or -1, then you don’t actually need any multiplies at all (you can just add or subtract), which would be easier to implement on fixed-point hardware.

@Eric Thank you very much for your detailed explanation and for clearing my doubts. :slight_smile: I will dive deeper in the encoding process that is used in the “https://www.nengo.ai/nengo-loihi/examples/cifar10-convnet.html?_ga=2.222011224.1552374519.1607295149-1108030864.1607295149” and if I have any questions related to it. I will get back to you.

However, I have some more questions and I would be very happy to get the answers for them. :slight_smile:

  1. Could you please provide any link to a paper for the rate approximation which is particularly used in Nengo?

  2. The input node in the MNIST example i.e., inp = nengo.Node(np.zeros(28 * 28)) is used to take the input image. Does this np.zeros creates a 2D matrix or 1D array to take the input image? and why is it zeros?

  3. Is there anyway to do time distributed simulation? Just like we do in Keras using TimeDistributed layers? For example, I have an image of 24x24 I would like to fed it to network as one column at one time step i.e., 24x1? Is this somehow possible with nengoDL?

Thank you. :slight_smile:

  1. The paper showing the SoftLIF neuron is here.
  2. The np.zeros(28 * 28) creates a vector (1D array) of 784 elements. It’s just a placeholder to get the right size for the input node. When sim.fit or sim.evaluate is called in that example, you’ll see it is passed a set of images; these are used as the input by overwriting the zeros.
  3. Yes, it is possible to do simulation over multiple timesteps. You’ll notice that both train_images and test_images in that example are 3D arrays. The first axis is the “batch” axis, where each entry corresponds to a different example; the second axis is the time axis, where each entry is what is shown at a subsequent timestep; the final axis is the feature axis, where each element corresponds to a different feature (in this case, pixel). You’ll notice that test_images already has a time axis of 30, where the image is just tiled (i.e. repeated) across the axis. This is because when the network is evaluated on the test data, it is run over time as the spiking network. For training, we just have a single timestep (because we’re training as a rate network), but there’s no reason you can’t do something similar for the training data, to train over time. In your case, if you’ve got a set of n images of 24 timesteps with 24 pixels, you’d want your array to have shape (n, 24, 24).

@Eric thank you for your prompt reply.

1. The paper showing the SoftLIF neuron is here .

Perfect. Thank you. :slight_smile:

2. The np.zeros(28 * 28) creates a vector (1D array) of 784 elements. It’s just a placeholder to get the right size for the input node. When sim.fit or sim.evaluate is called in that example, you’ll see it is passed a set of images; these are used as the input by overwriting the zeros.
Ok, this was what my understanding was… but was a bit doubtful thus wanted to make it clear… thank you :slight_smile:

3. Yes, it is possible to do simulation over multiple timesteps. You’ll notice that both train_images and test_images in that example are 3D arrays. The first axis is the “batch” axis, where each entry corresponds to a different example; the second axis is the time axis, where each entry is what is shown at a subsequent timestep; the final axis is the feature axis, where each element corresponds to a different feature (in this case, pixel). You’ll notice that test_images already has a time axis of 30, where the image is just tiled (i.e. repeated) across the axis. This is because when the network is evaluated on the test data, it is run over time as the spiking network. For training, we just have a single timestep (because we’re training as a rate network), but there’s no reason you can’t do something similar for the training data, to train over time. In your case, if you’ve got a set of n images of 24 timesteps with 24 pixels, you’d want your array to have shape (n, 24, 24).

Thank you for your explanation. However, I am sorry if I am not understanding it correctly. I do not want to repeat my image during the training, what I meant was that instead of feeding the whole 24 x 24 image. I would like to feed only one column per step. For example: [1 2; 3 4;] is an image, I would like to first feed column [1;3] to the network and then [2;4] and make prediction something like this…

The method I’ve proposed has no repeating (I’m not calling np.tile or anything like that). It’s simply making sure your data is in the shape (n, 24, 24) when you pass it in to NengoDL. Since the middle axis is the time axis, each one of those columns will be presented for one timestep; the total length of your simulation will be 24 timesteps. If, on the other hand, you wanted to run as a standard ANN, presenting your whole image at once for one timestep, you’d want data with shape (n, 1, 24 * 24).

EDIT: Sorry, I just realized I was presenting rows one at a time, rather than columns. To do columns, just transpose. If you’ve got your images with shape (n, 24, 24), then call e.g. images = np.transpose(images, (0, 2, 1)) to have the columns on the time axis, and the rows on the feature axis.

1 Like

@Eric thank you for the explanation… :slight_smile:
I have one last question related to this thread for now:

How is the training time/step and prediction time/step calculated?
When I print the time/step in training and in prediction they both are significantly different. And can we control this time/step somehow?

The number of timesteps is controlled by the length of that time axis. So if my data has shape (10, 20, 5) then it will run for 20 timesteps, whereas if it has shape (10, 40, 5), it will run for 40 timesteps. You can have different numbers of timesteps for training and prediction (as in the spiking MNIST example, where we train as an ANN so we just need one timestep during training, but then we run as a spiking network, so we repeat the image for 30 timesteps). But they can also be the same; it just depends on the data you use for training and testing.

If you’ve got a series of images (i.e. a video) and you want to present each image for more than one timestep, you can do something like repeated_images = np.repeat(images, repeats, axis=1), where repeats is the number of times you want to repeat each image (frame).

@Eric First of all sorry for coming back very late to you…!
Thank you for the explanation. That I understand, but I wanted to ask something different let me try to recall…For example taking MNIST example what I want to know is that the time step during the training is 2ms/step and in prediction or evaluation it is 9ms/step.
What does this timestep indicates in the training and in the prediction?
Training:
8/8 [==============================] - 0s 3ms/step - loss: 1.4511 - out_p_loss: 1.4511
Epoch 2/20
8/8 [==============================] - 0s 3ms/step - loss: 1.3605 - out_p_loss: 1.3605
Epoch 3/20
8/8 [==============================] - 0s 3ms/step - loss: 1.3381 - out_p_loss: 1.3381
Epoch 4/20
8/8 [==============================] - 0s 3ms/step - loss: 1.3244 - out_p_loss: 1.3244
Epoch 5/20
8/8 [==============================] - 0s 3ms/step - loss: 1.3140 - out_p_loss: 1.3140

Prediction:

2/2 [==============================] - 0s 22ms/step

Oh, that’s not the timestep. That’s Keras reporting the number of ms required to perform each training/inference step (i.e. running one minibatch of examples through the system). It’s a measure of model speed, and will depend on your system speed and the number of examples you have per minibatch.

@Eric, oh my bad.! Thank you for the clarification.

But still, I have some doubts about the timesteps in the temporal aspects. For instance, in our above example we are using (n,24,24) feeding 24x1 as input at one timestep to the model. As you mentioned it is system dependent. But can we control it somehow to some fix value within the simulating environment. For example I want the the timestep to be 1 ms.

Let suppose this timestep is 1 ms.

Does this means my model takes the whole image in 24 ms?

And why do we need to repeat the images in the testing?

My understanding is that once the model is learned…! then just like conventional non-spiking networks, the network should respond to a single image.

For example, to a 24x24 image, the model takes 24x1 at a time step, therefore classify it in 24 steps = 24 ms?

If I repeat the same image let say 25 times. Then it would take round about 24 x 25 = 600 ms?

I think I am mixing up some concepts from non-spiking with spiking version. But I would greatly appreciate it if you can clarify them.

Thank you!

Often with spiking networks, we have two timescales: the timescale that the neurons operate on (which defaults to 1 ms), and the timescale that the input changes at.

With the CIFAR-10 and MNIST examples, we’re working with static images. We choose a presentation_time for which we present each image. This is typically between 30-200 ms. There’s a tradeoff between presenting each image for longer and thus integrating spikes over a longer time period and thus getting better accuracy, and having a shorter presentation and thus getting higher throughput and less latency. There are many other factors that contribute to the choice of presentation time, including the synapses between layers and the firing rates of the neurons.

With inputs that are inherently time-based (e.g. video), there’s this same kind of tradeoff. You could just present each video frame for one neuron timestep, but then your neurons wouldn’t really have time to respond to the content of each image (though if you have a really high camera framerate, like 1000 fps, and thus high temporal correlation between adjacent frames, then you could get good results from just showing each frame for one timestep). Typically what we want to do is extend the frames to span multiple timesteps. As with static images, more timesteps will allow more accurate inference at the cost of throughput and latency.

So what you would then do is take your input video, which is say 30 frames of 32 x 32 pixels each (30, 32, 32), and repeat each frame a certain number of times (say 20), so that your resulting video is (600, 32, 32). This would then be the input to your spiking network.

1 Like

@Eric Thank you very much for the detailed explanation. :slight_smile:.
It cleared my doubts/ questions .:slight_smile: