Hi @Choozi, and welcome to the forum!
Those are great questions! There may be partial answers to them scattered throughout the forum, but I’ll try to give a high-level overview here.
There are two difficulties in training with spiking LIF neurons. The first is that it’s spiking, and spikes are non-differentiable as you’ve pointed out. To get around that, we just use a rate approximation to the spiking neuron (i.e. if you gave the spiking neuron a constant input current of a particular strength, what would its firing rate be). This method of training on rate functions is sometimes called “ANN-SNN synthesis” in the literature, because we’re training the network as an ANN (with the rate approximation as the nonlinearity), and then using these weights for our spiking network at test time. This is opposed to spike-based methods, which find a way to make the spikes differentiable. There is also an in-between method, which is to use spikes for the forward pass during training, and the rate function for the backward pass. This method is implemented in our KerasSpiking package with the spiking_aware_training
option on the SpikingActivation
layer.
The other difficulty of training with LIF neurons is even the rate approximation does not have a continuous derivative. Specifically, its derivative explodes right around the firing threshold. To avoid the potential numerical problems with this, we typically train with the SoftLIF neuron type. However, in some of our work we’ve found that modern optimizers (e.g. Adam) that have features like gradient clipping can actually train fine with the standard (non-soft) LIF rate approximation. We haven’t tested this extensively, though.
As for the encoding scheme, the best place to look is at our NengoLoihi deep learning models. For example, in the CIFAR-10 example, you’ll see that when we create the network (code block [5]), our first layer (called “input-layer”) is off-chip. This layer is responsible for encoding the input image into spikes. It’s a convolution layer with a 1x1 (i.e. point-wise) convolution kernel. Essentially, it allows the network to learn how to map each RGB pixel (3 channels) into the output of 4 spiking neurons (the number of output filters for that layer). I like this way of doing things, because it lets the network learn the best way of encoding using the specified number of neurons. You could use other numbers of filters: 6 filters would allow for one “on” neuron and one “off” neuron per RGB channel, which should give a very good encoding; 3 filters would allow only one neuron per channel, which is enough to represent the information, but may mean that some input colours (e.g. black) have low firing rates for all input neurons, and thus the network would respond more slowly to these colours. (This might be okay if your images typically have a light foreground on a dark background, but may be more problematic for dark foreground on light background, for example.)
So basically the encoding scheme just adds another layer, which takes in the real-valued pixels and outputs spikes. No matter how they’re portrayed, this is essentially what any encoding scheme is doing. Really all that differs is how easy this encoding scheme is to compute. Using a 1x1 convolution kernel to map 3 input channels to 4 spiking neurons requires 12 multiplies per pixel; if you have two neurons per channel, you might only need 6 multiplies (one per neuron, since each is only getting input from one channel). Or if you change either of these methods to only multiply by 1 or -1, then you don’t actually need any multiplies at all (you can just add or subtract), which would be easier to implement on fixed-point hardware.