Is it possible (or does it even make sense) to implement a spiking neural network model in Nengo for dimensionality reduction and reconstruction of a stimulus a la generative modeling? If so, what would that entail? What I mean is something like an autoencoder that could learn a compact, noise-robust representation of a stimulus (for example an image of a digit or segment of speech audio) that could then be expanded to reconstruct the stimulus. Additionally, would it be possible to use this representation as a semantic pointer for recall or to combine these semantic pointers to generate prototypical stimuli for each class?

Finally, if any of the things I mentioned are possible, is there reason to believe that there would be any real benefit to the quality of the dimensionality reduction or reconstruction provided by using a spiking neural network over traditional deep learning?

It would definitely be possible to build something like an autoencoder in spiking neurons. You could use nengo_dl to train the autoencoder; the procedure would look something like this example (but with the structure of the network and training objective changed to an autoencoder setup). And you could connect up that inner compression layer to a larger Nengo model, like a SPA model as you suggest, which would definitely be an interesting thing to explore.

In terms of what the benefits would be, I think the most interesting issue here would be exploring the idea of temporal compression rather than spatial. A traditional auto encoder does spatial compression (i.e., taking a high dimensional vector and condensing it to a lower dimensional vector). We could achieve the same thing with a spiking neural network, but there probably wouldn’t be any particular advantages. However, the unique aspect of a spiking neural network is that we have introduced time into the system, so that the compressed vector isn’t represented instantaneously, but over a period of time. So then we could ask questions about how that representation is compressed temporally (e.g., how many spikes do we need to represent the compressed signal, or how does the length of the compressed signal affect the reconstruction quality). And you could trade off the temporal compression against the spatial compression (e.g. a longer, lower dimensional compression versus a shorter, higher dimensional compression).

I haven’t tried that out myself so can’t say what the results would be, but it’d definitely be interesting!

I saw something about being able to insert a pre-trained TensorFlow model into the network. Do you think that would work here or produce results different than training the network in Nengo with the adapted differentiable spike learning rule (I’d think it would)?

Also, could you explain what you mean by it perhaps being able to learned a compressed version in time?

Let’s say I have speech audio corresponding to a single word that I’ve decomposed into a spectrogram or similar (ie: a windowed spectral representation over time, time by frequency, or even a continuous filterbank sampled every X number of samples equivalent to that time bin). If I trained an autoencoder normally, I could feed it the frequency vector at each time bin as individual training samples or perhaps the entire spectrogram all at once as a single “spectral image” training sample. The autoencoder would then be able to map a given testing sample to its learned compressed frequency/feature space.

In what way would the autoencoder mapping change with a spiking network? Would it be a matter of the probed neurons’ spiking patterns (extracted from the medial, compressed layer) potentially possessing interesting dynamics somehow describing the dynamics of the input layer? Also, a general question about Nengo’s spiking networks, would the output of the network spike or not spike for a given sample, or would you need to present each sample for the corresponding time it was initially collected over and observe dynamics over each subsequent presentation duration? I guess maybe the thing I’m trying to wrap my head around is the dimensionality we’re now working in with spiking networks and how exactly time factors in. I obviously don’t necessarily expect you to know what the findings would be, but it’d be really helpful for me to understand how the structure of the Nengo spiking network factors into mapping and processing the data.

It sounds to me, in the most general sense, that you would like to learn a dynamical system (i.e., filter). If you feed some input $u(t)$ to some dynamical system, then the state of that system ${\bf x}(t)$ is some vector that represents a temporally-compressed version of the history of that input. Different systems (or different filters) will be better at reconstructing ${\bf x}(t) \mapsto u$, depending on the dynamics of $u(t)$. The dimensionality of this state determines how compressed/reduced it will be. Learning the optimal system that recovers $u$ from ${\bf x}(t)$ in general is a very difficult problem, as it depends on the “dynamical modes” of the input, how far back in time you would like to reconstruct the input, and what error you are minimizing. As a trivial case, a sine wave $u(t)$ can be compressed into two dimensions (i.e., its 2D position on an oscillator). Less trivially, you might have some prior model for these dynamics and work from there.

We have some work here http://compneuro.uwaterloo.ca/publications/voelker2018.html that shows how you can do this in general for finite-length windows of low-frequency inputs using an SNN trained offline. As the dimensionality increases, the system becomes better at reconstructing the higher-frequency components of the input signal.

That is something you can do with NengoDL, but in that case we’re just using the network exactly the same way as it was defined in TensorFlow (e.g., it isn’t running in spiking neurons or anything like that). That could still be interesting, as you could hook the output of that model up to a neuromorphic model (for example, using the compressed representation as an input into a semantic pointer model, as you mentioned). But the TensorFlow network itself wouldn’t be doing anything that it wouldn’t do if you just simulated that network normally in TensorFlow.

The alternative would be to build and optimize the network in NengoDL. That would let you use different neuron models (e.g., ones with temporal dynamics), and optimize the output of the model over time (more on that below).

The key difference with a spiking network is that in order to observe the output of the network we need to simulate it over time. Technically we could simulate it for a single timestep, but that wouldn’t be very informative. We need to simulate it for some length of time, observe the spikes coming out of the neurons, and use that to reconstruct the output of the network.

So we input something like the frequency vector, simulate the model for t timesteps, and then observe our reconstructed output. In the autoencoder case, we would like that reconstructed output to be the original frequency vector. So then the question is, how should we set the parameters of the network such that, after simulating for t steps, the output looks like the original input vector. That’s an optimization problem we can solve using the same gradient-based optimization methods we would use for a normal autoencoder, but extended over time (i.e., back propagation through time).

So then there’s all kinds of interesting questions we could ask about how to efficiently/accurately compress that feature vector into a temporal spike pattern. E.g. we could try to minimize the number of spikes, for a sparse temporal representation. Or we could just try to minimize the reconstruction error, and see what kinds of neural activities that leads to. Or we could explore how the reconstruction accuracy changes as we change t (i.e. as we increase the amount of temporal information, how does that affect the spatial reconstruction accuracy).

This is a simplified description and you wouldn’t necessarily need to do things this way, but to give you a rough idea: In a normal autoencoder you input some vector with dimension n and get out a vector with some dimension m < n. When we are translating that into a spiking network, we have the same setup (input a vector with dimension n, get out a vector with dimension m), but in this case m is a vector of spikes (e.g. 1s or 0s) showing which of our m neurons spiked on that timestep. Then we simulate that network for t timesteps, so we end up with an output array with shape (t, m), showing which neurons spiked on each timestep. Then you’d generally apply some function to that array of spikes (e.g., a lowpass temporal filter), to make it easier to see how those neurons are spiking over time.

Again, that’s very simplified, and you could vary that in many ways. For example, you may not want to directly use the m spike trains as the represented value, but instead perform a linear mapping into some other d dimensional space. Or, since we’re simulating the model over time, we don’t need to keep our input signal constant. As you mentioned, rather than inputting the frequency vector, we could directly input the temporal audio signal, and try to compress that down into a lower dimensional spiking representation. But anyway, hopefully that gives you a general idea of how things change when you’re working with a spiking network.

Sure, so basically you’re saying that the optimization algorithm is performed across the entire batch of trials presented and that’s how it is factoring time into the compressed representation? If I had audio for multiple words, I would need to somehow set each word as a separate batch then or train the network on each word separately then, right?

Okay, so then the output dimensionality (assuming the case where each feature corresponds to a single neuron rather than an ensemble of neurons, which I think is what you’re implying) would be identical to a regular autoencoder, right? To get the reconstructed output, we’d then need to decode the spike trains similar to how was done here (https://www.nengo.ai/nengo/examples/advanced/nef_summary.html). How would this be applied to the intermediate, compressed output or the ultimate, reconstructed output of the generative model? Also, if you need the whole spike train according to how you trained it, would you not be able to reconstruct the output one sample at a time in an online fashion partial, windowed segments?

Thank you for answering all my questions thus far. This is really insightful!

Yep, exactly. The input data for the training has shape (batch_size, n_steps, dimensions), so in your case that would be something like (n_words, word_length, audio_channels).

You could do this, but you don’t necessarily have to. You could just use the spike trains themselves as your compressed representation. But if you have some function that you want to apply to your compressed representations, you could indeed use the NEF decoding process to solve for some linear readout weights to apply to those spike trains. I’m not sure what that function would be, so I’d be inclined just to use the spike trains directly, but you may have some ideas for what a useful decoded representation would be.

All of these layers will be represented as Ensembles in our model, so even though we’ll be optimizing this network with NengoDL we can still decode our target function from those ensembles in the regular Nengo way, e.g.

compressed_ens = nengo.Ensemble(m, m)
output = nengo.Node(size_in=d)
nengo.Connection(compressed_ens, output, function=my_func)

We need the whole input signal as a block when we’re training the model, but when you want to test the performance afterwards you could present input online. We only need the whole input signal as a block during the training process due to the nature of back propagation through time, that’s not a feature/requirement of the network itself.