[Nengo-DL]: Adding a dynamic custom layer between two Ensembles

In the build model (i.e., the model that the simulator uses), both the function and transform are combined into the connection’s weight matrix. Recall that the function is not a physical entity. It is just used by the connection’s weight solver to solve for the appropriate connection weights that will approximate the function.

As for the synapse, it is applied after the connection weight has been applied. However, since both the connection weight and synapse application are linear transforms, mathematically, it doesn’t matter which order they are done.

Can you provide a minimal network that demonstrates this? Just so that I have context with what network you are building, the parameters you have used, and how you are testing it.

This is correct.

Hello @xchoo,

I suppose you are talking about the convolution operation (between the synapse and the weighted spikes) implemented as Linear Transforms, and hence the order doesn’t matter. Right? In traditional Linear Transform operations through matrix multiplication, the order of matrices matters… isn’t it?

With respect to the following:

I suppose you are hinting towards the mathematical functions whose Decoders can be calculated, thus incorporated in the connection weight matrix along with transform. Right?

About the following statement:

Let’s say a Convolutional transform has to be applied, then whatever is the output of the neurons of the previous Ensemble, NengoDL first applies the transform, and then applies the synapse on the resulted values. Is it?

Here’s the attached minimal script. As I said earlier, I was bit confused. Doing this exercise resolved my doubts. After setting scale_firing_rates = 100, I could see that the raw output of data1[conv0_lyr_otpt] (i.e. probed from Conv0 layer) is following (Cell 15),

[9.999999 0.       0.       0.       0.       0.       0.       0.
 0.       0.       0.       0.       0.       0.       0.       0.
 0.       0.       0.       0.       0.       0.       0.       0.
 0.       0.       0.       0.       0.       0.       0.       0.
 0.       0.       0.       0.       0.       0.       0.       0.      ]

however, upon multiplying data1[conv0_lyr_otpt] with scale_firing_rates and dt, I see the actual spike amplitudes which is 1, as can be seen below (Cell 17)
investigating_spike_amplitudes.ipynb (35.5 KB)
:

[0.99999994 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.        ]

My doubt stands pretty much resolved, however, I would like to get some more clarifications. I see that the raw output is 10 (4), when scale_firing_rates is 100, (250) respectively, which is (amplitude/dt) / scale_firing_rate (assuming amplitude is 1 and dt is 0.001).

Is this the “input” to further computations in the connections, i.e. is function, or transform or synapse applied on (amplitude/dt) / scale_firing_rate (e.g. 10.0) in the nengo.Connection()? or are the above computations applied to the value of “spike amplitudes” e.g. 1 obtained after multiplying (amplitude/dt) / scale_firing_rate with scale_firing_rates and dt?

Yes. That is correct. The synaptic application is a convolution operation and it doesn’t matter if you apply the convolution first, or do the connection weight matrix multiplication first.

Correct!

Yes

Yes. This is expected behaviour. Remember that the purpose of the scale_firing_rates parameter is to increase the firing rates of the neurons in the network. However, in order to keep the overall “energy” (or information) being transmitted by the network the same in both cases, the amplitudes of the spikes are divided by the scale_firing_rates value. From my other post:

The scale_firing_rate parameter is applied directly to the amplitude of the neuron’s spikes. Similarly (on the input side), the scale_firing_rate parameter is applied directly to the neuron’s gains.

Just to clarify your earlier question. I believe you are conflating two different concepts here. Your original thought was that the scale_firing_rates would affect the amplitude by scaling it up. However, the purpose of the scale_firing_rates parameter is to increase (typically) the firing rates of the neurons (by adjusting the neuron gains). The reason the amplitude of the spikes is affected is because we want to keep the overall amount of information being transmitted by the spikes the same (pre and post-scaling). Since we scaled up the firing rate, the thing to do would be to scale down the amplitude, which in effect “cancels” out any additional information gain obtained by increasing the neuron gain.

Thanks @xchoo for your response. I am seeing that I am circling back to the age old questions which have already been resolved by you… probably I am looking it from different perspectives. I will try to be more careful.

With respect to the following,

I am understanding that input to the next Ensemble is (amplitude/dt) / scale_firing_rate (e.g. 10.0), however (as you mentioned), since gain already has scale_firing_rates multiplied to it, upon calculating the current $J = gain \times <e, x> + bias$, the scale_firing_rates automatically gets multiplied (by the virtue of the multiplication of gain) to the input $x$, thus effectively the current is calculated from the spikes values (i.e. $1$ or $2$… of course depending upon the neuron type). Right?

Another follow up question on

say if the function is non-trivial (i.e. not a max, exponential, sum, etc.) rather the function does some calculation in a loop, then will the decoders be calculated? I am having troubles in following… how could decoders be calculated (which could approximate that non-trivial function, e.g. sum of products, binary search!)? Is it the same process… that after executing the function on some random sample, the outputs are used to calculated decoders?

On another note, I have a network with Conv -> Conv -> AvgPool -> Conv -> Dense. When I attempt to deploy it on Loihi (the emulator in NengoLoihi), I get the following error: NotImplementedError: Mergeable transforms must be Dense; set remove_passthrough=False, which I believe is due to the presence of AvgPool layers which is modelled as the Passthrough node, and it’s AveragePooling transform is attempted to be merged with the Convolutional transform. And, this is not yet supported on Loihi… is it?

BTW, upon setting the remove_passthrough=False in the nengo_loihi.Simulator() args, it throws the following error: BuildError: Conv2D transforms not supported for off-chip to on-chip connections where pre is not a Neurons object..

Please let me know your suggestions!

That is correct. Note however that in my previous post explaining this, I specifically chose the ReLU neuron type, because the firing rate is linearly proportional to the input current (i.e., doubling the input current doubles the firing rate, which makes the explanation easier). If you use the LIF neuron however, things become more complicated as the neuron response is not linear, so the scale_firing_rates parameter doesn’t always work the way you think it does (with LIF neurons).

In such instances, Nengo will have a very hard time (if not impossible time) trying to solve for decoders. The decoders are solved by evaluating the desired output function for a given set of evaluation points. If the function requires some sort of recursion, then Nengo will solve the decoders for the first pass of the recursion (which is likely not what you want). In order to implement these more complex functions in Nengo, you’ll need to break down the function into simpler bits.

As an example, if you wanted to compute a sum of products, you’ll want a bunch of sub-networks to compute the products, and then another ensemble to compute the sum (note, this ensemble can be the ensemble where you want to use the sum, or it can also be a passthrough node).

I’ll have to run some test code to be sure but this is likely the case. In this instance, what you’ll probably want to do is to insert a Dense layer between the AvgPool and Conv layer in order to force an ensemble to be built there. That might work, but I can’t say for sure.

Yeah, NengoLoihi pretty much requires remove_passthrough to always be True. This is because we want to minimize the number of off-to-on-chip connections (and vice versa) since all nodes have to be simulated off-chip.

Try my suggestion of inserting a Dense layer first, and if that still doesn’t work, then I’ll have to ask the NengoLoihi devs on ideas on how to get around this issue. It’ll probably involve some math to calculate the full connection weight matrix for the Conv2D transform (and I don’t know that off the top of my head).

Hello @xchoo, thanks for your response and confirming the computation math with gain. With respect to Nengo dealing with complex functions, I was just curious about them, and your insights are very useful. Indeed, a possible way would be to break down those complex functions to give Nengo an easy time.

With respect to the AvgPooling on NengoLoihi, this is again something I was curious about. In future if I take this into priority I will remember adding a Dense layers between AveragePooling and Conv layers.

I was actually working with your suggestion to configure the planner to bypass graph optimization and speed up the execution, however, it doesn’t seem to help. Following is the code where ndl_model is obtained from the nengo_dl.Converter().

with ndl_model.net:
  # Disable operator merging to improve compilation time.
  nengo_dl.configure_settings(planner=noop_planner)
.
.

My network has the custom layer as you suggested here, and I even included nengo_dl.configure_settings(planner=noop_planner) within that subnet as well, but nothing helps. Any suggestions?

Without being able to fully profile your network, it’ll be hard to say exactly where the slowdown is happening. You could also try disabling the sorting algorithm to see if that helps? Apart from that, I don’t have much advice. I haven’t built complex models in NengoDL myself, so my experience there is limited. It may also be that you are at a network size where the time it takes to optimize the network is unavoidable.

Got your point @xchoo. Thanks for the another suggestion as well. I will try it out. However, yesterday I conducted few more experiments with disabling the graph optimizer and found that it takes even longer to optimize the network, possibly never ending… (even after 5 hours and 30 mins it was still in optimization phase). After enabling the graph optimizer, it still takes time, but probably reasonable and ends up executing fine. So I really don’t know what’s going on with graph optimization right now, but probably my network size is genuinely large enough.

Hello @xchoo, well… I incurred the same problem of NotImplementedError: Mergeable transforms must be Dense; set remove_passthrough=False and BuildError: Conv2D transforms not supported for off-chip to on-chip connections where pre is not a Neurons object. in my custom SNN which doesn’t even have AveragePooling layers. Upon your suggestions, I introduced a Dense layer between my MaxPooling and Conv layers which completely messed up the feature maps (i.e. kernels of shape (rows, cols) got transformed to (rows, number of neurons in Dense layer) and this is not going to help me.

I had some ideas where the issue could have occurred, so I introduced a LoihiSpikingRectifiedLinear() neuron at the output of each subnet (i.e. connected the output Node to an Ensemble of one neuron) and connected the same subnet’s output neuron to the rest of the network. Upon running the SNN on Loihi, it didn’t fail earlier (in time) with above errors, but after a very very long time of compilation, it failed again with the same above error: NotImplementedError: Mergeable transforms must be Dense; set remove_passthrough=False. This has confused me now about the exact cause of the issue.

Upon running the standalone version of my subnet (i.e. one such single subnet), and connecting the output Node (which becomes passthrough when deployed on Loihi) to an Ensemble of one neuron, I am able to get the expected output. However, when I try to feed in even a little bit large number of inputs, the group of subnets fail to run on Loihi complaining: BuildError: Input axons (74016) exceeded max (4096) in LoihiBlock(<Ensemble "OUT ENS">[0:2:1]).

So, all in all, the standalone versions fail due to exceeding the max number of Axons, and SNNs integrated with it fail due to non-mergeable transforms. Could this be because the standalone subnets are compiled in a different way when integrated with SNNs? Also, I am now probably getting to the right reason on why it takes so long to build, and that’s because of the large number of axons resulting when I integrate my subnet with an SNN. Does it sound the same to you as well? Since this is an unpublished work, I was wondering if I could message/email you the subnet’s architecture for a closer look? Please let me know.

Hmmm… Unforunately, this is going beyond my depth of knowledge about the NengLoihi system, so I’m going to refer you to @Eric (our NengoLoihi dev) to see if he can provide some suggestions to get your custom network working.

1 Like

Hello there, sorry to bother again… I was wondering if there are any suggestions to tackle this issue?

Sorry, I’ve been pretty busy lately.

The Mergeable transforms must be dense happens if you’re trying to remove a passthrough node that has a non-dense transform.

Passthrough nodes are nodes that do not transform their inputs in any way (we typically use them for convenience, e.g. so that we can define one point of entry for networks). You can typically identify them because they’ll have size_in set on their arguments (to indicate they can take in inputs), but have no output function set. We try to remove passthrough nodes in NengoLoihi to keep things on-chip (typically, nodes are simulated off-chip, because we want to be able to do arbitrary Python functions, but if they’re passthrough, they can often be removed so that off-chip step doesn’t happen). If you have a passthrough if your connections into and out of the passthrough have Dense transforms, we can combine those dense transforms into a single Dense transform when removing the passthrough node, and put that in its place. However, if one of the transforms is not dense (e.g. a Convolution transform), we can no longer remove the passthrough node. There is a flag remove_passthrough=False that you can pass to your simulator, to not remove passthrough nodes, but then you’ll have a lot of data being passed on and off the chip (which also means it is converted from spikes to floats, and then back into spikes). Furthermore, in the case where you have a Convolution connection coming out of that passthrough node, you’ll also get an error, because Convolution transforms are not supported from Nodes to the chip (this is because we would have to re-encode the float node values into spikes, which takes many neurons for each dimension; since Convolution transforms often have high-dimensional inputs, this is typically not what we want).

My suggestion would be to remove all your passthrough nodes, and just do the direct connections manually. Passthrough nodes are nice for convenience, but as you’ve seen, they have some limitations.

As for the error of exceeding the max axons, there are two potential cases: NengoLoihi is connecting things differently than you expect, resulting in a blow-up of axons; or everything is working as expected, but you’re legitimately just exceeding the number of input axons available.

When mapping convnets to Loihi, it takes quite a bit of care to ensure that none of the resource constraints are exceeded. If you haven’t already read the NengoLoihi tips and tricks page, it has some good suggestions for splitting things; the section on Reducing synapses and axons might be particularly helpful.

No problem and thanks Eric for looking into it!

WRT the following:

things will get trickier since I am reordering the output Passthrough Nodes currently in my network (e.g. [0, 1, 2, 3, 4, 5…] → [4, 5, 2, 1, 3, …]) . Implementing a direction connection to the neurons of the next Ensemble will not be trivial I guess. I will look into it.

Also, if this is the case (which I am unable to understand completely):

then will it resolve if required code changes are made to NengoLoihi? An example code here might help me understand your point.

OR

if following is the case:

can I have an example code here as well? e.g. connections between two Ensembles and calculating the number of axons consumed on Loihi? That way I will be able to identify if my subnet connections are genuinely blowing up the resources.

AFAIK, a neuro-core allows for 4096 input and output axons, and in my limited understanding, it looks like 4 input connections per compartment is allowed, and likewise… 4 output connections from each compartment. The connection’s endpoints has to be neurons. Please correct me if I am wrong anywhere.

The utilization_summary function is helpful for seeing how much resources different blocks are using. You might have to turn off validation to avoid errors; I think the only way to do that is by commenting out this line.

For examples of NengoLoihi connecting things differently than you expect, I was thinking something like connecting a node to an on-chip ensemble. If you don’t know how things are working internally, you might not realize that for each Node dimension, NengoLoihi has to create a number of (off-chip) neurons to encode that dimension into spikes. Now, I don’t think that’s the problem in your case (that shouldn’t cause an increase in the number of axons), but that’s the idea.