Techniques for Runtime Efficiency

psipeter · July 16, 2018, 2:24pm

I want to become more literate with the numerical methods used in nengo during runtime of a standard simulation (e.g. communication channel with LIF neurons). What kind of techniques does nengo use to increase computational efficiency after building is complete? I’m particularly interested in operations like updating the state variables of a neuron (e.g. voltage) at every time step. If someone (@tcstewar, @arvoelke, @tbekolay) would be willing to sketch a very broad outline of where the nengo code does something more sophisticated that a naive python implementation (e.g. smart matrix multiplies, tricks to use larger $dt$), then reference parts of the code or external literature where readers can find out more for themselves, that’d be much appreciated!

drasmuss · July 16, 2018, 3:20pm

The most important part of the optimization is combining large numbers of small ops into a smaller number of large ops. In the reference nengo simulator this logic is encapsulated in https://github.com/nengo/nengo/blob/master/nengo/builder/optimizer.py, and described in this paper https://www.frontiersin.org/articles/10.3389/fninf.2017.00033/full. Jan could describe that in more detail.

However, if you are really interested in performance, you will want to look at one of the other simulators (NengoOCL or NengoDL). The performance gains you get from moving to, e.g., GPU acceleration, will far outweigh any algorithmic optimizations in the reference simulator. For example, see Figure 6/7 in this paper https://arxiv.org/abs/1805.11144. If you want to know more about NengoDL I can give you as many details as you want there

jgosmann · July 16, 2018, 3:22pm

To my knowledge there isn’t much perfomance optimization other than using NumPy (which relies on optimized linear algebra libraries) in the “raw” simulator (there is some optimization in terms of the accuracy of Euler integration for the LIF neurons and while this allows for a larger $dt$ you might not see it as an increase in computational efficiency per se).

In-between the build and simulation step is an optimization step (at least for the reference simulator) which might be something you’re interested in given your question. The rough idea is the following: The build step translates the high level description into a set of basic operations (called “operators”) that perform computation on specific memory segments (called “signals”). Examples for operators are:

Copy memory block A to memory block B
Multiply matrix stored in block A with vector stored in B and write the result to C
Perform the LIF math
etc.

The simulator loops over all operators during a timestep and excutes them. This loop happens in Python which is slow (compared to a loop in C). If we have two copy operators, one copying from A to B and one from C to D, this could also be expressed as a single operator if the memory is organized such that C follows A and D follows B, such that we only do one copy from AC to BD. That saves us one slow Python iteration in the loop. Doesn’t sound like a lot, but there are a lot of operators (1000s in a small model up to millions or more in larger models), so overall the total number of operators can be reduced by a lot in this fashion (and not just copy operators, but other operators like the matrix-vector multiply too) and this increases speed also significantly.

There is also a second (potentially more important) reason why this ends up being faster. One of the slowest things on modern CPU is to access memory. Because of that, CPUs have some small (a few MB) memory segments built-in that are much faster to access and will hold recently requested data. They also try to be intelligent and do something called “pre-fetching”. When code is accessing memory locations i, i+1, i+2, … it is likely that i+3 is accessed next. So the CPU will already read that memory address before it was requested by the code while performing other computations. If it is not needed, it is discarded, but if it is needed, it is already there and the slow memory access can be avoided. This two hardware optimizations imply that efficient code should be local, i.e., access memory locations nearby already accessed locations and do the access in order if possible.

The optimization of merging operators achieves exactly that. For example, instead of copying A to B and then some random other memory locations C to D, only one sequential block AC is copied to BD. Thus, at the end of A, the CPU will already have the data at the start of C available.

This is the high-level overview. In practice all of this is a bit more complicated, because

Operators need to be executed in the right order (which is ensured with such things as dependency graphs, topological sorting, and transitive closures).
Not all memory segments are independent. Most will be read and written by at least two different operators, sometimes more. Some operators might only operate on a subset of a memory segment (a “view”). This introduces a lot of constraints that need to be adhered to.
This optimization makes only sense if it doesn’t take to long to perform (otherwise it might be more efficient to just run the slower simulation), but the algorithm is unfortunately $O(n^2)$ (the runtime increases quadratically with the number of operators). Thus, we cut as many combinations that are checked as possible.
Certain data structures (transitive closure) need a lot of memory which can be a problem. There are some interesting things happening to keep the memory consumption down.

The full nitty gritty details are described in this paper (or the code of course, but that might be hard to understand without the background from the paper).

jgosmann · July 16, 2018, 3:24pm

Also somewhat related is this part of the documentation.

psipeter · July 16, 2018, 4:40pm

Thanks Jan and Dan. I should maybe have mentioned that the motivation for this post is a guest lecture I’m giving to Sverrir’s 300-level CS course on numerical methods. We were hoping that I could talk about applications of numerical methods in neural simulators. It seems like the major gains in efficiency that nengo utilizes are related to the grouping of operators that perform various numerical steps, rather than the design of operators targeted at our particular flavor of neural simulations. If I looked into the operators that performed e.g. matrix multiplies or the LIF math, would I find anything insightful that I could discuss with 3rd year undergraduates? I’m currently reading the section of the NEURON book on numerical methods and finding some relevant material on temporal/spatial discretization, Euler vs. Crank-Nicholson integration, numerical instability, adaptive timesteps, etc. Am I likely to find anything comparable within the operator math, or do our simulations not benefit from these tricks (due to the simplicity of our point neurons)?

jgosmann · July 16, 2018, 5:15pm

I suppose numerical methods are not just about efficiency, but also accuracy? Nengo uses simple Euler integration (I believe). But does some things to make the LIF math more accurate (I think one of those might have been a bonus question in the NEF course). You might want to take a look at the PRs #511 and #975.

arvoelke · July 16, 2018, 9:43pm

Nengo uses zero-order hold (ZOH) – not Euler’s method – in its discretization of LIF neurons and linear synapse models. ZOH assumes the input stays constant across the time-step, while Euler’s method assumes the derivative of the state remains constant (this is usually a much stronger assumption that is more appropriate for nonlinear systems, where there is no closed-form ZOH solution). Note that the portion of the system being discretized is linear for both neurons and synapses. Details on both methods may be found here: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.signal.cont2discrete.html

arvoelke · July 16, 2018, 10:04pm

One idea that we haven’t spoken about, that may be appropriate for speeding up models such as Wilson’s (i.e., models that require a very fine time-step for integration – much finer than the Nyquist frequency of the input signal being integrated), is to build a lookup table / cache. This isn’t appropriate for a course on numerical methods, but it may be appropriate for approximating a ZOH solution to a complex nonlinear model in some manner that is fast and accurate, at the cost of additional memory. Basically, you would precompute a table[input][old_state] => (new_state, output) that approximates the model’s input -> output behaviour, assuming the input is held constant across the time-step (ZOH). You would do this by using some costly method of numerical integration at some finer time-step, and at some level of state+input resolution (which will determine your memory/accuracy trade-off). You only need to do this once. Then it’s $O(1)$ time to simulate the model by looking up the closest value in the table, and possibly doing some (spatial) interpolation across nearby values to improve accuracy. I’m not sure if there’s a word for this technique / trick other than lookup / caching / memoization applied in a scenario that trades memory for accuracy (assuming ZOH). (This is also essentially what people are doing when they are drawing bifurcation diagrams of nonlinear systems.) However… one big potential limitation with this approach might be parallelizing it across neurons in the same way that numpy array operations are (essentially) parallelized.

tbekolay · July 16, 2018, 10:34pm

There is a simulator called EDLUT that uses the lookup table approach for spiking networks. I’ve never used it personally, but it’s come up when I’ve searched for other simulators.

Seanny123 · July 17, 2018, 12:35am

The thing I found insightful and may be relevant is how the filters are implemented. You can’t just implement the equation directly, because you can’t just convolve the signal, because it’s arriving in real-time. You have to do a discretisation which we’ve implemented in an interesting way. Maybe this is more about Digital Signal Processing?