Proposed optimizations in the Nengo core

atait · July 25, 2020, 10:49am

This isn’t exactly a HPC topic; I am doing performance optimizations on my 4-core macbook. After some profiling, I found some bottlenecks in the core. The fixes are relatively simple. They reduced simulation time by about 30%.

The branch is at https://github.com/atait/nengo/tree/optimizations-core

I wanted to ask for feedback before making an issue or PR because I have a very limited understanding of all of the use cases and inner workings of Nengo. Only one nengo-bones test failed (pickling, see below), but those might not perfectly cover everyone relying on Nengo core.

Also, I just benchmarked off of one case, but these should be individually benchmarked before going forward to see if it’s worth it. The test case was a network with about 10k Ensembles of 30 neurons apiece and about 8 Connections projecting to other Ensembles.

Stop initializing BSR matrix on every step call

This probably only shows up when you have lots of small Ensembles.

Store SignalDict misses

When a requested key is not in the dict, SignalDict can fall back and allocate a new ndarray. Chances are that whatever requested that key will request it again, so we keep it around

Potential side effect: if the calling entity is relying on getting a freshly allocated ndarray every time. That would be violating the typical expectations of a dictionary, but I suppose it could happen.

Circumvent `np.clip` from numpy 1.17

It introduced significant slow down. https://github.com/numpy/numpy/issues/14281. The rest of Nengo is very good about numpy dtype handling, so it can be replaced with the underlying ufunc.

Potential issue: if user code is feeding in mismatched dtypes, they might find their way to the ufunc and break it.

For example,

Also, replaced all some_ndarray.clip(a, b) with clip(some_ndarray, a, b) and all clip(some_ndarray, a, None) with np.minimum(some_ndarray, a). This is safe for all user code.

Size 1 LinearFilter using builtin float

Builtin floats outperform single-element numpy arrays. This type of filter is a workhorse in some networks. OneXOneIn, like its sister classes, checks the state-space filter parameters to determine when it is applicable.

Numba-fy the `OneX` LinearFilter (probably nengo-extras)

Numba can give you about an order-of-magnitude speedup on numpy code. Under the hood, it uses compiler options, cache exploits, threading, Intel-specific CPU features (if present), etc. With sufficient motivation, one can also tailor its compilation to NVIDIA cards.

This is exactly what you need to do to fully use a modern computer. The whole point of numba is to make those intricacies as pythonic as possible; however, JIT code must surrender some key things.

does not pickle natively (test_pickle_sim failed; workaround is possible)
does not handle 16 bit floats (test_dtype[16] passes using a workaround)
impossible to debug, but it’s easy to turn off the JIT decorator for this
exhaustive testing acquires either a hardware-dependent aspect or a deep trust of the numba developers.

The OneX filter is the basis of the Lowpass filter, and, in turn, the most common Synapse. It’s a major workhorse, and it is pretty straightforward, so potential for a bug is low.

Perhaps a bigger downside is the additional dependency on numba, so I perceive this one as a component of nengo-extras, if anything.

pblouw · July 28, 2020, 9:04pm

These are great suggestions - thanks! We’ve looked a bit at Numba related optimizations internally, but haven’t got any code that we are ready to share at the moment. For next steps, I’d suggest opening a PR on Nengo core, as the dev team is interested in incorporating these improvements and it’ll be easier to have more detailed discussions on GitHub.

atait · July 29, 2020, 5:58am

Ok, thanks for the feedback. I’ll set up a PR over the next couple of days. Everything but numba is passing the test suite. That should just be a matter of a few lines somewhere in a __setstate__. The numba one should probably be closed for now – extra dependency is a big decision – but, I’ll put it up in its own PR just in case. Perhaps someone could formulate an implementation as a nengo-extras option.

Here’s a more careful %prun of my lone benchmark. Above, I meant 30% of original runtime, not reduction in runtime. The 1D LinearFilter is the clear standout in terms of the effect because I have lots of 1D Ensembles.

I imagine there is an internal script somewhere to perform the heavy benchmarking that the devs and user base actually care about (incl. Spaun, SpiNNaker, etc.). That could determine which changes matter and which ones are safe. I of course can’t do that study.

pblouw · July 29, 2020, 9:22pm

Sounds great! The core dev team should be able to provide some further info on any benchmarks of interest that would good to keep in mind, so we’ll look forward to the PR!

Eric · August 3, 2020, 4:16pm

I think the reason we initialize the BSR matrix inside step_dotinc is in case the data in the matrix is changing (i.e. if you initialize outside the function, and then the A signal changes, I’m not sure if mat_A will update). Since the test suite passes for you, though, this might indicate that we’re never actually using this functionality. We should definitely look into this, though; worst case, we might just need a conditional block here that initializes mat_A outside if A is read-only, and inside if it’s not.

jgosmann · August 3, 2020, 6:29pm

It seems that the BSR matrix just keeps a reference to the data array, so updates should work. The question is whether one should rely on that as it does not seem to be explicitly documented (though, one sort of needs to know this anyways if one wants to change the data array in place without changing the BSR matrix).

atait · August 4, 2020, 10:06pm

Ok, I have made a pull request. https://github.com/nengo/nengo/pull/1629

The pickle fail took me a while to figure out. The one-dimensional filter stored a state as a builtin float, but the initializer changes that state externally, thus relying on it being an array.

Surprisingly, keeping the array state updated on every loop gave up only a little bit of performance. Now, passing tests, the time spent on a OneX step dropped to 13% of original. Not keeping track of state, it was 7.1% of original. The factor of ~2 is not very important because this is no longer a bottleneck.
^since the shape of LinearFilter never changes, it might be possible to do something similar for multi-dimensional Ensembles by telling it what shape to expect

I think this means that about 90% of the time in OneX steps is in determining the shapes and datatypes of X and signal and how the result will broadcast, as opposed to the FLOPs and array access. That seems strange, so please post independent verifications on the PR.

Re: BSR initialization: I just went by the tests. Checking for writability would be a great way to minimize chance of side effect