[Performance Issue] Integrating Nengo model into ROS node

Hello everyone:

I’m now working on a robot project in which we try to feed some data (audio data to be specific) through Nengo model to process.
All the data connection things seems to work, but here’s the problem:
The Nengo model doesn’t quite keep up with real-time.
(OK actually quite A LOT OF lag, about 10 times slower than what we need.)
I understand to run Nengo model with real-time speed and real-world data wasn’t the main goals and claims of Nengo project, but we still wanna give it a try in our project.

I’m a bit frustrated when trying to improve my work, because I do not understand the technical details behind the API.
So I’m here to discuss the probabilities to improve my implementation, performance issue mainly. Any advice is welcomed.

project source code:

To show the model structure: python model_viz.py
Usage: python MSO_model.py
Example output:

[INFO] [1524815992.008086]: “nengo_mso_model” starts subscribing to “/ipem_module/apm_stream”.
Simulation finished in 0:00:02
ran 1024 steps in 2.232 sec
yet_to_run: 18432 steps
Simulation finished in 0:00:00
ran 1024 steps in 0.773 sec
yet_to_run: 23552 steps
Simulation finished in 0:00:00
ran 1024 steps in 0.773 sec
yet_to_run: 29696 steps
Simulation finished in 0:00:00
ran 1024 steps in 0.765 sec
yet_to_run: 34816 steps
Simulation finished in 0:00:00
ran 1024 steps in 0.769 sec
yet_to_run: 39936 steps

You can see that every 1024 step took around 0.7 sec to simulate while steps in yet_to_run accumulates.

*ROS message source is in other package (binaural_microphone and ipem_module)
So I create a dummy_source.py instead. (still need ROS installed and project compiled)

I’m really grateful for the project and the community. It’s an amazing tool and a lot to learned from.

Edward Chen

Some questions I currently come up with:

  1. How can I find the bottleneck of the model?
  2. Will reform the model structure improve the performance? (i.e. use discrete ensemble?)
  3. I figured out I should use nengo.Node object as output port instead of manually copying and clearing Probe object. Which is better though?
  4. …thinking

The main thing I wonder about is whether the bottleneck is in the EnsembleArrays or the Nodes. I’d try running a profiler on your code, and see where you’re spending most of the time. Since I see you’re using nengo_dl, you can follow the instructions here https://www.nengo.ai/nengo-dl/simulator.html#profile.

However, I’m not sure whether or not nengo_dl would be the fastest option in this case (Nodes slow things down, because TensorFlow has to call out to the underlying Python function). So it’d be worth trying it with the default Simulator as well, in which case you’d want to profile with https://docs.python.org/3/library/profile.html.

Some other simple things to try off the bat would be removing the Probes, or using a simpler neuron model (e.g. (Spiking)RectifiedLinear).

That being said, even if we find some performance improvements I don’t think you’ll ever get a model running in real-time with a dt of ~0.0001. You’ve got about 20000 neurons in your model, with all the associated connection weights, which is a lot of computation to be trying to run at 11KHz. So if it is possible to downsample the audio signal in your model, that would help a lot. Beyond that you’d probably have to simplify the model (using fewer neurons per ensemble, or fewer subchannels/delay values.

Thanks, @drasmuss!

Many informative suggestion here, I’ll try to parameterize these options and produce a performance comparison table.


As this code is the vanilla flavor of my model, the most naive and ideal case of the model, so no optimization is applied onto it. My further work is to figure it how to optimize or reduce the model for it to keep up with real-time processing. (Finger crossed)

One question: Do the pass-through nodes of the EnsembleArray i/o still python objects and may cause performance hit? Or it’s smart enough to just copy data in GPU.

Is Nengo_dl the fastest option?
As you can see in the code, I’ve list three simulator there, reference simulator, Nengo_dl simulator and Nengo_ocl simulator. The Nengo_dl one is the fastest for now.
But I think different implementation may treat data caching differently, so not quite sure which one suits my model best.

Yes, you’re definitely right. The direct method would be cutting down the complexity of the model, but I would take that as the last resort.

Edward Chen

No, pass-through nodes won’t give any significant performance hit. It’s just nodes that are executing some function that we need to worry about. And actually re-reading your model, the nodes you have there won’t cause a problem on nengo_dl either, because they don’t depend on input from the simulation (so they can be pre-computed before each run call).

Another option would be to restructure your model to support batched processing. That is, instead of having 7 input nodes and 7 ensemble arrays, one for each delay line, just have one node and one ensemble array, and feed in 7 inputs (with different delays) at the same time. That is only possible in nengo_dl, but I’d expect that to give you a significant performance improvement.

1 Like

@drasmuss I see, batched processing seems to be a great option! Thanks, I’ll give it a try soon.

Update of my experiment:

  1. With the same model, nengo_ocl simulator use around 10% GPU-utility and still runs slower,
    meanwhile nengo_dl simulator use only 1%~5% of GPU-utility but runs faster, why is this happening?
    This also shows that the bottleneck isn’t the CPU or GPU since they were not even heavily loaded, maybe the memory is the bound.

  2. nengo_dl’s under lying tensorflow back-end kept allocating most of GPU memory while the model really don’t need that much, how can I improve that? This question came up when I realized that I may need multiple nengo simulation run at the same time. Or should I integrate them into one nengo simulation?

  3. I’ve tried Nengo_dl with profile=True, but that kept failing and I got no profiling information, the error message is as below:

(110250, 40)
(110250, 40)
2018-04-29 16:05:03.017467: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-04-29 16:05:03.017486: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-04-29 16:05:03.017491: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-04-29 16:05:03.017495: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-04-29 16:05:03.017499: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-04-29 16:05:03.103974: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-29 16:05:03.104222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.898
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 6.49GiB
2018-04-29 16:05:03.104235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2018-04-29 16:05:03.104240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2018-04-29 16:05:03.104249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) → (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
Building network
Build finished in 0:00:01
|# Optimizing graph: creating signals | 0:00:00/home/cnrg-ntu/.local/lib/python2.7/site-packages/nengo_dl/graph_optimizer.py:1132: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
if np.issubdtype(sig.dtype, np.float):
Optimization finished in 0:00:01
Construction finished in 0:00:02
2018-04-29 16:05:08.897177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) → (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
0.0 0
[INFO] [1524989109.258160]: “nengo_mso_model” starts subscribing to “/ipem_module/apm_stream”.
2018-04-29 16:05:09.615810: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
Parsing Inputs…
Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Incomplete shape.Failed to write timeline file: /home/cnrg-ntu/.local/lib/python2.7/site-packages/nengo_dl/…/data/nengo_dl_profile.json
Error: Not found: /home/cnrg-ntu/.local/lib/python2.7/site-packages/nengo_dl/…/data/nengo_dl_profile.json
ran 1024 steps in 0.784 sec, 0 steps yet to run.

With the same model, nengo_ocl simulator use around 10% GPU-utility and still runs slower,
meanwhile nengo_dl simulator use only 1%~5% of GPU-utility but runs faster, why is this happening?
This also shows that the bottleneck isn’t the CPU or GPU since they were not even heavily loaded, maybe the memory is the bound.

If I had to guess, most of the time is being spent in the overhead of launching kernels (this usually doesn’t show up in the profiling information, since it is kind of background meta-processing). Unfortunately there isn’t a lot to be done about that, it’s a fixed cost you pay. The batched processing approach might help though, because you’re combining multiple smaller components into one larger one.

nengo_dl’s under lying tensorflow back-end kept allocating most of GPU memory while the model really don’t need that much, how can I improve that?

This is a TensorFlow configuration option, can read more about it here python - How to prevent tensorflow from allocating the totality of a GPU memory? - Stack Overflow. Unfortunately it isn’t super easy to edit those config options at the moment in NengoDL. You’ll have to go into the code, to this line here nengo-dl/nengo_dl/simulator.py at main · nengo/nengo-dl · GitHub, and add the config options you want. I’ll put it on my TODO list to make that more accessible.

I’ve tried Nengo_dl with profile=True, but that kept failing and I got no profiling information, the error message is as below

If I had to guess, you don’t have write access to the /home/cnrg-ntu/.local/... directory. Two options to get around this: one is to install nengo_dl via pip install nengo-dl --user (the --user flag should install nengo_dl in a directory where you do have write access). Or you can do a developer installation; this might make the most sense, if you’ll need to be editing the code for the config options above.

1 Like

Great suggestion here! Thanks a lot!

I tried upgrade the nengo-dl to manually install v0.6.1.dev,
for the compatibility I also upgrade nengo==2.7.0, tensorflow-gpu==1.8.0, CUDA 9.0.

While my installing, I encountered an error, “from pip import get_installed_distributions” doesn’t work in pip==10.0.1, just to remind you. (I’ve down-grade the pip to 9.0.1 and it worked fine then.)

I’m now experimenting with mini_batch function of nengo-gl, this is really powerful! (Exciting)

Now I come up with a question:
My original plan was to use nengo.Node object as the way of collecting data.
But as I dug in, I figured out that nengo-dl in mini-batch mode isn’t designed to have distinguishable callback function between batches, right?

Just do a quick test:

import nengo
import nengo_dl
import numpy as np

n_steps = 5
data_dim = 1
mini_batch = 3

def out_cb(t, x):
    print t, x


feeds = np.arange(mini_batch*n_steps*data_dim).reshape((mini_batch, n_steps, data_dim)) + np.arange(mini_batch)[:, None, None]


with nengo.Network() as net:
    node = nengo.Node([0], size_in=0, size_out=data_dim)
    p = nengo.Probe(node)
    out_node = nengo.Node(output=out_cb, size_in=data_dim, size_out=0)
    nengo.Connection(node, out_node, synapse=None)

with nengo_dl.Simulator(net, minibatch_size=mini_batch) as sim:
    sim.run_steps(n_steps, input_feeds={node: feeds},)

    print sim.data[p]

result in:

0.001 [0.]
0.001 [6.]
0.001 [12.]
0.002 [1.]
0.002 [7.]
0.002 [13.]
0.003 [2.]
0.003 [8.]
0.003 [14.]
0.004 [3.]
0.004 [9.]
0.004 [15.]
0.0050000004 [4.]
0.0050000004 [10.]
0.0050000004 [16.]
Simulation finished in 0:00:00
[[[ 0.]
[ 1.]
[ 2.]
[ 3.]
[ 4.]]

[[ 6.]
[ 7.]
[ 8.]
[ 9.]
[10.]]

[[12.]
[13.]
[14.]
[15.]
[16.]]]

Yes, maybe I can identify batches with the ordering of the callback been called but that’s a bit hack-y and I should not really count on that.

At this point, it feels like my nengo.Node-approach is slowing things down, maybe it is simpler if I just let it finish the simulation in batch then I take the data by nengo.Probe.
Which way is preferred in your opinion?

Yes you’re right, nengo.Node functions aren’t really aware of the minibatch aspect, they’ll just be called independently for each item in the minibatch. For what it’s worth, I believe the order of the function calls should be deterministic, so you could use that hack you mention (although I’d agree it’s definitely not ideal).

Another option would be to use TensorNodes . They work a lot like Nodes, but they are aware of the batching (the TensorNode function will be passed the whole minibatched array of inputs). However, the caveat there is that the TensorNode functions have to be written using TensorFlow. So you’d probably have to use tf.py_func inside your TensorNode (which is relatively slow) to handle the communication with ROS or whatever it is you need the Node to do.

It would probably also be fine to just use a Probe, I suspect that won’t be a performance bottleneck. If you want a slightly more performant way of accessing the probe data, you can use sim.model.params[my_probe][-1]. That will access just the probe data from the last call to sim.run_steps, rather than concatenating the data from all previous calls to sim.run_steps.

While my installing, I encountered an error, “from pip import get_installed_distributions” doesn’t work in pip==10.0.1, just to remind you. (I’ve down-grade the pip to 9.0.1 and it worked fine then.)

Thanks for pointing that out! I’ll get that fixed today.

Just pushed some updates to master, which you can get if you’re doing a developer installation. It fixes the pip 10 issue, but another feature relevant for you is that you can specify session config options with nengo_dl.configure_settings. So for example if you want to change the GPU memory allocation method, as discussed above, you can do

with nengo.Network() as net:
    nengo_dl.configure_settings(session_config={"gpu_options.allow_growth": True})
    <build model>

Rather than having to go in and edit the code.

1 Like

@drasmuss, wow the ‘nengo_dl.configure_settings’ feature work like a charm! Great Thanks!

After I’ve tried all the scheme to increase the simulation speed.
I think I have reached the limit of nengo-dl simulator, the reason derives from the observations shown below:

  1. With the model_v3, the fastest simulation is around 0.192 sec / 1024 time-step, no matter how I tweak the structure of model* or parameter for the simulator*, the speed doesn’t improve. (*reducing delay_values, minibatch_size, n_neurons, unroll_simulation, neuron_type)
    model_v3: https://github.com/Edward-CNRG-NTU/cnrg_ntu_tb3/blob/dev/central_auditory_model/scripts/MSO_model_v3.py
    comparison table: https://docs.google.com/spreadsheets/d/18GDlJjmkA29e2sfICrbdXAZZYLkuBwTjTcaOZgSMsSE/edit?usp=sharing

  2. And when I take the probe or output node out, the simulation speed increases a lot and sufficiently works in real-time.

Theses leads me thinking, is it the probe or output node a python callback function in tensorflow computation graph, so there’s a fixed time cost in every time-step.
If my assumption is correct, reducing the python callback may improve the simulation speed.

Question:

  1. Is it possible to implement an output node can reduce the output data and thus reducing the callback number needed?
    i.e. perform maxpooling over several time-step. Seven input steps, seven simulation steps and the output node takes the max value of these seven steps (in gpu) and only output once (python callback).
  2. incorporate different simulation dt (sample rate, time resolution) model in one simulation. (like RNN/LSTM model can have different input output sequence length.)

One thing to be careful of is that NengoDL/TensorFlow is smart enough to only run the parts of the model that are necessary. For example, if there is an ensemble of neurons that isn’t connected (even indirectly) to any nodes or probes, then there is no way that the activity of those neurons can impact the output of the model. So NengoDL just won’t simulate those neurons. If you’re seeing a really significant performance increase from removing probes, I’d suspect that is what is happening. Unfortunately I don’t think there’s an easy way to disable that behaviour, it’s part of TensorFlow’s core logic. The probe computations are all in TensorFlow (no python callouts), so they shouldn’t have a large performance impact (relative to other parts of the model).

One thing I noticed when looking at your code is that you don’t really need the ens_arr_add array. Since you just want to linearly combine ens_arr_L and ens_add_R you can do that by directly connecting those two ensembles to output_node; the extra layer of neurons isn’t really adding any computational power to the model. That would eliminate 1/3rd of your neurons, which should help with the speed.

1 Like

Thanks for your help. @drasmuss

The ens_arr_add is truly unnecessary, all the neuron thing are happening in the neurons and the decoder of the previous ensemble, good point!

I had thought of the magic tensorflow do optimize the computational graph, now your words confirm that, which means that there is not much I can do in terms of tweaking the simulator.

Hello @drasmuss,

Since my model doesn’t quite catch on real-time signal, I’m thinking maybe I can do some time skipping in signal when the simulation fell much behind real-time.

My Question is:

  1. Does simulator, or Nengo_dl simulator to be specifical, “stateful”? Which means the consecutive time step is relative, if the input signal skip few steps, the internal state of network or simulator will no longer be the same as the case with no skipping occur.
  2. If simulation is stateful, how did nengo_dl batch simulation pass on internal state between batches?

I would be grateful if you could provide some in depth suggestions. Thanks!

The Nengo/NengoDL simulation is not inherently stateful, but often Nengo models do have stateful elements. The two most common would be spiking neurons (where the internal voltage/refractory period is stateful) and synapses (where the filtered post-synaptic current is stateful).

So, for example, if you ran a model with inputs 0, 1, 2, 3, 4, 5, vs 0, 1, 2, 5, you would get different output on the final timestep if your model includes any of those elements. That being said, often the impact of those previous timesteps is relatively minimal (depending on the structure/parameters of your model), and it tends to decrease over time. So, for example, if we ran a model with input 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 vs 0, 2, 3, 4, 5, 6, 7, 8, 9, 10, the output we get on that final timestep would probably be pretty similar. You’d have to experiment with your particular model to see how impactful that skipping would be.

However, if your model only uses non-stateful neurons (e.g. RectifiedLinear, LIFRate), and doesn’t contain any synaptic filters (synapse=None on all the Connections), then you could skip timesteps without changing the output of the model at all.

If simulation is stateful, how did nengo_dl batch simulation pass on internal state between batches?

Each element in the batch has its own state (so, e.g., batch input 32 will always resume from wherever the previous batch element 32 left off, and will not be affected by what was going on in batch 31 at all).