Nengo bad performance on parallel executions

Hi !

I developed a custom spiking neural architecture with Nengo and I need to perform parallel (independent) executions of that architecture on a computing cluster (for parameters optimization).

I was surprised to see that when I ran more than 4 parallel simulations, there was absolutely no gain, i.e. adding a new simulation increased proportionately the execution time of the other simulations.
I thought the problem could be due to my machine, so I ran the architecture on different machines. But the problem remained. This was also true on a computing cluster, where only up to 4 simulations could be executed concurrently on each machine, without degrading the simulation speed.

Thus it seems that there is a threshold (4 simulations) that I cannot cross, regardless of the available computing resources. This is very problematic as I can use only a tiny fraction of the avalaible cpu cores.

Do you know how I can solve that problem ? Thank you !

Attached is my test code : (1.8 KB) (1.6 KB) (2.7 KB) (1.1 KB)

Hi @Adri,

Nengo (the vanilla Nengo) runs using Numpy to perform all of the neural computations of the network, and if you are running the Nengo simulations independently (i.e., spawning a separate Python process for each simulation), they shouldn’t affect each other too much. The specifics of how much they affect each other depend on the CPU architecture you are using, and the CPU scheduler the OS has implemented.

Testing your code on my machine (AMD Ryzen 9 5950X - 16 core, 32 hyperthread) on my Windows OS, I get the following results:

  • 1 simulation: 2:35
  • 2 simulations: minimum - 2:36, maximum - 2:38, average 2:37
  • 4 simulations: minimum - 2:40, maximum - 2:42, average 2:41
  • 6 simulations: minimum - 2:40, maximum - 2:46, average 2:43.5

Looking at the results above, there is a increase in the time it takes to complete the simulation, but it is not large increase. Also, the increase in simulation time is roughly linear with the number of simulations being run. Looking at the processor usage (in the Windows task manager), I can see the CPU cores constantly being switched in and out of running the process, so I surmise that the additional parallel simulations is adding extra overhead as more processes are being swapped around the cores.

However, I definitely would not classify these results as limiting the simulations to only a small fraction of the CPU cores. On my system, with my OS, running multiple parallel simulations will definitely slow all of them down, but not by a lot (about 1% slowdown per added simulation).

I also performed the same experiment on our compute cluster. We use SLURM to partition the available CPU cores, and unlike windows, the processes are locked to specific cores while they are running (the process doesn’t get swapped around to other CPU cores). Here are the results from this experiement:

  • 1 simulation: 2:24
  • 2 simulations: min: 2:24, max: 2:27, avg: 2:25.5
  • 4 simulations: min: 2:24, max: 2:26, avg: 2:25
  • 8 simulations: min: 2:26, max: 2:34, avg: 2:27.75
  • 12 simulations: min: 2:28, max: 2:37, avg: 2:31.75

Once again, while there is some increase in simulation time, running more parallel simulations still nets you more results than running the simulations sequentially, with an increase of only 8s when increasing the number of parallel simulations from 1 to 12. An increase in the simulation time is expected since the simulations are using more than just the CPU cores, but also system memory, and the CPU cache. Having more parallel processes running means a higher overhead when trying to access these shared resources, and this will inevitably cause the simulation speed of all processes to slow, but not by much.
And, as before, I wouldn’t consider this a situation where only a small fraction of the CPU cores are usable.

Note that all of these experiments have been performed using the code you provided without any modification to them. These experiments were done using the latest version of Nengo.

If you are still experiencing the issue you describe, can you post some run time data (like I did above), as well as what CPU architectures you are testing it on?