Nengo Loihi -- Precompute=True is ~10x slower than Precompute=False

Hello,

I am running a model that consists of a single large ensemble (n=2048) on a Loihi chip (doing reservoir computing). The basic architecture is I have a set of LIF neurons off chip that converts a state signal to spikes (I present the constant signal for 200 timesteps), I send these spikes to the loihi chip and then use the summed number of spikes from each neuron to do control for one (control) time step. Then the new state is presented to the reservoir etc… Below is some pseudo-code for the problem:

with network:
    nengo_loihi.add_params(network)

    encoding_target_ensembles=[]
    input_node_target = []

    for i in range(3):
        input_node_target.append(nengo.Node(lambda t, ii=i: state[ii] ) )

        encoding_target_ensembles.append(nengo.Ensemble(n_ens_neurons, dimensions=1, neuron_type=nengo.LIF(), ))
        network.config[encoding_target_ensembles[i]].on_chip = False 

        nengo.Connection(input_node_target[i], encoding_target_ensembles[i])
        nengo.Connection(encoding_target_ensembles[i].neurons, reservoir.neurons)

    reservoir = nengo.Ensemble(ens_size, dimensions=1, )
    nengo.Connection(reservoir.neurons, reservoir.neurons,  transform=W_res)

    def output_log(t,x):
        spike_collection += x 
    output_node = nengo.Node(output_log, size_in = ens_size)
    nengo.Connection(reservoir.neurons, output_node,  synapse = synapse_out, )

sim = nengo_loihi.Simulator(network,  precompute = True,  target='loihi', hardware_options={"snip_max_spikes_per_step": 350,}) 

for _ in range(num_env_steps):
    state = env.get_state()
    spike_collection *= 0
    sim.run_steps(200)
    action = env.generate_action(spike_collection)

Everything is working, however, I am trying to speed up the compute time. I present a constant state for the 200 timesteps, so the signal is precomputable, however, using Precompute=True results in the execution of 200 steps taking ~7 seconds, while if I set Precompute=False the execution takes ~1 second. I get identical results for the two options. Below are some timings:

For Precompute=True

INFO:DRV:  Connecting to 127.0.0.1:40481
INFO:DRV:      Host server up..............Done 0.20s
INFO:DRV:      Encoding axons/synapses.....Done 1.20s
INFO:DRV:      Compiling Embedded snips....Done 0.24s
INFO:DRV:      Compiling MPDS Registers....Done 0.07ms
INFO:DRV:      Booting up..................Done 2.04s
INFO:DRV:      Encoding probes.............Done 4.12ms
INFO:DRV:      Transferring probes.........Done 0.01s
INFO:DRV:      Configuring registers.......Done 2.93s
INFO:DRV:      Transferring spikes.........Done 0.10s
INFO:DRV:      Executing...................Done 7.27s
INFO:DRV:      Processing timeseries.......Done 0.07s
INFO:DRV:  Executor: 200 timesteps.........Done 10.41s
INFO:DRV:      Transferring probes.........Done 0.03ms
INFO:DRV:      Configuring registers.......Done 1.87ms
INFO:DRV:      Transferring spikes.........Done 0.09s
INFO:DRV:      Executing...................Done 7.29s
INFO:DRV:      Processing timeseries.......Done 0.06s
INFO:DRV:  Executor: 200 timesteps.........Done 7.45s
INFO:DRV:      Transferring probes.........Done 0.03ms
INFO:DRV:      Configuring registers.......Done 1.61ms
INFO:DRV:      Transferring spikes.........Done 0.08s
INFO:DRV:      Executing...................Done 7.29s
INFO:DRV:      Processing timeseries.......Done 0.07s
INFO:DRV:  Executor: 200 timesteps.........Done 7.45s
INFO:DRV:      Transferring probes.........Done 0.03ms
INFO:DRV:      Configuring registers.......Done 1.66ms
INFO:DRV:      Transferring spikes.........Done 0.08s
INFO:DRV:      Executing...................Done 7.30s
INFO:DRV:      Processing timeseries.......Done 0.07s

for Precompute=False

INFO:DRV:  Connecting to 127.0.0.1:34145
INFO:DRV:      Host server up..............Done 0.20s
INFO:DRV:      Encoding axons/synapses.....Done 1.21s
INFO:DRV:      Compiling Embedded snips....Done 0.36s
INFO:DRV:      Compiling Host snips........Done 0.49s
INFO:DRV:      Compiling MPDS Registers....Done 0.06ms
INFO:DRV:      Booting up..................Done 2.01s
INFO:DRV:      Encoding probes.............Done 0.04ms
INFO:DRV:      Transferring probes.........Done 1.11ms
INFO:DRV:      Configuring registers.......Done 2.86s
INFO:DRV:      Transferring spikes.........Done 0.02ms
INFO:HST:  Using Kapoho Bay serial number 405
INFO:HST:  Args chip=0 cpu=0 /home/noel/loihi_venv/lib/python3.8/site-packages/nxsdk/driver/compilers/../../../temp/1664287762.3849907/launcher_chip0_lmt0.bin --chips=1 --remote-relay=0 
INFO:HST:  Nx...
INFO:HST:  [Host] Listening for client
INFO:HST:  [Host] Connected to client
INFO:HST:  chip=0 cpu=0 time 100
INFO:HST:  chip=0 cpu=0 time 200
INFO:DRV:      Executing...................Done 0.46s
INFO:DRV:      Processing timeseries.......Done 0.06ms
INFO:DRV:      Transferring probes.........Done 0.05ms
INFO:DRV:      Configuring registers.......Done 1.93ms
INFO:DRV:      Transferring spikes.........Done 0.02ms
INFO:HST:  chip=0 cpu=0 Waited to exit (nonsense sum -13580)
INFO:HST:  chip=0 cpu=0 time 300
INFO:HST:  chip=0 cpu=0 time 400
INFO:DRV:      Executing...................Done 0.46s
INFO:DRV:      Processing timeseries.......Done 0.05ms
INFO:DRV:      Transferring probes.........Done 0.05ms
INFO:DRV:      Configuring registers.......Done 2.00ms
INFO:DRV:      Transferring spikes.........Done 0.02ms
INFO:HST:  chip=0 cpu=0 Waited to exit (nonsense sum -13580)
INFO:HST:  chip=0 cpu=0 time 500
INFO:HST:  chip=0 cpu=0 time 600
INFO:DRV:      Executing...................Done 0.46s
INFO:DRV:      Processing timeseries.......Done 0.04ms

Any idea what is going on?

My intuition is there must be more performance available for Precompute=True, but I haven’t been able to figure it out. Could it be an issue of the output probe? The NXSDK docs suggest using a spike streamer or some other snip to retrieve spikes from the chip instead of using probes, but I have not been able to figure out how to implement that within the nengo framework.

Some naive testing suggests it is the probe since if I run the simulation with the output node and its connection commented out, the timing for Precompute=False does not change, but the ‘Executing’ timing for Precompute=True goes from ~7 seconds to ~0.1 seconds.

Does this diagnosis make sense? Or is there another obvious explanation? And if this is correct, any suggestions on how to speed things up? Thanks!

Messing around with the code a bit more, I ran

board = self.sim.sims["loihi"].nxsdk_board
print(board.monitor.numProbes)

which returned 2048 for Precompute=True but 0 for Precompute=False. How does nengo get spikes off the chip for Precompute=False?

The code you’re looking for is in this snip here. That’s the template; you can see where we render it here.

It is a bit frustrating that spike probes can be so much slower than snips, as they were designed as the “intended” way to get spikes off the chip according to our understanding. I hadn’t realized that the difference could be so drastic.