[Nengo Loihi] Strage results for Neurocores dynamic power

Hi all,
in the last days I did some experiments on the Keras-to-Loihi example about power consumption.
These are my settings:

  • presentation time per sample: 0.03 s -> 30 steps per sample (considering that loihi dt is 0.001 s)
  • num samples: 80
  • tStart = 1, tEnd = num_samples * num_steps_per_sample as for probe interval

All the results seem ok except for the neurocores dynamic power.
These are some experiments I did:

  • When using probe_buff_size = 30 and probe_bin_size = 1, the NC dynamic power is 6.059 mW (too low wrt the number of NC used (14 of 96, 1 chip) ?)
  • Same settings as above (probe_buff_size = 30 and probe_bin_size = 1) but increasing the number of steps per sample to 60 (twice as above), the NC dynamic power is 0.537 mW. This result looks strage to me as I would expect that increasing the number of steps would lead to an high dynamic power for NC. Do you agree with we ?
  • When probe_buff_size = 4 * 30 (timesteps of 4 samples) and probe_bin_size = 1, the NC dynamic power is 3.459 mW. Repeating the inference, i got 2.801 mW. Why this difference ?
  • When using probe_buff_size = 240 (8 * 30) and probe_bin_size = 1, the NC dynamic power is -2.334 mW. Why negative power ?

NOTE: The dynamic power of NC is read using board.energyTimeMonitor.powerProfileStats['power']['core']['dynamic'].

The other power results (NC static power and Lakemont static and dynamic power) seem ok.

This is the notebook I used to make experiments:
https://drive.google.com/file/d/1Qg7lTCWOn-FBKh9VSDWC17COWegLSPL3/view?usp=sharing
Is anything wrong in my code ? Could you please help me to understand the obtained results ?

[EDIT]:

  • NxSDK version: 0.9.9
  • Nengo-Loihi built from source (master branch)
  • The parameters used in my experiments are the same used in the Keras-to-Loihi example

Best.

Hi @spiker!

The power measurement utilities available in NengoLoihi are actually not developed by the Nengo team. Rather, NengoLoihi simply provides references to the underlying NxSDK code to allow you to make the NxSDK calls within a NengoLoihi model. That being the case, you’ll have to clarify with Intel what effect changing the different parameters will have on the reported neurocore dynamic power.
You can also refer to the NxSDK power measurement code (I believe the relevant code is in nxsdk/graph/nxenergy_time.py) to see how the neurocore dynamic energy is being calculated.

As for the negative dynamic power readings, we have encountered this before with some of our own development networks. In our experience fixing this requires reducing the I/O requirement of the code you are running (i.e., getting the board to run things faster). You can get an idea of how much the I/O is slowing down the network execution by measuring the execution time per timestep (there is an energy probe parameter to get this value). If it is roughly real time (i.e., 1ms for every 1ms timestep), that should be sufficiently fast. However, in our own experience, any thing an order of magnitude or more over can cause these weird numbers to be reported. In NengoLoihi, one way of speeding up the simulation is to set precompute=True when creating the NengoLoihi simulator object. What this does is to “precompute” all of the model’s input signals (across the entire desired simulation time) into spikes, which is then loaded onto the board before the simulation starts. However, this method is constrained by the available memory on the board, and by the model’s network architecture, so setting precompute=True may not always be possible.

I’ll continue to experiment with your code and if I find anything significant to report or suggest, I’ll make a follow up post! :smiley:

Hi @xchoo,
many thanks for your reply.

I agree with you and I believe that very likely the negative dynamic power readings is related to the (high) I/O requirements.
I also made several experiments on event-based data and I obtained results similar to the MNIST example.
I think that the main bottleneck (especially when working with event-based data) is represented by PresentInput API. My suggestion is to to introduce in NengoLoihi a new API to encode data in a sparse way as the SpikeInputGenerator of NxSDK.

By the way, feel free to make a follow up post if you have any interesting result.

Best.

For event-based data, you’re definitely right that it’s inefficient to put them into frames for PresentInput, and then convert them back to spikes to send to the board.

However, much of your IO will be governed by how you convert them back to spikes. The default in NengoLoihi is to use the OnOffDecodeNeurons class, which has pairs of on and off neurons with negative intercepts. These negative intercepts mean that if a particular dimension has an input of 0, both the neurons for that dimension are firing; this helps us represent values near zero more accurately (since the total firing rate is always non-zero), but in your case it means you’re sending spikes to communicate the many zeros in your data.

To work around this, you could do the encoding yourself. The easiest way to do this is to make an ensemble off-chip to do the encoding, set up the neurons how you want, and then connect the .neurons of this ensemble to something on-chip. Using the .neurons will have NengoLoihi send the spikes from this ensemble, rather than trying to do some sort of encoding itself. Here’s a sketch of how to do that. (You’ll notice I use two ensembles, one for “on” neurons representing positive values in the input, and one for “off” neurons representing negative values. However, since I just have “on” neurons on the chip, I’ve set things up so that negative values inhibit the on-chip neurons. This is just one of many ways to do things. You could put the “on” and “off” neurons together in one ensemble, but then you may want to use Sparse transforms to limit the number of connection weights required.)

import matplotlib.pyplot as plt
import numpy as np
import nengo
import nengo_loihi

x = np.array([[0, 1, 0], [0, 0, 0.5], [0.2, 0, 0]])
input_dim = x.size

with nengo.Network() as net:
    nengo_loihi.set_defaults()
    nengo_loihi.add_params(net)

    node = nengo.Node(x.ravel())

    encode_pos = nengo.Ensemble(
        input_dim,
        dimensions=1,
        gain=nengo.dists.Choice([1000]),
        bias=nengo.dists.Choice([0]),
        neuron_type=nengo.SpikingRectifiedLinear()
    )
    net.config[encode_pos].on_chip = False
    nengo.Connection(node, encode_pos.neurons)
    p_encode_pos = nengo.Probe(encode_pos.neurons)

    encode_neg = nengo.Ensemble(
        input_dim,
        dimensions=1,
        gain=nengo.dists.Choice([1000]),
        bias=nengo.dists.Choice([0]),
        neuron_type=nengo.SpikingRectifiedLinear()
    )
    net.config[encode_neg].on_chip = False
    nengo.Connection(node, encode_neg.neurons, transform=-1)
    p_encode_neg = nengo.Probe(encode_neg.neurons)

    # this is our on-chip target ensemble
    target_ens = nengo.Ensemble(
        input_dim,
        dimensions=1,
        # We need a gain slightly higher than 1 here, because if the gain is 1 then an
        # input spike pushes a neuron to its firing threshold, but not over the firing
        # threshold, and it takes a second spike to push it over, resulting in half the
        # firing rate that we would expect.
        gain=nengo.dists.Choice([1.05]),
        bias=nengo.dists.Choice([0]),
        neuron_type=nengo.SpikingRectifiedLinear()
    )
    nengo.Connection(encode_pos.neurons, target_ens.neurons)
    nengo.Connection(encode_neg.neurons, target_ens.neurons, transform=-1)

    p_target_ens = nengo.Probe(target_ens.neurons)

with nengo_loihi.Simulator(net) as sim:
    sim.run(1.0)

pos_rates = sim.data[p_encode_pos].mean(axis=0).reshape(x.shape)
neg_rates = sim.data[p_encode_neg].mean(axis=0).reshape(x.shape)
target_rates = sim.data[p_target_ens].mean(axis=0).reshape(x.shape)
print("encode_pos rates:")
print(pos_rates)
print("target_ens rates:")
print(target_rates)

plt.subplot(2, 2, 1)
plt.imshow(x)
plt.title("input")
plt.subplot(2, 2, 2)
plt.imshow(target_rates)
plt.title("target ens")
plt.subplot(2, 2, 3)
plt.imshow(pos_rates)
plt.title("encode pos")
plt.subplot(2, 2, 4)
plt.imshow(neg_rates)
plt.title("encode neg")
plt.show()

Hi @Eric,
many thanks for your suggestions. I am new to Nengo but with your commen

I was wondering if the code for the encoding can be integrated into a model trained with Nengo-DL (like in Keras-to-Loihi example).
In particular, considering your code snippet, my idea is to to replace the input node of the converted net with node of your code and connect it to encode_pos and encode_neg as done by you. Then I think I should also connect the ensemble of the first convolutional layer (the target example) to encode_pos and encode_neg.

I have some more questions:

  • Should I also replace the ensemble related to the first Conv layer with the code you used with the target ensemble ? That is:
nengo_converter.layers[conv1].ensemble = nengo.Ensemble(
    input_dim,
    dimensions=1,
    gain=nengo.dists.Choice([1.05]),
    bias=nengo.dists.Choice([0]),
    neuron_type=nengo.SpikingRectifiedLinear()
)
  • In my case I have cropped frames of event data of shape (50, 50, 2). Starting from x, y, t and p arrays, I create the frames by doing binning in time in order to get 40 frames per sample. At the end my training and test sets have shape (batch_size, 1, data), where batch_size = num_samples * num_frames_per_sample, 1 represents the time dimension, data has size 5000 (25 * 25 * 2) and represents a flatten frame.
    I train the network using Nengo-DL and save the trained weights, which are then loaded and saved to the network object as done in the Keras-to-Loihi example.
    Now, I have two questions about this setup:
    • Is my representation of event-data compatible with your code ? It is not clear to me how to fed event data to the Node object in your code and which is the right representation of such data …
    • Is the encoding scheme compatible with my training approach ? By “compatible” I mean that encoded test data should have the same meaning of data used during training.

By the way, I will try to integrate your code in my setup and let you know if I need help or have additional questions.

Thanks in advance for your time and support. I am new with Nengo, but your comments are very useful to me to better understand how Nengo works and how to use event data with Nengo.

Best regards.

Just some thoughts about the power measurement results.

I was thinking about the use of precompute=True and I am wondering if might influence the Neurocores activity. You said that when using precompute=True, all of the model’s input signals (across the entire desired simulation time) are precomputed into spikes, which are then loaded into the board before the simulation starts.
Does this implies that the Neurocores activity is in some way reduced (that is the NC are active for few timestesps) ?
I am asking this question as in my experiments I observed a quite low power consumption for Neurocores wrt the implementation of the same model using NxTF API.
Some weeks ago I also noticed (and reported it in this post) that the reported spiking time per timestep is zero for all timesteps. I am wondering if this behavior can be related to the precomputation of spikes and the consequent reduced activity of Neurocores (as the spiking time should represent the time in which Neurocores perform computation - but correct me if I am wrong).

Moreover I also noticed that:

  • Before the simulation, the communication phase with the Lakemont takes a considerable amount of time (several minutes (8-10) when working with many event data samples). Is this related to the time needed for the pre-computation of spikes ?
  • I noticed in my experiments a significant amount of power related to the management and host phases. I believe that this is related to the high I/O requirements (since I am using event data with PresentInput, and, as you pointed out, this is inefficient). Moreover, do you know if precomputation also affects the management and host times ?

Best.

@Eric Will have to correct me if I’m wrong, but I took a quick glance at the code, and I think you’ll need to use the modified gains (as Eric suggested) in your modified code. I would still give it a test though, to make sure that it performs as you expected.

Yeah, the event data that you used for training should be compatible with Eric’s code. What you’ll want to do is to present 1 frame (i.e., one 25 * 25 * 2 flattened frame) every timestep of the node. Note that one timestep of the Nengo simulation is 1ms (by default), so you will probably want to present the frame for more than 40 timesteps each.

Yeah, as long as the test data is formatted the same way as your training data, it should work.

No. From experience, I believe the opposite to be true, actually. With all of the inputs essentially uploaded to the chip, less time is spent on waiting for I/O operations for the input, allowing the neurocores to be more active throughout the entire simulation. Of course, you’ll also have to account for the I/O operations of any probes you have in your code (probes need to copy data off the board, and that is an I/O operation), so to maximum neurocore activity, you’ll want to minimize the number of, or disable all the probes in your model.

Sort of. The delay is not so much the time needed to pre-compute the spikes, but rather, the time needed to upload the pre-computed spikes onto the board (as mentioned before, I/O is rather slow). Increasing the amount of input data will increase the amount of spikes that need to be uploaded, which increases this delay.

@Eric will have to clarify this, but I think with the precomputation, the management and host times should decrease. But I’m not 100% sure on this.

Hi @xchoo,
even though I didn’t complete tests on event-data using the code of Eric, I also made some tests on the Keras-to-Loihi example and got new interesting results.

Totally agree. I replicated it on both the

I made a new Jupyter notebook in order to compared the total latency when using precompute=True vs precompute=False.
You can find the notebook at the following link:

In particular, I made 5 test cases:

  1. precompute=False, enable_energy_probe=False → latency: 9.49 s
  2. precompute=False, enable_energy_probe=True → I obtained a RuntimeError due to EnergyProbe settings (probe_buff_size and probe_bin_size). I found that to solve it the total number of timesteps should be less or equal to the product probe_buff_size x probe_bin_size (see test case #5).
  3. precompute=True, enable_energy_probe=False → latency: 37.85 s.
  4. precompute=True, enable_energy_probe=True → latency: 41.98 s.
  5. Changing probe_buff_size from 30 to 300 and probe_bin_size from 1 to 10 when using the settings precompute=False, enable_energy_probe=True, I managed to run the model on Loihi and got the following latency: 9.98 s.

According to the tests it seems that using precompute=False.
Latency is computed as the following:

with loihi_sim:
    start_time = time.time()
    loihi_sim.run(run_time)
    end_time = time.time()

Are you able to replicate my results ? What do you think ?
Note that maybe you could obtain slightly different results for the latency but they should be in line with mine (for example, when running multiple times the test case #1 I obtained the following results for the latency: 9.66, 9.49, 9.08).

Best.

Yes. The numbers you get with your notebook are correct. However, I must point out that the method you are using to calculate the total inference time may be incorrect. When you call loihi.sim.run() it calls the NxSDK api to compile and run your Nengo code on the Loihi board. However, quite a number of things are done before your code actually runs on the board. You can actually see this process in the NxSDK output (excerpt below):

INFO:DRV:    Host server up..............Done 0.26s
INFO:DRV:    Encoding axons/synapses.....Done 2.54s
INFO:DRV:    Compiling Embedded snips....Done 0.40s
INFO:DRV:    Compiling MPDS Registers....Done 1.13ms
INFO:HST:  Args chip=0 cpu=0 ~/nengo_venv/lib/python3.5/site-packages/nxsdk/driver/compilers/../../../temp/1609256042.9954004/launcher_chip0_lmt0.bin --chips=1 --remote-relay=0 
INFO:DRV:    Booting up..................Done 7.84s
INFO:DRV:    Encoding probes.............Done 1.57ms
INFO:HST:  Lakemont_driver...
INFO:DRV:    Transferring probes.........Done 0.02s
INFO:DRV:    Configuring registers.......Done 0.62s
INFO:DRV:    Transferring spikes.........Done 11.46s
INFO:DRV:    Executing...................Done 1.14s
INFO:DRV:    Processing timeseries.......Done 0.09s
INFO:DRV:  Executor: 2400 timesteps........Done 13.35s
INFO:HST:  chip=0 cpu=0 halted, status=0x0

From the output you can see that NxSDK does things like:

  • Initialize the board
  • Configure probes
  • Transfer probe and spike data (spike data is what is generated when precompute=True)
  • Execute the code on the board
  • Process the resulting data from the simulation

All of those things are done in the loihi_sim.run() call. However, when we measure inference time, we only measure the time it takes to execute the code on the board, which in the example output is this one line:

INFO:DRV:    Executing...................Done 1.14s

So, while the loihi_sim.run() call took 13.35s to run in totality, only 1.14s of it was used to run the Nengo model on the board. A majority of the 13s was used to transfer the spike data to the board (because that’s a lot of information).

To get an accurate execution time for your Nengo code, you’ll want to use this code instead:

print("Total execution time: {:.2f} s".format(e_probe.totalExecutionTime / 1000000))
# e_probe.totalExecutionTime is reported in us (microseconds)

If you do this, I get these results for

  • precompute=True: 1.31s
  • precompute=False: 13.50s

And here you can see the advantage of using precompute=True. Since precompute=True downloads all of the input data onto the board, it does not have to waste time on input I/O. In the precompute=False case, the Loihi neurons have to wait for input from the host, and it lengthens the execution time to 13.50s (~ 10x longer).

I should note that if you look at the e_probe.plotEnergy() graph, you’ll notice that NxSDK only records power data during the execution time phase. I.e., only for the time reported in Executing ..... Thus, if you want to do the power comparisons, you’ll want to use e_probe.totalExecutionTime instead of timing the entire loihi_sim.run() call.

Hi @xchoo,
many thanks for your comments and explanation about the obtained results.
It was very useful to better understand the difference between using precompute=True and precompute=False.
The main advantage of using precompute=True is that it significantly speeds up the execution time (wrt precompute=False). However, the disadvantage could be that the encoding of data into spikes (Lakemont_driver...) and the transfer of spike data to the board may require a significant amount of time.
So, if considering only the execution time, for sure it is better to use precompute=True, however, if considering the total inference time, maybe precompute=False, might results in a smaller latency.

By the way, I am also wondering if the strange results related to neurocores dynamic power may be related to the low number of power samples collected during the execution time (1.14s may be a bit low to collect a significant amount of power samples, but I may be wrong) or the use of probe_bin_size=1.
Moreover, also when using precompute=False, I obtained several times a negative power consumption for neurocores dynamic power.
In particular, doing a more in-depth investigation, I found that the neurocores dynamic power is computed as:
nc_dyn_pwr = total_power - idle_power - lakemonts_dyn_power
Using the NxSDK API I obtained that the total power consumption (total_power) is about 944 mW, the idle power consumption (idle_power) is about 941 mW, and the lakemonts (x86) dynamic power (that represents the power consumption when executing code on Lakemonts) is about 4 mW.
Note that it depends on the lakemont active ratio (lmtActiveRatio), which seems to be quite high (0.99176 -> about 99 %).
You can get these power measurements through board.energyTimeMonitor.powerProfileStats.
Hope these results are useful to you to better understand the power results.

Anyway, thanks again for your help !

Yes, this is correct.

I would hesitate to refer to the total simulation run time as the “inference time”. Since “inference” refers to when the network is performing the inference task, this is only done during the execution of the network, and thus should be equivalent to the network’s execution time. Also, in the precompute=True case, the spike transfer time dominates the total simulation run time. Because of this, the total simulation run time becomes highly dependent on the type of input being downloaded onto the Loihi board. If you change the input to be sparse, I’d expect the spike transfer time to reduce, but the network execution time to remain the same. So, if you want to get a true measure of the amount of time the network performed on the inference, you’ll want to use the execution time, rather than the total simulation run time.

I do agree that 1.14s is a little shot to collect meaningful data from the network. I’d advise at least 15-30s of execution time. That said, 1.14s shouldn’t give you overly wrong results, it might just be dominated by any startup transients in the power measurement data (from experience comparing the NxSDK energy probe to a physical energy probe, the NxSDK energy probes do lag a little bit – maybe 10 or so timesteps)

Yeah, in the precompute=False case, because so much of the execution time is spent waiting on the I/O from the lakemonts, it is possible that the neurocore dynamic power is drowned out by any noise in the power measurements from the lakemonts, and this will give you strange results like negative dynamic power measurements.