[Nengo-DL]: Prediction with batches of test data

zerone · July 13, 2020, 1:26am

Hello all, I have few general questions along with how to do prediction on batches of test data with a converted spiking network. I am attaching my script here for downloading and running. The error I am facing in this script is: SimulationError: Number of saved parameters in ./keras_to_snn_params (7) != number of variables in the model (4). I looked around for a bit and found this but probably it was related to an old bug, as I am using the latest version of nengo-dl.

Nengo version:  3.0.0
NengoDL version:  3.2.0
Tensorflow version:  2.2.0

1> During training of a nengo-dl model, I can see it using the GPU, but while prediction it doesn’t. It uses the CPU. If it’s must for nengo-dl to use CPU in inference mode, can someone show me how to parallelize the prediction over multiple CPUs?

2> My script uses batches for training… how do I change it to enforce regularization over firing rates? From the tutorial nengo objects are used and not the names of the layers as passed in dict for batch training.

3> Is there a way to simultaneously train and test in the same simulation context (i.e. under one with nengo_dl.Simulator(converter.net, minibatch_size=batch_size) as sim) with due note of using rate neurons (tf.nn.relu) while training and using spiking neurons while testing?

4> Any tips other than regularization of firing rates during training to improve inference performance would be very much appreciated!

Thanks!

Update-1: If I remove inference_only=True from the nengo-dl model to predict on test data, the script runs ahead but fails with InvalidArgumentError (a small part of Traceback):

<ipython-input-10-82d2ac427759> in <module>
     19 
     20   test_data = get_test_batches(batch_size)
---> 21   data = nengo_sim.predict(test_data, steps=10)
     22 
     23   accuracy = get_accuracy(data[nengo_output], test_labels)

which indicates that the way I am passing test data to predict() isn’t correct. For a quick look, following is the way I am constructing test_data generator:

def get_test_batches(batch_size, n_steps=20):
  for i in range(0, test_images.shape[0], batch_size):
    ip = test_images[i:i+batch_size]
    label = test_labels[i:i+batch_size]
    yield(
    {
      "input_1": ip,
      "n_steps": np.ones((batch_size, 1), dtype=np.int32)* n_steps,
      "conv2d.0.bias": np.ones((batch_size, 32, 1), dtype=np.int32),
      "conv2d_1.0.bias": np.ones((batch_size, 64, 1), dtype=np.int32),
      "dense.0.bias": np.ones((batch_size, 10, 1), dtype=np.int32)
    })

Am I missing something? I am constructing it similar to the way mentioned here in get_batches() function. This brings me to the next point… can someone please explain the exact use case of inference_only=True? I wasn’t able to follow the API documentation very clearly.

brent · July 13, 2020, 7:37pm

Looking at the documentation for predict it takes an n_steps parameter, but you are using steps, so I’m guessing it is passing along steps to **kwargs and finding no place where it is used, leading to that error.

drasmuss · July 13, 2020, 8:02pm

When you change the options in the Converter you can get a significantly different Nengo network as a result. So when you create your first network with nengo_dl.Converter(tf_keras_model), that network will have a different set of parameters than when you create your second network with nengo_dl.Converter(tf_keras_model, swap_activations={tf.nn.relu: activation}, sale_firing_rates=scale_firing_rates, inference_only=True, max_to_avg_pool=True). So you won’t be able to load the parameters from the first network in the second. In particular, the inference_only=True and max_to_avg_pool=True conversions are likely to change the model parameters. Looking at your model though I don’t think those are necessary, so I would try just taking them out and see if you can load your parameters then.

It should use the GPU for both training and inference. I have never run into a situation where different devices were used during training vs inference. Do you see this same behaviour when running standard Keras models (not through NengoDL) on your system?

The example you link to also uses batched training. You should be able to use nengo objects or string names in either case. If that is not working let us know.

Yes, this will happen automatically, you don’t need to do anything.

It is not always possible to convert a Keras model to a Nengo model in a way that will produce identical training behaviour. inference_only=True is an option to indicate that you are OK with that (i.e., you understand that the converted model will not be trainable in the same way as the original Keras model). This is often fine, e.g. if you play on doing all your training in Keras, and you will only be doing inference in Nengo.

zerone · July 16, 2020, 1:47am

Hello @Brent and @drasmuss, thank you for your inputs. steps parameter is supposed to be passed to TF predict function to run the test data generator steps number of times.

I can see the GPU being used for training and inference both. Earlier I could only see the RAM being increased, which led me to thinking that only CPU is being used. I was able to do the batch testing, but with a different syntax. Mentioning it below in case others find it useful.

def get_test_batches(batch_size):
  for i in range(0, test_images.shape[0], batch_size):
    ip = test_images[i:i+batch_size]
    yield ip

with nengo_dl.Simulator(converter.net, minibatch_size=batch_size) as nengo_sim:
  nengo_sim.load_params("./keras_to_snn_params")
  
  test_data = get_test_batches(batch_size)
  pred_labels = []
  for data in test_data:
    tiled_data = np.tile(data, (1, n_steps, 1))
    pred_data = nengo_sim.predict_on_batch({nengo_input: tiled_data})
    for row in pred_data[nengo_output]:
      pred_labels.append(np.argmax(row[-1]))

Note that batch_size passed to get_test_batches and assigned to minibatch_size should be same, else it throws error.

I am now stuck at regularization of firing rates while batch training. I have implemented the following code on similar lines as in the tutorial with a difference of loss function and using layer names (strings) in place of nengo node objects.

tf_keras_model, inpt, otpt, conv0, conv1 = create_2d_cnn_model((28, 28, 1))
converter = nengo_dl.Converter(tf_keras_model)

print(converter.net.all_nodes)
with converter.net:
    output_p = converter.outputs[otpt]
    conv0_p = nengo.Probe(converter.layers[conv0])
    conv1_p = nengo.Probe(converter.layers[conv1])

batch_size, target_rate = 500, 250
def get_batches(batch_size):
  for i in range(0, train_images.shape[0], batch_size):
    ip = train_images[i:i+batch_size]
    label = train_labels[i:i+batch_size]
    yield ({
        "input_1": ip,
        "n_steps": np.ones((batch_size, 1), dtype=np.int32),
        "conv2d.0.bias": np.ones((batch_size, 32, 1), dtype=np.int32),
        "conv2d_1.0.bias": np.ones((batch_size, 64, 1), dtype=np.int32),
        "dense.0.bias": np.ones((batch_size, 10, 1), dtype=np.int32)
    },
    {
        "dense.0": label,
        "conv2d.0": np.ones((train_labels.shape[0], 1, conv0_p.size_in)) * target_rate,
        "conv2d_1.0": np.ones((train_labels.shape[0], 1, conv1_p.size_in)) * target_rate,
    })

with converter.net:
    output_p = converter.outputs[otpt]
    conv0_p = nengo.Probe(converter.layers[conv0])
    conv1_p = nengo.Probe(converter.layers[conv1])

with nengo_dl.Simulator(converter.net, minibatch_size=batch_size) as sim:
  sim.compile(
        optimizer=tf.keras.optimizers.Adam(lr=1e-3),
        loss={
            output_p: tf.keras.losses.CategoricalCrossentropy(from_logits=True),
            conv0_p: tf.losses.mse,
            conv1_p: tf.losses.mse,
        },
    
        loss_weights={output_p: 1, conv0_p: 1e-3, conv1_p: 1e-3}
    )
  
  for epoch in range(10):
    data_generator = get_batches(batch_size)
    
    sim.fit(
        data_generator, epochs=1, steps_per_epoch=120)
    
    
  sim.save_params("./keras_to_snn_params_regularized")

It first of all throws some warning and then gets stuck at some step which leads to memory explosion on my system. I can see usage of all 16GB RAM + 30 GB Swap and then it ultimately crashes. During the entire event there is no computing activity on GPU. Even after replacing the string layer names with neon node objects, the story remains the same.

I am not sure what’s wrong. Please let me know. I will be happy to share the entire ready to execute script if someone needs it. Thanks!

brent · July 16, 2020, 3:34am

It might be the probes that are eating up all of your memory. I believe by default they will keep track of all of their history, but you only care about the most recent weights. Try this line to disable the history:
nengo_dl.configure_settings(keep_history=False)

Another config I use in my models is:
nengo_dl.configure_settings(stateful=False)
Not sure if it is relevant for your model, but I believe it can speed things up a little if you don’t need state saved between runs.

This page describes the different config options and what they do.

zerone · July 16, 2020, 5:17pm

Hey @Brent, I tried with your suggestions, but the script still eats up all the memory. It is possibly able to map the network to GPU as I can see a slight increase in GPU RAM, but it then uses up all the system RAM and swap, after which it crashes. I am linking an independent script here in case someone wishes to execute it.

I guess, there is something fundamentally wrong with my script. It has the same code as mentioned in this post.

drasmuss · July 16, 2020, 7:42pm

zerone:

{
        "dense.0": label,
        "conv2d.0": np.ones((train_labels.shape[0], 1, conv0_p.size_in)) * target_rate,
        "conv2d_1.0": np.ones((train_labels.shape[0], 1, conv1_p.size_in)) * target_rate,
    }

I believe in this section this should be batch_size, not train_labels.shape[0]. Your generator should just be yielding targets for a single minibatch, but you’re trying to create targets for the whole dataset, which is likely why it is running out of memory. You could also further reduce the memory usage by setting the dtype to something other than the numpy default of float64 (e.g. float32).

zerone · July 16, 2020, 9:38pm

You are awesome @drasmuss! Silly me to have overlooked the params passed to the dictionary values. BTW, the keys “dense.0”, “conv2d.0”, “conv2d_1.0” in the dict you quoted should be actually “probe”, “probe_1”, “probe_2” respectively. So far I have been simply following examples and intuition to build such dictionaries… can you please point me to a source which explains what string names should be in these dictionaries?

drasmuss · July 16, 2020, 10:17pm

We recommend using objects rather than string names, as they are less confusing. E.g. instead of {"probe": x, "probe_1": y, "probe_2": z} do {output_p: x, conv0_p: y, conv1_p: z}. Then you don’t have to try to keep track of the names of the different objects, because you have the actual objects.

The one problem is that it isn’t easy to get those .bias objects, since they aren’t directly accessible through the converter.layers data structure. However, if you have the main object (e.g. conv0, then you should be able to get the name of the associated bias object via converter.layers[conv0].label + ".bias"

As to how the names are generated, it is based on the name of the Keras layer. So if you set the name attribute on the layer it will use that, otherwise Keras automatically generates a name (like “conv2d” or “dense”). If there are multiple objects with the same name, Keras will append a number to make them unique (e.g. “_1”). Then the Keras node_id is appended (e.g., the “.0” you see). And finally, in the case that a single Keras layer is converted into multiple Nengo objects, some distinguishing label is added for the different objects (e.g. converting a Keras conv2d layer will result in two objects, conv2d.0, which is the main convolutional output, and conv2d.0.bias, which is the bias node associated with that layer).

zerone · July 17, 2020, 3:09am

Thank you @drasmuss for a detailed explanation!

zerone · September 19, 2021, 7:09pm

Just checking… has the support for NengoDL training on multiple GPUs included in the latest version of NengoDL? (For inference, I can at least exploit data parallelism with ray). Please let me know!

xchoo · September 21, 2021, 2:47pm

I checked with the NengoDL devs and multi-GPU support is unfortunately still unavailable for NengoDL. Development on NengoDL has been paused (or slowed to a crawl) since our devs are working on other internal projects, so bringing multi-GPU support to NengoDL might take some time.

zerone · September 24, 2021, 2:51pm

Got it @xchoo! Thanks for confirming.