Validating Nengo dl models during learning changes results

jhuebotter · December 22, 2021, 3:44pm

Hello everyone,
This is my first post in this community and I hope that I am not horribly ignorant to the forum guidelines by posting my question here.

I have been using nengo and nengo_dl in my latest research project, and based on my limited experience so for I generally think both are great tools. I come from an AI background and have worked on several deep learning projects (primarily Pytorch) in the past. I am hoping to use nengo to develop a spiking network for model predictive control (MPC). In my typical workflow, when I train a model, I also evaluate it’s performance continuously during the training process on a separate chunk of the dataset. My standard training loop looks something like this:

Initialize the model and training / validation data
Load latest weights, train for an epoch and save the new weights
Load latest weights and evaluate the model
Repeat steps 2 and 3 until some condition is reached (max epochs or loss below X)

When trying to apply this method to my nengo model, however, I noticed that when I train for some epochs, say 5, and then with another simulator object want to continue training with the same parameters, I start back at a higher loss than I left off. Below, I added an image of these results where I first train for 50 epochs, and then train the same model again for another 50 epochs. These are made with the Adam optimizer, but the effect also happens when using SGD but to a lesser extend. This also occurs when I do not even do an evaluation (step 3) in between. I think this might have something to do with the state of the optimizer? But I do use the exact same optimizer object for both training steps.

weirdness

In the end, I would like to be able to look at validation results during training without this influencing the results I get. In other words, I expect the same outcome if I train once for 100 epochs in a single simulator object, or 100 times for 1 epoch with individual simulator objects. I use separate objects because my current understanding it that they “close” and are not “reusable” in nengo.

I have created a working example that produced these results in a jupyter notebook and it is available right here:

https://drive.google.com/file/d/1vXJj0JD_f6odaUiCQdB_4KP49KJANhAS/view?usp=sharing

I would be very happy about any sort of insight on 1. why this behavior occurs, and 2. how I can avoid it. Any assistance in this regard is highly appreciated. Perhaps other people have run into this or a similar issue before and there is a “standard” learning loop that works in nengo_dl that I am unaware of? The examples given in the official documentation do not seem to cover this - only how to indeed save the weights and use them later, which I believe I implemented correctly.

Thank you very much for your help and stay safe,

Justus

xchoo · December 30, 2021, 4:25am

Hi @jhuebotter, and welcome to the Nengo forums!

I took a quick look at your notebook, and while I haven’t done a in-depth analysis, I can guess at what is happening here. In your notebook, you are using nengo.Ensemble objects for the layers. One of the features of Nengo (and by extension, NengoDL), is that the parameters (gains and biases) of the neural ensemble are randomly generated every time a simulator (nengo.Simulator or nengo_dl.Simulator) object is created.

What this means is that when you created a different simulator object at epoch 50, you are using the same connection weights as the first simulator object, but the neuron parameters are different. I believe this is the cause of the sudden increase in loss at that point (I’m actually amazed the loss doesn’t jump higher).

To prevent this problem, you can either:

Use a seed when creating the Nengo ensemble.
Or use predefined bias and gain values when creating the Nengo ensemble.

jhuebotter · January 4, 2022, 2:15pm

Dear @xchoo,
Thank you very much for taking the time to take a look at my issue. I would like to note that I do use a seed when creating the Nengo ensemble. However, I wanted to double-check if indeed the parameters of the model remain unchanged when using a new simulator object and this seems to be the case. For this purpose, I created a slightly adapted version of my code above and hosted it here: https://drive.google.com/file/d/144ifDVs-zwcWXRX025gUMmphxS2K5Za0/view?usp=sharing

It seems that there is an indication that SGD and Adam are updating different parameters, although the instructions between the two remain unchanged. In my understanding, this should not be the case. Additionally, when evaluating the same model multiple times with separate simulator objects (before or after learning) the loss is always the same, so there seems to be no random aspects based on the simulator initialization itself.

To me both of these issues are very weird (1. the jump in the loss when continuing learning with a new sim object and 2. the fact that Adam and SGD optimize different sets of parameters). They do currently prevent me from being able to use nengo dl in my research projects and I would still appreciate any insights on why this is happening and how to prevent it.

Best wishes,
Justus

jhuebotter · January 5, 2022, 1:28pm

I can see how my observations in my last post might not be fitting well to the initial problem description. Please share your opinion if I should post this as a separate thread.
In any case, I would still very much appreciate any input in both the initial problem (increase in loss with new sim) and the second observation (different parameters being updated based on choice of optimizer).

Best,
Justus

xchoo · January 17, 2022, 4:31am

Hi @jhuebotter,

Just an update on this. I have been investigating your notebook and have been able to replicate the behaviour you are reporting. It seems that the issue only occurs with the Adam optimization, but I am not familiar enough with TensorFlow to determine why this is the case. I’ve messaged the NengoDL dev about it, but they are currently out of the office. I’ll let you know if when the respond.

jhuebotter · May 9, 2022, 12:16pm

Any news on this?
Cheers.

xchoo · May 9, 2022, 2:57pm

Hi @jhuebotter,

Indeed! I pinged the devs again, and they came back with a response this time.
You are correct in your original deduction that some state information is being lost when you instantiate a new simulator object. The NengoDL devs informed me that the Adam optimizer contains state information that is local only to the simulator it is being used in, and doesn’t transfer to other simulators even if you configure them to use the same optimizer type. Unfortunately, we don’t know of a way to transfer this state information between simulators.

With this new information, I looked at the notebook again and a relatively simple solution and I think I’ve found a relatively simple solution. In Nengo (and NengoDL), we use the with block with the simulator objects so that the process of opening and closing down the simulator object is automatically handled by Python. However, being Python, you don’t need to do this. Thus, my workaround is to simply reuse the simulator object in all of the different function calls. Here’s an example of the eval function reworked.

The original eval function:

def eval(model, inputs, targets, batch_size, seed, dt, load="", verbose=0):
    sim = nengo_dl.Simulator(model, seed=seed, minibatch_size=batch_size, dt=dt)
    with sim:
        if load:
            if verbose:
                print("loading weights from", load)
            sim.load_params(load)

    # here, I use an optimizer with a 0 learning rate
    # I have also tried passing the same optimizer used for learning to this function but the results are the same
    sim.compile(
        optimizer=tf.optimizers.SGD(0.0, momentum=False),
        loss={model.recordings["predicted_future_states"]: nengo_dl.losses.nan_mse},
    )
    history = sim.evaluate(inputs, targets, verbose=verbose)

    return history

The updated eval function (notice that sim is being passed in, and that the with sim block has been removed).

def eval(model, inputs, targets, batch_size, seed, dt, load="", verbose=0, sim=None):
    if load:
        if verbose:
            print("loading weights from", load)
        sim.load_params(load)

    # here, I use an optimizer with a 0 learning rate
    # I have also tried passing the same optimizer used for learning to this function but the results are the same
    sim.compile(
        optimizer=tf.optimizers.SGD(0.0, momentum=False),
        loss={model.recordings["predicted_future_states"]: nengo_dl.losses.nan_mse},
    )
    history = sim.evaluate(inputs, targets, verbose=verbose)

    return history

With the updated eval function (and other functions), the experiment code goes something like this:

# Make model
model = make_model(...)

# Create simulator
sim = nengo_dl.Simulator(...)

# Call train
train(..., sim=sim)

# Call eval
eval(..., sim=sim)

...

# Close simulator
sim.close()

I updated your notebook with this modified code and ran it, and these are the results I got (no more bump in training error ):

Here’s the notebook if you want to play around with it. Note that I commented out calls to the plot_prediction function. It looks like the persistent sim object causes some data length issues that was tangential to the actual problem so I didn’t want to debug it. You probably don’t need to pass the sim object to plot_prediction but I’ll leave you to experiment with that.

nengoDLweirdnessMinimal.ipynb (788.1 KB)

jhuebotter · May 11, 2022, 5:26pm

Hi @xchoo,

Thank you for looking into my issue here once again. It is indeed very good to know that some of the optimizer info is local to the simulator. It is pretty much what I thought and a bit unfortunate of course.

Reusing the same simulator seems to get rid of the large spike in training loss, but you will agree that even in the code you have sent above the training and evaluation loss after epoch 100 are not the same for the case when an evaluation happens at epoch 50 and when it does not. I am still not clear why this is the case. In the end, as stated above, I would like a setup where the evaluation of the model is really independent of the training. So no matter if I check the model performance 0, 1, or 100 times in 100 epochs, I do expect the same result in the end.

I just slightly adapted your solution so that when I call eval and plotting, I do not actually pass the sim object and hence a new one is created each time, but the one used for training remains persistent. This is actually enough to overcome the loss spike, but it also does not lead to the training loss being the same after 100 epochs. I cannot understand why this is the case. Any thoughts on this?

On another end, reusing the same simulator object outside of this toy example leads to the probed data growing very quickly, which slows down the simulation considerably after some time. I have to experiment a bit with the reset function and hope that perhaps sim.reset(seed=None, include_trainable=False, include_probes=True, include_processes=False) might overcome this issue.

Regarding the difference in parameters that are updated by each optimizer I had mentioned some weeks ago, I did start a new thread.

jhuebotter · May 11, 2022, 5:42pm

The reset function also does something very interesting to the loss curve. Essentially, the patterns I see on there really repeat now after every reset (evaluation). All I wanted to do really is truncate the probe data because it gets too large…

xchoo · May 13, 2022, 3:56am

I checked with the NengoDL devs, and it seems like the training loss reported by TF is a running average of the loss values, and that is likely the why the training loss doesn’t match the test error after some amount of epoch. The TF code for this can be found here.

Yeah, I definitely would not use the sim.reset function to do this. Although that would probably clear the probed data, it would also probably clear some other things as well. If you want to manipulate or clear the probe data, you should be able to do it with this:

sim.model.params[probe_obj] = ...

E.g., This code would clear (and garbage-collect) and reset the “predicted_future_states” probe data:

del sim.model.params[model.recordings["predicted_future_states"]]
sim.model.params[model.recordings["predicted_future_states"]] = []

You should be able to truncate it too (e.g., take the last 10 data points), by doing this:

sim.model.params[model.recordings["predicted_future_states"]] = sim.model.params[model.recordings["predicted_future_states"]][-10:]

I would also like to point out that the probes have a sample_every parameter, that allows you to define how often Nengo stores the probed data. The documentation for that is here.