Optimizers update different parameters

jhuebotter · May 11, 2022, 4:16pm

Hi,

I was debugging some code in order to check if a certain parameter actually gets updated during learning and to my surprise I found that not all standard tensorflow optimizers update all parameters, even if this is specifically desired. If I am not mistaken, the following parameters of a model should be learnable, given that the settings are accordingly (nengo_dl.configure_settings(trainable=True)):
encoders, intercepts, max firing rates, bias and gain of a nengo.Ensemble, as well as weights (transform) of a nengo.Connection.

I found that all parameters are updated as desired by the following optimizers: Adam, RMSProp, Adamax, Ftrl, and Nadam

In contrast, the following optimizers ONLY update the transform of a connection, but not any ensemble parameters: SGD, Adagrad, and Adadelta.

I was unable to find anything about this in the documentation, this forum, or on github and was wondering: 1. if this is known and 2. why this occurs?

I have prepared an example code here: https://drive.google.com/file/d/144ifDVs-zwcWXRX025gUMmphxS2K5Za0/view?usp=sharing

Any insight would be much appreciated.

Cheers,
Justus

xchoo · May 11, 2022, 5:21pm

Hi @jhuebotter,

From a code perspective, most of the NengoDL simulator calls (e.g., eval, fit, etc.,) are just wrappers around their respective TF functions, so changing the optimizer should still result in all trainable weights being modified. I spoke with the NengoDL devs and confirmed that this is the case.

As to why you are observing such an issue with your code, after discussing it with the devs and playing around with your code a bit more, we believe to have stumbled across the reason as to why it is happening: learning rates. It seems that different optimizers have different learning rate “baselines” (as it were), and the same learning rate value will result in different amounts of change in the weights of the various models. For the SGD, Adagrad and Adadelta optimizers, the change in values for the gains and biases were so small that essentially no change was made to them. For each optimizer, when I modified the learning rate values to:

SGD: lr * 10
Adagrad: lr * 100
Adadelta: lr * 2000

I saw changes to both the ensemble parameters and the connection weights. Note that the learning rate value scale is very rough, I basically increased them by an order of magnitude each time to see if had an effect.

jhuebotter · May 11, 2022, 5:52pm

Hi @xchoo,

I have just tried to rerun my code with the changes you suggested and can confirm that this indeed changes the behaviour. Thank you for your kind assistance!

Cheers,
Justus