PES high learning rate VS amplifying post

nrofis · January 25, 2023, 10:55pm

It is known that “a faster learning rate for PES results in over-fitting to the most recent online example, while a slower learning rate does not learn quickly enough.”

Although, I saw in several places (like in the adaptive control example) that sometimes it is preferred to decrease the learning rate while adding an amplifier to the post output.

For example, instead of doing

pre = nengo.Ensemble(100, dimensions=1)
post = nengo.Node(size_in=1)
nengo.Connection(pre, post, learning_rule_type=nengo.PES(1e-4))

reduce the learning rate by 10 times and amplify the output signal by 10 times

pre = nengo.Ensemble(100, dimensions=1)
post = nengo.Node(lambda x: 10 * x, size_in=1)
nengo.Connection(pre, post, learning_rule_type=nengo.PES(1e-5))

When comparing those examples, the post signal looks very similar. And intuitively, it is exactly the same model… Since multiplying the output by x is like multiplying all the weights in the decoder by x, and it is like multiplying the learning rate (weights diffs) by x.

What is the advantage and benefits of the second example? Why did they prefer the second option over the first in the adaptive control example?

travis.dewolf · January 30, 2023, 1:50pm

The advantage here is due to representational constraints of the neural ensembles. By keeping the learning rate low, the signal that’s represented by neurons stays in the -1:1 domain where it’s well represented. And then when it is sent out to the actual system we scale it up. This allows us to maintain a more precise learned signal (because it doesn’t saturate by moving outside of the domain the decoders built for) and have a signal that is strong enough to perform the task.

travis.dewolf · January 30, 2023, 1:51pm

(otherwise, as you noted, they are equivalent methods)

nrofis · January 30, 2023, 4:43pm

@travis.dewolf Thank you for the reply! I understand now.

Can this “trick” prevent the over-fitting issue to the most recent online example?

Say I have an adaptive model that needs to respond quickly enough but prevent that over-fitting of the recent example. If I will reduce the learning rate by 10 times, and amplify the post signal by 10 times, can it help to respond quickly but prevent over-fitting due to the low learning rate, or it will be exactly the same as high learning-rate?

travis.dewolf · January 30, 2023, 5:44pm

Hello, I’m not familiar with the overfitting issue, but I would expect not. If you’re not running into any issues with saturation of neuron firing rates or representing values outside of the expected range then the performance should be the same.

nrofis · January 30, 2023, 6:04pm

@travis.dewolf this quote is from Nengo documentation:

A faster learning rate for PES results in over-fitting to the most recent online example, while a slower learning rate does not learn quickly enough. This is a general problem with greedy optimization.

travis.dewolf · January 30, 2023, 6:27pm

Oh, this is more of ‘the forgetting problem’ specific to online learning. No, the scaling does not address this issue.

nrofis · January 30, 2023, 7:10pm

Yes, exactly that problem. Thank you for your help!

nrofis · February 2, 2023, 6:55pm

@travis.dewolf on second thought, a few lines earlier, there is a scaling node that scales the arm_node to scale the arms angle for the neuromorphic representation [-1,1].

Why not use a similar node to scale down the osc_node node output? Moreover, there is a transform attribute in the connection itself:

# connect up the osc signal as the training signal
nengo.Connection(model.osc_node, learn_conn.learning_rule, transform=-1)

Why not change it to:

# connect up the osc signal as the training signal
nengo.Connection(model.osc_node, learn_conn.learning_rule, transform=-0.1)

Instead of touching the learning rate?

I mean, these options are easier and cleaner to get the same result. So why do it like this? What am I missing?

travis.dewolf · February 2, 2023, 8:08pm

Sorry, could you post the full code of the original and alternative?

nrofis · February 2, 2023, 9:41pm

The code is quite long and I will write here the function I’m talking on.

From Nonlinear Adaptive Control example I’m looking at the add_adaptation function.

This is the original code:

def add_adaptation(model, learning_rate=1e-5):
    with model:
        # create ensemble to adapt to unmodeled dynamics
        adapt = nengo.Ensemble(n_neurons=500, dimensions=2, radius=np.sqrt(2))

        means = np.array([0.75, 2.25])
        scaling = np.array([0.75, 1.25])
        scale_node = nengo.Node(
            output=lambda t, x: (x - means) / scaling, size_in=2, size_out=2
        )
        # to send target info to ensemble
        nengo.Connection(model.arm_node[:2], scale_node)
        nengo.Connection(scale_node, adapt)

        # create the learning connection from adapt to the arm simulation
        learn_conn = nengo.Connection(
            adapt,
            model.arm_node[2:4],
            function=lambda x: np.zeros(2),
            learning_rule_type=nengo.PES(learning_rate),
            synapse=0.05,
        )
        # connect up the osc signal as the training signal
        nengo.Connection(model.osc_node, learn_conn.learning_rule, transform=-1)
    return model

Here they use a learning rate of 1e-5 but then multiply the post output by 10 times.

Instead of using 1e-5, they could add a dedicated node for scaling the error (and reduce the learning rate by 10 times), or change the transform property in the last nengo.Connection to -0.1 like:

def add_adaptation(model, learning_rate=1e-4):   # !! CHANGE !!
    with model:
        # create ensemble to adapt to unmodeled dynamics
        adapt = nengo.Ensemble(n_neurons=500, dimensions=2, radius=np.sqrt(2))

        means = np.array([0.75, 2.25])
        scaling = np.array([0.75, 1.25])
        scale_node = nengo.Node(
            output=lambda t, x: (x - means) / scaling, size_in=2, size_out=2
        )
        # to send target info to ensemble
        nengo.Connection(model.arm_node[:2], scale_node)
        nengo.Connection(scale_node, adapt)

        # create the learning connection from adapt to the arm simulation
        learn_conn = nengo.Connection(
            adapt,
            model.arm_node[2:4],
            function=lambda x: np.zeros(2),
            learning_rule_type=nengo.PES(learning_rate),
            synapse=0.05,
        )
        # connect up the osc signal as the training signal
        nengo.Connection(model.osc_node, learn_conn.learning_rule, transform=-0.1)  # !! CHANGE !!
    return model

In that way, they can keep the representational range of the ensemble ([-1,1]) without changing the learning rate. So what is the motivation to keep the learning rate low instead of modifying the transform or adding a scale node (and creating a cleaner code)?

travis.dewolf · February 7, 2023, 2:03pm

Hm, I’ll be honest I think I’m missing what you’re saying. The code you posted does not seem cleaner to me. I find it more confusing to have extra terms scaling the learning rate that need to be accounted for when debugging instead of clearly setting the learning rate in one place. I’m not sure why keeping the learning rate at 1e-4 would be considered cleaner.

The two sets of code appear to implement the same functionality to me, so you still need the x10 scale up on the output. So, I’m a bit confused.

If it’s just a preference thing, go for it! I don’t have much to say about that.

nrofis · February 7, 2023, 2:28pm

Hmm, maybe it is preference. I was confused since they created a scale_node to scale the input signal and changed the learning rate to scale the error, then rescaled again in the post.

For me, it was not so intuitive that I needed to change the learning rate to control the error range then scale the opposite in the post node. I found it cleaner to scale the error separately, then adjust the learning rate based on the model behavior. I understand that the result is the same, but it just weird.

BTW, they could use 1e-4 in the learning rate and remove the x10 in the post

u += x[2:4] * 10  # add in the adaptive control signal

Anyway, it is probably a coding preference since it is identical mathematically. Thank you!