PES in reinforcement learning

  1. Suppose we have a connection in a network, where we apply the PES learning rule using an error signal.
    Now suppose this error signal comes once every 100 ms defining a 100-timestep epoch of our network (ps from an external simulation) but the network works at a dt of 1 ms. We want the PES to modify the decoders once, not during all 100 timesteps.

How is this possible ? Does it have to do with the pre_tau parameter?
Would it be plausible? Would it be the same just to apply the same error signal for all 100 timesteps just with a lower learning rate?

  1. It is stated that the PES tries to minimize the error signal provided. That refers to absolute values (therefore 0 error will not change decoders, positive value will change them towards a certain direction and negative towards the opposite?) or the more negative the error the less change will take place (therefore 0 error will change decoders but less than a positive value and more than a negative value?)
    In other words error in PES can be considered as both reward and punishment in reinforcement terms?

Thank you



I can’t answer all of you questions, but a few things that might help:

  1. Do prevent PES from modifying the decoders, you can ensure that the error signal is 0. How to exactly do that, might depend on your network structure, but it is common to have a neural ensemble providing the error signal. So you could inhibit that ensemble (e.g., nengo.Connection(no_learning, error.neurons, transform=-2.*np.ones((error.n_neurons, 1)))). If you want the learning to happen for exactly one timestep, you should also set the synapse=None on this connection to prevent the inhibition signal from being filtered (however, this might not be biological plausible if you care about that).
  2. There is pull request for Nengo (#1254) that the possibility to apply a learning rule only every n timesteps. That might cover your use case, but requires that the application is in perfectly regular pre-defined intervals.

The pre_tau parameter gives the time constant of an exponential lowpass filter applied to the activities of the presynaptic ensemble (if I recall correctly).

I assume that depends on how quickly your input and error signal are changing. If they are changing on a slower time course it should be approximately the same. If they are change more quickly than the information during those timesteps can contribute in a different way.

This one is correct.


Thank you very much Jan!


How is it possible to use learning in order to maximize or minimize a
parameter? The value of this parameter can be fed back to the learning
connection, as an error signal would be, although I don’t know if we can
call it like this, as it is best described as a reward signal.
The best I could come up with is the following:
Define a best reward value for the parameter e.g. r_best=1 which is the
best that could ever be achieved in a timestep.
Use that as target value and define the error as actual - target, where
actual is the r_current value.
That results in an error signal always negative, being more negative when
the parameter’s r_current value is negative and less negative when it is
positive (and 0 when r_current =r_best=1). But that doesn’t seem to work
(maybe because it drives errors only one-way?)

Any ideas?




Not sure how or if it all you can do that at all with PES. You might need a different learning rule.

There are some publications on reinforcement learning with the NEF that might be helpful (what you are describing sounds somewhat like reinforcement learning to me, but that’s not my area of expertise).