Nengo Summer School RL example

I watched the Nengo Summer School on YouTube and play with the example that they attached to lecture 7 (the code is in the video description)

I’ve two questions about that example, especially about the error signal:

  1. After running it for a couple of minutes, the agent learned to prefer to stay in place. It chooses to turn to the right only, and in some cases, it just goes back and forth. I would expect it to prefer to move as much as possible since the reward is higher for higher speed. But it prefers to keep the reward near to 0 than move freely and have a chance to get hit… Any idea why is that?
    I’ve noticed that the error function is almost always 0, no matter what the reward is. It has some peaks when the agent hits the wall, but the error seems to be 0 no matter if the reward is 0 or 0.5…

  2. I don’t understand these lines:

    nengo.Connection(bg.output[0], errors.ensembles[0].neurons, transform=np.ones((50,1))*4)    
    nengo.Connection(bg.output[1], errors.ensembles[1].neurons, transform=np.ones((50,1))*4)    
    nengo.Connection(bg.output[2], errors.ensembles[2].neurons, transform=np.ones((50,1))*4)   
    

    Aren’t they equivalent to nengo.Connection(bg.output, errors.input, transform=4)?

The agent’s learning rule is very simplistic, so it’s behaviour can be a bit strange. What is it learning to do is to avoid hitting the walls, so it’s natural to spin in a circle, or just go back and forth between two relatively open spots. I discuss some of that behaviour in this forum post as well.

This is a side-effect of how the agent is coded. If you probe the output of the BG, you’ll see that it is quite noisy. Since the output of the BG inhibits the errors network, the noisiness of the BG output tends to keep the errors signals inhibited quite often (unless it really gets stuck). Thus, the error seems to hover around 0.

You can try to modify the agent to use the “cleaner” signal from the thalamus output by commenting out this code:

    nengo.Connection(bg.output[0], errors.ensembles[0].neurons, transform=np.ones((50, 1)) * 4)
    nengo.Connection(bg.output[1], errors.ensembles[1].neurons, transform=np.ones((50, 1)) * 4)
    nengo.Connection(bg.output[2], errors.ensembles[2].neurons, transform=np.ones((50, 1)) * 4)

and replacing it with this code:

    # Action 0 inhibits errors for actions 1 and 2
    nengo.Connection(thal.output[0], errors.ensembles[1].neurons, transform=np.ones((50, 1)) * -4)
    nengo.Connection(thal.output[0], errors.ensembles[2].neurons, transform=np.ones((50, 1)) * -4)
    # Action 1 inhibits errors for actions 0 and 2
    nengo.Connection(thal.output[1], errors.ensembles[0].neurons, transform=np.ones((50, 1)) * -4)
    nengo.Connection(thal.output[1], errors.ensembles[2].neurons, transform=np.ones((50, 1)) * -4)
    # Action 2 inhibits errors for actions 0 and 1
    nengo.Connection(thal.output[2], errors.ensembles[0].neurons, transform=np.ones((50, 1)) * -4)
    nengo.Connection(thal.output[2], errors.ensembles[1].neurons, transform=np.ones((50, 1)) * -4)

An explanation of the code change:

  • The output of the Thalamus network is cleaner than the output of the BG network. From the Thalamus network only the “chosen” action is active, whereas the output of the BG network can be quite noisy.
  • The output of the BG network is a negative value, so to inhibit the errors ensembles, the transform is positive (i.e., transform=np.ones((50, 1)) * 4). Conversely, the output of the Thalamus network is a positive value, so to inhibit the errors ensembles, the transform is negative (i.e., transform=np.ones((50, 1)) * -4)
  • The desired inhibition is (as stated by the comment in the code):

inhibit learning for actions not currently chosen (recall BG is high for non-chosen actions)

  • For the output of the BG, in the ideal scenario, the output of the chosen action is 0, and the non-chosen actions are non-zero negative numbers (e.g, -0.8). Thus, to inhibit the non-chosen actions, all you need to do is to feed the BG output to the respective (same action) error ensemble. The output of the Thalamus, is however, inverted. The output of the chosen action is 1, and the non-chosen actions are 0. So, to inhibit the non-chosen actions, you need to connect the output of the chosen action to the non-chosen actions (i.e., connection action 0 to errors 1 and 2)

If you run the code, the agent does sometimes do a little more moving about, but, it can also quite quickly get stuck into a pattern of turning in a circle. As I mentioned in my other post, this is because there is nothing that biases the forward movement as favoured. Rather, turning (left or right) and not hitting anything is as good as going forward and not hitting anything.

No, they are not. If you look at this connection statement:

nengo.Connection(bg.output[0], errors.ensembles[0].neurons, transform=np.ones((50,1))*4) 

You’ll notice that the connection is made to a .neurons object. This means that the output of the BG is being connected directly to the neurons of errors.ensemble[0]. This is how we implement neural inhibition within Nengo. The direct connection to the neurons of the errors.ensemble ensemble is also the reason why the transform parameter is a matrix rather than a single scalar value.

If you go back to lectures 2 & 3, recall that there are two different ways of connection to Nengo ensembles. This first method connects directly to the neurons of an ensemble:

nengo.Connection(input, ens.neurons)

This second method connects to the ensemble, but the connection is made through the encoders of the ensemble. This second method is the “NEF” method for connecting to an ensemble (as opposed to the “direct” method for connecting to an ensemble – which bypasses the ensemble encoders):

nengo.Connection(input, ens)

Just to summarize, this code:

nengo.Connection(bg.output, errors.input, transform=4)

would be equivalent to the NEF method for connecting to the ensembles. I.e.,

nengo.Connection(bg.output[0], errors.ensembles[0], transform=4)    
nengo.Connection(bg.output[1], errors.ensembles[1], transform=4)    
nengo.Connection(bg.output[2], errors.ensembles[2], transform=4)   

Thank you for your detailed explenation @xchoo!

I read your post and I still don’t get it. I understand that the agent tries to avoid the collision because the penalty is high. But I don’t understand why it rotates in the place (0 speed) or going back and forth. It gets a positive reward by its speed. So, rotating in the place has no penalty but also has 0 reward. I would expect that the agent will move around in circles to keep the speed high and get a good reward. Going back and forth also has a lot of actions with 0 reward.

Why it prefers to avoid penalty instead rather than maximize the reward?

Ho, now I understand the difference. BTW, why do we want to bypass the encoders in this case? Why can’t we use the BG (or Thalamus) values to inhibit the errors same as the reward ensemble is connected?

After playing around with the model a bit more, here are some observations I have made that may help you understand the system a bit more. Understand that this agent learning setup is super simplistic, and thus can have weird outcomes. If you expand the BG network (double click on the BG box), and display the “value” plot for the input node in the BG network, you’ll get a better understanding of what the agent is “thinking” as it is learning this task. As I understand it, this is a summary of how the agent’s learning is set up:

  • The agent has 3 actions: go forward, turn left, and turn right. When the simulation starts, the utilities for these 3 actions are pretty high (0.8, 0.6, and 0.7, respectively).
  • When the agent inevitably crashes into a wall, the action that is being performed is penalized. When an action (any on of the 3) is penalized, the agent tries another action (see the Thalamus output), and since the agent is still crashing into a wall, it’s likely that that action is penalized too.
  • Eventually, all three actions are penalized to such at extent that the BG and Thalamus output are basically random (the BG output for all 3 are close to -1, so the Thalamus just picks an action at random to execute).
  • Since going forward has a 1/3 chance of being picked, the agent will most likely just spin on the spot. And, because it is not moving forward, it gets no reward to get it out of this stuck state.
  • Worst still, if you are using the original code (where the BG output is inhibiting the errors network), when all 3 BG outputs are close to -1, all of the error is inhibited, which means that even if the agent is moving forward, it’s not getting any reward for doing so (because the BG is inhibiting the error signal).

One of the causes of this stuck action issue is that when the agent is penalized for doing an action, the other actions do not get a boost. Thus, the utility of all three actions can reduced to such a level that the agent’s action is basically random. To fix this, you’ll need to make it such that when the agent is penalized for performing an action, the other actions (i.e., the ones not performed) get rewarded. This change has been made in the NengoFPGA RL learning example, and it performs slightly better than the one in the Nengo examples. If I have time, I will try to port the NengoFPGA version back to the Nengo version and post it here.

When you are using the encoders of an ensemble, the ensemble can represent both positive and negative values (which is what the encoders are meant to do - project the input into a -1 to 1 space). However, the desired behaviour of an inhibited ensemble is that the output of the ensemble should be 0 regardless of the input signal.

Knowing this, lets explore a few scenarios:

The input to the ensemble has a value of 1. If we use the encoders, the “inhibition” value would need to be -1. So far, so good.

Now, let’s consider a negative input. If the input to the ensemble has a value of -1, and we use the encoders, the “inhibition” value needs to be 1 (to bring the output back to 0). Compared to the previous case, this already poses an issue. If we use the encoders, to inhibit the ensemble output, we’ll need some mechanism to create an inhibition signal based on the output of the ensemble (which can get complex to do, because there is a feedback loop now).

In order to avoid this, we bypass the encoders of the ensemble. By doing so, the inhibition signal feeds directly to the current input to the neurons, where a negative value produces a negative current which inhibits the neurons.

To clarify the role of the encoders, recall that the neurons work with currents. Currents are unidirectional, i.e., a positive input current generates spikes, a negative input current suppresses spikes. If we want the neuron to represent negative values, this poses an issue since the negative value would logically cause a negative input current, which wouldn’t produce any spikes. In order to rectify this, an “encoder” is inserted into the chain like so:

negative input → encoder → input current → neuron

To represent a negative input value then, we just need to have an encoder with a negative value. Then the two negatives cancel each other out and we get a positive input current! :smiley:
In order to have an ensemble (population of neurons) to represent both negative and positive values, we’ll then need some neurons in the ensemble to have positive encoders, and some to have negative encoders.

Thank you @xchoo for the explanation! It is much clearer now!