Questions about Reinforcement Rearning in Nengo

Hi, everyone:

I am very interested in nengo and reinforcement learning, but when I run this code, it didn’t seem to work(the agent always rtotating or collision). As I’m a green hand, can anyone help me?

And can anyone tell me where I can find materials for learning Reinforcement Learning in nengo?

Thinks, ChenLumen.

I tried to change the reward, but it gets worse or doesn’t work at all. Maybe this problem look like naive. I really want to learn reinforcement learning in nengo well, but I don’t have anyone who is familiar with this aspect. If someone can tell me the materials and the introductory code of this aspect, I would be very grateful.

Hi @mingchenliu,

Yeah, the example you linked is somewhat complex, and does strange things. But those things are explainable.

  • About the agent constantly rotating: When the agent is constantly rotating it is actually in an ideal state. This is only because the agent is not penalized for constantly turning, it has no “incentive” to do anything else. To fix this issue, you’d need to penalize (very slightly) both the left and right turn actions when the agent is not moving forward and there’s nothing detected on the forward radar.
  • About the agent constantly stuck on a wall: This is also a consequence of the simplistic error rules used for this agent. If you look at the error signal being generated, you’ll notice that the error signal is always being generated, so, when the agent is moving forward, it is learning that moving forward is really good, and it forgets that bumping into walls is bad. So, by the time it reaches a wall, it has already forgotten that crashing into it is bad, and so it does so. You can address this by decreasing the map size or increasing the agent move speed (so the agent encounters a wall before it has completely forgotten it is bad to crash into one). You can also reduce the forgetting by decreasing the learning rate on the “go forward” action (I used max_speed=30.0 and learning_rate=5e-5 for conn_fwd and it seemed to perform a bit better):
conn_fwd = nengo.Connection(radar, bg.input[0], function=u_fwd, learning_rule_type=nengo.PES(learning_rate=5e-5))

I should note that this example is very simplistic, and the agent does do the wrong thing most of the time. But, as the simulation progresses (like after 30-40s of the simulation time), you will start to see slight hints that the agent is learning to avoid walls (remember that spinning in a circle is also the best way to avoid all walls).

There are ways to improve the agent’s performance, and it’s mostly by altering error signals being generated. There is a modified version of the code in the NengoFPGA examples where I updated the error signals to make the agent slightly “smarter”. Note that the NengoFPGA example uses a special FpgaPesEnsembleNetwork, but that’s essentially a nengo.Ensemble and nengo.PES learning rule combined into one network. The relevant modifications to the original example are:

    ...
 
    # Generate the training (error) signal
    def error_func(t, x):
        actions = np.array(x[:3])
        utils = np.array(x[3:6])
        r = x[6]
        activate = x[7]

        max_action = max(actions)
        actions[actions < action_threshold] = 0
        actions[actions != max_action] = 0
        actions[actions == max_action] = 1

        return activate * (
            np.multiply(actions, (utils - r) * (1 - r) ** 5)
            + np.multiply((1 - actions), (utils - 1) * (1 - r) ** 5)
        )

    errors = nengo.Node(error_func, size_in=8, size_out=3)
    nengo.Connection(thal.output, errors[:3])
    nengo.Connection(bg.input, errors[3:6])
    nengo.Connection(movement_node, errors[6])

    ...
    
    # Connect output of error computation to learning rules
    nengo.Connection(errors[0], conn_fwd.learning_rule)
    nengo.Connection(errors[1], conn_left.learning_rule)
    nengo.Connection(errors[2], conn_right.learning_rule)

Unfortunately, learning to do reinforcement learning (in general) is a slow process that requires a lot of trial and error. For learning how to do it in Nengo, you’ll want to first familiarize yourself with exactly how the nengo.PES learning rule works, since this is the most common on-line learning rule used in Nengo. I’d advise you to create a simple network, and play around with it. Observe what happens when you change the inputs in a simple learned communication channel network and think about how that would affect something like the critter model.

Also, since you are playing around with the critter model, you’ll want to familiarize yourself with how the basal ganglia network works, since it forms the basis of how the critter makes “decisions”.

Finally, if you search the CNRG (computational neuroscience research group – these are the folks that work with the NEF and Nengo for research) publications page for “reinforcement”, there are several theses and papers exploring how to use Nengo to do reinforcement learning. :slight_smile:

Oh, thanks! @xchoo
I started learn about Deep Q learning, and I found this code. And my question is why q_node can caluate the next state Q value? I think it’s the Q value of the previous state. As far as I know, traditional Q learning using ANNs are store the previous state and reward (like s, a, r, s’). Can you help me? Or you have better example for Q learning with nengo? I wonder if it’s related to slow_tau and fast_tau?

If I understand the code (that you linked) correctly, q_node is not computing the next state’s Q value. It’s calculating the Q value of the previous state and previous action (i.e., Q(s’, a’)). There is even a comment about it in the code:

nengo.Connection(q_node,learning_conn.learning_rule,transform =-1,synapse=fast_tau) ##0.9*Q(s',a')+r

@drasmuss (Daniel Rasmussen) spearheaded the development of reinforcement learning implementations in Nengo while he was pursuing his PhD. I’d recommend starting out with his work. His PhD thesis is available here, and he has some code in it, but, it’s programmed in an older version of Nengo and will need to be updated. He also has a more easily digestible paper here that summarizes some of the work he did for his PhD.

There’s also a discussion here about building custom environments for Q learning agents, if you are interested in that.

Thanks, what I knew before was the state and action of the next step. After reading the thesis of Daniel Rasmussen, I realized that it was the state and action of the previous step. So I understand.