How to build custom UI environment for reinforcement learning in Nengo?

Hello,

As I am currently learning stuff hence my questions could feel vague.
I have seen few examples here with custom UI for reinforcement learning I am not very clear how these visualisation being made. basically I am confused about how they are wired up. So is there any tutorial or some description about how to build custom UI in nengo?

I have seen some examples where rewards and states represented using nodes then fed to ensembles and calculated error and using PES learning and learned tuning curves according to reward function but I am confused that there are some examples with cortex->basal ganglia->thalamus loop and some with just simple spiking neurons so confusion is about how to decide or design computational graph according to custom RL environment?
Another question is how to wire standard RL environment like cart pole GYM with nengo based RL algo?
Is there any nengo RL example with multi agent systems?
Is nengo builds static computational graph to preform whatever neural computations?

I have figured out most of answers to my questions here. example5 and example6 are quite useful to understand but not very clear about how Q(s,a)(action value function) or V(S)value function being calculated using neurons in these examples .

with model:
    movement = nengo.Node(move, size_in=2)

    env = ccm.ui.nengo.GridNode(world, dt=0.005)

    stim_radar = nengo.Node(detect)

    radar = nengo.Ensemble(n_neurons=50, dimensions=3, radius=4, seed=2,
                noise=nengo.processes.WhiteSignal(10, 0.1, rms=1))
    nengo.Connection(stim_radar, radar)


    def braiten(x):
        turn = x[2] - x[0]
        spd = x[1] - 0.5
        return spd, turn
    nengo.Connection(radar, movement, function=braiten)  
    
    def position_func(t):
        return body.x / world.width * 2 - 1, 1 - body.y/world.height * 2, body.dir / world.directions
    position = nengo.Node(position_func)
    
    state = nengo.Ensemble(100, 3)

    nengo.Connection(position, state, synapse=None)
    
    reward = nengo.Node(lambda t: body.cell.reward)
        
    tau=0.1
    value = nengo.Ensemble(n_neurons=50, dimensions=1)

    learn_conn = nengo.Connection(state, value, function=lambda x: 0,
                                  learning_rule_type=nengo.PES(learning_rate=1e-4,
                                                               pre_tau=tau))
    nengo.Connection(reward, learn_conn.learning_rule, 
                     transform=-1, synapse=tau)
    nengo.Connection(value, learn_conn.learning_rule, 
                     transform=-0.9, synapse=0.01)
    nengo.Connection(value, learn_conn.learning_rule, 
                     transform=1, synapse=tau)

in above code last four lines mostly calculating value function but I am not sure how it calculates ?
Some one could tell how neurons approximate value or action value function in above code?

The first line

learn_conn = nengo.Connection(state, value, function=lambda x: 0,
                                  learning_rule_type=nengo.PES(learning_rate=1e-4,
                                                               pre_tau=tau))

is creating the connection that will approximate the value function (mapping from the Ensemble representing the state to the 1D value ensemble representing the scalar value for that state). The function is initialized to be all zero. The next three lines

    nengo.Connection(reward, learn_conn.learning_rule, 
                     transform=-1, synapse=tau)
    nengo.Connection(value, learn_conn.learning_rule, 
                     transform=-0.9, synapse=0.01)
    nengo.Connection(value, learn_conn.learning_rule, 
                     transform=1, synapse=tau)

are setting up the standard TD error formula error = reward + discount * V(s') - V(s), where the three lines are computing the reward, discount * V(s'), and V(s) terms, in order. (in this case it is reversed to -error = -reward - discount * V(s') + V(s), but it’s the same idea).

Note that V(s) and V(s') are computed by applying different temporal filters to the value signal (the synapse argument). We can think of this, roughly speaking, as adding a delay to the signal (that’s not really true, but just for the sake of explanation). So when we take the value signal, and apply different temporal filters (delays), that allows us to approximate the value of that signal from different points in time, which is the key to computing the TD error.

1 Like

hi drausmuss

So I tried to replicate similar idea which you explained above but used gym cartpole environment for testing purpose.

import nengo
import numpy as np

import gym

model=nengo.Network()

env = gym.make('CartPole-v0').env

class EnvironmentInterface(object):
    def __init__(self,env,stepSize =5):
        self.env = env
        self.n_actions = env.action_space.n
        self.state_dim = env.observation_space.shape[0]
        self.t=0
        self.stepsize = stepSize
        self.output = np.zeros(self.n_actions)
        self.state = env.reset()
        self.reward= 0
        self.current_action = 0

    def take_action(self,action):
        self.state,self.reward,self.done,_=env.step(action)
        if self.done:
            # self.reward = -2
            self.state = env.reset()

    def get_reward(self,t):
        return self.reward
    
    def sensor(self,t):
        return self.state

    
    def step(self,t,x):
        if int(t*1000)%self.stepsize == 0:
            self.current_action = np.argmax(x) #np.argmax(self.output)
            self.take_action(self.current_action)
    
    def calculate_Q(self,t,x):
        if int(t*1000) % self.stepsize == 1:
            qmax = x[np.argmax(x)]
            op = np.zeros(self.n_actions)
            op[self.current_action] = 0.9*qmax + self.reward
            self.output = op
            
        return self.output
    
        
    def step2(self,t,x):
        if int(t*1000) == 1:
            print("STARTING")
        
        if int(t * 1000)%self.stepsize == 0:
            qs = self.output[5:]
            self.current_action = np.argmax(qs)

            self.take_action(self.current_action)
        
        elif int(t * 1000) % self.stepsize == 1:
            qvals = x
            qmax = qvals[np.argmax(qvals)]

            c_output = np.zeros(self.n_actions)
            c_output[self.current_action] = qvals[self.current_action]

            f_output = np.zeros(self.n_actions)
            f_output[self.current_action] = 0.9*qmax + self.reward

            self.output = np.concatenate((c_output,f_output,qvals))
        
        return self.output

            
        
tau = 0.01

fast_tau = 0
slow_tau = 0.01
n_action =2
envI=EnvironmentInterface(env)

state_dimensions=envI.state_dim
n_actions = envI.n_actions

with model:
    sensor = nengo.Node(envI.sensor)
    reward = nengo.Node(envI.get_reward)
    
    sensor_net = nengo.Ensemble(n_neurons=1000,dimensions=envI.state_dim,radius=4)
    
    nengo.Connection(sensor,sensor_net)
    
    action_net = nengo.Ensemble(n_neurons=1000,dimensions=envI.n_actions,radius=4)
    
    learning_conn=nengo.Connection(sensor_net,action_net,function=lambda x:[0,0],learning_rule_type=nengo.PES(1e-3, pre_tau=slow_tau),synapse=tau)
   
    q_node = nengo.Node(envI.calculate_Q,size_in=2,size_out=2)
    
    step_node = nengo.Node(envI.step,size_in=2)
    
    nengo.Connection(action_net,step_node,synapse=fast_tau)
    
    nengo.Connection(action_net,q_node,synapse=tau)
    
    
    nengo.Connection(q_node,learning_conn.learning_rule,transform =1,synapse=fast_tau) ##0.9*Q(s',a')+r
    
    nengo.Connection(action_net,learning_conn.learning_rule,transform =-1,synapse=slow_tau)#Q(s,a)

But looks like my implementation has some flaw as Q value of one action becomes very high and cartpole iteration never reached beyond few steps.

Is my implementation of action value function fine?

I haven’t looked at everything in detail, but just at first glance, it looks like your error is reversed (you have reward + 0.9 * Q(s', a') - Q(s, a), rather than -reward - 0.9 * Q(s', a') + Q(s, a) as in the example).

What is advantage of having negative reward( -reward - 0.9 * Q(s', a') + Q(s, a) ). ?

It’s not an advantage, it just depends on how the learning rule works. Some rules adjust the weights like w += error * ... and some adjust them like w -= error * ..., so you need to make sure that the error you’re passing to your learning rule matches the expectations of that rule relative to how you would like the weights to change. And in the case of PES, I believe you are passing in the error with the sign flipped at the moment.

Thanks Drasmuss
My code seems working I did’t got equivalent performance like DQN but managed to balance cart pole at least for few seconds.

Now I have doubt about basal ganglia.
I am reading this paper to understand Basal ganglia(BG) and learning So my confusion is about how utility values between cortex and BG been calculated and how to correct it using learning?

So I have to again calculate (-reward - 0.9 * Q(s’, a’) + Q(s, a) ) in learning node which will correct utility values between BG and cortex?

You can think of decision making in the TD reinforcement learning context as two steps: computing action values, and choosing an action based on those values. The part you have been doing so far in your model is computing the action values. The action selection was just being done in the EnvironmentInterface (self.current_action = np.argmax(x)). The role of the basal ganglia is essentially to do that argmax calculation, but in neurons. So none of the rest of your model needs to change, you would just be adding a new component (the basal ganglia), connecting your action values (represented in the action ensemble) to the input of the basal ganglia/thalamus, and then connecting the output of that (which now represents the selected action) back to the environment.

But thalamus output is of same dimension as of input to BG only it amplifies index of whichever action has maximum Q value. So I feel still I have to do argmax on output of thalamus?

Yes that is correct, it isn’t exactly an argmax operation. You can think of it as an argmax that indicates the maximum value by returning a vector of zeros, with 1 for the argmax value (e.g. [0, 0, 1, 0]). So you would need to call np.argmax on that if you want to turn that into an integer (e.g. 2).