Reinforcement Learning in Nengo


#1

Hi everyone.

Recently I’m looking into learning mechanism in neuroscience and also Nengo,
but I’m not quite there yet, just in the exploring phase.

I’ve follow the examples about supervised learning and unsupervised learning in the newly uploaded Nengo 2.4 documetation.
If I understood those in the right way, they presented how the neural network establish the functional connection, through PES and/or BCM rules, in the lower level (neuroscience level).
But if we doesn’t need to observe the online process of the connection learning behavior, we can just utilized the connection solver provided by nengo simulator to determine the synaptic weighting of the network to have the network performs the desired transform of function.

Now I’m more interested in the cognitive level of learning behavior, since it’s more applicable to real world task solving, say let the robot learn as trial.
I noticed that there’s a example utilizes Voja rule to perform SPA level learning which is really interesting, but not seems to cover the learning behavior in general or, say reinforcement learning.

So, I would like to see more suggestion in modeling of learning behavior with Nengo.
Any discussion is welcomed!

Edward Chen


Is it possible to use SNN as Q_function approximation?
#2

Hi Edward,

There has definitely been a good deal of work on reinforcement learning in Nengo.

These papers are probably a good place to start, they talk about implementing a model of bandit learning, the simplest RL case.
http://compneuro.uwaterloo.ca/publications/bekolay2011.html
http://compneuro.uwaterloo.ca/publications/stewart2012a.html

These papers extend those ideas to full temporal-difference RL, as well as hierarchical reinforcement learning.
http://compneuro.uwaterloo.ca/publications/rasmussen2013b.html
http://compneuro.uwaterloo.ca/publications/Rasmussen2014b.html
http://compneuro.uwaterloo.ca/publications/Rasmussen2014c.html

If you want to see some code examples, check out https://github.com/tcstewar/nengo_learning_examples (learning2, learning5, and learning6 are all RL-related, in roughly increasing order of complexity).


#3

wow, great!
thanks for your fast respond.
I’ll definitely dig into that!

Edward Chen


#4

Hi @drasmuss

I encountered an error while running learning5 and learning6.

error message:

Traceback (most recent call last):
File “/usr/local/lib/python3.5/dist-packages/nengo_gui/page.py”, line 220, in execute
exec(compiled, code_locals)
File “nengo_gui_compiled”, line 28, in module
File “/home/cnrg-ntu/nengo_gui_project/nengo_learning_examples/ccm/init.py”, line 1, in module
from model import Model,log_everything
ImportError: No module named ‘model’

p.s. I had download the CCMSuite and still not working, did I miss something?

Edward Chen


#5

Ah, it looks like ccmsuite is only compatible with Python 2.7, not 3.5. You could install Python 2.7, or we might be able to persuade @tcstewar to push an update for ccmsuite if he has time.


#6

okay, I see.

thanks for your reply :slight_smile:

Edward Chen


#7

I really need to update ccmsuite to fix this, and to fix the install bug in it (right now it only works if you do a “python setup.py develop”, but not if you do “python setup.py install”). But I’ve now had 2 people in the last 2 weeks run into problems with it, so I’ll take a look and see if I can fix this…


#8

Those two examples should be working in Python 3 now!

You’ll have to grab the latest version of ccmsuite again, and there are two small changes to the learning5 and learning6 scripts. I’ve updated them in the repository, but if you just want to make the changes yourself, it’s just changing

world = ccm.lib.grid.World(Cell, map=mymap, directions=4)

to

world = ccm.lib.cellular.World(Cell, map=mymap, directions=4)


#9

Building on the utility learning example (learning5 & learning6 script) and trying to use CCM Suite :

Let’s imagine we have a certain cell in the map designated by a letter as in the 6th example.

class Cell(ccm.lib.grid.Cell):
def color(self):
return ‘black’ if self.wall else None
def load(self, char):
if char == ‘#’:
self.wall = True
if char == ‘f’:
self.reward = 10

As I understood from checking the code the detect function detects obstacles, such as walls.

  1. How could we define a function that returns the distance to f? Ideally, to the nearest f? The detect function detects obstacles in discrete directions. What if we wanted the distance to f explicitly?
  2. Is there a diffusion function implemented in CCM? Like if the reward f emits an odor that diffuses in he world?
  3. Does the agent of example 5 learn eventually? Is it sensible for the utilities to stabilize in such an example?
  4. If we use the same network as in example 5 (agent that can move and turn) in a world where there is a specific f point and form the reward as positive when coming closer to f and negative when going away from it a. Would the agent learn to approach f? b. Would the utilities stabilize (I wouldn’t think so but maybe I missed sthing)

#10

I modified the example 5 script.
This is the code (the map of #### is not captured well but it is 65x28).
But when I try to run it, it never stops loading…

Any help?

import nengo
import numpy as np
import math
import ccm.lib.grid
import ccm.lib.continuous
import ccm.ui.nengo

mymap="""
#################################################################

#################################################################

“”"

class Cell(ccm.lib.grid.Cell):
def color(self):
return ‘black’ if self.wall else None
def load(self, char):
if char == ‘#’:
self.wall = True

world = ccm.lib.cellular.World(Cell, map=mymap, directions=4)

body = ccm.lib.continuous.Body()
food = ccm.lib.continuous.Body()

world.add(body, x=1, y=3, dir=2)
world.add(food, x=48, y=9, dir=2)

model = nengo.Network(seed=8)
with model:
def move(t, x):
speed, rotation = x
dt = 0.001
max_speed = 20.0
max_rotate = 10.0
body.turn(rotation * dt * max_rotate)
body.go_forward(speed * dt * max_speed)

movement = nengo.Node(move, size_in=2)

odor_change = nengo.Node(size_in=1, label='reward')

env = ccm.ui.nengo.GridNode(world, dt=0.005)
def distance2f(t):
    dis = math.sqrt((food.x-body.x)**2 + (food.y-body.y)**2)
    max_distance = world.width + world.height
    return dis/max_distance
odor = nengo.Node(distance2f)

odor_neurons = nengo.Ensemble(n_neurons=50, dimensions=1)
nengo.Connection(odor, odor_neurons)

odor_memory = nengo.Ensemble(n_neurons=50, dimensions=1)
nengo.Connection(odor_neurons, odor_memory, transform = 1, synapse=0.3)
nengo.Connection(odor_memory, odor_change, transform = -1, synapse=0.01)
nengo.Connection(odor_neurons, odor_change, transform = 1, synapse=0.01)

bg = nengo.networks.actionselection.BasalGanglia(3)
thal = nengo.networks.actionselection.Thalamus(3)
nengo.Connection(bg.output, thal.input)

def u_fwd(x):
    return 0.8
def u_left(x):
    return 0.6
def u_right(x):
    return 0.7
conn_fwd = nengo.Connection(odor_neurons, bg.input[0], function=u_fwd, learning_rule_type=nengo.PES())
conn_left = nengo.Connection(odor_neurons, bg.input[1], function=u_left, learning_rule_type=nengo.PES())
conn_right = nengo.Connection(odor_neurons, bg.input[2], function=u_right, learning_rule_type=nengo.PES())
    
nengo.Connection(thal.output[0], movement, transform=[[1],[0]])
nengo.Connection(thal.output[1], movement, transform=[[0],[1]])
nengo.Connection(thal.output[2], movement, transform=[[0],[-1]])

errors = nengo.networks.EnsembleArray(n_neurons=50, n_ensembles=3)
nengo.Connection(odor_change, errors.input, transform=-np.ones((3,1)))
nengo.Connection(bg.output[0], errors.ensembles[0].neurons, transform=np.ones((50,1))*4)    
nengo.Connection(bg.output[1], errors.ensembles[1].neurons, transform=np.ones((50,1))*4)    
nengo.Connection(bg.output[2], errors.ensembles[2].neurons, transform=np.ones((50,1))*4)    
nengo.Connection(bg.input, errors.input, transform=1)

nengo.Connection(errors.ensembles[0], conn_fwd.learning_rule)
nengo.Connection(errors.ensembles[1], conn_left.learning_rule)
nengo.Connection(errors.ensembles[2], conn_right.learning_rule)

#11

Hmm, very strange… The script seems to run for me. Can you try closing Nengo, restarting it, and running the file again?


#12

We’d have to write that function ourselves. You can do something like this:

for row in world.grid:
for cell in row:
if cell.is_food: # create this parameter by adding “self.is_food = char == ‘f’” to load
# do computation here to figure out distance to agent, using cell.x, cell.y, body.x, and body.y

Nope, that’d also have to be defined yourself.

Eventually it does sort of learn, since in that basic case there are sensible stable utility values (or, more accurately, there are sensible utility functions – the agent is not learning a single static utility value for each action; rather, it is learning a function that maps from state to utility. So the utility of going forward would be high if there’s no obstacles in front of it, but low if there are obstacles.

However, there’s a huge thing missing from this model: exploration. Right now, it’s only ever trying to perform what it thinks is the best action. So that’s a bit problematic. It turns out that in this simple case, we get a bit of exploration due to neuron spike noise, and that’s sufficient for this task, but I’m pretty sure that in more complex tasks we’d want some richer exploration system

.[quote=“bagjohn, post:9, topic:287”]
If we use the same network as in example 5 (agent that can move and turn) in a world where there is a specific f point and form the reward as positive when coming closer to f and negative when going away from it a. Would the agent learn to approach f? b. Would the utilities stabilize (I wouldn’t think so but maybe I missed sthing)
[/quote]

Hmm… good question… I think that might work, since that is such a simple task, and since it’s possible to define utility functions that do that task. The way I’d try going about it is first try defining the function yourself, rather than getting it to learn it. By that I mean take these functions here:

def u_fwd(x):
return 0.8
def u_left(x):
return 0.6
def u_right(x):
return 0.7

and modify them to be something like this:

def u_fwd(x):
# if the food is infront of me, then return 1.0, else 0.0
def u_left(x):
# if the food is to my left, then return 1.0, else 0.0
def u_right(x):
# if the food is to my right, then return 1.0, else 0.0

If that behaviour is possible to hand-design like that, then it should maybe be possible for the agent to learn it. But I always like trying to do it manually first, to make sure I’m not trying to get the system to learn something impossible.


#13

In Nengo GUI no luck. Even after restarting it, it never loads the model.

But when copying the code to jupyter notebook and adding

with nengo.Simulator(model, dt=0.001) as sim:
sim.run(5)
t = sim.trange()
print(“done”)

it completes the simulation in 18 seconds. For more timesteps it takes forever

Any ideas?


#14

That is really strange… It runs fine for me that way as well. How many more timesteps did you try? Also, what operating system are you running and what version of Python?


#15

Hello Terry!

I don’t know how it happened but anyway now it is working also in the GUI.
It is very slow indeed.

Anyway, I changed the code a bit. I would like some feedback if you have the time.

  • The Body senses its distance from the Food (odor Node). It does this through an “antenna” which is always a grid cell in front of him, so even if it only rotates, the input changes accordingly. The further from food the lower the odor.

  • The input (distance2f) is fed to the BG as in your example

  • The input is also given to an Odor Memory ensemble through a slow synapse. From the odor & odor_memory we get an odor_change signal (reward Node). It is negative when going towards, positive when away from food (odor - odor_memory).

  • The reward is fed to the error node along with the BG input as in your example. BG input is weighted positively, reward negatively.

The agent doesn’t seem to learn anything.

Issues :

  1. You said that we use the PES rule in such an example to correlate utilities of actions with states. In this case this is not possible as the state each time is just a value. And the utility of an action depends on the history. Trying to follow your hint and do it manually, I realize I would do something like this :
    If signal increases continue what you are doing.
    If signal decreases and you were moving, turn.
    If signal decreases and you were turning left, turn right and sometimes move.
    If signal is stable try everything randomly
    But this takes previous actions and signal history into account.
  2. Let’s say we wanted to give feedback of the actions just executed back to the BG (thalamic output to BG input along with the signal) and use the rule to modulate those synapses. Would that be rational?
  3. And 2 technical issues: in the model i attach i want the reward to be more sensitive. It barely moves throughout the simulation. I tried to use big transform factors (+10 for odor, -10 for odor_memory). Is it rational to do so?
  4. How to get rid of the initialization artifact? The odor Node jumps from 0 to its right value in the beginning of the simulation.

If you have time I would really appreciate your comments!

Thanks!

Panos

import nengo
import numpy as np
import math

import ccm.lib.grid
import ccm.lib.continuous
import ccm.ui.nengo

mymap="""
#################################################################

f

#################################################################

“”"

class Cell(ccm.lib.grid.Cell):
def color(self):
return ‘black’ if self.wall else None
def load(self, char):
if char == ‘#’:
self.wall = True
if char == ‘f’:
self.reward = 10

world = ccm.lib.cellular.World(Cell, map=mymap, directions=8)

body = ccm.lib.continuous.Body()
food = ccm.lib.continuous.Body()

world.add(body, x=20, y=10, dir=2)
world.add(food, x=48, y=9, dir=2)

model = nengo.Network(seed=8)
with model:
def move(t, x):
speed, rotation = x
dt = 0.001
max_speed = 20.0
max_rotate = 10.0
body.turn(rotation * dt * max_rotate)
body.go_forward(speed * dt * max_speed)

movement = nengo.Node(move, size_in=2)

odor_change = nengo.Node(size_in=1, label='reward')

env = ccm.ui.nengo.GridNode(world, dt=0.005)
def distance2f(t):
    (dx, dy) = world.get_offset_in_direction(body.x, body.y, int(body.dir))
    (x2, y2) = (body.x + dx, body.y +dy)
    if x2 < 0:
        x2 = body.x
    if y2 < 0:
        y2 = body.y
    if x2 >= world.width:
        x2 = body.x
    if y2 >= world.height:
        y2 = body.y
    dis = math.sqrt((food.x-x2)**2 + (food.y-y2)**2)
    max_distance = math.sqrt(world.width**2 + world.height**2)
    return (1 - dis/max_distance)
odor = nengo.Node(distance2f)
#print(antenna_pos[1])
odor_neurons = nengo.Ensemble(n_neurons=50, dimensions=1)
nengo.Connection(odor, odor_neurons)

odor_memory = nengo.Ensemble(n_neurons=50, dimensions=1)
nengo.Connection(odor_neurons, odor_memory, transform = 1, synapse=0.05)
nengo.Connection(odor_memory, odor_change, transform = 10, synapse=0.01)
nengo.Connection(odor_neurons, odor_change, transform = -10, synapse=0.01)

#reward_bias = nengo.Node(1)
#nengo.Connection(reward_bias, odor_change, transform = -0.5, synapse=0.01)

bg = nengo.networks.actionselection.BasalGanglia(3)
thal = nengo.networks.actionselection.Thalamus(3)
nengo.Connection(bg.output, thal.input)

def u_fwd(x):
    return 0.8
def u_left(x):
    return 0.7
def u_right(x):
    return 0.6
conn_fwd = nengo.Connection(odor_neurons, bg.input[0], function=u_fwd, learning_rule_type=nengo.PES())
conn_left = nengo.Connection(odor_neurons, bg.input[1], function=u_left, learning_rule_type=nengo.PES())
conn_right = nengo.Connection(odor_neurons, bg.input[2], function=u_right, learning_rule_type=nengo.PES())
  
#nengo.Connection(thal.output[0], bg.input[0])
#nengo.Connection(thal.output[1], bg.input[1])
#nengo.Connection(thal.output[2], bg.input[2])

nengo.Connection(thal.output[0], movement, transform=[[1],[0]])
nengo.Connection(thal.output[1], movement, transform=[[0],[1]])
nengo.Connection(thal.output[2], movement, transform=[[0],[-1]])

errors = nengo.networks.EnsembleArray(n_neurons=50, n_ensembles=3)
nengo.Connection(odor_change, errors.input, transform=-np.ones((3,1)))
nengo.Connection(bg.output[0], errors.ensembles[0].neurons, transform=np.ones((50,1))*4)    
nengo.Connection(bg.output[1], errors.ensembles[1].neurons, transform=np.ones((50,1))*4)    
nengo.Connection(bg.output[2], errors.ensembles[2].neurons, transform=np.ones((50,1))*4)    
nengo.Connection(bg.input, errors.input, transform=1)

nengo.Connection(errors.ensembles[0], conn_fwd.learning_rule)
nengo.Connection(errors.ensembles[1], conn_left.learning_rule)
nengo.Connection(errors.ensembles[2], conn_right.learning_rule)