Using Nengo Core for continuous space Reinforcement learning

nrofis · December 11, 2021, 11:56am

I’m trying to learn how to use Reinforcement Learning with Nengo. I saw several examples but all of them use discrete action space and state space. But what about continuous action space?

Are there any examples or papers talking about continuous action space RL with Nengo?

xchoo · December 30, 2021, 4:49am

I don’t think there are any Nengo examples (any maintained by the Nengo dev team) that explicitly use a continuous action space. There is one example that comes close to this though. This NengoFPGA RL example has an agent that uses discrete actions (go forward, go left, go right), but these actions are combined as a weighted sum, so the agent operates with a effectively continuous action space.

Note that the NengoFPGA example does contain some FPGA specific code, but it can be converted back to “regular” Nengo code quite easily. See this documentation for an explanation.

nrofis · December 30, 2021, 9:09am

Yeah, I’m aware of that example, I thought about something similar to Actor-Critic…

xchoo · December 30, 2021, 11:01pm

I don’t believe we have any actor-critic examples unfortunately.

nrofis · December 30, 2021, 11:57pm

I see, I’m looking for more advanced reinforcement learning. I mean, the example is not working well for more “real-life” complex problem (I tried).

The Nengo models are very interesting, especially for real-time learning. But PES is a supervised learning rule, which mean, someone or something should do the “right action” in real-time. And this is not so cool… Since if I have such solution, I don’t need to learn it

So the super cool scenario is a reinforcement learning in real-time - so the model will learn to act in the environment by “itself”. But the only example is not so robust. So I’m wondering if there are more advanced reinforcement learning examples.

xchoo · January 1, 2022, 3:14am

The question you are posing is effectively a research question (and one that’s probably being worked on somewhere), and so it’s a little outside the scope of this forum. I can however, recommend that as a starting point, you check out the publications from the CNRG lab at the University of Waterloo (link) to see if there is anything relevant to your question. Publications by Daniel probably have the most relevance, though Terry might have some publications as well.

nrofis · January 1, 2022, 9:45am

I see. I actually didn’t mean to ask a research question. I mean, I looked for more advanced and common reinforcement learning techniques such Q-Learning, Actor-Critic, etc. implemented in Nengo.

The example in NengoFPGA is nice but has some problems that were talked about in this forum such it quickly forgetting that hitting the wall is a bad thing and most of the time ends by just spinning in the place (not driven by positive reward).

Actually, it is surprising that this is an active research question. I will take a look at the publications you suggested. Thank you!

psipeter · January 7, 2022, 3:35pm

Hello, I’m a researcher in the CNRG, and I’m using nengo to build RL agents that use Q-learning, Actor-Critic, etc. My projects are still a work in progress, and nothing has been published yet, but perhaps I can answer some questions.

Unfortunately, I don’t work with continuous actions spaces. A colleague of mine in the CNRG, @yaffa, is currently developing methods for continuous action spaces using Spatial Semantic Pointers.

nrofis · January 14, 2022, 3:30pm

Thats interesting! Can’t wait to see your work .

jhuebotter · January 21, 2022, 1:30pm

Hi @nrofis,
I’m a graduate student and currently also working on related questions. In my particular case, I am more interested in providing a target state to the model than a (potentially sparse) reward signal. I guess that puts my work outside the classical RL domain, but the algorithms we try to implement in SNNs, currently using Nengo and Nengo DL, are similar. Learning here is driven by the difference between a target state and the currently observed state. An initial prototype works well with PES, but there are some heavy assumptions to be made in this case based on the fact that the error signal (and hence the components it is derived from) need the same dimensionality as the output variable to be learned. To make this more understandable: If a policy network has an output of 3 continuous actions, then the error signal to change these actions in real-time must also have a size of exactly 3. Also, this limits the learning effectively to a single layer in the policy “network” and this may be limiting.

I am currently working on translating this idea to a DL domain where these limitations do not hold - but of course, learning is then offline, in batches, based on BPTT etc. It will still be a while until any of this is done, tested, and published, but that is planned for the months to come.