Supervised learning

bagjohn · April 24, 2017, 7:37am

The way Eric suggests (using a set of input/training and output/target vectors when defining the function in a connection) is a way to achieve certain responses from the post-ensemble when receiving certain stimuli from the pre-ensemble.
In what way does Nengo achieve this? What optimization procedure does it follow?
Is this a way to achieve supervised learning? Is it possible to control this procedure?

More explicitly let’s assume that we have 5 different stimuli vectors (decoded presynaptic) which we want to associate with 5 different target vectors (decoded postsynaptic) through the pre-post connection. BUT we want to train the connection differentially on these examples :
-Define a total amount of time (or a total number of training examples) that the training will last
-Define a probability distribution that will dedicate the appropriate training time to each training example (e.g. [0.2, 0, 0.3, 0.1, 0.4])
How would we achieve this?

The big picture is to combine supervised learning / training as a form of replay experience (during embryogeny or during sleep) with actual reinforcement learning during an agent simulation.

Any comments/suggestions/ideas are welcome

tbekolay · April 24, 2017, 1:39pm

Interesting questions! The way that Nengo optimizes connections (whether they have explicit input/output training examples, or if you pass a function) is usually through least-squares minimization. We first determine how all of the neurons in the pre-ensemble respond to the given input, then solve for a least-squares optimal set of weights to give us the outputs we want when the neural responses are weighted by those weights and summed.

I say that we usually use least-squares minimization because the way in which you solve for decoders can be modified; the Connection object has a solver parameter to change how decoding weights are solved for. The solvers have the logic to optimize the weights, and can be seen at here. However, it’s kind of difficult to understand that file without knowing a lot of the rest of Nengo, so we have also provided a more minimal description of how Nengo works here; ctrl+f for the compute_decoder function to see how we solve for decoders.

To answer your questions directly, you can think of this offline optimization as a type of supervised learning, sure. However, it’s an offline learning procedure that results in static weights that do not change during the simulation. You could instead use the PES rule to do supervised learning online if you want that.

However, you don’t have to resort to online learning to modify the optimization process. I think the easiest way to achieve the results you want is to modify the inputs and outputs.

Say you had 5 input / output pairs. If you want some pairs to be more accurately decoded than other pairs, you can add additional instances of those input / output pairs. Since the objective of the optimization procedure is to minimize the total squared error over all input / output pairs, having the same pair twice means that its contribution to to the total error doubles, and so the weights that result from the optimization will be more likely to decode well for that particular input / output pair than for other input / output pairs.

So, for the specific example you gave of weighting the 5 pairs [0.2, 0, 0.3, 0.1, 0.4], you could do something like:

inputs = np.random.rand(5)
output = np.random.rand(5)
weighted_inputs = []
weighted_output = []
for i, weights in enumerate([2, 0, 3, 1, 4]):
    weighted_inputs.extend([inputs[i] for _ in range(weights)])
    weighted_outputs.extend([outputs[i] for _ in range(weights)])

nengo.Connection(pre, post, 
    eval_points=weighted_inputs, function=weighted_outputs)

To do the general case of a probability distribution, you’ll have to do some thinking about how best to generalize that snippet above, but it’s definitely doable.

bagjohn · April 24, 2017, 3:10pm

That was swift and crystal clear. Thanks Trevor!

Questions:

You say " it’s an offline learning procedure that results in static weights that do not change during the simulation". That means that after the weights are set there is no learning possible? Because if that is the case, it is not what I need.
How is it possible then to leave the weights of a connection random, so that you can perform learning afterwards? Or is it obligatory to define a function? If you just define a baseline output (e.g. using the 0 vector as single evaluation point and target outcome), the weights are still not modifiable?
Is there an example model for training a connection (e.g. by defining training data and target)? What function was used to set the initial weights?

Thanks
Panos

tbekolay · April 24, 2017, 4:12pm

Sounds like you’re more interested in the online learning networks rather than offline optimization. The approach to online learning in Nengo is a bit more involved than in other neural network systems because we operate in continuous time and require that error signals be computed within the network itself.

The communication learning example is a good place to start for making supervised learning networks. The square learning example might also be helpful as it uses an input signal that changes from one value to another after a certain length of time (which is how you could achieve the weighted learning you asked about in your original post). There are a few other learning examples as well.

The main difference between what you want to achieve and what these learning examples do is in how you calculate the error signal. In these networks, we calculate it based on some function (communication channel, product, square, etc). In your network, you would calculate it based on the input / output pairs.

bagjohn · April 25, 2017, 10:17am

Yes you are right!
I think the associative memory learning example using the Voja rule is the most appropriate.

Some basic questions:

1.a. When using nengo.Connection(ensemble, conn.learning_rule) how exactly does the output of the ensemble modulate the learning rule?
1.b. When using nengo.Connection(ensemble, conn.learning_rule, synapse=0.05) the synapse is again the time constant for the spike current towards the learning rule?

2.a. When using nengo.Probe(ensemble, synapse=None) the synapse is the time constant of the spike current towards Probe? So Probe can be considered a node for scripting purposes?
2.b. When using nengo.Probe(conn.learning_rule, ‘scaled_encoders’), the ‘scaled encoders’ is a built in attribute? What other built in attributes are there for connections to probe?

When running the script on the web-based nengo I get the
Warning: Simulators cannot be manually run inside nengo_gui (line 145)
and I cannot see any plots although they have been set up in the script. Do I have to terminate the sim somehow to see the plots (that are scripted after the sim.run)?
The web-based nengo is the only available now?

Thank you

bagjohn · April 30, 2017, 10:58am

Concerning the associative learning example using the Voja learning rule :

Correct me if I am wrong : The Voja rule helps modulate the encoders of the postsynaptic ensemble to focus on the specific decoded patterns of the presynaptic ensemble. Therefore we create neurons tuned to specific patterns (as shown in the plots).

Let’s say that, as in the example, we initially present 5 patterns with Voja learning rule always active and for a time period long enough so that we accomplish encoder modulation. If afterwards a new unknown pattern is presented to such an ensemble :

How will it be initially decoded? Is it based on its similarity to the previously presented patterns or we have no way of knowing (as it will depend to all different encoders)?
Will it be able to attract neurons to represent it as well? Or is it a case of overfitting where the ensemble is tuned irreversably to a set of initial patterns without the plasticity to learn new ones?
If you desired to set up an associative memory where some patterns are initially learned to a certain extend (like in the example) but is still plastic enough to learn new ones, what would you use?
It is mentioned that we have to define the intercept of the neurons by this equation
intercept = (np.dot(keys, keys.T) - np.eye(num_items)).flatten().max()). But if we want the memory to be able to recognize unknown patterns, this means that the whole keys array will not be initially known. Will this be an issue?

arvoelke · April 30, 2017, 5:44pm

First, it depends on which neurons become active. This in turn depends on which ‘patterns’ have been presented so far, the initial distribution of the encoders, and the chosen intercepts (thresholds). This can be well understood in terms of dot-products between each key and encoder, their corresponding thresholds, and the order of presentation.

Second, it depends on what the decoders of those neurons are. This in turn depends on what they were initialized to (the initially decoded function), and any changes that may have occurred if those neurons were previously active during PES learning.

This all depends on how you’ve understood and set everything up. With large intercepts and sufficiently many neurons there will always be unused neurons to attract. But the scaling may not be ideal due to the curse of dimensionality. This is related to unsolved problems in mathematics including sphere packing and the kissing problem. Last year I helped show in this paper with the SpiNNaker group that we can use knowledge about 24D spaces to make this scale to 196,560 patterns tiled along the leech lattice (also see here). Higher dimensions are not well-understood.

Back to your question, if new keys are presented near old ones, then those old neurons will be attracted (and their decoded value will be the same as it was before).

At a meta level, this was the simplest encoder learning rule that we could think of that would be useful in certain situations. More elaborate variants could be explored to help in other situations. For instance, you could add noise to each dimension to convert the point of attraction into a sphere of attraction. This would help keep some of the encoders available to be moved elsewhere without completely destroying their previous association. You could also add some sort of random walk or repelling mechanism to support slow and heterogeneous forgetting. It’s hard to know what’s best without details of the problem.

I would use larger intercepts and more neurons. If scaling becomes difficult, refer to the above answers.

It is also worth mentioning that these learning rules are set up to handle the “worst case” in the sense that nearby keys can have arbitrarily unrelated values (i.e., a nonlinear map). If you know that there are linear relationships between keys and values for sets of active neurons, then it should be possible to exploit this knowledge to improve scalability. This is something than @jgosmann has been looking into very recently for a model of associative learning in the hippocampus.

That should be the minimum allowable intercept to avoid catastrophic forgetting (a.k.a “interference”). If the intercepts are greater than the maximum pairwise dot-product between all patterns, then this is a necessary and sufficient condition to guarantee that none of your encoders move back and forth between different keys. But maybe that is something that you might want in some situations or for certain nearby keys. The intercepts are also only set the same for conveience / ignorance of prior knowledge.

There should always be something that you know about your keys array. For example, for uniformly distributed high-dimensional vectors we know the probability that the dot-product of two random vectors exceeds a given threshold, as well as the inverse of this relationship (see this code and this technical report). In general you can do something similar any time you have some statistical prior.

Thanks for the great set of questions. Sorry that the situation isn’t as “clean” as you might want.