I was looking at the heteroassociative memory example here and have a few questions.

I’ve noticed that there are two learning rules, Voja and PES, which are used to learn new associations. It appears that the Voja is modifying the encoders such that the intercepts/preferred direction are optimal for the keys which then, from those specific keys, the decoders compute the weight transform needed to get the values learned through PES. Is my understanding of this correct?

So here are the Voja rules are learning the keys in a sense. Where it is modifying the encoders to cluster together the inputs (keys) and from that cluster the decoder maps that cluster to a value using PES? When learning the decoders, how does the weight change not destructively influence previously learned key-value mappings?

Could some additional insight be provided into this:

An important quantity is the largest dot-product between all pairs of keys, since a neuron’s intercept should not go below this value if it’s positioned between these two keys. Otherwise, the neuron will move back and forth between encoding those two inputs.

The approach we take here is to enforce a sparse representation. Note that Voja only updates the encoders of the active neurons, that is if $\mathbf{e}^T \mathbf{x} \ge c$ where $\mathbf{e}$ is the neuron’s encoder, $\mathbf{x}$ is the input (key), and $c$ is the neural “intercept” of the tuning curve. By setting $c$ appropriately, the representation is sparse enough that an encoder can only move towards at most one key. If a different key is presented after the first is learned, then we would like it to be dissimilar enough that the neuron will not become active (that is, $\mathbf{x}_1^T \mathbf{x}_2 < c$). Hopefully this answers your third question as well.

Some more detailed information about reasonable distributions of intercepts can be found in this technical report as well as the documentation for nengo.dists.CosineSimilarity. Figuring out the right way to get a scalable sparse representation is really the crucial problem in general – one that is connected to the so-called “curse of dimensionality”. The approach here makes some headway, but it is still a difficult problem to solve in general (without prior knowledge) and belongs to an active area of research known as “one shot learning”.

Ah, I think that makes sense. I need to do a bit more digging, it’s starting to sound partly similar to the way SPA vectors are organized in the manner that they can be properly extracted as an auto-associative memory or a cleanup memory. So in a way, it almost just seems natural to use SPA vectors as the keys and/or values but with learning applied to it? If that is the case, this is quite helpful seeing learning applied to SPA concepts though it is unclear to me on how it performs on vectors binded/collected together. Particularly since I haven’t seen many examples (if there are any) of SPA and learning rules really mixed together.