I was wondering why one would use the dot product as a measure of similarity, which is the default when using the spa.similarity(), instead of defaulting to the cosine similarity.
When creating the semantic pointers (SP) that are part of a vocabulary, the spa.similarity() function is used to calculate the similarity between a newly made SP and the exisiting ones, only to accept it if its similarity to the existing SPs exceeds a pre-defined threshold (max_similarity). Given that the cosine similarity of two vectors is calculated as cos(u, v) = dot(u, v) / ( ||u|| ||v|| ) and that all SPs that are created using the vocab.create_pointer() function are made to have an L2-norm (length) of 1, the dot product and cosine similarity would be the same value.
However, when we actually probe a vector representation from an SPA network/module it seems to me there is no guarantee, nor is it likely at all, that said vector representation will have an L2-norm of 1. In that case, when we compare this vector to an SPA vocabulary item, the dot product and cosine similarity would not result in the same value.
Now, given that inital similarity at the creation of a vocabulary judges only on the actual angle between vectors and not their magnitude, why would one use the dot product as a measure of similarity over the cosine similarity when performing analysis? That is, using the SPA, do we think/believe that a vector x1 is more similar to a vocabulary item if its magnitude is bigger, when the angle with vocabulary item is the same?
There are a few reasons why normalize=False is the default:
By normalizing you lose some information about the “strength” (norm) of the vector. This might be relevant to understand network behavior and see if it is behaving properly. For example, if the norm of the vector is very small this might be an indication that the network is not working properly, but by normalizing everything could seem to be ok. Also, if you aim to produce approximately normalized vectors in the network, not normalizing gives you a better idea of how well you are achieving this. That being said, there are certainly situations where you want to normalize for the comparison for which the normalize argument is provided.
Normalization is hard to implement in neurons (without making additional assumptions, e.g. about the input range). This makes the default of spa.similarity closer to what could be achieved in a neural network.
When creating a vocabulary, Semantic Pointers are created with unit length (or approximately unit length) so that normalization should not make a difference (or almost no difference).
Hi @ChielWijs, and welcome back to the Nengo forums!
Here are my thoughts regarding dot products, cosine similarity and the SPA…
This is absolutely the case. When you go to plot a graph using the spa.similarity() function, the default is to use the dot product, and not the cosine similarity. And it is often the case that due to noise introduced by neurons and what not, the dot product and cosine similarity do not return the same value.
While it’s true that the analysis uses the dot product instead of the cosine similarity, there are several reasons we tend to do this as the default:
As @jgosmann mentioned, you lose the magnitude information when you normalize the SPA vectors.
The dot product is easier to compute (just a matrix multiplication vs needing a division). Because of this, implementing it in a neural network is fairly simple since you can embed the dot product in the connection weights (if you are comparing to a static vector), or use the Product network to build a dot-product network (if you are comparing two variable vectors).
For higher dimensional space (relative to the vocabulary size), SPA vectors tend to be more and more orthogonal to each other as the dimensionality of the vector increases. This means that when it comes to comparing vectors to each other, even if one vector exceeds the expected magnitude of 1, it doesn’t make too much of a difference if the expected dot product (or cosine similarity) w.r.t. the other vector is close to 0 (because they are orthogonal).
The “expected magnitude of 1” property holds true for newly created vectors, as well as vectors that have been bound together (although, repeated bindings can break this expectation). Where it doesn’t hold true is when you start doing superpositions (vector additions), but if your SPA algebra doesn’t have a lot of additions, then the expected magnitude of the binding operations is roughly 1. (Note: for unitary SPA vectors, even with repeated bindings, the vector magnitude will always remain 1, except in the case of superpositions).
I should also note that when working with the SPA and neural networks, it must be kept in mind that neurons tend to make things a little “loosey-goosey” (i.e., not mathematically exact). That being the case, building networks to function even with a little bit of slop is good practice (one good way to so ensure that the vectors are distinct enough that the network doesn’t have to work super hard to distinguish them), and in such a case, using the dot product (instead of the “exact” cosine similarity) is reasonable. Although… if you do need preciseness in your networks, it is possible (though difficult) to implement a normalization network to normalize vectors before computing a similarity metric. I discuss normalization networks in my PhD thesis (section 4.4).
I see. Given the third point that @xchoo makes, and if one chooses to base their analysis (e.g. the average similarity of the network output and target semantic pointer) on a large number of trials/datapoints, I would think this “false sense” of similarity should not influence that analysis too much when the vocab dimensionality is large enough.
For my master’s thesis I will compare the accuracy of memory recall (or to be more precise, the difference in similarity between presented stimulus SP and the model output) for two different declarative memory models. Given that these are quite difference (in both size and components), I think that this is one of those cases where normalization is needed, as magnitude might give a false sense that one model outperforms the other.
An interesting point, and one that I will look into further, thanks.
Thanks , I have just started work on my master’s thesis, so i’ll expect that this wont be the last you see of me.
For my use, I only need the spa.similarity() function to compare my model output to a vocabulary, so no need to implement it in the network itself, as I will just be using a probe to collect the signal, and analyse it after simulation.