What Is Cosine Similarity Anyway?

Sep 14, 2019 00:00 · 1753 words · 9 minute read nlp math explainer

In my previous post about story similarity in the work of H.P. Lovecraft, I used cosine similarity to measure relatedness between documents. A trigonometric function seems like a strange way of working with documents. Why not use something simpler, like the Jaccard index? But there’s a good (and intuitive) reason, I promise. In a lot of NLP tutorials (and also in stats/ML in general, to be fair), formulas and measures are presented as if they were natural laws. I want to describe the motivation for each and explain their relationship.

Jaccard index

A basic way of looking at document similarity is to look at how many words they share. Formally, the overlap is the intersection of two documents when considered as sets of words: \(S_{overlap}(\mathbf{a}, \mathbf{b}) = |\mathbf{a} \cap \mathbf{b}|\)

The Jaccard index (also called intersection-over-union) has a pretty straightforward motivation: we want to see how much overlap we have between two documents. We consider the documents as two sets of words. The intersection of a set is the number of items present in both sets (the overlap of a Venn diagram), while the union of two sets consists of items that are available in either set. Represented mathematically, the Jaccard index is \[S_{Jaccard}(\mathbf{A},\mathbf{B}) = J(\mathbf{A},\mathbf{B}) = \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}\]

Clearly the Jaccard index doesn’t consider anysort of weightings of words. It’s simply counting taking a ratio. We’re also not considering any words that neither document has. Jaccard ranges from 0 (no overlap) to 1 (‘total’ overlap). A Jaccard index of 1 doesn’t mean that the documents are identical. The overlap at \(J(A,B)=1\) is ‘total’ in the sense that both documents contain the same set of words. The phrases “butter not guns” and “guns not butter” would have the same Jaccard index.

Cosine Similarity

The other common measure of document similarity is the **cosine similarity*. With cosine similarity we’re not thinking about sets anymore, we’re thinking about vectors in spaces. A vector ‘points’ from the origin in the amount specified by its elements. So the vector [1, 5, 3] in a 3D space points 1 unit in the x axis, 5 in the y and 3 in the z. Using vector spaces makes our similarity measure more complicated, but it has a major advantage: instead of looking at Yes/No term in documents, we can now weight words by importance with procedures like tfidf. Beyond being able to use numbers instead of Yes/No indicators of set membership, we can bring in a lot of math to understand what’s going on. The cosine feels like a natural choice for a similarity measure: it ranges from -1 to 1, and gets larger when the angle in question is smaller. We know that the cosine between two vectors can be computed from the vectors themselves: \[S_{cosine}(\mathbf{a}, \mathbf{b}) = \cos \theta = \frac{\mathbf{a} \cdot \mathbf{b}}{\| \mathbf{a} \| \| \mathbf{b} \|}\]

Even though vector spaces are more ‘mathy’ than the count-and-divide conception of Jaccard similarity, the concept of what we’re describing is still the same: similar documents have more terms in common. Let’s describe whats going on. The dot product of two vectors describes the extent that the vectors point in the same direction. It is defined as \(\mathbf{a} \cdot \mathbf{b} = \sum \mathbf{a}_i \mathbf{b}_i\). A dot product is a scalar, and can be positive or negative. If the dot product is zero, vectors are perpendicular (properly: they are orthogonal) and don’t point in the any of the same directions. If positive, they point in the same direction. If negative, they point in opposite directions. The numerator of cosine similarity is a bare dot product. The norm of a vector describes its length or magnitude: how ‘far’ it is pointing. In standard Euclidean spaces we compute the norm as \(\| \mathbf{a} \| = \sqrt{\sum \mathbf{a}_{i}^{2}}\). Norms are always positive scalar values.

To understand what cosine similarity is doing, we need to consider both parts of the equation. The numerator can only increase when a term is present in both documents, and it goes up by the product of the two weightings. So this part of the equation can be understood as a kind of numerical intersection, where more strongly weighted words increase its value. The denominator of the cosine similarity is the product of its two vector norms. Dividing a vector by its norm rescales its components so that the length is 1. This allows us to compare each document on equal footing. Cosine similarity doesn’t care how far each document points, it cares that they point in the same direction.

Cosine similarity ranges from 0 (unrelatedness) to 1 (complete relatedness) if you have only positive values in your document vectors. If you’re using a system that allows negative values in your vectors (dimensionality reduction via PCA or LCA might do this), it can range from -1 to 11. A negative cosine similarity indicates that the vectors are pointing in opposite directions and can be seen as ‘anti’-similarity.

Connection

We have two pretty differently formulated measures here. To see the connection, let’s express Jaccard similarity in a similar form to cosine similarity.

Instead of a set of words, let’s consider the document as a vector. The vector is as long as the number of terms in our corpus, and we’ll say that the value for a term is 1 if the term is in the document and 0 if it isn’t. This 0/1 vector is called a bit array. Now we need to describe the intersection in vector terms. Well, the dot product two bit arrays has to be the number of items that are non-zero in both arrays. That’s the intersection! So \(\mathbf{a} \cdot \mathbf{b}\) will do.

How do we get the size of the union? We can do that with norms. Looking back at our definition of a norm, we can get the square root of the number of non-zero elements by taking the norm of the bit array. So we’ll just square the norm and thats our answer. If we add the squared norms for the two vectors we’ll have counted each term in both vectors. But there’s a hitch: we’ve counted the shared words twice. So we’ll also need to subtract out the intersection. Combined with the numerator, we have a new definition for Jaccard of \[J(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\| \mathbf{a} \|^2 + \| \mathbf{b} \|^2 - \mathbf{a} \cdot \mathbf{b}}\]

So, the main thrust of cosine and Jaccard similarities is the same: we use the dot product of two vectors as a measure of similarity, and scale by the size of the vectors we’re comparing. Jaccard and cosine similarities differ in their denominators. Cosine scales the vectors by their product of the magnitude, while Jaccard uses their sum. It’s the same idea, but the choice reflects their motivations. Jaccard’s sum is the clear choice for a similarity measure developed for counting. Cosine similarity has a geometric flavor so it chooses the vector norm, which allows you to interpret the similarity as the cosine between two angles2. The denominator of Jaccard can grow more rapidly than the cosine similarity, potentially understating similarities where they exist in large documents3.


  1. This should be reminiscent of Pearson’s correlation coefficient (\(r\)), which also ranges between 1 (perfect positive correlation) and -1 (perfect negative correlation). Pearson’s \(r\) is mathematically equivalent to cosine similarity when the vectors are centered.↩︎

  2. This has a major benefit. Many algorithms work with distances and not similarities. The expression \(1 - CosSim(a,b)\) is not a proper metric, as it doesn’t respect the triangle inequality. By undoing the cosine, you can create a proper angular distance metric.↩︎

  3. The Cauchy-Schwarz inequality guarantees \(|\mathbf{a} \cdot \mathbf{b}| \leq \|\mathbf{a} \| \| \mathbf{b} \|\). With \(\|\mathbf{a} \| \| \mathbf{b} \|\) as the upper possible bound for the dot product, we now just need to show that \(\|\mathbf{a} \|^2 + \| \mathbf{b} \|^2 - \| \mathbf{a} \| \| \mathbf{b} \| > \| \mathbf{a} \| \| \mathbf{b} \|\). The values of norms and dot products are just scalars, so we can apply basic algebra to this problem. Let \(x\) and \(y\) represent the norms of \(\mathbf{a}\) and \(\mathbf{b}\) respectively. We know that no number is negative when squared, therefore \((x-y)^2 \geq 0\). This expands to \(x^2 - 2xy + y^2 \geq 0\), and therefore both of these inequalites hold: \(x^2 + y^2 - xy \geq xy\), \(x^2 + y^2 \geq 2xy\). So even at the point where Cauchy-Schwarz says we can have our largest dot product, it can only be equal.↩︎

  4. Since our denominator involves a dot product, it may not be obvious that you won’t get a situation where the dot product between vectors is so negative that both the numerator and denominator are negative, which would give you a misleading positive value for the function as a whole. But you wont. In the previous footnote we noted that by Cauchy-Schwarz guarantees \(|\mathbf{a} \cdot \mathbf{b}| \leq \|\mathbf{a} \| \| \mathbf{b} \|\), and showed that \(\|\mathbf{a} \|^2 + \| \mathbf{b} \|^2 \geq 2 \| \mathbf{a} \| \| \mathbf{b} \|\)↩︎

tweet Share