## Euclidean, Manhattan and Cosine Distance Measures in C#

Euclidean, Manhattan and Cosine Distance Measures can be used for calculating document dissimilarity. Since similarity is the inverse of a dissimilarity measure, they can also be used to calculate document similarity. For document similarity the calculations are based on Frequency Distributions. See here for a comparison between Bag of Words and Frequency Distributions and here for using Jaccard Similarity with a Bag of Words.

The calculation starts with a frequency distribution for words in a number of documents. For example:

A n-dimensional space is then created, with a dimension for each of the words. In the above example a dimension will be created for “Cat”, “Mouse”, “Dog” and “Rat”, so it’s a four dimensioned space. Then, each document is plotted in this space. The following diagram shows the plot for just two of the four dimensions with the Euclidean Distance (the shortest distance between two points):

The Manhattan distance is the sum of the lengths of the rectangle formed by the two points:

Finally, the Cosine distance is the angle subtended at the origin between the two documents. A value of 0 degrees represents identical documents and 90 degrees dissimilar documents. Note that this distance is based on the relative frequency of words in a document. A document with, say, twice as many occurrences of all words compared to another document will be regarded as identical.

For a full description of these distance measures see [1], including details on their calculation.

The Euclidean and Manhattan distances are specific examples of a more general Lr-Norm distance measure. The ‘r’ refers to a power term, and for Manhattan this is 1 and for Euclidean it’s 2. Therefore a single class can be used to implement both:

public class LrNorm { /// <summary> /// Returns Euclidean distance between frequency distributions of two lists /// </summary> /// <typeparam name="T">Type of the item, e.g. int or string</typeparam> /// <param name="l1">First list of items</param> /// <param name="l2">Second list of items</param> /// <returns>Distance, 0 - identical</returns> public static double Euclidean<T>(List<T> l1, List<T> l2) { return DoLrNorm(l1, l2, 2); } /// <summary> /// Returns Manhattan distance between frequency distributions of two lists /// </summary> /// <typeparam name="T">Type of the item, e.g. int or string</typeparam> /// <param name="l1">First list of items</param> /// <param name="l2">Second list of items</param> /// <returns>Distance, 0 - identical</returns> public static double Manhattan<T>(List<T> l1, List<T> l2) { return DoLrNorm(l1, l2, 1); } /// <summary> /// Returns LrNorm distance between frequency distributions of two lists /// </summary> /// <typeparam name="T">Type of the item, e.g. int or string</typeparam> /// <param name="l1">First list of items</param> /// <param name="l2">Second list of items</param> /// <param name="r">Power to use 2 = Euclidean, 1 = Manhattan</param> /// <returns>Distance, 0 - identical</returns> public static double DoLrNorm<T>(List<T> l1, List<T> l2, int r) { // find distinct list of values from both lists. List<T> dvs = FrequencyDist<T>.GetDistinctValues(l1, l2); // create frequency distributions aligned to list of descrete values FrequencyDist<T> fd1 = new FrequencyDist<T>(l1, dvs); FrequencyDist<T> fd2 = new FrequencyDist<T>(l2, dvs); if (fd1.ItemFreq.Count != fd2.ItemFreq.Count) { throw new Exception("Lists of different length for LrNorm calculation"); } double sumsq = 0.0; for (int i = 0; i < fd1.ItemFreq.Count; i++) { if (!EqualityComparer<T>.Default.Equals(fd1.ItemFreq.Values[i].value, fd2.ItemFreq.Values[i].value)) throw new Exception("Mismatched values in frequency distribution for LrNorm calculation"); if (r == 1) // Manhattan optimization { sumsq += Math.Abs((fd1.ItemFreq.Values[i].count - fd2.ItemFreq.Values[i].count)); } else { sumsq += Math.Pow((double)Math.Abs((fd1.ItemFreq.Values[i].count - fd2.ItemFreq.Values[i].count)), r); } } if (r == 1) // Manhattan optimization { return sumsq; } else { return Math.Pow(sumsq, 1.0 / r); } } }

The following code shows how to use the methods in this class:

double LrNormUT(int r) { // Sample from Page 92 "Mining of Massive Datasets" List l1 = new List(); List l2 = new List(); l1.Add(1); l1.Add(1); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(2); l2.Add(2); l2.Add(2); l2.Add(2); return LrNorm.DoLrNorm(l1, l2, r); } [TestMethod] public void Euclidean1() { double dist = LrNormUT(2); Assert.AreEqual(5, (int)dist); Console.WriteLine("d:" + dist); } [TestMethod] public void Manhattan() { double dist = LrNormUT(1); Assert.AreEqual(7, (int)dist); Console.WriteLine("d:" + dist); }

The following class calculates the Cosine Distance:

/// <summary> /// Calculate cosine distance between two vectors /// </summary> public class Cosine { /// <summary> /// Calculates the distance between frequency distributions calculated from lists of items /// </summary> /// <typeparam name="T">Type of the list item, e.g. int or string</typeparam> /// <param name="l1">First list of items</param> /// <param name="l2">Second list of items</param> /// <returns>Distance in degrees. 90 is totally different, 0 exactly the same</returns> public static double Distance<T>(List<T> l1, List<T> l2) { if (l1.Count() == 0 || l2.Count() == 0) { throw new Exception("Cosine Distance: lists cannot be zero length"); } // find distinct list of items from two lists, used to align frequency distributions from two lists List<T> dvs = FrequencyDist<T>.GetDistinctValues(l1, l2); // calculate frequency distributions for each list. FrequencyDist<T> fd1 = new FrequencyDist<T>(l1, dvs); FrequencyDist<T> fd2 = new FrequencyDist<T>(l2, dvs); if(fd1.ItemFreq.Count() != fd2.ItemFreq.Count) { throw new Exception("Cosine Distance: Frequency count vectors must be same length"); } double dotProduct = 0.0; double l2norm1 = 0.0; double l2norm2 = 0.0; for(int i = 0; i < fd1.ItemFreq.Values.Count(); i++) { if (!EqualityComparer<T>.Default.Equals(fd1.ItemFreq.Values[i].value, fd2.ItemFreq.Values[i].value)) throw new Exception("Mismatched values in frequency distribution for Cosine distance calculation"); dotProduct += fd1.ItemFreq.Values[i].count * fd2.ItemFreq.Values[i].count; l2norm1 += fd1.ItemFreq.Values[i].count * fd1.ItemFreq.Values[i].count; l2norm2 += fd2.ItemFreq.Values[i].count * fd2.ItemFreq.Values[i].count; } double cos = dotProduct / (Math.Sqrt(l2norm1) * Math.Sqrt(l2norm2)); // convert cosine value to radians then to degrees return Math.Acos(cos) * 180.0 / Math.PI; } }

Here are some methods that show how to use this class:

double tol = 0.00001; [TestMethod] public void CosineSimple() { List<int> l1 = new List<int>(); List<int> l2 = new List<int>(); l1.Add(1); l1.Add(1); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(2); l2.Add(2); l2.Add(2); l2.Add(2); double dist = Cosine.Distance(l1, l2); Console.WriteLine(dist); Assert.AreEqual(40.3645365730974, dist, tol); }

**Reference:** [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011. P.90. See http://infolab.stanford.edu/~ullman/mmds.html

[…] Euclidean, Manhattan and Cosine Distance Measures in C# […]

Summary – Document and String Similarity/Distances Measures in C# | Nick Grattan's BlogJuly 10, 2014 at 6:21 pm

Not that aways from what you’ve posted, but

(1) Is there a reason you formed the Lr norm to only accept integer values for r? and why isn’t there a check for r>1? As when: 0 < r =1 (i.e. a double instead of an int)

(2) In the final step of the norm, you *may* way to include a simple test to see if the norm is 0 before applying the expensive Math.pow function

(3) For the Cosine Class

(3.1) When I implement, I generally will test to see if there is any non-zero common indexes. If not, the IP = 0 –> COS (theta) = 0 –> 0 = Pi/2. Depending on the implementation, this can speed things up

(3.2) When you computed the final Cos value, should you not do a check on the l1norm and l2norm to see they are both non-zero. This alternatively could be done at the start to check to see if either are the zero vectors.

Really enjoyed reading this page (came about from a random google search). I look forward to reading more of your stuff.

Lastly, have you considered implementations that take the Frequency Distributions, convert them to Discrete Probability Measures and then implement Statistical Distance Measures? this is quite common in the IR / NLP literature.

David GaleaAugust 9, 2017 at 5:28 am

* 0 < r < 1

David GaleaAugust 9, 2017 at 5:29 am