Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

Posts Tagged ‘Document Similarity Distance Measures Euclidean Manhattan Cosine C#

Euclidean, Manhattan and Cosine Distance Measures in C#

with 3 comments

Euclidean, Manhattan and Cosine Distance Measures can be used for calculating document dissimilarity. Since similarity is the inverse of a dissimilarity measure, they can also be used to calculate document similarity. For document similarity the calculations are based on Frequency Distributions. See here for a comparison between Bag of Words and Frequency Distributions and here for using Jaccard Similarity with a Bag of Words.

The calculation starts with a frequency distribution for words in a number of documents. For example:

Freq Dist

A n-dimensional space is then created, with a dimension for each of the words. In the above example a dimension will be created for “Cat”, “Mouse”, “Dog” and “Rat”, so it’s a four dimensioned space. Then, each document is plotted in this space. The following diagram shows the plot for just two of the four dimensions with the Euclidean Distance (the shortest distance between two points):

EuclideanDist

The Manhattan distance is the sum of the lengths of the rectangle formed by the two points:

ManhattanDist

Finally, the Cosine distance is the angle subtended at the origin between the two documents. A value of 0 degrees represents identical documents and 90 degrees dissimilar documents. Note that this distance is based on the relative frequency of words in a document. A document with, say, twice as many occurrences of all words compared to another document will be regarded as identical.

CoSineDist

For a full description of these distance measures see [1], including details on their calculation.

The Euclidean and Manhattan distances are specific examples of a more general Lr-Norm distance measure. The ‘r’ refers to a power term, and for Manhattan this is 1 and for Euclidean it’s 2. Therefore a single class can be used to implement both:

    public class LrNorm
    {
        /// <summary>
        /// Returns Euclidean distance between frequency distributions of two lists
        /// </summary>
        /// <typeparam name="T">Type of the item, e.g. int or string</typeparam>
        /// <param name="l1">First list of items</param>
        /// <param name="l2">Second list of items</param>
        /// <returns>Distance, 0 - identical</returns>
        public static double Euclidean<T>(List<T> l1, List<T> l2)
        {
            return DoLrNorm(l1, l2, 2);
        }

        /// <summary>
        /// Returns Manhattan distance between frequency distributions of two lists
        /// </summary>
        /// <typeparam name="T">Type of the item, e.g. int or string</typeparam>
        /// <param name="l1">First list of items</param>
        /// <param name="l2">Second list of items</param>
        /// <returns>Distance, 0 - identical</returns>
        public static double Manhattan<T>(List<T> l1, List<T> l2)
        {
            return DoLrNorm(l1, l2, 1);
        }

        /// <summary>
        /// Returns LrNorm distance between frequency distributions of two lists
        /// </summary>
        /// <typeparam name="T">Type of the item, e.g. int or string</typeparam>
        /// <param name="l1">First list of items</param>
        /// <param name="l2">Second list of items</param>
        /// <param name="r">Power to use 2 = Euclidean, 1 = Manhattan</param>
        /// <returns>Distance, 0 - identical</returns>
        public static double DoLrNorm<T>(List<T> l1, List<T> l2, int r)
        {
            // find distinct list of values from both lists.
            List<T> dvs = FrequencyDist<T>.GetDistinctValues(l1, l2);

            // create frequency distributions aligned to list of descrete values
            FrequencyDist<T> fd1 = new FrequencyDist<T>(l1, dvs);
            FrequencyDist<T> fd2 = new FrequencyDist<T>(l2, dvs);

            if (fd1.ItemFreq.Count != fd2.ItemFreq.Count)
            {
                throw new Exception("Lists of different length for LrNorm calculation");
            }
            double sumsq = 0.0;
            for (int i = 0; i < fd1.ItemFreq.Count; i++)
            {
                if (!EqualityComparer<T>.Default.Equals(fd1.ItemFreq.Values[i].value, fd2.ItemFreq.Values[i].value))
                    throw new Exception("Mismatched values in frequency distribution for LrNorm calculation");

                if (r == 1)   // Manhattan optimization
                {
                    sumsq += Math.Abs((fd1.ItemFreq.Values[i].count - fd2.ItemFreq.Values[i].count));
                }
                else
                {
                    sumsq += Math.Pow((double)Math.Abs((fd1.ItemFreq.Values[i].count - fd2.ItemFreq.Values[i].count)), r);
                }
            }
            if (r == 1)    // Manhattan optimization
            {
                return sumsq;
            }
            else
            {
                return Math.Pow(sumsq, 1.0 / r);
            }
        }
    }

The following code shows how to use the methods in this class:

       double LrNormUT(int r)
        {
            // Sample from Page 92 "Mining of Massive Datasets"
            List l1 = new List();
            List l2 = new List();

            l1.Add(1);
            l1.Add(1);
            l1.Add(2);
            l1.Add(2);
            l1.Add(2);
            l1.Add(2);
            l1.Add(2);
            l1.Add(2);
            l1.Add(2);

            l2.Add(1);
            l2.Add(1);
            l2.Add(1);
            l2.Add(1);
            l2.Add(1);
            l2.Add(1);
            l2.Add(2);
            l2.Add(2);
            l2.Add(2);
            l2.Add(2);

            return LrNorm.DoLrNorm(l1, l2, r);
        }
        [TestMethod]
        public void Euclidean1()
        {
            double dist = LrNormUT(2);
            Assert.AreEqual(5, (int)dist);
            Console.WriteLine("d:" + dist);
        }

        [TestMethod]
        public void Manhattan()
        {
            double dist = LrNormUT(1);
            Assert.AreEqual(7, (int)dist);
            Console.WriteLine("d:" + dist);
        }

The following class calculates the Cosine Distance:

    /// <summary>
    /// Calculate cosine distance between two vectors
    /// </summary>
    public class Cosine
    {
        /// <summary>
        /// Calculates the distance between frequency distributions calculated from lists of items
        /// </summary>
        /// <typeparam name="T">Type of the list item, e.g. int or string</typeparam>
        /// <param name="l1">First list of items</param>
        /// <param name="l2">Second list of items</param>
        /// <returns>Distance in degrees. 90 is totally different, 0 exactly the same</returns>
        public static double Distance<T>(List<T> l1, List<T> l2)
        {
            if (l1.Count() == 0 || l2.Count() == 0)
            {
                throw new Exception("Cosine Distance: lists cannot be zero length");
            }

            // find distinct list of items from two lists, used to align frequency distributions from two lists
            List<T> dvs = FrequencyDist<T>.GetDistinctValues(l1, l2);
            // calculate frequency distributions for each list.
            FrequencyDist<T> fd1 = new FrequencyDist<T>(l1, dvs);
            FrequencyDist<T> fd2 = new FrequencyDist<T>(l2, dvs);

            if(fd1.ItemFreq.Count() != fd2.ItemFreq.Count)
            {
                throw new Exception("Cosine Distance: Frequency count vectors must be same length");
            }
            double dotProduct = 0.0;
            double l2norm1 = 0.0;
            double l2norm2 = 0.0;
            for(int i = 0; i < fd1.ItemFreq.Values.Count(); i++)
            {
                if (!EqualityComparer<T>.Default.Equals(fd1.ItemFreq.Values[i].value, fd2.ItemFreq.Values[i].value))
                    throw new Exception("Mismatched values in frequency distribution for Cosine distance calculation");

                dotProduct += fd1.ItemFreq.Values[i].count * fd2.ItemFreq.Values[i].count;
                l2norm1 += fd1.ItemFreq.Values[i].count * fd1.ItemFreq.Values[i].count;
                l2norm2 += fd2.ItemFreq.Values[i].count * fd2.ItemFreq.Values[i].count;
            }
            double cos = dotProduct / (Math.Sqrt(l2norm1) * Math.Sqrt(l2norm2));
            // convert cosine value to radians then to degrees
            return Math.Acos(cos) * 180.0 / Math.PI;
        }
    }

Here are some methods that show how to use this class:

        double tol = 0.00001;

       [TestMethod]
        public void CosineSimple()
        {
            List<int> l1 = new List<int>();
            List<int> l2 = new List<int>();

            l1.Add(1);
            l1.Add(1);
            l1.Add(2);
            l1.Add(2);
            l1.Add(2);
            l1.Add(2);
            l1.Add(2);
            l1.Add(2);
            l1.Add(2);

            l2.Add(1);
            l2.Add(1);
            l2.Add(1);
            l2.Add(1);
            l2.Add(1);
            l2.Add(1);
            l2.Add(2);
            l2.Add(2);
            l2.Add(2);
            l2.Add(2);
            double dist = Cosine.Distance(l1, l2);
            Console.WriteLine(dist);
            Assert.AreEqual(40.3645365730974, dist, tol);
        }

Reference: [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011.  P.90. See http://infolab.stanford.edu/~ullman/mmds.html

Advertisements

Written by Nick Grattan

June 10, 2014 at 7:45 am