Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

Posts Tagged ‘Document Similarity Frequency Distribution Bag of Words

Bag of Words and Frequency Distributions in C#

with 5 comments

The simplest way of representing a document is the “Bag of Words”. This is the list of unique words used in a document. It is therefore a simple present/not present indicator for all words in the vocabulary and does not take into account the occurrence frequency of these words nor the order of the words.

The Bag of Words is used by the Jaccard Similarity measure for document similarity. If two documents have the same set of words then they are deemed identical, and if they have no common words they are completely different. This similarity measure takes no account of the relative length of the two documents being compared.

Bag of Words Similarity

In C# a Bag of Words can be represented by a generic List<>. The list type can either be a string (in which case it’s the actual word) or an integer (where the integer is a lookup into a dictionary). The latter is more efficient because the word is stored just once as a string and the integer lookup (4 bytes) is most likely to be shorter than the word itself. The C# in this blog post creates a dictionary, some Bag of Words and calculates the Jaccard index for documents.

Frequency Distributions record not only the words in a document but also the frequency with which they occur. However, like Bags of Words, no account is taken of the order of words in the document. These frequency distributions can be compared and used to assess the similarity between two or more documents.

Freq Dist

These techniques generally calculate the distance between two documents. A distance measure is the inverse of similarity. Common techniques are Euclidean, Manhattan and Cosine distances.

The following C# class manages Frequency Distributions. It’s a generic class, and so can use strings (the words themselves) or integers (for lookups into a dictionary).

    /// <summary>
    /// Manages Frequency Distributions for items of type T
    /// </summary>
    /// <typeparam name="T">Type for item</typeparam>
    public class FrequencyDist<T>
    {
        /// <summary>
        /// Construct Frequency Distribution for the given list of items
        /// </summary>
        /// <param name="li">List of items to calculate for</param>
        public FrequencyDist(List<T> li)
        {
            CalcFreqDist(li);
        }

        /// <summary>
        /// Construct Frequency Distribution for the given list of items, across all keys in itemValues
        /// </summary>
        /// <param name="li">List of items to calculate for</param>
        /// <param name="itemValues">Entire list of itemValues to include in the frequency distribution</param>
        public FrequencyDist(List<T> li, List<T> itemValues)
        {
            CalcFreqDist(li);
            // add items to frequency distribution that are in itemValues but missing from the frequency distribution
            foreach (var v in itemValues)
            {
                if(!ItemFreq.Keys.Contains(v))
                {
                    ItemFreq.Add(v, new Item { value = v, count = 0 });
                }
            }
            // check that all values in li are in the itemValues list
            foreach(var v in li)
            {
                if (!itemValues.Contains(v))
                    throw new Exception(string.Format("FrequencyDist: Value in list for frequency distribution not in supplied list of values: '{0}'.", v));
            }
        }

        /// <summary>
        /// Calculate the frequency distribution for the values in list
        /// </summary>
        /// <param name="li">List of items to calculate for</param>
        void CalcFreqDist(List<T> li)
        {
            itemFreq = new SortedList<T,Item>((from item in li
                   group item by item into theGroup
                   select new Item { value = theGroup.FirstOrDefault(), count = theGroup.Count() }).ToDictionary(q => q.value, q => q));
        }
        SortedList<T, Item> itemFreq = new SortedList<T, Item>();

        /// <summary>
        /// Getter for the Item Frequency list
        /// </summary>
        public SortedList<T, Item> ItemFreq { get { return itemFreq; } }

        public int Freq(T value)
        {
            if(itemFreq.Keys.Contains(value))
            {
                return itemFreq[value].count;
            }
            else
            {
                return 0;
            }
        }

        /// <summary>
        /// Returns the list of distinct values between two lists
        /// </summary>
        /// <param name="l1"></param>
        /// <param name="l2"></param>
        /// <returns></returns>
        public static List<T> GetDistinctValues(List<T> l1, List<T> l2)
        {
            return l1.Concat(l2).ToList().Distinct().ToList();
        }

        /// <summary>
        /// Manages a count of items (int, string etc) for frequency counts
        /// </summary>
        /// <typeparam name="T">The type for item</typeparam>
        public class Item
        {
            /// <summary>
            /// The value of the item, e.g. int or string
            /// </summary>
            public T value { get; set; }
            /// <summary>
            /// The count of the item
            /// </summary>
            public int count { get; set; }
        }
    }

The following code shows how this class can be used:

            List<string> li = new List<string>();
            li.Add("One");
            li.Add("Two");
            li.Add("Two");
            li.Add("Three");
            li.Add("Three");
            li.Add("Three");
            FrequencyDist<string> cs = new FrequencyDist<string>(li);
            foreach (var v in cs.ItemFreq.Values)
            {
                Console.WriteLine(v.value + " : " + v.count);
            }

The output from executing this code is:

One : 1
Three : 3
Two : 2

Written by Nick Grattan

June 9, 2014 at 6:51 pm