Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

Bag of Words and Frequency Distributions in C#

with 5 comments

The simplest way of representing a document is the “Bag of Words”. This is the list of unique words used in a document. It is therefore a simple present/not present indicator for all words in the vocabulary and does not take into account the occurrence frequency of these words nor the order of the words.

The Bag of Words is used by the Jaccard Similarity measure for document similarity. If two documents have the same set of words then they are deemed identical, and if they have no common words they are completely different. This similarity measure takes no account of the relative length of the two documents being compared.

Bag of Words Similarity

In C# a Bag of Words can be represented by a generic List<>. The list type can either be a string (in which case it’s the actual word) or an integer (where the integer is a lookup into a dictionary). The latter is more efficient because the word is stored just once as a string and the integer lookup (4 bytes) is most likely to be shorter than the word itself. The C# in this blog post creates a dictionary, some Bag of Words and calculates the Jaccard index for documents.

Frequency Distributions record not only the words in a document but also the frequency with which they occur. However, like Bags of Words, no account is taken of the order of words in the document. These frequency distributions can be compared and used to assess the similarity between two or more documents.

Freq Dist

These techniques generally calculate the distance between two documents. A distance measure is the inverse of similarity. Common techniques are Euclidean, Manhattan and Cosine distances.

The following C# class manages Frequency Distributions. It’s a generic class, and so can use strings (the words themselves) or integers (for lookups into a dictionary).

    /// <summary>
    /// Manages Frequency Distributions for items of type T
    /// </summary>
    /// <typeparam name="T">Type for item</typeparam>
    public class FrequencyDist<T>
    {
        /// <summary>
        /// Construct Frequency Distribution for the given list of items
        /// </summary>
        /// <param name="li">List of items to calculate for</param>
        public FrequencyDist(List<T> li)
        {
            CalcFreqDist(li);
        }

        /// <summary>
        /// Construct Frequency Distribution for the given list of items, across all keys in itemValues
        /// </summary>
        /// <param name="li">List of items to calculate for</param>
        /// <param name="itemValues">Entire list of itemValues to include in the frequency distribution</param>
        public FrequencyDist(List<T> li, List<T> itemValues)
        {
            CalcFreqDist(li);
            // add items to frequency distribution that are in itemValues but missing from the frequency distribution
            foreach (var v in itemValues)
            {
                if(!ItemFreq.Keys.Contains(v))
                {
                    ItemFreq.Add(v, new Item { value = v, count = 0 });
                }
            }
            // check that all values in li are in the itemValues list
            foreach(var v in li)
            {
                if (!itemValues.Contains(v))
                    throw new Exception(string.Format("FrequencyDist: Value in list for frequency distribution not in supplied list of values: '{0}'.", v));
            }
        }

        /// <summary>
        /// Calculate the frequency distribution for the values in list
        /// </summary>
        /// <param name="li">List of items to calculate for</param>
        void CalcFreqDist(List<T> li)
        {
            itemFreq = new SortedList<T,Item>((from item in li
                   group item by item into theGroup
                   select new Item { value = theGroup.FirstOrDefault(), count = theGroup.Count() }).ToDictionary(q => q.value, q => q));
        }
        SortedList<T, Item> itemFreq = new SortedList<T, Item>();

        /// <summary>
        /// Getter for the Item Frequency list
        /// </summary>
        public SortedList<T, Item> ItemFreq { get { return itemFreq; } }

        public int Freq(T value)
        {
            if(itemFreq.Keys.Contains(value))
            {
                return itemFreq[value].count;
            }
            else
            {
                return 0;
            }
        }

        /// <summary>
        /// Returns the list of distinct values between two lists
        /// </summary>
        /// <param name="l1"></param>
        /// <param name="l2"></param>
        /// <returns></returns>
        public static List<T> GetDistinctValues(List<T> l1, List<T> l2)
        {
            return l1.Concat(l2).ToList().Distinct().ToList();
        }

        /// <summary>
        /// Manages a count of items (int, string etc) for frequency counts
        /// </summary>
        /// <typeparam name="T">The type for item</typeparam>
        public class Item
        {
            /// <summary>
            /// The value of the item, e.g. int or string
            /// </summary>
            public T value { get; set; }
            /// <summary>
            /// The count of the item
            /// </summary>
            public int count { get; set; }
        }
    }

The following code shows how this class can be used:

            List<string> li = new List<string>();
            li.Add("One");
            li.Add("Two");
            li.Add("Two");
            li.Add("Three");
            li.Add("Three");
            li.Add("Three");
            FrequencyDist<string> cs = new FrequencyDist<string>(li);
            foreach (var v in cs.ItemFreq.Values)
            {
                Console.WriteLine(v.value + " : " + v.count);
            }

The output from executing this code is:

One : 1
Three : 3
Two : 2
Advertisements

Written by Nick Grattan

June 9, 2014 at 6:51 pm

5 Responses

Subscribe to comments with RSS.

  1. […] For a comparison between Bag of Words and Frequency Distributions see here. […]

  2. […] similarity. For document similarity the calculations are based on Frequency Distributions. See here for a comparison between Bag of Words and Frequency Distributions and here for using Jaccard […]

  3. […] Bag of Words and Frequency Distributions in C# […]

  4. Jaccard can be used easily for relative frequency distributions.

    schillingklaus

    May 17, 2015 at 6:58 am

  5. Nice class. btw, I modified it as I want to test it with a sort of Moving Window of Bag of Words. So, I put a data members

    List aList;

    int maxLength;

    and added a method

    public void Update(object item){
    aList.Add ((T)item);
    if (aList.Count > maxLength) {
    aList.RemoveAt (0);
    }
    CalcFreqDist(aList);
    }

    works sometimes but occasionally I get a “System.ArgumentException has been thrown element already exists” when CalcFreqDist is called….

    Marc Nunes

    September 18, 2015 at 9:41 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: