Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

Jaccard Similarity Index for measuring Document Similarity

with 6 comments

The Jaccard Similarity Index can be used to represent the similarity between two documents. A value “0” means the documents are completely dissimilar, “1” that they are identical,  and values between 0 and 1 representing a degree of similarity.

The Jaccard index is defined as the size of the intersection of the two documents divided by the size of the union of the two documents.

Of course, we need to decide how to represent the two documents as sets. Of the different ways this can be done, we might:

  1. Use a Bag Of Words, which is the list of unique words in the document, with no frequency count.
  2. A Bag of Words with Stop Words removed (common language words like “the”, “at” for English).
  3. Sets that take into account the frequency of words in the document.
  4. k-Shingles which are fixed k length overlapping runs of characters extracted from the document.
  5. n-grams which are fixed length n word runs. For example, a 2-gram are overlapping pairs of words taken from a document.

A simple, efficient, way of representing a bag of words is to create a dictionary of all words and assign a unique integer value to each word. Then, the bag of words for a document is simply the unique list of integer values for the words in the document, e.g. List<int>.

For a comparison between Bag of Words and Frequency Distributions see here.

Since sets are not ordered, a bag of words represented like this takes no account of the order in which words appear in the document.

Here is a C# class to calculate the Jaccard index:


public class Jaccard
 {

 public static double Calc(HashSet<int> hs1, HashSet<int> hs2)
 {
      return ((double)hs1.Intersect(hs2).Count() / (double)hs1.Union(hs2).Count());
 }

 public static double Calc(List<int> ls1, List<int> ls2)
 {
      HashSet<int> hs1 = new HashSet<int>(ls1);
      HashSet<int> hs2 = new HashSet<int>(ls2);
      return Calc(hs1, hs2);
 }
 }

Notice how the List<int> objects are converted to HashSet<int>, since the HashSet<> class has methods for performing set calculations.

This C# code shows some tests for this class:


public void JaccardTest1()
 {
      Dictionary<int, string> wordDict = new Dictionary<int, string>();
      wordDict.Add(1, "Word1");
      wordDict.Add(2, "Word2");
      wordDict.Add(3, "Word3");
      wordDict.Add(4, "Word4");

      List<int> doc1 = new List<int>();
      doc1.Add(2);
      doc1.Add(3);
      doc1.Add(4);
      doc1.Add(2);

      List<int> doc2 = new List<int>();
      doc2.Add(1);
      doc2.Add(5);
      doc2.Add(4);
      doc2.Add(2);

      List<int> doc3 = new List<int>();
      doc3.Add(1);

      Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc2));
      Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc1));
      Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc3));
}

This creates three documents:

  • Doc1: “Word2”, “Word3”, “Word4”, “Word2”.
  • Doc2: “Word1”, “Word5”, “Word4”, “Word2”.
  • Doc3: “Word1”.

The output is:

Jaccard: 0.4
Jaccard: 1
Jaccard: 0

For the first result, the total number of unique words (union) in “Doc1” and “Doc2” is 5, and the number of shared words (intersection) is 2. This gives the Jaccard Similarity of 2.0 / 5.0 = 0.4.

The second result shows that a document compared to itself gives a value of 1 (completely similar) and the third test shows a value of 0 for completely dissimilar documents.

Reference: “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011.  P.75. See http://infolab.stanford.edu/~ullman/mmds.html

Advertisements

Written by Nick Grattan

February 18, 2014 at 9:42 pm

6 Responses

Subscribe to comments with RSS.

  1. […] MinHash is different – the technique is designed to ensure that two similar items generate hashes that are themselves similar. In fact, the similarity of the hashes has a direct relationship to the similarity of the documents they were generated from. This relationship approximates to the Jaccard Similarity. […]

  2. […] shown how Jaccard Similarity Coefficient can be used to measure similarity between documents when these documents are represented by a bag […]

  3. […] Bag of Words is used by the Jaccard Similarity measure for document similarity. If two documents have the same set of words then they are deemed […]

  4. […] Distributions. See here for a comparison between Bag of Words and Frequency Distributions and here for using Jaccard Similarity with a Bag of […]

  5. […] Jaccard Similarity Index for measuring Document Similarity […]

  6. […] the below C# code (modified from here), I seem to get a different […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: