Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

Summary – Document and String Similarity/Distances Measures in C#

leave a comment »

MachinesThis post summarizes a number of recent posts on this blog showing how to calculate similarity and distance measures in C# for documents and strings.

First, documents can be represented by a “Bag of Words” (a list of the unique words in a document) or a “Frequency Distribution” (a list of the unique words in a document together with the occurrence frequency).

Bag of Words and Frequency Distributions in C#

The simplest similarity measure covered in this series is the Jaccard Similarity measure. This uses a bag of words and compares the number of common words between two documents with the overall number of words. This does not take into account the relative frequency nor the order of the words in the two documents.

Jaccard Similarity Index for measuring Document Similarity

Three measures using Frequency Distributions are described. In all these cases a n-dimensional space is created from the Frequency Distribution with a dimension for each word in the documents being compared. They are:

  1. Euclidean Distance – The shortest distance between two documents in the Frequency Distribution space.
  2. Manhattan Distance – The sum of all the sides in the hyper rectangle formed around two documents in the Frequency Distribution Space.
  3. Cosine Distance – The cosine of the angle subtended at the origin between two documents in the Frequency Distribution Space.

These measures take into account the word frequency, but Cosine Distance cannot distinguish between documents where the relative frequency of words is the same (rather than the absolute frequency). They do not take into account the order of the words in the documents.

Euclidean, Manhattan and Cosine Distance Measures in C#

The final measure is the Levenshtein Minimum Edit Distance. This measure aligns two documents and calculates the number of inserts, deletes or substitutions that are required to change the first document into the second document, which may not necessarily be the same length. This measure takes into account the words, the frequency of words and the order of words in the document.

Levenshtein Minimum Edit Distance in C#

By using MinHash and Locality Sensitivity Hashing, similar documents can be identified very, very efficiently – these techniques are related to the Jaccard Similarity Index.

MinHash for Document Fingerprinting in C#

Locality Sensitivity Hashing for finding similar documents in C#

 

Written by Nick Grattan

July 10, 2014 at 6:21 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: