## Archive for the ‘**Search / Information Retrieval**’ Category

## Summary – Document and String Similarity/Distances Measures in C#

This post summarizes a number of recent posts on this blog showing how to calculate similarity and distance measures in C# for documents and strings.

First, documents can be represented by a “Bag of Words” (a list of the unique words in a document) or a “Frequency Distribution” (a list of the unique words in a document together with the occurrence frequency).

Bag of Words and Frequency Distributions in C#

The simplest similarity measure covered in this series is the Jaccard Similarity measure. This uses a bag of words and compares the number of common words between two documents with the overall number of words. This does not take into account the relative frequency nor the order of the words in the two documents.

Jaccard Similarity Index for measuring Document Similarity

Three measures using Frequency Distributions are described. In all these cases a n-dimensional space is created from the Frequency Distribution with a dimension for each word in the documents being compared. They are:

- Euclidean Distance – The shortest distance between two documents in the Frequency Distribution space.
- Manhattan Distance – The sum of all the sides in the hyper rectangle formed around two documents in the Frequency Distribution Space.
- Cosine Distance – The cosine of the angle subtended at the origin between two documents in the Frequency Distribution Space.

These measures take into account the word frequency, but Cosine Distance cannot distinguish between documents where the relative frequency of words is the same (rather than the absolute frequency). They do not take into account the order of the words in the documents.

Euclidean, Manhattan and Cosine Distance Measures in C#

The final measure is the Levenshtein Minimum Edit Distance. This measure aligns two documents and calculates the number of inserts, deletes or substitutions that are required to change the first document into the second document, which may not necessarily be the same length. This measure takes into account the words, the frequency of words and the order of words in the document.

Levenshtein Minimum Edit Distance in C#

By using MinHash and Locality Sensitivity Hashing, similar documents can be identified very, very efficiently – these techniques are related to the Jaccard Similarity Index.

MinHash for Document Fingerprinting in C#

Locality Sensitivity Hashing for finding similar documents in C#

## Levenshtein Minimum Edit Distance in C#

From Lesk[1] p.254 – “The **Levenstein,** or **edit distance** , defined between two strings of not necessarily equal length, is the minimum number of ‘edit operations’ required to change one string into the other. An edit operation is a deletion, insertion or alteration [substitution] of a single character in either sequence “. Thus the edit distance between two strings or documents takes into account not only the relative frequency of characters/words but the position as well. Strings can be aligned too. For example, here’s an alignment of two nucleotide sequences where ‘-‘ represents an insertion:

ag-tcc cgctca

For these two strings the edit distance is 3 (2 substitutions and 1 insertion/deletion). In the case above the substitutions and inserts/deletes (“indels”) have the same weight. Often, substitutions are given a weight of 2 and indels 1 resulting in an edit distance of 5 for these strings. Substitutions are really an insert with a delete, hence the double weight.

The edit distance calculation uses Dynamic Programming. The algorithm is well described in Jurafsky & Martin[2] p.107. and summarized in these PowerPoint slides. This class implements the edit distance algorithm and text alignment in C#:

/// <summary> /// Calculates the minimum edit distance (Levenshtein) between two lists of items /// </summary> /// <typeparam name="T">Type of item (e.g. characters or string)</typeparam> public class MinEditDistance<T> { public MinEditDistance(List<T> list1, List<T> list2) { InsCost = 1; DelCost = 1; SubCost = 2; List1 = list1; List2 = list2; } List<T> List1; List<T> List2; Distance[,] D; // The costs of inserts, deletes, substitutions. Normally run SubCost with 2, others at 1 public int InsCost { get; set; } public int DelCost { get; set; } public int SubCost { get; set; } /// <summary> /// Calculates the minimum edit distance using weights in InsCost etc. /// </summary> /// <returns></returns> public int CalcMinEditDistance() { if (List1 == null || List2 == null) { throw new ArgumentNullException(); } if (List1.Count == 0 || List2.Count == 0) { throw new ArgumentException("Zero length list"); } // prepend dummy value List1.Insert(0, default(T)); List2.Insert(0, default(T)); int lenList1 = List1.Count; int lenList2 = List2.Count; D = new Distance[lenList1, lenList2]; for (int i = 0; i < lenList1; i++) { D[i, 0].val = DelCost * i; D[i, 0].backTrace = Direction.Left; } for (int j = 0; j < lenList2; j++) { D[0, j].val = InsCost * j; D[0, j].backTrace = Direction.Up; } for (int i = 1; i < lenList1; i++) for (int j = 1; j < lenList2; j++) { int d1 = D[i - 1, j].val + DelCost; int d2 = D[i, j - 1].val + InsCost; int d3 = (EqualityComparer<T>.Default.Equals(List1[i], List2[j])) ? D[i - 1, j - 1].val : D[i - 1, j - 1].val + SubCost; D[i, j].val = Math.Min(d1, Math.Min(d2, d3)); // back trace if (D[i, j].val == d1) { D[i, j].backTrace |= Direction.Left; } if (D[i, j].val == d2) { D[i, j].backTrace |= Direction.Up; } if (D[i, j].val == d3) { D[i, j].backTrace |= Direction.Diag; } } return D[lenList1 - 1, lenList2 - 1].val; } /// <summary> /// Returns alignment strings. The default value for T indicates an insertion or deletion in the appropriate string. align1 and align2 will /// have the same length padded with default(T) regardless of the input string lengths. /// </summary> /// <param name="align1">First string alignment</param> /// <param name="align2">Second string alignment</param> public void Alignment(out List<T> align1, out List<T> align2) { if (D == null) { throw new Exception("Distance matrix is null"); } int i = List1.Count - 1; int j = List2.Count - 1; align1 = new List<T>(); align2 = new List<T>(); while (i > 0 || j > 0) { Direction dir = D[i, j].backTrace; int dVal, dDiag = int.MaxValue, dLeft = int.MaxValue, dUp = int.MaxValue; if ((dir & Direction.Diag) == Direction.Diag) { // always favour diagonal as this is a match on items dDiag = -1; } if ((dir & Direction.Up) == Direction.Up) { dUp = D[i, j - 1].val; } if ((dir & Direction.Left) == Direction.Left) { dLeft = D[i - 1, j].val; } dVal = Math.Min(dDiag, Math.Min(dLeft, dUp)); if (dVal == dDiag) { align1.Add(List1[i]); align2.Add(List2[j]); i--; j--; } else if (dVal == dUp) { align1.Add(default(T)); align2.Add(List2[j]); j--; } else if (dVal == dLeft) { align1.Add(List1[i]); align2.Add(default(T)); i--; } } align1.Reverse(); align2.Reverse(); } /// <summary> /// Writes out the "D" matrix showing the edit distances and the back tracing directions /// </summary> public void Write() { int lenList1 = List1.Count; int lenList2 = List2.Count; Console.Write(" "); for (int i = 0; i < lenList1; i++) { if (i == 0) Console.Write(string.Format("{0, 6}", "*")); else Console.Write(string.Format("{0, 6}", List1[i])); } Console.WriteLine(); for (int j = 0; j < lenList2; j++) { if (j == 0) Console.Write(string.Format("{0, 6}", "*")); else Console.Write(string.Format("{0, 6}", List2[j])); for (int i = 0; i < lenList1; i++) { Console.Write(string.Format("{0,6:###}", D[i, j].val)); } Console.WriteLine(); Console.Write(" "); for (int i = 0; i < lenList1; i++) { WriteBackTrace(D[i, j].backTrace); } Console.WriteLine(); Console.WriteLine(); } } /// <summary> /// Writes out one backtrace item /// </summary> /// <param name="d"></param> void WriteBackTrace(Direction d) { string s = string.Empty; s += ((d & Direction.Diag) == Direction.Diag) ? "\u2196" : ""; s += ((d & Direction.Up) == Direction.Up) ? "\u2191" : ""; s += ((d & Direction.Left) == Direction.Left) ? "\u2190" : ""; Console.Write(string.Format("{0,6}", s)); } /// <summary> /// Represents one cell in the 'D' matrix /// </summary> public struct Distance { public int val; public Direction backTrace; } /// <summary> /// Direction(s) for one cell in the 'D' matrix. /// </summary> [FlagsAttribute] public enum Direction { None = 0, Diag = 1, Left = 2, Up = 4 } }

The following code shows how this class can be used:

// example from Speach & Language Processing, Jurafsky & Martin P. 108, cites Krushkal (1983) [TestMethod] public void LevenshteinAltDistance() { string s1 = "EXECUTION"; string s2 = "INTENTION"; int m = Align(s1, s2, 2, 1, 1, "*EXECUTION", "INTE*NTION"); Assert.AreEqual(8, m); } private int Align(string s1, string s2, int subCost = 1, int delCost = 1, int insCost = 1, string alignment1 = null, string alignment2 = null) { List<char> l1 = s1.ToList(); List<char> l2 = s2.ToList(); MinEditDistance<char> med = new MinEditDistance<char>(l1, l2); List<char> a1, a2; med.SubCost = subCost; med.DelCost = delCost; med.InsCost = insCost; int m = med.CalcMinEditDistance(); Console.WriteLine("Min edit distance: " + m); med.Write(); med.Alignment(out a1, out a2); foreach (char c in a1) { if (c == default(char)) Console.Write("*"); else Console.Write(c); } Console.WriteLine(); foreach (char c in a2) { if (c == default(char)) Console.Write("*"); else Console.Write(c); } Console.WriteLine(); if (alignment1 != null && alignment2 != null) { Assert.AreEqual(alignment1, new string(a1.ToArray()).Replace('\0', '*')); Assert.AreEqual(alignment2, new string(a2.ToArray()).Replace('\0', '*')); } return m; }

The output from this test reports the edit distance to be 8 and the alignment is:

*EXECUTION INTE*NTION

The “D” matrix with backtrack information is also displayed:

Note that there may be several different possible alignments since backtracking allows multiple routes through the matrix. This web site http://odur.let.rug.nl/kleiweg/lev/ provides an online tool for calculating the edit distance.

[1] “Introduction to Bioinfomatics”, Arthur M. Lesk, 3rd Edition 2008, Oxford University Press

[2] “Speech and Language Processing” D.Jurafsky, J.Martin, 2nd Edition 2009, Prentice Hall

## Microsoft Azure to offer Machine Learning Services – Preview July 2014

” Microsoft Azure Machine Learning combines power of comprehensive machine learning with benefits of cloud”

“Machine learning today is usually self-managed and on premises, requiring the training and expertise of data scientists. However, data scientists are in short supply, commercial software licenses can be expensive and popular programming languages for statistical computing have a steep learning curve. Even if a business could overcome these hurdles, deploying new machine learning models in production systems often requires months of engineering investment. Scaling, managing and monitoring these production systems requires the capabilities of a very sophisticated engineering organization, which few enterprises have today.

Microsoft Azure Machine Learning, a fully-managed cloud service for building predictive analytics solutions, helps overcome the challenges most businesses have in deploying and using machine learning. How? By delivering a comprehensive machine learning service that has all the benefits of the cloud. In mere hours, with Azure ML, customers and partners can build data-driven applications to predict, forecast and change future outcomes – a process that previously took weeks and months.”

See this blog post from: Joseph Sirosh, Corporate Vice President of Machine Learning at Microsoft

## Euclidean, Manhattan and Cosine Distance Measures in C#

Euclidean, Manhattan and Cosine Distance Measures can be used for calculating document dissimilarity. Since similarity is the inverse of a dissimilarity measure, they can also be used to calculate document similarity. For document similarity the calculations are based on Frequency Distributions. See here for a comparison between Bag of Words and Frequency Distributions and here for using Jaccard Similarity with a Bag of Words.

The calculation starts with a frequency distribution for words in a number of documents. For example:

A n-dimensional space is then created, with a dimension for each of the words. In the above example a dimension will be created for “Cat”, “Mouse”, “Dog” and “Rat”, so it’s a four dimensioned space. Then, each document is plotted in this space. The following diagram shows the plot for just two of the four dimensions with the Euclidean Distance (the shortest distance between two points):

The Manhattan distance is the sum of the lengths of the rectangle formed by the two points:

Finally, the Cosine distance is the angle subtended at the origin between the two documents. A value of 0 degrees represents identical documents and 90 degrees dissimilar documents. Note that this distance is based on the relative frequency of words in a document. A document with, say, twice as many occurrences of all words compared to another document will be regarded as identical.

For a full description of these distance measures see [1], including details on their calculation.

The Euclidean and Manhattan distances are specific examples of a more general Lr-Norm distance measure. The ‘r’ refers to a power term, and for Manhattan this is 1 and for Euclidean it’s 2. Therefore a single class can be used to implement both:

public class LrNorm { /// <summary> /// Returns Euclidean distance between frequency distributions of two lists /// </summary> /// <typeparam name="T">Type of the item, e.g. int or string</typeparam> /// <param name="l1">First list of items</param> /// <param name="l2">Second list of items</param> /// <returns>Distance, 0 - identical</returns> public static double Euclidean<T>(List<T> l1, List<T> l2) { return DoLrNorm(l1, l2, 2); } /// <summary> /// Returns Manhattan distance between frequency distributions of two lists /// </summary> /// <typeparam name="T">Type of the item, e.g. int or string</typeparam> /// <param name="l1">First list of items</param> /// <param name="l2">Second list of items</param> /// <returns>Distance, 0 - identical</returns> public static double Manhattan<T>(List<T> l1, List<T> l2) { return DoLrNorm(l1, l2, 1); } /// <summary> /// Returns LrNorm distance between frequency distributions of two lists /// </summary> /// <typeparam name="T">Type of the item, e.g. int or string</typeparam> /// <param name="l1">First list of items</param> /// <param name="l2">Second list of items</param> /// <param name="r">Power to use 2 = Euclidean, 1 = Manhattan</param> /// <returns>Distance, 0 - identical</returns> public static double DoLrNorm<T>(List<T> l1, List<T> l2, int r) { // find distinct list of values from both lists. List<T> dvs = FrequencyDist<T>.GetDistinctValues(l1, l2); // create frequency distributions aligned to list of descrete values FrequencyDist<T> fd1 = new FrequencyDist<T>(l1, dvs); FrequencyDist<T> fd2 = new FrequencyDist<T>(l2, dvs); if (fd1.ItemFreq.Count != fd2.ItemFreq.Count) { throw new Exception("Lists of different length for LrNorm calculation"); } double sumsq = 0.0; for (int i = 0; i < fd1.ItemFreq.Count; i++) { if (!EqualityComparer<T>.Default.Equals(fd1.ItemFreq.Values[i].value, fd2.ItemFreq.Values[i].value)) throw new Exception("Mismatched values in frequency distribution for LrNorm calculation"); if (r == 1) // Manhattan optimization { sumsq += Math.Abs((fd1.ItemFreq.Values[i].count - fd2.ItemFreq.Values[i].count)); } else { sumsq += Math.Pow((double)Math.Abs((fd1.ItemFreq.Values[i].count - fd2.ItemFreq.Values[i].count)), r); } } if (r == 1) // Manhattan optimization { return sumsq; } else { return Math.Pow(sumsq, 1.0 / r); } } }

The following code shows how to use the methods in this class:

double LrNormUT(int r) { // Sample from Page 92 "Mining of Massive Datasets" List l1 = new List(); List l2 = new List(); l1.Add(1); l1.Add(1); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(2); l2.Add(2); l2.Add(2); l2.Add(2); return LrNorm.DoLrNorm(l1, l2, r); } [TestMethod] public void Euclidean1() { double dist = LrNormUT(2); Assert.AreEqual(5, (int)dist); Console.WriteLine("d:" + dist); } [TestMethod] public void Manhattan() { double dist = LrNormUT(1); Assert.AreEqual(7, (int)dist); Console.WriteLine("d:" + dist); }

The following class calculates the Cosine Distance:

/// <summary> /// Calculate cosine distance between two vectors /// </summary> public class Cosine { /// <summary> /// Calculates the distance between frequency distributions calculated from lists of items /// </summary> /// <typeparam name="T">Type of the list item, e.g. int or string</typeparam> /// <param name="l1">First list of items</param> /// <param name="l2">Second list of items</param> /// <returns>Distance in degrees. 90 is totally different, 0 exactly the same</returns> public static double Distance<T>(List<T> l1, List<T> l2) { if (l1.Count() == 0 || l2.Count() == 0) { throw new Exception("Cosine Distance: lists cannot be zero length"); } // find distinct list of items from two lists, used to align frequency distributions from two lists List<T> dvs = FrequencyDist<T>.GetDistinctValues(l1, l2); // calculate frequency distributions for each list. FrequencyDist<T> fd1 = new FrequencyDist<T>(l1, dvs); FrequencyDist<T> fd2 = new FrequencyDist<T>(l2, dvs); if(fd1.ItemFreq.Count() != fd2.ItemFreq.Count) { throw new Exception("Cosine Distance: Frequency count vectors must be same length"); } double dotProduct = 0.0; double l2norm1 = 0.0; double l2norm2 = 0.0; for(int i = 0; i < fd1.ItemFreq.Values.Count(); i++) { if (!EqualityComparer<T>.Default.Equals(fd1.ItemFreq.Values[i].value, fd2.ItemFreq.Values[i].value)) throw new Exception("Mismatched values in frequency distribution for Cosine distance calculation"); dotProduct += fd1.ItemFreq.Values[i].count * fd2.ItemFreq.Values[i].count; l2norm1 += fd1.ItemFreq.Values[i].count * fd1.ItemFreq.Values[i].count; l2norm2 += fd2.ItemFreq.Values[i].count * fd2.ItemFreq.Values[i].count; } double cos = dotProduct / (Math.Sqrt(l2norm1) * Math.Sqrt(l2norm2)); // convert cosine value to radians then to degrees return Math.Acos(cos) * 180.0 / Math.PI; } }

Here are some methods that show how to use this class:

double tol = 0.00001; [TestMethod] public void CosineSimple() { List<int> l1 = new List<int>(); List<int> l2 = new List<int>(); l1.Add(1); l1.Add(1); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l1.Add(2); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(1); l2.Add(2); l2.Add(2); l2.Add(2); l2.Add(2); double dist = Cosine.Distance(l1, l2); Console.WriteLine(dist); Assert.AreEqual(40.3645365730974, dist, tol); }

**Reference:** [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011. P.90. See http://infolab.stanford.edu/~ullman/mmds.html

## Bag of Words and Frequency Distributions in C#

The simplest way of representing a document is the “Bag of Words”. This is the list of unique words used in a document. It is therefore a simple present/not present indicator for all words in the vocabulary and does not take into account the occurrence frequency of these words nor the order of the words.

The Bag of Words is used by the Jaccard Similarity measure for document similarity. If two documents have the same set of words then they are deemed identical, and if they have no common words they are completely different. This similarity measure takes no account of the relative length of the two documents being compared.

In C# a Bag of Words can be represented by a generic List. The list type can either be a string (in which case it’s the actual word) or an integer (where the integer is a lookup into a dictionary). The latter is more efficient because the word is stored just once as a string and the integer lookup (4 bytes) is most likely to be shorter than the word itself. The C# in this blog post creates a dictionary, some Bag of Words and calculates the Jaccard index for documents.

**Frequency Distributions** record not only the words in a document but also the frequency with which they occur. However, like Bags of Words, no account is taken of the order of words in the document. These frequency distributions can be compared and used to assess the similarity between two or more documents.

These techniques generally calculate the distance between two documents. A distance measure is the inverse of similarity. Common techniques are Euclidean, Manhattan and Cosine distances.

The following C# class manages Frequency Distributions. It’s a generic class, and so can use strings (the words themselves) or integers (for lookups into a dictionary).

/// <summary> /// Manages Frequency Distributions for items of type T /// </summary> /// <typeparam name="T">Type for item</typeparam> public class FrequencyDist<T> { /// <summary> /// Construct Frequency Distribution for the given list of items /// </summary> /// <param name="li">List of items to calculate for</param> public FrequencyDist(List<T> li) { CalcFreqDist(li); } /// <summary> /// Construct Frequency Distribution for the given list of items, across all keys in itemValues /// </summary> /// <param name="li">List of items to calculate for</param> /// <param name="itemValues">Entire list of itemValues to include in the frequency distribution</param> public FrequencyDist(List<T> li, List<T> itemValues) { CalcFreqDist(li); // add items to frequency distribution that are in itemValues but missing from the frequency distribution foreach (var v in itemValues) { if(!ItemFreq.Keys.Contains(v)) { ItemFreq.Add(v, new Item { value = v, count = 0 }); } } // check that all values in li are in the itemValues list foreach(var v in li) { if (!itemValues.Contains(v)) throw new Exception(string.Format("FrequencyDist: Value in list for frequency distribution not in supplied list of values: '{0}'.", v)); } } /// <summary> /// Calculate the frequency distribution for the values in list /// </summary> /// <param name="li">List of items to calculate for</param> void CalcFreqDist(List<T> li) { itemFreq = new SortedList<T,Item>((from item in li group item by item into theGroup select new Item { value = theGroup.FirstOrDefault(), count = theGroup.Count() }).ToDictionary(q => q.value, q => q)); } SortedList<T, Item> itemFreq = new SortedList<T, Item>(); /// <summary> /// Getter for the Item Frequency list /// </summary> public SortedList<T, Item> ItemFreq { get { return itemFreq; } } public int Freq(T value) { if(itemFreq.Keys.Contains(value)) { return itemFreq[value].count; } else { return 0; } } /// <summary> /// Returns the list of distinct values between two lists /// </summary> /// <param name="l1"></param> /// <param name="l2"></param> /// <returns></returns> public static List<T> GetDistinctValues(List<T> l1, List<T> l2) { return l1.Concat(l2).ToList().Distinct().ToList(); } /// <summary> /// Manages a count of items (int, string etc) for frequency counts /// </summary> /// <typeparam name="T">The type for item</typeparam> public class Item { /// <summary> /// The value of the item, e.g. int or string /// </summary> public T value { get; set; } /// <summary> /// The count of the item /// </summary> public int count { get; set; } } }

The following code shows how this class can be used:

List<string> li = new List<string>(); li.Add("One"); li.Add("Two"); li.Add("Two"); li.Add("Three"); li.Add("Three"); li.Add("Three"); FrequencyDist<string> cs = new FrequencyDist<string>(li); foreach (var v in cs.ItemFreq.Values) { Console.WriteLine(v.value + " : " + v.count); }

The output from executing this code is:

One : 1 Three : 3 Two : 2

## LSH for Finding Similar Documents from a large number of Documents in C#

We’ve shown how Jaccard Similarity Coefficient can be used to measure similarity between documents when these documents are represented by a bag of words. MinHash can be used to create document fingerprints which represent the document in a small number of bytes, and can also be used to estimate document similarity using Jaccard Similarity Coefficient.

The next problem is how to find similar documents in a large collection of documents. Imagine you have 12,000 documents and you want to find pairs of most similar documents. The brute-force approach is to compare each document with each other document, using MinHash and Jaccard Similarity Coefficient. Assuming each comparison takes 10 milliseconds this operation will take around 16 days, since each document is compared with each other document.

A faster solution is to use Locality Sensitive Hashing (LSH). This takes the MinHash values for documents and hashes the MinHash values so they hash into the same bucket if they are similar. Buckets are organized into bands, with a number of documents per band (20 is a commonly used value) Documents in the same bucket in a band are then candidates for being similar documents. The algorithm is described fully in [1].

This class implements the LSH algorithm in C#.

public class LSH { Dictionary<int, HashSet<int>> m_lshBuckets = new Dictionary<int, HashSet<int>>(); int[,] minHashes; int numBands; int numHashFunctions; int rowsPerBand; int numSets; List<SortedList<int, List<int>>> lshBuckets = new List<SortedList<int, List<int>>>(); public LSH(int[,] minHashes, int rowsPerBand) { this.minHashes = minHashes; numHashFunctions = minHashes.GetUpperBound(1) + 1; numSets = minHashes.GetUpperBound(0) + 1; this.rowsPerBand = rowsPerBand; this.numBands = numHashFunctions / rowsPerBand; } public void Calc() { int thisHash = 0; for (int b = 0; b < numBands; b++) { var thisSL = new SortedList<int, List<int>>(); for (int s = 0; s < numSets; s++) { int hashValue = 0; for (int th = thisHash; th < thisHash + rowsPerBand; th++) { hashValue = unchecked(hashValue * 1174247 + minHashes[s, th]); } if (!thisSL.ContainsKey(hashValue)) { thisSL.Add(hashValue, new List<int>()); } thisSL[hashValue].Add(s); } thisHash += rowsPerBand; var copy = new SortedList<int, List<int>>(); foreach(int ic in thisSL.Keys) { if (thisSL[ic].Count() > 1) { copy.Add(ic, thisSL[ic]); } } lshBuckets.Add(copy); } } public List<int> GetNearest(int n) { List<int> nearest = new List<int>(); foreach (SortedList<int, List<int>> b in lshBuckets) { foreach (List<int> li in b.Values) { if (li.Contains(n)) { nearest.AddRange(li); break; } } } nearest = nearest.Distinct().ToList(); nearest.Remove(n); // remove the document itself return nearest; } }

The constructor in this class is passed:

- A two dimensional array with a column for each MinHash and a row for each document.
- Number of rows per band (a typical value would be 20).

There are two stages to using this LSH class:

- Calc(): This calculates the assignment of documents to buckets and bands. This is done once.
- GetNearest(int n): For the nth document (row) this finds the list candidate similar documents, returned as a list of row numbers.

Here’s a unit test to use this class:

[TestMethod] public void CheckLSHAll() { string lang = "en"; SectionBO.MinHashForSections mhfs = SectionBO.GetMinHash(4, 100, lang, 1); LSH lsh = new LSH(mhfs.hashes, 20); lsh.Calc(); for (int n = 0; n < mhfs.hashes.GetUpperBound(0); n++) { List<int> nearest = lsh.GetNearest(n); } }

This code loads MinHash values for documents from a database (the parameters don’t matter). “mhfs.hashes” is an int[,] array with columns for the hash values (100 in this case) and a row for each document (about 12,000). It then finds similar documents for all 12,000 documents.

How long does this take to execute? The Calc() stage took** 265 Milliseconds**, and the searching for similar documents for all 12,000 individual documents took** 948 Millisecond**s when single threaded. Somewhat shorter than 16 days 🙂

*Important: *Both MinHash similarity testing and finding similar documents with LSH are *estimations* – they will result in Type I errors (false positive, documents that are reported similar, but are not similar) and Type II Errors (false negatives, documents that are similar but are not reported as similar). Type I errors are easy to eliminate – just go back to the MinHash values or the bag of words associated with the document and calculate the similarity. Type II errors cannot be easily rectified without resorting to the brute-force approach.

Note that the time to compute LSH with MinHash is not dependent on the size of the documents, only on the number of documents and number of MinHash functions used.

MinHash can be used in conjunction with Locality Sensitive Hashing (LSH) to compare large numbers of documents very efficiently. See here for an implementation in C#: https://nickgrattan.wordpress.com/2014/03/03/lsh-for-finding-similar-documents-from-a-large-number-of-documents-in-c

**Reference:** [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011. P.102. See http://infolab.stanford.edu/~ullman/mmds.html

## MinHash for Document Fingerprinting in C#

Most hashing techniques are designed to create hash values that differ significantly for similar items, such as documents. For example, two documents that are very similar will often generate hash values that are very different.

MinHash is different – the technique is designed to ensure that two similar items generate hashes that are themselves similar. In fact, the similarity of the hashes has a direct relationship to the similarity of the documents they were generated from. This relationship approximates to the Jaccard Similarity.

The MinHash algorithm involves creating a number of hash values for the document using different hash algorithms. Assuming that 100 different hash algorithms are used and each hash values is a four-byte integer value, the entire MinHash can be stored in 400 bytes.

MinHashes alone can be used to *estimate* the similarity of two documents without reference to the content of the documents. They are therefore “document fingerprints” or “document signatures”. The size of document fingerprint is determined by the number of hashes used. A typical number of hashes is around 100, so the total size of the finger print is around 400 bytes regardless of the original size of the document.

There are three stages to creating and using MinHashes:

- Generate the Hash functions.
- Create the MinHash fingerprints for the documents to be compared.
- Use the Jaccard Similarity coefficient using the MinHash fingerprints to test the similarity between documents.

The following class implements all three stages:

public class MinHash { // Constructor passed universe size and number of hash functions public MinHash(int universeSize, int numHashFunctions) { this.numHashFunctions = numHashFunctions; // number of bits to store the universe int u = BitsForUniverse(universeSize); GenerateHashFunctions(u); } private int numHashFunctions; // Returns number of hash functions defined for this instance public int NumHashFunctions { get { return numHashFunctions; } } public delegate uint Hash(int toHash); private Hash[] hashFunctions; // Public access to hash functions public Hash[] HashFunctions { get { return hashFunctions; } } // Generates the Universal Random Hash functions // http://en.wikipedia.org/wiki/Universal_hashing private void GenerateHashFunctions(int u) { hashFunctions = new Hash[numHashFunctions]; // will get the same hash functions each time since the same random number seed is used Random r = new Random(10); for (int i = 0; i < numHashFunctions; i++) { uint a = 0; // parameter a is an odd positive while (a % 1 == 1 || a <= 0) a = (uint)r.Next(); uint b = 0; int maxb = 1 << u; // parameter b must be greater than zero and less than universe size while (b <= 0) b = (uint)r.Next(maxb); hashFunctions[i] = x => QHash(x, a, b, u); } } // Returns the number of bits needed to store the universe public int BitsForUniverse(int universeSize) { return (int)Math.Truncate(Math.Log((double)universeSize, 2.0)) + 1; } // Universal hash function with two parameters a and b, and universe size in bits private static uint QHash(int x, uint a, uint b, int u) { return (a * (uint)x + b) >> (32 - u); } // Returns the list of min hashes for the given set of word Ids public List<uint> GetMinHash(List<int> wordIds) { uint[] minHashes = new uint[numHashFunctions]; for (int h = 0; h < numHashFunctions; h++) { minHashes[h] = int.MaxValue; } foreach (int id in wordIds) { for (int h = 0; h < numHashFunctions; h++) { uint hash = hashFunctions[h](id); minHashes[h] = Math.Min(minHashes[h], hash); } } return minHashes.ToList(); } // Calculates the similarity of two lists of min hash values. Approximately Numerically equivilant to Jaccard Similarity public double Similarity(List<uint> l1, List<uint> l2) { Jaccard jac = new Jaccard(); return Jaccard.Calc(l1, l2); } }

**Generate Hash Functions:**

The Hash functions are calculated once, and the functions then applied to all documents to be compared. In this implementation, Universal Hashing is used in the method GenerateHashFunctions. This method is pased the number of bits to represent the “universe size” (u) – for document similarity the universe size is the number of distinct words expected across all documents (the vocabulary). For each hash function two random numbers are used – a and b. Since the random function is seeded with the same value this method will always generate the same Hash functions. Lambda functions are created to implement these hash functions using QHash – this function is passed “x” (the value to be hashed) and a, b and u.

**Create the MinHash fingerprints:**

The MinHash algorithm is described in [1] and is not repeated here. The algorithm is implemented in the function GetMinHash. This method is passed a list of word integer identifiers – each word in the vocabulary is assumed to have been assigned a unique integer value, and these values are common across all documents. The function returns a list of MinHash values – the number of MinHash values is equal to the number of hash functions selected earlier.

**Testing Document Similarity using Jaccard Coefficient**

The MinHash values obtained from the previous step can be used *estimate* the similarity of two documents. This is done in the function Similarity which is passed a list of MinHash values for each of the two documents being compared. The function returns a value between 0 (completely different) to 1.0 (exactly the same).

This unit test shows how to use this class:

[TestMethod] public void MinHashFunc1() { List inums1 = new List(); inums1.Add(10); inums1.Add(8); inums1.Add(11); inums1.Add(13); inums1.Add(2); inums1.Add(17); inums1.Add(3); inums1.Add(1); inums1.Add(19); inums1.Add(11); MinHash mh = new MinHash(20, 100); List hvs1 = mh.GetMinHash(inums1).ToList(); List inums2 = new List(); inums2.Add(1); inums2.Add(2); inums2.Add(5); inums2.Add(9); inums2.Add(12); inums2.Add(17); inums2.Add(13); inums2.Add(11); inums2.Add(9); inums2.Add(10); List hvs2 = mh.GetMinHash(inums2).ToList(); Console.WriteLine(); Console.WriteLine("Estimated similarity: " + mh.Similarity(hvs1, hvs2)); Console.WriteLine("Jaccard similarity: " + Jaccard.Calc(inums1, inums2)); }

Because the number of “words” in each “document” is small in this unit test, the estimate of document similarity obtained using from the Jaccard Coefficient is not very accurate. However, much better results are obtained with even quite small documents.

The benefits of using MinHash are most significant when many document similarities are being calculated. In this case, the MinHash values for a document can be calculated when a document is created or updated and stored as metadata against the document.

**Reference:** [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011. P.96. See http://infolab.stanford.edu/~ullman/mmds.html

## Jaccard Similarity Index for measuring Document Similarity

The Jaccard Similarity Index can be used to represent the similarity between two documents. A value “0” means the documents are completely dissimilar, “1” that they are identical, and values between 0 and 1 representing a degree of similarity.

The Jaccard index is defined as the size of the intersection of the two documents divided by the size of the union of the two documents.

Of course, we need to decide how to represent the two documents as sets. Of the different ways this can be done, we might:

- Use a
*Bag Of Words,*which is the list of unique words in the document, with no frequency count. - A Bag of Words with
*Stop Words*removed (common language words like “the”, “at” for English). - Sets that take into account the frequency of words in the document.
*k-Shingles*which are fixed k length overlapping runs of characters extracted from the document.*n-grams*which are fixed length n word runs. For example, a 2-gram are overlapping pairs of words taken from a document.

A simple, efficient, way of representing a bag of words is to create a dictionary of all words and assign a unique integer value to each word. Then, the bag of words for a document is simply the unique list of integer values for the words in the document, e.g. List.

For a comparison between Bag of Words and Frequency Distributions see here.

Since sets are not ordered, a bag of words represented like this takes no account of the order in which words appear in the document.

Here is a C# class to calculate the Jaccard index:

public class Jaccard { public static double Calc(HashSet<int> hs1, HashSet<int> hs2) { return ((double)hs1.Intersect(hs2).Count() / (double)hs1.Union(hs2).Count()); } public static double Calc(List<int> ls1, List<int> ls2) { HashSet<int> hs1 = new HashSet<int>(ls1); HashSet<int> hs2 = new HashSet<int>(ls2); return Calc(hs1, hs2); } }

Notice how the List objects are converted to HashSet, since the HashSet<> class has methods for performing set calculations.

This C# code shows some tests for this class:

public void JaccardTest1() { Dictionary<int, string> wordDict = new Dictionary<int, string>(); wordDict.Add(1, "Word1"); wordDict.Add(2, "Word2"); wordDict.Add(3, "Word3"); wordDict.Add(4, "Word4"); List<int> doc1 = new List<int>(); doc1.Add(2); doc1.Add(3); doc1.Add(4); doc1.Add(2); List<int> doc2 = new List<int>(); doc2.Add(1); doc2.Add(5); doc2.Add(4); doc2.Add(2); List<int> doc3 = new List<int>(); doc3.Add(1); Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc2)); Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc1)); Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc3)); }

This creates three documents:

- Doc1: “Word2”, “Word3”, “Word4”, “Word2”.
- Doc2: “Word1”, “Word5”, “Word4”, “Word2”.
- Doc3: “Word1”.

The output is:

Jaccard: 0.4 Jaccard: 1<span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" ></span> Jaccard: 0

For the first result, the total number of unique words (union) in “Doc1” and “Doc2” is 5, and the number of shared words (intersection) is 2. This gives the Jaccard Similarity of 2.0 / 5.0 = 0.4.

The second result shows that a document compared to itself gives a value of 1 (completely similar) and the third test shows a value of 0 for completely dissimilar documents.

**Reference:** “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011. P.75. See http://infolab.stanford.edu/~ullman/mmds.html

## Language Identification

Language Identification is the process by which the language a document is written in is determined. It is a Natural Language Processing topic with a long history with varied approaches including:

- Using common words (or stop words) which are unique for particular languages.
- Using N-Gram solutions to work out the probability of adjacent words in different languages.
- Using character frequency and probability distributions

Techniques typically work well for longer documents, but become challenged with short pieces of text and documents that contain text in multiple languages. Documents containing Chinese text can be difficult as tokenization is problematic. Microsoft FAST Search in SharePoint automatically determines the language of a document. Searches can be performed to return documents of a particular language:

In this case, the “DetectedLanguage” managed property is used to return just document in French. (From document: Linguistics Features in SharePoint Server 2010 for Search and FAST Search Server 2010 for SharePoint).

There are cases, though, when language identification needs to be applied to text from sources other than documents. For example, when your code is processing text entered by a user. Since Windows 7 / Windows Server 2008 R2 Microsoft have distributed the “Extended Linguistic Service”, and one of the services is language detection. See here.

The services uses an unpublished algorithm. Calling this COM service from C# takes a bit of work, but luckily the “Windows API Code Pack for Microsoft.NET Framework” provides wrapper classes. Using the ExtendedLinguisticService project allows language detection through code like this code (modified from the book “Professional Windows 7 Development Guide” by John Paul Mueller):

string text = "Test String"; MappingService ms = new MappingService(MappingAvailableServices.LanguageDetection); using (MappingPropertyBag mpb = ms.RecognizeText(text, null)) { String[] languages = mpb.GetResultRanges()[0].FormatData(new StringArrayFormatter()); if (languages == null || languages.Length == 0) { Console.WriteLine(" FAIL "); } else { Console.WriteLine(" " + languages[0]); } }

The Extended Linguistic Service returns a list of identified languages with the most probable first in the list. Unfortunately it does not return a probability or confidence indicator. The Extended Linguistic Service is very fast. For example, classifying 15,876 text files containing over 1.18 GB of pure text took around 3 minutes, compared to 22 minutes for a .NET language identifier I wrote based on common stop words.

## Taxonomies and Ontologies

The terms “Taxonomy” and “Ontology” are often used inter-changeably and are often confused. However, the terms to express different concepts.

**Taxonomy**is about*classification,*and so represents “Is-A” relationships. For example, in the zoological world, a domestic cat (“Felis catus”) is a member of the family “Felidae” which itself is a member of the order “Carnivora”. Such taxonomies are typically, but not always, hierarchical. An object can exist simultaneously in many different taxonomies. So, a cat also belongs to the group “predators”, which would also include the insect “praying mantis”.**Ontology**is about the concepts and relationships that can exist for an object or a group of objects[1]. For example, the “Part-Of” (“Part Holonym” [2]) relationship is used to describe the parts of a car (a wheel is part of a car). Therefore, a taxonomy is a type of ontology by this definition.

SharePoint 2007 introduced the managed metadata service to allow the definition of taxonomies to be used for classifying items and documents through metadata columns. Companies are encouraged to define their own or use industry standard taxonomies for classifying documents across the organization to ensure standardization and improve searchability.

Less work has been done in integrating ontologies within SharePoint, although progress by a number of third-party software vendors is being made. “WordNet” [3] provides a rich source of generic ontological information using the English language, and codifies many types of relationships between nouns, verbs, adjectives and adverbs using “cognitive synonyms” (synsets). Vertical market ontologies are now being built, such as for financial governance by the “GRC3 – Governance, Risk and Compliance Competence Centre” at the University of Cork, Ireland (http://www.ucc.ie). Integration of such ontologies in SharePoint will be the next step in improving search, leading to the possibility of useful question-answering systems.

[1] What is an Ontology? http://www-ksl.stanford.edu/kst/what-is-an-ontology.html

[2] Speech and Language Processing, Jurafsky & Martin, 2nd Pearson International Edition p.653

[3] Princeton University “About WordNet.” WordNet. Princeton University. 2010. http://wordnet.princeton.edu