## Posts Tagged ‘**Document Similarity**’

## LSH for Finding Similar Documents from a large number of Documents in C#

We’ve shown how Jaccard Similarity Coefficient can be used to measure similarity between documents when these documents are represented by a bag of words. MinHash can be used to create document fingerprints which represent the document in a small number of bytes, and can also be used to estimate document similarity using Jaccard Similarity Coefficient.

The next problem is how to find similar documents in a large collection of documents. Imagine you have 12,000 documents and you want to find pairs of most similar documents. The brute-force approach is to compare each document with each other document, using MinHash and Jaccard Similarity Coefficient. Assuming each comparison takes 10 milliseconds this operation will take around 16 days, since each document is compared with each other document.

A faster solution is to use Locality Sensitive Hashing (LSH). This takes the MinHash values for documents and hashes the MinHash values so they hash into the same bucket if they are similar. Buckets are organized into bands, with a number of documents per band (20 is a commonly used value) Documents in the same bucket in a band are then candidates for being similar documents. The algorithm is described fully in [1].

This class implements the LSH algorithm in C#.

public class LSH { Dictionary<int, HashSet<int>> m_lshBuckets = new Dictionary<int, HashSet<int>>(); int[,] minHashes; int numBands; int numHashFunctions; int rowsPerBand; int numSets; List<SortedList<int, List<int>>> lshBuckets = new List<SortedList<int, List<int>>>(); public LSH(int[,] minHashes, int rowsPerBand) { this.minHashes = minHashes; numHashFunctions = minHashes.GetUpperBound(1) + 1; numSets = minHashes.GetUpperBound(0) + 1; this.rowsPerBand = rowsPerBand; this.numBands = numHashFunctions / rowsPerBand; } public void Calc() { int thisHash = 0; for (int b = 0; b < numBands; b++) { var thisSL = new SortedList<int, List<int>>(); for (int s = 0; s < numSets; s++) { int hashValue = 0; for (int th = thisHash; th < thisHash + rowsPerBand; th++) { hashValue = unchecked(hashValue * 1174247 + minHashes[s, th]); } if (!thisSL.ContainsKey(hashValue)) { thisSL.Add(hashValue, new List<int>()); } thisSL[hashValue].Add(s); } thisHash += rowsPerBand; var copy = new SortedList<int, List<int>>(); foreach(int ic in thisSL.Keys) { if (thisSL[ic].Count() > 1) { copy.Add(ic, thisSL[ic]); } } lshBuckets.Add(copy); } } public List<int> GetNearest(int n) { List<int> nearest = new List<int>(); foreach (SortedList<int, List<int>> b in lshBuckets) { foreach (List<int> li in b.Values) { if (li.Contains(n)) { nearest.AddRange(li); break; } } } nearest = nearest.Distinct().ToList(); nearest.Remove(n); // remove the document itself return nearest; } }

The constructor in this class is passed:

- A two dimensional array with a column for each MinHash and a row for each document.
- Number of rows per band (a typical value would be 20).

There are two stages to using this LSH class:

- Calc(): This calculates the assignment of documents to buckets and bands. This is done once.
- GetNearest(int n): For the nth document (row) this finds the list candidate similar documents, returned as a list of row numbers.

Here’s a unit test to use this class:

[TestMethod] public void CheckLSHAll() { string lang = "en"; SectionBO.MinHashForSections mhfs = SectionBO.GetMinHash(4, 100, lang, 1); LSH lsh = new LSH(mhfs.hashes, 20); lsh.Calc(); for (int n = 0; n < mhfs.hashes.GetUpperBound(0); n++) { List<int> nearest = lsh.GetNearest(n); } }

This code loads MinHash values for documents from a database (the parameters don’t matter). “mhfs.hashes” is an int[,] array with columns for the hash values (100 in this case) and a row for each document (about 12,000). It then finds similar documents for all 12,000 documents.

How long does this take to execute? The Calc() stage took** 265 Milliseconds**, and the searching for similar documents for all 12,000 individual documents took** 948 Millisecond**s when single threaded. Somewhat shorter than 16 days 🙂

*Important: *Both MinHash similarity testing and finding similar documents with LSH are *estimations* – they will result in Type I errors (false positive, documents that are reported similar, but are not similar) and Type II Errors (false negatives, documents that are similar but are not reported as similar). Type I errors are easy to eliminate – just go back to the MinHash values or the bag of words associated with the document and calculate the similarity. Type II errors cannot be easily rectified without resorting to the brute-force approach.

Note that the time to compute LSH with MinHash is not dependent on the size of the documents, only on the number of documents and number of MinHash functions used.

MinHash can be used in conjunction with Locality Sensitive Hashing (LSH) to compare large numbers of documents very efficiently. See here for an implementation in C#: https://nickgrattan.wordpress.com/2014/03/03/lsh-for-finding-similar-documents-from-a-large-number-of-documents-in-c

**Reference:** [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011. P.102. See http://infolab.stanford.edu/~ullman/mmds.html

## MinHash for Document Fingerprinting in C#

Most hashing techniques are designed to create hash values that differ significantly for similar items, such as documents. For example, two documents that are very similar will often generate hash values that are very different.

MinHash is different – the technique is designed to ensure that two similar items generate hashes that are themselves similar. In fact, the similarity of the hashes has a direct relationship to the similarity of the documents they were generated from. This relationship approximates to the Jaccard Similarity.

The MinHash algorithm involves creating a number of hash values for the document using different hash algorithms. Assuming that 100 different hash algorithms are used and each hash values is a four-byte integer value, the entire MinHash can be stored in 400 bytes.

MinHashes alone can be used to *estimate* the similarity of two documents without reference to the content of the documents. They are therefore “document fingerprints” or “document signatures”. The size of document fingerprint is determined by the number of hashes used. A typical number of hashes is around 100, so the total size of the finger print is around 400 bytes regardless of the original size of the document.

There are three stages to creating and using MinHashes:

- Generate the Hash functions.
- Create the MinHash fingerprints for the documents to be compared.
- Use the Jaccard Similarity coefficient using the MinHash fingerprints to test the similarity between documents.

The following class implements all three stages:

public class MinHash { // Constructor passed universe size and number of hash functions public MinHash(int universeSize, int numHashFunctions) { this.numHashFunctions = numHashFunctions; // number of bits to store the universe int u = BitsForUniverse(universeSize); GenerateHashFunctions(u); } private int numHashFunctions; // Returns number of hash functions defined for this instance public int NumHashFunctions { get { return numHashFunctions; } } public delegate uint Hash(int toHash); private Hash[] hashFunctions; // Public access to hash functions public Hash[] HashFunctions { get { return hashFunctions; } } // Generates the Universal Random Hash functions // http://en.wikipedia.org/wiki/Universal_hashing private void GenerateHashFunctions(int u) { hashFunctions = new Hash[numHashFunctions]; // will get the same hash functions each time since the same random number seed is used Random r = new Random(10); for (int i = 0; i < numHashFunctions; i++) { uint a = 0; // parameter a is an odd positive while (a % 1 == 1 || a <= 0) a = (uint)r.Next(); uint b = 0; int maxb = 1 << u; // parameter b must be greater than zero and less than universe size while (b <= 0) b = (uint)r.Next(maxb); hashFunctions[i] = x => QHash(x, a, b, u); } } // Returns the number of bits needed to store the universe public int BitsForUniverse(int universeSize) { return (int)Math.Truncate(Math.Log((double)universeSize, 2.0)) + 1; } // Universal hash function with two parameters a and b, and universe size in bits private static uint QHash(int x, uint a, uint b, int u) { return (a * (uint)x + b) >> (32 - u); } // Returns the list of min hashes for the given set of word Ids public List<uint> GetMinHash(List<int> wordIds) { uint[] minHashes = new uint[numHashFunctions]; for (int h = 0; h < numHashFunctions; h++) { minHashes[h] = int.MaxValue; } foreach (int id in wordIds) { for (int h = 0; h < numHashFunctions; h++) { uint hash = hashFunctions[h](id); minHashes[h] = Math.Min(minHashes[h], hash); } } return minHashes.ToList(); } // Calculates the similarity of two lists of min hash values. Approximately Numerically equivilant to Jaccard Similarity public double Similarity(List<uint> l1, List<uint> l2) { Jaccard jac = new Jaccard(); return Jaccard.Calc(l1, l2); } }

**Generate Hash Functions:**

The Hash functions are calculated once, and the functions then applied to all documents to be compared. In this implementation, Universal Hashing is used in the method GenerateHashFunctions. This method is pased the number of bits to represent the “universe size” (u) – for document similarity the universe size is the number of distinct words expected across all documents (the vocabulary). For each hash function two random numbers are used – a and b. Since the random function is seeded with the same value this method will always generate the same Hash functions. Lambda functions are created to implement these hash functions using QHash – this function is passed “x” (the value to be hashed) and a, b and u.

**Create the MinHash fingerprints:**

The MinHash algorithm is described in [1] and is not repeated here. The algorithm is implemented in the function GetMinHash. This method is passed a list of word integer identifiers – each word in the vocabulary is assumed to have been assigned a unique integer value, and these values are common across all documents. The function returns a list of MinHash values – the number of MinHash values is equal to the number of hash functions selected earlier.

**Testing Document Similarity using Jaccard Coefficient**

The MinHash values obtained from the previous step can be used *estimate* the similarity of two documents. This is done in the function Similarity which is passed a list of MinHash values for each of the two documents being compared. The function returns a value between 0 (completely different) to 1.0 (exactly the same).

This unit test shows how to use this class:

[TestMethod] public void MinHashFunc1() { List inums1 = new List(); inums1.Add(10); inums1.Add(8); inums1.Add(11); inums1.Add(13); inums1.Add(2); inums1.Add(17); inums1.Add(3); inums1.Add(1); inums1.Add(19); inums1.Add(11); MinHash mh = new MinHash(20, 100); List hvs1 = mh.GetMinHash(inums1).ToList(); List inums2 = new List(); inums2.Add(1); inums2.Add(2); inums2.Add(5); inums2.Add(9); inums2.Add(12); inums2.Add(17); inums2.Add(13); inums2.Add(11); inums2.Add(9); inums2.Add(10); List hvs2 = mh.GetMinHash(inums2).ToList(); Console.WriteLine(); Console.WriteLine("Estimated similarity: " + mh.Similarity(hvs1, hvs2)); Console.WriteLine("Jaccard similarity: " + Jaccard.Calc(inums1, inums2)); }

Because the number of “words” in each “document” is small in this unit test, the estimate of document similarity obtained using from the Jaccard Coefficient is not very accurate. However, much better results are obtained with even quite small documents.

The benefits of using MinHash are most significant when many document similarities are being calculated. In this case, the MinHash values for a document can be calculated when a document is created or updated and stored as metadata against the document.

**Reference:** [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011. P.96. See http://infolab.stanford.edu/~ullman/mmds.html

## Jaccard Similarity Index for measuring Document Similarity

The Jaccard Similarity Index can be used to represent the similarity between two documents. A value “0” means the documents are completely dissimilar, “1” that they are identical, and values between 0 and 1 representing a degree of similarity.

The Jaccard index is defined as the size of the intersection of the two documents divided by the size of the union of the two documents.

Of course, we need to decide how to represent the two documents as sets. Of the different ways this can be done, we might:

- Use a
*Bag Of Words,*which is the list of unique words in the document, with no frequency count. - A Bag of Words with
*Stop Words*removed (common language words like “the”, “at” for English). - Sets that take into account the frequency of words in the document.
*k-Shingles*which are fixed k length overlapping runs of characters extracted from the document.*n-grams*which are fixed length n word runs. For example, a 2-gram are overlapping pairs of words taken from a document.

A simple, efficient, way of representing a bag of words is to create a dictionary of all words and assign a unique integer value to each word. Then, the bag of words for a document is simply the unique list of integer values for the words in the document, e.g. List.

For a comparison between Bag of Words and Frequency Distributions see here.

Since sets are not ordered, a bag of words represented like this takes no account of the order in which words appear in the document.

Here is a C# class to calculate the Jaccard index:

public class Jaccard { public static double Calc(HashSet<int> hs1, HashSet<int> hs2) { return ((double)hs1.Intersect(hs2).Count() / (double)hs1.Union(hs2).Count()); } public static double Calc(List<int> ls1, List<int> ls2) { HashSet<int> hs1 = new HashSet<int>(ls1); HashSet<int> hs2 = new HashSet<int>(ls2); return Calc(hs1, hs2); } }

Notice how the List objects are converted to HashSet, since the HashSet<> class has methods for performing set calculations.

This C# code shows some tests for this class:

public void JaccardTest1() { Dictionary<int, string> wordDict = new Dictionary<int, string>(); wordDict.Add(1, "Word1"); wordDict.Add(2, "Word2"); wordDict.Add(3, "Word3"); wordDict.Add(4, "Word4"); List<int> doc1 = new List<int>(); doc1.Add(2); doc1.Add(3); doc1.Add(4); doc1.Add(2); List<int> doc2 = new List<int>(); doc2.Add(1); doc2.Add(5); doc2.Add(4); doc2.Add(2); List<int> doc3 = new List<int>(); doc3.Add(1); Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc2)); Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc1)); Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc3)); }

This creates three documents:

- Doc1: “Word2”, “Word3”, “Word4”, “Word2”.
- Doc2: “Word1”, “Word5”, “Word4”, “Word2”.
- Doc3: “Word1”.

The output is:

Jaccard: 0.4 Jaccard: 1<span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" ></span> Jaccard: 0

For the first result, the total number of unique words (union) in “Doc1” and “Doc2” is 5, and the number of shared words (intersection) is 2. This gives the Jaccard Similarity of 2.0 / 5.0 = 0.4.

The second result shows that a document compared to itself gives a value of 1 (completely similar) and the third test shows a value of 0 for completely dissimilar documents.

**Reference:** “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011. P.75. See http://infolab.stanford.edu/~ullman/mmds.html