Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

Posts Tagged ‘Jaccard

LSH for Finding Similar Documents from a large number of Documents in C#

with 3 comments


We’ve shown how Jaccard Similarity Coefficient can be used to measure similarity between documents when these documents are represented by a bag of words. MinHash can be used to create document fingerprints which represent the document in a small number of bytes, and can also be used to estimate document similarity using Jaccard Similarity Coefficient.

The next problem is how to find similar documents in a large collection of documents. Imagine you have 12,000 documents and you want to find pairs of most similar documents. The brute-force approach is to compare each document with each other document, using MinHash and Jaccard Similarity Coefficient. Assuming each comparison takes 10 milliseconds this operation will take around 16 days, since each document is compared with each other document.

A faster solution is to use Locality Sensitive Hashing (LSH). This takes the MinHash values for documents and hashes the MinHash values so they hash into the same bucket if they are similar. Buckets are organized into bands, with a number of documents per band (20 is a commonly used value) Documents in the same bucket in a band are then candidates for being similar documents. The algorithm is described fully in [1].

This class implements the LSH algorithm in C#.

    public class LSH
        Dictionary<int, HashSet<int>> m_lshBuckets = new Dictionary<int, HashSet<int>>();

        int[,] minHashes;
        int numBands;
        int numHashFunctions;
        int rowsPerBand;
        int numSets;
        List<SortedList<int, List<int>>> lshBuckets = new List<SortedList<int, List<int>>>();

        public LSH(int[,] minHashes, int rowsPerBand)
            this.minHashes = minHashes;
            numHashFunctions = minHashes.GetUpperBound(1) + 1;
            numSets = minHashes.GetUpperBound(0) + 1;
            this.rowsPerBand = rowsPerBand;
            this.numBands = numHashFunctions / rowsPerBand;

        public void Calc()
            int thisHash = 0;

            for (int b = 0; b < numBands; b++)
                var thisSL = new SortedList<int, List<int>>();
                for (int s = 0; s < numSets; s++)
                    int hashValue = 0;
                    for (int th = thisHash; th < thisHash + rowsPerBand; th++)
                        hashValue = unchecked(hashValue * 1174247 + minHashes[s, th]);
                    if (!thisSL.ContainsKey(hashValue))
                        thisSL.Add(hashValue, new List<int>());
                thisHash += rowsPerBand;
                var copy = new SortedList<int, List<int>>();
                foreach(int ic in thisSL.Keys)
                    if (thisSL[ic].Count() > 1)
                        copy.Add(ic, thisSL[ic]);

        public List<int> GetNearest(int n)
            List<int> nearest = new List<int>();
            foreach (SortedList<int, List<int>> b in lshBuckets)
                foreach (List<int> li in b.Values)
                    if (li.Contains(n))
            nearest = nearest.Distinct().ToList();
            nearest.Remove(n);  // remove the document itself
            return nearest;

The constructor in this class is passed:

  • A two dimensional array with a column for each MinHash and a row for each document.
  • Number of rows per band (a typical value would be 20).

There are two stages to using this LSH class:

  1. Calc(): This calculates the assignment of documents to buckets and bands. This is done once.
  2. GetNearest(int n): For the nth document (row) this finds the list candidate similar documents, returned as a list of row numbers.

Here’s a unit test to use this class:

        public void CheckLSHAll()
            string lang = "en";
            SectionBO.MinHashForSections mhfs = SectionBO.GetMinHash(4, 100, lang, 1);
            LSH lsh = new LSH(mhfs.hashes, 20);
            for (int n = 0; n < mhfs.hashes.GetUpperBound(0); n++)
                List<int> nearest = lsh.GetNearest(n);

This code loads MinHash values for documents from a database (the parameters don’t matter). “mhfs.hashes” is an int[,] array with columns for the hash values (100 in this case) and a row for each document (about 12,000). It then finds similar documents for all 12,000 documents.

How long does this take to execute? The Calc() stage took 265 Milliseconds, and the searching for similar documents for all 12,000 individual documents took 948 Milliseconds when single threaded. Somewhat shorter than 16 days 🙂

Important: Both MinHash similarity testing and finding similar documents with LSH are estimations – they will result in Type I errors (false positive, documents that are reported similar, but are not similar) and Type II Errors (false negatives, documents that are similar but are not reported as similar). Type I errors are easy to eliminate – just go back to the MinHash values or the bag of words associated with the document and calculate the similarity. Type II errors cannot be easily rectified without resorting to the brute-force approach.

Note that the time to compute LSH with MinHash is not dependent on the size of the documents, only on the number of documents and number of MinHash functions used.

MinHash can be used in conjunction with Locality Sensitive Hashing (LSH) to compare large numbers of documents very efficiently. See here for an implementation in C#:

Reference: [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011.  P.102. See

Written by Nick Grattan

March 3, 2014 at 9:04 pm