Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

MinHash for Document Fingerprinting in C#

with 4 comments

Most hashing techniques are designed to create hash values that differ significantly for similar items, such as documents. For example, two documents that are very similar will often generate hash values that are very different.

MinHash is different – the technique is designed to ensure that two similar items generate hashes that are themselves similar. In fact, the similarity of the hashes has a direct relationship to the similarity of the documents they were generated from. This relationship approximates to the Jaccard Similarity.

The MinHash algorithm involves creating a number of hash values for the document using different hash algorithms.  Assuming that 100 different hash algorithms are used and each hash values is a four-byte integer value, the entire MinHash can be stored in 400 bytes.

MinHashes alone can be used to estimate the similarity of two documents without reference to the content of the documents. They are therefore “document fingerprints” or “document signatures”. The size of document fingerprint is determined by the number of hashes used. A typical number of hashes is around 100, so the total size of the finger print is around 400 bytes regardless of the original size of the document.

There are three stages to creating and using MinHashes:

  1. Generate the Hash functions.
  2. Create the MinHash fingerprints for the documents to be compared.
  3. Use the Jaccard Similarity coefficient using the MinHash fingerprints to test the similarity between documents.

The following class implements all three stages:

    public class MinHash
    {
        // Constructor passed universe size and number of hash functions
        public MinHash(int universeSize, int numHashFunctions)
        {
            this.numHashFunctions = numHashFunctions;
            // number of bits to store the universe
            int u = BitsForUniverse(universeSize);
            GenerateHashFunctions(u);
        }

        private int numHashFunctions;

        // Returns number of hash functions defined for this instance
        public int NumHashFunctions
        {
            get { return numHashFunctions; }
        }

        public delegate uint Hash(int toHash);
        private Hash[] hashFunctions;

        // Public access to hash functions
        public Hash[] HashFunctions
        {
            get { return hashFunctions; }
        }

        // Generates the Universal Random Hash functions
        // http://en.wikipedia.org/wiki/Universal_hashing
        private void GenerateHashFunctions(int u)
        {
            hashFunctions = new Hash[numHashFunctions];

            // will get the same hash functions each time since the same random number seed is used
            Random r = new Random(10);
            for (int i = 0; i < numHashFunctions; i++)
            {
                uint a = 0;
                // parameter a is an odd positive
                while (a % 1 == 1 || a <= 0)
                    a = (uint)r.Next();
                uint b = 0;
                int maxb = 1 << u;
                // parameter b must be greater than zero and less than universe size
                while (b <= 0)
                    b = (uint)r.Next(maxb);
                hashFunctions[i] = x => QHash(x, a, b, u);
            }
        }

        // Returns the number of bits needed to store the universe
        public int BitsForUniverse(int universeSize)
        {
            return (int)Math.Truncate(Math.Log((double)universeSize, 2.0)) + 1;
        }

        // Universal hash function with two parameters a and b, and universe size in bits
        private static uint QHash(int x, uint a, uint b, int u)
        {
            return (a * (uint)x + b) >> (32 - u);
        }

        // Returns the list of min hashes for the given set of word Ids
        public List<uint> GetMinHash(List<int> wordIds)
        {
            uint[] minHashes = new uint[numHashFunctions];
            for (int h = 0; h < numHashFunctions; h++)
            {
                minHashes[h] = int.MaxValue;
            }
            foreach (int id in wordIds)
            {
                for (int h = 0; h < numHashFunctions; h++)
                {
                    uint hash = hashFunctions[h](id);
                    minHashes[h] = Math.Min(minHashes[h], hash);
                }
            }
            return minHashes.ToList();
        }

        // Calculates the similarity of two lists of min hash values. Approximately Numerically equivilant to Jaccard Similarity
        public double Similarity(List<uint> l1, List<uint> l2)
        {
            Jaccard jac = new Jaccard();
            return Jaccard.Calc(l1, l2);
        }
    }

Generate Hash Functions:

The Hash functions are calculated once, and the functions then applied to all documents to be compared. In this implementation, Universal Hashing is used in the method GenerateHashFunctions. This method is pased the number of bits to represent the “universe size” (u) – for document similarity the universe size is the number of distinct words expected across all documents (the vocabulary). For each hash function two random numbers are used – a and b. Since the random function is seeded with the same value this method will always generate the same Hash functions. Lambda functions are created to implement these hash functions using QHash – this function is passed “x” (the value to be hashed) and a, b and u.

Create the MinHash fingerprints:

The MinHash algorithm is described in [1] and is not repeated here. The algorithm is implemented in the function GetMinHash. This method is passed a list of word integer identifiers – each word in the vocabulary is assumed to have been assigned a unique integer value, and these values are common across all documents. The function returns a list of MinHash values – the number of MinHash values is equal to the number of hash functions selected earlier.

Testing Document Similarity using Jaccard Coefficient

The MinHash values obtained from the previous step can be used estimate the similarity of two documents. This is done in the function Similarity which is passed a list of MinHash values for each of the two documents being compared. The function returns a value between 0 (completely different) to 1.0 (exactly the same).

This unit test shows how to use this class:

        [TestMethod]
        public void MinHashFunc1()
        {
            List inums1 = new List();
            inums1.Add(10);
            inums1.Add(8);
            inums1.Add(11);
            inums1.Add(13);
            inums1.Add(2);
            inums1.Add(17);
            inums1.Add(3);
            inums1.Add(1);
            inums1.Add(19);
            inums1.Add(11);
            MinHash mh = new MinHash(20, 100);
            List hvs1 = mh.GetMinHash(inums1).ToList();

            List inums2 = new List();
            inums2.Add(1);
            inums2.Add(2);
            inums2.Add(5);
            inums2.Add(9);
            inums2.Add(12);
            inums2.Add(17);
            inums2.Add(13);
            inums2.Add(11);
            inums2.Add(9);
            inums2.Add(10);

            List hvs2 = mh.GetMinHash(inums2).ToList();
            Console.WriteLine();
            Console.WriteLine("Estimated similarity: " + mh.Similarity(hvs1, hvs2));
            Console.WriteLine("Jaccard similarity: " + Jaccard.Calc(inums1, inums2));
        }

Because the number of “words” in each “document” is small in this unit test, the estimate of document similarity obtained using from the Jaccard Coefficient is not very accurate. However, much better results are obtained with even quite small documents.

The benefits of using MinHash are most significant when many document similarities are being calculated. In this case, the MinHash values for a document can be calculated when a document is created or updated and stored as metadata against the document.

Reference: [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011.  P.96. See http://infolab.stanford.edu/~ullman/mmds.html

Advertisements

Written by Nick Grattan

February 23, 2014 at 3:02 pm

4 Responses

Subscribe to comments with RSS.

  1. […] to measure similarity between documents when these documents are represented by a bag of words. MinHash can be used to create document fingerprints which represent the document in a small number of […]

  2. […] MinHash for Document Fingerprinting in C# […]

  3. in this code here:

    // parameter a is an odd positive
    while (a % 1 == 1 || a =0)

    ?

    taylor

    February 15, 2015 at 1:04 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: