## Posts Tagged ‘**Jaccard Similarity**’

## MinHash for Document Fingerprinting in C#

Most hashing techniques are designed to create hash values that differ significantly for similar items, such as documents. For example, two documents that are very similar will often generate hash values that are very different.

MinHash is different – the technique is designed to ensure that two similar items generate hashes that are themselves similar. In fact, the similarity of the hashes has a direct relationship to the similarity of the documents they were generated from. This relationship approximates to the Jaccard Similarity.

The MinHash algorithm involves creating a number of hash values for the document using different hash algorithms. Assuming that 100 different hash algorithms are used and each hash values is a four-byte integer value, the entire MinHash can be stored in 400 bytes.

MinHashes alone can be used to *estimate* the similarity of two documents without reference to the content of the documents. They are therefore “document fingerprints” or “document signatures”. The size of document fingerprint is determined by the number of hashes used. A typical number of hashes is around 100, so the total size of the finger print is around 400 bytes regardless of the original size of the document.

There are three stages to creating and using MinHashes:

- Generate the Hash functions.
- Create the MinHash fingerprints for the documents to be compared.
- Use the Jaccard Similarity coefficient using the MinHash fingerprints to test the similarity between documents.

The following class implements all three stages:

public class MinHash { // Constructor passed universe size and number of hash functions public MinHash(int universeSize, int numHashFunctions) { this.numHashFunctions = numHashFunctions; // number of bits to store the universe int u = BitsForUniverse(universeSize); GenerateHashFunctions(u); } private int numHashFunctions; // Returns number of hash functions defined for this instance public int NumHashFunctions { get { return numHashFunctions; } } public delegate uint Hash(int toHash); private Hash[] hashFunctions; // Public access to hash functions public Hash[] HashFunctions { get { return hashFunctions; } } // Generates the Universal Random Hash functions // http://en.wikipedia.org/wiki/Universal_hashing private void GenerateHashFunctions(int u) { hashFunctions = new Hash[numHashFunctions]; // will get the same hash functions each time since the same random number seed is used Random r = new Random(10); for (int i = 0; i < numHashFunctions; i++) { uint a = 0; // parameter a is an odd positive while (a % 1 == 1 || a <= 0) a = (uint)r.Next(); uint b = 0; int maxb = 1 << u; // parameter b must be greater than zero and less than universe size while (b <= 0) b = (uint)r.Next(maxb); hashFunctions[i] = x => QHash(x, a, b, u); } } // Returns the number of bits needed to store the universe public int BitsForUniverse(int universeSize) { return (int)Math.Truncate(Math.Log((double)universeSize, 2.0)) + 1; } // Universal hash function with two parameters a and b, and universe size in bits private static uint QHash(int x, uint a, uint b, int u) { return (a * (uint)x + b) >> (32 - u); } // Returns the list of min hashes for the given set of word Ids public List<uint> GetMinHash(List<int> wordIds) { uint[] minHashes = new uint[numHashFunctions]; for (int h = 0; h < numHashFunctions; h++) { minHashes[h] = int.MaxValue; } foreach (int id in wordIds) { for (int h = 0; h < numHashFunctions; h++) { uint hash = hashFunctions[h](id); minHashes[h] = Math.Min(minHashes[h], hash); } } return minHashes.ToList(); } // Calculates the similarity of two lists of min hash values. Approximately Numerically equivilant to Jaccard Similarity public double Similarity(List<uint> l1, List<uint> l2) { Jaccard jac = new Jaccard(); return Jaccard.Calc(l1, l2); } }

**Generate Hash Functions:**

The Hash functions are calculated once, and the functions then applied to all documents to be compared. In this implementation, Universal Hashing is used in the method GenerateHashFunctions. This method is pased the number of bits to represent the “universe size” (u) – for document similarity the universe size is the number of distinct words expected across all documents (the vocabulary). For each hash function two random numbers are used – a and b. Since the random function is seeded with the same value this method will always generate the same Hash functions. Lambda functions are created to implement these hash functions using QHash – this function is passed “x” (the value to be hashed) and a, b and u.

**Create the MinHash fingerprints:**

The MinHash algorithm is described in [1] and is not repeated here. The algorithm is implemented in the function GetMinHash. This method is passed a list of word integer identifiers – each word in the vocabulary is assumed to have been assigned a unique integer value, and these values are common across all documents. The function returns a list of MinHash values – the number of MinHash values is equal to the number of hash functions selected earlier.

**Testing Document Similarity using Jaccard Coefficient**

The MinHash values obtained from the previous step can be used *estimate* the similarity of two documents. This is done in the function Similarity which is passed a list of MinHash values for each of the two documents being compared. The function returns a value between 0 (completely different) to 1.0 (exactly the same).

This unit test shows how to use this class:

[TestMethod] public void MinHashFunc1() { List inums1 = new List(); inums1.Add(10); inums1.Add(8); inums1.Add(11); inums1.Add(13); inums1.Add(2); inums1.Add(17); inums1.Add(3); inums1.Add(1); inums1.Add(19); inums1.Add(11); MinHash mh = new MinHash(20, 100); List hvs1 = mh.GetMinHash(inums1).ToList(); List inums2 = new List(); inums2.Add(1); inums2.Add(2); inums2.Add(5); inums2.Add(9); inums2.Add(12); inums2.Add(17); inums2.Add(13); inums2.Add(11); inums2.Add(9); inums2.Add(10); List hvs2 = mh.GetMinHash(inums2).ToList(); Console.WriteLine(); Console.WriteLine("Estimated similarity: " + mh.Similarity(hvs1, hvs2)); Console.WriteLine("Jaccard similarity: " + Jaccard.Calc(inums1, inums2)); }

Because the number of “words” in each “document” is small in this unit test, the estimate of document similarity obtained using from the Jaccard Coefficient is not very accurate. However, much better results are obtained with even quite small documents.

The benefits of using MinHash are most significant when many document similarities are being calculated. In this case, the MinHash values for a document can be calculated when a document is created or updated and stored as metadata against the document.

**Reference:** [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011. P.96. See http://infolab.stanford.edu/~ullman/mmds.html