Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

MinHash for Document Fingerprinting in C#

with 4 comments

ForwardTo

Most hashing techniques are designed to create hash values that differ significantly for similar items, such as documents. For example, two documents that are very similar will often generate hash values that are very different.

MinHash is different – the technique is designed to ensure that two similar items generate hashes that are themselves similar. In fact, the similarity of the hashes has a direct relationship to the similarity of the documents they were generated from. This relationship approximates to the Jaccard Similarity.

The MinHash algorithm involves creating a number of hash values for the document using different hash algorithms.  Assuming that 100 different hash algorithms are used and each hash values is a four-byte integer value, the entire MinHash can be stored in 400 bytes.

MinHashes alone can be used to estimate the similarity of two documents without reference to the content of the documents. They are therefore “document fingerprints” or “document signatures”. The size of document fingerprint is determined by the number of hashes used. A typical number of hashes is around 100, so the total size of the finger print is around 400 bytes regardless of the original size of the document.

There are three stages to creating and using MinHashes:

  1. Generate the Hash functions.
  2. Create the MinHash fingerprints for the documents to be compared.
  3. Use the Jaccard Similarity coefficient using the MinHash fingerprints to test the similarity between documents.

The following class implements all three stages:

    public class MinHash
    {
        // Constructor passed universe size and number of hash functions
        public MinHash(int universeSize, int numHashFunctions)
        {
            this.numHashFunctions = numHashFunctions;
            // number of bits to store the universe
            int u = BitsForUniverse(universeSize);
            GenerateHashFunctions(u);
        }

        private int numHashFunctions;

        // Returns number of hash functions defined for this instance
        public int NumHashFunctions
        {
            get { return numHashFunctions; }
        }

        public delegate uint Hash(int toHash);
        private Hash[] hashFunctions;

        // Public access to hash functions
        public Hash[] HashFunctions
        {
            get { return hashFunctions; }
        }

        // Generates the Universal Random Hash functions
        // http://en.wikipedia.org/wiki/Universal_hashing
        private void GenerateHashFunctions(int u)
        {
            hashFunctions = new Hash[numHashFunctions];

            // will get the same hash functions each time since the same random number seed is used
            Random r = new Random(10);
            for (int i = 0; i < numHashFunctions; i++)
            {
                uint a = 0;
                // parameter a is an odd positive
                while (a % 1 == 1 || a <= 0)
                    a = (uint)r.Next();
                uint b = 0;
                int maxb = 1 << u;
                // parameter b must be greater than zero and less than universe size
                while (b <= 0)                     b = (uint)r.Next(maxb);                 hashFunctions[i] = x => QHash(x, a, b, u);
            }
        }

        // Returns the number of bits needed to store the universe
        public int BitsForUniverse(int universeSize)
        {
            return (int)Math.Truncate(Math.Log((double)universeSize, 2.0)) + 1;
        }

        // Universal hash function with two parameters a and b, and universe size in bits
        private static uint QHash(int x, uint a, uint b, int u)
        {
            return (a * (uint)x + b) >> (32 - u);
        }

        // Returns the list of min hashes for the given set of word Ids
        public List<uint> GetMinHash(List<int> wordIds)
        {
            uint[] minHashes = new uint[numHashFunctions];
            for (int h = 0; h < numHashFunctions; h++)
            {
                minHashes[h] = int.MaxValue;
            }
            foreach (int id in wordIds)
            {
                for (int h = 0; h < numHashFunctions; h++)
                {
                    uint hash = hashFunctions[h](id);
                    minHashes[h] = Math.Min(minHashes[h], hash);
                }
            }
            return minHashes.ToList();
        }

        // Calculates the similarity of two lists of min hash values. Approximately Numerically equivilant to Jaccard Similarity
        public double Similarity(List<uint> l1, List<uint> l2)
        {
            Jaccard jac = new Jaccard();
            return Jaccard.Calc(l1, l2);
        }
    }

Generate Hash Functions:

The Hash functions are calculated once, and the functions then applied to all documents to be compared. In this implementation, Universal Hashing is used in the method GenerateHashFunctions. This method is pased the number of bits to represent the “universe size” (u) – for document similarity the universe size is the number of distinct words expected across all documents (the vocabulary). For each hash function two random numbers are used – a and b. Since the random function is seeded with the same value this method will always generate the same Hash functions. Lambda functions are created to implement these hash functions using QHash – this function is passed “x” (the value to be hashed) and a, b and u.

Create the MinHash fingerprints:

The MinHash algorithm is described in [1] and is not repeated here. The algorithm is implemented in the function GetMinHash. This method is passed a list of word integer identifiers – each word in the vocabulary is assumed to have been assigned a unique integer value, and these values are common across all documents. The function returns a list of MinHash values – the number of MinHash values is equal to the number of hash functions selected earlier.

Testing Document Similarity using Jaccard Coefficient

The MinHash values obtained from the previous step can be used estimate the similarity of two documents. This is done in the function Similarity which is passed a list of MinHash values for each of the two documents being compared. The function returns a value between 0 (completely different) to 1.0 (exactly the same).

This unit test shows how to use this class:

        [TestMethod]
        public void MinHashFunc1()
        {
            List inums1 = new List();
            inums1.Add(10);
            inums1.Add(8);
            inums1.Add(11);
            inums1.Add(13);
            inums1.Add(2);
            inums1.Add(17);
            inums1.Add(3);
            inums1.Add(1);
            inums1.Add(19);
            inums1.Add(11);
            MinHash mh = new MinHash(20, 100);
            List hvs1 = mh.GetMinHash(inums1).ToList();

            List inums2 = new List();
            inums2.Add(1);
            inums2.Add(2);
            inums2.Add(5);
            inums2.Add(9);
            inums2.Add(12);
            inums2.Add(17);
            inums2.Add(13);
            inums2.Add(11);
            inums2.Add(9);
            inums2.Add(10);

            List hvs2 = mh.GetMinHash(inums2).ToList();
            Console.WriteLine();
            Console.WriteLine("Estimated similarity: " + mh.Similarity(hvs1, hvs2));
            Console.WriteLine("Jaccard similarity: " + Jaccard.Calc(inums1, inums2));
        }

Because the number of “words” in each “document” is small in this unit test, the estimate of document similarity obtained using from the Jaccard Coefficient is not very accurate. However, much better results are obtained with even quite small documents.

The benefits of using MinHash are most significant when many document similarities are being calculated. In this case, the MinHash values for a document can be calculated when a document is created or updated and stored as metadata against the document.

Reference: [1] “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011.  P.96. See http://infolab.stanford.edu/~ullman/mmds.html

Written by Nick Grattan

February 23, 2014 at 3:02 pm

Jaccard Similarity Index for measuring Document Similarity

with 6 comments

ForwardTo

The Jaccard Similarity Index can be used to represent the similarity between two documents. A value “0” means the documents are completely dissimilar, “1” that they are identical,  and values between 0 and 1 representing a degree of similarity.

The Jaccard index is defined as the size of the intersection of the two documents divided by the size of the union of the two documents.

Of course, we need to decide how to represent the two documents as sets. Of the different ways this can be done, we might:

  1. Use a Bag Of Words, which is the list of unique words in the document, with no frequency count.
  2. A Bag of Words with Stop Words removed (common language words like “the”, “at” for English).
  3. Sets that take into account the frequency of words in the document.
  4. k-Shingles which are fixed k length overlapping runs of characters extracted from the document.
  5. n-grams which are fixed length n word runs. For example, a 2-gram are overlapping pairs of words taken from a document.

A simple, efficient, way of representing a bag of words is to create a dictionary of all words and assign a unique integer value to each word. Then, the bag of words for a document is simply the unique list of integer values for the words in the document, e.g. List.

For a comparison between Bag of Words and Frequency Distributions see here.

Since sets are not ordered, a bag of words represented like this takes no account of the order in which words appear in the document.

Here is a C# class to calculate the Jaccard index:


public class Jaccard
 {

 public static double Calc(HashSet<int> hs1, HashSet<int> hs2)
 {
      return ((double)hs1.Intersect(hs2).Count() / (double)hs1.Union(hs2).Count());
 }

 public static double Calc(List<int> ls1, List<int> ls2)
 {
      HashSet<int> hs1 = new HashSet<int>(ls1);
      HashSet<int> hs2 = new HashSet<int>(ls2);
      return Calc(hs1, hs2);
 }
 }

Notice how the List objects are converted to HashSet, since the HashSet<> class has methods for performing set calculations.

This C# code shows some tests for this class:


public void JaccardTest1()
 {
      Dictionary<int, string> wordDict = new Dictionary<int, string>();
      wordDict.Add(1, "Word1");
      wordDict.Add(2, "Word2");
      wordDict.Add(3, "Word3");
      wordDict.Add(4, "Word4");

      List<int> doc1 = new List<int>();
      doc1.Add(2);
      doc1.Add(3);
      doc1.Add(4);
      doc1.Add(2);

      List<int> doc2 = new List<int>();
      doc2.Add(1);
      doc2.Add(5);
      doc2.Add(4);
      doc2.Add(2);

      List<int> doc3 = new List<int>();
      doc3.Add(1);

      Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc2));
      Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc1));
      Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc3));
}

This creates three documents:

  • Doc1: “Word2”, “Word3”, “Word4”, “Word2”.
  • Doc2: “Word1”, “Word5”, “Word4”, “Word2”.
  • Doc3: “Word1”.

The output is:

Jaccard: 0.4
Jaccard: 1<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>
Jaccard: 0

For the first result, the total number of unique words (union) in “Doc1” and “Doc2” is 5, and the number of shared words (intersection) is 2. This gives the Jaccard Similarity of 2.0 / 5.0 = 0.4.

The second result shows that a document compared to itself gives a value of 1 (completely similar) and the third test shows a value of 0 for completely dissimilar documents.

Reference: “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011.  P.75. See http://infolab.stanford.edu/~ullman/mmds.html

Written by Nick Grattan

February 18, 2014 at 9:42 pm

Language Identification

with 6 comments

Language Identification is the process by which the language a document is written in is determined. It is a Natural Language Processing topic with a long history with varied approaches including:

  1. Using common words (or stop words) which are unique for particular languages.
  2. Using N-Gram solutions to work out the probability of adjacent words in different languages.
  3. Using character frequency and probability distributions

Techniques typically work well for longer documents, but become challenged with short pieces of text and documents that contain text in multiple languages. Documents containing Chinese text can be difficult as tokenization is problematic. Microsoft FAST Search in SharePoint automatically determines the language of a document. Searches can be performed to return documents of a particular language:

Language Specific Search

In this case, the “DetectedLanguage” managed property is used to return just document in French. (From document: Linguistics Features in SharePoint Server 2010 for Search and FAST Search Server 2010 for SharePoint).

There are cases, though, when language identification needs to be applied to text from sources other than documents. For example, when your code is processing text entered by a user. Since Windows 7 / Windows Server 2008 R2 Microsoft have distributed the “Extended Linguistic Service”, and one of the services is language detection. See here.

The services uses an unpublished algorithm. Calling this COM service from C# takes a bit of work, but luckily the “Windows API Code Pack for Microsoft.NET Framework” provides wrapper classes. Using the ExtendedLinguisticService project allows language detection through code like this code (modified from the book “Professional Windows 7 Development Guide” by John Paul Mueller):

string text = "Test String";

MappingService ms = new MappingService(MappingAvailableServices.LanguageDetection);
using (MappingPropertyBag mpb = ms.RecognizeText(text, null))
{
  String[] languages = mpb.GetResultRanges()[0].FormatData(new StringArrayFormatter());
  if (languages == null || languages.Length == 0)
  {
     Console.WriteLine(" FAIL ");
  }
  else
  {
  Console.WriteLine(" " + languages[0]);
  }
  }

The Extended Linguistic Service returns a list of identified languages with the most probable first in the list. Unfortunately it does not return a probability or confidence indicator. The Extended Linguistic Service is very fast. For example, classifying 15,876 text files containing over 1.18 GB of pure text took around 3 minutes, compared to 22 minutes for a .NET language identifier I wrote based on common stop words.

Written by Nick Grattan

May 31, 2013 at 12:27 pm

Developer’s Toolset for #SP2013 – Add MVC and mix in Razor!

leave a comment »

The clear message from Microsoft for SharePoint 2013 development is a move to the app model. Here’s an introduction: http://msdn.microsoft.com/en-us/library/jj164084.aspx. SharePoint 2013 apps moves your code assemblies out of the SharePoint platform and into separate web applications. There are two reason for this:

1. A requirement for a more stable SharePoint platform. If you’re running SharePoint across your organization you don’t want badly written applications impacting reliability.

2. In Office 365 you can only add assemblies in sandboxed solutions. However, because of the very constrained subset of the SharePoint API available to such solutions their functionality is limited. Now, Office 365 will use functionality provided by SharePoint apps.

The move to SharePoint 2013 brings SharePoint 2013 developers fully back into the ASP.NET development  fold. In particular, MVC with Razor greatly simplifies web application development. You will be using the SharePoint Client Object Model to access SharePoint objects from your app.

Here are some links that help you get started with building SharePoint 2013 apps:

 

 

 

Written by Nick Grattan

December 8, 2012 at 4:28 pm

Disappointing – SharePoint 2013 single server install failures

leave a comment »

SharePoint 2013 is now at RTM and available for preview. It’s disappointing that a default, single server install has issues that occur often and predictably, even on a clean Windows 2012 Server install. Of course, we expect errors and problems, especially with early release software. However, the default install should just work at this stage of the product lifecycle.

There are several blog posts that describe how to correct these issues. This is one of the better ones: http://squidpip.com/blog/sharepoint-2013-preview-configuration/.

Also, there’s a deployment guide from Microsoft which can be found here: http://www.microsoft.com/en-us/download/details.aspx?id=30384 . Be prepared though, it’s 675 pages long!

Written by Nick Grattan

November 28, 2012 at 2:11 pm

Posted in SharePoint 2013

Tagged with

Assembly deployment to “bin” virtual folder deprecated in SharePoint 2013

leave a comment »

For some time now deploying assemblies in the “bin” folder in virtual folders under inetpub has been frowned on in SharePoint. Well, it’s now deprecated in SharePoint 2013 and an attempt to do this will cause Install-SPSolution to fail.

If you really, really, have to you can use the -FullTrustBinDeployment parameter to allow this behavior. However, you should really look to deploy assemblies to the GAC.

Written by Nick Grattan

November 21, 2012 at 6:07 pm

Posted in SharePoint 2013

Tagged with

SharePoint 2013 – Links get Promoted!

with 11 comments

SharePoint 2013 now has links that animate when the user moves the mouse over them:

These are implemented by creating a “Promoted Links” list. You can create your own promoted links list:

  • Select Settings + Add an app
  • Click the Promoted Links  app (which is actually a list template):

  • Enter a name for the new list and click Create :

Once the list is created, items can be added to define each of the promoted links:

You can see that the item has a title, a URL for the image, a description showed when the icon is moused over and the target URL. The Prompted List has a “Tiles” view that will display the items in the list using the animated images:

And when moused over the image animates:

Of course, the Tiles view can be displayed in a Web Part that can be included on any ASPX page. So, easy to get a Metro type feel in your applications.

Written by Nick Grattan

July 30, 2012 at 7:27 pm

Ontology – An aside

leave a comment »

Of course, ontology is not a new study (http://en.wikipedia.org/wiki/Ontology). The opening paragraphs in Aristotle’s “The History of Animals” (written 350 BCE) clearly shows this:

Of the parts of animals some are simple: to wit, all such as divide
into parts uniform with themselves, as flesh into flesh; others are
composite, such as divide into parts not uniform with themselves,
as, for instance, the hand does not divide into hands nor the face
into faces. 

And of such as these, some are called not parts merely, but limbs
or members. Such are those parts that, while entire in themselves,
have within themselves other diverse parts: as for instance, the head,
foot, hand, the arm as a whole, the chest; for these are all in themselves
entire parts, and there are other diverse parts belonging to them.

Translated by D’Arcy Wentworth Thompson (http://classics.mit.edu/Aristotle/history_anim.mb.txt)

Written by Nick Grattan

July 24, 2012 at 8:17 pm

Posted in Uncategorized

Tagged with

Taxonomies and Ontologies

leave a comment »

The terms “Taxonomy” and “Ontology” are often used inter-changeably and are often confused. However, the terms to express different concepts.

  • Taxonomy is about classification, and so represents “Is-A” relationships. For example, in the zoological world, a domestic cat (“Felis catus”) is a member of the family “Felidae” which itself is a member of the order “Carnivora”. Such taxonomies are typically, but not always, hierarchical. An object can exist simultaneously in many different taxonomies. So, a cat also belongs to the group “predators”, which would also include the insect “praying mantis”.
  • Ontology is about the concepts and relationships that can exist for an object or a group of objects[1]. For example, the “Part-Of” (“Part Holonym” [2]) relationship is used to describe the parts of a car (a wheel is part of a car). Therefore, a taxonomy is a type of ontology by this definition.

SharePoint 2007 introduced the managed metadata service to allow the definition of taxonomies to be used for classifying items and documents through metadata columns. Companies are encouraged to define their own or use industry standard taxonomies for classifying documents across the organization to ensure standardization and improve searchability.

Less work has been done in integrating ontologies within SharePoint, although progress by a number of third-party software vendors is being made. “WordNet” [3] provides a rich source of generic ontological information using the English language, and codifies many types of relationships between nouns, verbs, adjectives and adverbs using “cognitive synonyms” (synsets). Vertical market ontologies are now being built, such as for financial governance by the “GRC3 – Governance, Risk and Compliance Competence Centre” at the University of Cork, Ireland (http://www.ucc.ie). Integration of such ontologies in SharePoint will be the next step in improving search, leading to the possibility of useful question-answering systems.

[1] What is an Ontology? http://www-ksl.stanford.edu/kst/what-is-an-ontology.html

[2] Speech and Language Processing, Jurafsky & Martin, 2nd Pearson International Edition p.653

[3] Princeton University “About WordNet.” WordNet. Princeton University. 2010.  http://wordnet.princeton.edu 

Written by Nick Grattan

July 24, 2012 at 7:21 pm

Sign in as Different User and SharePoint 2013

with 45 comments

It’s been noted that the “Sign in as a Different User” menu command is missing in SharePoint 2013 (e.g., see http://dirkvandenberghe.com/2012/07/18/sharepoint-2013-login-as-a-different-user.html).

One suggestion for a fix can be found here: http://www.little.org/blog/2012/07/17/launch-your-web-browser-as-another-user/.

This “Sign in as Different User” menu item is very useful when testing applications, but it can lead to problems especially when opening documents, say in Microsoft Word. So, it may be for these reasons that the option has been removed in SharePoint 2013.

You can add the menu item back in, but I would suggest only doing this on test or development SharePoint servers. To do this, repeat this edit on all servers in your SharePoint farm:

  • Locate the file \15\TEMPLATE\CONTROLTEMPLATES\Welcome.ascx and open in a text editor.
  • Add the following element before the existing element with the id of “ID_RequestAccess”:

<SharePoint:MenuItemTemplate runat="server" ID="ID_LoginAsDifferentUser"
 Text="<%$Resources:wss,personalactions_loginasdifferentuser%>"
 Description="<%$Resources:wss,personalactions_loginasdifferentuserdescription%>"
 MenuGroupId="100"
 Sequence="100"
 UseShortId="true"
 />

  • Save the file.

Now, the menu item shall be displayed:

 

Update 16-Jul-2013:  As noted by a number authors (and correctly so), modifying files in the hive is not generally good practice. It’s presented here as a quick (and dirty!) solution. Look through the comments against this post for some other solutions that you might want to try too.

Written by Nick Grattan

July 23, 2012 at 4:24 pm