Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

Archive for the ‘Search / Information Retrieval’ Category

Jaccard Similarity Index for measuring Document Similarity

with 6 comments

ForwardTo

The Jaccard Similarity Index can be used to represent the similarity between two documents. A value “0” means the documents are completely dissimilar, “1” that they are identical,  and values between 0 and 1 representing a degree of similarity.

The Jaccard index is defined as the size of the intersection of the two documents divided by the size of the union of the two documents.

Of course, we need to decide how to represent the two documents as sets. Of the different ways this can be done, we might:

  1. Use a Bag Of Words, which is the list of unique words in the document, with no frequency count.
  2. A Bag of Words with Stop Words removed (common language words like “the”, “at” for English).
  3. Sets that take into account the frequency of words in the document.
  4. k-Shingles which are fixed k length overlapping runs of characters extracted from the document.
  5. n-grams which are fixed length n word runs. For example, a 2-gram are overlapping pairs of words taken from a document.

A simple, efficient, way of representing a bag of words is to create a dictionary of all words and assign a unique integer value to each word. Then, the bag of words for a document is simply the unique list of integer values for the words in the document, e.g. List.

For a comparison between Bag of Words and Frequency Distributions see here.

Since sets are not ordered, a bag of words represented like this takes no account of the order in which words appear in the document.

Here is a C# class to calculate the Jaccard index:


public class Jaccard
 {

 public static double Calc(HashSet<int> hs1, HashSet<int> hs2)
 {
      return ((double)hs1.Intersect(hs2).Count() / (double)hs1.Union(hs2).Count());
 }

 public static double Calc(List<int> ls1, List<int> ls2)
 {
      HashSet<int> hs1 = new HashSet<int>(ls1);
      HashSet<int> hs2 = new HashSet<int>(ls2);
      return Calc(hs1, hs2);
 }
 }

Notice how the List objects are converted to HashSet, since the HashSet<> class has methods for performing set calculations.

This C# code shows some tests for this class:


public void JaccardTest1()
 {
      Dictionary<int, string> wordDict = new Dictionary<int, string>();
      wordDict.Add(1, "Word1");
      wordDict.Add(2, "Word2");
      wordDict.Add(3, "Word3");
      wordDict.Add(4, "Word4");

      List<int> doc1 = new List<int>();
      doc1.Add(2);
      doc1.Add(3);
      doc1.Add(4);
      doc1.Add(2);

      List<int> doc2 = new List<int>();
      doc2.Add(1);
      doc2.Add(5);
      doc2.Add(4);
      doc2.Add(2);

      List<int> doc3 = new List<int>();
      doc3.Add(1);

      Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc2));
      Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc1));
      Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc3));
}

This creates three documents:

  • Doc1: “Word2”, “Word3”, “Word4”, “Word2”.
  • Doc2: “Word1”, “Word5”, “Word4”, “Word2”.
  • Doc3: “Word1”.

The output is:

Jaccard: 0.4
Jaccard: 1<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>
Jaccard: 0

For the first result, the total number of unique words (union) in “Doc1” and “Doc2” is 5, and the number of shared words (intersection) is 2. This gives the Jaccard Similarity of 2.0 / 5.0 = 0.4.

The second result shows that a document compared to itself gives a value of 1 (completely similar) and the third test shows a value of 0 for completely dissimilar documents.

Reference: “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011.  P.75. See http://infolab.stanford.edu/~ullman/mmds.html

Advertisements

Written by Nick Grattan

February 18, 2014 at 9:42 pm

Language Identification

with 6 comments

Language Identification is the process by which the language a document is written in is determined. It is a Natural Language Processing topic with a long history with varied approaches including:

  1. Using common words (or stop words) which are unique for particular languages.
  2. Using N-Gram solutions to work out the probability of adjacent words in different languages.
  3. Using character frequency and probability distributions

Techniques typically work well for longer documents, but become challenged with short pieces of text and documents that contain text in multiple languages. Documents containing Chinese text can be difficult as tokenization is problematic. Microsoft FAST Search in SharePoint automatically determines the language of a document. Searches can be performed to return documents of a particular language:

Language Specific Search

In this case, the “DetectedLanguage” managed property is used to return just document in French. (From document: Linguistics Features in SharePoint Server 2010 for Search and FAST Search Server 2010 for SharePoint).

There are cases, though, when language identification needs to be applied to text from sources other than documents. For example, when your code is processing text entered by a user. Since Windows 7 / Windows Server 2008 R2 Microsoft have distributed the “Extended Linguistic Service”, and one of the services is language detection. See here.

The services uses an unpublished algorithm. Calling this COM service from C# takes a bit of work, but luckily the “Windows API Code Pack for Microsoft.NET Framework” provides wrapper classes. Using the ExtendedLinguisticService project allows language detection through code like this code (modified from the book “Professional Windows 7 Development Guide” by John Paul Mueller):

string text = "Test String";

MappingService ms = new MappingService(MappingAvailableServices.LanguageDetection);
using (MappingPropertyBag mpb = ms.RecognizeText(text, null))
{
  String[] languages = mpb.GetResultRanges()[0].FormatData(new StringArrayFormatter());
  if (languages == null || languages.Length == 0)
  {
     Console.WriteLine(" FAIL ");
  }
  else
  {
  Console.WriteLine(" " + languages[0]);
  }
  }

The Extended Linguistic Service returns a list of identified languages with the most probable first in the list. Unfortunately it does not return a probability or confidence indicator. The Extended Linguistic Service is very fast. For example, classifying 15,876 text files containing over 1.18 GB of pure text took around 3 minutes, compared to 22 minutes for a .NET language identifier I wrote based on common stop words.

Written by Nick Grattan

May 31, 2013 at 12:27 pm

Taxonomies and Ontologies

leave a comment »

The terms “Taxonomy” and “Ontology” are often used inter-changeably and are often confused. However, the terms to express different concepts.

  • Taxonomy is about classification, and so represents “Is-A” relationships. For example, in the zoological world, a domestic cat (“Felis catus”) is a member of the family “Felidae” which itself is a member of the order “Carnivora”. Such taxonomies are typically, but not always, hierarchical. An object can exist simultaneously in many different taxonomies. So, a cat also belongs to the group “predators”, which would also include the insect “praying mantis”.
  • Ontology is about the concepts and relationships that can exist for an object or a group of objects[1]. For example, the “Part-Of” (“Part Holonym” [2]) relationship is used to describe the parts of a car (a wheel is part of a car). Therefore, a taxonomy is a type of ontology by this definition.

SharePoint 2007 introduced the managed metadata service to allow the definition of taxonomies to be used for classifying items and documents through metadata columns. Companies are encouraged to define their own or use industry standard taxonomies for classifying documents across the organization to ensure standardization and improve searchability.

Less work has been done in integrating ontologies within SharePoint, although progress by a number of third-party software vendors is being made. “WordNet” [3] provides a rich source of generic ontological information using the English language, and codifies many types of relationships between nouns, verbs, adjectives and adverbs using “cognitive synonyms” (synsets). Vertical market ontologies are now being built, such as for financial governance by the “GRC3 – Governance, Risk and Compliance Competence Centre” at the University of Cork, Ireland (http://www.ucc.ie). Integration of such ontologies in SharePoint will be the next step in improving search, leading to the possibility of useful question-answering systems.

[1] What is an Ontology? http://www-ksl.stanford.edu/kst/what-is-an-ontology.html

[2] Speech and Language Processing, Jurafsky & Martin, 2nd Pearson International Edition p.653

[3] Princeton University “About WordNet.” WordNet. Princeton University. 2010.  http://wordnet.princeton.edu 

Written by Nick Grattan

July 24, 2012 at 7:21 pm

Precision, Recall and SharePoint Search

leave a comment »

Precision and recall are two commonly used measures of search performance:

  • Precision measures the accuracy with which documents are returned from the overall set of relevant documents.
  • Recall measures the number of documents returned from the overall set of relevant documents.

With SharePoint search users typically enter a number of words or phrases. Search then returns matching documents ordered by some relevancy criteria.  In the default case each word or phrase is combined with an AND operator. This maximizes precision but minimizes recall.

Alternatively, the words or phrases can be combined with an OR operator. Documents are returned that have at least one word or phrase from the search criteria. This maximizes recall but minimizes precision.

When precision is low, lots of irrelevant documents may be returned and when recall is low, relevant documents may not be returned.  Between these extremes there is little common ground.

Measurement of recall and precision require a document corpus where the relevant documents are identified for a set of queries. Examples of such datasets include TREC (Text Retrieval Conference, http://trec.nist.gov/) and OHSUMED (http://ir.ohsu.edu/ohsumed/, which is also used in a TREC stream).

The OHSUMED corpus consists of 348,566 documents containing references and abstracts to medical journals extracted from MEDLINE (http://www.nlm.nih.gov/pubs/factsheets/medline.html). There are 106 queries for which there are 16,140 query-document pairs assessed as being definitely or possibly relevant. These judgements were made by qualified physicians or librarians. Queries take the following form:

.I  5
.B
58 yo with cancer and hypercalcemia
.W
effectiveness of etidronate in treating hypercalcemia of malignancy

This includes the following information:

  • I: The query number (5).
  • B: The prognosis of a presenting patient.
  • W: The query that a physician used to return relevant documents.

To test precision and recall for relevant documents, the OHSUMED corpus was loaded into a SharePoint document library. The documents were saved as HTML documents in the library. Two tests setups were used:

  • Documents loaded into Microsoft SharePoint 2010 Foundation and indexed using “SharePoint Foundation Search V4”.
  •  Documents loaded into Microsoft SharePoint 2010 Enterprise Server and indexed using Microsoft FAST.

Then, for both test setups, the 106 queries (“W” above) where executed as if the user had typed the query into the SharePoint “search” box, and the returned documents analysed. For both test results, none of the definitely or possibly relevant documents were returned, although some other documents were returned for most of the queries.

So, in these tests both precision and recall are 0. This illustrates the gulf between search queries and question answering information retrieval systems. SharePoint search will, by default, only return documents which contain all the words or phrases in the search query (the AND operator is used), although stop words like “of” and “in” are generally ignored.

The definitely or possibly relevant documents identified by OHSUMED do not contain all words in the search query for various reasons. For example, relevant documents may contain synonyms of words used in the search query. Precision and recall can be improved by adding synonyms, but for a large domain such as medical research, this is a substantial undertaking.

Search engines such as Google now recognize questions and use question answering information retrieval techniques as well as indexed queries. In this example, the question is answered first, followed by matching documents:

  1. Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED:  An interactive retrieval evaluation and new large test collection for research, Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.
  2. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008

Written by Nick Grattan

June 28, 2012 at 5:34 pm

Microsoft FAST Search Installation Guide

leave a comment »

Microsoft FAST Search is becoming both more popular and relevant – rumor has it that it will be the only search and index option for SharePoint 2013 Standard/Enterprise.

Here’s a good installation guide: http://blog.concurrency.com/sharepoint/fast/

To get document preview you’ll need to install Microsoft Office Web Apps. Here’s a guide for this: http://technet.microsoft.com/library/ff431687(office.14).aspx#bkmk_ins_exis_sa

Written by Nick Grattan

May 30, 2012 at 1:49 pm