Archive for the ‘Search / Information Retrieval’ Category
Language Identification
Language Identification is the process by which the language a document is written in is determined. It is a Natural Language Processing topic with a long history with varied approaches including:
- Using common words (or stop words) which are unique for particular languages.
- Using N-Gram solutions to work out the probability of adjacent words in different languages.
- Using character frequency and probability distributions
Techniques typically work well for longer documents, but become challenged with short pieces of text and documents that contain text in multiple languages. Documents containing Chinese text can be difficult as tokenization is problematic. Microsoft FAST Search in SharePoint automatically determines the language of a document. Searches can be performed to return documents of a particular language:

In this case, the “DetectedLanguage” managed property is used to return just document in French. (From document: Linguistics Features in SharePoint Server 2010 for Search and FAST Search Server 2010 for SharePoint).
There are cases, though, when language identification needs to be applied to text from sources other than documents. For example, when your code is processing text entered by a user. Since Windows 7 / Windows Server 2008 R2 Microsoft have distributed the “Extended Linguistic Service”, and one of the services is language detection. See here.
The services uses an unpublished algorithm. Calling this COM service from C# takes a bit of work, but luckily the “Windows API Code Pack for Microsoft.NET Framework” provides wrapper classes. Using the ExtendedLinguisticService project allows language detection through code like this code (modified from the book “Professional Windows 7 Development Guide” by John Paul Mueller):
string text = "Test String";
MappingService ms = new MappingService(MappingAvailableServices.LanguageDetection);
using (MappingPropertyBag mpb = ms.RecognizeText(text, null))
{
String[] languages = mpb.GetResultRanges()[0].FormatData(new StringArrayFormatter());
if (languages == null || languages.Length == 0)
{
Console.WriteLine(" FAIL ");
}
else
{
Console.WriteLine(" " + languages[0]);
}
}
The Extended Linguistic Service returns a list of identified languages with the most probable first in the list. Unfortunately it does not return a probability or confidence indicator. The Extended Linguistic Service is very fast. For example, classifying 15,876 text files containing over 1.18 GB of pure text took around 3 minutes, compared to 22 minutes for a .NET language identifier I wrote based on common stop words.
Taxonomies and Ontologies
The terms “Taxonomy” and “Ontology” are often used inter-changeably and are often confused. However, the terms to express different concepts.
- Taxonomy is about classification, and so represents “Is-A” relationships. For example, in the zoological world, a domestic cat (“Felis catus”) is a member of the family “Felidae” which itself is a member of the order “Carnivora”. Such taxonomies are typically, but not always, hierarchical. An object can exist simultaneously in many different taxonomies. So, a cat also belongs to the group “predators”, which would also include the insect “praying mantis”.
- Ontology is about the concepts and relationships that can exist for an object or a group of objects[1]. For example, the “Part-Of” (“Part Holonym” [2]) relationship is used to describe the parts of a car (a wheel is part of a car). Therefore, a taxonomy is a type of ontology by this definition.
SharePoint 2007 introduced the managed metadata service to allow the definition of taxonomies to be used for classifying items and documents through metadata columns. Companies are encouraged to define their own or use industry standard taxonomies for classifying documents across the organization to ensure standardization and improve searchability.
Less work has been done in integrating ontologies within SharePoint, although progress by a number of third-party software vendors is being made. “WordNet” [3] provides a rich source of generic ontological information using the English language, and codifies many types of relationships between nouns, verbs, adjectives and adverbs using “cognitive synonyms” (synsets). Vertical market ontologies are now being built, such as for financial governance by the “GRC3 – Governance, Risk and Compliance Competence Centre” at the University of Cork, Ireland (http://www.ucc.ie). Integration of such ontologies in SharePoint will be the next step in improving search, leading to the possibility of useful question-answering systems.
[1] What is an Ontology?
http://www-ksl.stanford.edu/kst/what-is-an-ontology.html
[2] Speech and Language Processing, Jurafsky & Martin, 2nd Pearson International Edition p.653
[3] Princeton University “About WordNet.” WordNet. Princeton University. 2010.
http://wordnet.princeton.edu
Precision, Recall and SharePoint Search
Precision and recall are two commonly used measures of search performance:
- Precision measures the accuracy with which documents are returned from the overall set of relevant documents.
- Recall measures the number of documents returned from the overall set of relevant documents.
With SharePoint search users typically enter a number of words or phrases. Search then returns matching documents ordered by some relevancy criteria. In the default case each word or phrase is combined with an AND operator. This maximizes precision but minimizes recall.
Alternatively, the words or phrases can be combined with an OR operator. Documents are returned that have at least one word or phrase from the search criteria. This maximizes recall but minimizes precision.
When precision is low, lots of irrelevant documents may be returned and when recall is low, relevant documents may not be returned. Between these extremes there is little common ground.
Measurement of recall and precision require a document corpus where the relevant documents are identified for a set of queries. Examples of such datasets include TREC (Text Retrieval Conference,
http://trec.nist.gov/
) and OHSUMED (
http://ir.ohsu.edu/ohsumed/
, which is also used in a TREC stream).
The OHSUMED corpus consists of 348,566 documents containing references and abstracts to medical journals extracted from MEDLINE (
http://www.nlm.nih.gov/pubs/factsheets/medline.html
). There are 106 queries for which there are 16,140 query-document pairs assessed as being definitely or possibly relevant. These judgements were made by qualified physicians or librarians. Queries take the following form:
.I 5 .B 58 yo with cancer and hypercalcemia .W effectiveness of etidronate in treating hypercalcemia of malignancy
This includes the following information:
- I: The query number (5).
- B: The prognosis of a presenting patient.
- W: The query that a physician used to return relevant documents.
To test precision and recall for relevant documents, the OHSUMED corpus was loaded into a SharePoint document library. The documents were saved as HTML documents in the library. Two tests setups were used:
- Documents loaded into Microsoft SharePoint 2010 Foundation and indexed using “SharePoint Foundation Search V4”.
- Documents loaded into Microsoft SharePoint 2010 Enterprise Server and indexed using Microsoft FAST.
Then, for both test setups, the 106 queries (“W” above) where executed as if the user had typed the query into the SharePoint “search” box, and the returned documents analysed. For both test results, none of the definitely or possibly relevant documents were returned, although some other documents were returned for most of the queries.
So, in these tests both precision and recall are 0. This illustrates the gulf between search queries and question answering information retrieval systems. SharePoint search will, by default, only return documents which contain all the words or phrases in the search query (the AND operator is used), although stop words like “of” and “in” are generally ignored.
The definitely or possibly relevant documents identified by OHSUMED do not contain all words in the search query for various reasons. For example, relevant documents may contain synonyms of words used in the search query. Precision and recall can be improved by adding synonyms, but for a large domain such as medical research, this is a substantial undertaking.
Search engines such as Google now recognize questions and use question answering information retrieval techniques as well as indexed queries. In this example, the question is answered first, followed by matching documents:
- Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED: An interactive retrieval evaluation and new large test collection for research, Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008
Microsoft FAST Search Installation Guide
Microsoft FAST Search is becoming both more popular and relevant - rumor has it that it will be the only search and index option for SharePoint 2013 Standard/Enterprise.
Here’s a good installation guide:
http://blog.concurrency.com/sharepoint/fast/
To get document preview you’ll need to install Microsoft Office Web Apps. Here’s a guide for this:
http://technet.microsoft.com/library/ff431687(office.14).aspx#bkmk_ins_exis_sa
