Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

Precision, Recall and SharePoint Search

leave a comment »

Precision and recall are two commonly used measures of search performance:

  • Precision measures the accuracy with which documents are returned from the overall set of relevant documents.
  • Recall measures the number of documents returned from the overall set of relevant documents.

With SharePoint search users typically enter a number of words or phrases. Search then returns matching documents ordered by some relevancy criteria.  In the default case each word or phrase is combined with an AND operator. This maximizes precision but minimizes recall.

Alternatively, the words or phrases can be combined with an OR operator. Documents are returned that have at least one word or phrase from the search criteria. This maximizes recall but minimizes precision.

When precision is low, lots of irrelevant documents may be returned and when recall is low, relevant documents may not be returned.  Between these extremes there is little common ground.

Measurement of recall and precision require a document corpus where the relevant documents are identified for a set of queries. Examples of such datasets include TREC (Text Retrieval Conference, http://trec.nist.gov/) and OHSUMED (http://ir.ohsu.edu/ohsumed/, which is also used in a TREC stream).

The OHSUMED corpus consists of 348,566 documents containing references and abstracts to medical journals extracted from MEDLINE (http://www.nlm.nih.gov/pubs/factsheets/medline.html). There are 106 queries for which there are 16,140 query-document pairs assessed as being definitely or possibly relevant. These judgements were made by qualified physicians or librarians. Queries take the following form:

.I  5
.B
58 yo with cancer and hypercalcemia
.W
effectiveness of etidronate in treating hypercalcemia of malignancy

This includes the following information:

  • I: The query number (5).
  • B: The prognosis of a presenting patient.
  • W: The query that a physician used to return relevant documents.

To test precision and recall for relevant documents, the OHSUMED corpus was loaded into a SharePoint document library. The documents were saved as HTML documents in the library. Two tests setups were used:

  • Documents loaded into Microsoft SharePoint 2010 Foundation and indexed using “SharePoint Foundation Search V4”.
  •  Documents loaded into Microsoft SharePoint 2010 Enterprise Server and indexed using Microsoft FAST.

Then, for both test setups, the 106 queries (“W” above) where executed as if the user had typed the query into the SharePoint “search” box, and the returned documents analysed. For both test results, none of the definitely or possibly relevant documents were returned, although some other documents were returned for most of the queries.

So, in these tests both precision and recall are 0. This illustrates the gulf between search queries and question answering information retrieval systems. SharePoint search will, by default, only return documents which contain all the words or phrases in the search query (the AND operator is used), although stop words like “of” and “in” are generally ignored.

The definitely or possibly relevant documents identified by OHSUMED do not contain all words in the search query for various reasons. For example, relevant documents may contain synonyms of words used in the search query. Precision and recall can be improved by adding synonyms, but for a large domain such as medical research, this is a substantial undertaking.

Search engines such as Google now recognize questions and use question answering information retrieval techniques as well as indexed queries. In this example, the question is answered first, followed by matching documents:

  1. Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED:  An interactive retrieval evaluation and new large test collection for research, Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.
  2. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008
Advertisements

Written by Nick Grattan

June 28, 2012 at 5:34 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: