Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

Language Identification

with 6 comments

Language Identification is the process by which the language a document is written in is determined. It is a Natural Language Processing topic with a long history with varied approaches including:

  1. Using common words (or stop words) which are unique for particular languages.
  2. Using N-Gram solutions to work out the probability of adjacent words in different languages.
  3. Using character frequency and probability distributions

Techniques typically work well for longer documents, but become challenged with short pieces of text and documents that contain text in multiple languages. Documents containing Chinese text can be difficult as tokenization is problematic. Microsoft FAST Search in SharePoint automatically determines the language of a document. Searches can be performed to return documents of a particular language:

Language Specific Search

In this case, the “DetectedLanguage” managed property is used to return just document in French. (From document: Linguistics Features in SharePoint Server 2010 for Search and FAST Search Server 2010 for SharePoint).

There are cases, though, when language identification needs to be applied to text from sources other than documents. For example, when your code is processing text entered by a user. Since Windows 7 / Windows Server 2008 R2 Microsoft have distributed the “Extended Linguistic Service”, and one of the services is language detection. See here.

The services uses an unpublished algorithm. Calling this COM service from C# takes a bit of work, but luckily the “Windows API Code Pack for Microsoft.NET Framework” provides wrapper classes. Using the ExtendedLinguisticService project allows language detection through code like this code (modified from the book “Professional Windows 7 Development Guide” by John Paul Mueller):

string text = "Test String";

MappingService ms = new MappingService(MappingAvailableServices.LanguageDetection);
using (MappingPropertyBag mpb = ms.RecognizeText(text, null))
  String[] languages = mpb.GetResultRanges()[0].FormatData(new StringArrayFormatter());
  if (languages == null || languages.Length == 0)
     Console.WriteLine(" FAIL ");
  Console.WriteLine(" " + languages[0]);

The Extended Linguistic Service returns a list of identified languages with the most probable first in the list. Unfortunately it does not return a probability or confidence indicator. The Extended Linguistic Service is very fast. For example, classifying 15,876 text files containing over 1.18 GB of pure text took around 3 minutes, compared to 22 minutes for a .NET language identifier I wrote based on common stop words.

Written by Nick Grattan

May 31, 2013 at 12:27 pm

6 Responses

Subscribe to comments with RSS.

  1. Dear Nick,

    Do you know, for SP2013, the language identification will mark a document as any ONE language or multiple language? I have tested on a document with mulitple language and it can be searched by different language perference. But this test is not true for another sample document. I want to make sure how Microsoft design it.



    April 15, 2014 at 4:12 am

  2. […] is an awesome resource on how to use DetectedLanguage: Also, see all the available languages […]

  3. […] is an awesome resource on how to use DetectedLanguage: Also, see all the available languages […]

  4. […] Nick Grattan has an awesome blog post about this. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: