Language Identification
Language Identification is the process by which the language a document is written in is determined. It is a Natural Language Processing topic with a long history with varied approaches including:
- Using common words (or stop words) which are unique for particular languages.
- Using N-Gram solutions to work out the probability of adjacent words in different languages.
- Using character frequency and probability distributions
Techniques typically work well for longer documents, but become challenged with short pieces of text and documents that contain text in multiple languages. Documents containing Chinese text can be difficult as tokenization is problematic. Microsoft FAST Search in SharePoint automatically determines the language of a document. Searches can be performed to return documents of a particular language:
In this case, the “DetectedLanguage” managed property is used to return just document in French. (From document: Linguistics Features in SharePoint Server 2010 for Search and FAST Search Server 2010 for SharePoint).
There are cases, though, when language identification needs to be applied to text from sources other than documents. For example, when your code is processing text entered by a user. Since Windows 7 / Windows Server 2008 R2 Microsoft have distributed the “Extended Linguistic Service”, and one of the services is language detection. See here.
The services uses an unpublished algorithm. Calling this COM service from C# takes a bit of work, but luckily the “Windows API Code Pack for Microsoft.NET Framework” provides wrapper classes. Using the ExtendedLinguisticService project allows language detection through code like this code (modified from the book “Professional Windows 7 Development Guide” by John Paul Mueller):
string text = "Test String"; MappingService ms = new MappingService(MappingAvailableServices.LanguageDetection); using (MappingPropertyBag mpb = ms.RecognizeText(text, null)) { String[] languages = mpb.GetResultRanges()[0].FormatData(new StringArrayFormatter()); if (languages == null || languages.Length == 0) { Console.WriteLine(" FAIL "); } else { Console.WriteLine(" " + languages[0]); } }
The Extended Linguistic Service returns a list of identified languages with the most probable first in the list. Unfortunately it does not return a probability or confidence indicator. The Extended Linguistic Service is very fast. For example, classifying 15,876 text files containing over 1.18 GB of pure text took around 3 minutes, compared to 22 minutes for a .NET language identifier I wrote based on common stop words.
Dear Nick,
Do you know, for SP2013, the language identification will mark a document as any ONE language or multiple language? I have tested on a document with mulitple language and it can be searched by different language perference. But this test is not true for another sample document. I want to make sure how Microsoft design it.
Thanks.
Mark
Mark
April 15, 2014 at 4:12 am
Hi Mark,
Thanks for your comment. To be honest, I don’t know. This blog may provide some answers, even though it’s for 2010: http://blogs.msdn.com/b/ravipriya_de_alwis/archive/2013/03/28/language-detection-and-fallback-language-for-fast-search-for-sharepoint-2010.aspx.
Regards, Nick
Nick Grattan
April 15, 2014 at 10:23 am
[…] Nick Grattan has an awesome blog post about this. […]
Using "DetectedLanguage" to return only localized results from SharePoint Search index - #SharePointProblems / Koskila.net
January 9, 2018 at 5:59 pm
[…] is an awesome resource on how to use DetectedLanguage: https://nickgrattan.wordpress.com/2013/05/31/language-identification/. Also, see all the available languages […]
SharePoint Localization - a (somewhat) comprehensive how-to! - #SharePointProblems
July 18, 2018 at 2:16 pm
[…] is an awesome resource on how to use DetectedLanguage: https://nickgrattan.wordpress.com/2013/05/31/language-identification/. Also, see all the available languages […]
SharePoint Localization - a (somewhat) comprehensive how-to! - #SharePointProblems: Antti K. Koskela's Personal Professional Blog
October 6, 2018 at 6:20 pm
[…] Nick Grattan has an awesome blog post about this. […]
Using "DetectedLanguage" to return only localized results from SharePoint Search index
December 27, 2018 at 8:42 pm