Nick Grattan's Blog

About Microsoft SharePoint, .NET, Natural Language Processing and Machine Learning

Archive for the ‘SharePoint 2010’ Category

Language Identification

with 2 comments

Language Identification is the process by which the language a document is written in is determined. It is a Natural Language Processing topic with a long history with varied approaches including:

  1. Using common words (or stop words) which are unique for particular languages.
  2. Using N-Gram solutions to work out the probability of adjacent words in different languages.
  3. Using character frequency and probability distributions

Techniques typically work well for longer documents, but become challenged with short pieces of text and documents that contain text in multiple languages. Documents containing Chinese text can be difficult as tokenization is problematic. Microsoft FAST Search in SharePoint automatically determines the language of a document. Searches can be performed to return documents of a particular language:

Language Specific Search

In this case, the “DetectedLanguage” managed property is used to return just document in French. (From document: Linguistics Features in SharePoint Server 2010 for Search and FAST Search Server 2010 for SharePoint).

There are cases, though, when language identification needs to be applied to text from sources other than documents. For example, when your code is processing text entered by a user. Since Windows 7 / Windows Server 2008 R2 Microsoft have distributed the “Extended Linguistic Service”, and one of the services is language detection. See here.

The services uses an unpublished algorithm. Calling this COM service from C# takes a bit of work, but luckily the “Windows API Code Pack for Microsoft.NET Framework” provides wrapper classes. Using the ExtendedLinguisticService project allows language detection through code like this code (modified from the book “Professional Windows 7 Development Guide” by John Paul Mueller):

string text = "Test String";

MappingService ms = new MappingService(MappingAvailableServices.LanguageDetection);
using (MappingPropertyBag mpb = ms.RecognizeText(text, null))
{
  String[] languages = mpb.GetResultRanges()[0].FormatData(new StringArrayFormatter());
  if (languages == null || languages.Length == 0)
  {
     Console.WriteLine(" FAIL ");
  }
  else
  {
  Console.WriteLine(" " + languages[0]);
  }
  }

The Extended Linguistic Service returns a list of identified languages with the most probable first in the list. Unfortunately it does not return a probability or confidence indicator. The Extended Linguistic Service is very fast. For example, classifying 15,876 text files containing over 1.18 GB of pure text took around 3 minutes, compared to 22 minutes for a .NET language identifier I wrote based on common stop words.

Written by Nick Grattan

May 31, 2013 at 12:27 pm

Taxonomies and Ontologies

leave a comment »

The terms “Taxonomy” and “Ontology” are often used inter-changeably and are often confused. However, the terms to express different concepts.

  • Taxonomy is about classification, and so represents “Is-A” relationships. For example, in the zoological world, a domestic cat (“Felis catus”) is a member of the family “Felidae” which itself is a member of the order “Carnivora”. Such taxonomies are typically, but not always, hierarchical. An object can exist simultaneously in many different taxonomies. So, a cat also belongs to the group “predators”, which would also include the insect “praying mantis”.
  • Ontology is about the concepts and relationships that can exist for an object or a group of objects[1]. For example, the “Part-Of” (“Part Holonym” [2]) relationship is used to describe the parts of a car (a wheel is part of a car). Therefore, a taxonomy is a type of ontology by this definition.

SharePoint 2007 introduced the managed metadata service to allow the definition of taxonomies to be used for classifying items and documents through metadata columns. Companies are encouraged to define their own or use industry standard taxonomies for classifying documents across the organization to ensure standardization and improve searchability.

Less work has been done in integrating ontologies within SharePoint, although progress by a number of third-party software vendors is being made. “WordNet” [3] provides a rich source of generic ontological information using the English language, and codifies many types of relationships between nouns, verbs, adjectives and adverbs using “cognitive synonyms” (synsets). Vertical market ontologies are now being built, such as for financial governance by the “GRC3 – Governance, Risk and Compliance Competence Centre” at the University of Cork, Ireland (http://www.ucc.ie). Integration of such ontologies in SharePoint will be the next step in improving search, leading to the possibility of useful question-answering systems.

[1] What is an Ontology? http://www-ksl.stanford.edu/kst/what-is-an-ontology.html

[2] Speech and Language Processing, Jurafsky & Martin, 2nd Pearson International Edition p.653

[3] Princeton University “About WordNet.” WordNet. Princeton University. 2010.  http://wordnet.princeton.edu 

Written by Nick Grattan

July 24, 2012 at 7:21 pm

Exception in MergeAspSiteMapFiles for “SharePoint 2010 Products Configuration Wizard” (PSConfig)

leave a comment »

Recently, I’ve found that a SharePoint farm failed to update when running the “SharePoint 2010 Products Configuration Wizard” or when running PSConfig directly. The PSConfig log file in the 14 hive contained the following exception information:

07/12/2012 16:18:22  9  ERR                  Failed to install the application content files.
An exception of type System.Xml.XmlException was thrown.  Additional exception information: Name cannot begin with the '\' character, hexadecimal value 0x5C. Line 5, position 2.
System.Xml.XmlException: Name cannot begin with the '\' character, hexadecimal value 0x5C. Line 5, position 2.
   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
   at System.Xml.XmlTextReaderImpl.ParseQName(Boolean isQName, Int32 startOffset, Int32& colonPos)
   at System.Xml.XmlTextReaderImpl.ParseElement()
   at System.Xml.XmlTextReaderImpl.ParseElementContent()
   at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace)
   at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
   at System.Xml.XmlDocument.Load(XmlReader reader)
   at System.Xml.XmlDocument.Load(String filename)
   at Microsoft.SharePoint.Administration.SPAspSiteMapFile.MergeAspSiteMapFiles(XmlDocument xmldocSiteMap, String strSrcFilePath, String strMergeFilePattern)
   at Microsoft.SharePoint.Administration.SPAspSiteMapFile.Copy(String strSrcDir, String strSrcLeaf, String strDestDir, Boolean bMergeMaps, Boolean bBackupExistingFile)
   at Microsoft.SharePoint.Administration.SPAdministrationWebApplication.CopyAdminAppDomainDirectories(DirectoryInfo virtualDirectoryPath, OverwriteSetting overwrite)
   at Microsoft.SharePoint.Administration.SPWebService.ApplyApplicationContentToLocalServer()
   at Microsoft.SharePoint.PostSetupConfiguration.ApplicationContentTask.Run()
   at Microsoft.SharePoint.PostSetupConfiguration.TaskThread.ExecuteTask()
07/12/2012 16:18:22  9  INF                  Entering function TaskDriver.NotifyTaskSummary

I suspected that this was caused by a application I had deployed to the SharePoint farm. However, it was not obvious what in the application was causing the problem.

The key to finding the problem was the call to ApplyApplicationContentToLocalServer in the stack trace. Reading up on the documentation for this method indicated that this is responsible for merging all the site map in the 14 hive and placing them in the _app_bin folder for each web application in the appropriate inetpub folders. These sitemap files are XML documents with “.sitemap.” somewhere in the name.

This post is useful in understanding this process: http://sharepointinterface.com/tag/features/

Using this information I managed to find the sitemap file that was causing the problem – the XML document was invalid since in contained a “\” rather than “/”. And yes, it was installed by the application I had deployed.

Now, what I find interesting is that an issue in a site map deployed by an application can cause the SharePoint 2010 Products Configuration Wizard to fail, and therefore stop any updates from being applied. Not only interesting, but worrying too.

Written by Nick Grattan

July 16, 2012 at 7:49 pm

Forms Solutions for SharePoint 2010

with 2 comments

We have probably all experienced using Microsoft InfoPath 2010 for generating forms, both using the client application and the forms server. Here are some other forms solutions you might take a look at:

I haven’t evaluated or compared these products, but hopefully this will provide you with a starting point if you’re looking for that type of thing. Let me know if there are others to add.

Written by Nick Grattan

May 29, 2012 at 12:37 pm

Posted in SharePoint 2010

SharePoint 2010 – Provisioning User Profile Synchronization

leave a comment »

Configuring AD synchronization with SharePoint 2010 can be problematic. Here’s a great post on how to configure the services, including ForeFront Identity Manager:

http://blogs.msdn.com/b/russmax/archive/2010/03/20/sharepoint-2010-provisioning-user-profile-synchronization.aspx

Written by Nick Grattan

November 3, 2011 at 9:12 am

SharePoint 2010 Foundation, Server Service Pack 1 for Download

leave a comment »

Here’s information on SP1 for SharePoint 2010 Foundation http://support.microsoft.com/kb/2460058. Download from here: http://www.microsoft.com/download/en/details.aspx?id=26640

And for SP1 for SharePoint Server (Enterprise and Standard) information http://support.microsoft.com/kb/2460045. Download from: http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=26623

 

Written by Nick Grattan

July 1, 2011 at 12:53 pm

SharePoint Diagnostic Studio

leave a comment »

Talking of tools (see Favorite SharePoint Development Tools), the SharePoint Diagnostic Studio from Microsoft provides SharePoint performance monitoring functionality. This tool is particularly useful for consolidating the ULS records across servers in a SharePoint server farm. It’s part of the Microsoft SharePoint 2010 Administration Toolkit v2.0.

While useful, the user interface is not particularly polished or easy to use. In particular, the SharePoint Diagnostic Studio must be run with the language set to “US-English”, otherwise date formats will be presented incorrectly and date/time filtering will not work.

Overall Description: http://sharepoint.microsoft.com/blog/Pages/BlogPost.aspx?pID=971

Download from: http://www.microsoft.com/downloads/en/details.aspx?FamilyID=718447d8-0814-427a-81c3-c9c3d84c456e&displaylang=en

Documentation: http://technet.microsoft.com/en-us/library/hh144782.aspx

Written by Nick Grattan

July 1, 2011 at 12:41 pm

Follow

Get every new post delivered to your Inbox.

Join 59 other followers