Extended Linguistic Services (ELS): another cryptic name for some pretty powerful capabilities. ELS is a platform supporting language processing with a consistent service-based API for language detection and transliteration.
So what’s in it for you? Today, many of us are working in a very diverse and global economy, and assuming all the work you do in documents, e-mail, etc. is tied to a single culture (your Region and Language settings) is becoming less realistic. ELS provides a way for developers to respond more richly to who is using their application at any given time.
Start typing Spanish in a document processing system, for instance, and the application could change the menus to Spanish, cue up the Spanish version of the help file, etc. – kind of a localization on-the-fly. Consumer-centric media applications could automatically display news or top music and movies for the detected country. It’s all about applications being more aware of who is using them - and as we’ll see with the sensor and location API – where they are being used.
As a native API, the framework is fairly compact. It’s implemented via a static (elscore.lib) or dynamic (elscore.dll) library versus exposing a COM interface, as many other Windows 7 features do. There are actually only a half-dozen methods in the API, providing a lightweight and consistent way to access any existing or future service. As with most of the other Windows 7 APIs, the Windows API Code Pack provides a managed wrapper for those of us writing C# or Visual Basic code.
The general flow for engaging the framework is to
That’s it! Any service is invoked the same way.
So what services are available? With Windows 7 there are three ‘in-the-box’: Language Detection, Script Detection, and Transliteration. These services are identified by these same non-localizable strings, so you can use them (or alternatively a GUID) to locate the desired service among those enumerated in Step 1 above.
When text recognition occurs in the Language Detection service, the output (in the property bag) is a list of languages that are matches for the text, in order of confidence. The languages are generally represented by their 2-letter (neutral) identifiers, like en for English, nl for Dutch, etc. In some cases the full identifier is used, for instance to distinguish traditional Chinese (zh-Hant) from simplified Chinese (zh-Hans).
Script detection refers to identifying the characters of the script in which the submitted text is written. For instance, “Ich bin ein Berliner” is German (language: de), but the script is Latin. You can find a list of all scripts and the number of characters within each on The Unicode Consortium site. Note, there are two special script types: Qaai, which comprises ‘inherited’ characters; these are combining characters like circumflexes and umlauts, and Zyyy, ‘shared’ characters, those that are regularly used in multiple scripts. With the Script Detection service, characters in these two special classifications are subsumed by the previous script range, or by the first script range if they are leading characters in the string. In other words, the Script Detection service will not return Qaai or Zyyy as a script type. The output that is returned is a set of ranges indicating the starting and ending index of each script type as well as the name of the script identified for that range. The following string, for instance, yields two ranges: Latn (Latin) from character 1 to 15 and Cyrl (Cyrillic) from characters 16 to 21. This is English. АБВГД.
Script detection refers to identifying the characters of the script in which the submitted text is written. For instance, “Ich bin ein Berliner” is German (language: de), but the script is Latin. You can find a list of all scripts and the number of characters within each on The Unicode Consortium site. Note, there are two special script types:
With the Script Detection service, characters in these two special classifications are subsumed by the previous script range, or by the first script range if they are leading characters in the string. In other words, the Script Detection service will not return Qaai or Zyyy as a script type.
The output that is returned is a set of ranges indicating the starting and ending index of each script type as well as the name of the script identified for that range.
The following string, for instance, yields two ranges: Latn (Latin) from character 1 to 15 and Cyrl (Cyrillic) from characters 16 to 21.
This is English. АБВГД.
The goal of this service – actually a category of services – should be self-evident. You provide an input string of a certain script type, and the service transliterates it, that is, it performs a character-by-character mapping from the source to the destination script type. These are the specific transliteration services available now in Windows 7: Traditional to Simplified Chinese (zh-Hant to zh-Hans) Simplified to Traditional Chinese (zh-Hans to zh-Hant) Malayam to Latin Cyrillic to Latin Bengali to Latin Devanagari to Latin
The goal of this service – actually a category of services – should be self-evident. You provide an input string of a certain script type, and the service transliterates it, that is, it performs a character-by-character mapping from the source to the destination script type.
These are the specific transliteration services available now in Windows 7:
Check out the Extended Linguistic Services samples in the Windows API Code Pack. You’ll find two solutions. One is a simple console application that automatically exercises all three services. The other is a Windows Forms application in which you can enter text for transliteration (or pull it from a file). I think the screen shot from that sample application below pretty much says it all, wouldn’t you agree?