Normalization and Microsoft -- whats the story?

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Normalization and Microsoft -- whats the story?

  • Comments 8

This is a word that has been way too overloaded. To date I have heard of four specific uses since I have come to Microsoft:

  • It is used by some folks on the SQL Server team when they talk about string comparisons
  • It is used by the NLS collation APIs for the same reason (the NORM_* flags), in reference to potentially ignorable differences in string comparison
  • Robert A. Wlodarczyk has used the the terms string normalization to refer to the broader type of search that others call fuzzy searches
  • Unicode has a technical report entitled Unicode Normalization Forms that defines a technique for folding out differences in equivalent sequences

Looking at definitions for the word from dictionary.com:

  • To make normal, especially to cause to conform to a standard or norm
  • To make (a text or language) regular and consistent, especially with respect to spelling or style
  • To remove strains and reduce coarse crystalline structures in (metal), especially by heating and cooling
  • Reduction to a standard or normal state

I guess its really only Unicode that is using the word correctly, though maybe that "string normalization" usage is not too far off. Ah well, bad on over in SQL Server (they have the advantage of it not being splattered all over their public header files and documentation -- only their customer contacts tend to hear about it, and only when it comes up). And of course on those of us working on NLS. Oops!

But moving past the terminology issue, does collation as an operation support the concepts being described in Unicode normalization? In other words, does the CompareString API consider U+00c5 (Å, LATIN CAPITAL LETTER A WITH RING ABOVE) to be equivalent to U+0041 U+030a (Å, LATIN CAPITAL LETTER A + COMBINING RING ABOVE)?

The answer is that it does. But not because Unicode Normalization was being used.

Like yesterday's blog entry, which pointed out how the UCA was written up many years after Microsoft (and others) were doing the work, Unicode Normalization was first proposed as a DRAFT in the Spring of 1998. It was not until the Summer of 1999 that it was given the status of a Technical Report. The collation functions (CompareString and LCMapString) had been supporting this type of operation for many years before that.

It is worth mentioning that support is not perfect. As Microsoft Typography discovered when they first ran their font validation tools on their existing fonts, and as I discovered when I ran MSKLC's keyboard validation on on Microsoft's existing keyboard layouts, once there was a concerted effort to validate everything that collation does against all of the various Unicode operations such as Unicode Normalization, there were holes found. The holes fall into two basic categories:

  • in the Arabic presentation forms (where many precomposed ligatures are not considered to be equal to their combined forms)
  • in Korean Old Hangul (where Jamo sequences are given weights that are calculated to be close but not identical to their 'combined' Hangul forms)

I was once again struck by the differences that a 'linguistic' approach can lead to when compared to a 'technical' one, even when nominally working to service the same (or similar) customers. In the above two categories, there were specific reasons for the differences relating to customer feedback. In a few cases it will make sense to pick up some of these differences (several of the Arabic ligatures, for example), but for the most part those differences will likely stay as there is no compelling customer scenario to try to pick up all of the differences.

There is another (some would say more relevant) way in which Microsoft supports Unicode Normalization -- in the FoldString API, with its MAP_PRECOMPOSED, MAP_COMPOSITE, MAP_FOLDCZONE flags. While the tables that are used for these various foldings are a bit out of date, their goal is obviously the same. In the long run it completely makes sense to update this information.

Which is not to say that Unicode Normalization is not important -- it is. As a standard it has been picked up by the IETF, the W3C, and many others. Folks working with the Whidbey beta have already seen String.IsNormalized and String.Normalize in their Intellisense, and folks who have seen the PDC or later builds of Longhorn have noticed IsNormalized and Normalize APIs. There may well be people who will work to make sure text is properly put into a specific Unicode normalization form prior to storing or transmitting it.

But since all of the methods of text input on Windows tend to use the same normalization form already (form C), and since it obviously is an operation that takes some time and space to do no matter how fast it is, it is an optional task that can be used when it is important to do so. Especially since it mostly works already, anyway.

(This was an area I walked into dismayed that we do not support the Unicode Standard but walked out of pleased at how much we suck at the job of not supporting it. Usually when I am not supporting something I do a much better job of failing to meet expectations!)

So do we support Unicode normalization today? Well, sort of. At least in FoldString and in collation. Not 100% but we hit all of the common scenarios.

Do we plan to support it in the future products we release? Absolutely. Folks watching the betas can see it now (and the rest can find it in public web searches, which surprised me maybe more than it should have).

 

Comment on the blather
Leave a Comment
  • Please add 3 and 1 and type the answer here:
  • Post
Blog - Comment List
  • Hi Kaplan,

    You didn't mention the database normalization in first para :)
    In relational database design, the process of organizing data to minimize redundancy. Normalization usually involves dividing a database into two or more tables and defining relationships between the tables.
  • Yes, that is an excellent point! That would be yet another usage (and one that I dealt with with in my former, pre-Microsoft life).

    But I limited myself to meanings that I have been exposed to since I came to Microsoft, and none of them brought that one up.

    Plus, I figured it did not really help the point of the article much since database normalization is ever further from the dictionary definitions. :-)
  • This question came up a few times at TechEd in Orlando, and recently it was asked again in the newsgroups...
  • This question came up a few times at TechEd in Orlando, and recently it was asked again in the newsgroups...
  • Someone going by the handle AC asked me via email:

    You have mentioned that Google has trouble with...
  • (Apologies to Stanley Kubrick, of course!) It was almost the very first blog post I ever wrote, back

  • Regular reader Jan Kučera, in response to Stripping is an interesting job (aka On the meaning of meaningless,

  • Pingback from  Some words on Unicode in Windows (Delphi, .NET, APIs, etc) « The Wiert Corner – irregular stream of Wiert stuff

Page 1 of 1 (8 items)