Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
I am still finding my way around this whole thing, so this policy will probably change.
For now, the policy is that I check to make sure it has some relevance to the topic; if it is then I will let it go up.
If it's spam (and there has been some of that already, don't these people have lives?) then it will be deleted without explanation.
If its just offtopic (and I can be broad in my definintion of offtopic since I have often caused threads to drift) then I will send it back to the person posting with info on what I am doing if I am able to determine who to contact. Mostly I'll just let it be posted unless it's really random.
If its a report of a typo then I will leave it there until I correct the problem (after which I will probably delete it). If its a more substantial correction then a simple typo then I will most likely leave it up unless doing so would irreparably harm the impression that I am right about everything important. :-)
Addendum 13 December 2004: Generic "Great Site!" comments that have no actual content but are intended only to increase the visibility of an unrelated site will be rejected with extreme prejudice.
Addendum 22 December 2004: At times the documentation problems, bugs, and design flaws to which I point might cause people to misunderstand my motives. I will therefore try to make them clear. I believe that the Microsoft internationalization functionality is basically superior to everything else out there. I am thus very pro-Microsoft and confident enough about that functionality that I believe it is not a mistake to "expose" the information about problems here. However, no one should ever mistake that approach and believe that I am somehow inviting Microsoft-bashers to have a platform upon which they can do their bashing. If this applies to you then you probably know who you are, but just in case you can ask yourself if you prefer Monopolysoft or Micro$oft to Microsoft or M$ to MS; if your answer is "yes" then you are probably mistaking this blog for a place for MS bashing....
Addendum 23 April 2005: I am going to experimentally only moderate comments that are anonymous to see if I can still keep a handle on the spam problem. I'll let you know if this causes any problems....
'Evil date parsing' has quite an ignoble history. Rooted in COM (which was itself rooted to older versions of Visual Basic), converions from string to date had the simple job of making a string into a date, no matter what the cost. The benefits are obvious, but the problems range from performance issues to the cost of getting bad data by improper parsing. The latter is in fact why the parsing was considered evil for many purposes, because when the format is dd/mm/yy there are just as many people who wanted "01/13/2008" to fail as who wanted it to succeed. The fact that it would succeed simply clouded the issue of the meaning of "01/06/2004" since the difference between January 6th and June 1st are obvious and frightening for some applications.
The .NET Framework's DateTime.Parse method has goals not unlike those from older products, but because of this often suffers from the same problems. This road leads to less performant code that for every customer who loves that it parsed their dated will find another who will be unhappy that it missed an entirely reasonable format. They did solve some of the evil problems, but there were still plenty more that came out. By trying to work for everyone, the method ends up with four groups of customers:
ParseExact, on the other hand, takes the exact formats specified in the DateTimeFormatInfo object and uses them -- it uses nothing else. There is no forgiveness for data that does not match, and the issue of whether or not gratuitous spaces should be forgiven makes for an interesting argument in the hallways of some buildings at Microsoft. Its goal that is more along the lines of "I gave you the format, now I will give you the strings in that format; just do the freaking job." This makes it faster and more exact as a semantic, and as such is much more suited if the flexibility of the other method is not desired. As a veteran of bugs implicit in evil date parsing, I am quite fond of a method with none of the problems it can cause.
In order to protect your own code, you may want to consider using ParseExact when you can, to help avoid those other problems. Flexibility is great when you need it, but when you don't its better not to risk the problems that are the price of flexibility....
This posting will try to clear up some of the problems in documentation and info regading keyboards, since there is plenty left in those things to be confusing and there is no need to throw bewildering terms into the mix. Future posts will build on this one, so if you already know about keyboards you might be able to skip it (though I would not advise it!). It is not really a glossary since it is not alphabetized; the order is arbitrary based on either when I thought of a term to add or when I thought dramatic effect could be increased.
That's all I can think of at the moment, but I am sure I will be updating this topic any time I think of something else.
Robert A. Heinlein told a story in his book Expanded Universe back in 1980 (bear with me, I promise I'll be making a point eventually):
A few years ago, I was visited by an astronomer, quite young and brilliant. He claimed to be a long-time reader of my fiction and his conversation proved it. I was telling him about a time I needed a synergiestic orbit from Earth to a 24-hour station; I told him the story it was in, he was familiar with the scene, mentioned having read the book in grammar school.This orbit is similar in appearance to a cometary interplanet transfer but is in fact a series of compromises in order to arrive in step with the space station; elapsed time is an unsmooth integral not to be found in Hudson's Manual but it can be solved by the methods used on the Siacci empiricals for atmosphere ballistic: numerical integration.I'm married to a woman who knows more math, history, and languages than I do. This should teach me humility (and sometimes does, for a few minutes). Her brain is a great help to me professionally. I was telling this young scientist how we obtained yards of butcher paper, then each of us worked three days, independently, solved the problem and checked each other -- then the answer disappeared into *one* line of *one* paragraph (SPACE CADET) but the effort had been worthwhile since it controlled what I could do dramatically in that sequence.Dr Whoosis said "But *why* didn't you just shove it through a computer?"I blinked at him. Then said slowly, gently, "My dear boy--" (I don't usually call PH.D.'s in hardcore sciences "My dear boy"--they impress me. But this was a special case. "My dear boy... this was *1947*."It took him some moments to get it, then he blushed....
Its a story that comes into my head every time I get a question these days that proves the person asking is not thinking about the fact that the passage of events has an influence on what is possible. Nowhere is this greater that the subject of this posting -- people who wonder why Microsoft does not support the Unicode Collation Algorithm. People notice that Windows seems to have a similar framework and they assume that both of them use the same "default table" that works as a basis for all collations (in other words they assume that Microsoft is based on the based on the Unicode sort weight tables).
The truth is quite different. Unicode's weights have been a part of the UCA, which was first a DRAFT Unicode Technical Report in March of 1997. It did not lose its DRAFT status until November of 1999 and was not a Unicode Technical Standard until August of 1999.
Windows, on the other hand, has had its architecture and its default table in place since NT 3.1 shipped, over a decade ago. How could it be based on the Unicode sort weight tables, which did not exist at that time even in draft form? The temptation to respond to the person asking with a "My dear boy..." (or "My dear girl...") is at times overwhelming!
As to the extra functionality, I'll just say that in the past 15 years have seen a lot of language support being added to Windows, and the expertise that has been applied to its collation support is truly amazing. Its a daunting functionality to work on at times given how well it has performed over the years. :-)
From a philosophical perspective, collation in Windows has always based primarily on the linguistic data that is at its core -- the technical issues have always been driven by the data, not the other way around. I think this is a unique strength of the implementation that allows it to outperform others across a range of languages that is also (in my opinion) far superior. The tables were certainly built up with an entirely different linguistic and development philosophy, and ignoring my opinions about which is better, the data of either one would really be a poor fit for the other.
It is of note (well, to me at least!) that at the last two Unicode Technical Committee meetings that several decisions were made which will cause future versions of the UCA's default table to behave more like Microsoft's. This is not because it's Microsoft's way (we give advice about principles for the UCA but really do not innovate for it since we are not using it to come up with innovations) but because one of the authors of the UCA suggested tweaks to the UCA behavior based on expert advice and user feedback. I guess that means we had the right idea, huh? :-)
I get this question on a regular basis -- people wonder if I know that Korean shows up in a random order. People expect this to be the case when the Korean LCID (0x0412, or 1042 in decimal) is not passed, but when it is they expect things to be in some kind of correct order, and as far as they can tell, they are not.
My first big question (to help set expectations properly) is to ask what order they expect to be used. They almost always have trouble explaining what they thought would happen. But usually with a little help they make it through this part. The order is based on the most common Hangul pronunciation of the character, whether it is Hangul or Hanja (the Korean name for Han ideographs).
At this point they start to see some pattern to the results but it still seems random within a particular pronunciation.
My second big question is to ask how they are getting an order. This varies -- if they are a developer then they are calling LCMapString or CompareString (or some API that calls CompareString), otherwise if its in an application (Access or Excel or whatever) then they do not know what API is being called (but usually I know what the application is doing). Or they are using managed code and the CompareInfo or SortKey classes are being used. The problem is usually that they are passing the NORM_IGNORENONSPACE flag (or the CompareOptions.IgnoreNonSpace flag in managed code). And herein is where the problem lies....
You see, modern Korean Hangul and Hanja do not have the notion of non-spacing characters (like ˆˇˉ˘˙˚˛˜˝ diacritics seen in Latin), so that part of the "collation weight space" is used for Hanja that most commonly have the same Hangul pronunciation. Telling the API or method to ignore this weight is basically asking it to treat (for example) such ideographs as 渴, 噶, and 鶡 as randomly sorting together in a non-deterministic way, since the API will report that the two strings are equal. The same thing happens to letters with diacritics in Latin -- passing the flag will cause (for example) A, Ā, Ă, Ą, À, Á, Â, Ã, Ä, and Å to sort together randomly in the same sort of non-deterministic way.
Ah, the light begins to dawn -- the order was not actually so random as they thought. They should not be passing this flag!
The next question they ask is "what is so special about Korean that it is the only language where this is done?" The answer to this is simple: its not.
The names for the flags NORM_IGNORENONSPACE/NORM_IGNORECASE (or IgnoreNonSpace/IgnoreCase) are really misleading, since it really meant to refer to Latin script diacritics and case, which need to weigh between 'A' and 'Ă' as having a secondary difference and between 'A' and 'a' as having a tertiary difference. There are a myriad of languages that have the same linguistic need to express secondary and tertiary differences, so they need to use those types of weights. Passing the flags can harm the accuracy/specificity of Thai, Hindi, Tamil, Telugu, Arabic, Hebrew, and a whole lot more.
The final question that they ask is "But I am doing searches and I need to pass those flags for the Latin/Cyrillic/Greek text. How can I have the results work properly in those searches without finding the wrong characters for these other languages?" The answer here is to use the flags to do the search operation but then order them without using the flags. If you do this then you will get the longer list of candidates, but just as in typical web searches the closest matches will appear higher on the list!
This is a word that has been way too overloaded. To date I have heard of four specific uses since I have come to Microsoft:
Looking at definitions for the word from dictionary.com:
I guess its really only Unicode that is using the word correctly, though maybe that "string normalization" usage is not too far off. Ah well, bad on over in SQL Server (they have the advantage of it not being splattered all over their public header files and documentation -- only their customer contacts tend to hear about it, and only when it comes up). And of course on those of us working on NLS. Oops!
But moving past the terminology issue, does collation as an operation support the concepts being described in Unicode normalization? In other words, does the CompareString API consider U+00c5 (Å, LATIN CAPITAL LETTER A WITH RING ABOVE) to be equivalent to U+0041 U+030a (Å, LATIN CAPITAL LETTER A + COMBINING RING ABOVE)?
The answer is that it does. But not because Unicode Normalization was being used.
Like yesterday's blog entry, which pointed out how the UCA was written up many years after Microsoft (and others) were doing the work, Unicode Normalization was first proposed as a DRAFT in the Spring of 1998. It was not until the Summer of 1999 that it was given the status of a Technical Report. The collation functions (CompareString and LCMapString) had been supporting this type of operation for many years before that.
It is worth mentioning that support is not perfect. As Microsoft Typography discovered when they first ran their font validation tools on their existing fonts, and as I discovered when I ran MSKLC's keyboard validation on on Microsoft's existing keyboard layouts, once there was a concerted effort to validate everything that collation does against all of the various Unicode operations such as Unicode Normalization, there were holes found. The holes fall into two basic categories:
I was once again struck by the differences that a 'linguistic' approach can lead to when compared to a 'technical' one, even when nominally working to service the same (or similar) customers. In the above two categories, there were specific reasons for the differences relating to customer feedback. In a few cases it will make sense to pick up some of these differences (several of the Arabic ligatures, for example), but for the most part those differences will likely stay as there is no compelling customer scenario to try to pick up all of the differences.
There is another (some would say more relevant) way in which Microsoft supports Unicode Normalization -- in the FoldString API, with its MAP_PRECOMPOSED, MAP_COMPOSITE, MAP_FOLDCZONE flags. While the tables that are used for these various foldings are a bit out of date, their goal is obviously the same. In the long run it completely makes sense to update this information.
Which is not to say that Unicode Normalization is not important -- it is. As a standard it has been picked up by the IETF, the W3C, and many others. Folks working with the Whidbey beta have already seen String.IsNormalized and String.Normalize in their Intellisense, and folks who have seen the PDC or later builds of Longhorn have noticed IsNormalized and Normalize APIs. There may well be people who will work to make sure text is properly put into a specific Unicode normalization form prior to storing or transmitting it.
But since all of the methods of text input on Windows tend to use the same normalization form already (form C), and since it obviously is an operation that takes some time and space to do no matter how fast it is, it is an optional task that can be used when it is important to do so. Especially since it mostly works already, anyway.
(This was an area I walked into dismayed that we do not support the Unicode Standard but walked out of pleased at how much we suck at the job of not supporting it. Usually when I am not supporting something I do a much better job of failing to meet expectations!)
So do we support Unicode normalization today? Well, sort of. At least in FoldString and in collation. Not 100% but we hit all of the common scenarios.
Do we plan to support it in the future products we release? Absolutely. Folks watching the betas can see it now (and the rest can find it in public web searches, which surprised me maybe more than it should have).
If you are using the Microsoft Layer for Unicode on Windows 95/98/Me Systems and the Microsoft Foundation Classes (MFC), there are a few things you need to know about!
There are two bugs in MFC 6.0 that you will have to fix and rebuild (both bugs are fixed in MFC 7.0). One only applies to people compiling their own private MFC DLL, and the other applies to all MFC usage. In addition, there is a third problem if you are using the MFC DLLs that requires you to rebuild the C runtime (CRT) as well (applies to both 6.0 and 7.0).
AFFECTS both STATIC MFC and private DLL: If you look in CEditView (inside of viewedit.cpp) there are three member functions that contain code that does an #ifndef _UNICODE in them: they are CEditView::ReadFromArchive, CEditView::LockBuffer, and CEditView::UnlockBuffer. You should remove the #ifndefs (but leave the code within them!). They were making an assumption that you could never be running a Unicode MFC on Win9x (but of course, it turns out they are wrong, given the existence of MSLU!).
AFFECTS THE CUSTOM MFC DLL CASE ONLY: If you look in the RawDllMain function in dllinit.cpp, they have some more code, this time wrapped in #ifdef _UNICODE. It will cause the MFC DLL to fail anytime you try to use it on Win9x. You should remove this block of code too, and then rebuild the MFC DLL as per the info in TN033 (making sure you have it use MSLU when you build it, of course!).
IF YOU ARE REBUILDING THE MFC DLL: Unfortunately, you will also have to rebuild the CRT dll (msvcrt.dll) as well, due to the use of several CRT functions that rely on Unicode APIs, when called by the Unicode MFC. This is true in both 6.0 and 7.0.
There is one additional problem that can occur if you are using AfxWndProc (MFC's main, shared window proc wrapper) as an actual wndproc in any of your windows. You see, MFC has code in it so that if AfxWndProc is called and is told that the wndproc to follow up with is AfxWndProc, it notices that it is being asked to call itself and forwards for DefWindowProc instead.
Unfortunately, MSLU breaks this code by having its own proc be the one that shows up. MFC has no way of detecting this case so it calls the MSLU proc which calls AfxWndProc which calls the MSLU proc, etc., until the stack overflows. By using either DefWindowProc or your own proc yourself, you avoid the stack overflow.
If you are using the Microsoft Layer for Unicode on Windows 95/98/Me Systems in a project that uses ATL or WTL, there are some things you need to do to make it work.
Avoid the _ATL_MIN_CRT macro -- this macro appears to be incompatible with MSLU.
Problems with garbage text in window title bars -- It is a problem with the usage of ::DefWindowProc and ::CallWindowProc in ATL and WTL. The way to correct this problem is to at the very start of your program add the following code:
// Resolve UNICOWS's thunk (required) ::DefWindowProc (NULL, 0, 0, 0);
Please note that the abvove issue has been addressed in WTL 7.0 and thus only applies to earlier versions of WTL.