Blog - Title

November, 2004

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Comment policy....

    I am still finding my way around this whole thing, so this policy will probably change.

    For now, the policy is that I check to make sure it has some relevance to the topic; if it is then I will let it go up.

    If it's spam (and there has been some of that already, don't these people have lives?) then it will be deleted without explanation.

    If its just offtopic (and I can be broad in my definintion of offtopic since I have often caused threads to drift) then I will send it back to the person posting with info on what I am doing if I am able to determine who to contact. Mostly I'll just let it be posted unless it's really random.

    If its a report of a typo then I will leave it there until I correct the problem (after which I will probably delete it). If its a more substantial correction then a simple typo then I will most likely leave it up unless doing so would irreparably harm the impression that I am right about everything important. :-)

    Addendum 13 December 2004: Generic "Great Site!" comments that have no actual content but are intended only to increase the visibility of an unrelated site will be rejected with extreme prejudice.

    Addendum 22 December 2004: At times the documentation problems, bugs, and design flaws to which I point might cause people to misunderstand my motives. I will therefore try to make them clear. I believe that the Microsoft internationalization functionality is basically superior to everything else out there. I am thus very pro-Microsoft and confident enough about that functionality that I believe it is not a mistake to "expose" the information about problems here. However, no one should ever mistake that approach and believe that I am somehow inviting Microsoft-bashers to have a platform upon which they can do their bashing. If this applies to you then you probably know who you are, but just in case you can ask yourself if you prefer Monopolysoft or Micro$oft to Microsoft or M$ to MS; if your answer is "yes" then you are probably mistaking this blog for a place for MS bashing....

    Addendum 23 April 2005: I am going to experimentally only moderate comments that are anonymous to see if I can still keep a handle on the spam problem. I'll let you know if this causes any problems....

  • Sorting it all Out

    'Evil date parsing', Parse, and ParseExact

    • 12 Comments

    'Evil date parsing' has quite an ignoble history. Rooted in COM (which was itself rooted to older versions of Visual Basic), converions from string to date had the simple job of making a string into a date, no matter what the cost. The benefits are obvious, but the problems range from performance issues to the cost of getting bad data by improper parsing. The latter is in fact why the parsing was considered evil for many purposes, because when the format is dd/mm/yy there are just as many people who wanted "01/13/2008" to fail as who wanted it to succeed. The fact that it would succeed simply clouded the issue of the meaning of "01/06/2004" since the difference between January 6th and June 1st are obvious and frightening for some applications.

    The .NET Framework's DateTime.Parse method has goals not unlike those from older products, but because of this often suffers from the same problems. This road leads to less performant code that for every customer who loves that it parsed their dated will find another who will be unhappy that it missed an entirely reasonable format. They did solve some of the evil problems, but there were still plenty more that came out. By trying to work for everyone, the method ends up with four groups of customers:

    1. those who are unhappy,
    2. those who are happy,
    3. those who are happy now but will be unhappy when they find out how screwed up their wrong data is due to incorrect parsing.
    4. those who are unhappy now since the code that used to work in VB5 or VB6 does not work since the parsing changed.

    ParseExact, on the other hand, takes the exact formats specified in the DateTimeFormatInfo object and uses them -- it uses nothing else. There is no forgiveness for data that does not match, and the issue of whether or not gratuitous spaces should be forgiven makes for an interesting argument in the hallways of some buildings at Microsoft. Its goal that is more along the lines of "I gave you the format, now I will give you the strings in that format; just do the freaking job." This makes it faster and more exact as a semantic, and as such is much more suited if the flexibility of the other method is not desired. As a veteran of bugs implicit in evil date parsing, I am quite fond of a method with none of the problems it can cause.

    In order to protect your own code, you may want to consider using ParseExact when you can, to help avoid those other problems. Flexibility is great when you need it, but when you don't its better not to risk the problems that are the price of flexibility....

  • Sorting it all Out

    Some keyboarding terms

    • 12 Comments

    This posting will try to clear up some of the problems in documentation and info regading keyboards, since there is plenty left in those things to be confusing and there is no need to throw bewildering terms into the mix. Future posts will build on this one, so if you already know about keyboards you might be able to skip it (though I would not advise it!). It is not really a glossary since it is not alphabetized; the order is arbitrary based on either when I thought of a term to add or when I thought dramatic effect could be increased.

    • LCID -- a locale identifier. Traditionally pronounced like "El-Sid". It is a value that has no real meaning at all to keyboards but some people who ought to know better seem to think it does. Those people are actually thinking of LANGID, another entry in the glossary. LCIDs are only 32 bits, a fact that sucks for reasons I'll talk about another day, but it's twice as much as keyboards use, anyway.
    • LANGID - a languge identifier. Traditionally not pronounced like "Lan-Gid", instead "Lang-Eye-Dee" is preferred. This is the bottom WORD of the DWORD that makes up an LCID and essentially represents a number that signifies a unique language/region/script combination. This combination is what is needed for keyboards, which are intended to be typical input methods for such combinations. Obviously, looking ahead to the world of custom cultures ans custom locales, this may not be the best design architecture. But it's a little too late to change now....
    • Layout ID -- a layout identifier. This has no special pronunciation, probably because calling something a "Lid" would sound dumb. These numbers are used to help the USER SUBSYSTEM manage the situation when multiple keyboards use the same LANGID (something that happens for many keyboards that ship with Windows). Each keyboard layout using the same LANGID after the first must (1) have one and (2) it must not duplicate one that is already assigned on the system. Failure on either of these points will cause a layout to not be properly selected.
    • KLID -- a keyboard layout identifier. Traditionally pronounced "Kay-El-Eye-Dee" because some people in the USA get very uptight about certain homonyms (you can catch me slipping on this point from time to time). It's also sometimes called the input locale identifier since the name for HKL has been updated (see the HKL definiteion for info on why that is incorrect since the HKL is for something different). The KLID can be retrieved for the currently selected keyboard layout in a thread through the GetKeyboardLayoutName API (note the pswzKLID parameter), though that is not true of any other selected or installed keyboard layout. Every keyboard layout on the system has one of these. Each KLID is 32 bits (thus 8 hex digits), and they can all be found in the registry as the subkeys under HKLM\SYSTEM\CurrentControlSet\Control\Keyboard Layouts\. The bottom half of the KLID is a LANGID, and the top half is something device-specific. By convention, the first hex digit is usually as follows:
      • 0 -- Most keyboard layouts
      • A -- Keyboard layouts defined by MSKLC
      • D -- Some non-CJK input methods that have been defined by the Text Services Framework (note: reported to me; I have never seen one of these!)
      • E -- CJK input methods, also known as IMEs
    • HKL -- a handle to a keyboard layout, traditionally pronounced "Āch-Kay-El", the terminiology folks have pretty aggressively tried to call this an "input locale identifier" despite the obvious problem that it has nothing to do with locales and that it is not the same value as the actual identifier (the KLID). The HKL in actuality is the handle to an input method. Althought defined as a handle, only the lower 32 bits are currently used. Of those 32 bits, the bottom 16 bits represent a LANGID, and the top 16 bits represent a value defined by the USER SUBSYSTEM which helps to uniquely identify an installed keyboard layout. This is crucial since any keyboard layout can be installed more than once (by installing it under different languages, which helps user operations like spell checking).
    • MKLC -- see MSKLC. The only people who call it MKLC are the ones who object to MSKLC since it's not a true acronym (Microsoft being a single word). None of the user interface or documentation calls it this, so it's unbelievable that it is getting such a large entry in this glossary, but people outside of Microsoft use it all the time.
    • MSKLC -- the Microsoft Keyboard Layout Creator, traditionally prounounced "Em-Es-Kay-El-See" by people not taken in by the MKLC arguments. It is a tool released by Microsoft which allows someone to build a custom keyboard layout and build a setup package to install it on Windows NT 4.0, 2000, XP, or Server 2003. The help file contains a ton of information about the best way to design keyboard layouts that work well on Windows. I am the developer on it and love to hear feedback any time people have it, since there is always room for improvement.
    • IME -- Input method editor, traditionally pronounced "Eye-Em-Eee". An IME is an engine that converts keystrokes into phonetic or ideographic characters. It is a commonly used abstraction that allows a keyboard with only 100 or so keystrokes to be able to support character sets that contain up to 20,000 or more ideographs. There are old samples in the Platform SDK using the Input Method Manager (IMM) APIs, but today most IMEs written by Microsoft use the much more approachable Text Services Framework.
    • Supported keyboard layout -- this is an odd terminology that actually means a keyboard layout is defined on the system. It may not be currently selectable by a user (e.g., if it's a Thai keyboard layout and Thai/complex script support is not enabled). It can also be an IME or a speed-to-text converter, so DEFINED INPUT METHOD might have been a better term. This terminology is slowly being removed from documentation and it's not entirely clear what is replacing it.
    • Installed keyboard layout -- another odd bit of terminology, it means a keyboard layout that a user has selected. It can also be an IME or a speed-to-text converter, so SELECTED INPUT METHOD might have been a better term. This terminology is slowly being removed from documentation and it's not entirely clear what is replacing it.
    • Scan code -- The numeric value given to each physical key on a keyboard; the scan code is a hardware-dependent number that identifies the key. Scan codes have a fixed position on the physical keyboard, irregardless of the keyboard layout chosen by the user.
    • Virtual Key -- Also called the VK, the code that is given by the Windows USER subsystem to represent a keystroke. It is mapped from a scan code by using the keyboard layout definition and is thus entirely dependent on the user's chosen layout. The reason for this is the [unfortunate, IMHO] choice to have e.g. VK_A to be used for the 'A' key, which meant that on keyboards that put 'A' in a different position the VK would have to be moved.

    That's all I can think of at the moment, but I am sure I will be updating this topic any time I think of something else.

  • Sorting it all Out

    Microsoft does not use the Unicode Collation Algorithm

    • 7 Comments

    Robert A. Heinlein told a story in his book Expanded Universe back in 1980 (bear with me, I promise I'll be making a point eventually):

    A few years ago, I was visited by an astronomer, quite young and brilliant. He claimed to be a long-time reader of my fiction and his conversation proved it. I was telling him about a time I needed a synergiestic orbit from Earth to a 24-hour station; I told him the story it was in, he was familiar with the scene, mentioned having read the book in grammar school.

    This orbit is similar in appearance to a cometary interplanet transfer but is in fact a series of compromises in order to arrive in step with the space station; elapsed time is an unsmooth integral not to be found in Hudson's Manual but it can be solved by the methods used on the Siacci empiricals for atmosphere ballistic: numerical integration.

    I'm married to a woman who knows more math, history, and languages than I do. This should teach me humility (and sometimes does, for a few minutes). Her brain is a great help to me professionally. I was telling this young scientist how we obtained yards of butcher paper, then each  of us worked three days, independently, solved the problem and checked each other -- then the answer disappeared into *one* line of *one* paragraph (SPACE CADET) but the effort had been worthwhile since it controlled what I could do dramatically in that sequence.

    Dr Whoosis said "But *why* didn't you just shove it through a computer?"

    I blinked at him. Then said slowly, gently, "My dear boy--" (I don't usually call PH.D.'s in hardcore sciences "My dear boy"--they impress me. But this was a special case. "My dear boy... this was *1947*."

    It took him some moments to get it, then he blushed....

    Its a story that comes into my head every time I get a question these days that proves the person asking is not thinking about the fact that the passage of events has an influence on what is possible. Nowhere is this greater that the subject of this posting -- people who wonder why Microsoft does not support the Unicode Collation Algorithm. People notice that Windows seems to have a similar framework and they assume that both of them use the same "default table" that works as a basis for all collations (in other words they assume that Microsoft is based on the based on the Unicode sort weight tables).

    The truth is quite different. Unicode's weights have been a part of the UCA,  which was first a DRAFT Unicode Technical Report in March of 1997. It did not lose its DRAFT status until November of 1999 and was not a Unicode Technical Standard until August of 1999.

    Windows, on the other hand, has had its architecture and its default table in place since NT 3.1 shipped, over a decade ago. How could it be based on the Unicode sort weight tables, which did not exist at that time even in draft form? The temptation to respond to the person asking with a "My dear boy..." (or "My dear girl...") is at times overwhelming!

    As to the extra functionality, I'll just say that in the past 15 years have seen a lot of language support being added to Windows, and the expertise that has been applied to its collation support is truly amazing. Its a daunting functionality to work on at times given how well it has performed over the years. :-)

    From a philosophical perspective, collation in Windows has always based primarily on the linguistic data that is at its core -- the technical issues have always been driven by the data, not the other way around. I think this is a unique strength of the implementation that allows it to outperform others across a range of languages that is also (in my opinion) far superior. The tables were certainly built up with an entirely different linguistic and development philosophy, and ignoring my opinions about which is better, the data of either one would really be a poor fit for the other.

    It is of note (well, to me at least!) that at the last two Unicode Technical Committee meetings that several decisions were made which will cause future versions of the UCA's default table to behave more like Microsoft's. This is not because it's Microsoft's way (we give advice about principles for the UCA but really do not innovate for it since we are not using it to come up with innovations) but because one of the authors of the UCA suggested tweaks to the UCA behavior based on expert advice and user feedback. I guess that means we had the right idea, huh? :-)

  • Sorting it all Out

    They ask me "why is my Korean text in random order?"

    • 17 Comments

    I get this question on a regular basis -- people wonder if I know that Korean shows up in a random order. People expect this to be the case when the Korean LCID (0x0412, or 1042 in decimal) is not passed, but when it is they expect things to be in some kind of correct order, and as far as they can tell, they are not.

    My first big question (to help set expectations properly) is to ask what order they expect to be used. They almost always have trouble explaining what they thought would happen. But usually with a little help they make it through this part. The order is based on the most common Hangul pronunciation of the character, whether it is Hangul or Hanja (the Korean name for Han ideographs).

    At this point they start to see some pattern to the results but it still seems random within a particular pronunciation.

    My second big question is to ask how they are getting an order. This varies -- if they are a developer then they are calling LCMapString or CompareString (or some API that calls CompareString), otherwise if its in an application (Access or Excel or whatever) then they do not know what API is being called (but usually I know what the application is doing). Or they are using managed code and the CompareInfo or SortKey classes are being used. The problem is usually that they are passing the NORM_IGNORENONSPACE flag (or the CompareOptions.IgnoreNonSpace flag in managed code). And herein is where the problem lies....

    You see, modern Korean Hangul and Hanja do not have the notion of non-spacing characters (like ˆˇˉ˘˙˚˛˜˝ diacritics seen in Latin), so that part of the "collation weight space" is used for Hanja that most commonly have the same Hangul pronunciation. Telling the API or method to ignore this weight is basically asking it to treat (for example) such ideographs as 渴, 噶, and 鶡 as randomly sorting together in a non-deterministic way, since the API will report that the two strings are equal. The same thing happens to letters with diacritics in Latin -- passing the flag will cause (for example) A, Ā, Ă, Ą, À, Á, Â, Ã, Ä, and Å to sort together randomly in the same sort of non-deterministic way.

    Ah, the light begins to dawn -- the order was not actually so random as they thought. They should not be passing this flag!

    The next question they ask is "what is so special about Korean that it is the only language where this is done?" The answer to this is simple: its not.

    The names for the flags NORM_IGNORENONSPACE/NORM_IGNORECASE (or IgnoreNonSpace/IgnoreCase) are really misleading, since it really meant to refer to Latin script diacritics and case, which need to weigh between 'A' and 'Ă' as having a secondary difference and between 'A' and 'a' as having a tertiary difference. There are a myriad of languages that have the same linguistic need to express secondary and tertiary differences, so they need to use those types of weights. Passing the flags can harm the accuracy/specificity of Thai, Hindi, Tamil, Telugu, Arabic, Hebrew, and a whole lot more.

    The final question that they ask is "But I am doing searches and I need to pass those flags for the Latin/Cyrillic/Greek text. How can I have the results work properly in those searches without finding the wrong characters for these other languages?" The answer here is to use the flags to do the search operation but then order them without using the flags. If you do this then you will get the longer list of candidates, but just as in typical web searches the closest matches will appear higher on the list!

  • Sorting it all Out

    Normalization and Microsoft -- whats the story?

    • 8 Comments

    This is a word that has been way too overloaded. To date I have heard of four specific uses since I have come to Microsoft:

    • It is used by some folks on the SQL Server team when they talk about string comparisons
    • It is used by the NLS collation APIs for the same reason (the NORM_* flags), in reference to potentially ignorable differences in string comparison
    • Robert A. Wlodarczyk has used the the terms string normalization to refer to the broader type of search that others call fuzzy searches
    • Unicode has a technical report entitled Unicode Normalization Forms that defines a technique for folding out differences in equivalent sequences

    Looking at definitions for the word from dictionary.com:

    • To make normal, especially to cause to conform to a standard or norm
    • To make (a text or language) regular and consistent, especially with respect to spelling or style
    • To remove strains and reduce coarse crystalline structures in (metal), especially by heating and cooling
    • Reduction to a standard or normal state

    I guess its really only Unicode that is using the word correctly, though maybe that "string normalization" usage is not too far off. Ah well, bad on over in SQL Server (they have the advantage of it not being splattered all over their public header files and documentation -- only their customer contacts tend to hear about it, and only when it comes up). And of course on those of us working on NLS. Oops!

    But moving past the terminology issue, does collation as an operation support the concepts being described in Unicode normalization? In other words, does the CompareString API consider U+00c5 (Å, LATIN CAPITAL LETTER A WITH RING ABOVE) to be equivalent to U+0041 U+030a (Å, LATIN CAPITAL LETTER A + COMBINING RING ABOVE)?

    The answer is that it does. But not because Unicode Normalization was being used.

    Like yesterday's blog entry, which pointed out how the UCA was written up many years after Microsoft (and others) were doing the work, Unicode Normalization was first proposed as a DRAFT in the Spring of 1998. It was not until the Summer of 1999 that it was given the status of a Technical Report. The collation functions (CompareString and LCMapString) had been supporting this type of operation for many years before that.

    It is worth mentioning that support is not perfect. As Microsoft Typography discovered when they first ran their font validation tools on their existing fonts, and as I discovered when I ran MSKLC's keyboard validation on on Microsoft's existing keyboard layouts, once there was a concerted effort to validate everything that collation does against all of the various Unicode operations such as Unicode Normalization, there were holes found. The holes fall into two basic categories:

    • in the Arabic presentation forms (where many precomposed ligatures are not considered to be equal to their combined forms)
    • in Korean Old Hangul (where Jamo sequences are given weights that are calculated to be close but not identical to their 'combined' Hangul forms)

    I was once again struck by the differences that a 'linguistic' approach can lead to when compared to a 'technical' one, even when nominally working to service the same (or similar) customers. In the above two categories, there were specific reasons for the differences relating to customer feedback. In a few cases it will make sense to pick up some of these differences (several of the Arabic ligatures, for example), but for the most part those differences will likely stay as there is no compelling customer scenario to try to pick up all of the differences.

    There is another (some would say more relevant) way in which Microsoft supports Unicode Normalization -- in the FoldString API, with its MAP_PRECOMPOSED, MAP_COMPOSITE, MAP_FOLDCZONE flags. While the tables that are used for these various foldings are a bit out of date, their goal is obviously the same. In the long run it completely makes sense to update this information.

    Which is not to say that Unicode Normalization is not important -- it is. As a standard it has been picked up by the IETF, the W3C, and many others. Folks working with the Whidbey beta have already seen String.IsNormalized and String.Normalize in their Intellisense, and folks who have seen the PDC or later builds of Longhorn have noticed IsNormalized and Normalize APIs. There may well be people who will work to make sure text is properly put into a specific Unicode normalization form prior to storing or transmitting it.

    But since all of the methods of text input on Windows tend to use the same normalization form already (form C), and since it obviously is an operation that takes some time and space to do no matter how fast it is, it is an optional task that can be used when it is important to do so. Especially since it mostly works already, anyway.

    (This was an area I walked into dismayed that we do not support the Unicode Standard but walked out of pleased at how much we suck at the job of not supporting it. Usually when I am not supporting something I do a much better job of failing to meet expectations!)

    So do we support Unicode normalization today? Well, sort of. At least in FoldString and in collation. Not 100% but we hit all of the common scenarios.

    Do we plan to support it in the future products we release? Absolutely. Folks watching the betas can see it now (and the rest can find it in public web searches, which surprised me maybe more than it should have).

     

  • Sorting it all Out

    If you are using MFC 6.0 or 7.0 and you want to use MSLU...

    • 4 Comments

    If you are using the Microsoft Layer for Unicode on Windows 95/98/Me Systems and the Microsoft Foundation Classes (MFC), there are a few things you need to know about!


    There are two bugs in MFC 6.0 that you will have to fix and rebuild (both bugs are fixed in MFC 7.0). One only applies to people compiling their own private MFC DLL, and the other applies to all MFC usage. In addition, there is a third problem if you are using the MFC DLLs that requires you to rebuild the C runtime (CRT) as well (applies to both 6.0 and 7.0).

    1. AFFECTS both STATIC MFC and private DLL: If you look in CEditView (inside of viewedit.cpp) there are three member functions that contain code that does an #ifndef _UNICODE in them: they are CEditView::ReadFromArchive, CEditView::LockBuffer, and CEditView::UnlockBuffer. You should remove the #ifndefs (but leave the code within them!). They were making an assumption that you could never be running a Unicode MFC on Win9x (but of course, it turns out they are wrong, given the existence of MSLU!).


    2. AFFECTS THE CUSTOM MFC DLL CASE ONLY: If you look in the RawDllMain function in dllinit.cpp, they have some more code, this time wrapped in #ifdef _UNICODE. It will cause the MFC DLL to fail anytime you try to use it on Win9x. You should remove this block of code too, and then rebuild the MFC DLL as per the info in TN033 (making sure you have it use MSLU when you build it, of course!).

    3. IF YOU ARE REBUILDING THE MFC DLL: Unfortunately, you will also have to rebuild the CRT dll (msvcrt.dll) as well, due to the use of several CRT functions that rely on Unicode APIs, when called by the Unicode MFC. This is true in both 6.0 and 7.0.


    There is one additional problem that can occur if you are using AfxWndProc (MFC's main, shared window proc wrapper) as an actual wndproc in any of your windows. You see, MFC has code in it so that if AfxWndProc is called and is told that the wndproc to follow up with is AfxWndProc, it notices that it is being asked to call itself and forwards for DefWindowProc instead.

    Unfortunately, MSLU breaks this code by having its own proc be the one that shows up. MFC has no way of detecting this case so it calls the MSLU proc which calls AfxWndProc which calls the MSLU proc, etc., until the stack overflows. By using either DefWindowProc or your own proc yourself, you avoid the stack overflow.

  • Sorting it all Out

    Using MSLU with ATL/WTL

    • 0 Comments

    If you are using the Microsoft Layer for Unicode on Windows 95/98/Me Systems in a project that uses ATL or WTL, there are some things you need to do to make it work.

    • Avoid the _ATL_MIN_CRT macro -- this macro appears to be incompatible with MSLU.

    • Problems with garbage text in window title bars -- It is a problem with the usage of ::DefWindowProc and ::CallWindowProc in ATL and WTL.  The way to correct this problem is to at the very start of your program add the following code:

      // Resolve UNICOWS's thunk (required) 
      ::DefWindowProc (NULL, 0, 0, 0);
    • Tim Smith explained it best:
      "The problem is that if you create an ATL window prior to ::DefWindowProc being called, then m_pfnSuperWindowProc points to the thunk [in the loader] and not the resolved address.  Then when ATL passes m_pfnSuperWindowProc into ::CallWindowProc as part of the WM_SETTEXT message, MSLU doesn't realize that it is being passed [its own] ::DefWindowProc and thus does an extra level of text conversion.  By invoking ::DefWindowProc at the start of the program, then when ATL creates a window and stores the address of ::DefWindowProc in m_pfnSuperWindowProc, it is storing the address of the MSLU routine that the MSLU ::CallWindowProc realizes does not need conversion.  In general, if you are using ATL/WTL, just add that line of code at the start of your program and be done with it.  It also should be added to any DLL that uses ATL windows [which have the same issue].

      Please note that the abvove issue has been addressed in WTL 7.0 and thus only applies to earlier versions of WTL.

Page 1 of 1 (8 items)