Blog - Title

June, 2005

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    String.Compare is for sissies (not for people who want SQLCLR consistency)

    • 17 Comments

    Yesterday at the TechEd booth in which I was sitting, someone was asking me how to get the comparisons in the .NET Framework to be consistent with the ones in SQL Server 2005.

    This is a lot harder than it has to be, unfortunately!

    First we will start with the constants used by Windows from winnls.h in the Platform SDK:

    #define NORM_IGNORECASE       0x00000001  // ignore case
    #define NORM_IGNORENONSPACE   0x00000002 
    // ignore nonspacing chars
    #define NORM_IGNORESYMBOLS    0x00000004 
    // ignore symbols
    #define NORM_IGNOREKANATYPE   0x00010000  // ignore kanatype
    #define NORM_IGNOREWIDTH      0x00020000 
    // ignore width
    #define SORT_STRINGSORT       0x00001000  // use string sort method

    These definitions almost match the ones used by the .NET Framework, in the CompareOptions enumeration:

    CompareOptions.IgnoreCase              1   (0x00000001)
    CompareOptions.IgnoreKanaType          8   (0x00000008)
    CompareOptions.IgnoreNonSpace          2   (0x00000002)
    CompareOptions.IgnoreSymbols           4   (0x00000004)
    CompareOptions.IgnoreWidth            16   (0x00000010)
    CompareOptions.None                    0   (0x00000000)
    CompareOptions.Ordinal        1073741824   (0x40000000)
    CompareOptions.StringSort      536870912   (0x20000000)

    You can see where the differences are: NORM_IGNOREWIDTH/CompareOptions.IgnoreWidth, NORM_IGNOREKANATYPE/CompareOptions.IgnoreKanaType, and SORT_STRINGSORT/CompareOptions.StringSort. Let us keep these two differences in mind, they will become important soon.

    In both Shiloh (SQL Server 2000) and Yukon (SQL Server 2005), the very useful COLLATIONPROPERTY has two very cool attributes you can grab:

    • LCID -- returns the LCID you would use in Windows
    • ComparisonStyle -- returns the flags you would pass (they call it the Windows comparison style)

    Thus if I run the following query:

    SELECT 
        name,
        COLLATIONPROPERTY(name, 'CodePage') as CodePage,
        COLLATIONPROPERTY(name, 'LCID') as LCID,
        COLLATIONPROPERTY(name, 'ComparisonStyle') as ComparisonStyle,
        description
    FROM ::fn_helpcollations()

    You get back a nice big table that has the exact information for the code page used in WideCharToMultiByte/MultiByteToWideChar calls, and the LCID/dwCmpFlags values to use in a CompareString call (excerpt here, I do not want top duplicate the whole 1029-row table Yukon returns!):

    Collation name       CP   LCID Flags   Description
    Albanian_BIN         1250 1052 0       Albanian, binary sort
    Albanian_BIN2        1250 1052 0       Albanian, binary code point comparison sort
    Albanian_CI_AI       1250 1052 196611  Albanian, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
    Albanian_CI_AI_WS    1250 1052 65539   Albanian, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
    Albanian_CI_AI_KS    1250 1052 131075  Albanian, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
    Albanian_CI_AI_KS_WS 1250 1052 3       Albanian, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
    Albanian_CI_AS       1250 1052 196609  Albanian, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
    Albanian_CI_AS_WS    1250 1052 65537   Albanian, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive
    Albanian_CI_AS_KS    1250 1052 131073  Albanian, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive
    Albanian_CI_AS_KS_WS 1250 1052 1       Albanian, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive
    Albanian_CS_AI       1250 1052 196610  Albanian, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive
    Albanian_CS_AI_WS    1250 1052 65538   Albanian, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive
    Albanian_CS_AI_KS    1250 1052 131074  Albanian, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive
    Albanian_CS_AI_KS_WS 1250 1052 2       Albanian, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive
    Albanian_CS_AS       1250 1052 196608  Albanian, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive
    Albanian_CS_AS_WS    1250 1052 65536   Albanian, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive
    Albanian_CS_AS_KS    1250 1052 131072  Albanian, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive
    Albanian_CS_AS_KS_WS 1250 1052 0       Albanian, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive

    I could not find an easy T-SQL way to make the flag values show up as hexidecimal. But the mappings would be:

    Collation name       Flags     
    Albanian_BIN         0x00000000
    Albanian_BIN2        0x00000000
    Albanian_CI_AI       0x00030003
    Albanian_CI_AI_WS    0x00010003
    Albanian_CI_AI_KS    0x00020003
    Albanian_CI_AI_KS_WS 0x00000003
    Albanian_CI_AS       0x00030001
    Albanian_CI_AS_WS    0x00010001
    Albanian_CI_AS_KS    0x00020001
    Albanian_CI_AS_KS_WS 0x00000001
    Albanian_CS_AI       0x00030002
    Albanian_CS_AI_WS    0x00010002
    Albanian_CS_AI_KS    0x00020002
    Albanian_CS_AI_KS_WS 0x00000002
    Albanian_CS_AS       0x00030000
    Albanian_CS_AS_WS    0x00010000
    Albanian_CS_AS_KS    0x00020000
    Albanian_CS_AS_KS_WS 0x00000000

    Updated query to give the hex values for LCID and ComparisonStyle directly (thank you, James Todd!):

    SELECT
        name,
        COLLATIONPROPERTY(name, 'CodePage') as CodePage,
        CONVERT(binary(4), COLLATIONPROPERTY(name, 'LCID')) as LCID,
        CONVERT(binary(4), COLLATIONPROPERTY(name, 'ComparisonStyle')) as ComparisonStyle,
        description
    FROM ::fn_helpcollations()

    So, if you take that flag value, convert all of the 0x00020000 to 0x00000010 (and 0x00010000 to 0x00000008!), you then have a value you can plug into the CompareInfo.Compare method.

    But none of this will work with String.Compare!

    If you want behavior between SQL Server and the CLR to have any chance of parity, you must use the methods off of CompareInfo.

    Parity will seldom be 100%, because all of the SQL Server 2000 collations are based on the tables that shipped with Windows 2000 (just after Beta 1, before Beta 2), and Whidbey actually uses the same tables as Windows Server 2003. You will get much closer with the *_90 collations that were added in Yukon, though.

    A lot of dancing around and converting and munging to get parity -- too much, I think. There is definitely room for people to improve the experience here in future releases!

     

    This post brought to you by "Ǿ" (U+01fe, a.k.a. LATIN CAPITAL LETTER O WITH STROKE AND ACUTE)

  • Sorting it all Out

    Font substitution and linking #3

    • 16 Comments

    Earlier posts about font linking, substitution and fallback were:

    Font substitution and linking #1 (About font substitution)
    Font substitution and linking #2 (About industrial strength font linking, with MLang)

    Anyway, at the end of last month, M.W. Grossman asked me the following question:

    Unicode is so much fun. My company has paid me to read through most of your writing today (thanks for the Steve Taylor tip) but haven't found anything that gets into what I'm looking for, namely the method used to determine which font is selected and how to influence this.

    My problem lies in a mixed CJK system. The won, yuan and yen characters are being displayed differently depending on the client OS (Win2K, CHT/CHS/KOR/JPN). In JPN, everything displays well but both CHT and CHS mangle all currencies but the yuan and dollar. All machines have a number of CJK fonts in addition to MS Mincho and clearly there's some font substitution going on, but how and where? I can't be the only one wondering about this but I can't find anything on the subject beyond references to "MS Shell Dlg 2".

    An interesting problem, one that is rooted in the issues of Han unification that I have discussed previously.

    What it comes down to is that there are many characters which can have four different possible looks:

    • Japanese will default to using MS UI Gothic (fallback to PMingLIU, then SimSun, then Gulim)
    • Korean will default to using Gulim (fallback to PMingLiu, then MS UI Gothic, then SimSun)
    • Simplified Chinese will default to using SimSun (fallback to PMingLiu, then MS UI Gothic, then Batang)
    • Traditional Chinese will default to using PMingLiu (fallback to SimSun, then MS Mincho, then Batang)

    Now there are nuances like when you install HKSCS stuff for Hong Kong or GB18030 stuff for China and other random things that I'll talk some other time. But the above four are the msin font linking rules.

    The change depends on the setting of the NLS default system locale (meaning of course that this is a setting that is not just for non-Unicode programs!). It affects the behavior of several things, according to the rules spelled out above for each group -- linking to the fonts in that given order:

    • The preferred font for each group (for EA system locales)
    • Lucida Sans Unicode
    • Microsoft Sans Serif (not MS Sans Serif!)
    • Tahoma

    This is the most common Windows XP/Server 2003 list, it changes a bit between versions depending on default UI fonts -- mainly by adding new entries but leaving the old ones intact. Your mileage may vary, especially as some third party font installs will often try to modify these settings, often with success but occasionally with messed up registry setttings....

    Now I have gone on for quite a while here, without explaining what this linking does. The essential task performed is to give the operating system a list of fonts to look in and an order to look in them when the requested character cannot be found in the original font. Given that both Han unification and typographic traditions will dictate a specific look, feel, and glyph design to each font, the list is chosen so that a cirrect choice for the default system locale is most likely to give an intuitive look to the text that is shown on Windows.

    The actual mapping is found in the registry; it is seen in the values under HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontLink\SystemLink. A change requires a new windowstation, which for all shipping versions of Windows means either a reboot (which the default system locale requires anyway) or a new Terminal Services session.

    Of course, the user can always override these things if he wants to by explicitly choosing the 'right' default system locale; the developer can override these things if she needs to p.r.n. by choosing the fonts to use, explicitly. But in both cases, the user and the developer will often just use the default.

    But if you are a developer who is wanting to influence the selection here, PLEASE consider explicitly choosing a font rather than changing a system wide setting....

     

    This post brought to you by "一" (U+4e00, a.k.a. the first CJK Unified Idedograph)
    One of the few ideographs used by all languages that use Han/Kanji/Hanja but that does not tend to vary too much in style....

  • Sorting it all Out

    Is it Macau or is it Macao?

    • 22 Comments

    People get confused sometimes about the name of this place, but whether you call it:

    • Haojing'ao (壕鏡澳 "Trench Inlet")
    • Liandao (蓮島 "Lotus Island")
    • Xiangshan'ao (香山澳 "Fragrant mountain Island")

    According to some sites (like the U.S. Library of Congress), its official name used to be Macao when it was a Portuguese territory but after the reversion to China the official name became Macau (China : Special Administrative Region) and in conversations about it in its former territorial status the name Macau is now prefered in most contexts.

    But other sources (such as the Macao SAR Government Portal) seem to prefer Macao and Macao Special Administrative Region.

    At the risk of disrespecting Congress I am going to side with the actual Macao government site, and not just because they have updated more recently. :-)

    Last year, while looking into what we had for East Asian collation support in Windows, I noticed a curious fact. The default system code page is 950, which is the Traditional Chinese code page, yet the collation choices were:

    • 0x00001404 uses the PRC Pinyin-based pronunciation sorting tables
      As if it were using MAKELCID(MAKELANGID(LANG_CHINESE, SUBLANG_CHINESE_MACAU), SORT_CHINESE_PRCP)
    • 0x00021404 uses the PRC-based stroke order
      As if it were using MAKELCID(MAKELANGID(LANG_CHINESE, SUBLANG_CHINESE_MACAU), SORT_CHINESE_PRC)

    I figured this was a bug but it seemed odd that no one ever reported it, if it were. Hmmm, strange....

    So, although I was confused I decided to see if I could find out what was going on. I talked to several native speakers and people now living in the US who were either from Macao or who had vistied there for an extended period, and learned that even though the Traditional forms are still used more often than the Simplified ones, in recent years the Simplified forms have seen more usage.

    More importantly, however, people in Macao often do use a pronunication sort and it is Pinyin-esque (English is a productive language, so I can make that word up!). Many people in Macao learn the Bopomofo pronunciations but they do not use them in their daily lives. Thus the PRC Simplified Pinyin may not be perfect but they will be closer to the order that a native speaker would expect than the Bopomfo order, even if not all of the ideographs are on the list.

    I was unable to find any data on a Macao-specific Pinyin ordering, but I do know that the PRC government has expressed interest in getting Pinyin pronunciations for many more ideographs, including Traditional forms. Perhaps one of the motivations behind such a move is indeed to help support people in Hong Kong and Macao! Certainly it is the case that a Pinyin-based IME is more useful to most native speakers in Macao than a Bopomofo one would be.

    Maybe something Cantonese would be most useful, but that is a story for another day....

     

    This post brought to you by "序" (U+5e8f, a.k.a. as an ideograph meaning sequence or series)

  • Sorting it all Out

    A quick look at Whidbey's TextRenderer

    • 14 Comments

    Over the weekend, TheMuuj mentioned in a comment:

    As far as I know, there are new classes in Whidbey for drawing text with GDI (as a result of GDI+'s questionable screen rendering in some cases). Are these based on DrawText?

    This is not exactly the reason. There are two basic problems that come into play with GDI+:

    • There are some performance issues caused by the somewhat stateless nature of GDI+, where device contexts would be set and then the original restored after each call.
    • The shaping engines for international text have been updated many times for Windows/Uniscribe and for Avalon, but have not been updated for GDI+, which causes international rendering support for new languages to not have the same level of quality.

    The object is the System.Windows.Forms.TextRenderer class, which has two methods that can be used to render text using GDI/Uniscribe rather than GDI+:

    • DrawText - Draws the specified text at the specified location using the specified device context, font, and color.
    • MeasureText - Measures the specified text when drawn with the specified font.

    This class and these methods (which have several different overrides) allow WinForms to support new languages as the OS support is added. For example, the ELK support in Windows XP SP2 added font and rendering support to the operating system for Bengali and Malayalam, but versions 1.0 and 1.1 of the .NET Framework would not render these scripts properly, even when the right font was being used. However, version 2.0 (Whidbey) will be able to properly support these scripts whenever the OS can support them....

    I have not personally experienced the performance issues but have beentold by people who have that the support can also very useful. I am more of a "language support" guy myself, though. :-)

     

    This post brought to you by "" (U+0986, BENGALI LETTER AA)
    (A letter that was happy to see proper rendering support of Bengali and Assamese conuncts in managed code using XP SP2 and Whidbey!)

  • Sorting it all Out

    Once more into the palindrome

    • 16 Comments

    Back in April, in a series of posts, I talked about many cool features in Whidbey, from comparison (been there all along) to text elements (been there all along, better in Whidbey) to normalization (new in Whidbey), all in the context of one of the most important features that any computer program can possibly support -- palindrome detection:

    Looking for that internationally savvy palindrome checker....
    Where did the new StringInfo stuff come from?

    Normalization vs. .NET text elements

    It was all very cool stuff, and a lot of fun to do.

    And then I blew it.

    In that last post, I said:

    Now, Maurits went on in his Channel 9 posting to discuss sort elements, or cases when two or more characters are to be given a single sort weight (kind of the opposite of an expansion like these ligatures). However, in my opinion these are not really suitable for a palindrome detection algorithm, as I don't think they are usually treated as letters except in the case where they are also treated as unique text elements (the case covered by the StringInfo code).

    Any native speakers of languages with such constructs as the Spanish ch and the Hungarian dzs who think they should or should not be treated as a unit in trying to detect palindomosity should feel free to leave a comment to that effect. Also, if any of my collagues in the GIFT group agree or disagree here (and they are reading this!) they are invited to do the same (or stop me in the hall and accost me with this information!).

    Well, the good news is that no one accosted me.

    The bad news is that some people posted that I was wrong, and colleagues informed me that I was wrong. Those 'sort elements' (for lack of a better term) would be taken as a unit, just as the text elements are.

    Damn. I hate being wrong.

    Especially when I pointed out that there was no good way pick up these 'sort element' compressions, where two or more UTF-16 code points are given the weight of a single element. We have the data, sure. But we do not currently excpose it in a convenient way to answer this particular question.

    And to top it all off, there is no good way to reverse a sort key, though the sort key obviously contains the information about when these compressions (known in the Unicode Collation Algorithm as 'contractions').

    I stewed on this for a bit today, after finally deciding to tackle this post. There had to be some way short of a new Win32 or .NET function to make this work.

    So I closed my eyes and told myself to pretend I was interviewing for a job at Microsoft. :-)

    Then something suddenly occurred to me!

    We do not care about reversing or undoing the sort key; we only care about comparing the pieces of each sort key! And we can do that, we have the bytes, so why not just compare them?

    Let's take a hideously contrived string to test for palindromodity: abddzsCÅbÅCdzsdzsba

    The code points are:

    0061 0062 0064 0064 007a 0073 0043 00c5 0062 0041 030a 0043 0064 007a 0073 0064 007a 0073 0062 0061

    Note especially the Hungarian dzs double compression ('ddzs' == 'dzsdzs') and the A Ring in both precomposed and composite forms (code points marked in bold).

    We grab the sort key:

    string st = "abddzsCÅbÅCdzsdzsba";
    CompareInfo ci = CompareInfo.GetCompareInfo("hu-HU");
    byte[] rgbyt = ci.GetSortKey(stIn, CompareOptions.None).KeyData

    and then take a look at what those bytes look like:

    0e 02 0e 09 0e 1e 0e 1e 0e 0a 0e 02 0e 09 0e 02 0e 0a 0e 1e 0e 1e 0e 09 0e 02 01 02 02 02 02 02 1a 02 1a 01 02 02 02 02 12 12 02 12 12 01 01 00

    Let's split it into the various pieces: 

    UW: 0e 02 0e 09 0e 1e 0e 1e 0e 0a 0e 02 0e 09 0e 02 0e 0a 0e 1e 0e 1e 0e 09 0e 02 01
    DW: 02 02 02 02 02 1a 02 1a 01
    CW: 02 02 02 02 12 12 02 12 12 01
    SW: 01 00

    Now the rules are very simple for the detection:

    1. Strip out the 00 and 01 byte values, those are just sentinels;
    2. The UW values will always be two bytes per sort element, and will always therefore be an even number;
    3. The DW/CW/SW weights can be prefixed by 02 byte values. You can ignore them, and assume that identical weights exist in suffix form if the weight ends early (02 is the minimal weight and is just used to pad the piece of the sort key when are looking further into the string).

    So, looking at our sort key:

    UW: 0e 02 0e 09 0e 1e 0e 1e 0e 0a 0e 02 0e 09 0e 02 0e 0a 0e 1e 0e 1e 0e 09 0e 02
    DW: 02 02 02 02 02 1a 02 1a 02 02 02 02 02
    CW: 02 02 02 02 12 12 02 12 12 02 02 02 02
    SW:

    Yes folks, we have a palindrome! And the method to do so is using code that has been around since NT 3.1.

    Now if you really want to, you can normalize the string to take care of some of the issues brought up in the earlier posts.

    The code to sort through the byte array and verify the results is left as an exercise for the reader. If someone had figured this out in an interview, they would have impressed the hell out of me; I am temporarily way too impressed with myself for thinking of it.... :-)

    I will do some more "sort key cracking" another day, if there is interest. I have only ever known on one team inside or outside of Microsoft that has ever had a scenario that required cracking sort keys, but perhaps I have just not talked to everyone yet. I have run across many people who have found it to be an interesting area, even if they could not find an immediate use for the knowledge.

     

    This post brought to you by "﴿" (U+fd3f, a.k.a. ORNATE RIGHT PARENTHESIS)

  • Sorting it all Out

    Line breaking, according to DrawText

    • 3 Comments

    The Win32 DrawText function and its more full-featured cousin DrawTextEx, have been around for a long time. They both have a simple stated set of purposes:

    ...draws formatted text in the specified rectangle. It formats the text according to the specified method (expanding tabs, justifying characters, breaking lines, and so forth).

    Now I have talked about word breaking in the past, and obviously they are related (where else would you break lines but on valid word breaks?). But the DrawText/DrawTextEx functions are from an earlier time -- a time before complex scripts, or good integration of Unicode character properties, of the real existence of mature Unicode character properties.

    But let's take a look at its offerings, via the variou flags you can specify that affect the word break behavior:

    DT_WORDBREAK - Breaks words. Lines are automatically broken between words if a word extends past the edge of the rectangle specified by the lprc parameter. A carriage return-line feed sequence also breaks the line.

    DT_NOFULLWIDTHCHARBREAK - Prevents a line break at a DBCS (double-wide character string), so that the line-breaking rule is equivalent to SBCS strings. For example, this can be used in Korean windows, for more readability of icon labels. This value has no effect unless DT_WORDBREAK is specified.

    Huh?

    What the first one is trying to say is that by default, the text will just keep going and then when the border is reached it will start the new line and possibly break right in the middle. But if you pass the DT_WORDBREAK flag, then you are saying to make the breaks at the boundaries of words in the text. Which is pretty much what people expect (and what controls like EDIT already do themselves).

    The second flag was added after many user complaints about the Windows 95/NT 4.0 behavior that treats after each CJK ideograph as a potential word break opportunity. This new flag says to treat CJK the same way everything else is treated -- look for the spaces as the word break opportunities.

    Of course you may expect for more than just U+0020 to be handled when I say space. But most of the ones you would expect on such a list would not be there.

    Interestingly, all of the following are also looked as word breaking opportunities in East Asian text:

    • ! (U+0021, a.k.a. EXCLAMATION POINT)
    • · (U+00b7, a.k.a. MIDDLE DOT)
    • ˇ (U+02c7, a.k.a. CARON) Note this is the modifier letter version)
    • ˉ (U+02c9, a.k.a. MODIFER LETTER MACRON)
    • – (U+2013, a.k.a. EN DASH)
    • ″ (U+2033, a.k.a. DOUBLE PRIME)
    • ℃ (U+2103, a.k.a. DEGREE CELCIUS)
    • ∶ (U+2236, a.k.a. RATIO)
    • ╴ (U+2574, a.k.a. BOX DRAWINGS LIGHT LEFT)
    • various word breaking characters in small variants, Fullwidth ASCII, Halfwidth Katakana, and Fullwidth symbols

    Obviously the functionality in DrawText and DrawTextEx is not quite up to Unscribe standards, when it comes to complex scripts. But you know how I feel about NLS API behavior changes? Well this is core GDI behavior, and both they are MS Typograophy have to worry about even the most minute changes in behavior of their functions once something has shipped. Because you never know who is relying on it. A small change in word break behavior could make the page count of a document double or worse, so even sensible changes can only be made via new flags (or in the case of complex scripts via new functions).

     

    This post brought to you by " " (U+0020, a.k.a. SPACE)

  • Sorting it all Out

    Consoling people about their troubles with the console.

    • 4 Comments

    This is not one of those fun posts where I get to talk about exciting new features. Instead I am going to answer some questions about CMD, the console (kŏn'sōl'), and now I am going to try and console (kən-sōl') the people with questions, since they will probably not care for the answers. :-(

    Several hundred posts ago (back in the end of 2004), Per Bergland asked (in the Suggestion Box):

    This may have been asked (and answered) before, but I find it such a shame that cmd.exe can't execute a bat/cmd file in unicode (UTF16). Since Notepad doesn't do "OEM", I find myself using the DOS EDIT text editor to fix up national characters such as our Swedish å,ä and ö.

    Hey, cmd can *read* unicode and even *write* using the "/U" switch, so why can't it read & execute a file containing Unicode?

    You wouldn't happen to know this, would you?

    I can't answer this one with authority since I don't own CMD.EXE. In fact, I am not even sure who does these days. But I do know that it is not easy to get major feature work done in this area, in that codebase. The whole point of the Monad project (read more about it in posts here) is to get away from all of the backcompat issues that keep people from wanting to touch the code to make changes. The last time I checked it out, the plan was to support Unicode files, though.

    For the legacy case, I have been in the habit of using Word and choosing the code page to save a file to as plain text as a way to get the files in the right format, and I have tried to lobby the owners of Notepad to consider adding another "Save As..." option for the OEM code page, but I have not gotten much traction on that (or on my other request for that list, the UTF-8 without BOM choice). Though if i had to guess which was more likely to be seen in the future, I would guess that they would be quicker to add features to Notepad then to the console....

    Then, moving on into January, KJK::Hyperion asked (also in the Suggestion Box):

    Console windows support Unicode, but they necessarily have a number of limitations, having to support the OEM charset and being limited to monospace fonts (which, I've seen, rules out composed characters and some special spacing characters). How is this handled internally? especially, how is Japanese handled, with its mixture of half-width and full-width characters? and how are valid fonts chosen?

    See above for some answers. For wanting to have your own font choice, you can pick any monspace or essentially monospace font and then set one or both of the following registry values:

    KEY == HKEY_CURRENT_USER\Console, ValueName == FaceName, Value == <whatever font you like>

    KEY == HKEY_CURRENT_USER\Console, ValueName == FontFamily, Value == <50 for decorative, 40 for Script, 30 for Modern, 20 for Swiss, or 10 for Roman>.

    Now when I say essentially monospace above, the reason for that is that none of the CJK fonts are true monospaced fonts. They all (even the bitmap fonts) have the halfwidth characters taking up half as much space as the fullwidth ones, though.

    Most recently, Denis Bider asked (first of Larry Osterman, then of me (directly):

    In our company, we observed the following apparent inconsistency in cmd.exe.

    If you execute cmd /?, you get this help text:

    /A - Causes the output of internal commands to a pipe or file to be ANSI

    /U - Causes the output of internal commands to a pipe or file to be Unicode

    But the fact is, the output of cmd /A is not actually ANSI. It is in the OEM code page. For example, if I try cmd /A echo csz > file.txt, and then try to open file.txt in Notepad (which uses ANSI), I get garbage.

    Lots of other command line utilities (like those in Cygwin) actually use ANSI. So this is a problem - characters get corrupted across pipe boundaries; files get interpreted in incompatible ways.

    From a user's perspective, it seems somewhat logical to expect that if the /A flag description says it will produce ANSI, it should produce ANSI; not OEM.

    What do you think? Is this intentional or is it a problem?

    Well, since it is behavior that has been around for several versions, I would hesitate to call it a bug. I will run it up the flagpole here, but I assume the "fix" will be to just fix up the text in that help. Which is really all that they could do, since changing the behavior of the flag would break who knows how many scripts (well, if we made the change, we would know -- from all of the people complaining about the behavior change!).

    In the meantime though, I can recommend chcp.com, a nice little utility that will either display the active OEM code page in the console (if run with no parameters), or allow you to change that code page. You can look at some documentation on it here. Note that when you run this utility, is reports back the code page as the "Active code page". Not a 100% solution, but as good as the console will really allow.

    Did I mention that you may want to take a look at Monad? :-)

     

    This post brought to you by "" (U+3037, a.k.a. IDEOGRAPHIC TELEGRAPH LINE FEED SEPARATOR SYMBOL)

  • Sorting it all Out

    LCMapString's *other* job

    • 10 Comments

    To me, the NLS API function LCMapString has a full-time job, one that is crucial to the fundamental fabric of Windows -- sort key generation. I know it is crucial because if I accidentally mess up the tables then there are components that are unable to even let Windows finish booting up due to fear of corruption of information!

    Clearly the order is important, and the ability to create indexes that use this order is therefore equally important. It is no accident that the name of this blog is Sorting It All Out, you know. Keeping it all in some order so that no matter how complex it is, you can eventually work your way through it is (to me) a hideously important operation that is crucial to all of our work. And not just our international work -- if we mess up the simple ABCs then how can we ever hope to handle really complex operations?

    However, just as an artist often has to wait tables or bag groceries in order to pay her rent, LCMapString has been forced to take on some side work. :-)

    (Now I know that is a completely revisionist way of looking at things and I know its not really how things are. But it's a more convenient way of looking at things for me, so like the Bohr model for the atom I am going to let the "not entirely accurate" model stand, since it a useful way of looking at things!)

    Anyway, I thought I'd take a look at some of the other work this conversion function does....

    I hint at uses of a few of these conversions in my post A few of the gotchas of CompareString but now I am going to try to lay it all out, once and for all.

    As a side note, Julie Bennett once told me that she thought this function was kind of a hack, not because it wasn't useful but because it did too much, all in one place. It really wanted to be separate functions. Which is I guess a hazard of taking on too many part time job -- no one knows what your actual occupation is!

    Here are the additional flags and what they do:

    LCMAP_BYTEREV -- a very useful little conversion, whether used by itself or with any of the other results -- it will reverse the bytes in each word of a string. As the Platform SDk topic indicates, dor example, if you pass in 0x3450 0x4822 the result is 0x5034 0x2248. The conversion is the equivalent of lpDestStr[ich] = MAKEWORD( HIBYTE(lpSrcStr[ich], LOBYTE(lpSrcStr[ich]) ) across the whole string. Good honest side work for our girl!

    LCMAP_FULLWIDTH -- Converts each character to a full width one every time it encounters a half width one (all other characters pass through unchanged). Thus (U+ff8e a.k.a. HALFWIDTH KATAKANA LETTER HO) becomes (U+30db a.k.a. KATAKANA LETTER HO). In the legacy Japanese code page, full width characters took up twice as much space (the half width characters were in the 'high ansi' range of the code page, greater than 0x7f but less than 0xff, where the full width ones were double-byte characters). As a convention the full width ones are twice as wide -- the same width as the ideographs -- and according to some people are considered to be less aesthetically pleasing. In Unicode obviously both sets take up two bytes, but the typographic tradition continues to this day, and therefore this is no longer really just a 'legacy code page' issue -- it is a real typographic difference.

    LCMAP_HALFWIDTH -- Converts each character to a half width one every time it encounters a full width one (all other characters pass through unchanged). Thus  (U+30ef a.k.a. KATAKANA LETTER WA) becomes (U+ff9c a.k.a. HALFWIDTH KATAKANA LETTER WA). See the previous LCMAP_FULLWIDTH text for more information on the difference between them.

    LCMAP_HIRAGANA -- Converts each Katakana character to the equivalent Hiragana one  (all other characters pass through unchanged). Thus (U+30c5 a.k.a. KATAKANA LETTER DU) becomes  (U+3065 a.k.a. HIRAGANA LETTER DU). The more literal meaning of Hiragana is "smooth kana." The differences between Hiragana and Katakana are a bit beyond the scope of this blog post, but there is a fascinating Wikipedia article on it that covers the topic, and includes the poem Iroha-uta ("Song of colours"). This poem comes from the 10th century, and in a very cool way uses every hiragana once (and proves to me that Hiragana is more lyrically suited than English with its 'the quick brown fox jumpes over the lazy dog' nonsense!):

    いろはにほへと                     Iro ha nihohe to                    Even if colours have sweet perfume               
    ちにぬるを chirinuru wo eventually they fade away
    わかよたれそ wakayo tare so What in this world
    つねならむ tsune naramu is eternal ?
    うゐのおくやま uwi no okuyama The deep montains of vanity
    けふこえて kefu koete I cross them today
    あさきゆめみし asaki yume mishi renouncing the superficial dreeams
    ゑひもせすね wehi mo sesu ne not giving in to their madness any more

    LCMAP_KATAKANA -- Converts each Hiragana character to the equivalent Katakana one  (all other characters pass through unchanged). Thus (U+307e a.k.a. HIRAGANA LETTER MA) becomes (U+30de a.k.a. KATAKANA LETTER MA). The more literal meaning of Katakana is "partial kana." Again, the differences between Hiragana and Katakana are a bit beyond the scope of this blog post, but there is a fascinating Wikipedia article on this script too that covers the topic. The article does mention one important difference in usage between the two scripts:

    Katakana spelling differs slightly from hiragana. While hiragana spells long vowels with the addition of a second vowel kana, katakana uses a vowel extender mark. This mark is a short line following the direction of the text (horizontal in horizontal text, vertical in columns).

    Neither the Hiragana nor the Katakana conversions in Windows extend to cover this particular convention, though it is fascinating to contemplate doing so some day, in some kind of extension to the "linguistic casing" notion I'll talk about in a bit. Interesting feature idea, if it truly is the convention. :-)

    LCMAP_UPPERCASE -- Maps lowercase characters to uppercase characters passing through other characters unchanged. Thus ç (U+00e7, a.k.a. LATIN SMALL LETTER C WITH CEDILLA) becomes Ç (U+00c7 a.k.a. LATIN CAPITAL LETTER C WITH CEDILLA). Can be modifed with the LCMAP_LINGUISTIC_CASING flag which enables a whole bunch of new scenarios, discussed when I asked (then answered) the question What does "linguistic casing" mean and plays a fundamental role in the life of all but the "C" locale CRT casing operations and functions like CharUpper, such that although technically I do not own those functions, I basically own those functions (isn't emphasis a wonderful thing? <grin>). . Note that none of these wrappers uses the LCMAP_LINGUISTIC_CASING flag, which means that unless they are calling LCMapStringA there is absolutely no effect whatsoever based on the locale, and all claims to the contrary in both PSDK and CRT documentation are in the long, slow process of being fixed. The last word that I have to say about uppercasing is Georgian.

    LCMAP_LOWERCASE -- Maps uppercase characters to lowercase characters passing through other characters unchanged. Thus Ħ (U+0126, a.k.a. LATIN CAPITAL LETTER H WITH STROKE) becomes ħ (U+0127, a.k.a. LATIN SMALL LETTER H WITH STROKE). Can be modifed with the LCMAP_LINGUISTIC_CASING flag which enables a whole bunch of new scenarios, discussed when I asked (then answered) the question What does "linguistic casing" mean and plays a fundamental role in the life of all of the "C" locale CRT casing operations and functions like CharLower, such that although technically I do not own those functions, I basically own those functions (isn't emphasis a wonderful thing? <grin>). Note that none of these wrappers uses the LCMAP_LINGUISTIC_CASING flag, which means that unless they are calling LCMapStringA there is absolutely no effect whatsoever based on the locale, and all claims to the contrary in both PSDK and CRT documentation are in the long, slow process of being fixed. The last word I have to say about lowercasing is Sigma.

    LCMAP_SIMPLIFIED_CHINESE -- Maps traditional Chinese characters to simplified Chinese, passing through other characters unchanged. Thus  (U+6a02) becomes (U+4e50). The dictionary used for this mapping is small (only 2,620 ideographs) and has not been updated since the feature was added in NT 4.0 (it was originally added at the request of people in Office, who actually ended up going with their own more sophisticaated dictionary solution in Word that does a better job with the sometimes complicated mapping. Now although casing, width, and Kana mappings can all be done in place, this is not allowed for traditional->simplified Chinese mappings, even though the same restrictions (always the same lnegth, etc.) apply here -- if any NLS testers who are reading this want to put in a bug, I'll see what I can do about fixing that!

    LCMAP_TRADITIONAL_CHINESE -- Maps simplified Chinese characters to traditional Chinese, passing through other characters unchanged. Thus (U+5108) becomes (U+4fa9). The dictionary used for this mapping is even smaller (only 2,191 ideographs) since there are many times that several traditional Chinese ideographs will map to one simplified ideograph (thus these two flags are not 100% reversible versions of each other). The table has not been updated since the LCMAP_SIMPLIFIED_CHINESE one was. Same problems with in-place update apply here -- if any NLS testers who are reading this want to put in a bug, I'll resolve it as a duplicate of the other bug I was suggesting, above!

    Ok, that is probably enough for today. Tip your server (she may have subroutines to support). Enjoy the veal!

     

    This post brought you by "ホホワワヅづまマçÇĦħ乐樂儈侩" (U+ff8e U+30db U+30ef U+30c5 U+3065 U+307e U+30de U+00e7 U+00c7 U+0126 U+0127 U+4e50 U+6a02 U+4fa9 U+5108 a.k.a. HALFWIDTH KATAKANA LETTER HO, KATAKANA LETTER HO, KATAKANA LETTER WA, HALFWIDTH KATAKANA LETTER WA, KATAKANA LETTER DU, HIRAGANA LETTER DU, HIRAGANA LETTER MA, KATAKANA LETTER MA, LATIN SMALL LETTER C WITH CEDILLA, LATIN CAPITAL LETTER C WITH CEDILLA, LATIN CAPITAL LETTER H WITH STROKE, LATIN SMALL LETTER H WITH STROKE, HAPPY, HAPPY, BROKER, BROKER)
    (A group of characters that were happy to be asked to help showcase the technology behind LCMapString!)

  • Sorting it all Out

    The New String recommendations

    • 11 Comments

    Dave Fetterman reported yesterday on the Official Guidance: New Recommendations for Strings in .NET 2.0 (full paper here).

    Now this is a paper whose recommendations I think are incredibly important (some of them were I daresay inspired by things I have been saying here about invariant versus ordinal and using uppercasing!). And I think at the core of those recommendations is a principle that applies to all code that is written, managed and unmanaged, in any version of any product. So I don't want people to think "I'm not using Whidbey, so this does not apply to me."

    That core principle? Stated simply....

    Use appropriate comparison methods.

    It simply makes no sense to use the wrong method, ever.

    Now as I have mentioned before, I am a bigger fan of what I call the 'vertical method' (different flag values in a single method or function) as opposed to the 'horizontal method' (different methods or functions). I find it easier to explain, easier to document, and easier for a user to know what to call. I explain this in part for unmanaged code in my post Similar descriptions does not mean similar methodologies and some day soon I will follow up with a managed version of that post.

    But it is important no matter whether you are a 'vertical' person or a 'horizontal' one to choose appropriate methods to compare based on your actual scenario.

     

    This post brought to you by "๚" (U+0e5a, a.k.a. THAI CHARACTER ANGKHANKHU)

  • Sorting it all Out

    Its been almost four years, Sherry....

    • 17 Comments

    I was looking through email today trying to find a particular one, and ran across one that slayed me. This is a story I have never really told anyone in full before, for reasons that I cannot fully explain. It was not really a love story or anything like that. But it was a something of a defining friendship for me, and maybe I was just holding on to the story to keep it whole, for me....

    I was reminded of the whole thing two nights ago at the "Women in Technology" Birds of a Feather that Julie Lerman gave. I even talked about my friend for a bit and I think my voice was steady.

    You see, four years ago, I was sending a nervous email to someone who had once been a good friend of mine from back when I lived in Connecticut.

    Dr. Sherry Apple, who years before had been a neurosurgical resident at Hartford Hospital, and I had an interesting relationship. Almost 20 years senior to me, she looked at this somewhat lost young man with some dreams that did not look like they were going anywhere, as she herself sat in a neurosurgical residency program headed by someone who did not feel women belonged in neurosurgery at all, especially not this mouthy blonde from the South. I guess we were kindred spirits in a way, both feeling like we were being held back from what we wanted, whether by random or by not-so-random circumstances in life. And as I helped her study even way back then for her boards, while daring to dream of maybe even taking them myself some day. I still know my cranial nerves, and I still crack that Manter and Gatz from time to time, remembering those days.

    Those days when she helped me believe in myself again, at a time when I truly needed someone to help me do that.

    When I left Hartford, we lost touch for a few years. But when I needed to have a radiofrequency rhizotomy done for trigeminal neuralgia, I asked Charles E. Poletti, M.D. if he would do it, and Sherry asked me (almost shyly!) if I would be okay with her scrubbing in for the procedure. I told her I would not want it any other way....

    It is an interesting surgery, where they use a short acting anesthesia (Brevital) since the neurosurgeon has to wake you so you can answer questions about sensation. As they work to deaden a small part of the trigeminal nerve without destroying too much of it. By the report of both the doctor and one of the nurses who was there, when I started acting up, Sherry holding my hand was all I needed to keep calm (I was a bit too doped up to remember it, but she admitted it reluctantly later when confronted with the story). I did well post-op and flew home, and we lost touch again for a while...

    Suddenly I came across a random article about that same mouthy blonde, who was neither radio announcer nor singer nor guitarist (though she had been each of those things, over the years!). She was a neurosurgeon in practice in West Virginia! She was the president of WINS (Women in Neurosurgery) and I felt a pride that I have seldom felt before or since, knowing that at least a few times that she wanted to give up I helped convince her to stick with it. And she made it!

    I sent her a tentative email, afraid she may not remember who I was -- I have known several neurosurgeons who tend to not remember the people who knew them when they were less than done with their training. Or maybe afraid that she would remember and be disappointed since I had so clearly taken a different path, miles away from medicine.

    But such fears were unfounded in this case -- she not only remembered me but had just been thinking about me in relation to something she was working on. And we were soon emailing back and forth frantically, as she asked if I would be willing to help her find or maybe even write something for her that would help her create Kaplan Meyer curves for a study she was wanting to work on (none of the things she had found seemed to fit exactly. I readily said yes, and after it became clear that she was correct -- none of the packages I saw either would do exactly what she was looking for -- I started working on something to do the job....

    The last email I have from her was dated May 24th, 2001. I had almost finished putting together an initial prototype of the software we had discussed and even had a reservation to visit and show it to her, for the middle of the summer. A ticket that as it turns out I was never to need.

    In July of 2001, Sherry was boating with her husband in Upstate New York when the boat caught a wave, ejected her, and then crashed into her. She was killed in this terrible accident, in what was not even her 50th year.

    When I heard what had happened, I actually cried. Not because it was so sad (though frankly, it was). I cried because she had accomplished her dream, she outlasted that department chief who wanted her out, she was the only female neurosurgeon in the state where she practiced and had made quite a name for herself with some of the amazing surgeries she had performed, and the tender care she gave to her patients. It felt unfair to me that she had so little time to live the dream that she had so clearly and finally accomplished. Hadn't she earned the time?

    It has been nearly four years since that day. And whatever else I learned from her, I am proud that she did not ever give up, and it is one of the reasons that I won't. Because I want her to be able be proud of me, no matter where she is now.

    (for Sherry)

  • Sorting it all Out

    A few words about enduserdefined characters

    • 6 Comments

    EUDC stands for End User Defined Characters. The Platform SDK defines them simply:

    End user defined characters (EUDC) are customized characters that users install for viewing and printing documents. They enable users to form names and other words using characters that are not available in standard screen and printer fonts. These characters are available only in Asian-language versions of the system.

    I think I will quote a bit more from the Platform SDK; there is not much so I may as well use more of it....

    An enduserdefined character is always associated with a double-byte character set (DBCS) and a TrueType font. Applications identify the specified character by using the character's assigned DBCS character value, and the system uses this value to locate shape and style information in a corresponding TrueType font. The shape and style information specifies how the character is drawn on the screen or printed page.

    The DBCS character values that can be assigned depend on the specified character set. Each set has at least one range of reserved values for use as enduserdefined characters. The system or applications explicitly define these ranges by setting appropriate values under the EUDCCodeRange registry key. Each character set is identified by a unique code-page number.

    To create an enduserdefined character, the user chooses a character value that is within the specified range and adds the shape and style information to the TrueType font in the entry that corresponds to that character value. Users create the shape and style information using an EUDC editor or by purchasing enduserdefined font packages from font vendors. Any DBCS TrueType font can contain enduserdefined characters. The font is called a separate EUDC font if it contains only enduserdefined characters. The font is an integrated EUDC font if it contains standard characters as well as enduserdefined characters.

    Separate EUDC fonts are said to be either font-aware or font-unaware. A font-unaware font is designed to be a general purpose font that can be used with fonts of different font styles and of different implementations, such as GDI raster, WIFE, device, and TrueType fonts. A font-aware font is designed for use with a specific TrueType font.

    The system default EUDC font is a font-unaware font that the system automatically associates with all DBCS fonts except those TrueType fonts that have explicitly associated font-aware fonts. Applications set the system default EUDC font by setting the value of the SystemDefaultEUDCFont name under the EUDC registry key. Similarly, applications can associate font-aware fonts with corresponding TrueType fonts by specifying a font name and associated font file under the EUDC key. Separate EUDC fonts cannot be associated with integrated EUDC fonts.

    The system hides the system default EUDC and font-aware fonts. This means applications cannot enumerate or otherwise examine these fonts using GDI functions. Applications, such as EUDC editors and Control Panel, must use the registry entries to add, modify, and delete EUDC fonts.

    Enduserdefined characters can also be used in Unicode-enabled applications. The reserved ranges for each character set are mapped to corresponding values in the Unicode private use area (values 0xE000 and higher).

    Of course not a lot is said about the actual usage here, beyond the fact that East Asian versions of Windows allow you to create these custom characters.

    But they are hugely important to many users in East Asia who need to define a custom look to an ideograph (or support a particular ideograph at all in some cases -- there are indeed characters that are not yet added to Unicode). The ability to add Han characters via the EUDC Editor is a pretty important feature for those who need it. I'll talk more about enduserdefined characters soon....

     

    This post brought to you by "丹" (U+4e39, a CJK Unified Ideograph)

  • Sorting it all Out

    Every character has a story #11: U+???? (The Invisible Letter)

    • 7 Comments

    The story today is about a character that has not been encoded in Unicode as of yet. In fact, it was brought to the Unicode Technical Committee (UTC) once already and been turned down. but it had enough of a history behind it that I thought it might be worth having a story told....

    It all started with some great work from the Council for Information Technology (CINTEC), the Information and Communication Technology Agency of Sri Lanka (ICTA), and others on a document entitled SRI LANKA STANDARD SINHALA CHARACTER CODE FOR INFORMATION INTERCHNAGE. The second version of the document that we saw was submitted to the UTC at the 99th UTC meeting, held in Toronto on June 15-18, 2004. The document was very well put together.

    If I recall, the only real point of contention that came up was a description in the document of how to handle a vowel sign (which is a combining character) without a preceeding consonant (to act as a base):

    A vowel sign without an associated consonant may be displayed by preceding it with a zero-width non-joiner (zwnj) character. e.g. ා = 200C 0DCF (zwnj + ා).

    Subsequently there were several examples given doing this. There were several people who did not like this idea, as it really did not fit with the conventional usage of the  U+200c (ZERO WIDTH NON-JOINER).

    This is obviously not a new requirement; for as long as there have been combining characters, there has been a need to describe how to display them when no base character was present. There have always been two Unicode code points that have been recommended for this purpose:

    However, for some time the SPACE has been a problematic choice here, due to the way standards such as HTML allow the removal of spaces preceeding or following text, and it was a huge burden on formatters and parsers to have to handle this scenario and finding a SPACE character that was being used as a base character to not be extraneous....

    So the feedback was set to be given to the Sri Lankan NB to commend them for the excellent proposal with one suggested change -- to use U+00a0 rather than U+200c to work as the base.

    The thorny issue of what to do with U+0020 was talked about for a bit -- clearly some text would have to drafted explaining that while in some situations it represents a reaonable choice, in others it was a real problemtic one due to the spaces being removed, and then we moved on to other issues.

    And then, on the way to the WG2 meeting, Michael Everson, after consulting with Peter Constable, Rick McGowan, and Ken Whistler, put together a document.

    The name? Proposal to add INVISIBLE LETTER to the UCS.

    The idea was a dedicated code point to act as an invisible base character for precisely this situation. And there was some even some art, provided by Mr. Everson:

    What is that smudge in the middle of the invisible whatever? Hmmm, lets look at 200%.

    Hmmmm. Still can't make it out. But it does look like more than a smudge. Let's take a look at 400%.

    Clearly text saying something. My eyesight is 20/20 OS and 20/25 OD, but maybe I am just tired. Let's blow it up to 800% and see what this says.

    Geez Mr. Everson, I guess you really are a geek! I'm not sure I would have immortalized it in a proposal (or if I did I probably would not say anything until it was too late).

    Anyway, it went to WG2 as an FYI (I was not there, but I heard that Mr. Everson had hinted that the text said something and the chair (Mr. Mike Ksar) did what I did and blew up the image on the overhead. He was not amused at the time. 2 x :-)

    At the next UTC meeting (#100) in Redmond, WA on August 10-13, the proposal was reviewed and the decision was made to make it a Public Review Issue (PRI), #41.

    Then at the next UTC meeting (#101) in Cupertino, CA on November 15-18, a motion to accept the encoding of INVISIBLE LETTER failed (3 for, 6 against, and 1.5 abstained).

    And then, the next motion, to add the INVISIBLE LETTER to the list of rejected characters, also failed (5 for, 0.5 against, 5 abstained).

    I won't comment on that set of votes other than to note that most decisions these days in the UTC are actually passed by consensus, not via a motion and a vote (for example at that meeting there were 37 consensus decisions and 7 motions). This was a pretty amazingly contentious issue! Maybe we were all blowing off some steam? :-)

    Personally, I would have rather it had been a small Invisible Jet, instead. Ms. Bennett has long pined for a Wonder Woman analogue for ✈ (U+2708, a.k.a. AIRPLANE). What better location for an invisible jet than hidden in an invisible letter?

    Or we could just go back to the actual work....

    So, worthy of a story, right? :-)

     

    This post brought to you by "✈" (U+2708, a.k.a. AIRPLANE)

  • Sorting it all Out

    More on locales in SQL Server

    • 1 Comments

    Fascinating post about Locales in SQL Server, one that went into much more detail than I did at TechEd last week. It is from the from the new SQLCLR Integration blog and is definitely worth a read or three...

    They did a really good description of not only the problems in the old Microsoft C Runtime (CRT) locale model where anyone in the process could modify the locale and change the functionality of a broad range of settings, but of the very cool CRT solution in Whidbey -- functions where you can pass the locale you had set. And they top it off by saying that they pretty much try to keep the locale set to be the "C" locale (I have posted about that beast and its effect on the meaning of case insensitivity previously).

    The interesting idea (in my opinion) is that they do not try to keep the CRT locale in sync with the SQL collation setting. Which in a way makes sense; after all, which would they choose to be similar to? The server's? Kind of useless and weird when the database's did not match. Better not to jump into a fight with a model when one cannot win it. :-)

    It also talks about the CLR and its notion of locales (CurrentCulture and CurrentUICulture) and how they have a slightly different solution there. SQL Server leaves the CurrentCulture alone, so it is subject to similar issues as the old CRT locale, with one important difference and one very eerie similarity.

    The difference is that developers who change it only affect the thread they are in, which is obviously not the same as that CRT-wide setting. The similarity is that they leave the CurrentCulture and CurrentUICulture alone, subject to the server's settings. Since these settings also may or may not match the SQL Server settings, one could run into all kinds of strange incompatibilities, even beyond the ones I talk about in my String.Compare and SQLCLR post (which has someow made it quickly to the top of lot of different web searches, and not because of its use of the word sissy; people really seem to care about SQLCLR and keeping the two sides of the partnership as compatible as possible!).

    I probably would have made the push to try to make the SQL Server and the CLR locales be the same, especially given the issues with wanting similar comparisons, but I can understand that it does not affect everyone and since there is a 'tax' one has to pay to have them in sync in every single worker thread used by SQL Server, it really is unreasonable to make everyone pay that tax. Plus the same issuesof which collation would be best to use come up. So such a push would fail, with good reasons.

    But in any individual usage of the CLR within SQL Server, there is the method to help you keep in sync with SQL Server's settings (whichever ones you like) -- which I talked about in Orlando and will talk about in Amsterdam -- a GetCultureInfo override that I will talk about more another day -- a great addition to the already cool feature Mike discusses. More on this another day....

    Plus there is that String.Compare post that talks about for serious CLR work you probably want to be using CompareInfo.Compare anyway, for maximum SQL/CLR compatibility of collation results.

    But this post by the SQLCLR folks gives some additional issues, as well as the need to keep the differences between all of these different technology models in mind. Very cool!

  • Sorting it all Out

    Anyone having trouble with MSKLC on an x64 machine?

    • 9 Comments

    The other day, Andreas Henriksson asked me

    I had no idea where to post bug-reports, and you seem to be the author that's why I'm contacting you in person.

    I recently bought a new computer with Windows XP x64-edition, since then I've been trying to get my custom keyboard layout working on it. The layout I've previously generated on "regular" XP didn't work (they where claimed to be invalid when switched to), and neighter did installing MSKLC on x64 ("Missing .NET Framework", which if I'm not completely wrong comes preinstalled on x64 as well as 2003 and all newer Windows-versions.).

    It would be great if you could comment on it on your blog so I'd know for sure, or even better if the bug could be fixed. :)

    When I first got this email, I'll admit I panicked a bit. It ought to work, after all. So I went over to talk to the main tester for MSKLC -- the one talked about when I mentioned that international test is an art (and why there are few fine artists). 'Kanya,' I asked, 'could this really be happening?' She had never seen a problem like this before, but she gave it a try and had no problems, installing or building the layouts.

    Which is not to say there is not a limitation here -- the tool only creates keyboard layouts that sit in an i386 directory for a reason -- if it built IA64 or AMD64 or x64, then it would have directories for the other platform, too. That is definitely something that will have to wait for the next version, sorry!

  • Sorting it all Out

    Knock knock! Who's there? Kana! Kana Who?

    • 11 Comments

    You Kana wonder how we order Japanese strings? :-)

    Some time yesterday, one of the testers over on the Shell team was curious about how collation works for the Japanese alphabet. The discussion was an interesting one, so I thought I would post the summary of all the infomation we talked about (with some examples for each interesting distinction) here.

    Note that this behavior relates to what is done on Windows (as well as SQL Server, Office, Windows CE, Active Dirctory, and every Microsoft product that either calls our APIs or uses our data). Your mileage for other platforms may certainly vary!

    Ok, on Windows, the Japanese Kana all sort in an implementation of the GoJuOn order, with the following principles:

    • Traditional order for all letters
      e.g. ア before イ (U+30a2 aka KATAKANA LETTER A before U+30a4 aka KATAKANA LETTER I)
    • Halfwidth sorts before fullwidth
      e.g. ア before ア (U+ff71 aka HALFWIDTH KATAKANA LETTER A before U+30a2 KATAKANA LETTER A)
    • Small sorts before regular size
      e.g. ぁ before あ (U+3041 aka HIRAGANA LETTER SMALL A before U+3042 HIRAGANA LETTER A)
    • Katakana sorts before Hiragana
      e.g. ア before あ (U+30a2 aka KATAKANA LETTER A before U+3042 aka HIRAGANA LETTER A)
    • Some almost lookalikes sort right after their near twins
      e.g. う before ゔ (U+3046 aka HIRAGANA LETTER U before U+3094 aka HIRAGANA LETTER VU)
    • Circled comes after everything other type of a letter
      e.g. ㋐ after あ (U+32d0 aka CIRCLED KATAKANA A after U+3041 HIRAGANA LETTER A and all other letter A's)

    When you combine all these rules together, the order you get for the vowels would be:

    ァァアアぁあ㋐ィィイイぃい㋑ゥゥウウぅうヴゔ㋒ェェエエぇえ㋓ォォオオぉお㋔

    And then the other important things to note (changes in red 1 June 2005 7:50am):

    • A repeaterCho-On (prolonged sound mark) lengthens a vowel, so it duplicates the weight of the preceeding character and subtracts a bit (so it sorts before)
      e.g. ぎー before ぎぎ (U+304e U+30fc aka HIRAGANA LETTER GI, KATAKANA-HIRAGANA PROLONGED SOUND MARK before two HIRAGANA LETTER GI)
    • An iteration mark duplicates a letter, so it duplicates the weight and adds a little bit (so it sorts right after)
      e.g. きき before き々 (U+304d U+304d aka HIRAGANA LETTER KI; HIRAGANA LETTER KI before U+304d U+3005 HIRAGANA LETTER KI; IDEOGRAPHIC ITERATION MARK)
    • With the judicious use of flags, you can ignore any or all of the distinctions, to make some or all of the letters within any of the following color groups to be equal:

    ァァアアぁあ㋐ィィイイぃい㋑ゥゥウウぅうヴゔ㋒ェェエエぇえ㋓ォォオオぉお㋔

    In other words, everything on the same line below can be made to seem equal; everything on a different line cannot.

    HALFWIDTH KATAKANA LETTER SMALL A; KATAKANA LETTER SMALL A; HALFWIDTH KATAKANA LETTER A; KATAKANA LETTER A; HIRAGANA LETTER SMALL A; HIRAGANA LETTER A; CIRCLED KATAKANA A
    HALFWIDTH KATAKANA LETTER SMALL I; KATAKANA LETTER SMALL I; HALFWIDTH KATAKANA LETTER I; KATAKANA LETTER I; HIRAGANA LETTER SMALL I; HIRAGANA LETTER I; CIRCLED KATAKANA I
    HALFWIDTH KATAKANA LETTER SMALL U; KATAKANA LETTER SMALL U; HALFWIDTH KATAKANA LETTER U; KATAKANA LETTER U; HIRAGANA LETTER SMALL U; HIRAGANA LETTER U; KATAKANA LETTER VU; HIRAGANA LETTER VU; CIRCLED KATAKANA U
    HALFWIDTH KATAKANA LETTER SMALL E; KATAKANA LETTER SMALL E; HALFWIDTH KATAKANA LETTER E; KATAKANA LETTER E; HIRAGANA LETTER SMALL E; HIRAGANA LETTER E; CIRCLED KATAKANA E
    HALFWIDTH KATAKANA LETTER SMALL O; KATAKANA LETTER SMALL O; HALFWIDTH KATAKANA LETTER O; KATAKANA LETTER O; HIRAGANA LETTER SMALL O; HIRAGANA LETTER O; CIRCLED KATAKANA O

    The rules for the flags affect all this? Well....

    • NORM_IGNORENONSPACE removes the distinction of the circled letters and distinctions with letters like U and VU
    • NORM_IGNORECASE removes the distinction of the small versus regular letters;
    • NORM_IGNOREKANA removes the distinction between Hiragana and Katakana;
    • NORM_IGNOREWIDTH removes the distinction between halfwidth and fullwidth.

    Now obviously Windows file names are "case insensitive" but we do not consider the "small" Kana and the "regular" Kana to be case pair (no one does, usually including native speakers) -- so you can have both of them in file names in the same directory, but you cannot use both as the same names in (for example) an Active Directory installation (in fact since all four flags are passed for AD, you cannot use any of the letters within the colored groups together in the same AD namespace).

    Ignoring something with these flags in this context means "treat them all as equal" -- which means you will have a non-deterministic ordering any time you have a big list with many of these variants comparing as equal. In my opinion, a deterministic order is always better, and not just because I try to be an orderly guy. :-)

    But your mileage may vary, of course!

    Now the Kanji are not sorted in pronunciation order, because as I mentioned back in December of last year, there is no pronunciation-based sort for Japanese on Windows. But if you have entered the pronunciation information and are sorting by it (the way that for example an addressbook might choose do) then this order will be respected. Note that name readings (nanori'yomi) are sometimes (perhaps often) entirely individual and do not match any of the kun'yomi or on'yomi with which a given ideograph may be commonly associated. So such a feature makes a lot of sense if you know how all the names are pronounced; if not (for example in a large company address book) you may want an alternate way to search for names that you may know only by characters and not by pronunciation.

     

    This post brought to you by "ヰ" (U+30f0, a.k.a. KATAKANA LETTER WI)

Page 1 of 5 (61 items) 12345