Blog - Title

January, 2005

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Do they not even <b>*use*</b> Automatic Updates?!?

    • 29 Comments

    I have been reading people all over the internet who hate that Microsoft is perhaps in the future going to limit Windows Update to legal copies of Windows (Automatic Update would be their only option) with the Windows Genuine Advantage program (more info in the Windows Genuine Advantage FAQ).

    Many are on the bandwagon, from Greg Hughes to Mitch Wagner to a hundred of whoever your favorites are, everyone is talking about how evil Microsoft is for something that they have not even done yet.

    Most think Microsoft is being irresponsible by not patching these machines. Those people do not even realize that all security patches and Service Packs are still available via Automatic Update, even for illegal copies of Windows. This acts as a convincing proof to the theory that you do not need to know how to read in order to know how to write.

    The gist of the typical argument of those who are smart enough to at least recognize the "Automatic Updates" option is that people who pirate software will not choose to automatically update since they would be afraid that Microsoft would shut them down remotely for not being a legal user of Windows. They would rather use Windows Update where they have the choice for what they will or will not install.

    But have these wingnutspeople even used automatic updates before? Have they even looked at dialog?

    Well, lets look at it now, shall we? Here it is, both in Windows XPSP2 and Windows Server 2003:

        

    Notice how I have them set -- the XPSP2 box will automatically update every day at 3:00am, and the Server 2003 box will simply let me know if there are updates and then let me know again before installing. Is that a hint as to why I think these people have not used the feature?

    Notice how both of them have an option to look at the updates previous declined (currently disabled, I do not tend to refuse updates!)? Is that another hint?

    Look at all of the options I have here!

    People have total control over whether they install the security updates or not. Even if they are using a pirated version of Windows! The same choice they have in Windows Update for Critical Updates and Service Packs. If they are willing to use the latter, then why would the former be less appealing?

    Wouldn't using Automatic Updates lead to a safer internet for all users since it does not require an explicit visit to a web site to get patches installed? The only reason I do not install automatically on my Server 2003 boxes is that I may be building something and would prefer to control when I install. It is still very cool to get the reminder that there is something to install, and I am a huge fan of that sort of feature.

    So what are these people complaining about, exactly?

  • Sorting it all Out

    FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler)

    • 9 Comments

    Last Friday, Jochen Kalmbach, in response to A little bit about the new CharUnicodeInfo class, asked the following:

    By the way: is there some equivalent to FoldString, especially "MAP_PRECOMPOSED" and "MAP_COMPOSITE"? Neither StringInfo nor TextInfo provide such a function, or?

    My answer was:

    The .NET Framework has something even better than FoldString here -- I'll post on it tomorrow....

    But I got busy this weekend and never got around to posting the answer to the question. Sorry about that! I'll do it now (I hope Jochen did not give up on me in the interim!).

    The description of FoldString from the Platform SDK: The FoldString function maps one string to another, performing a specified transformation option.

    There are many different suported transformations:

    MAP_FOLDCZONE Fold compatibility zone characters into standard Unicode equivalents. For information about compatibility zone characters, see the following Remarks section. MAP_FOLDDIGITS Map all digits to Unicode characters 0 through 9. MAP_PRECOMPOSED Map accented characters to precomposed characters, in which the accent and base character are combined into a single character value. This value cannot be combined with MAP_COMPOSITE. MAP_COMPOSITE Map accented characters to composite characters, in which the accent and base character are represented by two character values. This value cannot be combined with MAP_PRECOMPOSED. MAP_EXPAND_LIGATURES Expand all ligature characters so that they are represented by their two-character equivalent. For example, the ligature 'æ' expands to the two characters 'a' and 'e'. This value cannot be combined with MAP_PRECOMPOSED or MAP_COMPOSITE.

    Digit folding functionality is covered by the methods I described in CharUnicodeInfo, especially GetDecimalDigitValue. Some of the other methods will do an even fuller job, supporting many of the non-decimal digit numbers, which FoldString never handled....

    The ligature functionality does not really exist right now, though that does work well in comparisons, whenever it needs to.

    But the other three mapping types see new life in Whidbey, with tables that cover the Unicode 4.0 version of normalization, as described in UAX #15, UNICODE NORMALIZATION FORMS.

    How does it work? Well, in the Whidbey release of the .NET Framework, two new methods were added to System.String:

    bool IsNormalized(NormalizationForm normalizationForm)

    string Normalize(NormalizationForm normalizationForm)

    The functionality of the methods is obvious enough from the names -- the first checks if the string is in a specified normalization form, and the second puts it in a specified form.

    The enumeration with the forms (NormalizationForm) has four members:

    public enum NormalizationForm
    {
        FormC    = 1,
        FormD    = 2,
        FormKC   = 5,
        FormKD   = 6
    }

    The normalization forms, which are described much more fully in the UAX#15 spec, have easy analogues to their FoldString counterparts:

    FormC      MAP_PRECOMPOSED
    FormD      MAP_COMPOSITE
    FormKC     MAP_PRECOMPOSED | MAP_FOLDCZONE
    FormKD     MAP_COMPOSITE | MAP_FOLDCZONE

    In fact the only real difference is that FoldString only does part of the job, because the FoldString tables do not have all of the mappings that are in Unicode, a point I discussed previously. But these normalization methods do. So you can do all the mapping you need to in order to take equivalent forms of the same string and put them into one consistent form.

    Since the "default" method used in most situations is Form C, there are also overrides to the two methods with no NormalizationForm parameter that use Form C automatically. In many cases, that is the one you may want to use. Making Form C the "default" normalization form is not an arbitrary decision -- almost all of the keyboards in that ship in Windows input text in Form C already (though of course keyboards created by MSKLC, beng user-created, can be in whatever form).

    Another thing to keep in mind is that text may not be in any of these forms -- for example an atbitrary string like õĥµ¨ (U+00f5 U+0068 U+0302 U+00b5 U+00a8). This string combines a precomposed character, a composite character, and two characters with compatibility decompostions (the MICRO SIGN and the DIARESIS). It is therefore not in any one form at all. Thus this string would see an IsNormalized return of false for all forms. But it can be normalized to return the appropriate result for each of them:

    Another thing to keep in mind is that text may not be in any of these forms -- for example an atbitrary string like õĥµ¨ (U+00f5 U+0068 U+0302 U+00b5 U+00a8). This string combines a precomposed character, a composite character, and two characters with compatibility decompostions (the MICRO SIGN and the DIARESIS). It is therefore not in any one form at all. Thus this string would see an IsNormalized return of false for all forms. But it can be normalized to return the appropriate result for each of them:

    õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form C  ---> õĥµ¨ (00f5 0125 00b5 00a8)
    õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form D  ---> õĥµ¨ (006f 0303 0068 0302 00b5 00a8)
    õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form KC --> õĥμ ̈  (00f5 0125 03bc 0020 0308)
    õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form KD --> õĥμ ̈  (006f 0303 0068 0302 03bc 0020 0308)

    Ideally they would always compare as being equal even if the forms are different, but this is definitely not a 100% of the time result, as I pointed out a few months ago when I answered the question Normalization and Microsoft -- whats the story? Therefore normalization is the one way you can use to make sure that you will always get the right comparison, especially in some cases that may not ever be fully supported in comparison, like "ﷺ" (U+fdfa, a.k.a. ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM), which decomposes to:

    صلى الله عليه وسلم

    (0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020 0648 0633 0644 0645)

    (Since most fonts do not support U+fdfa, if you can see the string above then it points to at least one time that normalization Form D helped out for a lot of people!)

    You can also see the Beta documentation for the IsNormalized method, the Normalize method, and the NormalizationForm enumeration.

     

    This post brought to you by "ﷻ" (U+fdfb, a.k.a. ARABIC LIGATURE JALLAJALALOUHOU)
    A liagture that decomposes to "جل جلاله" or
    062c 0644 0020 062c 0644 0627 0644 0647.

     

  • Sorting it all Out

    We broke CharNext/CharPrev (or, bugs found through blogging?)

    • 14 Comments

    (special thanks to James for pointing out this bug)

    It is amazing how sometimes one can be so busy trying to make a point that one can miss the point.

    A few days ago, I pointed out that CharNext(ch) != ch+1, a lot of the time.

    That ought to be true. It is true if you are running Windows NT 3.51, Windows NT 4.0, or Windows 2000.

    But in XP, things seem to have changed a bit.

    It used to be that if one took combining characters like U+0308 (COMBINING DIAERESIS) and passed them to the GetStringTypeW or GetStringTypeEx APIs with the CT_CTYPE3 dwInfoType, it would return (C3_NONSPACING | C3_DIACRITIC). If you look at the Platform SDK topics for these APIs, the types are defined as follows:

    Name                      Value       Meaning
    C3_NONSPACING    0x0001       Nonspacing mark. 
    C3_DIACRITIC        0x0002       Diacritic nonspacing mark. 

    Starting with Windows XP and continuing on with Windows Server 2003, it now just returns C3_DIACRITIC. Looking at the definitions, this makes sense -- C3_DIACRITIC claims it is for nonspacing marks, too. So the relevant part of the change is:

    1. There used to be no characters marked with just C3_DIACRITIC.
    2. There are no characters that are marked with just C3_NONSPACING now (there used to be several).

    This would all be fine given the above definitions (well, not really -- but we'll let that lie for a bit). The problem is that the CharNext and CharPrev APIs are relying on that C3_NONSPACING definition to figure out when to skip characters.

    I'm not sure what scares me more -- that this bug has been around since October of 2000, or that it was found due to a blog post that I might not have thought to do had not someone suggested it to me.

    I'll see about making sure this bug gets put in on Monday.

    So, between this one and the one I found myself (described in the answer to Guess #3 in Why I don't like the IsTextUnicode API), two longstanding bugs in Windows have been found through the act of blogging.

    This answers the question I posted in OT -- They taste like chicken, don't they? once and for all. Blogging may annoy me, but its not really relevant anymore. They help me make the product better. So I think I'd better keep doing it....

    Scoble, you reading this? :-)

     

    This post sponsored by all 792 of the nonspacing marks in Unicode

  • Sorting it all Out

    Why I don't like the IsTextUnicode API

    • 12 Comments

    The IsTextUnicode API has been around since NT 3.5, according to the Platform SDK histories. According to the PSDK, its purpose is as follows:

    The IsTextUnicode function determines whether a buffer is likely to contain a form of Unicode text. The function uses various statistical and deterministic methods to make its determination, under the control of flags passed via lpi. When the function returns, the results of such tests are reported via lpi.

    It then goes on to describe the many different tests that it can do when the appropriate flags are passed:

    IS_TEXT_UNICODE_ASCII16
       The text is Unicode, and contains only zero-extended ASCII values/characters.

    IS_TEXT_UNICODE_REVERSE_ASCII16
       Same as the preceding, except that the Unicode text is byte-reversed.

    IS_TEXT_UNICODE_STATISTICS
       The text is probably Unicode, with the determination made by applying statistical analysis. Absolute certainty is not guaranteed. See the following Remarks section.

    IS_TEXT_UNICODE_REVERSE_STATISTICS
       Same as the preceding, except that the probably-Unicode text is byte-reversed.

    IS_TEXT_UNICODE_CONTROLS
       The text contains Unicode representations of one or more of these nonprinting characters: RETURN, LINEFEED, SPACE, CJK_SPACE, TAB.

    IS_TEXT_UNICODE_REVERSE_CONTROLS
       Same as the preceding, except that the Unicode characters are byte-reversed.

    IS_TEXT_UNICODE_BUFFER_TOO_SMALL
       There are too few characters in the buffer for meaningful analysis (fewer than two bytes).

    IS_TEXT_UNICODE_SIGNATURE
       The text contains the Unicode byte-order mark (BOM) 0xFEFF as its first character.

    IS_TEXT_UNICODE_REVERSE_SIGNATURE
       The text contains the Unicode byte-reversed byte-order mark (Reverse BOM) 0xFFFE as its first character.

    IS_TEXT_UNICODE_ILLEGAL_CHARS
       The text contains one of these Unicode-illegal characters: embedded Reverse BOM, UNICODE_NUL, CRLF (packed into one WORD), or 0xFFFF.

    IS_TEXT_UNICODE_ODD_LENGTH
       The number of characters in the string is odd. A string of odd length cannot (by definition) be Unicode text.

    IS_TEXT_UNICODE_NULL_BYTES
       The text contains null bytes, which indicate non-ASCII text.

    IS_TEXT_UNICODE_UNICODE_MASK
       This flag constant is a combination of IS_TEXT_UNICODE_ASCII16, IS_TEXT_UNICODE_STATISTICS, IS_TEXT_UNICODE_CONTROLS, IS_TEXT_UNICODE_SIGNATURE.

    IS_TEXT_UNICODE_REVERSE_MASK
       This flag constant is a combination of IS_TEXT_UNICODE_REVERSE_ASCII16, IS_TEXT_UNICODE_REVERSE_STATISTICS, IS_TEXT_UNICODE_REVERSE_CONTROLS, IS_TEXT_UNICODE_REVERSE_SIGNATURE.

    IS_TEXT_UNICODE_NOT_UNICODE_MASK
       This flag constant is a combination of IS_TEXT_UNICODE_ILLEGAL_CHARS, IS_TEXT_UNICODE_ODD_LENGTH, and two currently unused bit flags.

    IS_TEXT_UNICODE_NOT_ASCII_MASK
       This flag constant is a combination of IS_TEXT_UNICODE_NULL_BYTES and three currently unused bit flags.

    Sound impressive and interesting enough yet?

    A bit of trivia -- the code for a flag that used to be documented (IS_TEXT_UNICODE_DBCS_LEADBYTE) is still there (and it is still in the header file, obviously -- the PSDK never breaks people like that). But the flag does not work well, so it is probably just as well that it is not documented any more. I highly recommend not passing it. Or ignoring when it is returned. The flag not dangerous or anything; it's just not too terribly useful for its intended purpose (detecting text that is actually DBCS).

    As I mentioned, the API has been around since NT 3.5. It was written by someone else, outside of the NLS team (such as it was in those days). That is fairly cool since there was not as much Unicode awareness/acceptance back then as there is now....

    In those heady days when to most developers Unicode was little more than a foreign word that translated to "twice the memory and space required for strings", this function was mostly used as a way to know when to call WideCharToMultiByte to know when to convert strings out of Unicode1, and there were very few callers even for that not-so-noble purpose. NT 4.0 did not see much of a usage explosion, although Windows 2000 did , where the number of callers throughout the entire Windows source tree just about tripled (to 65 or so callers). Not much movement on the caller side in XP or Server 2003, either. I don't mind this fact much, given why it mostly seemed to be used.

    Some time between XP and Server 2003, I did add it to MSLU, as a nice gesture to developers who were frustrated by NT-only APIs2.

    Nevertheless, as the title of this post indicates, I don't like the IsTextUnicode API.

    You may think you know why -- go ahead, I'll give you three guesses.

    Guess #1: Because I do not own it?

    Sorry, that's not it -- but your opinion about my ego is noted. :-)  Strike one!

    I'll give you a hint.

    Hint#1: Look at the Platform SDK description (I'll add emphasis to enhance the hint):

    The IsTextUnicode function determines whether a buffer is likely to contain a form of Unicode text. The function uses various statistical and deterministic methods to make its determination, under the control of flags passed via lpi. When the function returns, the results of such tests are reported via lpi.

    Guess #2: Excuse me, I meant because the NLS team does not own it?

    Hmm, sorry. I figured that was you meant the first time. Strike Two!

    I'll give you another hint.

    Hint #2: There has only been one substantive change made to this API from the time of its creation until Server 2003 shipped -- a const was added to the lpBuffer parameter.

    Got it now? Think carefully now, this is your last guess.

    Guess #3: Because it considers "CRLF (packed into one WORD)" to be illegal, even though U+0d0a is MALAYALAM LETTER UU?

    Ooh, good one -- that looks like a bug in the IS_TEXT_UNICODE_ILLEGAL_CHARS flag detection. Even cooler that you properly figured out the byte reversal issue. Or maybe you did not notice that part, since both that ASCII CRLF packed into a WORD and the character would reverse on little-endian systems to look like 0x0a0d in memory, and if you did not allow for byte reversal you would have been right then anyway.

    Given the support for Malayalam described previously in the post Lions and tigers and bearsELKs, Oh my!, this is kind of embarrassing. Or maybe given the fact that the code point has been allocated since Unicode 1.1 (according to DerivedAge.txt) which was released in June of 1993 (according to enumeratedversions.html), this is particularly embarrassing. Though that does make the comment over its use in the API source pretty amusing:

                //  The following is not currently a Unicode character
                //  but is expected to show up accidentally when reading
                //  in ASCII files which use CRLF on a little endian machine.

    If you think about it, most UTF-16 big endian files would be from other operating systems and have just a CR or just an LF for their line breaks, even if they were just ASCII. I guess we know why there is no big-endian check for illegal characters? :-)  Makes the whole IS_TEXT_UNICODE_ILLEGAL_CHARS check weird even if it were not totally busted anyway.

    For MSLU fans, yes I ported this bug there as well, though not on purpose. Sorry about that, I am not used to reading code points as reversed bytes....

    Of course, since I did not know about this problem before, it can't be why I started this post not liking the API. Hell, if not for this imaginary conversation I put together, I still wouldn't know about it. Lucky for everyone that I have displayed this psychological dysfunction in public and thus cannot be further embarrassed by reporting the bug on it, right? Strike 3!

    Or we could call it a foul tip, since you found a decade-old bug and all. Ok, it is still Strike 2. :-)

    One more hint:

    Hint #3: There has been no change to this API's underlying mechanics since at least NT 3.51 (and probably since the original NT 3.5 release).

    Any more guesses?

    Guess #4: Because it only seems to test the first 256 bytes, no matter how big of a string I pass?

    Well, no. I never cared too much for that one, even before I came to Microsoft. But I never really found a file where it made a difference. It would be nice if someone were to change this, but I wouldn't lose any sleep over it -- so it's definitely not a reason to dislike an API. Strike 3!

    Ok, I'll just tell you now. Because as an API intended to verify whether a string is following a standard, it wins an award for its obtusitality. Why on earth would the following not have been added, over the years if not in the initial release?

    IS_TEXT_UNICODE_UNPAIRED_SURROGATES
       
    Since it is invalid to have a high surrogate without a low surrogate following it and a low surrogate not proceeded by a high surrogate, why not detect such non-conformant cases?

    IS_TEXT_REVERSE_UNICODE_ILLEGAL_CHARS
       It seems only fair to round out the checks for UTF-16BE by including the reverse version of this flag, doesn't it?

    IS_TEXT_UNICODE_INVALID_FOR_4_00
       Obviously new flags could be added for each major version -- what better way to check for what is invalid then to check against an official "valid" list?

    IS_TEXT_UNICODE_INVALID_SCRIPT_USAGE
       
    There are all kinds of sequences that would indicate bad usage, from combining marks from one script used in an unrelated script to illegal sequences to text with invalid ordering per the canonical combining classes, and so on.

    IS_TEXT_UNICODE_VALID_UTF8_PER_RFC2799
       The initial description of UTF-8 in RFC 2279, which I think is the method used by Notepad3.

    IS_TEXT_UNICODE_VALID_UTF8_PER_UNICODE
       
    The more strict definition of UTF-8, which disallows surrogate code sequences and other non-shortest forms.

    IS_TEXT_UNICODE_VALID_UTF32 / IS_TEXT_UNICODE_VALID_REVERSE_UTF32
       
    These flags could be combined with some of the older signature detection flags if a UTF-32 LE or BE signature is found.

    IS_TEXT_UNICODE_UCS2_32 / IS_TEXT_UNICODE_REVERSE_UCS2_32
       
    Analagous to the IS_TEXT_UNICODE_ASCII16/IS_TEXT_UNICODE_REVERSE_ASCII16 flags, they would detect UTF-32 that looks like it could all be represented as UTF-16 without needing surrogate pairs.

    You get the idea -- Unicode is a dynamic standard, getting more interesting and more complicated all the time, not just for its own sake but in how the platform uses it. How can an API which is written a decade ago and never updated, whose job is to ask "is this flipping buffer full of Unicode text?" ever hope to keep up with such a standard?

     

    1 - Notepad being a noteworthy exception to this rule, since it used the API to try to detect when a text file was Unicode without a BOM.

    2 - Similar to why BeginUpdateResource, UpdateResource, and EndUpdateResource were added, though I must admit that for the *UpdateResource APIs it was mainly due to the fact that former MSFTie Matt Curland did all the work to make the functions Win9x-friendly. :-)

    3 - These are the rules that have been used by MultiByteToWideChar in later years. Ironically, the MultiByteToWideChar API is used by Notepad to convert files that it detected as UTF-8 by using RFC 2279 rules, meaning that any illegal sequences will be dropped without so much as a warning. Better keep those CESU-8 files away from recent enough versions of Notepad!

     

    This post sponsored by out much maligned little brother "ഊ" (U+0d0a, a.k.a. MALAYALAM LETTER UU)
    Who, like the rest of the Malayalam script, felt very supported by XPSP2, only to find out that the IsTextUnicode API did not share that opinion....

  • Sorting it all Out

    Every character has a story #5 (U+262b FARSI SYMBOL)

    • 4 Comments

    This character has an interesting history. As noted by Roozbeh Pournader:

    Neither Farsi, nor a symbol. In real life, it is the official emblem of the goverment of the Islamic Republic of Iran.

    Technically that would make it a logo and thus not a suitable candidate for encoding. But Roozbeh also noted:

    Exactly. The funny fact is that it has been in Unicode since 1.0...

    Truer words have ne'er been spoken. Luckily Ken Whistler stepped in to help explain the inconsistency:

    And in Unicode 1.0 it was called "SYMBOL OF IRAN", which was closer to your description of its use. It was WG2 that insisted on renaming it "FARSI SYMBOL" to get "IRAN" out of the name...

    P.S. I can feel another "Every Character Has a Story" story coming on...

    Of course this does seem to violate the stability rules, which claim that once a character is encoded, its name will not be changed. Luckily Ken once again stepped up to explain:

    Ancient history. Hundreds -- maybe thousands -- of Unicode 1.0 character names were changed in 1993 for Unicode 1.1 as part of the merger between the repertoires of Unicode and ISO/IEC 10646-1:1993. (The Great Compromise) The gory details of all the changes can be found in UTR #4, The Unicode Standard, Version 1.1. It was *after* that point (which was *very* painful for some people) that we put in place the never change a character name rule.

    The whole reason for having a Unicode 1.0 Name field in the UnicodeData.txt file was to track that name change.

    Now of course UTR #4 has been superseded and is not available online, though one would probably not learn much of interest since most of the fun/interesting parts about "The Great Compromise" are in the history and stories from those who were there, and that is not really captured. Think of it as being like the book of Acts in the New Testament -- many of the stories that would (in my very humble opinion) be really interesting about that particular period of time were not recorded, because the processes of change and compromise always tend to record information that speaks much more kindly about the experience than those who are there would themselves recall if you sat them down and bought them a beer....

    Anyway, back to U+262b. Roozbeh gave some more information in a different thread:

    ...U+262B, the so-called FARSI SYMBOL, which is nothing but the official symbol of the (government of) Islamic Republic of Iran, with no known usage but this. It was specifically designed in 1979 or 1980 for this purpose, and also appears in the flag of the Islamic Republic of Iran adopted at the same time.

    One insteresting
    [sic] point is that it is not Farsi (Persian) in any way! It is a logo form of the Arabic word "Allah", also encoded at U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM. Another interesting point is no one remembers exactly how it has got into Unicode! It has been there since the Unicode 1.0 days, so the source is definitely not an Iranian representative in SC2.

    Another interesting point is that when the very final session for approving a very recent Iranian national standard, defining a minimum subset of Unicode for Persian information interchange, was being held, the committee experts voted for removing this character from the optional characters list (characters which need not be supported but their use should be according to the text if they are), telling that it's really not a character, but a logo: "It's not used in text, but just in letterheads".

    Is anyone collecting notes to write that "Every Character Has a Story"  book some time? It's a good case for such a research! ;)

    When the idea of that "Every Character Has a Story" book was being floated around, I remember suggestion a subtitle of "The Dark Underbelly of Unicode". Amazing how easy it is to get there when you look into the history of some characters....

    And to date, no one (as far as I know) has come forward purporting to know how the "SYMBOL OF IRAN" was added to Unicode 1.0 (who proposed it, or why). Its source remains a mystery to this day....

     

    This post brought to you by who else but "" (U+262b, a.k.a. FARSI SYMBOL, a.k.a. SYMBOL OF IRAN in Unicode 1.0)

  • Sorting it all Out

    Sorry folks, MSKLC cannot trap CONTROL+ALT+DELETE

    • 3 Comments

    A few days ago, Larry Osterman pointed out Why is Control-Alt-Delete the secure attention sequence (SAS)?

    It is funny but one popular topic that comes up in supporting MSKLC is people wanting to be able to develop a keyboard layout that blocks the keystroke combination. So they are just looking for a version of MSKLC that does not have the DELETE key disabled because the other two keys are there and they are so close to their goal....

    Sorry, but they are not close. A keyboard layout cannot be made to take away this functionality. CTRL+ALT+DEL is still the one safe combination, even if the DELETE key were eanbled in the user interface of MSKLC.

    Of course, there is a dialog that Outlook puts up when it feels the need to reauthenticate (network hiccups?) which I never type into since the spoofing potential is so obvious. But I am sure most people do type in their credentials again anyway and would consider me to be paranoid. Maybe they should recommend in the dialog that the user hit CTRL+ALT+DEL and type in their password there rather than trying to prompt for it directly?

    The problem with trying to make a system foolproof is that the designers will always underestimate the ingenuity of complete fools. Unfortunately, those with evil intent do not underestimate complete fools; they thrive on such people....

    This post sponsored by "¶" (U+00b6, PILCROW SIGN)

  • Sorting it all Out

    A little bit about the new CharUnicodeInfo class

    • 15 Comments

    CharUnicodeInfo is a new class that is being added to Whidbey. It has one very straightforward job -- pick up property information from the Unicode Character Database. But there is a lot of data there!

    The name provides the proper balance between being appropriately descriptive and showing up near the System.Char struct in the Object Browser.

    It is much more functional than the FoldString API's MAP_FOLDDIGITS (discussed a little bit yesterday), which simply maps digits from various scripts to 0 - 9. And it carries much more information than System.Char struct methods like Char.IsWhiteSpace and Char.IsPunctuation (plus it is entirely based on Unicode character properties and has none of the backwards compatibility issues of the methods off of the Char struct (e.g. having to consider some characters as white space because other programs used to do so in their parsing). Pure Unicode all the way, baby!

    Now you can hardly call a class a secret when even simple searches in Google and MSN find over 100 pages about it. But I'll try to give a rundown of some of its basic functionality....

    Here is a list of some of the methods this new class contains:

    GetDecimalDigitValue -- as the title implies, returns the actual value this character has as a decimal digit (or -1 if it is not a decimal digit at all). This is ever so much more useful than Char.IsDigit, which only returns a simple yes/no answer to the question! In official terms, it returns the value of Unicode's Numeric Type/Numeric Value fields whenever the Unicode category is Nd (Number, Decimal).

    GetDigitValue -- For all those cases where a character is in fact a digit even if it is not just between 0 and 9, the GetDigitValue method can retrieve those values.

    GetNumericValue -- For the times that it is a number but may not even be a digit (such as fractional values), this method returns a numeric representation.

    GetBidiCategory -- There are many possible categories that describe the behavior of a character in bidirectional contexts, and every character falls into one of them: LeftToRight (L), LeftToRightEmbedding (LRE), LeftToRightOverride (LRO), RightToLeft (R), RightToLeftArabic (AL), RightToLeftEmbedding (RLE), RightToLeftOverride (RLO), PopDirectionalFormat (PDF), EuropeanNumber (EN), EuropeanNumberSeparator (ES), EuropeanNumberTerminator (ET), ArabicNumber (AN), CommonNumberSeparator (CS), NonSpacingMark (NSM), BoundaryNeutral (BN), ParagraphSeparator (B), SegmentSeparator (S), Whitespace (WS), and OtherNeutrals (ON) -- all members of the BidiCategory enumeration.

    GetUnicodeCategory -- Arguably the most elemental property, a character's General Category (one per character) really defines what a character is. Possible values are UppercaseLetter (Lu), LowercaseLetter (Ll), TitlecaseLetter (Lt), ModifierLetter (Lm), OtherLetter (Lo), NonSpacingMark (Mn), SpacingCombiningMark (Mc),  EnclosingMark (Me), DecimalDigitNumber (Nd), LetterNumber (Nl), OtherNumber (No), SpaceSeparator (Zs), LineSeparator (Zl), ParagraphSeparator (Zp), Control (Cc), Format (Cf), Surrogate (Cs), PrivateUse (Co), ConnectorPunctuation (Pc), DashPunctuation (Pd), OpenPunctuation (Ps), ClosePunctuation (Pe), InitialQuotePunctuation (Pi), FinalQuotePunctuation (Pf), OtherPunctuation (Po), MathSymbol (Sm), CurrencySymbol (Sc), ModifierSymbol (Sk), OtherSymbol (So), and OtherNotAssigned (Cn). And every one of them is a member of the UnicodeCategory enumeration.

    Note that every one of these methods has two overrides -- one that accepts a single System.Char, and the other which takes a System.String and an index value. The latter case is for dealing with supplementary characters, which are made up of a high and low surrogate (also known as a surrogate pair).

    Who knows what the future may bring to this class? The possibilities are endless, as the data that sits behind Unicode allows sophisticated text processing engines to use these properties in exciting ways. All written using the .NET Framework. Speaking as someone charged with writing tools such as MSKLC in the .NET Framework, I plan to try and be one of CharUnicodeInfo's best and most appreciative customers in the months and years to come. :-)

     

    This post brought to you by the many Unicode Character Categories....

  • Sorting it all Out

    Notepad folds digits like origami birds

    • 14 Comments

    Chris Walker mentioned to me yesterday something I did not know about Notepad -- that it uses the FoldString API with the MAP_FOLDDIGITS flag. This takes all of the digits in Unicode and folds them down into regular old zero to nine for everything you type something into the "Goto line" dialog (you can get to it by typing <CONTROL+G> or choosing "Edit|Go To..." from the menu).

    Note that there is really no connection to origami, the art of paper folding in this post. But I started playing with this and decided it was really kind of cool.  And then I was thinking about how Notepad looked like a big sheet of paper and how the piece of paper was folding digits, and how funny it was that paper would fold rather than being folded. Then I put http://origami.com/ into my browser and the title took on a life of its own....

    Now this whole digit folding in Notepad is very much a "stealth" feature, since it is not in the online help or any documentation that I was able to find. At first this annoyed me but then I thought about how hard it would be to document this feature to people who do not realize that some languages have their own digits. Or that it would probably not be necessary to document it for those who do (since people who do use them would probably try it anyway and just be pleasantly surprised that it works!). It is not a feature that is going to specifically convince people to buy Windows and I cannot see anyone truly believing that Windows is more internationalized because of this feature. So perhaps the fact that it is not documented is not too bad (though I am happy to point it out now).

    Obviously it is most useful to someone who uses some other language that has its own set of digits like Arabic or Thai, but you can play with it to by grabbing some of the digits below into the clipboard and watching the "Goto Line" functionality work its magic.

    It is probably most fun for when you do not have the fonts available -- since you will just see square boxes for those entries. Yet the API does not discriminate based on your machine's available fonts, so neither does Notepad. :-)

    If you have no patience for this sort of thing, then I apologize. You can stop by tomorrow and I'll try to catch your interest....

    UNICODE SCRIPTS USED BELOW FOR EACH NUMBER:

    • Arabic-Indic Digits
    • Eastern Arabic-Indic Digits
    • Devanagari Digits
    • Bengali Digits
    • Gurmukhi Digits
    • Gujarati Digits
    • Oriya Digits
    • Telugu Digits
    • Kannada Digits
    • Malayalam Digits
    • Thai Digits
    • Lao Digits
    • Superscript Digits
    • Subscript Digits
    • Circled Digits
    • Fullwidth Digits
    • Tamil Digits (no zero -- though one is being added to Unicode 4.1 and I stuck the codepoint in for now)
    • Parenthesized Digits (no zero)
    • Period Digits (no zero)
    • Dingbat Negative Circled Digits (no zero)
    • Dingbat Circled Sans-Serif Digits (no zero)
    • Dingbat Negative Circled Sans-Serif Digits (no zero)
    • "ASCII" Digits

    See how easy it is to get a program to recognize different digits? :-)

    ٠۰⁰₀⓪00                        (0660 06f0 0966 09e6 0a66 0ae6 0b66 0c66 0ce6 0d66 0e50 0ed0 2070 2080 24ea ff10 0030)
    ١۱¹₁①1⑴⒈❶➀➊1       (0661 06f1 0967 09e7 0a67 0ae7 0b67 0c67 0ce7 0d67 0e51 0ed1 00b9 2081 2460 ff11 0be7 2474 2488 2776 2780 278a 0031)
    ٢۲²₂②2⑵⒉❷➁➋2    (0662 06f2 0968 09e8 0a68 0ae8 0b68 0c68 0ce8 0d68 0e52 0ed2 00b2 2082 2461 ff12 0be8 2475 2489 2777 2781 278b 0032)
    ٣۳³₃③3⑶⒊❸➂➌3     (0663 06f3 0969 09e9 0a69 0ae9 0b69 0c69 0ce9 0d69 0e53 0ed3 00b3 2083 2462 ff13 0be9 2476 248a 2778 2782 278c 0033)
    ٤۴⁴₄④4⑷⒋❹➃➍4  (0664 06f4 096a 09ea 0a6a 0aea 0b6a 0c6a 0cea 0d6a 0e54 0ed4 2074 2084 2463 ff14 0bea 2477 248b 2779 2783 278d 0034)
    ٥۵⁵₅⑤5⑸⒌❺➄➎5  (0665 06f5 096b 09eb 0a6b 0aeb 0b6b 0c6b 0ceb 0d6b 0e55 0ed5 2075 2085 2464 ff15 0beb 2478 248c 277a 2784 278e 0035)
    ٦۶⁶₆⑥6⑹⒍❻➅➏6   (0666 06f6 096c 09ec 0a6c 0aec 0b6c 0c6c 0cec 0d6c 0e56 0ed6 2076 2086 2465 ff16 0bec 2479 248d 277b 2785 278f 0036)
    ٧۷⁷₇⑦7⑺⒎❼➆➐7      (0667 06f7 096d 09ed 0a6d 0aed 0b6d 0c6d 0ced 0d6d 0e57 0ed7 2077 2087 2466 ff17 0bed 247a 248e 277c 2786 2790 0037)
    ٨۸⁸₈⑧8⑻⒏❽➇➑8   (0668 06f8 096e 09ee 0a6e 0aee 0b6e 0c6e 0cee 0d6e 0e58 0ed8 2078 2088 2467 ff18 0bee 247b 248f 277d 2787 2791 0038)
    ٩۹⁹₉⑨9⑼⒐❾➈➒9    (0669 06f9 096f 09ef 0a6f 0aef 0b6f 0c6f 0cef 0d6f 0e59 0ed9 2079 2089 2468 ff19 0bef 247c 2490 277e 2788 2792 0039)


     

    This post is sponsored by all of the numbers listed above (although those in the second row did point that they are #1 quite a bit before agreeing!)

  • Sorting it all Out

    Sometimes, localization is a four letter word

    • 1 Comments

    As the guy from The Princess Bride said, I do not think that word means what you think it means.... you ever find yourself feeling that way? Localization is one of the words, mainly because people mix up basic terms and assume they mean what other terms mean in this area. But thankfully people like Larry Osterman not only don't have this problem, but they can point out the people who do, which he proved in his post What is localization anyway?

    And no, Larry. You were not stepping on my toes; people saying smart things should never feel like they can't say them. :-)

  • Sorting it all Out

    Lions and tigers and <strike>bears</strike>ELKs, Oh my!

    • 24 Comments

    In Fall 2004, Cathy Wissink and I were in San Jose at the Unicode Technical Committee meeting (being held at Apple) along with 20+ of our colleagues from various companies involved with internationalization. We spoke at the IMUG (International Mac User's Group) meeting one evening, giving a much longer version of the talk that has been done before at both prior Internationalization and Unicode Conferences and at the Microsoft Global Development & Deployment Conference. Things were a little bit closer to shipping so more could be said, and since we were given more time we were definitely allowed to say more.

    The title of the talk? Windows for the Rest of the World -- Customizing Windows for Emerging Markets. This post will contain a few slides of the content from that talk. :-)

    One thing we talked about quite a bit was about locales and how long it took to get them added. Some stats:

    • Windows NT 3.51 – 45 locales
    • Windows 95 – 105 locales
    • Windows NT 4.0 – 105 locales
    • Windows 98 – 114 locales
    • Windows 2000 – 125 locales
    • Windows Millennium Edition – 114 locales
    • Windows XP – 135 locales
    • Windows Server 2003 – 135 locales

    These numbers are only impressive when one ignores how many languages and cultures that are not being covered around the world. We then pointed out the problem with the traditional methods we have been using to add NLS data:

    • The last few years have seen an explosion of demand for new cultural data
    • Procedure has only allowed for additions to data at the time of a major system release
    • In order to add traditional NLS data to Windows, we need:
      • Strong business case
      • Adequate resources (research/development/test)
      • Carefully reviewed data
        • Scrubbed for geopolitical issues
        • Standards conformant
        • Culturally appropriate
      • Testing
    • It therefore takes too long to enable new languages for Windows!

    The presentation got into detail about a lot of the things that we are doing to try to help here, some of which I have talked about before (like MSKLC), and others that I will likely cover in future posts. But for now I will talk about one of the many things GIFT is doing to help with the issues above: ELKs!

    ELK stands for Enabling Language Kit. These useful beasts will (on a per locale basis) install as needed any or all of the following:

    • New locale information
    • New keyboard(s)
    • New sort(s)
    • New font(s)
    • New shaping engine(s)
    • Underlying infrastructure (setup, registry, .INF file changes)

    Obviously some (like locale information) always had to be done, but others (like fonts or shaping engines) were only required for a few.

    Lest you are afraid at this point that ELKs are typical vaporware that is never actually shipped, Microsoft Windows XP Service Pack 2 ships with 25 new ELK locales! Those locales are:

    • Bengali - India
    • Croatian - Bosnia and Herzegovina
    • Bosnian - Bosnia and Herzegovina
    • Serbian - Bosnia and Herzegovina (Latin)
    • Serbian - Bosnia and Herzegovina (Cyrillic)
    • Welsh - United Kingdom (more info in Englishin Welsh)
    • Maori - New Zealand
    • Malayalam - India
    • Maltese - Malta
    • Quechua - Bolivia
    • Quechua - Ecuador
    • Quechua - Peru
    • Setswana / Tswana - South Africa
    • isiXhosa / Xhosa - South Africa
    • isiZulu / Zulu - South Africa
    • Sesotho sa Leboa / Northern Sotho - South Africa
    • Northern Sami - Norway
    • Northern Sami - Sweden
    • Northern Sami - Finland
    • Lule Sami - Norway
    • Lule Sami - Sweden
    • Southern Sami - Norway
    • Southern Sami - Sweden
    • Skolt Sami - Finland
    • Inari Sami - Finland

    (I swear that this list was even more impressive when it was done with PowerPoint animations, showing up one item at a time!)

    Definitely not vaporware -- you can install XP SP2 and see support for all of these locales today. And things will continue on in the future!

    And like I said, there are a lot of other items discussed in the presentation, which will be covered in future posts. It's all about getting out of the way...

    This post sponsored by "ᕣ" (U+1563, CANADIAN SYLLABICS N-CREE THII)

  • Sorting it all Out

    No, not all programmers speak English.

    • 7 Comments

    Yesterday, Brad Abrams asked Do all programmers speak english?

    The answer, which I learned in part from the volunteer efforts to translate/localize large parts of the Trigeminal website, is no. Not all of them do.

    I learned (and continue to learn!) many things from that site, because all of the following are true:

    1. Some people from other countries send me mail and talk about how cool the localized site is.
    2. Other people not only do that but point out typos and better words to use in translations.
    3. Still other people want to help by translating some pages themselves.
    4. Most people send me email in their language, asking for assistance with a technical problem!

    The latter reached its high point a few years back when someone from a Microsoft subsidiary sent me email, fascinated that a consulting company he had never heard of was apparently doing business right near him and he had never heard of it. He wanted to know where my office was located, since my address was not on the site!1

    To this day I get email from developers now and again in français, हिन्दी, deutsch, Ελληνικά, Português, தமிழ், Română, עברית, svenska, ภาษาไทย, Español, 日本語, Nederlands, Български, فارسى, and sometimes even ქართული from developers (whether it is for reason #1, #2, #3, or #4 often varies with language but the patterns are consistent!). I have even at times had whole email conversations with me typing in English and them typing in their native language (mostly when it is a Latin script language!), and since we share a common subject matter we can often understand each other without requiring Babel Fish Translation for every message.

    One sentiment I have often heard from people in Sweden and the Netherlands (where I have given presentations and even had translations done of the PowerPoint slides!) is that many developers find knowledge of English to be invaluable as not all books and articles are translated (and some that is translated is not done as well as it might). Also, many developers assume that the English version of a program will be less buggy (though thankfully I think this impression is at least starting to fade). Along those lines, Francesco Balena told me a few years ago that even though he writes books for Microsoft Press in English that he does the translations into Italian himself (which i would imagine reduces the cost in addition to giving better control over quality!).

    The one time that developers are most likely to use their native language is in comments, even if their function and variable names are in English. And I suspect that as more and more developers from other countries break into the international space that this will become more and more common.

    I will close with a link to the 'Provincial' page, which sums up the issue in as cheeky of a manner as I can muster. :-)

    1 - He was less thrilled after I pointed out that not only was I not in France, but that neither was my translator, who is a native of Québec! He suddenly changed from #1 to #2 and started pointing out typos that existed on the French pages, something that is always easier to forgive when the author is from the same place!

     

    This post sponsored by "ચ" (U+09a9, a.k.a. GUJARATI LETTER CA)

  • Sorting it all Out

    Speaking Engagement.NET

    • 5 Comments

    I promised I would let people know when I was speaking next....

    I will be giving a talk at the .NET Developer's Association on February 14, 2004 in Redmond, WA, USA. I'll be talking about a lot of the exciting globalization features that are new in Whidbey. If you are local (or traveling to Seattle) and interested, you can get more info about the meeting at: 

          http://www.netda.net/Event/EventNewsletter.asp?EventDate=2/14/2005

    You do have to be a member for the raffle, but anyone can show up for the presentation. And in case I owe you a book from the raffle that was done the last time I spoke, I will bring them with me!

    Now it is Valentine's Day, so if you need me to write a note so your S.O. will understand why you have to be home a little late that night then let me know and I'll see what I can do. (Or you can bring him/her along, though a technical presentation may not be the best Valentine's Day "date" environment.)

  • Sorting it all Out

    CAPSLOCK only sometimes equals ShiftLock

    • 4 Comments

    Tor Lillqvist suggested I talk a little about the difference between CAPSLOCK and ShiftLock --

    Something you might want to blog about some day: the difference between "CapsLock" and "ShiftLock". Especially determining programmatically whether the "VK_CAPITAL" key acts as CapsLock or ShiftLock for a certain keyboard layout (input locale identifier). (Something I needed to do recently, for somewhat complex reasons)

    Currently I do it by checking with ToAsciiEx(). I select a key that generates different characters with and without VK_SHIFT, and where the result from the shifted key isn't simply the uppercase equivalent of the unshifted one. I.e. typically my code would select a key from the "digit" row.

    I then check again with VK_CAPITAL toggled (and VK_SHIFT off), and if the resulting char is the same that just VK_SHIFT produced, I know that VK_CAPITAL should act as ShiftLock.

    Well, of course I would always suggest using ToUnicodeEx instead of ToAsciiEx.... :-)

    Of course its not that simple since it is a per-key setting (as you point out, usually the CAPSLOCK key does not cause the number keys to display their shift states). And then comes features like SGCAPS that take over the CAPSLOCK key in a big way. Check out the Hebrew keyboard some time in MSKLC and you'll see what I mean -- it just wreaks havoc with the notion of the key labelled CAPSLOCK locking the 'CAPS' state. The price of adding two new shift states.

    The setting also gets weird when you add dead keys to the mix -- anytime you call ToUnicode or ToUnicodeEx and the key is a dead key you will have that weird stateful situation which will affect the next call in the same thread. Though of course people who author keyboard layouts with dead keys that span multiple shift states probably do not like their users too much -- thats a hard typing task!

    The setting in question here is actually available in MSKLC. We argued about the name of the setting for some time before finally settling on the "caps = shift" option. I had voted for "caps == shift" but was outvoted by the program managers to 2 to 1 -- maybe I should have invited some other developers?

    Anyway, the MSKLC setting is actually automatically applied when the characters are cased variants of each other, and it is used in validation -- if the two characters are case pairs yet this option is not enabled (or if they are not case pairs yet the option IS enabled), you get a warning. But this is okay if it is intentional -- usually, it's not, though. After MSKLC was available we found out how often Microsoft had done this setting improperly in the build-in keyboards in Windows!

    So, the summary -- for the CAPSLOCK to act as a ShiftLock, one must explicitly ask it to be so in the keyboard layout -- and often the layout author does not do this (even if that was probably the intention).

    This post brought to you by "±" (U+00b1, a.k.a. PLUS-MINUS SIGN)

  • Sorting it all Out

    Challenges behind MSLU: the loader! (Part 1)

    • 5 Comments

    (This post is based on one that was sent to the microsoft.public.platformsdk.mslayerforunicode newsgroup back in June of 2001)

    It all started simply enough: the Windows division decided to fund a project for a Unicode Layer for Win9x to help people be able to fully use the NT functionality that had been around all the time and was getting better each version. Obviously, anything that was going to slow down applications on WinNT/Win2K/WinXP was unacceptable. So there had to be a way to make sure this layer (MSLU) did not slow down a Unicode application running on the NT platform.

    We started with the DELAYLOAD technology, introduced in Visual C++ 6.0. The purpose of delay loading is (according to MSDN):

    "The Visual C++ linker now supports the delayed loading of DLLs. This relieves you of the need to use the Win32 SDK functions LoadLibrary and GetProcAddress to implement DLL delayed loading."

    All well and good, right? Well, they quickly found another side effect: if you were never going to call a function at all, then you could delay the loading of the DLL till hell froze over!

    And then quickly found yet another cool side effect: if you used their capability to override the loading via a hook function, you could redirect the effort and be sure on the NT platform to call the original API instead of the delayloaded entry point in your other DLL. Cool, right?

    Well, we thought so. :-)

    Of course, we quickly ran into problems with this approach. Developers might want to use the delayload stuff themselves, for their own DLLs. They might even have custom hooks already! It would not be too friendly to take over a technology and make people choose which functionality they preferred: delay loading or Unicode on Win9x.

    Luckily, Microsoft has a lot of smart people in it. After working with one of the long-time devs on the Windows team (Bryan Tuttle) and on the linker team over in VC++ (the "father of the Microsoft linker" and original developer of delay load, Dan Spalding), we had a new solution: the MSLU loader!

    (and for the record, all of the credit goes to Bryan and Dan; everything cool I did with the solution was due to the fact that they provided the solution in the first place!)

    Now delay load works by dynamically creating the stubs at compile time that are needed for the program, but we did not need to duplicate *that* functionality since the APIs being wrapped are a static list -- no need for dynamic work here. So the MSLU loader is basically contained in the .LIB file and contains all of the information for every single function that is wrapped or stubbed. That is the reason why a DLL that is less than 200K ended up with a .LIB file that is over 2mb -- because it contains a lot of information (it also contains all of the failure stubs so that if for some reason the API cannot be called due to low memory, etc., the failure case would be handled gracefully).

    Of course, the size might freak people out: who wants to add 2mb to their binary's size? Don't worry, thats not a problem, either. Remember that most APIs are pretty much identical as far as the linker is concerned, even in failure stubs: functions that take three parameters and return FALSE look the same whether its AddMonitorW, EnumDateFormatsW, or SetLocaleInfoW. What this means is that your debug build (no optimizations) will be a little bit bigger since you are going to have a lot of data that is identical for separate APIs, but if you do any kind of optimization (either for size or for speed) then all of these identical bits of code will be merged such that even large, complex applications only add a little bit of size.

    More MSLU loader trivia -- neither 'dumpbin /imports' nor Steve Miller's Dependency Walker can detect the loader, so you have to look for it indirectly. How's that? Well, it is fairly easy to see if every single Unicode API that takes string parameters suddenly disappears from Dependency Walker's tree. You can even use such a technique to look into incorrect unicows.lib integration -- if those imports are on the list then it means you have your .LIB files out of order.

    Does it all sound hideously complex? Well, a lot of work got done in the design so that all of the actual complexity is hidden, and all you have to do is link to the .LIB file, which gives you the MSLU loader free and clear (literally free, since it comes with the downloadable Platform SDK!). It made for a very interesting bit of technology (a custom delay loader) hidden away inside another interesting bit of technology (a Unicode layer for Win9x).

    All of it would have been impossible without some of the really smart people in VC++ and Windows. :-)

  • Sorting it all Out

    In Tamil -- sometimes, they are digits; other times, just numbers

    • 18 Comments

    Early last year, Raymond Chen talked about how Char.IsDigit matches more than just 0 through 9 and later last year I talked about Crossing the DIGITal divide. But in both cases the conversation is limited to digits, and not the wide world of numbers which includes a lot more than just different ways of saying 0123456789. 

    The distinction between digits and numbers in Unicode is an important one, since the formatting and parsing of numeric values is highly dependent on whether a number acts like the ASCII digits 0 - 9 or not.

    Now the bulk of the modern number systems use the same Arabic-Indic system conventions to which software developers are accustomed, but others do exist, some of which are still see use today.

    As an example people can relate to, most of us are aware of the Roman numeral system where there is no Zero and you sometimes have to use a lot of addition in subtraction in a deterministic manner (such that any time a smaller number comes before a larger one, the smaller one is subtracted; otherwise if they are the same value or the larger one comes first, it is added). Thus is one, is three, is 4, is 5, and so on. Although it is not used too much, it is still commonly seen in the credits of movies and television shows for the copyright date (e.g. MCMLXXXIX for 1989). Many people who are not used to Roman numerals breathed a sigh of relief at the year 2000 since MM is so much easier to read....

    It is of note that the Roman Numerals are encoded in Unicode even though they can all be represented as existing letters. The primary reason for this is that there are character properties associated with each encoded character, and these properties are used by many implementations of Unicode to get actual work done. Therefore, the letter V (U+0056, LATIN CAPITAL LETTER V) has a General Category of Lu (Letter, Uppercase) while (U+2164, ROMAN NUMBERAL FIVE) has a general category of Nl (Letter, Number).

    And yes, even that claim falls apart a little since the hexidecimal digits ABCDEF are not separately encoded for reasons of backwards compatibility with decades of existing practice on computers which is not the case with Roman numerals. Even the argument for having encoded the Roman numerals is a little specious since for the most part they have not been encoded and when they are the style never seems to be consistent typographically. Though YMMV since you may have better fonts than I do! Try "ⅯⅭⅯⅬⅩⅩⅩⅨ" for the test....

    All of this goes to show that Unicode is a very complex standard. In the end, Unicode can always do what it needs to do without fear of the occasional contradiction, since there will always be some precedent with which to be consistent. :-)

    Ethiopic numbers are based on a different alternative system, one that can really wreak havoc with a formatting/parsing architecture like that in Windows or the .NET Framework if you try to bring Ethiopic data in without writing code do the work (just like with Roman numerals). I'll talk about Ethiopic numbers another time....

    Yet another system, the one I will talk about here, is that of Tamil numerals. It is an additive and positional system (unlike Roman numerals, there is no subtraction involved) that has no zero but includes characters for 10, 100, and 1000.

    In the traditional system the number 3,782 would be represented as ௩௲௭௱௮௰௨ (literally Three-Thousand(s)-Seven-Hundread(s)-Eight-Ten(s)-Two, or மூன்று-ஆயிரத்து-எழு-நூற்று-எண்-பத்து-இரண்டு in Tamil).

    At least since the early 1800s, however, usage of the Tamil numerals as digits has been more and more common. Thus the number 3,782 would often be represented as ௩௭௮௨ (literally 3782).

    The following table gives a bunch of different numbers and how they are represented in both the older, more traditional style and in the "modern" style where they act as digits. Note that the table is treating U+0eb6 as TAMIL DIGIT ZERO even though it is not being added to Unicode until version 4.1. Up until now the ASCII DIGIT ZERO was used as needed, as I do in the table below for display purposes, and if you want to represent these numbers before Unicode 4.1 is released you should likely use U+0030 (DIGIT ZERO). The modern Tamil column using the LOCALE_SGROUPING setting of Tamil....

    Arabic-Indic Digit old style Tamil modern Tamil old style Tamil code points modern Tamil code points for number
    0   (not available) 0 (not available)  0be6
    1  0be7 0be7
    2  0be8 0be8
    3  0be9 0be9
    4  0bea 0bea
    5  0beb 0beb
    6  0bec 0bec
    7  0bed 0bed
    8  0bee 0bee
    9  0bef 0bef
    10  ௧0 0bf0 0be7 0be6
    11  ௰௧ ௧௧ 0bf0 0be7 0be7 0be7
    12  ௰௨ ௧௨ 0bf0 0be8 0be7 0be8
    13  ௰௩ ௧௩ 0bf0 0be9 0be7 0be9
    14  ௰௪ ௧௪ 0bf0 0bea 0be7 0bea
    15  ௰௫ ௧௫ 0bf0 0beb 0be7 0beb
    16  ௰௬ ௧௬ 0bf0 0bec 0be7 0bec
    17  ௰௭ ௧௭ 0bf0 0bed 0be7 0bed
    18  ௰௮ ௧௮ 0bf0 0bee 0be7 0bee
    19  ௰௯ ௧௯ 0bf0 0bef 0be7 0bef
    100  ௧00 0bf1 0be7 0be6 0be6
    156  ௱௫௰௬ ௱௫௬ 0bf1 0beb 0bf0 0bec 0be7 0beb 0bec
    200  ௨௱ ௨00 0be8 0bf1 0be8 0be6 0be6
    300  ௩௱ ௩00 0be9 0bf1 0be9 0be6 0be6
    1,000  ௧,000 0bf2 0be7 0be6 0be6 0be6
    1,001  ௲௧ ௧,00௧ 0bf2 0BE7 0be7 0be6 0be6 0be7
    1,040  ௲௪௰ ௧,0௪0 0bf2 0bea 0bf0 0be7 0be6 0bea 0be6
    8,000  ௮௲ ௮,000 0bee 0bf2 0bee 0be6 0be6 0be6
    10,000  ௰௲ ௧0,000 0bf0 0bf2 0be7 0be6 0be6 0be6 0be6
    70,000  ௭௰௲ ௭0,000 0bed 0bf0 0bf2 0bed 0be6 0be6 0be6 0be6
    90,000  ௯௰௲ ௯0,000 0bef 0bf0 0bf2 0bef 0be6 0be6 0be6 0be6
    100,0001 ௱௲ ௧,00,000 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6
    800,000  ௮௱௲ ௮,00,000 0bee 0bf1 0bf2 0bee 0be6 0be6 0be6 0be6 0be6
    1,000,0002 ௰௱௲ ௧0,00,000 0bf0 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6
    9,000,000  ௯௰௱௲ ௯0,00,000 0bef 0bf0 0bf1 0bf2 0bef 0be6 0be6 0be6 0be6 0be6 0be6
    10,000,0003 ௱௱௲ ௧,00,00,000 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6
    100,000,0004 ௰௱௱௲ ௧0,00,00,000 0bf0 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
    1,000,000,0005 ௱௱௱௲ ௧,00,00,00,000 0bf1 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
    10,000,000,0006 ௲௱௱௲ ௧0,00,00,00,000 0bf2 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
    100,000,000,0007 ௰௲௱௱௲ ௧,00,00,00,00,000 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
    1,000,000,000,0008 ௱௲௱௱௲ ௧0,00,00,00,00,000 0bf1 0bf2 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
    100,000,000,000,0009 ௱௱௲௱௱௲ ௧0,00,00,00,00,00,000 0bf1 0bf1 0bf2 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6

    1 - a.k.a. Lakh
    2 - a.k.a. 10 Lakhs
    3 - a.k.a. crore
    4 - a.k.a. 10 crore
    5 - a.k.a. 100 crore
    6 - a.k.a. thousand crore
    7 - a.k.a. 10 thousand crore
    8 - a.k.a. lakh crore
    9 - a.k.a. crore crore

    Some examples of both types of usage:

    • Modern practice, using Tamil digits for chapter numbers: mozi varalARu, by munucAmi varatarAcan, published by The South India Saiva Siddhanta Works Publishing Society, Tinnevelly, Limited, November 1954, p. 357-358 (page numbers from 14th Edition, December 1996).
    • Traditional practice, using the older format (and the source for large parts of the table above!): iniya tamiz ilakkaNam by yokisri cuttAnan~ta pAratiyAr, published by Kavitha Publications, p. 201-204. (you can see the scanned source of some of it here).

    Note that the traditional form is not currently handled by any code in either Windows or the .NET Framework, though it is sometimes seen in even modern contexts such as calendars. The system is not too complicated and figuring out the algorithm to parse or format with it seems like the sort of thing that would make an interesting Microsoft interview question. Though perhaps I will post some potential solutions another day....

     

    Special thanks to Sivaraj Doddannan, Dr. N. Ganesan, and Working Group 02 of INFITT (of which they are both members) for helping to dig up the excellent resources for Tamil numbers. INFITT (International Forum for Information Technology in Tamil) is a liaison member of Unicode and has been instrumental in providing character addition and usage reports to help finish up the Tamil block in Unicode.

     

    This post brought to you by "௧௨௩௪௫௬௭௮௯" (U+0be7 - U+0bef, a.k.a. TAMIL DIGIT ONE - TAMIL DIGIT NINE)
    and they all welcome their new compadre U+0be6, which is coming soon to a Unicode near you!

Page 1 of 4 (57 items) 1234