Blog - Title

July, 2005

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    It isn't really more secure in most cases

    • 20 Comments

    Sometimes, you go to log in to your Windows box. You see this friendly dialog (all screen shots c/o Virtual PC):

    (adjust for your Windows version, of course!)

    You do the requested three-fingered salute, and type in your password:

    and at this point is where I cringe. Because at least for 2-3 attempts, the next dialog I will see is:

    It is because my typing sucks. Not just for speed but for accuracy. And so for me (sitting in a private office where no one can read my screen) the only benefit to having astericks or black circles or whatever is that I just may screw up the password enough times that someone thinks I am a hacker trying to break in.

    Now this is not just a Windows thing, every operating system does this. The belief that the password that I am typing and obviously must at least know what I am typing is more secure because it is obfuscated is everywhere.

    But to be frank, from an accessiblity standpoint I would prefer that they just made the password visible in the dialog. That way I can see when I messed up and avoid worrying security people who probably have real issues to deal with.

    Now I am sure there are worries about screen scraping programs and other such things, but something remembering my keystrokes seems oretty devastating too. But we think our way around those things usually. So why not make it an accessibility option to make the passwrd visible, for those people whose typing on a scale of 1 to lame has reached the point of lame?

    Sorry for the rant, but it took me much longer to log on today then it ought to have....

  • Sorting it all Out

    Show me the [small]money!

    • 19 Comments

    SQL Server's currency data types have some interesting international features. And some of the intricacies of those features have some interesting international implications. I figured as long as we were here I could talk about some of them....

    The money and smallmoney topic in MSDN gives the basics of the datatypes:

    money

    Monetary data values from -2^63 (-922,337,203,685,477.5808) through
    2^63 - 1 (+922,337,203,685,477.5807), with accuracy to a ten-thousandth of a monetary unit. Storage size is 8 bytes.

    smallmoney

    Monetary data values from - 214,748.3648 through +214,748.3647, with accuracy to a ten-thousandth of a monetary unit. Storage size is 4 bytes.

    Both numbers are pretty much scaled integer types rather than floating point values (the latter would freak out a lot of people when it comes to money, so it makes sense to build the types this way). Though of course if you need more than four decimal places it is recommended that use the Decimal data type (there are apparently currencies for whom it is recommended to store more than four decimal places to help with complex calculations).

    You also cannot include currency grouping separators (commas for en-US) unless you pass the value as a string -- in which case you will want to be sure that the currency grouping and decimal separators all match the language of the session you are in. I usually like to put in straight numbers and not worry about dependencies on the language settings, myself.

    But the really interesting information is in the topic entitled Using Monetary Data. What this datatype allows is any currency symbol to be put in front of the number, even if it is not in a string (enclosed by single quotes) in a Transact-SQL clause. Basically, all of the following currency signs are supported:

    Codepoint

    Symbol

    Name

    U+0024

    $

    DOLLAR SIGN

    U+00a3

    £

    POUND SIGN

    U+00a4

    ¤

    CURRENCY SIGN

    U+00a5

    ¥

    YEN SIGN

    U+09f2

    BENGALI RUPEE MARK

    U+09f3

    BENGALI RUPEE SIGN

    U+0e3f

    ฿

    THAI CURRENCY SYMBOL BAHT

    U+20a1

    COLON SIGN

    U+20a2

    CRUZEIRO SIGN

    U+20a3

    FRENCH FRANC SIGN

    U+20a4

    LIRA SIGN

    U+20a6

    NAIRA SIGN

    U+20a7

    PESETA SIGN

    U+20a8

    RUPEE SIGN

    U+20a9

    WON SIGN

    U+20aa

    NEW SHEQEL SIGN

    U+20ab

    DONG SIGN

    U+20ac

    EURO SIGN

    As a bit of trivia, if you look at the Using Monetary Data topic, it has the same table as above, sorted in code point order, with one exception: the Euro (U+20ac) is placed just before U+20a1. The reason for this is that once upon a time (in the original Books Online topic that shipped with the initial release of SQL Server 2000), the documentation listed U+20a0 as the euro.

    Now the code in SQL Server did not do this (since that was not really the euro), and if you tried to use U+20a0 (, a.k.a. EURO-CURRENCY SIGN) as a currency sign in a money or smallmoney column, it would not work.

    When they finally fixed the documentation, it was I suppose easier to update the table by updating the two entries without moving stuff around in the table....

    Interestingly. I just looked in SQL Server 2005 Books Online and this table has not been updated there, either to add new entries or to fix that one ordering issue. Oops. But that is kind of minor, no sense worrying about that....

    Now for the real problems -- you knew there would be real problems, didn't you? :-)

    There are 22 characters in the Currency Symbols block (only 11 of which SQL Server recognizes in this case). Most importantly, there are 41 characters in the Sc (Symbol, Currency) general category (only 18 of which SQL Server recognizes in this case). For both of these you can look to the links to see the list of currency symbols....

    I would be a lot happier if SQL Server were looking for a return of UnicodeCategory.CurrencySymbol from CharUnicodeInfo.GetUnicodeCategory or some other convenient way of getting the currency symbols and treating them that way, of course.

    Or alternately, it would be cool if they removed some of the items that no longer really exist since they have converted to the euro now, and maybe added some more in.

    However, I will now take a step back and not ask for those features just yet....

    Note that you can just insert your currency values with any of these currency symbols in front of them. And the values will be inserted. As Is. Which may not be what you want if you deal with €100 vs. ¥100 vs. 100 for example (since €100 is about ¥13,541 or 126,178 by today's fix!).

    Now that currency symbol is not stored, either -- the currency's identity is eliminated after the insert. Fill in your own disaster sequence on this one -- and make sure to be careful of what you insert in your application....

    Now I have worked with the Cloanto Currency Server several times, and would highly recommend them to people who would want to deal with different types of currencies and do conversions. It is pretty cool having the results at your fingertips and available through both automation and .NET, too. Very cool stuff (I have been a loyal user since I first tried it back in 1999 while I was writing my book -- there is even a sample on the book's CD!).

    Anyway, the best practice for SQL Server is just keep the money and smallmoney columns with a single currency, or store the currency type in another field. It will keep you from doing something you did not intend to do with the database. Big mistakes (where big is defined as scope of effect; the actual mistake is usually a small design issue) in these sorts of columns are the surest way to find oneself looking for another job....

     

    This post brought to you by "" (U+20a0, a.k.a. EURO-CURRENCY SIGN)

  • Sorting it all Out

    Getting intermediate forms

    • 19 Comments

    Unicode has a certain complexity to it that can at times be challenging.

    Let's take for example U+1ec5, a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE. Here is what it looks like (how good will depend on your OS and browser support!):

    Now obviously that is pretty fully precomposed (in Unicode Normalization Form C). If it is fully decomposed, we get U+0065 U+0302 U+0303, a.k.a. LATIN SMALL LETTER E + COMBINING CIRCUMFLEX ACCENT + COMBINING TILDE. Here is what it looks like (again, how good will depend on your OS and browser support!):

    ễ

    And here is where the problems come in. Because between these two extremes lies as third case: U+00ea U+0303 a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX + COMBINING TILDE. Here is what it looks like (again, how good will depend on your OS and browser support!):

    ễ

    Now if you convert that third case to NFC you will get the first case, and to NFD you will get the second. How does that happen?

    Well, the rules for normalization are that you have to keep on performing the compression or decompression until you can't anymore.

    So, there are two ways to get the information of that last case:

    1. You can cart around the decomposition info from the Unicode Character Database so you can get it all yourself.
    2. You can take the NFD string and start converting to NFC with one additional character at a time, thus:

    Step 1:   Convert the string to NFD; we now have: U+0065 U+0302 U+0303

    Step 2:   U+0065 + U+0302 to NFC == U+00ea; we now also have U+00ea U+0303

    Step 3:   U+00ea + U+0303 to NFC == U+1ec5; we now also have U+1ec5

    Now this is not what I would call a perfect algorithm by any stretch of the imagination. But it is a quick and dirty way to get the information on a bunch of equal forms.

    But it certainly leaves open the question of whether the operating system and/or the .NET Framework should expose this information at some point....

     

    This post brough to you by "ễ" (U+1ec5, a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE)

  • Sorting it all Out

    The so-called Ultimate Keyboard

    • 17 Comments

    It all started on Friday, July 15th, at 2:43am. I received the following mail:

    Hi Michael

    I read your blog everyday, and I really enjoy it. Normally I would never contact someone who's blog I read (nor any other celebrity) but I saw this and I thought "I wonder if Michael Kaplan has seen this" I didn't act on it for 24hrs but now I still think you would like to know about it. Anyway enough waffling here is the link

    http://www.artlebedev.com/portfolio/optimus/

    If you have already seen this or you don't care then I'm sorry to have wasted your time.

    Andy

    Well, I'm not a celebrity or anything like that. I had not yet at that point seen the info on this keyboard.

    Roughly 53 minutes later, someone had posted it to the Unicode List -- with the link and not much else.

    I'll give a smattering of the feedback after that....

    JC Helary asked:

    And what about languages that work with syllable input ?

    First letter input: the keyboard displays something like qwerty, second letter input: the keyboard changes to what is possible in combination with the first letter ?!?!?

    JC Helary

    ps: the concept itself is very interesting, also the fact that the keyboard was a Mac one.

    And then Don Osborne stated:

    ... and all of this without footpedals or levers.

    The implications are significant and not only for extended and non-Lestern scripts. Once you are no longer constrained by what is painted on the keys at the factory, a lot of possibilities are opened up in addition to facilitating multilingual use of any given keyboard.

    Ultimately a post-QWERTY world? Well if the keyboard is not dedicated to one layout, even users of one language are not obliged to learn and stay with the legacy system. I.e., it could facilitate learning and use of alternative layouts such as Dvorak for English without requiring a hardware change away from the legacy layout.

    Don Osborn
    Bisharat.net

    I weighed in at this point, myself, with a point I have mentioned here:

    All of this implies that one cannot have a keyboard layout that does not match the letters painted on the keys. Which is of course false (although some people have trouble with the concept, even though they have no probem with shift states that may vary from those letters....),

    To which Don replied:

    Hi Michael. I agree the problem with the painted keys is not what's under them of course, but what one is in the user's head (or isn't, or can't be kept in it...). Remembering which key was reassigned or what such and such key means with Alt/Ctrl, etc., may be easy when it's your own setup. But if you have more than one keyboard, or one with a lot of modifications, or an unfamiliar layout, or are a person altogether new to keyboards, it may be difficult to remember or keep track of. Hence the LED keys can be a great aid.

    Re Dvorak, yes one can select it easily. But if you are learning typing, it's a whole lot easier if you can check the keys (or at least don't have the "wrong" layout looking back at you). One thing going against movement to Dvorak - not that I have a thing about that but some do - is that it's not ideal to learn/use a layout different than what you see, and no one will buy a Dvorak over a QWERTY since so many others have learned the latter. The LED key keyboard in effect levels the field because the keyboard can visually represent any layout you select.

    The changing visual key sequences idea that Jean-Christophe asked about is another interesting dimension.

    Don

    JC Helary then responded:

    I was specifically thinking about Japanese input. And I am curious how that would be implemented. Of course, it is not so much the sillabic part of the input that is a problem (there are Japanese kb layouts that assign one syllable to one key, even most Japanese users type with combinations of qwerty  letters.) It is rather the fact that the qwerty input part is converted 2 times: first time when the second letter is associated to the first to produce a Japanese kana, second when the kana (on its own or associated with other kanas) is converted to a kanji.

    JC Helary

    If you understand how the IME works, then you will understand how at this point, the wheels have started to come off the wagon. Yes, it is true that you enter Kana and then a candidate list comes up. But the list expects you to use arrow keys or similar methods to select items. There is never a time that the changing keyboard will have an opportunity to show the candidates because the various letter keys on the keyboard are not where one selects the Kanji -- that is where one types more kana. I am not going to knock this keyboard just yet, but let's not give it powers it cannot possess!

    Don then put the conversation back on track:

    Another possibility that came to mind was Ethiopic.

    In any event the potential to readily view - on the keys themselves - alternative layouts, extended and non-Latin characters, and to have it respond to input in cases like you describe could be a great equalizer in perceptions of scripts and their value/utility. Of course it's a practical thing, first and foremost, with specific advantages to many users, but it's also a matter of psychology (visual information being so critical in our perceptions). And it's in the latter that the impact could be revolutionary. (Of course it's also possible that other technologies will overtake it in terms of reshaping how we input text in various scripts, but I won't take that tangent...)

    Anyway, this is the last I'll post on this topic until we hear more from the keyboard designers or someone who has a chance to evaluate the prototype.

    Don

    MJ then added an interesting tidbit but gave no explanation:

    Dear all,
    We are already using such keyboard for Bangla for 18 years.
    Thanks
    MJ

    George W. Gerrity then weighed in:

    I have been waiting for such a keyboard for over 20 years. I even tried to get one with LED matrices on keys, manufactured to my specification about 15 years ago, with no luck (at least, at a cost I could afford).

    I am surprised that there is even some uncertainty as to how such a keyboard might be used for non-latin scripts. Consider, for instance, its use with Simplified (mainland) and Traditional (Taiwan) Chinese and Japanese: In all three cases, there exist four or five commonly-used input methods designed to be used with a standard QWERTY keyboard, that depend on remapping keys. The only alternatives are expensive, huge (largely unstandardised) keyboards originally designed for typesetters.

    Users who normally enter text in Chinese and Japanese have no trouble with the remapping because usage patterns develop for touch typing just as for users of QWERTY keyboards for Latin characters. The problem arises for users like myself, who occasionally want to enter Chinese, Hebrew, Greek, etc, and simply can't remember which keys go with which method.

    Having such a keyboard, whose visual indicators can be programmed to change with input method and with shift, opt and alt keys, would be a real boon.

    George

    Ken Whistler had some additional thoughts about problems with trying to use this keyboard, especially with IMEs:

    The whole concept of moving your hands *off* the keys to *read* the keys, and then back on the keys to key your input -- particularly for complicated IME's that change the state of input as you go -- strikes me as at best un-ergonomic and at worst frustrating and inefficient.

    And Raymond Mercier extended this thought:

    We are more bound by the tactile experience of the keys than we imagine, even those of us who are not real touch typists. I tried a new keyboard recently (mis-designed by Belkin) that had the unfortunate characteristic that the character appeared on the screen only after the finger was lifted from the key, instead of appearing on the downstroke of the key. It was quite impossible to type at normal speed. That experience rather suggests to me that normal typing would be very frustrated by having to pay so much visual attention to the keys.

    And Gregg Reynolds responded to these thoughts:

    Depends on how often you need to do it.  I can touch type on an Arabic keyboard, but there are always those rarely-used keys that I can't quite remember.  In that case, it would be a big time-saver if the keyboard could change dynamically from English to Arabic.

    I think it would be hugely useful for learning e.g. Vi.  Assuming that is that it can be programmed to do something like:  hold down shft-ctl-z, and little arrows pop up on the h/j/k/etc keys.  Or in emacs, hold down some key combo and some kind of mnemonics show up indicating e.g. ctl-f for forward-char.  Much easier than trying to remember where such info is in online help.

    If they can keep the price point somewhere in the vicinity of ordinary keyboards they will own the corporate marketplace, or at least that segment whose workers use multiple languages.  I have the Arabic keyboard memorized, so I don't generally need the letters printed, but
    most of my colleagues covet such a keyboard, even if it only uses stickers.  They could just print out an image of an Arabic keyboard and pin it next to the monitor, but they want the info on the keys.

    I would think a dynamic keyboard display could do the same (in principle).

    Now in the past I have mentioned the cool soft keyboard that the Tablet PC version of Windows XP has, which repaints its "keys" based on the chosen keyboard layout. It works quite well with the idea Gregg is talking about.

    It really is impossible to know how a keyboard will work here, though I suspect that many of the problems related to fingers having to be lifted to see the characters are really just the tip of the iceberg. But before I get into that, Suzanne E. McCarthy covered some thoughts on this new keyboard in her blog (abecedaria):

    The Optimus Keyboard

    Optimus Update

    A Post QWERTY World (contains many of the same comments I quoted above)

    Now, where was I? Oh yes, the tip of the iceberg.

    The code that one must write to interrogate a keyboard layout using Win32 keyboarding APIs in even a single shift state is non-trivial. Now add to that a need to keep running that code on a per-thread basis (since the input language is indeed a per-thread setting across the whole operating system), and the fact that this would have to be user-mode code (the kernel APIs to do all of these interrogations do not really exist). Therefore, to work reliably on Windows, one would probably require a low level keyboard hook that would grab any keystroke made (to know when to change for shift states or dead keys).

    Yikes, the performance implications of having to do all of that on a per-thread basis any time one switches to different UI threads is somewhat frightening. Though the code complexity issues also seem rather formidble. I actually know a few things about the code needed here (I had to write it for MSKLC's File|Load from existing... functionality. How else to interrogate a layout? But the notion of having to do this as often as the hardware would have to (and having to integrate inyerim states like dead keys and possibly even typograophical ligatures if one were truly adventurous!) is quite a challenging one.

    One day I might go through the full process here to help give some appreciation of the level of work. :-)

    And this is just for Windows. The work for any supported platform would likely not match other platforms.

    Perhaps there are plans to do this differently -- perhaps interrogating the keyboard layout DLL directly. Not impossible but also not trivial. And hard to manage across all of the possible keyboard layouts.

    My prediction (well, my guess) about the future of the keyboard would be that it will not end up being completely dynamic. It will just show the base state (like the hardware keyboard does now), and it will switch when the regular keyboard layout does. A lot like what the Tablet PC keyboard does now, which gives a good balance of the performance versus functionality issues.

     

    This post brought to you by "" (U+1127, a.k.a. HANGUL CHOSEONG PIEUP-CIEUC)

  • Sorting it all Out

    Typing in random Unicode code points redux

    • 14 Comments

    It was about two months ago when I pointed out a method for typing in random Unicode code points using the Unicode IME.

    Well, Andrew over at http://www.fileformat.info (the cool provider of my Unicode character links!) has been getting feedback on this issue for a long time, such that his How to enter Unicode characters into Microsoft Windows page is by his own report the paage that recieves the most feedback.

    Anyway, he wrote a small program that will make the entry easier -- the UnicodeInput Utility.

    Handier in a lot of ways than that IME and with an easy mechanism for launch, it is a good solution for a generic answer to the quewtion of how to enter potentially random code points. Check it out! :-)

  • Sorting it all Out

    Is the Optimus keyboard just a myth?

    • 11 Comments

    From my mailbox in response to The so-called Ultimate Keyboard:

    I think most of the Western auditory didn't get Optimus concept correctly. It is only a concept, it is not existent physicallly yet. The idea was created and then it was instantly rendered using their numerous 3D resources (they have a plenty of designers, animators, and even the straight artists in the team).

    Artemy Lebedev accumulated the powerful team of many famous Russian web-designers/creativity minds. And they have a long history of a "designing" various fanny/joy artefacts (http://www.artlebedev.ru/portfolio/id/). It could end up with the physical things for sell. But they don't have enginnering stuff in the team, and I'm very sceptical they have any engineering plan here, besides the original idea. Will glad to get corrected here.

    I would be too. But given the original skeptics I cited and my own doubts, I can only say that this keyboard is seeming less and less likely with each passing day....

  • Sorting it all Out

    Why doesn't FoldString take an LCID?

    • 11 Comments

    I actually recall asking Julie Bennett this question (why doesn't FoldString take an LCID?) a few years ago, and her answer was that none of the foldings that the function did were locale-dependant.

    Now that is true for some of the foldings:

    • MAP_FOLDCZONE (Fold compatibility zone characters into standard Unicode equivalents.)
    • MAP_PRECOMPOSED (Map accented characters to precomposed characters, in which the accent and base character are combined into a single character value.)
    • MAP_COMPOSITE (Map accented characters to composite characters, in which the accent and base character are represented by two character values.)

    Now it would be nice if MAP_FOLDDIGITS (Map all digits to Unicode characters 0 through 9) were a reversible opration. And the only way it ever could be would be to have an LCID parameter to specify what to map it to. Though as a workaround you could use the GetLocaleInfo function with the LCID you wanted to use and the LOCALE_SNATIVEDIGITS flag to map them. So this is not the end of the world.

    However, that last flag, MAP_EXPAND_LIGATURES, is the tough one.

    The flag as a simple job -- it expands all ligature characters so that they are represented by their two-character equivalent. Expand all ligature characters so that they are represented by their two-character equivalent. For example, the ligature 'æ' expands to the two characters 'a' and 'e'.

    And this flag works by consulting the very same table that the collation functions use to find these characters that expand. Generally this will all work.

    EXCEPT there are many languages (for example) where æ is considered to not be one of those characters to expand. This includes Danish, Norwegian, and Icelandic. And if FoldString took an LCID it could have at its disposal, the exact list for each circumstance might change as needed.

    Unfortunately, there really is no good way to do this language-specific type of thing right now. Which means we may be expected to try and tackle the issue in some future release....

     

    This post brought o you by "æ" (U+00e6, a.k.a. LATIN SMALL LETTER AE)

  • Sorting it all Out

    A Microsoft convention for compressions in sorting

    • 10 Comments

    The other day, a developer named Stephanie sent me an email about compressions (these are used in collation when two or more characters are given a single sort weight -- the Unicode Collation Algorithm calls their analagous construction a contraction, in part to avoid confusion with other meanings of the term compression that are described in Unicode). She had just read Dr. International's description of the difference between Traditional and Modern Spanish here, and asked:

    I did some experimentation and found that I saw the described results for CH, Ch, and ch, although the article only mentions CH. In any case, cH is not included. Can you explain these two discrepencies?

    Also, why wouldn't one of these be an alternate sort?

    Stephanie, you are right -- every compression we define for a cased script we handle the UU, UL, and LL forms, but we skip the LU form.

    This was originally a point of confusion for me as well, but Cathy Wissink set my straight back in the early days when she pointed out to me that words may be ALL CAPS or they may be all lowercase and they may be Initial caps, but there is in most languages not a pattern that has capital letters in the middle of text that is not capitalized. The convention we use for compressions is designed to take this reality into account and handle the expected cases while discarding the one that is unexpected.

    The Dr. International article isn't wrong here, though. I will often speak of a compression by just naming the one form when I mean all three forms; it is just a convenient way to express what compressions exist for a language, or a particular sort within a language.

    As to your final concern, I agree with you -- there ought to be an alternate sort used here. I actually even pointed this out in the past (described here). The truth is that alternate sorts did not exist then. They were added specifically in the postmortem over handling this issue with Spanish!

     

    This post brought to you by "ש" (U+05e9, a.k.a. HEBREW LETTER SHIN)

  • Sorting it all Out

    All code page architectures are created equal

    • 10 Comments

    Yes, I said it -- all code page architectures are created equal. But in the most Orwellian sense, some are more equal than others....

    First I will digress into a favorite Odgen Nash poem of mine, which is very short. I pretty much memorized it:

    Let's talk about eggs:
    Eggs have no legs.
    Let's talk about chikens:
    Chickens do have legs.
    The plot thickens --
    eggs come from chickens!
    But they have no legs under 'em
    What a conundrum!

    Why this poem popped into my head may become apparent shortly. If not then it is still a nice poem (Ogden Nash at his finest!).

    Anyway....

    If you look at the official, sanctioned encoding architectures owned by the GIFT team, there are three of them:

    • The Win32 NLS API model, used by the unmanaged universe and which sports a very C-focused model;
    • The MLang model, used by Internet Explorer and which sports a COM-based model;
    • The .NET Framework model, used by the managed universe and which sports a managed code model.

    (there is a fourth model for Kernel mode and the Rtl* functions that can be used in both kernel and user mode, but I will cover that another day -- for my purposes here just consider it for now like Win32 but more limited!)

    If these were three entirely separate models, it all might be easier. However:

    • for MLang, many code pages call the Win32 code, occasionally in edge cases returning HRESULTS that in many cases exceptions to be thrown
    • There are several code pages which have bugs in edge cases in Win32 that were fixed in MLang
    • The 1.0 and 1.1 versions of the .NET Framework code are thin wrappers around the Win32 code (maybe sometimes using MLang to try and work around bugs)
    • The 2.0 managed code started over and tried to fix many of the problems in the two other models (along the way becoming smaller and dare I say a bit faster!), yet in many ways based on the original work

    Talk about conundrums -- these three models are so interrelated even though there are so many times that their behavior differs that I doubt anyone will ever be able to sort out the behavioral differences.

    It represents complex pieces of code in three code bases written across nine versions of Windows, three versions of IE, and three version of the BCL, using unmanged, managed, and COM based code. It is very hard to figure out what is a bug to fix, what is a bug we are stuck with for backcompat reaons, what is an intentional feature that only looks like a bug because the behavior was not documented well enough. You can get a headache trying to figure it out sometimes (and many have!).

    So what does it all mean?

    Well, as Shawn Steele, the owner of the bulk of this complex set of code bases likes to say, people ought to just be using Unicode. And Shawn is spot on here -- the more complex the code page work you do, the more likely you are to run into problems with the use.

    Now I do not include UTF-8 (or even UTF-32 in the .NET Framework) with the rest of those code pages, since it is a Unicode encoding form and all, but just about everything else ought to be a "use if you have to convert something, but then once it is converted stop using!" model.

    Bue please just try to use Unicode, like the opersting system and the .NET Framework prefer, and were basically designed for....

     

    This post brought to you by "" (U+0ce1, a.k.a. KANNADA LETTER VOCALIC LL)

  • Sorting it all Out

    There is no such thing as a surrogate character (dammit!)

    • 8 Comments

    The title of this post, including the parenthetical note, is something that people associated with the Unicode Standard have to tell people all the time (of course generally people only say that parenthetical note to themselves, and really only because they have to say it so many times!).

    The issue is clear in both the Unicode Glossary:

    Surrogate Character. A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term.

    and the Unicode FAQ:

    Q: Are surrogate characters the same as supplementary characters?

    A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.

    There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate code point).

    In fact, if you look to the Unicode Roadmap, each plane has its own name:

    • Plane 0: BMP (Basic Multilingual Plane)
    • Plane 1: SMP (Supplementary Multilingual Plane)
    • Plane 2: SIP (Supplementary Ideographic Plane)
    • Plane 14: SSP (Supplementary Special-Use Plane)

    They are supplementary characters, one and all. They are not surrogate characters. Truly.

    This is easy, right?

    Of course even the clearest intention will not always find itself communicated properly, which is why the Char.IsSurrogate method will have text like "Indicates whether a Unicode character is categorized as a surrogate character" or when the Windows CE docs say "For sorting, all surrogate pairs are treated as two Unicode code points. Surrogates are sorted after other Unicode code points, but before the PUA (private user area). Sorting for a standalone surrogate character (that is, either the high or low character is missing) is not supported.". I do mind the not-entirely-accurate statement about the collation, but I will talk about that another day!

    I do not mind the surrogate character usage like that in the previous paragraph so much, as it is a more benign error -- when people say surrogate character in this context, they mean to say surrogate code point. Harmless error and it even shows up as a NULL glyph as if it were a character of some sort, and we can just the documentationl language at some point (hopefully soon, but I will not lose sleep if they do not).

    The real problem case is when they try to equate the term surrogate character with the term surrogate pair. If they compound it by the naming the method that way, like the XmlWriter.WriteSurrogateCharEntity method, which in addition the evil method name, say things like:

    When overridden in a derived class, generates and writes the surrogate character entity for the surrogate character pair.

    This is a bit harder to fix (not the doc. portion, but the method name, which obviously cannot be removed.

    But we'll figure something out. Eventually.

    Until then, please remember what the title of this post is telling you -- there is no such thing as a surrogate character!

     

    This post brought to you by U+D800, the first surrogate code point -- not a surrogate character!
    (This code point has come to terms with his lack of character-ness, but has mentioned that the fact that no one else has may put him into therapy)

  • Sorting it all Out

    New in Vista Beta 1: FindNLSString (an 'internationalized' strstr)

    • 8 Comments

    This is an example of the kind of features that we in NLS can add to a product -- not as fancy as transparency and other cool Vista stuff that gets all of the press coverage. But there is a certain class of people, a class with a big overlap with those who read this blog, who may find it to be quite interesting. I am not going to leak anything that is not available in the legitimate beta that may be in your hands right now (or could be some time soon!), so don't get too excited. But there are some very cool features that are going into Windows Vista that may be fun for geeks like me, so consider this the first of many such notices. :-)

    The strstr function has been a part of the C Runtime for ages. It's simple job is explained in the docs: "Returns a pointer to the first occurrence of a search string in a string."

    But of course that function (or its Unicode cousin, wcsstr) would never do any of the interesting fun things that CompareString is so famous for, from ligature equivalences (U+00e6 æ being equal to the letters ae for most locales) to Unicode canonical equivalences and more.

    So for LonghornVista we have added an NLS version of this long-existing functionality -- the FindNLSString function!

    The Vista Beta 1 SDK will be available soon, so consider this a marketing preview of the new function. :-)

    If you are a developer who has already picked up Beta 1 of Vista off of the MDSN servers, this function is exported from kernel32.dll and gives you all of the functionality of the managed methods off of CompareInfo (i.e. IsPrefix, IsSuffix, IndexOf, and LastIndexOf).

    The new FindNLSString has one extra bit of functionality that neither wcsstr nor those managed methods have ever had before -- an OUT param that will allow the caller to find out the length of the string that was found (which may not be the same size as the search string!). Now if you think about what the FindNLSString function may be used for (a good example is someon using the ReplaceText common dialog to replace one string with another), what better way to mess up an operation than to not know of the length of the string that was actually found? I mean, it is all well and good for the Unicode standard to say that U+00e5 (LATIN SMALL LETTER A WITH RING ABOVE) is canonically equivalent to U+0061 U+030a (LATIN SMALL LETTER A + COMBINING RING ABOVE), but if your replace operations starts improperly detecting the subset then it will not be a very effective replace operation, now will it? :-)

    Now one feature that has not been added is that there are no separate 'A' and 'W' functions -- there is just one Unicode version, without decoration. The trend that started in Windows Server 2003 with IsNLSDefinedString to only add Unicode versions of functions clearly looks to be the way things will be going forward for NLS. If you are not using Unicode, then you will want to realize that you are not going to see some of the features coming out in products.

    One obvious question.... why not just call the function FindString to go along with CompareString, LCMapString, FoldString, and so on?

    Well, I did try to do that, and managed to break our private build with the change since there were so many cases of internal functions in components and utiities and Platform SDK samples named FindString. Maybe if we had reserved the name 15 years ago, we'd be all set. But even if I changed all of those cases, it is obviously something that would be a problem for users as well. Anything that is in our source code once is in customers' code hundreds of times, and I don't even want to think about how many times it would be in customers' code. Calling it FindNLSString keeps that overlap from being a problem....

     

    This post brought to you by "å" (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE)
    A letter that is anxiously awaiting Vista Beta 1 so that all of its different normalization forms can finally be considered equal!

  • Sorting it all Out

    The ideographic 'myth' ? Well...

    • 8 Comments

    I like Suzanne's blog, abecedaria. Even when I don't agree with what she posts, I do find myself thinking. :-)

    Her very recent post, The Ideographic Myth (named after a chapter in a book), discusses the concept championed by the book's author John DeFrances that Chinese is not an ideographic script.

    DISCLAIMER: Now I do not speak, read, or write Chinese with any degree of fluency. So all of the following may be taken with a grain of salt if you do not wish to buy the arguments I'll put forward. My expertise is in this case definitely more toward implementations of collation that look at the script across various languages. And comparing it against many languages that use other scripts such as Latin, Cyrillic, Arabic, etc.

    Even if one is to claim that historically we are not looking at an ideographic script, let us put that aside for a moment and look at modern Chinese. Even people from neighboring villages will find their speech to be mutually unintelligible while their writing is not. And of the over 70,000 Han in Unicode, many have no known pronunciation -- they are historical characters from millenia ago. Some are even what you would call "mistakes" since the intent was a different Han but the artistic equivalent of a typo from millenia ago has survived in scholarly work. That would tend to underscore the symbolic nature of the script, at least to me.

    But beyond all that, looking at differences between Cantonese and Mandarin, the different pronunications in Pinyin vs. Bopomofo, the differences in pronunciation between uses of the same Han when one is looking at them as [Chinese] Han vs. [Korean] Hanja vs. [Japanese] Kanji, and the multiple pronunciations in each of these languages and varieties within language for the same Han -- how can I possibly look at it as anything but an ideographic script?

    Pinyin () itself is a relatively recent innovation (in the context of Han, which stretches back millenia, Pinyin is a simplification of pronunication that dates to the middle of last century). Even Bopomofo (注音符號) only dates back to the beginning of the last century. Prior to that, one had perhaps worse lack of intelligibility of spoken language without even the attempt to standardize pronunciation -- how can one look at that situation and not assume a symbolic, ideographic nature to the script that is used in several of the most important languages on the third rock from the sun?

    So I will take a sign (标) and apply to it my will (志) to try to determine its meaning. And from that I seem to have found a symbol (标志). Well, more than one symbol, in this case....

     

    This post brought to you by "ʨ" (U+02a8, a.k.a. LATIN SMALL LETTER TC DIGRAPH WITH CURL)

  • Sorting it all Out

    A subkultur iz a shprakh mit an armey un a flot

    • 8 Comments

    The title of this post is inspired by a quote from Max Weinrich, a Yiddish linguist -- A shprakh iz a dialekt mit an armey un a flot. I think it can be understood by many without knowledge of Yiddish, especially if they know German (as German-knowledgable Cathy likes to tell me, in a lot of ways Yiddish is like 16th century German with Hebrew letters). I knew what it meant but I don't know any German at all. Basically it can be translated as "A language is a dialect with an army and a navy."

    He was speaking somewhat ironically when he said this, since obviously Yiddish has neither but nobody would presume to call it a dialect at this point.

    But it does raise an interesting question about one of the difficulties of creating locales -- what would be the location of a Yiddish locale if one were to be added? There isn't one (though I think it might be fun to call it Yiddish - Shtetl, I doubt that would get past the lawyers!). And then of course we would need a Yiddish - Shtetl (Latin) and a Yiddish - Shtetl (Hebrew) to account for the fact that both scripts are used in these times. And the question of what to do wih collation is a fascinating one for the Latin script (though fairly obvious for the Hebrew script one).

    Thus my modified quote, to cover the Windows requirement for cultures and locales as they are defined -- A subkultur iz a shprakh mit an armey un a flot (a culture is a language with an army and a navy). :-)

    Or using the Hebrew script for the Yiddish phrase, something like:

    אײ סובקולטור איז אַ שפּראַך מיט אַן אַרמײ און אַ פֿלאָט

    The same problem exists for Esparanto, and really any language that crosses so many borders and lacks a specific origin location. It is just too hard to figure how they fit into the model of locales that Microsoft ships in Windows and the .NET Framework.

    This is one of the REAL benefits to both opening it all up and getting out of the way, since the difficulties that Microsoft would run into in trying to define a specific locale should not block an individual customer or even a community of customers from defining one that they would like to use.

     

    This post brought to you by  "װ" (U+05f0, HEBREW LIGATURE YIDDISH DOUBLE VAV)

  • Sorting it all Out

    Every character has a story #12: U+2071 (SUPERSCRIPT LATIN SMALL LETTER I)

    • 8 Comments

    This entire post below was authored by Ken Whistler and posted to the Unicode List at 7:37pm on July 20, 2005. I sit and watch Ken with awe and wonder at times like this. :-)

    (Asmus Freytag inspired this post when he stated: We could create a new series UCN (Unicode Character Notes) that are numbered by code point and each address a single character. Two hours and ten minute later, Ken had written the following text)

    Hmmmm...

    UCN #2071

    By Ken Whistler, character historian

    U+2071 SUPERSCRIPT LATIN SMALL LETTER I

    This character, while it might at first seem mundane and ordinary, has a colorful and amazing history of its own.

    It is one of only two *letters* to be encoded among the original block of superscripts and subscripts -- sharing this honor with U+207F SUPERSCRIPT LATIN SMALL LETTER N -- but its route into the superscripts and subscripts block was entirely distinct from the little superscript n. (See UCN #207F.) Unlike superscript n, which gained its location by virtue of its association with the venerable Code Page 437 of IBM PC fame, superscript i had no such code page association, and was much later to the scene, having a Unicode derived age of 3.2, instead of 1.0.

    Furthermore, unlike many another new character, U+2071 is nearly unique in that it went into a code point that had its own complex history, *before* U+2071 was actually encoded.

    "Whatever could this mean?" you might say, and one could well expect confusion over such a concept as this, but here is how the matter stands. Careful examination of the superscript and subscripts block and its history will demonstrate that superscript zero is encoded as U+2070 SUPERSCRIPT DIGIT ZERO (see UCN #2070) and that superscript one is encoded as U+2074 SUPERSCRIPT DIGIT FOUR  (see UCN #2074) -- the association of the digit value with the last digit of the code point is not random or by happenstance, by the way. However, the expected extrapolation from this pattern would be that U+2071 would be SUPERSCRIPT DIGIT ONE. It is not, of course, because that particular character had the rare luck to be included in ISO 8859-1, whereby it gained first-chart character status as U+00B9 SUPERSCRIPT DIGIT ONE. (See UCN #00B9 for the full story on that "one".) As a result of this unique status, the code point \U2071 was the very first instance of an occasional device seen elsewhere throughout the standard: the systematic gap blind cross reference. As early as the publication of the Founding Book (The Unicode Standard, Version 1.0), the reserved code point at 0x2071 is shown with the now famous original convention: x (superscript digit one --> 00B9)

    This pattern gapping and blind cross-referencing was the occasion of considerable discussion, and has resulted in much confusion down through the years about the standard. And \U2071 was the very *first* code point to use this convention, so can be seen as the archetypal instance of this phenomenon. Amazingly, the first code point using this convention referred to a character which itself denoted one!

    But of course that is not the end of the story of U+2071. Unlike the code point \U2072, which to this day continues  to maintain its pattern gap blind cross reference, although in the modern formulation: --> 00B2 ² superscript two (see UCN #2072 for details), U+2071 now actually has an encoded character which supersedes the earlier blind cross reference. While not unique in that status, U+2071 is among a very small, but highly august class of code points to be able to make this claim. (See UCN #0600 for a similar story with its own, uniquely Middle-Eastern flavor.)

    Nor does the tale end there, of course. There was a deep and impassioned argument about the proper emplacement of SUPERSCRIPT LATIN SMALL LETTER I, once it became clear that the importunings of the mathematical community could no longer be ignored and that all mathematical symbols, no matter how obscure, should be given their due in the standard. Now, superscript i was hardly obscure, of course -- it is commonly seen in mathematical treatises, but the usual assumption had been that superscript forms, whether of numbers, digits, or other symbols, should simply be represented as styled variants of existing characters.  Superscript i, however, escaped that generalization by  appearing in SGML entity lists, whose crossmapping imperative pushed it over into the realm of repertoire required for
    character encoding.

    Once that consensus had been reached, however, the committees were still at sixes and sevens, as it were, about the placement of superscript i. One faction fiercely argued for colonizing a hitherto untouched column and encoding it as U+2090. (See UCN #2090, which is a relatively short Unicode Character Note, but which remarks on this brief encounter with fame for that code point, rendering it a much more lively read than the downright dull UCN #2091.)

    Another faction argued that the committees should observe the sanctity of prescriptions of pattern gapping and follow the precedent of "Adding Things at the End of the List", without creating *new* unexplained gaps, and so argued for U+208F. (See UCN #208F.) That faction also argued that this placement would serendipitously place superscript i in immediate chart proximity to its venerable antecedent, U+207F SUPERSCRIPT LATIN SMALL LETTER N. However, their argument was fatally weakened by the inability to convince
    anyone of the felicity of adding a *super*script letter to the end of a list of *sub*script digits and punctuation.

    The third faction argued for what amounted to no less that a shocking act of character integration -- breaking the colorblind cross reference barrier by inserting a superscript letter into what had formerly been a segregated area, reserved for digits only, *despite* the fact that the only unaccounted for digit that could move into the neighborhood was already living on the good side of town, as it were, at U+00B9. After a long argument, this faction prevailed. And effectively, U+2071 SUPERSCRIPT LATIN SMALL LETTER I became the Jackie Robinson of the Unicode Standard, forever shattering the segregationist exclusionary practices that had prevented such characters from moving into code points that had established blind cross-references.

    In a further curious coincidence, U+2071 SUPERSCRIPT LATIN SMALL LETTER I bears more than a passing glyphic resemblance, at first glance at least, to U+00B9 SUPERSCRIPT DIGIT ONE. This means that newcomers to the standard often do a double-take when they view the superscripts and subscripts chart, as the character that they *expect* to see after U+2070 is a superscript one, and unless they look carefully, they might be fooled into thinking that U+2071 actually *is* a superscript one. In this respect, U+2071 SUPERSCRIPT DIGIT ONE has an additional unique status in the entire standard, of serving as a visual ghost of a vanished blind cross-reference to a character that appears almost the same as itself. No other character has this status, even among the small group of other characters that have crossed the cross reference barrier to appear in those formerly reserved code points. Some editors have argued that this ghostly and implicit graphic cross-reference should be finally acknowledged fully and demystified a bit by adding what would now be an explicit cross-reference to U+00B9. But how that might turn out, of course, is a matter only of current speculation and a topic for a future version of this Unicode Character Note.

    Another thing worth mentioning about U+2071 SUPERSCRIPT LATIN SMALL LETTER I has a bearing on the fabulous collection of stories related to the Phonetic Extensions block, U+1D00..U+1D7F. In particular, U+2071 SUPERSCRIPT LATIN SMALL LETTER I is used not only in mathematical contexts, but also appears as a modifier letter in the Uralic Phonetic Alphabet. It would have been proposed as a character among that collection, except that the mathematicians got their proposal in and processed first. This accounts for why there is a U+1D4D MODIFIER LETTER SMALL G (see UCN #1D4D) and a U+1D4F MODIFIER LETTER SMALL K (see UCN #1D4F), both associated with the UPA collection of Unicode derived age 4.0, as well as the much more venerable U+02B0 MODIFIER LETTER SMALL H (see UCN #02B0) and U+02B2 MODIFIER LETTER SMALL J (see UCN #02B2), both associated with IPA and other phonetic collections of Unicode derived age 1.0, but astoundingly, there is no MODIFIER LETTER SMALL I in the standard! Thus U+2071 SUPERSCRIPT LATIN SMALL LETTER I shares with U+207F SUPERSCRIPT LATIN SMALL LETTER N the status of being the only Latin modifier letters in the standard named for their decomposition tag, rather than their modifier letter status. Strange but true!

    So not only is U+2071 the Jackie Robinson of Unicode characters -- it also stands as one of the prime exemplars of the principle that you cannot derive all character properties from inspection of character names, nor assume that all characters in related groups of characters will have names constructed by identical patterns.

    U+2071 SUPERSCRIPT LATIN SMALL LETTER I should also figure prominently in any list of hard-to-find characters, precisely because its history confounds so many expectations regarding where, exactly, one should search for it if casually perusing the charts or attempting to access it for input.

    Vital Statistics:

    2071;SUPERSCRIPT LATIN SMALL LETTER I;Ll;0;L;<super> 0069;;;;N;;;;;

    For further details, consult the Unicode Character Database.

    SUPERSCRIPT LATIN SMALL LETTER I is also identified as IBM GCGID LI011000, where it is named in the documentation as "i Small superscript".

  • Sorting it all Out

    Kristin Connell at Paragon tonight

    • 7 Comments

    I first saw Kristin Connell when she opened for Jim Boggia in the Green Room, and I even bought her CD Second Chances there since I enjoyed several of the songs she played, and she said most of them were on the CD. I had her autograph it, too -- why not? :-)

    Anyway, this talented lady is at Paragon in Queen Anne tonight, where the food is great and the entertainment is free even when it is really good, like when it is Kristin. Highly recommended!

    I never did post the story about the CDs that night. Kristin opened the Green Room show (where they have a nice scooter-friendly elevator) and afterwards I bought the CD. I had only $25 on me, but the CD was just $15 so I figured I would go to the bank tomorrow.

    But then Jim got onstage and did a great show (he even closed with that hilarious Prince imitation I had heard about but never seen). I wanted to ask him to do Mr. Harris (an Aimee Mann song that he once got up on stage with Aimee to do, the night after I had to leave town), but I lost my nerve and did not ask while he was onstage. After the show I started to tell him this and it turns out he remembered the drive out to the show and Mr. Harris the next night and everything (he even remembered me which amazed me even though he did not remember my name, it was still very cool!).

    I decided to buy one his new CD (Safe in Sound), too -- even though his management had sent me one already (I did some stuff with flyers for them), I wanted to give one to a coworker. As a rule I like to make sure more money goes to the artists than to the store, so this seemed perfect. But then I remembered that I only had $10 on me. :-(

    But then I found a 5 euro note in the wallet, which is technically worth more than $5 (exchange rate being what it is). Would they accept it? I have known cab drivers to refuse them here, not realizing what they were worth (the ignorance about some things in this country is staggering!).

    As it turns out, he had no problem taking the euro. :-)

    I kind of wanted to get rid of the euros anyway, since I am back in the US now. But I'll bet you he won't forget the guy who bought his album in Seattle with euros, even if he still doesn't remember my name. Though in fairness he remembered the name of every female in the car from the original trip. Which is probably more gentlemanly than remembering the gentlemens' names, in any case.

    Anyway, hope to see you at the show tonight. I promise to give more notice about this sort of thing in the future (nobody reads this blog on the weekends, right? <grin>).

Page 1 of 5 (63 items) 12345