Blog - Title

May, 2005

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Not a Birds of a Feather, but something for TechEd 2005

    • 3 Comments

    Something special for IT Pros (and maybe for some developers!)

    I will be doing miniature versions of talks I have done before on keyboards -- how they work, how to develop them, how to deploy them. This will happen in Dr. International's clinic (located in the Community Cabana across the hallway from the Developer Tools’ track cabana) during one of the many times I am there, for anyone who is interested.

    If you are one of the interested parties then let me know and I will try and get some times noticed. Or maybe I should sign up for some of those GrokTalk slots?

    Scott, what do you think? :-)

  • Sorting it all Out

    When complex scripts are not too complex

    • 4 Comments

    Raymond Chen did a post yesterday entitled You can't simulate keyboard input with PostMessage.

    He did touch on the complicated language issues, saying:

    First of all, keyboard input is a more complicated matter than those who imprinted on the English keyboard realize. Languages with accent marks have dead keys, Far East languages have a variety of Input Method Editors, and I have no idea how complex script languages handle input. There's more to typing a character than just pressing a key.

    This is a subject I have covered before a little bit. The fact is that these APIs in the USER subsystem (from ToUnicode to SendInput and so on) all keep a certain amount of state. not just the state of shift keys like Raymond mentioned, but state related to when you have typed a dead key (so that it knows when you type the next key whether the dead key table contains the combination you have just typed). There is actually even an MSDN topic that helps describe some of the complex process entitled About Keyboard Input.

    And once you get into IMEs, the complex rules related to state that the IME must keep really boggle the imagination. It is probably easier to use the Input Method Manager (IMM) APIs to try to get input through an IME than to try and fake keystrokes....

    Luckily, the one issue that is not really all that complex is complex scripts. Because the actual issuses that make it complex (bidirectional text, contextual shaping, line breaking, and illegal sequence checking) are all related to what happens to the text after you have typed the keystrokes -- the font linking and the rendering.

    Well, that and knowing what to type, of course! Being able to have words even look like they belong together in languages like Thai and Hindi and Tamil really requires either knowing the language or memorizing keystrokes. Which is the same as when dealing with IMEs (in my book I had a chapter that talked about keystroke combinations you could use to test IMEs, it was a lot of fun and I still get positive feedback, enough that I may start posting examples of stuff like that soon.

    (Let me know if that sort of thing seems like it might be interesting -- I never know what people will find engaging here!)

     

    This post brought to you by "" (U+0b82, a.k.a. TAMIL SIGN ANUSVARA)
    (which is annotated in Unicode as "not used in Tamil" though several people who deal with Sanskrit in Tamil would beg to differ!)

  • Sorting it all Out

    The GrokTalk meme spreads

    • 2 Comments

    Scott Hanselman posted about GrokTalk, and explained what it is all about:

    The deal is this: We've all sat through some pretty lousy technical sessions at conferences. For the most part, sessions at TechEd are filled with good information, but every once in a while you sit through 75 minutes in order to "grok" something that could have been explained in 10 minutes.

    We thought it'd be interesting if we put together three days of presentations that were only 10 minutes long! Just the facts, just the technology, in a short format. We'll see presentations from folks you may have seen speak before like Scott Stanfield, Carl Franklin, Billy Hollis, myself, Juval Lowy, as well as a few from out of town, or the other side of that world that you may not have have had the opportunity to see.

    Of course, Mike Gunderloy pointed out one flaw:

    Frankly one reason I stopped going to Tech Ed was because too many of the 75-minute talks only had 10 minutes of content. Of course if these things are going to end up on the Web anyhow I still don't have any reason to go.

    Good point. The downside of ideas that come outside of the group of people who are trying to up attendance. :-)

    But since it was announced after TechEd sold out it could not have been about that anyway. But it is about making things more available to more people. And that makes it a good idea.

    I do think it is a very good idea, I plan to try and drop in on some of it if I have a chance.

    But kind of like all of the times I think of great interview questions that I cannot use since there is not enough time to set them up, it is hard to do what I do in ten minute bites. :-)

    Unless you think I could start doing some of these blog posts. Some of them could be done in 10 minutes (although without the links to other posts there could be some trouble trying to use some of them effectively!).

    In any case, if you are going to TechEd in Orlando, I'd definitely check them out.

    And be sure to stop by Dr. International's clinic (located in the Community Cabana across the hallway from the Developer Tools’ track cabana). I will be in there for a lot of the time that I am in Orlando....

  • Sorting it all Out

    Hats off to David Beaver

    • 0 Comments

    Or should I say, Háts off to David Beaver? :-)

    Over on the Language Log, David noticed some interesting issues with Google search in his post entitled PASS THE HÁT.

    The basic issue comes up with Google's HYPHEN-MINUS operator. According to their documentation on negative terms, it behaves as follows:

    If your search term has more than one meaning (bass, for example, could refer to fishing or music) you can focus your search by putting a minus sign ("-") in front of words related the meaning you want to avoid.

    For example, here's how you'd find pages about bass-heavy lakes, but not bass-heavy music:

    bass -music

    Note: when you include a negative term in your search, be sure to include a space before the minus sign.

    They make it sound simple, don't they? What David noted is that it is not simple, since the interaction is not as simple as additive and subtractive terms. The interaction related to the use of diacritics or even just the plain words is really quite fascinating.

    Knowing what I do about collation and indexes, I could spend a lot of time in this area, trying to reverse engineer both the alogorithm and the indexes being used by looking at the results. But it is not quite that interesting -- I have my collation implementation to be thinking about, after all. :-)

    But it is still fascinating to contemplate. My favorite is also the one David seems to like best:

    Or achete -achete: infantile as I am, I really like this one since it produces 1,890,000 hits, while Google helpfully suggests the alternative acheter -acheter, which produces no hits, surely a new record of bad performance for a search enhancing feature.

    It could add fascinating twists to GoogleFight, at least....

    You can try playing with other operators, and wonder why a -a gets over 44,000,000 hits while +a -a gets none (especially since the plus sign is meant to force words that are usually ignored, which a usually is....

  • Sorting it all Out

    Doing something with TechEd slides

    • 12 Comments

    In years past, I had seen shows at the Front Row Theatre in Cleveland (it is no longer around, the Rock and Roll Hall of Fame is on the site where it used to be). I was struck at the time by the way that the performer would be facing different parts of the audience at different times. So you did not always get the performers, but you got all of them for some of the time....

    Anyway, in years past when doing technical presentations in places like Stockholm and Amsterdam, I would usually work to get my slides localized -- of course in a bilingual form so I could still read them! :-)

    (Also, when speaking in London I would try to get rid the Americanisms when I "localised" them)

    I was thinking about maybe trying to do the same thing for TechEd in July, but I realized that a lot of the attendees will be coming from all over Europe, so trying to get them localized into Dutch may not really capture as much.

    So I thought about that long-gone venue and started toying with the idea of trying to get different slides done in different languages, all over Europe.

    It would roughly analagous to letting lots of people see some of the show localized, whether it was text in Nederlands, Frysk, Deutsch, français, ελληνικά, español, suomi, Magyar, íslenska, italiano, norsk, polski, Português, română, русский, hrvatski, slovenčina, shqipe, svenska, Türkçe, україньска, Беларускі, slovenski, eesti, bosanski, latviešu, lietuvių, euskara, македонски, srpski, Elsässisch, Occitan, Corsu, brezhoneg, hornjoserbšćina, Lëtzebuergesch, Rumantsch, Cymraeg, åarjelsaemiengiele, or any other language across Europe.

    So, does it sound interesting? Comments welcome!

    Do you speak any of these languages and would you like to help out? You can send a piece of email to me at michkap -at- microsoft.com (munge in the obvious way!).

  • Sorting it all Out

    Pass the string, please

    • 6 Comments

    A little over a month ago, I was talking about how SetLocaleInfo really stinks. Buried in my tirade was the germ of a question that people have asked in the past -- why is there no setting analogue to the LOCALE_RETURN_NUMBER flag used by functions like GetLocaleInfo, a sort of LOCALE_SPECIFY_NUMBER flag for the numeric fields? They are all small integers, so there is a simple enough method and datatype for such a flag to use.

    Of course that is kind of syntactic sugar for SetLocaleInfo. The functions where the problem looks worse for us are GetCurrencyFormat and GetNumberFormat. They both have the following text for their lpValue parameters:

    lpValue
    [in] Pointer to a null-terminated string containing the number string to format.

    This string can contain only the following characters:

    • Characters '0' through '9'.
    • One decimal point (dot) if the number is a floating-point value.
    • A minus sign in the first character position if the number is a negative value.

    All other characters are invalid. The function returns an error if the string pointed to by lpValue deviates from these rules.

    So not only is there no way to pass an actual number to these functions (you must pass a string), but you also have to use a simple string that is not even vaguely internationally wise. It is not even a good US string, since you would have to pass long numbers without grouping separators, something that is very un-natural. Yuck!

    But if you think about it, this does make sense. Would you want a function that would accept input one moment and then potentially fail with that same input moments later after a change to settings? Not to mention the performance impact of having to insert complex parsing logic into the function based on those settings, giving developers a very difficult to debug problem in the most common cases.

    Of course one could expect that the function could use the same format as is expected in the locale passed in, but if the developer was able to format a string with that locale's settings, she would not need the function, right?

    Now the fact that there is no LOCALE_SPECIFY_NUMBER type flag to pass is a bit stranger since it's not like this function gives the benefits of passing a string, right?

    But before you agree with my strawman question, consider for a moment what datatype you would use.

    Both functions will accept outrageously large numbers, much larger than any common C-style type that is available under Windows. And both will accept fractional values that are 100% precise -- way more precise than any floating point type that is available. You can start imagining complex schemes with different flags for different types of numbers, I suppose. And all that work and the functions would become hideously complex for customers to call and for us to maintain, all to provide less functionality than the current simple string allows!

    Sometimes the simple answers are the best ones. If you ask me, both GetCurrencyFormat and GetNumberFormat fall in that category, and on the whole I am happy with how these funcions work....

     

    This post brought to you by "Z" (U+005a, LATIN CAPITAL LETTER Z)
    (All of the other letters were still asleep, and when they were saying Zzzzzzz, the Z woke up, thinking someone was calling her name!)

  • Sorting it all Out

    What was I saying?

    • 0 Comments

    What was I saying? I can't remember.

    Maybe I will look up on Adi Oltean's blog how much memory my version of Windows can have.

    Good thing I remembered where my RSS reader was! :-)

  • Sorting it all Out

    Second time's a charm!

    • 6 Comments

    (a tiny little bit if technical content in this post, but its in the second half and may not be worth the trouble sifting through this stream of consciousness blathering!)

    About a month ago in my post Getting Enough Exercise? I talked about my exciting scooter ride to Typhoon! that saw me pushing my scooter for a lot of ths way back home.

    Well, they fixed the problem in the Victory's motor a few weeks ago but I was still wary about taking it out again.

    But the weather yesterday and today was simply way too stellar to ignore. I had to get out there. So I put on shorts and my lucky T-shirt (more on why its my lucky shirt another time, its a great story involving a concert in Malmo and a bizarre car repair accident that ruined the shirt I had on), and headed out.

    I made it there and back, with no problem at all. It was awesome!

    And I think I realized why I enjoy the idea of scooting around. It is pretty much the same reason I used to like running (back in the days when I was able to do that). It gives me a chance to keep my lower level systems busy handling the mundane tasks like watching traffic and steering, and I can disconnect the higher level systems to just think about stuff.

    I swear I came up with ideas for fixing a bug assigned to me and thoughts on how to approach debugging two others. I ran ideas for things to say during the various talks I'm doing and am really looking forward to doing them. I thought about my busy schedule over the next few months -- a TechEd in Orlando , a wedding in Seattle, a TechEd in Amsterdam, two B'nai Mitzvot in Sugarland, a trip to Edmonton to talk about Unicode stuff to two different groups of people and a likely gig in Houston to do the same while I am in Sugarland, interesting bugs to look at and several interesting projects to work on.

    I remember when summer was a time for vacation!

    Then I was thinking about the last time I was scooting to Typhoon. I drove into Redmond Town Center, did some shopping, and scooted over to the restaurant. On the way there I saw an ex-girlfriend (well, technically ex-fiancee, I guess that promotes her to the ex-girlfriend!), and I really didn't say much more than hello. I thought about it later, wondering why I did not stop and ask how she was doing, how was life, how were her kids, how was the job, and so on. But I really did not do any of that. No special reason, I was not trying to avoid it as far as I know. I mean, I had the scooter so it was not a worry about not being able to stand around very long. Maybe I subconsciously was just trying to not get back into it all. I don't think I was rude, and its not like she was asking about me or anything. So maybe she had the same idea.

    Or maybe I just look more pathetic in the scooter than I used to with the cane! :-)

    Well, even  if that is true, I think I will keep on scooting. I did not realize how much I missed that special processing time I used to get. It's hard to explain, but it is the kind of thing you can't do when you are relaxing or trying to sleep or whatever. I guess I have also managed to do it on a beach in Little Cayman a few times, but that was by literally thinking about nothing at all and hitting some sort of Zen state where awareness creeps in while you bake in the sun and watch the water, but never in any other time or place. Since I don't have as much time to get away like that, the scooter trips can make a great substitute....

    Anyway, back to this current trip.

    I was thinking a little bit about CompareString/LCMapString (and their managed cousins CompareInfo/SortKey). And the promise to work to make sure that the results of string comparison and sort key comparison results are consistent.

    It occurred to me that I do not even remember the last time there was a bug on the sort key side of the equation, but I do remember all the last who knows how many issues that have come up on the string side. The hazard of trying take shorcuts, I guess -- you can easily find yourself getting tagged out trying to slide into second base, if you know what I mean. And in a function that is pretty much a goulash of shortcuts to try to maximize performance with a goal of a 0% sacrifice in fidelity.

    Now admittedly the bugs are pretty much all edge cases. But the obvious cases, one takes for granted -- it is like the Talmud. It is in the minutae that you can draw solid boundaries so that everything else will fall within or without in deterministically defined ways....

    Now somewhere deep in the code of that function, there is a simple skeleton of a design, but over time the various complexities and bug fixes and features and workarounds have conspired to make it a very daunting code base. Enough so that most people on the team, who are usually happy to load balance bugs, will hesitate rather than pick up the sorting ones. It's just a rough code base to jump into. Of course comments have not always been maintained as well as they could have been when small changes have been made -- I really ought to try to clean some of that up, and make sure the comments match the code (and that they exist where they ought to!). It may even help inspire new insights. :-)

    And now I will go eat some more noodles. If you made it this far, thanks for making it through my blatherings. :-)

     

    All the characters fell asleep by the time I got to Little Cayman, so none were awake to take the sponsorship. I will try to wake them up for a post later....

  • Sorting it all Out

    Time zones and locations and keyboards, oh my!

    • 7 Comments

    Yesterday, Manip post a rant about the fact that the three settings in the title of my post today are not integrated well (in response to a post by Larry Osterman, who gets good comments even to his non-posts!). The rant:

    I'm using this misc story to post a rant about Windows Setup.

    Why oh why are countries and time zones + keyboard layouts not cross linked?

    First Windows Setup asks me my nationality, I respond 'United Kindon'... Setup is *SO* smart that it doesn't change the default time zone to GMT & UK Keyboard Layout; instead I have to manually select all three.

    Now I absolutely agree that not *ALL* people who select UK will want to use GMT and a UK keyboard layout but it could at least make those the defaults for the vast majority of the people that do each time...

    In case you were thinking that this is proof that you should read the comments in people's posts, this probably wouldn't do the trick, because Larry sent me email about this before I had a chance to look at it. :-)

    Now let's talk about this rant for a bit. :-)

    Now starting in Windows XP and continuing into Windows Server 2003, a user locale setting will give you a keyboard, automatically. You can choose to add more if you like, or even remove ones automatically added, and with the exception of a single bug that we may hear about in a comment to this very post it all works quite well. Note that in prior versions this feature did not exist, so for Windows 2000 you would actually have to complain about four non-integrated settings!

    However, the time zone setting and the location setting are not integrated with that user locale/keyboard choice.

    I have spoken in the past (like in GEOID -- The LCIDs maligned little brother....) about the fact that the location and user locale settings are not well integrated, and the fact that not enough people are using the location setting anyway. And the inertia that keeps developers from making such changes is not helped by the fact that the GEO API does not exist on Windows 2000, either. But eventually, with the help of that eager PM I talked about the cited GEO post, I am sure that issue will work itself out.

    And once upon a time, there was some effort to integrate the time zones and user locales as well, but that particular integration no longer really happens -- and it would honestly be better if it integrated with the GEO settings anyway. Especially since there is no documented, queryable mapping between locales and time zones, but a call to the GetGeoInfo function with a SYSGEOTYPE of GEO_TIMEZONES will return an array of all of the time zones in the selected location.

    Note that there are many locations that span multiple time zones, so there will still have to be some tweaking done, but that is not unreasonable (I am sure there are both people who would demand that Microsoft integrate GPS tracking to pinpoint the location and people who would decry such an "evil Microsoft conspiracy" to track users as immoral; I will not give my opinion either way!).

    So in summary, not only can the LOCALE/GEO and GEO/TIMEZONE connections can be made, but there are people actively trying to get them made. If Manip can wait until we get all of our ducks in a row, then the integration that the rant truly pines for will happen. Hang in there! :-)

     

    This post brought to you by "±" (U+00b1, a.k.a. PLUS-MINUS SIGN)

  • Sorting it all Out

    When do time zones and cultural settings get updated?

    • 12 Comments

    Yesterday, Jeff Parker post a comment to a non-post about time zones and backcompat from Larry Osterman (Larrys blog is so cool that he has ideas pop up even when he only posts about the fact that he could do put up anything substantive since he was working on an internal presentation!). Jeff's comment was:

    Hey I was thinking of something about your backwards compatability and how long should you keep API's. Maybe when you get time you could elaborate on another situation with that. This year Indiana has voted to go along with Daylight Savings Time. Where previously they did not. Microsoft even has a Eastern (Indiana) Time Zone. Now would this go away? Why would you keep it? And more importantly are they going to patch it since there is no longer and Indiana time zone. What does Microsoft do if an API specifically affects a culture and the culture changes.

    Just a suggestion, something I am curious about. Since I went to Purdue I know a lot about the old Indiana time. When I heard they were changing I was wondering how a shift like this would affect API's and do they still then remain valid.

    In case you were thinking that this is proof that you should read the comments in people's posts, this post would do the trick, since I saw it before Larry sent me email about the other issue. :-)

    The issue of backcompat and time zones that Jeff brings up is an interesting one (and the news that Indiana plans to join the rest of the country is fascinating, now if we could just get Arizona to follow suit!). In this specific case, lots of people never even knew about the Indiana rules until an episode of The West Wing had some fun with it. But for the time zones in Windows, the principles are easier to decipher:

    • We update any time the time zones update -- the rules are such that correctness is more important than consistency. There are applications that do not use the APIs to match the behavior, but the average user will usually blame the application, not the operating system (in fact they might wish that the OS had easier updating methodology!).
    • Once the Indiana time zone is the same as its surrounding area, the fact that it has a separate time zone 'slot' is more of a cosmetic issue. It cannot ever be removed in existing products even if the settings in the zone are updated. After all, while time zones change, the zone itself is a setting on peoples' machines and removing it can cause a lot more problems then it would solve. There is no hurry to do it, if you ask me; given the contention surrounding the issue, it may end up getting reversed at some point anyway (say if Troy Woodruff loses his seat and legislators' remorse sets in, parts of the old rules could find their way back in Indiana!).

    Now what to do in future versions is a different story -- it is easy enough to migrate people when they upgrade, when/if you need to. There are, after all, at least 75 different time zones the last time I have had cause to look at them all, last year. More get added from time to time, both for good reasions and bad (I will not make judgments or cast dispersions by giving you my own opinions on categorizations here -- let's just say that neither common sense nor maturity always figure into official policies or requests!).

    The issues of supporting a time zone "slot" past its useful life as a distinction for the sake of backwards compatibility is an fascinatingly dufficult issue, one that I am glad I do not have to make.

    For myself, I usually do not change my time zone settings even when I travel -- it is easier for me to just do a little math in my head; if everything goes wrong and I miscalculate, maybe I can get out of a few meetings! :-)

    Now the principles here also very much apply to locale/culture settings. They must be updated to match cultural expectations. If you have code that either does not query it or code that assumes it will not change, the code is just wrong....

     

    This post brought to you by "¿" (U+00bf, a.k.a. INVERTED QUESTION MARK)

  • Sorting it all Out

    Fixed my SQL Server CTP14 woes!

    • 3 Comments

    So I am setting up various Virtual PC images for SQL Server Shiloh and Yukon demos for my upcoming TechEd presentations. I was installing on different platforms, also -- because hard drive space is cheap and variety is the spice of life. :-)

    Suddenly I was seeing an alert with the following text in it, but not with all installs:

    TITLE: Microsoft SQL Server 2005 CTP Setup
    ----------------------------------------
    SQL Server Setup could not connect to the database service for server configuration. The error was: [Microsoft][SQL Native Client]Shared Memory Provider: Connection was terminated [1236]. Refer to server error logs and setup logs for more information. For details on how to view setup logs, see "How to View Setup Log Files" in SQL Server Books Online.

    For help, click: http://go.microsoft.com/fwlink?LinkID=20476&ProdName=Microsoft%20SQL%20Server&ProdVer=9.00.1116&EvtSrc=setup.rll&EvtID=29545&EvtType=lib%5Codbc_connection.cpp@Do_sqlScript@OdbcConnection::connect@x800704d4
    ----------------------------------------
    BUTTONS:
    &Retry
    Cancel
    ----------------------------------------

    Ugh, what to do?

    Thankfully, Mike Epprecht posted about this problem, and its solution, in a very helpful post.

    Back on track. Thanks, Mike! :-)

  • Sorting it all Out

    What isn't in the default install for NLS

    • 9 Comments

    Ram Mallika mentioned in the suggestion box:

    Can you please talk about the issues in a out-of-the-box support for complex scripts and languages, instead of making it as a supplemental language installation? Also, What can we expect on this issue on the next Windows (longhorn)?

    I am sure you are aware of the immense benefits etc.

    Thanks much

    [It is a follow-up on one of your comments in
    Even every version of XP Home is fully internationalized.... ]

    Now, the sad truth is that out of the box, the important multilanguage settings (all language groups in Windows 2000, and East Asian/Complex Script support in Windows XP/Server 2003) are not installed by default.

    There are two reasons for this:

    1) Installing the East Asian Language Support means installing the EA fonts and input methods. And those are a lot of files. Someone decided that the "size of the default install" was an important metric, which is I suppose the same order of magnitude as the logic by which most cars are sold with tiny temp tires as their spares.

    2) Installing Complex Script Support means turning on Uniscribe, which does cause a minor performance issue (there is a minor hit to opening up your machine to all that Unicode has to offer!). Someone decided that the "default install" for the US English product should not have to take that performance hit, although of course that US English product is the one most likely to be shipped to any place that does not have a localized version of its own. Hopefully they turn on that support for those scenarios (for obvious reasons they do the right thing for many of the localized SKUs -- if they did not then you wouldn't be able to see the UI in some cases!

    Over the past five years I have met with, spoken to, consulted for, been bushwacked by, partnered with, and presented to a lot of different customers about international issues. And this is an issue that has come up again and again with consultants and large organizations and governments and sysadmins and people who do not always have the administrative permissions required to install these components....

    Obviously this is a battle that is worth taking up, right? :-)

    Plans for Longhorn are something that I can't really talk about yet, but it will not be very long before that is not true. So keep your eyes here, once Longhorn features are all fair game, there are a lot of exciting things going on in the area of globalization support and typography and MUI. And that is work that I am pretty excited to be doing, if you know what I mean. So hang in there a little bit longer. :-)

     

    This post brought to you by "ᾜ" (U+U+1f9c, a.k.a. GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI)

  • Sorting it all Out

    The last word on the FINAL SIGMA

    • 16 Comments

    Back in the beginning of April, I explained about the one scenario where casing does not need to roundtrip in .NET -- the Greek final sigma.

    Anyway, the day before yesterday I got an email from someone who had been reading my blog and was looking at all of the one-way mappings that are in the linguistic tables (accessed with the LCMAP_LINGUISTIC_CASING flag, which I have discussed previously). He was wondering why that FINAL SIGMA could not be put into the linguistic tables since it is a one-way mapping.

    A fair question, one I thought worthy of a post. :-)

    If you are a native speaker of Greek, then you know that both ς (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA) and σ (U+03c3, a.k.a. GREEK SMALL LETTER SIGMA) do indeed uppercase to Σ (U+03a3, a.k.a. GREEK CAPITAL LETTER SIGMA). But if we added this character to the linguistic table, then it suddenly ς would never work in the CharUpper/CharUpperBuff functions and would not work in the default call to LCMapString with the LCMapString function with the LCMAP_UPPERCASE flag.

    Obviously that would not be a good thing.

    Try to imagine how you would feel if attempting to uppercase the string hello would come out as HELLo. Wouldn't you consider it a bug? Especially is it used to come out with the HELLO you were expecting? You might be thinking about telling the platform GooDBYE, if you know what I mean.

    Of course ideally the functions would notice whether the Σ was at the end of a word and then decide whether to use ς or σ, depending. But LCMapString does not really look beyond the character level here, so until it does that would not really be an option.

    Though of course a more sophisticated application might work to provide results beyond the character boundary. Though I do not envy such programs; the boundary for them becomes quite fuzzy if you have non-Greek characters after the ς. Does that count as a new word or doesn't it? That is the kind of question where an API can never win -- no matter which way it goes, there will be some people who do not like the answer.

    Anyway, that is why ς is not uppercased only in the linguistic table. Because there are too many cases where the results simply don't make sense, at least not as things are implemented currently....

     

    This post brought to you by "ς" (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA)
    A character that wonders whether Unicode would have been simpler if it did not exist as an independent entity, and fionts could then decide whether to make it a "final" form or not....

  • Sorting it all Out

    You may want to rethink your choice of UTF, #3 (Platform?)

    • 0 Comments

    Ok, by now you know the drill -- I am comparing the various ways of expressing text in Unicode.

    In prior posts I have talked about the issues related to size and to speed. However, both of those posts were working in a theoretical vacuum that was independent of the world in which the code would have to live.

    This may be suitable for the internal engine of your component, but once your component has to talk to the world outside, the need to take into account the environment on which the code must rest (in other words, the platform) becomes important.

    What is crucial here is that the fastest and best encoding to use for these communications is the "native" type of the platform.

    The key is to match that encoding form, whatever it may be.

    If you do then all of the native APIs of that platform are available to you, and you maximize performance while minimizing the chance of logical errors corrupting data if you minimize the number of conversions.

    If you are running against a Windows platform, then that means you are using UTF-16. Period.

    If you are a web service that has to deal with Internet protocols like SOAP and such (or if your platform's Unicode support story happens through UTF-8) then your best bet may be UTF-8.

    And if you are running on a UNIX box that uses those four byte code points then UTF-32 is really your only good option.

    Now remember that this only refers to those external communications.

    If you do extensive string processing in an existing application, then it will often make more sense to leave it in that alternate form and just make sure to use the platform type for when that communication is required. You may also find certain operations to be much more cumbersome in the areas of string process, formatting, and parsing. If that is the case, then youyr internal engine might be using UTF-16 or UTF-32, even if the underlying platform's default type is not.

    This may also be the case if you create cross-platform libraries -- it may prove to be an unmaintainable mess to try make the underlying identity of strings change on different platform compiles, but it would make perfect sense for the library to use one for all of its internal code....

     

    This post brought to you by "∂" (U+2202, a.k.a. PARTIAL DIFFERENTIAL)

  • Sorting it all Out

    Encoding scheme, encoding form, or other

    • 2 Comments

    No one ever accused the Universal Character Set of being simple.

    Just short of 100,000 characters, many different scripts and languages, all sorts of complex scripts.

    Unicode is downright hard, sometimes.

    If you asked me, that is the biggest reason for Microsoft to just call it Unicode, rather than Unicode Tranformation Format-16 bit, Little Endian. Because that keeps it a bit simpler for people who do not have to care about that level of detail.

    And now I am going to ruin all that for a bit.

    Who am I? The party pooper! :-)

    Now first of all, there are three different Unicode forms: UTF-8, UTF-16, and UTF-32. Those are the only recognized legal Unicode forms. No matter how many times UTF-9 is published as an RFC, its April 1st publish date will always give it away. I have been comparing various issues between the Unicode forms like the size and the speed over the past few days. The Unicode forms are actually good descriptions of the way Unicode is represented.

    The next level takes us into the way that those forms are actually stored on disk or in memory if you are literally looking at byte entries. These are known as the Unicode schemes and there are five of them:

    • UTF-8
    • UTF-16BE (UTF-16, Big Endian)
    • UTF-16LE (UTF-16, Little Endian)
    • UTF-32BE (UTF-32, Big Endian)
    • UTF-32LE (UTF-32, Little Endian)

    Now I have talked about the whole Endian thing in the past, and there will be one of those Jeff Foxworthy-esque You may want to rethink your choice of UTF posts soon that talks about Endian issues, RSN (real soon now).

    For now, you can just realize that Unicode is legislating that a USHORT and a UINT have different byte orders on different platforms, similar to the way that a government could legislate gravity if they wanted to -- they are recognizing what platforms do and just trying to formally describe them. :-)

    Now there is also CESU-8, which I briefly discussed when I was talking about size and speed a few days ago. It is not an encoding form and it is not, strictly speaking. an encoding scheme in the formal recognized sense like the five entries above. Although to make life more confusing for everyone, the full name of CESU-8 is Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8). It's status is defined in the summary of the Technical Report:

    This document specifies an 8-bit Compatibility Encoding Scheme for UTF-16 (CESU) that is intended for internal use within systems processing Unicode in order to provide an ASCII-compatible 8-bit encoding that is similar to UTF-8 but preserves UTF-16 binary collation. It is not intended nor recommended as an encoding used for open information exchange. The Unicode Consortium, does not encourage the use of CESU-8, but does recognize the existence of data in this encoding and supplies this technical report to clearly define the format and to distinguish it from UTF-8. This encoding does not replace or amend the definition of UTF-8.

    So far it has not done well in contrast to the official Unicode forms, and that will likely not improve in future posts. Sorry, but those are the breaks....

    Now there is also UTF-EBCDIC, which, to once again help confuse all, is described in the summary as an "EBCDIC Friendly Unicode (or UCS) Transformation Format" (confusing beacuse it is yet another UTF that in this case is called neither a form nor a scheme!). Luckily the scope section defines where it ought to be used: "Neither UTF-EBCDIC nor its intermediate form called UTF-8-Mod in this technical report, are intended to be used in open interchange environments. It is useful in homogeneous EBCDIC systems and networks". Which kind of says it all.

    Now both CESU-8 and UTF-EBCDIC should probably have been Unicode Technical Notes, and some people have even pointed that out. You can tell people "my dear boy..." when you explain to them that UTNs did not formally exist when these two proposals were either approve or on track to be approved. Maybe next time?

    In any case, Microsoft does not support either of these formats, as they are not intended for interchange, and none of its internal processes use them. Nothing personal, they are just really not MS formats.

     

    This post is brought to you by "۩" (U+06e9, a.k.a. ARABIC PLACE OF SAJDAH)

Page 1 of 5 (72 items) 12345