Welcome to MSDN Blogs Sign in | Join | Help

I have not officially decided whether to laugh or cry about this fascinating way to celebrate the encoding of U+1e9e:

But I have no plans to get these to wear as cufflinks at Jenny and Alex's wedding the weekend after next.

You can read about it (in German) here. If you have even a meager knowledge of German, the comments have a lot of hilarity in them. :-)

Anyway, previous, less celebratory blogs about this letter:

  • September 2005: Every character has a story #15: CAPITAL SHARP S (not encoded)
  • May 2007: Every character has a story #26: CAPITAL SHARP S (might be encoded?)
  • August 2007: Every character has a story #28: U+1e9e (CAPITAL SHARP S)
  • February 2008: The idea has to do more than just make sense to me (aka How S-Sharp are *you* feeling today?)
  • April 2008: Kind of ironic how Germany seems so okay with Capital *Letter* punishment, huh?
  • Of course keep in mind that for most of the German speaking world this still isn't a letter.

    And it is not in code page 1252 so non-Unicode applications won't like it much either....

    Not that I am bitter about this or anything. :-)

     

    (Hat tip to Simon Daniels)

     

    This post brought to you by ß and(U+00df and U+1e9e, LATIN SMALL LETTER SHARP S and LATIN CAPITAL LETTER SHARP S)

    The mail that came on the other day was:

    We’re seeing an issue that doesn’t seem correct to me, but I just wanted to confirm. In debugging the issue, the developer found that operating system shows the special genitive month name for Hebrew locale and Hebrew lunar calendar only.  Does the Hebrew lunar calendar even have special genitive month names? We don’t think so. 

    Can you shed any light on why Windows would show the special genitive month name only  when the user locale is Hebrew and calendar is Hebrew lunar?  Is there something we’re calling incorrectly? A Windows bug? ???

    Thanks so much for your help,

    The repro steps were then provided, but those aren't to relative here. :-)

    First, and most importantly, Hebrew does not have genitive month names, or any kind of genitive form that would impact dates. In fact, looking at the Wikipedia on the Hebrew language:

    Hebrew grammar is partly analytic, expressing such forms as dative, ablative, and accusative using prepositional particles rather than grammatical cases. However, inflection plays a decisive role in the formation of the verbs and nouns. E.g. nouns have a construct state, called "smikhut", to denote the relationship of "belonging to": this is the converse of the genitive case of more inflected languages. Words in smikhut are often combined with hyphens. In modern speech, the use of the construct is sometimes interchangeable with the preposition "shel", meaning "of". There are many cases, however, where older declined forms are retained (especially in idiomatic expressions and the like), and "person"-enclitics are widely used to "decline" prepositions.

    There is also a link to more information that explains there is (in a sense) a genitive case -- but if you look at how it is constructed you will see why dates in general and month names in particular are affected.

    The interesting issue here is that even though

    • people on one side were pretty insistent in explaining that Hebrew had no genitive case here
    • there were people on the other side who were equally insistent that there must be one

    since they were seeing the same apparent net effect -- different format strings used in calls to GetDateFormat with the same date values returning different month names.

    Interesting conclusion, right? Not terribly accurate, but still kind of astute!

    Now I have talked about genitive forms a lot in the past:

  • What the %$#! are genitive dates? (25 December 2004)
  • Don't roll your own GetDateFormat (15 July 2005)
  • Genitive dates, revisited (10 November 2005)
  • Do genitive dates always work properly for Greek? (10 November 2005)
  • Any Sami speakers reading this blog? :-) (11 November 2005)
  • One last post about genitive dates (12 November 2005)
  • Practical Uses for Replacement Cultures/Locales (20 March 2006)
  • It may not always end with ի (08 April 2006)
  • It may not always end with ის or ისა, either (09 April 2006)
  • A re-genitive post (04 August 2007)
  • And I wonder whether one or more of them contributed here. In all of those posts I have never gotten too deep into how they are constructed in language.

    I probably will talk about this at some point, though not today, since it is not relevant to the problem this team was seeing. :-)

    The problem here had a very different cause....

    I'll give you a hint what it is.

    Consider the blog .NET is too busy being consistent with Windows to be consistent with itself...., noting that this is not the actual problem here, though the root cause is similar.

    You see, in that blog, the discussion is about how the calendar chosen by the user in Regional and Language Options is causing the calendar being used by the CurrentCulture in the .NET Framework to not be the new, cool default calendar in .NET. The user preference overrides things.

    This bug in the reported Hebrew case was that the month names were being retrieved by several different ways, including calls to GetDateFormat with format strings like "mmmmdd" vs. "mmmm" (which on the surface is a a great way to detect genitive month name usage!). And one other small difference:

    The use of LOCALE_NOUSEROVERRIDE in some cases but not others.

    That simple difference, combined with (from the original report) "when the user locale is Hebrew and calendar is Hebrew lunar", means that when the LOCALE_NOUSEROVERRIDE flag is passed the Gregorian month comes out; when the flag is not, the Hebrew lunar month comes out.

    Just like the user wants!

    This will give us דצמבר (December) versus כסלו (Kislev) and so on.

    Our "non-genitive" form. :-)

    The actual code difference between the two cases feels quite natural, because after all there are performance benefits to using LOCALE_NOUSEROVERRIDE when you can get away with it.

    Who would say no to reasonable performance enhancements?

    And we don't tend to think about things like month names changing, so it seems like a great candidate for optimization. Only in this case, it isn't, so much. :-(

    You can even see it in managed code in those same circumstances where the user locale choices cause the data to be handled/interpreted differently. Just like what was described in .NET is too busy being consistent with Windows to be consistent with itself....

    And if you want to avoid those different month names, remember to call the functions more consistently!

    This blog brought to you by כ (U+05db, aka HEBREW LETTER KAF)

    I can't help feeling that Outlook HoliDAZE thing again (and I just added that word to my spell-checking dictionary -- I think we've not seen the last of this!).

    So this last Monday was Whit Monday for some people, as Outlook let me know:

    Interesting.

    Okay, that takes care of the Roman Catholics. How about the Eastern Orthodox?

    Now we don't have to totally panic here. As Wikipedia describes:

    In the Eastern Orthodox Church Whit Monday is known as "Monday of the Holy Spirit" or "Day of the Holy Spirit" and is the first day of the afterfeast of Pentecost, being dedicated specifically to the honor of God the Holy Spirit.

    So it looks like they have it for Greece, at least. So not all of the locations are lost here -- just most of them. :-(

    Of course it gets a little worse when you look for Whit Friday:

    Again I can't help feeling  like calendars in Outlook are once again letting me down here....

     

    This blog brought to you by 𐤹 (U+10939, aka LYDIAN LETTER C)

    I have talked about WinZip and Zip in the past, in particular their odd relationship with Unicode, in blogs like:

    Now if you spend some time in the comments in that Zipping up Unicode file names blog, several people talked about how the ZIP formt itself has extensions to handle Unicode. Of course without implementers that is quite the theoretical point. If you know what I mean.

    But do you know what?

    Well, the gloves are off now, baby!

    In my INBOX yesterday, from the folks over at WinZip:

    We are pleased to inform you that we have just released WinZip 11.2, the second update to our most recent major release, WinZip 11.

    WinZip 11.2 takes WinZip and the Zip compression standard to a wider global audience by integrating the Zip format’s newly-added Unicode support, allowing accurate rendering of international characters in file names when using the same Zip file on computers with different code pages. WinZip 11.2 changes include:

    • Unicode support to ensure proper display of international characters for file names in a Zip file
    • Integrated support for LHA (.LHA and .LZH support)
    • Removal of support for DOS-based, third-party programs such as ARJ and ARC
    • Minor bug fixes and enhancements

    If you need assistance with the operation of the software, or have questions or comments about WinZip products, please contact us.

    Your use of WinZip is governed by the terms of the WinZip License.

    Thank you,
    WinZip Computing

    This is completely and entirely awesome.

    Folks over in Windows now have a job to do for next version -- and I have some email to send tomorrow.... :-)

    I also have to talk to the site license folks, too. More mail. But a good kind....

    WinZip, you rock!

     

    This post brought to you by Ž (U+017d, a.k.a. LATIN CAPITAL LETTER Z WITH CARON)
    (An entirely zippable character in the new version of WinZip!)

    Randall Monroe over on xkcd has done it again....

    A Better Idea

    In the words of Miss Stephanie H., there are no words. :-)

    Thanks for this one, Randall. YOMAK, big time!

     

    This blog brought to you by(U+4dd7, aka HEXAGRAM FOR RETURN)

    So I got email from a developer colleague the other day.

    It was that same developer who had previously impressed me -- as described in Expertise isn't always everything (aka When the one who is learning teaches us something important) -- and kept me from embarrassing myself by unknowingly repeating the same topic again -- as described in Font size scaling -- GDI vs. GDI+.

    Due to the positive nature of these two interactions (and several later ones by other members of her team), I like to pay attention when one of those mails hits my inbox. You know, just in case....

    And this mail turned to be no exception. :-)

    The mail read:

    Hello Michael,

    We are running into a problem with vertical text and I am hoping you might be able to shed some light on it.  Please take a look at the attached files – the pdf file is what we expect to see in terms of content, the TIF file is what we get.

    The TIF file was generated on a Window 2003 server with Office 12 installed. The problem is caused by
    @ in front of font family. The font used  in this case is Tahoma but the problem repro’s with other fonts too.

    The interesting part is that we  don’t have this problem on other computers. I tried on Vista, XP and even on another Windows 2003 server too. It works just fine. I compared the versions of GDI and Uniscribe dll’s on the two Windows 2003 servers and they are the same.

    The ScriptShape and ScriptPlace methods succeed and the glyphs and their widths look good. It suspect this is more of a drawing problem.

    Any clue what might go wrong here? Any system settings that I should compare between the two computers?

    Thank you,

    The  PDF file looked something like this (shown here as a somewhat reduced image of what it looked like):

    And the image file looked really different:

     

    Ick.

    I was at a loss for a moment -- it isn't like I had ever seen that kind of a problem before.

    But then I thought about the lesson learned from Expertise isn't always everything and the fact that they were pushing past the traditional boundaries of vertical text where people had previously been comfortable.

    My response was understated. I reasoned that it worked on one machine and not another, yet file versions all looked the same, and so:

    Hey, long time no mail! :-)

    The biggest thing I'd check would be whether Complex Script support was turned on or not -- when it is not, the binaries for Uniscribe are still around but the hooks so GDI call back into them don't happen.

    I admit that at the time I was stalling a bit, as I wanted to look into this with some people also but I didn't want to suggest nothing at alln the meantime.

    To be honest, I did not suspect that this would actually help, but I knew it was vaguely possible that it might (for the reason given -- the world of calling Uniscribe directly and sometimes indirectly has strange behavior when complex script support is turned off and most indirect hooks from GDI to Uniscribe are not doing anything).

    Turns out that it fixed the problem!

    Now for the follow-up we talked about a bunch of other issues like whether thy should always make sure the support is enabled, what that checkbox actually does in the way of settings, and so on. She realized it would make sense to understand more about the interaction between GDI and Uniscribe.

    I will follow-up on this topic in the future -- lots of other people would also like to understand this area better, beyond this one team that is finding issues no one ever even imagined that must nevertheless be figured out and in many cases documented and in some cases fixed when bugs are found (these are Destry bugs of the highest order!).

    Especially in this weird case -- where vertical text does not traditionally go through Uniscribe, and the only Uniscribe connection here is that they were technically calling through the low-level Uniscribe APIs to get the work done.

    Explaining how it ended up solving the problem?

    That one managed to stump everyone I asked.

    Though no one was truly surprised that it worked.

    Which I guess just goes to show you how complicated the interaction in the text stack is.

    Especially when running non-complex vertical (and thus GDI-handled) text through Uniscribe when complex script support (and thus GDI to Uniscribe connection via LPK) is turned off.

    Yet another reason to be thankful that this support is always turned on now, starting in Vista? :-)

    This blog brought to you by(U+17a3, aka KHMER INDEPENDENT VOWEL QAQ)

    It is a commonly reported issue in Windows and many components that run upon it, a recent one can be seen on the Connect site, here:

    Description:
    There is a conversion problem in the c/c++ runtime library. Turkish characters ı/I and i/İ are converted incorrectly to upper/lower case.

    Code that reproduces the issue:

    #include "stdafx.h"
    #include <string>
    #include <xlocale>
    #include <iostream>
    #include <algorithm>
    #include <functional>
    int _tmain(int argc, _TCHAR* argv[]) {
        std::locale a = std::locale::locale();
        std::locale::global (std::locale("Turkish"));
        std::locale b = std::locale::locale();
        std::cout << a.name().c_str() << " -> " << b.name().c_str() << std::endl;
        std::wstring lowerCase = _T("ığüşiöç");
        std::wstring upperCase = _T("IĞÜŞİÖÇ");
        std::wstring upperResult, lowerResult;

        upperResult.resize(lowerCase.length());
        lowerResult.resize(lowerCase.length());
        std::transform(lowerCase.begin(), lowerCase.end(), upperResult.begin(), towupper);
        std::transform(upperCase.begin(), upperCase.end(), lowerResult.begin(), towlower);

        std::wcout << lowerCase << std::endl;
        std::wcout << lowerResult << std::endl;
        std::wcout << upperCase << std::endl;
        std::wcout << upperResult << std::endl;
        if (upperCase != upperResult || lowerCase != lowerResult) {
            std::cout << "Conversion failed" << std::endl;
        }
        return 0;
    }

    Observed Results:

        C -> Turkish_Turkey.1254
        ığüşiöç
        iğüşİöç
        IĞÜŞİÖÇ
        ıĞÜŞIÖÇ
        Conversion failed

    Expected Results:

        C -> Turkish_Turkey.1254
        ığüşiöç
        ığüşiöç
        IĞÜŞİÖÇ
        IĞÜŞİÖÇ

    The issue boils down to the very simple fact that the C runtime's casing functions are being used underneath this code, and LCMapString is being called underneath that.

    They are passing the Turkish locale to LCMapString, but they are not passing the LCMAP_LINGUISTIC_CASING function, which means that Turkic case tables are not being used.

    On the surface, there is an easy fix -- just make LCMAP_LINGUISTIC_CASING get passed here, right?

    Though it is of course not that simple or it would have been fixed years ago, and I wouldn't be blogging about it here....

    I'll point toward two blog posts from the end of 2004:

    Especially that second one, which points out the two things that the LCMAP_LINGUISTIC_CASING flag does:

    1. You get the right behavior for Turkic locales like Turkish and Azeri;
    2. You get a bunch of one-way mappings on all locales, e.g. U+03f1 (Greek Rho Symbol) will uppercase to U+03a1 (Capital Greek Rho), which will lowercase to U+03c1 (Small Greek Rho).

    Now #1 is the "fix" for this bug, sure. It even kind of goes along with the C/C++ standards in this regard, e.g. 7.25.3.2.1.3 and 7.25.3.2.2.3 of C99:

    7.25.3.2.1.3 (the towlowercase function): If the argument is a wide character for which iswupper is true and there are one or more corresponding wide characters, as specified by the current locale, for which iswlower is true, the towlower function returns one of the corresponding wide characters (always the same one for any given locale); otherwise, the argument is returned unchanged.

    7.25.3.2.2.3 (the towuppercase function): If the argument is a wide character for which iswlower is true and there are one or more corresponding wide characters, as specified by the current locale, for which iswupper is true, the towupper function returns one of the corresponding characters (always the same one for any given locale); otherwise, the argument is returned unchanged.

    It is the second point in that definition, which adds all of the following other mappings, that makes all of this messier.

    Those other mappings are:

    Uppercase (all locales other than Azeri and Turkish):

    U+0131 --> U+0049 (LATIN SMALL LETTER DOTLESS I --> LATIN CAPITAL LETTER I)
    U+01c5 --> U+01c4 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON --> LATIN CAPITAL LETTER DZ WITH CARON)
    U+01c8 --> U+01c7 (LATIN CAPITAL LETTER L WITH SMALL LETTER J --> LATIN CAPITAL LETTER LJ)
    U+01cb --> U+01ca (LATIN CAPITAL LETTER N WITH SMALL LETTER J --> LATIN CAPITAL LETTER NJ)
    U+01f2 --> U+01f1 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z --> LATIN CAPITAL LETTER DZ)
    U+0390 --> U+03aa (GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS --> GREEK CAPITAL LETTER IOTA WITH DIALYTIKA)
    U+03b0 --> U+03ab (GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS --> GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA)
    U+03d0 --> U+0392 (GREEK BETA SYMBOL --> GREEK CAPITAL LETTER BETA)
    U+03d1 --> U+0398 (GREEK THETA SYMBOL --> GREEK CAPITAL LETTER THETA)
    U+03d5 --> U+03a6 (GREEK SMALL LETTER DIGAMMA --> GREEK CAPITAL LETTER PHI)
    U+03d6 --> U+03a0 (GREEK PI SYMBOL --> GREEK CAPITAL LETTER PI)
    U+03f0 --> U+039a (GREEK KAPPA SYMBOL --> GREEK CAPITAL LETTER KAPPA)
    U+03f1 --> U+03a1 (GREEK RHO SYMBOL --> GREEK CAPITAL LETTER RHO)

    Lowercase (all locales other than Azeri and Turkish):

    U+0130 --> U+0069 (LATIN CAPITAL LETTER I WITH DOT ABOVE --> LATIN SMALL LETTER I)
    U+01c5 --> U+01c6 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON --> LATIN SMALL LETTER DZ WITH CARON)
    U+01c8 --> U+01c9 (LATIN CAPITAL LETTER L WITH SMALL LETTER J --> LATIN SMALL LETTER LJ)
    U+01cb --> U+01cc (LATIN CAPITAL LETTER N WITH SMALL LETTER J --> LATIN SMALL LETTER NJ)
    U+01f2 --> U+01f3 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z --> LATIN SMALL LETTER DZ)
    U+03d2 --> U+03c5 (GREEK UPSILON WITH HOOK SYMBOL --> GREEK SMALL LETTER UPSILON)
    U+03d3 --> U+03cd (GREEK UPSILON WITH ACUTE AND HOOK SYMBOL --> GREEK SMALL LETTER UPSILON WITH TONOS)
    U+03d4 --> U+03cb (GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL --> GREEK SMALL LETTER UPSILON WITH DIALYTIKA)

    Now these somewhat random behaviors exist in all of the case mappings that happen in .NET except for the ones based on the InvariantCulture, which ends up as the source for lot of unexpected behavior that pops up from time to time, and not only due to the fact that as lists go they are incomplete....

    You can look at them probably guess some of the problems they can cause with these strange one-way conversions!

    But adding this to the CRT's behavior would essentially be adding these non-reversible transformations to almost every call. Which is not really desirable behavior, in some people's minds....

    The real question would be whether this bug would be considered more reasonable to fix if Win32 supported a more granular kind of functionality than LCMAP_LINGUISTIC_CASING provides -- a way that would keep the linguistically useful separate from the random "rehabilitate symbols" crap and all of the rest....

     

    This blog brought to you by all of the above cited Unicode characters...

    I have talked in the past about my feelings regarding the REPLACEMENT CHARACTER in blogs such as The torrents of U+fffd (aka When security and conformance trump compatibility and reality).

    And I have even mentioned in the A less intelligent strain of blog spam blog how spam attempts mostly seemed equal but there were specific blogs that seemed to attract more of them, finally showing in the blog Microsoft is giving this character nada weight but lotsa importance that one blog in particular (Getting the real (localized) name of the keyboard) seemed quite susceptible to spam containing lots of U+fffd in it.

    Even though no other blog seemed to be.

    Now generally it take the splog/spam a bit of time to start hitting a post. It has to be live for at least a few weeks and more commonly a few months before it starts getting hit.

    But suddenly my blog from yesterday of this time (Why Bengali keyboards can't be found on XP 64 bit) broke all of those rules:

    Here is a screenshot looking at them:

    Now I am curious what these two blogs:

    could possibly have in common with each other that is different from the other 2540-some blogs in Sorting it all Out.

    What makes them stand out in particular? What are the spam/splog sites targeting?

    The sites they try to point to provide no patterns here, and at this point I believe they are not actually relevant.

    Now I speak fluent notdef glyph as good as anyone, and to be honest better than most.

    And I instinctively feel that the actual information that they were trying to encode, whether it was intended for phishing purposes or not, would provide some insight into what they were trying to do. And that this information, lost due to the UTF-16, UTF-8 & UTF-32 update to conform with Unicode 5.0's security concerns security fix, causes me to never be able to understand what the attempted attack vector was.

    Obviously there is no problem catching this particular kind of spam; it has never gotten through at all. So it could hardly be said to be a very effective attack vector.

    And with no more than 2-10 instances being sent to each blog per day, it is hardly the most common form of spam that fails to make it through the filters on Sorting it all Out.

    But I am very curious about what the hell links up these two blogs to this particular mechanism and feel that the conformance changes have robbed me of an effective way to ever find out what the vector may be here!

    Is this so crazy?

     

    This post brought to you by (U+fffd, a.k.a. REPLACEMENT CHARACTER)

    There has been a lot of recent buzz based on the Moving to Unicode 5.1 post in the googleblog written by Mark Davis.

    Enough that people keep sending me email asking if I had seen it, much of that traffic being there because I haven't blogged about it myself yet....

    One of the most interesting points in the blog was about the Uptick in native Unicode webpages:

    Just last December there was an interesting milestone on the web. For the first time, we found that Unicode was the most frequent encoding found on web pages, overtaking both ASCII and Western European encodings—and by coincidence, within 10 days of one another. What's more impressive than simply overtaking them is the speed with which this happened; take a look at the blue line in this graph.

    You can see a long-term decline in pages encoded in ASCII (unaccented letters A through Z). More recently, there's been a significant drop in the use of encodings covering only Western European letters (ASCII and a few accented letters like Ä, Ç, and Ø). We're seeing similar declines in other language-specific encodings. Unicode, on the other hand, is showing a sharp increase in usage.

    This is based on our indexing of web pages, and thus may vary somewhat from what other search engines find. However, the trends are pretty clear, and the continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover.

    The fact that we are really only looking at percentages here might make it easier for a company like Google to make resource allocation decisions based on web content encoding that they have to index.

    But conclusions about growth (as the title of the art implies) might be a bit much, since there are other crucial factors here which Google has access to but is not provided in this quick example -- factors like

    • the raw number of pages with each encoding, and
    • the parts of the world where traffic is coming from, and
    • the correlation (if any) between encoding and part of the world, and
    • the type of web page (static HTML, blog, whatever), and
    • other trends in the data that can be measured

    would make it much easier to assess whether we are looking at an interesting phenomenon or an uptick in the number of web sites created by a particular tool used all over China or India, or an uptick in the number of blogs created in blogspot, or whatever.

    I just don't want to draw conclusions of a multi-dimensional issue based on looking st just one dimension -- the encoding of pages (because while that is perfectly reasonable if one is looking what encodings to put resources in, it is not as interesting for making conclusions about the overall web -- since the reasons may have to do with entirely different issues).

    This blog brought to you by(U+0c9d, aka KANNADA LETTER JHA)

    One of the interesting benefits of reading this blog is that it can provide answers to questions that you run across later.

    For example, I got mail the other day that was forwarding a question from someone:

    A friend was happily using the Bengali language keyboard on XP SP2 (32 bit) until he moved to the 64 bit version on a new box (more RAM).   He’s able to see most other Indian lang keyboards except the Bengali one.  He found kbdinben.dll was missing in the 64 bit box.    He tried to grab the dll from the 32 bit box and manually registering it, which didn’t resolve the issue.

    Anyone faced a similar issue, know a workaround or know someone in the XP team that can help?

    There are actually several different issues going on here.

    The first one is that of course that keyboard DLLs are not really suject to regsvr32.exe for manual registration.

    Though this doesn't matter since 32-bit keyboard DLLs won't work on a 64-bit platform (people may recall how is what led to the MSKLC 1.4 update).

    And of course it is a violation of the license agreement to copy the DLLs around anyway, though the fact that they won't work does seem like the biggest deterrent here. :-)

    Note the licensing issue would also apply to fonts!

    And there is the fact that one of the tricks of XP 64 bit is that it was built out of the Server 2003 source tree, not the XP one.

    Remember back in December of 2005 when I mentioned that ELKs aren't roaming where the servers are?

    It's funny, I never mentioned it but it was right after I posted this that I was told that contrary to prior discussion, there were going to be additions here. If only they had told me, I probably would have worded that post a little differently. Though after the blog was published anything I wrote would have sounded like a possible announcement, so I decided to keep quiet!

    Then in April of 2007 in Don't forget to reboot, please, in the context of describing a bug I casually mentioned

    Well, I guess you could blame it on Service Pack 1 of Server 2003....

    Remember
    ELKs aren't roaming where the servers are?

    Well, they fixed that and added a whole bunch of ELKs to Server 2003.

    That means they added a whole bunch of registry keys saying that these locales were present and an updated locale.nls that contains the data for those locales.

    People didn't really notice that I didn't really get deeper into it --  I didn't give the list of ones that were added.

    I'll do that now:

    • Croatian - Bosnia and Herzegovina
    • Bosnian - Bosnia and Herzegovina (Latin)
    • Bosnian - Bosnia and Herzegovina (Cyrillic) 
    • Filipino (Philippines) 
    • Frisian (Netherlands) 
    • Inuktitut (Latin, Canada) 
    • Irish (Ireland) 
    • Luxembourgish (Luxembourg) 
    • Maltese - Malta
    • Maori - New Zealand
    • Mapudungun (Chile) 
    • Mohawk (Mohawk) 
    • Nepali (Nepal) 
    • Pashto (Afghanistan) 
    • Quechua - Bolivia
    • Quechua - Ecuador
    • Quechua - Peru
    • Romansh (Switzerland)
    • Sami, Inari - Finland
    • Sami, Lule - Norway
    • Sami, Lule - Sweden
    • Sami, Northern - Finland
    • Sami, Northern - Norway
    • Sami, Northern - Sweden
    • Sami, Skolt - Finland
    • Sami, Southern - Norway
    • Sami, Southern - Sweden
    • Serbian - Bosnia and Herzegovina (Latin)
    • Serbian - Bosnia and Herzegovina (Cyrillic)  
    • Sesotho sa Leboa - South Africa
    • Setswana - South Africa
    • Welsh - United Kingdom
    • Xhosa - South Africa
    • Zulu - South Africa

    Impressive list?

    Well, yes -- everything from XP I had mentioned before in Lions and tigers and bearsELKs, Oh my! and ELK stampede!, with the exception of two entries:

    • Bengali - India
    • Malayalam - India

    The reason these two were left out? Well, there was no Uniscribe language update in SP1, so there was no good way to add support for the shaping. So with both of them not able to be supported, adding keyboards and fonts for them wouldn't make that much sense.

    But now you have why there is no Bengali keyboard layout on Server 2003 SP1.

    And from there you have why it does not exist on XP 64-bit, either.

    Now using MSKLC 1.4 you can create the keyboard (you can even base it on the Bengali keyboard from an XP or Vista platform where it works), but this does not give you the locale, the font, or the Uniscribe support -- so you do have other work to do beyond just building a keyboard layout. So this could be a case where even providing the answer may not help to enable the situation to work.

    Sorry about that.

    But if nothing else it finally got me to print that ELK list from Server 2003 SP1!

    This blog brought to you by(U+09ac, aka BENGALI LETTER BA)

    In my blog Disabilities in the workplace, Mary suggested in the comments:

    I'd like to see a discussion for those of us with non-mobility disabilities. How about mild autism? Just try and get through the first 5 minutes of an interview and make a positive impression when you can only lip read for the first 5 minutes!

    Now there's a challenge! Do you say "I'm sorry, I didn't see what you said" and become "that deaf over 40 chick" - when you're not deaf. Or do you disclose the disability and get bumped? Or do you not disclose it, try and bumble through and get bumped for being slow to respond and comprehend?

    A very good point, and one that I only have limited experience with myself, to which I can add info from colleagues who suffer from problems like mild autism and Tourette syndrome.

    The experience I do have myself comes from MS symptoms that put me in similar situations -- like visual defects so severe that I couldn't see too much beyond right in front of me.

    That particular problem has only happened to me two times, but both times I avoided the situation, and canceled the scheduled meetings. It seemed prudent due to my inability to see people who would be there, which would be a very noticeable symptom.

    But maybe that points to what I would do if there were no way to avoid it -- there is really no other way than to be upfront with the situation -- if there was no way to hide it, I'd choose Mary's "choice B" and disclose.

    These days, it seems like people tend to be pretty sensitive about even the hint of discrimination. So I would be upfront about what I can't do but then launch immediately into what I can. nd let the chips fall where they may.

    For an interview, I'd make sure to walk in with them knowing that information, and (since there is no way they would know what to do or how to handle it), I'd try to make sure the interview was structured in a way that I could handle it (a more extensive version of what I have done in interviews when standing at the white board wasn't feasible).

    The risk of negative bias or even prejudice is there, but giving people the opportunity to think the worst of the situation is not really such a great idea even then, so there is no downside unique to disclosure. In situations like Tourette's or Asperger's that have come up in television shows either not too long ago or recently (in L.A. Law and Boston Legal, respectively), there is even perhaps some common basis for contrast -- like what about one's own case is different from the ones popularized on television.

    I actually only know of one time when I was definitely directly discriminated against in word or deed, and that person felt guilt enough that they went out of their way to correct the situation afterward.

    Discrimination is a very rough thing, in any form and against really anything. And the odds are obviously much better in an employment situation than an interviewing one (since in is much more in the former that people have a vested interest in getting along), but what else can one really do?

     

    This blog brought to you by Գ (U+0533, aka ARMENIAN CAPITAL LETTER GIM)

    No, this is not a post about abortion. I am talking about NULL termination, a way to end strings that is legal in all fifty states and that is widely used by software that is in regular use by people all over the world irregardless of their views about the other kind of termination. So it is a universally safe topic, even if this disclaimeresque introduction isn't, necessarily.

    There are many patterns to calling functions within the NLS API, most (but not all) of which have the following rules:

    • You can pass a NULL buffer and a 0 length to be returned the exact required length plus one for the terminating NULL.
    • You can pass a buffer of the size above or larger to have the buffer filled, with the return being the exact length plus one for the terminating NULL.
    • You can pass a buffer of just the required length without the terminating NULL, you get the buffer filled, with the return being the exact length with no terminating NULL

    This behavior can easily get confusing.

    It is even a security topic since if misused the filled buffer without a terminating NULL can lead to real bugs if not used properly.

    But there are times when it is exactly what you want.

    Say for example if you are using LCMapString to do case conversion and you are doing it inline. Inserting a random NULL in the middle of a string with existing content that one wishes to preserve is seldom a good idea.

    Other calls can also require this behavior.

    But it does explain the following calls and results:

    LCMapString(LOCALE_USER_DEFAULT, LCMAP_FULLWIDTH, L"\u00c4\u0170", -1, NULL, 0)
    return value: 3

    LCMapString(LOCALE_USER_DEFAULT, LCMAP_FULLWIDTH, L"\u00c4\u0170", -1, wz, 3)
    return value: 3
    wz value: L"\u00c4\u0170\u0000"

    LCMapString(LOCALE_USER_DEFAULT, LCMAP_FULLWIDTH, L"\u00c4\u0170", -1, wz, 10)
    return value: 3
    wz value: L"\u00c4\u0170\u0000"

    LCMapString(LOCALE_USER_DEFAULT, LCMAP_FULLWIDTH, L"\u00c4\u0170", 2, wz, 3)
    return value: 2
    wz value: L"\u00c4\u0170"

    LCMapString(LOCALE_USER_DEFAULT, LCMAP_FULLWIDTH, L"\u00c4\u0170", 2, wz, 2)
    return value: 2
    wz value: L"\u00c4\u0170"

    (It helps to know that L"\u00c4\u0170" will have an implicit NULL at the end when one is looking at it from inside the function!)

    Now in this case the two characters:

    Ä (U+00c4, aka LATIN CAPITAL LETTER A WITH DIAERESIS)

    Ű (U+0170, aka LATIN CAPITAL LETTER U WITH DOUBLE ACUTE)

    have no full-width equivalents and thus will pass through the function with this flag, completely unchanged. Which provides an ideal opportunity to see all of this behavior, under the microscope so to speak....

     

    This blog sponsored by those two characters along for the ride, above

    Nothing technical whatsoever folks; if this bothers you then please just get over it or go away! :-)

    I have been in the habit lately of talking women out of the idea of having a relationship with me.

    I know how that sounds, and I took the time to record myself speaking the words and then playing them back. So I quite literally know how that sounds.

    Being not entirely hideous though rather far from significantly attractive (see the picture over on the right if you doubt me), I have no reasonable explanation for this recent trend, but I am doing what I can to curtail it a much as possible, though.

    Nicely.

    To give a recent example or two, not too long ago I talked Samantha out of a crush when she tried to get me to go with her and her friends to see Vampire Weekend, which was quite ironically a show I had gotten tickets for someone from another non-relationship that I was friends with1.

    And this last week I did see The B-52's with Claire (I have been enjoying the group's new album -- which made me regret turning down earlier ticket offers, so she had good timing -- and the pre-condition that this was not to be considered a date was accepted and confirmed before we arrived at the venue).

    There wasn't much opportunity to talk about anything other than the band in general and Kate Pierson in particular, but she did want to talk at some point, so we went to dinner the next night.

    This was also not a date. But she wouldn't let me pay for my ticket so she did manage to take advantage of my sense of fair play. :-)

    We went to that sushi place near Uwajimaya down the street from Microsoft, and I managed to convince her that it was just a case of temporary emotional myopia before the check came. She believed me, I think because I can be very convincing.

    We then ended up having a fascinating conversation about the obsession men have with women from puberty onward. She was talking about something she had seen on E! about the twenty-five most memorable swimsuit moments. After she asked me what I thought was on the list (I had not seen the show), I named:

    • Phoebe Cates in Fast Times at Ridgemont High, in that Judge Reinhold fantasy sequence where she unsnaps the top. 
    • Bo Derek in 10, in that one piece suit from a Dudley Moore fantasy scene where they run toward each other;
    • Jacqueline Bisset in The Deep, where she wore just bikini bottoms and a white T-shirt;
    • Farrah Fawcett in that 1976 poster that they originally didn't want to pay her for since the agreement was for a bikini;
    • Raquel Welch in One Million Years B.C., in that bikini made of animal skins;
    • Carrie Fisher in Star Wars: Return of the Jedi, in that metallic bikini;
    • Cheryl Tiegs in that 1978 SI swimsuit edition shot of the see-through white fishnet bathing suit.;
    • Denise Richards and Neve Campbell in Wild Things, for their "fight and then make out scene" in the pool;

    There weren't others I could think of as really being iconic, so I ran out of steam at that point. I think this was the order I named them in though I didn't write them down so I am not entirely sure.

    She asked me where I thought they ranked, and I honestly wasn't sure (other than thinking the Phoebe Cates one was probably #1).

    I don't know if you have seen this special (I hadn't, and still haven't yet), but the fact that all of them were supposedly on the list is how she proved to me that men can be pigs, and the fact that the Cates shot was indeed #1 was proof that men are pigs.

    Her theory is that all men can name some or all of these items. I suggested it be called the Homo-Sus hypothesis, which takes the genus of both men and pigs.

    Now I don't think any of this actually constitutes proof, but since for the most part men are kind of pigs anyway and I didn't want to volunteer examples that I think would constitute proof, it seems silly to argue the previously submitted evidence -- we weren't in court or anything.

    Anyway, with the meal long over we went to our cars and parted ways. She is really a very nice person (actually all three of them are) who any normal man would be happy to go out with. The key is find one of those normal men (a concept for which I am technically under-qualified!).

     

    1 - Interestingly enough, after the show, both of them were teasing me about having met and chatted for a while until I just got annoyed with the games of youngsters. At which point they both said they were kidding. Since then they have left it open so I have no idea if they met or not, really. One day one of them may actually tell me, but I'm not going to worry about it (I just don't enjoy people teasing me about stuff!).

     

    This blog brought to you by(U+2f97, aka KANGXI RADICAL PIG)

    Recently while paying attention to The Unicode List I was once again reminded why I don't pay more attention to The Unicode List. :-)

    Specifically it was a thread started by Andreas Prilop:

    I refer to
     
    http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
     http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

    In ISO-8859-1, code position 0x90 is mapped to U+0090.
    In Windows-1252, code position 0x90 is listed as "undefined".

    Why are they treated differently? International Standard ISO/IEC 8859-1 does *not* define code position 0x90. So it might also be listed as "undefined".

    Or, for purely practical reasons, 0x90 in Windows-1252 might also be mapped to U+0090.

    This different behaviour for undefined code positions may occasionally cause trouble - please see
     
    http://lists.w3.org/Archives/Public/www-validator/2008Apr/
     http://lists.w3.org/Archives/Public/www-validator/2008May/
    Thread "Fallback to UTF-8".

    Richard Wordingham then responded in a message that should probably could have resolved the issue (had it not been The Unicode List):

    0x90 is defined in the IANA version of ISO-8859-1, which calls up the description in RFC1345.  In a web context, I believe the IANA definition should take precedence over ISO/IEC.

    On the other hand, Windows-1252 might be extended again and assign a meaning to 0x90, so it is probably better not to map any Unicode codepoint to that value.

    > Or, for purely practical reasons, 0x90 in Windows-1252 might
    > also be mapped to U+0090.


    Which is reported to be what Windows *currently* actually does.

    And Unicode cool guy Ken Whistler put in some thoughts here as well:

    > > Why are they treated differently?

    Different theory by the maintainers of the two sets of files.

    I am the most recent maintainer of record for the 8859-X mapping files posted on the Unicode website. For those I follow the consensus of the UTC that mappings for control code points in the 8859-X family of ASCII-derived encodings to/from Unicode is least problematical if 0x00 <--> U+0000, 0x01 <--> U+0001, etc. This is, in fact, the way that almost all commercial conversions handle the control code conversions for 8859-X character sets.

    Since 8859-1.TXT and the other mapping tables posted on the Unicode website are intended to provide practical *mapping* guidelines for implementers, it would be pedantic in the extreme (and counterproductive) to post them up as documentation of the 8859-X standards *without* the control code mappings.

    The Microsoft mapping tables are contributed by and maintained by Microsoft, and follow Microsoft standards practice for table definition. 0x00..0x1F are mapped through to U+0000..U+001F, but because most Microsoft code pages contain graphic characters in the 0x80..0x9F range, those characters are mapped, but unassigned code points are simply left #UNDEFINED, as is also the case for Microsoft double-byte code pages. This allows a distinction to be made between that status and #DBCS LEAD BYTE values.

    In practice, of course, when actually implmenting conversion tables from Microsoft code pages to/from Unicode, nearly all commercial implementations, including Microsoft's, map undefined values in the 0x80..0x9F range (for non-DBCS code pages) to the corresponding Unicode U+0080..U+009F control code character, rather than to U+FFFD.

    > > International Standard ISO/IEC 8859-1 does *not* define
    > > code position 0x90. So it might also be listed as "undefined".
    >
    > 0x90 is defined in the IANA version of ISO-8859-1, which calls up the
    > description in RFC1345.  In a web context, I believe the IANA definition
    > should take precedence over ISO/IEC.


    While I agree with the conclusion that for web usage, mappings that map through control codes rather than treating them as undefined is the correct thing to do -- I do so for different reasons.

    RFC 1345 is *extremely* dated. It is from 1992, and refers to prepublication versions of 10646. The first edition of 10646 wasn't even published until 1993, and at that point we are talking about a Unicode 1.1-level repertoire. The character mnemonic table in RFC 1345 is full of errors, and the mapping tables for various charsets at the end of RFC 1345 have not been updated to track the updates of the 8859 standards nor the updates in mapping practice for some charsets that resulted from extensions to 10646.

    >
    > On the other hand, Windows-1252 might be extended again and assign a meaning
    > to 0x90, so it is probably better not to map any Unicode codepoint to that
    > value.

    I disagree. If you do not map U+0090 to 0x90 for Windows-1252, all you are doing in ensuring an interoperability bug both with Windows and with other commercial applications doing conversions.

    After that, David Starner and Doug Ewell made contributions pointing out that if one was truly expecting C1 control characters like U+0090 to be in cp1252 then one probably had bigger problems with one's data, anyway.

    Mark Davis pointed out that ICU does indeed map 0x90 to U+0090 and vice versa, since "in ICU we always go by what people do, and not what they say.... Windows itself maps 0x90 to U+0090."

    Good to know! :-)

    Andreas came back with one more contribution to explain the reasoning behind the original concern:

    The problem was/is: What to do when a byte 0x90 is found in a file that has

    (a) erroneously charset=ISO-8859-1
    (b) erroneously charset=Windows-1252
    (c) no encoding/charset at all specified

    Surprisingly, the W3C validator gives up with Windows-1252 but does perform a check with ISO-8859-1.

    See the test document  
    http://www.unics.uni-hannover.de/nhtcapri/test.htm and follow the links "Validate as ISO-8859-1" and "Validate as Windows-1252".

    The validation report with Windows-1252 would be more helpful, in my opinion, if 0x90 in cp1252 is mapped to something - to U+0090 or whatever.

    And Richard made the final contribution to date in response to the "surprise" in that last message from Andreas:

    It's not surprising at all.  These charsets designations have the *IANA* definitions, which are not necessarily identical to international (e.g. ISO-8859 series) or national (e.g. TIS-620) standards.  Thus 0x90 is undefined for Windows-1252 but merely an illegal character for HTML in the IANA definition of ISO-88591.

    It is funny how these things go, really.

    Though this one was mercifully short, at least. :-)

    I was surprised (though not surprised enough to extend the thread by volunteering the information!) that even better than the referenced:

    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

    or the nicer, more graphical though functionally equivalent:

    http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx

    There is an even better reference to look at, one also hosted on the Unicode site:

    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

    This file, along with the rest of the mappings at

    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/

    come directly from Microsoft.

    And while they were provided primarily to officially document in public the "best fit" mappings in these code pages, they have additional benefits, as they are the literal source files used to build the code page data files, with no lines removed.

    In particular, you can see entries like the following in there if cp1252:

    0x81 0x0081
    0x8d 0x008d
    0x8f 0x008f
    0x90 0x0090
    0x9d 0x009d

    Now compared with the one that the file they referenced had:

    0x81       #UNDEFINED
    0x8D       #UNDEFINED
    0x8F       #UNDEFINED
    0x90       #UNDEFINED
    0x9D       #UNDEFINED

    I think it seem much nicer and definitely matches the assertion that several people raised about the de facto behavior of the code page.

    These files also define all the "best fit" mappings, so that (for example) cp1252 file lists all of the following:

    0x0041 0x41 ;Latin Capital Letter A
    0x0100 0x41 ;Latin Capital Letter A With Macron
    0x0102 0x41 ;Latin Capital Letter A With Breve
    0x0104 0x41 ;Latin Capital Letter A With Ogonek
    0x01cd 0x41 ;Latin Capital Letter A With Caron
    0x01de 0x41 ;Latin Capital Letter A With Diaeresis And Macron
    0xff21 0x41 ;Fullwidth Latin Capital Letter A

    which makes all of these characters other than the one that can do round trip mapping to U+0041 map to somewhere. It is why the WCTABLE has 698 entries even though the MBTABLE only has 256. :-)

    Maybe I should have volunteered the information. :-)

    But The Unicode List just plain scares me.

    Even more that clowns do, and you know how creepy clowns are....

    Now I have talked about best fit mappings in Windows code pages in all of the following blogs in this Blog: