Blog - Title

November, 2010

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Suddenly, in a bit more time than a blink of an eye, "standards support" becomes "less i18n support"

    • 6 Comments

    Over the years I've had a lot to say about Digit Substitution, the feature so widely used in so many of the bidirectional scripts and the scripts of South Asia:

    and so on.

    I think this is most of them.

    Most of these blogs have talked about the times that Digit Substitution isn't doing what you might expect, or isn't doing what the people designing it intended, or violates the claims and/or implied behavior in the documentation, or is just plain broken.

    But when one considers how long this feature has been around, it really seems unlikely that anyone could ever simply dump the functionality and act like it isn't there or doesn't exist, 1984 "Oceania has always been at war with Eurasia" style, right?

    Actually, as it turns out, this kind of assumption would be wrong.

    In a bold push to prove how conformant the last two versions of Internet Explorer -- IE 8 and IE 9 -- (present in, respectively, shipping and widely available in beta form) truly is, support for Digit Substitution is not there so much, any more.

    Because there is not such a feature in the HTML standard (HTML5 or any other version).

    Other browsers like FireFox don't support the idea, either (since their installer is still user locale based for installer UI localization, they are not a model I trust for international even outside the lameness that is international support in HTML they follow, but alas I digress).

    Now you can specify you want things the old way that follows Regional and Languge Options and its Digit Substitution settings with a meta tag, like the below HTML snippet:

    <!DOCTYPE HTML>
    <html>
    <head>
        <meta http-equiv="X-UA-Compatible" content="IE=7" >
    </head>
    <body>
    <p>0123456789</p>
    </body>
    </html>

    That content note -- 5 or 7 will give you Digit Substitution, and 8 or 9 will not.

    Note that if you don't have that meta tag (most pages won't) then it will usually use the same as the browser version, though there are interesting and not-very-well-documented exceptions I found, like in HTML file opened from local or network paths, for example. I can't claim I have found anything reliable except when I specify things. I guess Unspecified behavior seems to be unspecified might provide us all with as bit of irony here....

    On this blog, where I have very little control over the per-page meta tags and headers, I cannot really control what this behavior will be on pages.

    I had better show this a little, so here follows some art....

    First set the Format to Arabic (Saudi Arabia) and hit the Additional settings button:

    When that other dialog comes up, set the Use native digits control to "National":

    hit OK out of both dialogs, and then the fun begins!

    You can see the difference between when that compatibility setting is 5:

    and when it is 8:

    You clearly see Digit Substitution in Internet Explorer being tied to the support of version-specific behavior and standards mode and all the rest of the work in IE.

    Suddenly, in a bit more time than a blink of an eye, "standards support" becomes "less Internationalization support"....

    Though there are some related features like the "list-style-type" style that can be applied to ordered lists (the OL element) -- features that require specific opt-in by the author of the web content and the user's preferences have no impact upon it (other than by choosing a browser that doesn't support a given "list-style-type" since the full available list of each browser varies, I mean).

    I am still deciding how I feel about all of this.

    Part of me feels okay about this, given all of the weirdnesses I have been pointing out for years. There are clearly some very real flaws with this feature.

    But on balance I consider the following:

    • the Digit Substitution behavior has been around in one form or another on Windows for over 15 years, and
    • many of the keyboards (e.g. all 3 Arabic keyboards, Persian, Sinhala, Urdu, and Uyghur) of these various languages with alternate digits don't have the digits on them since this feature makes it seem like they are there, and
    • the Arabic Windows keyboard (cp1256) itself doesn't have digits (for pretty much the same reasons), and
    • users can have between 5 and 15 years of experience with typing ASCI digits and getting their National ones, with no idea how to type them in any other way even if the keyboard as the keys assigned somewhere on them, and
    • the formatting code in Windows does not ever even use National digits because this feature keeps it from having to, and
    • neither the formatting nor the parsing code in .Net have ever even used National digits (for much the same reasons), and
    • users have been dealing with the good and the bad of this feature all along;

    and suddenly I don't feel so good.

    Given all of the consequences implicit in the above, having this change in the latest version of Internet Explorer and in a public beta of the next version, when no conversations happened among so many of the stakeholders of the functionality, seems to me to be a little unfortunate....

    Especially with web apps becoming more and more popular as they become more sophisticated and able to give richer experiences, losing this particular rich experience may not be so pleasant for some users. Users who LIKE that support.

    But maybe that is just me being oversensitive, as one of those stakeholders.

    Maybe Windows and the .Net Framework should beef up their parsing and formatting support to work in this new world where you may no longer get your numbers, even if you ask for them....

    Note to .Net and Windows globalization people (you know who you are!): you may be hearing from me sometime soon about this!

  • Sorting it all Out

    At long last, it's Sinhala time!

    • 2 Comments

    THE WINDOWS 7 SINHALA LANGUAGE INTERFACE PACK IS LIVE!

    Click here to download the Sinhala Windows 7 LIP via the Microsoft.com Download Center.

    Please note that the Sinhala  Windows 7 LIP can only be installed on a system that runs an English client version of Windows 7. It is available to download for both 32-bit and 64-bit systems.

    The Sinhala Windows 7 LIP is produced as part of the Local Language Program sponsored by Public Sector.

    A LITTLE BACKGROUND INFORMATION ON SINHALA

    NUMBER OF SPEAKERS

    19 million

    NAME IN THE LANGUAGE ITSELF:

    සිංහල

    Sinhala is spoken by more than 17 million people in Sri Lanka where it is also one of the two official languages (Tamil being the other one).

    FUN FACTS:

    • Sinhala contains loan words of Portuguese, Dutch and English origin, which reflects Sri Lanka's colonial history. English was official language from 1833 to 1958 (that is until 10 years after independence).
    • There is a crucial distinction between "high" and "low" Sinhala: The colloquial form used in everyday life is quite different from the literary form which is also spoken in formal environments and in the media.
    • The differences are so big that the two forms could be considered two different languages. Not only is the vocabulary very distinct, but also the grammar: In the colloquial version there are no inflected forms for verbs, for example. This is a little like having King Edward's English used on MSNBC and slang in everyday conversations.
    • There are no sub-clauses - comparable constructions are created with participles and adjectives. Therefore a phrase like "The man who gives information on LIP languages" becomes "The information on LIP languages giving man".
    • There are no prepositions but only postpositions. "under the book" would literally retranslate as "book under".

    CLASSIFICATION:

    Sinhala is written in its own script which resembles those of south Indian Dravidian scripts due to the rounded shape of its characters.

    Click here for more information on the Sinhala language and here for more information on the Sinhala script.

    MICROSOFT-SPECIFIC STUFF:

    The Microsoft Sinhala keyboard story is a story that any sane person would go 1000 miles out of their way to avoid having to discuss, but it seems ridiculous to me to say nothing about it.

    Like pretty much all of the languages of South Asia, the inbox input story on Windows is not great, and it was Sinhala in particular that made me realize a sad truth: that any time someone inside Microsoft tells someone else inside Microsoft that rather than jumping in to support something that "we need to consider the entire end-to-end scenario" what they are really saying is "we aren't going to fix it this version. Try again next time, and for now go away." Given the "end-to-end" premise this makes sense, since in the middle of a product cycle it is too late to devote enough planning resources to get a huge issue investigated and addressed.

    The flaw in the logic is that the "end-to-end" solution (or the keyboard specific variation, where it was suggested that "we must consider the entire input stack in a more holistic way") -- is not truly needed. It really isn't here, since we already have a Text Services Framework. All that is needed is a plug-in. Think of it this way: someone came up with the "end-to-end" humvee and now all that we need is put a hubcap on it. Spending too much time talking about "end-to-end" makes one sound like a "rear end".

    Thankfully, there are third party solutions in the meantime that do solve the problem, using the Sinhala National standard. Some even use Microsoft's own Text Services Framework, which is why I am grateful for a rich third party developer ecosystem. It helps keep customers unblocked while we take our time doing the right thing.

    I will continue to keep trying to solve this particular problem and there are others who are trying to do the same. We aren't really as bad as the Sinhala input story makes us look.

    I'm sorry.

    But the Sinhala LIP itself is still pretty cool. :-)

    Enjoy!

  • Sorting it all Out

    Please don't feed the list

    • 4 Comments

    I really try to avoid The Unicode List whenever possible.

    It isn't just because it is full of time-wasting rabble rousers. Though it is so that is part of it.

    And it isn't just because it is full of people who by most reasonable measures are bats*** crazy. Though it is, so that is part of it.

    It is largely because it brings out the worst in everyone.

    I mean, even the smart people say things that make you wonder if they put their brains in a blind trust while they wrote the mail that makes you simply shake your head and sigh.

    Thankfully the not-quite-as-smart people remind me that they can get it wrong better than any momentary lapse that one of the smart ones might have....

    Like a conversation about non-Unicode web pages and the Windows clipboard:

    People started going on and on about the need to have the right system locale to see the non-Unicode pages correctly.

    Did they try it? Do they use the Internet on their Windows machines? Do they even run Windows?

    INTERNET EXPLORER supports Unicode.

    Every legacy web page that isn't lying about its encoding says what its encoding is. And IE, being a Unicode application, converts the page.

    So if you can see it right then the data is already Unicode and cooy/paste is a pure Unicode operation.

    Hell, even if you can't see it right then the data is still Unicode and you can change the encoding to make it right. Copy/paste is still Unicode.

    The DEFAULT SYSTEM LOCALE is pretty much beside the point and has nothing to do with anything going on here.

    I am guessing they just aren't running Windows so they are using phrases they have heard before.

    Over and over, it never gets any better.

    It really is best to avoid The Unicode List.....

  • Sorting it all Out

    I [will have] told you so! Well, perhaps too late (all things considered)...

    • 2 Comments

    The year was 2004.

    The Blog you are reading now had just a few blogs in it.

    And I wrote a blog titled Microsoft does not use the Unicode Collation Algorithm.

    The year was 2008.

    Thousands of blogs had been added to thie Blog since that earlier blog.

    And I wrote a blog titled Microsoft still does not use the UCA; the converse is also true.

    Nothing has changed, it is all still true.

    Though over the years as these two different implementations worked to cover this single large space, their functionality has overlapped and each implementation has often in its efforts to do the right thing not paid enough heed when the other implementation had already realized that a particular solution was a bad idea for one reason or another.

    Now by itself thisdoes not mean that it would necessarily be a mistake to solve the problem in that particular way -- at times there are underlying architectural reasons why the differences exist and there is not much reason to try and change those differences.

    With all that said, part of a recent release to the Unicode Announcements alias struck me as interesting. The text of the announcement read in part:

    Mountain View, CA, USA – October 29, 2010 – The new version of Unicode Technical Standard #10, Unicode Collation Algorithm (UCA), has been updated for Unicode Version 6.0, adding support for 2,088 characters in sorting, searching, and matching. Also in this release new data files for support of the Unicode Common Locale Data Repository (CLDR), which provides customization for different languages.

    Reorderable Categories. The data files for CLDR order characters strictly by certain major categories. This allows programmers to parametrically reorder these groups of characters to put them in the desired order for different languages. For example, numbers can be ordered after letters, or Cyrillic before Latin. The reorderable categories are:

    whitespace, punctuation, general symbols, currency symbols, and numbers, then Latin, Greek, Coptic, Cyrillic, ..., Egyptian Hieroglyphs, and finally, CJK.

    Reorderable Categories.

    Microsoft did something like that years ago.

    Not a configurable system to do it, but an explicit change for one sort.

    You may remember reading about it, in one or more of the following blogs:

    As that last blog pointed out, we removed the customization because we ultimately deemed it to be not such a great idea.

    Now perhaps what is being done differently here will make it not such a big deal that the latest version of Unicode added a flexible architectural feature that Microsoft started to realize was a bad idea at least eight  years ago and finally removed from its implementation four years ago.

    I don't know, since I haven't looked beyind the announcement itself.

    Of course I have no way of knowing whether the issue was mentioned by any of the Microsoft representatives present (I wasn't there, and no one who was mentioned it to me until after everything was done and I was pointed at the announcement mail in as generic sense when everyone was).

    Not to mention that to be honest I don't think very many of the support issues and problems that came up with this Korean "feature" (here or in Korea) ever made it to much into the public, either. We barely even documented it, except in one oblique doc comment that no one understood.

    And those three blogs of mine,

    Since Microsoft currently uses neither the Unicode Collation Algorithm nor the CLDR tailorings of it, I don't have too much of a specific business reason to do much more here.

    But I figure I can mention it here, at least.

    If people using this new flexible algorithm start running into strange complication, compatibility issues, or other problems....just remember I told everyone so, right here. Even if I did so just a bit too late.....

  • Sorting it all Out

    Header files are the wrong place to be less than helpful

    • 4 Comments

    A lot the people who work on the absolute latest version of Windows should have a lot more respect and consideration for the other, previous versions -- including the latest shipping version.

    Not all of them, mind you. There are those who care about those other versions a lot no matter what they are working on.

    A lot of them ought to, though.

    It was late last month when someone was asking:

    I have a requirement to enumerate the list of all country/regions in localized form depending on the current OS locale, just like the way they are shown when the OS is installed. Can someone point me to the API?

    Now when a question like this is asked by someone inside of Microsoft, there are levels upon levels of things to consider that aren't an issue with a question asked by someone outside of Microsoft.

    Like do they literally want to do it the way setup does it (e.g. maybe code running at the same time? Think Easy Transfer Wizard type stuff!) or is equivalent functionality acceptable?

    The lists setup builds have every item that will be in the new OS after it is installed, so it can't call the currently installed OS.

    Is the person asking for possibly internal, undocumented functions? If so then are they on the Windows team?

    We can't give out internal Windows stuff to folks not on Windows.

    Or is it someone in Product Support or Consulting Services or some other customer facing org, asking on behalf of a customer -- or code being written for a customer?

    I won't explain why this changes things, but it should be obvious.

    No worries this time, they just wanted equivalent functionality, publicly documented. So it can be the same answer I would give everyone out in the world too. If I were answering the question.

    I didn't answer the question then, someone else did:

    EnumSystemLocalesEx and GetLocaleInfoEx with LOCALE_SLOCALIZEDCOUNTRYNAME.  LOCALE_SLOCALIZEDCOUNTRYNAME is only available on Windows 7 and later I’m afraid.  I’m hoping by “current OS locale” you mean the users UI language (and not User Locale or System Locale).

    I am not going to answer it now, I'll let the answer given stand.

    However, in true passive/aggressive form, I am going to criticize the answer a bit. :-)

    i won't criticize the third sentence ("I’m hoping by “current OS locale” you mean the users UI language (and not User Locale or System Locale)."), that is good level-setting about using the right locale among the many choices -- in this case the UI language. I do that all the time!

    But those first two sentences. Let's chat a bit.....

    First of course the first sentence ("EnumSystemLocalesEx and GetLocaleInfoEx with LOCALE_SLOCALIZEDCOUNTRYNAME."):

    I am a huge fan of EnumSystemLocalesEx over EnumSystemLocales, and GetLocaleInfoEx over GetLocaleInfo. After all, I have been the person fearlessly saying LCIDs Suck both internally and externally before we even had all of the functions that could be used instead of them. I was saying it back when people were arguing about whether new functions were needed, as part of the reason to do some of the work!

    But I am aware of the fact that sometimes, in fact most times, developers everywhere in the world other than the Windows team have to consider the need to support versions of the operating system older than Vista.

    So while I have no problem pushing one solutioon over the other, I like to provide a bit more context.

    It isn't like the docs help. If you go to the list of Windows National Language Support functions, the table lays them out as:

    EnumSystemLocales Enumerates the locales that are either installed on or supported by an operating system.
    EnumSystemLocalesEx Enumerates the locales that are either installed on or supported by an operating system.

    Boy, way to help point people in the right direction there! Like maybe the Ex function could mention "using standards conformant locale names" or whatever. Or maybe put all the old functions in a separate table after the first big list of the most up-to-date functions? Something....

    As I pointed out in To Ex or not to Ex? THAT is the question., unless you are intimately familiar with the two functions then in most cases you won't know which one to use.

    Okay, you get my point.

    Now let's talk about that second sentence, "LOCALE_SLOCALIZEDCOUNTRYNAME is only available on Windows 7 and later I’m afraid.".

    Hmmmm.

    Let's take a look at WinNls.h, they've done some shuffling here:

    //
    // These are the various forms of the name of the locale:
    //
    #define LOCALE_SLOCALIZEDDISPLAYNAME  0x00000002   // localized name of locale, eg "German (Germany)" in UI language
    #if (WINVER >= _WIN32_WINNT_WIN7)
    #define LOCALE_SENGLISHDISPLAYNAME    0x00000072   // Display name (language + country/region usually) in English, eg "German (Germany)"
    #define LOCALE_SNATIVEDISPLAYNAME     0x00000073   // Display name in native locale language, eg "Deutsch (Deutschland)
    #endif //(WINVER >= _WIN32_WINNT_WIN7)

    #if (WINVER >= _WIN32_WINNT_VISTA)
    #define LOCALE_SLOCALIZEDLANGUAGENAME 0x0000006f   // Language Display Name for a language, eg "German" in UI language
    #endif //(WINVER >= _WIN32_WINNT_VISTA)
    #define LOCALE_SENGLISHLANGUAGENAME   0x00001001   // English name of language, eg "German"
    #define LOCALE_SNATIVELANGUAGENAME    0x00000004   // native name of language, eg "Deutsch"

    #define LOCALE_SLOCALIZEDCOUNTRYNAME  0x00000006   // localized name of country/region, eg "Germany" in UI language
    #define LOCALE_SENGLISHCOUNTRYNAME    0x00001002   // English name of country/region, eg "Germany"
    #define LOCALE_SNATIVECOUNTRYNAME     0x00000008   // native name of country/region, eg "Deutschland"

    //
    // Legacy labels for the locale name values
    //
    #define LOCALE_SLANGUAGE              0x00000002   // localized name of locale, eg "German (Germany)" in UI language
    #if (WINVER >= _WIN32_WINNT_VISTA)
    #define LOCALE_SLANGDISPLAYNAME       0x0000006f   // Language Display Name for a language, eg "German" in UI language
    #endif //(WINVER >= _WIN32_WINNT_VISTA)
    #define LOCALE_SENGLANGUAGE           0x00001001   // English name of language, eg "German"
    #define LOCALE_SNATIVELANGNAME        0x00000004   // native name of language, eg "Deutsch"
    #define LOCALE_SCOUNTRY               0x00000006   // localized name of country/region, eg "Germany" in UI language
    #define LOCALE_SENGCOUNTRY            0x00001002   // English name of country/region, eg "Germany"
    #define LOCALE_SNATIVECTRYNAME        0x00000008   // native name of country/region, eg "Deutschland"

    // Additional LCTypes
    #define LOCALE_ILANGUAGE              0x00000001   // language id, LOCALE_SNAME preferred

    #define LOCALE_SABBREVLANGNAME        0x00000003   // arbitrary abbreviated language name, LOCALE_SISO639LANGNAME preferred

    #define LOCALE_ICOUNTRY               0x00000005   // country/region code, eg 1, LOCALE_SISO3166CTRYNAME may be more useful.
    #define LOCALE_SABBREVCTRYNAME        0x00000007   // arbitrary abbreviated country/region name, LOCALE_SISO3166CTRYNAME preferred

    Okay, speaking literally the answer is incorrect; LOCALE_SLOCALIZEDCOUNTRYNAME is under #if (WINVER >= _WIN32_WINNT_VISTA) defines, not #if (WINVER >= _WIN32_WINNT_WIN7) defines.

    But my problem is a bit more subtle than that.

    Like the fact that LOCALE_SLOCALIZEDCOUNTRYNAME has the same value as LOCALE_SCOUNTRY and therefore has been around for a very very long time.

    There isn't much good reason for version guarding here on the constants since this isn't about version-specific functionality; providing consistent naming for these constants was done to make them easier to use on all versions because knowing that LOCALE_SLANGUAGE and LOCALE_SCOUNTRY are the localized names is kind of obscure, and having the full names makes it easier to understand.

    Personally, because pf the pointless version guarding my usual recommendation would be to use LOCALE_SCOUNTRY, and not the "easier" LOCALE_SLOCALIZEDCOUNTRYNAME constant that is actually harder due to this arbitrary baggage added in.

    I wonder if removing the cruft is an appcompat risk now. It probably is, unfortunately.

    Now there are other pieces there in the latest file I find troubling, like the way LOCALE_SABBREVLANGNAME is an "arbitrary abbreviated language name, LOCALE_SISO639LANGNAME preferred".

    After all the time I have spent dealing with the myriad of issues around LOCALE_SABBREVLANGNAME and the blogs I have written about it like LOCALE_SABBREVLANGNAME is so not an ISO-639 code and LOCALE_SABBREVLANGNAME is more than just an ISO-639 code, I can say that these codes that uniquely identify locales and are the central method of recognizably identifying keyboards that have taken hundreds of hours of my professional life are not arbitrary. And I find myself almost personally offended that this push to use standards is willing to do at the expense of functionality and of history.

    Some if it my own history!

    I'm hardly Jesus, but I'd rather not be denied, all things being equal.

    Regular long-time readers may remember Is it Hangul? or Hangeul? or Han'gŭl? or what?, where no effort is made to try and force a political issue, even when something is added for some interesting political reasons.

    This is still true, by the way, of the Korean charset constant.

    I guess if you aren't on the NLS team you don't over-think the issues so much. :-)

    Look, I was once a developer who considered the header files to be the only documentation worth looking at, and I do not mind editing with an eye to an agenda as long as the fundamental goal of usefulness isn't compromised. Comments are always a welcome addition, they truly are. As are version defines that avoid adding unintended version dependencies.

    Thinking of prior blogs like A way better model for features, part 2 that covers the flaws in the downlevel library for NLS functions, the fact that they are mesing up the down-level functionality in the header file at least has the benefit of consistency.

    All in all, I can't say I'm a fan of the new header file; it is better to not include information or version defines than to make the file either harder to use or harder to learn from....

  • Sorting it all Out

    Fruit is your alarm clock? You may be getting up late today or tomorrow...

    • 4 Comments

    I didn't have any problems waking up at the time I thought I would be waking up.

    You may not have been so lucky.

    Though I had a few things in my favor:

    • My own blog from last week (iOS <= 4.1: ±1 hour from savoir faire?) gave me, and perhaps you, ample warning, and
    • I don't use my mobile phone as my alarm clock anyway, and
    • I don't have an iPhone

    But perhaps you were not as lucky.

    I also had some celebrities warning me about the problem in case I did have an iPhone and it was my alarm clock and I forgot about that blog last week:

    Thanks Alyssa! :-)

    A part of me was hoping that Apple would get a hotfix out sooner since pulling in the iOS 4.2 update sooner would have been so much harder. But I guess they decide they would rather just take the hit and have a little widespread bit of trouble in their biggest (or at least their loudest!) market.

    Perhaps this will get Apple into that same mode about taking these issues more seriously that they have largely been able to avoid, which I think would be good thing, for everyone. Even though they are technically the competition, I haven't ever really viewed them in exactly that way, and even if I did I'd rather generate interest on the merits of my products, not depending on the screwups or mistakes of others.

    But that's just me, I know that opinions will vary.

    Anyway, hope you didn't find yourself up an hour late, and hope you have removed your recurring alarm for tomorrow if you have one, like the instructions request.

    And finally, I hope the iOS 4.2 update doesn't brick your phone (a friend of mine said the 4.1 update had that effect on her phone, a rather abrupt way to prove to her in light of these DST issues the truth of the old truism better late than never)....

  • Sorting it all Out

    Please stop using this turd. And if you are an MS employee: stop using this turd. Now.

    • 5 Comments

    Don't use GDI+.

    Please.

    I'll repeat.

    Don't use GDI+.

    Truly.

    It is a terrible thing to use.

    Well, strictly speaking, it may be perfect if you like to have hanging bugs in your code that may never be fixed.

    And if you are looking for mirroring support that is totally inadequate, GDI+ can deliver.

    I could go on.

    I could point out the latest problem someone pointed out -- that all that has been found out in supporting many of the African languages like Yoruba, Igbo, and Hausa that the simple fact that the Latin script can at times be a complex script is a fact that GDI+ is blissfully unaware of.

    So was NT 4.0, so I suppose I shouldn't worry about it too much.

    And if you don't do Bidi or anything complex even, then maybe you can go on using this technology for your text rendering.

    But if you are interanl to the company, then since Microsoft now supports languages like Yoruba, Igbo, and Hausa, the rules are different. And it is quite embarrassing to say you support a language but find out that some of your components don't support it. All because those few errant components were written with GDI+.

    If you work for Microsoft, I can be even more colorful about this topic. Let me know where and when and I will present for your group's leadership that use of GDI+ is akin to assault with intent to maim [text].

    People really ought to stop using this turd. Truly....

  • Sorting it all Out

    We will call it Filipino, even though there they will just call it Tagalog (and figure we talk to the gov't too much)

    • 9 Comments

    THE WINDOWS 7 FILIPINO LANGUAGE INTERFACE PACK IS LIVE!

    Click here to download the Filipino Windows 7 LIP via the Microsoft.com Download Center.

    Please note that the Filipino  Windows 7 LIP can only be installed on a system that runs an English client version of Windows 7. It is available to download for both 32-bit and 64-bit systems.

    The Filipino Windows 7 LIP is produced as part of the Local Language Program sponsored by Public Sector.

    A LITTLE BACKGROUND INFORMATION ON FILIPINO

    NUMBER OF SPEAKERS

    25 million native speakers; 60 million speakers worldwide

    NAME IN THE LANGUAGE ITSELF:

    Filipino

    Filipino is (together with English) the national language of the Philippines, as stated in Article XIV, Section 6 of the 1987 constitution of the country. The constitution declares it an evolving language that shall be "developed and enriched on the basis of existing Philippine and other languages". It is regulated by the Komisyon sa Wikang Filipino (Commission on the Filipino Language). In reality, Filipino is heavily based on Tagalog, an Austronesian language spoken by about 22 million people natively (mostly on the island of Luzon). Even amongst linguists there is some confusion about the exact relationship between Tagalog and Filipino. Three different views exist on the issue:

    • Filipino is Tagalog - the language names are basically synonyms. This is the view that most Filipinos have.
    • Filipino is a language in which all Philippine languages are being amalgated, with English and Spanish as additional possible vocabulary sources. This is what the constitutions wants, but hard to achieve because of the heterogeneity of the Philippine languages.
    • Filipino is the Tagalog dialect spoken in  the metropolitan area of Manila, the major melting pot of the country's ethnic groups. It has many loanwords from English and borrowings from other Philippine languages.

    It is easy to imagine how different people can look at the same language in these three very different ways....

    CLASSIFICATION:

    Filipino is hard to classify given all of the above, but Tagalog is a member of the Austronesian language family to which, for example, also Malay, Indonesian, Tongan, Cebuano, Tausug, and Maori belong.

    SCRIPT:

    Filipino is written in the Latin script with the addition of two letters Ñ and Ng.

    This can be contrasted with the older Tagalog script, which has not been in use for centuries. The older script is in Unicode, though (see chart here).

     Click here for more information on the Filipino language.

    MICROSOFT SPECIFIC AND MICHAEL KAPLAN SPECIFIC:

    One of my first contacts with the team that would become Windows International was to do with a consulting job for a company that wanted Microsoft to assign several different locale identifiers (LCIDs) for languages they were supporting. One of them was Tagolog, and although Cathy (the person I did not know but I was directed to speak to) refused to create an LCID for Tagalog, she did assign one for Filipino. This was the one I used and they shook their heads since their understanding was the same of most of the Tagalog speakers in the Phillipines.

    Cathy did not assign LCIDs for Cebuano or Tausug, despite my request. And though we became great friends later, her image as me of the crazy guy with crazy language requests and my image of her as the wiatch who refused to give me my LCIDs was a part of our mythology for some time.

    Enjoy!

  • Sorting it all Out

    It may not be the best idea to think of Luxembourgish as "German with an army". :-)

    • 2 Comments

    THE WINDOWS 7 LUXEMBOURGISH LANGUAGE INTERFACE PACK IS LIVE!

    Click here to download the Luxembourgish Windows 7 LIP via the Microsoft.com Download Center.

    Please note that the Luxembourgish  Windows 7 LIP can only be installed on a system that runs an English client version of Windows 7.   It is in available in both 32-bit and 64-bit versions.

    The Luxembourgish Windows 7 LIP is produced as part of the Local Language Program sponsored by Public Sector.

    A LITTLE BACKGROUND INFORMATION ON LUXEMBOURGISH:

    NUMBER OF SPEAKERS

    ~300,000

    NAME IN THE LANGUAGE ITSELF:

    Lëtzebuergesch  

    Luxembourgish is spoken in the small Western European country of Luxembourg where it is official language since 1984 (together with French and German). Though so closely related to German (see "Classification") that German speakers should have no major problems understanding Luxembourgish, the differences between the two languages in terms of grammar are considerable. Luxembourgish has also borrowed many words from French (from merci for thank you to Prabbeli from parapluie for umbrella).

    INTERESTING FACT:

    Though Luxembourg is founding member of the European Union, Luxembourgish is not an official language of the EU. In Luxembourg itself laws are not published in Luxembourgish either.

    CLASSIFICATION:

    Strictly linguistically spoken, Luxembourgish is a West Central German dialect - but due to its standing it can be considered a language on its own (As the saying goes, "A language is a dialect with an army".) Luxembourgish as a Germanic language belongs to the family of Indo-European languages.). 

    SCRIPT:

    Luxembourgish is written in Latin script. There are four special characters: é, ä, ë and ü.

    Click here for more information on the Luxembourgish language.

    Enjoy!

  • Sorting it all Out

    Y can't Z Undo, exactly?

    • 3 Comments

    People sometimes ask me how many languages I speak, since I seem to talk about so many different languages all the time.

    Given my incomplete knowledge of so many languages (including English!) I usually answer 0.6 languages.

    But in truth between the smattering of so many languages I have learned bits and pieces of over the years, I find myself able to read a lot more mail than I ever would have thought possible.

    So last night when Dimiter forwarded a mail to me accidentally, and it was entirely in Bulgarian, I took it in stride:

    Subject: bug

    Когато си пусне човек кирилицата на Bulgarian Phonetic (не Bulgarian Phonetic Traditional, а новата подредба дето предложиха БАН и която идваше по default с
    Vista) ctrl+z спира да работи във всички програми от Notepad до Office. Много досадно. Reproduce-ва се на всички компютри 100% и не става на другите клавиатурни подредби като en, bds и phonetic traditional.

    Айде ако ти пука report-вай го. Аз мисля, че достатъчно опити направих.

    It took nearly four minutes before he sent another mail with the translation, apologizing for having had me on the mail (he was just looking for my email address and did not mean to have me on the reply!), I had already checked out the mail.

    Actually, there were enough terms I knew and English terms (like the names of keyboard layouts on Windows) that I had already translated the mail in my head, and knew what the problem was!

    Dimiter provided a confirmation that I was right, four minutes later as I mentioned:

    apparently CTRL+Z doesn't work on Bulgarian (Phonetic). It doesn't repro with any other layout (including Phonetic Traditional), so it's a specific bug.

    This is technically not a bug, actually.

    But I should explain what is going on here.

    First let's load up the Bulgarian Phonetic keyboard layout

    in that tool of 1000 uses, MSKLC (Microsoft Keyboard Layout Creator):

    Now of course one knows where the "Z" key is on the keyboard, it is on the key that in MSKLC is just above the Left Alt key (the one just to the right of it is the "102" key not present on some keyboards).

    You know, this key:

    Though if you hover over that key, you will see that the key is not what you think it is!

    Try it!

    VK_Y?

    What's that doing way over there?

    Thank goodness for a hidden feature in MSKLC that will make finding VK_Z much easier.

    Just hit the "Z" key on your keyboard and MSKLC will select the key, which turns out to be this one:

    The one under the 6.

    Hover over to confirm:

    Wow, weird.

    So, the "Y" and "Z" seem reversed.

    Where have I seen that before?

    Oh yeah, I remember.

    On the German keyboard layout:

    Whoever put together the Bulgarian Phonetic keyboard layout must have started from the German keyboard layout to get its slightly different view of the mapping between scan codes and virtual keys.

    This kinda of takes us full circle to Raymond Chen's Why are the keyboard scan codes for digits off by one? (which was in turn as riff off my Off by one what, exactly?):

    Of course, if the original keyboard designers had started counting from the lower left corner, like all right-thinking mathematically-inclined people, then this sort-of-coincidence would never have happened. The scan codes for the digits would have been 2E through 37, and nobody would have thought anything of it.

    Further to this point about if we had gotten a different design of these key assignments, had the Virtual Key values not been based on the letters in the US keyboard, then languages that used the same Latin Script letters (e.g. French and Spanish and Portuguese and German) would not have felt compelled to move the VK values around when letters moved around. This movement is something that keyboards that use other scripts generally leave in the basic US keyboard positions and do not get in the habit of doing, so the only people who run into problems are:

    • Anyone who tries to use a software keyboard layout for a different Latin script language that doesn't match the letters printed on the hardware faces, and
    • Anyone who tries to use a software keyboard layout whose scan code to virtual key mapping is based on one of those other keyboards.

    Obviously a program could figure out the exact letter on the key and tell you what is for these cases, but almost no programs do.

    And we can't change the layout now that it has shipped this way. The only way it could ever be "fixed" is to deprecate the old layout and add yet another Bulgarian layout -- one with identical behavior in the letters it types for most users since the keyboard shortcuts apparently are not that commonly used in Bulgarian (which is why Microsoft had never heard this bug report previously despite the keyboard layout having existed since Vista was in beta, and called Longhorn).

    If I still owned MSKLC (or if the current owners ever planned to ship a version again, a concept that seems less and less likely all the time!) then I'd also suggest detecting this as a validation issue for all non-Latin script keyboard layouts (and perhaps a separate rule for Latin script lanhguage keyboards to match the VK values for roughly the same reason). But that is just something I think about in idle moments.

    This kind of issue with the keyboard layout itself may well just be a permanent small confusion for the Bulgarian Phonetic keyboard layout, with just this blog (and perhaps a future KB article?) describing the full story....

  • Sorting it all Out

    iOS <= 4.1: ±1 hour from savoir faire?

    • 5 Comments

    If you have an iPhone, which I assume most of you do, you may have heard about the problems with the alarm apparently not handling Daylight Savings Time in Europe correctly.

    If you are in Europe, you may have woken up an hour late.

    If not you can read about it in articles like the one from CNN (iPhone gives Europe extra hour of sleep) or PC World (iPhone Alarm Bug Has Europe Waking 1 Hour Late). Or one of the others.

    There are lots of others.

    It's funny, people I know sending emails around as that article from CNN was making the rounds. This one is typical:

    It’s always interesting to me how anything negative in the Apple Ecosystem gets turned into a positive by the media – instead of ‘iPhone makes Europe late for work’; it’s "iPhone Gives Europe Extra hour of Sleep".

    He hadn't seen the other article yet!

    Now that is a tempting way to respond, but it is off the mark a little bit.

    Really the difference in response between when Microsoft has problems with time zones versus when Apple does has as lot less to do with any kind of "Apple is cooler" or "Steve Jobs reality distortion field" than people are giving credit for.

    It is about the two different kinds of customer situations we are talking about -- the typical Microsoft PC user vs. the typical Apple iPhone user.

    Actually, that can sometimes even be the same person, so let me restate that:

    It is about the two different kinds of customer situations we are talking about -- the typical Microsoft PC use vs. the typical Apple iPhone use.

    I mean, what is the worst that happens if your clock is an hour off? Nothing, really.

    Hell, even if you are still using your alarm clock you have an excuse not to go into work on time. Just tell you have an iPhone.

    Supposedly the update (in iOS 4.2) may not make it to phones in the US before our DST transition. So you will have that same excuse, very soon!

    Even if you work for Microsoft (just tell them you have to wait for your Windows Phone before you can change over).

    When you look at the two companies, we are talking about the same kinds of bugs. Just like with us, I'll bet there have been and will continue to be iPhone users with the same problems in other time zones that no one notices. And it is only in these "bug" time zones where all the press focus is that point out the huge issue.

    But no one is running their business on an iPhone. There aren't even nearly as many running their business on Macs (ubless they are running Windows through Boot Camp).

    You can get away with maybe not having an update to fix a time zone bug out by November even if you knew about the bug in October if you aren't breaking your work computers. You don't need dedicated teams making sure tme zone updates are handled ASAP, and you can even miss the same bug in another country a week later.

    Apple did have a strategy to trake on the business world, and it nearly killed them. They know how to avoid death (keep Steve Jobs!) and go with strengths.

    I like my MacBookPro, its my second favorite machine, even when it is running Windows 7 (even Windows thinks the Apple hardware is cooler, that should tell us something!). But I won't make it my main work machine when it is booted into anything other than Windows.

    I'd rather have missed appointments be blamed on a more natural set of targets -- Outlook or Exchange Server, not the operating system. :-)

    Anyway, enjoy your hours of extra sleep, iPhone users. Just keep acvting like the bug is biting you and you can milk this for a few weeks, I'd imaginae. Even after the update comes out ("it won't install on my phone, what the hell?" is a great affirmative defense)....

  • Sorting it all Out

    The consequences of being unintuitive and nonconformant

    • 5 Comments

    It was just days ago that Weijiang asked one of those questions that comes along every now and again that makes the way things work on microsoft platforms (known to some as The Way Things Work™) seem a little off.

    Weijiang's question was:

    Hi. I met a unexpected problem when using string.IndexOf. The following code demonstrates the problem:

    string r = "\ufffd\ufffd\ufffd\ufffd";
    string tar = "a";
    Console.WriteLine(tar.IndexOf(r));


    Can you guess the output? The output is 0, which is very weird for me. Can someone explain why? Because this has broken my program, which assume if a.IndexOf(b)>= 0 Then a.Length >= b.Length.

    Thanks

    Now this question had built into it the opportunity to both correct the question and shame the technology at the same time. Usually I would never turn such an opportunity down!

    But before I really had an opportunity to craft a response, Pavel beat me to the punch with a very well-thought-out reply:

    First of all, s.IndexOf(“”) always returns 0 for a non-empty s, for obvious reasons.

    Going from there, your assumption about relation between IndexOf and Length is generally incorrect, even for characters other than U+FFFD. For example, U+2060 (word joiner) and U+00AD (soft hyphen) are also treated as “empty”, and thus string “\u2060\u00ad” would be treated the same as “” for the purposes of culture-sensitive comparisons, which is where your result comes from. This is also the case for Equals, CompareTo and other String methods so long as you request culture-sensitive comparison. It just so happens that IndexOf does so by default, while e.g. Equals and Contains do not.

    Generally speaking, this is consistent with user expectations, since those characters are not part of the “semantic load” of the string as far as user is concerned. E.g. consider user copy-pasting a string with a soft hyphen (which, unless the line break occurs, is not observable to him) into the search dialog. He’d be quite surprised if your app says that it couldn’t find it, while he can clearly see it in the text!

    On the other hand, String.Length is a simple counter of 16-bit code units (i.e. chars), without any special treatment to some over others. If you want to match that with IndexOf, use the overload which takes StringComparison.Ordinal to explicitly request code unit comparison.

    The difference in defaults is certainly quite error-provoking, though. So much so that many .NET coding styles require always requesting either culture-sensitive or ordinal comparison explicitly (by using StringComparison or CultureInfo) for methods which permit both, even when the requested mode is the default for that particular method.

    Now in addition to his excellent examples, there are other trhings I would point out.

    Like the fact that the original assumption "if a.IndexOf(b)>= 0 Then a.Length >= b.Length" that was broken. all one needs is examples like "\u00e5".IndexOf("\u0061\u030a") to disprove that assumption (that is a ring and a plus combining ring for those who don't speak Unicode code points.

    In fact, that is why all the extra work went into FindNLSString to provide not only a return value like IndexOf returns but also to return the length of the found string -- since one cannot make assumptions about the length of the found string based on the length of the string one is trying to find. The extra support in FindNLSString points to a real hole in the scenario of usefulness of several potential ways one might want to utilize culturally sensitive comparisons -- a limitation that still exists in .Net even in the latest version.

    Given the lack of this support in earlier languages like Java, it isn't that surprising that .Net hasn't considered too important to add.

    After all, me alone clamoring for something is generally not a good enough reason to do anything, since I clamor for so many things. :-)

    But i digress....

    Afterwards, I sent Pavel some mail complimeneting him on his response, and he replied that he did feel like his answer was perhaps a little incomplete:

    I dodged the original question somewhat, since I didn’t explain why U+FFFD specifically is treated as a “noncharacter” (newspeak seems highly appropriate here somehow). And that’s because I don’t know, and don’t have any good guess as to why. Logically speaking, it’s “something we didn’t know how to handle”, so whether it was meaningful to the end user or not, we do not know. It would seem that, by considering it unimportant, we’re making a wild guess there.

    A dodge? He may have been harder on himself than he had to be.

    If a person is on trial for a crime and the prosecutor's evidence relies on an illegal search, then it may  be a technicality to get the case thrown out (and therefore ignore the fact that the crime may have been committed), but I wouldn't consider it a dodge.

    I find it kind of cool that Pavel didn't have a good guess as to why the behavior is what it is, since when I originally did the work in FindNLSString my goal was based on my own (naive) notions of intuitive behavior, many of which were different than the behavior eventually supported by the function -- on the basis of the need to match the .Net functionality (the function was added to upport synthetic locales in .Net, so that behavior matching was considered pretty crucial from a scenario perspective).

    It turns out that must fo the acual behavior was done for the sake of expectational behavior based on behavior in Java, since it was assumed that lots of the .Net developers may have once been Java developers. A lot like the way DOS behavior was so often CP/M based (something I found helpful when I moved from CP/M on an Osborne 1 to DO on a PC all those years ago).

    For me the chain of evidence runs dry, for two reasons:

    1. I took the word of others that many of the edge case behaviors in .Net in general and C# in particular had a Java basis, as that fact is much more told-truth to me than truth;
    2. I do not know the reasons why Java might have chosen to behave as it does, so I can't answer the next anticipated question ("Well then why does Java do it that way?").

    The original question, focusing on U+FFFD (REPLACEMENT CHARACTER), hits issues I have often discussed in the past in other blogs:

    And it is interesting how all those various connections occurred to me after the original question -- how it all ties in together on a bunch of design different not all entirely intuitive design decisions that now have far-ranging consequences on function/method behavior and security....

Page 2 of 2 (27 items) 12