Blog - Title

May, 2011

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    f y cn rd ths, thn cd tht strps yr vwls my nt bther y s mch....

    • 3 Comments

    Relatively common practices can often be dead wrong.

    Take the following code snippet for example (names munged to protect the guilty):

    BOOL CSillyBackwardsCollection::IsValidAlias(__in LPCWSTR pszAlias)
    {
         while ((*pszAlias != chNullTerminator) && (MAX_NAME_LENGTH > index))
        {
            if ( !iswalnum(*pszAlias) && (*pszAlias != L'_') && (*pszAlias != L'-') )
            {
                // not a valid identifier
               return FALSE;

    Okay, now it is obvious that there are few cases where every single code point in Unicode shouldn be treated as valid for a name, identifier, or alias.

    And it is great that that this code is using Unicode - it has been a long time coming and it is good to see more and more people doing this, by default and automatically.

    But iswalnum?

    iswalnum?

    iswalnum?

    Geez.

    This Microsoft CRT function is not (in any version whatsoever) following the latest best practices that Unicode suggests in either Unicode Technical Standard #18: Unicode Regular Expressions or Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax, which means that there are all kinds of perfectly valid Unicode characters that are needed for many of the languges and scripts covered by Unicode that this code will discard.

    See the title of this blog? There are languages and scripts that will be impacted the same way in their letters. Don't forget the lessons of blogs like Is Kana 'alphabetic' ? Depends on who you ask....; there are languages that are completely taken out of the running here!

    Now a part of me wants to blame this code snippet.

    But just a small part.

    Because a much bigger part of me sees that the biggest problem is the need to overhaul the all of the Microsoft Visual C Runtime CTYPE (character type) functions to be line with both UTS #18 and UAX #31.

    For example, see this table for a much better way to classify Unicode characters. The distance betwen this and what the CRT uses is huge!

    The CRT needs to grow up and embrace Unicode in all of its uses rather than just using the schemes cobbled together for POSIX compatibility back before Unicode was anywhere significant....

    Note to folks who own the CRT -- my schedule is up to date if you want to discuss the needs of UTS #18 and UAX #31 further! :-)

  • Sorting it all Out

    Semijazzed about semilight!

    • 1 Comments

    So the other day, in response to To True Boldly Go Where No Font...(yada yada yada), Quppa commented:

    What weight does Segoe UI Semilight (found in WP7 and maybe Windows 8) correspond to?

    I don't know what the latter product is, never heard of it before. :-)

    But the former product doesn't have a Segoe UI Semilight in it or on it.

    Now Windows Phone 7 has a Segoe WP Semilight, and the other fonts in the Segoe WP * series seem by all appearances to be honest subsets of the Segoe UI * series of fonts in Windows 7.

    So perhaps there is something to this idea.

    Though it puts things in a weird place.

    I mean, thinking back to the weight names in this table from LOGFONT->lfWeight:

    ValueWeight
    FW_DONTCARE 0
    FW_THIN 100
    FW_EXTRALIGHT 200
    FW_ULTRALIGHT 200
    FW_LIGHT 300
    FW_NORMAL 400
    FW_REGULAR 400
    FW_MEDIUM 500
    FW_SEMIBOLD 600
    FW_DEMIBOLD 600
    FW_BOLD 700
    FW_EXTRABOLD 800
    FW_ULTRABOLD 800
    FW_HEAVY 900
    FW_BLACK 900

    Now in pretty much every context that counts, semilight would be less light than light.

    Just like semibold in this table is less bold than bold -- thus FW_BOLD is 700, FW_SEMIBOLD is 600, and FW_REGULAR is 400.

    But on the other side, FW_LIGHT is 300 and FW_REGULAR is 400, so I guess some theoretical future FW_SEMILIGHT may well be 350 (or some other number between 300 and 400).

    Of course it does seem like the scale is a bit off here if you compare what the Semi* and Demi* meanings are on the bold side. But these are just descriptive words that act as approximations, not mathematical concepts with all the rigor that would imply.

    Perhaps if I had sophisticated font file tools on my WP7 device I could find out exactly what weight Segoe WP Semilight claimed to be, as perhaps that would cause it all to make sense.

     Or I suppose I could just get it from the Design Templates for Windows Phone 7 and find out more directly....

       



       

    Okay, clearly Segoe WP Semilight sits between Segoe WP Light and Segoe WP.

    Now just to keep me from feeling too guilty for breaking future code from anyone who does their own defines, please be sure to wrap it all up as follows if you agree with my logic and want to use it sooner:

    #ifndef FW_SEMILIGHT
    #define FW_SEMILIGHT 350
    #endif // FW_SEMILIGHT

    Okay, now I feel better.... :-)

    I am truly at a minimum semi-jazzed about the Semilight weight (full jazz would require all the hinting work to be done!).

  • Sorting it all Out

    Relationships (according to Science!)

    • 0 Comments

    Based on a Facebook insight in the middle of commenting on one of Hande's statuses from yesterday morning -- an insight I had once before but that prior use was for utility explaining it with a relationship rather than entertainment like it was yesterday and maybe is today....

    It starts with the shapes, the curves. I believe it was Andy "Slipstick" Libby who first suggested the shapes where one could find elegant fifth order functions - he was including the time variable. So we (the boys, I mean) watch those shapes. We want something here, we just hve no idea what it is exactly.

    In oather words, it is about GEOMETRY.

    Then, it really is just a numbers game. I mean, sometimes one has to be turned down by Tammy and Stephanie and Melissa before one gets to meet Arlene and then re-aquaint yourself with Tammy, later on in the numbers game when things start to add up.

    In other words, it is about the MATH.

    Along the way one sometimes finds that things really click, that there is a real connection that takes hold and refuses to let go. It's excitement, it's attraction.

    In other words, it is about CHEMISTRY.

    The next step leads to learning more about each other, about beliefs and history and values. Because if you find you have enough in common then sooner or later it is going to get more private intimate.

    In other words, it is about PHYSICS.

    {perhaps at this point an importance of "hard" sciences pun -- thanks Dave! -- would be appropriate while of course being simultaneously inappropriate}

    Next if one wants to be with someone forever (imagine Erin Grey as Wilma Deering asking "Oh Buck Rogers, does that mean we'll be together for always?"), there will be marriage and sometimes there will even be perpetuation of the species -- also known as children.

    In other words, it is about BIOLOGY.

    So if you ever were wondering why people put so much emphasis on the hard sciences and their importance, now you know -- IT'S ALL ABOUT SCIENCE!

  • Sorting it all Out

    MUI and MFC live together in perfect harmony...

    • 2 Comments

    I'm not really in touch with my old group the way I used to be.

    Even though I still work with them from time to time and find myself in meetings with them occasionally, the connection just doesn't occur to people to use.

    I mean for example, it was about 3.5 years ago, in blogs like

    where Erik Fortune (and I suppose I) promised a sample of how to integrate MUI with MFC, as an example of how to use MFC within another resource technology framework.

    It is one of the only "guest post" blogs I have ever published, although once when someone in a group I used to be in accused me of using the Blog to gain extra visibility I offered anyone the ability to do a guest blog if they wanted. No one took me up on the offer.

    I'm digressing again....

    Anyway, back to those blogs from 3.5 years ago, the ones with the promise.

    I kind of lost track of the whole thing, to be honest.

    But it turns out they did put the sample up eventually, they just didn't tell me about it....

    You can find it right here: Implementing MUI fallback with MFC

    Sorry I didn't point to it sooner, I didn't know! We just aren't in contact like we used to be.

    Now these are the kind of samples that can make MUI easier to approach by the many people who already have solutions of some type in place. In fact if I have any objection is that they didn't push harder to build it into MFC directly, but that's a much bigger resource issue, obviously....

    Anyway, enjoy!

  • Sorting it all Out

    If it's overtly over-applied and overarching, it may be an overreach (e.g. in nOrway!)

    • 3 Comments

    I will admit there are occasionally times when I consider all the differences between me and the members of my family that I wonder whether I was somehow switched at the hospital or something like that.

    Idle thoughts that I quickly discard given how traumatic it would be to take one who had been believed to be a parent and changing the identification.

    Given that, it is slightly unfortunate that the parent/child metaphor is used in Windows locales and .NET Framework cultures.

    Especially in light of an issue that came up not too long ago....

    Isabella's question was, to my line of thinking, profoundly reasonable:

    I noticed one undocumented behavior change while debugging. The symptom is the parent culture of “nb-no” changes across 2 .Net versions.

    Any idea about this? Is it by design or a bug?

    CultureInfo culture = CultureInfo.GetCultureInfo("nb-no");
    Console.WriteLine("Parent name: {0}, LCID:{1:X}", culture.Parent.Name, culture.Parent.LCID);

    Target Framework: 3.0 / 3.5:        Parent name: no, LCID:14
    Target Framework: 4.0 :               Parent name: nb, LCID:7C14

    Thanks,
    Isabella

    All of this is a side effect of a big change that happened in the Windows 7/.NET Framework 4.0 timeframe.

    Perhaps undocumented was not the most accurate term for Isabella to use here; either under-documented or poorly documented would have been better, in this case.

    The documentation, such as it is, can be found in the white paper entitled Microsoft .NET Framework 4: What is New in Globalization:

    Renovating Globalization Information

    In the real world, the globalization information is constantly changing because of cultural developments in the local markets, because of new standards which update the culture sensitive information frequently, or because Microsoft finds more accurate information about different markets or expands into more markets. Microsoft .NET Framework 4 supports a minimum of 354 cultures compared to a minimum of 203 cultures in the previous release. Many of those cultures are neutrals that were added to complete the parent chain to the root neutral culture.

    This "insertion of neutrals" is the "justification" for the change here, though as justifications go I have a few qualms here.

    But then I have other qualms with this white paper, as I discussed previously in my evisceration of the breaking change in the KeyboardLayoutId in Reporting one casualty in the operation; luckily it was the stupidest member of the unit. :-)

    Anyway, let me explain my new qualms, and I will let my Norwegian friend Kim (of See Kim. See Kim run. See Kim run setup and find a bug! fame) tell me if I am right or wrong. His review is a great help to me, since more and more often I see Kim my as my own personal ambassador to Norway. This is in no small part due to the fact that plans for the SIAO embassy to Norway in Stjørdalshalsen (a town in Nord-Trøndelag county) have been put on hiatus while I try to figure out how to pronounce Stjørdalshalsen, and to put some distance between me and the original plans to locate the embassy in the nearby village of Hell, a less than brilliant marketing plan to deal with people who would from time to time tell me to Go To Hell, Michael! The whole affair can be attributed to the the influence and impact of attractive Norwegian women who befriend me and say things that sound clever while we are both drinking. Thankfully I was not told to Go To Hell in that context....

    Anyway, that was quite a digression! Back to the topic....

    If you look at the white paper, it gives a great example of the very good reason for the additional neutrals that were added, namely the cross-script child to parent relationships:

    For example, three Inuktitut neutrals were added to the already existing cultures Inuktitut (Syllabics, Canada) and Inuktitut (Latin, Canada) as shown in the following table.

    Culture Display Name Culture Name LCID
    Inuktitut iu 0x005d
    Inuktitut (Syllabics) iu-Cans 0x785D
    Inuktitut (Syllabics, Canada) iu-Cans-CA 0x045D
    Inuktitut (Latin) iu-Latn 0x7C5D
    Inuktitut (Latin, Canada) iu-Latn-CA 0x085D

    Table – Example of many new neutrals

    Now in resource fallback (the main scenario of the chain) the older chain was really something of a liability since it was crossing the script boundary a little too freely, in this case and many others.

    But in the case of Bokmål vs. Nynorsk, we aren't talking about script differences country differences or language differences. We are talking about two theoretical extremes of a dialect-based difference that spans the country of Norway and only Norway, and in actuality the old fallback behavior isn't really bad in any linguistic or market-based sense -- a neutral Nynorsk doesn't really seem to serve much of a real purpose....

    In fact, it really looks like a bit of technical over-reaching (an overarching change being over-applied).

    Clearly intentional, and therefore by design, I guess.

    Though given how traumatic it is to find out that whoever your thought your parent was turns out to not be, perhaps these edge cases should not have been shifted so quickly.

    Perhaps not such a good application of the design?

  • Sorting it all Out

    To True Boldly Go Where No Font...(yada yada yada)

    • 11 Comments

    One of the interesting features discussed in articles like Script and Font Support in Windows and What’s New for International Customers in Windows 7 is True Bold support.

    For example from that second link there is even an example which admittedly might have been picked to be more striking:

    For instance, for bold texts the true bold font provides better text display than the simulated bold.

    This is an example of text display in simulated bold versus true bold:

     

    But it doesn't spend a lot of real time explaning why it's a good thing, does it?

    Let's start by compring it to the non-bold font, with the same words (here in Windows 7 vs. Vista):

        

    Can you see the differences? Aren't they a bit more striking, perhaps even with a great insight as to how it happened, when you cn see the base font that was being thickened? :-)

    The basic problem comes in the algorithmic "thickening" done -- whether by GDI, GDI+, WPF, WPF/DWrite, DWrite, or Silverlight.

    As good as it may or may not be, it can never be as good as when an actual typography expert is doing the work explicitly.

    Though I may not be the only one who wonders why OpenType doesn't define features that act as "hints" for algorithmic Bold or Italics. That too would let typographers support a better story than the lowest level algorithmic boldng supported now!

    This idea is not entirely new -- this is why so many fonts support regular, bold, italic, and bold-italic versions!

    Of course some of the most well-known fonts do not -- like Microsoft Sans Serif:

    Now the lack of true bolds and true italics for this font has a direct impact on fine typography when you use italic or bold.

    But when you look at core fonts like Microsoft Sans Serif that always suppport large parts of the core scripts of the latest version of Unicode, the actual cost of creating bold, italic, and bold-italic versions can start to get bigger and bigger.

    And with the heavy push toward using Segoe UI in Windows rather than the UI fonts of prior versions (Tahoma and Microsoft Sans Serif), makes a lot of sense in terms of investment.

    Just as providing true bolds makes a lot more sense for all of these scripts like the Indics that aren't in the core fonts -- since they are the fonts the system goes to when you use the core fonts and ask for these other scripts.

    Though the story doesn't end there....

    Let's talk about Segoe UI, a font that was born with bold, italic, and bold-italic variants.

    And also more:

    Okay, so in addition to the expected gang of four that we expect in fonts, they added two more weights:

    • Segoe UI Light
    • Segoe UI Semibold

    And they didn't just pick up these names in as thesaurus, these weights have been around for a while, as this table from LOGFONT->lfWeight "hints" at:

    ValueWeight
    FW_DONTCARE 0
    FW_THIN 100
    FW_EXTRALIGHT 200
    FW_ULTRALIGHT 200
    FW_LIGHT 300
    FW_NORMAL 400
    FW_REGULAR 400
    FW_MEDIUM 500
    FW_SEMIBOLD 600
    FW_DEMIBOLD 600
    FW_BOLD 700
    FW_EXTRABOLD 800
    FW_ULTRABOLD 800
    FW_HEAVY 900
    FW_BLACK 900

    Though of course by not providing hinted versions of these two additional fonts (as people complin about like in the comments of blogs like this one), it makes them less useful overall:

    I do not believe that the Segoe UI Light and Segoe UI Semibold weights have gone through the same rigorous hinting procedures as the original Segoe UI styles.

    For example, it looks like counter control is not being used on the lower case letters b, d, p, and q. Look how their rounded overshoots and counter sizes are larger than they should be at normal "UI" font ppems. The numeral 1 in the Light weight and capital I in the Semibold weight also tends to protrude beyond the cap height at typical "UI" font ppems. I believe this may be because they are modifications of the outlines used in the Segoe marketing fonts to better match the style of the UI fonts.

    The Light and Semibold weights also do not provide the same script and language support as the Regular and Bold weights. They have fewer glyphs overall (comparable to the original italic styles, which have no need for Arabic scripts, however). The OpenType feature sets are also divergent.

    Developers should not presume to use these new alternative weights interchangeably with the originals. Depending on the users' display settings and locales, the users may experience a less than optimal result.

    My guess is that these fonts were included mainly to render the text "Windows 7" in the fashion of "(semibold)Windows (light)7", to provide a consistent look with branding assets.

    Perhaps these weights will receive more attention (and especially delta hinting for typical UI ppems) before Windows 7 ships.

    Spolier: they didn't. :-)

    And it does without saying that also not providing either a Segoe UI Light Italic or a Segoe UI Semibold Italic limits their usefulness further -- since now they will just be using algorithmic italics (which can at times be even more destructive of text than algorithmic bold).

    So it is funny how all three axes:

    • true bold/true italic
    • additional weights
    • OpenType hinting

    are all done in the name of fine typograpy, yet there is so little overlap between the three of them....

    There are also other uses for true italics, like the language specific issues I have discussed in blogs like You say ĭtalics, I say ītalics. It is much more complicated in Cyrillic.... for language-specific Cyrillic script issues and When the font is the boss of you for taking care of some Japanese-specific issues.

    For the latter case, note that even people who object to the modified behavior of "ignoring" the italics request just suggested providing oblique support via other means -- no one who understands the language objects to the notion of using True Italics to providii9ng an experience superior to Algorithmic Italics.

    Now as typography in Windows and within other Microsoft products improves, the cost and even the time issues can obviously keep every single improvement happening to every single font.

    Though I do hope the grid gets to be filled in a little more completely, and with a little bit more overlap between features, than we see currently.

    Don't you? :-)

    And as a final point, and not for nothing, I'll point you at On why it's a bad thing to choose font information by name only and point out that more than ever you don't want to set fonts by name -- it leaves you in a really really bad place if you do (unless you don't mind your font selections being ignored, I mean!)....

  • Sorting it all Out

    There's no "I" in IDN, part 3: There's no "I" in DIY, either!

    • 10 Comments

    Prior blogs in this series:

    As I started deciding what I was going to write in this third part, I thought back to the initial guidance.

    And how both initial parts strongly encouraged people to be calling the appropriate functions, since calling the wrong ones are never such a great idea.

    But looking at the current landscape, for all intents and purposes the only Microsoft application that works properly here (most of the time) is Internet Explorer. And even it gets the answer wrong some of the time (a fact only discovered a few months ago, even though the bug has apparently existed since Internet Explorer 7's RTW -- some specific scenario where IE takes a URL already in UTF-8 and converts it to UTF-8 again, with the expected cottage-cheese-inducing results).

    For all applications, there are several different possibilities for any Internationalized Domin Name access:

    • INTERNET (everything is registered as Punycode)
      • If you convert the URL to Punycode
        • Everything works!
      • If you don't convert the URL to Punycode
        • Everything fails!
    • INTRANET (everything is [usually] registered as UTF-8)
      • If you're calling the right functions
        • Everything works!
      • If you're not calling the right functions
        • Everything fails!

     Now this little tree of options looks awfully simple, here on this blog.

    But unfortunately it isn't simple at all -- because there is a real lack of a standard method to distinguish Internet from Intranet at the application level.

    And IE is a messy clusterf*** of meaning trying to make sense of it, as KB articles like 2028170: Enabling "Automatically Detect Intranet Network" on a domain member computer will enable all the three Intranet Options automatically hint at:

    By default in Internet Explorer 7 and Internet Explorer 8, "Automatically Detect Intranet Network" will be enabled to automatically control the following three options of Intranet detection: 

    Include all local (intranet) sites not listed in other zones

    Include all sites that bypass the proxy server

    Include all network paths (UNCs)
    If "Automatically Detect Intranet Network" is enabled on a computer that is a member of a domain, all three of these options will be enabled regardless of what is configured through domain policy.

    It is marked as by design, and the resolution section gives complex instructions explaining how to override these settings through administrative templates.

    Not for the meek!

    Of course the engineer immediately jumps to an obvious question: why not require Punycode names to be assigned as aliases for even the INTRANET scenario?

    Such a requirement would have made everything easier, for everybody!

    But you have to consider the original reason for Punycode: to support IDN without requiring the existing infrastructure to change.

    Well, the Intranet has been working just fine on UTF-8 and in some cases even with legacy code pages, and trying to force a change in all those cases would break a lot of people.

    Though I still believe publishing as a recommendation would really have made life easier for a lot of people, it was not be. In the current situation such INTRANET Punycode server names represent an easy way to test that the Punycode scenario works for anyone smart enough to convert to it for later INTERNET testing. Given the role that proxy servers place, testing resolution within an organization would be quite difficult without using Punycode resolution within the INTRANET.

    For now, if you use IE's zone-based support (or one of the few others like that inherent in AD/Exchange), you too will work as well as they do.

    Or you can DIY (Do It Yourself), which can be pretty complicated, depending on your scenario.

    There are many who feel that this is not good enough in the long run, that both Windows and .NET both need to try to make this easier. So people look to the future with interest.

    Some of those people are doing nothing until they see some signs of future improvements, while others are doing the things like those from Part 1 and Part 2 of this series, and keeping those endpoints in discrete bits of their code so that if they eventually still need to add a DIY piece they can add it, but just in case the system underneath comes up with the right answers everything might just work.

    I think that latter group will be in the best position to later either do nothing or in the worst case do a small amount of work.

    Realistically, few will do their own work to do what IE/AD/et. al. are doing here. There really is no "I" in DYI since it will take a lot of work, and definitely implies a "we" with a lot of developers and testers!

  • Sorting it all Out

    The Locales of Windows 7, all divvied up

    • 9 Comments

    Someone asked me yesterday if there were lists for:

    1. all the locales for which we did full localizations;
    2. all the locales for which we did partial (Language Interface Pack) localizations;
    3. all the locales supported by Windows that fall in neither of these categories.

    Well, they knew that #1 and #2 were probably somewhere, but they did not know about #3.

    I was a little bored, so I assembled the lists for Windows 7.

    A few caveats:

    • There is no programatic way to get this info, unless you include reading these three HTML tables into your code;
    • I narrowly skirt geopolitical and SKU/LIP labelling issues by using the "first locales" by LCID and always using the names of the locales;
    • Many of the items in the third table are techncially "covered" by other localizations that exist; how good that coverage is in the opinion of customers will of course vary;
    • The fourth table, the various "reserved" locales not supported in Windows found in the [MS-LCID] Protocol doc, is not here (for several reasons, none of which I'm going to get into today).

    So here are a bunch of lists:

     Table #1: The locales representing languages into which Windows 7 localizes:

    LCID Name Display Name Native Name
    0401 ar-SA Arabic (Saudi Arabia) العربية (المملكة العربية السعودية)
    0402 bg-BG Bulgarian (Bulgaria) български (България)
    0804 zh-CN Chinese (People's Republic of China) 中文(中华人民共和国)
    0404 zh-TW Chinese (Taiwan) 中文(台灣)
    041a hr-HR Croatian (Croatia) hrvatski (Hrvatska)
    0405 cs-CZ Czech (Czech Republic) čeština (Česká republika)
    0406 da-DK Danish (Denmark) dansk (Danmark)
    0413 nl-NL Dutch (Netherlands) Nederlands (Nederland)
    0409 en-US English (United States) English (United States)
    0425 et-EE Estonian (Estonia) eesti (Eesti)
    040b fi-FI Finnish (Finland) suomi (Suomi)
    040c fr-FR French (France) français (France)
    0407 de-DE German (Germany) Deutsch (Deutschland)
    0408 el-GR Greek (Greece) ελληνικά (Ελλάδα)
    040d he-IL Hebrew (Israel) עברית (ישראל)
    040e hu-HU Hungarian (Hungary) magyar (Magyarország)
    0410 it-IT Italian (Italy) italiano (Italia)
    0411 ja-JP Japanese (Japan) 日本語 (日本)
    0412 ko-KR Korean (Korea) 한국어 (대한민국)
    0426 lv-LV Latvian (Latvia) latviešu (Latvija)
    0427 lt-LT Lithuanian (Lithuania) lietuvių (Lietuva)
    0414 nb-NO Norwegian, Bokmål (Norway) norsk, bokmål (Norge)
    0415 pl-PL Polish (Poland) polski (Polska)
    0416 pt-BR Portuguese (Brazil) Português (Brasil)
    0816 pt-PT Portuguese (Portugal) português (Portugal)
    0418 ro-RO Romanian (Romania) română (România)
    0419 ru-RU Russian (Russia) русский (Россия)
    081a sr-Latn-CS Serbian (Latin, Serbia and Montenegro (Former)) srpski (Srbija i Crna Gora (Prethodno))
    041b sk-SK Slovak (Slovakia) slovenčina (Slovenská republika)
    0424 sl-SI Slovenian (Slovenia) slovenski (Slovenija)
    0c0a es-ES Spanish (Spain) español (España)
    041d sv-SE Swedish (Sweden) svenska (Sverige)
    041e th-TH Thai (Thailand) ไทย (ไทย)
    041f tr-TR Turkish (Turkey) Türkçe (Türkiye)
    0422 uk-UA Ukrainian (Ukraine) україньска (Україна)

     

    Table 2: The locales representing languages for which Windows creates Language Interface Packs, aka LIPs: 

    LCID Name Display Name Native Name
    0436 af-ZA Afrikaans (South Africa) Afrikaans (Suid Afrika)
    041c sq-AL Albanian (Albania) shqipe (Shqipëria)
    045e am-ET Amharic (Ethiopia) አማርኛ (ኢትዮጵያ)
    042b hy-AM Armenian (Armenia) Հայերեն (Հայաստան)
    044d as-IN Assamese (India) অসমীয়া (ভাৰত)
    042c az-Latn-AZ Azeri (Latin, Azerbaijan) Azərbaycan­ılı (Azərbaycanca)
    042d eu-ES Basque (Basque) euskara (euskara)
    0845 bn-BD Bengali (Bangladesh) বাংলা (বাংলাদেশ)
    0445 bn-IN Bengali (India) বাংলা (ভারত)
    201a bs-Cyrl-BA Bosnian (Cyrillic, Bosnia and Herzegovina) босански (Босна и Херцеговина)
    141a bs-Latn-BA Bosnian (Latin, Bosnia and Herzegovina) bosanski (Bosna i Hercegovina)
    0403 ca-ES Catalan (Catalan) català (català)
    048c prs-AF Dari (Afghanistan) درى (افغانستان)
    0464 fil-PH Filipino (Philippines) Filipino (Pilipinas)
    0456 gl-ES Galician (Galician) galego (galego)
    0437 ka-GE Georgian (Georgia) ქართული (საქართველო)
    0447 gu-IN Gujarati (India) ગુજરાતી (ભારત)
    0468 ha-Latn-NG Hausa (Latin, Nigeria) Hausa (Nigeria)
    0439 hi-IN Hindi (India) हिंदी (भारत)
    040f is-IS Icelandic (Iceland) íslenska (Ísland)
    0470 ig-NG Igbo (Nigeria) Igbo (Nigeria)
    0421 id-ID Indonesian (Indonesia) Bahasa Indonesia (Indonesia)
    085d iu-Latn-CA Inuktitut (Latin, Canada) Inuktitut (kanata)
    083c ga-IE Irish (Ireland) Gaeilge (Éire)
    0434 xh-ZA isiXhosa (South Africa) isiXhosa (uMzantsi Afrika)
    0435 zu-ZA isiZulu (South Africa) isiZulu (iNingizimu Afrika)
    044b kn-IN Kannada (India) ಕನ್ನಡ (ಭಾರತ)
    043f kk-KZ Kazakh (Kazakhstan) Қазақ (Қазақстан)
    0453 km-KH Khmer (Cambodia) ខ្មែរ (កម្ពុជា)
    0441 sw-KE Kiswahili (Kenya) Kiswahili (Kenya)
    0457 kok-IN Konkani (India) कोंकणी (भारत)
    0440 ky-KG Kyrgyz (Kyrgyzstan) Кыргыз (Кыргызстан)
    046e lb-LU Luxembourgish (Luxembourg) Lëtzebuergesch (Luxembourg)
    042f mk-MK Macedonian (Former Yugoslav Republic of Macedonia) македонски јазик (Македонија)
    083e ms-BN Malay (Brunei Darussalam) Bahasa Malaysia (Brunei Darussalam)
    043e ms-MY Malay (Malaysia) Bahasa Malaysia (Malaysia)
    044c ml-IN Malayalam (India) മലയാളം (ഭാരതം)
    043a mt-MT Maltese (Malta) Malti (Malta)
    0481 mi-NZ Maori (New Zealand) Reo Māori (Aotearoa)
    044e mr-IN Marathi (India) मराठी (भारत)
    0450 mn-MN Mongolian (Cyrillic, Mongolia) Монгол хэл (Монгол улс)
    0461 ne-NP Nepali (Nepal) नेपाली (नेपाल)
    0814 nn-NO Norwegian, Nynorsk (Norway) norsk, nynorsk (Noreg)
    0448 or-IN Oriya (India) ଓଡ଼ିଆ (ଭାରତ)
    0429 fa-IR Persian (Iran) فارسى (ايران)
    0446 pa-IN Punjabi (India) ਪੰਜਾਬੀ (ਭਾਰਤ)
    046b quz-BO Quechua (Bolivia) runasimi (Bolivia Suyu)
    0c1a sr-Cyrl-CS Serbian (Cyrillic, Serbia and Montenegro (Former)) српски (Србија и Црна Гора (Претходно))
    046c nso-ZA Sesotho sa Leboa (South Africa) Sesotho sa Leboa (Afrika Borwa)
    0432 tn-ZA Setswana (South Africa) Setswana (Aforika Borwa)
    045b si-LK Sinhala (Sri Lanka) සිංහ (ශ්‍රී ලංකා)
    0449 ta-IN Tamil (India) தமிழ் (இந்தியா)
    0444 tt-RU Tatar (Russia) Татар (Россия)
    044a te-IN Telugu (India) తెలుగు (భారత దేశం)
    0442 tk-TM Turkmen (Turkmenistan) türkmençe (Türkmenistan)
    0420 ur-PK Urdu (Islamic Republic of Pakistan) اُردو (پاکستان)
    0443 uz-Latn-UZ Uzbek (Latin, Uzbekistan) U'zbek (U'zbekiston Respublikasi)
    042a vi-VN Vietnamese (Vietnam) Tiếng Việt (Việt Nam)
    046a yo-NG Yoruba (Nigeria) Yoruba (Nigeria)

     

    Table #3: Locales whose identifiers are not directly associated with any localizations of Windows, even if a related identifier might make for one representing a suitable localization:

    0484 gsw-FR Alsatian (France) Elsässisch (Frànkrisch)
    1401 ar-DZ Arabic (Algeria) العربية (الجزائر)
    3c01 ar-BH Arabic (Bahrain) العربية (البحرين)
    0c01 ar-EG Arabic (Egypt) العربية (مصر)
    0801 ar-IQ Arabic (Iraq) العربية (العراق)
    2c01 ar-JO Arabic (Jordan) العربية (الأردن)
    3401 ar-KW Arabic (Kuwait) العربية (الكويت)
    3001 ar-LB Arabic (Lebanon) العربية (لبنان)
    1001 ar-LY Arabic (Libya) العربية (ليبيا)
    1801 ar-MA Arabic (Morocco) العربية (المملكة المغربية)
    2001 ar-OM Arabic (Oman) العربية (عمان)
    4001 ar-QA Arabic (Qatar) العربية (قطر)
    2801 ar-SY Arabic (Syria) العربية (سوريا)
    1c01 ar-TN Arabic (Tunisia) العربية (تونس)
    3801 ar-AE Arabic (U.A.E.) العربية (الإمارات العربية المتحدة)
    2401 ar-YE Arabic (Yemen) العربية (اليمن)
    082c az-Cyrl-AZ Azeri (Cyrillic, Azerbaijan) Азәрбајҹан (Азәрбајҹан)
    046d ba-RU Bashkir (Russia) Башҡорт (Россия)
    0423 be-BY Belarusian (Belarus) Беларускі (Беларусь)
    047e br-FR Breton (France) brezhoneg (Frañs)
    0c04 zh-HK Chinese (Hong Kong S.A.R.) 中文(香港特别行政區)
    1404 zh-MO Chinese (Macao S.A.R.) 中文(澳門特别行政區)
    1004 zh-SG Chinese (Singapore) 中文(新加坡)
    0483 co-FR Corsican (France) Corsu (France)
    101a hr-BA Croatian (Latin, Bosnia and Herzegovina) hrvatski (Bosna i Hercegovina)
    0465 dv-MV Divehi (Maldives) ދިވެހިބަސް (ދިވެހި ރާއްޖެ)
    0813 nl-BE Dutch (Belgium) Nederlands (België)
    0c09 en-AU English (Australia) English (Australia)
    2809 en-BZ English (Belize) English (Belize)
    1009 en-CA English (Canada) English (Canada)
    2409 en-029 English (Caribbean) English (Caribbean)
    4009 en-IN English (India) English (India)
    1809 en-IE English (Ireland) English (Eire)
    2009 en-JM English (Jamaica) English (Jamaica)
    4409 en-MY English (Malaysia) English (Malaysia)
    1409 en-NZ English (New Zealand) English (New Zealand)
    3409 en-PH English (Republic of the Philippines) English (Philippines)
    4809 en-SG English (Singapore) English (Singapore)
    1c09 en-ZA English (South Africa) English (South Africa)
    2c09 en-TT English (Trinidad and Tobago) English (Trinidad y Tobago)
    0809 en-GB English (United Kingdom) English (United Kingdom)
    3009 en-ZW English (Zimbabwe) English (Zimbabwe)
    0438 fo-FO Faroese (Faroe Islands) føroyskt (Føroyar)
    080c fr-BE French (Belgium) français (Belgique)
    0c0c fr-CA French (Canada) français (Canada)
    140c fr-LU French (Luxembourg) français (Luxembourg)
    180c fr-MC French (Principality of Monaco) français (Principauté de Monaco)
    100c fr-CH French (Switzerland) français (Suisse)
    0462 fy-NL Frisian (Netherlands) Frysk (Nederlân)
    0c07 de-AT German (Austria) Deutsch (Österreich)
    1407 de-LI German (Liechtenstein) Deutsch (Liechtenstein)
    1007 de-LU German (Luxembourg) Deutsch (Luxemburg)
    0807 de-CH German (Switzerland) Deutsch (Schweiz)
    046f kl-GL Greenlandic (Greenland) kalaallisut (Kalaallit Nunaat)
    045d iu-Cans-CA Inuktitut (Syllabics, Canada) ᐃᓄᒃᑎᑐᑦ (ᑲᓇᑕ)
    0810 it-CH Italian (Switzerland) italiano (Svizzera)
    0486 qut-GT K'iche (Guatemala) K'iche (Guatemala)
    0487 rw-RW Kinyarwanda (Rwanda) Kinyarwanda (Rwanda)
    0454 lo-LA Lao (Lao P.D.R.) ລາວ (ສ.ປ.ປ. ລາວ)
    082e dsb-DE Lower Sorbian (Germany) dolnoserbšćina (Nimska)
    047a arn-CL Mapudungun (Chile) Mapudungun (Chile)
    047c moh-CA Mohawk (Mohawk) Kanien'kéha (Canada)
    0850 mn-Mong-CN Mongolian (Traditional Mongolian, PRC) ᠮᠤᠨᠭᠭᠤᠯ ᠬᠡᠯᠡ (ᠪᠦᠭᠦᠳᠡ ᠨᠠᠢᠷᠠᠮᠳᠠᠬᠤ ᠳᠤᠮᠳᠠᠳᠤ ᠠᠷᠠᠳ ᠣᠯᠣᠰ)
    0482 oc-FR Occitan (France) Occitan (França)
    0463 ps-AF Pashto (Afghanistan) پښتو (افغانستان)
    086b quz-EC Quechua (Ecuador) runasimi (Ecuador Suyu)
    0c6b quz-PE Quechua (Peru) runasimi (Peru Suyu)
    0417 rm-CH Romansh (Switzerland) Rumantsch (Svizra)
    243b smn-FI Sami, Inari (Finland) sämikielâ (Suomâ)
    103b smj-NO Sami, Lule (Norway) julevusámegiella (Vuodna)
    143b smj-SE Sami, Lule (Sweden) julevusámegiella (Svierik)
    0c3b se-FI Sami, Northern (Finland) davvisámegiella (Suopma)
    043b se-NO Sami, Northern (Norway) davvisámegiella (Norga)
    083b se-SE Sami, Northern (Sweden) davvisámegiella (Ruoŧŧa)
    203b sms-FI Sami, Skolt (Finland) sääm´ǩiõll (Lää´ddjânnam)
    183b sma-NO Sami, Southern (Norway) åarjelsaemiengiele (Nöörje)
    1c3b sma-SE Sami, Southern (Sweden) åarjelsaemiengiele (Sveerje)
    044f sa-IN Sanskrit (India) संस्कृत (भारतम्)
    0491 gd-GB Scottish Gaelic (United Kingdom) Gàidhlig (an Rìoghachd Aonaichte)
    1c1a sr-Cyrl-BA Serbian (Cyrillic, Bosnia and Herzegovina) српски (Босна и Херцеговина)
    301a sr-Cyrl-ME Serbian (Cyrillic, Montenegro) српски (Црна Гора)
    281a sr-Cyrl-RS Serbian (Cyrillic, Serbia) српски (Србија)
    181a sr-Latn-BA Serbian (Latin, Bosnia and Herzegovina) srpski (Bosna i Hercegovina)
    2c1a sr-Latn-ME Serbian (Latin, Montenegro) srpski (Crna Gora)
    241a sr-Latn-RS Serbian (Latin, Serbia) srpski (Srbija)
    2c0a es-AR Spanish (Argentina) Español (Argentina)
    400a es-BO Spanish (Bolivia) Español (Bolivia)
    340a es-CL Spanish (Chile) Español (Chile)
    240a es-CO Spanish (Colombia) Español (Colombia)
    140a es-CR Spanish (Costa Rica) Español (Costa Rica)
    1c0a es-DO Spanish (Dominican Republic) Español (República Dominicana)
    300a es-EC Spanish (Ecuador) Español (Ecuador)
    440a es-SV Spanish (El Salvador) Español (El Salvador)
    100a es-GT Spanish (Guatemala) Español (Guatemala)
    480a es-HN Spanish (Honduras) Español (Honduras)
    080a es-MX Spanish (Mexico) Español (México)
    4c0a es-NI Spanish (Nicaragua) Español (Nicaragua)
    180a es-PA Spanish (Panama) Español (Panamá)
    3c0a es-PY Spanish (Paraguay) Español (Paraguay)
    280a es-PE Spanish (Peru) Español (Perú)
    500a es-PR Spanish (Puerto Rico) Español (Puerto Rico)
    540a es-US Spanish (United States) Español (Estados Unidos)
    380a es-UY Spanish (Uruguay) Español (Uruguay)
    200a es-VE Spanish (Venezuela) Español (Republica Bolivariana de Venezuela)
    081d sv-FI Swedish (Finland) svenska (Finland)
    045a syr-SY Syriac (Syria) ܣܘܪܝܝܐ (سوريا)
    0428 tg-Cyrl-TJ Tajik (Cyrillic, Tajikistan) Тоҷикӣ (Тоҷикистон)
    085f tzm-Latn-DZ Tamazight (Latin, Algeria) Tamazight (Djazaïr)
    0451 bo-CN Tibetan (PRC) བོད་ཡིག (ཀྲུང་ཧྭ་མི་དམངས་སྤྱི་མཐུན་རྒྱལ་ཁབ།)
    042e hsb-DE Upper Sorbian (Germany) hornjoserbšćina (Němska)
    0480 ug-CN Uyghur (PRC) ئۇيغۇرچە (جۇڭخۇا خەلق جۇمھۇرىيىتى)
    0843 uz-Cyrl-UZ Uzbek (Cyrillic, Uzbekistan) Ўзбек (Ўзбекистон)
    0452 cy-GB Welsh (United Kingdom) Cymraeg (y Deyrnas Unedig)
    0488 wo-SN Wolof (Senegal) Wolof (Sénégal)
    0485 sah-RU Yakut (Russia) саха (Россия)
    0478 ii-CN Yi (PRC) ꆈꌠꁱꂷ (ꍏꉸꏓꂱꇭꉼꇩ)

     

    Enjoy!

  • Sorting it all Out

    Philosophical about enablement, the impossible, and the infeasible

    • 2 Comments

    Feeling kind of philosophical....

    So the other day, I was authoring a keyboard.

    Wait, that isn't quite what I was doing.

    I was doing some keyboard authoring work, to enhance a keyboard someone else had authored.

    Good, that's closer to what it was.

    This was not random meandering around with vague, non-annunciated purpose, mind you.

    There was some definite purpose to the enhancements.

    Mostly along the lines of the fuzzy corners of the stuff you can't do in MSKLC.

    Stuff related to features I covered in Chain Chain Chain, Chain of Dead Keys and working around interesting limitations not unlike the ones I mentioned in [Pretty much] All the things you can't do with SGCAPS, and why (more on this another day).

    it is funny how all of this came about, actually.

    i was essentially working on keyboards the same way Cathy did it like way over a decade ago -- editing text files by hand. Because the architecture supports some tings the tool does not.

    I mean, there is a functionality, one that has been around for just shy of two decades. One whose support is summed up by 1.02 header files spread across the SDK and the WDK (mostly the WDK). 

    And yes, I just calculated that 0.02 part.

    It is functionality that nearly everything makes use of in their applications yet almost nothing ever tried to create with in any way other than what Microsoft provided (despite the admittedly understated presence of samples).

    When no one (outside of a few people in Windows) uses a functionality to do such work in, the potential feature set outside of what is used becomes somewhat irrelevant.

    Everything is "impossible" to do.

    Then a tool (MSKLC) gets created that makes a specific subset of features available.

    And it is only features the tool doesn't support that become "impossible".

    Which is all a little silly (from my point of view, at least!) since the features in question weren't ever impossible. Perhaps in both cases they just weren't feasible for most people to do (mostly since we never told anyone how to do them!).

    So I was sitting at a computer doing the impossible, by which I mean infeasible, defined by an arbitrary line I created myself by the nature of the things the tool doesn't do.

    A line extended when a new release of the tool was made (ref: Not the coolest, but).

    And a line fortified when ownership of the tool was transferred to a team that to date has never shipped a new release.

    I was essentially asked the other day by someone about that fortified line:

    I am not at all familiar with MSKLC, but seeing you talk about it so much, I am wondering if anybody ever considered making it open-source, like WiX, MfcMapi or similar projects? Wouldn’t that be the best outcome for everybody?

    It probably would, if you asked me. Which I suppose he did, huh? :-)

    But while CodePlex would be great solution for not only Microsoft Keyboard Layout Creator but its sister tool Microsoft Locale Builder, CodePlex requires a knowledgeable owner of a project to "sponsor" and "support" it. And there is apparently a dearth of such people able to get buy-in from their respective managements to do said sponsoring/supporting.

    If they were there, I would want to be a contributor to both (probably MSKLC more than MSLB!), which I suppose some would consider to have a fairly defocusing kind of effect on my actual work. They might be correct about that, so I can't fault their logic.

    It certainly wouldn't fit in any anyone's current mandates.

    Except some actual customers, and standards bodies, and perhaps governments.

    Thus my feeling all philosophical....

  • Sorting it all Out

    Every character has a story #33: GREEK ANO TELEIA (U+0387)

    • 3 Comments

    Unicode characters are a bit like actual characters, like actual people, in real life.

    Some of them are majestic and are revered.

    Others are fun and whimsical.

    Still others are lookalikes -- perhaps even celebrity lookalikes.

    And there are a few that have perhaps been put through a bad situation, improperly categorized in a way that will affect them for the rest of their lives.

    The scars may not be visible, but they'll always be there.

    ·

    That is U+0387, aka GREEK ANO TELEIA.

    This character went through a recent interesting thread on the "Unicore" list related to its canonical equivalance with another character: U+00b7, aka MIDDLE DOT).

    Let's look at those properties. from UnicodeData.txt:

    00B7;MIDDLE DOT      ;Po;0;ON;    ;;;;N;;;;;

    0387;GREEK ANO TELEIA;Po;0;ON;00B7;;;;N;;;;;

     Sites like this one had some interesting comments though:

    njstram:

    Canonical Equivalence Issues for Greek Punctuation. Some commonly used Greek punctuation marks are encoded in the Greek and Coptic block, but are canonical equivalents to generic punctuation marks encoded in the C0 Controls and Basic Latin block, because they are indistinguishable in shape. Thus, U+037E ";" GREEK QUESTION MARK is canonically equivalent to U+003B ";" SEMICOLON, and U+0387 "·" GREEK ANO TELEIA is canonically equivalent to U+00B7 "·" MIDDLE DOT. In these cases, as for other canonical singletons, the preferred form is the character that the canonical singletons are mapped to, namely U+003B and U+00B7 respectively. Those are the characters that will appear in any normalized form of Unicode text, even when used in Greek text as Greek punctuation. Text segmentation algorithms need to be aware of this issue, as the kinds of text units delimited by a semicolon or a middle dot in Greek text will typically  differ from those in Latin text.

    The character properties for U+00B7 MIDDLE DOT are particularly problematical, in part because of identifier issues for that character. There is no guarantee that all of its properties will align exactly with U+0387 GREEK ANO TELEIA itself, because the latter were established based on the more limited function of the middle dot in Greek as a delimiting punctuation mark.

    John Hudson:

    There are also possible glyph design discrepancies in these canonical equivalences. The Greek ano teleia is properly placed near the top of the non-ascending lowercase letters (the x-height, in Latin type terminology), roughly equivalent to the height of the top dot on the colon. The middle dot is aligned to the optical centre of the x-height, i.e. lower. Also, in all-caps settings, the ano teleia rises to align near the top of the capitals, even further from the height of the middle dot. This was a very poorly considered canonical equivalence.

     

    Anyway, ignoring these issues, in Greek the ANO TELEIA is not used in the middle of a word or term (right now you may see lots of

    Tech·Ed

    information, for example. The fact that U+0387 is not preferred and the way that Unicode normalization will pick one and the distinction is lost unless one keeps track of the "Greek-ness" of the text.

    But in a way both characters suffer here a bit since implementations must often make assumptions that may be invalid for some text.

    At Microsoft, it has some interesting issues:

    • In collation: It was not even in the collation tables at all until Vista (meaning it had no eight an coudln't be found), and once there it was given identical weight to the MIDDLE DOT.
    • In codepages: It does not appear in any code page, even as a best fit character in cp 1253 where it would perhaps have made sense.
    • In fonts: Some of the above comments covered this -- many believe it is not represented exactly as it should be since the two characters usually look the same.
    • In keyboards: The only keyboard it exists on is the Greek Polytonic keyboard, where it has been since Windows 2000.
    • In identifiers: Both characters are now allowed in identifiers, so you would be able to use Tech·Ed in a variable name!

    The most important issue related to the equivalence is that it can't ever be changed. So ten years from now I fully expect someone to send mail asking for a change....

  • Sorting it all Out

    How do I find out what state you're in?

    • 0 Comments

    Over in the Suggestion Box, John asked:

    I'm trying to figure out how to get (and possibly change) the IME state of the active window, like the language bar does.  Assume for the moment that I only need to support TSF.  Is there a reasonable (a.k.a., will probably continue to work in the future) way to accomplish this?

    The only thing I've been able to find in the TSF docs that seems potentially of interest is ITfLangBarMgr::GetThreadLangBarItemMgr.  I'm somewhat concerned that the only documentation MSDN provides for this function is "Should not be used."  I've tried calling it, but it only seems to work if I pass the thread ID of my own thread; otherwise I get "The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)".  Can this function be made to work, is there a better way, or is there no good way to do this?

    Now the Language Bar does its magic in what most reasonable people would consider to be an extreme way -- it puts itself in every thread (well, technically every threadgroup) so that it can make changes from any input thread, if need be.

    Many "internal" Text Services Framework methods like ITfLangBarMgr::GetThreadLangBarItemMgr that work differently within one's own thread than outside of it -- because they are helper classes and methods designed to be used by that code that lives in every threadgroup to keep the central code behind the language bar up to date.

    Others, such as this one, are actually not internal so much as meant to be used by actual text services like IMEs that need to do things to update the language bar based on their own state. thus it won't really work very well unless you can manage to be in each threadgroup that might need the information.

    However, this does not mean there is no hope!

    For starters there is IMM32 way (i.e. ImmGetConversionStatus and ImmSetConversionStatus) that can help here....

    But let's look more to the services TSF provides for Applications, specifically the Language Bar services. Although the examples on that page talking about speech are misleadingly unhelpful, the GUID_COMPARTMENT_* links from that page point to the Predefined Compartments page which has several promising compartments like GUID_COMPARTMENT_KEYBOARD_INPUTMODE_CONVERSION and GUID_COMPARTMENT_KEYBOARD_INPUTMODE_SENTENCE, and these compartments are actually about as useful as they sound and more so. :-)

    At some point I'll have to put together a sample using these, though Eric Brown in this blog points to a sample showing how they can be used from within a text service. I'll just put the idea of an application style sample on my list of things to do....

  • Sorting it all Out

    A difference that makes no difference makes a blog

    • 2 Comments

    One of the most interesting things about digit substitution is the weird cases.

    like if you look at the relevant fields you get from GetLocaleInfo or GetLocaleInfoEx:

    LOCALE_SNATIVEDIGITS:

    Native equivalents of ASCII 0 through 9. The maximum number of characters allowed for this string is eleven, including a terminating null character. For example, Arabic uses "٠١٢٣٤٥ ٦٧٨٩". See also LOCALE_IDIGITSUBSTITUTION.

    LOCALE_IDIGITSUBSTITUTION:

    ValueMeaning
    0 Context-based substitution. Digits are displayed based on the previous text in the same output. European digits follow Latin scripts, Arabic-Indic digits follow Arabic text, and other national digits follow text written in various other scripts. When there is no preceding text, the locale and the displayed reading order determine digit substitution, as shown in the following table.
    Locale     Reading order     Digits used
    Arabic      Right-to-left          Arabic-Indic
    Thai         Left-to-right           Thai digits
    All others Any                        No substitution used
    1 No substitution used. Full Unicode compatibility.
    2 Native digit substitution. National shapes are displayed according to LOCALE_SNATIVEDIGITS.

    So basically LOCALE_SNATIVEDIGITS can be some native set of digits.

    And LOCALE_IDIGITSUBSTITUTION decides whether to always use 0123456789 (which happens when the value is 1), to always be LOCALE_SNATIVEDIGITS (which happens when the value is 2), or to sometimes be one and sometimes another (which happens when the value is 0, for some locales).

    Of course the times these settings fall down is any time LOCALE_SNATIVEDIGITS is "0123456789" and LOCALE_IDIGITSUBSTITUTION is 0 or 2 -- since these settings basically ask the system to replace 0123456789 to 0123456789.

    Oops.

    Now of course you can set it this way yourself in Regional and Language Options.

    And every version of Windows has had some locales that have data like this in anywhere from at least 2 to 6 locales.

    Here's some managed code that builds a list:


    using System;
    using System.Text;
    using System.Globalization;
    using System.Runtime.InteropServices;

    public class Test {
        public static void Main() {
            StringBuilder sb;
                foreach(CultureInfo ci in CultureInfo.GetCultures(CultureTypes.SpecificCultures)) {
                uint uDS;
                GetLocaleInfoW((uint)ci.LCID, LOCALE_IDIGITSUBSTITUTION | LOCALE_RETURN_NUMBER, out uDS, 4);
                if(uDS==0 || uDS==2) {
                    sb = new StringBuilder(11);
                    GetLocaleInfoW((uint)ci.LCID, LOCALE_SNATIVEDIGITS, sb, 11);
                    if(sb.ToString().Equals("0123456789")) {
                        Console.WriteLine("{0}\tIDIGITSUBSTITUTION=={1}\t SNATIVEDIGITS=={2}", ci.Name, uDS, sb.ToString());
                    }
                }
            }
        }

        static uint LOCALE_RETURN_NUMBER = 0x20000000;
        static uint LOCALE_IDIGITSUBSTITUTION = 0x1014;
        static uint LOCALE_SNATIVEDIGITS = 19;

        [DllImport("kernel32.dll", CharSet=CharSet.Unicode, ExactSpelling=true, CallingConvention=CallingConvention.StdCall)]
        private static extern int GetLocaleInfoW(uint Locale, uint LCType, StringBuilder lpLCData, int cchData);

        [DllImport("kernel32.dll", CharSet=CharSet.Unicode, ExactSpelling=true, CallingConvention=CallingConvention.StdCall)]
        private static extern int GetLocaleInfoW(uint Locale, uint LCType, out uint lpLCData, int cchData);
    }


    You can run this on various versions of Windows.

    Like in XP SP2, the list is ky-KG, mn-MN, ar-LY, ar-DZ, ar-MA, and ar-TN.

    Or In Windows Server 2008, where the improved list is ky-KG and MN-MN.

    Or in Windows 7, where the small backslide the list is en-US, ar-LY, ar-DZ, ar-MA, and ar-TN.

    Ultimately there are two problems here -- one to do with theoretical data purity (the data just seems wrong), and the other to do with data performance (asking the system to do processing that isn't necessary can have performance impact).

    Though in practice, since a user can set it the same way in Regional and Language Options, I'd rather that the system just determined when the operation would be a no-op (this scenario) and just stopped processing. Since then everyone will benefit, including any user with the wrong settings, any custom locale with the wrong settings, and any future data with the wrong settings (the latter is a reasonable supposition since very version has been wrong in at least a  few cases so far!).

    Even if this optimization is not happening (it may be!) and even if it never happens, the "wrong" data doesn't lead to wrong results.

    Thus my conclusion:

    You see, as any cat will tell you, curiosity never killed anything other than a few hours.

    And a difference that makes no difference? It makes no differencea blog.

  • Sorting it all Out

    You can choose to *not* impose a "Non-English Tax" on your feature. Or you can be a jerk. Whichever....

    • 33 Comments

    This entire blog you are reading today to going to be tinged with irony.

    You see, I am writing it in English.

    Because English is the language I know best, even as I try to learn other languages.

    If things like proper grammar and spelling and such count against English knowledge, then some might argue I don't even know English!

    But anyway, it all relates to a comment somebody had the other day in one of those perpetual threads on the topic of whether to localize exception text:

    Exception text shouldn’t be localized, nor should it be user-friendly.

    The text of exceptions should always be in one language (most likely English, but generally the spoken language the main development team uses). This allows developers who use your API to search the internet for that text. It doesn’t help if a French developer hits the exception and gets his own translated message.

    The exception text should be descriptive enough to help developers figure out the issue, but not too descriptive to give away details of your system. Exception text should never be displayed to the user (unless you are writing an API framework and your user is the developer who uses it).

    This comment highlights the two assumptions I hate most about the Redmond-based development at Microsoft, especially lately in managed code with their de facto jargon-adding exception names:

    1. The assumption that a readable and understandable exception message is a bad thing, and
    2. The assumption that the "search the Internet" scenario is a valid reason to only ever use English for exceptions

    People who espouse this complete and utter crap fail to understand that these assumptions help to not only perpetuate each other (e.g. the reason they reason they most commonly search the Internet is they can't even understand the English even if they are a native speaker!), but they are just not good development practice. Just like an iPhone application that pinched people in the hand would not make it easier to use an iPhone.

    Note to developers: if they are searching the Internet for your exception text, it is probably because THEY DIDN'T UNDERSTAND IT, irregardless of the language it is in.

    Now the world is full of developers, and lots of them were not born speaking English. Many of them learn English (some more successfully than others, especially with the bizarre dialect of English being used here!), but the number of people who struggle to learn English so they can learn another language -- like C# -- can find their job made a lot easier with a good localization effort so the info can be in French or Japanese or whatever.

    And the world of people who started in English or learned to speak it are not going to lose the ability to get information in English -- and they are not qualified to judge those who have not yet, or cannot. So they should just frigging relax and support the effort to get more people involved.

    These English speaking developers who create things in English only that most people can't understand even if they are fluent English speakers should momentarily consider how bad they are at communicating to their audience - and how easy a small change in attitude would help others, while not hurting anyone else.

    Speaking English is great - in some countries it is a required first step to prosperity. But the reason for that is the number of places that do the same thing these developers are doing.

    In the end, there is no harm to making anything more widely available and useful to people.

    But at the same time, there is no shame in allowing people who don't speak English to understand and benefit from your efforts, and frankly there is more than a mere scosh of shame in requiring it of them, you know?

    So perhaps if you are a developer who uses the .Net Framework that made so many issues in computer programming easier to deal with? You just might choose to not impose an English tax on your work....

    Or you might be a jerk.

    Whichever.

    The biggest irony of ths blog is of course that this blog, and this Blog, are all in English, and thus due to my own limitations, I impose an English text on my readers. But you can do better than me!

  • Sorting it all Out

    ALL OF US should be GAY about being able to follow a STRAIGHT path to who we want to be with

    • 7 Comments

    I was thinking about language this weekend, as I was spending time with several different groups of friends.

    In fact, I was thinking about my friends.

    Some of them are "straight" and some of them are "gay".

    Now I put both terms in quotes because the terms are what I was thinking about.

    Before they were co-opted, they had two specific and non-controversial meanings. Looking them up on Wiktionary:

    • straight -- Not crooked or bent; having a constant direction throughout its length (this definition dates from the 14th century!)
    • gay -- Happy, joyful, lively (listed as a dated definition)

    And now that they have been co-opted to become a part of the issue of sexuality:

    • straight -- Heterosexual (listed as a colloquial definition)
    • gay -- Homosexual

    But of course people can be happy, joyful, of lively no matter what their sexual orientation. Or they can find themselves unhappy, joyless, and lifeless if they are struggling with figuring out their orientation, finding a companion of the appropriate gender, or dealing with any kind of prejudice as they spend time on this our third rock from the sun while they deal with those prejudices and the consequences thereof.

    And although prejudice against people who are homosexual can make their lives more complicated, I want to believe this will not be true forever. So although prejudice in our society might in theory make the distance a person travels toward heterosexuality to be a "straight" line that is in Euclidian-esque conceptual spaces such that we live in the shortest distance between the two points, I believe the prejudices will be overcome and that won't be the case. And even now while prejudice exists, any will rightly feel that the shortest distance between the two points is always the most honest one -- and thus if one is homosexual then admitting it is the "straightest" line in some sense.

    Is it fear of being homosexual that causes Wiktionary to consider the "non-homosexual' definition to be dated?

    What issue (and whose?) causes Wiktionary to consider the "heterosexual" definition of straight to be colloquial?

    The entire issue of sexual orientation and sexual identity is of course now so twisted up in religion and politics and fear that the fact that it twisted up a few words hardly seems important compared to all of the lives it has twisted up.

    Being both Jewish and handicapped I have had to deal with prejudice in my life at various times, but I lack the context fully understand what this particular prejudice feels like, other than what others have told me. And I would rather tear my teeth out than refuse to defend a friend of mine.

    But no matter what, from a linguistic standpoint I have to start with the words.

    In my view, the way we have twisted the words up and made them casualties of prejudice is pretty damn 𝐪𝐮𝐞𝐞𝐫¹!

     

    1 - I mean to use the word 𝐪𝐮𝐞𝐞𝐫  here in the 'old-fashioned' sense of weird, odd, different, or slightly unwell.And don't even get me started on the problems with that word, or the unrelated word slut, ether!

  • Sorting it all Out

    A digit by any other name can be just as geeky

    • 2 Comments

    Yesterday when I blogged It will take putting NADS out in front to make a difference, there were some unexpected consequences to it.

    As an example, I happened to post a link to the Digit Shapes MSDN topic, and I got back an earful about issues with the topic.

    Like the fact that it perhaps plays fast and loose with terminology.

    Perhaps blaming an MSDN help topic for the problems here is a bit much though -- since the problems pre-date MSDN, Microsoft, computers, and even calculators by quite a piece!

    When we look at the "ASCII digits" 0123456789 we are talking about a digit system that arose in India sometime between 500 B.C.E. and 500 C.E. It went from there to various Arabic mathematicians. Because of that they are usually referred to as Arabic or sometimes even Western Arabic or even Arabic-Indic or digits.

    There is a choice bit of irony in ever thinking of these as Arabic-Indic digits (as some people do, based on where they came from), since meanwhile the folks who spoke Arabic and who were mathematicians were usually using the Eastern Arabic digits ٠١٢٣٤٥٦٧٨٩, often thought of as Hindi digits or Indian digits, though within Unicode known by their names as the Arabic-Indic digits.

    Confused yet? :-)

    Meanwhile through India there were other digit systems used that were none of these ones like the Devanagari digits ०१२३४५६७८९, where you can see some of the shapes of the European digits likely came from.

    So in the Western world we use Arabic digits while in the Arab world they use Hindidigits while in India (arguably the birthplace of Hinduism) they use the Devanagari and other digits.

    The names of many of these various digits obviously tend to be based on their source rather than where they sit now, while the source location kind of moved on elsewhere with their own numbers -- making the names feel almost like anachronisms, of a sort. Well, maybe not anachronisms in the conventional sense but certainly items [mis]named in a way that is akin to an anachronism (a chronological inconsistency) -- maybe a geographical inconsistency?

    Probably it would be easier to simply include the digits the way I do above while putting the various names in quotes so as not to claim any one of them is right since they are all riddled with inconsistencies anyway.

    Now there were other issues pointed out of a more technical nature than technicalities with the names of the digits, but I'll get into those problems another day....

Page 1 of 2 (24 items) 12