Blog - Title

February, 2011

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    There's no "I" in IDN, part 1: If you're not Unicode, you're just wrong!

    • 17 Comments

    International domain names -- one of those times that we really are all in this together, a time that "I don't have time to fix this" really isn't a good answer.

    I figured I should talk about that for a bit....

    So anyway, the question I got from a rather anxious developer via email the other day was:

    I have a lot of code that depends on functions like getaddrinfo, getnameinfo, gethostbyname, and gethostbyaddr. How do I get them to support internationalized domain names?

    The answer is both simple and complicated.

    Complicated because the answer could (in theory) very different depending on whether the server is on the intranet (where one would use UTF-8) or the Internet (where one would use Punycode).

    And complicated because there isn't a whole lot of infrastructure to have the system figure out which is which and which to use in native code (the managed story is a little better here but it has its own pitfalls; I will cover those another day).

    For now I'll just talk about the intranet story (the Internet story will be for another another day).

    The most important step, one that is pretty much universally a good design practice for many reasons but especially here is to move off the non-Unicode functions like the ones our anxious developer named. If one has anything outside of ANSI (or even ASCII in some cases), the Unicode (or UTF-8) version are required here, as the following table points out:

    Function you shouldn't use

    Function you should be using instead

    DnsQuery_A

    DnsQuery_W (or DnsQuery_UTF8)

    DnsValidateName_A

    DnsValidateName_W (or DnsValidateName_UTF8)

    DnsNameCompare_A

    DnsNameCompare_W (or DnsNameCompare_UTF8)

    DnsHostnameToComputerNameA

    DnsHostnameToComputerNameW

    getaddrinfo GetAddrInfoW

    GetAddrInfoA

    GetAddrInfoW

    getnameinfo

    GetNameInfoW

    GetNameInfoA

    GetNameInfoW

    GetAddrInfoExA

    GetAddrInfoExW

    gethostbyname

    GetAddrInfoW

    gethostbyaddr

    GetNameInfoW

    WSAAsyncGetHostByName

    GetAddrInfoW

    WSAAsyncGetHostByAddr

    GetNameInfoW

    WSALookupServiceBeginA

    WSALookupServiceBeginW

    Now as luck would have it, deciding whether to use the "W" version of the function or the UTF-8 version (for the functions that support both) is pretty simple -- just use whichever format you have the text in already.

    And as further luck would have it, for just about all of the functions on this list, the replacement is easy and straightforward for the call itself. Of course you may need to move the code to use Unicode, and it's important to not just convert it from the CP_ACP or whatnot (otherwise you haven't really fixed anything!, but that's not too bad.

    You can think of this first step as the most obvious part of all of the work. I'll get into some of the more complicated aspects in the future, with maybe some additional fun details related to Active Directory to make things really interesting (that will be on yet another another day -- or with a topic like AD more than one other day!).

    Now once you start getting into the EAI side (i.e. the email side) it gets both insanely simple and insanely complicated too. But eventually, on some other another other day (once again multiple other days, most likely), I'll hit that topic too.

  • Sorting it all Out

    A design flaw not being fixed is not a bug. And it's not "By Design", either.

    • 13 Comments

    There are situations where a design decision made by Windows can cause a problem that some may reasonably consider to be a bug while the team owning the functionality might resolve the issue as by design.

    Interestingly, neither is technically right in many of these cases. Not if an honest assessment is made of the intent of the product and the described behavior.

    The problem may be caused by a design flaw.

    The fact that the owning team decides it is by design in this case means that it is a desugn flaw that is not going to be addressed.

    I'll give a couple of examples of this pattern to make it clearer....

    You may remember my If you had gotten there first, you might have staked your claim too! blog, where I explained how the first language installed with a copy of Windows has special privileges in terms of the language of the directory names, the langauge of account names like the Administrator account, and so on.

    Now if you have other languages installed via various Lanuage Packs, then you may install a Language Pack which, had it been installed first, might have called the Administrator account Administrateur. But now, since you installed English first, it will be named Administrator and it won't change.

    However, documentation that discusses the Administrator account that you may look up in help will follow the default user interface language currently set.

    And it will talk about Administrateur.

    This is the design flaw that I described in Administrator vs. Administrateur, et. al..

    Could it be fixed? Sure it could. By a complicated help authoring scheme that did the actual account name lookup based on wel known SIDs and RIDs and inserted that name into the help content, this problem where help content refers to things that are not actually true in a reasonable MUI scenario would be solved.

    As a bonus, it would fix cases where the account is renamed, as well. Which might be interesting.

    But the latter could have security consequences and really all that work is just generally rejected as too much effort for too little gain.

    And although the documentation could in theory be edited to try to cover this case, it is a widespread issue that would almost certainly make everything ovwer-complicated and confusing.

    So everything is left as is, and this design flaw is considered by design.

    Even though any reasonable person looking at documentation that looks wrong will consider it a bug.

    See? :-)

    I'll give you another example....

    say you are on another language version of Windows. Let's say the Italian version of Windows 7. On a machine you deceided to name italianmachine just for simplicity.

    The "C:\Program Files" directory will actually be "C:\Programmi" on this Italian installation.

    Now if you are like me, the machine next to this Italian one might actually be Brazilian Portguese copy of Windows 7.

    And on that machinethe "C:\Program Files" directory will actually be "C:\Arquivos de Programas".

    If you used the UNC path to look at this remote path, it will not be "\\brazilianmachine\C$\Arquivos de Programas".

    It will be "\\brazilianmachine\C$\Programmi".

    Yikes!

    Just because you're on an Italian machine doesn't mean you expect the whole worled to be Italian, right?

    But there is a good reason for this behavior.

    The reason that happens is that if you look at the desktop.ini in that directory on that Brazilian machine, it will say:


    [.ShellClassInfo]
    LocalizedResourceName=@%SystemRoot%\system32\shell32.dll,-21781

    and since the Italian copy of Windows you are running will have no problems resolving that path and finding that file, it will look up the directory name by loading that string on its own machine instead of the remote one. and of course it will get the Italian string....

    Now perhaps if you are a developer you might be thinking about ways you could fix this problem. You could

    • since it is using enviornment variables use WMI to get the remote machine's environment variables, and
    • resolve the local path on a remote machine to get the remote machine's directory, and
    • load the remote file to get the string

    for example. But is that a great plan for good performance? Is there hope that it would even be readable in all cases, as compared to just making it Italian?

    In the end, this behavior which is easy to consider to be a bug and which the core team might consider to be by design is yet again a design flaw that is not worth the cost to address.

    if you work for Microsoft, a more honest resolution for this kind of bug is Won't Fix, not By Design.

    Or, if you are one of those users of Mac or Linux who hates Microsoft, you could (in both cases) just see it as pointing out that Microsoft software is buggy.... :-)

  • Sorting it all Out

    UTF-8 default isn't in the latest Notepad, either

    • 11 Comments

    Allow me to intentionally misquote a West Wing episode I enjoyed:

    Every once in a while, every once in a while, there's a bug report with an absolute right and an absolute wrong, but those reports almost always include blue screens of death. Other than that, there aren't very many un-nuanced bug reports in writing a blog that's way too big for ten words. I'm the author of Sorting it all Out, not the author of the readers who agree with me.

    That was fun. :-)

    It is true that I often start blogs with "simple questions" which turn out to have complicated answers.

    And when the answers are simple they usually are simple in a bad way: like the word NO.

    This is one of those blogs....

    The question, as you have probably gathered, was simple:

    Customer wants to automatically use UTF-8 when saving files with Notepad instead of ANSI by default.

    The answer is indeed that no, it isn't possible. This default is hard-coded into Notepad.

    They made the decision in 1993 when Notepad was added to NT 3.1, and have stuck to their guns -- even after UTF-8 support was added in 1998-1999.

    Sorry.

    Now as a workround, you could try the following:

    1. Create your own app to start every text file rather than starting them in Notepad directly;
    2. Add the UTF-8 BOM to this otherwise empty file you create;
    3. Hope they never start a new file in Notepad or fail to use your "Wolves' Highway" application.

    But in the end, there is no way to keep Chloe on the Wolves' Highway. That might be why she was shot and killed by ranchers.

    And why users will seldom follow the directions here, either....

  • Sorting it all Out

    speaking with an accent, conceptually

    • 8 Comments

    Blogs like When the roof got raised, and why and Number format and currency format are not always the same) and Why does the percent stuff have so many restrictions?(the former two talking about the growing pains involved in extending locale support as new languages brought new requirements years ago, and the latter talking about a limitation documented here that is architecturally fixed in Windows 7 and may one day get its data fixed if we are lucky, point out that NLS is a reactive business.

    We have something out there, it turns out to not be enough, and so things are changed. Enhanced. Stretched. Modified.

    Other times, it is silly to touch things at all. There are times that a language has a similar concept that is different enough that trying to make it work within existing support that "fixing" it just makes no sense.

    Like for one thing, consider LOCALE_S1159 and LOCALE_S2359, the per-locale AM and PM indicators.

    In a language like BengaliBangla (ref: Even in India, the language is actually known as Bangla (not Bengali)), have the following set in the locale:

    LOCALE_S1159          পুর্বাহ্ন

    LOCALE_S2359          অপরাহ্ন

    If you know Bangla you might see the problem here.

    Let's look at these two words in the larger context in which they exist:

    Time period Word When
    Dawn ভোর 03:00 to 06:59
    Morning সকাল 07:00 to 11:59
    Noon দুপুর 12:00 to 14:59
    Afternoon বিকেল 15:00 to 17:59
    Evening সন্ধ্যা 18:00 to 19:59
    Night রাত 20:00 to 02:59

    This is a multi-part problem, of course.

    Now in general terms someone in Bengal or a Bangla-speaking part of Assam or Bangladesh from that table along with a time is the kind of thing one would want in a time format.

    One would not generally do so much with AM or PM after the time in these places.

    I emailed with friend Omi Azad about it for a bit and he confirmed that the use of these terms would simply be more intuitive; forcing everyone into the 12 hour clock we use with these two less than perfect terms is far from ideal.

    The folks in India and Bangladesh are not alone here, either -- Malay has a similar issue (they would use pg for the morning, tgh for 12 to 4pm, ptg for 4-7pm, and mlm for after 7pm) which has the same problem when itcomes to dding it to our time format notions.

    By its very nature this would be a much bigger change, making the architectural investments to support:

    • the notion of other time periods that a locale might commonly use in date formats than just the AM/PM divisions between noon and midnight;
    • the way to have the number of time periods vary;
    • the way to define the start time for each;
    • the way to identify the labels for each of these time periods.

    Here in the US we have such terms though I can't say I'd expect them in a formatted time string.

    Even after confirming with Ben and Shihab and Omi and Goldie that some or all of these terms are used, it is still not entirely clear to me whether they would be expected in a long time format, or whether instead this conceptual jump is due to Bangla people moving to the nearest conceptual analogue that they have to our AM/PM and identifying it, since AM/PM wouldn't naturally occur to them if it isn't exactly how they would look at the world.

    But since a similar construct is use in the US and other places, this new architecture would make sense, as would going out and trying to get all the data for it across all those locales.

    Though obviously this would pretty unlikely at this point.

    Bengalis who wanted such a mechanism for time formatting are probably going to have to keep writing their own code, alongside a 24-hour clock.

    Or go back in time 10-15 years and make the case then, of course.

    Okay, let's assume that change is not going to be heading our way.

    There is another problem and I was having it in my reading research on this problem in my elementary "learning Bengali" books and that when I started describing my troubles Omi pointed out with those AM/PM strings that appears to exist in our Bangla fonts. In his words:

    হ ্ ন is currently হ্ন but has to be হ্ণ
    হ ্ ণ is currently হ্ণ but has to be হ্ন

    So when the font is fixed they will look like পুর্বাহ্ণ & অপরাহ্ণ

    So the idea is that the HNA and HNNA conjuncts in the Bengali fonts are perhaps reversed?

    If he's right that would explain the trouble I was having.

    I was going to check with Goldie too, but she is in Mexico and asking her to be typing in Bengali script seems like a little much. I'll wait til she gets back to ask her....

    In the meantime, I'm wondering how many people might be typing words the wrong way to get the right appearance, and how much that might muck around with search in the meantime.

    This had me thinking about an extensive discussion I had six years ago with someone from Ethiopia about the fact that they did not have time zones but they had a different notion that they used to describe time that amounted to something wi8th many of the same effects related to how hey thought of time compared to when the sun was up (given that Ethiopia is reportedly the hottest place in the world year round I can easily imagine they would have such a mechanism!).

    Maybe I'll ask Scott Hanselman if he has any thoughts about that issue.

    And now I am wondering how much of the data in our locales is trying to map what people want on an architecture imperfect to representing what people use -- causing our locales to kind of "speak with an accent" the way as person might speak with an accent because he is using the phonemes he grew up with while speaking a language with different phonemes....

  • Sorting it all Out

    Viva Valencia!

    • 8 Comments

    THE WORLD TOUR OF WINDOWS 7 CAPTIONS LANGUAGE INTERFACE PACKS STARTS WITH THIS RELEASE BEING ANNOUNCED!

    It starts with the Valencian CLIP (Captions Language Interface Pack).

    Now this is also where it ends, as Valencian is the only Windows 7 CLIP being released.

    About CLIPs:

    The Microsoft Captions Language Interface Pack (CLIP) is a simple language translation solution that uses tooltip captions to display results.  Use CLIP as a language aid, to see translations in your own language, update results in your own native tongue or maybe use it as a learning tool.

    In a pure "count of words" sense, a CLIP has fewer localized words in it than a LIP (language interface pack).

    About this CLIP:

    This sort of thing cannot be done in a vacuum; the Valencian CLIP was done in collaboration with the Valencian Government, for both Microsoft Office 200 and Microsoft Windows 7. You can get it from right here.

    About the Launch:

    I wasn't able to be there, so I could never do it justice. But you can check out CLIP optimiza Windows 7 y Office 2010 al valenciano for starters if you want to hear more about it. It seems like it was pretty awesome, actually!

    About Valencian:

    Valencian is the traditional and official name of the Catalan language in the Valencian Community. There are dialectical differences from standard Catalan, as well as (especially in the context of technology) terminological differences. Under the Valencian Statute of Autonomy, the Acadèmia Valenciana de la Llengua has been established as Valencian's regulator. It is frequently spoken of as a separate language, the llengua valenciana, though opposition to the use of standard Catalan occurs primarily among those who do not regularly use the language.

    Valencian/Catalan, like the closely related Occitan, has a long literary tradition, especially Late Medieval and Renaissance. One of the most outstanding works of all Catalan and Valencian literature is the romance Tirant lo Blanch, written by the Valencian knight and poet Joanot Martorell.

    Click here for more info on Valencian.

    Valencian in standards:

    In the standards world, ISO-639-1 does not recognize it (suggesting ca which is the code for Catalan). ISO-639-2 doesn't recognize it either (suggesting cat which is again the code for Catalan). Aand ISO-639-3 recommends catfor either Catalan or Valencian. However, if you look at BCP-47 it allows for IANA-registered variant subtag of valencia, which means one could use ca-valencia or ca-ES-valencia to distinguish it from Catalan's ca or ca-ES if one needed to while being conformant to the standard.

    If this seems complicated, then you probably haven't had to study the official status of languages within Spain or the EU, believe me. Or spent much time looking into the various Valencian language controversies....

    Enjoy!

    THE WORLD TOUR OF WINDOWS 7 CAPTIONS LANGUAGE INTERFACE PACKS ENDS WITH THIS RELEASE BEING ANNOUNCED!

  • Sorting it all Out

    You didn't expect to be able to read any email on any device, did you?

    • 8 Comments

    Over in the Suggestion Box, Geoffrey Coram asked:

    I'm the "lead developer" for the e-mail application nPOPuk.  I recently added some code to help a Russian user on a Windows CE machine: apparently, the KOI8-RU codepage is not installed on WinCE, so messages with charset="koi8-ru" were thoroughly corrupted.

    So now I'm thinking about my app, which comes in Unicode and ANSI versions, and wondering:

    1) Is there an easy way for the ANSI version to tell the user, "hey, you typed a Unicode character in the message body"?  I typed some Cyrillic in the window, and when the app sent WM_GETTEXT, it got back a bunch of question marks.

    2) The user can select the charset (UTF-8, ISO-8859-1, KOI8-R, etc.) for sending the message; is there an easy way to tell the user, "hey, there are characters in your message that aren't available in the charset you selected"?

    I suppose I could (a) stop compiling the ANSI version and (b) force all messages to be UTF-8, but that seems draconian.

    This is one to take one piece at a time.

    At least in the body of email messages, the ability to have text that uses different random encodings that the mail client will support has been a long-standing principle.

    Not every message coming to an email client is limited to a single encoding that it understands.

    Of course whether one extracts the context via RTF functions or HTML functions or some other means will largely depend on the client, though generally HTML seems to be the one that all mail clients support to some extent.

    Although in theory you could support an email using any encoding using HTML, in fact there is a device-based limit in mobile devices since it may not be able to convert/parse/display every encoding.

    If you are using Platform Builder to build an image for a device, you may have more flexibility here but even that only goes so far. Messages using KOI8-RU and other such code pages can suffer here, and there really isn't a good answer in the platform (though if it is a limited number of code pages one could just ship the tables for a few others)....

    As to that second question, in order to read it in an app, you can just try to convert it to Unicode one way or another. If you are unable to do that conversion, you can definitely warn the user that you are unable to parse the text....

  • Sorting it all Out

    When good functions [seem to] go bad

    • 7 Comments

    In many cases it isn't the question that is complicated; it is the impact of surrounding features that make the answers so complicated!

    A while back, the question that was asked was:

    Hello,

    I’m trying to get the file version for mshtml.dll at runtime, when I call GetFileVersionInfo (from shell\osshell\version\filever.cpp) with either the full path (C:\windows\system32\mshtml.dll) or just the binary name, it always gives me: 8.00.7600.16385 (win7_rtm.090713-1255) – the IE8 RTM version, regardless of which TP (test pass, think of it as a service pack for IE) is on the machine.

    According to the filever tool, the version I actually have on my machine is:

    >filever /v C:\Windows\System32\mshtml.dll
    <snip>
            FileVersion             8.00.7600.16625 (win7_gdr.100629-1617)

    How can I get GetFileVersionInfo give me this version?

    Thanks!

    Okay, the way the question was asked was complicated the time, too. :-)

    But at its simplest level the question was just "how do I get the version number?" since the wrong answer was (apparently) being returned.

    Anyone want to take a guess as to what might be going on, what might cause a[n apparent] lie to be told here?

    Hint: Ask yourself why I might care about the answer here as a way to figure out what the answer might be....

  • Sorting it all Out

    In Nigeria? With these three LIPs out, maybe Windows 7 was your idea!

    • 6 Comments

    Allusion to the Windows 7 commercials is as obvious as it is intentional!

    THE WINDOWS 7 LANGUAGE INTERFACE PACKS FOR NIGERIA ARE LIVE!
     
    Click on the Hausa, Igbo, and/or Yoruba links to download them via the Microsoft.com Download Center.
      
    Please note that the Nigerian Windows 7 LIPs can only be installed on a system that runs an English client version of Windows 7.   They are available to download for both 32-bit and 64-bit systems.

    The Nigerian Windows 7 LIPs are produced as part of the Local Language Program sponsored by Public Sector.

    A LITTLE BACKGROUND INFORMATION ON THE NIGERIAN LANGUAGES

    NUMBER OF SPEAKERS:

    Hausa: 24 million native speakers, 15 million second- language
    Igbo: 18 million
    Yoruba: 20 million, 2 million second-language

    PREDOMINANT DISTRIBUTION IN NIGERIA:

    SOME FUN FACTS:

    • Non-native pronunciation of Hausa differs vastly from native pronunciation by way of key omissions of implosive and ejective consonants present in native Hausa dialects. This creates confusion among non-native and native Hausa speakers, because non-native Hausa speakers do not differentiate between the pronunciation of words like daidai (correct) and ɗaiɗai (one-by-one) in non-native Hausa. 
    • Many names in Igbo are actually fusions of older original words and phrases. For example, one Igbo word for vegetable leaves is akwükwö nri, which literally means "leaves for eating" or "vegetables". Green leaves are called akwükwö ndu, because ndu means "life".
    • In Yoruba there are no differences for the singular and plural. The context decides whether a word denotes singular or plural.

    CLASSIFICATION:

    • Hausa is the most widely spoken of the Chadic languages under which Hausa is classified in the West Chadic subgroup.  The Chadic languages are part of the Afro-Asiatic language family. 
    • Igbo and Yoruba are members of the Benue-Congo subfamily of the Niger-Congro language family. The Benue-Congo subfamily includes the widely spoken Bantu languages of Eastern and Southern Africa.

    See classification information for: Hausa  Igbo  Yoruba

    SCRIPT:

    • Hausa is written with a modified Latin script called Bokò. Three consonants were added to the basic Latin alphabet; a fourth consonant  is used in Niger. The Bokò alphabet was introduced by the British at the beginning of the 20th century and replaced the modified Arabic script that had been used before. This script, called Àjami, is still occasionally used, especially in religious writing and for poetry.
    • Igbo is written in the Latin script, the official orthography is known as Onwu. The Igbo people first used Nsibidi ideograms invented by the neighboring Ekoi people for writing. 
    • Yoruba uses the Latin alphabet, modified by the use of the digraph gb and the letters ẹ, ọ and ș.  Three diacritics are used on vowels and syllabic nasal consonants to indicate the tones used in Yoruba: an acute accent (´) for the high tone, a grave accent (`) for the low tone, and an optional macron (¯) for the middle tone (which is more often left unmarked).

    See script information for: Hausa  Igbo  Yoruba

    Enjoy!

  • Sorting it all Out

    You might almost say that Gmail got þwned

    • 5 Comments

    So it all started when Michael Everson started a new blog:

    http://þorn.info

    This is not a porn site.

    Because that's not LATIN SMALL LETTER P there. It's LATIN SMALL LETTER THORN.

    Aren't IDNs awesome? :-)

    Anyway, David Starner noted:

    On Sun, Feb 6, 2011 at 3:14 PM, Michael Everson <everson@evertype.com> wrote:
    > Pleased to announce a new blog, http://žorn.info

    And yet even today, Unicode email is reliably unreliable.

    Interesting!

    Others identified the problem, it's one that has been discussed before.

    Michael was using Apple Mail to send the email.

    Apple Mail encoded the mail as being ISO-8859-1. This is reasonable since every character in the email can fit in that code page.

    And David was using gmail for his email client.

    And this is where the wheels came off the wagon.

    Alexandros (Αλέξανδρος) explained it most fully:

    The original message was correctly tagged as ISO-8859-1, but it looks
    like both people responding saw it interpreted as ISO-8859-13. Judging
    from the Message-IDs, both seem to be posting from Gmail, so this must be
    an example of Google's encoding guessing, which has been discussed here
    in the past: since many web pages and mail messages in other encodings
    are mistagged as ISO-8859-1, Google uses various heuristics which are
    easy to go wrong when there's only a few non-ASCII characters in the
    text.

    As I recall, posting in UTF-8 makes the problem go away, although it's
    hard to find fault with Apple Mail for going with the most conservative
    and appropriate encoding for the content (i.e. ISO-8859-1).

    I personally don't care for Google's behavior here.

    There are certainly programs that get things wrong. Hell, the company I work for produced a version of FrontPage that had a serial inability to properly tag and use 8859-1 and CP1252.

    A lot of what Alexandros was saying came from things said by Mark Davis of Google in the past. Mark has explained at length about all the data they work from and how much is tainted (incorrectly tagged).

    But to be honest, this behavior? I find it to be disturbing.

    Remember, this is a world that Google looks at the stats and detects that over 50% of the web is Unicode (as I discussed in >50% of the web is Unicode? Meh, I say. Meh.).

    So maybe it is time for Google, which officially suggests people use Unicode to get correct results, and which works in a world where most modern clients produce correct results anyway, to start shirting over to being more trusting of the huge number of clients out here who aren't getting this wrong.

    They are supposed to be running rings around everyone with the ability to use the whole Internet as a corpus.

    So perhaps Google isn't being pwned by this problem, but this clear willingness to trust an algorithm that distrusts others (and that minimal investigation shows to be a detectable phenomenon, albeit one which Google doesn't go so far as to detect, though) is proving to be a bit of a þ (thorn) in their side....

  • Sorting it all Out

    Windows the Enabler

    • 5 Comments

    I had someone ask me

    I don’t have a big picture on the whole process of enabling a new language, even though I know a little here or there. For my learning purpose, do you have something that I can read to teach myself?

    Now there really isn't an explicit single place that I know of where such info is kept, so I took a similar item on my "blog request list" and promoted it to right now so you can read the response right here. :-)

    Now these steps are going to be described in a narrative since that is how my blogs often work, but the actual process is done by different people and the order often reflects the built-in multitasking that any multi-person project can bring to the mix. So don't think of this blog as providing an ordered recipe or directions.

    Here we go!


    STEP ONE: The Reading

    The most basic level of enablement is the display of text, which means a font that has the glyphs for the language's letters in it.

    I wouldn't really claim that the language was fully enabled or anything, but if I can read documents in it when I explicitly choose that font then it is a good first step.

    And this step enables the reading, which is great. But the next step is another crucial one on the way to full enablement:

    STEP TWO: The Writing

    Put simply, there needs to be a keyboard or an IME.

    Other methods exist like handwriting recognition and speech recognition, but those tend to show up much later in the lifetime of as language's support in computers. So for present purposes we can assume a keyboard or an IME.

    The quality of the input method is one that I would usually make on a more global basis (since if it is a part of Windows it is available for use to a frightening number of people), but for the purposes of this blog on language enablement, I'll say that the perceived quality of an input method to an individual customer is directly proportional to how easy it is that they find it to use.

    I'll get into that issue further another day, I just mention it here so people can keep in mind how much the fundamental process of language enablement is sabatoged at its root if any of these first three steps is mesed up.

    Of course it is also worth noting that these first two steps can be done by anyone, without even getting real help from Windows. Microsoft and many third parties have provided tools to help woth both fonts and keyboards.

    But in the context of Microsoft being the one doing the enabling, we should start talking about the things Microsoft can do for enabling a language that goes beyond these things.

    STEP THREE: Underlying Rendering Support

    Even that basic display needs a lot more behind it to do seemingly simple scenarios automatcally like

    • looking at web pages or
    • seeing the characters without explicitly choosing the font or
    • seeing proper shaping of complex scripts

    So the proper rendering support via Uniscribe and DWrite and in some cases GDI font linking is important. The only thing cooler for language display than having a good font is not having to choose that font explicitly, so skipping this third step is ill advised.

    Obviously there are other little items in this step like adding the Unicode character names to Character Map that don't affect rendering per se but definitely make working with fonts and characters easier.

    Additionally, once the next two steps are done, additional rendering support can be expanded to handle features like digit substitution or font linking based on system locale or writing system differences implicit in different locales, and so on. All of the various pieces have to be in place for these last few fancy items to work properly, no matter how much of the work is done earlier in preparation for when the step appears to people using the system.

    STEP FOUR: Underlying Script Support in NLS

    This step may already have been done ages ago if the script was already supported and all the requisite characters have their properties in the OS tables, but often times new languages that Windows has never supported might require whole scripts or specific individual characters that have just been added to Unicode to be added to the system as well.

    It is easy, but important, to have this support.

    STEP FIVE: Underlying locale support in NLS

    I have a colleague who is grimacing at the way I have added "locales" to a discussion of "language enablement" but the next part of the enablement process involves sorting and date formats and calendars and language names, and so on. And all of these items are stored on a per locale basis. So that person is likely just going to have to suck it up and get over it

    STEP SIX: The Localization

    Now this step has many substeps within it, but for the moment we'll treat it as one big chunk.

    Once of this support is there, people can start seeing the user interface itself making use of the enabled language!


    Ok, so there we go.

    Now looking at the steps I gave:

    1. The Reading
    2. The Writing
    3. Underlying Rendering Support
    4. Underlying Script Support in NLS
    5. Underlying locale support in NLS
    6. The Localization

    At which point would you declare a language to be enabled?

    There are several teams that consider enablement to happen once their work is done, especially when other steps aren't planned.

    In most cases Microsoft in general won't claim they have enabled a language unless a supportable chunk of steps 1-5 are present.

    But how much support is relative: not every language is intended to go through every step.

    There are even languages that were originally intended to go through the whole series of steps that ran into problems along the way; at that point all of the support can be yanked out but in many cases the partial support will be left in and shored up so that proper support is what will be seen.

    If you look at Windows you can probably find languages essentially stopped at each of these steps.

    In fact, anyone who can name one language that only goes so far as each step will win the prize today!

  • Sorting it all Out

    What do Frank Burns from M*A*S*H and Windows Server 2008 R2 have in common?

    • 4 Comments

    The truism that explaining a joke allows you to get though it doesn't give you the funny has a lot of truth to it.

    In this case, if you didn't watch M*A*S*H you may have trouble discerning what is behind the riddle of the title of this blog.

    It involved an episode where Margaret "Hot-Lips" Hoolihan, while drunk, reveals:

    I probably shouldn't be telling you this, but Frank Burns is a lipless wonder.

    And as with all jokes that must be explained, the people who now get it don't think it is very funny.

    You can think of the issue as an extension of ELKs aren't roaming where the servers are from over five years ago.

    Because although it is true that ELKs -- the infrastructure data added to XP to support the creation of LIPs -- was [mostly] added to Windows Server 2003 (I gave the full list in Why Bengali keyboards can't be found on XP 64 bit), the truth is that all of this ELK insertion work into what essentially represented the server code base, cool as it was, did represent a specific type of ELK, one not found in nature.

    A LIP-less ELK.

    We don't release LIPs for Windows server products.

    We get questions on this all the time, questions like this one:

    Good morning all,

    My customer has the following question about a language pack for Windows server 2008 R2 that does not appear on the list of supported language pack in the technet link below. Is this list authoritative or is there another way I can find this language pack for them. Any assistance would be greatly appreciated.

    I am having no luck finding an o.s. language pack for Windows Server 2008 R2 Hindi language.  Am I overlooking something?
     
    I also looked out here and it doesn’t look like there is one for Window Server 2008 R2 operating system:
    http://technet.microsoft.com/en-us/library/dd744369(WS.10).aspx

    Regards

    This person got a quick answer, even if it was not the answer being sought:

    WS08R2 didn’t release a LP in Hindi.

    Here are the languages we did release LPs for:
    http://www.microsoft.com/downloads/details.aspx?FamilyId=03831393-eef7-48a5-a69f-0ce72b883df2&displaylang=en

    English, German, Japanese, French, Spanish, Chinese Simplified, Chinese Traditional, Korean, Portuguese (Brazil), Russian, Portuguese (Portugal), Dutch, Swedish, Polish, Turkish, Czech, Hungarian, Arabic, Danish, Norwegian, Finnish, Hebrew, Greek, Thai, Ukrainian, Romanian, Slovakian, Slovenian, Croatian, Serbian Latin, Bulgarian, Lithuanian, Latvian, Estonian

    Now the answer was slightly off target since it was about Language Packs, not LIPs. But as we know everyone makes that mistake (for more info on that issue, see When terminology affects satisfaction and Out of touch? No, just out of scope...), so I won't dwell on that part, so much. But the truth is that for those LP languages, server resources are translated.

    Of course, the truth is that from a customer standpoint, this separate treatment of client and server is one that is seldom fathomed. It appears to be arbitrary and pointless, especially in a world where there are people who tend to use the server product as their client.

    At one time, the standard developer desktop for Office developers was indeed Windows Server 2003, a fact that at one point got in the way of the ability of Office to work with ELKs and try out LIPs. I have no idea if an updated policy like this still exists over in Office, but either way it serves to underscore the fact that this is a real phenomenon, even inside Microsoft. And one that can affect job productivity, no less!

    Now there are many problems with the notion of a LIP on a server.

    First of all, a principle behind LIPs of "translating the most visible UI" fails in interesting ways since many visible UI pieces that are server-specific aren't translated.

    And then there is the scenario.

    If you don't accept the "server as client" scenario's premise, then it is quite easy to consider the principle of making computers more available to people who don't speak one of the major languages we handle fully (like English or German or Japanese) to be way out of scope for the server family of products.

    But maybe it is an incorrect assessment to deny the premise.

    i mean, I often run the server as a client because I don't tend to go for the fancy client features like Aero and glass and UI transitions, and those features are off by default in server. So it takes less time to get the machine in the state I want it if I start from sever.

    I don't know for sure but that probably biases me on the issue. Does anyone else have n opinion about the "running the server as a client" scenario?

  • Sorting it all Out

    Why I don't like the JapaneseCalendar class #1: Respecting (or at least admitting) the history....

    • 4 Comments

    Ever since I wrote About the Y1C problem, which really isn't too much of a problem (except maybe in North Korea)...when I said I'd probably say a word about the Japanese calendar, I knew I'd eventually be saying something about the Japanese calendar.

    The fact is that the Microsoft implementation of the Japanese Imperial calendar has always bothered me. For many different reasons.

    Thus this new series....

    Now my dislike is not because of the principal issue I mentioned in Long live the Emperor, a not-as-uncommon-as-it-should-be complaint I hear from non-Japanese developers about the calendar who, despite being type of nerds who would never truly take violent rhetoric to the level of suggesting Microsoft predict the date of the death and/or end of rule of the Emperor nevertheless unknowingly advocate for the Emperor to leave office.

    The fact that the problem continues to crop up is a reflection on the people making the complaint, not of the calendar itself.

    And also not because of the principal issue I mentioned in Y oh Y does YYYY sometimes mean YY, you ask?, because although it can sometimes be confusing, the hypothetical need to support eras of over 100 years is proven wrong rather readily.

    In both cases, perceived inconsistencies that in fact mirror actual usage of the calendar do not bother me; they only bother people too caught up in the technical issues to understand or even notice those usage issues.

    And yet the problems I do have with this calendar and in their own way kind of brought up in both articles.

    For a hint, I'll suggest thinking about the frustrations I have with the Umm-Al Qurah calendar of Saudi Arabia that I discuss in Long term planning is not always done.

    If one recognizes some of the limitations of using a religious calendar like the Hijri one to deal with civil matters to such a degree that you create a whole new calendar like Saudi Arabia did here, it seems almost irresponsible to have the period it supports be so short that common scenarios like machine certificates and such will fail when you try to use such a calendar.

    In the case of the Japanese calendar, my concern is not really on Japan though. It is on Microsoft.

    It just seems ridiculous that when we have millennia of attested data (and even more unattested, legendary data) on the Japanese monarchs (e.g. see the List of Japanese monarchs in Wikipedia) that we limit ourselves to the last four eras (平成, 昭和, 大正, and 明治).

    Why wouldn't we stretch this out a little further than that?

    Like at least to where the historical Gregorian years are reasonable to use....

    I mean, it is more than just ironic that you can use the GregorianCalendar to refer to dates that exist before the calendar existed yet you cannot use the JapaneseCalendar to refer to dates that were well within its range

    The design of the calendar, however, which considers the current era to be #4, they left no architectural room for the up to 121 previous eras for the up to 117 previous rulers.

    I suppose they could work around this by taking advantage of the fact that a System.Int32 is being used for the era-related methods and use negative numbers for the ones prior to "era #1" though this is less than ideal for a bunch of reasons. If I ever chose to fix this bug I'd create a new class and do it right from scratch.

    But there are other reasons I dislike our implementation of the JapaneseCalendar class, which I will continue to talk about in part 2....

  • Sorting it all Out

    Yes, they did 64-bit for the Mongolian LIP, too!

    • 4 Comments

    THE MONGOLIAN (CYRILLIC) LANGUAGE INTERFACE PACK FOR WINDOWS 7 IS LIVE!

    You can download the file for the 32-bit version or the 64-bit version.

    Like the Turkmen LIP, it does not currently have a download page for reasons that are not the fault of any of the people i respect and which I don't feel like getting into....

    The Mongolian Windows 7 LIP is produced as part of the Local Language Program sponsored by Public Sector.

    A LITTLE BACKGROUND INFORMATION ON MONGOLIAN:

    Number of Speakers:

    Cyrillic script: ~2,500,000
    Mongolian script: ~3,000,000 (difficult to obtain reliable figures on it)

    Name in the language itself:

    Монгол хэл

    Khalkha Mongolian is the National language of Mongolia and is known to 90% of the population. It is now written using the Cyrillic alphabet, although in the past it was written using the Mongolian script. An official reintroduction of the old script was planned for 1994, but this has not yet taken place as older generations encountered difficulties (also, the lack of consistent and widespread support on computers has also had some impact here).

    Fun Fact:

    There is a tradition of giving names with unpleasant qualities to children born to a couple whose previous children have died, in the belief that the unpleasant name will mislead evil spirits seeking to steal the child. Muunokhoi 'Vicious Dog' may seem a strange name, but Mongolians have traditionally been given such taboo names to avoid misfortune and confuse evil spirits. Other examples include Nekhii 'Sheepskin', Nergüi 'No Name', Medekhgüi, 'I Don't Know', Khünbish 'Not A Human Being', Khenbish 'Nobody', Ogtbish 'Not At All', Enebish 'Not This One', Terbish 'Not That One'. This tradition is one that is also familar in other cultures such as in ethnic Judaism.

    Click here for more information on the Mongolian Language.

    Classification:

    Mongolian belongs to the Mongolic languages. The delimitation of the Mongolian language within Mongolic is a much disputed theoretical problem, one whose resolution would probably require a set of comparable linguistic criteria for all major varieties.

    Click here for more information on the Mongolian classification.

    Script:

    As previously stated, in Mongolia (a.k.a. Outer Mongolia) the Cyrillic script is used, while in China (a.k.a. Inner Mongolia) the traditional Mongolian script is used. The literacy rates have been much higher in Mongolia since the switch to use the Cyrillic script, due to a large push to increase literacy. The fundamental change (from a "top to bottom, right to left" script to a "left to right, top to bottom" script) and the fact that that the latter is so much easier to support in technology may be having the same problems as other "formerly vertical scripts" have seen, though sufficient formal study is currently lacking.

    Click here for more information on the use of the Cyrillic script with Monglian, and click here for more information on the use of the Mongolian script.

    Microsoft-specific:

    Mongolian is yet another locale for which we made the wrong decision in its LOCALE_SNAME/CultureInfo.Name value. By choosing mn-MN rather than mn-Cyrl-MN, we were left with adding another Mongolian and whether to name it with a consistent yet dfferent script mn-CN or an inconssistent but more accurate mn-Mong-CN. we went with the latter, but in retrospect we should have added the script since more than one script is used for the language. Lesson learned. :-)

    I spoke about some of the technolgical challenges previously in Looking at life a bit more vertically, for a moment.... As with other scripts that have large tehnical challenges in user interface usage (e.g. Tibetan), I find myself uncomfortable with the impact that technology may well be having on the long term directions of growth. In the case of Mongolian, the fact that (for example) every version of Microsoft Access from the last ten years with the Mongolian Baiti font  that now ship in Windows can do things like the following for both display and input controls:

    may start helping the future of Mongolian where it is applicable, even if the UI issues in Windows itself are not addressed.

    Enjoy!

  • Sorting it all Out

    Turkmen! (for both Turkmen and women of Turkeminstan)

    • 3 Comments

    THE WINDOWS 7 TURKMEN LANGUAGE INTERFACE PACK IS LIVE!

    You can download the 32-bit version right over here and the 64-bit version right over here.

    It does not currently have a download page for reasons that are not the fault of any of the people i respect and which I don't feel like getting into....

    It can be installed on Windows 7 SP1(you must have SP1 installed!) with either Russian or English resources, and either 32-bit or 64-bit (just pick the right download, of course!).

    The Turkmen Windows 7 LIP is produced as part of the Local Language Program sponsored by Public Sector.

    A LITTLE BACKGROUND INFORMATION ON TURKMEN:

    Number of Speakers:

    4 million

    Name in the langauge itself:

    türkmençe

    Turkmen is the national language of Turkmenistan. It is spoken by approximately 3 million people in Turkmenistan, and by approximately 380,000 in northwestern Afghanistan and 500,000 in northeastern Iran.

    Fun Fact:

    Like other Turkic languages, Turkmen is characterized by vowel harmony. In general, words of native origin consist either entirely of front vowels (inçe çekimli sesler) or entirely of back vowels (ýogyn çekimli sesler). Prefixes and suffixes reflect this harmony, taking different forms depending on the word to which they are attached.

    Click here for more information about the Turkmen language.

    Classification:

    Turkmen belongs to the group of South Turkic languages in the Turkic branch of the Altaic language family. It shares this group with Turkish and Azerbaijani.

    Click here for more information about Turkmen classification.

    Script:

    Turkmen only started to appear in writing at the beginning of the 20th century, when it was written with the Arabic script. Between 1928 and 1940 it was written with the Latin alphabet, and from 1940 it was written with the Cyrillic alphabet. Since Turkmenistan declared independence in 1991, Turkmen has been written with a version of the Latin alphabet based on Turkish.

    Click here for more information about the Turkmen script.

    Microsoft-specific:

    It is unclear whether it was intentional or not, but despite being based on the Turkish alphabet, the Turkmen locale on Windows does not do Turkic casing. If this is wrong, someone should tell us so we can fix it some day.

     Enjoy!

  • Sorting it all Out

    Whatever happened to EUROCONVERT?

    • 3 Comments

    The question that came to me via the Contacting Michael... link was:

    What happened to all of the Euro conversion precision work? Why hasn't it been updated?

    Ah, life without questions you wish no one ever asked would be so very very dull, would it not? :-)

    Let me start by saying that I am not a lawyer, and that my knowledge of the currency market and the conversion thereof would shame me in front of my friend Monica were it not for the fact that she already knows I am retarded in such matters.

    In fact my primary skill in these matters is in not bouncing checks!

    So while you may be reading this for amusement, entertainment, or education, if there are legal or compliance reasons that you are seeking this information you should consider making contact with someone who know what they are taking about when it comes to financial matters as a prt of their professional responsibilities.

    With all that said, let me explain....


    You see, when the Euro first came out there was a lot of concern about all of the various pieces of financial information stored in programs like Excel and Access.

    And given the varying exchange rates between the different currencies, there was more than a little stress surrounding the way one would do the conversion between these currencies and the Euro.

    Thus in Excel the EUROCONVERT function was added (and in Access an object model method was added that makes the same underlying call).

    It's description:

    Converts a number to euros, converts a number from euros to a euro member
    currency, or converts a number from one euro member currency to another by
    using the euro as an intermediary (triangulation). The currencies available for
    conversion are those of European Union (EU) members that have adopted the
    euro. The function uses fixed conversion rates that are established by the EU.

    The function signature:

    EUROCONVERT(number,source,target,full_precision,triangulation_precision)

    You can get info on the parameters and how they work here in the docs. it goes on for some time about all it does relating to EU law and such.

    Now what are the important points to keep in mind?

    • This information has largely been removed from docs in later versions of Excel;
    • It was removed entirely from Project;
    • It was removed from the docs in Access;
    • The list of currencies has not been updated in the docs since that time;
    • Presumably it was not updated to support other currencies themselves, either.

    So how can a method presumably added for compliance be so marginalized by a Microsoft product, if Microsoft is not trying to get in trouble with the European Union?

    A funny thing happened with all of these rules, you know.

    Everybody hated them, they were insanely complicated and difficult to maintain.

    Well, maybe not everybody. I mean, I don't know every single person.

    But the people I talked to hated it all.

    So the EU put together some formal guidelines that would scale better. Pages like this one explain it well, and given the way it stresses significant figures (rather than decimal points)it gets away from the need to define the bove info on decimal places to hang on to and such.

    Now where does this leave the tool -- is it bogus now?

    Actually, no. The old rules were applications of analogous principles, thus the same results would be returned either way. It's just that one way requires one to carry around a lot of per-currency table entries and get new entries as new currencies were added. The new method requires none of that....

    You can think of it as the classic difference between an algorithmic vs. a table-based approach!

    So the fact that EUROCONVERT/EuroConvert have not been updated but have been marginalized is really okay here. The EU moved on to doing things in a better way.

    The above describes how the situation was explained to me; if it is wrong then my sin is in who I chose to believe. :-)

Page 1 of 2 (24 items) 12