Blog - Title

February, 2010

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Silly money equivalency games work both ways (aka Making your localizer's life easier, Part 3)

    • 6 Comments

    As series go, this one is not happening nearly as fast as I might have liked.

    The first one, How many ways can a developer say 'File Not Found?' (aka Making your localizer's life easier, Part 1), happened in the end of December, 2007.

    The follow-up, We're back and we're embarrassing ourselves? (aka Making your localizer's life easier, Part 2), didn't get made public until the end of February, 2008.

    Now we are to part three, and it took two years to get there.

    This seems like too slow of a place for a regular series. I'll see what I can do about that....

    Anyway, if memory serves it was Shakespeare who said

    Brevity is the Soul of Wit.

    Now in modern times we have such busy lives that we often take a specific subset of the meaning here, along the lines of

    Don't Waste My Time.

    and when a software developer finds themselves doing the same thing over and over again, they feel like they are doing something wrong, something inefficient.

    Like if they are putting a word, like say Music into several places in the user interface, it offends some sense of developer tidiness to have the word repeated over and over, perhaps in separate binaries, loaded over and over from these different places.

    If they stumbled across part 1 of this very series they might ask themselves How many ways can a developer say 'File Not Found?' and feel silly saying a simple word over and over again, the same way. The word is the same every time - Music. What could be simpler than that?

    They may even be thinking about all those reminders of the cost per word per language to localize and think they might be saving Microsoft a few dollars if they just have the resource once and loaded into those various places.

    Occasionally, a true geek calculate it. You know, take their salary, divide it into the time spent doing this little exercise, and compare it with that per word per language cost to figure out if they literally saved Microsoft some money that afternoon.

    On a Friday? I could totally see that happening.

    Of course there is a small problem here. a problem with this "improvement" the developer has figured into the user interface with the saving all the repeats of the word Music.

    The problem is that the developer is dead wrong.

    Regular reader Mihai actually spoke to this issue a bit in a comment to that very first blog in the series:

    It is a good idea to make a difference between "reuse" (or duplication) and "consistency."

    Duplication is good, consistency is good, reuse is bad.

    Duplication: to have the same string in several places.

    Consistency: to have all instances be the same.

    Reuse: to merge all identical strings into one and use that one (usually "to save money").

    It is good to have the same string repeated. You need to say "Print" 50 times? Then have it 50 times! This gives the freedom to the translator to do what it's right for his language. Costs more? Maybe. But it will cost even more if you merge them (work), then you have a bug filed saying that titles and buttons need different translations.

    The challenge is to keep strings separate and consistent in the same time.

    Usually translated versions end up being more consistent, because are done by linguists, with access to the full set of strings, with proper tools, then edited with consistency as one of the point to check.

    The English version is just bunch of strings put together by various programmers, some of them with English as a second language, usually without a spell-check, and over a long period of time.

    and it is a valid point.

    A fix for even that problem with the zillion ways to say "File not found" can easily be to just make sure the same words are used for all of them. How is a user well served by needing to look at so many ways of saying the same thing?

    and getting back to our Music example for a moment.

    And let's look at the Croatian language pack for Windows 7:

    We'll ignore the fact that the word glazbe isn't capitalized; the problem here is that it should be glazba, or actually Glazba in this case.

    There are actually other parts of the user interface where Glazbe, or even glazbe, might be appropriate.

    Oh, and also that radnu površinu in the dialog above should probably be radna površin in this case, and radna površina in others.

    You get the point.

    Now would you like to guess how many times of these strings that are the same appear in the resources?

    Once each.

    Now I have talked about the capitalization issue and the need to let localization be flexible about it before so I won't harp in it now.

    But the fact that there is the need to express the same word in different ways in some languages even if not in English? That I will harp on here.

    I will however ask you to recall that geek who figured out that he saved the company a few bucks by sharing that string....

    If you factor in the PR hit when people started complaining about the bad grammar that makes the product look bad (which they did because it did), the cost to investigate the cause, fix it, test the fix, then localize it properly (in other languages too since those are new strings to be added now), if that developer gave back his equivalent salary for a week the score might still not be even.

    Not that he should be charged; it is just that silly money equivalency games work both ways!

    Of course the original notion of wanting to avoid the extra words to get translated has some merit, it is just that the developer does not have the context to understand all of the cases where those strings that are the same strings may not really be the same strings. And no matter how costly it is to translate the same word, it is much more costly to have to fix this kind of problem later when it happens....

  • Sorting it all Out

    Love those Vietnamese LIPs out there

    • 0 Comments

    The Vietnamese Language Interface Pack for Windows 7 is now live!

    And what is more, in contrast with Vista, which I talked about in A serious lack of overlap with 64, the Vietnamese LIP is available in both 32-bit and 64-bit versions!

    You can get it right here....

    Anyway, here is some of that background information on Vietnamese:

    NUMBER OF SPEAKERS:  ~80 million

    NAME IN THE LANGUAGE ITSELF:  tiếng Việt

    ABOUT VIETNAMESE:   Vietnamese is the official language of Vietnam where it is spoken by the vast majority of the population. It is also spoken by about 3 million Vietnamese abroad, 1 million of which live in the United States. Vietnamese is an analytical (isolating) language which means that it does not use inflectional markers to indicate the grammatical role of a word in a sentence. Instead word order is used to express grammatical function. Vietnamese is also a tonal language. The standard dialect of Vietnamese, that of Hanoi, has 6 different tones which are represented in the script (while most other Vietnamese dialects recognize only 5 tones). The word "ma", for example, can have very different meanings depending on the tone: ma (mid falling) means ghost, má (high rising) mother, mà (low falling) but, mã (high rising glottalised) horse.

    FUN FACTS: French colonial rule has left traces in form of several loanwords like đầm (from madame, madam), ga (from gare, station), xi-măng (from ciment, cement), pho mat (from fromage, cheese) or banh (from pain, bread).

    CLASSIFICATION: Vietnamese is widely agreed to be an Austro-Asiatic language, specifically a member of the group of Mon-Khmer languages spoken in Indo-China. For a long time, it had been considered a member of the Sino-Tibetan family because of the influence of China's culture and language over two millennia, but the classification has been deemed wrong since the 1950s.

    SCRIPT: The easily recognizable Vietnamese alphabet, the quốc ngữ script, is based on the Latin alphabet, with several diacritics added. Vietnamese was written in  variants of the Chinese script, chữ nôm and chữ nho,  from the 13th century on, but as early as the 16th century, the Latin alphabet was used to transcribe Vietnamese: Portuguese missionaries used it for teaching and evangelization. The French Jesuit Alexandre de Rhodes, who worked in the country between 1624 and 1644, built upon these efforts and contributed largely to the development of the modern script.

    Enjoy!

  • Sorting it all Out

    There is no "I" in "Uyghur". Oh. Um. Well, except in the Windows Language Bar....

    • 7 Comments

    On a sunny afternoon on July 15th, 2006 my blog Uighur or Uyghur? was unleashed on an unsuspecting populace.

    Even moreso than many of the similar issues that have come up, the legislative efforts alone were impressive.

    And while the information was not finalized and communicated officially in time for Vista, it was indeed in time for Windows 7, thus:

    Yea, got it!

    Well, not entirely, actually. :-(

    If you add the keyboard, as I mentioned yesterday, you start to see the other half of the problem:

    See that UI? There is still an "I" there. Thus in the keyboard list:

    and once it is selected:

    Hey, why is the legacy keyboard the one listed if I asked for both of them? Weird.

    Let's fix that:

    and select that other one, the one with the right letter:

    Ok, that is better.

    Now this remaining issue is the one I described in other blogs.

    Blogs like LOCALE_SABBREVLANGNAME is so not an ISO-639 code that talk about the issue generally, and LOCALE_SABBREVLANGNAME is more than just an ISO-639 code which talk about Uyghur more specifically, and how the decision ended up working there.

    This is obviously a less than perfect solution since now this artifact of UI - Uyghur is with us, with not muh more than some blogs of mine to explain why. But changing the ISO-639 code is a huge problem for a lot of people affecting a lot more areas than just this one, and changing the mechanism the Language Bar uses to represent languages would also have much wider impact than fixing this one case.

    But we learn to live with issues like DE - German and ES - Spanish and EL - Greek and more recenty FA - Persian. So it is not like this is without precedent (for the record the first 3 of those four examples are due to native name spelling and the fourth is an issue similar in many ways to the Uyghur one).

    And the German one can help make a funny joke in presentations when there are Germans in the room. :-)

    It is really just people like Thomas Milo and I who seem to express unhappiness about issues like this one. Sigh...

  • Sorting it all Out

    The inappropriate nature of getting the Feh out of Uighur, Windows 7 edition

    • 2 Comments

    No big news today, this is an update less sensational than the original story.

    The equivalent story in newspaper terms would be buried on page 20, even if the original story was front page headlines.

    On the other hand, my front page is like page 20 already, so I guess this whole plan is doomed to failure.

    Oh well.

    Anyway, just a quick follow-up on Fight the Future? (#1 of ??), aka The inappropriate nature of getting the Feh out of Uighur.

    They fixed it in Windows 7.

    Let's look at the keyboard list:

    Uyghur and Uyghur (Legacy)?

    Oh well, at least they changed the spelling of Uighur to be Uyghur.

    Maybe I'll write about that another day.

    Let's look at the previews:

    Oh wait, that won't help, it was happening in the SHIFT state.

    We'll go over to MSKLC instead.

    They're in the list there, too:

    Let's see what they look like:

    And there you have it, they fixed it.

    They didn't take my advice about the equating those two characters in the collation tables, to help with the migration of any data that might have been typed by anyone in Vista (or from anyone who upgrades since they'll still have their old, installed keyboard.

    The old layout couldn't be edited (like I mentioned in the second part here), though if it had been me I might have relaxed the rules slightly for this case, given how similar the two characters look in some fonts and all.

    But oh well, at least the bug is fixed, right? :-)

  • Sorting it all Out

    I have not, generally speaking, been a "flasher" (for the last few years at least!)

    • 7 Comments

    This is not a post about flashing people while telling them how to be more world ready with their code!

    This is, however, a post about Adobe's Flash.

    Not about the whole Apple/Adobe HTML5/Flash thing that everyone else is writing about that is so controversial.

    As fascinating as that may be for some.

    For the record, none of my friends at Apple, Adobe, or Microsoft find that stuff to be something fun to engage on, but clearly there are people somewhere having a good time with it. I don't personally know any of them as far as I know.

    Anyway, this is about something my friend Mihai at Adobe told me about.

    You see, for as long as I have known about/dealt with Flash, what I have found most frustrating was the lack of globalization support.

    It drove me nuts, in fact, any time I had to deal with it (some high profile web sites I was helping people with had major issues because of this lack!).

    But no longer.

    Enter flash.globalization!

    This is a really amazing set of classes. Check it out:

    Class Description
      Collator The Collator class provides locale-sensitive string comparison capabilities.
      CollatorMode The CollatorMode class enumerates constant values that govern the behavior of string comparisons performed by a Collator object.
      CurrencyFormatter The CurrencyFormatter class provides locale-sensitive formatting and parsing of currency values.
      CurrencyParseResult A data structure that represents a currency amount and currency symbol or string that were extracted by parsing a currency value.
      DateTimeFormatter The DateTimeFormatter class provides locale-sensitive formatting for Date objects and access to localized date field names.
      DateTimeNameContext The DateTimeNameContext class enumerates constant values representing the formatting context in which a month name or weekday name will be used.
      DateTimeNameStyle The DateTimeNameStyle class enumerates constants that control the length of the month names and weekday names that are used when formatting dates.
      DateTimeStyle Enumerates constants that determine a locale-specific date and time formatting pattern.
      LastOperationStatus The LastOperationStatus class enumerates constant values that represent the status of the most recent globalization service operation.
      LocaleID The LocaleID class provides methods for parsing and using locale ID names.
      NationalDigitsType The NationalDigitsType class enumerates constants that indicate digit sets used by the NumberFormatter class.
      NumberFormatter The NumberFormatter class provides locale-sensitive formatting and parsing of numeric values.
      NumberParseResult A data structure that holds information about a number that was extracted by parsing a string.
      StringTools The StringTools class provides locale-sensitive case conversion methods.

    Mihai gave me some of the skinny on the internal, and in some ways it is similar to Silverlight as it uses the underlying platform it is running on for a lot of the support (as I talked about Silverlight doing back in blogs like Shine a Little [Silver]Light). Though in the case of flash.globalization, if the platform lacks a solid story then it falls back to ICU. This does not apply to Windows, where the existing support is used.

    It supports custom cultures/custom locales!

    It is all name based, using the RFC!

    Not even the newest version of Office (due out soon?) can say that, and they have an XML-based standard and over a decade of notice about doing that....

    And there were other interesting bonuses I found out from Mihai when talking about flash.globalization. Like remember LOCALE_SDECIMAL? Quite a character! Or two. Or three... from the other day? The one that broke the new Windows Calculator and has caused problems for Excel and other parts of Microsoft products?

    Well flash.globalization can handle multiple characters in those locale fields like LOCALE_SDECIMAL and LOCALE_STHOUSAND and so forth. In parsing and formatting.

    I'll need to check on things like LOCALE_SDATE and LOCALE_STIME to make sure they work too, though I must admit I am optimistic that if there is a bug I won't be fielding questions about whether such things really happen in shipping locales. No offense to people who might be inclined to do that, of course! :-)

    Here is another reference if you are hungry for information in it:

    The flash.globalization package in Flash Player: Cultural diversity without complexity

    Some very cool information here about using it for those who might be interested.

    Anyway, thinking about a third party component that is able to consume so much of the functionality that I have been working with, on, in, and around for so long to improve the story on a platform that used to not do any of this is really really exciting!

  • Sorting it all Out

    The game is over, people!

    • 46 Comments

    NOTEPAD adds a BOM (Byte Order Mark) when you save a file in the UTF-8 encoding.

    Always1.

    You'd think that since Windows Notepad has been doing this for over 319680000 seconds2, and that the combined usage of Windows 20003, Windows XP, Windows Server 2003, Windows XP 64-bit, Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2 is so high that it may well blow your mind to calculate the number, that people would have gotten over this by now.

    But no.

    As recently as yesterday4, people were making comments again in that Why are UTF-8 encoded Unix shell scripts *ever* written or edited in Notepad? blog, the one where I officially suggested that these people who don't like the Notepad behavior of inserting a BOM in front of UTF-8 files had a simple remedy:

    STOP USING WINDOWS NOTEPAD!

    Yet for some reason people are still arguing it.

    Please give up, it is over. If you were in a contest or duel for this5, then you have lost the contest, been bested in the duel. The game is over6.

    A long time ago, someone decided that:

    • if your file was 100% ASCII7 and
    • you chose to save it as UTF-8 and
    • you opened the file up again and
    • added some >0x007f character and
    • later saved again that

    you should not be prompted8 in a way like this:

    This file contains characters in Unicode format which will be lost if you save this file as an ANSI encoded text file. To keep the Unicode information, click Cancel below and then select one of the Unicode options from the Encoding drop down list. Continue?

    and so that was the way the feature was coded.

    Game over.

    There is probably an alt.i.hate.microsoft newsgroup somewhere on USENET that would be happy to hear your complaint on the matter.

    But the world has moved on.

    And Notepad (the apparent premiere tool of UNIX shell script authors throughout the world) has let down a segment of customers who could have updated whatever is reading the scripts in less than a day, rather than complaining about this on and off for the last ~37009 plus days.

    Your sacrifice is appreciated.

    But please, go home now.

    P.S. Isn't there some tool on UNIX that does this correctly10?

    P.P.S. I will not include a screenshot of my private Notepad; I'm not trying to tease you here that badly....

     

    1 - Well, not on the private Notepad I build from time to time from the Windows source, but that one is not one that is released to the public.
    2 - Over ten years, give or take
    3 - Where this first started happening.
    4 - The day before today.
    5 - Which none of you were, who are you kidding?
    6 - Even more over than the Canadians in that game last night.
    7 - Which ironically, most UNIX shell scripts are.
    8 - This is a cool feature too, by the way.
    9 - Over ten years, give or take.
    10 - By your definition of "correctness", at least - a BOM-less UTF-8 save.

  • Sorting it all Out

    Will someone take up the job of Calendar support in .Net, please?

    • 9 Comments

    Calendar support in Microsoft products.

    Sigh....

    Within the first few months of this Blog, I wrote Calendars on Win32 -- Not all there yet and Calendars on Win32 -- just there for show...., followed just 8 days later by the first managed foray into the topic (Calendars.NET -- new platform, new issues).

    Occasionally there was good news for Saudi Arabia or for Iranexpatriate Iranians or India or whatever.

    But by and large the question I first posed to my very first manager in Windows almost a decade ago ("When can we make calendar support not stink?") remains largely unanswered.

    Whether one targets the native side or the managed side, the fix would involve a hell of a lot of work - largely a rewrite for the former, largely a new class and a major refactoring for the latter (with the even larger question of how to support custom calendars that would have to somehow work in both environments left aside for another day).

    Obviously I can't describe what the rewrite would need on the Windows side all that effectively since I can't really describes too much about the internals as they stand. So we'll table that one for now; I tend to doubt that there is much hope for that work being done, to be honest.

    The former paragraph is my personal opinion about a kind of political situation, not at all a real technical issue.

    But the managed world, I hold out more hope since there is a team over in the Developer Division that occasionally does some major work in this area.

    The biggest piece of the work would be the refactoring - the current organization between DateTimeFormat, the various Calendar classes, and the parsing and formatting arms of DateTime is very poorly organized for the job of adding new calendars (you can use .Net Reflector to see some of this, you can probably just take it as read the huge dependency each class has on internal knowledge of the fixed tables in Microsoft's data that makes this such a Herculean task.

    Here is my humble suggestion for someone who wanted to tackle the Calendar issue for .Net, as a step by step task-list:

    1. Come up with a new name for the new Calendar class; surprisingly, coming up with an acceptable name might be the hardest part of this entire exercise!
    2. Make sure you have the complete understanding of your calendar figured out - how to convert any date between it and the Gregorian calendar, from soup to nuts. You'll need this.
    3. Decide if you want to support more than one language to express the dates/times in, and if you do then add a UICulture name sort of thing to support it. This part is optional.
    4. Write a method that will take a DateTime and a standard or custom date/time format string, and format it into a string using your calendar's logic. This will be used by a ToString method later.
    5. Write a method that will take a a) string and b) a standard or custom date/time format string and do an exact, literal parse into a DateTime using your calendar's logic. This will be used by a ParseExact method later.
    6. Write a method that will take a string and do a "parse under any circumstances if it is in any way possible" into a DateTime using your calendar's logic. This will be used by a Parse method later.
    7. If your calendar has other day names/month names and so forth then fill out all the information for a DateTimeFormat kind of thing that has those names in it.
    8. What to do here depends on whether you are internal to Microsoft with the authority to change the BCL (Base Class Libraries) of .Net or not:
      1. If you can make the changes in .Net, make sure that the existing DateTime, DateTimeOffset, CultureInfo, and other classes are modified to be able to consume your new calendar class with the new cool name figured out in step 1. Call me if you need any help with this!
      2. If you cannot make the changes in .Net, then make sure the methods in steps 4-6 above are easy to call and release it as is. You are done. Call me if you want someone to help try it out when you're done for an interesting calendar!

    Now this also "solves" the custom calendar issue for managed code, one way or the other. It leaves the Win32 case broken but that would definitely require resources elsewhere (I would love to see it but I'm not going to rely on it).

    For the record this process above is just a more fully fleshed out version of the thing I first suggested almost five years ago in Calendars.NET -- new platform, new issues, but I'm hoping that the expanded information will inspire someone (or multiple someones!) to do the work and tell other people once it's done.

    Plus note that if you do the Ethiopic calendar I mentioned the other day in The road not traveled (or, more to the point, the road not built) for Amharic, you can use the Ethiopic number system since you are the one doing all the parsing and formatting anyway - try exposing those Parse and Format methods too and help prove how easy it would have been for them to do it!

    As I said above, let me know if you start doing all this, I'd love to see someone pick up the torch on this one since no one seems to be dong it yet and my former call to passion (opening it all up and getting out the way) is really wishing more could be done here, by someone. By anyone....

  • Sorting it all Out

    If you ask the average person "Which comes first, '=' or '_' ?" they will stare at you blankly. With good reason.

    • 0 Comments

    Some questions can really take you back, you know?

    Like the other day, when someone asked the following question on a programming alias:

    I am using string.CompareTo to compare two strings like “status=abc” and “status_includes=abc”. The result indicates “status=abc” is greater than “status_includes=abc”. However, on ASCII table ‘=’ is before ‘_’.  Did I misunderstand string.CompareTo?

    Reminds me of the old days!

    Now of course if you ask the average person on the street whether "A" comes before "B" you will get as reasonably consistent answer.

    But most of them, when asked to give their opinion on whether "=" comes before "_", will simply give you a blank look.

    And let's face it, they are right.

    Clearly the kind of very technical people (aka Geeks or Nerds) who would even say things about the "ASCII table" and the order of things in it are a subset of all of the people in the world. Writing stuff to be intuitive for them rather than the set of everyones else in the world wouldn't make a lot of sense.

    Being a geek who looks at sorting and thinks about the default behavior that they believe should match the ASCII table, that is obscure enough that you might almost class it as th e professional equivalent of a fetish. :-)

    Now by default, the method in question uses CurrentCulture for comparisons, but for the record these two characters will sort the same in every culture, including InvariantCulture, because none of the m change the handling here.

    You can actually go to the protocol docs to look at the source weights used (hint: see [MS-UCODEREF in particular if you are one of those types of people!) but I'll save the normal type people among you some time and meaningfully subset the big table here:

    ;------------------------------------------------------------------------------------------------
    ;Windows NT 4.0 through Windows Server 2003 Sorting Weight Table
    ;This file contains detailed character weight specifications that permit consistent sorting and
    ;comparison of Unicode strings.  The data is not used by itself but is used as one of the
    ;inputs to the comparison algorithm.
    ;------------------------------------------------------------------------------------------------
    ...
    ...
    ...
    0x0038 12 162 2 2 ;Digit Eight
    0x0039 12 180 2 2 ;Digit Nine
    0x003a 7 55 2 2 ;Colon
    0x003b 7 58 2 2 ;Semicolon
    0x003c 8 14 2 2 ;Less-Than Sign
    0x003d 8 18 2 2 ;Equals Sign
    0x003e 8 20 2 2 ;Greater-Than Sign
    0x003f 7 60 2 2 ;Question Mark
    0x0040 7 62 2 2 ;Commercial At
    0x0041 14 2 2 18 ;Latin Capital Letter A
    ...
    ...
    ...
    0x005a 14 169 2 18 ;Latin Capital Letter Z
    0x005b 7 63 2 2 ;Opening Square Bracket
    0x005c 7 65 2 2 ;Backslash
    0x005d 7 66 2 2 ;Closing Square Bracket
    0x005e 7 67 2 2 ;Spacing Circumflex
    0x005f 7 68 2 2 ;Spacing Underscore
    0x0060 7 72 2 2 ;Spacing Grave
    0x0061 14 2 2 2 ;Latin Small Letter A
    ...
    ...
    ...

    And there you have it.

    Letters have a "SCRIPT MEMBER" of >= 14, while regular punctuation tends to be 7, and mathematical stuff tends to be 8.

    And given those groupings, even a Nerd would treat U+003d (aka EQUALS SIGN) as a mathematical sign and U+005F (aka LOW LINE, aka Spacing Underscore) as being in general punctuation.

    The decision here of how to group them, whether to group them, and in what order to group them (by choosing a number, one chooses an order) is to some degree arbitrary but now has been present for so long that it is not really going to change.

    Until the Geek Culture is created; clearly that locale will use the Ordinal sort anyway, which is not only more intuitive but will actually tend to be faster!

    For the original question, the developer was pointed to the overload that accepts a StringComparer and was able to pass StringComparer.Ordinal to solve the problem, though I personally would have recommended splitting these strings into name/value pairs since even without knowing what they are they clearly are and then sorted the names -- which would give the right answer for all cases, in both the existing Cultures and the faux one I posited!

  • Sorting it all Out

    The road not traveled (or, more to the point, the road not built) for Amharic

    • 3 Comments

    I've written about both Amharic language issues and Ethiopic (Fidel) script issues a few times over the years.

    Like these, for example:

    Today I'm going to delve into something I mentioned in that fourth link, which has some relation to the first link.

    You see, the Amharic locale on Windows is missing something.

    Something big.

    It is missing the support for formatting the numbering system in GetNumberFormat[Ex], and both that numbering system and the Ethiopic calendar system in GetDateFormat[Ex] and the various calendaring functions (informants knowledgeable about Amharic assure us that the numbers are still used, as is the calendar). Probably in times too - there is a difference in time handling that is not the same as time zones which would amount to a need to change the formatted times, at least.

    In the end, neither was done in Vista for the locale, and the one spec that was written (during Vista) on what it would take to add that support was not picked up and implemented for Windows 7 (I do not know why, this was after I left the team).

    Now what this means in practical terms is that the locale is missing a crucial element that is known to still be used by people who might reasonably be expected to choose that locale.

    As the Wikipedia article on the Ethiopian calendar states:

    The Ethiopian calendar (Amharic: የኢትዮጵያ ዘመን አቆጣጠር yä'Ityoṗṗya zämän aḳoṭaṭär), also called the Ge'ez calendar, is the principal calendar used in Ethiopia and also serves as the liturgical calendar for Christians in Eritrea belonging to the Eritrean Orthodox Tewahdo Church, Eastern Catholic Church and Lutheran Evangelical Church of Eritrea. It is based on the older Alexandrian or Coptic calendar, which in turn derives from the Egyptian calendar, but like the Julian calendar, it adds a leap day every four years without exception, and begins the year on August 29 or August 30 in the Julian calendar. A seven to eight year gap between the Ethiopian and Gregorian calendars results from alternate calculations in determining the date of the Annunciation of Jesus.

    Like the Coptic calendar, the Ethiopian calendar has twelve months of 30 days each plus five or six epagomenal days, which comprise a thirteenth month. The Ethiopian months begin on the same days as those of the Coptic calendar, but their names are in Ge'ez. The sixth epagomenal day is added every four years without exception on August 29 of the Julian calendar, six months before the Julian leap day. Thus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian), but falls on September 12 in years before the Gregorian leap year.

    The current year according to the Ethiopian calendar is 2002, which began on September 11, 2009 AD of the Gregorian calendar. The year 2003 will begin on September 11, 2010.

    Anyway, this support does seem pretty important. And while the calendar support would be fairly straightforward (and managed code supports a Julian calendar even if native code doesn't), most reports confirmed that the numbers used in dates also used the other numbering system.

    And tackling that number system is just something that no one seemed terribly inclined to do, especially since it would conventionally be expected to be handled via digit substitution in fonts, something that would not work since they are really not the same kinds of numbers. So this would really just be kept to the formatting functions. And the parsing ones too if this made its way into managed code....

    So Amharic would be the locale that never got completed. Well yet, anyway.

    Given the implementation decisions in Windows 7 like the one I mentioned in We do seem to be short on time... (Windows 7 edition), finishing up that support seems less than likely (especially post reorg though even pre-reorg this was not done).

    Conceptually this troubles me, because it is quite possible that technology could start to change the way people look at their language and their culture. The decision to not support these things may well make them less supportable in country too, eventually.

    It is troubling to me to think that if that happens I will have been a part of it. A part that perhaps didn't fight hard enough (even though I know that these arguments would not have been accepted when balanced against resources).

    I may have earned my salary that day, but I didn't earn my respect....

    Now at some point one of the people I know (like Scott, or Daniel, or one of the other Amharic speakers in the group of "people I've met") will read this.

    They may tell me that is not so dire as I paint it (which might make me feel better).

    Or they may tell me that incomplete support here will not have the same impact on culture that spell-checkers in Word have on spelling regularization/reform (which might also make me feel better since details make the reassurance more plausible).

    Or they may tell me that other platforms are also to blame (which might make feel better still since blame shared is guilt divided).

    Or they may tell me that I am spot on in the long run (which will make me feel worse since opinion plus authority often goes that way).

    Or they may remind me that cultures change in response to technology of their own volition so it is unreasonable to blame technology (which might make me feel better since I'll remember that Japanese vertical writing's lessened prevalence was much more the choice of people in Japan to speed technology adoption than anything technology drove specifically).

    Or they may tell me I am thinking too much (a common charge today so I'll check for an echo).

    I think I'm a little depressed now though.

  • Sorting it all Out

    The residual self image can be powerful, but the revised self image can knock your socks off

    • 5 Comments

    This is one of those posts that has nothing to do with anything that I usually tend to write about here. One that makes me wish I had the foresight to add a special "OnTopic" tag to put on blogs that I wanted to show up in the front page of the GoGlobal Feed. I should write some code to scroll through every blog in this Blog (there are almost 3000 of them now1) and the ones with tags beyond Potpourri and iBot and Multiple Sclerosis would get the new OnTopic tag added, while the remainder would get the OffTopic tag, instead.

    Maybe I really should dig into the code to do all that with the tags2, I'm pretty sure the capability exists. For now, it will have scrolled off within a few days; I'll look into the tagging thing soon. I really, really, really don't want two separate blogs.

    Anyway.

    I was reading the Esquire article by Chris Jones (Roger Ebert: The Essential Man).

    It really moved me. I mean, really.

    I have redefined my work and my career and my life several times due to my disease and yet in reading what Roger Ebert has done I feel like a mere dilettante.

    There was one part in particular that moved me. It became a "quote of the day" for Twitter, which became a status for Facebook. At the bottom of page 5 of the online article:

    ...Ebert and Chaz go out for dinner, to one of their favorite places, the University Club of Chicago. Hidden inside another skyscraper, there's a great Gothic room, all stone arches and stained glass. The room is filled mostly with people with white hair — there has been a big push to find younger members to fill in the growing spaces in the membership ranks — and they nod and wave at him and Chaz. They're given a table in the middle of the room.

    Ebert silently declines all entreaties from the fussy waiters. Food arrives only for Chaz and a friend who joins them. Ebert writes them notes, tearing pages from his spiral notepad, tapping his fingers together for his words to be read aloud. Everyone smiles and laughs about old stories. More and more, that's how Ebert lives these days, through memories, of what things used to feel like and sound like and taste like. When his friend suddenly apologizes for eating in front of him, for talking about the buttered scallops and how the cream and the fish and the wine combine to make a kind of delicate smoke, Ebert shakes his head. He begins to write and tears a note from the spiral.

    No, no, it reads. You're eating for me.

    The emphasis in red is mine, but the underline is not.

    That notion is one I carry with me a lot these days. In so many of the things I can't do anymore, from churning out dozens of KLOCs to skiing to roller skating to ice skating to dancing to running to even walking some days. Things that I used to love, things that I can close my eyes and see happening still, things that I either know I will never do again like I once did (despite expensive lessons in some cases!) or know that I will at best never do as well.

    And when people I know are doing those things, it can be hard sometimes.

    That is a particularly bitter thought, one that I would really prefer to be unworthy of me, yet sometimes I sink to those particular depths.

    And yet other times, when I move past that bitterness it feels like that desire to see people I know, I like, I care about to do those things so I can enjoy them is a very real feeling.

    They aren't necessarily doing it for me, mind you. They are doing those things for themselves and though I don't fancy myself a voyeur I love to watch it and hang on the descriptions of the things I don't see myself.

    It is like living vicariously through them.

    The girl I am going out with is a great example of this. Not a perfect one (she doesn't write code, after all!). But she skis and runs and I love to watch the status updates and listen to her descriptions of what happens later.

    We dance and me on my iBot in the very pale hint of what I used to be able to do and nevertheless see her move and that she enjoys what is happening as much as I am and I easily can imagine that it is all happening in a way where I am doing so much more. But by the look on her face she is not troubled by my lack, and sometimes it seems almost like she too is imagining these things even though she never has seen them before3.

    At work I can do a training or write a prototype or call a meeting or write an email that can alter a direction on something that in my view might not have been happening the best way it could have. And when I meet new people they will regularly tell me (after making sure that I am the one who writes that blog4) that it was helpful or important or essential or life saving for some project that they were working on. And I know that those dozens of KLOCs I'm not writing are now millions or hundreds of millions of KLOCs being written by others for which I was a muse for a moment.

    Like that thing from The Matrix where people see themselves in the matrix as they imagine they are - their residual self image and so on. I still have that. And I think about that list that in bitterness I decry my disease has taken from me, and I watch them doing all these things and it is like I am dong them still. Only this revised self image is even better, in many cases because the view is so much wider than it ever could have been were I still doing it the way I was.

    When I think about all of that, I realize love my life more than I ever did when I was actually doing all of the stuff that I miss doing so much. As the title suggets, my residual self image can be powerful but the revised self image can knock my socks off. And not just because it is balancing on two wheels....

     

    1 - I briefly even had more than Raymond on this server, though between my blogging sabbatical and the fact that I usually write just one a day while he writes more probably has him on top again!
    2 - My manager would probably like that too....
    3 - There is a particular dearth of people who know me from when I could dance, so she is hardly alone in that category.
    4 - By which I mean this blog, of course!

  • Sorting it all Out

    IsSortable IsFixed in CLR v4. Well, kind of....

    • 0 Comments

    It really has been over three years since I first wrote the blog IsSortable() == false? Well, sometimes it may be lying.... about the method that didn't work so well with Windows only cultures.

    At the time, I talked to the testers and developers and program managers involved, and they all assured me this was being tracked for a fix in the first version where a fix was possible (i.e. no "red bits" issues).

    And that would have been CLR v.4. Coming soon to a programmer near you.

    They have fixed the problem there.

    Not by adding the instance specific method I suggested, but by updating and reportedly sharing the tables between Windows and .Net.

    I do not know if I fully understand the total plan on the sharing thing; an MVP I know explained it to me but I am not sure he knew exactly how to describe the solution either.

    I suppose it will not be until new versions of Windows and .Net (5.0?) are out for everyone interested (mainly me, perhaps?) to be sure that the problem has been fixed for more than just the version. For the time being, I'll say fixed for now. :-)

  • Sorting it all Out

    Inappropriate use lead to problems, in non-technical areas too

    • 0 Comments

    There was another odd, if somewhat offtopic example of the phenomenon described in If it was not intended for that, don't do it. No, really. Stop. Now. Please? and rather than in the relevant, topical circles like Future compat is not back compat and wishing can't make it so.... in a much more serious, legal way.

    It happened when friend and twitter follower/followee Lily Jang mentioned yesterday in Facebook and Twitter:

    Poll: 3 teens face a judge on *child porn* charges for allegedly forwarding a picture of a naked classmate on their cell phones. If convicted of the felony, they would remain on the state sex offender registry for life. Does the punishment fit the crime ? Other thoughts. Will read on-air.

    On-air? Oh yes, it is that Lily Jang1. :-)

    Anyway, we are really looking at the same problem, just what can happen over in legal circles.

    I mean, sometimes a particular type of crime can be happening.

    And obviously we want to catch the criminals so the crime can be stopped.

    That is the intent.

    Laws are then formed, which will properly punish the criminal and protect everyone else from the impact of their misdeeds.

    Now I don't think there are a whole lot of people in favor of child abuse. I know I am not, and I can't imagine anyone I know being anything other than not in favor.

    But clearly the laws involving registries that sex offenders must be listed in are there to deal with the fact that there are known and documented problems with such offenders becoming repeat offenders. This has been found to be true and psychologists and others have thoughts on why that might be the case.

    The intent here makes sense and whether one has troubles with any of the ways such registries and formed, used, or been misused, few will disagree that this intent is at least trying to help mitigate a known problem.

    But applying it to this other case -- of a child taking a picture of a classmate naked and forwarding it to others -- is, while perhaps violating the letter of the law, is certainly not violating the spirit of it, the INTENT of it.

    It is someone seeing a solution to one problem and trying to apply it to another. One for which it may not really be so great for. Like that Future compat is not back compat problem but a lot more serious, with much more terrible consequences for some kids -- yes, kids.

    Plus it is unlikely in the extreme that the law enforcement officers that would make use of the registries or the psychologists who see certain behaviors as hard to control would suspect that these kids would truly be a danger to others.

    It seems like a misuse of the law -- treating the law as a sword rather than a shield as it is generally intended here (to protect people).

    Now this is an extreme case obviously (and I hope that the media coverage here can have some impact/influence on the issue), but there are probably many other similar cases where a law can be misused.

    Just the way a feature in software can be? Perhaps we could think of those over-eager software developers violating the rights of users to not be subjected to the results of bad design decisions? Of course with the crime not causing problems as bad the punishment could hardly be as severe. But maybe checkin privileges could be changed or something....

    For the law, the only defense we usually have is the common sense of the prosecutor. But if the pressures on them or their bias or their misunderstanding of the intent is large enough, even their problem space can have this problem.

     

    1 - By which I mean, that one.

  • Sorting it all Out

    Knock knock! Who's there? Kana! Kana Who? I Kana got something wrong!

    • 1 Comments

    Sometimes I look back at prior blogs from this Blog and am really quite happy with what they say.

    Other times, I am less impressed.

    You may have the same feeling if you are a regular reader. :-)

    Now still other times I feel like great at the time but then have reason to not feel as great later.

    Today I am going to talk about one of the times from that third category.

    The blog? From over four and a half years ago (Knock knock! Who's there? Kana! Kana Who?).

    This blog involved a huge discussion with a lot of different people about the terminological, technical, linguistic, and collationary features of Kana in Japanese, and was finally reviewed by people both here and in Japan.

    Unfortunately, it was kinda wrong when it described the exact weight differences between the various Kana.

    The question, asked by the man who took on the role of the child who pointed out that the emperor was wearing no clothes, was Miles:

    We have some bugs with respect to sorting and matching katakana small letters. I’ve been trying to figure out what the desired sort/match semantics should be, but that is not trivial.

    1. according to http://blogs.msdn.com/michkap/archive/2005/06/01/423711.aspx katakana small letters should compare equal to their non small equivalents when NORM_IGNORECASE is used.
    2. but calling CompareStringEx with various flags I get the following

    NORM_IGNORECASE:                        U+30e3  <       U+30e4
    LINGUISTIC_IGNORECASE:               U+30e3  <       U+30e4
    NORM_LINGUISTIC_CASING:         U+30e3  <       U+30e4
    NORM_IGNORENONSPACE:            U+30e3  =       U+30e4
    LINGUISTIC_IGNOREDIACRITIC:     U+30e3  <       U+30e4

    So, at least on Win2k8, it looks like Katakana Small letters are diacritic variants of their respective big equivalents.

    Which of these two behaviors should SQL try to imitate ?

    Thanks,
    Miles

    Oops.

    That was my first thought.

    Now the middle test of the five he tried was not relevant; NORM_LINGUISTIC_CASING has no impact whatsoever on Kana (it fixes the Turkic problem described here).

    The other results made no sense though. Not if that blog of mine, the one everyone reviewed, the one I reviewed, the one no one has questioned in nearly five years, was right.

    Better check this one out....

    First let's look at the weights for the six Katakana A's:

    U+ff67 HALFWIDTH KATAKANA LETTER SMALL A 22 02 01    01 01 c4 ff 02 c4 ff c4 ff 01 00
    U+30a1 KATAKANA LETTER SMALL A           22 02 01    01 01 c4 ff 02 c4 ff ff    01 00
    U+ff71 HALFWIDTH KATAKANA LETTER A       22 02 01    01 01 ff 02 c4 ff c4 ff    01 00
    U+30a2 KATAKANA LETTER A                 22 02 01    01 01 ff 02 c4 ff ff       01 00
    U+3041 HIRAGANA LETTER SMALL A           22 02 01    01 01 c4 ff 02 ff ff       01 00
    U+3042 HIRAGANA LETTER A                 22 02 01    01 01 ff 02 ff ff          01 00
    U+32d0 CIRCLED KATAKANA A                22 02 01 ee 01 01 ff 02 c4 ff ff       01 00

    Hmmm.

    Circled Katakana looks to be a diacritic (DW) difference, none of them look to be a case (CW) difference, and all of them duke it out in the special (SW) area.

    now of course NORM_IGNOREWIDTH will muck with some of the information in here if you pass that flag; it will make the first and second items on the list look identical, and also the third and fourth to look identical.

    Of course these seem to behave differently than other fullwidth characters and halfwidth counterparts, like I described back in A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you).

    But my hope that something had changed in Windows 7 were dashed; the behavior is the same, and it turns out that A&P of Sort Keys, part 7 also has some minor differences to explain in regard to Kana behaving differently in weights than other differently "widthed" characters do.

    Plus perhaps A&P of Sort Keys, part 10 (aka I've Kana wanted to start talking about Japanese) was a bit too confident about that first blog....

    The really interesting thing about both "bugs" in the blog, which I discovered during this deeper dive for the blog you are reading now, are that neither are borne out by the actual raw values in the weight tables; the difference happens in the way the raw weights are read in the "special case" of Kana.

    My review, which was mainly of comparing the relative sorting of the characters and the raw weights, was complete enough to give me confidence but not complete enough for me to have deserved said confidence....

  • Sorting it all Out

    LOCALE_SDECIMAL? Quite a character! Or two. Or three...

    • 0 Comments

    One of the very first bugs I ever fixed as a consultant for a customer was in an application that printed financial reports, It ran on Windows 95 and NT 4.0, which were the main version of Windows people were on then (the move from Win 3.1[1] on the client and NT 3.5[1] on the client (if not the server) was pretty complete with the customers I was dealing with).

    The application had problems any time a user made the decimal separator more than one character. Like you can do still in Windows 7 just as you could in every prior version of Windows:

    I looked at the documentation surrounding the LOCALE_SDECIMAL constant (here in its most recent form):

    Character(s) used for the decimal separator, for example, "." in "3.14" or "," in "3,14". The maximum number of characters allowed for this string is four, including a terminating null character.

    Now some things are different here (it used to not have a dedicated topic as the line of text was a part of the larger Locale Information Constants topic, and it did not used to mention that although it allowed four characters that one of them had to be the NULL), but the basic limitations have not changed since that time.

    I fixed the bug in the code, by the way.

    It was clear (to me at least, back then) that there was no use fighting the documented abilities of an operating system. Not supporting this configuration is a bug in the application, after all.

    Now for just about as long as this feature has been around, there have been applications that have been tripping over it.

    Not every application, mind you.

    Just some of them.

    In fact, the fancy Calculator in the new version of Windows is just such an application who sees its decimal handling broken when you make such a change.

    Darn, and after all the work they did to support user preferences, to trip on this obscure one (which despite being around for so long is still not widely known).

    Now, you may remember how I have talked about the pseudo locales like qps-ploc that test out extreme cases harder to readily test for in prior blogs like Walking off the end of the eighth bit. Well it just so happens that qps-ploc has two characters in its decimal separator to make this feature easier to test out and make sure it does not break you.

    So you can look at this as your chance to find the bug that even folks at Microsoft missed, and fix it in your own code.

    And just remember that this feature has now been around for over a decade and a half, so if it breaks you application then it is your application that is broken. And has been for some time.

    Just FYI, but curiosity over whether some existing locale has ever used this feature is just an act of desperation to avoid doing the job. -)

    Pseudo just made it easier for you to find the bug!

  • Sorting it all Out

    The real problem(s) with all of these console "fallback" discussions

    • 3 Comments

    It seems like these days, you can't swing a cat around here without hitting a bunch of people talking about the console and fallback on bidirectional languages (particularly Arabic and Hebrew).

    This is something that people are all over about native console apps, and managed console apps as well (even externally to customers, in Blogs like Dina's Developing Arabic applications should be easy! with blogs like Console doesn’t display Arabic, or in blogs of mine like this one, where a managed code developer was asking about calls to SetThreadPreferredUILanguages(MUI_CONSOLE_FILTER, NULL, NULL).

    These various theoretically cat-injuring communications center around the MUI story for Arabic and Hebrew in most cases, and the importance of making sure the right fallback is used (which essentially amounts to either English or French) any time one launches a console application and needs a better source for resources since Arabic and Hebrew won't work there.

    But there is a problem to all of this.

    Well, two problems actually. I just thought of another problem.

    The first one is that this is not only a problem with Arabic and Hebrew; it applies to lots of other complex script languages - from the code page based ones like Thai to the Unicode-only ones like Hindi/Laotian/Khmer/Bengali/etc. None of them work in the console and all of them require fallback.

    Now the problem with suggesting it is only an Arabic/Hebrew problem does not mean the suggested solution won't work for these other locales; it will. But by framing the question only in terms of those two, no one thinks about the problem with the others. And anyone who is starting with one of these other languages in mind will be lost.

    I don't think this is a problem in Dina's case (her Blog's name is Developing Arabic applications should be easy!, after all. But just about everyone else is guilty of ignoring the fact that this needs to be called out for a lot of other languages.

    Now there is a simple scheme for getting the underlying data - a combination of calling GetLocaleInfoEx with the LOCALE_SCONSOLEFALLBACKNAME constant and then doing some code page logic to make sure that the underlying code page can support the language, as the LOCALE_SCONSOLEFALLBACKNAME describes but does not define explicitly:

    Note In general, applications should not make direct use of LOCALE_SCONSOLEFALLBACKNAME data. To determine what language resources to use in a console window, an application should call either SetThreadUILanguage or SetThreadPreferredUILanguages. These functions use the console fallback data as a factor in choosing a language that is legible in the console, but it is not the sole determinant. In particular, the console is limited to displaying characters from a single code page. For example, el-GR for Greek (Greece) is a valid console language, but if the current console code page is Latin-1 (code page 1252) the console displays Greek text mostly as a series of character-not-found symbols.

    If the language corresponding to this locale is supported in the console, the value is the same as that for LOCALE_SNAME, that is, the locale itself can be used for console display. However, the console cannot display languages that can be rendered only with Uniscribe. For example, the console cannot display Arabic or the various Indic languages. Therefore, the LOCALE_SCONSOLEFALLBACKNAME value for locales corresponding to these languages is different from the value for LOCALE_SNAME.

    For predefined locales, if the fallback value is different from the value for the locale itself, the value for the neutral locale is used. A specific locale is associated with both a language and a country/region, while a neutral locale is associated with a language but is not associated with any country/region. For example, ar-SA falls back to "en", not to "en-US". This policy of using neutral locales is implemented consistently for predefined locales and is strongly recommended for custom locales. However, the policy is not enforced. For a custom locale, your application can use a specific locale instead of a neutral locale as a fallback.

    Note None of the functions described in Calling the "Locale Name" Functions accept neutral locales as inputs. Thus LOCALE_SCONSOLEFALLBACKNAME data is of very limited use. In particular, neither GetLocaleInfo nor GetLocaleInfoEx accepts neutral locales as inputs.

    Now we get to the second problem: although this topic, which has big "do not use" info around it, has some of the best conceptual documentation on the rules for what works and what does not, no one is providing good code here to help. And most of the samples out there, and that others are using, is wrong anyway, in small ways ( since one can change the code page of a console with tools like chcp and fool the code.

    And I just noticed what might be a third problem, this time in that MSDN topic. Aren't neutrals supported in GetLocaleInfo[Ex] in Windows 7?

    Beyond that, there is a fourth problem that I just thought of.

    None of the suggested functions or code samples or conceptual topics solve any real world problem.

    Running all of these various solutions will keep you from for example loading Arabic resources in your console application -- which really would not ever be able to happen anyway since no one who knows anything about the console would ever pay someone to localize their application to Arabic anyway!

    This makes the code kind of pointless for almost all cases except maybe a few like Arabic Morocco which fallback to French if it is there; most others go to English, which is the ultimate fallback usually anyway.

    But doing all this work for the sake of Morocco can be worthwhile as an exercise (or as a reality if you ship software there!).

    But beyond that, since in the majority of these cases where the user's default locale will be the same as their UI language, and since we tell people 'til the cows come home to always use the date formatting functions/methods/properties rather than rolling their own, this means that most of those strings will be printing out the same characters the console was determined to not be able to support in the first place.

    So you end up right back where you started, for Khmer and Arabic and Bengali and Lao and Hindi and Thai and Sinhalese and all the others.

    Hell, every time the user locale fails that same "is it supported by the code page?" test, you will fail. Which could be the case for almost any language other than English.

    Using a similar fallback logic for the user locale/CurrentCulture is not something that currently exists. And even if you write your own you are hampered by the fact that the typical en-us fallback is such a bad match for the rest of the world with different decimal separators/day-month order issues/collation support/etc. that you will likely create as much confusion without the question marks as with them.

    Perhaps this the fourth and fifth problems: the fourth is the lack of description of the above problem, and the fifth is the lack of any kind of solution....

    But I will tell you one thing.

    I'll be doing a presentation on console issues in the near future for some Windows developers and testers, and I promise you that the issues in this blog will be some of the main ones pointed out there, with some real suggested solutions to the full problem instead of just the half-solutions ignoring the twothreefourfive problems I mention here.

    And also talking about the sixth problem: where Powershell, particularly the graphical Powershell, fits in here (and breaks the assumptions of almost every other solution that used to work some of the time!).

    Perhaps the seventh problem is that no one is talking about that issue either.

    I'll perhaps have some blogs after all of that is done, too. For the folks following here. :-)

Page 1 of 2 (30 items) 12