Blog - Title

October, 2011

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    The evolving Story of Locale Support, part 0 (The introduction)

    • 30 Comments

    Two of the most funnest parts or speaking at this year's Internationalization and Unicode Conference were:

    1) I got to talk about some of the things I do now as a part of my work.

    2) I got to talk about some of the cool things that are a part of the upcoming version of Windows.

    I imagine talking about some of the more interesting facets of these two things might keep me busy for some time!

    You can think of this as the pre-blog for many of the future blogs discussing the very features that you can see more f if you have the //Build Developer Preview.

    One important issue to note is that although both the presentation and discussion over the coming months will in some ways seem quite organized, a lot of the actual work was much more disjointed, based on the needs of many different teams and partners and customers.

    The apparent organization has much more to do with applying a consistent set of rules and principles across a true Hungarian Goulash of different requests and features and bug reports....

    Also, while I'll be able to speak of some of the rules we're following as obvious, if i simply look at many of the items as I did just a few years ago, I can promise you that we learned a lot of the core methods that define the nature of this critical sliver of the job in this very version. Many of the lessons learned are in retrospect obvious and would have been invaluable to know five years ago, and I am completely prepared to forgive every single member of the original NLS team for not recognizing these "obvious truths" since they were terribly obvious to me either, at the time!

    I'll also set down some really important ground rules:

    I will not be talking about any of this work as innovative.

    While I am not judging the individual uses of The"I" Word by many people involved with Windows 8, it is hard not feel like the word is being at least overused if not misused. So I'm going to emphasize the common sense nature of muck of the work, and since the only thing innovative about it is being right where I and others used to be wrong, it feels inappropriate to use that word in the context of what I'll be discussing.

    I am not going to talk about creating experiences or sharing experiences or anything like that.

    Perhaps the nature of this Blog is in some sense a form of marketing of International and World-Ready features, but my view of some of the basic facets of locales and keyboards and formatting and parsing and collation does not allow me to re-purpose the way I talk about these things. It may be a lack of creativity on my part, but it would feel fake to me to talk about things using these particular "buzzwords". I think I'm a bit too grounded in what I think of as 'the experience of what feels broken" to really feel comfortable talking that way.

    Perhaps these ground rules cause my presentation to feel very different that many other, even those covered by others at the recent IUC. I don't find them to be insincere, but I know I'd feel insincere if I tried to emulate them....

    One last rule:

    I have neither inclination nor desire to violate either non-disclosure agreements or marketing news cycles related to Windows 8.

    This last rule seems obvious, but I don't want anyone to misunderstand my intent here, or what I want to accomplish. Any time I talk about stuff you haven't heard before, it is only due the fact that they are doing other things right now, not because I am disclosing anything that you couldn't have found yourself by spelunking through the //Build Developer Preview, or eventually the Beta.

    I hope you enjoy the things I talk about,and the thing that I point out. My goal is to enjoy the trip, and further I hope you enjoy it, too.

    Okay, we can now let the games begin!

  • Sorting it all Out

    There's no "I" in IDN, part 11: There's no place like ::1, not even 127.0.0.1!

    • 13 Comments

    Previous parts in this series:

     I have a T-shirt with the caption:

    There's no place like 127.0.0.1

    which is kind of fun, and a nice way to help distinguish geeks in some situations.

    It makes the topic today even more topical, which I guess just makes it a better topic?

    The limitations I mentioned in Part 10 above related to the way the hope of Unicode support in DNS is dashed by the requirements that NetBIOS, and more to the point the way the DNS and NetBIOS names are kept in sync by a convention that many people depend on.

    That limitation is a pretty bitter one, since the whole point of IDN is really to allow names that support a significant chunk of Unicode, yet the "default" name for just about everything you might do with a machine is largely kept in the world of code pages.

    That's ugly.

    Moving forward, it is easy to assume that backcompat requirements will keep the situation from ever changing.

    Thankfully that assumption ignores the one way we can make sure we can get out from under Microsoft's implementation of NetBIOS's ugly limitations.

    IPv6

    (gratuitous embedding of the song performed by the secret-wg in the closing plenary of the RIPE 55 conference follows)

     There is one very cool thing about IPv6.

    Well actually, there are a buttload of cool things about IPv6, but most of those things aren't relevant to this Blog and only one is relevant to this blog.

    It is the fact that the standard defines no direct relationship to NBT, aka NetBIOS over TCP/IP, aka NetBIOS.

    So, in a pure IPv6 world, unless Microsoft specifically adds such a relationship, and thereby snatches defeat from the jaws of victory, an IPv6 only world is a DNS only world.

    And thus an IPv6 only machine does not need to run NetBIOS -- or at least not connect the two together.

    And the machine name no longer has to be dependent on the default system locale, or more specifically the CP_OEMCP.

    Now there's "many a slip twixt a cup and a lip", as they say.

    But one of the new commitments I will now have as a part of my larger IDN responsibilities is to make sure no one adds back that dependency....

  • Sorting it all Out

    I can't _____ [an MS rant]

    • 0 Comments

    All personal stuff, MS stuff, iBot stuff. Skip as needed....

    I can't snowboard anymore.

    I really only did it a couple of times, and honestly I usually preferred to just ski if I was going to be on the slopes. But still, I can't do it anymore

    I can't ski anymore.

    Skiing is something I genuinely miss. I miss it enough that the mitigatory methods I can use to go down the hill feel like e hollow echoes of what once was. A few people now and again have almost convinced me to try it -- heavy emphasis on the ALMOST. I was really little more than a very advanced amateur. But I owned it. Well, at least until I didn't.

    I can't run anymore.

    I used to run fairly often. It was a great way for me decompress if I was angry or frustrated or in any way keyed up. Sometimes I wonder hoe I get rid of all that baggage now. I guess I just blog now instead. But I think running was more fun.

    I can't perform Neurodiagnostic tests like EEG/Evoked Potentials or Cerebrovascular tests like Carotid Doppler studies anymore.

    Something I did when I was young, I had a specialty in both intra-operative tests (unconscious victims but fine motor skills required) and pediatric tests (I had a "no chloroform" role since it could hurt the test results; this meant I had to be able to charm small children enough to paste electrodes to their heads with collodion, a noxious scent in the best of circumstances!). I still have the charm, but the fine motor skills are beyond me.

    I can't draw blood anymore.

    Once upon a time I used to do phlebotomy and as certified to do so, in fact. It was a sideline, working for neurologists and neurosurgeons and sticking needles in those kids to check their Dilantin levels. Luckily the charm still holds, but unluckily I can't palpate a vein or properly place the needle even if I could.

    I can't ice skate or roller skate anymore.

    Not that I did either professionally, but it was fun to do. Ain't gonna happen now, because....

    I can't balance myself anymore.

    I remember the first day I felt out of control. I was just coming out of Suz's 2nd floor apartment, and halfway down the stairs I was just barely able to keep from falling the rest if the way. This "acute onset" feature has been a part of my life for over a decade.

    I can't skateboard anymore.

    Never something I was a superhero doing, but I could do an Ollie and all manner of flips, and grinding was not beyond me. To this day, when I pass by the many "skateboard punks" of Seattle who will say the iBot is "so tight that it's sick!" I can either deferentially point out that there was no way I could do a heelflip with it, or with false bravado say that they if I had my custom board I could do a kickflip that would blow their minds.

    I can't walk anymore.

    This took a lot longer to happen, and was long obviated by using a cane. But it is something where having trouble balancing simply got worse and worse....

    I can't dance anymore.

    During my senior year of high school, I had an epiphany about how lame one's dancing is if one is never trained in how to do it. So I took dance lessons, and after quite a few I could foxtrot and waltz and swing dance and mashed potato and salsa and meringue and twist and tango and so on. Until eventually I couldn't. In some ways I miss, a skill that I trained at so hard, the most. A victim of my new clumsiness.

    I can't write with pen or pencil anymore.

    My handwriting has never been stellar, but it has been steadily getting worse -- one of my earliest symptoms. Now it's so awful that even for simple events like the blind auction for charity at Seattle Works SWANK I could barely scrawl out my my bid number. I joked that next year I'd bring a pricing gun, but who am I kidding? I couldn't aim something like that. Next year I'll just have to talk to someone in Seattle Works about how to make it work...

    I can't type fast (or at all) without software assistance anymore

    Another sideline was transcription, and when I think bout the steady and measurable decline of skill from a high point of 180 words a minute with no errors to 20 words a minute with 5-6 errors, it's never clearer in my mind that this is a degenerative illness.


    I could could go on, but you get the point.

    This random assortment of things that for either personal or professional reasons I can't do anymore are basically just the price of my disease.

    Some of these problems, I overcome with the iBot, while other parts I overcome with computer software.

    But MS has had its price....

  • Sorting it all Out

    Never doubt that a program like Notepad can change the world. Indeed, it is of the only things that ever has!

    • 4 Comments

    The question came in just the other day:

    ...customer says that when they specify a page as “UTF-8N”, it doesn’t render correctly on IE, but does on other browsers.  I searched for “UTF-8N” and found references only on Japanese-language pages.  The one English-language reference I found is this one where someone claimed that “UTF-8N” is simply UTF-8 without a BOM.  Is that true?  (How could anyone expect such a scheme to work?)  Do we explicitly not support “UTF-8N”?  Do we differ from competitors in this regard?

    Peter pointed out what UTF-8N is:

    UTF-8N was a proprietary designation that once appeared in some IBM documentation. It is not a valid charset identifier for use in HTML or XML. Valid charset identifiers are registered with IANA and can be found in this registry page:`

    http://www.iana.org/assignments/character-sets

    This is specified, for instance, in section 5.2 of the HTML 4.1 spec: 

    http://www.w3.org/TR/1999/REC-html401-19991224/charset.html

    Now in the original thread it was NaseerBatt who pointed out the same meaning, without mentioning the "non-standard nature" of it:

    UTF-8 without BOM is UTF-8N .Do you mean there no in-built mechanism or hack in C#/.NET that helps to detect this encoding (UTF-8N)?

     Now the issue of the UTF-8 BOM is an interesting one that Microsoft has been in the center of since the beta of Windows 2000, where Notepad changed the world in its simple decision to always emit the Byte Order Mark (BOM) and time you save a file as UTF-8.

    It made life easier fr many people working on other Microsoft products, since it is faster to work with the first few bytes of a file than to validate the encoding of the entire file.

    Though outside of Microsoft, it isn't as well thought of.

    So why do some other browsers seem to support if but not IE?

    Well, remember that you can always have UTF-8 without a BOM -- in fact many people often prefer it.

    Now there are two ways to deal with an HTML document that has a non-official "UTF-8N" encoding tag:

    • You can treat the file as some default encoding like ISO-8859-1 or Windows CP1252, or the CP_ACP, or
    • You an go through the various encoding detection steps that you would if there was no encoding tag.

    There are strong philosophical issues underlying the choice of approaches. including the one that happens to appear to be used by IE this case: if the text is mislabeled, don't try to do any further work....

    In the end, the best answer is to probably just don't tag text inappropriately....

    I won't suggest inserting a BOM in front of UTF-8 unless you're sure you won't start complaining about why that's a bad idea.

  • Sorting it all Out

    Every rose has it's Þ....

    • 7 Comments

    Over on the Unicore List, the question was a familar one:

    I am converting text in an ANSI-encoded document to Unicode using 
    search and replace in Notepad on Windows Vista SP2. The source 
    document contains text in the 8-bit CSX+ encoding for Indic 
    transliteration. A chart of the CSX+ encoding is available at: 
    http://homepage.ntlworld.com/richard.wordingham/10646/CSX+.htm

    In the CSX+ encoding, ASCII 254 'þ' represents the character 'ḥ' 
    (h-underdot). When I perform a search and replace of ASCII 254 'þ' to 
    Unicode 'ḥ' U+1E25 LATIN SMALL LETTER H WITH DOT BELOW, the operation 
    not only converts all instances of 'þ' to 'ḥ', but also all instances 
    of 'th' to 'ḥ'. For example, the word 'rathaþ' gets caught in the 
    replace and is changed to 'raḥaḥ'.

    This is rather unexpected behavior. I would consider this an error, 
    but perhaps a very well-intentioned one, given that the phonetic 
    representation of 'þ' in Old English is in fact /th/.

    Is there some internal Windows mechanism that treats ASCII 254 þ as 
    being canonically equivalent to 'th'? Or, perhaps is the equivalent 
    rule the dastardly deed of some Old English language enthusiast turned 
    techie? :)

    Best,
    Anshuman

    Yet another "misuse" of Notepad beyond the old UTF-8 BOM? :-)

    This one is kind of my fault., too.... not directly since I am not the one who changed Notepad, but I am the one who added the function and then pushed them to use it in Notepad (fixing the problems I pointed out in blogs like When Notepad's Find doesn't and The fallacy of comparing out of context and so on more than half a decade ago).

    In Vista, bringing FindNLSString brought the full power of Windows collation to the Find/Replace capabilities of Notepad.

    So all of the various Unicode canonical forms will always be equal and so on.

    This is a good thing.

    Unfortunately for Anshuman, it also brings our EXPANSIONS along.

    In particular, the following two entries:

    0x00de 0x0054 0x0048 ;TH
    0x00fe 0x0074 0x0068 ;th

    The only locale whose sorting negates this equivalence is Icelandic.

    Perhaps if one is running on Vista or later, switching to an Icelandic user locale (aka "Standards and Formats") will provide a workaround for the Thorn in your side.

    Well, this one, at least!

  • Sorting it all Out

    Improving genitive. Or not.... (part 2): Explaining the point of Part 1

    • 5 Comments

    Previously, in Improving genitive. Or not.... (part 1), I didn't say very much.

    But I suppose in a way I didn't have to, if I wanted you to read between the lines!

    By pointing to example of a specific limitation in the way that the "detect if genitive is required" algorithm works, one can craft a format string that is mis-detected as requiring the genitive form.

    And  the supposition I made in that blog was that no prior format ran into this problem in any locale we shipped and thus was previously a non-issue.

    I mean, a custom format could cause it, but such customizations are really really really really really really really rare.

    And the fact that no one ever reported it before supports my supports my supposition.

    My further supposition was that once a format existed that runs into the problem, that no one wanted to fix the algorithm to make it better at detecting the need for genitive months.

    I think this second supposition is in some measure more provably true since the algorithm is largely unchanged from the original one written back in 1993.

    The first supposition might be incorrect if people had been complaining but we never heard about it, right? :-)

    I guess I'm making the argument that the algorithm should be improved. The current test is (essentially - I'm not going to post source but for people in access who want to follow along it is in base\Win32\winnls\nlslib\datetime.c) that if:

    • the locale defines genitive month names, and
    • the format has a month or an abbreviated month, and
    • a numeric day is "near" that month or abbreviated month 

    Then it decides your genitive.

    Okay, so Latvian trips them up here with its format.

    They have a period, which is an obvious way to say "new sentence" to a small Latvian child, though this algorithm that is old enough to buy liquor in some jurisdictions can't tell the difference when the two items near each other span a sentence boundary.

    Of course that character that is sometimes also a date separator, too, so it easy to criticize but much harder to fix!

    Ultimately no one ever considered whether to tweak the algorithm to better catch this case.

    It seems like it would be a worthy bug to fix at some point, right? :-)

  • Sorting it all Out

    Yet another time they messed up. Respectfully.

    • 4 Comments

    So, the movement fromLCIDs to names continues apace (ref Your LCID sucks and It is true that your LCID sucks, but your LANGID sucks more).

    Of course not everyone looks at LCIDs the same way.

    Thus Decimal vs. hexadecimal LCIDs, backcompat, and being weird, for example.

    Now I know Office and DevDiv and SQL Server tend to use decimal values for LCIDs and LANGIDs.

    Which I do not.

    And ordinarily I am not judgmental about alternate world views.

    Again, I'm not.

    Though in this case I think they are definitely wrong.

    I realized just a few days ago when Murray told me that over in Office they had to change the size of some data structures that they were storing LCIDs as strings.

    Because they were going off the end of the four character buffer they were using to store them in with some of the bigger LANGID values.

    If they were just using hexadecimal digits than the buffer would have been enough.

    Oops!

    On the other hand, they could have stored them as numbers and it al would have just worked irregardless. :-)

  • Sorting it all Out

    You think *your* characters have stories? Let me tell you a character story....

    • 4 Comments

    Today's blog's title should be read with a Groucho Marx accent....

    Over on the Unicode List, Andreas Prilop asked:

    There are three so-called "Yiddish digraphs" in Unicode:
      U+05F0   wawayim
      U+05F1   waw yod
      U+05F2   yodayim

    What is specifically Yiddish about these digraphs?
    They can be used in the same way in Hebrew.
    But this isn't done. Why not?

    http://he.wikipedia.org/wiki/%F8%E9%E9_%F7%E5%F8%F6%E5%E5%E9%E9%EC
    http://he.wikipedia.org/wiki/%F8%D6_%F7%E5%F8%F6%D4%D6%EC

    Why should Yiddish be written with special digraphs
    but Hebrew with sequences of two letters?

    But even in Yiddish, the digraphs are not really used:

    http://yi.wikipedia.org/wiki/%F8%F2%F7%E9%E0%E5%E5%E9%F7
    http://yi.wikipedia.org/wiki/%F8%F2%F7%E9%E0%D4%E9%F7


    The Unicode Standard says:
    | ... to distinguish the digraph double vav from an occurrence
    | of a consonantal vav followed by a vocalic vav.

    By that reasoning you would need an English digraph "sh"
    to distinguish "sh" in "***" from "s-h" in ***hole. ;-)

     Ah yes, the Yiddish digraphs!

    Code point Character Name
    U+05f0 װ HEBREW LIGATURE YIDDISH DOUBLE VAV
    U+05f1 ױ HEBREW LIGATURE YIDDISH VAV YOD
    U+05f2 ײ HEBREW LIGATURE YIDDISH DOUBLE YOD

    Lots of people jumped in and the consensus was alon the lines of "I'm not sure, but I think it'd legacy".

    Thankfully the guy who should be writing the "Every Character Has a Story" book jumped in to add some surety:

    On 10/19/2011 12:08 PM, Mark E. Shoulson wrote:
    > I think the issue here is (probably) a matter of legacy encodings,
    > though someone else would need to confirm that.

    O.k., as self-appointed historian of the standard, I guess I need to be
    the one to answer that. ;-)

    The Yiddish digraphs were added to the basic set of Hebrew letters for
    Unicode 1.0 on behalf of the Research Libraries Group, for compatibility
    with their existing usage on the Research Libraries Information Network
    (RLIN).

    Digging very deep in the old mailbox, I located email from Joan Aliprand
    of the Research Libraries Group, dating from July 11, 1991 confirming
    this, and noting that "I pushed very hard for inclusion of the Yiddish
    digraphs tsvey vovn and tsvey yudn."

    It is my recollection that the 3rd digraph was added during the
    discussion of
    the addition of those two.

    At any rate, there is your legacy encoding source for these. Whether or not
    the digraphs are used in *current* Yiddish data (or would even be
    recommended for such use) is not relevant to reasons for the original
    inclusion.

    --Ken

     And there we go -- ever digraph has a story, too! 

    This blog is sponsored by our three Hebrew/Yiddish Digraph friends....

  • Sorting it all Out

    You don't have to hang out with Mary-Kate/Ashley to feel exposed by an Olson connection!

    • 3 Comments

    Insert gratuitous shot of the Olson twins, who this blog is not about, here:

    Several people in the industry tend to bemoan a common pattern whereby we at Microsoft always have to do our own thing.

    We do it in collation. We do it in locales. We do it in time zones.

    For the most part, the reason we do or own thing is that there is nothing there we can use.So we build our own.

    At that point, we usually don't tend to turn around and share it, which is why so often other things are created.

    And no one wants to throw out what we did before since it built up on what people are using and depending on.

    We just can't see the upside to just throwing it all away to pick up the UCA. Or CLDR. Or the Olson data.

    And then we are left with people throughout the word sighing at having to deal with the "Microsoft" case and the "everyone else" case....

    Well, except not this week.

    There was a panic on the Internet late last week.

    It came from the following mail:

     From: "Olson, Arthur David (NIH/NCI) [E]" <olsona@dc37a.nci.nih.gov>
     Subject: Civil suit; ftp shutdown; mailing list shutdown
     Date: October 6, 2011 8:16:02 AM PDT
     To: "'tz@elsie.nci.nih.gov'" <tz@lecserver.nci.nih.gov>
     Reply-To: tz@elsie.nci.nih.gov
     
    A civil suit was filed on September 30 in federal court in Boston; I'm a
    defendant; the case involves the time zone database.
     
    The ftp server at elsie.nci.nih.gov has been shut down.
     
    The mailing list will be shut down after this message.
     
    Electronic mail can be sent to me at arthurdavidolson@gmail.com.
     
    I hope there will be better news shortly.
     
                                     --ado

    This news, that the Olson database was being taken offline due to a lawsuit, caused something of a panic in the i18n space, everywhere except for one place.

    That on place?

    Microsoft.

    Now I don't know how everyone in Microsoft who had a stake in time zones reacted, but I know exactly what I felt.

    Relief.

    Relief that this lawsuit didn't relate to us.

    Perhaps a mild dash of vindication was kind of flowing through there as well -- relief that our "less open" thing/process had at least one thing going for it...

    Well, at least insofar as "not a target in a lawsuit" can be considered a thing, that is.

    So now the story unfolds and blog posts like this one talk about the suit.

    And of course there is the expected discussion on Slashdot.

    I expect it will all resolve itself, eventually.

    But until then, the Unicode Standard folks in the CLDR may be kicking themselves for embracing Olson data, rather than following standards like SQL (where folks like Jim Melton have been long against referencing Olson data, and they have a huge "I told you so" feeling). Kind of like the implied one I have for people inside MS who thought a move to Olson was inevitable.

    All I know is that one doesn't have to hang out with Mary-Kate and Ashley to feel exposed by an Olson connection, these days....

  • Sorting it all Out

    A joke,and the exception that proves the rule....

    • 2 Comments

    Okay, first the joke. My dad sent it to me....

    An Englishman, a Scotsman, an Irishman, a Welshman, a Latvian, a Turk, a German, an Indian, several Americans (including a southerner, a New Englander, and a Californian) an Argentinean, a Dane, an Australian, a Slovakian, an Egyptian, a Japanese, a Moroccan, a Frenchman, a New Zealander, a Spaniard, a Russian, a Guatemalan, a Colombian, a Pakistani, a Malaysian, a Croatian, a Uzbek, a Cypriot, a Pole, a Lithuanian, a Chinese, a Sri Lankan, a Lebanese, a Cayman Islander, a Ugandan, a Vietnamese, a Korean, a Uruguayan, a Czech, an Icelander, a Mexican, a Finn, a Honduran, a Panamanian, an Andorran, an Israeli, a Venezuelan, a Fijian, a Peruvian, an Estonian, a Brazilian, a Portuguese, a Liechtensteiner, a Mongolian, a Hungarian, a Canadian, a Moldovan, a Haitian, a Norfolk Islander, a Macedonian, a Bolivian, a Cook Islander, a Tajikistani, a Samoan, an Armenian, a Aruban, an Albanian, a Greenlander, a Micronesian, a Virgin Islander, a Georgian, a Bahaman, a Belarusian, a Cuban, a Tongan, a Cambodian, a Qatari, an Azerbaijani, a Romanian, a Chilean, a Kyrgyzstani, a Jamaican, a Filipino, a Ukrainian, a Dutchman, a Ecuadorian, a Costa Rican, a Swede, a Bulgarian, a Serb, a Swiss, a Greek, a Belgian, a Singaporean, an Italian, a Norwegian and 47 Africans walk into a fine estaurant....

    The maître d' scrutinizes the group one by one and bars their entrance saying, Sorry, you can't come in here without a Thai."

    :-)

    So, now that you have Thai on your mind....

    I wrote it just the other day near the end of June.

    An irresistible force walks into an immovable object (aka the Thai that binds us), I mean.

    You know, the thing about the PUA character hidden on the Thai Pattachote keybord:

    After due and careful consideration, the campaign to remove the PUA looks like it will win over the rule about changing a keyboard.

    But the important question then is what to put there insead....

    Like there is the link to the Oracle site (http://download.oracle.com/docs/cd/E19253-01/817-2521/asian.supported.locales-246/index.html) that appears to be Phinthu -- U+0e3a.

    Or Peter found another possibility:

    This site, which sells keycap stickers, shows a Pattachote layout with basic keys plus three extras not found on all hardware. One has nikahit with nothing on its shifted state. There isn’t any phintu or lakkangyao, though both are very rarely used characters....I’d probably replace the PUA code point with lakkangyao: I think there’s more likelihood of someone typing that than phintu. E.g., you need to type it if you’re enumerating the complete alphabet, but not phintu.

    Fair enough, perhaps lakkangyao sounds unreasonable.

    I'll run it up the flapole and see if anyone salutes it....

  • Sorting it all Out

    Improving genitive. Or not.... (part 4): On [not] finding the smoking gun

    • 3 Comments

    Previous posts in the series are Improving genitive. Or not.... (part 1) and Improving genitive. Or not.... (part 2): Explaining the point of Part 1. and Improving genitive. Or not.... (part 3): The hazards of "off label" usage

    Now in response to that third part, Van commented:

    So I guess the question is why we can't just leave the genitive month algorithm as is, but incorporate two new month name variables - an invariant nominative and genitive - that will implement regardless of or contrary to the genitive month algorithm. I mean, if your old assumptions don't work anymore, keep the old stuff around for when it's needed and implement a new way that will work, no?

    Okay, so in that one question there are actually several ideas:

    • The idea of not touching the current support at all;
    • The idea of providing explicit, unambiguous ways to get either form without any special code;
    • The idea of providing a new way to fix the problem given these tools, while keeping the old support in place.

    Obviously point #1 is what we have been doing to date. Unfortunately.

    Point #2, if the eventual fix was to be done by Microsoft, has no specific purpose since all that data is already available. Now unless the goal was to give customers or ISVs an easier way to do their own parsing and formatting, which also as little point since that is possible now by formatting just "MMMM" vs. "d MMMM".

    The central ideaof trying to fix the bug would be to extend the current support, in ways that do not break expectations.

    It seems unlikely that they will do that, given the wide variety of uses that are happening (which is a lot of what this series is about). Changes would have to exist for both parsing and formatting, and perhaps the only hope of that happening is some future version where there ere compelling scenarios to fix.

    Up until now, he judge has consistently dismissed the case due to lack of substantive evidence of a problem that needs fixing.

    Put simply, there has to be a "smoking gun" to make that happen.

    At some point, I'll talk more about what probably would be required here, were such a campaign be launched. At this point, I'm not launching one, or even advocating it happen. But it can be a useful exercise to walk through what would be needed to get resources allocated....

  • Sorting it all Out

    The unused case case (i.e. the case of the unused case)

    • 1 Comments

    So the thing about Digit Substitution is that it is generally an all or nothing feature.

    You either have some special digits and you make the change to support them, or you have no special digits and you choose to not support them.

    Oh yeah, and sometimes you occasionally have no special digits yet you change to them anyway.

    That third case is kind of a bug, previously described in A difference that makes no difference makes a blog.

    Ok, so you have these two settings:

    LOCALE_SNATIVEDIGITS:

    Native equivalents of ASCII 0 through 9. The maximum number of characters allowed for this string is eleven, including a terminating null character. For example, Arabic uses "٠١٢٣٤٥ ٦٧٨٩". See also LOCALE_IDIGITSUBSTITUTION.

    LOCALE_IDIGITSUBSTITUTION:

    ValueMeaning
    0 Context-based substitution. Digits are displayed based on the previous text in the same output. European digits follow Latin scripts, Arabic-Indic digits follow Arabic text, and other national digits follow text written in various other scripts. When there is no preceding text, the locale and the displayed reading order determine digit substitution, as shown in the following table.
    Locale     Reading order     Digits used
    Arabic      Right-to-left          Arabic-Indic
    Thai         Left-to-right           Thai digits
    All others Any                        No substitution used
    1 No substitution used. Full Unicode compatibility.
    2 Native digit substitution. National shapes are displayed according to LOCALE_SNATIVEDIGITS.

     and the three options:

    1. LOCALE_SNATIVEDIGITS are there, and LOCALE_IDIGITSUBSTITUTION has them turned on!
    2. LOCALE_SNATIVEDIGITS are not there, but LOCALE_IDIGITSUBSTITUTION has them turned off anyway, so who cares?
    3. LOCALE_SNATIVEDIGITS are note there, but LOCALE_IDIGITSUBSTITUTION has them turned on -- a classic no-op bug;

    I assume many of my readers know enough about logic to suddenly see the case we've never gotten into before....

    Interesting? I'll tell you, I think so.

    If we did cover this case, where would we use it?

    Would it be a bug like the third case?

    Discuss amongst yourselves; tomorrow I'll jump in with my thoughts....

     

     

     

  • Sorting it all Out

    Improving genitive. Or not.... (part 3): The hazards of "off label" usage

    • 3 Comments

    Previous posts in the series are Improving genitive. Or not.... (part 1) and Improving genitive. Or not.... (part 2): Explaining the point of Part 1.

    I thought I'd do something different today.

    Now all the way back in 2004, near the end of the second month of this Blog, I wrote What the %$#! are genitive dates?, which was remarkable for several reasons:

    • I lied about my grades (I actually got an "A" in grammar and I did learn about genitive case year ago);
    • I am pretty sure it was the first time the genitive month names feature was descried on an official Microsoft property;
    • I explained how most people didn't understand the feature, which is why it was never documented before.

    The first and third parts are connected -- despite the fact that I did well on every test in school that explained the genitive case, the fact is that I really was unable to usefully do anything with it until I learned something about a language that had a separate spelling for the separate case (Russian). It just seemed too theoretical, you know?

    My experience with reflexive verbs was the same; until I learned about them for Hebrew, passing tests didn't prove I understood the concepts. It only proved that I took tests well.

    Many people have described their introduction similarly, so I know it wasn't just me....

    Anyway, there are several locales representing languages that, like English, did not have spelling changes associated with genitive month names.

    Some of these locales found another use for the "genitive months" feature.

    Like if they wanted the month name on its own to be capitalized but in a sentence they wanted it to be lower-cased.

    Locales like Portuguese, for example.

    And as you can imagine, this "off label" usage of the Windows "genitive months" feature doesn't always work as people would perhaps wish for.

    It is even occasionally reported as a bug, this "inconsistent" capitalization.

    Just as genitive months don't always work as customers might hope, this "off-label" use of the feature has significant limitations, too. Since it wasn't designed for the scenario, etc....

    But just as in the pharmaceutical industry, it can make sense to figure out whether these other usages can be helpful too, and they can tailor the "prescription" to work better. If the Dev team knew about this alternate usage, they wouldn't have invented the next Rogaine or Viagra (conceptually speaking, I mean - we are just talking about software!), but they would have had yet another reason to try and make it work better!

  • Sorting it all Out

    On [not] being what's next [yet!]

    • 2 Comments

    Over in the Suggestion Box, long time reader and friend Ted asked:

    Metro/Windows 8 - actually I'm kinda surprised that neither Metro nor WinRT show up here yet (at least when I use the search).  There must be something interesting relating to i18n to talk about in these areas.  At least that's what it looks like (judging from the BUILD conference). 

    Well then how about Visual Studio vNext for Win32 desktop - like in msvcr110 they've finally moved to using locale names instead of locale IDs internally.

    Thinking back to the earlier days of this Blog and how there were so many things I talked about here first before anyone else, it may seem odd that I am taking a back seat now as Windows 8 is now available in the form of the Developer Preview.

    Okay, maybe it is a little odd.

    But since then I've grown up a bit!

    Perhaps it's a subconscious reaction to the fact that I had received mail from Frank Shaw  (VP in charge of corporate communications) since then. Nothing accusatory or anything, which might have been cause for a freak out I suppose. But even worse, it represented intelligent questions. This is worse because it implies I am a little on the radar, which makes being a little less of a cowboy a responsible idea.

    Or maybe it is that Chris Capossela who I first knew back in 1997 when he was a junior program manager (and he owned the Access Setup Wizard, which I was the dev on) is now the company's Chief Marketing Officer -- and Frank's boss. No directives once again, but the idea of being a little more responsible just comes to mind because I'd rather not be mucking in the message and randomizing him or anyone who works for him.

    More likely there are good explicit reasons to stand back and let things unfold as my division President Steven Sinofsky is blogging himself and there aren't necessarily good reasons to be writing about things before there is enough to write about that people can use. I mean, I'm hardly against sliding in the mustard, but I need some context in which to slide it.

    Another big difference -- we aren't talking about slow platform evolution like adding a new function or even a few new functions - we are talking about a new platform whose best practices are best to talk about after people are confident they have a good first cut of what is best.

    Not to mention we are in that phase that DCRs and such can easily happen as small gaps in scenarios are identified. While such gaps might be natural for me to cover here ordinarily, it would require me to be a jerk to use the Blog as a way to try to force DCRs to happen in a certain way. And pehaps I really would like to not be a jerk if I can avoid it.

    I have had a role in Windows 8 related to locales and keyboards, and at the upcoming Internationalization and Unicode conference in Santa Clara, I'll be talking about some of those things -- things that are available now if you have the Developer Preview though I'll be able to discuss things in as more focused way since there isn't much of an externally available roadmap.

    I won't be talking so much about Metro or WinRT since that's pretty far afield of my topics, but there are many things that are natural results of the way things have been working and how things should be working.

    Everyone else is excited about that new platform -- and they should be. It is pretty cool! But even these "nominally less cool" pieces are pretty interesting.

    I'm pretty excited about all of it, because some of it is really cool stuff. Perhaps not cool compared to modern touch based apps, but cool from the point of view of things I think are important. And that some of you probably think too....

    There are even some potential lessons for those who use and/or contribute to and/or manage the Unicode CLDR project, from their older brother who was doing this before they were even an idea -- lessons that in some cases we learned the hard way. Perhaps we can spare them some extra pain.

    But Metro and WinRT are largely topics for the future, as the final shape of both of them and of their best practices are defined and it makes sense to start putting my spin and take on things.

    So stay tuned. And if you will be at the IUC say hi and let me know this blog was one of the reasons you came! :-)

  • Sorting it all Out

    Improving genitive. Or not.... (part 1)

    • 4 Comments

    Genitive month name usage is a feature that has been in Windows for a while.

    But the original feature hasn't completely done the trick. I mean, in the long run.

    Put simply, we simply didn't "grow" the feature the way we were growing other areas when new locales challenged our older perceptions.

    Like we had a maximum documented size for a field in a locale but then added a new locale that needs a bigger value?

    That has happened many times.

    But then when you consider genitive date support, what about bugs like the one in Latvian. Genitive. Oops, specifically?

    Now in looking at the earlier versions of Windows when the whole "genitive months" feature was added, it appears that the first versions of it worked properly.

    But over time we ended up with new locales, and in some cases those locales commonly used formats that were harder to detect

    Ideally the algorithm would be tweaked (this was my #1 in that earlier blog).

    But as we get further and further out no one reports the bugs, we lose track of things. And it is easy to have people give us the two different data items -- the genitive month names and the various formats, and not pay as much attention to the interaction between them....

    As a feature I don't expect our architectural support to get worse, but I worry that we don't invest enough to keep up with new issues in the existing support. Like this bug....

    I have a few other issues I'm curious and/or concerned about here, which I'll cover in future parts. Stay tuned!

Page 1 of 2 (21 items) 12