Blog - Title

July, 2006

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    What is it about an autograph?

    • 2 Comments

    Absolutely nothing technical here!

    There is something nice about having some with an autograph. What is the fascination with that sort of thing, anyway?

    The other day a friend of mine had an autographed picture of Liz Phair he had to get rid of. He had won it in a contest, but since

    • it was that picture on her Liz Phair album behind the guitar, and
    • he was married, and
    • he wanted to stay married

    there was not a whole lot that he could do with it. So I said a little Apocalypse Now moment and took it -- what the hell else was I going to do.... :-)

    And no, I didn't put it up in my office, I understand why that is a bad idea.

    Though between you and me, I'm not sure what the actual problem is. She may not even actually be naked, and even if she is it's not like we actually know for sure. Since we can't see anything. I mean in addition to what might actually be clothes she is essentially wearing a freaking guitar. Which would mean she is more dressed than someone who couldn't find an excuse to have a guitar in front of them (I know people would be looking at me pretty oddly were I to wear a guitar to work!).

    And the look on her face, I'm not sure what it is supposed to mean, but I don't think it is quite as suggestive a look as some people think. It is more aloof than that. Steve Taylor had a line describing that look in the song Tale O' the Twister -- "A barstool yawn to a stuttered come on, it's a dirt road rut, she said 'button up mister'" -- if you know what I mean. Kind of a "I'm looking for someone who's all that, too bad you're not it" or somesuch.

    Of course it leaves room for some delusional dissembling on the part of fans -- so someone could look at it and think like maybe she is being aloof and she thinks the person she is looking at is all that. But I mean really, how often does that happen? If the answer is even remotely close to ever, it is pretty certain that it is never a fan who really gets that kind of attention....

    Kathleen Edwards once signed a copy of Failer and at the top of what she wrote, she put my name down as "Mike." I wasn't gonna correct it or anything, but I actually remember being a little sad. But then the next night, when I gave her a copy of "Back To Me" to sign, she signed it "Michael" and actually remembered that she had put "Mike" on the other album and said she was sorry about that. Which was undeniably cool since I had not introduced myself the second night at all. That one she wrote "Mike" on is something I'll sorta treasure now.

    Though not as much as the picture she took with me in Dublin after a show there, with the copy of the signed CD for someone else (she wanted to give me proof that she had actually signed it, I swear neither of us was actually drunk at the time). She even stuck her tongue out, for reasons that you had to be there to understand. It was kind of related to a joke she told at the time, and I'll admit that the reason you had to be there is that I can't remember what the joke was. But it really did make sense at the time.

    And then on Saturday I had Vienna Teng sign a copy of her latest CD, after the show. And Julie had her sign one too. I was struck later by how cool it was that Vienna did not write the exact same message on both of them. Add to that the fact that she has an incredible voice and a CS degree from Stanford and it becomes really obvious that if she played in Seattle more often, I'd be at those shows....

    Now Julie did not fully understand why I didn't ask Duncan Sheik to sign a copy of his CD, especially since I had one there. But I kind of felt weird to me being a "total fan" in that case -- cause I'm gonna do some work for him and it just seems better to not be quite the slobbering fan in that case. She didn't quite see the difference, so I guess it's hard to explain.

    I did forget to ask him if it was the same Mark. Something to remember to ask him for next time.

    That reminds me that I ought to do a quick review of the show, it was a lot of fun, and as it is in most cases the meta-show was also once again fascinating. :-)

    Like I said, there is just something nice about having some with an autograph. I'm not sure what it is, or why it seems to make a difference. But it does....

  • Sorting it all Out

    I before E, except after C...

    • 20 Comments

    Now how does that saying go?

    I before E, except after C,
    Or when it sound like 'A', like in Neighbor and Weigh,
    Or when it sound like 'Ear', like in the word Weird,
    Unless it sound like 'Eek', like in Duncan Sheik!

    Ok, I added that last part in. But in my defense I do have that built-in appreciation for singer/songwriters, and it is how his name is spelled. But what do you expect me to do, when Duncan Sheik is playing in Seattle tonight? :-)

    That's right, he will be at Chop Suey, playing with Vienna Teng. Doors open at 9pm.

    You can get tickets here (from TicketWeb), and if you happen to see me scooting along then feel free to say hi and mention if you heard about the show reading here! :-)

    If I get up the nerve, I'll ask Duncan if the 'Mark Liberman' in the "Thanks to:" section of the liner notes in his newest album (White Limosine) is the Penn linguist I've mentioned in the past....

  • Sorting it all Out

    Now serving: International recruits

    • 3 Comments

    A few years back before I came to work for Microsoft full-time, I was at a VBITS in the Speaker's Lounge. I was talking to Jim Fawcette, asking him about how the international subscription base worked compared to the US one, both in methods and percentages.

    What he told me about how small the percentage of international subcribers was (way out of line with Microsoft's sales of the software that the magazines were writing about!), and how the whole model of other magazines essentially buying the content with no real attempt to go after the markets directly, I was a little disturbed.

    I pointed out that if Fawcette's percentages did not match Microsoft's in the various markets that he ought to make sure people thought of that as a failure to find all of the people who were willing to give them money. But he really wasn't buying the argument; it was clear that it was not a central part of the model -- what was international was not always being treated as a core part of where everything is --the developers, the customers, the producs, the passion.

    (This all might be different now, like I said it was a few years ago. But I remember my frustrastion at the viewpoint!)

    I was thinking about this again the other day for a not entirely analagous situation. You see, there is a question I get from time to time about jobs at Microsoft from people who re not in the United States of America. I always am sure to redirect them appropriately to someone who is more qualified to help them out....

    But it was great to stumble across the other day where Heather has pointed out in this post that the Microsoft Careers site now starts off with a dropdown giving a large list of locations:

    It does feel better to know that international recruiting for positions throughout the world is something that is no longer a "local" concern but a global interest. :-)

    Very cool!

  • Sorting it all Out

    Is that what Microsoft's thinking?

    • 1 Comments

    Some people seem to think that this is what Microsoft is thinking when the issue of those new ads from Apple comes up.

    But on the whole, I think this is.

    :-)

  • Sorting it all Out

    'Localizable' is not always 'Internationalized'

    • 1 Comments

    The other day, developer Ellen sent me mail about the Soft-Keyboard, Spinner and List Scroller sample WPF controls for Media Center, asking if I felt like the move to get developers paying more attention to internationalization in software had turned a corner.

    I took a look at the sample, and at the accompanying docs on building the localized versions. It is easy to look at the English screenshot:

    and the localized one:

    and feel like some important localization work has been done.

    But of course the project really misses the point of covering soft keyboard and proper internationalization, since it does not cover the majority of the myriad of issues that many language keyboards use, from additional shift states to dead keys.

    Or, lest we forget, there are all of the additional keys that have to be present on some keyboards (even if they are soft) to handle IME-type functionality. Like our "sample" above?

    And when you get right down to it, is an MCE remote control really the expected model for rapid text input anywhere?

    But ignoring that point for a moment, Ellen has actually noticed something important here, a phenomenon that should perhaps be a clue for all of us.

    What encourages people to provide "easy localization" in their samples?

    Basically, it is the ability to plug it in simply. If it is just a 'cookie cutter' feature, then it is easy to add. And so people will add it.

    Unfortunately, it leads to a bigger problem -- and that is that LOCALIZATION is much more than what the sample provided, which is a limited form of LOCALIZABILITY that is not really able to cover the core internationalization requirements of most markets into which one migfht wish to localize.

    The real problem is that there are no "cookie cutter" solutions to that problem. So I suspect we will see more and more samples that give us a slice or two rather than the whole loaf. :-(

     

    This post brought to you by (U+1842, a.k.a. MONGOLIAN LETTER CHI)

  • Sorting it all Out

    The download you requested is unavailable.

    • 2 Comments

    Remember earlier this month when I was talking about the Update to the mitigation tools for IDN security problems?

    Well, it turns out that Wikipedia has an external link to the tools in its Internationalized Domain Names article.

    Unfortunately, they are pointing at the old download site of the 1.0 tools I posted about originally. A link that no longer resolves. :-(

    Luckily the link is far enough down on the list that people haven't noticed, mostly. In fact if Sergey hadn't mentioned it to me I wouldn't have noticed (not being a paid Wikipedia fact checker does cut into the amount of time I could really spend on such a project!).

    I am sure it will all work itself out, and someone will update the link shortly to point to the correct link.

    And then also, just as hopefully (and so that no one takes this as a pure Wikipedia bashing blog post!), Microsoft won't move the link next time they do an update. :-)

     

    This post brought to you by о (U+043e, a.k.a. CYRILLIC SMALL LETTER O)

  • Sorting it all Out

    When will we support Rongo-Rongo?

    • 20 Comments

    A few years back, John McConnell gave a day 2 keynote at the 26th Internationalization an Unicode Conference, entitled The Windows Language Roadmap or When Do We Get Rongo-Rongo?.

    The subtitle, in a bold tradition that was subsequently taken up by this very blog you are now reading, had little to do with the actual presentation, but provided an interesting title and a fun story that cannot be found in the slides (leaving people who did not attend the talk wondering what it was all about, just as with the moose at the end of the presentation!).

    (He did give a slightly longer version of the talk at the 2004 Global Development and Deployment Conference, where the advantage of a video version of the presentation online exists for your enjoyment. :-)

    Anyway, for your reading pleasure I will (with John's permission) provide the transcript of the story below, but it is right at the beginning of the video and definitely worth listening to in John's unique storytelling style if you have the time (since I did not include a laughtrack it's the only way you can find out where the crowd was amused!).

    enjoy!

    I've had several people ask me about the title of this talk "When do we get to Rongo-Rongo?". Some people thought I made up the name. I'll explain, it has a little bit of a personal history.

    One of the very first projects I had when I was still a developer involved in globalization was back in the mid-80s. It was for a very large customer whose name I can't mention, but they're in Langley, Virginia.

    The assignment I had was to support bidirectional text; technically the documents supported left-to-right, it did not support bidirectional. So I understood and worked with people who understood bidirectional text and I was able to work that out.

    But being the ambitious little nerd that I was, I went off to a library and I decided I would find out more about writing systems. Because I knew vaguely that East Asian text was written vertically and I thought, 'well maybe I should generalize my code so I can support vertical writing.'

    So the library was a wonderful resource. I found out about ancient Greek writing, which (I'll probably say this wrong) Boustrophedon, where they would write one line going one way and then the next line would start there and go back. And that was very appealing to me.

    But then even better was Rongo-Rongo, which it sounds like it's made up by teenagers or something, but it was a language used on Easter Island, or I shouldn't say language, a writing system on Easter Island.

    I believe there's only like 120, some small number of samples. They are on these large round disks. It has never been fully deciphered.

    But the thing that was really wonderful about it is it's written sort of like Boustrophedon, but when you get to the second line, rather than just going backwards, it actually turns upside down.

    So this really put me into a fever, writing the code.

    So, unfortunately in that particular coding assignment I ultimately concluded that I couldn't support Rongo-Rongo -- the performace hit was just a little too great.

    And so, when I delivered the software to the salespeople they said "What languages does it support?" and I said "It'll support anything except Rongo-Rongo."

    I said this as sort of a joke, but about a month later we had the version two requirements, which said that "Version two must support Rongo-Rongo."

    So ever since that experience it's been the goal at the end of the rainbow, it's where we will eventually get to before I retire....

    The full presentation talks about ELKs and LIPs and lots off the other things I talk about here, and is worth a listen, in my opinion. :-)

    So here is a quick and dirty Q&A:

    Q: What company was John working for back in mid 80's?

    A: He was working for DEC at the time, though the contract was for that customer in Langley, Virginia. 

    Q: Does Unicode support Rongo-Rongo?

    A: Rongo-Rongo is not yet encoded in Unicode.

    Q: Does Vista support it?

    A: The first step that Windows requires when it comes to language support is support within Unicode (after that we can get into fonts and shaping engines and such), so given the answer to the first question, the answer to this one would also be no.

    Q: Will Microsoft ever support Rongo-Rongo?

    A: It is worth noting that John has not retired yet, so who knows what the future holds? It is still at the end of the rainbow....

     

    This post brought to you by (U+0f03, a.k.a. TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA)

  • Sorting it all Out

    The Cantonese IME (not for input of characters from Canton, Ohio)

    • 64 Comments

    Last month I was talking about how Feature ideas don't always turn out to be good ones. And I mentioned how I'd probably talk about other cases in the future.

    What can I say besides welcome to the future. :-)

    In Vista, from the time when it was just Longhorn, there has been enhanced collation support for all of the CJK locales. The stroke count sorts and Mandarin pronunciation (both Pinyin and Bopomofo) sorts all covered more characters, the Korean Hangul pronunciation sort was enhanced too, and the Japanese locale got a new alternate sort to cover everything in JIS X 0213. Basically a lot of work was done.

    But there was one area that was not covered that was really bothering me -- there was no support for a Cantonese sort of any kind.

    "But isn't Cantonese," you might ask, "a spoken dialect, not a written one?"

    The Wikipedia article Written Cantonese gives a good answer to this question in its introduction:

    Written Cantonese refers to the written language used to write colloquial standard Cantonese using Chinese characters.

    Cantonese is usually referred to as a spoken variant, and not as a written variant. Spoken vernacular Cantonese is different from standard written Chinese, which is essentially formal Standard Mandarin in written form. Written Chinese spoken word for word in Cantonese sounds overly formal and distant. As a result, the necessity of having a written script which matched the spoken language increased over time. This resulted in the formation of additional Chinese characters to complement the existing characters. Many of these represent phonological sounds not present in Mandarin. A good source for well documented written Cantonese words can be found in the scripts for Cantonese drama and Cantonese opera.

    With the advent of the computer and standardization of character sets specifically for Cantonese, many printed materials in predominantly Cantonese spoken areas of the world are written to cater to their population with these written Cantonese characters. As a result, mainstream media such as newspapers and magazines have become progressively less conservative and more colloquial in their dissemination of ideas. Generally speaking, some of the older generation of Cantonese speakers regard this trend as a step "backwards" and away from tradition. This tension between the "old" and "new" is a reflection of a transition that is taking place in the Cantonese speaking population.

    And if you look at the major population centers with people who use Cantonese, there are clear efforts to support this development among many of the native speakers (and writers) of Cantonese.

    There are some cultural issues that even I was faced with when doing research here that I will discuss further in a follow-up post....

    Of course one of the big problems has been that there are multiple romanizations used to represent the pronunciations, and unfortunately they are often used in the same lists (like phonebooks in Macau and elsewhere that allow people to simply enter the pronunciation -- how can you hope to sort the phone book consistently if the people providing the pronunciations have different ideas of how even identical pronunciations are to be represented?

    But lots of work has been done to try to help with this issue, for example the Jyutping system produced by the Linguistic Society of Hong Kong (LSHK). And many people have been trying to use it -- for example the government of the Hong Kong SAR's Chinese Language Interface Advisory Committee (CLIAC) has produced the Cantonese Pronunciation List of the Characters for Computers, a huge set of data providing Cantonese "Pinyin-esque" style pronunciations for much of the Hong Kong Supplemental Character Set (HKSCS).

    When I first saw that we would have a list of over 30,000 ideographs and their pronunciations, I was excited -- perhaps this data could be used to provide a Cantonese sort for the people in Hong Kong and elsewhere who wanted it?

    But unfortunately, while there is much that is good about Jyutping, it has one liability at present, one that it shares with Yale and other romanization systems: and that is that there are several romanization systems. And there is not yet one that is ubiquitous.

    Another problem that exists is that for the 30,764 unique ideographs given pronunciations in the CLIAC-provided doc, there are less than 2,000 unique pronunciations (less than 700 if you do not include the tone values).

    And yet another problem is in the decision about tones -- some number the tones in Cantonese at nine, while others claim that three of these are unimportant distinctions and that there are only six to worry about. So it is not just different romanization systems, which vary enough with place names like Canton and Guangzhou coming from the same word, but even if people agree on the romnization they may differ on their opinion of the tones (with some believing that tones 7, 8, and 9 actually fold into 1, 3, and 6 respectively).

    And the final problem, there is not yet a clear and established standard on how to break ties -- once you decide which Han have the same pronunciation, how do you decide which one comes first?

    There was just not enough of a consensus yet to try to push ahead in Windows with providing such a sort. Because Microsoft has no interest in dictating language policy; we just want to identify it so that we can represent things the way customers would like them.

    But this now brings us to input methods.

    Like I said way back in December of 2004, IMEs have it easy. In this case because (if for no other reason) if you identify a rich new source of pronunciations you can simply add them to the IME if you like them. Or you can provide different IMEs using the different systems, too (assuming you have enough data!).

    Anyway, enough of the backstory, right? Let's get to the IME, like I said I would!

    The steps are the same as they were with the Unicode IME. Just grab the file from here (871 kb) or you can grab the zipped version here (144 kb).

    1) Copy the text file to \Program Files\Windows NT\TableTextService on your Vista machine (if the "Program Files" on your machine is another language, use that directory, do not create a new one!).

    2) Open an elevated command prompt and navigate to that directory.

    3) Run the following from that command prompt:

    rundll32 TableTextService.dll RegisterProfile TableTextServiceCantonese.txt

    4) Say OK to the dialog that comes up verifying you want to install it:

    You can now add the Chinese Hong Kong Cantonese IME to the Chinese (Hong Kong S.A.R.) locale by going through the following steps that are illustrated here.

    Now like the Unicode IME this is a sample, and further this is a work in progress. There are lots of things I would like to do to tweak settings here, like as in how/if the list should be sorted, for example.

    (And if I find other huge caches of Cantonese pronunciations in other romanizations I might even see whether they could be productively combined.)

    And like I said, in an upcoming post I will talk about many of the cultural issues I ran across while doing the research here -- they are fascinating!

     

    This post brought to you by 䕫 (U+2f9b2, an Extension B ideograph in HKSCS with a Jyutping pronunciation of kwai4)

  • Sorting it all Out

    The name of the enum is KeysEx, dammit

    • 12 Comments

    When I read Geoffrey K. Pullum's PowerGenItalia and PenisLand, I was once again struck by how funny it is that some people can take the KeysEx enumeration and read the name differently.

    I have gotten the same grief for other class extensions that use the Ex suffix, like ControlsEx (which was actually a filename for a file that contained the extended definition of several controls). The fact that all of the file names were lowercased probably did not help.

    Now I can understand the situations that URLs can get into (where case insensitivity requires the equivalence with unintended terms), and I can forgive those who stumble on a filename that has been munged and get thed wrong idea.

    But obviously there was no munging of KeysEx, so this is a situation where the people who have contacted me are suggesting that the name itself is suggestively ambiguous even with the capitalization intact.

    This is a whole different problem.

    But what do you think causes it? Is it the explosion of inernet slang like in this post:

    OMG LOLOLOLO U SUK!!!!!!!11 Translation : You suck!
    OMFG R U SERIUS??? <<PERSON>> IS SUCH A N00B!!!1 : Are you serious ? <<person>> is such a noob!
    CAN I HAV SUM FREE STUFF PLZ???? : Can I have some free stuff, please?
    give me mony or i kill j00!! : Give me money or I will kill you!
    im ur gf nub........giv me free stuff!!!1! : I'm your girlfriend, noob! Give me free stuff!

    that causes readers to develop whole new filters that simply ignore clues like capitalization?

    Are we to assume that the Capitalization Styles in programming languages are ineffectual since they are not being parsed by programmers?

    Do people just have dirty minds?

    Or is there something else going on here?

    [Update 8:36am] At least (as Raymond proves), some developers still use the clues that casing provides -- although a bit of language knowledge can at time interfere!

     

    This post brought to you by (U+12fd, a.k.a. ETHIOPIC SYLLABLE DDE)

  • Sorting it all Out

    It's great to have stuff in the top 10 (sort of)

    • 0 Comments

    I was looking at the MSDN home page and I noticed the Top 10 Downloads list:

    See that MSLU link? Cool! Not bad for a technically unsupported tool!

    Of course it was a little sad that MSKLC was not in the top 10; in fact, it wasn't even in the top 100:

    but on the other hand, at least the Microsoft Layer for Unicode is on the top 10 list. I was curious it it was really #4 like that main page list implied, so I thought I'd check it's rank:

    So #36 is in the top 10. Must be that new Microsoft math or something, that rounds way the freak down!

    If MSLU weren't there, I'd assume it was a marketing thing, given the prominence of the .NET Framework downloads and all. It seems like they are using some bizarre algorithm, kind of like the one they use to elect presidents in the United States? :-)

    I am just kidding, it is probably the difference between "developer" downloads and all types, with the developer downloads on the MSDN site.

    Oh well, it is just nice to know people are still interested....

    Someone even pointed out to me that there was even a Wikipedia article about MSLU, which even links here to this blog (to the MSLU category). It was interesting to read more about the open source projects replacing both the LIB and the DLL, imitation being the sincerest form of flattery and all. :-)

    Nothing about MSKLC in Wikipedia, though. Sigh....

    (and no, this is not a hint to anyone to write such an article. I am a much bigger fan of these things generically working themselves out without direct involvement using an almost Trekian devotion to a non-interference directive!)

    Though in any case, they were both cool projects to work on. So I'm not gonna complain. It was just nice to see one of them in the sorta top 10!

     

    This post brought to you by (U+0bf0, a.k.a. TAMIL NUMBER TEN)

  • Sorting it all Out

    I bless the rains down in Afrika[ans]

    • 17 Comments

    Ok, if I could get اردو, മലയാളം, Qhichwa Simi, فارسی, isiZulu, ಕನ್ನಡ, नेपाली, Lëtzebuergisch, कोंकणी, Setswana, বাংলা, తెలుగు, and ਪੰਜਾਬੀ to all move over a little bit.

    Because the Afrikaans Language Interface Pack is now available!

    Some info about Afrikaans:

    Number of speakers: 4 million

    Name in the language itself: Afrikaans

    Afrikaans is one of the 11 official languages of South Africa and is spoken mainly in the western one-third of the country, especially the Northern Cape and Western Cape provinces. After Zulu and Xhosa, Afrikaans has the third-largest language community in South Africa. It is also spoken in the neighboring country of Namibia.

    The language is a heritage of the Dutch colonization of areas of today’s South Africa from the 17th century onwards – which is why it was originally known as “Kaaps-Hollands” (“Cape Dutch”). It gained loanwords from languages of others settlers (mainly English, French and German) and the surrounding African people and underwent grammatical simplification and some phonetic changes. Afrikaans became a literary language about a century ago after it had been a spoken language only, and it replaced Standard Dutch (which had been the written language until then) officially in 1925.

    Fun facts:

    • Afrikaans is the only language that has its own monument: The Afrikaans Language Monument (Afrikaanse Taalmonument) is located near Paarl in the the Western Cape Province and was completed 1975.
    • A famous English loan word from Afrikaans is trek (as in Star Trek). It means long, hard journey.
    • Afrikaans has a double negative like the French ne ... pas which is used with composite verb forms: Hy het niks gedoen nie is literally He has nothing done not.

    Classification:

    Afrikaans is principally derived from the same 16th-century Dutch dialect that led to modern Dutch and is closely related to that language. Both belong to the West Germanic branch of the Indo-European language family. Afrikaans is the youngest Germanic language.

    Script:

    Latin script, 26 letters (like in English), with c, q, x, and z rarely being used. There are special characters: è, é, ê, ë, î, ï, ô, û

    Enjoy!

     

    This post brought to you by A (U+0041, a.k.a. LATIN CAPITAL LETTER A)

  • Sorting it all Out

    Is East Asian support installed?

    • 6 Comments

    Michiel Salters asked in the Suggestion Box:

    MLang font linking works without us asking , it's just too nice. Really too nice, because MLang isn't telling us it failed. And in fact, we can't figure out that it's failing.

    So what's the problem? Most of our users won't have east asian fonts installed. Our application needs to know this. However, due to font linking MLang is rendering our text. It just renders them as a set of boxes.

    The first 10 algorithms we tries to distinguish machines with and without east asian fonts have failed. Among the obvious:
    CreateFont() with CHINESEBIG5_CHARSET : works even without east asian fonts installed

    Call GetGlyphIndices for the font created for CHINESEBIG5_CHARSET : suggests that font created has real glyphs (i.e. U+4E00 and U+4E01 have different glyphs)

    Use IMLangFontLink2 and call GetScriptFontInfo : MSDN documentation is incomplete here, and it just font names. We can't hardcode every fontname out there.

    Compare IMLangFontLink2::GetFontCodePages with IMLangFontLink2::GetCharCodePages() for U+4e00

    Create a font with MLangFontLink2::MapFont, passing U+4E00 to get a chinese font. (fails straight away)

    Get the chinese codepage using GetCharCodePages for U+4E00, then create a font for the codepage using MapFont. That does work, but it also works if there is in fact no font installed.

    Create the font via that chinese codepage, then get the glyphs with GetGlyphIndices for U+4e00 and U+4e01. Now we were getting pretty annoyed. If there are no east asian fonts installed, we still get a font, GetGlyphIndices  tells us the glyphs U+4e00 and U+4e01 differ, yet all we get is boxes (obviously)

    The final lie was MLangFontLink2::GetFontUnicodeRanges. If there are no east-asian fonts installed, it will in fact include return a  range that contains all 20.000+ CJK ideographs. Apparently means 20.000 different(!) box glyphs.

    We currently see two options. The real east asian fonts are incomplete. We can just hardcode a codepoint that is currently missing, and check if MLangFontLink2::GetFontUnicodeRanges claims support. If so, we assume it's lying about all code points. Of course, if Vista adds that code point, or the user installs a font which adds that missing code point, this assumption is wrong.

    The only solution we haven't tried, and we really can't imagine we have to resort to this, is rendering U+4e00 to a bitmap and see if this renders a box (two vertical lines connected by two horizontal lines)

    So, where is the BOOL IsThereAFontFor(wchar_t) function hidden? And did I mention we'd want it on Windows ME/IE5 as well?

    Part of the problem here is simply stated -- MLang is not all about answering whether support is not there -- it is all about trying to make the support work.

    And of course the GDI problem (like using CreateFont with the CHINESEBIG5_CHARSET) is similar -- as a function it will do its best to create the requested font but if the final answer is not perfect, GDI does not spend too much time weeping. If you catch my drift.

    If you really need to know whether East Asain support is installed, the best way to do it is to use the NLS IsValidLocale function with the LCID_INSTALLED flag -- it is the way that you know whether the OS thinks that all of the supporting files are there....

    Be sure to pick the right LCID on downlevel systems, since everything prior to XP was a bit more granular about language support. It is no big deal, just make sure you pick the LCID that best captures the support you are looking for, that's all. :-)

    I can't say too much about the IE5/Windows Me situation -- I mean, since as I said in Is MSLU Still Supported?, Windows 98 and Windows Me aren't supported anymore. Though if you truly wanting to go beyond the OS answer when it says FALSE you can see if you have the IE language support -- ether by using the model described in Installable language components in Internet Explorer or looking for the specific fonts that it installs per language....

    Generally, I find this whole scenario pretty unconvincing. Maybe that is just me.

    But we are talking about users who don't install East Asian support. Yet these users expect a program to magically support CJK ideographs. Users, who further will refuse to understand that they need to install the EA support if they are seeing NULL GLYPHS instead of ideographs.

    When I was consulting, I never had problems pointing out this particular requirement; I have even helped write up the justification to companies that were not always willing to allow the users permissions to install stuff like this. It is just not the sort of problem that I have found a huge lack of understanding about.

    But if that fails, you can just blame Microsoft. It's not like everyone else dosen't do that. :-)

     

    This post brought to you by (U+17d8, a.k.a. KHMER SIGN BEYYAL)

  • Sorting it all Out

    Our non-Unicode heritage

    • 6 Comments

    Apologies for the small George Carlin riff in italics below, it is based on the Civil War bit he did during his New Jersey HBO special back in the early 90s. I lack the budget to have Mr. Carlin do a Podcast saying this bit, so please use your imagination to get the full effect!

    The first version of Windows (1.0) shipped back in 1985, and it didn't have all that much in the way of impressively compelling international support. There were other good reasons for nobody to buy it, so most people probably did notice the lack.

    Anyway, about seven years after the first version was released, seven years later, Windows NT 3.1 was shown at the PDC in San Fransisco. And it supported Unicode.

    Not so you'd really notice it, of course.

    Just sort of 'on paper.'

    Of course now, fourteen years later, and Microsoft is planning on shipping Windows Vista, a fully Unicode operating system.

    But not so you'd really notice it.

    Because we still have these components that don't support Unicode.

    Components who figure a code page is a really keen way to encode.

    And the developers study the encoding carefully, and they try to improve on the strategies and the tactics to increase the component's utility. In case we have to go through writing new non-Unicode support some time. [sarcasm]

    In fact, some of these components actually get used in top of the line applications and they go out and shoot for the moon with the features they provide.

    You know what I say? Use live ammunition, would you please?

    That was fun. :-)

    Anyway, let's get down to it.

    One of those components, I mentioned briefly in this post: wininet.dll.

    It came up because we had an interest in changing the defaults for the NtfsAllowExtendedCharacterIn8dot3Name setting, documented as:

    Specifies whether the characters from the extended character set, including diacritic characters, can be used in short file names using the 8.3 naming convention on NTFS volumes.

    Value

    Meaning

    0

    On NTFS volumes, file names using the 8.3 naming convention are limited to the standard ASCII character set (minus any reserved values).

    1

    On NTFS volumes, file names using the 8.3 naming convention may use extended characters.

    This entry does not exist in the registry by default. You can add it by using the registry editor Regedit.exe.

    Of course what is not mentioned in that informational topic is that years ago it was decided that this value should be set anytime the default system locale was Chinese, Japanese, or Korean (and unset anytime it wasn't).

    There are several problemes here --

    1. It is mildly inconvenient as we are trying to reduce the number of dependencies to the system locale
    2. we also (generally) reduce the odd interactions like that are so hard to track.
    3. The old logic actually stomps on the preferences of anyone who actually uses the setting, any time the locale is changed

    After talking with various partners and knowledgable people in the file system and the various markets, we tried just setting it always and being done with it. Unicode had been around for some time, maybe it was "time to cut the cord" (the exact words of one of the file system architects).

    In fact, if you have Beta 2 of Vista then that is what you have on your install.

    Everything was going great until we found out that that one several-year old baby still had the cord attached. :-(

    The wininet cache (that is used to basically cache everything that various processes including IE use accessing the internet) does not support Unicode, since wininet.dll doesn't (wininet supports a Unicode interface that converts anything you throw at it, but that is more or less it).

    Now for a page on the web it would not be too noticeable; after all, if a cache item of an internet access cannot be reached, then it just wouldn't get used -- you just go right to the internet. Unfortunately if you have a user name that isn't on your default system code page then the path to the cache itself is broken. So you fail even trying to get to it to fail -- so basically you lose Internet Explorer.

    Oops.

    Anyway, no worries, even though no beta customers had reported the problem, there was no sense waiting for a report -- clearly this was a big enough regression that it had to be fixed.

    The change has been reverted for future builds, so that wininet.dll's lack of Unicode support (and incidently of the Windows non-Unicode heritage!) is preserved for another version.

    Though I suppose it means that there aren't a whole lot of Windows user names off the default system code page that are used on CJK system locale machines. Or if there are then those customers probably don't try to use IE much. Since they are as broken in the prior versions as they will be in the new version.

    And of course the people who use that NtfsDisable8dot3NameCreation setting to block the creation of short file names are probably not going to be too happy either if they have long user names or names with characters off the default system code page, for roughly the same reason.

    In the end I am not really too worried about it since both ANSI support and short file name support on NTFS are there for backwards compatibility. So I suppose the overlap is consistent enough that people are not hitting this particular bug much.

    But it is a story that I have been shaking my head about since the problem was identified....

    This post brought to you by (U+0da4, a.k.a. SINHALA LETTER TAALUJA NAASIKYAYA)

  • Sorting it all Out

    Blogging is stupid, sometimes

    • 2 Comments

    I was asked by a friend if I had an opinion about Rory's Blogging is Stupid post.

    I am probably the wrong person to ask, truth be told.

    I have actually had people tell me that they stopped reading this blog because of a specific post, like this one. I have had people suggest I should be summarily fired for another one. And I have had people tell me I am offensive and rude for other posts.

    At least two people have told me that they more or less can't stand this blog and they only read it for the same reason they used to read my CIS messages and/or newsgroup posts -- because they want to see what I will say next.

    I am reminded of that Bobcat Goldthwait joke I quoted before as a way to officially respond to most of thesed critcisms. I can't comment on the "being fired" part other than to say that I guess my employers didn't agree since I was not, in fact, fired....

    On the whole I agree with Rory's post (except the part about cleaning myself with a cat). The subject matter of this blog is still stuff I find interesting. And a lot of that is in the work that I do and that my colleagues do. Some of it also about music I enjoy, and some is about MS, and the rest is just random stuff that catches my eye.

    I am glad that I posted about Multiple Sclerosis and my various trials and tribulations related to it. There are people who have said I am crazy to have done so and even crazier to keep doing so, but I guess I don't see it that way. If you do, and if you are offended by the posts, then I guess I have missed out on that part time "role model" job, but it probably would violate moonlighting agreement anyway.

    For the small subset of readers who are in prison and have no control over what pages are bought up in the browser, just tell the warden I am secretly giving you the location for the file that is hidden in your cellblock so that they should go to another blog before they find themselves with a record number of breakouts.

    For everyone else, if you are offended by a post, the best thing to do is not read it. If I stop being me then there isn't very much point to it, you know?

    And if you are offended by the whole blog then what the hell are you doing here? There is lots of stuff that is interesting that you can try instead....

  • Sorting it all Out

    Our highly internationalized OS uses DPI, aka Dots Per In-.... um, never mind!

    • 13 Comments

    Windows has an impressive story when it comes to both internationalization and localization. I mean, sure there are occasionally things that I don't like, but most of those bugs get fixed and I hope they will all eventually be addressed one way or another.

    But there is one issue that in the minds of many irrefutably shows that Windows is a product from a US company....

    It is in the whole issue of the DPI (a.k.a. Dots Per Inch) setting.

    Even looking at Vista (under the theory that the longer we wait, the more that has been fixed!):

    If you click on that Custom DPI button, you are show a nice ruler that is clearly using inches:

    Of course generally DPI is meant to be used in the case of printers, where it really is referring to literal inches, an issue that is difficult to quantify if you don't really know much about inches (for those of you who are more metrically inclined, we are talking about 2.54 cm).

    The use of inches when we are talking about the screen as we are here is one step sillier, since we are actually talking about screen inches. This is a concept not really found in nature since they are not actually inches anyway. So you can play with the setting to try andf see what it does:

    That ruler and the font sample, perhaps due to people at Microsoft being self consciously aware of the fact that the whole setting is stupidly described, try to put it in terms that people will better understand the consequences of.

    Given that, why bother involving inches here specifically? Or if nothing else, there is that Measurement System setting in Regional and Language Options that one can customize:

    People have on occasion taken issue with the contrast there being one of U.S. versus Metric. I think that there are only like three countries that have not embraced the metric system in an official capacity: the USA, Liberia, and Myanmar. Which is definitely a list that we want to be a part of, right? (eye roll implied)

    Okay, so with the whole setting so poorly described, there is no reason to not simply at a minimum embrace this setting in Regional and Language Options. In fact, this is exactly what Windows Journal (available in Vista) does. If you take the Options dialog for Windows Journal:

    Clicking on that Display Measurements... button will also show a ruler, but the ruler will depend on that Regional and Language Options setting to decide whether to show:

    or

    with literal instructions involving getting your own ruler and putting against the ruler on the screen. Which makes it even cooler that they use the RLO setting to decide what kind of ruler to show, if you ask me. :-)

    Other attempts are made if you look at Access, at VB, and at WinForms, where the scale measurement can either follow Regional and Language Options or can have the scale changed to use twips or pixels instead.

    It is all a valuable attempt to redeem the DPI (or at least the description of it), but I would go a step further if it were up to me....

    I recall being right near Bill Vaughn in front of the crowd San Fransisco when he made the impromptu suggestion of the infamous "Visual Fred" tag for Visual Basic.NET:

    I think Visual Basic .NET is not bad — it’s just different. In my opinion, it would have been far easier on everyone if they had just called it something else — anything else. Visual Fred would have been better. That way developers would know that it’s just another language.

    The whole issue of DPI would be much easier if we followed Bill's example -- just got rid of the inches and talked about DPF -- Dots Per Fred. And we simply defined Fred as some arbitrary computer screen measurement. We could if nothing else get rid of use of a term that we shouldn't be using anyway for screen attributes!

     

    This post brought to you by (U+104d, a.k.a. MYANMAR SYMBOL COMPLETED)

Page 1 of 5 (62 items) 12345