Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
Last month I was talking about how Feature ideas don't always turn out to be good ones. And I mentioned how I'd probably talk about other cases in the future.
What can I say besides welcome to the future. :-)
In Vista, from the time when it was just Longhorn, there has been enhanced collation support for all of the CJK locales. The stroke count sorts and Mandarin pronunciation (both Pinyin and Bopomofo) sorts all covered more characters, the Korean Hangul pronunciation sort was enhanced too, and the Japanese locale got a new alternate sort to cover everything in JIS X 0213. Basically a lot of work was done.
But there was one area that was not covered that was really bothering me -- there was no support for a Cantonese sort of any kind.
"But isn't Cantonese," you might ask, "a spoken dialect, not a written one?"
The Wikipedia article Written Cantonese gives a good answer to this question in its introduction:
Written Cantonese refers to the written language used to write colloquial standard Cantonese using Chinese characters.Cantonese is usually referred to as a spoken variant, and not as a written variant. Spoken vernacular Cantonese is different from standard written Chinese, which is essentially formal Standard Mandarin in written form. Written Chinese spoken word for word in Cantonese sounds overly formal and distant. As a result, the necessity of having a written script which matched the spoken language increased over time. This resulted in the formation of additional Chinese characters to complement the existing characters. Many of these represent phonological sounds not present in Mandarin. A good source for well documented written Cantonese words can be found in the scripts for Cantonese drama and Cantonese opera.With the advent of the computer and standardization of character sets specifically for Cantonese, many printed materials in predominantly Cantonese spoken areas of the world are written to cater to their population with these written Cantonese characters. As a result, mainstream media such as newspapers and magazines have become progressively less conservative and more colloquial in their dissemination of ideas. Generally speaking, some of the older generation of Cantonese speakers regard this trend as a step "backwards" and away from tradition. This tension between the "old" and "new" is a reflection of a transition that is taking place in the Cantonese speaking population.
And if you look at the major population centers with people who use Cantonese, there are clear efforts to support this development among many of the native speakers (and writers) of Cantonese.
There are some cultural issues that even I was faced with when doing research here that I will discuss further in a follow-up post....
Of course one of the big problems has been that there are multiple romanizations used to represent the pronunciations, and unfortunately they are often used in the same lists (like phonebooks in Macau and elsewhere that allow people to simply enter the pronunciation -- how can you hope to sort the phone book consistently if the people providing the pronunciations have different ideas of how even identical pronunciations are to be represented?
But lots of work has been done to try to help with this issue, for example the Jyutping system produced by the Linguistic Society of Hong Kong (LSHK). And many people have been trying to use it -- for example the government of the Hong Kong SAR's Chinese Language Interface Advisory Committee (CLIAC) has produced the Cantonese Pronunciation List of the Characters for Computers, a huge set of data providing Cantonese "Pinyin-esque" style pronunciations for much of the Hong Kong Supplemental Character Set (HKSCS).
When I first saw that we would have a list of over 30,000 ideographs and their pronunciations, I was excited -- perhaps this data could be used to provide a Cantonese sort for the people in Hong Kong and elsewhere who wanted it?
But unfortunately, while there is much that is good about Jyutping, it has one liability at present, one that it shares with Yale and other romanization systems: and that is that there are several romanization systems. And there is not yet one that is ubiquitous.
Another problem that exists is that for the 30,764 unique ideographs given pronunciations in the CLIAC-provided doc, there are less than 2,000 unique pronunciations (less than 700 if you do not include the tone values).
And yet another problem is in the decision about tones -- some number the tones in Cantonese at nine, while others claim that three of these are unimportant distinctions and that there are only six to worry about. So it is not just different romanization systems, which vary enough with place names like Canton and Guangzhou coming from the same word, but even if people agree on the romnization they may differ on their opinion of the tones (with some believing that tones 7, 8, and 9 actually fold into 1, 3, and 6 respectively).
And the final problem, there is not yet a clear and established standard on how to break ties -- once you decide which Han have the same pronunciation, how do you decide which one comes first?
There was just not enough of a consensus yet to try to push ahead in Windows with providing such a sort. Because Microsoft has no interest in dictating language policy; we just want to identify it so that we can represent things the way customers would like them.
But this now brings us to input methods.
Like I said way back in December of 2004, IMEs have it easy. In this case because (if for no other reason) if you identify a rich new source of pronunciations you can simply add them to the IME if you like them. Or you can provide different IMEs using the different systems, too (assuming you have enough data!).
Anyway, enough of the backstory, right? Let's get to the IME, like I said I would!
The steps are the same as they were with the Unicode IME. Just grab the file from here (871 kb) or you can grab the zipped version here (144 kb).
1) Copy the text file to \Program Files\Windows NT\TableTextService on your Vista machine (if the "Program Files" on your machine is another language, use that directory, do not create a new one!).
2) Open an elevated command prompt and navigate to that directory.
3) Run the following from that command prompt:
rundll32 TableTextService.dll RegisterProfile TableTextServiceCantonese.txt
4) Say OK to the dialog that comes up verifying you want to install it:
You can now add the Chinese Hong Kong Cantonese IME to the Chinese (Hong Kong S.A.R.) locale by going through the following steps that are illustrated here.
Now like the Unicode IME this is a sample, and further this is a work in progress. There are lots of things I would like to do to tweak settings here, like as in how/if the list should be sorted, for example.
(And if I find other huge caches of Cantonese pronunciations in other romanizations I might even see whether they could be productively combined.)
And like I said, in an upcoming post I will talk about many of the cultural issues I ran across while doing the research here -- they are fascinating!
This post brought to you by 䕫 (U+2f9b2, an Extension B ideograph in HKSCS with a Jyutping pronunciation of kwai4)
Now how does that saying go?
I before E, except after C,Or when it sound like 'A', like in Neighbor and Weigh,Or when it sound like 'Ear', like in the word Weird,Unless it sound like 'Eek', like in Duncan Sheik!
Ok, I added that last part in. But in my defense I do have that built-in appreciation for singer/songwriters, and it is how his name is spelled. But what do you expect me to do, when Duncan Sheik is playing in Seattle tonight? :-)
That's right, he will be at Chop Suey, playing with Vienna Teng. Doors open at 9pm.
You can get tickets here (from TicketWeb), and if you happen to see me scooting along then feel free to say hi and mention if you heard about the show reading here! :-)
If I get up the nerve, I'll ask Duncan if the 'Mark Liberman' in the "Thanks to:" section of the liner notes in his newest album (White Limosine) is the Penn linguist I've mentioned in the past....
It hearkens back to that Persian? Or Farsi? post I did back in May of this year, and indeed some very similar issues about it exist. The question is whether the name of the language is Uighur or Uyghur....
The simplest answer (were I and all my readers to actually be the simplest people) would be to go to GoogleFight.com and discover that the web thinks of it as Uighur by a 3.3 to 1 margin. But we aren't the simplest people, or at least we try not to be. :-)
A slightly less simple answer is that of course it is not -- the name of the language is actually ئۇيغۇر, so that arguing about the English transliteration of it makes about as much sense as arguing about the best way to spell the transliteration of חנכה or معمر القذافي. In other words, it does not make very much sense....
Of course, one could also be really pedantic and claim that since the use of the Arabic script for the language is a relatively recent development, that even the transliteration is a translation of the original. As Omniglot point out:
Uyghur was originally written with the Orkhon alphabet, a runiform script derived from or inspired by the Sogdian script, which was ultimately derived from the Aramaic script. From the the 16th century until the early 20th century, Uyghur was written with a version of the Arabic alphabet known as 'Chagatai'. During the 20th century a number of versions of the Latin and Cyrillic alphabets were adopted to write Uyghur in different Uyghur-speaking regions. However the Latin alphabet was unpopular and in 1987 the Arabic script was reinstated as the official script for Uyghur in China.
Uyghur was originally written with the Orkhon alphabet, a runiform script derived from or inspired by the Sogdian script, which was ultimately derived from the Aramaic script.
From the the 16th century until the early 20th century, Uyghur was written with a version of the Arabic alphabet known as 'Chagatai'. During the 20th century a number of versions of the Latin and Cyrillic alphabets were adopted to write Uyghur in different Uyghur-speaking regions. However the Latin alphabet was unpopular and in 1987 the Arabic script was reinstated as the official script for Uyghur in China.
But such arguments are ultimately unconvincing, because in truth we do not really make up language name spellings by using such pure standards. The argument of needing the Aramaic spelling to get the "real" name becomes a clear case of reductio ad absurdum, and an argument we can discard.
As I pointed out in the later Persian? Or Farsi? Redux post, the argument there is really a transliteration for فارسی vs. a usage of a much older word for the language, in a much older civilization -- a bit like the argument of using the original Aramaic above! And while using "Farsi" for "Persian" may be like calling Spanish "Español" in English, anyone who watches Dora the Explorer (even I have a niece, you know!) may find that more and more common to be doing anyway. So the whole "connotation preference" argument seems much more reasonable and honest -- and the decision can be based on which connotation is generally preferred.
So what about Uighur and Uyghur? Neither of them holds much in the way of an obvious connotation preference, at least in English, right? And neither really has a common form used in English words -- this language, which is pronounced in English much like "wee-girl" without the L at the end, is not something that one can easily gleen from either spelling, and both look somewhat un-natural given how uncommon the forms are in English. You could make the official spelling in English Weegir and make folks in spelling bees that much happier. :-)
Of course it seems pretty common to keep words out of spelling bees that one could make a resonable case for causing an international incident over the way that the officials expect the word to be spelled, so we are spared that whole issue, in any case.
Now if you look at the language and its Turkic roots (or maybe more accurately branches), the Uyghur/Uygur spelling is more satisfying in Turkish, from the standpoint of both orthography and phonology (not to mention avoiding the violation of Turkish vowel harmony that Uighur/Uigur would be guilty of). Of course this too is not really an argument for the English spelling of the language, either.
It does appear that the government in China prefers the Uyghur spelling in many of its communications, which if it were consistent and broad based would probably be more convincing, at least in terms of a "Letting the largest person in the room settle the argument that was not all that important anyway" kind of resolution. But it seems inconistently applied there too, plus more often than not it is just 维吾尔语 there, anyway.
So is there one that is better? I guess I can see it both ways, and really have a hard time claiming it is the most pressing issue related to the Xinjiang Uyghur Autonomous Region of China. Like many others, I am inclined to lack the energy to fight about which is better to use.
Vista, in the current builds I am looking at, uses Uighur, which if nothing else has the benefit of connection with the three letter ISO 639 code (uig) even though the two-letter code (ug) can obviously go either way. I suspect that this is the sort of thing that could easily change a bit between versions or not based on the passionate feelings of customers about the LOCALE_SENGLANGUAGE, just as Farsi/Persian has managed to do.
Or people might create custom locales to fix what the reasonably see as our mistakes, as I pointed out in Determining (and correcting) locale settings.
(Note: after I wrote all the above, I found a Wikipedia talk page that covers the very same issue, though it too has trouble coming to conclusions on the best spelling to choose -- if nothing else the article let me correct one point I had wrong in my initial text!)
This post brought to you by ئ (U+0626, a.k.a. ARABIC LETTER YEH WITH HAMZA ABOVE)
You may have read the first post I wrote about this, titled About the Fonts folder in Windows, Part 1 (aka What are we talking about?). I won't say that it is all downhill from there, but this post is definitely going to see us spending a little time in the valley....
You see, there is one thing about the Fonts folder that just about everyone can agree about -- the lovers of Microsoft, the haters of it, and everyone in between.
And that is that they can't stand the UI for adding fonts.
There are even instructions for it, though those instructions do not include pictures (probably put together by someone who was embarrassed about the whole thing).
Luckily the folks on the Typography team include a little art here. It is true that they haven't updated the instructions for the last few versions. But that is okay -- because the UI has not been updated either. :-)
And if you want art, the folks at Adobe give instructions with art here and here, the first link having the extra info on using the Adobe Type Manager for all of those Type 1 fonts (ATM is not needed for the later versions).
Most people I talk to never even knew this ridiculous dialog that looks like a reject they forgot to update from Windows 3.1 even existed; they simply open the Fonts folder and drag the font files to it. Most people don't even realize there is a right-click menu option for fonts to install them, either.
And lots of other people put up sites explaining how much cooler all of this is on a Mac (like this X vs. VP one). I won't comment except to say that there are some times that Microsoft appears to be clueless about UI. And this is one of them.
But the reason for the dialog existing was explained by the owner of it not too long ago:
The current install font dialog presents a list of fonts that are present in a directory to install and it lists them by font name. These come in roughly 3 categories: individual fonts - truetype, opentype, raster - eg foo.ttf. There is a 1-1 mapping of fonts to fonts that get displayed in the install font dialog. The only way to get the name of the font is by opening the font file and reading the font name from it - the file names almost never contain a good enough name, and the file names aren't localized. .inf installs - you can specify a catalog of fonts that span media in a .inf file. There is a 1-many mapping of .inf files to available fonts to install, the only way to generate the list of fonts is to parse the .inf file. Type 1 fonts, eg foo.pf* - There is a 2-1 mapping of font files to fonts, each type 1 font has 2 files (sometimes 3) associated with it, at a minimum a .pfb and .pfm file. There isn't another dialog in the system that can take such a set of input and produce the proper list of items. Many people say this should just use the common file open dialog, however unfortunately the requirements imposed by 1-3 above aren't supported in any way by the common dialogs, its a very unusual case.
The current install font dialog presents a list of fonts that are present in a directory to install and it lists them by font name. These come in roughly 3 categories:
There isn't another dialog in the system that can take such a set of input and produce the proper list of items. Many people say this should just use the common file open dialog, however unfortunately the requirements imposed by 1-3 above aren't supported in any way by the common dialogs, its a very unusual case.
Now I am going to say that (speaking just for me and not at all for Microsoft or anyone who owns the dialog or the folder or even the functionality) that this explanation does not really ring true for me.
I mean, a common file open dialog can handle both multiple file types (.INF, .TTF, .PFB, .PFM, etc.) and multiple files at the same time. If they don't select all the files that are needed for a Type 1 font then an error can be put up just like they selected a corrupted file or one that is already installed.
Sure it's a complicated little bit of code, I won't argue about that. And a tester would probably find it to be a target rich environment when looking for bugs. But it isn't impossible; it's just intricate.
So it seems like an update here ought to be possible. If that were the only reason, I mean.
But in the end, I think the real story is a bit more involved -- one of those weird situations where the people who own the functionality (the font folks, the rendering folks, the other font folks) are not the owners of the UI. And they don't have the resources to take the ownership. And the folks who own it don't really have a ton of resources either, so this turns into one of those sacrificial features that can get cut each version (on one version because there is no time left and on the next because we already shipped it before, right?).
When everything is said and done, the story here is not nearly as impressive as it probably ought to be. Or even as impressive as it could be. It really does scream for someone to take all that info I put in the first post about what is involved with the install and just putting something together, doesn't it? :-)
The story continues, in both more impressive and less impressive ways. Stay tuned....
This post brought to you by F(U+ff26, a.k.a. FULLWIDTH LATIN CAPITAL LETTER F)
A few years back, John McConnell gave a day 2 keynote at the 26th Internationalization an Unicode Conference, entitled The Windows Language Roadmap or When Do We Get Rongo-Rongo?.
The subtitle, in a bold tradition that was subsequently taken up by this very blog you are now reading, had little to do with the actual presentation, but provided an interesting title and a fun story that cannot be found in the slides (leaving people who did not attend the talk wondering what it was all about, just as with the moose at the end of the presentation!).
(He did give a slightly longer version of the talk at the 2004 Global Development and Deployment Conference, where the advantage of a video version of the presentation online exists for your enjoyment. :-)
Anyway, for your reading pleasure I will (with John's permission) provide the transcript of the story below, but it is right at the beginning of the video and definitely worth listening to in John's unique storytelling style if you have the time (since I did not include a laughtrack it's the only way you can find out where the crowd was amused!).
enjoy!
I've had several people ask me about the title of this talk "When do we get to Rongo-Rongo?". Some people thought I made up the name. I'll explain, it has a little bit of a personal history. One of the very first projects I had when I was still a developer involved in globalization was back in the mid-80s. It was for a very large customer whose name I can't mention, but they're in Langley, Virginia. The assignment I had was to support bidirectional text; technically the documents supported left-to-right, it did not support bidirectional. So I understood and worked with people who understood bidirectional text and I was able to work that out. But being the ambitious little nerd that I was, I went off to a library and I decided I would find out more about writing systems. Because I knew vaguely that East Asian text was written vertically and I thought, 'well maybe I should generalize my code so I can support vertical writing.' So the library was a wonderful resource. I found out about ancient Greek writing, which (I'll probably say this wrong) Boustrophedon, where they would write one line going one way and then the next line would start there and go back. And that was very appealing to me. But then even better was Rongo-Rongo, which it sounds like it's made up by teenagers or something, but it was a language used on Easter Island, or I shouldn't say language, a writing system on Easter Island. I believe there's only like 120, some small number of samples. They are on these large round disks. It has never been fully deciphered. But the thing that was really wonderful about it is it's written sort of like Boustrophedon, but when you get to the second line, rather than just going backwards, it actually turns upside down. So this really put me into a fever, writing the code. So, unfortunately in that particular coding assignment I ultimately concluded that I couldn't support Rongo-Rongo -- the performace hit was just a little too great. And so, when I delivered the software to the salespeople they said "What languages does it support?" and I said "It'll support anything except Rongo-Rongo." I said this as sort of a joke, but about a month later we had the version two requirements, which said that "Version two must support Rongo-Rongo." So ever since that experience it's been the goal at the end of the rainbow, it's where we will eventually get to before I retire....
I've had several people ask me about the title of this talk "When do we get to Rongo-Rongo?". Some people thought I made up the name. I'll explain, it has a little bit of a personal history.
One of the very first projects I had when I was still a developer involved in globalization was back in the mid-80s. It was for a very large customer whose name I can't mention, but they're in Langley, Virginia.
The assignment I had was to support bidirectional text; technically the documents supported left-to-right, it did not support bidirectional. So I understood and worked with people who understood bidirectional text and I was able to work that out.
But being the ambitious little nerd that I was, I went off to a library and I decided I would find out more about writing systems. Because I knew vaguely that East Asian text was written vertically and I thought, 'well maybe I should generalize my code so I can support vertical writing.'
So the library was a wonderful resource. I found out about ancient Greek writing, which (I'll probably say this wrong) Boustrophedon, where they would write one line going one way and then the next line would start there and go back. And that was very appealing to me.
But then even better was Rongo-Rongo, which it sounds like it's made up by teenagers or something, but it was a language used on Easter Island, or I shouldn't say language, a writing system on Easter Island.
I believe there's only like 120, some small number of samples. They are on these large round disks. It has never been fully deciphered.
But the thing that was really wonderful about it is it's written sort of like Boustrophedon, but when you get to the second line, rather than just going backwards, it actually turns upside down.
So this really put me into a fever, writing the code.
So, unfortunately in that particular coding assignment I ultimately concluded that I couldn't support Rongo-Rongo -- the performace hit was just a little too great.
And so, when I delivered the software to the salespeople they said "What languages does it support?" and I said "It'll support anything except Rongo-Rongo."
I said this as sort of a joke, but about a month later we had the version two requirements, which said that "Version two must support Rongo-Rongo."
So ever since that experience it's been the goal at the end of the rainbow, it's where we will eventually get to before I retire....
The full presentation talks about ELKs and LIPs and lots off the other things I talk about here, and is worth a listen, in my opinion. :-)
So here is a quick and dirty Q&A:
Q: What company was John working for back in mid 80's?
A: He was working for DEC at the time, though the contract was for that customer in Langley, Virginia.
Q: Does Unicode support Rongo-Rongo?
A: Rongo-Rongo is not yet encoded in Unicode.
Q: Does Vista support it?
A: The first step that Windows requires when it comes to language support is support within Unicode (after that we can get into fonts and shaping engines and such), so given the answer to the first question, the answer to this one would also be no.
Q: Will Microsoft ever support Rongo-Rongo?
A: It is worth noting that John has not retired yet, so who knows what the future holds? It is still at the end of the rainbow....
This post brought to you by ༃ (U+0f03, a.k.a. TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA)
I have previously (like in Typing in random Unicode code points) talked about using the Unicode IME as a way to input random Unicode code points by their code point values. It was therefore something of a shock to me when Lionel Fourquaux mentioned a few months ago that:
I have been unable to find the Unicode IME in Vista (build 5308). Was it removed? If so, why? I'm missing it.
I looked, and yes, this IME had been removed! :-(
I talked to some of the folks on the Input Method Editor (IME) team and on the Text Services Framework (TSF) team, and the removal was intentional. There was a concerted effort to assist with maintainability of the IMEs to make sure that they were using TSF rather than the old IMM mechanisms, and since this IME was kind of built in to all of the other ones, there just wasn't a strong advantage to having a standalone one that did the same thing....
And of course just having something mentioned here in this blog is not a compelling enough reason, either. :-(
However, after working with some of the folks on the TSF team I have a beta version that may eventually be a great sample of a new type of TSF TIP that is entirely text based, which can be used to replace the Unicode IME.
With the permission of that group, I am making this beta available to those who want to try it out. All you need to participate is a somewhat recent build of Vista and to be reader of this blog (how else will you know about it?).
Don't forget the disclaimer text: "Postings are provided as is with no warranties, and confer no rights." To that I would add the same text in Hebrew is something like "הודעה זו מסופקת "כפי שהיא" ללא כל אחריות או חיובים, ואינה נותנת לך זכויות כלשהן".
For those of you who are using Vista who want to try it out, you can download the 1.6 mb text file here (or the 166 kb zipped version here!), then follow these instructions
rundll32 TableTextService.dll RegisterProfile TableTextServiceUnicode.txt
You can now add the Unicode IME to the Chinese (Taiwan) locale by going through the following steps that are illustrated here.
And that is all you have to do. Now, when you type in an application like Notepad, you can switch to the IME and type in code point values like 00E5 or 0951 of whatever, and the character will be typed!
Now even as a sample this is a work in progress, like for example I want to change some of the configuration settings so you can see the code point you are in the middle of typing.
And I want it to be able to be available for other locales too, for obvious reasons.
And this IME will not work in the console, sorry about that. I don't think that partiular limitation can be addressed, so pretend you never saw this post before. :-)
You get the idea -- it is a beta!
There will be documentation on the format of the text file that will be available eventually, and I will probably blog about it a bit too in the future (plus maybe share some of my other samples, such as a Cantonese IME I have been working on).
But if you are using a recent build of Vista and want to try out this sample, a meager attempt at a replacement version of the Unicode IME from prior versions, especially if you wanted to post feedback about it here.
Enjoy!
Back when I posted about the Punjabi and Telugu Language Interface Packs (It's not Telugu [తెలుగు] Tubbies (and it's certainly not Punjabi [ਪੰਜਾਬੀ] Tubbies!)) I got some interesting comments, like one from Amar:
Dude! you know lot about Indian Languages.
and another from Seshagiri:
Its amazing to see your grasp on the languages or should I say fonts.
But as I pointed out at the time, I know a little bit about languages but the bulk of these interesting facts that I have been posting about LIP languages have actually come from Soren Eberhardt, a Software Localization Engineer who is one of the core folks involved with the LIP program at Microsoft.
It is actually Soren, through a combination of what he knows and what he learns while doing research into the localization work that goes into Language Interface Packs, who is reponsible for getting these interesting facts about languages together when various LIPs have been released.
So I just wanted to make sure I expressed my gratitude for this hard work, and make sure to give credit where credit is due for the info about Urdu, Tswana, Luxembourgish, Quechua, Persian, Telugu, Punjabi, Zulu, Kannada, Nepali, Konkani, Bengali, and Malayalam.
Thanks, Soren!
This post brought to you by ട (U+0d1f, a.k.a. MALAYALAM LETTER TTA)
Chris was thinking about how keyboard layouts work on Windows, and suggested:
Windows allows a user to set up multiple keyboards within an input language, and switch between them either by a keyboard shortcut or by clicking an icon on the taskbar. However, when the keyboard layout is changed, the change is only applied to the current window. This behaviour is extremely confusing to users, as the characters that appear as they type then vary from window to window. By default, using the keyboard shortcut or clicking the taskbar icon to select another keyboard, the new keyboard should be used in all windows. If it is necessary to keep the old behaviour, this should be done via a (non-default) check-box on the "Language Bar" tab of the "Text Services and Input Languages" applet.
It is an interesting idea, and also a difficult one. The list of installed keyboards (which actually means the list of enabled keyboards) is a per user setting on a machine. And, as Chris mentioned, every language can have more than one keyboard attached to it.
So, let's say that I have English and Korean enabled, with multiple input methods on each:
If I look at the Language Bar, I have the option of choosing the language:
and then also the keyboard layout under the language once it is selected:
Thankfully, the selection mechanism is smart enough to remember the selection in each language. So that if I
It is smart enough to remember that I switched to that specific IME. This is as good thing.
Now we move to the complicated part -- the fact that actual, specific combination of language and layout is a per thread setting. This has been the design for a long time and it usually leads to two potentially non-intuitive behaviors:
Now that first point is longstanding behavior that is often relied on by people (including me!), and although it is not immediately intuitive it does kind of make sense, especially when one finds oneself e.g. typing a document in a particular language and not expecting that to lead to changes in one's email program....
The secone point is the one that Chris is talking about. Any time one changes the keyboard layout for a language, then isn't it often more likely that you would perhaps expect that switch to be applied to all other UI threads in the session that use this language?
Clearly there are exceptions here (plus there is a legacy expectation that some people may be used to), but that is also why Chris is suggesting some kind of configurable switch. I tend to agree with Chris that the change here is more likely to meet the intuitive expectations of customers, don't you?
This kind of change is complicated for someone to do themselves -- there would need to be a call to SetWindowsHookEx to inject a DLL into every thread to run code to check GetKeyboardLayout -- and if the languages (in the LOWORD) match and the device (in the HIWORD) do not then make a call to ActivateKeyboardLayout to make the switch.
Yuck.
But if you look at the Language Bar and the Text Services Framework, it already has a little something running in every thread, managing this information. So it is in a much better position to architect the kind of cross-thread, cross-process communication that might be required here.
Not an easy change, by any means. But an excellent feature request that is definitely worth considering for a future version of Windows. Good idea, Chris! :-)
This post brought to you by ђ (U+0452, a.k.a. CYRILLIC SMALL LETTER DJE)
The other day, Raymond Chen posted Pidls and monikers do roughly the same thing, just backwards.
And in that post he had the following text (emphasis mine):
When operating with the Windows shell, you will almost certainly find yourself at some point working with a pointer to an item ID list, known also as a "pidl" (rhymes with "middle").
When I saw it, I was thinking about how he could have said "pidl" (rhymes with "MIDL") instead. You know, MIDL, the Microsoft Interface Definition Language.
Luckily this was just a momentary thing. I realized that it was unlikely that someone would know how MIDL was pronounced but not know how PIDL was pronounced. They both rhyme with "middle."
But I thought about how most of these industry acronyms tend to have listed the pronunciation in various glossaries. Once in a while there will be helpful text like Raymond's above or Bruce McKinney's Hardcore Visual Basic explanation of how to say GUID:
NOTE: Let me indulge readers from my part of the world by describing the pronunciation of GUID as geoduck without the "uck." Those of you who don’t know a geoduck from a mallard can just say "Goo-Id."
But most of the time we let people guess, and that is why half of the people say SEQUEL while the other half say ESS-QUE-ELL. Or why almost everyone says OLAY while a few people say OH-ELL-EEE.
The platform SDK will religiously make sure to call the first mention National Language Support (NLS) or Graphic Device Interface (GDI). Those are just pronounced with the letters, but why does it waste time talking about Small Computer System Interface (SCSI) rather than telling us it rhymes with scuzzy? you know, like advice we can use?
And I have heard people tell stories about the way people mispronounce things -- even though there is no real central communication of what the pronunciation ought to be. So we don't tell people how to pronounce it (other than by just pronouncing it ourselves when it comes up), yet we silently judge them for their choice.
We are no better with globalization support -- we say ebb-cid-ick for EBCDIC and laugh at the people who say EEE-BEE-CEE-DEE-EYE-CEE. And so on.
On the other hand, I do the same thing when I am introduced to people -- I tell them my name is Michael. And then sometimes they ask whether I prefer Michael or Mike or ???? and I say whatever. But I think I do silently judge people who ignore the initial name I used, sometimes.
So is it a test we do, to see who is paying attention?
In the new HBO show Lucky Louie, Louie blew a weekend that they had off because Louie decided to get her flowers, which ordinarily would be quite sweet but not so much when he gets her red roses again after she had told him she did not want to get red roses, that she really did not like them. Now obviously the stakes are not as high in these other situations, but I think we want to know people are listening to us.
There is more to having a conversation than learning to shut the hell up until the other person is done talking and spending the whole time thinking of what you will say when they are done. In fact there are two more things, specifically:
Could the thing we seem to do with acronyms be part of the same type of test? And if so, isn't the joke on us since (for example) a lot more people read here then will ever hear a word of what I say. And I am not going to add an audio track to the blog (I can see it now, SIAO on Tape!).
Note that even recognizing the fundamental silliness of how we (and how I) act, nothing will probably change here. Because if I give the pronunciation every time, I am doing a remedial reading course. And if I say what the acronym stands for each time, I am like Platform SDK West.
But I will try not to judge in the future on the pronunciations. I'll remember the trivia question Triumph the Insult Comic Dog asked the Star Wars fans camped out before Attack of the Clones opened:
Triumph: What substance was Han Solo frozen in?Star Wars Fans: Carbonite!Triumph: No, I'm sorry, that is incorrect. The correct answer is 'Who gives a fuck?'
This will help remind me that the fewer of these acronyms you know how to pronounce, the cooler you probably are. We geeks need to get over ourselves, you know?
And from now on, I'll tell people I prefer Michael when they ask. And then I'll feel more justified in silently judging the ones who say Mike since they took the time to ask and then ignored the answer.
But the folks who talk about ATM machines and Very VIP People are still toast in my book. Because everyone knows how silly that is....
In the spirit of the milk bet, I got the following invite yesterday:
As some of you have heard by now, there is something known as a Saltine challenge. The goal is to eat 7 saltines in 60 seconds. If you're interested and well-hydrated, join us in ##/#### tomorrow :) Cheers, Sushmita
As some of you have heard by now, there is something known as a Saltine challenge. The goal is to eat 7 saltines in 60 seconds. If you're interested and well-hydrated, join us in ##/#### tomorrow :)
Cheers,
Sushmita
Group morale events have clearly reached a new high (or perhaps a new low, it is all a matter of perspective!).
Yes, there is indeed The Saltine Challenge, which you can read a bit about here and here and here and here (or do your own searches, it is a popular bit of folklore). Obviously some of those on the NLS and MST teams plan to outdo the original Nabisco ad campaign and do seven crackers, not just six....
For those participating, Mike (another Mike, not the Mike who was formerly ours) explains the chunking strategy:
Alright, I'll give up the goods on the chunking method now: Basically, if you eat three, then two, then one, it seems quite a bit easier to accomplish the feat. The idea is that by getting three out of the way in one swoop while you're still somewhat wet, you're slaying half the beast right away. Doing one right after the other is bad because you're pretty much bone dry after only the first cracker, and doing all six right away is bad because that's just too much cracker to break down. Incidentally, when I did seven, it was with a 4-3 strategy. Still haven't been able to do a 4-4 yet though. And I wonder why my mouth is so scratched up this week...
Alright, I'll give up the goods on the chunking method now:
Basically, if you eat three, then two, then one, it seems quite a bit easier to accomplish the feat. The idea is that by getting three out of the way in one swoop while you're still somewhat wet, you're slaying half the beast right away. Doing one right after the other is bad because you're pretty much bone dry after only the first cracker, and doing all six right away is bad because that's just too much cracker to break down.
Incidentally, when I did seven, it was with a 4-3 strategy. Still haven't been able to do a 4-4 yet though. And I wonder why my mouth is so scratched up this week...
He was able to evedntually do eight saltines, using a 3-3-2 strategy.
And then Mike (the first Mike, the one who was formerly ours) took a break from producing humorous and not-for-public-sharing build break haiku poems to provide the following little ditty:
oodles of saltinesall stuffed into my dry mouthI so want to puke
Of course, I think it might be fun to try to combine the two bets somehow -- though milk and cookies is a more conventional combination than milk and crackers. :-)
Got milk?
In the latest comment to the post that keeps on going (Behind 'How to break Windows Notepad'), Sanjay Vyas asks:
Not all combinations of 4-3-3-5 will produce it. For example, Bush hid the truth does not work while Bush hid the facts does. Any explanation?
Well, let us look at the two strings. First:
Bush hid the facts0042 0075 0073 0068 0020 0068 0069 0064 0020 0074 0068 0065 0020 0066 0061 0063 0074 0073
becomes:
畂桳栠摩琠敨映捡獴7542 6873 6820 6469 7420 6568 6620 6361 7374
The obvious question is why the second string does not do the same thing. Why does:
Bush hid the truth0042 0075 0073 0068 0020 0068 0069 0064 0020 0074 0068 0065 0020 0074 0072 0075 0074 0068
not become the analagous:
畂桳栠摩琠敨琠畲桴7542 6873 6820 6469 7420 6568 7420 7572 6874
exactly?
Neither string is very useful from a meaning standpoint, so we can dispell conspiracy theories involving both Japan and China right away (thankfully!).
If you run both bits of text through IsTextUnicode running all tests, the first one returns TRUE and only returns IS_TEXT_UNICODE_STATISTICS, which means it only won the statistical tests.
In a comment to the digg thread, neko asked (but no one answered):
I wonder if this will give the same result if you run notepad.exe in wine? Does wine emulate the dodgy isTextUnicode() behaviour as well?
I am not sure, though I am inclined to doubt that they are running the same statistical tests. A bit of Google spluenking suggests that the code is here somewhere, but I decided not to look myself. Hopefully it is not a dead link. Someone can tell me later if I was right. :-)
I won't post the actual code for IsTextUnicode -- no sense getting in thast kind of trouble, and even if that did not matter it is kind of embarrassing code....
(As a side note, the results on Windows vary a little bit depending on the default system locale as the function looks a bit at lead bytes -- and if the ratio of bytes in the string to lead bytes according to a DBCS default system code page is 2:1 -- which could in theory mean that on a Chinese, Japanese, or Korean system that the results could vary some....)
The tests (on Windows) are rather arbitrary and consist of a few parts, but the biggest piece it is really testing is a comparison of the fluctuation between high bytes and low bytes, and the diff between various high bytes and low bytes that the second string is failing on.
I honestly see no good reason for it to return TRUE here in either string, though I ran into problems even trying to fix this bug with CRLF due to it breaking a use of the function in detecting a Unicode JScript file, so changing the tests that the function does here is probably a no-no.
Though it really is not a conspiracy with Microsoft making critical comments on the president's use of TRUTH or FACTS, I doubt I will have much luck convincing people of that.
To me the more interesting conclusion here is that the passing of an arbitrary Unicode String like "畂桳栠摩琠敨琠畲桴" to IsTextUnicode returns FALSE due to these statistical tests. Once again, this is just not a function I like (or trust!) very much.
But it is not some kind of easter egg. Truly, it is just a dumb algorithm!
This post brought to you by 桴 (U+6874, a CJK ideograph)
Just moments ago, Sergey asked in the Suggestion Box:
Hello, Michael! Wouldn't it be great to be able to set UTF-8 as a multibyte code page in Windows? What do you think?
Well yes, I think it would be great. :-)
Of course (in the spirit of RAH) I think it would be great if the lion could lay down with the lamb. Though I'd lay odds that only one of them would be getting up later....
I hint at some of the problems in this post and then talk about it more directly in the comments in this one.
Short version -- it can't happen.
Sorry, Sergey.... :-(
This post brought to you by ড় (U+09dc, a.k.a. BENGALI LETTER RRA)
Francisco Moraes posted in the Suggestion Box:
This isn't much of a suggestion but more of a question: I have a program that layouts glyphs from fonts to display on the screen. This all works great until I try to print, because some printers will substitute the font being used and print garbage. Is there a way to avoid the font substitution when using glyphs instead of characters or a better alternative? Francisco
It is hard to specifically answer the question without knowing what program is being used, but I can talk about the feature, and that may help. :-)
The option in the user interface goes under different names, but is usually under the Advanced options of the Print dialog and has some name related to TrueType fonts. It is hard to get more specific than that since it is UI that is provided by the printer driver, but here are some examples:
It is interesting of course that only two options are given in most cases, since the option is based on the dmTTOption member of the DEVMODE data structure. It is documented as follows:
Specifies how TrueType fonts should be printed. This member can be one of the following values. Value Meaning DMTT_BITMAP Prints TrueType fonts as graphics. This is the default action for dot-matrix printers. DMTT_DOWNLOAD Downloads TrueType fonts as soft fonts. This is the default action for Hewlett-Packard printers that use Printer Control Language (PCL). DMTT_DOWNLOAD_OUTLINE Window 95/98/Me, Windows NT 4.0 and later: Downloads TrueType fonts as outline soft fonts. DMTT_SUBDEV Substitutes device fonts for TrueType fonts. This is the default action for PostScript printers.
The thing is that this whole structure in general and this member specifically are trying to cover all of the possible available options in a wide variety of different printers, so for a given printer they will not all be available, usually.
But that DMTT_BITMAP member is equivalent to the Print TrueType as Graphics option, which keeps those device fonts from being used when the printer thinks it might know better.
Now in many situations those device fonts are considered to be very important, but in many international and multilingual text scenarios they are not so good -- since they are usually either not available for the font you need or the device font is a subset of the font on the computer. So printing TrueType as graphics is usually the best option....
This post brought to you by ༔ (U+0f14, a.k.a. TIBETAN MARK GTER TSHEG)
Walter asks:
Hi, It seems the time format in Control Panel Regional and Language Options is only mapped to .NET DateTime.ToString(“T”), i.e. the DateTimeFormatInfo.LongTimePattern. When using Thread.CurrentThread.CurrentCulture.DateTimeFormat, the LongTimePattern will be changed according to the Control Panel, while the ShortTimePattern is not changed. But I noticed that the Taskbar’s clock in the notification area (systray) will be changed according to the Control Panel, while keeping the Short Time format. Is there any way to do this conversion in code? BTW, the Short Date and Long Date in Regional and Language Options map to .NET DateTimeFormatInfo’s ShortDatePattern and LongDatePattern perfectly. Thanks. Regards,Walter
Hi,
It seems the time format in Control Panel Regional and Language Options is only mapped to .NET DateTime.ToString(“T”), i.e. the DateTimeFormatInfo.LongTimePattern.
When using Thread.CurrentThread.CurrentCulture.DateTimeFormat, the LongTimePattern will be changed according to the Control Panel, while the ShortTimePattern is not changed.
But I noticed that the Taskbar’s clock in the notification area (systray) will be changed according to the Control Panel, while keeping the Short Time format. Is there any way to do this conversion in code?
BTW, the Short Date and Long Date in Regional and Language Options map to .NET DateTimeFormatInfo’s ShortDatePattern and LongDatePattern perfectly.
Thanks.
Regards,Walter
Technically it is not the systray, but I won't quibble; I have Raymond Chen to quibble for me. :-)
Some may immediately think of when Ivan Petrov was asking about Customizing the SHORT time format at the end of last year.
But this question is a little different, since Walter is noticing that there is an obviously unmanaged component that seems to be following the current locale yet also seems to clearly be a short time. Whassup with that?
Well, the secret is in the GetTimeFormat function, to which you can pass any of the following dwFlags values:
Value Meaning LOCALE_NOUSEROVERRIDE Format the string using the system default time format for the specified locale. If this flag is not set, the function formats the string using any user overrides to the default time format for the locale. This flag can only be set if lpFormat is set to a null pointer. LOCALE_USE_CP_ACP Use the operating system ANSI code page instead of the locale code page for string translation. See Code Page Identifiers for a list of code pages. TIME_NOMINUTESORSECONDS Do not use minutes or seconds. TIME_NOSECONDS Do not use seconds. TIME_NOTIMEMARKER Donot use a time marker. TIME_FORCE24HOURFORMAT Always use a 24-hour time format.
If you use that TIME_NOSECONDS flag I marked in bold, you will essentially get a formatted short time, even though there is no short time pattern available via GetLocaleInfo.
As the GetTimeFormat documentation states, the separator will be removed when the seconds are. :-)
You can also use TIME_NOTIMEMARKER if you'd like here, though in the case of the time value that Walter was thinking about, the AM/PM time marker will usually be there unless the locale uses a 24-hour clock.
So it is not so hard to make your own short times that match the current locale....
This post brought to you by : (U+003a, a.ka. COLON)
In his usual charming style, regular newsgroup contributor Norman Diamond posted the following to the microsoft.public.win32.programmer.international and microsoft.public.word.international features newsgroups:
My hard disk has a file, whose path will likely be wrapped by Outlook Express:C:\Program Files\Windows CE Tools\wce500\Windows Mobile 5.0 Smartphone SDK\Samples\CPP\Win32\Mapirule\readme.txtAmong useful bits of information are found the following:> Client痴 transport (SMS, ActiveSync, POP3) arrives.and:> where <clsid> represents the COM object痴 class ID GUIDThat deserves an award for being self-descriptive. A maker of such things as an operating system for Windows Mobile 5.0 Smartphones, an SDK for the same, and compilers that can target the same, just knew that they didn't have to use Unicode for documents like this because the ANSI code page would get the message across just fine. The ANSI code page is of course the one used by Notepad on desktop Windows systems such as XP, which defaults to code page 932 as delivered by Microsoft and preinstalled on PCs. The word 痴 really does describe the process that led to displaying the word 痴.One might wonder if a maker of tons of documentation on how to use Unicode might want to learn how to use Notepad to save a .txt file in Unicode encoding so that this documentation file might provide information using some unknown characters other than 痴. A certain company which is known for being 痴 might have the ability to teach them. But a certain company which is known for being 痴 might not want to learn from them.
He does have a way with words (see the sponsor tag line for the 痴 (U+75f4) ideograph if you don't know the meaning of 痴 and want a better understanding of Norman's humor!).
It is important to look beyond the words for a moment here, to see what we are talking about here. :-)
Those who read Behind 'How to break Windows Notepad' might have a hint of what is going on here -- we are looking at one of those "misunderstanding the characters in a file" problems. Though in this case it is not an ANSI file being mistaken for a Unicode one; it is an ANSI file in one code page on a machine whose CP_ACP is a different code page....
Now I will be the first to admit that it seems foolish to rely on something as fragile as the default system code page for the readme file of a sample.
If you convert 痴 to cp 932, you get 0x9273, which had it been treated as cp 1252 would be ’s which is the clitic that is used in English to indicate a possessive. Thus the acutual strings would be:
> Client’s transport (SMS, ActiveSync, POP3) arrives.
and:
> where <clsid> represents the COM object’s class ID GUID
where ’s is actually 0x92 and 0x73, which becomes U+2019 and U+0073 via cp 1252.
Now since Microsoft Word will commonly takes ' (U+0027) and autocorrects it to ’ (U+2019).
No keyboard that Microsoft releases sticks U+2019 in a file, so it really looks like the problem is that the text was edited in Word at some point. That it became 痴 is just a real bit of irony that helped Norman point an ironic sort of a bug and helped me provide a good Unicode Lame List story. :-)
This post brought to you by 痴 (U+75f4, a.k.a. a CJK Unified Ideograph meaning foolish, stupid, dumb, silly)