Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
The title for this post actually comes from the SortKey help topic:
Each character in a string is given several categories of sort weights, including script, alphabetic, case, and diacritic weights. A sort key serves as the repository of these weights for a particular string. For example, a sort key might contain a string of alphabetic weights, followed by a string of case weights, and so on. SortKey is equivalent to the Windows API method LCMapString with the LCMAP_SORTKEY flag. However, unlike LCMapString, the sort keys for English characters precede the sort keys for Korean characters.
Someone asked me what the hell that text refers to!
Well, a decision was made back in the early days of Windows (that incidentally many have had cause to regret) to cause ideographs for Korean to be sorted in front of all of the other letters (including the Latin script letters of English). This code exists on all of the Windows NT-based platforms and on the Windows CE platforms, but when the time came to support sort keys on the .NET Framework a decision was made to explicitly not do this.
There is no real linguistic basis for either behavior, its arbitrary either way.
Though since the .NET Compact Framework uses the WinCE OS tables to do its work, it means that the WinCE results will differ from the .NET Framework everywhere else.
It is worth mentioning that text in the SortKey topic is a little confusing since it does not make clear that this only happens for the Korean LCID. And since the Windows behavior is not completely documented it does not cover the fact that neither Extension A nor Extension B ideographs are supported by it (though at present none of them are given intentional Korean-specific weight, a fact that will change in future versions).
In any case, thats why Korean has the Korean coming first. Though when you consider the fact that this affects over 20,000 ideographs, the image of someone asking if they could cut in line at the supermarket "because they only have 20,000 items" is a little scary. :-)
This post brought to you by "ᇴ" (U+11f4, a.k.a. HANGUL JONGSEONG KAPYEOUNPHIEUPH)
A few months ago, I was talking with a customer who simply could not understand the sorting results she was seeing (in this case in a table in MS Word 2003). She distilled it down to a small repro; basically she took a small list of words:
word
meaning
cote
dimension
côte
coast
coté
with dimensions
côté
side
(at this point I knew both the language and what was causing her to have problems, and you may know too!)
What she noticed was that if she marked the left column as being French text (she tried several French choices, including France and Canada), the order was like this:
while if column was marked as being English then the results looked like this:
She could not understand any sorting rules that would explain the way that the words were sorting in the "French" table.
So I talked with her about Académie française and how they have a specific preference related to the way letters with diacritics are to be sorted (I also mentioned incidentally that I thought they had abolished the use of the circumflex in the early 1990's in many words but honestly I did not know if it applied to these two words that use the SMALL O WITH CIRCUMFLEX!).
The specific rule I was talking about here is that diacritics are evaluated from right to left rather than from left to right. Thus côte comes before coté, rather than after it as it does in languages like English that evaluate them from left to right. Because the word côte has no ACUTE on the "e" at the end of the word while coté does. In English and most other languages, the evaluation starts on the left and therefore the CIRCUMFLEX or lack thereof on the "o" is the controlling factor in ordering.
You can see it described in the French sort order exanple in Appendix D of the first edition of Developing International Software for Windows 95 and Windows NT.
This particular rule is interesting in that in all of the native French speakers with whom I have spoken, I never found anyone who could explain the rule to me. In their defense they were pretty much all aware that there were special rules used in dictionaries, but if you think about it there would seldom be a time that one could not find the word one wanted in a dictionary that used this rule. After living with the language for a lifetime, I am sure things like this are simply understood subconsciously when they occur. This phenomenon is common in almost all languages and they pretty much all have rules that native speakers understand even if the speakers cannot articulate the rules.
Another interesting factoid about language that can be seen here has to do with the fact that this use of "reverse diacritics" is seen in every French locale supported by Windows. It is fascinating to see the influence that the "mother country" of a language can sometimes have on changes that are made to other places where it is spoken.
When changes are made, whether by longtime organizations such as the Académie or by direct legislation, other countries will in many cases tend to pick up those changes. To me, the reasons behind such language reforms spreading this way are fascinating to contemplate. It is certainly not any kind of sovereignty or true languge "ownership" issue (and in future posts I may discuss specific cases in other languages where changes were at times intentionally not picked up!).
But I am at times amazed at the way that people will appear to see language as transcending the petty things. Its the kind of behavior that makes me interested in linguistic issues. :-)
This post brought to you by
I have been noticing more and more web sites lately that contain an interesting type of ad. Have you seen any of the following?
In each of these you are playing a game where you are given an "easy" task (provided you don't believe that Pamela Anderson is a pop diva, of course!). All you have to do is click in a specific small area of the screen making up a small part of an image that is not in the tab index, has no accelerator, has no other keyboard access, and is thus generally unreachable to people who have accessibility issues and are unable to use the mouse.
I must admit that there is a part of me that gets mad any time I see inaccessible UI. Its unfair, its thoughtless, and it discrimates. It sucks.
I am also forced to admit that there is another, larger part of me that is a little jealous of people who can be spared the stupidity of these ads. Its almost like someone is saying to them "I know you have it tough, so we won't force you to deal with the pop-up hell our company wants to give everyone else on the web dumb enough to click here."
All of that is probably crap -- if the lifeless wingnuts who design these Beelzebub's hors d'oeuvres and serve them up on web sites that are so desperate to have the Internet make some money for them finally that they will pimp for the video game system target practice contests were to recognize that there was a whole crowd of suckers who would give them money, then they'd be packaging up a system to deliver the con to them.
Luckily they do not seem to be all too bright.
A Sort key is basically an array of bytes. The intention of the sort key is to make for faster comparisons of strings, so that if you compare the sort key values for two strings you will get the same results as comparing the strings themselves. They abstract out all of the irrelevant data (for example if you use NORM_IGNORECASE or CompareOptions.IgnoreCase) then the binary sort key for "AAAA", "AaAa", and "aaaa" will all be identical. As such, sort keys make a great basis for an index of string values, like you would have in a database engine.
But how are they structured to allow this to happen?
They have the same architecture in both managed code (via the SortKey class) and unmanaged code (via LCMapString with the LCMAP_SORTKEY flag). The structure is described in the LCMapString topic in the Platform SDK:
[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x00 Note that the sort key is null-terminated. This is true regardless of the value of cchSrc. Also note that, even if some of the sort weights are absent from the sort key, due to the presence of one or more ignore flags in dwMapFlags, the 0x01 separators and the 0x00 terminator are still present.
[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x00
Note that the sort key is null-terminated. This is true regardless of the value of cchSrc. Also note that, even if some of the sort weights are absent from the sort key, due to the presence of one or more ignore flags in dwMapFlags, the 0x01 separators and the 0x00 terminator are still present.
The reason for this structure is that the primary weights (called the Unicode weights, above) need to take priority over secondary weights (called Diacritic weights, above), which themselves have to take priority over the tertiary weights (called Case weights, above), and so forth. In this way, all of the following examples are true when using the invariant locale/culture, as described in the last post:
AAAA < AAAB (primary difference)AAAA < AÃAA (secondary difference)aaaa < AAAA (tertiary difference)AÃAA < AAAB (primary difference, secondary difference ignored)AAAA < aaaB (primary difference, tertiary difference ignored)aaaaab < aaab (primary difference, length and tertiary difference ignored)AAA < aaà (secondary difference, special width and tertiary difference ignored)
And so forth. For that to work, the four different categories need to be kept separate and each one needs to be put in the sort key in its entirety, and if any type of weight is ignored then that whole section will be empty.
You can take the sort key, this structured array of bytes, and use it as an index for the string. Comparisons of two byte arrays will always be faster than comparing the string themselves.
This of course assumes that the sort keys are pre-calculated, like in an index. If they are not and you are looking at the difference between caculating then comparing the sort key values for two strings versus comparing the strings themselves, the string comparison will almost certainly be faster. The reason for that is that the sort key calculation involves analyzing the information of the entire string (and still does not include the actual comparison) whereas string comparisons will exit as soon as they can come up with an answer to the question of which one comes first.
I was doing a presentation a few years ago and it occurred to me that looking at direct string comparison versus sort key calculation/comparison was like looking at the "retail" version of collation vs. the "Wholesale" one. Only some people in the crowd felt it was an illuminating analogy, and I once again learned that I should not blurt out "good ideas thst suddenly occur to me" when I am in the middle of a presentation. :-)
One last thought -- no, there is not an Ordinal type of sort key. Because Ordinal comparisons are already done in a binary manner!
This post brought to you by "р" (U+0440, a.k.a. CYRILLIC SMALL LETTER ER)
There is a great deal of confusion surrounding the meaning of these two different things in the .NET Framework, and when to use each. If you have suffered, are suffering, or think may suffer in the future from such a confusion, then read on!
(Otherwise, I guess you can go away and come back another time)
The invariant culture's direct ancestor is the invariant locale. Officially added to the Windows source tree at 10:23am on May 12, 2001, its intention was not to be used as an actual locale (which would explain why no locale data was added until a month later; until then no one was using it in GetLocaleInfo!).
Originally, LOCALE_INVARIANT had just one noble purpose -- to allow one to use CompareString (and LCMapString with the LCMAP_SORTKEY flag) in a way that would only use the "Default" Windows sorting table as mentioned a little bit here and especially here. The results, as that second article mentioned, would not vary when the user or system locale settings did; they would be invariant within that installation of Windows.
The data was added for this locale a month later, as I said, for obvious reasons -- if you have an LCID that one function considers to be valid, you must have a very good reason if another will not. And it cannot duplicate any other locale, either. Much weird data was added so that no one would be tempted to try to act like they spoke a language called "Invariant" and then all was good.
Note that these string comparisons still had much linguistic value -- half of the locales in Windows use that default table, so an invariant sort would not only avoid varying, it would also look right to a lot of the world.
The .NET framework had similar requirements (with the additional need for invariant parsing/formatting support) and thus CultureInfo.InvariantCulture was created. As with the locale, any string comparions made with InvariantCulture's CompareInfo object would have linguistic validity in a lot of places, and would not vary within that installation of the .NET Framework.
So everyone had what they needed, right?
Well, no.
A bunch of people wanted a method of doing a more binary type of comparison, instead of one that would be based on the "linguistically appropriate" approach gven a particular culture1.
The difference between what we had and what they wanted was akin to the difference between the C Runtime's strcoll/wcscoll versus strcmp/wcscmp (in the CRT documentation they refer to the difference as being locale based versus lexicographic).
The other advantage to such a "lexicographic" comparison is that it would be faster since a simple binary comparison of the code point values was being used.
To meet this need, the notion of an Ordinal sort was added and an Ordinal member was added to the CompareOptions enumeration. Selecting it would ignore all of those cultural collation features and give you a binary sort that would also, incidentally, not vary.
The only remaining problem at this point is that there were now two useful ways to do these different "niche" type of comparisons but neither name really jumps out at the developers who were looking for such solutions.
That problem remains to this day, though every single time I speak at a conference or answer a question in a newsgroup or get someone to look at posts like this one, then there is at least one less developer who has this problem. Maybe this time it is you? :-)
Now the story does not end here; many people have wanted to do things in a case-insensitive way. Of course if you wanted a case-insensitive invariant comparison then you could have done that all along -- just use the InvariantCulture's CompareInfo methods with the CompareOptions.IgnoreCase flag passed in. Easy!
But some people wanted a case-insensitive ordinal comparison?!?
Now the closet linguist in me shudders at this concept since a casing operation is essentially a linguistic one while an ordinal one is specifically not -- it's lexicographic.
So people are asking for a linguistic non-linguistic support, a request that for me brings to mind the comedian Steven Wright's dog2.
However, the technical half of me understands the need and so I got over my linguistic fetish as one of my colleagues on the BCL team worked in Whidbey to add a new OrdinalIgnoreCase member to the CompareOptions enumeration.
The behavior is basically to do the casing operation using the default casing tables prior to doing the binary comparison. This feature has been in the "Whidbey" version of the .NET Framework for some time (first checked into the source code tree on February 7, 2003), so you can try it out today if you have just about any build of Whidbey underfoot.
Hopefully this post will help clear up some of the confusion about these two interesting comparison types.
1 - What can I say? Some people are Некультурные (uncultured) though not in the culturally offensive sense.2 - Steven Wright claimed to have named his dog Stay so that he could call out "Come here, Stay! Come here, Stay!" and watch the dog walk toward him in a stuttery fashion.
This post brought to you by "Ω" (U+03a9, GREEK CAPITAL LETTER OMEGA)I talked to Omega just before this post went live. She said that as the last letter in the Greek alphabet (who was pretty much always therefore last in the queue), she understood the cost of keeping letters in order. Any performance benefit is good one, to her mind. Especially since a binary sort would let her come before her little sister (U+03c9, GREEK SMALL LETTER OMEGA) for once.
My New Year's Resolutions for 2005:
People who know me will affirm that I was doing this stuff already -- so it should be easy to keep doing it. :-)
This post brought to you by "F" (U+0046, LATIN CAPITAL LETTER F)I will not attempt to guess why the letter F would want to sponsor here, but he seemed to feel it was fairly important.
I am going to take these two questions out of order because (a) locales existed before cultures did, (b) neutral locales set the stage for neutral cultures, and (c) I think it may help us look less lame. Though that third reason is probably just naive optimism on my part....
To see what Windows does with neutral locales, you can look at the documented behavior of ConvertDefaultLocale. Basically, if you pass a neutral like LANG_ENGLISH then it will return the equivalent of MAKELANGID(LANG_ENGLISH, SUBLANG_DEFAULT) thus 0x009 becomes 0x0409, 0x01a becomes 0x041a, and so forth. Easy, huh?
This ConvertDefaultLocale function calls an internal routine to do its work; the same routine is called by every NLS function, too. Which is a long way around to say that neutral locales do not exist to the Win32 NLS APIs.
Now there is one use for them in Win32 -- resource loading. You can use neutral LCIDs either to more accurately tag resources or to provide an easy fallback mechanism. Of course, if you want to put names on them then you cannot use GetLocaleInfo since asking for the LOCALE_SLANGUAGE of LANG_ENGLISH will give you "English (United States)" which is probably not what you wanted.
(In fact, I wonder what Visual Studio's resource editor does for its strings for these neutrals -- it must have its own strings somewhere, hard coded? Ick!)
In retrospect, it might have been a better idea to not do things that way, but it has shipped this way for at least the last ten versions of Windows. So we are kinda stuck with it.
Anyway, thats neutral locales -- at best, they are tolerated. But you can't really do much with them. Using that ConvertDefaultLocale-ish behavior can actually get you unexpected results sometimes, too. More on this another day.
This brings us to neutral cultures....
In the .NET Framework, a neutral CultureInfo mostly does not do this weird implied LCID fallback thing. There is actual data behind these culures that you can query and use -- and you can get back the actual names and everything. It also does a great job on the resource loading fallback -- using the CultureInfo object's Parent property. The parent property is not based on LCID tricks, either -- it's actual planned data for each culture. Obviously much cooler and a bit more thought out.There is even a CreateSpecificCulture method on a CultureInfo that does the same sort of thing as ConvertDefaultLocale, creating a specific culture from a neutral one.
A lot of you probably noticed where I said "mostly" in that last paragraph (an occupational hazard of having readers who can be as cynical as I am!). Unfortunately, that weird LCID-esque fallback behavior still basically happens for collation and encoding via the culture's CompareInfo and TextInfo objects. Which is not such a big deal, and it is really necessary since both of those objects need the context of a specific culture.
In retrospect, it might have been a better idea to not do things that way -- CompareInfo and TextInfo should not have been made available (as happens for other objects like the associated DateTimeFormatInfo and NumberFormatInfo), but it has shipped this way for two versions so we are kinda stuck with it.
One important difference that distinguishes them from neutral locales is that one could create a class that is derived from CultureInfo that contains language-specific information which would make more sense in neutral cultures, which is really a fancy way of saying "language-only cultures" (which is itself kind of a fancy way of say something).
There are also some odd situations with the LCID property of the CultureInfo, but thats a separate issue. More on this another day, too.
So, thats neutral cultures. Mostly not useful, except for resources -- except in the same way that they are useful in Win32 (by pretending its a specific culture). Or for potential extensibility either by Microsoft or by developers in the future.
Three steps forward, one step back? :-)
This post brought to you by "♎" (U+264e, a.k.a. LIBRA)
The quote in the title is an allusion to a quote from the Simpsons. Now the ANY key quote is not my favorite quote from Homer J. Simpson (actually, this is), but its in the top five.
And it popped into my head as I prepared to talk about the AltGR key, which is a key that does not exist on the keyboard upon which I am typing right now.
Anyway....
In a comment from a couple of days ago, Norman Diamond suggested that AltGr stood for Alternate Graphics. Thats fine as far as it goes, but only the first two words of his post suggest what AltGR stands for, and the words do not really explain what it means, though he talks a bit about what it does. So the question remains, where does ALTGR come from as a term?
Well, it does indeed stand for Alternate Graphics (or Alternative Grafiken). The original intended purpose of it was to have an easy way to get at the table-based graphical characters that were so handy to use in a console application, located on the right side of the spacebar.
It was, however, used quite actively for keyboards that needed extra keys (and there is no layout I know of today in Windows that supports the graphical characters except by accident in consoles, and they do not use ALTGR to get there). This does not apply to the US english keyboard hardware, so they just put a RIGHT ALT key there which will actually act as if it were the ALTGR key any time you switch to a layout that makes use of this extra shift state. Note that this extra shift state is also available by hitting <Ctrl>+<Alt>, but thats more work to type. So having a single key to type instead is much cooler.
Of course, this can cause problems since sometimes people make shortcuts using <Ctrl>+<Alt>, which screws with what people might want to actually do with a keyboard layout. In fact, Raymond Chen talked about Why Ctrl+Alt shouldn't be used as a shortcut modifier last March, explaining this fact.
I would extend Raymond's very good advice to anybody who uses MSKLC to create custom keyboards (note that MSKLC warns about assigning <Ctrl>+<Shift> to keyboards since many shortcuts are assigned there). Or people who uses Word to create shortcuts. Or people on the Microsoft Word team who created tons of "useful" shortcuts that do not mind stomping on what a keyboard layout may have assigned to a keystroke combination1. The key is to think about the keyboards and/or the shortcuts you create in the larger context of where you may either step on others or be stepped upon by them.
And if you create a custom keyboard with MSKLC, consider putting one of the graphical BOX DRAWING characters in the ALTGR state somewhere, so that you can be one of the cool people that makes the AltGR key meaningful again. Its like having an easter egg in software, but with an important recreational purpose!
1 - Every few months I start looking at the Word object model and its KeyBindings collection and related trivia to create a Word Add-in that will listen for keyboard changes and any time a WM_INPUTLANGCHANGE notification is received it would remove the Word shortcuts that conflicted with actual keyboard assignments. I find the undone project I was working on, get into it for a few hours, and then realize that this is something that the Word team or the Offce team ought to put together and build into the product. So I send off some mails and they agree with me and then it seems to go nowhere. A few months later it starts over again. Maybe one day one of us will have a finished solution for this problem. :-)
This post brought to you by "╦" (U+2566, a.k.a. BOX DRAWINGS DOUBLE DOWN AND HORIZONTAL).The competition in the BOX DRAWING block of Unicode to do a topical sponsorship of the post was fierce; it was finally chosen by the drawing of lots, in order to avoid violence.In the future, an effort will be made to woo "appropriate sponsorship" from Unicode characters based on actual relevance to the specific post. Otherwise, its like a celebrity endorsement for a product that the famous person does not use -- and I hate that.
I have a box of candy in my office, and the two methods that seem most effective in keeping me between 161 and 165 pounds are to (a) try to fill the candy box with candy I do not like but other people do, and (b) not spend too much time at work and instead work from home. The second method is important since I also keep a large fridge filled with Limonata and if I am there too much I'll gain weight even without the candy and because other people like to fill the box, too. The first is important because I probably buy the most and I can dilute these other nice people.
Every once in a while I will see something new and I'll buy a bag of something that I have never seen before because I am curious what the hell it is. Most of them will be eaten before I get aroung to trying one and then I'll finally figure out if I liked it or not.
Have you ever had a particular kind of candy that you could not tell if you liked by trying it?
And have you ever found that trying it again later that day or the next day you still weren't sure?
Its probably just me.
Anyway, I haven't decided yet if I like blogs.
I mean, I have been doing this one for two months. I have been trying to post something relevant every day for at a little over a month of that. And at least four of those posts have been interesting. Its hard to sort out in the stats the people who link to you from the people who are just subscribing to hundreds of blogs and are thus not really reading any of this, but I think there are at least 20 people who read it now and again.
Yet, I can't tell if blogs annoy me yet.
I may not know until I can't think of something to post, and at that point I will be annoyed so it probably won't be fair to take that impression as an indication of how I feel about a whole technology.
Robert Scoble likes them a lot, but I have known him since he worked for Jim Fawcette and he always thinks some technology is the coolest thing in the world, and its always non-coinidentally what he is very involved with at the time. I'm not criticizing that, since if nothing else it proves he has convictions and believes in what he is doing/saying - which is a good thing. But its not a good way to get independent advice on what is cool. After all, he used to love the Offramp, too (and proably still does!), and I never really did, even when I used to post there....
Maybe the whole blog opinion thing will come to me one day when I don't suspect it.
Maybe blogs just taste like chicken. And I kind of like chicken. Though not every day....
On the other hand, maybe blogs taste like Limonata. In that case, you can expect about a case of posts every week!
This post brought to you by "✌" (U+270c, a.k.a. VICTORY HAND)
(From the Suggestion Box)
When people start looking at East Asian languages, they notice that most of the regions have a sort based on pronunciation: Korea has a sort based on the Hangul pronunciation of the Hangul and Hanja codepoints, Taiwan has one based on the pronunciation in Bopomofo order, and China has one based on the Pinyin pronunciation. They notice that there is one major region missing from this list -- Japan. They wonder why Japanese is not given the benefits of such a sort. Isn't the Japanese market important to Microsoft?
(I have been asked this very question, sometimes that very way, in email)
There are two answers to this question, one long and one short.
The short answer is that there is a pronunciation based sort in Windows. Simply pass any Hiragana or Katakana to CompareString or LCMapString/LCMAP_SORTKEY and you will see everything collate properly. It even works in all locales; one does not even need to pass the Japanese LCID, 0x0411, to see it happen. The world is in the proper アアあイイいウウうエエえオオお order (in the traditional AIUEO order, Halfwidth Katakana followed by Fullwidth Katakana followed by Hiragana). What more could one want?
Of course the answer to that question is in the long answer -- people want to know how to get the Kanji (the Han ideographs) to sort in this order, too.
The answer to this question is that there is no such sort. To explain why, lets look at how the Korean/Chinese/Taiwanese regional sorts are done. In all three of them, there are often characters that have multiple different pottential pronunications in an ideograph, based on context (just as exists in English for words like Polish the language versus polish the furniture cleaner). This would make pronunication based sorts impossible except for the fact that the most common pronunciation is determined and then that is the one that is used when multiple pronunications exist.
Admittedly this is not a perfect solution, but short of a computer that can actually read the text, there is not much more that can be done (although I am sure one could imagine interesting dictionary-based ways to approximate things -- I have, and they fall under the heading of 'clever' even when they are not really practical).
Now lets look at the situation with Japanese.
There are three different types of pronunications, called readings (on, kun, and nanori) and individual Kanji can have one, two, or all three of these (and in most cases at least the first two). They can also have more than one of each! The third reading type (nanori) is for name and there is in most cases no way to know what it is without being told (this is in fact how phonebooks work -- someone giving the pronunciation in Kana to the phone company or list creator).
Given all of that, there is no way to even guess what the most common pronunciation is, even if the data were available, without giving users results that seem wrong or confusing to them. Because even though one could craft an algorithm that could make intelligent guesses at which type of reading is meant, there is no way to make something at least as likely to be correct as the other East Asian languages, especially given that what is probably the most common need for such a sort (lists of names) would require a separate field for the pronunciation.
And this is indeed the best solution for such situation -- a separate field containing the pronunciation. It works quite well, and I would encourage any application that wants to do a pronunciation-based sort to try doing this as a method.
In theory, this is something an application can do when a name is typed when the IME mode is based on pronunciation; this is the one time that the pronunciation information is present without it being queried separately -- during the composition phase. As far as I know, this is not something that is done right now (if I am mistaken feel free to let me know!). It would be exceedingly difficult to do with the IME APIs and Windows messages as they are (and it is nearly impossible in the .NET Framework since the appropriate events are not even exposed).
This post brought to you by "ㄎ" (U+310e, a.k.a. BOPOMOFO LETTER K)
People often when looking at wingdi.h notice the following constant definitions, somewhere around line 1292:
#define HANGEUL_CHARSET 129#define HANGUL_CHARSET 129
and they wonder -- which is the right one? People usually assume that one if the older and the other is preferred.
Well, its a funny question. The short answer is that it does not matter. They are just simple #defines and they end up being the same value anyway.
The longer answer may be of interest to some, so I'll give that too. :-)
Back in the late 1930s, George McCune and Edwin Reischauer put together a system to represent 한글 (Han'gŭl) in a romanized form using the Latin script. This system (after many years of being used around the world) became the official romanized form used by South Korea from 1984 until 2000 (a very good summary of the system can be found here). In that form the first syllable 한 (Han') is combined with 글 (gŭl) to produce Han'gŭl. People would often skip all accents/diacritics and thus Hangul is the most common way people saw it (especially in identifiers like constants which cannot contain diacritics, but also in general usage). The problem with the information that is lost in the names is a real one, however, and for many years people struggled with an imperfect system.
Then starting in the mid 1990s work was started to try to produce a romanization system that would not have all of those diacritics, and although much of this new standard was communicated earlier, in the year 2000 it was offically published as the official system by South Korea. In that standard the 'ŭ' is actually represented by 'eu', and thus the official romanization of 한글 becomes Hangeul. Given that change, there was really no good reason to not add a CHARSET_HANGEUL constant.
Now there have been some criticisms of the "Revised Romanization of Korean" both inside and outside of Korea (summarized on the government site here with all of the changes) and its ability to properly represent Korean in a completely reversible form, and thus there are people outside of Korea who continue to work with the original McCune-Reischauer romanization. Of course, the existing constant (CHARSET_HANGUL) could not be removed anyway without breaking existing code. Also the constant is not really mentioned explicily in documentation much since the CHARSET_* constants are not used much in the world of modern font linking and fallback. In the end, it was just easier to leave it in as is, but add the new constant so that people could use the "new" name if they wanted it.
Koean as a language is best represented by using actual Hangeul syllables rather the romanized form anyway, so neither form really should affect much other than trying to use the term in situations where you need to describe the language in English anyway. Given the poor reversibility, the best way to store Korean text is to not try to romanize it at all if one can avoid it. Other solutions to this problem have been proposed such as the Korean Romanization for Data Applications (KORDA), but there has not been a high demand for this solution in Windows or the .NET Framework since there are no API that would make good use of transliterated forms and collation is not really set up to support it either.
A final piece of the puzzle is what happens in North Korea. Essentially, the original McCune-Reischauer form is used for romanization, but the name 조선글 (Chosŏn'gŭl) is preferred. However, the preferred ordering for Jamos in North Korea (and thus by extension for the full syllables that are made up of Jamos) is different than that of South Korea. Therefore, the expected sort for North Korea is not directly available since there is no North Korean locale support in Windows or the .NET Framework, although proper rendering will be achievable if one has appropriate fonts.
This post brought to you by "ᅅ" (U+1145, a.k.a. HANGUL CHOSEONG IEUNG-SIOS)
Now as those who know me personally will attest I am not a linguist. But I often cannot help but act as if I know something about it. Yet you can tell I am not a linguist (or someone who ever got better than a B+ in grammar!) as I am about to go out on a limb and describe something based on what I think is meant. The reader is therefore warned! :-)
So, the question is more politely "what are genitive dates?". Well, to answer that question, we'll first start with the dictionary definition of genitive. We'll go with the very first definition since it described the intended usage:
Adj - Of, relating to, or being the grammatical case expressing possession, measurement, or source
In this particular situation, its to do with the that 'possessive' usage.
When in English (in the US) you say 'December 25' aloud you usually say "December twenty-fifth" and this all really a shorter form of "the twenty-fifth of December". Its not the traditional way one thinks of a possessive (after all December does not "own" the twenty-fifth day in the same sense that one would talk about 'my dignity' or 'the dignity of me' (before this posting of course!). But there is a possessive usage going on here, and in some sense December does indeed own 31 days, the twenty-fifth being one of them (while poor February, the 90-pound weakling of the calendar, owns a mere 28.25!).
Anyway, this form of "December" is the genitive form.
In English, 'December' on its own and the genitive forms such as those above are the same words, which is why you may never have learned about most of this in grade school (well, I did not in my grade school -- we could just always blame Beachwood elementary schools if everyone else learned this stuff). But this is not true of all languages. In Czech, for example, the twelfth month on its own is 'prosinec' and the genitive form is 'prosince'. In Greek the difference is 'Δεκέμβριος' versus 'Δεκεμβρίου', in Polish it's 'grudzień' versus 'grudnia'. In Russian it's 'Декабрь' versus 'декабря'. And so on for Belarusian (aka Byelorussian), Ukranian, Slovak, Latvin, Lithuanian, and others.
Lest any of my English speaking colleagues find this too confusing, they should probably consider trying to explain the differences between the words 'I' and 'me' and their genitive forms 'my' and 'mine' and when each is used, plus the capitalization of 'I'. They will busy for a long time trying to summarize that. Japanese has numerous ways to do counting which vary with the thing being counted yet many other things that are simpler like gender neutral honorific. Honestly, I suspect that every language to a non-native speaker has some things that are easier and some other things that are harder, yet a native speaker just handles them without even thinking. So perhaps we can forgive those with a different word used in genitive dates since the languages have done nothing wrong? :-)
At this point, if one is looking at the LOCALE_SMONTHNAME* flags in the Locale Information used by GetLocaleInfo or the MonthName array in the .NET Framework's DateTimeFormatInfo and wondering why I am going on about genitive dates when it looks like Windows does not support them. But if any of my readers have used Czech or any of those other languages then they can attest that GetDateFormat and the various formatting and parsing functions in the .NET Framework support them quite well, there is simply no method to obtain the raw data. Its one of those cool stealth features which speakers in other languages do not have and thus do not understand and do not expect, while speakers in those languages do not really think about since everything seems to be working. This has been working properly in Microsoft products since NT 4.0 and Windows 95 and probably earlier and has been in every version of the .NET Framework that has ever shipped.
In fact, the upcoming version of the .NET Framework (code name Whidbey) includes new properties to set and retrieve the genitive form of the month names, so it is no longer really a "hidden" feature anyway. And it was never really hidden to be difficult; it was more that it is very hard to describe to anyone who does not use different forms for months. And things that are hard to document make stuff more confusing for everyone.
If nothing else, it is yet another reason to use the built-in functions and methods for formatting and parsing rather than trying to write one's own!
This post brought to you by "ᠲ" (U+1832, a.k.a MONGOLIAN LETTER TA)
This post is one of a series that I will be putting up over the next few weeks about international and non-international issues surrounding keyboards, MSKLC, and accessibility. Through them I will deal with issues important to developers in their application, issues important to keyboard authors in their works, and issues important in the use of MSKLC (and MSKLC's own accessibility triumphs and failures).
The first article is very related to this one and is about shortcuts. In this article I promise that I will do plenty of riffing on the differences between the two.
If, you read the first post (link above) then you know about how weird shortcuts can get in international scenarios.
But think for the moment about how they use the basic keystrokes and do not care what is under them in the way of characters.
Accelerators, on the other hand, use the menu architecture. In fact the virtual key name for the ALT key is VK_MENU , which says a lot about the original intent of these keys.
Note: Most European keyboard layouts hijack the ALT key on the right side of the keyboard to get a new shift state. It is usually labeled ALTGR for reasons that I will explain another day. :-)
A good example of Accelerators is <Left Alt>+<f> to open up the File menu or <Left Alt>+<d> to select the "Destroy" button on the dialog. These accelerators do not use the keystrokes like shortcuts do, they use the actual characters that show up in a WM_CHAR message if the application passes the WM_KEYDOWN message to DefWindowProc.
The upshot of this differences is that they are 100% subject to both localization of the menu items to control the choices and also the keyboard layout choice which may or may not contain the letters in question. This means that anytime the user's keyboard layout does not contain the characters used in the shortcuts
Now other controls can have accelerators on them, and they are often considered wonderful accessibility helpers for that reason. Anyone who has ever tried to look at a web page and had to scroll through dozens or even hunderds of controls and URLs knows that tabbing between controls is not a very usable option.
Note that any time the keyboard layout language does not match the product's localization these items will not work. This includes CJK (Chinese/Japanese/Korean) is the one place where the specific language is used (for what its worth); because you cannot type Kanji in single keystroke, the localization effort usually includes ASCII letters for accelerators. Every other language can have problems here, and even the CJK languages will have problems if you switch to a keyboard layout that does not have the ASCII letters in it. CJK and English (note that you can verify this by opening notepad on an English Windows and switching to a Hindi (or any other language) keyboard and then try to access accelerators -- they will not work.
On the whole, this means that accelerators almost always have potential to suck from the standpoint of accessibility. While shortcuts only suck until people learn where they are in other languages, the accelerators will always be broken when you switch languages and should thus never be used to meet accessibility requirements since they can easily be made unavailable.
Now shortcuts and accelerators do both have some things in common -- for example, they both cannot have Han on them, and they also cannot really have dead keys or ligatures (in the keyboard sense -- two or more UTF-16 code points -- not in the typographical sense).
Yesterday I mentioned how all of the great resources that list out shortcuts tend to lump shortucts and accelerators together. This is okay a lot of the time since in many cases the default keyboard matches the localized language of the application. But this is easily broken any time you switch keyboards to type in another language!
Note that this also points to a suggestion for keyboards that you build yourself using MSKLC -- you might want to consider whether there is a sensible place to put the letters of the language into which most localized products are used so that one can use the common accelerators used in the application. This may not be feasible, but if it is then users who prefer to use keybaord accessibility methods will probably be very thankful.
This post brought to you by "࿇" (U+0fc7, a.k.a. TIBETAN SYMBOL RDO RJE RGYA GRAM)
The West Wing episode that was on last night was actually a rerun of last year's "Christmas" episode, entitled Abu el Banat. I have actually gotten a few emails to both accounts and several people clicking on the Contact link on this page asking about it though.
They ask about the statistic that was floated around during the episode (where an assisted suicide issue was being forced by the head of the DEA in a state that allows assisted suicide) about how one in five people who asks for assisted suicide has multiple sclerosis, and does the president want to avoid getting into it because people will being up the MS issue. They wonder if the one-in-five statistic has any basis in fact (I will assume that most of them are not wondering if I am suicidal. I'm not, by the way. Its not like I am a dentist1 or something).
It is true that of the 93 suicides with which Dr. Jack Kevorkan assisted that 20 of the people had multiple sclerosis. It is a fact that even the MS Society decided to go investigate since they really have to look ino any claim about MS, even silly ones like mercury amalgam fillings when 20/20 claims there is a connection. The results of the investigation are interesting. Though of course the results are skewed toward an MS landscape without the CRAB+T drugs (Copaxone, Rebif, Avonex, Betaseron, or the new Tysabri) or other drugs like Novantrone), it is unclear how many of those looking into assisted suicide are eligible for any of those drugs anyway. So maybe the results are accurate as they are.
But I guess a bigger issue is what is really going on here. What does being put in a situation where one feels helpless do to a person?
As I often like to point out to people: MS is not a death sentence, but it is a life sentence. And there is no parole. It is easy to spend a lot of time reminiscing about all of the things one cannot do anymore, but that is only particulary effective when one is younger, not when one is older (and everyone else is noticing what they cannot do anymore). Clearly it is not the case that 1/5 people who feel helpless to control their life circumstances has MS, so what makes MS so special here?
I try not to bother complaining anymore about what I used to be able to do because invariably the people I complain to point out that they could never do that anyway. Whether its memory or dancing or running or keeping in shape, people don't seem to realize that I never compare myself to others, I compare myself to me. What is depressing is that even allowing for age it seems like I am on a whole new curve now which takes me down faster than I would have otherwise gone down. Of course people seem to have their eyes cross during that conversation and even I have to admit that it sounds stupid when I try to say it out loud. Which probably means it it scores pretty highly on a scale of 1 to lame.
Does having a reason (MS) beyond the generic one everybody has (getting older) somehow make the situation feel more hopeless? A depressing thought, but somehow less depressing once it is explicilty stated that way, at least to me. Maybe because once I know there is a reason for it, I can move on to the next thing.
So, I have a good control over most of my life and a good understanding of the things that I do not. And though the West Wing-touted statistic is apparently a true one, it does not impact me. Because as far as I can tell its more for people who do not realize that we all have the uncertainty of life, and that being a bit less uncertain is not something to be depressed about.
1 - The 2000 movie The Whole Nine Yards had a lot of fun with the fact that Matthew Perry's character Nicholas "Oz" Oseransky was a dentist, with several characters pointing out that they read somewhere that dentists are prone to suicide. While I will joke about it for dramatic effect in my bkog, I will point out what Dr. Jerry Gordon did, which is that its not really true. Suggestions that the one out of five statistic is somehow related to the four out of five dentists who recommend Trident gum for their patients who chew gum are simply too silly to get into (though I guess I just did).
This post brought to you by "Ừ" (U+1eea, a.k.a. LATIN CAPITAL LETTER U WITH HORN AND GRAVE)because after being turned down by Sesame Street as having too much on its mind to be a good sponsor and despondent over the success of its unencumbered cousin "U", will apparently take work anywhere -- even disreputable web sites like this one
This post will be one of a series that I will be putting up over the next few weeks about international and non-international issues surrounding keyboards, MSKLC, and accessibility. Through them I will deal with issues important to developers in their application, issues important to keyboard authors in their works, and issues important in the use of MSKLC (and MSKLC's own accessibility triumphs and failures).
As it is the first post, there will likely be questions in people's minds that relate to items that will show up in future posts; I promise I will be flattered that I was able to inspire such issues before covering them!
This is a huge area. Talking about keyboards, MSKLC, accessibility, surrounding international issues.... the list of topics not being covered may have been shorter....
Anyway, lets start with shortcuts.
These little beasties are a crucial way for some to get at core functionality. While some people prefer to go to use Edit|Cut, Edit|Copy, and Edit|Paste from the menus, many others like by <CONTROL + X>, <CONTROL + C>, and <CONTROL + V> for the functionality all across Windows. This is so true that the developers who wants to change the functionality from Cut/Copy/Paste to eXchange/Contribute/Verify will find that they face an uphill battle, fighting spinal reflexes and muscle memory.
There are other common shortcuts such as <CONTROL + A> for "Select all", <CONTROL + Z > for "Undo" and <CONTROL + Y> for "Redo", or the other magical editing shortcuts like <CONTROL + I> for italics, <CONTROL + B> for bold, and <CONTROL + U> for underline. Users are dismayed when they do not work if it seems like they ought to (like in a text editor -- I tend to define an "immature edfitor" as one that does not have these definitions!), and one of the best ways to upset users is to have that functionality and put it on some other shortcut. Or even worse to have no shortcut at all!
(There are other shortcuts that you can find out about by looking for the article entitled "Windows keyboard shortcuts overview" in Windows Help, but I am not referring to most of those, I am referring to application shortcuts that a developer might put in their application). But I highly recommend that other article since it has a lof useful shortcuts in it and is very useful to users.)
There is an even cooler page from the Accessibility folks entitled Keyboard Assistance that covers many of the applications that Microsoft ships beyond Windows. In fact there is only one part of that page which is not so great, which I will point out in a moment. I am not making up the importance of shortcuts and accessibility. Ignoring the fact that they host this site, they host articles like the one from Sara Ford1 enitled Testing for Accessibility which mentions shortcuts quite prominently in the accessibility testing that an application should do here. Incidentally, Sara's article rightfully does not try to separate shortcuts (the topic of this post) from accelerators (a topic coming soon). Why do I say rightfully? Because users do not usually distinguish and thus it is not entirely sensible for documentation to do so. Hell, I wouldn't separate them if I did not have to in order to describe the technical issues behind them that are the root of so many problems....
Anyway, these shortcuts are boon to people who have difficulty using the mouse and they open up worlds for people who want to get at that functionality. I love them even when I have no problems with the mouse because I do not have to break up keyboard usage (and I sometimes do have problems with the mouse, though for me the keyboard kind of sucks then too - but thats another story). This makes them a wonderful usabulity adjunct to people who have problems with using other means of common actions as long as they know about them, which is the reason that the commonly known ones are more likely to be used. I also must come back to Sara's articl, which has the coolest quote about accessibility in the world:
One final note about keyboard accessibility is that the more ways the user can perform a given task, the more accessible the functionality is.
This quote can even be extended beyond keybord accessibility, so I will be coming back to it in future posts since it was a guiding principle in the UI design of MSKLC.The more ways there are to perform a task, the more accessible the functionality is.
But as I mentioned, there is a spot where things fall down. Here it is....
Of course, they suck a little bit for languages that do not use the Latin script, where keys for X, C, V, A, Z, Y, I, B, U, and others are not present. I say "a little bit" since folks who buy a keyboard still have asccess to those shortcuts but they are under other keys. And who would instinctively know that <CONTROL + ฟ> on a keyboard with the Thai Kedamanee text layout printed on it would mean "Select All", that <CONTROL + न> on a keyboard with the Hindi Traditional layout printed on it would mean "Paste", or that <CONTROL + ਬ> on a keyboard with the Punabi layout printed on it would mean "Undo"? Some might actually make sense, but the odds of most of them meaning anything at all are unlikely.
One the other hand, things can also suck if you switch to another Latin script layout that does not match the letters on the keys on your actual keyboard. You may well be dismayed to see what looks on the hardware like <CONTROL + A> actually quit from your application rather than selecting all of the text when you switch to the French layout (since VK_A and VK_Q are switched). It may annoy you tremendously that the undo and redo functionality in what looks on the hardware like <CONTROL + Z> and <CONTROL + Y> seems to be swapped just because you switch to the German layout (where VK_Y and VK_Z are switched).
Speaking as a non-speaker of both the Latin and non-Latin category of languages, I am hard pressed to give an opinion on which is worse other than to say that I would think it was worse if it were my language. Dammit.
MSKLC actually comes in handy here -- you can load an existing keyboard layout and hover over the different keys to find out what is the key over each of these different letters. (Of course at that point we run into MSKLC usability issues, but thats a different issue, for another post!)
But why does that happen, exactly?
These items are not subject to localization since they are virtual key (VK_*) based and they can be subject to key repositioning when a keyboard layout repositions keys (for example the French layout switches the positions of "VK_A" and "VK_Q" which means that if I switch to the French keyboard on my US keyboard hardware I have to know that the key that has "A" printed on it is really going to mean QUIT if I hit it with the control key). Thankfully, outside of the Latin script languages there is very little moving around of the VK_* values.
This is a place where Chinese, Japanese, and Korean language keyboards do just as well (which is to say poorly). They keep the English positioning on the VK values even when a non-English keyboard layout is used (like in Japanese and Korean), which puts them with the rest of the non-Latin script languages. They regularly include the shortcut key in their user interface, on menu items and such, which is fairly cool. Of course this does not help if you do not know where the keys are, since they may not be printed accurately to the physical keyboard, but this makes them no worse off than every other language....
MSKLC does not offer direct customization here for where the virtual keys go; you can only indirectly change them by changing what scan codes go with what keys. This limitation has an interesting history and reason which borders on and approaches yet does not completely reach a justification -- see a future post for mone info!)
Note that beyond VK-repositioning there is no other consequence of letter assignment since localization does not affect the letter to be used (other than the fact that people may not be able to find them) and there is no dependence on anything outside of ASCII/VK stuff. Luckily, most other layouts do not move VKs around but a lot of Europe and elsewhere do so there is no 100% clean solution here.
So the place where shortcuts fall down is when the keyboard layout is switched so that it is different than the keys on the keyboard. Because then the discoverability of the shortcuts is significantly lower.
I just feel like every keyboard in another language should come with a huge mapping of the shortcuts, which is to say the VK_* assignments, to the key on the keyboard that represents it. Wouldn't that be a cool feature to have available when buy a keyboard? Or even better, any time you select a new keyboard layout, since where the biggest accessibility problem lies. This may not be practical but it seems like it would be handy! Well, that or the hypothetical keyboard whose key tops change letters when you change the layout2.
Addendum 23 Dec 2004 11:10am: The Japanese subsidiary has provided a good translation for some of these shortcuts here.
For developers, it is important to understand one more issue (assuming that your heads have not yet exploded). Thau is the fact that shortcuts are processed at the point of the WM_KEYDOWN message. This is very different from accelerators (which I will talk about another day) and is the main reason that there is no localization of the shortcuts -- they are processed before the virtual key is ever translated into a character. So while it may be useful to give documentation about how <CONTROL + ф> will make the text bold with the Bulgarian keyboard layout, it really is <CONTROL + B> that is doing the work. Windows never gets to the point where VK_B gets converted to U+0444 (CYRILLIC SMALL LETTER EF).
So, how to summarize this mess? I just looked over this post again after I was done writing it and realized it was a bit like the movie The Survivors3. There is no way to summarize it.
Except maybe to say that keyboard shortcuts are cool. Although they kinda suck sometimes, internationally....
1 - Sara's blog is an awesome resource that is on the short list of blogs I read, and not only because I screwed up accessibility to much on the mosdt visible piece of UI I have ever created. Its incredibly insightful, and she and I are fellow travelers in that we are both passionate about areas of which it is often tough to convince people of the importance.
2 - Although I am not aware of any commercially available keyboards that do this, it is a deisgn architecture that is wonderfully implemented on a project that Wei Wu, myself, and several TabletPC developers put to good use for the first update to the TabletPC soft keyboard, which has the job of emulating keyboard hardware. Changing the layout changes what shows up in soft keyboard, which is such a huge leap forward in the usability of everything except for shortcuts that it amazes me. It is of course not helpful for shortcuts since the universal documentation still talks about <CONTROL + X> for cut, etc. and the soft keyboard cannot change any of that.
3 - The Survivors (starring Robin Williams, Walter Matthau, and Jerry Reed). A fun movie. I remember an interview withJerry Reed where he was asked to summarize the movie and he said it was impossible to summarize -- the point of the movie kept changing every ten minutes.
This post brought to you by "ウ" (U+ff73, a.k.a. HALFWIDTH KATAKANA LETTER U).