Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Yes, I said it. IMEs (Input Method Editors) have it easy. And I will say that even though I have only ever built them myself from the samples in the Platform SDK or the ones already in Windows. Even though I have only ever really worked at building keyboards, which cover such a small fraction of characters compared to IMEs that it makes me look like the kid on the beach building sand castles compared to Buckingham Palace by comparison.
Kind of cheeky to make such a claim, huh?
Nevertheless, I will make it. Because I am not talking about the design or the development part of it (which has obviously just a much chance to be intuitve/useful or not as any other project involving user interface and user interaction). I am talking about from a data management and data usage standpoint. And the questions that the data can answer.
With an IME, an attempt to take the small number of keys found on the regular keyboard and map them to up to some subset of the entire set of CJK ideographs, Kana, Bopomofo, Hangul, and Jamo characters in Unicode. The basis of the mapping varies, depending on language and user preference. It could be based on pronunciation, on the code point number, on count of strokes, on radical. The user can then commit the choice and it will be entered into the application.
And here is where we get to the part that makes me (as the owner of collation support in Windows an the .NET Framework) jealous, that makes me say the IME folks have it easy.
Because if that first choice is not what the user was looking for, then they get a list of alternate candidates that meet the same criteria as that keystroke or set of keystrokes. The list can be ordered by some collected data that tells the IME which candidate is more likely to be the right choice.
Ameliorated are the problems I discussed a few days ago about ideographs that have more than one pronunciation, because they can all be there, an entry for each pronunication.
Mitigated are the problems I mentioned about pronunications that can apply to many different characters, because each additional character can show up in the candidate list.
And Gone is the need to answer the question of equality that is so central to the CompareString and LCMapString APIs, the CompareInfo and SortKey classes -- because the question is no longer "are they equal, my liege?" or "which is ordered first, sire?". Instead, it's "what's on the list, dude?"
And it just struck me that it feels like a much cooler question, if you ask me.
Of course I was immediately reminded by colleagues that this is only cooler when it is the question that one is wanting to ask. If Jessica needs an order for a list of strings or Wendy needs to answer the quesion of equality or Molly needs to build indexes for her database, then the question that the IME works so hard to answer is not nearly as cool.
I was also reminded of something else I know but sometimes forget (which is good, because the remembering part humbles me a bit -- there are many people who feel I need that!). A question's coolness cannot be judged solely by the ease or even the possibility of a good answer. Some of the coolest questions in the world do not have answers yet. Some have answers that seem much simpler or even dumber than the original questions. Some are brilliant even if the initial question seems at first glance to be dumb or trite.
So why am I jealous of the IME folks? Because getting a satisfactory answer to their question is a more tractable problem for Korean, Japanese, and Chinese (both Mandarin and Cantonese), when compared to the questions that the technologies I work on ask.
For me, a function that is smart enough to order multiple characters with the same pronunication is easy -- I just plug in the rules for whatever mechanism acts as the tie-breaker. However, the function to take a character with multiple possible pronunciations and choose the "right" one for a pronunication based sort is a lot harder. Under current art, one needs to add one's own pronunication data a-la-Ruby (or other annotation mechanism).
Perhaps surrounding text could provide the context, if it exists -- but what it is does not? Also, a machine being able to choose the "right" pronunication based on such context is really the first half of the machine translation problem -- to know how to treat the data, the machine must first be able to in some sense understand what is meant.
Are there answers? Well, not in Windows or the .NET Framework today. But there is an understanding of the desired functionality. There may even by thoughts about avenues of attempts at solution. One day.
But at the very least, I thought that this post might make a good quick introduction to the problem.
This post brought to you by "ฬ" (U+0e2c, a.k.a. THAI CHARACTER LO CHULA)
Back in early October, Yao Ziyuan (a.k.a. 'Booted Cat') posted a suggestion for Microsoft in the microsoft.public.word.international.features