Welcome to MSDN Blogs Sign in | Join | Help

The best way to process Unicode input is to make somebody else do it

Andrew M asks via the Suggestion Box:

I was hoping you could address how to properly code Unicode character input. It seems like a lot of applications don't support it correctly.

I'm not sure I understand the question, but the answer is pretty easy: Don't do it!

Text input is hard. It should be left to the professionals. This means you should use controls such as the standard edit control and the rich edit control. Properly converting keystrokes to characters involves not just the shift state, but the management of various input method editors, some of which are quite complicated. For example, the IME Pad lets the user draw a Chinese character with the mouse (or if you're lucky, the stylus), and then it will take the result and try to figure out which character you were trying to write and generate the appropriate Unicode character.

Other IMEs will generate provisional conversions of phonetic text into Unicode characters, and as more input is received, they can go back and revise their previous guesses based on subsequent input. You definitely don't want to get involved in this. Just leave it to the professionals.

Postscript: For those who have never used a phonetic IME, here's how a hypothetical English phonetic IME might work. Let's pretend there's an English phonetic keyboard with keys labeled with various phonemes. (Instead of IPA characters, I will use traditional American phonetics.)

You type Result
әUh
tUt
ĕA te
nA 10
shA 10 sh
әA 10 sha
nA tension
oA tension o
lAttention all

Notice how the IME keeps updating its guess as to what you're trying to type as better information becomes available. The text is underlined since it is all provisional. During the input, you can hit the left-arrow to go back to any part of the provisional text, hit the down-arrow, and see a list of alternatives, at which point you can override the guess with the correct answer. For example, if you really wanted to write "A tension all", you would arrow back to the word "Attention", hit the down-arrow, and select "A tension" from the menu. Eventually, you reach the end of a phrase or sentence, look over the provisional text, and after making any necessary corrections, you hit Enter, at which point the text is committed into the edit control and a new string of provisional text begins.

Published Monday, October 22, 2007 7:00 AM by oldnewthing
Filed under:

Comments

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 10:23 AM by Psa

Slightly offtopic, but do you know if there's any end-user documentation for the IMEs that comes with Windows?

I'm learning mandarin and it took me ages to work out that you need to type "v" in the IME to get a pinyin "ü" (that's supposed to be a u with a diaeresis if it doesn't show up).

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 10:30 AM by mmmh

Is it always possible ?

For example, is it possible to implement a complex control like the VS2005 syntax coloring edit control resorting only to the basic text editbox and/or the rtf-editor ?

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 11:05 AM by Tom

If it was always possible to hand off work to other people, we wouldn't need to write much code, would we?  Of course the edit and RTF controls aren't suitable for every possible situation.  But you should definitely try to use them *if possible*.

I had to add proper IME support to someone else's custom edit control not too long ago and it wasn't that hard.  I paid attention to WM_IME_COMPOSITION and used the Imm*() API.

Here's the example I used as a guide - IME support in the context of a game:

http://web.archive.org/web/20061109141509/http://www.libsdl.org/pipermail/sdl/2002-October/049962.html

# Who will do it for me with a redirected stream and ReadConsoleInput

Monday, October 22, 2007 11:30 AM by Guillaume

In one console application, I use ReadConsoleInput to handle all things Unicode for me. Works fine.

But if a stream is redirected in my application, stdin is not a console anymore and I loose ReadConsoleInput and all the Unicode goodies it provides (most of wich I wasn't even aware of, mind you).

Any tips on how to handle this ?

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 12:10 PM by Jules

"the answer is pretty easy: Don't do it!"

Good advice, but it's clearly not always possible to follow it.  Many applications with non-trivial user interfaces will require something more advanced than either of these controls will handle (e.g. automatic text formatting, graphical variations like visible whitespace, etc.).

As an example, one application I intend to write in the near future will require a text editor with automatic highlighting (like a syntax highlighting editor) combined with support for simple text formatting (e.g. choice of a few predefined font styles, underline and italics, first line of paragraph indents, etc.)

This seems to me to be beyond the capabilities of the existing controls.  The project will be distributed as shareware and will likely not earn a huge amount of cash, so third party controls seem to be out.  This means I *need* to write something that will allow the user to enter text.

So how do I do this?  Frankly, not being in the slightest bit familiar with IME, I don't have a clue.

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 12:59 PM by Eric Brown

Jules:

I would *strongly* consider using the richedit control for your text editor control; preferably richedit 4.1, as it has full Text Services Framework support (the supported way to implement IMEs).  If you have more questions, contact me via my blog.

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 3:22 PM by Triangle

What happened to the Raymond Chen who said "Programming is hard because nobody said it would be easy" ?

[No sense making something harder than it needs to be. -Raymond]

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 3:46 PM by Triangle

[No sense making something harder than it needs to be. -Raymond]

What about the shell and COM you mentioned yesterday, wouldn't it be simpler to allow objects created by one thread to be used by other threads?

[You're saying it's just as easy to write a free-threaded object as a multi-threaded object? My experience suggests otherwise. -Raymond]

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 4:06 PM by Not a nitpicker

Actually, it is just as easy to write a free-threaded object as it is to write a multi-threaded object.  Is it safe to assume you meant apartment-threaded instead of multi-threaded?

[Right, sorry. free-threaded vs. single-threaded. -Raymond]

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 4:08 PM by Triangle

[You're saying it's just as easy to write a free-threaded object as a multi-threaded object? My experience suggests otherwise. -Raymond]

I mean using the object - not implementing it.

[So you believe it should be more important to make using shell extensions easier at the expense of making it harder to write them. It's a balance we've already discussed a few years ago; no point rehashing it. -Raymond]

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 5:12 PM by Andreas Sikkema

This looks very much like the way T9 input on mobile phones seems to work.

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 5:41 PM by MS

"This looks very much like the way T9 input on mobile phones seems to work."

Its the same story on some of the BlackBerry phones I've done a lot of coding for.  The predictive nature they use is pretty efficient if you take the time to learn it.  Using the standard edit controls there gains you all of this for free; of course, being a completely closed in Java solution, you generally can't write new input methods.

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 5:51 PM by Henry Boehlert

I'll probably be dead wrong but last time I checked, EDIT and RICHEDIT_CLASS didn't provide support for huge files, e.g. windowing/virtualization, maybe like WC_LISTVIEW does.

Whenever I needed to wade through weeks of logs and traces, I came darn close to try and write it myself.

It's not so frequent anymore, though, now that we have multi-core and dirt-cheap RAM and I can keep working while the editor is catching breath. (Will notepad.exe ever support /3G?)

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 5:54 PM by brian

Some people think they can do a better job then the programmers at Microsoft.  When I meet people like that I say "Well you obviously can't, 'cause if you could you'd be working for them."

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 7:49 PM by Sven Groot

There are legitimate cases for writing your own text editors. Are the Visual Studio editor and Word not examples of that? How do they deal with input, then?

The nice thing about the IME is that it sends you window messages as it's composing the string to let you know what it's doing. This has been very useful for to me in one instance. :)

[Of all the times I've been asked this question, I have yet to find someone who was intending to write a text editor. If you want to write a text editor, then you get to learn about the IME messages. -Raymond]

# re: The best way to process Unicode input is to make somebody else do it

Monday, October 22, 2007 8:32 PM by Dewi Morgan

From the question, I understood it as "how do you deal with an input stream that may or may not contain unicodedata?", not "how do you deal with key inputs that should map to unicode output?" - but it was a woolly question.

I strongly agree, if the question meant what Raymond interpreted it to, that you should avoid it like the plague. In Java 1.1 I tried, really hard, to do rich text (as a superset of unicode). After months of getting it wrong, with one bug popping up whenever I squished another, I retired from the fray, defeated, and used a Java 1.2 Swing component instead.

# ?????? ???????????? ?????? 2.0 » Blog Archive » ???????????? ???????????? ?????????????????????? ???????? Unicode ????????????????…

# re: The best way to process Unicode input is to make somebody else do it

Tuesday, October 23, 2007 8:00 AM by e

>> When I meet people like that I say "Well you obviously can't, 'cause if you could you'd be working for them."

Pretty stupid comment. First because MS given its size has naturally a good number of bright heads as well as a much greater number of "standard-skilled programmers".

Second you take for granted anyone's dream is working for a big corporation in Redmond as opposite in "working for a smaller company (and all that this implies)", "working for a company closer to your home/family", "creating your own startup" or any combination of these and other factors.

# re: The best way to process Unicode input is to make somebody else do it

Tuesday, October 23, 2007 9:13 AM by Developers, developers, developers

MS has a long tradition of ease the work of (external) developers. Why should this be an exception?

# re: The best way to process Unicode input is to make somebody else do it

Wednesday, October 24, 2007 12:00 AM by Anony Moose

Based on the "if you can do better than X then you would work for X" theory, all companies (including both Microsoft and Apple) are the worst company, because if the developers at company Y could do better than the ones working for company X then they would work for that company because company X is always the best company in the universe for all values of X. You know, logic is a funy thing.

# re: The best way to process Unicode input is to make somebody else do it

Wednesday, October 24, 2007 5:02 AM by bob

brian:

Some people think they can do a better job then the programmers at Microsoft.  When I meet people like that I say "Well you obviously can't, 'cause if you could you'd be working for them."

@brian:

Then Brian, we must all assume that no other good programmer exists except the ones who work or once worked for Microsoft, now do you honestly believe that? Seeing Microsoft products over the years I have strong doubts ;-)

# I agree, the best way to process Unicode input is indeed to make somebody else do it

Friday, October 26, 2007 11:13 AM by Sorting It All Out

I saw Raymond Chen's The best way to process Unicode input is to make somebody else do it and I wholeheartedly

# ?????? ???????????????????????? » Blog Archive » ???????????? ???????????? ?????????????????????? ???????? Unicode ????????????????

New Comments to this post are disabled
 
Page view tracker