CharNext(ch) != ch+1, a lot of the time

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

CharNext(ch) != ch+1, a lot of the time

  • Comments 26

Earlier today, Raymond Chen sent me a piece of email that mentioned an important point for developers who iterate a string one character at a time. Its a lot more interesting than what I was going to post so I'll do the boring one later or tomorrow (or if I am smart I'll just give it a miss entirely), and I'll do this one now, instead.

It has to do with the CharNext and CharPrev APIs

The email had a title similar to the one of this posting. Basically he was pointing out that CharNextW(p) != p+1. And he is right, they are not equal, a lot of the time.

The same issue that applies to CharNext(ch) not being the same as ch+1 applies to CharPrev(ch) and ch-1. Put more verbosely, this is because incrementing and decrementing a character pointer within a string is not nearly as functional as these APIs can be.

The reasons for this are many and as I said they apply to both CharPrev and CharNext (and incidentally also CharNextExA and CharPrevExA). They include:

  • When not dealing with Unicode, strings on CJK (Chinese, Japanese, and Korean) systems have many characters that take up two bytes (including all Han/Kanji/Hanja, all Hangul and all full-width non-Kanji). Using simple byte increments will mean one is moving through half a character with each iteration.
  • When dealing with Unicode, strings that use combining characters like  (U+0061 U+030a, a plus combing ring) or (U+0075 U+0308, u plus combining umlaut) take up two code points, so once again one will be dealing with moving through half a character with each iteration.
  • Stretching to languages like Vietnamese where double diacritics are commonly seen in characters such as ộ (U+006f U+0323 U+0302, o plus combining dot below plus combining circumflex) take up three code points, one is dealing with only a third of a character with each iteration!
  • When one uses Unicode supplementary characters such as U+21532 (an Extension B ideograph) where a Unicode string on Windows will actually be represented by a surrogate pair (U+d845 U+dd32) one will actually once again be dealing with half a character.

I crossed that last item off of the list because CharNext and CharPrev do not currently handle surrogate pairs properly, so one will be just as bad off using the APIs as one would be just doing simple pointer arithmetic.

(Never fear, I will be looking into seeing if I can do something about this for the future now that someone reminded me!)1

How does it do this work? Well, CharNextA/CharPrevA use the IsDBCSLeadByte API on appropriate platforms to determine if the byte is a lead byte in a lead byte/trail byte pair, and CharNextW/CharPrevW use the GetStringTypeW API to figure out if a character is a non-spacing character like a ring or a diaresis.

Other APIs that do an even better job can be found in Uniscribe. The ScriptBreak API will return an array of SCRIPT_LOGATTR structs, and that structure's fCharStop member, when set, indicates that this is a valid place for a cursor to jump to. When such a cursor jump is valid, it indicates that you have moved forward by one "character" in the sense that a user might think of a character. And therefore if you use Uniscribe you will be handle this job properly, even for supplementary characters.

Unicribe is a useful enough library that I will talk more about it in a future post, maybe even some sample code for easier operations (like this one). Uniscribe is behind most of the support of international text on Windows.

It was ported to Windows CE 5.0 as well, though it is described in documentation about the CE Platform Builder, which implies (to me) that it might only be included on SKUs that are built with it for shipment to places that require it. Those who know more about Windows CE and Uniscribe should feel free to contact me with more info so I can sound more intelligent the next time I talk about it!

This is a much easier task in managed code, whether you can use the StringInfo class and its GetTextElementEnumerator and GetTextElement methods, which allow for easy iteration of a string. You can also use its ParseCombiningCharacters method to get an array of integers representing the same character boundaries represented by Uniscribe's SCRIPT_LOGATTR.fCharStop member.

For those of you who are still awakereading, I will point out one annoying issue I discovered while typing in the å, ü, and ộ characters earlier in this post. Problems? Well...

  1. The .Text blog system's editor to type did not handle this properly and moved past 1/2 or 1/3 of the character with each arrow keypress.
  2. When the character was being selected it either selected the whole character and then appeared to do nothing or appeared to do nothing and then selected the whole character, depending on whether I moved from left-to-right or right-to-left with the cursor.
  3. The deletion behavior was just as dismal. I don't even want to talk about it. 

I also saw the same behavior in plain old EDIT text boxes in Internet Explorer, so it looks like .Text is off the hook. And that even though IE does use Uniscribe, they forgot to implement some of the possible features that library provides2. :-( 

If someone wanted to try this in other browsers (cough! FireFox or Opera), I'd be curious about the results; the IE results were pretty disappointing to me. Try selecting the string "åüộåüộåüộåüộ" and pasting into their browser wherever you see an EDIT box to test, or just try putting thr cursor into and running the cursor through the INPUT control text right here:

and let me know what you see (unless you see 20 other people have  already responded!). Be sure to mention the browser and version (I am using IE 6.0.3790.0 on Windows Server 2003 with all of the latest updates).

(All that talk about Uniscribe reminds me I want to talk about MLang at some point too -- I've added MLang and Uniscribe to my list of things to talk about!)

1 - It first started when I was working on MSLU, but even to this day I have not gotten fully used to being able to say stuff like this. I'll try to write more about it another time, cuz it is kind of cool.
2 - It works fine in Notepad and Word and VS.Net, all of which use Uniscribe.

 

This post brought to you by "å" (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE)
A character that is downright snooty about the fact that it has none of the problems mentioned in this article.
Despite the fact that a circle almost bigger than its body is super-glued to its head, something that would make me feel at least a little self-conscious.

Comment on the blather
Leave a Comment
  • Please add 6 and 7 and type the answer here:
  • Post
Blog - Comment List
  • Firefox' job is only arguably slightly better:

    - Simple arrow key navigation (left and right arrows) also require being pressed 2 or 3 times to move one char forward/backward but the caret doesn't move at all during the intermediate moves.

    - the back key behaves the same way as IE6.

    - Selection (Ctrl+ left/right) goes through 1/2 or 1/3 of char but the difference is that the (un)selected parts of the characters are moved (half-way) besides the 'main' character to show that they are included or not.

    Version : Ouch... 0.9 :-(
  • Safari 1.2.4 handles it fine (if you are interested in other OSs) - it didn't move half a character etc like IE 6 SP2 on XPSP2 did.
  • Be nice to the å, Swedes are very affectionate towards this character :-)
  • I work with a Swede who told me as much, hopefully my joke under the sponsorship will not be taken as too offensive (if so it will be taken down, I would not want "å" yo retract its sponsorship.

    I guess I could use the combining form as a stunt double for the precomposed form as a way around the problem, then it would be less likely to be offensive to it, since a stunt double rarely worries about the same things as the celebrity....

    Or maybe this whole post is silly, too. Scratch the maybe! :-)
  • Same problems with Opera 7.54.
  • Re: Windows CE:

    Platform Builder is the tool for OEMs to build their own custom platforms. It's entirely up to the OEM to decide what to put in. Evangelism may be required for OEMs producing PDA-like platforms.

    I would expect Microsoft's own Windows Mobile platforms (Pocket PC and Smartphone) to include Uniscribe in future versions. Microsoft dictates which CE components are included in a Windows Mobile platform image. The OEM has responsibility for the OEM Adaptation Layer to adapt the platform to the specific hardware. At least, that's my understanding.
  • Thanks Mike,

    Yep, seeing it in the Platform Builder means people have to end up choosing if it is a wothwhile addition to the platform. Getting them to do it can be a challenge since the space on the device is at such a premium....
  • Firefox 1.0 is no better. First Shift+Right selects the 'a' (displaying the 'a' selected and the ring shifted right and unselected), second Shift+Right selects the ring and it snaps back into place. Deleting is (pardon me) intuitive: first Backspace deletes the circumflex, second the combining dot below, and third the o. One can also delete the letter first, then the combining diacritics move to the previous letter. They, strangely, don’t combine but are positioned left to right.
  • Btw, the IE address bar (and google's search bar) get it right. My suspicion is that's because they're standard windows edit controls, while the IE (and firefox) edit controls aren't real windows controls.
  • Well, there goes my even-odd trick for Kanji-backspace :-).

    For what it's worth, the reason Safari, on Mac OS X, gets this right is essentially the same reason the IE address bar gets it right. Safari uses ATSUI (Apple Text Services for Unicode Imaging).

    Unfortunately, Mac OS X doesn't have a Uniscribe equivalent. One either eats the whole ATSUI pie, or one has to try and figure out character break points on one's own.

    If I may offer a suggestion, I think some developers might well be interested in how to use Uniscribe to do context-based glyph substitution (e.g. Arabic).
  • Rick -- check out the "Suggest a topic for me!" link at the top of the page....
  • Interesting - at home your example renders correctly, while at work it didn't. I get boxes (the 'missing character' symbol) behind the o rather than the diacritics. Both machines run XP SP2.

    One difference is that I use ClearType on the home machine, but I've just turned that off and it's still OK. I ought to have the same fonts on both...

    Ah! Some stupid program has overwritten Arial with a version from what looks like Win95! Whatever it was has also trashed Times, Times Bold and Symbol.
  • Does CharNextW only recognize combining characters, or does it use the UAX29 grapheme cluster algorithm, or something else that's locale-specific?

    I've been considering the impact of jamo syllables upon this and debating whether to treat each jamo as a grapheme cluster for my "character-wise iterator", or treat a syllable cluster according to 3.12 as a cluster.

    And yeah, Firefox 1.0 treats it boneheadedly, both in the input box and in the address bar.

    As Larry alluded to, there are so many elements on any given page that creating a child window for everything would exhaust handle space easily, so IE has its own set of "windowless controls" that it uses for everything inside a rendered page.
  • It is not using anything out of Unicode, it is using the GetStringTypeW API's results to decide what is a non-spacing character. Unicode may choose to respect its elders since GetStringTypeW predates that UAX (and most of the others!).

    Looking at the data, Jamos are not treated as combining characters. They are pretty much treated as:

    CT_CTYPE1: C1_ALPHA | C1_DEFINED
    CT_CTYPE2: C2_LEFTTORIGHT
    CT_CTYPE3: C3_ALPHA

    But the CTYPE data comes from Unicode and ha for a while now, and Unicode does not define them as combining either; it calls them Lo (Letter, Other).

    And I do see what IE is doing, but even in their own custom controls they could use Uniscribe to define the cursor/arrow/selection/deletion behavior....
  • I could perhaps argue that the way it's handled in IE is by design. When I select the text with the mouse, it selects a whole character at a time (i.e. base char + combining chars) and I can delete the whole thing by just hitting backspace or delete.

    When using the keyboard, it moves through the 1/2 or 1/3 of the character, which basically lets me edit the combining charaters in-place. Maybe it's just a personal thing, but I think that's not entirely a bad way of doing it.

    You're right in that it's not how Visual Studio does it - once you create the full charater, it treats all two or three bytes as a single character, but I can see how this could be viewed as "by design".
Page 1 of 2 (26 items) 12