Previous blogs in this series of blogs on this Blog:
It was way back in January of 2005 where I first mentioned how CharNext(ch) != ch+1, a lot of the time, which explained about how just incrementing a character pointer was not all that the CharNext function did.
Then James pointed out in a comment (here) that, as it turns out, on Windows XP and Server 2003, that kind of was all that CharNext was doing.
Remember how I talked about the way that even though NLS did not own some of
these USER functions, that we pretty much "owned" them since we control their
behavior, in this
post?
Well, this is one of those functions.
Details on the breakage cause were described soon thereafter in We broke CharNext/CharPrev (or, bugs found through blogging?).
So the decision was made to fix the bug (restoring the old
functionality from NT <= Windows 2000, and at the same time to look
at the other common complaint related to surrogate pairs.
Vista, therefore, supports the old functionality and took steps to add the new functionality (not splitting surrogate pairs).
There was a problem, though.
The code in there swapped the check for high and low surrogates such
that it was always skipping the high surrogate and always returning the
low surrogate -- which is exactly the opposite of the behavior you want.
Now no one found the bug because as it turns out the tested case
(and admittedly the most common scenario for the function?), which is "a more
linguistically appropriate lstrlenW based on user character principles" will still work here, even though a single call will return the wrong result when faced with a high surrogate.
Here is where it gets more complicated.
What happens in the next version, and/or possibly in the next service pack of Vista/Server 2008?
There are many choices:
1) Do we give up on the supplementary character detection given the
mistake and just bring it back to the <= Windows 2000 level behavior
that properly handles combining characters?
2) Do we fix the bug with supplementary characters so that both they and the combining characters case will both work?
3) Do we give up on both and go back to the XP level behavior, which
even though it was a regression from prior versions does represent a
very popular platform?
4) Do we give up on trying to do anything here and just leave it
broken as it is now, and perhaps in some unknown future version (it is
a bit late in the cycle to start designing all new features) look into
all new solutions to the problem(s) once they are identified?
Now the order of these four choices, due to the way the code is
written and under the principle of minimal change, is technically in
order from most difficult to least difficult. Though really the amount
of difficulty involved here is not that much even as you move across
all four options, so that does not really provide very much insight
into a triage process.
In terms of platform popularity, I don't think there are many people
outside of fans of the Windows "Mohave" commercials who would claim
that XP isn't the most popular platform -- which does suggest that #3
is worth considering, at least.
My personal preference would be #2 since it is "the right thing to
do" though when you have behavior that has been changing every few
versions it might perhaps better to take some time to think about the
backward compatibility issues before concluding that "the right thing
to do" and "the best thing to do" are necessarily the same.
The fact that so few people noticed the bug suggests that either
- no one is paying attention to/relying on the results, or
- the function is being used for a "linguistic character" version of lstrlenW.
And obviously in both of those cases #2 would not represent a breaking change.
But let's assume that a certain number of developers have noticed
the odd behavior and chosen to work around it in their own code. Plenty
of people do that, and many of them are either too cynical to report
the bug or don't know of a good way to make the report. Or they just don't like Microsoft -- it happens.
The ones who just decide the function is unreliable and write their
own can be removed from our consideration here, since even though they
may be right, they will not be broken if the behavior changes. So we'll
leave them aside for a moment.
If we don't want to break the people who found the bug and worked
around it, we'd have to assume that they were essentially detecting the
case where CharNext or CharPrev incorrectly return a high surrogate value (whether using the IS_HIGH_SURROGATE macro or simple range checking or whatever), and then doing an additional increment/decrement in those cases.
Perhaps they feel that their code was a really good idea since it
will even "fix" prior versions like NT 4.0 or Windows 2000 or XP and
thus they feel they are "future-proof" since no right-thinking
developer would break against every version (note that none of the
above solutions do that!).
Now if history is to be a guide, people might not do the full job
here -- they might not be detecting the errant cases like unpaired
surrogates or multiple high surrogates, so it might just be blind one
WCHAR increment/decrement.
And some might go even farther and validate that a valid surrogate
pair exists, which is not something Windows necessarily does but isn't
unreasonable here.
But note that even in all of the above potential circumstances, the full fix described above in #2 is still entirely safe since the function would never return a character that was a high surrogate.
On balance, my gut feeling that #2 would be the best thing to do (in
the next version of Windows and possibly even in future Vista/Server
2008 service packs) mainly on the basis that it is the right thing to
do also does appear to be the best solution for technical reasons as
well.
I mean, as UTF-16 detection mechanisms go, the best that can be said about CharNext and CharPrev is that they [sometimes, in some versions] work. Which is not saying much, but is saying something, at least. It is better at least in the abstract to improve with each version, in my opinion....
Though perhaps others would analyze the situation and circumstances and come to a different conclusion.
What would you do?
This post brought to you by å (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING
ABOVE)
A character that is downright snooty about the
fact that it has none of the problems mentioned in this article.
Despite the
fact that a circle almost bigger than its body is super-glued to its head,
something that would make me feel at least a little
self-conscious.