Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
The NLS API function IsNLSDefinedString is an exercise in social engineering within software.
Perhaps I should explain what the hell I am talking about. :-)
This function takes a string and essentially gives you a judgment about whether this string is one you can pass to the collation functions in the NLS API and expect to have something along the lines of reasonable, supportable results.
The process is simple. It enumerates every UTF-16 code unit in the string, and uses the following tests to make its decision.
Clearly, this is not a linguistic judgment, since the conditions are easily stated. Every UTF-16 code unit in the string:
Calling a string that is does not pass this test INVALID has interesting consequences, since it means that IsNLSDefinedString is not just returning whether to expect determinsm in collation function results. If that were the case then only the point #1 would be needed.
Two questions come up at this point:
Question #1: Why judge the PUA so harshly, if NLS collation functions will return deterministic results?
The issue here is that the private use area has no real context or meaning beyond that created by private agreement. Therefore, there is no way that NLS collation functions can treat such a string as being valid, since its meaning is unknown to the operating system.
So IsNLSDefinedString makes sure that situations that require an answer to the question of determinism are not given false answers based on strings that do not have a known, valid value.
Question #2: Why judge unpaired surrogate code points to harshly, if NLS collation functions will return deterministic results?
The issue here is that an unpaired surrogate is given the same status in Unicode as an undefined code point, so IsNLSDefinedString returns FALSE here just as it would for any other undefined code point.
So if you use IsNLSDefinedString, you are being influenced to do certain things with your application to make sure that these "undesirable" code units are not treated as being valid.
A very geeky form of social engineering, as NLS tries to make the character "neighborhood" a nicer place for the other characters to live!
Could this be expanded in the future to take care of other sequences such as too many diacritics and other potential undesirables? Well, perhaps -- in a new major version only though, of course -- but the line so far has been drawn to differentiate between what has clear meaning in Unicode vs. what does not; it is unclear whether it makes sense in the long run to extend the coverage to handle implementation-specific limits....
This post brought to you by U+00ad, a.k.a. SOFT HYPHEN
Lots of people have pointed out both before and after I did in When a user sets something. please assume
Remember my posts about stripping diacritics using normalization? Well, Feroze does. Just the other day
This last May, I talked about Keeping out the undesirables and talked about how the IsNLSDefinedString
Regular reader Jan Kučera suggested in response to In Case you have problems that you might think are
It dates back to some time between when Windows XP whipped and Windows Server 2003 shipped.
Previous parts in this series:
part 1: If you're not Unicode, you're just wrong!
part 2: Try