Keeping out the undesirables?

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Keeping out the undesirables?

  • Comments 9

The NLS API function IsNLSDefinedString is an exercise in social engineering within software.

Perhaps I should explain what the hell I am talking about. :-)

This function takes a string and essentially gives you a judgment about whether this string is one you can pass to the collation functions in the NLS API and expect to have something along the lines of reasonable, supportable results.

The process is simple. It enumerates every UTF-16 code unit in the string, and uses the following tests to make its decision.

  • Does the code unit have weight?
    • If the answer is NO, then find out if a small list of characters that are considered valid despite their weightlessness, like U+00ad (SOFT HYPHEN)?
      • If the answer to that question is NO, then return FALSE -- this is an undefined code unit as far as the operating system knows.
      • If the answer to that question is YES, then continue the test.
    • If the answer is YES, then continue the test.
  • Is the code unit in the PUA (Private Use Area) of Unicode?
    • If the answer is YES, then return FALSE.
    • If the answer is NO, then continue the test.
  • Is the code unit a low surrogate?
    • If the answer is YES, then return FALSE -- an unpaired surrogate code unit was found.
    • If the answer is NO, then continue the test.
  • If the code unit a high surrogate?
    • If the answer is YES, then is the next code point a low surrogate?
      • If the answer is YES, then skip one additional code unit and continue the test.
      • If the answer is NO, then return FALSE -- an unpaired surrogate code unit was found.
    • If the answer is NO, then continue the test.
  • If you made it to this point, then proceed to the next code unit. If you are at the end of the string then return TRUE.

Clearly, this is not a linguistic judgment, since the conditions are easily stated. Every UTF-16 code unit in the string:

  1. has weight or is on a small list of valid weightless code units;
  2. Is not in the PUA;
  3. Is not an unpaired surrogate.

Calling a string that is does not pass this test INVALID has interesting consequences, since it means that IsNLSDefinedString is not just returning whether to expect determinsm in collation function results. If that were the case then only the point #1 would be needed.

Two questions come up at this point:

Question #1: Why judge the PUA so harshly, if NLS collation functions will return deterministic results?

The issue here is that the private use area has no real context or meaning beyond that created by private agreement. Therefore, there is no way that NLS collation functions can treat such a string as being valid, since its meaning is unknown to the operating system.

So IsNLSDefinedString makes sure that situations that require an answer to the question of determinism are not given false answers based on strings that do not have a known, valid value.

Question #2: Why judge unpaired surrogate code points to harshly, if NLS collation functions will return deterministic results?

The issue here is that an unpaired surrogate is given the same status in Unicode as an undefined code point, so IsNLSDefinedString returns FALSE here just as it would for any other undefined code point.

So if you use IsNLSDefinedString, you are being influenced to do certain things with your application to make sure that these "undesirable" code units are not treated as being valid.

A very geeky form of social engineering, as NLS tries to make the character "neighborhood" a nicer place for the other characters to live!

Could this be expanded in the future to take care of other sequences such as too many diacritics and other potential undesirables? Well, perhaps -- in a new major version only though, of course -- but the line so far has been drawn to differentiate between what has clear meaning in Unicode vs. what does not; it is unclear whether it makes sense in the long run to extend the coverage to handle implementation-specific limits....

 

This post brought to you by U+00ad, a.k.a. SOFT HYPHEN

Comment on the blather
Leave a Comment
  • Please add 2 and 8 and type the answer here:
  • Post
Blog - Comment List
  • If I want to limit my Active Directory to (say) Unicode 3.1, can I pass that in the NLSVERSIONINFO object?  Do the NLS version numbers correspond to the Unicode releases?

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_Versioning.asp
  • Bob Richmond asks in the Suggestion Box:

    UniScribe, Opentype, Unicode versions, and the PUA As I...
  • Nick Lamb is a regular reader who often keeps me on my toes.
    In response to my recent post Return of...
  • Lots of people have pointed out both before and after I did in When a user sets something. please assume

  • Remember my posts about stripping diacritics using normalization? Well, Feroze does. Just the other day

  • This last May, I talked about Keeping out the undesirables and talked about how the IsNLSDefinedString

  • Regular reader Jan Kučera suggested in response to In Case you have problems that you might think are

  • It dates back to some time between when Windows XP whipped and Windows Server 2003 shipped.

    Suddenly

  • Previous parts in this series:

    part 1: If you're not Unicode, you're just wrong!

    part 2: Try

Page 1 of 1 (9 items)