Similar descriptions does not mean similar methodologies

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Similar descriptions does not mean similar methodologies

  • Comments 16

The other day, I had to take a look at the various unmanaged case insensitive string comparison functions. I thought I would post what the comparison/contrast information.

First the locale sensitive functions:

  • CompareStringW (kernel32.dll) -- the mother of all of the functions below, you can choose the locale, the flags, and whether the strings are counted or null-terminated. Embedded nulls are allowed.
  • lstrcmpiW (user32.dll) -- assumes null-terminated strings, then calls CompareStringW with the NORM_IGNORECASE flag and the thread locale (if that fails then it tries again with the system locale; in the unlikely event both fail, it uses a call to _wcsicmp).
  • _wcsicoll (CRT) -- assumes null-terminated strings. If using the "C" locale, does an ASCII (A to Z) ToLowercase followed by a binary compare; otherwise it calls CompareStringW with the LCID of the CRT locale and the SORT_STRINGSORT and NORM_IGNORECASE flags.
  • _wcsnicoll (CRT) -- takes one count parameter for both strings, but will also exit on an embedded null. If using the "C" locale, does an ASCII (A to Z) ToLowercase followed by a binary compare; otherwise it calls CompareStringW with the LCID of the CRT locale and the SORT_STRINGSORT and NORM_IGNORECASE flags (note that using just one count parameter will break compressions on locales that use them and expansions on all locales).
  • StrCmpIW (shlwapi.dll) -- assumes null-terminated strings, then calls CompareStringW with the NORM_IGNORECASE flag and the thread locale (if that fails then it tries again with the system locale). Manages to look a lot like lstrcmpiW, though not completely so in rare scenarios.
  • StrCmpNIW (shlwapi.dll) -- takes one count parameter for both strings, but will also exit on an embedded null. It calls CompareStringW with the thread locale of the CRT locale and the NORM_IGNORECASE flags (note that using just one count parameter will break compressions on locales that use them and expansions on all locales). Manages to look a lot like a hybrid of lstrcmpiW and _wcsnicoll.
  • StrCmpLogicalW (shlwapi.dll) -- does linguistic comparisons using the thread locale (falling back to the system locale on failure), cleverly wrapping multiple calls to CompareStringW to support treating the 0123456789 digits as numbers.

And now the locale insensitive functions:

  • RtlCompareUnicodeString (ntdll.dll) -- taking lengths in it UNICODE_STRING parameters (and allowing embedded nulls), it converts characters to uppercase and then does a binary comparison on them. This comparison matches what a lot of the operating system does for many of its objects (most of which use this very function!).
  • _wcsicmp (CRT) -- assumes null-terminated strings. If using the "C" locale, on each character it does an ASCII (A to Z) ToLowercase followed by a binary compare; otherwise on each character it does a full ToLowercase followed by a binary compare.
  • _wcsnicmp (CRT) -- takes one count parameter for both strings, but will also exit on an embedded null. If using the "C" locale, on each character it does an ASCII (A to Z) ToLowercase followed by a binary compare; otherwise on each character it does a full ToLowercase followed by a binary compare.
  • StrCmpICW (shlwapi.dll) -- assumes null-terminated strings. On each character it does an ASCII (A to Z) ToLowercase followed by a binary compare. It matches the "C" locale behavior of _wcsicmp, which of course does not match the OS behavior at all.
  • StrCmpNICW (shlwapi.dll) -- takes one count parameter for both strings, but will also exit on an embedded null. On each character it does an ASCII (A to Z) ToLowercase followed by a binary compare. It matches the "C" locale behavior of _wcsicmp, which of course does not match the OS behavior at all.

A few interesting points about these functions:

1) According to comments in the SHLWAPI source, many of them were initially added because the CRT and user32 counterparts were not supported on earlier versions of Win9x. Kind of ironic when you note the small behavior differences between them all, huh?

2) Given the Georgian casing issue, it is a little sad that almost all of these functions that convert prior to comparison use a lowercasing operation when so much of the core OS uses uppercasing. Especially given how often people use the functions to emulate the OS behavior for tidier validation messages. Luckily, the amount of data in Khutsuri is small so the inconsistency is not often noticed.

3) Am I the only person who thinks it is weird that _wcsicmp and _wcsnicmp have locale-specific behaviors, especially such really weird ones? They doc this a bit I guess, but until I looked at the code I would never have guessed.

4) CompareStringW is definitely the king of the linguistic comparison -- everyone else is either (a) calling our function, (b) doing the job wrong, or (c) both!

Now there is no king (nor good heir apparent) for the non-linguistic comparison right now in unmanaged code, like I talk about here.

Yes, I am still thinking about it. :-) 

The situation is kind of like when you have a vacancy in management and a lot of "wannabe" replacements (like these other functions), none of whom really fit the bill and none of whom can get the job done themselves. If you know what I mean....

 

This post brought to you by "ς" (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA)

Comment on the blather
Leave a Comment
  • Please add 4 and 5 and type the answer here:
  • Post
Blog - Comment List
  • All for the sake of comparing 2 strings. Whatever happened to good old strcmp? :-)
  • Ah, remember my criteria -- Unicode, case insensitive. The strcmp function (intrinsic or CRT) is none of those. :-)
  • Don't worry, I'll point out the explosion of methods and overrides in managed code soon. I hinted at them in http://blogs.msdn.com/michkap/archive/2005/04/14/408116.aspx

    :-)
  • Well, to be honest, I prefer lots of overloads to lots of differently-named functions. At least with overloads you can look in the same place for all the documentation whereas with differently-named functions, you've got to rely on the documentation to include pointers to all the other possible variants.

    Still, one function that can do it all would be best of all, even if I have to write my own little wrappers for my own special cases. At least then I can follow my own standards, rather than trying to remember the difference between RtlCompareUnicodeString, StrCmpNIW and lstrcmpiW for example...
  • Well, me too.

    But I prefer fewer functions with fewer overrides best of all -- with lots of intuitive enumerations, which intellisense also help with....
  • # StrCmpLogicalW (shlwapi.dll) -- does linguistic comparisons using the thread locale (falling back to the system locale on failure), cleverly wrapping multiple calls to CompareStringW to upport treating the 0123456789 digits as numbers.
    ^

    to support;)
  • Good catch -- fixed now. :-)
  • Dave Fetterman reported yesterday on the Official Guidance: New Recommendations for Strings in .NET...
  • It was a little over a month ago that I pointed out that Similar descriptions does not mean similar methodologies,...
  • Fascinating post about Locales in SQL Server, one that went into much more detail than I did at...
  • Hi. I'm trying to use CompareStringW to compare some WideStrings and I need to compare them case-sensitively. However, I always got then compared case-insensitively. I did NOT set the "NORM_IGNORECASE" flag on.
    So, when I sort strings "France", "Portugal" and "other", I want the result to be either

    France
    Portugal
    other

    or

    other
    France
    Portugal

    but what I get is

    France
    other
    Portugal

    cuz when I compare "France" and "Portugal", the result is 1 (this is correct), comparing "other" and "Portugal" gives 1 (that's correct, too), but comparing "France" and "other" also gives 1 (incorrect, should be 3).
    It's interesting that whene I call CompareStringW on "portugal" and "Portugal" the result I get is not 2, but 1. It looks like this function does case-insensitive comparison, and only if the compared strings don't differ (case-insensitive) it looks on the case.
    Is there a way to make the CompareStringW function not ignore the case?
    I am using locale MAKELCID(MAKELANGID(LANG_CZECH, SUBLANG_DEFAULT), SORT_DEFAULT), but it behaves exactly in the same way even if I set it to MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_DEFAULT), SORT_DEFAULT).
  • Hi Nazgul, See my post 'What it means to be case insensitive' at <A HREF="/michkap/archive/2005/06/16/429667.aspx">http://blogs.msdn.com/michkap/archive/2005/06/16/429667.aspx</A> to understand what is meant here. There is no NLS function that does what you want here, and it would certainly not be an 'ignore case' since that is the opposite of what you are doing -- you are not only *not* ignoring case, you are going out o you way to pay attention to it in non-intuitive ways! :-)
  • Rob commented in the Suggestion Box:

    It might be worthwhile to address this article: http://www.codeproject.com/buglist/comparenocase.asp...
Page 1 of 2 (16 items) 12