Back in August in the post Double compressions -- Hungarian goulash? I described how double compressions worked in Windows and the .NET Framework.

It can indeed be a complicated feature to support, and not just for the reasons I explicitly stated but because you not only need to make text pieces like ddzs equivalent to dzsdzs but also because you had to treat both strings as if they contained two sort elements (more on sort elements here and here).

It is not so hard to do though, and we have supported it for a long time without people complaining about the support.

It turns out that the truth is even more complicated than that, though!

You see in the language there are two behaviors that are supposed to be captured:

  1. As I said, ddzs should be treated as equivalent to dzsdzs when doing comparisons;
  2. Additionally, ddzs should be sorted as if it were d + dzs rather than dzsdzs.

As I have said previously though, comparison is sorting on Windows. For linguistic purposes, both are done through the same basic functions, such as CompareString.

In order to support these two different operations, you would need to have an additional EqualString function to give the linguistic absolute equality question while still giving the different answer for collation. And the behavior of EqualString would almost always be identical to the behavior of CompareString returning CSTR_EQUAL, with the only exceptions being that:

  • cases like above one in Hungarian double compressions could in theory be supported;
  • it could be a bit faster since it no longer has to detect which comes first; any difference of any weight level would cause the function to immediately return the result.

Note that is really not as good of a reason that there being both an RtlCompareUnicodeString and an RtlEqualString in ntdll.dll, because both of those functions immediately return the results when any difference is found. Because although we could argue the speed differences between "the difference of two numbers not being zero" and "two numbers not being equal" in compiled code, it is nowhere near the order of magnitude of difference in speed you would see in a linguistic function that could stop on the first character an return FALSE when comparing Abcdefg and abcdefĝ rather than needing to walk the whole string to know it should return CSTR_LESS_THAN.

One unfortunate side effect of this post and talking about a theoretical EqualString is that the more I type, the more I think it might actually be a useful function to have available, given the large number of times that one might really prefer to answer the abolute identity question rather than a which one comes first question.

Though in principle, in most cases absolute identity is only important in binary/ordinal comparisons, not linguistic ones -- such as looking at filenames and other symbolic identifiers. This issue with Hungarian double compressions being a great example of an exception to that principle, of course.

It is interesting to speculate why complaints have never been escalated by the Hungarian users of Windows, since as our collation is sorting rather than absolute identity, the behavior is technically incorrect. Although I suspect that

  • The number of real world situations where such string comparisons might return different results could be reasonably small;
  • Hungarian customers may simply accept and enjoy the identity-type behavior in both situations;
  • Those customers may actually be resigned to the situation;
  • Some other reason I do not understand.

Truth be told I am hoping it is mostly the first and/or second of these four options. :-)

Though I must say that language issues such as this one fascinate me and sometimes frighten me, as I think about how the behavior on Windows can shape user experiences and expectations across a culture. It definitely encourages me to try hard to do right by the language and not make decisions that would negatively impact a language or its usage!

 

This post brought to you by "ĝ" (U+011d, a.k.a. LATIN SMALL LETTER G WITH CIRCUMFLEX)