Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
This post is about a not entirely intuitive fact that will be seen in the implementation of collation in Microsoft products. It affects the results of both CompareString and LCMapString in Windows, the results of using the CompareInfo and Sortkey classes in the .NET Framework, and in the results in products like Jet and SQL Server.
To help show what is happening under the covers, I will use the sort keys.
We'll use the letters A (U+0041, LATIN CAPITAL LETTER A) and Ą (U+0104, a.k.a. LATIN CAPITAL LETTER A WITH OGONEK), as well as their lowercase counterparts.
When getting sort keys using the default table (LOCALE_INVARIANT), the weights look like the following:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 12 01 01 00ą U+0105 0E 02 01 1B 01 02 01 01 00Ą U+0104 0E 02 01 1B 01 12 01 01 00
Note the Unicode weights (in blue), the diacritic weights (in green) and the case weights (in red). Now when we ignore case:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 02 01 01 00ą U+0105 0E 02 01 1B 01 02 01 01 00Ą U+0104 0E 02 01 1B 01 02 01 01 00
And when we ignore diacritics:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 12 01 01 00ą U+0105 0E 02 01 01 02 01 01 00Ą U+0104 0E 02 01 01 12 01 01 00
And then we ignore both:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 02 01 01 00ą U+0105 0E 02 01 01 02 01 01 00Ą U+0104 0E 02 01 01 02 01 01 00
Clearly, in the default table LATIN CAPITAL LETTER A WITH OGONEK is little more than a LATIN CAPITAL LETTER A with a hook in it's foot. A small diacritic weight is added to show that it is still primarily a LATIN CAPITAL LETTER A. And the act of ignoring the diacritic gives identical results to when the diacritic was never there in the first place -- you can see it right in the weights.
Now, how about when we move to Polish, LCID 0x00000415? In Polish, LATIN CAPITAL LETTER A WITH OGONEK is a letter with a unique Unicode weight, and this causes a difference in the results:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 12 01 01 00ą U+0105 0E 04 01 01 02 01 01 00Ą U+0104 0E 04 01 01 12 01 01 00
Do you see what happened here? Since in Polish LATIN CAPITAL LETTER A WITH OGONEK has a unique Unicode weight, ignoring the case weight has a predictable effect:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 02 01 01 00ą U+0105 0E 04 01 01 02 01 01 00Ą U+0104 0E 04 01 01 02 01 01 00
And Ignoring the diacritic weight will have no effect whatsoever (since there is no diacritic weight to ignore):
So the net effect is that for Polish, passing a NORM_IGNORENONSPACE flag in Windows, a CompareOptions.IgnoreNonspace in the .NET Framework, or a collation in SQL Server such as Polish_CI_AI (Polish, case insensitive, accent insensitive) will never see LATIN CAPITAL LETTER A WITH OGONEK as a LATIN CAPITAL LETTER A. Because Polish does not give the letter diacritic weight.
This is a common issue, whether you look at å (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE) in Swedish, Č (U+010c, a.k.a. LATIN CAPITAL LETTER C WITH CARON) in Slovenian, or any of the other hundreds of examples that exist in supported collations. The key is that in each case you must consider not only whether the character appears to have a diacritic on them but how the language is looking at the string....
This post brought to you by "Ą" (U+0104, a.k.a. LATIN CAPITAL LETTER A WITH OGONEK)
I do manage to get a lot of random email (between my two main accounts, several hundred a day, not including
The mail I got read: Hi I'm Laurent Gébeau , FRench MVP on Windows. I met you at the last MVP summit
Thinking about the issues involved with à ≠ a (unless à = a) made me think back to other posts where
Sometimes, in order get the best results in collation, one has to use constructs that from a linguistic
Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The
Hello Sir,
I have a problem my question when i fetch the data in may class and set in text view in android not proper character display it.and ask question in stack overflow but anyone proper answer suggest me .please help me sir.and My Question link below:
stackoverflow.com/.../how-can-i-display-latin-words-in-android
Advance Thanks!!