Earlier I posted about Giving a character a new identity (by giving it some secondary weight).

Now that post, while true, only tells part of the story.

Now I am going to tell the other part....

Take the following code and you may be able to see where I am going before you even look at the results:

CompareInfo ci = CompareInfo.GetCompareInfo("ja-JP");
string st1 = "ヷ";

Thread.CurrentThread.CurrentCulture = new CultureInfo("ja-JP");
Console.WriteLine(ci.IndexOf(st1, "ワ"));
Console.WriteLine(ci.IndexOf(st1, 'ワ'));
Console.WriteLine(st1.IndexOf("ワ"));
Console.WriteLine(st1.IndexOf('ワ'));

string st2 = "\u0061\u030a";
Console.WriteLine(ci.IndexOf(st2, "a"));
Console.WriteLine(ci.IndexOf(st2, 'a'));
Console.WriteLine(st2.IndexOf("a"));
Console.WriteLine(st2.IndexOf('a'));

The results? They will be:

-1
-1
-1
0
-1
-1
-1
0

So what's the problem? Why does System.String.IndexOf(Char) behave differently than System.String.IndexOf(String), System.Globalization.CompareInfo.IndexOf(String, Char), and System.Globalization.CompareInfo.IndexOf(String, String), anyway?

Well, setting aside my disdain for all of the System.String shortcuts to globalization functionality that makes the real linguistics features of the System.Globalization namespace that much harder for developers both inside and outside of Microsoft to find (never mind the additional confusion about the confusing and incomplete flags they add), there is the fact that the System.String "shortcut" methods often contain actual shortcuts to try to be more performant, to try to keep from calling the "slower" globalization methods.

So this particular issue can be looked at as an over-optimization, a case where developers assumed that they would not need to call the "slower" method in this situation.

Were they wrong?

Well, in my view, yes. All of these shortcut methods are just plain bad if they ever do anything other than call the real methods in the System.Globalization namespace. Anything else makes for less maintainable code that requires modifying multiple bits if there are ever changes or problems to fix, and it is harder for testers to track all of these different places to verify correct behavior in.

Of course now I suppose it would be in some people's minds a breaking change to fix the errant method.

So let's make it more interesting and raise the stakes:

CompareInfo ci = CompareInfo.GetCompareInfo("sv-SE");
string st1 = "\u00e5";

Thread.CurrentThread.CurrentCulture = new CultureInfo("sv-SE");
Console.WriteLine(ci.IndexOf(st1, "a"));
Console.WriteLine(ci.IndexOf(st1, 'a'));
Console.WriteLine(st1.IndexOf("a"));
Console.WriteLine(st1.IndexOf('a'));

string st2 = "\u0061\u030a";
Console.WriteLine(ci.IndexOf(st2, "a"));
Console.WriteLine(ci.IndexOf(st2, 'a'));
Console.WriteLine(st2.IndexOf("a"));
Console.WriteLine(st2.IndexOf('a'));

The results here? You know in this "Swedish "A-Ring" case?

-1
-1
-1
-1
-1
-1
-1
0

So, that over-optimization is causing behavior differences in strings that are canonically equivalent in Unicode, to wit LATIN SMALL LETTER A WITH RING ABOVE versus LATIN SMALL LETTER A + COMBINING RING ABOVE.

And that is a bug, suggesting that just taking out this over-optimization case might be in everyone's best interests....

(Using the Swedish or Japanese results above is not required; it just makes the weirdness look worse. The bug is there either way)

 

This post brought to you by å (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE)