Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
There are all kinds of interesting things in the Unicode standard. For example, the block of characters from U+A000 to U+A48F is for representing syllables in the "Yi script". Apparently it is a Chinese language writing system developed during the Tang Dynasty.
A string drawn from this block has an unusual property; the string consists of just two characters, both the same: a repetition of character U+A0A2:
string s = "ꂢꂢ";
Or, if your browser can't hack the Yi script, that's the equivalent of the C# program fragment:
string s = "\uA0A2\uA0A2";
What curious property does this string have?
I'll leave some hints in the comments, and post the answer next time..
UPDATE: A couple people have figured it out, so don't read the comments too far if you don't want to be spoiled. I'll post a follow-up article on Friday.
Hint #1: The curious property is platform-dependent; you'll want to be using a 32 bit version of CLR v4.
Hint #2: The curious property is also a property of a much more commonly-used string.
It has the same hash code of String.Empty?
Wow, you are fast!
Hint #3: You are on the right track; there is an even more curious property this string has which is related to its hash code being equal to that of String.Empty. -- Eric
"s.ToUpper() == s.ToLower()" is true. Though that's not that curious.
Indeed, I think all strings in Chinese-style languages have this property. - Eric
I am not seeing any curious property either. Although I did notice that it doesn't have a case difference lice [ICR].
Is it something to do with byte order marks? Something like it matches an empty string in a different encoding with byte order marks?
Bingo to Eyal - in a 32-bit process, s.GetHashCode() == string.Empty.GetHashCode() but not on an x64 process. I would guess this is a lesson in depending on hash codes? :-)
I'm at work right now and only have access (easily) to a 2008 R2, ergo a 64 bit CLR. I'll give it a look when I get home.
Well Eyal is correct, on the x86 v4 CLR, "\uA0A2\uA0A2".GetHashcode() == "".GetHashcode(). Though technically that doesn't meet the criterion of hint number 2. Unless the curious property that "\uA0A2\uA0A2" shares with a much more commonly used string (i.e. string.Empty) is 'having the hashcode 757602046'. But I don't know, I just don't find that property all that curious.
As far as I can tell the Hashcode also match using x64.
Colisions like this happen in real system everytime when using GetHashCode(). However a lot of comparing and sorting infraestructure of the framework, like LINQ for example, depend on it.
For me this model of equality is (on some scenarios) broken. I would like to see a much improven hashing algortihm, with less colision probabilty, be implemented in future versiones of dotnet.
Hashcodes are meant to collide! Not a problem depending on them! :)
Maybe the curious property is "sharing a hash code with string.Empty." Interestingly, Eric only seems to claim that the property is curious on s, but not on the "much more commonly-used string."
Hashcodes are not meant to express equality -- a good reason why it would be a broken equality model --, but it surely can express inequality.
it's the shortest (in code points) string whose hashcode matches? or the only 2 codepoint string to match?
The smallest (if treated as an unsigned number) legal UTF-16 string which shares the hashcode?
Or is the fact it's a palindrome (and string.Empty is essentially a palindrome in root case) and the only such palindrom to share the hashcode?