Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
There are all kinds of interesting things in the Unicode standard. For example, the block of characters from U+A000 to U+A48F is for representing syllables in the "Yi script". Apparently it is a Chinese language writing system developed during the Tang Dynasty.
A string drawn from this block has an unusual property; the string consists of just two characters, both the same: a repetition of character U+A0A2:
string s = "ꂢꂢ";
Or, if your browser can't hack the Yi script, that's the equivalent of the C# program fragment:
string s = "\uA0A2\uA0A2";
What curious property does this string have?
I'll leave some hints in the comments, and post the answer next time..
UPDATE: A couple people have figured it out, so don't read the comments too far if you don't want to be spoiled. I'll post a follow-up article on Friday.
Could it be that s.GetHashCode() == (s + s).GetHashCode()?
You're so close! -- Eric
> Could it be that s.GetHashCode() == (s + s).GetHashCode()?
More than that - any number of repetitions of this string have the same hashcode.
Give the man a cigar! Nicely done. -- Eric
I wonder if it has anything to do with string interning? Although I played around with String.Intern(...) on the 32-bit CLR v4 and didn't come to any definitive conclusion.
Eric, is it that if s is concatenated with itself any number of times (>= 0), the hash code of the resulting string will have the same hash code as s? If so, indeed an interesting property.
Yes! - Eric
Is this special property of this string (hash code equals empty string no matter how many times it is concatenated to itself) a freaky happenstance? Or is it by design? Does the underlying hashing function some how use "\uA0A2\uA0A2" in the hash code calculation? Or does it happen to be that these values somehow have fallen into an algorithmic crack?
This is hardly by design. Hash algorithm likely just reads the string in 4 byte pieces (2 chars), and 0xA0A2A0A2 is a fixpoint of the function it applies.
> > Could it be that s.GetHashCode() == (s + s).GetHashCode()?
> More than that - any number of repetitions of this string have the same hashcode.
And the much more common string that shares this property is of course String.Empty (where it is much less interesting).
=> the first thing I did was look at the answer :(
@Diego: It can express inequality it BOTH hashcode are generated on the same machine with the same .net framework version in the same mode (x84/x64).
@Eugene: the hashcode function clearly contains some kind of internal state, but if you include that state it's not a general fixpoint. For instance "abcd".GetHashCode() != "abcdꂢꂢ".GetHashCode(). On the other hand, for any 2-letter string s it appears that (s+"ꂢꂢ").GetHashCode() == s.GetHashCode(); even though s+"ꂢꂢꂢꂢ" does not; which seems to indicate that some parts of the internal state don't immediately affect the output but do after later iterations.
This kind of hashcode property might be exploitable to form a denial of service attack; it's a bit far-fetched, but vaguely reminiscent (though not nearly as easily exploitable) of the php/java floating point parsing bug of a while back...
The interesting behavior is (chiefly) function of two things, one is the specific magic number used to intialize the hash computation (352654597 or 0x15051505) and the second is that the hash calculation is formed of two such calculations, both entirely independent (one looking at even pairs of characters one looking at odd pairs) which are then merged at the end (independent of length).
The function involved for the "shuffle existing bits, bring in more bits from the source" is such that there is a magic input pair where the shuffle: (((num << 5) + num) + (num >> 0x1b)) followed by the bringing in of the bits (a simple xor) results in a no-op.
For any reasonable implementation this is an offshoot of the very desirable property that you do not want to bias the hash function output while wanting to keep the function fast.
The string.Empty case of course doesn't go through this shuffle then augment process at all, so it it a no-op by virtue of not actually happening :).
Thus, for strings with a sequence of 4 special characters from the start of the string till the suffix that matters will result in a collision.
This does indeed allow for targeted performance attacks (if you worry about this use a hash function not vulnerable to this that trades a little speed for being much harder to attack (or use a cryptographic hash like SHA-1 or better if you really care).
simply including the length in the calculation directly (rather than as a by product of the iteration count) would recuce the simplicity of the attack, but at the cost of bias in the output bits without quite a strong avalanche phase (since the string lengths encountered in the real world will tend to be very clustered and small). Essentially the better the avalanche the less easy it would be to target too.
the phonetic pronunciation of this is (IPA) m̥o pronounced (very roughly) like hmo (but with the hm bit voiceless)
Exploiting this would thus be a "hmOhmO" attack. Not really catchy so I can't imagine it would be popular :)
@Luc Bos : Me, too. The second thing I did was peek at the reference source out of curiosity. I think my favorite line in the reference source is hash1 ^= ThisAssembly.DailyBuildNumber . For context:
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
That makes me smile.
I suppose another curious property is that it looks like an M.C. Escher drawing of impossible eyeglasses.
This is not an interesting property of the string, but of the hash function.