Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
Today is the fifteenth anniversary of my first day of full time work here at Microsoft. Hard to believe it has been a decade and a half of writing developer tools. I am tremendously fortunate to be able to work with such a great team on such a great toolset for such great customers. I'm looking forward to the next decade and a half!
Now back to our regularly scheduled puzzling.
There are two curious properties of the string I mentioned last time. The first somewhat curious property is that this string has the same hash code as the empty string. A reader posted a comment to that effect within a half an hour of the blog post going up.
Of course, there are only four billion possible hash codes and way more than four billion possible strings, so there have got to be collisions in there somewhere. There are infinitely many strings that have this property; this just happens to be a particularly short one.
What is much more interesting is that any number of concatenations of this string yields a string that also has the same hash code as the empty string! A number of readers discovered this property.
I hinted that the curious property was one shared by a common string. And indeed, the property of having the same hash code for arbitrarily many concatenations is also shared with the empty string. Ha ha!
I noted in my first hint that this property was platform-dependent; you need to be using a non-debug 32 bit version of CLR v4. As we document in the MSDN page, the string hashing algorithm is subject to change at any time, and in fact has changed in the past. The debug version of the CLR changes its string hashing algorithm every day, so as to be sure that none of our internal infrastructure depends on the hash code algorithm remaining stable.
What gives this particular string this bizarre property?
The internals of the hashing algorithm are pretty straightforward; basically, we start with two integers in a particular state, and then mutate those integers in some way based on successive bytes of the string data. The algorithm works on considering two characters (four bytes) of string data at a time for efficiency. The transformation consists of some bit shifts and adds and whatnot, as you'd expect.
You can think of the hashing algorithm as a finite state machine with 64 bits of state. Each four bytes of input from the string changes the 64 bits of state. The final hashed result is derived from those 64 bits of state at the point where the input stream ends.
This particular sequence of four bytes - A0 A2 A0 A2 - has the property that it transforms the start state bits into themselves. As long as that input keeps coming in, the internals of the hash calculation never get out of the start state, and therefore must always output the same hash code as that of the empty string.
If you have a function f, and a value x such that f(x) = x, the value x is called a "fixed point" of f. What I've found here is technically not exactly a fixed point of the state transition function, but it has the same sort of flavour as a fixed point. Fixed points in various forms come up a lot in computer science and mathematics; they have many interesting and useful properties which I might go into in later fabulous adventures. Today though I want to think a bit about the security implications.
Essentially what we have here is a way of constructing arbitrarily many strings that all have the same hash code. (*) If a hostile client can feed many such strings into a hash table on a benign server, the attacker can essentially mount a denial-of-service attack against the server that uses the hash table. The hash table will be unable to hash the strings into different buckets, and therefore it will get slower, and slower, and slower.
It's not much of an attack, but it is an attack. (**)
As always, when thinking about how to mitigate attacks, think about "defense in depth". If you think you might be vulnerable to this sort of attack, make the attacker do several impossible things, not just one, to pull off the attack. For example:
* Monitor inputs to ensure that one user isn't sending an unusually huge amount of data for processing; throttle them back if they are.
* Filter inputs for unusual characters; attacks on hash tables are likely to involve out-of-the-ordinary data that has been specially crafted. Most people aren't going to be sending strings to a database full of Yi Script characters.
* Monitor system performance, looking for slowdowns that might be indicative of DoS attacks
* Build a custom hash table that uses its own string hashing algorithm. When you make a new hash table, randomize the starting state and the transformation of the hash.
And so on. Of course, this is just a sketch of a few ideas; if you are genuinely worried about such an attack, it is quite tricky to actually design and implement an attack-resistant hash table. Consult an expert. (I am not an expert on this!)
(*) Of course there are plenty of ways to come up with hash collisions; like I said before, there are infinitely many hash collisions since there are only four billion possible hash codes. This is just a particularly cheap and easy way to come up with hash collisions.
(**) And of course an attack by hostile clients on a slow hash table is just a special case of "attacking" yourself by accidentally hitting a slow code path due to a buggy design, as I learned the hard way.
The ability to concatenate this pattern to any string without changing the hash code brough back some memories of datalinks back in the 1970's (although the two situations are not really related except by mental association) . For various electrical reasons it was "a bad idea" to let the line idle in either the "high" or "low" state, many different protocols (including Manchester Encoding) were used, but quite a few used bit paterns that would not impact the "live" data. Much of the time, finding these patterns were very similar to "solving puzzles"...
Happy aniversary, Eric! Hopefully the next decade and a half will provide adventures that are just as fabulous as the last decade and a half!
The person in the last post who found the code in the reference source that changes the string hashcode algorithm daily in internal builds made me wonder.
Is there any actual reason why the CLR shouldn't simply initialize a static variable in every AppDomain at AppDomain startup time (or perhaps once per process instead), set it to a random number, and use that as the initial state in the string hash algorithm? If that were done, then nobody would ever be able to complain about the hashing algorithm changed, because it changes every time you run a program. Bugs caused by incorrectly assuming the hashcode algorithm was guaranteed to be consistent would be much rarer if this were done.
And obviously the BCL team understood this, by building this protection into their debug builds. Why not build in the same protection for external developers too?
@Stuart - Wouldn't that make it effectively impossible to serialize a dictionary for later or distributed use? The MSDN documentation calls out that the behavior of GetHashCode "might change from one version of the common language runtime to another" -- not that it will change so that subsequent runs of the same program may return different results.
@Chris - that's the point, isn't it? People serialize hashcodes for later or distributed use, and then screw up when the version of their software running on .NET 4 can't read the file that was saved from the .NET 2 version. Or their distributed software fails when some of the nodes are 32bit and some are 64bit.
(I have to assume that the serialized version of a dictionary doesn't depend on the hashing function, though. That would be stupid, and the BCL programmers aren't stupid)
The whole point is that it's not actually legal to serialize hashcodes because you can't assume the hashing function will stay the same. But having it stay the same in practice 90% of the time leads people to tempt fate in exactly that way. Better make it change every time so that people will know not to depend on it.
I'd be wary of giving the "Filter inputs for unusual characters" advice. That's the sort of advice that leads people to write code that disallows non-alphanumeric characters in password fields (attempting to prevent an unlikely attack ends up making dictionary attacks trivial) or rejects valid email addresses (such that somebody can't enter their actual email).
Considering the vanishingly-small odds of you being qualified to determine which characters at http://unicode.org/charts/ are actually "unusual", it's probably better to not attempt the cleverness.
@Stuart, it seems Dictionary<K,V> is serialized as an array of KeyValuePair<K,V> (plus some additional data, like the comparer used). So it should work even when the hash changes.
Right; you want to serialize a dictionary as its *logical* structure, a list of pairs, not as its *internal* structure, a list items grouped by hash code. Getting that wrong has major negative consequences. The VB team back in the day - like, early 1990s - serialized part of the project state by simply dumping the bucket lists of some internal symbol table hash tables out to disk, and as a result ended up in a situation where for backwards compat reasons they could *never* change their string hashing algorithm, even when they had to add features like supporting Unicode identifiers in source. It was a total pain in the rear, let me tell you. That lousy hash algorithm was baked into the product for *years*. -- Eric
Happy anniversary, Eric!
Keep up the good work here at your blog, I'm sure all of us find reading it always a pleasurable experience :)
Your adventures have been nothing less than fabulous.
Other hashing algorithms (MD5, etc) use the length of the consumed data as an additional state transform, which would alleviate this problem.
Stuart: If every process hashed strings differently, how would you build something like a distributed hash table? You'd have to write your own hash code instead of being able to rely on the fast one built into the system.
As Jason suggested, it's probably a good idea to hash the length too, and it doesn't seem like it would be too hard -- just start the hash one int earlier. In my tests it adds only 1-2 nanoseconds per call.
Does anybody know why the 64-bit CLR (v4.0) uses a different hash algorithm than the 32-bit version? It seems the 32-bit version hashes 2 characters at a time and looks at the whole string, while the 64-bit version hashes only 1 character at a time and stops when it sees a NUL embedded within the string. The 64-bit hash algorithm is not only slower, but has pathological cases (all strings with a given prefix up to a NUL hash to the same value).
My own quick benchmarks show that when compiled for x64, the 32-bit hash function should be faster than the current 64-bit hash function except for strings of length 1 and those with embedded NULs.
@Gabe: I'm not really familiar with distributed hash tables or what they're used for, but surely they still have all those problems in the world that exists now? Unless I'm misunderstanding what "distributed" means in this context, some of the nodes might be running on 64bit and some on 32bit. And presumably you'd want to allow nodes to be upgraded to new versions of your software (and hence new versions of the CLR) gradually rather than all at once.
Congratulations Mr Lippert !
i hope you would be more successful in future.
"The VB team back in the day - like, early 1990s - serialized part of the project state by simply dumping the bucket lists of some internal symbol table hash tables out to disk"
Don't the entries still have to contain the keys? (and thus one could write a deserializer that ignores this structure, just iterating over these "bad buckets" and importing every entry into the new table) - I would think that resolving collisions, rehashing to a different size, etc, would be impossible otherwise.
@Gabe: I think both of these improvements deserve an entry on Connect.
The 64-bit hash algorithm is just one issue in a long list of features that could take advantage of x64 CPUs or are inferior to the 32-bit JIT/CLR. While overall the 64-bit version certainly isn't worse, almost every time I happen to look at the code generated by JIT, I see potential for improvement.