Home-grown hash functions considered harmful...

About 7 years ago, when I was a high-minded computer science senior, a lowly computer science freshman buddy mentioned the trouble he was having with his homework. He had to design a hash table, and he was getting too many collisions for his design to be accepted for the assignment. I suggested that he always stick with a prime number of buckets. He tried that and -- like magic -- his collision rate went down dramatically. He asked me why it worked, and (being a member of the CS elite) I just scoffed. Isn't it obvious? Prime numbers are just better!

Now that I've been taken down a notch or two (six years of real industry experience will do that), I've learned that contrary to common practice, it really isn't obvious. I've also learned that magic solutions to hard problems should be examined with suspicion.

The rest of this post is probably pretty boring, so here's the quick summary:

  1. If you have to write a hash function for something important (i.e. a .NET GetHashCode method or a C++ hash traits class), don't design your own. Steal one that has been well-tested. One of my favorites for speed, simplicity, and quality is the "Jenkins One-At-A-Time hash" (aka joaat hash - "one-at-a-time" refers to one byte hashed per iteration, though the algorithm is probably still mostly ok if you use a 16-bit char per iteration instead). You can find a good description of it and many others here. For more information, see the Wikipedia entry on hash tables.
  2. If you see any locally-developed hash table implementation that requires a prime number of buckets, you might want to track down the author and suggest a better hash function that doesn't place a prime number constraint on the number of buckets. Prime numbers are so 20th century! (Keep in mind that the hash table author might be aware of better ways but is remaining backwards-compatible with previous releases or is protecting the clients of the hash table from their own bad hash functions.)

The Goal: Fast Lookup

(This is review for those who aren't familiar with hash tables.) 

A common task (in both real life and computer programming) is looking things up. Given input "A" (the "key"), I want to find a corresponding value "B" (the "value"). Think finding a word in a dictionary (given a word, I want to find the definition) or finding the page of a book given a page number.

If the key is easily enumerable (for the purpose of this discussion, enumerable means I am able to easily map each key onto a unique non-negative integer) and all valid keys are densely packed into a small range in the enumeration, a numbered list works quite well. For example, if the key is an integer, and valid keys are all in the range 200 to 300, and all possible keys from 200 to 300 are valid (have associated valies), then the input is easily enumerable and the enumeration is dense. For computer programs, the logical data structure for this situation might be an array or a vector, mapping each key (200..300) to an array index (0..100) by subtracting 200 and rejecting any key outside of the range 200 to 300.

If the key is not easy to enumerate or if the enumeration is not dense, looking up the value might be more difficult. If the key is enumerable but not dense (i.e. a string with up to 256 Unicode characters), you could theoretically create an array with indexes corresponding to all possible inputs, but unfortunately, most of the slots in the array will be empty and wasted. In the case of 256 Unicode characters, at one bit per value, your array of values will require more bits of storage than there are atoms in the known universe.

One good solution is to make a list of items sorted by the associated key. The list can then be searched using a binary search. This works quite well for a large class of problems, and assuming that the list allows for linear-time lookup of item a (where a is a number refering to the a'th item), finding an arbitrary item is O(log N) (where N is the number of items in the list). (To allow quick insertions and deletions, the list is typically implemented as a tree, but that's just an implementation detail.) The STL std::map class uses this method.

Another good solution is to split the set of values into many small groups or "buckets" based on the key. Each possible key always maps to the same bucket. Now if I want to find an item, I don't have to search the whole list -- I just have to search through the items in the correct bucket. Assuming a good mapping that evenly distributes items among buckets and a number of buckets proportional to the number of items, this method can allow us to find an item (or know that it is not present) in O(1) time. This method is often called the hash table.

The Challenge: Hash Functions

The mapping of input keys to bucket numbers is called the hash mapping. A good hash mapping will quickly map any input key A to an integer m from 0..M-1. It will do this in a way that evenly distributes the items in the hash table so that most of the buckets have about the same number of items. An ideal hash mapping will result in each bucket having the same number of items. "Perfect" hashing is where there are N items, N = M buckets, and each bucket contains exactly one item. In general, perfection is only possible when the items are known before the hash mapping is chosen.

A hash function is (for this discussion) a mapping of input keys to non-negative integers. This is an important step of a hash mapping. Given a hash function, the hash mapping can be completed by performing any mapping of non-negative integers onto the range 0..M-1. This is often done by dividing the result of the hash function by M and taking the remainder.

Assuming that we have to pick a hash function ahead of time, desirable qualities include speed (minimize the time spent turning a key its hash), simplicity (don't waste space in the memory of the computer, be easy to understand and debug for the benefit of the developer), quality (evenly distribute typical inputs into available buckets), and flexibility (work with any kind of input and any number of buckets).

Developing good hash functions is hard. In fact, if you have to pick a hash function before the items are known, it is provably impossible to avoid the worst-case scenario where all items end up in the same bucket and your search time becomes O(N). However, assuming that your inputs are not generated by a malicious attacker with knowledge of your hash function, it is possible to come up with some good hash functions that are fast, simple, flexible, and of high quality. It just isn't easy, especially because literature and common practice are full of examples of poorly designed hash functions.

Bad hash functions have persisted partly because it really doesn't matter much for many tasks. A fairly bad hash function might make your algorithm perform somewhat more slowly, but as long as your hash table correctly handles the collision cases, you might not even notice. In many cases where N is small or bounded, even the worst case of O(N) might not be a problem.

In addition, while good hash functions are hard and require careful analysis, in most cases, it is relatively easy to randomly pick some unsigned integer operations (set unsigned int x = something, then for each byte b of input set x = x (OP) b) and come up with a hash function that looks "pretty random". The problems don't show up right away, so the code gets checked in. In the rare case that the hash table ends up being a performance bottleneck, a liberal application of prime-number sauce makes the worst issues go away.

The Magic: Prime Numbers

A hash function designed "at random" without careful analysis will almost certainly have a very uneven distribution for the likely set of keys. Often there will be periodic patterns in the mapping of input to output. Inputs following a pattern will result in hash codes that follow some other pattern. The ideal hash function would evenly distribute all keys among all buckets (no matter how the keys might be related), so any common pattern leading to an uneven distribution is a flaw. Sometimes the flaw will go unnoticed, but sometimes the flaw will cause uneven bucket distribution leading to possible performance issues.

For example (strawman, I know, but not entirely far-fetched), assume I invent a hash function with the flaw that the result of the hash function is always a multiple of 48 if all inputs are ASCII digits (48..57). This might be ok for some item sets, but it suddenly becomes a problem when the items are "all zip codes in the US" with 96,000 buckets. Instead of O(1), the lookup becomes O(48) with 48 items in every 48th bucket and 0 items in all others.

So lets say I get a bug assigned to me to fix the performance issue. I determine that the problem could not possibly be with my hash algorithm -- I designed it at random, therefore the output must be perfectly random! (Not true, by the way.)

I play around with some data sets, and notice that the problem is really bad when the number of buckets is a nice round number, but seems to go away when I use a more "random" number of buckets. This is because given an input that is divisible by 48, the remainder when dividing by a nice round number is likely to preserve my hash function's distribution flaw:

M (the number of buckets) = 96000 = 48 x 2000 

m (the output of my hash function) = 59259216

m mod M = (48 x 1234567) mod (48 x 2000) = 48 x (1234567 mod 2000) = 48 x 567 = 27216. (The output of my hash mapping - still divisible by 48!)

On the other hand, dividing by a number that doesn't have any prime factors in common with the period of my distribution flaw (48 = 2x2x2x2x3) will do a nice job of hiding the flaw. Prime numbers don't have any prime factors other than themselves, so they magically hide a number of periodic hash function flaws.

Not really understanding the logic, I do a few searches through the available literature (i.e. Google) and see references to prime numbers in relation to hash functions. I try a few prime numbers for the bucket sizes, and it looks like all of the distribution issues magically go away. So I resolve the bug as "somebody else's fault", indicating that the client needs to use a prime number for the number of buckets.

This is inconvenient for the client, but it solves the problem, so it gets written into the documentation for my hash table. The prime-number-of-buckets idiom continues to propagate. (Oh my!)

The Right Way: Steal

While prime numbers smooth over a number of flaws, they really don't solve the problem. The problem is that poorly developed hash functions aren't well distributed. Using a prime number of buckets eliminates one class of issues (which seems to be the most common class), making the performance acceptable in more cases, but other issues remain. In addition, requiring a prime number of buckets causes some inconvenience -- the client now needs a table of prime numbers, hash computations require a division operation, hash table resizing becomes more complex, etc.

It turns out that with some careful design, it is possible to create hash functions that do not contain these periodic flaws. These high-quality hash functions are very good at producing even distributions over nearly any number of buckets given nearly any input. The result is that there need be no hash-function-induced constraint on the number of buckets.

Since careful design of hash functions is not something most developers (myself included) know how to do, the best advice I have is this: steal. The best code is the code you don't have to write yourself. There are many good hash algorithms out there that have undergone extensive analysis and have excellent distributions for nearly all kinds of input. Some are even extremely simple. All of them will almost certainly knock the socks off of anything you can develop on your own in terms of even distributions. Most of them will run more quickly than your home-grown solution. And they're free.

Here is a very simple hash function (Jenkins One-At-A-Time) that is better in almost every way than anything I saw in college (assume ub4 is a typedef for a 4-byte unsigned integer):

ub4 one_at_a_time(char *key, ub4 len, ub4 hash = 0)
{
  ub4   i;
  for (i=0; i<len; ++i)
  {
    hash += key[i];
    hash += (hash << 10);
    hash ^= (hash >> 6);
  }
  hash += (hash << 3);
  hash ^= (hash >> 11);
  hash += (hash << 15);
  return hash;
} 

You can easily use this for any size of hash table by dividing and taking the remainder. Even better, use it with power-of-two hash tables by masking off the top bits of the hash. You can hash multiple parts of a structure by calling the function repeatedly and chaining the result each time.