Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
All of this stuff about high dimensional geometry has been known for quite a while. The novel bit that this paper is all about is how to actually build a fast index for searching a high-dimensional space where the query has a strong likelihood of being junk. The idea that these smart Microsoft researchers came up with is called Redundant Bit Vectors, and it goes like this:
Suppose we've got a 100 dimensional space of 8-byte floats, with one million records to search. Each record point has a "tolerance hypercube" constructed around it -- if a query point falls into that cube, then the record is a likely candidate for a match. Before we do any searches we build an index – building the index can take as long as we like so long as searching it is fast and the index is small.
To build the index, divide every dimension up into non-overlapping "buckets”. The algorithm for choosing the right number of buckets and the best bucket sizes isn't particularly interesting, so without loss of generality let's just assume that you've got four equally sized buckets for each dimension and move on. We've got 100 dimensions, four buckets per dimension, that's 400 buckets. Now associate each bucket with a one-million-bit integer. That’s a big integer – what integer do we pick?
Consider the third bucket for the 57th dimension. There’s a million-bit integer associated with that bucket. Set the nth bit of the integer to 1 if the nth database record's tolerance hypercube overlaps any part of the third bucket, considering only the 57th dimension, 0 if it does not. That is, this enormous integer answers the question “if the 57th dimension of the query is in the third bucket, what are all the database entries that we can immediately eliminate?” The ones that we can eliminate have zeros in their position in the integer.
Once we build those 400 million-bit integers, doing a query is a snap. First, figure out which 100 buckets the query point is in, one bucket per dimension. That gives us 100 one-million-bit integers. Binary AND those 100 integers together, and the million-bit integer that results has bits on only for those points which have tolerance hypercubes that overlap the query bucket in all 100 dimensions.
If the bucket sizes are chosen well then that should be a tiny number of points. Ideally, each of the four buckets per dimension would match only 25% of the points, but that’s not likely – maybe the tolerance rectangles are large enough and the dimensional data are interrelated enough that each bucket only eliminates 50% of the points. But even if that sub-optimal situation arises, each AND eliminates another 50%.
The first million-bit integer for the first dimension gets us down to 500000 candidate points, the second down to 250000, the third 125000, and by the time we get to the twentieth dimension we should be down to only a handful of points, if any. Detecting junk is very fast as the result will quickly go to all zeros. If it isn't junk and you discover a handful of match candidates then you can pull out the old Cartesian distance metric to check whether the query point is really a good match for any of them.
Obviously there are no chips that can AND together one hundred integers of a million bits each – most chips do it in 32 or 64 bit chunks, two at a time. But by being clever about how the bits are stored you can ensure good cache locality, so that the chip is almost always ANDing together portions of the integers which are close together in memory. And by being clever about keeping track of when there are huge long runs of zeros in the result as it is being computed, you can make the ANDing process faster by skipping regions that you know are going to go to zero.
Is this index small enough to keep in main memory? 400 million-bit integers is 50 MB. That’s a big chunk of memory, but certainly more reasonable than keeping the entire 800 MB database in memory.
Note that this is still a linear-time algorithm – if we’ve got n records then we’ve got to work with an n-bit integer, and that will take O(n) time. But because AND is such a fast operation compared to loading records into memory and computing Cartesian distances, in practice this is a winning algorithm.
Well, that was entertaining. Next time we'll get back down to earth -- I'll answer a few recent and not-so-recent questions about scripting.
This is an interesting idea. Personally I've written a nearest-neighbor search system based on approximate k-d tree search called "Best Bin First" from a paper by David Lowe. It'd be interesting to know how they compare. It may actually worthwhile for me to implement this algorithm (or perhaps LSH) for the system I was doing (it was comparing scale-invariant features in images), since junk queries were definitely a problem for me (once the nearest neighbor is far enough away, I don't really care anymore, even though these queries are the ones taking the longest most of the time).
Thanks for the great article!