Google Desktop On The Cheap Part One: Nope, That Doesn't Work

Google Desktop On The Cheap Part One: Nope, That Doesn't Work

  • Comments 11

So, how many files are we talking about here? 

  • I've got over thirty thousand source code files containing over seven million unique words. There are pretty much two kinds of words -- words that occur in one, two, a dozen files, and words that occur in almost every file. ("Microsoft", "next", etc.)
  • An index consists of a searchable list of words. Each word is mapped to its postings list -- the list of all the documents in which the word is found. (A postings list could also contain information about where in the document the word is found, but in general, these documents are so short that it doesn't really matter. If we need to know precisely where the word is, it's fast enough to just search that document again from scratch. We want to make finding the document in the first place cheaper.)
  • Building the index can take some time -- it can take hours, certainly. Days, no.

Like I mentioned last time, it's that last step that's a doozy. Building the index efficiently is going to be the hard problem here.

Here's an idea that doesn't work -- but why it doesn't work is important.

As you know from my previous article on counting unique words in a document, I like making clever use of hash tables. Hash tables are (in theory…) O(1) for each insert, so building a hash table mapping n words to their posting lists should be O(n). Searching a hash table is also O(1), so we're all set, right?

Of course, the combination of having a working set that might exceed the virtual memory space, plus keeping all that stuff in memory in script is going to tax the garbage collector, which was not designed for such tasks. Hmm.

We could keep the hash table buckets on disk, and just read in the bucket as we needed it. Imagine if we had, say, 1024 files each containing roughly 7000 words and their posting lists. The files could be named x000, x001, … x400. All the words in, say, x001, are those words which when hashed to a ten-bit hash, give 001. We could then easily sort each index file alphabetically to make searching within it faster. 

The search algorithm pseudocode looks something like this:

hash the target word
load the sorted index file for that hash into memory
binary-search the sorted index file for the target word
if found, display its posting list

Hashing is fast, and let's assume that the index files are small enough that loading them into an array in memory is also fast. Binary searching the 7000 = n / 1024 items in the list takes about 13 comparisons, and we're done. This is pretty darn quick -- one hash, one file load, a dozen string comparisons. 

How do we get here though? The index building algorithm goes like this pseudocode:

for each file
      extract the word list for this file
      for each word in the word list
            hash the word
            add the word and the file name to the appropriate index file 
      next
next
for each index file
      sort index file alphabetically
next

(There are more optimizations we could perform, like how to efficiently compress the posting list, but we'll come back to that later.)

What's the order of this thing? Let's assume for the moment that the cost of opening the source file and extracting a list of words from it is "cheap". (We'll also return to this handwave in a later entry!) We have in total n words in all source files. Hashing and adding each word to the end of its index file is O(1) per word, so that's O(n) altogether for the first loop.

Sorting the hash files using a comparison sort comes down to O(n log n) if you work out the math. (We pick up some advantage from having to only sort 1024 small chunks of the table at a time rather than the whole table, but there's no asymptotic advantage.) Clearly that dominates the linear first loop.

It seems from our analysis so far that, unsurprisingly, the sorting is the expensive part, not the hashing.

We could eliminate that.  What if we went to a 20 bit hash, a million index files, and a linear search?  Then the cost of building the index is still O(n), and the sorting cost goes away, and the search gets cheaper.  Of course, then we're slamming the file system with a million tiny files, but that's another story. Let's say that there are b index buckets.

Unfortunately, neither of these techniques work very well, and the analysis above is completely bogus. In the real world, the asymptotic analysis of processor time is irrelevant because it discounts an important fact: disk seeks are way more expensive than processor time. Disk access involves physically moving a robot arm over a spinning hunk of iron-coated aluminum, as opposed to juggling a few electrons at the speed of light.

Let's look at this from the point of view of disk accesses. What does building the index do in this algorithm? Well, odds are good that every new word that comes in will go to a different hash bucket. That's the whole point of a good hash algorithm, after all. That means that every time we add the word to the appropriate index file, we open the file, seek the disk head to the end of the file, write bits to disk, close the file, and repeat -- seek to a new location, open the file, write bits…

The hardware is pretty smart about keeping in-memory caches of what's going on, but eventually we're going to hit the same problem we're trying to avoid -- the cache will fill up, and we'll spend all our time running around the disk putting bits on platters. There are O(n) seeks!  Seven million words, some number of files -- on average, there will be O(b) other files opened and closed before any given index file is opened the next time! That's going to slam any cache, and probably fragment the disk while we're at it.

The sorting, on the other hand, opens a file, reads it all in, sorts it, writes it all out, repeat. The disk access is highly efficient -- there are O(b) seeks, period -- and the processor cost of sorting a 7000 word file is not huge compared to each seek.

External hash tables aren't the answer. And, for pretty much exactly the same reason, balanced multiway trees (aka "b-trees") don't work either. The O(n) disk seeks in the first part thoroughly dominate the O(b) seeks in the sort. The processor time is largely irrelevant. 

Therefore, any indexing algorithm that we come up with should have the property that it allows large sections of the index to be written to a small number of files at a time.

Next time, we'll suss out an algorithm that does fewer disk seeks during the index building.

  • Have you looked into b+ trees? I'll admit no use of them myself, but I hear it's what SQL Server uses to contain indexes. Loosely (and I'll admit my understanding is very limited!) instead of storing a single key/value pair in a node, you use large nodes containing many keys or key/value pairs. Values are only stored (sorted) in the lowest-level leaf nodes. Key nodes contain keys in sorted order with pointers to child nodes (which may be key or value nodes).

    When searching, you start by looking in the root node. If your search key is less than or equal to the first key in the root node, you look in the first child node. If it's greater than the first and less than or equal to the second key, look in the second child. And so on until greater than the p-1th key, where it's in the pth child. If this child is a value node, you search for the key in this node. Otherwise, you follow the rules above to get to the next child.

    This means that you do, worst case, log[p](N) [I think!] I/Os to find the item, assuming that you read a whole node from the disk in one go. SQL Server uses 1 node == 1 page, where SQL Server's pages are 8KB at present, although it's complicated by the fact that SQL Server supports variable-length keys, so it can't support a constant number of keys per page. Inserts and deletes are obviously pretty complicated.

    A reference I understood (*sheepish grin*): http://www.cs.msstate.edu/~cs2314/global/BTreeAnimation/index.html. I searched for "B+ tree" on Google.
  • See the third-to-last paragraph of today's entry, where I say

    for pretty much exactly the same reason, balanced multiway trees (aka "b-trees") don't work either.

    As you mention, it's O(log[p] n) disk I/Os per search, and if p is, say, 1000, then that's a billion records within three page reads.

    (Incidentally, the CPU time is O(p log[p] n) for the algorithm you describe, O(log n) if you write a more clever intra-bucket search. But the CPU time is irrelevant compared to the disk time.)

    Didn't I just say that a hash table solution has O(1) disk i/os per search? Sure, log[1000] n isn't much bigger than 1 for reasonable values of n, but clearly hash tables beat btrees for search. Why would anyone use btrees at all?

    Because hash tables do great when searching for a particular item, but bad for returning a set of related items in order. It's all very well to say "what is the value of key foo", and have it be really quick, but if the question is "what key comes after foo in the database?" hash tables do badly.

    Hash tables work because they deliberately DELOCALIZE data. They take data that are similar and make it cheaper to store them by putting a whole bunch of dissimilar things together into one bucket. Delocalizing your data is the price you pay for O(1) access -- the thing that gets you the O(1) access is that you are no longer relying on the ability to compare one item to another, and therefore sorting goes out the window.

    Btrees on the other hand keep huge wodges of highly similar data close together on the same page. Btrees encourage massive data locality. This is why databases are so good at giving you the "next ten records" -- because the next ten records are in the same disk page.

    But building a btree index sucks even worse than building a hash table index. Remember, we had one disk seek per word for the hash table solution, giving a total number of disk seeks of O(n). With a btree, insertion takes O(log[p]n) disk seeks, for a total of O(n log[p] n) seeks!

    Suppose I had my seven million records stored in a btree with average branching factor of 1000. That would be on average three seeks per word inserted, for a total of 21 million seeks, far worse than the 7 million with the hash algorithm.
  • If we could guarantee that the btree insertions were in key-sorted order, then it gets way cheaper. But the fact that the data isn't sorted in the first place is why we're building an index!

    One potential optimization would be to "batch up" the insertions in memory, sort them in memory, insert them, and write a paging system that does not flush pages to disk immediately when they're dirty.

    You can also keep pages close to the root in memory, which helps.

    Tricks like that improve the situation somewhat, but as the btree gets big, odds are good that inserting fairly large sorted data sets is going to hit many pages. And if there are a thousand second-level nodes, it's expensive to keep them all in memory!


  • Just out of curiosity, have you investigated how GDS or Lookout create their indexes?
  • Lookout uses the .Net port of the Apache open-source Lucene indexer, so you can download the source over here: http://sourceforge.net/projects/nlucene/

    Text indexes that are intended to support fuzzy searches do things like word stemming and soundex compression, which means that you end up with multiple keys on each entry. So you end up with a combination of approaches. The goal is to try to reduce the number of disk reads while searching the most common cases. (one, two, three word searches, searches with one misspelled word, etc)
  • With regards to nlucene, Novell's mono port of Lucene (beagle) may be more up to date and converts to 'native' .NET if you remove 5 lines of logging code.
  • For the first attempt I would probably just use BerkeleyDB or something like that, it should be heavily optimized so I would expect it to work much better than something I can come up with in a reasonable time. If that was too slow I would try to "abuse" the fact that some words occure quite frequently and some very rarely. So I'd process just part of the data using an in-memory hash, stopping before the hash gets too big, then move the less frequent words to an on-disk hash keeping only the top N% of the words seen so far in memory. So for the frequent words I'd check&update the in-memory hash, for the less frequent ones I'd have to go to the disk.

    Also you said you only want to know the documents containing the words, so you do not care whether a word may be found several times in a file, right? It may be beneficial to keep a small in-memory hash of the words seen in the current file and skip those you've already processed.
  • You might also want to consider adding position / proximity to the information you keep about each word in a document.

    It is a common search method to use an 'inverted index' which contains all significant words in all documents. Look up Swish-E for some background...

    - h.
  • I monitored Google Desktop disk usage, when I enter phrases like "is or and" it reads small amount of data from disk, say 2MB. It seems it knows the result before reading!!!

    My source files are about 5GB of text, in which there should be about 10 milion "is".

    Does google use a super compressed, say 4 bit, hashing algorithm?
  • My source files are about 1.21 jigowatts of text.

    ONE POINT TWENTY ONE JIGOWATTS!!!!!~~~!

  • Yeah my system uses the Kroz Voltagon Optimizer for this kind of disk-based indexing.  Doesn't have the compability with the alien mother ship in Independence Day, but I like it for the heavy lifting.

    The Tarkalon Majesticizer is also pretty good.  But I've got 4.9999 trilobytes of data to work with!  That's just our source code!

Page 1 of 1 (11 items)