Google Desktop On The Cheap, Part Zero

Google Desktop On The Cheap, Part Zero

  • Comments 13

Before I get started with my next series, two quick notes.

First, at long last I had dinner with famed JOS poster (and east coast technical evangelist for my product, VSTO) Philo the other night, as he happened to be in Redmond. I'm pleased to report that he's as witty, urbane and opinionated in real life as online, and I'm not just saying that because he expensed my salmon. Thanks! And blog more!

Second, liorean is thinking about putting together some sort of serialized-codeDOM-like markup language to describe JScript programs. We chatted about it a bit last week and I'm still not 100% sure what it's for, but it sure looks interesting. Anyone who has more free time than I do who wants to check it, out, contact him here.

OK, onward.

I've said before that we designed the script languages to make it easy to quickly write simple scripts to automate web page functionality, choose what HTML to serve up in ASP, or write better batch files, but that they were explicitly not designed to be fully-fledged high-performance production languages for solving really complex or data-intensive problems.

I stand by that statement. However, that's not to say that it's impossible to develop some pretty heavy-duty software in script. Rather, it's to say that if you're going to choose script come hell or high water, you need to understand its performance characteristics and choose your algorithms carefully.

There's a problem I frequently have at work. I have billions of bytes of source code on my machine, and sometimes I really want to know where a particular function is defined, but for whatever reason I don't know where the source code is and I'm not able to debug into it. There are several possible solutions to this problem:

· grep/findstr/etc -- do a linear search of the billions of bytes for the particular string. Trouble is, this is extremely time consuming.

· install Microsoft's index server. That would require me understanding how it all works, and it seems to be tied to web servers and all kinds of goo I don't want to mess with just to grep the sources. And where's the fun in that?

· Google Desktop looks like it does exactly what I want. However, I didn't hear about it until after I had already written much of the code I'll be talking about over the next few entries, and besides, it doesn't install on my machine because it is incompatible with… my firewall. Huh? Why is a text indexer incompatible with my network security? Weird. I'm not touching that thing until they sort that out.

· You guessed it -- write my own indexing system in script, and then blog about it.

Easy, right? I mean, I already blogged about a very similar problem: how to efficiently find the words that occur most/least often in a body of text. We could solve the indexing problem the same way. Recall that my algorithm to solve that problem was basically to build a series of hash tables mapping words to the number of times they occurred. We could just modify that algorithm to map from words to a list of the files in which they occur. Then the search becomes incredibly fast -- you just ask the hash table for the list of files that contains the keyword you're looking for, and you're done.

The trouble is, you very quickly discover that this solution does not scale to hundreds of millions of words in hundreds of thousands of files. It's just too big. The hash table wouldn't fit into the virtual memory space. And the script engine hash tables (dictionary, or any JScript object) were never designed to handle that kind of load. The garbage collection in JScript alone would destroy performance after the first few hundred thousand words.

Another problem that you quickly run into is that it is comparatively easy to do fast searches on the index once you've got it. Creating the index in the first place, and updating it as files change, that's a hard problem. One of the first things that springs to mind is that the index is itself going to be huge, and should probably spend most of its life on disk itself. Another is that the files which make up the index should themselves be relatively small, so that we can process them individually without having to load the whole world into garbage-collected memory. Basically, the index needs to be "paged" rather than monolithic, just like virtual memory is paged in and out of physical memory as it is needed, and the index pages need to be organized in such a way that we touch very few of them to get to the page we want.

I've been kicking around a bunch of different ways to solve this problem -- multiway balanced trees, binary/multiway external mergesort, etc -- and I thought I might blog about some of my experiments over the next few weeks. I'm still heads-down at work (and I've started working on another book!) so this will likely be updated pretty sporadically.

  • interesting.

    we used to run swish-e over our fairly large codebase (wrapped by a script called "fish"), and i kept the habit up until recently for my own files (of course, i'm hoping that tiger will have the appropriate hooks in it such that i can get everything else for free). the command line tool approach was really nice - i.e.

    gnuclient `fish haddock roe`

    would bring up all source containing the words haddock and roe - less than a second over >2M lines of source (actually, if you were doing a trivial search it was usually for obscenities:-). generating the index itself was fairly trivial (cron for reindex nightly, and no need for a webserver).

    another killer was using virtual folders in VM (under xemacs) - keeping three years of mail in a single file did suck (~30s to load and display) but being able to put down arbitrary regexp filters over all mail in under a second was more than worth the startup pain (especially when your session stays up for months at a time).

    more recently, keeping notes in a local wiki has helped sort my head out. and version controlling my home directory, too.

    someone needs to integrate all this cool stuff; as you say, an app which needs to wander off over the network unnecessarily can go whistle...

    or maybe i should get out my ducttape and push myself back to the future - as was.
  • Note Re Firewall:

    Your firewall client software is out of date.

    Getting the latest version solved the problem for me.
  • If I had the faintest idea how to do that at work, I would.
  • Larry, I put Google Desktop on a test box at work (just for S&G, didn't have a real need for it) and apparently it runs a local web server because I see stuff like http://127.0.0.1:4344/blahblah when I use its web interface. That might be the cause of your firewall conflicts.

    Hopefully they only accept HTTP requests from localhost....
  • Sorry, you're Eric, not Larry. That's what I get for reading too many blogs late at night. ;)
  • Try turning off the "homing in" options in your preferences page. Your firewall could be preventing that.
  • I've been indexing my desktops with Microsoft Index Server for more than 5 years now. It's a bit of a ball-ache to configure the first time, with creating a search page and so forth, but once it's running It Just Works.

    Which isn't to say I won't be interested in your JScript indexing experience.
  • AFAIK, Google Desktop uses a LSP (propably to hook the google page and put the desktop results there). It's imcompatible with a lot of products which use LSP. The incompatibility list includes the ISA Server Client.
  • If you get tired of coding you might try Appletware's Data Detective:
    http://appletsite.com/
    This thing was blazing fast hwne I last used it a couple of years ago. I think it was at least partly written in assembler.
  • "gnuclient `fish haddock roe`"

    Actually I had the salmon.

    I've got another problem with the Google Desktop - apparently nobody at Google has multiple drives, since it only installs on the system drive, and wants a GB of disk space.

    Philo
  • We both had the salmon -- you had the Chinook salmon and Alaskan halibut duet, I had the char-grilled salmon with edamame risotto.

    Mmm, Anthony's.
  • If you use Outlook, then you should install a great addin called Lookout. It's available from Microsoft! :-)
  • From what I understand of the Google desktop, they do all that hacky stuff but for a good reason -- when you search on www.google.com, they will intercept the search results before IE displays them, and inject local results as well. This lets you see local results *without* ever sending anything to the Google servers.

    Personally, I think it's an ugly hack and they should have just had a "click here to check local results" button, but at least they made an effort not to have a massive privacy / security hole in their software. Kudos.
Page 1 of 1 (13 items)