Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
Before I get started with my next series, two quick notes.
First, at long last I had dinner with famed
I stand by that statement. However, that's not to say that it's impossible to develop some pretty heavy-duty software in script. Rather, it's to say that if you're going to choose script come hell or high water, you need to understand its performance characteristics and choose your algorithms carefully.
There's a problem I frequently have at work. I have billions of bytes of source code on my machine, and sometimes I really want to know where a particular function is defined, but for whatever reason I don't know where the source code is and I'm not able to debug into it. There are several possible solutions to this problem:
Easy, right? I mean
The trouble is, you very quickly discover that this solution does not scale to hundreds of millions of words in hundreds of thousands of files. It's just too big. The hash table wouldn't fit into the virtual memory space. And the script engine hash tables (dictionary, or any JScript object) were never designed to handle that kind of load. The garbage collection in JScript alone would destroy performance after the first few hundred thousand words.
Another problem that you quickly run into is that it is comparatively easy to do fast searches on the index once you've got it. Creating the index in the first place, and updating it as files change, that's a hard problem. One of the first things that springs to mind is that the index is itself going to be huge, and should probably spend most of its life on disk itself. Another is that the files which make up the index should themselves be relatively small, so that we can process them individually without having to load the whole world into garbage-collected memory. Basically, the index needs to be "paged" rather than monolithic, just like virtual memory is paged in and out of physical memory as it is needed, and the index pages need to be organized in such a way that we touch very few of them to get to the page we want.
I've been kicking around a bunch of different ways to solve this problem -- multiway balanced trees, binary/multiway external mergesort, etc -- and I thought I might blog about some of my experiments over the next few weeks. I'm still heads-down at work (and I've started working on another book!) so this will likely be updated pretty sporadically.