Welcome to MSDN Blogs Sign in | Join | Help

Developing for Developers

Tools, techniques, and theory for measuring and improving the power and performance of developers and their code
P-complete and the limits of parallelization

We're entering an era where CPU clock speeds will soon cease to scale upwards and instead CPU manufacturers are planning to put more and more independent cores on a chip.  Intel plans to release an 80-core chip within 5 years. Consequently the research community is setting their eye on the manifold barriers to writing correct and efficient highly-parallel programs. Will we be able to continue to get the same boosts in computational performance that we have been getting through raw clock speed?

You might think it's just a matter of building effective tools for producing programs that can take advantage of highly parallel platforms, and training programmers to take advantage of them. But there are complexity theory results to show that there are some problems that, under widely accepted assumptions, fundamentally cannot be parallelized, and this is the theory of P-completeness. For these problems, it may be literally impossible to get any significant benefit out of throwing additional cores at them - such problems are called inherently sequential. More to the point, however long it takes your computer to solve these problems right now, that's about how fast it's going to stay for a long time.

The complexity class NC (Nick's Class, named after Nick Pippenger by Steven Cook) is the set of problems that can be solved in polylogarithmic time (O(logk n) time for some integer k) by a machine with polynomially many processors. For example, given a list of n numbers, the typical way to add them up on a single processor would be with a loop, as in this C# snippet:

    public static int AddList(List<int> list) {
        int sum = 0;
        foreach (int i in list) {
            sum += i;
        }
        return sum;
    }

But now suppose we had n/2 processors. Then each processor can add just two of the numbers, producing a list of n/2 numbers with the same sum. We add these in pairs as well, continuing until just one number remains, the sum. This fan-in algorithm requires only O(log n) time, placing the problem in NC (or technically the decision problem version of it).

We can use similar ideas to parallelize many algorithms. For example, the best known sequential algorithms for matrix multiplication requires time O(n2.376) for an n × n matrix, but given O(n3) processors, we can solve it in O(log n) time: for each entry, we assign n processors to compute the products that are added to form that entry, then use the fan-in procedure above to add them together. This can still be done in O(log n) time. There are efficient parallel algorithms for other common problems too, like sorting, string matching, and algorithms in graph theory and computational geometry.

In real life of course we don't have such a polynomial number of processors, so this is in some ways an impractical definition; all it really tells us is that if we have many processors, we can solve the problem dramatically faster. Practical parallel algorithms pay more attention to speedup - how many times faster the algorithm is with k processors than with one. Effective speedup requires a good way of dynamically decomposing the problem. But speedup is difficult to characterize formally in a machine-independent way, so we'll stick with the definition of NC.

The famous P = NP problem asks whether all problems whose solutions can be verified in polynomial time, can also be solved in polynomial time. In this problem, the problems in P are considered "feasible to solve", and most computer scientists believe P ≠ NP - that is, that some problems which we can feasibly verify cannot be solved feasibly. We believe this in part because we have shown that many important problems are NP-hard. To say a problem is NP-hard means there is a polynomial time algorithm that can take any problem in NP and convert it into an instance of that problem; consequently, if an NP-hard problem can be solved in polynomial time, so can any problem in NP.

But when you have a large number of processors, the question turns around: we no longer consider P feasible, because any algorithm using just one CPU would face a dramatic slowdown compared to algorithms that take advantage of all of them. Now NC algorithms are considered feasible, and we ask: can all problems with efficient sequential solutions be highly parallelized? That is, is NC = P?

Just as with P = NP, nobody knows. However, just as there are NP-hard problems, there are P-hard problems. A P-hard problem is a problem where there is an NC algorithm (running in polylogarithmic time on a polynomial number of processors) to convert any problem in P into an instance of that problem. If we had an NC algorithm to solve a P-hard problem, it would imply that NC = P, providing a constructive and mechanical way of parallelizing any problem we can solve sequentially in polynomial time. Just as most researchers believe P ≠ NP, most researchers believe NC ≠ P, which implies that no P-hard problem has an NC algorithm to solve it.

One of the most important P-complete problems is linear programming: given a linear target function in terms of some variables, and a number of linear inequalities expressing constraints between those variables, find the optimal (maximum) value of the target function. It finds uses in operations research, microeconomics, and business management, and it would be great if we could exploit large number of cores to tackle larger instances of these problems. But no one has found an efficient parallel algorithm for this problem.

Another important example is the circuit-value problem: given a digital logic circuit and its inputs, compute the output of the circuit. There is a trivial linear time solution in the number of gates: continually get the next gate that you know both inputs of and compute its output, until you're done. It would seem that certain gates don't influence each other and be computed in parallel. But because the wires can express complex interdependencies between values and can create long chains of dependencies, there's no clear way to parallelize a general circuit. Besides obvious applications to circuit design and simulation, many problems can be solved by families of circuits, and if we can evaluate them quickly we could solve these problems. But not only has no one has found a general way of parallelizing such evaluation, they've shown that even if we restrict the circuits so that wires can't cross, or such that only the inputs can be negated, the problem remains P-complete, and so difficult to parallelize (these are the planar and monotone circuit-value problems, respectively).

The first problem ever shown to be NP-complete, as well as an important problem in practice, is the boolean satisfiability problem: given a formula using AND, OR, and NOT, with a number of variables set to either true or false, determine if there is some way of setting the variables to make the whole formula evaluate to true. The P-complete version of this problem is called Horn satifiability.

Any boolean formula can be reduced efficiently to a form where it represents a three-layer circuit: a literal is either a variable or a negated variable; a clause is the OR of some literals; and an entire formula is the AND of some clauses. A "Horn clause" is a clause where all of the variables are negated except possibly one. This provides an efficient solution to the problem, since Horn clauses can be rewritten as implications (not(x1) or not(x2) or x3 is the same thing as (x1 and x2) implies x3). Formulas made up of Horn clauses appear in the context of logical deduction systems, such as the resolution systems used to model problems in AI. Again, no one has found an effective way to parallelize finding solutions for these formulas.

In the world of P = NP, people have long been studying ways to "get around" the infeasibility of finding a polynomial-time solution for NP-hard problems, such as approximation algorithms, special cases, fixed-parameter tractable solutions, and so on. We could ask the same questions about the NC = P question: could we find an approximate solution much more quickly by using all our cores, instead of running a classical algorithm to find an exact solution on just one? Are there special cases of these problems which are parallelizable? But comparatively little investigation has been done into this area, although some examples exist (like Karger and Motwani's "An NC Algorithm For Minimum Cuts").

Finally, I should mention that P-complete is relevant not only to the study of what problems can be efficiently parallelized, but also to what problems can be solved in a small amount of space. I'll leave this for another time though.

Posted Friday, September 07, 2007 8:11 PM by dcoetzee | 3 Comments

Filed under:

Robin's theorem

Most computer scientists are familiar with the P = NP problem, which asks essentially whether we can verify more problems in polynomial time than we can solve. So fundamentally does complexity theory hinge on this result that the Clay Mathematics Institute has labelled it one of their seven Millennium Prize Problems, and will award $1,000,000 to the researchers that solve it.

But there are six other Millennium Prize problems that are often less accessible to computer scientists, requiring background in areas such as topology and complex analysis. One of the most important of these is the Riemann Hypothesis, which is normally stated in terms of complex zeros of the Riemann Zeta function. Here, we will describe a recent characterization of the Riemann Hypothesis more suited to a computer scientist's background based on growth of integer functions.

Consider the sum-of-divisors function σ(n) that sums up the divisors of the positive integer n. For example, σ(15) = 1 + 3 + 5 + 15 = 24. If you've studied asymptotic notation, a natural question to ask is: how fast does σ(n) grow?

It's clear that since 1 and n always divides n, σ(n) ≥ n + 1, so σ(n) is Ω(n). But σ(n) cannot be O(n), because σ(n)/n takes on arbitrarily large values. To see this, consider n equal to k primorial, the product of the first k primes. Because σ(n)/n is multiplicative, σ(n)/n will equal the product of σ(p)/p = (1+p)/p = (1 + 1/p) for each prime factor. But if you expand out this product, you get a sum including 1/p for each prime factor, and the sums of the reciprocals of the primes, also known as the harmonic series of primes, diverges. But the harmonic series of primes grows quite slowly, and is O(ln ln n).

From this we might conjecture that σ(n) grows at a rate of n ln ln n. Gronwall's theorem, proven in 1913, shows that in fact:

lim supn→∞ σ(n) / (n ln ln n) = eγ.

Here, e and γ are both constants (the latter is the Euler-Mascheroni constant), so eγ is a constant, equal to about 1.781. This is somewhat stronger than the statement that σ(n) is O(n ln ln n) - the lim sup actually says that for any ε > 0, some tail of the sequence σ(n) / (n ln ln n) must be bounded above by eγ + ε.

But instead of a lim sup, what we'd really like is to have a hard bound; we'd like to say:

σ(n) / (n ln ln n) < eγ for all n.

This is false, and a quick search identifies 27 small counterexamples: 2, 3, 4, 5, 6, 8, 9, 10, 12, 16, 18, 20, 24, 30, 36, 48, 60, 72, 84, 120, 180, 240, 360, 720, 840, 2520, and 5040. However, no search to date has located any larger counterexamples. From this we might conjecture:

σ(n) / (n ln ln n) < eγ for all n > 5040.

In 1984, Guy Robin showed that in fact, this statement is true if and only if the Riemann hypothesis is true. This is Robin's theorem. Thus, the Riemann hypothesis could, in theory, be proven false by exhibiting a single positive integer n > 5040 such that σ(n) > eγ (n ln ln n). Because most mathematicians believe the Riemann hypothesis is true (just as most computer scientists believe P ≠ NP) it's more likely that the above statement is true, and a proof of this fact would win you $1,000,000.

Posted Monday, July 16, 2007 2:12 PM by dcoetzee | 2 Comments

Filed under:

Cache-oblivious data structures

In most data structure and algorithms classes, the model used for basic analysis is the traditional RAM model: we assume that we have a large, random-access array of memory, and count the number of simple reads/writes needed to perform the algorithm. For example, selection sort takes about n(n-1)/2 reads and 2n writes to sort an array of n numbers. This model was relatively accurate up through the 1980s, when CPUs were slow enough and RAM small enough that reading and writing the RAM directly didn't create a significant bottleneck. Nowadays, however, typical desktop CPUs possess deep cache hierarchies - at least a sizable L1 and L2 cache - to prevent runtime from being dominated by RAM accesses. Data structures and algorithms that efficiently exploit the cache can perform dramatically more quickly than those that don't, and our analysis needs to take this into account.

For example, suppose you have a list of bytes stored in both an array and a linked list. Provided the linked list nodes are allocated at random positions, iterating over the array will be dramatically faster, exhibiting optimal locality of reference, while iterating over the linked list will trigger a cache miss for every element. We can formalize this: assume a cache line - the size of the block read into the cache at one time - has size B bytes. Then traversing the linked list requires n cache misses, while traversing the array requires precisely ceiling(n/B) cache misses, a speedup of B times.

It is possible to design data structures that specifically take advantage of the cache line size B. For example, in a previous post I described an unrolled linked list data structure where each node fills up a cache line, achieving similar cache performance to arrays while still allowing efficient insertions and deletions. Such a data structure is termed "cache-aware", because it explicitly takes knowledge about the cache model into account in its construction.

Unfortunately, this approach has a compelling disadvantage: one has to tune the data structure based on the cache line size B. This is a problem for many reasons: first of all, programs, even those compiled to native machine code, are often run on many different processors with different cache line sizes. A Pentium 4 can run the same machine code as a Pentium II, but has a very different cache model. But even when running on a fixed machine, these simple cache-tuned data structures can't take advantage of multilevel caches, because each one has a different size and a different block size. One can attempt to build a data structure for a specific cache hierarchy, but this is even less robust against a change of machine.

In 1999, in his masters thesis, Harold Prokop came up with an interesting solution to this problem: the idea of a cache-oblivious algorithm. This does not mean that the algorithm does not take advantage of the cache; to the contrary, it does so quite effectively. What it means is that the algorithm does not need to know the cache line size; it works effectively for all cache line sizes B simultaneously. This allows the algorithm to robustly exploit caches across many machines without machine-specific tuning. More importantly, it allows the algorithm to exploit multilevel caches effectively: because it works for all B, it applies between each two adjacent levels of cache, and so every level is well-utilized.

In fact, the multilevel advantage extends beyond simply taking advantage of both the L1 and L2 caches - such algorithms perform well even when the storage hierarchy is expanded to encompass much slower media such as slow memory, Flash, compressed memory, disks, or even network resources, without knowledge of block sizes or access patterns. For example, cache-oblivious B-trees can be used to implement huge database indexes that simultaneously minimize disk reads and cache misses.

Note that just because the algorithm doesn't depend on B doesn't mean the analysis of the algorithm doesn't depend on B. Let's take an example: iterating over a simple array. As noted earlier, this requires about n/B cache misses, which is optimal. But neither the array's structure nor the iteration algorithm explicitly take B into account. Consequently, it works for all B. In a multilevel cache, all data prefetched into every cache is used. This is the theoretical benefit of cache-oblivious algorithms: we can reason about the algorithm using a very simple two-level cache model, and it automatically generalizes to a complex, deep cache hierarchy with no additional work.

While cache-oblivious algorithms are clearly useful, at first it's not clear that there even exist any other than simple array iteration. Thankfully, extensive recent research has revealed cache-oblivious data structures and algorithms for a multitude of practical problems: searching (binary trees), sorting, associative arrays, FFT computation, matrix multiplication and transposition, priority queues, shortest path, query processing, orthogonal range searching, and more emerging every year. Practical comparison studies have shown a significant performance gain on real processors compared to traditional algorithms, although carefully-tuned cache-aware algorithms still enjoy an advantage on the specific machines they are tuned for (see e.g. [1][2]).

Many of the cache-oblivious data structures and algorithms that have been published are relatively complex, but here I'll describe a simple one just to give you a feel for it. This discussion is based on parts of this paper.

Consider the following simple search problem: we have a static list of records, and we wish to find the record with a particular key. Traditionally, this problem is solved with either an array and binary search, or a binary search tree. Both of these approaches exhibit dismal cache behavior. Database applications rely on B-trees, which group several records into each block. This greatly improves performance, but requires knowledge of the block size, and does not work across all levels of the cache hierarchy.

The key is the van Emde Boas layout, named after the van Emde Boas tree data structure conceived in 1977 by Peter van Emde Boas. Suppose for simplicity that the number of elements in a power of 2. We create a complete binary tree containing our records (where by "complete" we mean that all leaves are at the same level). This tree will have height h = log2n.

Take a look at the first h/2 levels of the tree. This subtree has 2h/2 leaves, each itself the root of a tree of height h/2. In the van Emde Boas layout, we first lay out the subtree of height h/2 rooted at the root, followed by the subtrees rooted at each leaf of this tree from left to right. Each subtree is recursively laid out using the van Emde Boas layout. An example diagram is shown in Figure 1 of the paper.

The analysis proceeds like this: at each step of the recursion, the size of the subtrees being laid out is the square root of the size of the subtrees laid out at the previous step. Consequently, as we recurse, at some point we will be laying out subtrees between sqrt(B) and B in size (let's call it C). Each of these subtrees can be retrieved in a single memory transfer, and this layout partitions the tree into subtrees of this size.

The search procedure works like this: we access the root, which pulls in the first log2C levels of the tree. We have only cache hits until we access a leaf of the tree, which pulls in the next log2C levels of the tree rooted at that leaf. This continues until we reach the bottom of the tree. Because only 1 out of every log2C accesses causes a cache miss, the total number of misses is log2n/log2C = logCn. Because C is at least sqrt(B), log2C is at least (log2B)/2, and the total number of misses is at most 2logBn.

We made one small assumption above: that each subtree is aligned with the cache block boundaries; if this is not true, each subtree can require two accesses, bringing the final maximum to 4logBn. Probabilistic analysis shows that with a random starting point in memory and sufficiently large block size, the real constant is closer to 2.

This is the same asymptotic performance as B-trees, and is faster than ordinary binary trees by a factor of about log2n/2logBn = (log2B)/2. For example, for a disk block size of 2048 elements, this would be a speedup of about 6.5 times. The advantage is more than theoretical: to quote the paper, "preliminary experiments have shown, surprisingly, that CO [cache-oblivious] B-trees can outperform traditional B-trees, sometimes by factors of more than 2".

But as practical as the research is in cache-oblivious algorithms, many applications and libraries have yet to take advantage of them. The data structures supplied with .NET, Java, Lisp, and so on are not cache-oblivious. We need to start putting this research into practice and reaping the benefits. I hope this gives you some insight into cache-oblivious algorithms, and I hope you will consider taking advantage of it the next time you develop a product whose performance critically depends on memory access patterns.

Posted Tuesday, June 12, 2007 1:54 PM by dcoetzee | 9 Comments

Filed under:

K-nearest neighbor spatial search

I apologize to everyone for the hiatus - I realise a post is more than a little overdue and will try to be more regular in the future. I appreciate the support that I've received for the blog and continuing it.

Consider the following problem: you're starting a website for a chain of restaurants, and you want to have a "Locations" feature where a person can enter their address and get a list of the 10 outlets closest to their location, in order of distance from their location (as the crow flies, for simplicity). Moreover, each time they hit the "next" button, they should get the next 10 locations in order of distance.

More abstractly, say you have a list of locations on a 2D map, each specified as a point with an x and y coordinate. You can do any sort of preprocessing you like of this list of points. Then I ask you to efficiently answer: given some point, which is not necessarily one of the points in the list, produce a list of the k points closest to that point, in order of their distance from that point.

This problem, called the k-nearest neighbor problem, is very well-studied and has a variety of solutions. Here we describe one very simple but effective incremental algorithm first proposed by Hjaltason and Samet in their "Ranking in Spatial Databases" in 1995 (Citeseer). A spatial database is a database specialized to perform this type of query over its (potentially very large) data set.

Consider the following very simple but inefficient algorithm: we take each point, one at a time, and add it to a priority queue with its key equal to the distance from the search point. We then extract minimum repeatedly to obtain the points in order of their distance from the search point. Once we've extracted the first k, we're done, and with a good data structure this requires only O(k log n) time. Unfortunately, even with the best priority queue data structures based on Fibonacci heaps, adding all n points requires amortized linear (O(n)) time and space, which for a very large number of points as one would expect to find in a spatial database is prohibitive.

One approach to narrowing down the points we have to examine is to split the space into larger regions. The idea is that points close to the search point should be found in regions close to the search point. In more detail, we first divide the space into two rectangular regions, each containing some of the points. Each of these regions is, in turn, subdivided into two rectangular regions, and we continue in this manner until all the resulting regions have some small constant number of points in them. The result is a tree of regions, where each node is a subregion of its parent node, and each point lies in one of the leaves. It really doesn't matter how we choose to divide up the space - the authors used PMR quadtrees, but you could use kd-trees, simple quadtrees, or whatever you like. All that matters is that in the end each leaf region contains only a small number of points.

Creating the regions requires examining all the points, but the key is that this part does not depend on the search point, and so only has to be done once in preprocessing. The regions may require rebalancing as points are added or removed, but this can be done efficiently as well.

However, there's a trick to this: if the search point lies near the boundary of a region, its closest points may actually lie in a different, adjacent region. How can we correctly locate the nearest point without spuriously gathering all the points in all the regions around the point?

Here's the key insight: our priority queue will contain two types of objects, single points and regions. The key of each point is set to its distance from the search point, while the key of each region is set to the minimum distance of that region from the search point. Each time we extract a point from the queue, we output it; each time we extract a region from the queue, we expand it by adding its subregions to the queue, or if it's a leaf region, the points in that region. Because the region's key is set to the minimum distance of that region from the search point, any point that is outputted before the region is expanded is closer to the search point than any point in the region. This ensures correctness.

k-nearest neighbor spatial search example

For example, consider the search problem shown to the right. There are 9 points, and we wish to find the 5 closest points to the big red search point in order. The space is recursively subdivided into regions as shown in the tree.

For the purpose of discussion we will refer to each region by the set of points it contains. Initially, the queue contains only the large region containing all points.

Current queue in order: {1,2,3,4,5,6,7,8,9}

This contains the red search point, so its minimum distance from the search point (its key) is zero. We remove it from the queue and add its two children {1,3,4,8} and {2,5,6,7,9}. The region {2,5,6,7,9} again contains the red search point and so has key zero. The minimum distance to the region {1,3,4,8} is shown by the red line extending out of the red point and to the left and is larger than zero.

Current queue in order: {2,5,6,7,9}, {1,3,4,8}

We extract and expand {2,5,6,7,9}, getting leaf region {2} and region {5,6,7,9}. The distance to region {2} is shown by the red line extending upward from the red point, which is clearly longer than the distance to region {1,3,4,8}. The region {5,6,7,9} has distance zero.

Current queue in order: {5,6,7,9}, {1,3,4,8}, {2}

We extract and expand {5,6,7,9} to get the regions {5,9} and {6,7}. The distance to region {6,7} is shown by the red line extending right from the red point, the longest distance so far encountered. The region {5,9} contains the red point and has distance zero.

Current queue in order: {5,9}, {1,3,4,8}, {2}, {6,7}

Next, we extract and expand region {5,9}. This is a leaf region, so we add the points 5, 9 to the queue. Point 5 is only slightly closer to the red point than region {2}, but farther than region {1,3,4,8}. Point 9 is farther than everything so far encountered.

Current queue in order: {1,3,4,8}, 5, {2}, {6,7}, 9

We extract and expand {1,3,4,8} to get {1,3,4}, at the same distance as {1,3,4,8}, and {8}, the distance of which is shown by the diagonal red line.

Current queue in order: {1,3,4}, {8}, 5, {2}, {6,7}, 9

We extract and expand {1,3,4}, which is a leaf region, and add the three points. Point 4 is closest, point 3 lies between {6,7} and 9 in distance, and point 1 is slightly closer than point 9.

Current queue in order: 4, {8}, 5, {2}, {6,7}, 3, 1, 9

We extract and output point 4, the closest point. We expand region {8} and add point 8, which is slightly farther than point 3.

Current queue in order: 5, {2}, {6,7}, 3, 8, 1, 9

We extract and output point 5. We expand region {2} and add point 2, which is slightly closer than point 8.

Current queue in order: {6,7}, 3, 2, 8, 1, 9

We extract and expand region {6,7}, adding points 6 and 7. Point 6 is farthest yet and point 7 even farther.

Current queue in order: 3, 2, 8, 1, 9, 6, 7

We've now expanded all regions and the remaining points in the queue are the remaining points not yet outputted in order.

The nice thing about this particular algorithm is that it's fully incremental: we could have stopped at any time once we had enough points, and pick up later when we want more. This makes it quite appealing in database queries, where you don't really know how many points you need in advance, because you might want to filter the results according to additional attributes. In practice, in a typical spatial database, one only needs to expand a small fraction of the regions in order to find a small number of closest points. The idea can be extended into higher-dimensional regions, spatial objects other than points such as areas and line segments, and even more general problems where one can easily compute the minimum key of a subtree of objects.

I hope you found this informative and I appreciate your comments.

Posted Thursday, June 07, 2007 5:23 PM by dcoetzee | 7 Comments

Functional list processing in C# 2.0 with anonymous delegates

One of the benefits of functional languages is their great flexibility in list manipulation, which enables them to express certain computations concisely that would require one or more verbose loops in procedural languages. Many of the features that functional programmers take for granted such as first class functions, persistent data structures, and garbage collection were included in Lisp partly for the purpose of convenient list processing (Lisp stands for "LISt Processing").

Although C# is not a functional language, in C# 2.0 and the accompanying .NET Framework 2.0, a number of constructs and library calls have been added that allow generic List objects to be manipulated in a more functional way. The most important of these is anonymous delegates, which are the same construct that functional programmers call closures. We start out with the following method, which removes all elements from a list that match a given condition, passed in as a delegate:

    delegate bool Predicate<T>(T arg);

    static void RemoveAll<T>(List<T> list, Predicate<T> condition)
    {
        int i=0;
        while (i < list.Count)
        {
            if (condition(list[i]))
                list.RemoveAt(i);
            else
                i++;
        }
    }

The use of the word "predicate" here comes from the mathematical predicate, which is just a function that yields a boolean true/false value.

Now say we have a list of integers, and we want to remove all the multiples of x, where x is a given number. In C# 1.0, using RemoveAll for this is so no easy task, requiring a temporary object:

    class RemoveMultiplesOfHelper
    {
        private int x;
        public RemoveMultiplesOfHelper(int x)
        {
            this.x = x;
        }
        public bool IsMultiple(int n)
        {
            return (n % x) == 0;
        }
    }

    static void RemoveMultiplesOf(List<int> list, int x)
    {
        RemoveAll<int>(list, new RemoveMultiplesOfHelper(x).IsMultiple);
    }

The advantages of reusing RemoveAll are important: it's tricky to iterate through a list while removing elements from it, and programmers often introduce subtle bugs while attempting to do so. But this code has serious problems too: it's a hassle to write, it's verbose, the "temporary" class isn't very reusable and is difficult to name, and the code is spread all over the place, making the flow of execution difficult to follow. Because of these issues, many procedural programmers opt to reinvent the wheel instead.

Anonymous delegates solve all these problems. An anonymous delegate allows you to write code right in the middle of a method which has access to all the local variables in that method, and then package it up and pass it to another method to execute later. Here's our example using anonymous delegates:

    static void RemoveMultiplesOf(List<int> list, int x)
    {
        RemoveAll<int>(list, delegate(int y) { return (y % x) == 0; });
    }

In this much briefer implementation, the delegate "captures" the local value of x at the instant of its creation, then later RemoveAll uses it in computations.

Another handy trick you can do with anonymous delegates is take a method signature that doesn't quite fit an existing delegate and translate it into one that does. For example, suppose you're sorting a List using the overload of the Sort method that takes a System.Comparison delegate. Now say you're sorting a list of strings, and you have a string comparison method that takes a boolean determining whether the comparison is case-sensitive:

    delegate int Comparison<T>(T left, T right);
    void Sort<T>(Comparison<T> comparison);
    int StringCompare(string left, string right, bool caseSensitive);

At first you appear to be in quite a pickle, as the signature of the StringCompare method does not exactly match the delegate signature Comparison<string>. Anonymous delegates make it easy to overcome this problem:

    static void SortStrings(List<string> a, bool caseSensitive)
    {
        a.Sort(delegate(string left, string right) {
            return StringCompare(left, right, caseSensitive);
        });
    }

We simultaneously match the desired delegate signature and push the decision of whether to sort case-sensitively or not to the caller, where it belongs, all with very little new code.

The advantages of anonymous delegates make functions like our Sort and RemoveAll considerably more useful. Consequently, a number of such functions were included with the generic List class in .NET 2.0:

  • RemoveAll(Predicate<T>): Like our RemoveAll implementation above, removes all elements from a list matching a predicate.
  • Sort(Comparison<T>): Mentioned above, this is a Sort method that takes a delegate. Sorting with anonymous delegates is very handy for unusual sorting orders; for example, the following sorts a list of integers in reverse order:
    list.Sort(delegate(int x, int y) { return y.CompareTo(x); });
    
  • Find(Predicate<T>): Finds the first element in a list satisfying a predicate. Handy for replacing linear searches. For example, this single line of code identifies the first file in the current directory modified in the last hour, or yields null if none:
    new List<string>(Directory.GetFiles(Directory.GetCurrentDirectory())).Find(
        delegate(string path) { return File.GetLastWriteTime(path) >=
                                       DateTime.Now - new TimeSpan(1, 0, 0); });
    
  • Exists(Predicate<T>): Determines whether a list contains an element satisfying a predicate. Besides its added readability in situations where you only care about the existence of a value, it also has another use: since "Find" returns the default value for the type T to indicate "not found", you need to use Exists for predicates that might match null or zero, such as:
    bool containsEven = list.Exists(delegate(int x) { return (x % 2) == 0; });
    
  • TrueForAll(Predicate<T>): Similar to Exists, but determines whether the predicate is satisfied by all elements of the list. For example, this line of code might run some tests and see if they all pass:
    if (tests.TrueForAll(delegate(Test t) { t.Run(); return t.SuccessCode == 0; })
    
  • FindAll(Predicate<T>): Creates a new list containing all elements satisfying a predicate. In functional languages this is sometimes called select or filter, as we're "filtering out" the parts of the list not satisfying the predicate. Here's an example which, in a single line, gets a list of all processes using more than k threads:
    List<Process> x = new List<Process>(Process.GetProcesses()).FindAll(
        delegate(Process p) { return p.Threads.Count > k; });
    
  • FindIndex(Predicate<T>): Just like Find, except that instead of returning the first value that satisfies the predicate, it returns its index in the list. For example, this expression yields the index of the first alphabetical character in a string:
    new List(s.ToCharArray()).FindIndex(Char.IsLetter)
  • FindLastIndex(Predicate<T>): Same as FindIndex, except that it finds the last occurrence satisfying the predicate.
  • ForEach(Predicate<T>): Simply executes the given delegate on each member of the list. Although just using a foreach loop is cleaner than using ForEach with an anonymous delegate, ForEach is still handy if you already have a named delegate, as in this example that deletes all the files in the current directory:
    new List<string>(Directory.GetFiles(Directory.GetCurrentDirectory())).ForEach(File.Delete);
    
  • ConvertAll(Predicate<T>): Perhaps the most powerful of the delegate-based List methods, ConvertAll feeds each element of the list through a conversion function, producing a new list. In functional languages this is often called map because, like a mathematical function, it "maps" one set to another. Here's a simple example which produces a list of strings that are the lowercased versions of the strings in the original input list:
    list.ConvertAll(delegate(string s) { return s.ToLower(); });
    

What's more, besides being individually useful, complex transformations can be achieved by chaining these methods together in clever ways. For example, here's two lines of code that produce a list of all files larger than a given size in a given directory:

    static List<string> GetBigFiles(string directory, int bigLength)
    {
        List<string> paths = new List<string>(Directory.GetFiles(directory));
        return paths.ConvertAll<FileStream>( File.OpenRead )
                    .FindAll( delegate(FileStream f) { return f.Length >= bigLength; } )
                    .ConvertAll<string>( delegate(FileStream f) { return f.Name; } );
    }

The first ConvertAll opens all the files by using the static library method "File.OpenRead" as the converter. The FindAll filters out just the FileStream objects corresponding to large files. The final ConvertAll extracts the filenames from each stream. Each of these "list filters" is independently reusable. At first this way of programming may seem unusual or cumbersome, but just compare the size of the above method to a procedural implementation.

Don't forget also that you can write your own methods taking delegates to reap more benefits from anonymous delegates. A frequently useful application is for error handling and debugging code. Suppose you find yourself writing several methods that all look like this:

    void foo()
    {
        Debug.Write("entering foo()");
        try
        {
            // Do some stuff
        }
        catch (Exception e)
        {
            // Do some stuff
            throw;
        }
        Debug.Write("exiting foo()");
    }

It's not clear how to factor out the two "Do some stuff" parts, since they're nested right in the middle of the method. Once again, anonymous delegates are the answer. We can write a method like this taking delegates for the parts that get filled in:

    delegate void DoAction();
    delegate bool ProcessException(Exception e);
    void DebugInvoke(string name, DoAction action,
                     ProcessException processException)
    {
        Debug.Write("entering " + name);
        try
        {
            a();
        }
        catch (Exception e)
        {
            if (processException(e))
            {
                throw;
            }
        }
        Debug.Write("exiting " + name);
    }

Now we can mix-and-match try bodies and catch bodies at will, and get the tracing for free:

    void foo(int x, int y)
    {
        int z = 0;
        DebugInvoke("addFoo", delegate() { z = x + y; }, OverflowHandler);
        DebugInvoke("addBar", delegate() { z = z + x; },
                    delegate(Exception e) { Debug.Assert(false); return true; });
        DebugInvoke("addBaz", delegate() { y = z + z; }, OverflowHandler);
    }

    static bool OverflowHandler(Exception e)
    {
        if (e is OverflowException)
        {
            Debug.Write("Encountered overflow");
            return true;
        }
        return false;
    }

Unfortunately, anonymous delegate syntax, while much more concise than the alternative, is still a bit verbose, requiring not only the word "delegate" but full argument lists. One of the goals of the next version of C# is to encourage the use of these functional idioms, partly by introducing a new closure syntax that will take advantage of type inference to overcome this syntactic overhead.

Just remember to keep this rule of thumb in mind: if you're iterating over a list, and the body of the foreach is pretty small, chances are you don't need to. See if you can do the same thing more concisely and with less risk using a combination of the above methods and anonymous delegates (and amaze your coworkers). I hope you found this article helpful.

Posted Friday, June 30, 2006 3:16 PM by dcoetzee | 10 Comments

Factoring large numbers with quadratic sieve

Today I'm going to talk about how the quadratic sieve factoring algorithm works, giving a comprehensive description assuming knowledge of only basic university-level mathematics.

The foundation of the most popular public-key cryptography algorithm in use today, RSA, rests on the difficulty of factoring large integers. When keys are generated, efficient algorithms are used to generate two very large prime numbers and multiply them together. The person who generated the keys knows these two numbers, but everyone else only knows the product. The product contains enough information to encrypt a message to the person; the two primes allow the recipient to decrypt it. There is no known way to decrypt it without using the primes, but by factoring, we can extract the two prime factors from the product and break the encryption.

At the time that RSA was invented in 1977, factoring integers with as few as 80 decimal digits was intractable; all known algorithms were either too slow or required the number to have a special form. This made even small, 256-bit keys relatively secure. The first major breakthrough was quadratic sieve, a relatively simple factoring algorithm invented by Carl Pomerance in 1981, which can factor numbers up to 100 digits and more. It's still the best known method for numbers under 110 digits or so; for larger numbers, the general number field sieve (GNFS) is now used. However, the general number field sieve is extremely complicated, and requires extensive explanation and background for even the most basic implementation. However, GNFS is based on the same fundamental ideas as quadratic sieve, so if factoring the largest numbers in the world is your goal, this is the place to start.

We'll begin by addressing a few problems that at first glance have nothing to do with factoring, then assemble them into a working algorithm. I won't be motivating them first, but trust me - they're important.

Finding a subset of integers whose product is a square

Suppose I give you a set of integers and I ask you to find a subset of those integers whose product is a square, if one exists. For example, given the set {10, 24, 35, 52, 54, 78}, the product 24×52×78 is 97344 = 3122. The brute-force solution, trying every subset, is too expensive because there are an exponential number of subsets.

We'll take a different approach based on prime factorizations and linear algebra. First, we factor each of the input numbers into prime factors; for now we will assume that these numbers are easy to factor. For the above example set, we get:

10 = 2 × 5
24 = 23 × 3
35 = 5 × 7
52 = 22 × 13
54 = 2 × 33
78 = 2 × 3 × 13

When you multiply two numbers written as prime factorizations, you simply add the exponents of the primes used. For example, the exponent of 2 in 24×52×78 is 6, because it's 3 in 24, 2 in 52, and 1 in 78. A number is a square if and only if all the exponents in its prime factorization are even. Suppose we write the above factorizations as vectors, where the kth entry corresponds to the exponent of the kth prime number. We get:

[1 0 1 0 0 0]
[3 1 0 0 0 0]
[0 0 1 1 0 0]
[2 0 0 0 0 1]
[1 3 0 0 0 0]
[1 1 0 0 0 1]

Now, multiplying numbers is as simple as adding vectors. If we add rows 2, 4, and 6, we get [6 2 0 0 0 2], which has all even exponents and so must be a square. In more familiar terms, we want the last bit of each entry in the sum to be zero. But in this case, we don't need to store all of the numbers above, only the last bit of each number. This gives us the following:

[1 0 1 0 0 0]
[1 1 0 0 0 0]
[0 0 1 1 0 0]
[0 0 0 0 0 1]
[1 1 0 0 0 0]
[1 1 0 0 0 1]

Moreover, since we're only interested in last bits, we can perform all our addition using one-bit integers with wraparound semantics (in other words, mod 2). If we add rows 2, 4, and 6 in this way, we get [0 0 0 0 0 0 0], the zero vector. In fact, all squares correspond to the zero vector.

Let's rephrase this as a matrix equation problem. If we transpose the above matrix, so that rows become columns, we get this:

[1 1 0 0 1 1]
[0 1 0 0 1 1]
[1 0 1 0 0 0]
[0 0 1 0 0 0]
[0 0 0 0 0 0]
[0 0 0 1 0 1]

Call this matrix A. If we multiply A by the vector [0 1 0 1 0 1], using one-bit integer arithmetic, we get the zero vector. This tells us precisely which numbers we need to multiply to get a square. So, our goal is to find a nonzero vector x such that Ax=0 (remember, all arithmetic here is with one-bit integers).

If you've had a course in linear algebra, this problem should look very familiar; it's the problem of finding the null space of a matrix, the set of vectors such that Ax=0. The problem can be solved using row reduction (Gaussian elimination). We row reduce the matrix, and then assign values to the free variables in a way that gives us a nonzero solution. The other variables will be determined by these values and the matrix. You probably studied this problem using rational numbers, not one-bit integers, but it turns out row reduction works just as well for these. For example, if we add row 1 to row 3 in the above matrix, we get the following:

[1 1 0 0 1 1]
[0 1 0 0 1 1]
[0 1 1 0 1 1]
[0 0 1 0 0 0]
[0 0 0 0 0 0]
[0 0 0 1 0 1]

Completing the row reduction, we eventually end up with this matrix:

[1 0 0 0 0 0]
[0 1 0 0 1 1]
[0 0 1 0 0 0]
[0 0 0 1 0 1]
[0 0 0 0 0 0]
[0 0 0 0 0 0]

If we turn this back into a system of equations and rearrange, we get this:

x1 = 0
x2 = −x5x6
x3 = 0
x4 = −x6

Suppose we choose x5=0, x6=1. From the above equations, it follows that the first four vectors have the values 0, 1, 0, and 1 (remember, one-bit integer arithmetic). This gives us our final vector, [0 1 0 1 0 1]. If we were to choose x5=1 and x6=0 instead, we'd get a different solution: [0 1 0 0 1 0], corresponding to 24×54 = 1296 = 362.

Moreover, a theorem of linear algebra tells us precisely how many input numbers we need to guarantee that a square can be found: as long as we have more columns than rows, the null space is guaranteed to be nontrivial, so that we have a nonzero solution. In other words, we just need more numbers than prime factors used by those numbers. As this case shows, though, this isn't a necessary condition.

The one remaining problem with this method is that if one of the numbers in our set happens to have very large factors, our matrix will have a large number of rows, which requires a lot of storage and makes row reduction inefficient. To avoid this, we require that the input numbers are B-smooth, meaning that they only have small factors less than some integer B. This also makes them easy to factor.

Fermat's method: factoring using a difference of squares

You might be wondering what squares have to do with factoring. The connection is the very simple factorization method known as Fermat's method. Although not efficient in general, it embodies the same basic idea as quadratic sieve and works great for numbers with factors close to their square root.

The idea is to find two numbers a and b such that a2b2 = n, the number we wish to factor. If we can do this, simple algebra tells us that (a+b)(ab) = n. If we're lucky, this is a nontrivial factorization of n; if we're not so lucky, one of them is 1 and the other is n.

The concept behind Fermat's algorithm is to search for an integer a such that a2n is a square. If we find such an a, it follows that:

a2−(a2-n) = n

Hence we have a difference of squares equal to n. The search is a straightforward linear search: we begin with the ceiling of the square root of n, the smallest possible number such that a2n is positive, and increment a until a2n becomes a square. If this ever happens, we try to factor n as (a−sqrt(a2n))(a+sqrt(a2n)); if the factorization is trivial, we continue incrementing a.

Here's an example of Fermat's method from Wikipedia. Let n=5959; a starts out at 78. The numbers 782−5959 and 792−5959 are not squares, but 802−5959=441=212. Hence (80-21)(80+21)=5959, and this gives the nontrivial factorization 59×101=5959.

The reason Fermat's method is slow is because simply performing a linear search of all possible a hoping that we'll hit one with a2n square is a poor strategy − there just aren't that many squares around to hit. A better way of going about it is to proactively compute an a having this property (actually a similar property).

The key is to notice that if we take a number of a2n values, none of which are squares themselves, and multiply them, we may get a square, say S. Let A be the product of the corresponding values of a. Basic algebra shows that A2S is a multiple of n. Hence, (A−sqrt(S))(A+sqrt(S)) is a factorization of some multiple of n; in other words, at least one of these shares a factor with n. By computing the greatest common divisor (GCD) of each with n using Euclid's algorithm, we can identify this factor. Again, it may be trivial (just n itself); if so we try again with a different square.

All that remains is, given a list of a2n values, to find a subset whose product is a square. But this is precisely an instance of the problem discussed in the last section. Unfortunately, recall that that the method we came up with there is not efficient for numbers with large factors; the matrix becomes too large. What do we do? We simply throw away numbers with large factors! Theoretical results show that there are a fairly large number of values in the sequence a2n that are smooth (recall that smooth numbers have only small factors). This gives us a new factoring method that works pretty well up to a point.

For example, consider the number 90283. If we start a at 301 and increment it up to 360 while computing a2n, we get the following values:

318, 921, 1526, 2133, 2742, 3353, 3966, 4581, 5198, 5817, 6438, 7061, 7686, 8313, 8942, 9573, 10206, 10841, 11478, 12117, 12758, 13401, 14046, 14693, 15342, 15993, 16646, 17301, 17958, 18617, 19278, 19941, 20606, 21273, 21942, 22613, 23286, 23961, 24638, 25317, 25998, 26681, 27366, 28053, 28742, 29433, 30126, 30821, 31518, 32217, 32918, 33621, 34326, 35033, 35742, 36453, 37166, 37881, 38598, 39317

None of these are squares (the first square occurs at a=398); however, if we factor each value we will discover that 7 of these values have no factor larger than 43:

6438, 10206, 16646, 19278, 19941, 30821, 35742

If we take these 7 values and feed them to the algorithm described in the last section, it finds a square: 19278×19941×30821×35742 = 423481541612104836 = 6507545942. The corresponding original a were 331, 332, 348, and 355, and their product is 13576057680. Now, we can factor the number:

(13576057680−650754594)(13576057680+650754594) = 12925303086 × 14226812274 is a multiple of 90283
GCD(90283, 12925303086) = 137
GCD(90283, 14226812274) = 659
137 × 659 = 90283.

Making it faster: sieving for smooth numbers

The factorization algorithm above is considerably better than Fermat's algorithm, but if we try to scale up the size of number we factor, we quickly encounter a bottleneck: finding the smooth numbers in the sequence. Only 7 of the 60 values we computed in our last example were 43-smooth (actually we were lucky to get a square with so few vectors). As the size of the number that we're factoring grows, so does the size of the numbers in the sequence, and the proportion of smooth numbers rapidly shrinks. Although finding smooth numbers doesn't require completely factoring every number in the sequence (we only have to test primes up to the smoothness limit), it's still too expensive to test every number in the sequence this way.

The key is to observe that the prime factors of the sequence a2n follow a predictable sequence. Let's take a look at the prime factorizations of the first ten or so numbers in our example sequence above:

318 = 2×3×53
921 = 3×307
1526 = 2×7×109
2133 = 33×79
2742 = 2×3×457
3353 = 7×479
3966 = 2×3×661
4581 = 32×509
5198 = 2×23×113
5817 = 3×7×277

The most obvious pattern is that every other number is even, beginning with the first one. This should be no surprise, since we're effectively adding 2a+1 to get each new number, which is always odd. Also, you'll notice that the first and second numbers are divisible by 3, as are the fourth and fifth, the seventh and eigth, and so on. If you look at the larger list, you'll notice similar patterns for larger primes; for example, the 3rd and 6th numbers are divisible by 7, and every 7th number after each of them as well. And, mysteriously, not one number in our entire sequence is divisible by 5!

So what's going on? The answer involves what number theorists call quadratic residues. A number a is called a quadratic residue mod p if there is some square S such that Sa is divisible by p. Half of all numbers are quadratic residues mod p, regardless of the value of p, and there's a simple formula for determining whether or not a particular number is: just take a, raise it to the power (p−1)/2, and then take the remainder after division by p. Then a is a quadratic residue mod p if and only if the answer is 1. Although this computation seems to involve very large values, in fact we can compute it quite quickly using exponentiation by squaring combined with frequent remainder operations.

This explains why none of our values are divisible by 5. If we compute 90283(5-1)/2 mod 5, we get 4, which is not 1 (remember that 90283 is our original n to be factored). Thus, there is no square such that Sn is divisible by 5; but all numbers in our sequence have this form. In practice, this means we can compute just once ahead of time which factors may occur in the sequence (primes p such that n is a quadratic residue mod p), and ignore all others.

For our next mystery, why is it that given a number in the sequence divisible by p, every pth number after that is also divisible by p? Well, simple algebra shows that if a2n=kp, then:

(a+p)2n = (a2n)+p(2a+p) = kp+p(2a+p).

But this doesn't explain why it always seems to be the case that there are exactly two different initial values of a such that a2n is divisible by p (with the exception of p=2). For example, in our sequence above the 3rd and 6th values were divisible by 7. The answer again is quadratic residues: it can be shown that the modular equation x2≡y (mod p) has exactly two solutions (if it has any), and in fact there is an efficient algorithm for computing these two solutions called the Shanks-Tonelli algorithm. I won't go into it here since it requires some background in number theory, but for small primes it isn't really needed; it suffices to test the first p numbers to see which are divisible by p. For larger primes, it becomes important to avoid this expensive scan.

Recall the Sieve of Eratosthenes, an algorithm for locating prime numbers. It starts with a list of numbers, then crosses off all numbers not divisible by 2 except 2, then does the same for 3, 5, and so on until it's done. The numbers that remain must be prime. When attempting to find a list of prime numbers, this strategy is much more efficient than running even the most advanced primality test on each number individually.

We take a similar strategy here: we begin with a table of the original values in the sequence. We then visit all the numbers divisible by 2 and divide out a factor of 2. We do the same for each power of 2 up to the size of the sequence. We then do the same for every other prime up to our smoothness bound (43 in our example). In the end, the smooth numbers and only the smooth numbers will have become 1. Since we visit less and less list elements as the prime factor increases, the overall work is much less. For example, here's our original list from the above example:

318, 921, 1526, 2133, 2742, 3353, 3966, 4581, 5198, 5817, 6438, 7061, 7686, 8313, 8942, 9573, 10206, 10841, 11478, 12117, 12758, 13401, 14046, 14693, 15342, 15993, 16646, 17301, 17958, 18617, 19278, 19941, 20606, 21273, 21942, 22613, 23286, 23961, 24638, 25317, 25998, 26681, 27366, 28053, 28742, 29433, 30126, 30821, 31518, 32217, 32918, 33621, 34326, 35033, 35742, 36453, 37166, 37881, 38598, 39317

We visit elements 1, 3, 5, and so on, dividing out 2. Here's the list after this first pass is complete:

159, 921, 763, 2133, 1371, 3353, 1983, 4581, 2599, 5817, 3219, 7061, 3843, 8313, 4471, 9573, 5103, 10841, 5739, 12117, 6379, 13401, 7023, 14693, 7671, 15993, 8323, 17301, 8979, 18617, 9639, 19941, 10303, 21273, 10971, 22613, 11643, 23961, 12319, 25317, 12999, 26681, 13683, 28053, 14371, 29433, 15063, 30821, 15759, 32217, 16459, 33621, 17163, 35033, 17871, 36453, 18583, 37881, 19299, 39317

Here it is after dividing out the prime factors 3, 5, 7, 11, 13, and 17:

53, 307, 109, 79, 457, 479, 661, 509, 2599, 277, 1073, 7061, 61, 163, 263, 3191, 1, 10841, 1913, 577, 6379, 1489, 2341, 2099, 2557, 1777, 1189, 5767, 2993, 18617, 1, 23, 10303, 1013, 1219, 22613, 3881, 163, 12319, 2813, 619, 26681, 4561, 1039, 2053, 9811, 5021, 37, 103, 10739, 16459, 1601, 1907, 35033, 851, 12151, 18583, 1403, 919, 39317

We see a couple 1s have already appeared; these are 17-smooth numbers. When we get all the way up through 43, we have:

53, 307, 109, 79, 457, 479, 661, 509, 113, 277, 1, 307, 61, 163, 263, 3191, 1, 293, 1913, 577, 6379, 1489, 2341, 2099, 2557, 1777, 1, 5767, 73, 18617, 1, 1, 10303, 1013, 53, 22613, 3881, 163, 12319, 97, 619, 26681, 4561, 1039, 2053, 9811, 5021, 1, 103, 10739, 16459, 1601, 1907, 35033, 1, 419, 18583, 61, 919, 39317

We see several numbers set to 53 or 61; these would be smooth if we raised our bound a little bit.

This sieving process is where quadratic sieve gets its name from. This drastically decreases the overall work needed to find a sufficient number of smooth numbers, making it practical for very large numbers. This basic implementation could probably handle numbers up to 50-60 digits.

Improvements and optimizations

Quadratic sieve admits a number of "bells and whistles" to dramatically improve its runtime in practice. We mention only a few of the most important ones here.

The simple row reduction method of Gaussian elimination is not able to accomodate the very large smoothness bounds needed to factor large numbers, which often range in the millions, mostly due to space limitations; such matrices, if stored explicitly, would require trillions of bits. However, this method is wasteful, because most of the entries in the matrix are zero (they must be; each number has no more than log2n prime factors). Instead of using an actual two-dimensional array, we can just keep a list for each column that lists the positions of the 1 bits in that column. We then use a method well-suited to reducing sparse matrices such as the Lanczos algorithm. This still requires a fair amount of space; it's common to use block algorithms that work on small portions of the matrix at one time, storing the rest of the matrix on disk. The matrix step is notoriously difficult to parallelize and for large problems is often done on a single high-performance supercomputer.

The most expensive step by far is the sieving, which can require scanning billions of numbers to locate the needed smooth numbers. A common trick is to only track the approximate logarithm of each number, usually in fixed-point arithmetic. Then, when visiting each number, instead of performing an expensive division we only have to subtract. This introduces a bit of rounding error into the algorithm, but that's okay; by rounding consistently in the correct direction, we can ensure that we don't miss any smooth numbers and only capture a few spurious numbers that we can quickly check and reject. Because the logarithms of small primes are small, and require visiting more numbers than any others, primes like 2 and 3 are often dropped altogether.

Another problem is that a2n grows fairly quickly; because smaller numbers are more likely to be smooth, we get diminishing returns as we scan higher in the sequence. To get around this, we scan values of not just the sequence a2n but also a number of similar sequences such as (Ca+b)2n for suitable constants C, b. This variation is called the multiple polynomial quadratic sieve, since each of these sequences can be seen as the values of polynomial in a.

Finally, although the matrix step does not admit simple parallelization due to many data dependencies, the sieving step is perfectly suited to massive parallelization. Each processor or machine simply takes a portion of the sequence to scan for smooth numbers by itself, returning the small quantity of smooth numbers that it discovers to a central processor. As soon as the central processor has accumulated enough smooth numbers, it asks all the workers to stop. In the multiple polynomial variant, it's common to assign some of the polynomials to each machine.

One peculiar idea for massively parallelizing the sieving step, invented by Adi Shamir, is to use not computers but a specially constructed sieving device based on light emitters and sensors that he calls TWINKLE. The concept is that we have a light for each prime number whose intensity is proportional to the logarithm of that prime. Each light turns on just two times every p cycles, corresponding to the two square roots of n mod p. A sensor senses the combined intensity of all the lights together, and if this is close enough to the logarithm of the current value, that value is a smooth number candidate.

Conclusion

I hope this gives you all some insight into the workings of one of the most powerful factoring algorithms available, and into how unexpected algorithms and mathematics can be applied to familiar problems. Please leave any feedback, whether clarifications, corrections, complaints, or encouragement. Thanks for reading.

Posted Monday, June 19, 2006 12:13 PM by dcoetzee | 10 Comments

Color quantization

If you've ever done work with Web graphics, I'm sure that at some point you reduced an image with thousands of colors, such as a screenshot or photograph, to an image with only 256 colors, for example to store the image in the GIF format. Remarkably, the reduced image is usually very visually similar to the original image, despite the massive loss of information. Most paint programs are capable of performing this transformation, and operating systems such as Windows will perform the transformation on-the-fly when the user is running a video mode with only 256 colors.  But how does it work?

In the literature, this problem is called the color quantization problem -- generally, the word "quantization" refers to any process that maps several values to a single representative value. In this case, the values that we're mapping are colors. There are two main steps to color quantization:

  1. Generate a good palette
  2. Dither

Generate a good palette

First we'll talk about generating a good palette. The simplest approach is to use a fixed palette for all images. We can map pixels in the source image to pixels in the reduced image by simply choosing the closest color. What do we mean by "closest"? Well, we can identify each color with a point in three-dimensional space whose coordinates are the red, green, and blue values of the color.  Then, we consider two colors to be close if their points are close in this three-dimensional space.  This isn't ideal, but it's very simple.

The fixed palette approach has the advantage that all images using the same palette can be rendered on the same screen simultaneously; unfortunately, it severely compromises image quality. Making this even worse is that typical fixed palettes, such as the Windows standard palette and the web colors palette, are chosen as uniform palettes, meaning that the colors are evenly spaced throughout RGB space. This is a terrible way to construct a palette, as humans can distinguish much finer color differences in some areas of RGB space than others.

Below, we show a photograph of a rose taken by Stan Shebs, but modified to remove its blue channel. This means that all of its colors lie in a single plane, the plane blue = 0. That color plane is shown in the graph below.  We also show a 16-color optimized palette found by Adobe Photoshop - each blue point is a palette color, and each region shows the section of the color space mapping to that palette color. As you can see, the palette is far from uniform, and there are many colors that the image doesn't even use.

From this picture, we can see that generating a good palette is essentially a point clustering problem. We want to identify clusters in the color space that contain a large number of points close together and assign a single palette entry to those colors.

One of the most popular algorithms for this problem is called the median cut algorithm. It was invented by Paul Heckburt in 1980 in his paper "Color Image Quantization for Frame Buffer Display" and is still in wide use today. The algorithm follows just a few basic steps:

  1. Create a block or box surrounding the points; it should be just large enough to contain them all.
  2. While the number of blocks is less than the desired palette size:
    1. Find the block with the longest side out of all blocks.
    2. Cut this block into two blocks along its longest side. Cut it in such a way that half of the points inside the block fall into each of the two new blocks (that is, we split it through the median, thus the name).
    3. Shrink these two new boxes so that they just barely contain their points.
  3. Average the points in each box to obtain the final set of palette colors.

Despite a number of more sophisticated modern algorithms, this strikingly simple approach is adequate for generating 256 color palettes for most images. You can read about a C++ implemention of the median cut algorithm at my wiki.

Dither

If the palette is large enough and of high enough quality, Reducing by nearest color is usually sufficient. If not, however, distinct artifacts will be visible, such as banding, where what were once smooth gradients become striped regions (see this image; look closely at the cat's neck). One effective technique for dealing with these palette limitations is called dithering, which mixes dots of different colors to simulate a larger number of colors and smooth out gradients. For example, if you mix a grid of red and blue dots, the result will appear purple. Dithering is also commonly used in printing, where for example a grayscale image can be simulated with great accuracy using a higher-resolution dithered black and white image.

The trick to dithering is to effectively simulate the original color in each area of the image while avoiding telltale artifacts formed by poor dithering patterns; see ordered dithering, commonly used in newspapers, as an example of a dithering scheme with clearly visible artifacts. One of the most popular dithering schemes to this day is the Floyd-Steinberg algorithm, a classic dithering algorithm falling into the general class of error diffusion dithering algorithms.

The basic concept behind error diffusion is to begin with nearest color matching, but whenever we can't match a pixel color exactly, the error between the original color and the palletized color is distributed into adjacent pixels not yet visited. In Floyd-Steinberg, the pixel to the right receives 7/16 of the error, the pixel below 5/16, the pixel below and to the left 3/16, and the pixel below and to the right 1/16. When these pixels are visited, this added error affects the palette entry selected for them, leading to an average colour over that portion of the image close to the actual average color of the pixels in that area.

What's great about Floyd-Steinberg is that it generalizes easily to reducing full 24-bit color images to palettes of any size. The only modification we need to make is that we perform the error distribution step for each of the three channels. The palette entry is still chosen by nearest color. You can read about a C implementation of Floyd-Steinberg dithering on my wiki.

This isn't the end of the story of course - there is ongoing research and all kinds of fancy variations on these approaches, many of which can produce much nicer results than these classical algorithms. But at least now the next time you see a GIF or 8-bit PNG file on the web, or use a computer in 256-color mode, you can appreciate the algorithms under the cover that helped make your images look good under these severe limitations.

Posted Tuesday, May 09, 2006 5:57 PM by dcoetzee | 8 Comments

How does JPEG actually work?

JPEG is an image encoding designed to compress photographs and similar images effectively, often 5 to 15 times over a raw bitmap format. It's a lossy format that exploits properties of human vision to eliminate information that is difficult to distinguish. You see JPEGs all the time, and if you've seen a JPEG compressed with a low quality level, you have a good idea of the kinds of visual artifacts that emerge when JPEG compression is pushed to its limits. But how does it really work, and how does this cause JPEG artifacts?

The JPEG standard defines a number of different encoding options, and there are a number of container file formats defined other than the usual JFIF, but I'm going to discuss a single popular JPEG encoding method used by many applications. This encoding combines three different completely independent methods of image compression: downsampling the chrominance channels, quantization of the discrete cosine transformation, and entropy coding.

Method 1: Downsampling the chrominance channels

The most obvious way to make an image file smaller is to make the image smaller - for example, a 100×100 bitmap is 1/4 the size of a 200×200 bitmap. You can expand the image back out to the original size when viewing it by simply upsampling it. This isn't a great compression method, since it causes clearly visible artifacts, usually blocky or blurry effects, especially around detailed, sharp areas.

A different approach is instead of downsampling the entire image, to only downscale certain channels of the image. For example, the eye's red-green sensors are more numerous and sensitive than its blue-yellow sensors, so if we shrink just the blue channel and later re-expand it, the effect is much less noticable. But the RGB color space is not our only choice - images can be encoded in many color systems, and for each of these the channels we choose to downsample will have different visual effects. This experiment by Eric Setton of Stanford University demonstrates the effect on image quality of downsampling various channels in various color systems.

The results are visually obvious: we are sensitive to red and green, but even more sensitive to intensity or luminance, the relative brightness of pixels compared to other nearby pixels. This is expressed by the Y component of the YCbCr system and the V component of the HSV system. Any downsampling of this information is visually disastrous. On the other hand, downsampling the chrominance, expressed by the Cb and Cr components of the YCbCr system, has almost no visual effect (Cb and Cr stand for "chrominance blue" and "chrominance red" - see a sample image broken up into YCbCr channels in the middle of this page). For this reason, typical JPEG files convert images to the YCbCr color space, and then downsample both the Cb and Cr channels by a factor of 2 in both width and height. Further processing of the image proceeds on each of the three channels independently.

This is the method in JPEG encoding largely responsible for the artifact known as "color bleeding", where along sharp edges the color on one side can "bleed" onto the other side. This is because the chrominance channels, which express the color of pixels, have had each block of 4 pixels averaged into a single color, and some of these blocks cross the sharp edge.

Method 2: Quantization of the discrete cosine transformation

Normally, we view an image as simply a grid of pixels, and if we alter a pixel or group of pixels dramatically, the effect will be visually obvious. A different approach to storing images is to come up with a set of "basis" images which can be added together in various combinations to form the image we want. JPEG encodes each 8-by-8 square of pixels using the following basis image set of 64 images:

We assign each of these images a relative weight determining how much influence it has on the image. This weight can be positive or negative (when negative, the basis image is subtracted). Taken together, these 64 weights can describe any 8×8 image at all, and they describe the image precisely. This example at Wikipedia shows an 8×8 image, its original pixel array, and its new 8×8 array of weights after being converted using the discrete cosine transform (which I'll discuss later).

But what's the point of transforming 64 pixel values into 64 weights? For one thing, the resulting data may exhibit more structure, and so be easier to compress. For example, with blocks from typical photographs, the basis images in the upper left above have large weights and the ones in the lower right have small weights.

But to store the weights, which are real numbers, we have to encode them somehow. The simplest encoding is to just convert them to integers between -k and k by scaling and rounding them. This works, but there's a tradeoff in our choice of k: if it's too big, the encoded image will take too much space, but if it's too small, the rounding might visibly distort the image, since it's effectively altering the weights.

A more effective approach is to use a different k for each basis image. This works well because the visual difference caused by variations in the weights of different basis images is different. The eye can detect small variations of the weights of the images in the upper-left, but can overlook much larger variations in the weights of the images in the lower-right, so we can get away with smaller k and larger rounding error for these weights.

In practice, rather than deal with expensive floating-point numbers, we round the initial weights to integers and later divide them by fixed factors called the quantization factors using integer division. During decoding, we multiply them by these factors again. The upper-left images have small quantization factors, so that little information is lost in this process, while the lower-right images have larger factors. This example shows a commonly used matrix of quantization factors, one for each basis image, and the result of quantizing the sample image using them. Observe how much simpler the matrix becomes - it now contains a large number of entries that are small or zero, making it much easier to compress. This example shows how the weight values expand back out during decoding - notice that although some of the weights are now different, the visual difference is almost undetectable.

Quantization is the primary source of JPEG artifacts. Because the images in the lower-right tend to have the largest quantization divisors, JPEG artifacts will tend to resemble combinations of these images. The matrix of quantization factors can be directly controlled by altering the JPEG's "quality level", which scales its values up or down.

Quantization compresses the image in two important ways: one, it limits the effective range of the weights, decreasing the number of bits required to represent them. Two, many of the weights become identical or zero, improving compression in the third step, entropy coding.

But this discussion leaves one important question unanswered: how do we take an arbitrary square of 8×8 pixels and decompose it into a combination of the basis images? As it turns out, we specifically chose the images above in a way to make this easy. These images are formed by cosine waves of various frequencies in the x and y directions, and a transformation called the discrete cosine transform from Fourier transformation theory allows us to rapidly decompose a signal into these parts in only O(n log n) time. Because of the importance of JPEGs, over the years very fast and clever specialised solutions for 8×8 blocks have been created in both software and hardware. This is also the reason we use small, constant-sized blocks instead of processing the image as a whole: the cosine transform step doesn't need to scale up to large bitmaps, so the transformation as a whole can run in linear (O(n)) time.

Method 3: Entropy coding

Unlike the other steps, entropy coding is lossless. If you've studied the simple compression techniques used in ZIP files and BMP files, the techniques of entropy coding should come as no surprise.

We start by turning our 8×8 grid of integers into a linear sequence of integers. The obvious row-major and column-major orders are not ideal here, since we expect there to be many zeros in the lower-right area. Instead, we start with the top-left corner and zig-zag in a diagonal pattern across the matrix, going back and forth until we reach the lower-right corner.

Next, we apply the same run-length encoding algorithm used by GIF files to condense sequences of identical integer values. In areas where the values are small, we can expect many such runs, particularly runs of zeros. We use a special code to say "all the remaining values are zero", which comes in handy.

Finally, we apply Huffman compression to the remaining sequence, which is a standard lossless compression technique that encodes frequently used values using less bits than infrequently used values.

Conclusion

And that's it! For each YCbCr channel, for each 8×8 block, we perform these transformations, and we store the results in a bitstream. This data is wrapped up in a JFIF container giving metadata about the image, and the file goes out ready for web browsers to view, completely unaware of all the data that has been lost. Just don't go sending JPEGs into space and expect aliens to see them the same way.

Finally, no discussion of the JPEG encoding would be complete without noting that JPEG is based on research ideas that are now very old and largely supplanted. The more modern JPEG 2000 encoding based on wavelets can be pushed much farther before visible artifacts appear, and includes a surprisingly efficient lossless mode. Of course, all this improvement is rather moot until common applications actually support it.

Posted Wednesday, April 12, 2006 4:33 PM by dcoetzee | 4 Comments

Encoding correctness in types

This article will discuss effective use of types to catch some common problems at compile time not normally found by the typechecker.

The typechecker is the most important tool in the programmer's arsenal. Types enable you to structure code in a more maintainable way and prevent entire classes of errors from ever occurring at runtime, many of which can be insidiously difficult to debug. If you've written anything extensive in assembly language, you can understand the frustration of dealing with misinterpretations of data leading to corruption of program state. Researchers like to say "well-typed programs never go wrong".

But what does "go wrong" mean? One answer is: "more than you think". By making effective use of the type system to encode invariants about your program, you can use the compiler to catch more errors at compile time that you currently catch at run time.

A simple example is SQL injection, where insertion of user-supplied data into a SQL statement string can enable attacks where the user modifies the nature of the statement with carefully chosen data. At first the solution seems simple: double your apostrophes to escape them, and then the value cannot be misinterpreted. This can be trickier than sounds, though, since in a complex program you might not be able to identify which values come from an untrusted source, directly or indirectly, which are safely produced from trusted sources, and which are untrusted data that has already been properly escaped. These three types of data travel from method to method, following a winding path that obscures their source. The compiler is of no help, since it sees them all as just strings.

The solution, called "tainting" in languages like Perl and Ruby, is to make them not "just strings", but to create two types of data: tainted data, which is unsanitized data derived from an untrusted source, and untainted data, which is either derived from a trusted source or already sanitized. For our example, suppose we're performing our queries with SqlDataReader. We create a wrapper class for it which has an identical interface except that wherever SqlDataReader takes a string, our wrapper takes a SafeSqlString. This new type is simply a wrapper for string that can only be generated in a few strictly limited ways:

  • static SafeSqlString AssumeSafe(string): Used for trusted data, just wraps it
  • static SafeSqlString Escape(string): Escapes apostrophes; used for untrusted data.
  • SafeSqlString operator+(SafeSqlString): Concatenates two SafeSqlStrings
  • static SafeSqlString Format(SafeSqlString format, params object[] args): Takes a safe format string and a list of arguments of safe types such as SafeSqlString, int, and double, and produces a new SafeSqlString.

What does this buy us? Now, as long as we use our new SqlDataReader wrapper for all database access, we are essentially immune to SQL injection: the only way it could sneak in is through AssumeSafe(), which is obviously dangerous and makes the code easy to review. As an added bonus, double escaping can never happen: the data only gets escaped once as it changes from string to SafeSqlString, and Escape() cannot be applied to SafeSqlString.

Because these types flow through the program, we can specify that a particular method always returns a SafeSqlString, or assumes that its arguments are SafeSqlStrings, or assumes that some private variable is always safe, and enforce this statically instead of by convention. This is important, because it allows us to use AssumeSafe() at a point where it's obvious that the string is safe, as opposed to five calls up the call stack where we really have no idea who else might be calling us.

What else can we encode in types? All sorts of things! Here are some more examples:

  • We could have a type PositiveInt for positive integers, which will throw a runtime error if constructed with a nonpositive integer. Now, if a method receives a PositiveInt, it can safely pass it on to another method expecting a positive integer without any additional runtime checks. This trick works with any object where you want to enforce a complex runtime property on an object but only want to check it once (or only when it might change). In unmanaged code, it's particularly useful for checking buffer/string lengths.
  • Some binary tree algorithms require that every node has zero or two children, never one. We can represent this using a subclass hierarchy, where the abstract base class is TreeNode and the subclasses are Leaf and InternalNode. Leaf contains no data at all, while InternalNode contains a piece of data and two (non-null) references to TreeNodes.
  • There are many other properties of data structures that can be encoded in types. For example, in a red-black tree, red nodes cannot have red children. We can enforce this statically by creating a subclass hierarchy where BlackNode objects contain references to the abstract base class Node, while RedNode objects contain only references to BlackNode objects. Similarly, the root must be black, so we just use a BlackNode reference for the root.
  • If you're writing a sorting algorithm, it might not be clear that the array you get back contains a rearrangement of the same elements as the array you passed in. There's no quick way to check this, even at runtime. On the other hand, if instead of an array you pass in an object whose only valid non-read-only operation is to exchange two elements, then this becomes statically obvious.

In short, wherever you find yourself wanting to keep track of a static property as it flows through a program, types are a great way of doing so - use them to catch errors sooner.

In programming languages with more sophisticated type systems, you can encode even more information in types. Taking this to the extreme are dependent types, a relatively new system of types based on values that are extremely difficult to typecheck but so powerful that they can express many correctness properties of real programs. A number of research languages such as Dependent ML and Epigram already implement dependent types. With the power of systems like these, maybe in a few years the primitive C-based type systems of C# and C++ will look a lot like assembly language does now. But before we can take advantage of new technology, let's start making better use of what we have - encode your program's properties using types.

Posted Monday, March 27, 2006 1:52 PM by dcoetzee | 0 Comments

LiteratePrograms wiki

Hi everybody. This is a short post, but I just wanted to tell you all about a new wiki I created called LiteratePrograms, based on an extension of the same MediaWiki software used by Wikipedia.

Some of you read my earlier post about literate programming and some of the advantages it has for easy-to-read exhibition of code. On LiteratePrograms, every article is a literate program: it is simultaneously a document and a piece of code which can be downloaded, compiled, and run. The "download code" tab at the top of every article runs the noweb literate programming tool which extracts code segments from the discussion, puts them back in a form that can be compiled, and zips them up in a package for download. Additionally, the wiki markup, editable by anyone, allows rich documentation including mathematical formulas and diagrams. Finally, all content is released under the liberal MIT/X11 non-copyleft license, so anybody can use it, even for commercial purposes. You can see a sample of LP in action at Insertion sort (C, simple).

Recently featured at Lambda the Ultimate (thanks Allan!), LP has attracted a lot of attention and has a couple dozen articles and users now. The users have some pretty diverse interests in languages and subject areas that I would never have anticipated (see Pi with Machin's formula (Python), Hello World (occam)).

I invite your suggestions for and participation in and the site - in particular, we're really lacking in Windows and .NET programs and programmers at the moment, and I know you guys can help. See How to write an article and take a look at the existing articles to learn about the markup syntax. Please ask me any questions you have. Thanks everyone.

Posted Friday, March 03, 2006 5:54 PM by dcoetzee | 0 Comments

Dependency tracking in builds

Happy Valentine's Day, everyone! Today I'm going to talk about the important problem of successfully tracking dependencies in builds in an automatic manner.

In short, there are two kinds of build systems: the ones where all dependencies are tracked successfully and incremental rebuilding is available at all times, and the ones where you have to rebuild everything whenever you change anything remotely significant. There is nothing inbetween - as soon as a build system fails to track dependencies in one circumstance, developers will become paranoid and insist on costly full rebuilds in the future. After all, the consequences of a partial rebuild are insidiously difficult to diagnose, resulting in broken builds, symbol mismatches, exceptions while loading assemblies, mixtures of interacting new and old code, and other nonsense that we don't want to waste our time debugging. This means that a dependency tracking system has to be very close to perfectly reliable, which can make it very difficult to design and implement successfully.

Although the problem is a very old one, with tools like GNU make dating back decades, I believe in most project environments dependency tracking is not solved to the level of reliability that is needed for this kind of trust. Behind all of these failed dependency-tracking systems is one cardinal sin:

Dependencies cannot be maintained by hand.

I'm not merely saying maintaining dependencies by hand is difficult or even error-prone, but that it's damn near impossible. In C/C++ in particular, adding a #include statement to a header file implicitly adds a dependency to every source file including that header file directly or indirectly. Even worse is removing a #include, in which case some of the files depending on it might no longer require it, but then some might, depending on what they directly or indirectly include. Waste all the time you want on this - you won't get it right. Even in C#, Java, or Visual Basic, which thankfully have no concept of header and source files and load an entire project for compilation at once, interproject dependencies can become just as complex in large solutions.

Even worse are systems which require you to explicitly specify not only the dependencies but the build order. Often hierarchically structured solutions of many projects will use pretty good dependency tracking within each project, but revert to this "build these things in this order" model for the whole tree. The result is that related modules in different parts of the tree cannot be built together incrementally, and highly parallellized building becomes impossible. Scalability demands that we create some level of abstraction for solution-level builds, but that doesn't mean we have to sacrifice incremental building altogether.

The alternative to maintaining something by hand is of course maintaining it automatically. The code is already quite explicit about what depends on what - whether in the form of #include statements, import statements, assembly references, or other means. The fundamental principle of avoiding redundancy tells us that we should be generating dependency information from these sources. For example, the following buried treasure from the GNU make manual details how to use the -M flag of gcc together with some (rather esoteric) script to automatically track header dependencies in C:

The practice we recommend for automatic prerequisite generation is to have one makefile corresponding to each source file. For each source file `name.c' there is a makefile `name.d' which lists what files the object file `name.o' depends on. That way only the source files that have changed need to be rescanned to produce the new prerequisites.

Here is the pattern rule to generate a file of prerequisites (i.e., a makefile) called `name.d' from a C source file called `name.c':

 
%.d: %.c
        @set -e; rm -f $@; \
         $(CC) -M $(CPPFLAGS) $< > $@.$$$$; \
         sed 's,\($*\)\.o[ :]*,\1.o $@ : ,g' < $@.$$$$ > $@; \
         rm -f $@.$$$$
http://www.gnu.org/software/make/manual/html_mono/make.html#SEC51

The concept here is that the dependencies for each source file are stored in a separate file which itself depends on that source file. If the source file is modified, the ".d" file containing that source file's dependencies is rebuilt before it is loaded by make. In this way the dependencies can never be wrong or out of date for even a moment. This concept could be generalized to a pluggable framework that enables you to determine build-time dependencies of arbitrary files using arbitrary code.

We can play similar tricks to dynamically track the dependencies of projects in a solution. Although this is one level of abstraction higher, it's really just the same thing with source files replaced by projects. We analyze the projects to determine their dependencies at build-time, set up a build order, and then locally build each project using its own local dependency structure. This enables automatic dependency tracking to scale effectively. With a solution like this, you can simply sync up the latest changes from the repository and do an incremental build from root in minimum time.

Dependency tracking isn't just useful for builds either - it can often be used to evaluate the impact of a change. For example, if I change a header file, a good dependency tracking system should be able to name for me all the affected source files in the project, as well as any other project that might be affected. In the end I might decide to go with a less destabilizing change instead. They can also be incorporated into version control systems, where they can help to evaluate whether concurrent changes made by other people will break what you're working on right now. This is largely unexplored territory.

At Microsoft, a new build system called MSBuild has recently been developed and released with Visual Studio 2005. Currently used for almost all of the Visual Studio build, it will hopefully in time replace older ad hoc systems based on batch files across the company. One of its advantages is its incremental building support, described here:

http://msdn2.microsoft.com/en-us/library/ms171483.aspx

Although a sophisticated system, even MSBuild's support for incremental building is relatively limited, focusing on dependencies within projects between files that have direct input-to-output mappings.

Build scheduling algorithms

Let's talk a little bit about the technology behind build scheduling - a simple problem, but one with a number of interesting generalizations. To simplify the discussion, we'll refer to both