Developing for Developershttp://blogs.msdn.com/b/devdev/Tools, techniques, and theory for measuring and improving the power and performance of developers and their codeen-USTelligent Evolution Platform Developer Build (Build: 5.6.50428.7875)New blog: Papers in Computer Sciencehttp://blogs.msdn.com/b/devdev/archive/2009/02/28/new-blog-a-cs-paper-a-day.aspxSun, 01 Mar 2009 01:20:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:9451372MSDNArchive1http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=9451372http://blogs.msdn.com/b/devdev/archive/2009/02/28/new-blog-a-cs-paper-a-day.aspx#comments<p>Hey all - I apologize for the (extremely) long period of no updates, I've been prioritizing other things. I've been accepted this Fall to begin my Ph.D. at University of California, Berkeley. Since I won't be at Microsoft any longer, I've started a new blog with a new focus and more regular updates called Papers in Computer Science at Wordpress, and I've posted my first post to it:</p><p><a href="http://papersincomputerscience.wordpress.com/" mce_href="http://papersincomputerscience.wordpress.com/">http://papersincomputerscience.wordpress.com/</a></p><p>My summary from the introduction post: "<i>Papers in Computer Science</i> (originally called <i>A CS Paper a Day</i>)
is a blog for abstracts and brief summaries and discussions about
published computer science articles. I’ll cover a variety of areas,
focus on some of the most interesting and important papers in each one,
and I’ll try to explain them to a general computer science audience
without specialist background. Because most of these are available
online, many of them without cost, readers can consult the original
paper for more information if they’re interested."</p><p>I hope anyone interested in reading more of my stuff will take a look at <i>Papers in Computer Science</i> and leave comments. Thank you all for your patronage!<br></p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=9451372" width="1" height="1">P-complete and the limits of parallelizationhttp://blogs.msdn.com/b/devdev/archive/2007/09/07/p-complete-and-the-limits-of-parallelization.aspxSat, 08 Sep 2007 06:11:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:4821073MSDNArchive3http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=4821073http://blogs.msdn.com/b/devdev/archive/2007/09/07/p-complete-and-the-limits-of-parallelization.aspx#comments<P>We're entering an era where CPU clock speeds will soon cease to scale upwards and instead CPU manufacturers are planning to put more and more independent cores on a chip. <A class="" href="http://news.com.com/2100-1006_3-6119618.html" mce_href="http://news.com.com/2100-1006_3-6119618.html">Intel plans to release an 80-core chip within 5 years</A>. Consequently the research community is setting their eye on the manifold barriers to writing correct and efficient highly-parallel programs. Will we be able to continue to get the same boosts in computational performance that we have been getting through raw clock speed?</P>
<P>You might think it's just a matter of building effective tools for producing programs that can take advantage of highly parallel platforms, and training programmers to take advantage of them. But there are complexity theory results to show that there are some problems that, under widely accepted assumptions, fundamentally cannot be parallelized, and this is the theory of <A class="" href="http://en.wikipedia.org/wiki/P-complete" mce_href="http://en.wikipedia.org/wiki/P-complete">P-completeness</A>. For these problems, it may be literally impossible to get any significant benefit out of throwing additional cores at them - such problems are called <EM>inherently sequential</EM>. More to the point, however long it takes your computer to solve these problems <EM>right now</EM>, that's about how fast it's going to stay for a long time.</P>
<P>The complexity class NC (Nick's Class, named after <A title="" href="http://en.wikipedia.org/wiki/Nick_Pippenger" originalTitle="Nick Pippenger" popData="undefined" hasPopup="true">Nick Pippenger</A> by Steven Cook) is the set of problems that can be solved in polylogarithmic time (O(log<SUP><I>k</I></SUP> <I>n</I>) time for some integer <EM>k</EM>) by a machine with polynomially many processors. For example, given a list of <EM>n</EM> numbers, the typical way to add them up on a single processor would be with a loop, as in this C# snippet:</P><PRE> public static int AddList(List<int> list) {
int sum = 0;
foreach (int i in list) {
sum += i;
}
return sum;
}
</PRE>
<P>But now suppose we had <EM>n</EM>/2 processors. Then each processor can add just two of the numbers, producing a list of <EM>n</EM>/2 numbers with the same sum. We add these in pairs as well, continuing until just one number remains, the sum. This fan-in algorithm requires only O(log <EM>n</EM>) time, placing the problem in NC (or technically the decision problem version of it).</P>
<P>We can use similar ideas to parallelize many algorithms. For example, the best known sequential algorithms for matrix multiplication requires time O(<I>n</I><SUP>2.376</SUP>) for an <I>n</I> × <I>n</I> matrix, but given O(<EM>n</EM><SUP>3</SUP>) processors, we can solve it in O(log <EM>n</EM>) time: for each entry, we assign <EM>n</EM> processors to compute the products that are added to form that entry, then use the fan-in procedure above to add them together. This can still be done in O(log <EM>n</EM>) time. There are efficient parallel algorithms for other common problems too, like sorting, string matching, and algorithms in graph theory and computational geometry.</P>
<P>In real life of course we don't have such a polynomial number of processors, so this is in some ways an impractical definition; all it really tells us is that if we have <EM>many</EM> processors, we can solve the problem dramatically faster. Practical parallel algorithms pay more attention to <EM>speedup</EM> - how many times faster the algorithm is with <EM>k</EM> processors than with one. Effective speedup requires a good way of dynamically decomposing the problem. But speedup is difficult to characterize formally in a machine-independent way, so we'll stick with the definition of NC.</P>
<P>The famous P = NP problem asks whether all problems whose solutions can be verified in polynomial time, can also be solved in polynomial time. In this problem, the problems in P are considered "feasible to solve", and most computer scientists believe P ≠ NP - that is, that some problems which we can feasibly verify cannot be solved feasibly. We believe this in part because we have shown that many important problems are NP-hard. To say a problem is NP-hard means there is a polynomial time algorithm that can take any problem in NP and convert it into an instance of that problem; consequently, if an NP-hard problem can be solved in polynomial time, so can any problem in NP.</P>
<P>But when you have a large number of processors, the question turns around: we no longer consider P feasible, because any algorithm using just one CPU would face a dramatic slowdown compared to algorithms that take advantage of all of them. Now NC algorithms are considered feasible, and we ask: can all problems with efficient sequential solutions be highly parallelized? That is, is NC = P?</P>
<P>Just as with P = NP, nobody knows. However, just as there are NP-hard problems, there are P-hard problems. A P-hard problem is a problem where there is an NC algorithm (running in polylogarithmic time on a polynomial number of processors) to convert any problem in P into an instance of that problem. If we had an NC algorithm to solve a P-hard problem, it would imply that NC = P, providing a constructive and mechanical way of parallelizing any problem we can solve sequentially in polynomial time. Just as most researchers believe P ≠ NP, most researchers believe NC ≠ P, which implies that no P-hard problem has an NC algorithm to solve it.</P>
<P>One of the most important P-complete problems is <EM><A class="" href="http://en.wikipedia.org/wiki/Linear_programming" mce_href="http://en.wikipedia.org/wiki/Linear_programming">linear programming</A></EM>: given a linear target function in terms of some variables, and a number of linear inequalities expressing constraints between those variables, find the optimal (maximum) value of the target function. It finds uses in operations research, microeconomics, and business management, and it would be great if we could exploit large number of cores to tackle larger instances of these problems. But no one has found an efficient parallel algorithm for this problem.</P>
<P>Another important example is the <EM>circuit-value problem</EM>: given a digital logic circuit and its inputs, compute the output of the circuit. There is a trivial linear time solution in the number of gates: continually get the next gate that you know both inputs of and compute its output, until you're done. It would seem that certain gates don't influence each other and be computed in parallel. But because the wires can express complex interdependencies between values and can create long chains of dependencies, there's no clear way to parallelize a general circuit. Besides obvious applications to circuit design and simulation, many problems can be solved by families of circuits, and if we can evaluate them quickly we could solve these problems. But not only has no one has found a general way of parallelizing such evaluation, they've shown that even if we restrict the circuits so that wires can't cross, or such that only the inputs can be negated, the problem remains P-complete, and so difficult to parallelize (these are the <EM>planar </EM>and <EM>monotone </EM>circuit-value problems, respectively).</P>
<P>The first problem ever shown to be NP-complete, as well as an important problem in practice, is the boolean satisfiability problem: given a formula using AND, OR, and NOT, with a number of variables set to either true or false, determine if there is some way of setting the variables to make the whole formula evaluate to true. The P-complete version of this problem is called Horn satifiability.</P>
<P>Any boolean formula can be reduced efficiently to a form where it represents a three-layer circuit: a literal is either a variable or a negated variable; a clause is the OR of some literals; and an entire formula is the AND of some clauses. A "Horn clause" is a clause where all of the variables are negated except possibly one. This provides an efficient solution to the problem, since Horn clauses can be rewritten as implications (not(x1) or not(x2) or x3 is the same thing as (x1 and x2) implies x3). Formulas made up of Horn clauses appear in the context of logical deduction systems, such as the resolution systems used to model problems in AI. Again, no one has found an effective way to parallelize finding solutions for these formulas.</P>
<P>In the world of P = NP, people have long been studying ways to "get around" the infeasibility of finding a polynomial-time solution for NP-hard problems, such as approximation algorithms, special cases, fixed-parameter tractable solutions, and so on. We could ask the same questions about the NC = P question: could we find an approximate solution much more quickly by using all our cores, instead of running a classical algorithm to find an exact solution on just one? Are there special cases of these problems which are parallelizable? But comparatively little investigation has been done into this area, although some examples exist (like Karger and Motwani's "<SPAN class=l><A class="" href="http://citeseer.ist.psu.edu/karger97nc.html" mce_href="http://citeseer.ist.psu.edu/karger97nc.html">An NC Algorithm For Minimum Cuts</A>").</SPAN></P>
<P><SPAN class=l>Finally, I should mention that P-complete is relevant not only to the study of what problems can be efficiently parallelized, but also to what problems can be solved in a small amount of space. I'll leave this for another time though.</SPAN></P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=4821073" width="1" height="1">TheoryRobin's theoremhttp://blogs.msdn.com/b/devdev/archive/2007/07/16/robin-s-theorem.aspxTue, 17 Jul 2007 00:12:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:3901480MSDNArchive2http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=3901480http://blogs.msdn.com/b/devdev/archive/2007/07/16/robin-s-theorem.aspx#comments<P>Most computer scientists are familiar with the P = NP problem, which asks essentially whether we can verify more problems in polynomial time than we can solve. So fundamentally does complexity theory hinge on this result that the Clay Mathematics Institute has labelled it one of their seven <A class="" href="http://www.claymath.org/millennium/" mce_href="http://www.claymath.org/millennium/">Millennium Prize Problems</A>, and will award $1,000,000 to the researchers that solve it.</P>
<P>But there are six other Millennium Prize problems that are often less accessible to computer scientists, requiring background in areas such as topology and complex analysis. One of the most important of these is the Riemann Hypothesis, which is normally stated in terms of complex zeros of the Riemann Zeta function. Here, we will describe a recent characterization of the Riemann Hypothesis more suited to a computer scientist's background based on growth of integer functions.</P>
<P>Consider the <A class="" href="http://en.wikipedia.org/wiki/Divisor_function" mce_href="http://en.wikipedia.org/wiki/Divisor_function">sum-of-divisors function</A> σ(<EM>n</EM>) that sums up the divisors of the positive integer <EM>n</EM>. For example, σ(15) = 1 + 3 + 5 + 15 = 24. If you've studied asymptotic notation, a natural question to ask is: how fast does σ(<EM>n</EM>) grow?</P>
<P>It's clear that since 1 and <EM>n</EM> always divides <EM>n</EM>, σ(<EM>n</EM>) ≥ n + 1, so σ(<EM>n</EM>) is Ω(<EM>n</EM>). But σ(<EM>n</EM>) cannot be O(n), because σ(<EM>n</EM>)/<EM>n</EM> takes on arbitrarily large values. To see this, consider <EM>n</EM> equal to <EM>k</EM> primorial, the product of the first <EM>k </EM>primes. Because σ(<EM>n</EM>)/<EM>n</EM> is <A class="" href="http://en.wikipedia.org/wiki/Multiplicative_function" mce_href="http://en.wikipedia.org/wiki/Multiplicative_function">multiplicative</A>, σ(<EM>n</EM>)/<EM>n</EM> will equal the product of σ(<EM>p</EM>)/<EM>p</EM> = (1+<EM>p</EM>)/<EM>p</EM> = (1 + 1/p) for each prime factor. But if you expand out this product, you get a sum including 1/p for each prime factor, and the sums of the reciprocals of the primes, also known as the <A class="" href="http://mathworld.wolfram.com/HarmonicSeriesofPrimes.html" mce_href="http://mathworld.wolfram.com/HarmonicSeriesofPrimes.html">harmonic series of primes</A>, diverges. But the harmonic series of primes grows quite slowly, and is O(ln ln <EM>n</EM>).</P>
<P>From this we might conjecture that σ(<EM>n</EM>) grows at a rate of <EM>n</EM> ln ln <EM>n</EM>. <A class="" href="http://mathworld.wolfram.com/GronwallsTheorem.html" mce_href="http://mathworld.wolfram.com/GronwallsTheorem.html">Gronwall's theorem</A>, proven in 1913, shows that in fact:</P>
<P>lim sup<SUB><EM>n</EM>→∞ </SUB>σ(<EM>n</EM>)<EM> </EM>/ (<EM>n </EM>ln ln <EM>n</EM>) = <EM>e</EM><SUP>γ</SUP>.</P>
<P>Here, <EM><A class="" href="http://en.wikipedia.org/wiki/E_%28mathematical_constant%29" mce_href="http://en.wikipedia.org/wiki/E_%28mathematical_constant%29">e</A></EM> and γ are both constants (the latter is the <A class="" href="http://en.wikipedia.org/wiki/Euler%E2%80%93Mascheroni_constant" mce_href="http://en.wikipedia.org/wiki/Euler%E2%80%93Mascheroni_constant">Euler-Mascheroni constant</A>), so <EM>e</EM><SUP>γ</SUP> is a constant, equal to about 1.781. This is somewhat stronger than the statement that σ(<EM>n</EM>) is O(<EM>n</EM> ln ln <EM>n</EM>) - the <A class="" href="http://en.wikipedia.org/wiki/Limit_superior_and_limit_inferior" mce_href="http://en.wikipedia.org/wiki/Limit_superior_and_limit_inferior">lim sup</A> actually says that for any ε > 0, some tail of the sequence σ(<EM>n</EM>)<EM> </EM>/ (<EM>n </EM>ln ln <EM>n</EM>) must be bounded above by <EM>e</EM><SUP>γ</SUP> + ε.</P>
<P>But instead of a lim sup, what we'd really like is to have a hard bound; we'd like to say:</P>
<P>σ(<EM>n</EM>)<EM> </EM>/ (<EM>n </EM>ln ln <EM>n</EM>) < <EM>e</EM><SUP>γ</SUP> for all <EM>n</EM>.</P>
<P>This is false, and a quick search identifies 27 small counterexamples: 2, 3, 4, 5, 6, 8, 9, 10, 12, 16, 18, 20, 24, 30, 36, 48, 60, 72, 84, 120, 180, 240, 360, 720, 840, 2520, and 5040. However, no search to date has located any larger counterexamples. From this we might conjecture:</P>
<P>σ(<EM>n</EM>)<EM> </EM>/ (<EM>n </EM>ln ln <EM>n</EM>) < <EM>e</EM><SUP>γ</SUP> for all <EM>n </EM>> 5040.</P>
<P>In 1984, Guy Robin showed that in fact, this statement is true if and only if the Riemann hypothesis is true. This is <A class="" href="http://mathworld.wolfram.com/RobinsTheorem.html" mce_href="http://mathworld.wolfram.com/RobinsTheorem.html">Robin's theorem</A>. Thus, the Riemann hypothesis could, in theory, be proven false by exhibiting a single positive integer <EM>n</EM> > 5040 such that σ(<EM>n</EM>) > <EM>e</EM><SUP>γ </SUP>(<EM>n </EM>ln ln <EM>n</EM>). Because most mathematicians believe the Riemann hypothesis is true (just as most computer scientists believe P ≠ NP) it's more likely that the above statement is true, and a proof of this fact would win you $1,000,000.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=3901480" width="1" height="1">TheoryCache-oblivious data structureshttp://blogs.msdn.com/b/devdev/archive/2007/06/12/cache-oblivious-data-structures.aspxTue, 12 Jun 2007 23:54:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:3256830MSDNArchive10http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=3256830http://blogs.msdn.com/b/devdev/archive/2007/06/12/cache-oblivious-data-structures.aspx#comments<P>In most data structure and algorithms classes, the model used for basic analysis is the traditional RAM model: we assume that we have a large, random-access array of memory, and count the number of simple reads/writes needed to perform the algorithm. For example, selection sort takes about n(n-1)/2 reads and 2n writes to sort an array of n numbers. This model was relatively accurate up through the 1980s, when CPUs were slow enough and RAM small enough that reading and writing the RAM directly didn't create a significant bottleneck. Nowadays, however, typical desktop CPUs possess deep cache hierarchies - at least a sizable L1 and L2 cache - to prevent runtime from being dominated by RAM accesses. Data structures and algorithms that efficiently exploit the cache can perform dramatically more quickly than those that don't, and our analysis needs to take this into account.</P>
<P>For example, suppose you have a list of bytes stored in both an array and a linked list. Provided the linked list nodes are allocated at random positions, iterating over the array will be dramatically faster, exhibiting optimal locality of reference, while iterating over the linked list will trigger a cache miss for every element. We can formalize this: assume a cache line - the size of the block read into the cache at one time - has size B bytes. Then traversing the linked list requires n cache misses, while traversing the array requires precisely ceiling(n/B) cache misses, a speedup of B times.</P>
<P>It is possible to design data structures that specifically take advantage of the cache line size B. For example, <A class="" href="http://blogs.msdn.com/devdev/archive/2005/08/22/454887.aspx" target=_blank mce_href="http://blogs.msdn.com/devdev/archive/2005/08/22/454887.aspx">in a previous post</A> I described an unrolled linked list data structure where each node fills up a cache line, achieving similar cache performance to arrays while still allowing efficient insertions and deletions. Such a data structure is termed "cache-aware", because it <EM>explicitly</EM> takes knowledge about the cache model into account in its construction.</P>
<P>Unfortunately, this approach has a compelling disadvantage: one has to tune the data structure based on the cache line size B. This is a problem for many reasons: first of all, programs, even those compiled to native machine code, are often run on many different processors with different cache line sizes. A Pentium 4 can run the same machine code as a Pentium II, but has a very different cache model. But even when running on a fixed machine, these simple cache-tuned data structures can't take advantage of multilevel caches, because each one has a different size and a different block size. One can attempt to build a data structure for a specific cache hierarchy, but this is even less robust against a change of machine.</P>
<P>In 1999, in his masters thesis, Harold Prokop came up with an interesting solution to this problem: the idea of a <A class="" href="http://citeseer.ist.psu.edu/prokop99cacheobliviou.html" mce_href="http://citeseer.ist.psu.edu/prokop99cacheobliviou.html"><EM>cache-oblivious</EM> algorithm</A>. This does not mean that the algorithm does not take advantage of the cache; to the contrary, it does so quite effectively. What it means is that the algorithm does not need to know the cache line size; it works effectively for <EM>all</EM> cache line sizes B simultaneously. This allows the algorithm to robustly exploit caches across many machines without machine-specific tuning. More importantly, it allows the algorithm to exploit multilevel caches effectively: because it works for all B, it applies between each two adjacent levels of cache, and so every level is well-utilized.</P>
<P>In fact, the multilevel advantage extends beyond simply taking advantage of both the L1 and L2 caches - such algorithms perform well even when the storage hierarchy is expanded to encompass much slower media such as slow memory, Flash, compressed memory, disks, or even network resources, without knowledge of block sizes or access patterns. For example, cache-oblivious B-trees can be used to implement huge database indexes that simultaneously minimize disk reads and cache misses.</P>
<P>Note that just because the algorithm doesn't depend on B doesn't mean the <EM>analysis</EM> of the algorithm doesn't depend on B. Let's take an example: iterating over a simple array. As noted earlier, this requires about n/B cache misses, which is optimal. But neither the array's structure nor the iteration algorithm explicitly take B into account. Consequently, it works for all B. In a multilevel cache, all data prefetched into every cache is used. This is the theoretical benefit of cache-oblivious algorithms: we can reason about the algorithm using a very simple two-level cache model, and it automatically generalizes to a complex, deep cache hierarchy with no additional work.</P>
<P>While cache-oblivious algorithms are clearly useful, at first it's not clear that there even exist any other than simple array iteration. Thankfully, extensive recent research has revealed cache-oblivious data structures and algorithms for a multitude of practical problems: <A class="" href="http://supertech.csail.mit.edu/cacheObliviousBTree.html" mce_href="http://supertech.csail.mit.edu/cacheObliviousBTree.html">searching (binary trees)</A>, <A class="" href="http://portal.acm.org/citation.cfm?id=1227161.1227164" mce_href="http://portal.acm.org/citation.cfm?id=1227161.1227164">sorting</A>, <A class="" href="http://citeseer.ist.psu.edu/bender02localitypreserving.html" mce_href="http://citeseer.ist.psu.edu/bender02localitypreserving.html">associative arrays</A>, <A class="" href="http://citeseer.ist.psu.edu/prokop99cacheobliviou.html" mce_href="http://citeseer.ist.psu.edu/prokop99cacheobliviou.html">FFT computation</A>, <A class="" href="http://www5.in.tum.de/~bader/publikat/matmult.pdf" mce_href="http://www5.in.tum.de/~bader/publikat/matmult.pdf">matrix multiplication</A> and <A class="" href="http://www.springerlink.com/content/1cg4uvtb45mmyq3n/" mce_href="http://www.springerlink.com/content/1cg4uvtb45mmyq3n/">transposition</A>, <A class="" href="http://citeseer.ist.psu.edu/arge02cacheoblivious.html" mce_href="http://citeseer.ist.psu.edu/arge02cacheoblivious.html">priority queues</A>, <A class="" href="http://www.cs.utexas.edu/~vlr/papers/spaa04.ps" mce_href="http://www.cs.utexas.edu/~vlr/papers/spaa04.ps">shortest path</A>, <A class="" href="http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p05.pdf" mce_href="http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p05.pdf">query processing</A>, <A class="" href="http://portal.acm.org/citation.cfm?id=1137883" mce_href="http://portal.acm.org/citation.cfm?id=1137883">orthogonal range searching</A>, and more emerging every year. Practical comparison studies have shown a significant performance gain on real processors compared to traditional algorithms, although carefully-tuned cache-aware algorithms still enjoy an advantage on the specific machines they are tuned for (see e.g. <A class="" href="http://www.brics.dk/~gerth/pub6.html" mce_href="http://www.brics.dk/~gerth/pub6.html">[1][2]</A>).</P>
<P>Many of the cache-oblivious data structures and algorithms that have been published are relatively complex, but here I'll describe a simple one just to give you a feel for it. This discussion is based on parts of <A class="" href="http://supertech.csail.mit.edu/papers/cobtree.pdf" mce_href="http://supertech.csail.mit.edu/papers/cobtree.pdf">this paper</A>.</P>
<P>Consider the following simple search problem: we have a static list of records, and we wish to find the record with a particular key. Traditionally, this problem is solved with either an array and binary search, or a binary search tree. Both of these approaches exhibit dismal cache behavior. Database applications rely on B-trees, which group several records into each block. This greatly improves performance, but requires knowledge of the block size, and does not work across all levels of the cache hierarchy.</P>
<P>The key is the <EM>van Emde Boas</EM> <EM>layout</EM>, named after the van Emde Boas tree data structure conceived in 1977 by Peter van Emde Boas. Suppose for simplicity that the number of elements in a power of 2. We create a complete binary tree containing our records (where by "complete" we mean that all leaves are at the same level). This tree will have height <EM>h</EM> = log<SUB>2</SUB><EM>n</EM>.</P>
<P>Take a look at the first <EM>h</EM>/2 levels of the tree. This subtree has 2<SUP><EM>h</EM>/2</SUP> leaves, each itself the root of a tree of height <EM>h</EM>/2. In the van Emde Boas layout, we first lay out the subtree of height <EM>h</EM>/2 rooted at the root, followed by the subtrees rooted at each leaf of this tree from left to right. Each subtree is recursively laid out using the van Emde Boas layout. An example diagram is shown in Figure 1 of the paper.</P>
<P>The analysis proceeds like this: at each step of the recursion, the size of the subtrees being laid out is the square root of the size of the subtrees laid out at the previous step. Consequently, as we recurse, at some point we will be laying out subtrees between sqrt(B) and B in size (let's call it C). Each of these subtrees can be retrieved in a single memory transfer, and this layout partitions the tree into subtrees of this size.</P>
<P>The search procedure works like this: we access the root, which pulls in the first log<SUB>2</SUB>C levels of the tree. We have only cache hits until we access a leaf of the tree, which pulls in the next log<SUB>2</SUB>C levels of the tree rooted at that leaf. This continues until we reach the bottom of the tree. Because only 1 out of every log<SUB>2</SUB>C accesses causes a cache miss, the total number of misses is log<SUB>2</SUB>n/log<SUB>2</SUB>C = log<SUB>C</SUB>n. Because C is at least sqrt(B), log<SUB>2</SUB>C is at least (log<SUB>2</SUB>B)/2, and the total number of misses is at most 2log<SUB>B</SUB>n.</P>
<P>We made one small assumption above: that each subtree is aligned with the cache block boundaries; if this is not true, each subtree can require two accesses, bringing the final maximum to 4log<SUB>B</SUB>n. Probabilistic analysis shows that with a random starting point in memory and sufficiently large block size, the real constant is closer to 2. </P>
<P>This is the same asymptotic performance as B-trees, and is faster than ordinary binary trees by a factor of about log<SUB>2</SUB>n/2log<SUB>B</SUB>n = (log<SUB>2</SUB>B)/2. For example, for a disk block size of 2048 elements, this would be a speedup of about 6.5 times. The advantage is more than theoretical: to quote the paper, "preliminary experiments have shown, surprisingly, that CO [cache-oblivious] B-trees can outperform traditional B-trees, sometimes by factors of more than 2". </P>
<P>But as practical as the research is in cache-oblivious algorithms, many applications and libraries have yet to take advantage of them. The data structures supplied with .NET, Java, Lisp, and so on are not cache-oblivious. We need to start putting this research into practice and reaping the benefits. I hope this gives you some insight into cache-oblivious algorithms, and I hope you will consider taking advantage of it the next time you develop a product whose performance critically depends on memory access patterns.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=3256830" width="1" height="1">Data structuresK-nearest neighbor spatial searchhttp://blogs.msdn.com/b/devdev/archive/2007/06/07/k-nearest-neighbor-spatial-search.aspxFri, 08 Jun 2007 03:23:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:3154227MSDNArchive7http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=3154227http://blogs.msdn.com/b/devdev/archive/2007/06/07/k-nearest-neighbor-spatial-search.aspx#comments<P>I apologize to everyone for the hiatus - I realise a post is more than a little overdue and will try to be more regular in the future. I appreciate the support that I've received for the blog and continuing it.</P>
<P>Consider the following problem: you're starting a website for a chain of restaurants, and you want to have a "Locations" feature where a person can enter their address and get a list of the 10 outlets closest to their location, in order of distance from their location (as the crow flies, for simplicity). Moreover, each time they hit the "next" button, they should get the next 10 locations in order of distance.</P>
<P>More abstractly, say you have a list of locations on a 2D map, each specified as a point with an x and y coordinate. You can do any sort of preprocessing you like of this list of points. Then I ask you to efficiently answer: given some point, which is not necessarily one of the points in the list, produce a list of the <EM>k</EM> points closest to that point, in order of their distance from that point.</P>
<P>This problem, called the <EM>k</EM>-nearest neighbor problem, is very well-studied and has a variety of solutions. Here we describe one very simple but effective incremental algorithm first proposed by Hjaltason and Samet in their "<SPAN class=l>Ranking in Spatial Databases" in 1995 (<A class="" href="http://citeseer.ist.psu.edu/hjaltason95ranking.html" target=_blank mce_href="http://citeseer.ist.psu.edu/hjaltason95ranking.html">Citeseer</A>). A spatial database is a database specialized to perform this type of query over its (potentially very large) data set.</SPAN></P>
<P><SPAN class=l>Consider the following very simple but inefficient algorithm: we take each point, one at a time, and add it to a priority queue with its key equal to the distance from the search point. We then extract minimum repeatedly to obtain the points in order of their distance from the search point. Once we've extracted the first <EM>k</EM>, we're done, and with a good data structure this requires only O(<EM>k</EM> log <EM>n</EM>) time. Unfortunately, even with the best priority queue data structures based on Fibonacci heaps, adding all <EM>n</EM> points requires amortized linear (O(<EM>n</EM>)) time and space, which for a very large number of points as one would expect to find in a spatial database is prohibitive.</SPAN></P>
<P><SPAN class=l>One approach to narrowing down the points we have to examine is to split the space into larger regions. The idea is that points close to the search point should be found in regions close to the search point. </SPAN><SPAN class=l>In more detail, w</SPAN><SPAN class=l>e first divide the space into two rectangular regions, each containing some of the points. Each of these regions is, in turn, subdivided into two rectangular regions, and we continue in this manner until all the resulting regions have some small constant number of points in them. The result is a tree of regions, where each node is a subregion of its parent node, and each point lies in one of the leaves. It really doesn't matter how we choose to divide up the space - the authors used PMR quadtrees, but you could use kd-trees, simple quadtrees, or whatever you like. All that matters is that in the end each leaf region contains only a small number of points.</SPAN></P>
<P><SPAN class=l>Creating the regions requires examining all the points, but the key is that this part does not depend on the search point, and so only has to be done once in preprocessing. The regions may require rebalancing as points are added or removed, but this can be done efficiently as well.</SPAN></P>
<P><SPAN class=l>However, there's a trick to this: if the search point lies near the boundary of a region, its closest points may actually lie in a different, adjacent region. How can we correctly locate the nearest point without spuriously gathering all the points in all the regions around the point?</SPAN></P>
<P><SPAN class=l>Here's the key insight: our priority queue will contain two types of objects, single points and regions. The key of each point is set to its distance from the search point, while the key of each region is set to the <EM>minimum</EM> distance of that region from the search point. Each time we extract a point from the queue, we output it; each time we extract a region from the queue, we expand it by adding its subregions to the queue, or if it's a leaf region, the points in that region. Because the region's key is set to the <EM>minimum</EM> distance of that region from the search point, any point that is outputted before the region is expanded is closer to the search point than <EM>any</EM> point in the region. This ensures correctness.</SPAN></P>
<P><SPAN class=l><IMG title="k-nearest neighbor spatial search example" style="WIDTH: 300px; HEIGHT: 514px" height=514 alt="k-nearest neighbor spatial search example" src="http://moonflare.com/blogfiles/devdev/Spatial_search_1.png" width=300 align=right mce_src="http://moonflare.com/blogfiles/devdev/Spatial_search_1.png"></SPAN></P>
<P><SPAN class=l>For example, consider the search problem shown to the right. There are 9 points, and we wish to find the 5 closest points to the big red search point in order. The space is recursively subdivided into regions as shown in the tree.</SPAN></P>
<P><SPAN class=l>For the purpose of discussion we will refer to each region by the set of points it contains. Initially, the </SPAN><SPAN class=l>queue contains only the large region containing all points.</SPAN></P>
<P><SPAN class=l>Current queue in order: {1,2,3,4,5,6,7,8,9}</SPAN></P>
<P><SPAN class=l></SPAN><SPAN class=l>This contains the red search point, so its minimum distance from the search point (its key) is zero. We remove it from the queue and add its two children {1,3,4,8} and {2,5,6,7,9}. The region {2,5,6,7,9} again contains the red search point and so has key zero. The minimum distance to the region {1,3,4,8} is shown by the red line extending out of the red point and to the left and is larger than zero.</SPAN></P>
<P><SPAN class=l>Current queue in order: {2,5,6,7,9}, {1,3,4,8}</SPAN></P>
<P><SPAN class=l>We extract and expand {2,5,6,7,9}, getting leaf region {2} and region {5,6,7,9}. The distance to region {2} is shown by the red line extending upward from the red point, which is clearly longer than the distance to region {1,3,4,8}. The region {5,6,7,9} has distance zero.</SPAN></P>
<P><SPAN class=l>Current queue in order: {5,6,7,9}, {1,3,4,8}, {2}</SPAN></P>
<P><SPAN class=l></SPAN><SPAN class=l>We extract and expand {5,6,7,9} to get the regions {5,9} and {6,7}. The distance to region {6,7} is shown by the red line extending right from the red point, the longest distance so far encountered. The region {5,9} contains the red point and has distance zero.</SPAN></P>
<P><SPAN class=l>Current queue in order: {5,9}, {1,3,4,8}, {2}, {6,7}</SPAN></P>
<P><SPAN class=l>Next, we extract and expand region {5,9}. This is a leaf region, so we add the points 5, 9 to the queue. Point 5 is only slightly closer to the red point than region {2}, but farther than region {1,3,4,8}. Point 9 is farther than everything so far encountered.</SPAN></P>
<P><SPAN class=l>Current queue in order: {1,3,4,8}, 5, {2}, {6,7}, 9</SPAN></P>
<P><SPAN class=l>We extract and expand {1,3,4,8} to get {1,3,4}, at the same distance as {1,3,4,8}, and {8}, the distance of which is shown by the diagonal red line.</SPAN></P>
<P><SPAN class=l>Current queue in order: {1,3,4}, {8}, 5, {2}, {6,7}, 9</SPAN></P>
<P><SPAN class=l>We extract and expand {1,3,4}, which is a leaf region, and add the three points. Point 4 is closest, point 3 lies between {6,7} and 9 in distance, and point 1 is slightly closer than point 9.</SPAN></P>
<P><SPAN class=l>Current queue in order: 4, {8}, 5, {2}, {6,7}, 3, 1, 9</SPAN></P>
<P><SPAN class=l>We extract and output point 4, the closest point. We expand region {8} and add point 8, which is slightly farther than point 3.</SPAN></P>
<P><SPAN class=l>Current queue in order: 5, {2}, {6,7}, 3, 8, 1, 9</SPAN></P>
<P><SPAN class=l>We extract and output point 5. We expand region {2} and add point 2, which is slightly closer than point 8.</SPAN></P>
<P><SPAN class=l>Current queue in order: {6,7}, 3, 2, 8, 1, 9</SPAN></P>
<P><SPAN class=l>We extract and expand region {6,7}, adding points 6 and 7. Point 6 is farthest yet and point 7 even farther.</SPAN></P>
<P><SPAN class=l>Current queue in order: 3, 2, 8, 1, 9, 6, 7</SPAN></P>
<P><SPAN class=l>We've now expanded all regions and the remaining points in the queue are the remaining points not yet outputted in order.</SPAN></P>
<P><SPAN class=l>The nice thing about this particular algorithm is that it's fully incremental: we could have stopped at any time once we had enough points, and pick up later when we want more. This makes it quite appealing in database queries, where you don't really know how many points you need in advance, because you might want to filter the results according to additional attributes. In practice, in a typical spatial database, one only needs to expand a small fraction of the regions in order to find a small number of closest points. The idea can be extended into higher-dimensional regions, spatial objects other than points such as areas and line segments, and even more general problems where one can easily compute the minimum key of a subtree of objects.</SPAN></P>
<P><SPAN class=l>I hope you found this informative and I appreciate your comments.</SPAN></P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=3154227" width="1" height="1">Functional list processing in C# 2.0 with anonymous delegateshttp://blogs.msdn.com/b/devdev/archive/2006/06/30/652802.aspxSat, 01 Jul 2006 01:16:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:652802MSDNArchive12http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=652802http://blogs.msdn.com/b/devdev/archive/2006/06/30/652802.aspx#comments<P>One of the benefits of functional languages is their great flexibility in list manipulation, which enables them to express certain computations concisely that would require one or more verbose loops in procedural languages. Many of the features that functional programmers take for granted such as first class functions, persistent data structures, and garbage collection were included in Lisp partly for the purpose of convenient list processing (Lisp stands for "LISt Processing").</P>
<P>Although C# is not a functional language, in C# 2.0 and the accompanying .NET Framework 2.0, a number of constructs and library calls have been added that allow generic List objects to be manipulated in a more functional way. The most important of these is anonymous delegates, which are the same construct that functional programmers call <EM>closures</EM>. We start out with the following method, which removes all elements from a list that match a given condition, passed in as a delegate:</P><PRE> delegate bool Predicate<T>(T arg);
static void RemoveAll<T>(List<T> list, Predicate<T> condition)
{
int i=0;
while (i < list.Count)
{
if (condition(list[i]))
list.RemoveAt(i);
else
i++;
}
}
</PRE>
<P>The use of the word "predicate" here comes from the mathematical predicate, which is just a function that yields a boolean true/false value.</P>
<P>Now say we have a list of integers, and we want to remove all the multiples of <EM>x</EM>, where <EM>x</EM> is a given number. In C# 1.0, using RemoveAll for this is so no easy task, requiring a temporary object:</P><PRE> class RemoveMultiplesOfHelper
{
private int x;
public RemoveMultiplesOfHelper(int x)
{
this.x = x;
}
public bool IsMultiple(int n)
{
return (n % x) == 0;
}
}
static void RemoveMultiplesOf(List<int> list, int x)
{
RemoveAll<int>(list, new RemoveMultiplesOfHelper(x).IsMultiple);
}
</PRE>
<P>The advantages of reusing RemoveAll are important: it's tricky to iterate through a list while removing elements from it, and programmers often introduce subtle bugs while attempting to do so. But this code has serious problems too: it's a hassle to write, it's verbose, the "temporary" class isn't very reusable and is difficult to name, and the code is spread all over the place, making the flow of execution difficult to follow. Because of these issues, many procedural programmers opt to reinvent the wheel instead.</P>
<P>Anonymous delegates solve all these problems. An anonymous delegate allows you to write code right in the middle of a method which has access to all the local variables in that method, and then package it up and pass it to another method to execute later. Here's our example using anonymous delegates:</P><PRE> static void RemoveMultiplesOf(List<int> list, int x)
{
RemoveAll<int>(list, delegate(int y) { return (y % x) == 0; });
}
</PRE>
<P>In this much briefer implementation, the delegate "captures" the local value of x at the instant of its creation, then later RemoveAll uses it in computations.</P>
<P>Another handy trick you can do with anonymous delegates is take a method signature that doesn't quite fit an existing delegate and translate it into one that does. For example, suppose you're sorting a List using the overload of the Sort method that takes a System.Comparison delegate. Now say you're sorting a list of strings, and you have a string comparison method that takes a boolean determining whether the comparison is case-sensitive:</P><PRE> delegate int Comparison<T>(T left, T right);
void Sort<T>(Comparison<T> comparison);
int StringCompare(string left, string right, bool caseSensitive);
</PRE>
<P>At first you appear to be in quite a pickle, as the signature of the <CODE>StringCompare</CODE> method does not exactly match the delegate signature <CODE>Comparison<string></CODE>. Anonymous delegates make it easy to overcome this problem:</P><PRE> static void SortStrings(List<string> a, bool caseSensitive)
{
a.Sort(delegate(string left, string right) {
return StringCompare(left, right, caseSensitive);
});
}
</PRE>
<P>We simultaneously match the desired delegate signature and push the decision of whether to sort case-sensitively or not to the caller, where it belongs, all with very little new code.</P>
<P>The advantages of anonymous delegates make functions like our Sort and RemoveAll considerably more useful. Consequently, a number of such functions were included with the generic List class in .NET 2.0:</P>
<UL>
<LI><A href="http://msdn2.microsoft.com/en-us/library/wdka673a.aspx">RemoveAll(Predicate<T>)</A>: Like our RemoveAll implementation above, removes all elements from a list matching a predicate.
<LI><A href="http://msdn2.microsoft.com/en-us/library/w56d4y5z.aspx">Sort(Comparison<T>)</A>: Mentioned above, this is a Sort method that takes a delegate. Sorting with anonymous delegates is very handy for unusual sorting orders; for example, the following sorts a list of integers in reverse order: <PRE>list.Sort(delegate(int x, int y) { return y.CompareTo(x); });
</PRE>
<LI><A href="http://msdn2.microsoft.com/en-us/library/x0b5b5bc.aspx">Find(Predicate<T>)</A>: Finds the first element in a list satisfying a predicate. Handy for replacing linear searches. For example, this single line of code identifies the first file in the current directory modified in the last hour, or yields null if none: <PRE>new List<string>(Directory.GetFiles(Directory.GetCurrentDirectory())).Find(
delegate(string path) { return File.GetLastWriteTime(path) >=
DateTime.Now - new TimeSpan(1, 0, 0); });
</PRE>
<LI><A href="http://msdn2.microsoft.com/en-us/library/bfed8bca.aspx">Exists(Predicate<T>)</A>: Determines whether a list contains an element satisfying a predicate. Besides its added readability in situations where you only care about the existence of a value, it also has another use: since "Find" returns the default value for the type T to indicate "not found", you need to use Exists for predicates that might match null or zero, such as: <PRE>bool containsEven = list.Exists(delegate(int x) { return (x % 2) == 0; });
</PRE>
<LI><A href="http://msdn2.microsoft.com/en-us/library/kdxe4x4w.aspx">TrueForAll(Predicate<T>)</A>: Similar to Exists, but determines whether the predicate is satisfied by <EM>all</EM> elements of the list. For example, this line of code might run some tests and see if they all pass: <PRE>if (tests.TrueForAll(delegate(Test t) { t.Run(); return t.SuccessCode == 0; })
</PRE>
<LI><A href="http://msdn2.microsoft.com/en-us/library/fh1w7y8z.aspx">FindAll(Predicate<T>)</A>: Creates a new list containing all elements satisfying a predicate. In functional languages this is sometimes called <EM>select</EM> or <EM>filter</EM>, as we're "filtering out" the parts of the list not satisfying the predicate. Here's an example which, in a single line, gets a list of all processes using more than <EM>k</EM> threads: <PRE>List<Process> x = new List<Process>(Process.GetProcesses()).FindAll(
delegate(Process p) { return p.Threads.Count > k; });
</PRE>
<LI><A href="http://msdn2.microsoft.com/en-us/library/0k601hd9.aspx">FindIndex(Predicate<T>)</A>: Just like Find, except that instead of returning the first <EM>value</EM> that satisfies the predicate, it returns its <EM>index</EM> in the list. For example, this expression yields the index of the first alphabetical character in a string: <PRE>new List<CHAR>(s.ToCharArray()).FindIndex(Char.IsLetter)</PRE>
<LI><A href="http://msdn2.microsoft.com/en-us/library/xzs5503w.aspx">FindLastIndex(Predicate<T>)</A>: Same as FindIndex, except that it finds the <EM>last</EM> occurrence satisfying the predicate.
<LI><A href="http://msdn2.microsoft.com/en-us/library/bwabdf9z.aspx">ForEach(Predicate<T>)</A>: Simply executes the given delegate on each member of the list. Although just using a foreach loop is cleaner than using ForEach with an anonymous delegate, ForEach is still handy if you already have a named delegate, as in this example that deletes all the files in the current directory: <PRE>new List<string>(Directory.GetFiles(Directory.GetCurrentDirectory())).ForEach(File.Delete);
</PRE>
<LI><A href="http://msdn2.microsoft.com/en-us/library/73fe8cwf.aspx">ConvertAll(Predicate<T>)</A>: Perhaps the most powerful of the delegate-based List methods, ConvertAll feeds each element of the list through a conversion function, producing a new list. In functional languages this is often called <EM>map</EM> because, like a mathematical function, it "maps" one set to another. Here's a simple example which produces a list of strings that are the lowercased versions of the strings in the original input list: <PRE>list.ConvertAll<STRING>(delegate(string s) { return s.ToLower(); });
</PRE></LI></UL>
<P>What's more, besides being individually useful, complex transformations can be achieved by chaining these methods together in clever ways. For example, here's two lines of code that produce a list of all files larger than a given size in a given directory: <PRE> static List<string> GetBigFiles(string directory, int bigLength)
{
List<string> paths = new List<string>(Directory.GetFiles(directory));
return paths.ConvertAll<FileStream>( File.OpenRead )
.FindAll( delegate(FileStream f) { return f.Length >= bigLength; } )
.ConvertAll<string>( delegate(FileStream f) { return f.Name; } );
}
</PRE>
<P></P>
<P>The first ConvertAll opens all the files by using the static library method "File.OpenRead" as the converter. The FindAll filters out just the FileStream objects corresponding to large files. The final ConvertAll extracts the filenames from each stream. Each of these "list filters" is independently reusable. At first this way of programming may seem unusual or cumbersome, but just compare the size of the above method to a procedural implementation.</P>
<P>Don't forget also that you can write your own methods taking delegates to reap more benefits from anonymous delegates. A frequently useful application is for error handling and debugging code. Suppose you find yourself writing several methods that all look like this:</P><PRE> void foo()
{
Debug.Write("entering foo()");
try
{
// Do some stuff
}
catch (Exception e)
{
// Do some stuff
throw;
}
Debug.Write("exiting foo()");
}
</PRE>
<P>It's not clear how to factor out the two "Do some stuff" parts, since they're nested right in the middle of the method. Once again, anonymous delegates are the answer. We can write a method like this taking delegates for the parts that get filled in:</P><PRE> delegate void DoAction();
delegate bool ProcessException(Exception e);
void DebugInvoke(string name, DoAction action,
ProcessException processException)
{
Debug.Write("entering " + name);
try
{
a();
}
catch (Exception e)
{
if (processException(e))
{
throw;
}
}
Debug.Write("exiting " + name);
}
</PRE>
<P>Now we can mix-and-match try bodies and catch bodies at will, and get the tracing for free:</P><PRE> void foo(int x, int y)
{
int z = 0;
DebugInvoke("addFoo", delegate() { z = x + y; }, OverflowHandler);
DebugInvoke("addBar", delegate() { z = z + x; },
delegate(Exception e) { Debug.Assert(false); return true; });
DebugInvoke("addBaz", delegate() { y = z + z; }, OverflowHandler);
}
static bool OverflowHandler(Exception e)
{
if (e is OverflowException)
{
Debug.Write("Encountered overflow");
return true;
}
return false;
}
</PRE>
<P>Unfortunately, anonymous delegate syntax, while much more concise than the alternative, is still a bit verbose, requiring not only the word "delegate" but full argument lists. One of the goals of the next version of C# is to encourage the use of these functional idioms, partly by introducing a new closure syntax that will take advantage of type inference to overcome this syntactic overhead.</P>
<P>Just remember to keep this rule of thumb in mind: if you're iterating over a list, and the body of the foreach is pretty small, chances are you don't need to. See if you can do the same thing more concisely and with less risk using a combination of the above methods and anonymous delegates (and amaze your coworkers). I hope you found this article helpful.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=652802" width="1" height="1">Factoring large numbers with quadratic sievehttp://blogs.msdn.com/b/devdev/archive/2006/06/19/637332.aspxMon, 19 Jun 2006 22:13:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:637332MSDNArchive20http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=637332http://blogs.msdn.com/b/devdev/archive/2006/06/19/637332.aspx#comments<P>Today I'm going to talk about how the quadratic sieve factoring algorithm works, giving a comprehensive description assuming knowledge of only basic university-level mathematics.</P>
<P>The foundation of the most popular public-key cryptography algorithm in use today, RSA, rests on the difficulty of factoring large integers. When keys are generated, efficient algorithms are used to generate two very large prime numbers and multiply them together. The person who generated the keys knows these two numbers, but everyone else only knows the product. The product contains enough information to encrypt a message to the person; the two primes allow the recipient to decrypt it. There is no known way to decrypt it without using the primes, but by factoring, we can extract the two prime factors from the product and break the encryption.</P>
<P>At the time that RSA was invented in 1977, factoring integers with as few as 80 decimal digits was intractable; all known algorithms were either too slow or required the number to have a special form. This made even small, 256-bit keys relatively secure. The first major breakthrough was quadratic sieve, a relatively simple factoring algorithm invented by Carl Pomerance in 1981, which can factor numbers up to 100 digits and more. It's still the best known method for numbers under 110 digits or so; for larger numbers, the general number field sieve (GNFS) is now used. However, the general number field sieve is <EM>extremely complicated</EM>, and requires extensive explanation and background for even the most basic implementation. However, GNFS is based on the same fundamental ideas as quadratic sieve, so if factoring the largest numbers in the world is your goal, this is the place to start.</P>
<P>We'll begin by addressing a few problems that at first glance have nothing to do with factoring, then assemble them into a working algorithm. I won't be motivating them first, but trust me - they're important.</P>
<H2>Finding a subset of integers whose product is a square</H2>
<P>Suppose I give you a set of integers and I ask you to find a subset of those integers whose product is a square, if one exists. For example, given the set {10, 24, 35, 52, 54, 78}, the product 24×52×78 is 97344 = 312<SUP>2</SUP>. The brute-force solution, trying every subset, is too expensive because there are an exponential number of subsets.
<P>We'll take a different approach based on prime factorizations and linear algebra. First, we factor each of the input numbers into prime factors; for now we will assume that these numbers are easy to factor. For the above example set, we get:
<P>10 = 2 × 5<BR>24 = 2<SUP>3</SUP> × 3<BR>35 = 5 × 7<BR>52 = 2<SUP>2</SUP> × 13<BR>54 = 2 × 3<SUP>3</SUP><BR>78 = 2 × 3 × 13</P>
<P>When you multiply two numbers written as prime factorizations, you simply add the exponents of the primes used. For example, the exponent of 2 in 24×52×78 is 6, because it's 3 in 24, 2 in 52, and 1 in 78. A number is a square if and only if all the exponents in its prime factorization are even. Suppose we write the above factorizations as vectors, where the <EM>k</EM>th entry corresponds to the exponent of the <EM>k</EM>th prime number. We get:</P>
<P>[1 0 1 0 0 0]<BR>[3 1 0 0 0 0]<BR>[0 0 1 1 0 0]<BR>[2 0 0 0 0 1]<BR>[1 3 0 0 0 0]<BR>[1 1 0 0 0 1]</P>
<P>Now, multiplying numbers is as simple as adding vectors. If we add rows 2, 4, and 6, we get [6 2 0 0 0 2], which has all even exponents and so must be a square. In more familiar terms, we want the last bit of each entry in the sum to be zero. But in this case, we don't need to store all of the numbers above, only the last bit of each number. This gives us the following:</P>
<P>[1 0 1 0 0 0]<BR>[1 1 0 0 0 0]<BR>[0 0 1 1 0 0]<BR>[0 0 0 0 0 1]<BR>[1 1 0 0 0 0]<BR>[1 1 0 0 0 1]</P>
<P>Moreover, since we're only interested in last bits, we can perform all our addition using one-bit integers with wraparound semantics (in other words, mod 2). If we add rows 2, 4, and 6 in this way, we get [0 0 0 0 0 0 0], the zero vector. In fact, all squares correspond to the zero vector.</P>
<P>Let's rephrase this as a matrix equation problem. If we transpose the above matrix, so that rows become columns, we get this:</P>
<P>[1 1 0 0 1 1]<BR>[0 1 0 0 1 1]<BR>[1 0 1 0 0 0]<BR>[0 0 1 0 0 0]<BR>[0 0 0 0 0 0]<BR>[0 0 0 1 0 1]</P>
<P>Call this matrix A. If we multiply A by the vector [0 1 0 1 0 1], using one-bit integer arithmetic, we get the zero vector. This tells us precisely which numbers we need to multiply to get a square. So, our goal is to find a nonzero vector x such that Ax=0 (remember, all arithmetic here is with one-bit integers).</P>
<P>If you've had a course in linear algebra, this problem should look very familiar; it's the problem of finding the null space of a matrix, the set of vectors such that Ax=0. The problem can be solved using row reduction (Gaussian elimination). We row reduce the matrix, and then assign values to the free variables in a way that gives us a nonzero solution. The other variables will be determined by these values and the matrix. You probably studied this problem using rational numbers, not one-bit integers, but it turns out row reduction works just as well for these. For example, if we add row 1 to row 3 in the above matrix, we get the following:</P>
<P>[1 1 0 0 1 1]<BR>[0 1 0 0 1 1]<BR>[0 1 1 0 1 1]<BR>[0 0 1 0 0 0]<BR>[0 0 0 0 0 0]<BR>[0 0 0 1 0 1]</P>
<P>Completing the row reduction, we eventually end up with this matrix:</P>
<P>[1 0 0 0 0 0]<BR>[0 1 0 0 1 1]<BR>[0 0 1 0 0 0]<BR>[0 0 0 1 0 1]<BR>[0 0 0 0 0 0]<BR>[0 0 0 0 0 0]</P>
<P>If we turn this back into a system of equations and rearrange, we get this:</P>
<P><I>x</I><SUB>1</SUB> = 0<BR><I>x</I><SUB>2</SUB> = −<I>x</I><SUB>5</SUB> − <I>x</I><SUB>6</SUB><BR><I>x</I><SUB>3</SUB> = 0<BR><I>x</I><SUB>4</SUB> = −<I>x</I><SUB>6</SUB></P>
<P>Suppose we choose <I>x</I><SUB>5</SUB>=0, <I>x</I><SUB>6</SUB>=1. From the above equations, it follows that the first four vectors have the values 0, 1, 0, and 1 (remember, one-bit integer arithmetic). This gives us our final vector, [0 1 0 1 0 1]. If we were to choose <I>x</I><SUB>5</SUB>=1 and <I>x</I><SUB>6</SUB>=0 instead, we'd get a different solution: [0 1 0 0 1 0], corresponding to 24×54 = 1296 = 36<SUP>2</SUP>.</P>
<P>Moreover, a theorem of linear algebra tells us precisely how many input numbers we need to guarantee that a square can be found: as long as we have more columns than rows, the null space is guaranteed to be nontrivial, so that we have a nonzero solution. In other words, we just need more numbers than prime factors used by those numbers. As this case shows, though, this isn't a necessary condition.</P>
<P>The one remaining problem with this method is that if one of the numbers in our set happens to have very large factors, our matrix will have a large number of rows, which requires a lot of storage and makes row reduction inefficient. To avoid this, we require that the input numbers are <I>B-smooth</I>, meaning that they only have small factors less than some integer <I>B</I>. This also makes them easy to factor.</P>
<H2>Fermat's method: factoring using a difference of squares</H2>
<P>You might be wondering what squares have to do with factoring. The connection is the very simple factorization method known as Fermat's method. Although not efficient in general, it embodies the same basic idea as quadratic sieve and works great for numbers with factors close to their square root.</P>
<P>The idea is to find two numbers <I>a</I> and <I>b</I> such that <I>a</I><SUP>2</SUP>−<I>b</I><SUP>2</SUP> = <I>n</I>, the number we wish to factor. If we can do this, simple algebra tells us that (<I>a</I>+<I>b</I>)(<I>a</I>−<I>b</I>) = <I>n</I>. If we're lucky, this is a nontrivial factorization of <I>n</I>; if we're not so lucky, one of them is 1 and the other is <I>n</I>.</P>
<P>The concept behind Fermat's algorithm is to search for an integer <I>a</I> such that <I>a</I><SUP>2</SUP>−<I>n</I> is a square. If we find such an <I>a</I>, it follows that:</P>
<P><I>a</I><SUP>2</SUP>−(<I>a</I><SUP>2</SUP>-<I>n</I>) = <I>n</I></P>
<P>Hence we have a difference of squares equal to <I>n</I>. The search is a straightforward linear search: we begin with the ceiling of the square root of <I>n</I>, the smallest possible number such that <I>a</I><SUP>2</SUP>−<I>n</I> is positive, and increment <I>a</I> until <I>a</I><SUP>2</SUP>−<I>n</I> becomes a square. If this ever happens, we try to factor <I>n</I> as (<I>a</I>−sqrt(<I>a</I><SUP>2</SUP>−<I>n</I>))(<I>a</I>+sqrt(<I>a</I><SUP>2</SUP>−<I>n</I>)); if the factorization is trivial, we continue incrementing <I>a</I>.</P>
<P>Here's an example of Fermat's method <A href="http://en.wikipedia.org/wiki/Fermat%27s_factorization_method">from Wikipedia</A>. Let <I>n</I>=5959; <I>a</I> starts out at 78. The numbers 78<SUP>2</SUP>−5959 and 79<SUP>2</SUP>−5959 are not squares, but 80<SUP>2</SUP>−5959=441=21<SUP>2</SUP>. Hence (80-21)(80+21)=5959, and this gives the nontrivial factorization 59×101=5959.</P>
<P>The reason Fermat's method is slow is because simply performing a linear search of all possible <I>a</I> hoping that we'll hit one with <I>a</I><SUP>2</SUP>−<I>n</I> square is a poor strategy − there just aren't that many squares around to hit. A better way of going about it is to proactively compute an <I>a</I> having this property (actually a similar property).</P>
<P>The key is to notice that if we take a number of <I>a</I><SUP>2</SUP>−<I>n</I> values, none of which are squares themselves, and multiply them, we may get a square, say <I>S</I>. Let <I>A</I> be the product of the corresponding values of <I>a</I>. Basic algebra shows that <I>A</I><SUP>2</SUP>−<I>S</I> is a multiple of <EM>n</EM></I>. Hence, (<I>A</I>−sqrt(<I>S</I>))(<I>A</I>+sqrt(<I>S</I>)) is a factorization of <I>some multiple</I> of <I>n</I>; in other words, at least one of these shares a factor with <I>n</I>. By computing the greatest common divisor (GCD) of each with <I>n</I> using Euclid's algorithm, we can identify this factor. Again, it may be trivial (just <I>n</I> itself); if so we try again with a different square.</I>
<P>All that remains is, given a list of <I>a</I><SUP>2</SUP>−<I>n</I> values, to find a subset whose product is a square. But this is precisely an instance of the problem discussed in the last section. Unfortunately, recall that that the method we came up with there is not efficient for numbers with large factors; the matrix becomes too large. What do we do? We simply throw away numbers with large factors! Theoretical results show that there are a fairly large number of values in the sequence <I>a</I><SUP>2</SUP>−<I>n</I> that are smooth (recall that smooth numbers have only small factors). This gives us a new factoring method that works pretty well up to a point.</P>
<P>For example, consider the number 90283. If we start <I>a</I> at 301 and increment it up to 360 while computing <I>a</I><SUP>2</SUP>−<I>n</I>, we get the following values:
<P>318, 921, 1526, 2133, 2742, 3353, 3966, 4581, 5198, 5817, 6438, 7061, 7686, 8313, 8942, 9573, 10206, 10841, 11478, 12117, 12758, 13401, 14046, 14693, 15342, 15993, 16646, 17301, 17958, 18617, 19278, 19941, 20606, 21273, 21942, 22613, 23286, 23961, 24638, 25317, 25998, 26681, 27366, 28053, 28742, 29433, 30126, 30821, 31518, 32217, 32918, 33621, 34326, 35033, 35742, 36453, 37166, 37881, 38598, 39317</P>
<P>None of these are squares (the first square occurs at <I>a</I>=398); however, if we factor each value we will discover that 7 of these values have no factor larger than 43:
<P>6438, 10206, 16646, 19278, 19941, 30821, 35742</P>
<P>If we take these 7 values and feed them to the algorithm described in the last section, it finds a square: 19278×19941×30821×35742 = 423481541612104836 = 650754594<SUP>2</SUP>. The corresponding original <I>a</I> were 331, 332, 348, and 355, and their product is 13576057680. Now, we can factor the number:</P>
<P>(13576057680−650754594)(13576057680+650754594) = 12925303086 × 14226812274 is a multiple of 90283<BR>GCD(90283, 12925303086) = 137<BR>GCD(90283, 14226812274) = 659<BR>137 × 659 = 90283.</P>
<H2>Making it faster: sieving for smooth numbers</H2>
<P>The factorization algorithm above is considerably better than Fermat's algorithm, but if we try to scale up the size of number we factor, we quickly encounter a bottleneck: finding the smooth numbers in the sequence. Only 7 of the 60 values we computed in our last example were 43-smooth (actually we were lucky to get a square with so few vectors). As the size of the number that we're factoring grows, so does the size of the numbers in the sequence, and the proportion of smooth numbers rapidly shrinks. Although finding smooth numbers doesn't require completely factoring every number in the sequence (we only have to test primes up to the smoothness limit), it's still too expensive to test every number in the sequence this way.</P>
<P>The key is to observe that the prime factors of the sequence <I>a</I><SUP>2</SUP>−<I>n</I> follow a predictable sequence. Let's take a look at the prime factorizations of the first ten or so numbers in our example sequence above:</P>
<P>318 = 2×3×53<BR>921 = 3×307<BR>1526 = 2×7×109<BR>2133 = 3<SUP>3</SUP>×79<BR>2742 = 2×3×457<BR>3353 = 7×479<BR>3966 = 2×3×661<BR>4581 = 3<SUP>2</SUP>×509<BR>5198 = 2×23×113<BR>5817 = 3×7×277</P>
<P>The most obvious pattern is that every other number is even, beginning with the first one. This should be no surprise, since we're effectively adding 2<EM>a</EM>+1 to get each new number, which is always odd. Also, you'll notice that the first and second numbers are divisible by 3, as are the fourth and fifth, the seventh and eigth, and so on. If you look at the larger list, you'll notice similar patterns for larger primes; for example, the 3rd and 6th numbers are divisible by 7, and every 7th number after each of them as well. And, mysteriously, not one number in our entire sequence is divisible by 5!</P>
<P>So what's going on? The answer involves what number theorists call <I>quadratic residues</I>. A number <I>a</I> is called a <I>quadratic residue mod p</I> if there is some square <I>S</I> such that <I>S</I>−<I>a</I> is divisible by <I>p</I>. Half of all numbers are quadratic residues mod <I>p</I>, regardless of the value of <I>p</I>, and there's a simple formula for determining whether or not a particular number is: just take <I>a</I>, raise it to the power (<I>p</I>−1)/2, and then take the remainder after division by <I>p</I>. Then <I>a</I> is a quadratic residue mod <I>p</I> if and only if the answer is 1. Although this computation seems to involve very large values, in fact we can compute it quite quickly using <A href="http://en.wikipedia.org/wiki/Exponentiation_by_squaring">exponentiation by squaring</A> combined with frequent remainder operations.</P>
<P>This explains why none of our values are divisible by 5. If we compute 90283<SUP>(5-1)/2</SUP> mod 5, we get 4, which is not 1 (remember that 90283 is our original <I>n</I> to be factored). Thus, there is no square such that <I>S</I>−<I>n</I> is divisible by 5; but all numbers in our sequence have this form. In practice, this means we can compute just once ahead of time which factors may occur in the sequence (primes <I>p</I> such that <I>n</I> is a quadratic residue mod <I>p</I>), and ignore all others.</P>
<P>For our next mystery, why is it that given a number in the sequence divisible by <I>p</I>, every <I>p</I>th number after that is also divisible by <I>p</I>? Well, simple algebra shows that if <I>a</I><SUP>2</SUP>−<I>n</I>=<I>kp</I>, then:</P>
<P>(<I>a</I>+<I>p</I>)<SUP>2</SUP>−<I>n</I> = (<I>a</I><SUP>2</SUP>−<I>n</I>)+<I>p</I>(2<I>a</I>+<I>p</I>) = <I>kp</I>+<I>p</I>(2<I>a</I>+<I>p</I>).</P>
<P>But this doesn't explain why it always seems to be the case that there are <I>exactly two</I> different initial values of <I>a</I> such that <I>a</I><SUP>2</SUP>−<I>n</I> is divisible by <I>p</I> (with the exception of <I>p</I>=2). For example, in our sequence above the 3rd and 6th values were divisible by 7. The answer again is quadratic residues: it can be shown that the modular equation <I>x</I><SUP>2</SUP>≡y (mod <I>p</I>) has exactly two solutions (if it has any), and in fact there is an efficient algorithm for computing these two solutions called the <A href="http://en.wikipedia.org/wiki/Shanks-Tonelli_algorithm">Shanks-Tonelli algorithm</A>. I won't go into it here since it requires some background in number theory, but for small primes it isn't really needed; it suffices to test the first <I>p</I> numbers to see which are divisible by <I>p</I>. For larger primes, it becomes important to avoid this expensive scan.</P>
<P>Recall the <A href="http://en.wikipedia.org/wiki/Sieve_of_eratosthenes">Sieve of Eratosthenes</A>, an algorithm for locating prime numbers. It starts with a list of numbers, then crosses off all numbers not divisible by 2 except 2, then does the same for 3, 5, and so on until it's done. The numbers that remain must be prime. When attempting to find a list of prime numbers, this strategy is much more efficient than running even the most advanced primality test on each number individually.</P>
<P>We take a similar strategy here: we begin with a table of the original values in the sequence. We then visit all the numbers divisible by 2 and divide out a factor of 2. We do the same for each power of 2 up to the size of the sequence. We then do the same for every other prime up to our smoothness bound (43 in our example). In the end, the smooth numbers and only the smooth numbers will have become 1. Since we visit less and less list elements as the prime factor increases, the overall work is much less. For example, here's our original list from the above example:</P>
<P>318, 921, 1526, 2133, 2742, 3353, 3966, 4581, 5198, 5817, 6438, 7061, 7686, 8313, 8942, 9573, 10206, 10841, 11478, 12117, 12758, 13401, 14046, 14693, 15342, 15993, 16646, 17301, 17958, 18617, 19278, 19941, 20606, 21273, 21942, 22613, 23286, 23961, 24638, 25317, 25998, 26681, 27366, 28053, 28742, 29433, 30126, 30821, 31518, 32217, 32918, 33621, 34326, 35033, 35742, 36453, 37166, 37881, 38598, 39317</P>
<P>We visit elements 1, 3, 5, and so on, dividing out 2. Here's the list after this first pass is complete:</P>
<P>159, 921, 763, 2133, 1371, 3353, 1983, 4581, 2599, 5817, 3219, 7061, 3843, 8313, 4471, 9573, 5103, 10841, 5739, 12117, 6379, 13401, 7023, 14693, 7671, 15993, 8323, 17301, 8979, 18617, 9639, 19941, 10303, 21273, 10971, 22613, 11643, 23961, 12319, 25317, 12999, 26681, 13683, 28053, 14371, 29433, 15063, 30821, 15759, 32217, 16459, 33621, 17163, 35033, 17871, 36453, 18583, 37881, 19299, 39317</P>
<P>Here it is after dividing out the prime factors 3, 5, 7, 11, 13, and 17:</P>
<P>53, 307, 109, 79, 457, 479, 661, 509, 2599, 277, 1073, 7061, 61, 163, 263, 3191, 1, 10841, 1913, 577, 6379, 1489, 2341, 2099, 2557, 1777, 1189, 5767, 2993, 18617, 1, 23, 10303, 1013, 1219, 22613, 3881, 163, 12319, 2813, 619, 26681, 4561, 1039, 2053, 9811, 5021, 37, 103, 10739, 16459, 1601, 1907, 35033, 851, 12151, 18583, 1403, 919, 39317</P>
<P>We see a couple 1s have already appeared; these are 17-smooth numbers. When we get all the way up through 43, we have:</P>
<P>53, 307, 109, 79, 457, 479, 661, 509, 113, 277, 1, 307, 61, 163, 263, 3191, 1, 293, 1913, 577, 6379, 1489, 2341, 2099, 2557, 1777, 1, 5767, 73, 18617, 1, 1, 10303, 1013, 53, 22613, 3881, 163, 12319, 97, 619, 26681, 4561, 1039, 2053, 9811, 5021, 1, 103, 10739, 16459, 1601, 1907, 35033, 1, 419, 18583, 61, 919, 39317</P>
<P>We see several numbers set to 53 or 61; these would be smooth if we raised our bound a little bit.</P>
<P>This sieving process is where quadratic sieve gets its name from. This drastically decreases the overall work needed to find a sufficient number of smooth numbers, making it practical for very large numbers. This basic implementation could probably handle numbers up to 50-60 digits.</P>
<H2>Improvements and optimizations</H2>
<P>Quadratic sieve admits a number of "bells and whistles" to dramatically improve its runtime in practice. We mention only a few of the most important ones here.</P>
<P>The simple row reduction method of Gaussian elimination is not able to accomodate the very large smoothness bounds needed to factor large numbers, which often range in the millions, mostly due to space limitations; such matrices, if stored explicitly, would require trillions of bits. However, this method is wasteful, because most of the entries in the matrix are zero (they must be; each number has no more than log<SUB>2</SUB><I>n</I> prime factors). Instead of using an actual two-dimensional array, we can just keep a list for each column that lists the positions of the 1 bits in that column. We then use a method well-suited to reducing sparse matrices such as the <A href="http://en.wikipedia.org/wiki/Lanczos_algorithm">Lanczos algorithm</A>. This still requires a fair amount of space; it's common to use block algorithms that work on small portions of the matrix at one time, storing the rest of the matrix on disk. The matrix step is notoriously difficult to parallelize and for large problems is often done on a single high-performance supercomputer.</P>
<P>The most expensive step by far is the sieving, which can require scanning billions of numbers to locate the needed smooth numbers. A common trick is to only track the <EM>approximate logarithm</EM> of each number, usually in fixed-point arithmetic. Then, when visiting each number, instead of performing an expensive division we only have to subtract. This introduces a bit of rounding error into the algorithm, but that's okay; by rounding consistently in the correct direction, we can ensure that we don't miss any smooth numbers and only capture a few spurious numbers that we can quickly check and reject. Because the logarithms of small primes are small, and require visiting more numbers than any others, primes like 2 and 3 are often dropped altogether.</P>
<P>Another problem is that <I>a</I><SUP>2</SUP>−<I>n</I> grows fairly quickly; because smaller numbers are more likely to be smooth, we get diminishing returns as we scan higher in the sequence. To get around this, we scan values of not just the sequence <I>a</I><SUP>2</SUP>−<I>n</I> but also a number of similar sequences such as (<I>Ca</I>+<I>b</I>)<SUP>2</SUP>−<I>n</I> for suitable constants <I>C</I>, <I>b</I>. This variation is called the multiple polynomial quadratic sieve, since each of these sequences can be seen as the values of polynomial in <I>a</I>.</P>
<P>Finally, although the matrix step does not admit simple parallelization due to many data dependencies, the sieving step is perfectly suited to massive parallelization. Each processor or machine simply takes a portion of the sequence to scan for smooth numbers by itself, returning the small quantity of smooth numbers that it discovers to a central processor. As soon as the central processor has accumulated enough smooth numbers, it asks all the workers to stop. In the multiple polynomial variant, it's common to assign some of the polynomials to each machine.</P>
<P>One peculiar idea for massively parallelizing the sieving step, invented by Adi Shamir, is to use not computers but a specially constructed sieving device based on light emitters and sensors that he calls <A href="http://porsche.ls.fi.upm.es/Material/Articulos/twinkle.pdf">TWINKLE</A>. The concept is that we have a light for each prime number whose intensity is proportional to the logarithm of that prime. Each light turns on just two times every <EM>p</EM> cycles, corresponding to the two square roots of <EM>n</EM> mod <EM>p</EM>. A sensor senses the combined intensity of all the lights together, and if this is close enough to the logarithm of the current value, that value is a smooth number candidate.</P>
<H2>Conclusion</H2>
<P>I hope this gives you all some insight into the workings of one of the most powerful factoring algorithms available, and into how unexpected algorithms and mathematics can be applied to familiar problems. Please leave any feedback, whether clarifications, corrections, complaints, or encouragement. Thanks for reading.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=637332" width="1" height="1">Color quantizationhttp://blogs.msdn.com/b/devdev/archive/2006/05/09/594109.aspxWed, 10 May 2006 03:57:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:594109MSDNArchive8http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=594109http://blogs.msdn.com/b/devdev/archive/2006/05/09/594109.aspx#comments<P>If you've ever done work with Web graphics, I'm sure that at some point you reduced an image with thousands of colors, such as a screenshot or photograph, to an image with only 256 colors, for example to store the image in the GIF format. Remarkably, the reduced image is usually very visually similar to the original image, despite the massive loss of information. Most paint programs are capable of performing this transformation, and operating systems such as Windows will perform the transformation on-the-fly when the user is running a video mode with only 256 colors. But how does it work?</P>
<P>In the literature, this problem is called the <A href="http://en.wikipedia.org/wiki/Color_quantization">color quantization</A> problem -- generally, the word "quantization" refers to any process that maps several values to a single representative value. In this case, the values that we're mapping are colors. There are two main steps to color quantization:<BR></P>
<OL>
<LI>Generate a good palette
<LI>Dither</LI></OL>
<H2>Generate a good palette</H2>
<P>First we'll talk about generating a good palette. The simplest approach is to use a fixed palette for all images. We can map pixels in the source image to pixels in the reduced image by simply choosing the closest color. What do we mean by "closest"? Well, we can identify each color with a point in three-dimensional space whose coordinates are the red, green, and blue values of the color. Then, we consider two colors to be close if their points are close in this three-dimensional space. This isn't ideal, but it's very simple.</P>
<P>The fixed palette approach has the advantage that all images using the same palette can be rendered on the same screen simultaneously; unfortunately, it severely compromises image quality. Making this even worse is that typical fixed palettes, such as the Windows standard palette and the web colors palette, are chosen as uniform palettes, meaning that the colors are evenly spaced throughout RGB space. This is a terrible way to construct a palette, as humans can distinguish much finer color differences in some areas of RGB space than others.</P>
<P>Below, we show a <A href="http://commons.wikimedia.org/wiki/Image:Rosa_Gold_Glow_2.jpg">photograph of a rose</A> taken by Stan Shebs, but modified to remove its blue channel. This means that all of its colors lie in a single plane, the plane blue = 0. That color plane is shown in the graph below. We also show a 16-color optimized palette found by Adobe Photoshop - each blue point is a palette color, and each region shows the section of the color space mapping to that palette color. As you can see, the palette is far from uniform, and there are many colors that the image doesn't even use.</P>
<CENTER>
<TABLE border=0>
<TBODY>
<TR>
<TD vAlign=center><IMG src="http://moonflare.com/blogfiles/devdev/Rosa_Gold_Glow_2_small_noblue.png"></TD>
<TD><IMG src="http://moonflare.com/blogfiles/devdev/Rosa_Gold_Glow_2_small_noblue_color_space.png"></TD></TR></TBODY></TABLE></CENTER>
<P>From this picture, we can see that generating a good palette is essentially a point clustering problem. We want to identify clusters in the color space that contain a large number of points close together and assign a single palette entry to those colors.</P>
<P>One of the most popular algorithms for this problem is called the <EM>median cut algorithm</EM>. It was invented by Paul Heckburt in 1980 in his paper "<SPAN class=l><A href="http://citeseer.ist.psu.edu/heckbert80color.html">Color Image Quantization for Frame Buffer Display</A></SPAN>" and is still in wide use today. The algorithm follows just a few basic steps:</P>
<OL>
<LI>Create a block or box surrounding the points; it should be just large enough to contain them all.
<LI>While the number of blocks is less than the desired palette size:
<OL>
<LI>Find the block with the longest side out of all blocks.
<LI>Cut this block into two blocks along its longest side. Cut it in such a way that half of the points inside the block fall into each of the two new blocks (that is, we split it through the median, thus the name).
<LI>Shrink these two new boxes so that they just barely contain their points.</LI></OL>
<LI>Average the points in each box to obtain the final set of palette colors.</LI></OL>
<P>Despite a number of more sophisticated modern algorithms, this strikingly simple approach is adequate for generating 256 color palettes for most images. You can read about <A href="http://en.literateprograms.org/Median_cut_algorithm_(C_Plus_Plus)">a C++ implemention of the median cut algorithm</A> at my wiki.</P>
<H2>Dither</H2>
<P>If the palette is large enough and of high enough quality, Reducing by nearest color is usually sufficient. If not, however, distinct artifacts will be visible, such as banding, where what were once smooth gradients become striped regions (see <A href="http://en.wikipedia.org/wiki/Image:Dithering_example_undithered_16color.png">this image</A>; look closely at the cat's neck). One effective technique for dealing with these palette limitations is called <EM><A href="http://en.wikipedia.org/wiki/Dither#Digital_photography_and_image_processing">dithering</A></EM>, which mixes dots of different colors to simulate a larger number of colors and smooth out gradients. For example, if you mix a grid of red and blue dots, the result will appear purple. Dithering is also commonly used in printing, where for example a grayscale image can be simulated with great accuracy using a higher-resolution dithered black and white image.</P>
<P>The trick to dithering is to effectively simulate the original color in each area of the image while avoiding telltale artifacts formed by poor dithering patterns; see <A href="http://www.visgraf.impa.br/Courses/ip00/proj/Dithering1/ordered_dithering.html">ordered dithering</A>, commonly used in newspapers, as an example of a dithering scheme with clearly visible artifacts. One of the most popular dithering schemes to this day is the <A href="http://www.visgraf.impa.br/Courses/ip00/proj/Dithering1/floyd_steinberg_dithering.html">Floyd-Steinberg </A>algorithm, a classic dithering algorithm falling into the general class of <EM>error diffusion </EM>dithering algorithms.</P>
<P>The basic concept behind error diffusion is to begin with nearest color matching, but whenever we can't match a pixel color exactly, the error between the original color and the palletized color is distributed into adjacent pixels not yet visited. In Floyd-Steinberg, the pixel to the right receives 7/16 of the error, the pixel below 5/16, the pixel below and to the left 3/16, and the pixel below and to the right 1/16. When these pixels are visited, this added error affects the palette entry selected for them, leading to an average colour over that portion of the image close to the actual average color of the pixels in that area.</P>
<P>What's great about Floyd-Steinberg is that it generalizes easily to reducing full 24-bit color images to palettes of any size. The only modification we need to make is that we perform the error distribution step for each of the three channels. The palette entry is still chosen by nearest color. You can read about <A href="http://en.literateprograms.org/Floyd-Steinberg_dithering_%28C%29">a C implementation of Floyd-Steinberg dithering</A> on my wiki.</P>
<P>This isn't the end of the story of course - there is ongoing research and all kinds of fancy variations on these approaches, many of which can produce much nicer results than these classical algorithms. But at least now the next time you see a GIF or 8-bit PNG file on the web, or use a computer in 256-color mode, you can appreciate the algorithms under the cover that helped make your images look good under these severe limitations.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=594109" width="1" height="1">How does JPEG actually work?http://blogs.msdn.com/b/devdev/archive/2006/04/12/575384.aspxThu, 13 Apr 2006 02:33:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:575384MSDNArchive7http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=575384http://blogs.msdn.com/b/devdev/archive/2006/04/12/575384.aspx#comments<p>JPEG is an image encoding designed to compress photographs and
similar images effectively, often 5 to 15 times over a raw bitmap
format. It's a lossy format that exploits properties of human vision to
eliminate information that is difficult to distinguish. You see JPEGs
all the time, and if you've seen a JPEG compressed with a low quality
level, you have a good idea of the kinds of visual artifacts that
emerge when JPEG compression is pushed to its limits. But how does it
really work, and how does this cause JPEG artifacts?</p>
<p>The JPEG standard defines a number of different encoding options,
and there are a number of container file formats defined other than the
usual JFIF, but I'm going to discuss a single popular JPEG encoding
method used by many applications. This encoding combines three
different completely independent methods of image compression:
downsampling the chrominance channels, quantization of the discrete
cosine transformation, and entropy coding.</p>
<h2>Method 1: Downsampling the chrominance channels</h2>
<p>The most obvious way to make an image file smaller is to make the
image smaller - for example, a 100×100 bitmap is 1/4 the size of a
200×200 bitmap. You can expand the image back out to the original size
when viewing it by simply upsampling it. This isn't a great compression
method, since it causes clearly visible artifacts, usually blocky or
blurry effects, especially around detailed, sharp areas.</p>
<p>A different approach is instead of downsampling the entire image, to
only downscale certain channels of the image. For example, the eye's
red-green sensors are more numerous and sensitive than its blue-yellow
sensors, so if we shrink just the blue channel and later re-expand it,
the effect is much less noticable. But the RGB color space is not our
only choice - images can be encoded in many color systems, and for each
of these the channels we choose to downsample will have different
visual effects. <a href="http://www.stanford.edu/%7Eesetton/experiments.htm">This experiment</a>
by Eric Setton of Stanford University demonstrates the effect on image
quality of downsampling various channels in various color systems.</p>
<p>The results are visually obvious: we are sensitive to red and green,
but even more sensitive to intensity or luminance, the relative
brightness of pixels compared to other nearby pixels. This is expressed
by the Y component of the YCbCr system and the V component of the HSV
system. Any downsampling of this information is visually disastrous. On
the other hand, downsampling the <i>chrominance</i>, expressed by the
Cb and Cr components of the YCbCr system, has almost no visual effect
(Cb and Cr stand for "chrominance blue" and "chrominance red" - see <a href="http://r0ro.free.fr/tipe/algorithme.php">a sample image broken up into YCbCr channels in the middle of this page</a>).
For this reason, typical JPEG files convert images to the YCbCr color
space, and then downsample both the Cb and Cr channels by a factor of 2
in both width and height. Further processing of the image proceeds on
each of the three channels independently.</p>
<p>This is the method in JPEG encoding largely responsible for the
artifact known as "color bleeding", where along sharp edges the color
on one side can "bleed" onto the other side. This is because the
chrominance channels, which express the color of pixels, have had each
block of 4 pixels averaged into a single color, and some of these
blocks cross the sharp edge.</p>
<h2>Method 2: Quantization of the discrete cosine transformation</h2>
<p>Normally, we view an image as simply a grid of pixels, and if we
alter a pixel or group of pixels dramatically, the effect will be
visually obvious. A different approach to storing images is to come up
with a set of "basis" images which can be added together in various
combinations to form the image we want. JPEG encodes each 8-by-8 square
of pixels using the following basis image set of 64 images:</p><img src="http://moonflare.com/blogfiles/devdev/Dctjpeg.png" height="300" width="300">
<p>We assign each of these images a relative weight determining how
much influence it has on the image. This weight can be positive or
negative (when negative, the basis image is subtracted). Taken
together, these 64 weights can describe any 8×8 image at all, and they
describe the image precisely. <a href="http://en.wikipedia.org/wiki/JPEG#Discrete_cosine_transform">This example at Wikipedia</a>
shows an 8×8 image, its original pixel array, and its new 8×8 array of
weights after being converted using the discrete cosine transform
(which I'll discuss later).<br>
</p>
<p>But what's the point of transforming 64 pixel values into 64
weights? For one thing, the resulting data may exhibit more structure,
and so be easier to compress. For example, with blocks from typical
photographs, the basis images in the upper left above have large
weights and the ones in the lower right have small weights.</p>
<p>But to store the weights, which are real numbers, we have to encode
them somehow. The simplest encoding is to just convert them to integers
between -<i>k</i> and <i>k</i> by scaling and rounding them. This works, but there's a tradeoff in our choice of <i>k</i>:
if it's too big, the encoded image will take too much space, but if
it's too small, the rounding might visibly distort the image, since
it's effectively altering the weights.<br>
</p>
<p>A more effective approach is to use a different <i>k</i> for each
basis image. This works well because the visual difference caused by
variations in the weights of different basis images is different. The
eye can detect small variations of the weights of the images in the
upper-left, but can overlook much larger variations in the weights of
the images in the lower-right, so we can get away with smaller <i>k</i> and larger rounding error for these weights.<br>
</p>
<p>In practice, rather than deal with expensive floating-point numbers,
we round the initial weights to integers and later divide them by fixed
factors called the <i>quantization factors </i>using integer division.
During decoding, we multiply them by these factors again. The
upper-left images have small quantization factors, so that little
information is lost in this process, while the lower-right images have
larger factors. <a href="http://en.wikipedia.org/wiki/JPEG#Quantization">This example</a> shows a commonly used matrix of quantization
factors, one for each basis image, and the result of quantizing the
sample image using them. Observe how much simpler the matrix becomes -
it now contains a large number of entries that are small or zero,
making it much easier to compress. <a href="http://en.wikipedia.org/wiki/JPEG#Decoding">This example</a>
shows how the weight values expand back out during decoding - notice
that although some of the weights are now different, the visual
difference is almost undetectable.<br>
</p>
<p>Quantization is the primary source of JPEG artifacts. Because the images in the lower-right
tend to have the largest quantization divisors, JPEG artifacts will
tend to resemble combinations of these images. The matrix of quantization factors can be directly controlled by
altering the JPEG's "quality level", which scales its values up or down.<br>
</p>
<p>Quantization compresses the image in two important ways: one, it
limits the effective range of the weights, decreasing the number of
bits required to represent them. Two, many of the weights become identical or zero,
improving compression in the third step, entropy coding.</p>
<p>But this discussion leaves one important question unanswered: how do
we take an arbitrary square of 8×8 pixels and decompose it into a
combination of the basis images? As it turns out, we specifically chose
the images above in a way to make this easy. These images are formed by
cosine waves of various frequencies in the x and y directions, and a
transformation called the <a href="http://en.wikipedia.org/wiki/Discrete_cosine_transform">discrete cosine transform</a> from Fourier transformation theory allows us to rapidly decompose a signal into these parts in only O(<i>n</i> log <i>n</i>)
time. Because of the importance of JPEGs, over the years very fast and
clever specialised solutions for 8×8 blocks have been created in both
software and hardware. This is also the reason we use small,
constant-sized blocks instead of processing the image as a whole: the
cosine transform step doesn't need to scale up to large bitmaps, so the
transformation as a whole can run in linear (O(<i>n</i>)) time.</p>
<h2>Method 3: Entropy coding</h2>
<p>Unlike the other steps, entropy coding is lossless. If you've
studied the simple compression techniques used in ZIP files and BMP
files, the techniques of entropy coding should come as no surprise.</p>
<p>We start by turning our 8×8 grid of integers into a linear sequence
of integers. The obvious row-major and column-major orders are not
ideal here, since we expect there to be many zeros in the lower-right
area. Instead, we start with the top-left corner and zig-zag in a
diagonal pattern across the matrix, going back and forth until we reach
the lower-right corner.</p>
<p>Next, we apply the same run-length encoding algorithm used by GIF
files to condense sequences of identical integer values. In areas where
the values are small, we can expect many such runs, particularly runs
of zeros. We use a special code to say "all the remaining values are
zero", which comes in handy.</p>
<p>Finally, we apply Huffman compression to the remaining sequence,
which is a standard lossless compression technique that encodes
frequently used values using less bits than infrequently used values.</p>
<h2>Conclusion</h2>
<p>And that's it! For each YCbCr channel, for each 8×8 block, we
perform these transformations, and we store the results in a bitstream.
This data is wrapped up in a JFIF container giving metadata about the
image, and the file goes out ready for web browsers to view, completely
unaware of all the data that has been lost. Just don't go sending JPEGs
into space and expect aliens to see them the same way.</p>
<p>Finally, no discussion of the JPEG encoding would be complete
without noting that JPEG is based on research ideas that are now very
old and largely supplanted. The more modern <a href="http://www.jpeg.org/jpeg2000/">JPEG 2000 encoding</a>
based on wavelets can be pushed much farther before visible artifacts
appear, and includes a surprisingly efficient lossless mode. Of course,
all this improvement is rather moot until common applications actually
support it.</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=575384" width="1" height="1">Encoding correctness in typeshttp://blogs.msdn.com/b/devdev/archive/2006/03/27/562342.aspxMon, 27 Mar 2006 23:52:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:562342MSDNArchive0http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=562342http://blogs.msdn.com/b/devdev/archive/2006/03/27/562342.aspx#comments<P>This article will discuss effective use of types to catch some common problems at compile time not normally found by the typechecker.</P>
<P>The typechecker is the most important tool in the programmer's arsenal. Types enable you to structure code in a more maintainable way and prevent entire classes of errors from ever occurring at runtime, many of which can be insidiously difficult to debug. If you've written anything extensive in assembly language, you can understand the frustration of dealing with misinterpretations of data leading to corruption of program state. Researchers like to say "well-typed programs never go wrong".</P>
<P>But what does "go wrong" mean? One answer is: "more than you think". By making effective use of the type system to encode invariants about your program, you can use the compiler to catch more errors at compile time that you currently catch at run time.</P>
<P>A simple example is SQL injection, where insertion of user-supplied data into a SQL statement string can enable attacks where the user modifies the nature of the statement with carefully chosen data. At first the solution seems simple: double your apostrophes to escape them, and then the value cannot be misinterpreted. This can be trickier than sounds, though, since in a complex program you might not be able to identify which values come from an untrusted source, directly or indirectly, which are safely produced from trusted sources, and which are untrusted data that has already been properly escaped. These three types of data travel from method to method, following a winding path that obscures their source. The compiler is of no help, since it sees them all as just strings.</P>
<P>The solution, called "tainting" in languages like Perl and Ruby, is to make them not "just strings", but to create two types of data: tainted data, which is unsanitized data derived from an untrusted source, and untainted data, which is either derived from a trusted source or already sanitized. For our example, suppose we're performing our queries with SqlDataReader. We create a wrapper class for it which has an identical interface except that wherever SqlDataReader takes a string, our wrapper takes a SafeSqlString. This new type is simply a wrapper for <EM>string </EM>that can only be generated in a few strictly limited ways:</P>
<UL>
<LI>static SafeSqlString AssumeSafe(string): Used for trusted data, just wraps it</LI>
<LI>static SafeSqlString Escape(string): Escapes apostrophes; used for untrusted data.</LI>
<LI>SafeSqlString operator+(SafeSqlString): Concatenates two SafeSqlStrings</LI>
<LI>static SafeSqlString Format(SafeSqlString format, params object[] args): Takes a safe format string and a list of arguments of safe types such as SafeSqlString, int, and double, and produces a new SafeSqlString.</LI></UL>
<P>What does this buy us? Now, as long as we use our new SqlDataReader wrapper for all database access, we are essentially immune to SQL injection: the only way it could sneak in is through AssumeSafe(), which is obviously dangerous and makes the code easy to review. As an added bonus, double escaping can never happen: the data only gets escaped once as it changes from <EM>string</EM> to <EM>SafeSqlString</EM>, and Escape() cannot be applied to <EM>SafeSqlString</EM>.</P>
<P>Because these types flow through the program, we can specify that a particular method always returns a SafeSqlString, or assumes that its arguments are SafeSqlStrings, or assumes that some private variable is always safe, and enforce this statically instead of by convention. This is important, because it allows us to use AssumeSafe() at a point where it's <EM>obvious</EM> that the string is safe, as opposed to five calls up the call stack where we really have no idea who else might be calling us.</P>
<P>What else can we encode in types? All sorts of things! Here are some more examples:</P>
<UL>
<LI>We could have a type PositiveInt for positive integers, which will throw a runtime error if constructed with a nonpositive integer. Now, if a method receives a PositiveInt, it can safely pass it on to another method expecting a positive integer without any additional runtime checks. This trick works with any object where you want to enforce a complex runtime property on an object but only want to check it once (or only when it might change). In unmanaged code, it's particularly useful for checking buffer/string lengths.</LI>
<LI>Some binary tree algorithms require that every node has zero or two children, never one. We can represent this using a subclass hierarchy, where the abstract base class is TreeNode and the subclasses are Leaf and InternalNode. Leaf contains no data at all, while InternalNode contains a piece of data and two (non-null) references to TreeNodes.</LI>
<LI>There are many other properties of data structures that can be encoded in types. For example, in a red-black tree, red nodes cannot have red children. We can enforce this statically by creating a subclass hierarchy where BlackNode objects contain references to the abstract base class Node, while RedNode objects contain only references to BlackNode objects. Similarly, the root must be black, so we just use a BlackNode reference for the root.</LI>
<LI>If you're writing a sorting algorithm, it might not be clear that the array you get back contains a rearrangement of the same elements as the array you passed in. There's no quick way to check this, even at runtime. On the other hand, if instead of an array you pass in an object whose only valid non-read-only operation is to exchange two elements, then this becomes statically obvious.</LI></UL>
<P>In short, wherever you find yourself wanting to keep track of a static property as it flows through a program, types are a great way of doing so - use them to catch errors sooner.</P>
<P>In programming languages with more sophisticated type systems, you can encode even more information in types. Taking this to the extreme are dependent types, a relatively new system of types based on values that are extremely difficult to typecheck but so powerful that they can express many correctness properties of real programs. A number of research languages such as <A href="http://www.cs.bu.edu/~hwxi/DML/DML.html">Dependent ML </A>and <A href="http://www.e-pig.org/">Epigram</A> already implement dependent types. With the power of systems like these, maybe in a few years the primitive C-based type systems of C# and C++ will look a lot like assembly language does now. But before we can take advantage of new technology, let's start making better use of what we have - encode your program's properties using types.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=562342" width="1" height="1">LiteratePrograms wikihttp://blogs.msdn.com/b/devdev/archive/2006/03/03/543397.aspxSat, 04 Mar 2006 04:54:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:543397MSDNArchive0http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=543397http://blogs.msdn.com/b/devdev/archive/2006/03/03/543397.aspx#comments<P>Hi everybody. This is a short post, but I just wanted to tell you all about a new wiki I created called <A href="http://literateprograms.org/">LiteratePrograms</A>, based on an extension of the same MediaWiki software used by <A href="http://www.wikipedia.org/">Wikipedia</A>.</P>
<P>Some of you read <A HREF="/devdev/archive/2006/01/23/516431.aspx">my earlier post</A> about literate programming and some of the advantages it has for easy-to-read exhibition of code. On LiteratePrograms, every article is a literate program: it is simultaneously a document and a piece of code which can be downloaded, compiled, and run. The "download code" tab at the top of every article runs the <A href="http://www.eecs.harvard.edu/~nr/noweb/">noweb literate programming tool</A> which extracts code segments from the discussion, puts them back in a form that can be compiled, and zips them up in a package for download. Additionally, the wiki markup, editable by anyone, allows rich documentation including mathematical formulas and diagrams. Finally, all content is released under the liberal MIT/X11 non-copyleft license, so anybody can use it, even for commercial purposes. You can see a sample of LP in action at <A href="http://literateprograms.org/Insertion_sort_%28C%2C_simple%29">Insertion sort (C, simple)</A>.</P>
<P><A href="http://lambda-the-ultimate.org/node/1336">Recently featured at Lambda the Ultimate</A> (thanks Allan!), LP has attracted a lot of attention and has a couple dozen <A href="http://literateprograms.org/Special:Allpages">articles</A> and <A href="http://literateprograms.org/Special:Listusers">users</A> now. The users have some pretty diverse interests in languages and subject areas that I would never have anticipated (see <A href="http://literateprograms.org/Pi_with_Machin%27s_formula_%28Python%29">Pi with Machin's formula (Python)</A>, <A href="http://literateprograms.org/Hello_World_%28occam%29">Hello World (occam)</A>).</P>
<P>I invite your suggestions for and participation in and the site - in particular, we're really lacking in Windows and .NET programs and programmers at the moment, and I know you guys can help. See <A href="http://literateprograms.org/LiteratePrograms:How_to_write_an_article">How to write an article</A> and take a look at the existing articles to learn about the markup syntax. Please ask me any questions you have. Thanks everyone.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=543397" width="1" height="1">Dependency tracking in buildshttp://blogs.msdn.com/b/devdev/archive/2006/02/14/532262.aspxWed, 15 Feb 2006 06:20:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:532262MSDNArchive6http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=532262http://blogs.msdn.com/b/devdev/archive/2006/02/14/532262.aspx#comments<P>Happy Valentine's Day, everyone! Today I'm going to talk about the important problem of successfully tracking dependencies in builds in an automatic manner.</P>
<P>In short, there are two kinds of build systems: the ones where all dependencies are tracked successfully and incremental rebuilding is available at all times, and the ones where you have to rebuild everything whenever you change anything remotely significant. There is nothing inbetween - as soon as a build system fails to track dependencies in one circumstance, developers will become paranoid and insist on costly full rebuilds in the future. After all, the consequences of a partial rebuild are insidiously difficult to diagnose, resulting in broken builds, symbol mismatches, exceptions while loading assemblies, mixtures of interacting new and old code, and other nonsense that we don't want to waste our time debugging. This means that a dependency tracking system has to be very close to perfectly reliable, which can make it very difficult to design and implement successfully.</P>
<P>Although the problem is a very old one, with tools like GNU make dating back decades, I believe in most project environments dependency tracking is not solved to the level of reliability that is needed for this kind of trust. Behind all of these failed dependency-tracking systems is one cardinal sin:</P>
<P><FONT size=5>Dependencies cannot be maintained by hand.</FONT></P>
<P>I'm not merely saying maintaining dependencies by hand is difficult or even error-prone, but that it's damn near impossible. In C/C++ in particular, adding a #include statement to a header file implicitly adds a dependency to every source file including that header file directly or indirectly. Even worse is <EM>removing</EM> a #include, in which case some of the files depending on it might no longer require it, but then some might, depending on what they directly or indirectly include. Waste all the time you want on this - you won't get it right. Even in C#, Java, or Visual Basic, which thankfully have no concept of header and source files and load an entire project for compilation at once, interproject dependencies can become just as complex in large solutions.</P>
<P>Even worse are systems which require you to explicitly specify not only the dependencies but the <EM>build order</EM>. Often hierarchically structured solutions of many projects will use pretty good dependency tracking within each project, but revert to this "build these things in this order" model for the whole tree. The result is that related modules in different parts of the tree cannot be built together incrementally, and highly parallellized building becomes impossible. Scalability demands that we create some level of abstraction for solution-level builds, but that doesn't mean we have to sacrifice incremental building altogether.</P>
<P>The alternative to maintaining something by hand is of course maintaining it automatically. The code is already quite explicit about what depends on what - whether in the form of <EM>#include</EM> statements, <EM>import </EM>statements, assembly references, or other means. The fundamental principle of avoiding redundancy tells us that we should be <EM>generating</EM> dependency information from these sources. For example, the following buried treasure from the GNU make manual details how to use the -M flag of gcc together with some (rather esoteric) script to automatically track header dependencies in C:</P>
<BLOCKQUOTE>
<P>The practice we recommend for automatic prerequisite generation is to have one makefile corresponding to each source file. For each source file `<TT><VAR>name</VAR>.c</TT>' there is a makefile `<TT><VAR>name</VAR>.d</TT>' which lists what files the object file `<TT><VAR>name</VAR>.o</TT>' depends on. That way only the source files that have changed need to be rescanned to produce the new prerequisites. </P>
<P>Here is the pattern rule to generate a file of prerequisites (i.e., a makefile) called `<TT><VAR>name</VAR>.d</TT>' from a C source file called `<TT><VAR>name</VAR>.c</TT>': </P>
<P>
<TABLE>
<TBODY>
<TR>
<TD> </TD>
<TD class=smallexample><FONT size=-1><PRE>%.d: %.c
@set -e; rm -f $@; \
$(CC) -M $(CPPFLAGS) $< > $@.$$$$; \
sed 's,\($*\)\.o[ :]*,\1.o $@ : ,g' < $@.$$$$ > $@; \
rm -f $@.$$$$
</FONT></PRE></TD></TR></TBODY></TABLE><A href="http://www.gnu.org/software/make/manual/html_mono/make.html#SEC51">http://www.gnu.org/software/make/manual/html_mono/make.html#SEC51</A></P></BLOCKQUOTE>
<P>The concept here is that the dependencies for each source file are stored in a separate file which itself depends on that source file. If the source file is modified, the ".d" file containing that source file's dependencies is rebuilt before it is loaded by make. In this way the dependencies can never be wrong or out of date for even a moment. This concept could be generalized to a pluggable framework that enables you to determine build-time dependencies of arbitrary files using arbitrary code.
<P>We can play similar tricks to dynamically track the dependencies of projects in a solution. Although this is one level of abstraction higher, it's really just the same thing with source files replaced by projects. We analyze the projects to determine their dependencies at build-time, set up a build order, and then locally build each project using its own local dependency structure. This enables automatic dependency tracking to scale effectively. With a solution like this, you can simply sync up the latest changes from the repository and do an incremental build from root in minimum time.
<P>Dependency tracking isn't just useful for builds either - it can often be used to evaluate the impact of a change. For example, if I change a header file, a good dependency tracking system should be able to name for me all the affected source files in the project, as well as any other project that might be affected. In the end I might decide to go with a less destabilizing change instead. They can also be incorporated into version control systems, where they can help to evaluate whether concurrent changes made by other people will break what you're working on right now. This is largely unexplored territory.</P>
<P>At Microsoft, a new build system called MSBuild has recently been developed and released with Visual Studio 2005. Currently used for almost all of the Visual Studio build, it will hopefully in time replace older ad hoc systems based on batch files across the company. One of its advantages is its incremental building support, described here:</P>
<P><A href="http://msdn2.microsoft.com/en-us/library/ms171483.aspx">http://msdn2.microsoft.com/en-us/library/ms171483.aspx</A></P>
<P>Although a sophisticated system, even MSBuild's support for incremental building is relatively limited, focusing on dependencies within projects between files that have direct input-to-output mappings.</P>
<P><FONT size=5>Build scheduling algorithms</FONT></P>
<P>Let's talk a little bit about the technology behind build scheduling - a simple problem, but one with a number of interesting generalizations. To simplify the discussion, we'll refer to both inputs and targets/outputs as "nodes". Thus, source files, header files, binary outputs, object files, resource files, and so on are all nodes. We can describe this as a directed graph where each vertex is a node and each edge from A to B says that "B depends on A". In simple terms, this is just a drawing where we have a circle for each node and an arrow pointing from each node to nodes that depend on it. This is called the <EM>dependency graph</EM>, and is the fundamental data structure used in build scheduling (as well as other types of scheduling).</P>
<P>The simplest scheduling problem is sequential scheduling, where we have a one-processor machine and we need to figure out what order to build things in. In this case, we can use topological sorting. This just means that first we start by building all the things with no dependencies, and we proceed to build the things whose only dependencies are things we already built. There are efficient algorithms for determining this order in advance.</P>
<P>In large solutions, however (like say, Windows), building everything on one machine could take weeks or months. In this case, we'd like to use many machines, each with many processors, to build different parts of the product simultaneously. However, they still have to respect the build-time dependencies between components. This is where things get interesting, and we run into many of the same problems as scheduling instructions on microprocessors, and even run into some NP-complete problems. A couple common heuristics, the Decreasing-Time Algorithm and the Critical-Path Algorithm, are described here:</P>
<P><A href="http://www.ctl.ua.edu/math103/scheduling/scheduling_algorithms.htm">http://www.ctl.ua.edu/math103/scheduling/scheduling_algorithms.htm</A></P>
<P>Both of these require a time estimate that each step will take, which is not too difficult to generate automatically from metrics like lines of code. The concept behind the Critical-Path Algorithm is that if a long chain of dependencies is waiting on a particular node to be built, that's a good reason to build that node first, even if it doesn't take long to build itself. Even these heuristics are based on NP-complete problems such as the longest path problem, but in practice they can often be solved efficiently.</P>
<P>Note that one can't just throw arbitrary hardware at a scheduling problem; there are limits beyond which any additional hardware would simply sit idle, and there are even smaller limits for which very little benefit is derived. Part of the challenge of parallel scheduling is finding just the right number of systems to maximize cost-benefit.</P>
<P>There are even fancier variants of this where you have machines that build at different speeds, machines that can build some types of files but not others, and so on. I won't get into these here, but this discussion should make one thing clear: a developer doesn't want to waste precious development time figuring all this stuff out. Build order should always be determined by build technology that can be relied upon to be correct, to do efficient incremental building, and to scale up to large builds and large build clusters well, all automatically.</P>
<P>So why do we keep using hacked-up internal build systems made of Perl scripts and anarchy? I don't know, but if you decide to make something better, please come sell it to my group!</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=532262" width="1" height="1">The nature of computing and infeasible machineshttp://blogs.msdn.com/b/devdev/archive/2006/02/06/526252.aspxTue, 07 Feb 2006 05:44:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:526252MSDNArchive3http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=526252http://blogs.msdn.com/b/devdev/archive/2006/02/06/526252.aspx#comments<p>If asked what they do, some developers might say "I write code" or "I program computers". But more fundamentally, what is computing really about? After all, <a href="http://en.wikipedia.org/wiki/Euclid%27s_algorithm">Euclid's algorithm</a> for computing the greatest common denominator of two numbers in polynomial time was designed - and executed by hand - thousands of years ago, long before even early mechanical computing devices were available (besides the abacus). For centuries elaborate bureaucracies have set up processing units and communication channels to spread and delegate work, not unlike modern server farms. Even today, complexity theorists regularly study mythical machines such as <a href="http://en.wikipedia.org/wiki/Non-deterministic_Turing_machine">nondeterministic machines</a>, <a href="http://en.wikipedia.org/wiki/Alternating_Turing_machine">alternating machines</a>, nonuniform circuits, <a href="http://en.wikipedia.org/wiki/Oracle_machine">oracle machines</a>, and <a href="http://en.wikipedia.org/wiki/Hypercomputation">hypercomputers</a>, none of which are practical to construct or even to simulate in general. So why design algorithms for machines we can't build?</p>
<p>In reality, computing has nothing to do with computers. Computing is fundamentally about solving problems using strictly defined processes under resource constraints. The theory of computing, by design, has little regard for what device you use to carry those processes out.</p>
<p>For example, suppose ten monks in medieval Europe are copying a sacred book of 100 pages, numbered 0 through 99, and just as they finish one of them accidentally knocks the pages all over the floor. How can they rapidly put the pages back in order? One simple process is this: each monk picks up 10 pages and places them in one of ten piles depending on the tens digit of the page number. Then, each monk takes one of the piles and sorts it by himself. The piles are put together and they are done - this could all be done in just a couple minutes, much more quickly than a single monk would take to sort the whole book. In computing terms, this is a parallelized version of radix sort, switching to selection sort or insertion sort for small sublists.</p>
<p>Indeed, good algorithms are often more important than the device you use; for example, a computer sorting a list of a billion integers using bubble sort would probably take more time than a room full of people using the parallelized radix sort described above. On the other hand, some devices can solve problems that others cannot; for example, the very weak deterministic finite automaton (DFA) can't even tell whether its input string contains the same number of zeros and ones, which is a trivial task for a Turing machine, an abstract machine modelling practical computers. This tells us something important: DFAs aren't good enough. In other words, a computer that at its best can only simulate a DFA is fundamentally incapable of solving certain problems. Similarly, theoretical hypercomputers can solve problems that Turing machines cannot.</p>
<p>Unfortunately, once machines become powerful enough, they quickly all become equivalent with respect to the problems they can solve. For example, deterministic, multitape, random access, probabilistic, nondeterministic, and quantum Turing machines all solve the same set of problems. Why go to the bother of defining all these different machines then? There are two reasons:</p>
<ol>
<li>Certain devices make solutions to certain problems much easier to describe. For example, problems that are easy to solve by exhaustive search are also easy to express using algorithms that run on nondeterministic machines. Problems like primality testing and undirected graph connectivity are much easier to solve with known algorithms when we have the ability to generate random numbers.</li>
<li>Some devices can solve more problems than others using the same amount of resources, or solve the same problems using less resources. As a simple example, a machine with random access can search an ordered list in logarithmic time using binary search, whereas an ordinary Turing machine with only sequential access would require linear time. <a href="http://en.wikipedia.org/wiki/Shor%27s_algorithm">Shor's algorithm</a> can factor integers on quantum computers in polynomial time, something we don't know how to do on classical machines.</li></ol>
<p>We know that the "features" offered by new theoretical devices are useful in describing new algorithms for solving problems, since we designed them for that purpose, but what we'd really like to know is whether they really let us solve problems more efficiently than we possibly could without them. This question, on the "power" of devices, is the major stumbling block of modern complexity theory. Even the great unsolved question of whether P=NP can be phrased as a question of power: can nondeterminism allow us to solve problems in polynomial time that we could not otherwise? Another question, whether P=RP, asks whether we can solve more problems in poly time with the ability to generate random numbers than without it. Although also unsolved, many researchers believe that these classes are equal and randomness does not add any real new power. The unsolved question of whether L=P asks whether we can solve more problems in polynomial space and polynomial time than in logarithmic space and polynomial time. Occasionally, we show that two dissimilar devices are equally powerful, as in the famous results that PSPACE=NPSPACE, IP=PSPACE, and NL=coNL, but rarely ever have we unconditionally shown that one machine is truly more powerful than another (notable exceptions are the trivial L ≠ PSPACE and P ≠ EXPTIME; see <a href="http://en.wikipedia.org/wiki/Time_hierarchy_theorem">time hierarchy theorem</a>, <a href="http://en.wikipedia.org/wiki/Space_hierarchy_theorem">space hierarchy theorem</a>).</p>
<p>To come back to my original point, why do we design algorithms for machines we can't build? There are many reasons, but some of the most important are:</p>
<ol>
<li>The algorithm may be more natural to describe on an abstract machine which we can then simulate on a real machine. By focusing our efforts on this simulation program, we can acquire the ability to easily solve any problem that the machine being simulated can solve.</li>
<li>Understanding what problems a machine can solve helps us learn about the machine and how powerful it is. This helps us determine whether we should invest money and effort in software and hardware simulating those features or whether we can adapt existing machines to solve the same problems just as efficiently.</li>
<li>Like the proof that the halting problem cannot be solved, they help us understand what problems are truly intractable, so that we can focus our effort on problems that we're more likely to crack successfully.</li></ol>
<p>More philosophically, a popular saying among mathematicians is that if you were to wake up tomorrow and the universe were gone, you could still practice mathematics - abstract concepts are not dependent on any property of our physical universe, even if the specific representations and concepts that we use are. The same applies to computing, which at its core is a mathematical discipline: even when modern SDKs and classical architectures have become obsolete, the same abstract problem solving we now perform in C++, C#, or Java we will perform in yet another new programming language. And if you're ever stuck in purgatory for eternity, you can just quicksort a list of a billion numbers in your head.</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=526252" width="1" height="1">Literate programminghttp://blogs.msdn.com/b/devdev/archive/2006/01/23/516431.aspxTue, 24 Jan 2006 00:53:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:516431MSDNArchive10http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=516431http://blogs.msdn.com/b/devdev/archive/2006/01/23/516431.aspx#comments<P><A href="http://www.google.com/search?q=literate+programming">Literate programming</A>, invented in 1981 by the same Donald Knuth who wrote <EM>The Art of Computer Programming</EM> and the document language TeX, is a technique in which a program is written as a human-oriented document interspersing discussion and code. The code segments are arranged not according to execution order or the logical structure of the code, but in whatever order the author believes offers the best presentation to the reader. A tool can be used to automatically extract the code segments, put them back in the order expected by the compiler, and export them as a source file. Although it never really caught on for ordinary coding, it has the important ability to make code easier to understand for a reader, easing maintenance, transfer of ownership, and helping to prevent future bugs.</P>
<P>The idea is perhaps easiest to understand if we look at it not from the perspective of a coder, but from the perspective of a researcher writing a book. They write a section about an algorithm, showing first the core algorithm, then talking about issues, improvements, and variations. In each section they show a little more code. They might show a test driver to demonstrate how it might be invoked. Taken to the logical extreme, these code samples could, taken together, show every part of the program - all someone would need to do to run it is to copy the samples into a source file in the right order and build it. If we automate this extraction process, we now have a section of a book that simultaneously <EM>is</EM> a real program that can be run.</P>
<P>Some people find the idea of structuring a program in this seemingly random "presentation order" strange, but really we do it all the time. Suppose you arrive on a new team and take control of a module from someone else. At some point, you sit down with the previous owner, who proceeds to describe how the module works. If they do this effectively, they don't take an alphabetical file listing and go through each file in order - instead, they first focus on core parts of core functionality and expand outward to more peripheral functionality. You might ask a question about a piece of code you see, and they say "Don't worry about that - I'll get to that later when I talk about X." By controlling the presentation order, they enhance your comprehension by keeping code out of the way until you have the proper background to understand its purpose.</P>
<P>But what if this previous owner is long gone? They could have all the comments in the world, but you still won't have anything like the discussion above. Literate programming enables them to write a tutorial document that takes you through the code in the same friendly way as a sit-down with the owner, and yet never invokes redundant effort or becomes out-of-date, because it <EM>is</EM> the code as well.</P>
<H2>An example</H2>
<P>But there's no better way to understand literate programming than to see some. So, right here, I'm going to discuss an implementation in C# of <a href="http://blogs.msdn.com/devdev/archive/2005/08/17/452827.aspx">Bloom filters</A>, the topic of my very first post (read this first if you haven't already). This implementation is based on a paper by a former of professor of mine, Panagiotis Manolios, and one of his students, Peter Dillinger, called <EM><A href="http://www.cc.gatech.edu/fac/Pete.Manolios/research/bloom-filters-verification.html">Bloom Filters in Probabilistic Verification</A></EM>. This paper solves a frustrating implementation issue: Bloom filters require a potentially unlimited number of independent hash functions, and there's no convenient way of coming up with these hash functions on the fly. Their concept is to use only two independent hash functions to generate all of the hash values required. The insertion method below, based on Algorithm 2 in their paper, demonstrates this:</P><PRE><<BloomFilter Insert method>>=
public void Insert(TKey obj)
{
int h1 = obj.GetHashCode();
int h2 = obj.GetHashCode2();
unchecked
{
h1 = (int)(((uint)h1) % table.Count);
h2 = (int)(((uint)h2) % table.Count);
}
<<perform insertion>>
}
</PRE>
<P>The syntax <CODE><<perform insertion>></CODE>, used by the noweb tool, indicates a chunk of code that I haven't talked about yet. The label at the top is the name of this chunk of code. We use unchecked arithmetic for the casts because the hash code methods can return negative values, which we wish to convert to large positive values via the cast to uint. The narrowing cast back to int is safe because table.Count is in an int.</P>
<P>Above, TKey is a generic type parameter on the class specifying the type of values stored in the Bloom filter. Keys must implement the IDoubleHashable interface, which adds a second hash function GetHashCode2() independent from <A href="http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemobjectclassgethashcodetopic.asp">the standard GetHashCode() supplied by Object</A>:</P><PRE><<IDoubleHashable definition>>=
public interface IDoubleHashable
{
int GetHashCode2();
}
<<BloomFilter class definition>>=
public class BloomFilter<TKey> where TKey : IDoubleHashable
{
<<BloomFilter implementation>>
}
</PRE>
<P>We store the bits of the BloomFilter in a <A href="http://msdn.microsoft.com/library/en-us/cpref/html/frlrfSystemCollectionsBitArrayClassTopic.asp">standard BitArray collection</A>, which compactly stores the bits of the table and handles all the bit arithmetic for us:</P><PRE><<BloomFilter member variables>>+=
private BitArray table;
</PRE>
<P>This is in the <CODE>System.Collections</CODE> namespace, so we need a using statement:</P><PRE><<BloomFilter using statements>>+=
using System.Collections;
</PRE>
<P>The <CODE>+=</CODE> noweb syntax above is used to add a piece of code to a code section. This is useful for sections we wish to describe sequentially a part at a time, or for elements like the above where order is unimportant.</P>
<P>Now, we're prepared to describe the insertion. We set a number of roughly evenly spaced bits, where the first hash establishes the initial index, the second hash establishes the spacing, and the spacing is slightly increased as we go along. All arithmetic is done modulo the table size:</P><PRE><<Perform insertion>>=
for(int i=0; i<numHashes; i++)
{
table[h1] = true;
unchecked
{
h1 = (int)((uint)(h1 + h2) % table.Count);
h2 = (int)((uint)(h2 + i) % table.Count);
}
}
</PRE>
<P>The number of bits we set is determined by the fixed parameter numHashes. We'll describe how it's chosen a bit later on:</P><PRE><<BloomFilter member variables>>+=
private int numHashes;
</PRE>
<P>The lookup procedure is analogous, checking all the same bits and returning true only if they're all set:</P><PRE><<BloomFilter Contains method>>=
public bool Contains(TKey obj)
{
int h1 = obj.GetHashCode();
int h2 = obj.GetHashCode2();
unchecked
{
h1 = (int)(((uint)h1) % table.Count);
h2 = (int)(((uint)h2) % table.Count);
}
for (int i = 0; i < numHashes; i++)
{
if (table[h1] == false)
{
return false;
}
unchecked
{
h1 = (int)((uint)(h1 + h2) % table.Count);
h2 = (int)((uint)(h2 + i) % table.Count);
}
}
return true;
}
</PRE>
<P>The probability that this method will return true for a random value not in the set is given by:</P>
<P> <EM>p</EM> = (1 − <EM>e</EM><SUP>−<EM>kn/m</EM></SUP>)<SUP>k</SUP></P>
<P>where the variables are defined as:</P>
<UL>
<LI><EM>k</EM>: Called <CODE>numHashes</CODE> in the code, the number of table elements set by an insertion
<LI><EM>m</EM>: Called <CODE>table.Count</CODE> in the code, the number of elements in the table
<LI><EM>n</EM>: The number of distinct objects inserted so far</LI></UL>
<P>To initialize the object, we use the following method, which needs to know <EM>k</EM> and <EM>m</EM>:</P><PRE><<BloomFilter initialization>>+=
private void Initialize(int tableSize, int numHashes)
{
this.numHashes = numHashes;
table = new BitArray(tableSize, false);
}
</PRE>
<P>However, a user is unlikely to know the best values for these (which is why this isn't a constructor). Instead, we address two common use cases:</P>
<OL>
<LI>In this case, the user knows <EM>m</EM>, the desired table size in bits, and <EM>n</EM>, an estimate of how many elements they will insert. We choose <EM>k</EM> in a way that minimizes the false positive probability <EM>p</EM>, using the value (ln 2)<EM>m</EM>/<EM>n</EM> rounded to the nearest integer: <PRE><<BloomFilter initialization>>+=
public BloomFilter(int tableSize, int numElements)
{
Initialize(tableSize, (int)Math.Round(ln2*numElements/tableSize));
}
</PRE>
<LI>In this case, the user knows <EM>n</EM>, an estimate of how many elements they will insert, and <EM>p</EM>, the desired false positive probability. Bloom filter theory states that if <EM>k</EM> is chosen optimally (using the formula in part 1 above), we will have <EM>p</EM> = <EM>α</EM><SUP><EM>m</EM>/<EM>n</EM></SUP>, where the constant α = 1/2<SUP>ln 2</SUP>. This means we must have <EM>m</EM> = (log<SUB><EM>α</EM></SUB> <EM>p</EM>)<EM>n</EM>. I call <EM>α</EM> <CODE>optimalBaseRate</CODE> below: <PRE><<BloomFilter initialization>>+=
public BloomFilter(int numElements, double falsePositiveProb)
{
int tableSize = (int)Math.Ceiling(numElements * Math.Log(falsePositiveProb, optimalRateBase));
Initialize(tableSize, (int)Math.Round(ln2 * tableSize / numElements));
}</PRE></LI></OL>
<P>The constants used above are defined at the class level (we cannot use <CODE>const</CODE> because the <CODE>Math</CODE> calls are considered non-constant by the compiler):</P><PRE><<BloomFilter constants>>+=
private readonly double ln2 = Math.Log(2.0);
private readonly double optimalRateBase = 1.0/Math.Pow(2, Math.Log(2.0));
</PRE>
<P>We need the <CODE>System</CODE> namespace for <CODE>Math</CODE>:</P><PRE><<BloomFilter using statements>>+=
using System;</PRE>
<P>Finally, we put it all together to complete the implementation of <CODE>BloomFilter</CODE>:</P><PRE><<BloomFilter implementation>>=
<<BloomFilter constants>>
<<BloomFilter initialization>>
<<BloomFilter member variables>>
<<BloomFilter Insert method>>
<<BloomFilter Contains method>>
</PRE>
<P>We structure the program as two source files, with the interface in its own file, and both entities in the <CODE>BloomFilters</CODE> namespace:</P><PRE><<BloomFilter.cs>>=
<<BloomFilter using statements>>
namespace BloomFilters {
<<BloomFilter class definition>>
}
<<IDoubleHashable.cs>>=
namespace BloomFilters {
<<IDoubleHashable definition>>
}</PRE>
<P>Our implementation is now complete. My full version of this code includes double-hashable classes for integers and strings and a test driver, which I omit here for space. With a bit of added syntax, I could run this entire entry through the literate programming tool <A href="http://www.eecs.harvard.edu/~nr/noweb/">noweb</A>, and it would produce both an HTML document much like the one you're reading now, along with two C# source files which I can then build into an assembly and use.</P>
<H2>More information</H2>
<P>For a longer, fancier example of literate programming, take a look at <A href="http://moonflare.com/code/select/index.php">my Select implementation</A>, which uses the literate programming tool <A href="http://www.eecs.harvard.edu/~nr/noweb/">noweb</A> to write an implementation of the classical worst-case linear time selection algorithm using C++ and the document language LaTeX. Here we gain the additional advantage of a full-fledged document language, complete with the ability to typeset mathematical expressions and diagrams, which is impossible in ordinary source code.</P>
<P>Sometimes schemes such as Perl's pod, Javadocs, or C#'s XML comments are touted as being literate programming, but this is far from the case. These are simply isolated documentation of individual logical program units. While they are valuable and have little overhead, they do not <EM>present</EM> the program; they only document parts of it in an isolated way.</P>
<P>So when should you use literate programming? Certainly there is an overhead involved, but in any case where the ability to maintain a portion of a program is a critical factor, or where there is a high risk of an owner disappearing, or where code will be exposed to many external partners or customers, literate programming is a valuable tool for keeping the code accessible and easy to understand. Try rewriting a single confusing or troublesome portion of your program with it, update your build system to run the literate programming tools prior to compilation, and see if it makes a difference with your application.</P></SUP></EM></EM><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=516431" width="1" height="1">Software engineeringEfficient selection and partial sorting based on quicksorthttp://blogs.msdn.com/b/devdev/archive/2006/01/18/514482.aspxWed, 18 Jan 2006 23:26:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:514482MSDNArchive6http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=514482http://blogs.msdn.com/b/devdev/archive/2006/01/18/514482.aspx#comments<P>Most people who have studied algorithms remember quicksort, the ubiquitous sorting algorithm available in the standard library of nearly every programming language implementation in existence, including C, C++, Java, and the .NET Framework. Quicksort is a compelling example of how algorithms with poor worst-case behavior but good average-case or expected behavior can be highly practical. Much less well-known, however, are the simple variations on quicksort that can improve stack space usage and solve other important problems efficiently.</P>
<P>Let's briefly review quicksort. Quicksort is fundamentally based on the efficient <EM>partition</EM> operation, a linear-time operation which divides a list into two parts: the elements less than a value, and the elements greater than or equal to a value. The value is called the <EM>pivot value</EM>, and is usually chosen to be one of the values in the list. In efficient implementations of quicksort, partitioning is usually implemented as an in-place operation that just shuffles elements of the array around as it scans over it. Quicksort operates by simply choosing a pivot value, running partition on the array, and then recursively running quicksort on the two partitions:</P><PRE> // Sort the elements of a between indexes left and right inclusive
void quicksort(a, left, right) {
if (right > left) {
Choose a pivot element a[pivotIdx] with left <= pivotIdx <= right
// Here, partition() returns the new index of the pivot element
newPivotIdx = partition(a, left, right, pivotIdx);
quicksort(a, left, newPivotIdx-1);
quicksort(a, newPivotIdx+1, right);
}
}
</PRE>
<P>Much research has gone into the problem of choosing a good pivot element, but simply using a typical pseudorandom number generator to select an array index is usually sufficient to get efficient performance on average. A common and effective optimization is to switch to insertion sort for small subarrays, for which insertion sort is considerably faster than quicksort (and pretty much every sorting algorithm). A useful variation on quicksort is the partially tail-recursive variant, where we only make a recursive call for the smaller of the two sublists, and use a loop to deal with the other one:</P><PRE> // Sort the elements of a between indexes left and right inclusive
void quicksort(a, left, right) {
while (right > left) {
Choose a pivot element a[pivotIdx] with left <= pivotIdx <= right
newPivotIdx = partition(a, left, right, pivotIdx);
if (newPivotIdx - left > right - newPivotIdx) {
quicksort(a, newPivotIdx+1, right);
right = newPivotIdx - 1;
} else {
quicksort(a, left, newPivotIdx-1);
left = newPivotIdx + 1;
}
}
}
</PRE>
<P>This variant has worst-case logarithmic space usage, and in practice makes about half as many calls, making it more competitive with in-place sorting algorithms like heapsort. It's especially useful on memory-limited systems, but since stack space often comes at a premium, it's also useful on desktops.</P>
<P>Now let's turn to how we can solve different problems using variants on this idea. Related to sorting is the <I>selection problem</I>, where we are asked to find the <I>k</I>th smallest element in a list. A common special case of this is finding the median. Obviously, we can just sort the list and index, but this isn't terribly efficient - this is only a good idea if we're going to be selecting many different elements and the list is static. Instead, notice this important property of the partition function: the pivot element is always placed in its final sorted position, while all elements which come before it in sorted order are before it, and all those that come after it in sorted order are after it. Consequently, if we know the sorted position of the element we're looking for, we already know which partition it's in - just check <I>k</I> against the index of the pivot after the partition operation:</P><PRE>// Get the value of the kth smallest element in the list
object quickselect(a, k) {
int left=0, right = a.Length - 1;
while (right > left) {
Choose a pivot element a[pivotIdx] with left <= pivotIdx <= right
newPivotIdx = partition(a, left, right, pivotIdx);
if (k < newPivotIdx) {
right = newPivotIdx - 1;
} else {
left = newPivotIdx + 1;
}
}
return a[k];
}
</PRE>
<P>This is nearly the same as the second quicksort example above, but we no longer need to make any recursive calls - this algorithm is in-place. Moreover, because the sizes of the lists we visit shrink geometrically on average, the total expected time is only O(n), or linear time. The algorithm is also highly efficient in practice, and is used in all implementations of C++'s <CODE>nth_element</CODE> call. There is a variation on this scheme with worst-case linear time that I won't discuss here, because it's not as efficient in practice, but it was one of the most important early theoretical results in algorithms.
<P></P>
<P>A related problem is finding the <I>k</I> smallest or largest values in a list, or the "top <I>k</I>". This is important, for example, in cases where you want to display just the top results from a search based on some numeric metric of relevancy (or if you want to start your own pop music countdown). The simplest algorithm, running the first <I>k</I> iterations of selection sort, requires O(<I>kn</I>) time, which is quite expensive. Instead, we can exploit the properties of partition to avoid many of the recursive calls made by a complete quicksort:</P><PRE> // Rearrange the elements so that the k smallest elements appear at the beginning
void quicksortSmallestK(a, left, right, k) {
while (right > left) {
Choose a pivot element a[pivotIdx] with left <= pivotIdx <= right
newPivotIdx = partition(a, left, right, pivotIdx);
if (k <= newPivotIdx) {
right = newPivotIdx - 1;
} else if (newPivotIdx - left > right - newPivotIdx) {
quicksort(a, newPivotIdx+1, right);
right = newPivotIdx - 1;
} else {
quicksort(a, left, newPivotIdx-1);
left = newPivotIdx + 1;
}
}
}
</PRE>
<P>This algorithm uses O(n + klog k) expected time and O(log k) space, and again is quite efficient in practice. It can also be generalized to compute an arbitrary subarray of the sorted array (the <I>j</I>th through <I>k</I>th smallest values in the list in order).</P>
<P>When it comes to solving ordering problems, partitioning and quicksort-based algorithms are really your swiss army knife. Keep them in mind next time you encounter a problem like this.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=514482" width="1" height="1">Integer division by constantshttp://blogs.msdn.com/b/devdev/archive/2005/12/12/502980.aspxTue, 13 Dec 2005 04:07:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:502980MSDNArchive8http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=502980http://blogs.msdn.com/b/devdev/archive/2005/12/12/502980.aspx#comments<P>Today I'm going to discuss a few clever algorithms for dividing values by constant values known at compile-time, without the use of the division instruction.</P>
<P>In compiler optimization theory, the basic problem we attempt to solve is to create algorithms which take a specific program as input and produce a better program, for example a program which runs faster, uses less code space, or even consumes less power. An efficient standard library is an important step in creating efficient programs, but it does not provide the advantage of optimizations, which can dynamically adjust their behaviour to suit each input program. To demonstrate, today I'm going to discuss a very narrow but suprisingly interesting compiler optimization: efficiently dividing an (unsigned) integer by a fixed constant, such as 37.</P>
<P>In the bad old days, the division instruction on x86 machines was extremely slow, and the use of clever bit tricks was very common to speed it up, especially in high performance areas like graphics and simulation. Today we're more concerned with embedded systems, where, to save on chip cost, division is often implemented as a software library routine. Such a routine is necessary if neither argument is known until runtime, but it is extremely slow, easily taking hundreds of clock cycles. In many common applications, we know the divisor at compile-time, and we can exploit this knowledge to craft a specialized division algorithm that drastically outperforms the generic one.</P>
<P>Most programmers are aware that a value can be divided by 2<SUP><EM>n</EM></SUP>, truncating the remainder, by merely shifting it right by <EM>n</EM> bits. This works because the positional binary numeral system multiplies the <EM>k</EM>th digit from the right by 2<SUP><EM>k</EM></SUP>. This simple optimization is present in nearly every compiler; for this reason, overzealous programmers who convert their divisions by 4 into bit-shifts will usually discover that there is no improvement (at least if they measure the improvement, as they always should).</P>
<P>But what if we want to divide by a different number, say, 10? This would be quite handy to do efficiently, since division by 10 is often used in converting from binary to decimal or binary-coded decimal. Since 10 = 5×2, we really just have to shift right by 1 and then divide by 5. Dividing by 5 is the tricky part. There are several approaches; the following ideas are taken from the pre-released <A href="http://www.hackersdelight.org/divcMore.pdf">Chapter 10: Integer Division by Constants</A> of the next edition of <EM>Hacker's Delight</EM>, an excellent book full of neat bit tricks like these.</P>
<P>First, supposing a multiplication instruction is available, let's think about how we can divide using multiplication. First notice that dividing <EM>x </EM>by <EM>y</EM> is the same as multiplying <EM>x</EM> by 1/<EM>y</EM>, and we know <EM>y</EM> at compile time, so we can find 1/<EM>y</EM> then. We use a fixed-point representation to perform the fractional arithmetic using integers. For example, in the decimal system, one can divide by 5 by simply multiplying by 2 and then moving the decimal point one place to the left (x/5 = 2x/10). In binary, however, 1/5 does not have a terminating decimal; instead we have:</P>
<P>1/5 = 0.001100110011...<SUB>2</SUB></P>
<P>Now, suppose we have an instruction that multiplies two 32-bit values and produces a 64-bit result. Just as x/5 = 2x/10, we also see that:</P>
<P>x/5 = (858993459.2)x / 2<SUP>32</SUP>.</P>
<P>We can approximate (858993459.2)x by (858993459.25)x, which is approximated in turn by 858993459*x + (x >> 2). We just do one multiply, one shift, and one 64-bit add, then take the high word (the high 32 bits). This is just a heuristic argument, but sure enough, if we exhaustively try this formula on every 32-bit value, it never fails. On many machines like the x86 this can be done with as little as 4 instructions.</P>
<P>But what about the machines not fortunate enough to have a 64-bit multiply result? Or what if you're writing in a high-level language like C where multiplication always has the same size operands and result? In this case the task is more difficult. The key is to notice that simulating an approximate multiply by 1/5 isn't so hard, because its bit pattern is very regular: 1/5 is about 0.00110011001100110011001100110011. Now, notice that:</P>
<P>0.00110011001100110011001100110011 = 0.0011001100110011 + (0.0011001100110011 >> 16)<BR>0.0011001100110011 = 0.00110011 + (0.00110011 >> 8)<BR>0.00110011 = 0.0011 + (0.0011 >> 4)<BR>0.0011 = 1 >> 3 + 1 >> 4</P>
<P>Applying this logic in reverse, we devise the following algorithm for approximating x/5:</P><PRE> q = (x >> 3) + (x >> 4);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
</PRE>
<P>Unfortunately, this algorithm isn't correct off the bat like our last one, because we're discarding bits and accumulating rounding error all over the place. However, using multiplication, we can get an idea of how far off we are:</P><PRE> r = x - 5*q;
</PRE>
<P>By adding a carefully devined error adjustment based on this value, to be precise 11r/32, we come out with a correct algorithm that can be verified by exhaustive testing. See the PDF linked above if you're interested in the precise details of how we come up with this adjustment. Our complete algorithm reads:</P><PRE> q = (x >> 3) + (x >> 4);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
r = x - 5*q;
return q + ((11*r) >> 5);
</PRE>
<P>Of course, while being able to divide by 5 is great, ideally we'd want our compiler to come up with a clever algorithm for whatever constant we choose to divide things by. This is considerably more complicated, but there are compiler algorithms which will generate clever bit twiddling code like the above for any constant divisor and inline them right into the code performing the division. Especially for small divisors, the result is drastically more efficient than a call to a general software division function.</P>
<P>A much simpler version to understand in the general case is called <EM>exact division</EM>. Say you have a number that, for whatever reason, you already know is divisible by 22, and you want to divide it by 22. In this case, the solution is simple. We first shift right by 1; it remains to divide it by 11. Next, drawing from the theory of groups and modular arithmetic, we note for every odd number a less than 2<SUP>32</SUP>, there is necessarily a <EM>multiplicative inverse</EM> b such that ab = 1 (mod 2<SUP>32</SUP>). In other words, multiplying a by b and taking the low 32 bits, which the hardware does already, will result in the number 1. Consequently, x/a = bx/(ab) = bx (since ab=1). There are simple efficient algorithms to find this inverse, such as the extended Euclidean algorithm. Here are the inverses of several small odd integers mod 2<SUP>32</SUP>:</P>
<TABLE>
<TBODY>
<TR>
<TD>3</TD>
<TD>2863311531</TD>
<TD width=50>
<TD>11</TD>
<TD>3123612579</TD></TR>
<TR>
<TD>5</TD>
<TD>3435973837</TD>
<TD width=50>
<TD>13</TD>
<TD>3303820997</TD></TR>
<TR>
<TD>7</TD>
<TD>3067833783</TD>
<TD width=50>
<TD>15</TD>
<TD>4008636143</TD></TR>
<TR>
<TD>9</TD>
<TD>954437177</TD>
<TD width=50>
<TD>17</TD>
<TD>4042322161</TD></TR></TBODY></TABLE>
<P>In our example, to divide, say, 3916 by 11, we'd just compute 3916*3123612579, which when truncated to the low 32 bits by the hardware is 356.</P>
<P>If you're interested in this kind of stuff, check out <EM>Hacker's Delight</EM> - it covers not only tricks for arithmetic but for other more complex operations such as counting the number of bits set in a word (the population function). I hope you guys found this interesting. As always, please feel free to leave any comments or e-mail me.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=502980" width="1" height="1">Succinct data structureshttp://blogs.msdn.com/b/devdev/archive/2005/12/05/500171.aspxMon, 05 Dec 2005 21:43:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:500171MSDNArchive4http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=500171http://blogs.msdn.com/b/devdev/archive/2005/12/05/500171.aspx#comments<P>Sorry for the long hiatus, everyone. Today I'm going to talk about succinct data structures, which are informally data structures that use very close to the absolute minimum possible space. This material is largely based on lectures by MIT professor Erik Demaine, as transcribed by 6.897 students Wayland Ni and Gautam Jayaraman (<A href="http://theory.csail.mit.edu/classes/6.897/spring03/scribe_notes/L12/lecture12.pdf">link</A>). Also see additional notes from this year by students <A href="http://theory.csail.mit.edu/classes/6.897/spring05/lec/lec21.pdf">Yogo Zhou</A> and <A href="http://theory.csail.mit.edu/classes/6.897/spring05/lec/lec22.pdf">Nick Harvey</A>.</P>
<P>One big problem with classical data structures that is too often overlooked is how horribly inefficient their space usage is. For example, suppose you have a linked list of characters. Each node contains a character and a pointer; on a 32-bit platform, each pointer takes 4 bytes, so that's already 5n bytes for n elements. Additionally, it's unlikely with most standard allocators that the 3 bytes following the character will be used; these are normally reserved for padding for alignment. On top of this, unless some kind of memory pool is created for the nodes, most simple allocators will reserve 4 to 12 bytes of space <EM>per node</EM> for allocation metadata. We end up using something like 16 times as much space as the actual data itself requires. Similar problems appear in many simple pointer-based data structures, including binary search trees, adjacency lists, red-black trees, and so on, but the problem isn't limited to these: dynamic arrays of the sort implemented by Java and .NET's ArrayList classes can also waste linear space on reserved space for growing the array which may never be used.</P>
<P>A natural explanation to this state of affairs would be to say that all of this extra data is neccessary to facilitate fast operations, such as insertion, removal, and search, but in fact this isn't true: there are data structures which can perform many of the same operations using very close to the minimum possible space. What is the minimum possible space? For a list of data, it's simply the space taken by a static array, which is just the space taken by the data itself. But what about, say, a binary tree, or a set of integers?</P>
<P>This is a little trickier, but the idea is to use combinatorics to count the number of possible values of size <EM>n</EM>, and then take the logarithm of this. Because we need a distinct representation for each value (otherwise we couldn't tell which one we have), this will give us the average number of bits needed to represent an individual value. For example, suppose we have a binary tree with <EM>n</EM> nodes; we are ignoring data for now and just assuming each node has a left and right child reference. How many bits do we need to store this tree? Well, the number of rooted, ordered, binary trees with <EM>n</EM> nodes is described by <EM>C<SUB>n</SUB></EM>, the <EM>n</EM>th <A href="http://mathworld.wolfram.com/CatalanNumber.html">Catalan number</A>, which can be defined by this formula:</P>
<P><EM>C<SUB>n</SUB></EM> = (2<EM>n</EM>)! / ((<EM>n</EM> + 1)! <EM>n</EM>!)</P>
<P>Now, since <A href="http://mathworld.wolfram.com/StirlingsApproximation.html">Stirling's approximation</A> tells us that log(<EM>n</EM>!) is <EM>n</EM>lg <EM>n</EM> plus some stuff asymptotically smaller than <EM>n </EM>(written o(n)), where lg is the log base 2, we can see that the logarithm of <EM>C<SUB>n</SUB></EM> is asymptotically approximated by:</P>
<P>log <EM>C<SUB>n</SUB></EM> ≈ (2<EM>n</EM>)lg(2<EM>n</EM>) − (<EM>n</EM> + 1)lg(<EM>n</EM> + 1) − <EM>n</EM>lg(<EM>n</EM>)<BR>log <EM>C<SUB>n</SUB></EM> ≈ (2<EM>n</EM>)(lg(2)<EM> + </EM>lg(<EM>n</EM>)) − 2<EM>n </EM>lg(<EM>n</EM>)<BR>log <EM>C<SUB>n</SUB></EM> ≈ 2<EM>n</EM></P>
<P>In other words, we have to use at least 2 bits per node - this is the best we can possibly do. Surprisingly, there is a very simple encoding that does so. We traverse the tree in preorder, outputting 1 for each node and outputting 0 whenever we hit a <EM>null</EM>. Here's some C# code which implements the conversion back and forth between this compact bit string and a full in-memory tree structure, preserving both the tree's exact shape and its data:</P><PRE> class BinaryTreeNode {
public object data;
public BinaryTreeNode left, right;
}
static void EncodeSuccinct(BinaryTreeNode node, ref BitArray bits, ref ArrayList data) {
bits.Length++;
bits[bits.Length - 1] = (node != null);
if (node != null) {
data.Add(node.data);
EncodeSuccinct(node.left, ref bits, ref data);
EncodeSuccinct(node.right, ref bits, ref data);
}
}
static BinaryTreeNode DecodeSuccinct(BitArray bits, ArrayList data) {
int bitsIndex = -1, dataIndex = -1;
return DecodeSuccinctHelper(bits, ref bitsIndex, data, ref dataIndex);
}
static BinaryTreeNode DecodeSuccinctHelper(BitArray bits, ref int bitsIndex,
ArrayList data, ref int dataIndex) {
BinaryTreeNode result;
bitsIndex++;
if (bits[bitsIndex])
{
result = new BinaryTreeNode();
dataIndex++;
result.data = data[dataIndex];
result.left = DecodeSuccinctHelper(bits, ref bitsIndex, data, ref dataIndex);
result.right = DecodeSuccinctHelper(bits, ref bitsIndex, data, ref dataIndex);
} else {
result = null;
}
return result;
}
</PRE>
<P>The original representation uses about 3 times as much space as the data, while the succinct representation uses only 2<EM>n</EM> + 1 extra bits, an overhead of 6% on a 32-bit machine. There is some constant overhead, but because this is asymptotically less than <EM>n</EM>, the data structure is still considered succinct. A number of applications are evident: trees can be compactly stored and transmitted across networks in this form, embedded and portable devices can incorporate pre-built tree data structures in the application's image for fast startup, and so on.</P>
<P>On the other hand, it would seem that you can't really <EM>do</EM> much with trees in this form - in limited memory environments, you might not be able to expand the tree back out to its full form. An alternative encoding with the same space requirement is to search the tree with breadth-first instead of depth-first search. If we do this, the resulting bit string has some useful properties. Define the <EM>Rank</EM> and <EM>Select</EM> functions as:</P>
<P>Rank(<EM>i</EM>) = the number of 1 bits between indexes 0 and <EM>i</EM> in the bit string<BR>Select(<EM>i</EM>) = the index of the <EM>i</EM>th 1 bit in the bit string</P>
<P>The following can be proven: if we have the breadth-first search bit string for the tree, the children of the node represented by bit <EM>i</EM> are always represented by bits 2(Rank(<EM>i</EM>)) and 2(Rank(<EM>i</EM>)) + 1. The parent of index <EM>i</EM> is always located at index Select(<EM>i</EM>/2). Finally, the data for the node at index <EM>i </EM>is located at index Rank(<EM>i</EM>) in the data array. Using these lemmas, along with an efficient implementation of Rank and Select, we can traverse the tree and perform all the usual operations on it in this compact form. I won't get into implementing Rank and Select here, but suffice it to say that by adding only O(<EM>n</EM>/log <EM>n</EM>) additional bits (which again is asymptotically less than <EM>n</EM>) they can be implemented in constant time; if the hardware supports a <EM>find-first-zero</EM> and/or <EM>population</EM> (bit-count) operation, these can be quite efficient.</P>
<P>This same idea can be applied to all sorts of data structures: for example, there are <EM>m choose n</EM> (that is, <EM>m</EM>!/(<EM>n</EM>!(<EM>m</EM>-<EM>n</EM>)!)) different sets of integers between 1 and <EM>m</EM>, and there are data structures which use bits in the logarithm of this quantity to store such a set. The number of partitions of a set of objects is defined by the <A href="http://mathworld.wolfram.com/BellNumber.html">Bell numbers</A>, and there are data structures using bits in the logarithm of these numbers. A variety of papers address succinct graph representations, and there are even schemes for representing graphs with specific properties that use no more bits than the logarithm of the number of graphs having those properties.</P>
<P>One of the most interesting succinct data structures, created by Franceschini and Grossi in 2003 in their <A href="http://www.springerlink.com/index/GQ8AMCEQ88BE4YW2.pdf"><EM>Optimal Worst-Case Operations for Implicit Cache-Oblivious Search Trees</EM></A>, is a search tree which permits all the usual insert, delete, and search operations in O(log <EM>n</EM>) time, but with only constant space overhead! It stores the data in an array, and nearly all the information it needs to search it is contained in the order of the elements alone (similar to the heap used in heapsort, but with different operations). On top of this, the structure is <EM>cache-oblivious</EM>, meaning that it maximally utilizes every level of the memory hierarchy (within a constant factor) without any machine-specific tuning (I'll talk more about cache-oblivious data structures another time). I haven't read this paper yet, but for those who believe data structures are a "solved problem", this kind of result makes it clear that there are many wonders yet to behold.</P>
<P>Finally, I wanted to come back to the problem with ArrayList. Although the simple array-growing scheme used by typical dynamic arrays can waste linear space (sometimes as much unused space is reserved as is used for data), there are alternative dynamic array data structures which remain efficient but reserve only O(sqrt(<EM>n</EM>)) space, which is asymptotically optimal (there is a sqrt(<EM>n</EM>) lower bound). These are described in Brodnik et al.'s <EM><A href="http://portal.acm.org/citation.cfm?id=673194">Resizeable Arrays in Optimal Time and Space</A>.</EM></P>
<P>Thanks for reading, everyone. Please ask any questions that you have, and don't hesitate to e-mail me if you want to.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=500171" width="1" height="1">Data structuresPersistent data structureshttp://blogs.msdn.com/b/devdev/archive/2005/11/08/490480.aspxTue, 08 Nov 2005 22:41:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:490480MSDNArchive3http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=490480http://blogs.msdn.com/b/devdev/archive/2005/11/08/490480.aspx#comments<P>When learning to program in a functional language such as Lisp, <A href="http://caml.inria.fr/">Ocaml</A>, or <A href="http://www.haskell.org/">Haskell</A>, one of the most difficult aspects of the paradigmic shift is that data in these languages is almost entirely immutable, or read-only. In purely functional programs, there is no assignment, and data structures cannot be modified. But how are we to do anything useful with a data structure we can't modify? The answer is that instead of modifying it, we create new data structures that incorporate the old structure as a part.</P>
<P>The simplest example of this is the singly-linked list. Say you have a pointer to the head of a linked list A. I can take that list and add three new nodes to the beginning to get a new list B. Yet, due to the structure of the list, the old pointer to the first node of the list A will still point to the same data structure as before; you wouldn't even notice I prepended the new elements. We have retained access to both the previous version and updated version of the list simultaneously. Conversely, if I take a pointer to the third node of A, I get a new list C consisting of A with the first two elements removed (called a <EM>tail </EM>of the list), but existing pointers to A still see the whole list. A data structure with this special property, the ability to retain access to the old version during updates, is called a <EM><A href="http://en.wikipedia.org/wiki/Persistent_data_structure">persistent data structure</A> </EM>(not to be confused with a <EM>disk-based</EM> data structure, which might also be inferred from the word "persistent"). Why might this be useful? There are several reasons:</P>
<UL>
<LI>We can treat all lists as immutable, or read-only. As a result, a function can be sure that no one else is going to modify its list. It can safely pass it to untrusted functions, and annoying phenomena like aliasing and race conditions cannot arise.
<LI>Because the data is immutable, the compiler can make assumptions that allow it to produce faster code, and the runtime and garbage collector can also make simplifying assumptions.
<LI>It allows us to efficiently "undo" a sequence of operations - just retain a pointer to the old version you want to return to later, and when you're done with the new version, throw it away (assuming you have garbage collection). This is particularly useful for rolling back a failed operation.</LI></UL>
<P>On the other hand, other operations on the singly-linked list, such as node removal or insertion in the middle, would overwrite existing pointers and destroy the old version. We can make any data structure (even arrays) persistent by simply making a complete copy of it and then modifying it to perform each operation, but this approach is obviously very expensive in time and space. An active area of programming language research is to find new, more clever data structures that can perform more operations in an efficient, persistent way.</P>
<P>Following is a simple example, taken from Chris Okasaki's excellent and readable PhD thesis, <A href="http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf"><EM>Purely Functional Data Structures</EM></A> (you can also purchase a <A href="http://www.amazon.com/exec/obidos/tg/detail/-/0521663504/">book version</A>). Say we wish to implement a simple queue. Our queue will support three main operations: add an element to the front, add an element to the back, and remove an element from the front. A singly linked list supports only two of these persistently, adding or removing an element from the front. A simple solution to this problem is to maintain two singly-linked lists, call them L and R. The queue's contents are simply the contents of L followed by the the contents of the reverse of R. For example, if L={1,2} and R={4,3}, then our queue contains the values {1,2,3,4} in that order. To add to the beginning or end of the queue, we simply add to the beginning of either L or R. To remove an element from the front, we remove an element from the front of L. However, it may so happen that L is empty. If this happens, we take R, reverse it, and make it the new L. In other words, we create a new queue {L,R} where L is the reverse of the old R and R is the empty list.</P>
<P>But isn't this operation, reversing a potentially long list, expensive? Not in an amortized sense. Each element added to R will only get moved to L one time ever. Therefore, a sequence of <EM>n</EM> insertions and <EM>n</EM> removals will still take only O(<EM>n</EM>) time, which is an amortized time of O(1) per operation. We say that we can "charge" each of the original <EM>n</EM> insertions a cost of O(1) to "pay for" the O(<EM>n</EM>) reversal cost.</P>
<P>Binary search trees and even self-balancing trees like red-black trees can also be made fully persistent: when we add a new node N, we create a new version of N's parent, call it P, with N as a child. We must also create a new version of P's parent with P as a child, and so on up to the root. For a balanced tree, O(log n) space is required for such an insertion operation - more than the O(1) required for an in-place update, but much less than the O(n) required for a complete copy-and-update. These are very useful for building map data structures that need the advantages of persistence.</P>
<P>One key observation to building more clever persistent data structures is that, really, we <EM>can</EM> modify the old version, as long as it <EM>behaves</EM> the same as it did before. For example, if we have a binary search tree, we can rebalance the tree without changing its contents; it only becomes more efficient to access. Objects holding existing pointers to the tree won't notice that it's been rebalanced as long as they access it only through its abstract interface and can't touch its internal representation. This can be exploited to maintain an efficient representation as more and more versions are created. On the other hand, this also sacrifices the advantage of avoiding race conditions; we must ensure the data structure is not accessed in an intermediate state where it would not behave the same.</P>
<P>Persistent data structures aren't strictly limited to functional languages. For example, the <A href="http://www.moo.mud.org/">MOO</A> and <A href="http://cold.org/coldc/">ColdMUD</A> virtual environment languages use immutable data structures for built-in string, list, and map types, but have mutable objects. Nevertheless, implementations of persistent data structures are, today, largely limited to functional languages. But they're useful in every language - perhaps one day we will see them in standard libraries. Meanwhile, take a look at <A href="http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf">Okasaki's thesis</A>, and when thinking about your next design, give a thought to how persistent data structures could simplify your own code.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=490480" width="1" height="1">Data structuresSoftware transactional memoryhttp://blogs.msdn.com/b/devdev/archive/2005/10/20/483247.aspxFri, 21 Oct 2005 02:29:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:483247MSDNArchive7http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=483247http://blogs.msdn.com/b/devdev/archive/2005/10/20/483247.aspx#comments<P><STRONG>Software transactional memory</STRONG> (<STRONG>STM</STRONG>) is a scheme for concurrent programming with multiple threads that uses transactions similar to those used in databases. Today I'll discuss what STM is, how it works, some implementations, and why you should care.</P>
<P>It's well-known that if two or more threads blindly try to access the same data at the same time, a variety of undesirable <EM>race conditions</EM> can result, situations where the outcome of the program will depend on the order in which the threads access the data. Usually, only one of these orders is desirable. In multithreading, the predominant means of synchronization are <EM>locks</EM>, which in their simplest form enforce that only one thread can access a particular resource at one time. There are several problems with this approach:</P>
<UL>
<LI>If the locks are too coarse-grained (that is, the code sections they guard are too large), the locked code sections can become a contended bottleneck, leading to decreased concurrency and, on multiprocessor machines, performance closer to that of a single processor machine.
<LI>If the locks are too fine-grained, the code can end up spending more time locking, unlocking, and context-switching than it does doing real work.
<LI>Locks cause several unpredictable concurrency issues of their own, such as deadlock, livelock, priority inversion, and so on.
<LI>In the event of an unexpected error, a lock can be left permanently locked, although C#'s "lock block" helps to prevent this.
<LI>Last but not least, locks disturb the flow of the program and force the programmer to think about what concurrent threads could be doing - which programmers happen to be quite terrible at doing. This error-prone process leads to many of the most difficult to reproduce and difficult to fix bugs that we face in our daily lives.</LI></UL>
<P>Some of these problems are addressed to some extent by more advanced types of locks such as read-write locks, semaphores, monitors, write barriers, and events, but when you add complexity you add overhead and decrease performance, so we continue to use normal locks more often than anything else.</P>
<P>On the other hand, if you've ever written a SELECT or UPDATE command in SQL, you probably didn't stop to think: what if someone updates the rows I'm querying or updating concurrently? Shouldn't I be locking this table before I mess with it? Should I lock just the rows I'm working with or the whole table? In most applications, query concurrency is simply transparent; the database engine takes care of everything for you. Instead of providing locks, it exposes to the programmer the simpler concept of a <EM>transaction</EM>. At its heart, a transaction is simply a way of performing a group of operations in such a way that they appear to happen atomically, all at a single instant. The intermediate states cannot be read or updated by any other transaction. This enables you to keep tables consistent with their constraints at all times - one operation can violate a constraint, but both before and after each transaction, it will hold.</P>
<P>In typical implementations of transactions, you have the option of either <EM>commiting</EM> or <EM>aborting</EM> a transaction at any time. When you commit a transaction, you agree that the transaction went as planned and make its changes permanently in the database. If you abort a transaction on the other hand, this means that something went wrong and you want to roll back, or reverse, any partial changes it's already made. Good installers use a similar rollback scheme: if some installation step fails and recovery is not possible, it will erase all the cruft it's already put on your machine so that it's just like it was before. If it fails to do this, the machine may become unusable or future installation attempts may fail.</P>
<P>Let's think about how transactions might be useful in concurrent programming. Say you're writing a threadsafe doubly-linked list class. You want to insert a node into the middle of the list. In a single-threaded application, you would just write something like this:</P><PRE> public void InsertAfter(Node node, T value)<BR> {<BR> Node newNode = new Node(value);<BR> newNode.prev = node;<BR> newNode.next = node.next;<BR> newNode.prev.next = newNode;<BR> newNode.next.prev = newNode;<BR> }<BR></PRE>
<P>Unfortunately, the linked list becomes inconsistent between the last two lines (newNode is in the list, but newNode.next.prev is not newNode), and it's impossible to reorder these statements in such a way that the linked list will remain consistent between every pair of statements. You could put a lock around every method accessing the list, but then two threads can't access different parts of the list at the same time, which can be a serious performance problem if it's a large "global" list accessed by many threads at once. If multiple locks became involved, you would also have to be careful to avoid deadlock.</P>
<P>Let's take a different approach based on transactions. We introduce the concept of a <EM>version</EM> for each object (in our example, each node would be an object). The version is just a counter that starts at zero. Each time an object is modified, its version is incremented. Each time the object is read, the version is (atomically) read along with it. Now we're ready to implement <A href="http://www.cambridge.intel-research.net/%7Erennals/notlockfree.pdf">the software transactional memory scheme of Robert Ennals</A>:</P>
<OL>
<LI><STRONG>Optimistic reads</STRONG>: Every time we read an object, we remember its version at that instant. When we commit the transaction, we verify that the version for every object we read is still the same as it was when we read it. If any of them have changed, we abort and roll back the transaction, undoing any writes it did, and then try the operation over again. If none of them have changed, we know that the concurrent writes would not have altered the transaction's behavior in any case, so we can commit. The assumption here is that the concurrent writes aren't going to tamper with the objects we're reading very often, just other objects we don't care about.
<LI><STRONG>Private revocable writes</STRONG>: Whenever we want to write to a node, we first lock it. The lock prevents others from reading or writing it simultaneously, and the transaction will continue to hold this lock until it commits or aborts. Next, we make a private copy of the node. Any changes we want to make we make to this private copy. By ensuring that each object is accessed through only one reference (which may itself be referenced by multiple references), we can atomically swap out the old copy for the new, updated copy when we commit the transaction. To abort the transaction, we simply release all write locks and throw away all the private copies. Notice that if a transaction locks a node for writing and later aborts, transactions which read that node can still go on to succeed.
<LI><STRONG>Deadlock detection</STRONG>: When we're automatically locking things all over the place, deadlock is sure to happen sooner or later. To deal with this, we periodically interrupt the program (much as a garbage collector does) and scan for a cycle of objects that are deadlocked, each waiting on a lock held by the next. We then force one of them to abort, releasing its locks and breaking the cycle. With this system in place, deadlocks can never occur. Databases with locking use a similar system to prevent deadlock.
<LI><STRONG>Edge cases</STRONG>: If a transaction which has read inconsistent state tries to commit, it will fail, but it's also possible that such a transaction could go into an infinite loop or have a fatal exception because it wasn't written to deal with this situation (nor should it be). Again, the runtime is responsible for detecting and aborting such transactions.</LI></OL>
<P>Now, all this copying, swapping, and retrying of aborted operations looks expensive. However, the truth is that so much concurrency is gained from a transaction-based system that they are often much more efficient than lock-based systems even for a small number of processors, as you can see in the Performance section of <A href="http://www.cambridge.intel-research.net/%7Erennals/notlockfree.pdf">Ennals' paper</A>. The reason for this is that simple locking is often too pessimistic - it locks an entire data structure when really different threads could be safely operating on different parts of it concurrently, as long as you have a roll back mechanism ready in case a conflict arises.</P>
<P>So it's really a win-win - you can write your program almost as though it were single-threaded, and it runs faster! Even on a single processor, the penalty is not so severe as to outweigh the benefits to the programmer (who is, after all, more expensive than both software and hardware). If we return to our linked list example, we can simply take the single-threaded solution, put each method inside a transaction, and we're done.</P>
<P>This all sounds nice in theory, but few of us are qualified to write and use a production-quality STM library after a quick read of a paper. Luckily, there are already some good implementations floating around that you can try:</P>
<UL>
<LI>Robert Ennal's <A href="https://sourceforge.net/projects/libltx">Lightweight Transaction Library</A>, a C library written for UNIX and recently re-released under the modified BSD license (which allows commercial use as well as commercial use of derivative works such as ports). Someone should really port this to a .NET library.
<LI>Microsoft Research's <A href="http://research.microsoft.com/research/downloads/default.aspx">C# Software Transactional Memory</A> (SXM). (restricted to noncommercial)</LI></UL>
<P>Here are some more papers on the topic:</P>
<UL>
<LI>The <A href="http://citeseer.ist.psu.edu/shavit95software.html">original paper</A> on software transactional memory.
<LI>Microsoft Research's <A href="http://research.microsoft.com/Users/simonpj/papers/stm/stm.pdf">Composable Memory Transactions</A>.
<LI><A href="http://research.sun.com/people/moir/Papers/PODC03.pdf">Software Transactional Memory for Dynamic-Sized Data Structures</A>, which introduced the obstruction-free DSTM implementation.
<LI>The <A href="http://en.wikipedia.org/wiki/Software_transactional_memory">Wikipedia article</A> (which needs some expansion).
<LI>A <A href="http://www.cs.wisc.edu/trans-memory/biblio/swtm.html">small bibliography</A>.</LI></UL>
<P>Give them a try - in 5 or 10 years, locks might just be a footnote in a Systems textbook.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=483247" width="1" height="1">Non-nullable typeshttp://blogs.msdn.com/b/devdev/archive/2005/10/10/479272.aspxMon, 10 Oct 2005 21:39:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:479272MSDNArchive5http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=479272http://blogs.msdn.com/b/devdev/archive/2005/10/10/479272.aspx#comments<P>If you write programs in C, C++, Java, or C#, you've gotten used to having the null value around. The null value is a special reserved reference (or pointer) value indicating that a reference does not refer to any object. It's useful for constructing a variety of data structures, but it's also a notoriously common source of bugs, because occasionally values that "should never" be null turn out to be, unexpectedly. On modern machines this type of runtime error can be detected quickly if it occurs, typically by using the value zero for null and unmapping the first page of memory, but it would be much nicer to prevent this type at error at compile-time.</P>
<P>Impossible, you say? Not at all. In fact, functional languages in the ML family such as Standard ML, Ocaml, and Haskell have always made a distinction between nullable types (types that can have a reserved null value, such as "int option") and non-nullable types (such as "int"). There's no reason for every type to include, by default, a feature you may not use. There's nothing magic about this; it doesn't make a program any more difficult to type check if you say some values can be null and other values can't.</P>
<P>We could make a similar distinction in a language such as C#. In fact, someone already has. One of the most exciting research projects at Microsoft is an extension of C# called <A href="http://research.microsoft.com/specsharp/">Spec#</A>. Spec# has a lot of truly cutting-edge features that I won't get into right now, but the one simple feature that I do want to mention is non-nullable types. These allow you to affix an exclamation mark (!) onto any reference type, yielding a new type that cannot be null but is otherwise the same. For example, consider this simple string comparison method:</P><PRE> public int StringCompare(string left, string right)
{
for(int i=0; i<Math.Min(left.Length, right.Length); i++)
{
if (left[i] != right[i])
{
return left[i] - right[i];
}
}
return right.Length - left.Length;
}
public int Foo(string name)
{<BR> return StringCompare(name, name + "bar");
}
</PRE>
<P>If I enter this into a Spec# buffer, it underlines the references to "left" and "right" with red lines to indicate that I may be dereferencing a null value, which is a compile-time error (not just a warning). The solution is to change the signature of StringCompare to the following, indicating that it only accepts non-null strings:</P><PRE> public int StringCompare(string! left, string! right)
</PRE>
<P>If I do this, however, I get an error in Foo() that it's passing a possibly-null string where a non-null string is expected (a type error). Supposing I want Foo to accept null values, I can add the appropriate null check and the error goes away:</P><PRE> public int Foo(string name)
{
if (name == null) return 0;
return StringCompare(name, name + "bar");
}
</PRE>
<P>Even more importantly, Spec# ships with improved signatures for a large number of .NET Framework methods. You no longer have to worry about NullReferenceExceptions occurring when you pass null to a Framework method that doesn't accept it - you find out at compile-time. They generated most of these signatures automatically by reflecting on the CLR and looking for an argument being tested for null followed by a throw of NullReferenceException, but ideally the CLR team would include the proper types from the beginning.</P>
<P>Non-nullable types are also present in other extension languages and tools. <A href="http://www.research.att.com/projects/cyclone/">Cyclone</A>, the safe C dialect, adds non-nullable pointers to C (along with several other types of pointers, see <A href="http://www.research.att.com/projects/cyclone/online-manual/main-screen002.html#toc3">Pointers</A>). <A href="http://research.compaq.com/SRC/esc/">ESC/Java</A> adds non-nullable pointers, along with many other extensions, to Java (see <A href="http://gatekeeper.research.compaq.com/pub/DEC/SRC/technical-notes/SRC-2000-002.html#2.4.0%20%20non_null%20pragma">non-null pragma</A>). Microsoft has a well-known internal static checking tool called PREfix which allows the specification of non-null types in C++ code using the <STRONG>SAL_notnull</STRONG> SpecStrings annotation. This feature is so easy to add and so useful that just about every new language does so. </P>
<P>So why have we continued to battle null values for all these years when there's an easy-to-implement, well-known compile-time solution? Got me, but I wouldn't mind seeing this feature in C# 3.0.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=479272" width="1" height="1">Secret sharinghttp://blogs.msdn.com/b/devdev/archive/2005/10/02/476273.aspxMon, 03 Oct 2005 01:31:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:476273MSDNArchive2http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=476273http://blogs.msdn.com/b/devdev/archive/2005/10/02/476273.aspx#commentsOne of the most difficult problems in cryptographic key management is
keeping a secret key safe from both compromise and loss. If you don't make
enough backups, the key might be destroyed in a hardware failure or
natural disaster. But if any backup is compromised, the key is
compromised.<br>
<br>Rather than invent new tools, one might try to solve the problem
with encryption. Just encrypting the key normally is not helpful, since
then you have the
same problem with the key needed to decrypt it. Instead, if you encrypt
your key using multiple keys, then
store these keys in different locations, you need all the keys to
decrypt it. This protects against compromise, but if any key is lost,
so is the original key. To fix this,
you'd have to encrypt the original key many times, each time using just
some of
the keys. For example, suppose you had 5 keys and you wanted to be able
to decrypt using any 3 of them. Then you'd need to perform all the
following encryptions:<br>
<ul>
<li>E<sub>1</sub>(E<sub>2</sub>(E<sub>3</sub>(K)))</li>
<li>E<sub>1</sub>(E<sub>2</sub>(E<sub>4</sub>(K)))</li>
<li>E<sub>1</sub>(E<sub>2</sub>(E<sub>5</sub>(K)))</li>
<li>E<sub>1</sub>(E<sub>3</sub>(E<sub>4</sub>(K)))</li>
<li>E<sub>1</sub>(E<sub>3</sub>(E<sub>5</sub>(K)))</li>
<li>E<sub>1</sub>(E<sub>4</sub>(E<sub>5</sub>(K)))</li>
<li>E<sub>2</sub>(E<sub>3</sub>(E<sub>4</sub>(K)))</li>
<li>E<sub>2</sub>(E<sub>3</sub>(E<sub>5</sub>(K)))</li>
<li>E<sub>2</sub>(E<sub>4</sub>(E<sub>5</sub>(K)))</li>
<li>E<sub>3</sub>(E<sub>4</sub>(E<sub>5</sub>(K)))</li>
</ul>
The number of encryptions needed grows very quickly. It also takes
quite a bit of time and space to store all these multiply-encrypted
keys. Let's look at some other solutions.<br>
<br>
One solution is to simply break the key file into two parts,
make a copy of each, and store the four parts in four locations. Then,
even if any one is compromised, the key is not revealed. Unfortunately,
this scheme is not that helpful, because just knowing half of someone's
private key is enough to drastically decrease the time needed for a
brute-force search to crack the remainder of the key.<br>
<br>
Here's another simple solution: create a one-time pad (a file of random
data) the same size as the private key, then XOR them together. Put the
result in one place and the pad in another, and destroy the original
key. We call each file a "share" or "shadow". Later, you can XOR the two shares
together to retrieve the original key. Since the bytes of each file are
random, you can be assured that neither piece will reveal any
information about the key by itself. We can extend this scheme to <i>n</i> pieces by just creating <i>n</i> - 1 one-time pads, adding them together, and subtracting this from the key data to obtain the <i>n</i>th share. Unfortunately, if even one piece is lost, so is the key. This is called an (<i>n,n</i>) secret sharing scheme, because we have <i>n</i> shares and need <i>n</i> of them to reconstruct the secret.<br>
<br>
Here's another more useful scheme: let the private key be represented as the number <i>S</i>. Choose a random line in the plane passing through the point (0,<i> S</i>). Next, choose <i>n</i> random points on that line as your <i>n</i> shares. Now, if we know any two of the shares, we know the line, and so can find <i>S</i>. If we only know one share, then we just know one point, and for every <i>T</i> there is a line passing through (0, <i>T</i>) and that point; hence, the secret could be anything at all. This is called an (<i>n,2</i>) secret sharing scheme, because we have <i>n</i> shares and need 2 of them to reconstruct the secret.<br>
<br>
Particularly useful for individuals is the (3,2) scheme - you can keep
one share on your person on a floppy or USB pen drive at all times, one
share on your hard drive, and one share backed up at a secure remote
location. When you need to use the key, you recreate it briefly using
the shares on your hard drive and your person, then destroy it. If you
lose the floppy or your hard disk crashes, you can recreate the key
from the remaining shares (and make yourself 3 new shares). If any one
share is compromised, the key is not compromised. If you know of a
compromise, you can recreate the key, create 3 new shares, and destroy
the old ones, preventing the attacker from getting a second share. The
new shares are unrelated to the old ones because a different random
line through (0,<i>S</i>) is used.<br>
<br>
In 1979, Shamir generalized these ideas even further to create a general (<i>n</i>,<i>k</i>) secret sharing scheme for any <i>n</i>, <i>k</i>. The idea is to choose a random degree-<i>k</i> polynomial passing through (0,<i>S</i>), where <i>S</i> is the secret. Then choose <i>n</i> random points on this curve. Because there is exactly one degree-<i>k</i> polynomial passing through any given set of <i>k</i> points, any <i>k</i> of these points define the curve and allow us to retrieve the secret. If we have only <i>k</i> - 1 points, then we don't learn anything about <i>S</i>, because there is a polynomial passing through (0,<i>T</i>) and those <i>k</i> - 1 points for every <i>T</i>.
To make the notion of choosing random polynomials and random points
precise, Shamir works with polynomials modulo a prime number <i>p</i>, choosing polynomial coefficients and point coordinates randomly from the integers in [0, <i>p</i> - 1].<br>
<br>
Secret sharing has more applications than just protecting private keys.
It's also useful in situations where we want to entrust a secret to a
group of people but want to ensure that several of them must cooperate
to access the secret. We don't need to trust them all individually, nor
do they all need to cooperate, so the secret is still available if
members go missing or refuse to cooperate.<br>
<br>
Finally, I wouldn't just tell you all this without providing some
working code you can use. You can download an excellent implementation
of the scheme for UNIX called <a href="http://www.mindrot.org/misc-code.html">secsplit</a> (<a href="http://www2.mindrot.org/files/secsplit-1.2.tar.gz">direct download</a>).
I don't know of any free .NET implementation, but I'm sure that porting
this 600-line program into any environment wouldn't be difficult.<br>
<br>
Some more resources on secret sharing:<br>
<ul>
<li><a href="http://www.cs.cornell.edu/Courses/cs513/2000SP/SecretSharing.html">Lectures notes</a> from Cornell's CS 513</li>
<li><a href="http://en.wikipedia.org/wiki/Secret_sharing">Wikipedia article</a></li><li>Adi Shamir's "<a href="http://www.caip.rutgers.edu/%7Evirajb/readinglist/shamirturing.pdf">How to Share a Secret</a>"<br>
</li>
<li><a href="http://www.rsasecurity.com/rsalabs/node.asp?id=2259">RSA FAQ</a></li>
</ul><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=476273" width="1" height="1">AlgorithmsCustom building and code generators in Visual Studio 2005http://blogs.msdn.com/b/devdev/archive/2005/09/13/465034.aspxTue, 13 Sep 2005 23:15:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:465034MSDNArchive11http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=465034http://blogs.msdn.com/b/devdev/archive/2005/09/13/465034.aspx#comments<P>I'm a fervent fan of using code generator tools wherever possible to make your life easier. Although they come with issues related to effective building, diagnostics, and debugging, the amount of value they add to your application is immense: they can eliminate entire classes of potential bugs, save you a great deal of effort and time, make your module much easier to extend and maintain, and even yield runtime performance gains. Among the most frequently used code generator tools are the lexer and parser generators GNU <A href="http://www.gnu.org/software/flex/">Flex</A> and <A href="http://www.gnu.org/software/bison/bison.html">Bison</A>, based on the classic generators lex and yacc. Although I'll get into the details of how to use these tools effectively at a later time, today what I want to show you is a practical example of how to use the new custom build features of Visual Studio 2005 to effectively incorporate a code generator into your automatic build process.</P>
<P>Here is a very simple Bison grammar for evaluating arithmetic expressions involving addition, subtraction, and multiplication:</P><PRE>/* example.y */
%{
#define YYSTYPE int
%}
%token PLUS MINUS STAR LPAREN RPAREN NUMBER NEWLINE
%left PLUS MINUS
%left STAR
%%
line : /* empty */
| line expr NEWLINE { printf("%d\n", $2); }
expr : LPAREN expr RPAREN { $$ = $2; }
| expr PLUS expr { $$ = $1 + $3; }
| expr MINUS expr { $$ = $1 - $3; }
| expr STAR expr { $$ = $1 * $3; }
| NUMBER { $$ = $1; }
;
%%
int yyerror (char const *msg) {
printf("Error: %s\n", msg);
}
int main() {
printf("%d\n", yyparse());
return 0;
}</PRE>
<P>The Flex lexer used by this parser looks like this:</P><PRE>/* example.lex */
%{
#include "example.parser.h"
%}
%option noyywrap
%%
[ \t]+ { /* ignore whitespace */ }
"(" { return LPAREN; }
")" { return RPAREN; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return STAR; }
\n { return NEWLINE; }
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
. { printf("Invalid character '%s'", yytext); }
%%</PRE>
<P>If we were writing this parser at a UNIX command line, we might generate the source files and compile the result using this sequence of commands:</P><PRE>bison -v -d example.y -o example.parser.c
flex -oexample.lexer.c example.lex
gcc -o example example.lexer.c example.parser.c</PRE>
<P>Now say you wanted to build the same application for Windows using Visual Studio 2005. The tools are available on Windows (see <A href="http://gnuwin32.sourceforge.net/packages/flex.htm">flex for Win32</A>, <A href="http://gnuwin32.sourceforge.net/packages/bison.htm">bison for Win32</A>), and you could simply run the same first two commands at the command-line and then build the resulting source files in Visual Studio. However, this simple approach sacrifices many of the advantages that Visual Studio provides for its built-in source file types: it doesn't rebuild the generated source files as needed, it doesn't allow you to jump to errors that occur during generation, and it doesn't allow you to configure build options using a nice GUI. Let's see how we can reclaim these advantages for Flex and Bison files.</P>
<H2>Creating a simple custom build type</H2>
<P>Our first goal is simply to be able to build Flex and Bison files. First, use the Flex and Bison setup binaries from the links above to install the tools. A bin directory will be created in the installation directory. Add this to your system path. You should be able to execute both flex and bison from a command prompt without specifying a path.</P>
<P>Next, we create a new C++ console application. Uncheck the option to use precompiled headers - I'll explain how to use these with Flex and Bison later. Remove the main source file created for you by the wizard. Next, right-click the project in Solution Explorer and choose Custom Build Rules. The following dialog appears:</P><IMG src="http://moonflare.com/blogfiles/devdev/CustomBuildRules.1.PNG">
<P>A <EM>build rule </EM>establishes how to build a file of a particular type. A group of related build rules are stored in a <EM>build rule file</EM>, which can be saved, distributed, and reused in many projects. We'll start by creating a new build rule file for our Flex and Bison rules:</P>
<OL>
<LI>Click New Rule File.
<LI>Enter "GNU Tools" for Display Name and File Name.
<LI>Choose a suitable directory for the build rule file. If it asks you if you want to add the directory to your search path, say yes.</LI></OL>
<P>Now we'll create a build rule for Bison files:</P>
<OL>
<LI>Click Add Build Rule.
<LI>Enter the following values:
<OL>
<LI>Name: Bison
<LI>File Extensions: *.y
<LI>Outputs: $(InputName).parser.c;$(InputName).parser.h
<LI>Command Line: bison -d [inputs] -o $(InputName).parser.c
<LI>Execution Description: Generating parser...</LI></OL>
<LI>Click OK twice, then check the box labelled "GNU Tools" and click OK.
<LI>Add the <EM>example.y</EM> file above to your project. Right-click on the file and choose <EM>Compile</EM>. You should receive no errors.
<LI>Create a new file folder under the project called "Generated Files". Add the existing file <EM>example.parser.c</EM> to this folder.</LI></OL><IMG src="http://moonflare.com/blogfiles/devdev/CustomBuildRules.2.PNG">
<P>If you build now, you should receive only an error complaining that yylex() is undefined. Now, go back to Custom Build Tools and click Modify Rule File on GNU Tools. Create a rule for Flex:</P>
<OL>
<LI>Click Add Build Rule.
<LI>Enter the following values:
<OL>
<LI>Name: Flex
<LI>File Extensions: *.lex
<LI>Outputs: $(InputName).lexer.c
<LI>Command Line: flex -o$(InputName).lexer.c [inputs]
<LI>Execution Description: Generating lexer...</LI></OL>
<LI>Click OK three times.
<LI>Add the <EM>example.lex</EM> file above to your project. Right-click on the file and choose <EM>Compile</EM>. You should receive no errors.
<LI>Add the existing file <EM>example.lexer.c</EM> to your project.</LI></OL>
<P>If you build now, you should receive no errors and be able to run the application successfully. Now in any project you can simply check the "GNU Tools" box, add the .lex and .y files to your project, and build. What happens if you modify the example.y and build? It runs Bison again and recompiles example.parser.c, because it was regenerated, and example.lexer.c, because it includes a header file that was regenerated. If we modify the .lex file, Flex is rerun and example.lexer.c is recompiled, but example.parser.c is not rebuilt. If you had a larger parser, you'd appreciate how much time this incremental rebuilding saves you.</P>
<H2>Improving diagnostic support</H2>
<P>Delete one of the "%%" marks in the .y file and build. Unsurprisingly, Bison fails. However, the Error List tells you no more than this. It'd be more helpful if you could find out what errors the tool produced. If you look at the output window, Bison did produce some errors, but if you double click on them to visit the error location, it just takes you to the top of the file. What gives?</P>
<P>The reason for this is that Visual Studio only recognizes one error format, that used by its own tools. Here's an example:</P><PRE>c:\myprojects\myproject\hello.cpp(10) : error C2065: 'i' : undeclared identifier</PRE>
<P>Bison doesn't output errors in this format, and so they aren't parsed. Flex uses yet another different format. What to do? The simplest way to deal with this is to invoke a simple script on the output of the tools as part of the build rule which parses the output and converts it to the desired format. You can write this script in any language; I wrote them in C# using the .NET Framework's regular expressions. Here's what I wrote inside the Main() function for the Bison converter tool (error checking and such omitted):</P><PRE>string line;
while ((line = Console.In.ReadLine()) != null)
{
Match match = Regex.Match(line, "([^:]+):([0-9]+)\\.[^:]*: (.*)");
if (match != null)
{
Console.WriteLine("{0}({1}): error BISON: {2}",
Path.GetFullPath(match.Groups[1].Value),
match.Groups[2].Value, match.Groups[3].Value);
}
else
{
Console.WriteLine(line);
}
}</PRE>
<P>I deploy the binary, say it's called BisonErrorFilter.exe, to the same directory as bison.exe. I then change the Command Line of the Bison build rule to the following (click the arrow in the right of the field to access a multiline text box):</P><PRE>bison.exe -d [inputs] -o $(InputName).parser.c > bison.err 2>&1
BisonErrorFilter < bison.err</PRE>
<P>If you compile the .y file now, any errors should appear in the error list, as desired, and you can double-click them to visit their locations. I wrote a similar script for the lexer output. Be careful when doing this, though, because if you miss any errors, Visual Studio might look at the error return of the last command and interpret it as success. A better way to do this would be to wrap the tool in a script that passes its arguments to the tool, collects the tool's output and return code, converts and prints the output, and then returns its return code.</P>
<P>I haven't figured out how, but I believe it's possible to also create custom help entries for each error message, then have the filter tool produce the right error code for each one. This way, users can get help for each error individually by just clicking on it and pressing F1.</P>
<H2>Properties</H2>
<P>Properties enable you to control how the command-line tool is executed directly from the properties page for each individual file you wish to build with it. Let's start with a simple example: a handy lexer switch is -d, which prints out an informative message each time a token is recognized. We don't want it on all the time, and certainly not in release mode, but it'd be handy to be able to turn on and off as necessary.</P>
<P>To create a property for this, first return to the lexer build rule. Then follow these steps:</P>
<OL>
<LI>Click Add Property.
<LI>Choose Boolean for the User Property Type.
<LI>Enter the following values:
<OL>
<LI>Name: debug
<LI>Display Name: Print debug traces
<LI>Switch: -d
<LI>Description: Displays an informative message each time a token is recognized.</LI></OL>
<LI>Click OK. Then, add [debug] right after "flex" in the Command Line field.
<LI>Click OK three times.
<LI>Right-click on example.lex in Solution Explorer and choose Properties.
<LI>In the left pane, click the plus next to <EM>Flex</EM>. Click General.
<LI>You'll see your property. Click on it and its description will appear at the bottom. Set it to Yes.
<LI>Click Command Line in the left pane. You'll see that the -d flag has been added.
<LI>Click OK and build.
<LI>Run the app and type an arithmetic expression. You'll see trace messages.
<LI>View the project properties. You'll see that it now has a Flex node also. Here you can set the default settings for all files of that type in the project which don't have specific overriding settings set.</LI></OL><IMG src="http://moonflare.com/blogfiles/devdev/CustomBuildRules.3.PNG">
<P>Adding more properties is just as simple. You can go through the man page for the tool and add properties for each switch, using the Category field to group them into categories. You can use the other property types for switches accepting arguments. If you want, you can create a detailed help file with additional explanation and examples for each switch. When you're done you have an impressive looking property sheet for your files reminiscent of those for built-in types:</P><IMG src="http://moonflare.com/blogfiles/devdev/CustomBuildRules.4.PNG">
<P>You can also set different settings for debug and release builds. For example, for Flex, it's good to set table size to slowest and smallest for the Debug version, to speed up compilation, and to set it to the recommended full tables with equivalence classes for the Release version, which is a good tradeoff of table size and speed.</P>
<P>Finally, once you're done adding all the properties you like, you can take the resulting .rules files and give it to everyone on your team, or distribute it on a website, so that everyone can easily integrate the tool into Visual Studio. Perhaps eventually tools like Flex and Bison will ship with a .rules file.</P>
<H2>Conclusion</H2>
<P>In Visual Studio 2003 you would have had to write a plug-in to come close to achieving this level of integration with a third-party tool. Although it has limitations, I hope the problems solved by these new features help encourage you to incorporate more tools and code generation into your regular development. Now that you know how to use Flex and Bison from the Visual Studio IDE, next time I'll talk about how to use the tools themselves, going through some of the development and debugging processes that a grammar developer goes through, and show you some similar tools for other .NET languages. Thanks for reading, everyone.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=465034" width="1" height="1">Software engineeringModular arithmetic and primality testinghttp://blogs.msdn.com/b/devdev/archive/2005/09/07/462070.aspxWed, 07 Sep 2005 20:29:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:462070MSDNArchive7http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=462070http://blogs.msdn.com/b/devdev/archive/2005/09/07/462070.aspx#comments<p><a href="http://en.wikipedia.org/wiki/Number_theory">Number theory</a>
is, roughly speaking, the study of properties of integers. Often a
problem which is easy for real numbers, such as factoring or linear
programming, seems to be considerably more difficult when restricted to
integers (in fact, integer programming is NP-hard). Much of the focus
of modern number theory is on solving these problems efficiently.</p>
<p>The practical importance of number theory was greatly increased by
the introduction of RSA cryptography, an unprecedented system whereby
information could be securely transmitted without first transmitting
the key and risking its interception. Central to RSA is the ability to
efficiently generate a semiprime, which is simply a product of two very
large primes of roughly equal magnitude. What's useful about these
numbers is that we can generate them efficiently, yet given just the
semiprime, determining the two prime factors is a hard,
unsolved problem. It turns out that there are a lot of primes to
choose from: the probability that a randomly chosen k-bit number is
prime is between 0.69/<em>k </em>and 0.87/<em>k.</em> For example, about 7-9% of 10-bit numbers are prime. Consequently, we can find a <em>k</em>-bit prime number in an expected O(<em>k</em>) random guesses. The only remaining problem is to find an efficient algorithm for testing these guesses for primality.</p>
<p>Brute force primality tests such as simply dividing the number by
all values less than the square root are ineffective for large numbers.
To get around this, we'll need an important tool called <a href="http://en.wikipedia.org/wiki/Modular_arithmetic">modular arithmetic</a>.
If you've written C, you may be surprised to know that you're already
familiar with modular arithmetic - the "wrap around" behavior of <em>k</em>-bit integers is an implementation of arithmetic mod 2<sup><em>k</em></sup>. For example, this piece of code computes the value 17:</p><pre> unsigned char c1 = 177, c2 = 96;<br> unsigned char sum = c1 + c2;<br></pre>
<p>In mathematical notation, we would say:</p>
<p> 177 + 96 ≡ 17 (mod 256)</p>This
tells us that the numbers 177 + 96 and 17 differ by a multiple of 256,
or in other words have the same last 8 bits. Multiplication mod 256 is
likewise similar to multiplication of unsigned chars in C: <pre> unsigned char c1 = 177, c2 = 23;<br> unsigned char product = c1 * c2;<br></pre>
<p> 177 × 23 ≡ 231 (mod 256)</p>
<p>Modular arithmetic works exactly the same for other moduli than 256;
the only difference is the number where it wraps around. For example,
if you compute 10 + 10 mod 17, you get 3.</p>
<p>The interesting thing about modular arithmetic is that it can be shown that the values form a commutative <a href="http://en.wikipedia.org/wiki/Ring_%28algebra%29">ring</a>. This means, among other things, that:</p>
<ul>
<li>Addition and multiplication are commutative (you can reverse their operands without changing the result).
</li><li>The associative property holds for both addition and
multipliation: (177 × 23) × 45 gives the same thing as 177 × (23 × 45),
even if computed with unsigned chars. </li><li>The distributive property holds: (177 + 23) × 45 is equal to (177 × 45) + (23 × 45).</li></ul>
<p>If none of these surprise you, this might: if the modulus is a power of 2, as in machine arithmetic, every odd number <em>m</em> has a multiplicative inverse. An inverse is simply a number <em>n</em> such that <em>m </em>× <em>n</em> = 1. What good is this? Well, suppose you want to divide a number <em>k</em> by 7 on a machine with no hardware division. You know that <em>k</em> is divisible by 7. The inverse of 7 mod 256 is 183. Because 7 × 183 = 1, we have 7 × 183 × <em>k</em> = <em>k</em>. Divide both sides by 7, and we get 183 × <em>k</em> = <em>k</em>/7. In other words, multiplying by a number's inverse is the same as dividing by that number, provided that <em>k</em> is evenly divisible by the number.</p>
<p>Now we come back to our original problem: how can modular arithmetic
help us determine primality? Well, it's not hard to show that if <em>p</em> is prime, then:</p>
<p> For all <em>a </em>with<em> </em>0 < <em>a</em> < <em>p</em>, <em>a</em><sup><em>p</em></sup> ≡ <em>a</em> (mod <em>p</em>).</p>
<p>This is called <a href="http://en.wikipedia.org/wiki/Fermat%27s_little_theorem">Fermat's little theorem</a>. What's useful about it is not only that it's true for all primes, but that if you find an <em>a</em> such that it does not hold, you have proven that the number <em>p</em> is composite. This can be checked efficiently because there is <a href="http://en.wikipedia.org/wiki/Exponentiation_by_squaring">an efficient algorithm for computing large powers</a>.
This is simply an existence proof - it tells us nothing about what the
factors are. It can be shown that for most composite numbers, if you
just select several <em>a</em> at random and try them, one is likely to fail the test. Unfortunately, there are an infinite number of special numbers called <a href="http://en.wikipedia.org/wiki/Carmichael_number">Carmichael numbers</a> for which the above result holds for all <em>a</em>, even though they are not prime.</p>
<p>To get around this, we design a new, slightly more
complicated test which is not susceptible to this problem. I take
this from the excellent book <em><a href="http://www.amazon.com/exec/obidos/tg/detail/-/0387252827/">Prime Numbers: A Computational Perspective</a></em>, by Richard Crandall and Carl Pomerance. Suppose <em>p</em> is an odd prime, and that the binary representation of <em>p</em> - 1 is the odd number <em>t</em> followed by <em>s</em> zeros. Then one of the following is true for every <em>a</em> with 0 < <em>a</em> < <em>p </em>- 1:</p>
<p><em> a<sup>t</sup></em> ≡ 1 (mod <em>p</em>)<br><em> a<sup><em>t</em> << <em>i</em></sup></em> ≡ p - 1 (mod <em>p</em>) for some <em>i</em> with 0 ≤ <em>i</em> < <em>s</em></p>
<p>Above "<<" denotes the shift-left operator. It can be shown
that for any odd composite number > 9, both of the above will fail
for at least 3/4 of the <em>a</em> between 1 and <em>p</em>-2. We can use this to design a simple probabilistic algorithm:</p>
<ol>
<li>Choose an <em>a</em> at random with 0 < <em>a</em> < <em>p</em>-1.
</li><li>Check if one of the above formulas holds. If not, <em>p</em> is composite.
</li><li>If we have iterated at least T times, claim that <em>p</em> is prime. Otherwise, go back to step 1.</li></ol>
<p>Here's an (untested) piece of C# code which implements this simple
algorithm, assuming that BigInteger is a suitable arbitrary-precision
integer type (the Framework provides no such type):</p><pre> public bool IsProbablePrime(BigInteger p, int T) {<br> int s = 0;<br> BigInteger t = p - 1;<br> while (t.IsEven()) {<br> s++;<br> t >>= 1;<br> }<br> for (int repeat = 0; repeat < T; T++) {<br> BigInteger a = BigInteger.Random(1, p - 2);<br> if (!PassesPrimalityTest(a, p, s, t)) {<br> return false;<br> }<br> }<br> return true;<br> }<br><br> public bool PassesPrimalityTest(BigInteger a, BigInteger p, int s, BigInteger t) {<br> BigInteger b = BigInteger.ModPower(a, t, p); // b = a^t mod p<br> if (b == 1 || b == p - 1) return true;<br> for (int j = 1; j < s; j++) {<br> b = BigInteger.ModPower(b, 2, p);<br> if (b == p - 1) return true;<br> }<br> return false;<br> }<br></pre>
<p>Because each trial is independent, the chance of erroneously claiming that <em>p</em> is prime is 1/4<sup>T</sup>,
which is ridiculously tiny even for small T, like T = 20. We can now
use this to quickly locate large prime numbers and generate semiprimes
for RSA cryptography.</p>
<p>Unfortunately, these tests identify composite numbers but reveal
nothing about their factors. This makes them useless for factoring
numbers. Unlike many other hard problems, the factoring problem is
believed to not be NP-complete, and so may very well have an efficient
algorithm that no one has found yet. Such an algorithm would make RSA
encryption useless and win you <a href="http://www.rsasecurity.com/rsalabs/node.asp?id=2093">$625,000</a>.
Combinations of advanced techniques have managed to factor very large
numbers, but only at an enormous expense in computing time and
hardware. This is one of the most active current areas of research in
number theory and computer science. </p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=462070" width="1" height="1">AlgorithmsTheoryThe point location problemhttp://blogs.msdn.com/b/devdev/archive/2005/09/01/459540.aspxFri, 02 Sep 2005 04:05:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:459540MSDNArchive0http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=459540http://blogs.msdn.com/b/devdev/archive/2005/09/01/459540.aspx#comments<P><A href="http://en.wikipedia.org/wiki/Computational_geometry">Computational geometry</A> is a field that studies efficient solution of geometric problems, which are critical in mapping, manufacturing, and particularly graphics applications. If you find data structures and algorithms interesting, it's likely you'd also find computational geometry interesting, because it often combines and adapts well-known data structures and algorithms in novel ways to solve natural problems that are easy to explain. As an example of this, I'm going to discuss the <A href="http://cgm.cs.mcgill.ca/%7Eathens/cs507/Projects/2002/JukkaKaartinen/">point location problem</A>.</P>
<P>Imagine you have a political map of Africa with a red dot somewhere on it. Someone asks you, what country is that dot in? This is a simple question for even a child to answer, but how do we make a computer answer the same question? We can represent the map as a <EM>planar decomposition</EM>, meaning a division of the plane into polygonal regions. We store the vertices making up the boundary of each region. Stop and think for a little bit about how you might solve this.</P><IMG src="http://moonflare.com/blogfiles/devdev/Point_location_cross_product.png" align=right>
<P>Perhaps the simplest approach is to exploit the cross product. Suppose the vertices of the region are listed in clockwise (or counterclockwise) order. Compute the cross product of the vector from vertex <EM>k</EM> to vertex <EM>k</EM> − 1 with the vector from vertex <EM>k</EM> to the point being located. The point lies in the region if and only if all of these cross products (for all <EM>k</EM>) point in the same direction. Since we visit each vertex twice and the cross product takes constant time, this requires O(<EM>n</EM>) time, where <EM>n</EM> is the total number of edges.</P><IMG src="http://moonflare.com/blogfiles/devdev/Point_location_ray.png" align=left>
<P>Another simple solution that generalizes better to higher dimensions is to start at the point and draw a line emanating from it. The first time you hit an edge, you know that edge is on the boundary of the correct region. The edge may be on the boundary of two adjacent regions, but comparing the slope of the line to the slope of the edge will tell you which one is correct. To actually compute this, we define the line parametrically with an equation like this, where (<EM>x</EM><SUB>0</SUB>, <EM>y</EM><SUB>0</SUB>) is the point being located:</P>
<P>(<EM>x</EM>, <EM>y</EM>) = (<EM>x</EM><SUB>0</SUB>, <EM>y</EM><SUB>0</SUB>) + <EM>t</EM>(1, 1)</P>
<P>Then for each edge, we compute the <EM>t</EM> value at which the line intersects that edge, if any, and choose the edge with the smallest positive <EM>t</EM> value. This algorithm is also O(<EM>n</EM>).</P><IMG src="http://moonflare.com/blogfiles/devdev/Point_location_rectangles.png" align=right>
<P>These algorithms are fine if you're only locating one point, but if we wish to locate many points we can do better by creating a data structure which facilitates quick point location. I'll begin by solving a much simpler problem: say your map only has rectangles in it, and these rectangles are arranged in vertical "strips" as shown to the right. To solve this problem, we can construct a balanced binary search tree. This tree will first compare the <EM>x</EM> coordinate of the point against the <EM>x</EM> coordinates of the vertical lines to determine what strip of rectangles the point falls in. It then compares the <EM>y</EM> coordinate of the point against the <EM>y</EM> coordinates of the horizontal lines to determine what rectangle it falls in. Each comparison eliminates about half of the remaining rectangles from consideration, so this takes only O(log <EM>n</EM>) time in all for <EM>n</EM> rectangles. The search tree takes O(<EM>n</EM>log <EM>n</EM>) time to construct and takes O(<EM>n</EM>) space, since there's at most one node for each edge and there are 4<EM>n</EM> edges.</P>
<P>We generalize this by allowing each "strip" to contain not just rectangles but <EM><A href="http://en.wikipedia.org/wiki/Trapezoid">trapezoids</A></EM>, as shown below (both rectangles and triangles are considered to be trapezoids here). This is called a <EM>trapezoidal map</EM>. This makes the second part of the search only a little more complicated: instead of comparing the <EM>y</EM> coordinate to a fixed value, we choose an edge and use a cross product to check which side of the edge the point falls on.</P><IMG src="http://moonflare.com/blogfiles/devdev/Point_location_trapezoidal_map_1.png" align=left>
<P>Finally, take any map, and draw a vertical line through each vertex; in the image to the left the gray vertical lines were added in this manner. You'll end up with a trapezoidal map. Since we can determine what region of this map contains the point in O(log <EM>n</EM>) time, and each of these regions is part of only one region in the original map, we've solved the problem for general maps in O(log <EM>n</EM>) time.</P>
<P>Unfortunately, adding all these vertical lines also creates a lot of new regions, potentially as much as squaring the number of regions. Search time is still good, since log(<EM>n</EM><SUP>2</SUP>) = 2 log <EM>n</EM>, but the storage required can jump from O(<EM>n</EM>) up to O(<EM>n</EM><SUP>2</SUP>). This is worst-case, but even in typical practical cases it can require a lot of space, and construction time for the data structure also increases. We want to retain a quick search capability but without having quite so many regions.</P><IMG src="http://moonflare.com/blogfiles/devdev/Point_location_trapezoidal_map_2.png" align=right>
<P>To achieve this, notice that we don't actually need to draw a complete vertical line through every point to divide the map into trapezoids. All we really have to do is draw a line up and down from each vertex until we reach the edges above and below. It can be shown that the number of trapezoids is now no more than about three times the number of original regions, so the worst-case space has been cut back down to O(<EM>n</EM>). The search procedure is similar: at each node we either compare the <EM>x</EM> coordinate of the point to that of a vertical line segment, or we check on which side of an edge the point lies. Either way we eliminate about half the remaining regions. The search "tree" is no longer a tree but a directed acyclic graph, because a single trapezoid may fall on both sides of a vertical line segment. See <A href="http://cgm.cs.mcgill.ca/%7Eathens/cs507/Projects/2002/JukkaKaartinen/searching.html">a demonstration</A>.</P>
<P>Of course I've left a lot of details out here, like how to ensure the search tree remains well-balanced during construction, what to do if some points have the same <EM>x</EM> coordinate, and so on, but I hope this gives you a taste of how clever algorithms can efficiently solve geometric problems.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=459540" width="1" height="1">The Visitor pattern and multiple dispatchhttp://blogs.msdn.com/b/devdev/archive/2005/08/29/457798.aspxTue, 30 Aug 2005 04:27:00 GMT91d46819-8472-40ad-a661-2c78acb4018c:457798MSDNArchive14http://blogs.msdn.com/b/devdev/rsscomments.aspx?WeblogPostID=457798http://blogs.msdn.com/b/devdev/archive/2005/08/29/457798.aspx#comments<P>Today I'll talk about multiple dispatch, a programming language feature that increases flexibility of method calls and eliminates the need for awkward pattern constructions, at the cost of some additional complexity in the dispatch algorithm.</P>
<H3>The Visitor pattern</H3>
<P>Those who have read the Gang of Four's seminal <EM><A href="http://www.amazon.com/exec/obidos/tg/detail/-/0201633612/">Design Patterns</A></EM> may recall the <A href="http://www.google.com/search?&q=%22visitor+pattern%22">Visitor pattern</A>. Its purpose is to provide new operations on a class or set of related classes without actually altering the classes involved. <EM>Design Patterns</EM> provides a real-world example in the document-view paradigm, but for simplicity I'll provide a simpler example here based on some <A href="http://www.cs.rice.edu/~cork/book/node62.html">notes by Corky Cartwright</A>.</P>
<P>Say you wish to represent an arithmetic expression using integers and addition such as (1 + 5) + 2. In an object-oriented language, a natural way to do this is to create an abstract base class "Expression", and then create some related subclasses:</P>
<UL>
<LI>ConstantExpression: Represents a constant number, like 2.
<LI>SumExpression: Represents the result of adding two Expression objects.</LI></UL>
<P>The objects form a tree, where the leaves are ConstantExpression objects. To evaluate an expression represented in such a form, it suffices to create a virtual Evaluate() method which each subclass implements appropriately:</P><PRE>abstract class Expression {
public abstract int Evaluate();
}
class ConstantExpression : Expression {
int constant;
public override int Evaluate() {
return constant;
}
}
class SumExpression : Expression {
Expression left, right;
public override int Evaluate() {
return left.Evaluate() + right.Evaluate();
}
}
</PRE>
<P>This may solve the problem if all you want to do is evaluate expressions, but there are all sorts of operations you can do on arithmetic expressions. Do we really want to add a virtual method for every one of them? We certainly don't want clients of the class to be forced to extend the classes just to add application-specific operations. The Visitor pattern solves this problem by replacing the Evaluate() method with an Accept() method that takes as a parameter a visitor object. The visitor object has a method for each type of expression, and Accept() invokes the appropriate one. Here's how we would evaluate expressions using the Visitor pattern:</P><PRE>abstract class Expression {
public abstract object Accept(Visitor v);
}
class ConstantExpression : Expression {
public int constant;
public override object Accept(Visitor v) {
return v.VisitConstant(this);
}
}
class SumExpression : Expression {
public Expression left, right;
public override object Accept(Visitor v) {
return v.VisitSum(this);
}
}
interface Visitor {
object VisitConstant(ConstantExpression e);
object VisitSum(SumExpression e);
}
class EvaluateVisitor : Visitor {
public object VisitConstant(ConstantExpression e) {
return e.constant;
}
public object VisitSum(SumExpression e) {
return (int)e.left.Accept(this) + (int)e.right.Accept(this);
}
}</PRE>
<P>We evaluate an expression <CODE>e</CODE> using <CODE>(int)e.Accept(new EvaluateVisitor())</CODE>, and adding new operations is easy; we just create new implementors of Visitor. Clients can create application-specific algorithms without touching the original classes. As an added benefit, the visitor class can keep state internal to the algorithm; for example, this visitor counts and stores the number of constants in an expression:</P><PRE>class CountConstantsVisitor : Visitor {
public int Count;
public object VisitConstant(ConstantExpression e) {
Count++;
return null;
}
public object VisitSum(SumExpression e) {
e.left.Accept(this);
e.right.Accept(this);
return null;
}
}
</PRE>
<P>But the wary designer will note that there is a maintainance problem here: if we add a new subclass of Expression, like ProductExpression, the Visitor interface and all classes implementing it have to be updated. If the Expression class rarely changes, this is fine, and may even be good: it forces the visitor classes to consider the new cases. If it's rapidly changing, however, we definitely have a problem.</P>
<H3>Double dispatch</H3>
<P>At this point we take a slight detour and consider another simple example. <EM>Polymorphism </EM>is the critical ability of an object-oriented method call to invoke different methods at runtime based on the type of the object, like Evaluate() in our first example. If you've ever dealt with operator overloading in C++ or C#, though, you may come up against the limits of polymorphism. If I write an <CODE>operator+</CODE> for a class Complex for complex numbers, subclasses can override it in a polymorphic way - but only if the complex number is on the <I>left</I>. If the Complex object is on the right-hand side, there is no way to ensure the correct method is invoked polymorphically short of testing the type and invoking it yourself.</P>
<P>How cruel we are to the unprivileged right-hand sides! The built-in C operator + produces different types depending on both the type of its left and right argument; both a double plus a float and a float plus a double produce doubles. We want something similar for our own operators, the ability to polymorphically choose a method based on both the type of its left argument and its right argument. Then we could write code like this:</P><PRE>Complex operator+(Complex c, int i) { ... }
Complex operator+(ComplexSubclass c, int i) { ... }
Complex operator+(int i, Complex c) { ... }
Complex operator+(int i, ComplexSubclass c) { ... }
</PRE>
<P>The right method will be chosen at runtime based on the types of both sides; the most specific (most derived) type is always chosen. This is called <EM>double dispatch</EM>, because we select the method ("dispatch" the message) based on the type of two arguments.</P>
<P>Unfortunately, this new flexibility also introduces some new problems. Suppose we have these two methods:</P><PRE>Complex operator+(ComplexSubclass s, Complex c) { ... }
Complex operator+(Complex c, ComplexSubclass s) { ... }
</PRE>
<P>Now, suppose we try to add two objects of type <CODE>ComplexSubclass</CODE>. Neither of the above looks more appropriate than the other. There are various ways of dealing with this; for example, we can give the first argument precedence over the others, we can throw an error at runtime, or we can make the compiler require us to implement a method taking two <CODE>ComplexSubclass</CODE> objects. For this article, we'll sidestep the issue by saying we won't write any code that will invoke this dilemma.</P>
<P>Now, I'm sure you're wondering what any of this has to do with the Visitor pattern. The answer is that this feature allows us to drastically simplify our Visitor pattern example while retaining all of its benefits:</P><PRE>abstract class Expression { }
class ConstantExpression : Expression {
public int constant;
}
class SumExpression : Expression {
public Expression left, right;
}
class EvaluateVisitor {
public int Visit(ConstantExpression e) {
return e.constant;
}
public int Visit(SumExpression e) {
return Visit(e.left) + Visit(e.right);
}
}
</PRE>
<P>Now, this isn't valid C# of course, since C# doesn't support double dispatch, but you can see the effect it would have if it did: the Visitor interface is gone, the Accept() method is gone, and the calls in the last Visit() method are simplified. Best of all, it solves our dilemma of a rapidly changing Expression class hierarchy, because we can add a "default case" method that deals with any expression types we're not familiar with: <PRE> public int Visit(Expression e) {
throw new Exception("Unsupported type of expression"); // or whatever
}
</PRE>
<P>One handy trick you can do in Java and .NET languages is to have a method that examines the type of its argument and then uses reflection to pass it on to the correct overload, simulating polymorphism. This is much slower than native double dispatch support, but achieves much of the same effect.</P>
<P>If you've studied how polymorphic methods are dispatched, you're probably most familiar with the "vptr" and "vtable" model, where each object includes a reference to a type descriptor, which in turn contains a table of function pointers giving the right function to invoke for each method for an object of that type. With double dispatch this becomes trickier. One of the simplest strategies, and an effective one in practice, is to have the initial vtable dispatch the method to a "stub" method, whose job is to perform another vtable dispatch on the second argument. Because most methods don't require double dispatch, this avoids imposing a penalty on ordinary single dispatch methods.</P>
<H3>Multiple dispatch</H3>
<P>But why stop at two arguments? Why not choose the method based on <I>all</I> the arguments? Although increasingly less useful, there are legitimate applications of methods that dispatch on 3, 4, or more arguments. This form of dispatch, called <EM>multiple dispatch</EM>, is directly supported by languages such as <A href="http://gauss.gwydiondylan.org/">Dylan</A>, <A href="http://www.cs.washington.edu/research/projects/cecil/">Cecil</A>, and <A href="http://www.apl.jhu.edu/~hall/AI-Programming/CLOS.html">CLOS</A> (the Common Lisp Object System). Not only does this feature ease design of an initial system, but it also greatly increase the ability to extend it in multiple directions in an object-oriented way. Try experimenting with some of them and going through their tutorials; even in languages that don't directly support multiple dispatch, it helps to be able to recognise a situation where it would apply and simulate it.
<P>Thanks for reading, everyone.</P><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=457798" width="1" height="1">Software engineeringProgramming languages