|
|
-
-
Anyone who's taken a beginning calculus course spent some time studying (and maybe dreading) infinite series. A convergence definition was probably given as the convergence of the sequence of partial sums of the series. Next the course probably talked about various ways to tell when a series converges or diverges. From then on, very little time was spent on series that diverge, as they're clearly not very interesting.
And, oh what a lie that is. Divergent series have held mathematicians interest for quite some while. If you were lucky your calculus class might've even spent some time talking about the divergent harmonic series 1 + 1/2 + 1/3 + .... This series diverges, however it diverges in a very interesting way. The sequence of partial sums of the harmonic series closely approximate the natural logarithm function. In fact, if H_n is the nth partial sum of the harmonic series, lim(n -> infinity) (H_n - ln n) is a constant .57721566...
In general mathematicians are interested in finding ways of making divergent series converge. As a motivating example, consider the series 1 -1 + 1 -1 + 1 - ... The partial sums of this series are 1, 0, 1, 0, ... These partial sums oscillate back and forth between 1 and 0, so this series clearly doesn't converge to anything. However, an interesting property of this series is that on the odd chance that it did converge, we know what it would converge too! To see this, let S = 1 - 1 + 1 -1 + 1... Then S = 1 - (1 - 1 + 1 - 1 + 1 ...) = 1 -S, so S = 1/2. This gives us a great test. If we can find any method of making this divergent series converge, this method must cause the series to converge to 1/2.
Now there's no hope of making the series converge using the Cauchy definition of convergence that we were taught in our calculus course. We need to search for other ways of defining convergent series. While we search for alternate definitions, we shall insist on one thing. Any series that is convergent using the calculus-course definition of convergence must also converge using our new definition, and furthermore must converge to the same value. This means that our new convergence techniques will expand our universe of convergent series, however it won't change anything we already know.
For the alternating series above, let's start by considering the alternating geometric series 1 -x + x^2 -x^3 + x^4 - ... As long as |x| < 1, we know that this sums up to 1/(1+x). If we let x get closer and closer to 1 (while remaining smaller than one) the series will sum to a result closer and closer to 1/2. If x = .999, the series sums to .50025. If x = .9999, the series sums to .500025, and so on. This gives us another way to think about our divergent alternating series above. Why don't we consider all of these convergent geometric series where |x| < 1, and take the limit as x approaches 1? Instead of taking limits of more and more terms in the series, where taking limits as the terms of the series approach a number. This gives us 1/2 as the sum of the series, and motivates the following definition:
Definition: Let a0 + a1 + a2 + ... be a series, and assume that there is a function f with a power series a0 + a1x + a2x^2 + ... convergent for -1 < x < 1. If f(x) is defined and continuous at x = 1, say that a0 + a1 + a2 + ... converges in the sense of Abel, and that the sum converges to f(1)
A few minutes of thought should convince you that any series that converges in the usual sense also converges in this Abel sense, and it converges to the same value.
Another way of defining convergence is by the use of Cesaro sums. The series 1 -1 + 1 -1 +... diverges not because the partial sums grow uncontrollably, but rather because the partial sums oscillate around too fast. If we could find a way of averaging the sums in order to smooth them out, maybe this series will converge. And of course we can do exactly that.
Given a series a0 + a1 + a2 + ..., let Si be the ith partial sum of the series. So S0 = a0, S1 = a0 + a1, and so on. The nth Cesaro sum is defined as (S0 + S1 + S2 + ... + Sn)/(n+1), the average value of the first n+1 partial sums. if the Cesaro sums converge to a finite value as n grows to infinity, we say that the series a0 + a1 + a2 +... converges in the sense of Cesaro, and that the sum converges to the limit of the Cesaro sums.
Now in the case of the series 1 -1 + 1 -1 + ..., the Cesaro sums converge to 1/2. Take a minute to convince yourself that if a series is convergent in the normal sense, the Cesaro sums converge to the normal sum of the series. It's not quite as simple as it was with Abel sums, however it's not too hard.
These are two of the most basic alternative convergence definitions out there. There are a number of others that contribute to this interesting field. This can also explain some surprising results, such as Euler's assertion that 1 + 2 + 3 + 4 + ... = -1/12.
|
-
I'll pose an easy one this time. Everyone should be familiar with the Fibonacci sequence defined recursively as F(n) = F(n-1) + F(n-2). Fibonacci is often one of the first examples of recursion one learns, and also one of the first examples of how recursion can be harmful. For this problem, I'm going to generalize Fibonacci sequence to allow it to be defined in terms of an arbitrary number of previous n terms:
Definition: Let n be a positive integer and let a_0, a_1, a_n-1 be arbitrary integers. We defined the Generalized Fibonacci Sequence of these constants as follows
G(0) = a_0 G(1) = a_1 ... G(n-1) = a_n-1
G(m) = G(m-1) + G(m-2) + ... + G(m-n) For any m >= n
The regular Fibonacci sequence can be expressed in terms of G if n =2, a_0 = 0, and a_1 = 1. Now given some Generalized Fibonacci Sequence G, consider what happens if we limit this to integers modulo m, for some number m:
G_m(l) = G(l) (modulo m)
Show that the sequence of numbers G_m(0), G_m(1), G_m(2), ... is eventually cyclic. i.e. if you walk down the sequence far enough it falls into a steady pattern that repeats regularly. What can you say about the period of the cycle in terms of m, n, a_0, a_1, ..., a_n?
|
-
In the last post, we discussed a method to find common subexpression on the entire control-flow graph using partitioning. Today, we're going to talk about another classic way of accomplishing this goal using data-flow analysis. Then I think I'm going to discuss topics other than compilers... at least for a while.
The first thing we need to do is to define what it means for an expression to be redundant. So far we've done this two ways. Initially we gave an algorithm (value numbering) to find redundant expressions, and implicitly defined an expression to be redundant if value numbering declared it so. Later we gave a definition of congruence of operations, and gave an algorithm for finding congruent operations. Congruence was related to redundant expressions, but two operations being congruent did not necessarily mean that one was redundant.
For today's discussion, we're no longer going to insist that all names are unique. This means that we have to deal with blocks that redefine variables used in expressions. Say we have some expression of the form x + y. If some basic block redefines either x or y, we say the block has killed the expression x + y. With this terminology out of the way, we can now give a definition for redundant expressions.
Definition: An expression e is said to be Available at a block b if e is defined on every control-flow path that reaches b and e is not killed on any of these control-flow paths. The set of all expression available at a block b is referred to as Avail(b).
This is a very intuitive definition of e, often referred to as the meet over all paths solution. Unfortunately, the definition gives us no ideas how we can compute these available-expression sets. enumerating all control-flow paths that reach a block b sounds hard, and how should we deal with loops? This is where data-flow analysis comes in.
In order to calculate available expressions for a block b, let's start by assuming that we've already successfully computed these sets for every block in the procedure except for b. We can then easily figure out available expressions for b. If an expression is available at all predecessors of b and it isn't killed in any of those predecessors, it's clearly available at b. Further, any expression that is defined in all predecessors of b and isn't killed in any predecessor is also available at b. For every block in the control-flow graph, we can calculate two sets
DefsOut(b) = {every expression defined in b and not killed b}
Killed(b) = {all expressions that are killed as a result of an assignment in b}
Both of these sets can easily be computed by examining the block. We can then define the set of expressions available in b as follows:
(*) Avail(b) = Intersect(over all predecessors p of b) [DefsOut(p) U (Avail(p) - Killed(p))]
Where the minus sign means set substraction.
Now this works if we already have Avail sets calculated for every block other than b, but let's make a leap. Let's start everyone off with empty Avail sets, and run through the graph applying formula (*). Once we're done, run through the graph again applying the formula. Keep doing this until the sets we calculate stop changing. Once the sets stop changing, we hope that we've calculated the meet-over-all-paths solution above. Here's an algorithm that accomplishes this (with a few improvements);
Worklist = {all blocks}
while Worklist isn't empty pull a node b off the worklist (**)Avail(b) = Avail(b) U Intersect(over all predecessors p of b) [DefsOut(p) U (Avail(p) - Killed(p))] if the calculation caused Avail(b) to change add all successors of b onto Worklist
Now there are three basic questions to ask about this algorithm: does it terminate, does it calculate the correct answer, and how fast is it. It should be obvious that this algorithm terminates. Every iteration either leaves the Avail sets alone or increases them monotonically, so at some point it must finish. If the control-flow graph is acyclic, this algorithm intuitively works. If you imagine the set of available expressions as starting from the source node and being pushed out to successor nodes (though successor nodes can also act as sources of expressions), each iteration pushes the information a little further down the graph; since there are no cycles, eventually all possible execution paths are followed, and we're done. This intuition is very similar to the rational behind the Bellman-Ford algorithm for shortest paths in graphs.
In general flow graphs, whether the above algorithm correctly calculates the meet-over-all-paths solution depends on the definition of the expression (**). If this expression satisfies certain properties (which causes terminology geeks to call it admissible), the algorithm will correctly calculate the meet-over-all-paths solution. It turns out that our definition for Avail has these properties, so the above algorithm works. The proof that this algorithm works is far from obvious, and is based off of a branch of mathematics called Lattice Theory.
The last question is... how fast? This also depends on characteristic of the expression (**). Further, the order in which we look at nodes has a huge impact on how many iterations are needed to reach a fixed point. For expressions like (**) that cause data to flow downwards, a good bet might be to order the worklist by the block's index in a reverse postorder walk.
Now that we have or Avail sets, what should we do with them? Well first of all, we need to make sure to generate global names for all the expressions we care about. Now we can run the local value numbering algorithm from several posts back on each block, but instead of starting out with an empty hash table we can start off the hash table with the list of available expressions for the block. Generating global names will generate a lot of extra copies and useless variables, but later compiler passes will find these and eliminate them.
Next time I think I might get back to some math-related topics.
|
-
In the last few posts, we examined ways of eliminating redundant expressions in basic blocks and in tree-structures of basic blocks. We examined the tree of extended basic blocks and then enhanced this algorithm to run over the immediate-dominator tree of a procedure. These algorithms all worked on acyclic regions of the control-flow graph. In this post, we're going to start examining algorithms that work on the entire control-flow graph at one time.
A particularly elegant algorithm for globally discovering redundant expressions comes froma 1988 paper by Alpern, Wegman, and Zadeck. They define a notion of congruent operations.
Definition: Let x and y be two operations in a procedure. x and y are said to be congruent if they have the same operator and their operands are congruent.
Notice that two congruent operations aren't necessarily redundant. This definition does not take control flow into account, so there may be no control-flow path that includes x and y.
In this paper, Alpern,Wegman, and Zadeck propose an algorithm to determine which operations are congruent. This algorithm is based off of Hopcroft's algorithm for minimizing finite-state machines. The idea is quite simple. To start off with, any operation that loads an immediate value is obviously congruent to any other operation that loads the same immediate value. We start by assuming that all operations the same operator are congruent and that all immediate loads are congruent only if they load the same immediate value. We create a worklist containing all of these congruence classes, and we keep trying to make the classes smaller. As usual, the control-flow graph must have this static-single assignment property that any variable use is the result of only one variable definition.
Partition all operations into classes by opcode Put all of the classes into Worklist
While Worklist is not empty Remove a class c from Worklist For each class s (including classes not on Worklist) If s uses the result of any operation in c split s into two classes. One which contains all operations that use some operation in c, and one which contains all other operations in s. put the smaller of these two classes back on the worklist
When this algorithm terminates, we're guaranteed that all operations in each class are congruent.
There are a couple interesting differences between this algorithm and the previous value-numbering algorithms. For one thing, value numbering is pessimistic. Value number starts off by assuming that no expressions are redundant, and it builds up it's hash table as it find expressions in the code. This partitioning algorithm, on the other hand, is optimistic. It starts by assuming that all operations with the same operation are congruent, and then refines this assumption. Optimistic algorithms often find more opportunities then pessimistic algorithms, however one advantage of pessimistic algorithms is that they can be terminated early and still provide benefit. If this partitioning algorithm is terminated early, all of the classes computed must be thrown away.
Another key difference, is that the hash-based, value-numbering algorithms worked in an online manner. The algorithm would replace common subexpressions as it found them. This partitioning algorithm doesn't replace any expressions. It merely computes a bunch of congruency classes and leaves us with the problem of how to use this information. In the paper, Alpern, Wegman, and Zadeck suggest using the dominator information we talked about last time. Remember, that an operation x is said to dominate another operation y if every control flow path that reaches y flows through x. Using this idea, we can do the following:
for every congruence class c for every pair of operations x and y in c if one operation dominates the other rewrite the dominated operation as a load of the result of the other
Others have proposed better ways of using this information.
Partitioning and hash-based value numbering are hard to compare directly, since they find very different sets of expressions. However, partitioning has some disadvantges with respect to dominator value numbering:
- In practice, partitioning can be very slow compared to value numbering.
- It's not obvious how to handle algebraic identities such as x+0 = 0+x = x. Value numbering is great at this.
- 2*a and a+a will always be in seperate classes.
In practice, dominator-based value numbering (especially after it's been improved in various ways) is often better than the partitioning method. If you're not constricted by compile times though, there's no reason why you couldn't run both.
|
-
In the last post we extended the value-numbering algorithm to extended basic blocks. This is generally a win, but still leaves us wanting a bit more. Consider the following contrived program:
x = y + z B1
if x == 0 a = y + z B2 else b = y + z B3
c = y + z B4
This program contains 4 basic blocks as marked above. The control-flow graph for this program presents three extended basic blocks: B1,B2 B1, B3 and B4. If we apply value numbering to these extended basic blocks, we improve the program to the following
x = y + z
if x == 0 a = x else b = x
c = y + z
An improvement surely, but what about block B4? B4 has both B2 and B3 as predecessors, so we can't include it in any extended block bigger than itself. However, it's obvious from inspection that the expression y+z evaluated in B4 is redundant. No matter which branch of the if statement is taken, y+z has already been evaluated by the time B4 is reached.
In order to optimize B4, we need to generalize our concept of extended basic blocks a bit. We mentioned last time that the key property of an extended basic block B1, B2, ..., Bn (at least for value numbering) is that whenever some Bi executes it's guaranteed that all previous blocks in the list have already executed in order. We can formalize this property a bit more by introducing dominators:
Definition: Let G be a control-flow graph with start node s. Given two nodes x and y in G, x is said to dominate y if every path from s to y must pass through x.
We've now formalized the key property of extended basic blocks. Each of the blocks is dominated by all predecessors in the extended basic block. By the time any block has executed, all it predecessors have executed in order.
Let's introduce one more idea. that of immediate dominators:
Definition: Let G be a control-flow graph with start node s, and let x be an arbitrary node in G that is not s. The immediate dominator of x, referred to as IDOM(x), is the unique dominator of x that is not x itself with the property that any other dominator of x, not including x, dominates IDOM(x). The nodes of G form a tree structure rooted at s such that the parent of any node x is IDOM(x).
Each node in an extended basic block is the immediate dominator of the next node in the extended block. This means that every extended basic block appears somewhere in the immediate-dominator tree as a subtree. So, if we apply our scoped hash table algorithm from last time to the immediate-dominator tree, we should get all the benefit of extended-basic block value numbering, and then some!
In our example above, the immediate-dominator tree contains paths B1, B2 B1, B3 and B1, B4. Running our scoped hash table value numbering over this gives us the following
x = y + z
if x == 0 a = x else b = x
c = x
eliminating all common subexpressions.
In reality, value numbering over immediate-dominator trees has other complications due to the static single assignment form I mentioned previously. If you're interested in this, take a look at the paper by Briggs, Cooper, and Simpson http://citeseer.ist.psu.edu/329118.html
Tomorrow we'll move on to global methods of finding common expressions.
|
-
Yesterday's post dealt with optimizing a single basic block. Most useful progams contain some control flow however, so in the next few posts we'll explore extending the algorithm to larger regions.
The set of all basic block in a procedure form a graph, known as the Control Flow Graph, where the edges correspond to branches. Algorithms such as yesterday's that operate on single blocks are known as "local" algorithms, algorithms that operate on the entire graph are referred to as "global," while algorithms that operate on a portion of the graph are referred to as "regional." In this post and the next, we'll extend yesterday's value-numbering algorithm to operate on a regional level.
Yesterday we did a renaming pass over the block we were analyzing to ensure that no variable was assigned to more than once. The fact that this is possible in a basic block is pretty obvious, it's much less obvious whether this can be done on an entire control flow graph. It turns out that such a renaming is possible by rewriting the code into what's known as Static Single Assignment form. We're just going to accept this fact on faith right now until I get around to discussing it.
For starters, consider a procedure consisting of two blocks A and B. There's an edge from A to B, meaning that as soon as A finishes executing it branches to the beginning of B. This combination of A and B satisifies the fundamental property of basic blocks that if one instruction executes they all execute. Value numbering can clearly be applied across both of these blocks. This leads us to the following definition:
Def: Let B0, B1, ..., Bn be a sequence of basic blocks where each Bi is the only parent of Bi+1, and B0 is either the entry node or has multiple parents. B0, B1, ..., Bn is called an Extended Basic Block.
An extended basic block doesn't quite satisfy the basic block definition, however we can say that if any block executes all previous block must have executed in order. It should be obvious then that we can run value numbering over an extended basic block treating it is as a single block. Extended basic block can often take us pretty far in a procedure, so this is quite nice.
Extended basic blocks can overlap quite a bit. If we end up value numbering each extended basic block seperately, we'd end up reevaulating some blocks many times. One way to do this more efficiently is by using a scoped hash table.
A scoped hash table is a data structure that's often used to interpret lexical scoping in languages. The idea is to have a stack of hash tables, and to replace a lookup with a a sequence of lookups starting with the top hash table on the stack. As soon as one lookup succeeds the search stops, this enables definitions higher on the stack to shadow definitions lower on the stack.
The set of extended basic block rooted at a node in the CFG form a tree structure. We can walk it as follows
List HashTables Extended-Value-Number(Block b) create hash table and insert at head of HashTables Value-Number(B)
for every successor s of b if b is the only parent of s Extended-ValueNumber(s)
pop the head off of HashTables
Value-Number(Block b) perform value number of b with hash lookup replace with a scoped lookup in the list HashTables
This algorithm prevents recomputing value numbers. If I have two extended basic blocks B0, B1, B2, B3 and B0, B1, B2, B4, the algorithm will first compute the value numberings for the prefix B0, B1, B2. It will then branch to compute a value numbering of B3 based on the hash tables created by the prefix, throw away the hash table created for B3, and branch again to compute a value numbering of B4.
Optimizing extended basic block is an improvement, but we'd like to do even better. With extended basic blocks, we give up as soon as two control-flow paths merge. To do better, we're going to need a way to deal with these merges.
|
-
Despite my having worked on Windows Storage for the past four years, one of my main interests has always been compilers. I had the good fortune to go to school at Rice University, a school that had one of the strongest compiler programs in the country. I expect to spend some time on this blog rambling about interesting bits a pieces of compilers.
For the next several blog entries, I'm going to talk about a compiler transformation with a very rich history. Specifically, I'm going to talk about common subexpression elimination. Consider a program that contains the following two array references:
A[[i][j] = 1 A[i][k] = 2
When this is compiled, the compiler generates something equivalent to the following
addr1 = A + i * row-length*element-size + j*element-size *addr1 = 1
addr2 = A + i * row-length*element-size + k*element-size *addr2 = 2
Clearly A + i * row-length*element-size is common between addr1 and addr2, and doesn't need to be calculated twice. The goal of common-subexpression elimination is to automatically discover such commonality and to eliminate it.
Let's start with a bit of basic terminology. A basic block is a run of straight-line code containing no control flow; often a basic block is understood to be maximal, that means that adding any instruction for the program, either before or after, would introduce control flow. A fundamental property of a basic block is that if any contained instruction executes, all contained instructions execute. Every procedure in a program consists of a set of basic blocks that are organized into a control-flow graph. The control-flow graph for a procedure has an entry point, the entry point for the procedure, and there's an edge between block corresponding to every branch in the procedure. This post will concentrate on classical ways of eliminating common subexpressions in basic blocks. Future posts will examine ways of doing this on entire control-flow graphs.
the classic algorithm for subexpression elimination in a basic block is called value numbering, and is reportedly due to Balke. The basic idea is to assign every value in the block a number (even if we don't know what the value evaluates to), and use a hash table to keep track of expression values. Let's start out by rewriting the block so no variable is ever killed. i.e.
a = 5 b = a * c a = 7 Will be rewritten as
a = 5 b = a * c d = 7
As long as we replace subsequent uses of a with d, we can safely do this within a block. Now the value-numbering function looks something like this
ValueNumber(block) { maxValueNumber = 0 for each instruction x <- y op z (or x <- y or x <- op y) in the block lookup "ValueNumberOf(y) op ValueNumberOf(z)" in the hash table if it's there with with value number v replace the instruction with that loads x from value number v else ValueNumberOf(x) = maxValueNumber++ add ValueNumberOf(y) op ValueNumberOf(z) to the has table and give it value number ValueNumberOf(x) }
We're making a simplifying assumption here that all variables we care about are initialized with the basic block, and that we don't care about the value of any variables once the basic block has finished. This algorithm now assigns numbers to expressions with the property that two expressions are given the same number only if they will evaluate to the same value. Let's examine our above example with array references. A sample assembler output from this code might look something like this
r0 <- addressof(A) r1 <- i r2 <- row-length r3 <- element-size r4 <- r2 mul r3 ; row-length * element-size r5 <- r1 mul r4 ; i * row-length * element-size r6 <- r0 add r5 ; A + i * row-length * element-size r7 <- j r8 <- element-size r9 <- r7 mul r8 ; j * element-size r10 <- r6 add r9 ; A + i * row-length * element-size + j * element-size deref(r10) <- 1
r11 <- addressof(A) r12 <- i r13 <- row-length r14 <- element-size r15 <- r13 mul r14 ; row-length * element-size r16 <- r12 mul r15 ; i * row-length * element-size r17 <- r11 add r16 ; A+ i * row-length * element-size r18 <- k r19 <- element-size r20 <- r18 mul r19 ; k * element-size r21 <- r17 add r20 ; A+ i * row-length * element-size + k * element-size deref(r21) <- 2
Whew! That's 22 registers uses just to compute two array address. Let's see what our value numbering algorithm will do to this. If we assume an infinite number of registers (an assumption we can later "fix" with a register allocator) we can use the value numbers as register indexes once they've been assigned.
r0 <- addressof(A) r1 <- i r2 <- row-length r3 <- element-size r4 <- r2 mul r3 ; row-length * element-size r5 <- r1 mul r4 ; i * row-length * element-size r6 <- r0 add r5 ; A + i * row-length * element-size r7 <- j r8 <- r3 r9 <- r7 mul r3 ; j * element-size r10 <- r6 add r9 ; A + i * row-length * element-size + j * element-size deref(r10) <- 1
r11 <- r0 r12 <- r1 r13 <- r2 r14 <- r3 r15 <- r4 ; row-length * element-size r16 <- r5 ; i * row-length * element-size r17 <- r6 ; A+ i * row-length * element-size r18 <- k r19 <- r8 r20 <- r18 mul r8 ; k * element-size r21 <- r add r20 ; A+ i * row-length * element-size + k * element-size deref(r21) <- 2
We've saved a bunch of computation by replacing arithmetic operations with register copies, but we haven't saved any registers. Examining this code does show a lot of unneeded operations that we could eliminate by tweaking this algorithm. The first thing we notice is all the assigments to registers that are never used. This is not a big deal since a later compiler pass will get rid of these assignments. This is easy to fix (at least given our simplifying assumption above) here however by noticing that any assignment x <- y where y has already been assigned a value a number can be safely deleted.
There are a few tweaks left to this algorithm. Noticing algebraic identities is a start. x add 0, x mul 1, and x div 1 should all be noticed and given the same value number as x. Commutative operations should be noticed. x add y and y add x should have the same value number for integer values of x and y.
Noticing identities is generally a win, but it doesn't help our array example. there is one killer improvement for our example however, and that is to incorporate constant folding. Any time we see a variable loaded from an immediate, we know the actual value represented by that variable's value number and we can store it. More generally, any time we see an operation of the form x <- y op z and both ValueNumberOf(y) and ValueNumberOf(z) have known values, we can calculate the actual value being stored in x and note that information together with x's value number; furthermore, we can replace x <- y op z with an immediate load. If we continue to accept our simplifying assumption we can even remove x <- y op z entirely after storing a value with x's value number; in the more general case we can rely on a later compiler pass to remove this assignment. In any case if one of ValueNumber(y) or ValueNumber(z) has an associated value, we can replace that parameter with an immediate.
With the above improvements, our array example reduces to the following
r0 <- i r1 <- r0 mul row-length*element-size ; i * row-length * element-size r2 <- addressof(A) add r1 ; A + i * row-length * element-size r3 <- j r4 <- r3 mul element-size ; j * element-size r5 <- r2 add r4 ; A + i * row-length * element-size + j * element-size deref(r5) <- 1
r6 <- k r7 <- r6 mul element-size ; k * element-size r8 <- r2 add r7 ; A+ i * row-length * element-size + k * element-size deref(r8) <- 1
We've reduced it from 24 instructions down to 11, and along the way we used 14 fewer registers. Not a bad deal!
|
-
At some point in your programming career, you've probably written a growable list. It might've been to implement a dynamic array or it might've been for a hash table, but I bet you've written one. After all, while you can usually find a library that has already implemented your application for the growable list, using that library isn't always an option. In the past I've sometimes had to reimplement standard data structures because the library available to me had bugs I wasn't able to work around.
An introductory data structures class will teach you that the "correct" way of implementing a growable list is by using a doubling strategy. The analysis goes as follows: say I start out with a list large enough to hold a single element, and I grow the array by a single element each time. Every grow operation requires me to allocate new memory and copy the content of the array over -- this takes time proportional to the size of the array. Adding n elements requires takes time proportional to
1 + 2 + 3 + .... + n = 1/2 * n * (n+1) = O(n^2)
It should be obvious that growing the array by any constant amount will still leave you with O(n^2) time for n inserts. On the other hand if we double the array length each time, we only have to copy the entire array about lg(n) times for n inserts. If we assume that n = 2^l + r, this then requires time proportional to
n + 1 + 2 + ... 2^l = n + 2^(l+1) - 1 < n + 2n -1 < 3n = O(n)
One gotcha to look out for is shrinking the list. If you halve the length of the list whenever possible, then a loop that alternately adds and removes a single element will run very slowly indeed. A little care must be taken to not shrink too early.
End of story, right? The math works, out so we know what to do. Of course the real world is never so pretty. Sometimes we have some knowledge of the data set which tells us that doubling is overkill. If I know that I'll rarely overrun the list by more than a few elements and the list is likely to never grow very large, I might decide to use the constant-grow mechanism. After all asymptotic analysis only applies if my data can grow large, and I know that's not the case; of course if my data changes at some future date, the maintainer of the code will probably curse my name for using constant grows.
There's one major problem with the above strategy of doubling the array, and that is latency. When you amortize the cost out over a length of time it's fairly cheap, however if you graph the performance of this list as you add elements you'll see spikes whenever the array is doubled. This spikes might be acceptable in a batch system, but cause usability issues in whenever the array is doubled. A laudable goal is to find a growable array class that is both efficient and that maintains steady performance without spikes.
One trick that I learned for Alan Cox of Rice University is to programmatically amortize the cost of the doubling operation. What does this mean? Well the above operation worked by having most adds be extremely cheap with the rare (lg n times) occurance of an extremely-expensive add. If you're willing to take a slight performance hit on every add, you can completely eliminate the expensive add operation. Consider the following C++ class for an integer array (untested code):
// assume memory allocations always succeed class Array { private: int* oldList; int* currentList; size_t size, maxSize; size_t elementsCopied; public: Array() : oldList(NULL), newList(NULL), size(0), maxSize(1), elementsCopied(0) { currentList = new int[1]; }
~Array() { delete[] oldList; delete[] currentList; }
void Add(int item) { if (size == maxSize) { delete[] oldList; oldList = currentList; maxSize *= 2; currentList = new int[maxSize]; elementsCopied = 0; } currentList[size++] = item; if (oldList != NULL) { currentList[elementsCopied] = oldList[elementsCopied]; ++elementsCopied; } }
void Get(size_t index) { if (oldList != NULL && index <= elementsCopied) return oldList[index]; else return currentList[index]; } };
Notice that when we fill up the array we allocate a new memory buffer of double the size, but we don't actually copy anything. We keep the old memory around though, and every time an element is added to the the end of the array we copy one more element into the new buffer. Since the new buffer is double the size, by the time we've filled it up we've also finished copying from the old buffer (notice that this works for any multiple > 1). We've slowed down each Add operation by replacing one write with a read and two writes, but we no longer hav latency jags when we double the array. (we assume that allocating the new memory is a constant-time operation; this isn't a bad assumption, and the fact that we allocate memory in powers of 2 actually helps for many head allocators). The other tradeoff is memory. We use 1.5 times as much memory as with the previous scheme to allow us to do the copying lazily.
Starting tomorrow, I'll start focusing on interesting trasformations done by compilers.
|
-
Every so often, your application makes a call on a COM interface pointer, and this call fails with RPC_E_DISCONNECTED. Or it fails with RPC_E_SERVER_DIED, or with RPC_E_SERVER_DIED_DNE, or any number of RPC errors. In most cases what this means is that the server implementing this COM interface is no longer running. Which error code you receive is dependent on many factors, such as whether the server process exited normally, whether the server process shutdown unexpectedly, whether the shutdown was in progress when the call came in, etc. There are many reasons why this might happen. The server process may have simply crashed. Alternatively, it may have exited in response to someone stopping its service, killing it from task scheduler, or even because of a bug that caused it to exit early. If your application uses out-of-proc COM objects, it needs to realize that these errors might occur. There are many ways that these errors can be handled, but as always the choice depends on your scenario.
One common pattern is for an application to maintain a list of pointers to COM objects, all of them implementing the same interface, for use throughout the application. Your application might be an event publisher, pushing event notifications out to anyone who's subscribed. Alternatively, these COM objects may all be providers of functionality used by your application. In either case, your application is going to be littered with code that looks like this
list = GetInterfaces() for each p in list HRESULT hr = p->SomeFunc()
if (FAILED(hr)) handle error
Now, there's a real problem with the above code. If any of the COM servers has exited, you'll get one of the above-mentioned RPC errors. In order to fix this, we need to check for these errors and handle the appropriately. Let's say that we want to restart a COM server whenever it exits unexpectedly. In order to do this, we need to store a CLSID with each interface pointer and we also need to handle RPC errors in the above code snippet; for simplicity, we'll only care about the three RPC errors mentioned above. Now, instead of storing a list of interface pointers, we'll store a list of the following structure
struct Entry { CLSID m_id; IInterfacePointer* m_interface; Entry* m_next; };
Every time we want to broadcast a function call to all entries in the list, we now need code that looks like this
Entry* current = GetInterfaces(); while (current) { HRESULT hr = current->m_interface->SomeFunc(); if (hr == RPC_E_DISCONNECTED || hr == RPC_E_SERVER_DIED || hr == RPC_E_SERVER_DIED_DNE) NotifyEntryDead(current); else if (FAILED(hr)) // standard error handling
current = current->m_next; }
Where NotifyEntryDead looks something like this
void NotifyEntryDead(Entry* entry) { RemoveFromList(entry); HRESULT hr = CoCreateInstance(entry->m_id, NULL, CLSTCTX_ALL, IID_IInterfacePointer, (void**)&entry->m_interface); if (FAILED(hr)) // handle error AddToList(entry); }
Now, writing this loop each time we broadcast a function call creates a lot of tedious, repetitious code. A slightly more clever programmer would create proxy functions, one for each member of the IInterfacePointer interface, which did this broadcasting for you. However if the interface has many members, it's still annoying to have to add this NotifyEntryDead code to each of these broadcast functions. What's more, if you're fixing existing code to handle this situation, there might not be any broadcasting functions; in this case, you'll be forced to go add this RPC error checking to all of these loops scattered around your code.
Now what would be really nice is if the GetInterfaces() function did this checking for you. Add a new member to IInterfacePointer called IsAlive, and document that it should always return S_OK. GetInterfaces() will now walk over the list calling IsAlive() on all the pointers. If IsAlive fails for some interface pointer, GetInterfaces() will call NotifyEntryDead for that entry. There's still a window where your loops might hit these RPC errors, but the window is now small enough that it's acceptable to handle them as normal errors.
This is great if you have control over this IInterfacePointer interface. Unfortunately, it's often the case that you can't control what members the interface has. In order to check whether the interface pointer is valid, you need to be able to call some function on the interface which has no side effects. And that's when you remember about IUnknown.
Every COM interface derives from a base interface named IUnkown. IUnknown declares three members that all COM classes must implement: AddRef, Release, and QueryInterface. So, our first attempt at checking whether an interface pointer is valid looks as follows
bool IsValid(IUnknown* pointer) { HRESULT hr = pointer->AddRef(); if (SUCCEEDED(hr)) hr = pointer-Release();
return SUCCEEDED(hr); }
Unfortunately, it turns out that the above function does not work. To understand why, you must remember that the interface pointer isn't really a pointer to the out-of-proc COM class. Rather it's a pointer to a local proxy class that implements the same interfaces as the out-of-proc class. This proxy class is responsible for marshalling your function calls over to the process hosting the COM server and calling the real interface pointer in the context of that process. It turns out that this proxy will only send one AddRef and one Release to the out-of-proc COM class. Any AddRefs and Releases after the first AddRef are cached locally in the proxy, and a Release will not be sent to the out-of-proc class until the local reference count drops to zero. Eventually COM will notice that the server is gone and these calls should start failing, however this might take some time.
Since AddRef and Release don't work, let's try using QueryInterface. Here's our second attempt
bool IsValid(IUnknown* pointer) { IUnknown* unknown = NULL; HRESULT hr = pointer->QueryInterface(IID_IUnknown, (void**)&unknown); if (SUCCEEDED(hr)) unknown->Release();
return SUCCEEDED(hr); }
QueryInterface for IUnknown must always succeed for any valid COM object, so you would expect it only to fail if the COM object is no longer around. Unfortunately, this suffers from the same problem that the AddRef solution had. The COM proxy will cache interface pointers, and if you call QueryInterface with an interface ID that's already cached (and IUnknown will always be cached), QueryInterface succeeds without going beyond the bounds of the current process. This knowledge leads us to our final attempt. In order to guarantee that COM actually attempts to reach the actual COM object, we must call QueryInterface for an interface id that the proxy has never seen. The simplest way to do this is to create a new unique identifier each time. A proper COM object must return E_NOINTERFACE for such a request, so we write the following function
bool IsValid(IUnknown* pointer) { IID nonExistent = GUID_NULL; HRESULT hr = ::CoCreateGuid(&nonExistent); if (FAILED(hr)) // handle error IUnknown* unknown = NULL; HRESULT hr = pointer->QueryInterface(nonExistent, (void**)&unknown); assert(FAILED(hr));
return hr == E_NOINTERFACE; }
Our GetInterfaces() function will now walk over the entire list calling IsValid() on each interface pointer. Any pointers for which IsValid fails will be restarted using NotifyEntryDead.
|
-
Now that I've gotten some of my math fetish out of my system, I plan on blogging for a while on VSS and how to backup a Windows system. Starting with Windows XP, Microsoft has changed the way backup and restore of Windows system are done. We provided a way to create point-in-time shadow copies of a volume, thus eliminating previous problems of changing time windows and open handles. We also provided a generic backup infrastructure that allows a backup application to coordinate with data stores during backup and restore.
There were a number of problems with Windows backup in previous versions. The first is that it was complicated. Every system service had a different set of backup APIs, there was another set of instructions for backing up the registry, and all this work only backed up the bare system; if you had any third-party stores installed, you had to consult with them to determine how to back them up. Backup applications had to consult a 100-page white paper to figure out how to backup the system, as well as white papers from numerous third-party stores. In addition to this, backup windows caused problems. Files were constantly being modifed during the backup window, another application could have an exclusive handle open to a file, and backup often neccesitated bringing a store down for the duration of the backup window.
In Windows XP, Microsoft shipped the VSS infrastructure. This was shipped mainly in three binary files. VSSAPI.DLL contained the backup infrastructure, VSSVC.EXE implemented a service that coordinates shadow-copy creation, and SWPRV.DLL implemented a provider for shadow-copy creation. We'll talk first about the role of VSSVC.EXE and SWPRV.DLL.
In order to solve the problem of exclusive handles and changing backup windows, Microsoft introduced the notion of Volume Shadow Copies. a volume shadow copy is a point-in-time, unchanging image of some live volume on your system. The shadow copy appears as a new volume on your system (and in Windows Server 2003 you can even give it a drive letter), but all the files on this new volume are read only. What's more, nobody should have any handles open on the shadow-copy volume, eliminating that traditional problem of backup applications on Windows. The model for shadow-copy creation involves three actors: a requestor, a coordinator, and a provider. The requestor is the application that requests the shadow copy. The requestor sends the creation request to the coordinator, which is implemented in VSSVC.EXE. The coordinator then picks a provider to create the shadow copy, and returns a path to the new shadow-copy device. By default the Microsoft provider, which is implemented in SWPRV.DLL, is used to create the new shadow copy.
VSSAPI.DLL implements the new backup infrastructure, which was an attempt to simplify the task of backing up a system. Every data store on a system -- including system services as well as complicated third-party stores -- hosts a piece of code called a writer. A writer knows what files need to be backed up for that store, and also can bring those files to a consistent state. A backup application queries all writers on the system, and uses the information returned by those writers to determine which volumes and files need backing up. It then tells all the writers involved to freeze their data and asks the VSS service to create shadow copies for all volumes involved. This shadow copy now contains all of the stores' data in a consistent state, so writers are allowed to continue writing to the original volume. The backup application can then back up the data at it's leisure off of the shadow-copy volume.
This was a very basic overview of how VSS and the backup infrastructure work. In future posts I'll delve more deeply into some details of VSS. For more information you can also look at the following presentations:
http://support.microsoft.com/?kbid=889247
http://support.microsoft.com/default.aspx?kbid=889346
As well as the MSDN documentation at: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vss/base/volume_shadow_copy_service_overview.asp
|
-
I haven't been blogging in a little while, and I want to start blogging about VSS related topics soon. However, for now I'll throw out another math puzzle, though this one is much easier.
Remember that given a set U of real numbers, a real number x is said to be a limit point of U if every open interval around x contains a point of U other than x.
Let V be a bounded, countabll-infinite set of real numbers, and let limit(V) be the set of limit points of V (note that limit(V) might not intersect V at all). Is it possible for limit(V) to also be a countably-infinite set?
|
-
I left off last time explaining how invertible function spaces naturally form a group and how that leads to the concept of a group action. This concept turns out to be useful in examining any group, not just explicit function spaces. Any group G can be seen as a space of invertible functions (by the way, the fancy term for these sorts of function is bijections) on itself. How does this work? Well, let g be some element of G. With g we can associate a function g(h) G->G such that g(h) = gh. All this function does is multiply it's argument by g on the left using the normal group multiplication; since g has an inverse in G, the inverse function is also well defined. Now since we've presented the group as a function space we can define an action p of G on G using the evaluation function by saying that p(g, h) = gh.
One more piece of notation. Let's say that G acts on X and g is a group element. If A is a subset of X, we define gA to be the set obtained by applying the action of g to every element in A. In formal notation: gA = {ga : a :el: A}
Now that group actions are understood, we can move on to a concept at the heart of the Banach Tarski Paradox. Let's start with a definition in Math Speak:
Let G be a group acting on a set X. We say that this group action is paradoxical if there are disjoint subsets of X A1, A2, ..., An and B1, B2, ..., Bn and group elements g1, g2, ..., gn such that the following holds:
- g1A1 U g2 A2 U ... U gn An = X
- g1B1 U g2 B2 U ... U gn Bn = X
Less formally, this set X can be pulled apart into two pieces, and each of these pieces can be grown back into the original set by proper application of group elements.
In and of itself, this result might not be terribly surprising. After all, the function in G might be doing terribly strange things to the set X. In fact if we let G be the group of all invertible functions on X, you can show right away that the group action is paradoxical. What would be surprising is if we could show that an ordinary, familiar group and an ordinary, familiar action has this property.
And of course we can. Let SO^3 be the group of all 3-dimensional rotations. The Banach Tarski Paradox states that the group action of SO^3 on a 3-dimensional ball with radius 1 is paradoxical. Namely, we can pull the ball into several pieces, apply rotations to those pieces, and get two complete balls when we put them together. What's more, this can be done using no more than 5 pieces.
Next time I'll outline the proof, and talk about how this impacts the real world.
|
-
There are many solutions to this problem. Here's an easy one that everybody can understand.
Let's construct a bipartite graph representing this problem. On the left side, we have a vertex for every integer lattice point (i.e. (0,0) (0,1) (1,1) etc.). On the right side we have a vertex for every subrectangle in the tiling. Draw an edge between a lattice-point vertex and a subrectangle vertex if that vertex lies on one of the subrectangle's corners.
Now it's a simple matter of counting. Let's count the edges twice. First let's count the edges by looking at the rectangles. Every subrectangle can have 0, 2, or 4 edges coming in. This is obvious, since integer corners must come in pairs. If one corner has integer coordinates so must another since this subrectangle has one side of integer length. It's therefore obvious that this graph has an even number of edges.
Now let's count the edges by looking at the vertices. A vertex can be in one of four places. It can be in the middle of a square (not a corner) in which case it has 0 edges coming out. It can be on an interior corner, in which case it has 4 edges coming out for all the surrounding squares. It can be a corner of a subrectangle but on an edge of the big rectangle, in which case 2 edges come out. Finally, it can be on a corner of the big rectangle, which means only one edge comes out. The vertex (0,0) always is one of these final corners so has exactly one incident edge. However, all other types of edges add an even number of edges to the total and we know the final number of edges must be even, so there must be at least one other corner of the big rectangle which is an integer lattice point. This gives us a side of the big rectangle that's of integer length, and we're done!
For another proof of this, think about what happens if you integrate the function sin (2 pi x) sin (2 pi y) over the rectangle.
|
-
What does the following code snippet do?
bool test(const wstring& first, const wstring& second) { if (first.empty() && second.empty()) return true;
if (first.empty()) return (second[0] == L'*') && test(first, second.substr(1)); if (second.empty()) return (first[0] == L'*') && test(first.substr(1), second);
switch(first[0]) { case L'?': if (second[0] == L'*') { return test(first.substr(1), second) || test(first, second.substr(1)); }
return test(first.substr(1), second.substr(1)); case L'*': return test(first, second.substr(1)) || test(first.substr(1), second); default: switch(second[0]) { case L'?': return test(first.substr(1), second.substr(1)); case L'*': return test(first.substr(1), second) || test(first, second.substr(1)); default: return (first[0] == second[0]) && test(first.substr(1), second.substr(1)); } } }
|
|
|
|