Why have a stack?

Why have a stack?

Rate This
  • Comments 25

Last time I discussed why it is that we have all the .NET compilers target an "intermediate language", or "IL", and then have jitters that translate IL to machine code: because doing so ultimately reduces the costs of building a multi-language, multi-hardware platform. Today I want to talk a bit about why IL is the way it is; specifically, why is it a "stack machine"?

To begin with, what is a "stack machine"? Let's consider how you might design a machine language that could describe the operation of adding together two integers to make a third. You could do it like this:

add [address of first addend], [address of second addend], [address of sum]


When the machine encounters this instruction it looks up the values of the addends stored in the two addresses, somehow adds them together -- how it does so is its business -- and stores the result in the third address.

You might instead say that there is a special region of memory called the "accumulator" which knows how to add a given value to itself:

clear_accumulator
increase_accumulator [address of first addend]
increase_accumulator [address of second addend]
save_accumulator [address of sum]


Or, you could say that there is a special region of memory called the "stack" which can grow and shrink; you get access to the items on the top:

push_value_at [address of first addend]
push_value_at [address of second addend]
add
pop_value_into [address of sum]


The "add" instruction takes the two values off the top of the stack, somehow adds them, and then puts the result back on the stack; the net result is that the stack shrinks by one.

A virtual machine where most of the instructions are of this third form is called a "stack machine", for obvious reasons.

IL specifies a stack machine, just like many other virtual machines. But most hardware instruction sets actually more closely resemble the second form: registers are just fancy accumulators. Why then are so many virtual machines specifying stack machines?

There are several reasons, but again, it primarily comes down to lowering costs. Stack machines are very easy to understand, they are very easy to write a compiler front-end for, they are easy to build interpreters for, they are easy to build jitters for, and they provide a concise abstraction for the common case. The common case is that the result of any particular operation is only going to be of interest for a brief period.

Imagine, for example, if we chose the first strategy for IL, and then had to compile an expression like x = A() + B() + C(); What would we have to do in the first case? Something like this:

create_temporary_storage // for result of A()
call A(), [address of temporary storage]
create_temporary_storage // for result of B()
call B(), [address of temporary storage]
create_temporary_storage // for result of first addition
add [address of first temporary storage], [address of second temporary storage], [address of third temporary storage]
...


You see how this goes? The IL is getting huge, and all so that we can keep track of precisely which memory locations are used to store values that we are about to never care about again. A stack abstraction lets the stack implementation deal with the temporary storages; in a stack machine, the same code looks something like:

push [address of x]
call A() // implicitly creates a temporary storage by pushing the result on the stack
call B()
add
call C()
add
store // store result on top of stack in address just below it.


The code is much smaller and much easier to understand. A stack machine is a very simple way to describe a complex computation; by being able to write code for such a simple machine, it lowers the cost of making a compiler. And not only is it easier to write compilers and jitters that target simple stack machines, it is easier to write other code analysis tools as well. The IL verifier, for example, can quickly determine when there is a code path through a method that, say, misaligns the stack, or passes the wrong types of arguments to a method.

  • It's simple, it's well understood, and it works. Three wonderful, enviable traits for any system to have.

  • This articles demonstrates the pros of using a stack which is simplicity but what about the cons? what are we giving away by using a stack machine?

    Good question; just because the "pros" are enormous does not mean that there are no "cons". The CIL system avoids many of the "cons" because (1) the stack machine is just an intermediate form on the way to being realized as machine code, and (2) the CIL system is not a "pure" stack machine; you can get access to "local" items that are not at the top of the stack easily enough. These are effectively arbitrarily many registers, if you want to, say, perform common subexpression elimination. For a good discussion of some of the problems that stack machines have, see http://en.wikipedia.org/wiki/Stack_machine for details. -- Eric

  • I think you're being a tiny bit disingenuous.

    I think I am not being even a tiny bit disingenuous and frankly I resent the statement that I'm being deliberately misleading in any way. If you'd like to have a discussion of the technical merits of various solutions I am happy to do so; you can leave the ad hominem attacks out. -- Eric

    Three-address code, your "first strategy", is typically not as verbose as your example of its use; this "create_temporary_storage()" bit is a poor design choice. Rather, it might look like:

       %1 = A()
       %2 = B()
       %3 = %1 + %2
       %4 = C()
       %5 = %3 + %4
       x = %5

    I'm deliberately using syntax reminiscent of LLVM's SSA-based approach. On a line count basis, it's actually shorter than the stack machine; and with an SSA encoding, I think it's easier to apply many analyses than a stack machine encoding. SSA is also not particularly harder to generate code for than a stack machine (particularly an impure one like CLR or JVM, where locals and parameters have register-oriented storage rather than living on the stack, and you have access to something like LLVM's alloca and mem2reg pass); and nor is SSA any harder to interpret. I'll grant that the stack machine format, when encoded in bytecode, is probably easier to make more compact. But if size is that important, you'll still want to apply compression, and I wouldn't like to guess which straightforward encodings for SSA vs stack machine naturally give rise to lower entropy per instruction count for common compression algorithms.

  • Actually, the three-address code could look a line shorter again:

       %1 = A()

       %2 = B()

       %3 = %1 + %2

       %4 = C()

       x = %3 + %4

  • But still, the stack-based code does not have to specify all the data sources and destinations (be it memory addresses or registers).

  • @Joren: no, but in more complex scenarios it sometimes has to do some bit of copying around or shuffling the stack ordering because the data sources and destinations are fixed and implicit... You always gain some, loose some.

    That said I understand stack-machine are very easy to work with, but I would also tend to think that SSA are not that bad either and lend themselves pretty well to various analysis and optimizations. Anyway we're stuck with the IL the way it is and it's surely not gonna change soon.

  • @Barry.Kelly: What, exactly, would %1, etc be representative of? Registers? Memory locations? If we assume the former, then we run into an issue with re-entrancy. Let's pretend this is a (non-tail) recursive function. If we're always writing into (say) EAX, then each iteration of the function will be overwriting that register, which may or may not be a problem.

    Let's assume, though, that %1 doesn't mean a specific register, but perhaps it's a place holder for "the next register" (or maybe the registers shuffle around so that EAX doesn't mean the same thing all the time. Like how (I believe, without bothering to go look it up) Itanium works). Now, what happens if you run out of registers? Even in a non recursive function, there are a limited number of those to pass around. In the interests of efficiency, we'd want to save the contents of any register we care about, so that other pieces of code don't have to worry about stepping on your toes. Now we're back to initial problem of where to store these temporary values.

    If %1, etc, instead represent memory directly, then who owns this memory? Do they live in the DATA section of the executable? If so, what happens if we recurse? If memory is allocated for each stack frame (whoops, there's another problem with avoiding the stack), then where do we keep the address? In a register? But, there's only so many of those to go around...

    I'm sure these problems are not insurmountable (after all, I'm fairly certain that there exist stackless architectures, even if I'm not nearly smart enough to conceive of one), but the stack certainly lends itself to solving these problems in a concise manner.

    @DaRage: I imagine that one particularly nasty con of a stack-based system is that it's possible to run out of stack space. Technically, the hypothetical ideal stack has no limit on its size, but as we all know, our programs and machines cannot be infinite in size, so limits exist, and running out of stack space is an expressly fatal problem.

    Another con might be that now you need to worry about balancing the stack and ensuring that your code doesn't push more than it pops, or vice versa.

  • I'd imagine they represent local execution frame slots, that are enregistered and/or spilled automatically by the JITter, after all, knowing about the hardware is the JITter's job, not the compilers. So for @Barry's example, you could trivially generate something like (assuming callee cleanup for simplicity):

    ; %1 = C()

    call C()

    mov [ebp+4], eax

    ; %2 = D()

    call D()

    mov [ebp+8], eax

    ; %3 = %1 + %2

    mov eax, [ebp+4]

    mov ecx, [ebp+8]

    add eax, ecx

    mov [ebp+12], eax

    ; %4 = E()

    call E()

    mov [ebp+16], eax

    ; x = %3 + %4

    mov eax, [ebp+12]

    mov ecx, [ebp+16]

    add eax, ecx

    mov x, eax

    Which you could simplify of course:

    call C()

    mov [ebp+4], eax

    call D()

    mov ecx, [ebp+4]

    add ecx, eax

    mov [ebp+4],  ecx

    call E()

    mov ecx, [ebp+4]

    add ecx, eax

    mov x, eax

  • The nice thing about stack VMs is how easy it is to walk the expression tree and generate bytecode for it - you don't need any information shared about the nodes, every one of them can generate its own code independently without any assumptions about what the stack looks like before that, so long as it ends with the newly computed value on top.

  • Mike Caron: The SSA "registers" (%1, %2, %3 in the example) are all virtual -- they don't "live" anywhere. The JIT compiler's job is to figure out where (if anywhere) each register is stored at any point in a function. The values might be stored in a CPU register, on the stack, or not stored at all. For example, %3 might be stored in a register when it's first computed, then moved to the stack to use it in a function call, and then discarded completely once it's unneeded. Once the lifetime of a register has expired or been moved the location that stored its value may be reused.

    Also, I should add that stack-based machines are not uncommon. The x87 is a stack architecture that probably everybody reading this post has an implementation of in some way or another. The Burroughs B5000, like the CLR, has a typed, stack architecture.

  • Woah, Eric, I didn't mean to attack or insult you. I'm sorry it came out that way. I just think you picked a bad example.

    But I also explained why I think your example was bad. I notice you didn't address my technical argument.

  • @Mike.Caron: read up on LLVM. %1 etc. are pseudo-registers (most register VMs provide an infinite number of logical registers for convenience). But they can be viewed in another way: as names for nodes in an expression DAG. It's from this that simpler analyses flow.

  • Eric, FWIW, I actually believe the CLR chose a stack machine because the JVM chose a stack machine, and the JVM chose a stack machine because it was a reasonably compact representation when kept uncompressed in memory, and that in turn was important because it was designed to be interpreted on small devices (settop boxes and the like). But the CLR was from the get-go designed to compile its code, rather than interpret it in any circumstances. I think the CLR would actually have been slightly better off with an SSA design. But there's little to choose between them; it's not difficult to turn one into the other. From what I understand, most non-trivial JIT compilers convert the stack code into SSA anyway.

  • The JVM (and thus the CLR) stack machines look very much like the Smalltalk bytecode stack machines with types strapped on. The Smalltalk machine, was, I think, originally microcoded on the Alto.

    At around the same time the AT&T folks including, I think, Rob Pike, worked in the Inferno/Limbo system that had a 3-address IL.  They certainly argued it was quicker and easier to write a (simple, not-very-optimising) translator from 3-address code to machine code than it would have been with a stack machine. a

  • To add balance, Microsoft also ship register-based virtual machines. So even within mscorp there is no consensus on the best approach. Some history, perhaps:

    Microsoft acquired a company in 1996 whose product was a register-based virtual machine. There was lots of speculation at the time that this was the basis for MSIL, but that speculation was wrong.

    Fast forward 15 years and the person who created that virtual machine also designed Microsoft's newest register-based virtual machine - namely the Chakra JavaScript engine:

    > "IE9 includes a new interpreter which uses a register-based layout, efficient opcode, and use of type optimizations."

    That company was Colusa and the person is Steve Lucco

    www.microsoft.com/.../default.mspx

Page 1 of 2 (25 items) 12