The Stack Is An Implementation Detail, Part Two

The Stack Is An Implementation Detail, Part Two

Rate This
  • Comments 23

A number of people have asked me, in the wake of my earlier posting about value types being on the stack, why it is that value types go on the stack but reference types do not.

The short answer is “because they can”. And since the stack is cheap, we do put them on the stack when possible.

The long answer is… long.

I’ll need to give a high-level description of the memory management strategies that we call “stack” and “heap”. Heap first.

The CLRs garbage-collected heap is a marvel of engineering with a huge amount of detail; this sketch is not actually how it works, but it’s a good enough approximation to get the idea.

The idea is that there is a large block of memory reserved for instances of reference types. This block of memory can have “holes” – some of the memory is associated with “live” objects, and some of the memory is free for use by newly created objects. Ideally though we want to have all the allocated memory bunched together and a large section of “free” memory at the top.

If we’re in that situation when new memory is allocated then the “high water mark” is bumped up, eating up some of the previously “free” portion of the block. The newly-reserved memory is then usable for the reference type instance that has just been allocated. That is extremely cheap; just a single pointer move, plus zeroing out the newly reserved memory if necessary.

If we have holes then we can maintain a “free list” – a linked list of holes. We can then search the free list for a hole of appropriate size and fill it in. This is a bit more expensive since there is a list search. We want to avoid this suboptimal situation if possible.

When a garbage collection is performed there are three phases: mark, sweep and compact. In the “mark” phase, we assume that everything in the heap is “dead”. The CLR knows what objects were “guaranteed alive” when the collection started, so those guys are marked as alive. Everything they refer to is marked as alive, and so on, until the transitive closure of live objects are all marked. In the “sweep” phase, all the dead objects are turned into holes. In the “compact” phase, the block is reorganized so that it is one contiguous block of live memory, free of holes.

This sketch is complicated by the fact that there are actually three such arenas; the CLR collector is generational. Objects start off in the “short lived” heap. If they survive they eventually move to the “medium lived” heap, and if they survive there long enough, they move to the “long lived” heap. The GC runs very often on the short lived heap and very seldom on the long lived heap; the idea is that we do not want to have the expense of constantly re-checking a long-lived object to see if it is still alive. But we also want short-lived objects to be reclaimed swiftly.

The GC has a huge amount of carefully tuned policy that ensures high performance; it attempts to balance the memory and time costs of having a Swiss-cheesed heap against the high cost of the compaction phase. Extremely large objects are stored in a special heap that has different compaction policy. And so on. I don’t know all the details, and fortunately, I don’t need to. (And of course, I have left out lots of additional complexity that is not germane to this article – pinning and finalization and weak refs and so on.)

Now compare this to the stack. The stack is like the heap in that it is a big block of memory with a “high water mark”. But what makes it a “stack” is that the memory on the bottom of the stack always lives longer than the memory on the top of the stack; the stack is strictly ordered. The objects that are going to die first are on the top, the objects that are going to die last are on the bottom. And with that guarantee, we know that the stack will never have holes, and therefore will not need compacting. We know that the stack memory will always be “freed” from the top, and therefore do not need a free list. We know that anything low-down on the stack is guaranteed alive, and so we do not need to mark or sweep.

On a stack, the allocation is just a single pointer move – the same as the best (and typical) case on the heap. But because of all those guarantees, the deallocation is also a single pointer move! And that is where the huge time performance savings is. A lot of people seem to think that heap allocation is expensive and stack allocation is cheap. They are actually about the same, typically. It’s the deallocation costs – the marking and sweeping and compacting and moving memory from generation to generation – that are massive for heap memory compared to stack memory.

Clearly it is better to use a stack than a GC heap if you can. When can you? Only when you can guarantee that all the necessary conditional that make a stack work are actually achieved. Local variables and formal parameters of value type are the sweet spot that achieve that. The locals of frames on the bottom of the stack clearly live longer than the locals on the frames of the top of the stack. Locals of value type are copied by value, not by reference, so the local is the only thing that references the memory; there is no need to track who is referencing a particular value to determine its lifetime. And the only way to take a ref to a local is to pass it as a ref or out parameter, which just passes the ref on up the stack. The local is going to be alive anyway, so the fact that there is a ref to it “higher up” the call stack doesn’t change its lifetime.

aside
{

A few asides:

This explains why you cannot make a “ref int” field. If you could then you could store a ref to the value of a short-lived local inside a long-lived object. Were that legal then using the stack as a memory management technique would no longer be a viable optimization; value types would be just another kind of reference type and would have to be garbage collected.

Anonymous function closures and iterator block closures are implemented behind-the-scenes by turning locals and formal parameters into fields. So now you know why it is illegal to capture a ref or out formal parameter in an anonymous function or iterator block; doing so would not be a legal field.

Of course we do not want to have ugly and bizarre rules in the language like “you can close over any local or value parameter but not a ref or out parameter”. But because we want to be able to get the optimization of putting value types on the stack, we have chosen to put this odd restriction into the language. Design is, as always, the art of finding compromises.

Finally, the CLR does allow “ref return types”; you could in theory have a method “ref int M() { … }” that returned a reference to an integer variable. If for some bizarre reason we ever decided to allow that in C#, we’d have to fix up the compiler and verifier so that they ensured that it was only possible to return refs to variables that were known to be on the heap, or known to be “lower down” on the stack than the callee.

}

So there you go. Local variables of value type go on the stack because they can. They can because (1) “normal” locals have strict ordering of lifetime, and (2) value types are always copied by value and (3) it is illegal to store a reference to a local anywhere that could live longer than the local. By contrast, reference types have a lifetime based on the number of living references, are copied by reference, and those references can be stored anywhere. The additional flexibility that reference types give you comes at the cost of a much more complex and expensive garbage collection strategy.

But again, all of this is an implementation detail. Using the stack for locals of value type is just an optimization that the CLR performs on your behalf. The relevant feature of value types is that they have the semantics of being copied by value, not that sometimes their deallocation can be optimized by the runtime.

 

  • Quite interesting tech article, because sometimes GC looks like a black box.

    Being a developer, the deeper platform understanding you have - the more efficient programs you can build.

    Thanks for the post!

    John Williams

    http://johnwilliams.eu

  • Note, however, that the time cost of precise GC heap collection is proportional to the number of live objects, not the number of objects that need to be freed, while stack allocation has a deallocation cost that's linear in the number of stack frames at least. Garbage collection, amortized, can cost less than a single instruction per allocated object.

    That means that a GC heap can actually outperform stack allocation asymptotically. Obligatory paper reference: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.8219

    Of course, this is a theoretical asymptotic bound, as it requires delaying GC until the fraction of live objects as a percentage of the total heap is quite small, and therefore requires copious extra memory.

  • Very interesting; thanks. I was discussing this with a colleague and we thought that the question "What's the difference between a value type and a reference type" would make an excellent interview question.

    Also... when we tried to define values vs reference, the best we could come up with was based on identity -- that the identity of a value type is based on its attributes, but a reference type is not. So the value tuple { 'a', 1, 'foo' } is equal to every other value tuple { 'a', 1, 'foo' }. but reference tuples (say, classes with public fields) are not equal.

    We also found that we almost never defined structs, but nearly always use classes.

  • This is very interesting but I have a question related to this is there a way to mark reference objects that you know will live for the entire lifetime of the application? For examlpe I'm working on an application where UI strings are read from a file into a hashtable the first time one needs to be used and this object is immutable for the applications lifetime, would it be possible to let the runtime know this upfront?

  • >stack allocation has a deallocation cost that's linear in the number of stack frames at least.

    Method calls return one at a time, so stack frames are removed one at a time. Then sure, it makes sense that that would take linear time in the number of stack frames.

    >Garbage collection, amortized, can cost less than a single instruction per allocated object.

    I'm not so sure. Aren't all the objects flagged by the GC in the mark phase? That's at least a single instruction per allocated object. And it would make sense if the GC checked that flag during compacting, which is more work per object.

    Even if garbage collection wasn't even linear in the number of objects, this is seems an unfair comparison. Stack deallocation time may be dependent on the number of stack frames deallocated, but it's not dependent on the number of stack objects deallocated.

  • You say here that the "free list" is a linked list - but isn't it supposed to be a heap ordered by hole size or something like that?

  • Thank you for submitting this cool story - Trackback from DotNetShoutout

  • @Joren

    I haven't read the paper that Barry Kelly linked but remember that live objects are flagged in the mark phase, not dead ones.  So if your heap consists of almost all dead objects there will be large contiguous swathes of memory that can be reclaimed without ever examining them.

    Interesting stuff.

  • @Kjartan: Your strings will quickly enter the 3rd generation of GC objects, where they will stay, untouched, for a long time. I don't think you'd gain much performance benefit having it go to the 3rd directly instead of starting off at the 1st.

  • @Kjartan That is indeed something that can help (in certain classes of applications) and the technique is called pre-tenuring.

    The benefit is marginal in a lot of real world cases though. In most cases for it to be beneficial the amount of memory to pre-tenure must be reasonably large, and as allocations over a certain size go into the Large Object heap immediately there is little need to pre-tenure them (if you size the dictionary appropriately to start with rather than relying on the doubling up behaviour then it will end up straight in the LOH it is just the strings that would have to work their way through the Generations)

    This optimisation can improve the speed of short lived process runs but in general managed languages are relatively bad at this anyway so don't try to jump through extreme hoops to make this better (fundamentally paging in the dlls required for the GC itself is a massive start-up cost anyway)

  • > A lot of people seem to think that heap allocation is expensive and stack allocation is cheap

    That is correct in the native C/C++ world so it's not very surprising that people tend to project it to the managed world

  • @Steve Cooper: The difference between value and reference types is to do with identity, as you point out, but you don't actually need a concept of "is equal to". The real distinction is when objects (value or reference) are modified - copies of a value type aren't modified when the original is, but copies of a reference type are modified. In languages like Haskell where you can't directly modify variables, there is no distinction - the implementation can use values or references under the covers as it sees fit.

  • "Equality" is a very interesting item, as this can be overloaded on a per class/struct basis therefor the points raised by Steve Cooper, are the DEFAULT behaviour, but should NEVER be assumed to be true.

  • > If for some bizarre reason we ever decided to allow that in C#, we’d have to fix up the compiler and verifier so that they ensured that it was only possible to return refs to variables that were known to be on the heap, or known to be “lower down” on the stack than the callee.

    CLR 2.0+ already does it for "ref to heap" case. In other words, you are allowed to use the sequence of LDFLDA immediately followed by RET in verifiable typesafe code, if the return type of method is a managed pointer.

  • GC is a nice evolution of memory management.  Thanks for the explanation as not many developers have done signfiicant work in C/C++ or pre-GC parallel processing applications.  I've found only a couple cases where GC needed manual tuning to get good performance.  One was an application that would read in and process large files block by block (4mb+ blocks) and needed the buffers to be allocated as soon as possible and live throughout the lifetime of the application.  

Page 1 of 2 (23 items) 12