First a little background reading before going into tail call improvements in CLR 4 - David Broman did an excellent job at covering the basics in his post here: http://blogs.msdn.com/davbr/archive/2007/06/20/enter-leave-tailcall-hooks-part-2-tall-tales-of-tail-calls.aspx. He also captured a mostly complete list of the restrictions as they stood for CLR 2 here: http://blogs.msdn.com/davbr/pages/tail-call-jit-conditions.aspx.
The primary reason for a tail call as an optimization is to improve data locality, memory usage, and cache usage. By doing a tail call the callee will use the same stack space as the caller. This reduces memory pressure. It marginally improves the cache because the same memory is reused for subsequent callers and thus can stay in the cache, rather than evicting some older cache line to make room for a new cache line.
The other usage of tail calls are in recursive algorithms. We all know from our computer science theory classes that any recursive algorithm can be made into an iterative one. However, just because you can doesn’t mean you always want to. If the algorithm is naturally tail recursive, why not let the compiler do the work for you? By using a tail call, the compiler effectively turns a tail recursive algorithm into a loop. This is such an important concept that the JIT has a special path just for turning functions that call themselves in a tail recursive manner into loops. This is a nice performance win because beyond the memory improvements mentioned previously, you also save several instructions by not executing the prolog or epilog multiple times. The compiler is also able to treat it like a loop and hoist out loop invariants so they are also only executed once instead of being executed on every iteration.
In CLR 2 the 64-bit JIT focused solely on the ‘easy’ cases. That meant that it generated code that did a tail call whenever it could because of the memory benefits. However, it also meant that sometimes when the IL requested a tail call using the “tail.” instruction prefix, the JIT would not generate a tail call because it wasn’t easy. The 32-bit JIT had a general purpose tail call mechanism that always worked, but wasn’t performant enough to consider it an optimization, so it was only used when the IL requested a tail call.
For CLR 4 the fact that the x64 JIT would sometimes not honor the “tail.” prefix prevented functional languages like F# from being viable. So we worked to improve the x64 JIT so that it could honor the “tail.” prefix all of the time and help make F# a success. Except where explicitly called out, x86 and IA64 remain unchanged between CLR 2 and CLR 4, and the remainder of this document refers solely to the x64 JIT. Also this is specific to CLR 4, all other versions of the runtime or JIT (including service packs, QFEs, etc.) might change this in any number of ways.
The 64-bit calling convention is basically a caller-pops convention. Thus the ‘easy’ cases basically boiled down to where the JIT could generate code that stored the outgoing tail call arguments in the caller’s incoming argument slots, execute the epilog, and then jump to the target rather than returning. The improvements for CLR 4 came in two categories: Fewer restrictions on the optimized/easy cases and a solution for the non-easy cases. The end result is that on x64 you should see shorter call stacks in optimized code because the JIT generated more tail calls (don’t worry this optimization is turned off for debug code), and functional programmers (or any other compiler or language that relies on tail calls and uses the “tail.” prefix) no longer need to worry about stack overflows. The down side of course is that if you have to debug or profile optimized code, be prepared to deal with call stacks that look like they’re missing a few frames.
Reducing restrictions on the easy cases
The more often we can tail call, the more likely programs will benefit from the optimization. From the second link above you can see that there were a lot of restrictions on when the JIT could perform a tail call. We looked at each one of these and tried to see if we could reduce or eliminate them.
We reduced this restriction by also allowing a single pop opcode. This allowed the case where a method that returned void ended with a call to some other method that had a non-void return. The callee was obviously being invoked for its side-effect rather than its return value. In most cases the return value is just sitting in rax or xmm0, and does no harm by being left there, so in those cases, the tail call is legal and becomes an ‘easy’ case. However, note that having an explicit tail call, with the tail. prefix is still unverifiable in this situation, so in most cases a compiler or code generator should omit the tail. prefix and just rely on the JIT optimizing the call into a tail call.
We changed the JIT to recognize this case as well. Specifically the X64 calling convention dictates that for value types that don’t fit in a register (1,2,4, or 8 byte sized value types), that the caller should pass a ‘hidden’ argument called the return buffer. The CLR 2 JIT would pass a new return buffer to the callee, and then copy the data from the callee’s return buffer to the caller’s return buffer. The CLR 4 runtime now recognizes when it is safe to do so, and then just passes the caller’s return buffer into the callee as its return buffer. This means even when we don’t do a tail call, the code has at least one less copy, and it enables one more case to be ‘easy’!
The above restriction is now removed!
These have been partially removed. Specifically, we treat the caller as only having its fixed arguments and the callee as having all of its arguments. Then we apply all of the other rules. It is also worth mentioning that the CLR deviates from the native X64 calling convention by adding another ‘hidden’ argument called the vararg cookie. It stores the GC properties of the actual call (important for the variadic part). This hidden argument is also counted as part of the signature for the purposes of the other rules.
This part applies to all platforms. We were able to reduce the number of cases where the runtime refused a tail call due to security constraints. Mainly the runtime stopped caring so much about the exact assembly and instead looked at the security properties. If performing the call as a tail call would not cause an important stack frame to disappear from the call stack, it is now allowed. Previous to this change tail calls across modules or to unknown destinations (due to calli or callvirt) were rare, now they are significantly more common.
The above x64 restriction is now removed!
A solution for the non-easy cases
Previously even if the IL had the “tail.” prefix, if the call did not match the requirements to be one of the optimized easy tail calls cases, it was turned into a regular call. This is disastrous for recursive algorithms. In many cases this prevented F# programs from running to completion without getting stack overflows. So for CLR 4 we added an alternate code path that allowed tail call like properties for these specific scenarios, where it was needed, but it was not easy.
The biggest problem to overcome is the x64 calling convention. Because of the calling convention if a tail call callee needs more argument space than the caller has, the code is wedged between a rock and a hard place. The solution involved a little stack trickery. The JIT generates a mostly normal call instead of a tail call. However instead of calling the target directly it calls through a special helper which works some magic on the stack. It logically unwinds the stack to be as if the caller has returned. Then it adds a fake method to the stack as if it had been called, and then jumps to the callee. This fake method is called the TailCallHelper. It takes no arguments, and as far as unwind data and the calling convention are concerned calls the callee with no outgoing arguments, but has a dynamically resizing locals area (think _alloca for C/C++ programmers, or stackalloc for C# programmers). Thus this method is still caller-pops, but the tail call helper can dynamically change how much to pop based on a given callee.
This stack magic is not cheap. It is significantly slower than a normal call or a tail call. It also consumes a non-trivial amount of stack space. However the stack space is ‘recycled’ such that if you have a tail recursive algorithm, the first tail call that isn’t ‘easy’ will erect the TailCallHelper stack frame, and subsequent non-easy tail calls may need to grow that frame to accommodate all the arguments. Hopefully once the algorithm has gone through a full cycle of recursion the TailCallHelper stack frame has grown to the maximum sized needed by all the non-easy calls involved in the recursion, and then never grows, and thus prevents a stack overflow.
As you can see the way the non-easy case is handled is also not fast. Because of that, the JIT never uses the non-easy helper unless the IL requires it with the “tail.” prefix. This was also the driving force behind trying to reduce the number of restrictions on the ‘easy’ cases.
CLR Codegen Team
[Updated 6/25/2009 to make a few minor clarifications based on internal feedback]