What does the optimize switch do?

What does the optimize switch do?

Rate This
  • Comments 40

I was asked recently exactly what optimizations the C# compiler performs when you specify the optimize switch. Before I answer that, I want to make sure that something is perfectly clear. The compiler’s “usage” string is not lying when it says:

/debug[+|-]     Emit debugging information
/optimize[+|-]  Enable optimizations

Emitting debug information and optimizing the generated IL are orthogonal; they have no effect on each other at all (*). The usual thing to do is to have debug on and optimizations off, or vice versa, but the other two combinations are perfectly legal. (And while I’m at it, turning debug information generation on does not also do /d:DEBUG; that’s also orthogonal. Again, the sensible thing to do is to keep /debug and /d:DEBUG in sync, but you do not have to; you might want to debug the optimized version without assertions and we’re not going to stop you.)

From this point on, when I say “emitting” I mean “producing metadata”, and when I say “generating” I mean “producing IL”.

So, first off, there are some optimizations that the compiler always performs, regardless of the optimize flag.

The language semantics require us to do constant folding because we need to detect that this is an error:

switch(x) {
case 2:
case 1+1:

The compiler will replace usages of constant fields with the constant, but does not perform any more sophisticated constant propagation, even though the compiler has a reachability analyzer.

Speaking of which, there are two kinds of reachability analysis we perform for dead code elimination. First, there is the “according to spec” reachability. The specification calls out that reachability analysis consumes compile-time constants. This is the reachability analysis we use to give “unreachable code” warnings, to determine whether local variables are definitely assigned on all code paths, and to determine whether the end point of a non-void method is reachable. If this first pass determines that code is unreachable then we never generate IL for it. For example, this is optimized away:

if (false) M();

But if the expression involves non-compile-time constants then this first form of reachability analysis would tell you that the call to N here is reachable, and therefore is a definite assignment error:

int x = M();
int y;
if (x * 0 != 0) N(y);

If you fixed that up so that y was definitely assigned, then the first-pass reachability analyzer would NOT trim this code because it still thinks that the call is reachable.

But clearly you and I know that the call is not reachable, because we know more about arithmetic than the specified reachability analyzer. We perform a second pass of reachability analysis during code generation which is much smarter about these things. We can do that because the second pass only affects codegen, it does not affect language semantics, error reporting, and so on.

The existence of that second pass implies that we do a simple arithmetic optimizations on expressions which are only partially constant. For example, if you have a method M that returns an integer, then code like

if (M() * 0 == 0) N();

can be generated as though you’d written just:

M();
N();

We have lots of simple number and type algebra optimizers that look for things like adding zero to an integer, multiplying integers by one, using “null” as an argument to “is” or “as”, concatenation of literal strings, and so on. The expression optimizations always happen, whether you’ve set the optimize flag or not; whether basic blocks that are determined to be unreachable after those optimizations are trimmed depends on the flag.

We also perform some small optimizations on some call sites and null checks. (Though not as many as we could.) For example, suppose you have a non-virtual instance method M on a reference type C, and a method GetC that returns a C. If you say GetC().M() then we generate a callvirt instruction to call M. Why? Because it is legal to generate a callvirt for an instance method on a reference type, and callvirt automatically inserts a null check. The non-virtual call instruction does not do a null check; we'd have to generate extra code to check whether GetC returns null. So that's a small code size optimization. But we can do even better than that; if you have (new C()).M(), we generate a call instruction because we know that the result of the "new" operator is never null. That gives us a small time optimization because we can skip the nanosecond it takes to do the null check. Like I said, it's a small optimization.

The /optimize flag does not change a huge amount of our emitting and generation logic. We try to always generate straightforward, verifiable code and then rely upon the jitter to do the heavy lifting of optimizations when it generates the real machine code. But we will do some simple optimizations with that flag set. For example, with the flag set:

  • Expressions which are determined to be only useful for their side effects are turned into code that merely produces the side effects.
  • We omit generating code for things like int foo = 0; because we know that the memory allocator will initialize fields to default values.
  • We omit emitting and generating empty static class constructors. (Which typically happens if the static constructor set all the fields to their default value and the previous optimization eliminated all of them.)
  • We omit emitting a field for any hoisted locals that are unused in an iterator block. (This includes that case where the local in question is used only inside an anonymous function in the iterator block, in which case it is going to become hoisted into a field of the closure class for the anonymous function. No need to hoist it twice if we don’t need to.)
  • We attempt to minimize the number of local variable and temporary slots allocated. For example, if you have:

for (int i = …)  {…}
for (int i = …) {…}

then the compiler could generate code to re-use the local variable storage reserved for i when the second i comes along. (We eschew this optimization if the locals have different names because then it gets hard to emit sensible debug info, which we still want to do even for the optimized build. However, the jitter is free to perform this optimization if it wishes to.)

  • Also, if you have a local which is never used at all, then there is no storage allocated for it if the flag is set.
  • Similarly, the compiler is more aggressive about re-using the unnamed temporary slots sometimes used to store results of subexpression calculations.
  • Also, with the flag set the compiler is more aggressive about generating code that throws away “temporary” values quickly for things like controlling variables of switch statements, the condition in an “if” statement, the value being returned, and so on. In the non-optimized build these values are treated as unnamed local variables, loaded from and stored to specific locations. In the optimized build they can often be just kept on the stack proper.
  • We eliminate pretty much all of the “breakpoint convenience” no-ops.
  • If a try block is empty then clearly the catch blocks are not reachable and can be trimmed. (Finally blocks of empty tries are preserved as protected regions because they have unusual behaviour when faced with certain exceptions; see the comments for details.)
  • If we have an instruction which branches to LABEL1, and the instruction at LABEL1 branches to LABEL2, then we rewrite the first instruction as a branch straight to LABEL2. Same with branches that go to returns.
  • We look for “branch over a branch” situations. For example, here we go to LABEL1 if condition is false, otherwise we go to LABEL2.

    brfalse condition, LABEL1
    br LABEL2
    LABEL1: somecode

    Since we are simply branching over another branch, we can rewrite this as simply "if condition is true, go to LABEL2":

    brtrue condition, LABEL2
    somecode
  • We look for “branch to nop” situations. If a branch goes to a nop then you can make it branch to the instruction after the nop.
  • We look for “branch to next” situations; if a branch goes to the next instruction then you can eliminate it.
  • We look for two return instructions in a row; this happens sometimes and obviously we can turn it into a single return instruction.

That’s pretty much it. These are very straightforward optimizations; there’s no inlining of IL, no loop unrolling, no interprocedural analysis whatsoever. We let the jitter team worry about optimizing the heck out of the code when it is actually spit into machine code; that’s the place where you can get real wins.


(*) A small lie. There is some interaction between the two when we generate the attributes that describe whether the assembly is Edit’n’Continuable during debugging.

  • An awful lot of these optimizations (maybe all) are language-independent.  Do you share the optimizer with VB?  Indeed, it looks like you could write an optimizer that reads IL and writes better IL.  Other than compilation efficiency, is there any reason this isn't done?

    You are not the first person to notice this. :-) We could do that. Or, one of the things we are considering for future versions of C# and VB is to have the compilers output a common "semantic tree" format which can be fed into a common metadata emitter and/or IL generator. However, it's not yet clear whether this would be a net win from a cost-vs-benefit analysis. We have two IL generators that work perfectly well; why throw away much of that existing, debugged code and spend a few months writing a common infrastructure in order to achieve a false economy of shared code? It seems potentially penny-wise, pound-foolish. We'll make the decision that is right for the business case. -- Eric

  • I apologize for a comment about the format of the post, rather than about the post itself, but...

    Very often in this blog, I find that once text formatted as "code" appears, all of the text following that remains formatted as "code" rather than reverting back to the normal format.

    I looked at the HTML source, and I can see that although most of the HTML is valid XML, with each <p> tag having a corresponding </p> close tag, when the <span class="code"></span> shows up, the <p> element is NOT properly closed.

    In this particular post, the example would be the first section of text where the compiler options are described.  The entire paragraph is contained by a <span class="code"> element, and there is an open <p> tag, but no close <p> tag.

    Whose bug is this?  I'm not sure.  The DOCTYPE for the page isn't XHTML, so technically a <p> tag need not have a close tag.  On the other hand, when a <p> tag doesn't have a close tag, I believe that the <p> element is considered to continue until the next block element, which in this case is the next <p> element.  Because the <p> tag for the compiler options text is inside the <span class="code"> element, this means that logically, the <p> element and <span> element overlap, which isn't allowed even in HTML (unclosed elements are allowed, but otherwise the same hierarchical rules apply).

    To further confuse the issue, some browsers figure it out (e.g. Opera) and others don't (e.g. Safari).  Presumably in the case where the text is displayed as expected, the HTML parser auto-closes any unclosed elements contained within an element that is being closed.

    Anyway, I thought I'd mention this, in the hopes that it's simple to fix the problem. I have no idea how the HTML is actually being generated, and I suppose if it's auto-generated by some utility, it might be hard to request and/or have implemented a bug-fix so that the <p> element is correctly closed even when contained within a <span> element. But, if it's being hand-generated, or generated by something that is easily fixed, it sure would be helpful to those of us using certain browsers.  :)

    Thanks!

  • @pete.d: I think it parses it as <span> <p>code</p> <p>text</p> ..... and a </span> at the end.

    Now about the post,

    Isn't callvirt more 'complicated', because it needs to use the right method in the virtual whatchamacallit-tables (vtables?), thus making the optimization of using callvirt instead of call counter-productive?

    (I assume the answer is 'no, its not', otherwise you wouldn't make this optimization, but I'd like to know why not)

    [[ "callvirt" takes a method token as its argument; the jitter looks up the method token in the metadata tables to see if it is actually a virtual method. If it is, then it generates a null check and a vtable call, if not then it generates a null check and a direct call. -- Eric ]]

  • "* We omit generating code for things like int foo = 0; because we know that the memory allocator will initialize fields to default values."

    That's actually an interesting point... If this kind of instruction is omitted, then why does the compiler require that all local variables be initialized before use ? I mean, I would probably do it anyway because it's better to be explicit, but it seems strange that we are compelled to write code that's going to be eliminated anyway...

    Because using uninitialized locals accidentally is a common source of bugs. -- Eric

  • I've always developed using debug, no optimisation and deployed w/o debug and optimised (because it's the 'Right Thing To Do'). However, I've never really tested the performance characteristics, since there was basically no cost in doing the optimisation.

    However, reading your post, it sounds like the /optimise flag doesn't do much optimising at all. Is there much of a difference for the majority of cases? If so, are the gains spread across the scenarios, or is there one or two optimisations that give the majority of the speedups?

    Interesting post, by the way.

  • Configurator....

    I think it quite clearly explained that the JIT makes the determination if "CallVirt" is actually "virtual" (in the sense of using the "v-table") or is a direct call.

    Travis....

    Hopefully you "over-simpified"...Shipping a different Build configuration [noDebug/Optimizd] than the one you subject to all of your testing [Debug/NoOptimize] is dangerous. Years back I established a policy of doing all work in the "shipping" configuration [I do not call it "release" since that is just a name, and does not describe the actual settings], and only switching to a "diagnostic" [again I do not call it "Debug" for the same reason] if there is a bug that I simply can not pin down in a reasonable time.

    The amount of grief this has saved me over the years [I did the same thing with C++ and with C before that] is priceless.

  • [quote]That’s pretty much it. These are very straightforward optimizations; there’s no inlining of IL, no loop unrolling, no interprocedural analysis whatsoever. We let the jitter team worry about optimizing the heck out of the code when it is actually spit into machine code; that’s the place where you can get real wins.[/quote]

    I would have thought that it would make more of a difference to do these types of optimizations in the IL - the less the jitter needs to do, the faster things can start up.  These optimizations can't be free.

    But the C# compiler knows nothing about the target environment. The optimizations you want to make on compact framework running on a cell phone are very different than the optimizations you want to make on a 64 bit server. So let the jitter worry about it; the jitter knows how to optimize for the current environment. -- Eric

  • WOW!

    > We attempt to eliminate generation of "protected" regions. For instance, if a try block is empty then clearly the catch blocks are not reachable and the finally can just be an ordinary code region.

    This is dangerous! In fact that is how some people create "protected" regions in the first place.

    Code that is executed within finally block will not be interrupted by Abort exception for example.

    You are correct. I was misremembering. We eliminate catch blocks, but the finally blocks live on as finally blocks. I've corrected the text; thanks for the note. -- Eric

  • "We attempt to eliminate generation of "protected" regions. For instance, if a try block is empty then clearly the catch blocks are not reachable and the finally can just be an ordinary code region."

    I've seen code like this in Microsoft's own implementation of the ASP.NET Cache:

    try{}
    finally
    {
     Monitor.Enter(this._lock);
     acquiredLock = true;
    }

    Does this optimization mean that the above code might not run the "finally" block in a protected region?

    See above. -- Eric

  • +1 ThreadAbortExceptions do not interrupt finally blocks so don't replace finally blocks with ordinary code regions.

  • >Or, one of the things we are considering for future versions of C# and VB is to have the compilers output a common "semantic tree" format which can be fed into a common metadata emitter and/or IL generator.

    Now that you mention it, I remembered that .NET 4.0 beta1 has a whole lot of goodies in System.Linq.Expression - it's owned by C# team, isn't it? My first impression after looking at it is that it actually pretty much covers the "semantic tree" for a single method, including things such as lambdas with all the nitty-gritty done behind the scenes. The only thing that seems to be missing in terms of C# feature coverage there are iterator methods. And it can generate output into a MethodBuilder too, so it's almost a compiler writer's toolbox...

    ... almost - because it can only generate static methods (why oh why?), and seemingly cannot generate a bunch of cross-calling methods at a time.

    To generate non-static methods we need to generate them on some class. What class? We only have expression trees and statement trees; what we need are declaration trees. I would very much like to have the ability to represent types as trees. Then we would have a "compiler writer's toolbox" as you say. Whether we will get to that point or not, I don't know; guessing would be speculation about the feature sets of unannounced projects that have no budgets at this time. That kind of guessing is seldom productive. -- Eric

  • I admit, I don't get the concern about ThreadAbortException. Personally, I would have been happy if Thread.Abort(), Thread.Suspend(), and Thread.Resume() had just been left out of .NET altogether. There are so many other ways that Thread.Abort() can produce unpredictable results, I can't say that I'd be all that worried about the code in a finally block winding up unprotected.

    Now, if there's some _other_ important benefit to protected code regions I'm overlooking, then perhaps that's a reason for concern. But I'm surprised anyone, never mind three people, care about "correctness" in the context of a ThreadAbortException.  IMHO, by definition any code using ThreadAbortException is incorrect, regardless of what the compiler's doing.  :p

    I don't like thread aborts either. But wishing doesn't make them go away. We've got to write a code generator for a language that targets the environment we're given, and the environment we're given has somewhat goofy thread abort semantics. -- Eric

  • @Thomas

    Fields are initialized to default values, but I'm not sure the same is guaranteed to happen for local variables, since they are allocated to the stack.

    If you ever wish to become sure, then I encourage you to read the CLI spec, Partition II, section 24.4.4 and Partition III, section 1.8.1.1. -- Eric

  • Eric was clearly in error about the try/finally issue.  The runtime is guaranteed to execute all finally blocks before a thread is aborted, so the C# compiler cannot just throw away the protected region markers usually placed around a finally block.

  • @thomas

    I think Eric is referring to the removal of the initializer for member variable foo not local variable foo.

Page 1 of 3 (40 items) 123