Optimizing C++ Code : Dead Code Elimination

Optimizing C++ Code : Dead Code Elimination

Rate This
  • Comments 28

 

If you have arrived in the middle of this blog series, you might want instead to begin at the beginning.

This post examines the optimization called Dead-Code-Elimination, which I’ll abbreviate to DCE.  It does what it says: discards any calculations whose results are not actually used by the program.

Now, you will probably assert that your code calculates only results that are used, and never any results that are not used: only an idiot, after all, would gratuitously add useless code – calculating the first 1000 digits of pi, for example, whilst also doing something useful.  So when would the DCE optimization ever have an effect?

The reason for describing DCE so early in this series is that it could otherwise wreak havoc and confusion throughout our  exploration of other, more interesting optimizations: consider this tiny example program, held in a file called Sum.cpp:

int main() {
    long long s = 0;
    for (long long i = 1; i <= 1000000000; ++i) s += i;
}

We are interested in how fast the loop executes, calculating the sum of the first billion integers.  (Yes, this is a spectacularly silly example, since the result has a closed formula, taught in High school.  But that’s not the point)

Build this program with the command:  CL /Od /FA Sum.cpp  and run with the command Sum.  Note that this build disables optimizations, via the /Od switch.  On my PC, it takes about 4 seconds to run.  Now try compiling optimized-for-speed, using CL /O2 /FA Sum.cpp.  On my PC, this version runs so fast there’s no perceptible delay.  Has the compiler really done such a fantastic job at optimizing our code?  The answer is no (but in a surprising way, also yes):

Let’s look first at the code it generates for the /Od case, stored into Sum.asm.  I have trimmed and annotated the text to show only the loop body:

       mov    QWORD PTR s$[rsp], 0                     ;; long long s = 0
       mov    QWORD PTR i$1[rsp], 1                    ;; long long i = 1
       jmp    SHORT $LN3@main  
$LN2@main:

       mov    rax, QWORD PTR i$1[rsp]                  ;; rax = i
       inc    rax                                      ;; rax += 1
       mov    QWORD PTR i$1[rsp], rax                  ;; i = rax 
$LN3@main:
       cmp    QWORD PTR i$1[rsp], 1000000000           ;; i <= 1000000000 ?
       jg     SHORT $LN1@main                          ;; no – we’re done
       mov    rax, QWORD PTR i$1[rsp]                  ;; rax = i
       mov    rcx, QWORD PTR s$[rsp]                   ;; rcx = s
       add    rcx, rax                                 ;; rcx += rax
       mov    rax, rcx                                 ;; rax = rcx
       mov    QWORD PTR s$[rsp], rax                   ;; s = rax
       jmp    SHORT $LN2@main                          ;; loop
$LN1@main:

The instructions look pretty much like you would expect.  The variable i is held on the stack at offset i$1 from the location pointed-to by the RSP register; elsewhere in the asm file, we find that i$1 = 0.  We use the RAX register to increment i.  Similarly, variable s is held on the stack (at offset s$ from the location pointed-to by the RSP register; elsewhere in the asm file, we find that s$ = 8).  The code uses RCX to calculate the running sum each time around the loop.

Notice how we load the value for i from its “home” location on the stack, every time around the loop; and write the new value back to its “home” location.  The same for the variable s.  We could describe this code as “naïve” – it’s been generated by a dumb compiler (i.e., one with optimizations disabled).  For example, it’s wasteful to keep accessing memory for every iteration of the loop, when we could have kept the values for i and s in registers throughout.

So much for the non-optimized code.  What about the code generated for the optimized case?  Let’s look at the corresponding Sum.asm for the optimized, /O2, build.  Again, I have trimmed the file down to just the part that implements the loop body, and the answer is:

                                                       ;; there’s nothing here!

Yes – it’s empty!  There are no instructions that calculate the value of s.  

Well, that answer is clearly wrong, you might say.  But how do we know the answer is wrong?  The optimizer has reasoned that the program does not actually make use of the answer for s at any point; and so it has not bothered to include any code to calculate it.  You can’t say the answer is wrong, unless you check that answer, right?

We have just fallen victim to, been mugged in the street by, and lost our shirt to, the DCE optimization.  If you don’t observe an answer, the program (often) won’t calculate that answer. 

You might view this effect in the optimizer as analogous, in a profoundly shallow way, to a fundamental result in Quantum Physics, often paraphrased in articles on popular science as “If a tree falls in the forest, and there’s nobody around to hear, does it make a sound?”

Well, let’s “observe” the answer in our program, by adding a printf of variable s, into our code, as follows:

#include <stdio.h>
int main() {
    long long s = 0;
    for (long long i = 1; i <= 1000000000; ++i) s += i;
    printf("%lld ", s);

The /Od version of this program prints the correct answer, still taking about 4 seconds to run.  The /O2 version prints the same, correct answer, but runs much faster.   (See the optional section below for the value of “much faster” – in fact, the speedup is around 7X)

At this point, we have explained the main point of this blog post: always be very careful in analyzing compiler optimizations, and in measuring their benefit, that we are not being misled by DCE.  Here are four steps to help notice, and ward off, the unasked-for attention of DCE:

  • Check that the timings have not suddenly improved by an order of magnitude
  • Examine the generated code (using the /FA switch)
  • If in doubt add a strategic printf
  • Put the code of interest into its own .CPP file, separate from the one holding main.  This works, so long as you do not request whole-program optimization (via the /GL switch, that we’ll cover later)

However, we can wring several more interesting lessons from this tiny example.  I have marked these sections below as “Optional-1” through “Optional-4”.

  

Optional-1 : Codegen for /O2

Why is our /O2 version (including a printf to defeat DCE), so much faster than the /Od version?  Here is the code generated for the loop of the /O2 version, extracted from the Sum.asm file:

       xor    edx, edx
       mov    eax, 1 
       mov    ecx, edx
       mov    r8d, edx
       mov    r9d, edx
       npad   13
$LL3@main:
       inc    r9
       add    r8, 2
       add    rcx, 3
       add    r9, rax                           ;; r9  = 2  8 18 32 50 ...
       add    r8, rax                           ;; r8  = 3 10 21 36 55 ...
       add    rcx, rax                          ;; rcx = 4 12 24 40 60 ...
       add    rdx, rax                          ;; rdx = 1  6 15 28 45 ...
       add    rax, 4                            ;; rax = 1  5  9 13 17 ...
       cmp    rax, 1000000000                   ;; i <= 1000000000 ?
       jle    SHORT $LL3@main                   ;; yes, so loop back

Note that the loop body contains approximately the same number of instructions as the non-optimized build, and yet it runs much faster.  That’s mostly because the instructions in this optimized loop use registers, rather than memory locations.  As we all know, accessing a register is much faster than accessing memory.  Here are approximate latencies that demonstrate how memory access can slow your program to a snail’s pace:

Location

Latency

Register 1 cycle
L1 4 cycles
L2 10 cycles
L3 75 cycles
DRAM 60 ns

So, the non-optimized version is reading and writing to stack locations, which will quickly migrate into L2 (10 cycle access time) and L1 cache (4 cycle access time).  Both are slower than if all the calculation is done in registers, with access times around a single cycle.

But there’s more going on here to make the code run faster.  Notice how the /Od version increments the loop counter by 1 each time around the loop.  But the /O2 version increments the loop counter (held in register RAX) by 4 each time around the loop.  Eh?

The optimizer has unrolled the loop.  So it adds four items together on each iteration, like this:

s = (1 + 2 + 3 + 4) + (5 + 6 + 7 + 8) + (9 + 10 + 11 + 12) + (13 + . . .

By unrolling this loop, the test for loop-end is made every four iterations, rather than on every iteration – so the CPU spends more time doing useful work than forever checking whether it can stop yet!

Also, rather than accumulate the results into a single location, it has decided to use 4 separate registers, to accumulate 4 partial sums, like this:

RDX = 1 + 5 +  9 + 13 + ...  =  1,  6, 15, 28 ...
R9  = 2 + 6 + 10 + 14 + ...  =  2,  8, 18, 32 ...
R8  = 3 + 7 + 11 + 15 + ...  =  3, 10, 21, 36 ...
RCX = 4 + 8 + 12 + 16 + ...  =  4, 12, 24, 40 ...

At the end of the loop, it adds the partial sums, in these four registers, together, to get the final answer.

(I will leave it as an exercise for the reader how the optimizer copes with a loop whose trip count is not a multiple of 4)

 

Optional-2 : Accurate Performance Measurements

Earlier, I said that the /O2 version of the program, without a printf inhibitor, “runs so fast there’s no perceptible delay”.  Here is a program that makes that statement more precise:

#include <stdio.h>
#include <windows.h>
int main() {
  LARGE_INTEGER start, stop;
  QueryPerformanceCounter(&start);
    long long s = 0;
    for (long long i = 1; i <= 1000000000; ++i) s += i;
  QueryPerformanceCounter(&stop);
  double diff = stop.QuadPart - start.QuadPart;
  printf("%f", diff);
}

It uses QueryPerformanceCounter to measure the elapsed time.  (This is a “sawn-off” version of a high-resolution timer described in a previous blog).  There are lots of caveats to bear in-mind when measuring performance (you might want to browse a list I wrote previously), but they don’t matter for this particular case, as we shall see in a moment:

On my PC, the /Od version of this program prints a value for  diff of about 7 million somethings.  (The units of the answer don’t matter – just know that the number gets bigger as the program takes longer to run).  The /O2 version prints a value for diff of 0.  And the cause, as explained above, is our friend DCE.

When we add a printf to prevent DCE, the /Od version runs in about 1 million somethings - a speedup of about 7X.

 

Optional-3 : x64 Assembler “widening”

If we look carefully again at the assembler listings in this post, we find a few surprises in the part of the program that initializes our registers:

       xor    edx, edx                          ;; rdx = 0     (64-bit!)
       mov    eax, 1                            ;; rax = i = 1 (64-bit!)
       mov    ecx, edx                          ;; rcx = 0     (64-bit!)
       mov    r8d, edx                          ;; r8  = 0     (64-bit!)
       mov    r9d, edx                          ;; r9  = 0     (64-bit!)
       npad   13                                ;; multi-byte nop alignment padding
$LL3@main:

Recall first that the original C++ program uses long long variables for both the loop counter and the sum.  In the VC++ compiler, these map onto 64-bit integers.  So we should expect the generated code to make use of x64’s 64-bit registers.

We already explained, in a previous post, that xor reg, reg is a compact way to zero out the contents of reg.  But our first instruction is apply xor to register EDX – the lower 32 bits of the RDX register.  The next instruction moves a 1 into EAX, the lower 32 bits of the RAX register.  This same pattern – of moving a value into 32-bit register continues with the next 3 instructions.  Taken at face value, this would leave the higher 32 bits of each target register containing random junk.  And yet the loop body performs calculations on the full-width, 64-bit registers.  How can the answers possibly be right?

The answer is that the x64 instruction set, originally published by AMD, chose to automatically zero-extend the high 32 bits of a 64-bit destination register.  Here are the two applicable bullets from section 3.4.5 of that manual:

  • Zero-Extension of 32-Bit Results: 32-bit results are zero-extended into the high 32 bits of 64-bit GPR destination registers.
  • No Extension of 8-Bit and 16-Bit Results: 8-bit and 16-bit results leave the high 56 or 48 bits, respectively, of 64-bit GPR destination registers unchanged.

Finally, notice the npad 13 instruction (really a pseudo-operation – a directive to the assembler).  This ensures the next instruction – which is the start of the loop body – will be aligned in memory on an address that’s a multiple of 16 bytes.  This improves performance (sometimes, and on some micro-architectures).

  

Optional-4 : printf versus std::cout

You might ask why I used C’s printf function, rather than C++ std::cout as a way to defeat DCE in the above experiment.  Try it and see.  Both work ok, but the asm file generated by the latter is much bigger, and therefore more difficult to navigate around: 0.7 Mbytes compared with 1.7 Kbytes.

 

 

 

 

 

 

 

  • So the compiler doesn't turn this computation into a constant?

  • Thanks, very nice write-up, especially liked the side-by-side code explanations and the optional parts :-)

    I've also experienced rather non-trivial blow-up of the generated code (size & complexity) with std::cout :-(

    On the other hand, chrono seems to be fine, here's a version of "Optional-2" using it: tinyurl.com/gcc-chrono-asm

    // BTW, don't forget to print out "s"! ;-)

    // If the above link doesn't work, here's the source: http://ideone.com/I8mEe4

    That being said, there are some issues with support for high_resolution_clock (it should use QueryPerformanceCounter, but it's not): connect.microsoft.com/.../719443

    // However, in this example I'm using steady_clock anyway.

  • "At this point, we have explained the main point of this blog post: always be very in analyzing compiler optimizations, ..." - always be very what?

  • Regarding "Accurate Performance Measurements" - AFAIR there's another way such measurements can fail: if the compiler moves the for loop after the second QueryPerformanceCounter, I think I've seen VC++ doing that once.

    "since the result has a closed formula, taught in High school"

    It looks like for a small n the compiler knows the formula and replaces the whole loop with a constant. Well, I assume that what really happens is that it doesn't know the formula but tries to evaluate the loop and gives up after a small number of iterations :)

  • When I saw the title of this blog post, my initial thoughts were that it was going to discuss what I think is usually referred to as dead code _stripping_.

    So even though your post was on elimination, I'm going to do a partial thread hijack and talk about dead code stripping :)

    Some background (from another Microsoft employee): blogs.msdn.com/.../10383017.aspx

    Basically, if I compile a cpp file with 50 functions in it, and I only use one of those functions VC doesn't throw out the unused functions.  This can really add up to a lot of bloat.  If you use a static library, you get some amount of dead code elimination on a per .obj file basis, but if you have the same dreaded 50 function cpp, then you still have issues.

    I've even seen this as an issue with the static C runtime.  I've used the static C runtime in a program that made no use of floating point operations, yet uncalled floating point functions were left in my binary.

    Someone is going to mention /OPT:REF and /OPT:ICF.  Well, those don't strip everything.  With vc2012 and the static C runtime, the cpp program "int main() {return 42;}" still has references to _aulldvrm, LeadupVec, memcpy, and a few others, even though nothing uses them (at least transitively).

    So... what are the chances of this behavior getting improved?  Ideally, by default for /O2, but if not, with extra flags, or at least some warning flag that could be set on the linker.

  • There's a related optimisation called partially dead code motion. I don't know if VC++ does this, but it's worth mentioning because it can cause a similar kind of surprise when benchmarking algorithms.

    It happens when you have code like this:

    void

    example(bool b)

    {

       int s = something_expensive();

       if (b) {

           std::cout << s << '\n';

       }

    }

    Here, s is not dead. However, it's only used if b is true, so in a sense it's "partially" dead. So it would make more sense to move s into the condition:

       if (b) {

           int s = something_expensive();

           std::cout << s << '\n';

       }

    The compiler has to move the code rather than simply delete it. Hence, it's partially dead code motion rather than elimination.

  • Ben Craig look at /Gy or whatever it is that puts functions into their own little unit.

    /Gy[-] separate functions for linker

    Been a while since I did that sort of caring since few of my .cpp files have the size they once did.  Yeah, enable function level linking in the IDE code generation page.

  • Man of LalalMancha:

    The build of my program that included unused _aulldvrm, LeadupVec, etc... did have /Gy.  It is possible that the static C-runtime is built without that flag, and that may be the cause of the bloat, but I don't know for sure.

  • Why does the compiler sets rdx to 0 with xor   edx, edx , but rcx, r8, r9 with mov instruction? Shouldn't it set those registers also with the xor trick?

  • @pseudonym - in your example, the compiler also has to prove that something_expensive() has no side-effects in order to make that optimization.  Otherwise it's changing the observable behavior of the program, which is a no-no.

  • @blah

    It does, but only up to a modest loop count.  Because the loop body is so small (just one instruction) the optimizer unrolls the loop N-fold.  A subsequent "copy propagation" phase calculates the final answer.  

    Using VS 2012, you can verify that for a loop of 1 thru 10, the optimizer calculates the answer at compile-time.  Anything bigger, and it generates a loop that executes at runtime.

  • @MattPD

    Yes, std::chrono::steady_clock works.  But I believe, it uses the same low-resolution as system_clock.  For this example, the poor resolution doesn't matter, of course.  (And yes, Stephan has that fix to make so that the high_resolution_clock uses QueryPerformanceCounter)

  • @Casey

    Yep - should say "careful" - I've fixed the text - thanks.

  • @MattPD

    Also note - I am using VS 2012 throughout this blog.  So where Matt's code won't work as-is.  Specifically, VS 2012 does not support the C++11 type alias feature ("using clock = steady_clock").  The workaround is easy.  (I believe type alias is supported in VS 2013 pre-view, but I've not checked)

  • @Mike

    The optimizer will only move code around if it can prove it does not change observable behavior (as CarlD points out later, for another, similar point).  In this case, even whole-program-optimization won't see the body of QueryPerformanceCounter.  So, for example, it cannot prove that QPC might not throw an exception the second time it is called.  Therefore, it would be unsafe to move the second QPC call above the loop.

    "since the result has a closed formula, taught in High school" - yep: 1 + 2 + ... + n = n * (n + 1) / 2 (number of terms times their average size)

    Yes, the optimizer calculates the answer, at compile-time, for small loops - see reply to @blah.

Page 1 of 2 (28 items) 12