Optimizing C++ Code : Dead Code Elimination

Optimizing C++ Code : Dead Code Elimination

Rate This
  • Comments 28

 

If you have arrived in the middle of this blog series, you might want instead to begin at the beginning.

This post examines the optimization called Dead-Code-Elimination, which I’ll abbreviate to DCE.  It does what it says: discards any calculations whose results are not actually used by the program.

Now, you will probably assert that your code calculates only results that are used, and never any results that are not used: only an idiot, after all, would gratuitously add useless code – calculating the first 1000 digits of pi, for example, whilst also doing something useful.  So when would the DCE optimization ever have an effect?

The reason for describing DCE so early in this series is that it could otherwise wreak havoc and confusion throughout our  exploration of other, more interesting optimizations: consider this tiny example program, held in a file called Sum.cpp:

int main() {
    long long s = 0;
    for (long long i = 1; i <= 1000000000; ++i) s += i;
}

We are interested in how fast the loop executes, calculating the sum of the first billion integers.  (Yes, this is a spectacularly silly example, since the result has a closed formula, taught in High school.  But that’s not the point)

Build this program with the command:  CL /Od /FA Sum.cpp  and run with the command Sum.  Note that this build disables optimizations, via the /Od switch.  On my PC, it takes about 4 seconds to run.  Now try compiling optimized-for-speed, using CL /O2 /FA Sum.cpp.  On my PC, this version runs so fast there’s no perceptible delay.  Has the compiler really done such a fantastic job at optimizing our code?  The answer is no (but in a surprising way, also yes):

Let’s look first at the code it generates for the /Od case, stored into Sum.asm.  I have trimmed and annotated the text to show only the loop body:

       mov    QWORD PTR s$[rsp], 0                     ;; long long s = 0
       mov    QWORD PTR i$1[rsp], 1                    ;; long long i = 1
       jmp    SHORT $LN3@main  
$LN2@main:

       mov    rax, QWORD PTR i$1[rsp]                  ;; rax = i
       inc    rax                                      ;; rax += 1
       mov    QWORD PTR i$1[rsp], rax                  ;; i = rax 
$LN3@main:
       cmp    QWORD PTR i$1[rsp], 1000000000           ;; i <= 1000000000 ?
       jg     SHORT $LN1@main                          ;; no – we’re done
       mov    rax, QWORD PTR i$1[rsp]                  ;; rax = i
       mov    rcx, QWORD PTR s$[rsp]                   ;; rcx = s
       add    rcx, rax                                 ;; rcx += rax
       mov    rax, rcx                                 ;; rax = rcx
       mov    QWORD PTR s$[rsp], rax                   ;; s = rax
       jmp    SHORT $LN2@main                          ;; loop
$LN1@main:

The instructions look pretty much like you would expect.  The variable i is held on the stack at offset i$1 from the location pointed-to by the RSP register; elsewhere in the asm file, we find that i$1 = 0.  We use the RAX register to increment i.  Similarly, variable s is held on the stack (at offset s$ from the location pointed-to by the RSP register; elsewhere in the asm file, we find that s$ = 8).  The code uses RCX to calculate the running sum each time around the loop.

Notice how we load the value for i from its “home” location on the stack, every time around the loop; and write the new value back to its “home” location.  The same for the variable s.  We could describe this code as “naïve” – it’s been generated by a dumb compiler (i.e., one with optimizations disabled).  For example, it’s wasteful to keep accessing memory for every iteration of the loop, when we could have kept the values for i and s in registers throughout.

So much for the non-optimized code.  What about the code generated for the optimized case?  Let’s look at the corresponding Sum.asm for the optimized, /O2, build.  Again, I have trimmed the file down to just the part that implements the loop body, and the answer is:

                                                       ;; there’s nothing here!

Yes – it’s empty!  There are no instructions that calculate the value of s.  

Well, that answer is clearly wrong, you might say.  But how do we know the answer is wrong?  The optimizer has reasoned that the program does not actually make use of the answer for s at any point; and so it has not bothered to include any code to calculate it.  You can’t say the answer is wrong, unless you check that answer, right?

We have just fallen victim to, been mugged in the street by, and lost our shirt to, the DCE optimization.  If you don’t observe an answer, the program (often) won’t calculate that answer. 

You might view this effect in the optimizer as analogous, in a profoundly shallow way, to a fundamental result in Quantum Physics, often paraphrased in articles on popular science as “If a tree falls in the forest, and there’s nobody around to hear, does it make a sound?”

Well, let’s “observe” the answer in our program, by adding a printf of variable s, into our code, as follows:

#include <stdio.h>
int main() {
    long long s = 0;
    for (long long i = 1; i <= 1000000000; ++i) s += i;
    printf("%lld ", s);

The /Od version of this program prints the correct answer, still taking about 4 seconds to run.  The /O2 version prints the same, correct answer, but runs much faster.   (See the optional section below for the value of “much faster” – in fact, the speedup is around 7X)

At this point, we have explained the main point of this blog post: always be very careful in analyzing compiler optimizations, and in measuring their benefit, that we are not being misled by DCE.  Here are four steps to help notice, and ward off, the unasked-for attention of DCE:

  • Check that the timings have not suddenly improved by an order of magnitude
  • Examine the generated code (using the /FA switch)
  • If in doubt add a strategic printf
  • Put the code of interest into its own .CPP file, separate from the one holding main.  This works, so long as you do not request whole-program optimization (via the /GL switch, that we’ll cover later)

However, we can wring several more interesting lessons from this tiny example.  I have marked these sections below as “Optional-1” through “Optional-4”.

  

Optional-1 : Codegen for /O2

Why is our /O2 version (including a printf to defeat DCE), so much faster than the /Od version?  Here is the code generated for the loop of the /O2 version, extracted from the Sum.asm file:

       xor    edx, edx
       mov    eax, 1 
       mov    ecx, edx
       mov    r8d, edx
       mov    r9d, edx
       npad   13
$LL3@main:
       inc    r9
       add    r8, 2
       add    rcx, 3
       add    r9, rax                           ;; r9  = 2  8 18 32 50 ...
       add    r8, rax                           ;; r8  = 3 10 21 36 55 ...
       add    rcx, rax                          ;; rcx = 4 12 24 40 60 ...
       add    rdx, rax                          ;; rdx = 1  6 15 28 45 ...
       add    rax, 4                            ;; rax = 1  5  9 13 17 ...
       cmp    rax, 1000000000                   ;; i <= 1000000000 ?
       jle    SHORT $LL3@main                   ;; yes, so loop back

Note that the loop body contains approximately the same number of instructions as the non-optimized build, and yet it runs much faster.  That’s mostly because the instructions in this optimized loop use registers, rather than memory locations.  As we all know, accessing a register is much faster than accessing memory.  Here are approximate latencies that demonstrate how memory access can slow your program to a snail’s pace:

Location

Latency

Register 1 cycle
L1 4 cycles
L2 10 cycles
L3 75 cycles
DRAM 60 ns

So, the non-optimized version is reading and writing to stack locations, which will quickly migrate into L2 (10 cycle access time) and L1 cache (4 cycle access time).  Both are slower than if all the calculation is done in registers, with access times around a single cycle.

But there’s more going on here to make the code run faster.  Notice how the /Od version increments the loop counter by 1 each time around the loop.  But the /O2 version increments the loop counter (held in register RAX) by 4 each time around the loop.  Eh?

The optimizer has unrolled the loop.  So it adds four items together on each iteration, like this:

s = (1 + 2 + 3 + 4) + (5 + 6 + 7 + 8) + (9 + 10 + 11 + 12) + (13 + . . .

By unrolling this loop, the test for loop-end is made every four iterations, rather than on every iteration – so the CPU spends more time doing useful work than forever checking whether it can stop yet!

Also, rather than accumulate the results into a single location, it has decided to use 4 separate registers, to accumulate 4 partial sums, like this:

RDX = 1 + 5 +  9 + 13 + ...  =  1,  6, 15, 28 ...
R9  = 2 + 6 + 10 + 14 + ...  =  2,  8, 18, 32 ...
R8  = 3 + 7 + 11 + 15 + ...  =  3, 10, 21, 36 ...
RCX = 4 + 8 + 12 + 16 + ...  =  4, 12, 24, 40 ...

At the end of the loop, it adds the partial sums, in these four registers, together, to get the final answer.

(I will leave it as an exercise for the reader how the optimizer copes with a loop whose trip count is not a multiple of 4)

 

Optional-2 : Accurate Performance Measurements

Earlier, I said that the /O2 version of the program, without a printf inhibitor, “runs so fast there’s no perceptible delay”.  Here is a program that makes that statement more precise:

#include <stdio.h>
#include <windows.h>
int main() {
  LARGE_INTEGER start, stop;
  QueryPerformanceCounter(&start);
    long long s = 0;
    for (long long i = 1; i <= 1000000000; ++i) s += i;
  QueryPerformanceCounter(&stop);
  double diff = stop.QuadPart - start.QuadPart;
  printf("%f", diff);
}

It uses QueryPerformanceCounter to measure the elapsed time.  (This is a “sawn-off” version of a high-resolution timer described in a previous blog).  There are lots of caveats to bear in-mind when measuring performance (you might want to browse a list I wrote previously), but they don’t matter for this particular case, as we shall see in a moment:

On my PC, the /Od version of this program prints a value for  diff of about 7 million somethings.  (The units of the answer don’t matter – just know that the number gets bigger as the program takes longer to run).  The /O2 version prints a value for diff of 0.  And the cause, as explained above, is our friend DCE.

When we add a printf to prevent DCE, the /Od version runs in about 1 million somethings - a speedup of about 7X.

 

Optional-3 : x64 Assembler “widening”

If we look carefully again at the assembler listings in this post, we find a few surprises in the part of the program that initializes our registers:

       xor    edx, edx                          ;; rdx = 0     (64-bit!)
       mov    eax, 1                            ;; rax = i = 1 (64-bit!)
       mov    ecx, edx                          ;; rcx = 0     (64-bit!)
       mov    r8d, edx                          ;; r8  = 0     (64-bit!)
       mov    r9d, edx                          ;; r9  = 0     (64-bit!)
       npad   13                                ;; multi-byte nop alignment padding
$LL3@main:

Recall first that the original C++ program uses long long variables for both the loop counter and the sum.  In the VC++ compiler, these map onto 64-bit integers.  So we should expect the generated code to make use of x64’s 64-bit registers.

We already explained, in a previous post, that xor reg, reg is a compact way to zero out the contents of reg.  But our first instruction is apply xor to register EDX – the lower 32 bits of the RDX register.  The next instruction moves a 1 into EAX, the lower 32 bits of the RAX register.  This same pattern – of moving a value into 32-bit register continues with the next 3 instructions.  Taken at face value, this would leave the higher 32 bits of each target register containing random junk.  And yet the loop body performs calculations on the full-width, 64-bit registers.  How can the answers possibly be right?

The answer is that the x64 instruction set, originally published by AMD, chose to automatically zero-extend the high 32 bits of a 64-bit destination register.  Here are the two applicable bullets from section 3.4.5 of that manual:

  • Zero-Extension of 32-Bit Results: 32-bit results are zero-extended into the high 32 bits of 64-bit GPR destination registers.
  • No Extension of 8-Bit and 16-Bit Results: 8-bit and 16-bit results leave the high 56 or 48 bits, respectively, of 64-bit GPR destination registers unchanged.

Finally, notice the npad 13 instruction (really a pseudo-operation – a directive to the assembler).  This ensures the next instruction – which is the start of the loop body – will be aligned in memory on an address that’s a multiple of 16 bytes.  This improves performance (sometimes, and on some micro-architectures).

  

Optional-4 : printf versus std::cout

You might ask why I used C’s printf function, rather than C++ std::cout as a way to defeat DCE in the above experiment.  Try it and see.  Both work ok, but the asm file generated by the latter is much bigger, and therefore more difficult to navigate around: 0.7 Mbytes compared with 1.7 Kbytes.

 

 

 

 

 

 

 

  • @Jim - I'm assuming that when the optimizer calculates the result for small loops, it's merely a side-effect of loop unrolling, constant propagation, constant folding, and other standard optimizations that eventually reduce the code to a constant - there's no magic theorem solver in the compiler that calculates n*(n+1)/2.

  • @Ben Craig

    Dead Code Stripping:  basically, compile with /Gy (so that each function is built into its own record, or "COMDAT", in the resulting OBJ, rather than being catenated with its colleagues), then link with /OPT:REF and you should get *only* the required functions.  For this to work with the static CRT, it needs to have been built with /Gy - which it is, on recent releases.

    Experiment with: cl /Gy /Zi /MT App.cpp /link /opt:ref /verbose:ref - to list all the functions that the linker discards.  

    Then experiment with: dbh App.pdb to see all the functions it keeps.

    But, as you point out, it does not discard every function you expect, for several reasons.  This is a big topic.  Probably deserves its own blog post.  I'll add it to the list.

    [In passing, "dead code" means code that would be executed, but whose results are not used, as described in this blog post.  By contrast, "unreachable code" means code that can never be called at runtime - it won't even be executed (although not everyone is picky to bother with the difference).  The "stripping" going on here is more akin to "unreachable code", than to "dead code"]

  • @CarlD

    Correct - the fact that the optimizer figures out that Sum(1..10) = 55 is a side-effect of regular optimizations.  The optimizer does not infer that the tuples are calculating Sum(1..10), and apply the closed-form solution.

  • @Jim Hogg "So, for example, it cannot prove that QPC might not throw an exception the second time it is called.  Therefore, it would be unsafe to move the second QPC call above the loop."

    Hmm, even with /EHsc where the compiler assumes that external C functions do not throw C++ exceptions?

    In any case, I found the example I was talking about. For whatever reason the compiler will never move the code between QPC. But it will move a function call:

    long long __declspec(noinline) sum(long long n) {

       long long s = 0;

       for (long long i = 1; i <= n; ++i) s += i;

       return s;

    }

    int main() {

       LARGE_INTEGER start, stop;

       QueryPerformanceCounter(&start);

       long long s = sum(1000000000);

       QueryPerformanceCounter(&stop);

       double diff = stop.QuadPart - start.QuadPart;

       printf("%f %lld\n", diff, s);

    }

    Relevant assembly code (/FAs):

    ; 16   :     QueryPerformanceCounter(&start);

    lea rcx, QWORD PTR start$[rsp]

    call QWORD PTR __imp_QueryPerformanceCounter

    ; 17   :     long long s = sum(1000000000);

    ; 18   :     QueryPerformanceCounter(&stop);

    lea rcx, QWORD PTR stop$[rsp]

    call QWORD PTR __imp_QueryPerformanceCounter

    call ?sum@@YA_JH@Z ; sum - AFTER THE SECOND QPC

    Obviously, this code will happily print 0 even if the execution of sum takes some time.

    Something like was reported on Connect a while ago and everyone seems to agree that it is a valid optimization: connect.microsoft.com/.../visual-c-2012-rtm-serious-compiler-bug

  • @Jim " So, for example, it cannot prove that QPC might not throw an exception the second time it is called.  Therefore, it would be unsafe to move the second QPC call above the loop."

    Why would this be unsafe? Or, how would observable behaviour change?

    IMO observable behaviour consists of two performance counter reads and one or two printfs. These can't be reordered, but the calculations can be reordered freely.

  • @claudiu

    Not sure why.   ie, the compiler zeroes out rdx using "xor edx, edx" then makes use of this zeroed register to zap others, as in "mov ecx, edx" (2 bytes) and "mov r8d, edx" (3 bytes).  The xor alternatives are the same size.  Asking around the team, one thought is that the choice might reflect an attempt to reduce instruction data dependencies (flow/anti/output), but that's not firm.

  • @Mike

    I'm not explaining this too well, and perhaps muddying the waters with a mistake or two :-).  Let me try to summarize.  The optimizer can reorder code so long as it can prove the observable effects are the same as the original.

    For example, suppose we have a code fragment like this:

       // Example-1

       QPC(&start)

       x = <some inline calculation>

       QPC(&stop)

       print x

    The compiler has no "special" knowledge of QPC (QueryPerformanceCounter).  It could have side-effects (eg, print a message to the console; throw an exception; launch a missile).  But the compiler can 'see' the details of <some inline calculation>, such as our sum-1-to-n example, and infer it has no side-effects.  So there's nothing to prevent the optimizer from re-ordering this sequence to:

      // Example-2

      QPC(&start)

      QPC(&stop)

      x = <some inline calculation>

      print x

    because when you execute that sequence, the observable behavior is the same as the original.  It can even reorder this case:

      // Example-3

      QPC(&start)

      x = <some call to a non-side-effecting function in this compiland>

      QPC(&stop)

      print x

    because the compiler can 'see' the body of that called function and infer it has no side-effects.

    But, it CANNOT reorder this case:

      // Example-4

      QPC(&start)

      x = <some call to an external function>

      QPC(&stop)

      print x

    because the optimizer cannot know that the external function (where, by "external", I simply mean not in this compiland) might have side-effects (a printf, for example).  Likewise, the optimizer CANNOT reorder the sequence:

      // Example-5

      QPC(&start)

      x = <some_calculation>

      print(x)

      QPC(&stop);

    Because doing so could change the observable effect of the program.  (Just imagine QPC includes a printf to make this clearer)

    The example on Connect is like Example-3.  So, as Ian points out, the optimization is legitimate - there is no observable difference in behavior.

  • @Jim Hogg: "But, it CANNOT reorder this case:.."

    Yes, but your example loop is not such a case, it can be moved before/after QPC calls and then you'll get wrong performance results. I suppose one needs to use something like _ReadWriteBarrier to be on the safe side.

  • TempleOS Holy-C compiler is better.

    -----

    U0 Cnt(I64 n)

    {

     I64 I;

     for (I=1;i<=n;i++)

       "%d\n",i;

    }

    Cnt(1000000);

    -----

    Counts to a million in 1.7 seconds.

  • I think there's a small typo under Optional-2:

    "When we add a printf to prevent DCE, the /Od version runs in about 1 million somethings - a speedup of about 7X."

    I think you mean /02 there.

  • nice..

    check this out orientar4u.blogspot.in/.../engineering-seminars.html

  • for this kind of assembly look up, i usually put the code in a function like result_t foo( optionalarg_t ) with a noinline attribute, this keep the profiling and outputs separate from the assembly. it may also prevents some optimisation like constant folding inside loop kernels as the arguments cannot be compile time for the function

  • A topic dear to my heart. I'd like to see the compiler emit after the final link a list of the methods and functions under a specified source path that weren't called. It would be incredibly useful for keeping mature codebases minimal and complete.

Page 2 of 2 (28 items) 12