Optimizing C++ Code

Optimizing C++ Code

Rate This
  • Comments 24

Introduction

 

Hi, my name is Jim Hogg and I am a Program Manager, working in the Visual C++ compiler team in Microsoft, based on the main campus here in Redmond. More specifically, I work in the part of the compiler that optimizes your code, to make it run faster, or to make it smaller, or a mixture of the two.

In this series of blog posts, I will explain some of the optimizations that make your code run faster. I'll include examples, with measurements of how much gain various optimizations might deliver. I'll then describe some of the more recent optimizations that the team has added, transforming your code in amazing, non-obvious ways.

Who is this blog aimed at? Anyone that is interested in how compilers work. Anyone that wonders how a compiler can possibly make the code you wrote run faster than "what the original C++ code says". And, on the opposite side, some of the patterns that prevent or inhibit optimization: armed with this knowledge, you might tweak your source code to allow the optimizer more freedom, and make your program run faster.

What are the pre-requisites to follow along with this blog? Some knowledge of programming in C or C++. (Most of the examples I will use can be understood in C. Only towards the end will I examine optimizations that are specific to C++ code – such as "de-virtualization"). Ideally, you should be able to read 64-bit assembler code: that way, you can really see the transformations the optimizations make. But this is not a hard requirement – I'll aim to provide insights without digging all the way down to the binary machine code that the compiler generates.

I will create a Table-of-Contents here for all of the blog posts in this series, updating as I publish each post.

  1. Introduction (this post)
  2. Overview – the process of compiling C++ code
  3. Constant-Folding
  4. Dead Code Elimination
  5. . . .
  6. . . .
  7. Function Inlining
  8. . . .
  9. . . .
  10. Whole-Program Optimization (Link-Time Code Generation)
  11. . . .

 

  • @Jim:

    One case when float-to-double conversions show up in intermediate calculations:

    www.altdevblogaday.com/.../intermediate-floating-point-precision

    Consider:

    void MulDouble(double x, double y, double* pResult)

    {

       *pResult = x * y;

    }

    void MulFloat(float x, float y, float* pResult)

    {

       *pResult = x * y;

    }

    Codegen dependency chain:

    MulDouble:

       movsd-> mulsd-> movsd

    MulFloat

       movss-> cvtps2pd-> mulsd-> cvtpd2ps-> movss

    This has a rather disastrous impact on performance:

    "Measuring such things is tricky at best, and the measurements extrapolate poorly, but in my crude tests I found that on optimized 32-bit /arch:SSE2 builds with VC++ 2010 MulFloat takes 35% longer to run than MulDouble. On tests where there is less function call overhead I’ve seen the float code take 78% longer than the double code."

    Note that VC++ 2012 and/or 64-bit may change the results (see the article).

    Importantly:

    "The frustrating thing about my test functions above is that the extra instructions are provably unneeded. They will make no difference to the result.

    The reason I can be so confident that the double-precision temporaries make no difference in the example above is that there is just one operation being done. The IEEE standard has always guaranteed that the basic operations (add, subtract, multiply, divide, square root, some others) give perfectly rounded results. On any single basic operation on two floats if the result is stored into a float then widening the calculation to double is completely pointless because the result is already perfect. The optimizer should recognize this and remove the four extraneous instructions. Intermediate precision matters in complex calculation, but in a single multiply, it doesn’t.

    In short, even though the VC++ policy in this case is to use double-precision for float calculations, the “as-if” rule means that the compiler can (and should) use single precision if it is faster and if the results would be the same “as-if” double precision was used."

  • Oh yeah! C++ is really back! Can't wait for the next post.

  • We need more aggressive vectorization and more aggressive auto-parallelization. We also need a way to automatically create optimal function-order lists to feed the linker, to reduce page faults and improve locality of reference. PGO does not always work, for example, it does not work with OpenMP enabled. The linker had the option to accept function-order lists for as long as I can remember (MS C 3.0 in circa 1990 ?) but there *never* was any practical way to actually create a profile-guided function-order list. It's about time to fill that gap (and since MS is very competition-oriented when it comes to its product’s feature sets: the Intel C/C++ compiler did that for years, see software.intel.com/.../optaps_pgo_funcdata_order.htm).

  • @axel

    Yes, agreed we need to be more aggressive in auto-vectorizing and auto-parallelizing.  We're on it.

    The Linker still supports function order lists (search for LINK /ORDER on MSDN).  But, as you say, how to get a good order?  I won't digress into discussion of PGO.  Instead, please ask on the current series of posts about PGO that my colleague Ankit is writing on vcblog (we just talked about this a minute ago)

  • @MattPD

    We changed the codegen with VS 2012 (where, in passing, /arch:SSE2 has become the default).  To see what I mean, compile the following one-liner with:  cl /c /FA /O2 mul.cpp

        void fmul(float x, float y, float* p)    { *p = x * y; }

    Here is the codegen for 'fmul' targetting x64:

    mulsd xmm0, xmm1

    movsdx QWORD PTR [r8], xmm0

    ret 0

    And here is the codegen for 'fmul', targetting x86:

    movss xmm0, DWORD PTR _x$[esp-4]

    mulss xmm0, DWORD PTR _y$[esp-4]

    mov eax, DWORD PTR _p$[esp-4]

    movss DWORD PTR [eax], xmm0

    ret 0

    As Bruce explains in his blog, the cvt* instructions were used to widen the floats to doubles to perform the multiplication, and narrowed the result afterwards.  This is permitted by the C++ standard.  But it also permits the compiler to to perform the multiply at 32-bit precision (thereby avoiding the cvt* instructions, and going faster); and in VS2012, we do!

  • @Jim

    Thanks!

    Interestingly, while your x86 codegen is the same as mine, my x64 is different:

    ; Listing generated by Microsoft (R) Optimizing Compiler Version 17.00.60315.1

    ; ...

    mulss xmm0, xmm1

    movss DWORD PTR [r8], xmm0

    How/why are you getting double-precision "mulsd" (and "QWORD" data) in there?

    Isn't that what one should be getting for "MulDouble" instead?

    BTW, I have to ask, is "movsdx" (which I think I've only seen in MSVC codegen) an alternate spelling for "movsxd" (sign-extension from 32 to 64 bits), since I've only encountered the 2nd one in my references?

    Out of curiosity, why was the cvt* thing done before? Some ancient x87 compatibility reasons (but even then it was never necessary per the IEEE std.?) as Bruce guessed or was there something else?

  • Oops!  Yes, a cut&paste error on my part - as you guessed, mine was x64 codegen for a 'dmul'.  My x64 codegen for 'fmul' is, of course, the same as you show.

    The movsdx mnemonic is emitted (only?) by the VC++ compiler into an FA listing, as we saw.  This mnemonic is NOT in the Intel manuals.  And it's NOT an alternative to the movsxd instruction.  In fact, it corresponds to the Intel MOVSD instruction - as you will see in the disassembly, using link /dump /disasm mul.obj.   As best I can discover, we added the 'x' to disambiguate it from Intel's original MOVSD instruction ("move doubleword string").  That 'x' has nothing to do with sign-bit extension.  Not sure it was a great idea, viewed with 20-20 hindsight, but that's what we have.

    As for the cvt* x87 codegen - perhaps an oversight?  perhaps following K&R-I. where all float calculations were mandated done at double precision.  The original codegen dates back many years so discovering the real reason would require some 'archaeology'.

  • @Jim

    Wow, thanks! I never would have guessed that one!

    // I was even looking in the "x64 Instructions" section on MSDN (Windows Debuggers), msdn.microsoft.com/.../ff561500.aspx, but with no luck.

    Wondering if there's a one place somewhere listing extensions like that (or is movsdx just one)?

    And thanks for the handy tip with link.exe, I'm more accustomed to the GCC toolchain for this kind of stuff (especially found the "GCC Explorer" on-line tool handy), but it's nice to have one more way to explore! :-)

  • Do you plan to cover very low-level optimizations that affect cache usage, e.g., maintaining good locality of reference? I'm not sure how much the compiler can do about that, but programmers should know about them.

Page 2 of 2 (24 items) 12