Optimizing C++ Code : Overview

Optimizing C++ Code : Overview

Rate This
  • Comments 18

If you have arrived in the middle of this blog series, you might want instead to begin at the beginning.

This post explains the flow of data within the Visual C++ compiler – starting with our C++ source program, and ending with a corresponding binary program. This post is an easy one – dipping our toes into the shallow end of ocean.

Let's examine what happens when we compile a single-file program, stored in App.cpp, from the command-line. (If you were to launch the compilation from within Visual Studio, the diagram below would have to include higher layers of software; however, these end up emitting the very commands I'm about to describe).

So let's imagine we just typed: CL /O2 App.cpp

The CL stands from "Compile and Link". /O2 tells the compiler to optimize for speed – to generate machine code that runs as fast as possible. This command launches a process to execute the CL.EXE program – a "driver" that calls several pieces of software: when joined together, they process the text in our App.cpp and eventually generate a binary file, called App.exe. When executed, this binary will carry out the operations we specified in our original source file.

Let's take a stroll around the diagram above and explain what's going on.

CL.EXE parses our command-line, and checks that it makes sense. It then calls the C++ "front-end", located in C1XX.DLL (the "CXX" is meant to suggest "C++", but dates back to a time when "+" was not a legal character in file names). The FrontEnd is the part of the chain that understands the C++ language. It scans, parses and transforms your App.cpp into an equivalent tree, passed to the next component via 5 temporary files. The "language" represented by these 5 files is called "CIL", for "C Intermediate Language". Don't confuse this with the intermediate code, generated by managed languages such as C#, sometimes called "MSIL" but also, unfortunately, named "CIL" in the ECMA-335 standard.

Next, CL.EXE calls the "back-end", located in C2.DLL. We call the BackEnd "UTC", which stands for "Universal Tuple Compiler", although this name doesn't show up in any of the binaries included into Visual Studio. The BackEnd starts by converting the information from the FrontEnd into tuples – a binary stream of instructions. If we were to display them, they would look like a kind of high-level assembly language. It's high-level in the senses that:

  • Operations are generic. For example, a BRANCH(LE) instruction versus how it will eventually be translated, into a cmp instruction in x64 machine code
  • Operands are symbolic. For example, t66 denoting a temporary variable generated by the compiler, versus eax, the x64 register that will hold its value at runtime

Because we asked the compiler to optimize for speed, via the /O2 switch, the Optimize part of the BackEnd analyses the tuples and transforms them into a form that will execute faster but that is semantically equivalent – that yields the same results as would the original tuples. After this step, the tuples are passed to the CodeGen part of the BackEnd, which decides the final binary code to emit.

The CodeGen module emits App.obj as a file on disk. Finally, the Linker consumes that file, resolves any references to libraries, and produces the final App.exe binary.

In the diagram, black arrows show flow of data – text or binary. Red arrows show flow of control.

(We shall return to this diagram later in the series, when we hit the topic of Whole Program Optimization, specified to the compiler with the /GL switch, and to the Linker with the /LTCG switch. We will see the same set of boxes, but connected in different ways)

In summary:

  • The FrontEnd understands C++ source code. Other parts of the chain – BackEnd and Linker – are (mostly) isolated from the details of the original source language. They work upon the tuples, mentioned above, that form a kind of high-level, binary assembly language. The original source program might have been any imperative language, such as FORTRAN or Pascal. The BackEnd really doesn't care. Well, not much.

 

  • The Optimize part of the BackEnd transforms tuples into more efficient forms that will run faster. Such transformations, we call optimizations. (We should really call them "improvements", because there may be other improvements that produce even faster-running code – we just aim to get close to that ideal. However, someone coined the term "optimization" a few decades ago, and we're stuck with it) There are dozens of such optimizations, with names like "constant folding", "common sub-expression elimination", "hoisting", "invariant code motion", "dead code elimination", "function inlining", "auto vectorization", etc. For the most part, these optimizations are independent of the eventual processor on which the program will execute – they are "machine-independent" optimizations.

 

  • The CodeGen part of the Backend decides how to lay out the runtime stack (used to implement function "activation frames"); how to make best use of the available machine registers; adds details that implement function calling conventions; and transforms the code to make it run even faster, using its detailed knowledge of the target machine. (As one tiny example, if you ever look at assembly code – for example, using the Disassembly window in Visual Studio (Alt+8) whilst debugging code – you may notice instructions like xor eax, eax used to set EAX to zero, in preference to the more obvious move eax,0. Why? Because the XOR form is smaller (2 bytes instead of 5) and executes faster. We might well call this a "micro-optimization", and wonder whether it's worth the bother. Recall the proverb: "Look after the pennies and the pounds will look after themselves"). In contrast to the Optimize part, CodeGen is aware of the processor architecture on which the code is going to run.  In some cases, it will even change the order in which machine instructions are laid out – a process called "scheduling" – based upon its understanding of the target processor.   Oh, I'd better explain that a bit more: CodeGen knows whether it is targeting x86 or x64 or ARM-32; but it's rare that is can know the specific micro-architecture the code will run on - Nehalem versus Sandy Bridge, for example. (see the /favor:ATOM switch for a case where it knows more detail)

This blog will focus on the Optimize part of the compiler, with scant mention of the other components that make up the chain – FrontEnd, CodeGen or Linker.

This post has introduced lots of jargon. I'm not intending that you make total sense of it all at this point: this post is an overview, sprinkled with ideas that hopefully pique your interest, and ensure you'll come back next time, where I will start to explain all of that jargon.

Next time, we will look at perhaps the simplest optimization and how it works – the one called "constant folding".

  • "passed to the next component via 5 temporary files"

    Why are temporary files used for this? AFAIK both the frontend and backend run in the same process.

    "UTC", which stands for "Universal Tuple Compiler"

    Heh, I always wondered what UTC stands for.

  • Nice post! Can't wait to see next. I love reading about compiler details.

  • So some questions, then some back story...

    Does the VC team track stack usage as a performance metric?

    What is the exchange rate between instruction size and stack size?

    Back story:

    About a year ago, I was debugging a buffer overflow in an x64 executable build with vc10.  For the purposes of the discussion, it looked something like this...

    inline void buffer_holder() {

      char buffer[1024];

      //do stuff with buffer

    }

    int main() {

      buffer_holder();

      buffer_holder();

      return 0;

    }

    I noticed that the code generated for this program had main() reserve ~2K of stack space, instead of reserving ~1K, and reusing the buffer.

    I have seen similar situations to this where a three byte instruction (add esp with an immediate value) could save hundreds of bytes of stack space.  I would expect this to improve system performance, as the cache would be much denser, but I haven't measured.  Do you have any idea where the break-even point is (in general) for instruction tweaks vs. stack space saved?

  • Outstanding post! Is there any chance Jim that at the end of your series you could collect all the posts together in a PDF? The above material is incredibly valuable and presentation is top notch - it's practically a Patterns and Practices level treatment of the material.

    Like others here I'm really looking forward to seeing where you take this. Kudos and many thanks.

    Tom

  • I wonder how much faster the whole compilation process would be without relying so heavily on the filesystem?

  • बहुत ही शानदार लेख लिखा है|

  • You are an excellent write Sir! I am an engineer turned somewhat programmer. No formal training in coding. Compiling to me usually consist of Compile All from the IDE. But your article was very informative and interesting!!!

    Your style of writing is great at explain things like the "UTC" explanation!

    Great Job!!!

  • How to reduce the priority to -1 on build in new build?  From historical data, a build completes in the same time but does not bring the machine to its knees begging for mercy.  *Consider this and give us back our CPUs.

    *Of course, it won't happen, but know that this is stupid to consume all CPU resources all the time when it is my machine.  **I demand to control how it is used.

    **Of course, this is like saying stop PRISM.

    O' the drag of zero competition.

  • @Mike

    Reasons for using temp files are historical - from days when memory was limited; before DLLs came into use, and the parts of the compiler launched as separate processes.  However, with a writeback buffer-cache in the filesystem, the performance is good.  (I was told that the team experimented with in-memory transfers some time back, but without much difference in performance).  Note that these intermediate representations can be quite big.  If you want to 'see' the temp files, use the undocumented /Bk switch.  For example: CL /c /Bkxyz.  App.cpp will compile App.cpp, and produce 5 temp files called xyz.db, xyz.ex, xzy.gl, xyz.in and xyz.sy.  You can't do anything with these temp files, except see their size.

    Note too that we use these temp files, packaged together, in Whole-Program (/GL /LTCG) scenarios.  We'll cover that in a future post.

  • @Ben Craig

    Re-using the same part of stack for different variables is called "stack packing".  And UTC does this, where possible, for scalar variables.  It's like register-allocation, except, instead of registers, we are dealing with "slots" in the stack.  To be safe (ie, to yield correct results), the compiler needs to analyze the lifetime of each variable, to make sure the lifetimes of any variables that share the same stack slot do not overlap.  The analysis required to do this for arrays is more involved than for scalars - in your example, checking that none of the 1000 bytes of the first instantiation could overlap, in time, any of the 1000 bytes of the second instantiation.  This is made more complex if the code assigns a char pointer to any elements in the array.

    Bottom line - yes, there are some situations where it's safe to stack-pack arrays; but the analysis can be involved, and the compiler doesn't currently do so.  (In passing, it's the CodeGen part of the compiler that's responsible for stack-packing)

  • @Tom Kirby-Green

    Gathering posts together?  Yes - I am now realizing that some of the interesting discussion arises from questions, so a PDF summary would usefully require an editor pass to merge text and Q&A together.

  • I realize this probably isn't something that will ever be supported, but being able to target C2.exe from my own front-end would be cool - get a bunch of optimizations for free, possibly also with whole-program optimization support and binary interop. Doing something more sane like targeting LLVM is probably a better idea though :)

    I might try pulling apart those /Bk outputs for fun some time...

  • @Simon Buchan  - well, that was the idea with Phoenix but unfortunately Phoenix disappeared...

  • @Simon

    We don't document, or support, feeding parsed source code into UTC, unfortunately.  And yes, that was (one of) the many aims of Phoenix (where I spent 4 interesting years).  Although Phoenix disappeared as a shipping product, that technology, and part of the team, live on within another group inside Microsoft.

  • Nice and very informative information on code cleaning, but needed one online tool to clean up codes of blogger to increase speed

    www.knowinfonow.com

Page 1 of 2 (18 items) 12