Visual C++ Code Generation in Visual Studio 2010

Visual C++ Code Generation in Visual Studio 2010

Rate This
  • Comments 45

Hello, I’m Ten Tzen, a Compiler Architect on the Visual C++ Compiler Code Generation team. Today, I’m going to introduce some noteworthy improvements in Visual Studio 2010.

 

Faster LTCG Compilation:  LTCG (Link Time Code Generation) allows the compiler to perform better optimizations with information on all modules in the program (for more details see here).  To merge information from all modules, LTCG compilation generally takes longer than non-LTCG compilation, particularly for large applications.  In VS2010, we improved the information merging process and sped up LTCG compilation significantly. An LTCG build of Microsoft SQL Server (an application with .text size greater than 50MB) is sped up by ~30%.

 

Faster Pogo Instrumentation run:  Profile Guided Optimization (PGO) is an approach to optimization where the compiler uses profile information to make better optimization decisions for the program.  See here or here for an introduction of PGO.  One major drawback of PGO is that the instrumented run is usually several times slower than a regular optimized run.  In VS2010, we support a no-lock version of the instrumented binaries.  With that the scenario (PGI) runs are about 1.7X faster. 

 

Code size reduction for X64 target: Code size is a crucial factor to performance especially for applications that are performance-sensitive to the behavior of instruction cache or working set.  In VS2010, several effective optimizations are introduced or improved for X64 architecture. Some of the improvements are listed below:

·         More aggressively use RBP as the frame pointer to access local variables. RBP-relative address mode is one byte shorter than RSP-relative.

·         Enable tail merge optimizations with the presence of C++ EH or Windows SEH (see here and here for EH or SEH).

·         Combine successive constant stores to one store. 

·         Recognize more cases where we can emit 32-bit instruction for 64-bit immediate constants.

·         Recognize more cases where we can use a 32-bit move instead of a 64-bit move.

·         Optimize the code sequence of C++ EH destructor funclets.

 

Altogether, we have observed code size reduction in the range of 3% to 10% with various Microsoft products such as the Windows kernel components, SQL, Excel, etc.

 

Improvements for “Speed”:  As usual, there are also many code quality tuning and improvements done across different code generation areas for “speed’.  In this release, we have focused more on the X64 target.  The following are some of the important changes that have contributed to these improvements:

·         Identify and use CMOV instruction when beneficial in more situations

·         More effectively combine induction variable to reduce register pressure

·         Improve detection of region constants for strength reduction in a loop

·         Improve scalar replacement optimization in a loop

·         Improvement of avoiding store forwarding stall

·         Use XMM registers for memcpy intrinsic

·         Improve Inliner heuristics to identify and make more beneficial inlining decisions

Overall, we see an 8% improvement as measured by integer benchmarks and a few % points on the floating point suites for X64.  

 

Better SIMD code generation for X86 and X64 targets:  The quality of SSE/SSE2 SIMD code is crucial to game, audio, video and graphic developers.  Unlike inline asm which inhibits compiler optimization of surrounding code, intrinsics were designed to allow more effective optimization and still give developers access to low-level control of the machine.  In VS2010, we have added several simple but effective optimizations that focus on SIMD intrinsic quality and performance.  Some of the improvements are listed below:

 

·         Break false dependency:  The scalar convert instructions (CVTSI2SD, CVTSI2SS, CVTSS2SD, or CVTSD2SS) do not modify the upper bits of the destination register. This causes a false dependency which could significantly affect performance. To break the false dependence of memory to register conversions, VS2010 compiler inserts MOVD/MOVSS/MOVSD to zero-out the upper bits and use the corresponding packed conversion.  For instance,

 

cvtsi2ss xmm0, mem-operand   à           movd xmm0, mem-operand
                                                                         cvtdq2ps xmm0, xmm0

For register to register conversions, XORPS is inserted to break the false dependency.

cvtsd2ss xmm1, xmm0                 
à
           xorps xmm1, xmm1
                                                                        cvtsd2ss xmm1, xmm0

Even though this optimization may increase code size we have observed a significant positive performance improvement on several real world code and benchmark programs. 

 

·         Perform vectorization for constant vector initializations: In VS2008, a simple initialization statement, such as __m128 x = { 1, 2, 3, 4 }, would require ~10 instructions. With VS2010, it’s optimized down to a couple of instructions.  This can apply to dimensional initialization as well.  The instructions generated for initialization statements like __m128 x[] = {{1,2,3,4}, {5,6}} or __m128 t2[][2]= {{{1,2},{3,4,5}}, {{6},{7,8,9}}};  are greatly reduced with VS2010. 

 

·         Optimize __mm_set_**(), __mm_setr_**() and __mm_set1_**() intrinsic family.  In VS2008, a series of unpack instructions are used to do the combining of scalar values. When all arguments are constants, this can be achieved with a single vector instruction.  For example, the single statement, return _mm_set_epi16(0, 1, 2, 3, -4, -5, 6, 7), would require ~20 instructions to implement in previous releases while it’s only one instruction is required in  VS2010. 

 

Better register allocation for XMM registers thus removing many redundant loads, stores and moves.

·         Enable Compare & JCC CSE (Common Sub-expression Elimination) for SSE compares.  For example, the code sequence below at left will be optimized to the code sequence at right:

 

ECX, CC1 = PCMPISTRI                                   ECX, CC1 = PCMPISTRI
JCC(EQ) CC1                                                       JCC(EQ) CC1
ECX, CC2 = PCMPISTRI                  
à
           JCC(ULT) CC2
JCC(ULT) CC2                                                     JCC(P) CC3
ECX, CC3 = PCMPISTRI
JCC(P) CC3

 

Support for AVX in Intel and AMD processors:   Intel AVX (Intel Advanced Vector Extensions) is a 256 bit instruction set extension to SSE and is designed for applications that are floating point intensive (See here and here for detailed information from Intel and AMD respectively).  In VS2010 release, all AVX features and instructions are fully supported via intrinsic and /arch:AVX.  Many optimizations have been added to improve the code quality of AVX code generation which will be described with more details in an upcoming blog post. In addition to AVX support in the compiler, the Microsoft Macro Assembler (MASM) in VS2010 also supports the Intel AVX instruction set for x86 and x64.

 

 

More precise Floating Point computation with /fp:fast: To achieve maximum speed, the compiler is allowed to optimize floating point computation aggressively under /fp:fast option.  The consequence is that the floating point computation errors can accumulate and a result could be so inaccurate that it could severely affect the outcome of programs.  For example, we observed that more than half of the programs in the floating points benchmark suite fail with /fp:fast in VS2008 on the X64 targets.  In order to make /fp:fast more useful, we “down-tuned” a couple of optimizations in VS2010. This change could slightly affect the performance of some programs that were previously built with /fp:fast but will improve their accuracy.  And if your programs were failing with /fp:fast in earlier releases, you may see better results with VS2010.

 

Conclusion: The Visual C++ team cares about the performance of applications built with our compiler and we continue to work with customers and CPU vendors to improve code generation. If you see issues or opportunities for improvements, please let us know though Connect or through our blog.

 

 

 

 

  • Woohoo! Your achievements are awesome!!! Can't wait for finally using it :-)

  • If the code generation for intrinsics is improved, that would be great. I have seen wrong code generation with intrinsics, and have to manually unroll any loop with intrinsics - the compiler seems to think they are function calls when it makes unroll decisions. The CMOV improvements would be great, too - right now the only way to force the compiler to use those is to promote all sub-expressions to variables and use very awkward trenary operator sequences.

  • Have you seen these issues with the Visual Studio 2010 BETA 2 compiler?

    Sharing sample code that exposes the problem (or missed opportunity) would be fantastic.  We do have a long list of optimizations we want to add\improve in the compiler, but specific test cases are always great to help us validate we really are addressing the issue you indentified.

    -Andre

    Lead Program Manager

    C++ compiler

  • As a game developer, all of these optimizations are great news, thanks!

  • I found out about the AVX stuff on my own but the rest is news!  Thanks for the continued work.

  • Do you have any numbers on the speedup of VS 2010 vs. VS 2008 or VS 2005 for x86?

  • That's what I wanted to hear!!

  • Great news! Cann't wait to test it out. Thanks a lot.

  • Cool, at least for the SSE2 stuff, I can confirm that it's indeed working, well done. You should also mention that returning using foo.m128_f32 [0] now does a simple store instead of doing weird stuff :) Another change which is worth mentioning is that _mm_mul_ps (foo, _mm_load_ps (bar)) now gets compiled to mulps XMM1, XMMWORD PTR [ecx] instead of always loading both into a register (very annoying as it increased register pressure for no good reason).

    Unfortunately, the template inline heuristic is still buggy (bug 351744) -- any chance that a SP for VS2010 will fix that, or is this really for the next release cycle?

  • RE: Bryan Hayes:

    The numbers quoted in the "Improvements for “Speed”" section are VS2010 improvements over VS2008.  We have'nt taken the time to compare to VS2005.

    Andre Vachon

    Lead Program Manager

    C++ compiler

  • Cool! Very interesting article.

    Is "Faster Pogo" a typo? Or do you use it like this in Microsoft?

  • Cool! Very interesting article.

    Is "Faster Pogo" a typo? Or do you use it like this in Microsoft?

  • Nice to see the native compiler evolves. What could be bring more pleasure to any true hardcore C++ coder?..

    The one thing that keeps me sad VC++ release after release is compiler's behavior with memcpy function. When you call it with constant size, compiler inlines it aggresively, uses SSE2 and produces a very fast code...

    But, if you ever try to give it a dynamic size, a "call memcpy" is inserted. Regardless of any "initrinsic" settings, a full call to either CRT DLL, or static version of the function is inserted. And moreover, even for a very simple copy of several bytes, the processor would pass through about 10 branches until it ever starts copying anything.

    Another problem is if you want to compile a small binary without CRT dependency and happen to use memcpy/memset (even implicitly), you are the loser. I remember rewriting the simplest constuction of the form

    SOMESTRUCTURE s={0};

    to per-field zeroing just not to implicitly call memset, which I could not make the compiler to inline.

    The behaviour has changed from VS 2005 to VS 2008 (or from .NET to VS 2005, can't remember more precisely). Before that, calls to memcpy with dynamic size were nicely inlined.

    I once tried to ask someone from MS here and in forums and was given an answer that thus it was "faster". Still can't believe this...

  • Re: Alex

    So you would like to see the compiler inline a rep movsb\movsd of some kind for the dynamic sized copies ?

  • I'll see if we can get a sample code - we just shrugged off the problem and worked around it. The issue was code like this (resulting from a macro that needs a temp variable:

    {

     {

        __m128i foo;

        // access foo

     }

     {

        __m128i foo;

        // access foo

     }

     {

        __m128i foo;

        // access foo

     }

    }

    The assembly showed that access to foo in each scope was garbled as if scoping did not apply; not using nested scopes but a single local foo fixed the issue. We'll see if we can resurrect the test case and send it through our Microsoft contact.

    We'll give 2010 a spin with it too. However, Microsoft has consistently pointed us to beta versions of compilers as solutions to our problems. I cannot go to my supervisor and suggest we switch our code base to an unreleased and unsupported compiler regardless of what it fixes.

Page 1 of 3 (45 items) 123