Visual C++ Code Generation in Visual Studio 2010

Visual C++ Code Generation in Visual Studio 2010

Rate This
  • Comments 45

Hello, I’m Ten Tzen, a Compiler Architect on the Visual C++ Compiler Code Generation team. Today, I’m going to introduce some noteworthy improvements in Visual Studio 2010.

 

Faster LTCG Compilation:  LTCG (Link Time Code Generation) allows the compiler to perform better optimizations with information on all modules in the program (for more details see here).  To merge information from all modules, LTCG compilation generally takes longer than non-LTCG compilation, particularly for large applications.  In VS2010, we improved the information merging process and sped up LTCG compilation significantly. An LTCG build of Microsoft SQL Server (an application with .text size greater than 50MB) is sped up by ~30%.

 

Faster Pogo Instrumentation run:  Profile Guided Optimization (PGO) is an approach to optimization where the compiler uses profile information to make better optimization decisions for the program.  See here or here for an introduction of PGO.  One major drawback of PGO is that the instrumented run is usually several times slower than a regular optimized run.  In VS2010, we support a no-lock version of the instrumented binaries.  With that the scenario (PGI) runs are about 1.7X faster. 

 

Code size reduction for X64 target: Code size is a crucial factor to performance especially for applications that are performance-sensitive to the behavior of instruction cache or working set.  In VS2010, several effective optimizations are introduced or improved for X64 architecture. Some of the improvements are listed below:

·         More aggressively use RBP as the frame pointer to access local variables. RBP-relative address mode is one byte shorter than RSP-relative.

·         Enable tail merge optimizations with the presence of C++ EH or Windows SEH (see here and here for EH or SEH).

·         Combine successive constant stores to one store. 

·         Recognize more cases where we can emit 32-bit instruction for 64-bit immediate constants.

·         Recognize more cases where we can use a 32-bit move instead of a 64-bit move.

·         Optimize the code sequence of C++ EH destructor funclets.

 

Altogether, we have observed code size reduction in the range of 3% to 10% with various Microsoft products such as the Windows kernel components, SQL, Excel, etc.

 

Improvements for “Speed”:  As usual, there are also many code quality tuning and improvements done across different code generation areas for “speed’.  In this release, we have focused more on the X64 target.  The following are some of the important changes that have contributed to these improvements:

·         Identify and use CMOV instruction when beneficial in more situations

·         More effectively combine induction variable to reduce register pressure

·         Improve detection of region constants for strength reduction in a loop

·         Improve scalar replacement optimization in a loop

·         Improvement of avoiding store forwarding stall

·         Use XMM registers for memcpy intrinsic

·         Improve Inliner heuristics to identify and make more beneficial inlining decisions

Overall, we see an 8% improvement as measured by integer benchmarks and a few % points on the floating point suites for X64.  

 

Better SIMD code generation for X86 and X64 targets:  The quality of SSE/SSE2 SIMD code is crucial to game, audio, video and graphic developers.  Unlike inline asm which inhibits compiler optimization of surrounding code, intrinsics were designed to allow more effective optimization and still give developers access to low-level control of the machine.  In VS2010, we have added several simple but effective optimizations that focus on SIMD intrinsic quality and performance.  Some of the improvements are listed below:

 

·         Break false dependency:  The scalar convert instructions (CVTSI2SD, CVTSI2SS, CVTSS2SD, or CVTSD2SS) do not modify the upper bits of the destination register. This causes a false dependency which could significantly affect performance. To break the false dependence of memory to register conversions, VS2010 compiler inserts MOVD/MOVSS/MOVSD to zero-out the upper bits and use the corresponding packed conversion.  For instance,

 

cvtsi2ss xmm0, mem-operand   à           movd xmm0, mem-operand
                                                                         cvtdq2ps xmm0, xmm0

For register to register conversions, XORPS is inserted to break the false dependency.

cvtsd2ss xmm1, xmm0                 
à
           xorps xmm1, xmm1
                                                                        cvtsd2ss xmm1, xmm0

Even though this optimization may increase code size we have observed a significant positive performance improvement on several real world code and benchmark programs. 

 

·         Perform vectorization for constant vector initializations: In VS2008, a simple initialization statement, such as __m128 x = { 1, 2, 3, 4 }, would require ~10 instructions. With VS2010, it’s optimized down to a couple of instructions.  This can apply to dimensional initialization as well.  The instructions generated for initialization statements like __m128 x[] = {{1,2,3,4}, {5,6}} or __m128 t2[][2]= {{{1,2},{3,4,5}}, {{6},{7,8,9}}};  are greatly reduced with VS2010. 

 

·         Optimize __mm_set_**(), __mm_setr_**() and __mm_set1_**() intrinsic family.  In VS2008, a series of unpack instructions are used to do the combining of scalar values. When all arguments are constants, this can be achieved with a single vector instruction.  For example, the single statement, return _mm_set_epi16(0, 1, 2, 3, -4, -5, 6, 7), would require ~20 instructions to implement in previous releases while it’s only one instruction is required in  VS2010. 

 

Better register allocation for XMM registers thus removing many redundant loads, stores and moves.

·         Enable Compare & JCC CSE (Common Sub-expression Elimination) for SSE compares.  For example, the code sequence below at left will be optimized to the code sequence at right:

 

ECX, CC1 = PCMPISTRI                                   ECX, CC1 = PCMPISTRI
JCC(EQ) CC1                                                       JCC(EQ) CC1
ECX, CC2 = PCMPISTRI                  
à
           JCC(ULT) CC2
JCC(ULT) CC2                                                     JCC(P) CC3
ECX, CC3 = PCMPISTRI
JCC(P) CC3

 

Support for AVX in Intel and AMD processors:   Intel AVX (Intel Advanced Vector Extensions) is a 256 bit instruction set extension to SSE and is designed for applications that are floating point intensive (See here and here for detailed information from Intel and AMD respectively).  In VS2010 release, all AVX features and instructions are fully supported via intrinsic and /arch:AVX.  Many optimizations have been added to improve the code quality of AVX code generation which will be described with more details in an upcoming blog post. In addition to AVX support in the compiler, the Microsoft Macro Assembler (MASM) in VS2010 also supports the Intel AVX instruction set for x86 and x64.

 

 

More precise Floating Point computation with /fp:fast: To achieve maximum speed, the compiler is allowed to optimize floating point computation aggressively under /fp:fast option.  The consequence is that the floating point computation errors can accumulate and a result could be so inaccurate that it could severely affect the outcome of programs.  For example, we observed that more than half of the programs in the floating points benchmark suite fail with /fp:fast in VS2008 on the X64 targets.  In order to make /fp:fast more useful, we “down-tuned” a couple of optimizations in VS2010. This change could slightly affect the performance of some programs that were previously built with /fp:fast but will improve their accuracy.  And if your programs were failing with /fp:fast in earlier releases, you may see better results with VS2010.

 

Conclusion: The Visual C++ team cares about the performance of applications built with our compiler and we continue to work with customers and CPU vendors to improve code generation. If you see issues or opportunities for improvements, please let us know though Connect or through our blog.

 

 

 

 

  • P.P.S.  To belabor the point to the end since many won't get it and never will, that makes it

    __asm mov eax, ecx

    __asm shl eax, 2

    and the shift must wait for the move, obviously.  lea blows the doors off that.  Try it, and bring your hankie.

    "a=b*4 should use shl instead of mul" is much more actionable"

    Now that is actionable.

  • > Anonymous; longtime C+/Mfc dev.  How about some more friendly names :-)

    Well, you know my real name, check your mails regarding some guy who had issues with SSE -- like a few months before Beta2 -- and reported problems with the Beta1 compiler as he couldn't get his hands on a newer one :D

  • Boots,

    You are not giving the OOO engine the ability to do its job in your asm sample.

    If you rewrite the _asm loop code to use alternating src/dest registers and measure like I did on my Intel(R) Core 2(TM) Quad 6600 system you might find that shl will actually beat lea:

    Here is how I rewrote the lea loop:

    for (UINT32 ix = 0; ix < times; ix++) {

      __asm align 16

      __asm lea eax, [ecx * 4]

      __asm lea ecx, [eax * 4]

      __asm lea eax, [ecx * 4]

      __asm lea ecx, [eax * 4]

      __asm lea eax, [ecx * 4]

      __asm lea ecx, [eax * 4]

      __asm lea eax, [ecx * 4]

      __asm lea ecx, [eax * 4]

    }

    And for the shl loop, I rewrote it this way:

    for (UINT32 ix = 0; ix < times; ix++) {

      __asm align 16

      __asm shl ecx, 2

      __asm shl eax, 2

      __asm shl ecx, 2

      __asm shl eax, 2

      __asm shl ecx, 2

      __asm shl eax, 2

      __asm shl ecx, 2

      __asm shl eax, 2

    }

    When I recompiled with VS2010 Beta 2 (/Ox) and ran it on Vista SP1, 32-bit, the shl loop was about 6.5% faster than the lea loop.

    Also, I am sure that the operation of lea and shift are not identical because shl writes to the carry flag and lea does not which could result in a partial flag stall downstream.

    Even when you look at processors based on the "Nehalem" microarchitecture, shifts can be dispatched to 2 ports (0,5) whereas lea can only be dispatched to 1 port (1) per cycle.

    But, we were having fun with asm here so no harm done :)

    Disclaimer: my personal ramblings do not reflect the opinion of Intel Corporation.

  • ERROR The code you entered was invalid

  • © 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement

  • This may not be the best place to ask about this, but maybe someone on the team can comment, or blog about it elsewhere...

    I've been researching alternative memory allocator libraries, like Hoard, tcmalloc, etc.  The use of these libs seems fairly straight-forward in *nix/gcc environments, but using them for MSVC always seems to involve jumping though strange hoops.  Can someone here comment on the state of MSVC with regards to it's openness to replacing the runtime's memory allocator, and is this something that has received any thought or attention for 2010?

  • @JuanRod

    MS recommends compiling in release with /O2 not /Ox.  What happens when you do that?

  • @JK:

    First, the literal definitions. MSDN defines exactly what flags /O2 and /Ox give:

    /O2: /Ob2 /Og /Oi /Ot /Oy /Gs /GF /Gy

    /Ox: /Ob2 /Og /Oi /Ot /Oy

    So, as you can see, /O2 and /Ox both:

    * turns on inlining (/Ob2)

    * turns on global optimizations (/Og)

    * turns on intrinsics (/Oi)

    * favors execution speed (/Ot)

    * turns on frame pointer omission, an optimization (/Oy).

    However, /O2 _also_ turns on

    * stack probes (/Gs)

    * string pooling (/GF)

    * comdat folding (/Gy)

    COMDAT folding and string pooling will both reduce the size of your executable, but they also have possible run-time effects: they operate in the grey area of "implementation-specific undefined behavior". For example, /GF switch pools string literals into a read-only page of memory, which will break C programs that mutate string literals. COMDAT folding will combine unrelated functions that are identical, but will cause problems if you compare function pointer values.*

    So, for those codebases where you already have /Ox as your release build, you may not want to move to /O2, because /GF & /Gy might change your runtime behavior.

    And that's why we have /O2 and /Ox, but we recommend you building in /O2 if you can.

    *(For the nitpickers, /GF is turned on with /ZI, /Zi or Z7, i.e. debug info, so it's almost certainly on.)

    Lin Xu

    Program Manager

    C++ compiler

  • If /fp:fast was modified to disable optimizations that destabilize numerical results, what are the differences now between /fp:fast and /fp:precise?

    Also, what is the difference in performance between the two after the changes?

    Thanks.

  • There are still many optimizations only enabled under /fp:fast,  not under /fp:precise. The changes in 2010 release turned off (or be more conservative in) a few top-offending optimizations.

    In terms of performance difference, it's dependent on compile options, underlaying hardware, nature of programs and so on.

    For Spec2k6 FP suite, we observed 4% gain on X86 and 1.6% on X64 architecture when it's built with LTCG, Pogo and /arch:SSE2.

    Ten Tzen

    Principal Architect

    C++ Compiler

  • Great stuff!

    Thanks for sharing!!

    Regards,

    vectorization service

  • We have a lot of C code in our software, and I'd like to ask do all these new improvements apply to C too or is only C++ improved in VS 2010?

    Thanks

  • This all applies to 'C' as well.

    -Andre Vachon

    Lead Program Manager

    C & C++ Compiler :-)

  • Could someone kindly xplain how I can install the intel plugin c++ compiler into visual studio 10, so that I have the option to switch the compiler.

    I can select the x86m x64, ia64 options, but I cannot, as with visual studio 8, switch to the intel compiler.

    Thank you !

  • I'm from the Intel C++ Compiler support group. About Pieter Viljoen's question about the Intel C++ Compiler plugin into Visual Studio 2010, it's being worked on. Our next major release of Intel C++ or Visual Fortran Compiler for Windows and the next release of Intel Parallel Composer will support VS2010 and will be plugged into the IDE. You will be able to select the Intel C++ Compiler from the project property page for the VS2010 projects.

    If you have more questions about it, you can post questions to http://software.intel.com/en-us/forums/intel-c-compiler forum.

    Thanks,

    Jennifer Jiang

Page 3 of 3 (45 items) 123