Visual C++ Code Generation in Visual Studio 2010

Visual C++ Code Generation in Visual Studio 2010

Rate This
  • Comments 45

Hello, I’m Ten Tzen, a Compiler Architect on the Visual C++ Compiler Code Generation team. Today, I’m going to introduce some noteworthy improvements in Visual Studio 2010.

 

Faster LTCG Compilation:  LTCG (Link Time Code Generation) allows the compiler to perform better optimizations with information on all modules in the program (for more details see here).  To merge information from all modules, LTCG compilation generally takes longer than non-LTCG compilation, particularly for large applications.  In VS2010, we improved the information merging process and sped up LTCG compilation significantly. An LTCG build of Microsoft SQL Server (an application with .text size greater than 50MB) is sped up by ~30%.

 

Faster Pogo Instrumentation run:  Profile Guided Optimization (PGO) is an approach to optimization where the compiler uses profile information to make better optimization decisions for the program.  See here or here for an introduction of PGO.  One major drawback of PGO is that the instrumented run is usually several times slower than a regular optimized run.  In VS2010, we support a no-lock version of the instrumented binaries.  With that the scenario (PGI) runs are about 1.7X faster. 

 

Code size reduction for X64 target: Code size is a crucial factor to performance especially for applications that are performance-sensitive to the behavior of instruction cache or working set.  In VS2010, several effective optimizations are introduced or improved for X64 architecture. Some of the improvements are listed below:

·         More aggressively use RBP as the frame pointer to access local variables. RBP-relative address mode is one byte shorter than RSP-relative.

·         Enable tail merge optimizations with the presence of C++ EH or Windows SEH (see here and here for EH or SEH).

·         Combine successive constant stores to one store. 

·         Recognize more cases where we can emit 32-bit instruction for 64-bit immediate constants.

·         Recognize more cases where we can use a 32-bit move instead of a 64-bit move.

·         Optimize the code sequence of C++ EH destructor funclets.

 

Altogether, we have observed code size reduction in the range of 3% to 10% with various Microsoft products such as the Windows kernel components, SQL, Excel, etc.

 

Improvements for “Speed”:  As usual, there are also many code quality tuning and improvements done across different code generation areas for “speed’.  In this release, we have focused more on the X64 target.  The following are some of the important changes that have contributed to these improvements:

·         Identify and use CMOV instruction when beneficial in more situations

·         More effectively combine induction variable to reduce register pressure

·         Improve detection of region constants for strength reduction in a loop

·         Improve scalar replacement optimization in a loop

·         Improvement of avoiding store forwarding stall

·         Use XMM registers for memcpy intrinsic

·         Improve Inliner heuristics to identify and make more beneficial inlining decisions

Overall, we see an 8% improvement as measured by integer benchmarks and a few % points on the floating point suites for X64.  

 

Better SIMD code generation for X86 and X64 targets:  The quality of SSE/SSE2 SIMD code is crucial to game, audio, video and graphic developers.  Unlike inline asm which inhibits compiler optimization of surrounding code, intrinsics were designed to allow more effective optimization and still give developers access to low-level control of the machine.  In VS2010, we have added several simple but effective optimizations that focus on SIMD intrinsic quality and performance.  Some of the improvements are listed below:

 

·         Break false dependency:  The scalar convert instructions (CVTSI2SD, CVTSI2SS, CVTSS2SD, or CVTSD2SS) do not modify the upper bits of the destination register. This causes a false dependency which could significantly affect performance. To break the false dependence of memory to register conversions, VS2010 compiler inserts MOVD/MOVSS/MOVSD to zero-out the upper bits and use the corresponding packed conversion.  For instance,

 

cvtsi2ss xmm0, mem-operand   à           movd xmm0, mem-operand
                                                                         cvtdq2ps xmm0, xmm0

For register to register conversions, XORPS is inserted to break the false dependency.

cvtsd2ss xmm1, xmm0                 
à
           xorps xmm1, xmm1
                                                                        cvtsd2ss xmm1, xmm0

Even though this optimization may increase code size we have observed a significant positive performance improvement on several real world code and benchmark programs. 

 

·         Perform vectorization for constant vector initializations: In VS2008, a simple initialization statement, such as __m128 x = { 1, 2, 3, 4 }, would require ~10 instructions. With VS2010, it’s optimized down to a couple of instructions.  This can apply to dimensional initialization as well.  The instructions generated for initialization statements like __m128 x[] = {{1,2,3,4}, {5,6}} or __m128 t2[][2]= {{{1,2},{3,4,5}}, {{6},{7,8,9}}};  are greatly reduced with VS2010. 

 

·         Optimize __mm_set_**(), __mm_setr_**() and __mm_set1_**() intrinsic family.  In VS2008, a series of unpack instructions are used to do the combining of scalar values. When all arguments are constants, this can be achieved with a single vector instruction.  For example, the single statement, return _mm_set_epi16(0, 1, 2, 3, -4, -5, 6, 7), would require ~20 instructions to implement in previous releases while it’s only one instruction is required in  VS2010. 

 

Better register allocation for XMM registers thus removing many redundant loads, stores and moves.

·         Enable Compare & JCC CSE (Common Sub-expression Elimination) for SSE compares.  For example, the code sequence below at left will be optimized to the code sequence at right:

 

ECX, CC1 = PCMPISTRI                                   ECX, CC1 = PCMPISTRI
JCC(EQ) CC1                                                       JCC(EQ) CC1
ECX, CC2 = PCMPISTRI                  
à
           JCC(ULT) CC2
JCC(ULT) CC2                                                     JCC(P) CC3
ECX, CC3 = PCMPISTRI
JCC(P) CC3

 

Support for AVX in Intel and AMD processors:   Intel AVX (Intel Advanced Vector Extensions) is a 256 bit instruction set extension to SSE and is designed for applications that are floating point intensive (See here and here for detailed information from Intel and AMD respectively).  In VS2010 release, all AVX features and instructions are fully supported via intrinsic and /arch:AVX.  Many optimizations have been added to improve the code quality of AVX code generation which will be described with more details in an upcoming blog post. In addition to AVX support in the compiler, the Microsoft Macro Assembler (MASM) in VS2010 also supports the Intel AVX instruction set for x86 and x64.

 

 

More precise Floating Point computation with /fp:fast: To achieve maximum speed, the compiler is allowed to optimize floating point computation aggressively under /fp:fast option.  The consequence is that the floating point computation errors can accumulate and a result could be so inaccurate that it could severely affect the outcome of programs.  For example, we observed that more than half of the programs in the floating points benchmark suite fail with /fp:fast in VS2008 on the X64 targets.  In order to make /fp:fast more useful, we “down-tuned” a couple of optimizations in VS2010. This change could slightly affect the performance of some programs that were previously built with /fp:fast but will improve their accuracy.  And if your programs were failing with /fp:fast in earlier releases, you may see better results with VS2010.

 

Conclusion: The Visual C++ team cares about the performance of applications built with our compiler and we continue to work with customers and CPU vendors to improve code generation. If you see issues or opportunities for improvements, please let us know though Connect or through our blog.

 

 

 

 

  • The set intrinsics are indeed greatly improved in beta 2 -- that will make things easier.

    It looks like it's still impossible to generate a 32-bit load or store from XMM registers, though, with the compiler still generating a MOVD to or from a GPR and then doing a scalar move. The two patterns I've tried are *(int *)p = _mm_cvtsi128_si32(vec) and _mm_cvtsi32_si128(*(int *)p). I work with 32-bit packed pixels often, so this gums up the works a bit.

  • Phareon:

    _mm_cvtsi32_si128(...) used to work as a MOVD in 2008 - I've used it as a replacement to some of the _mm_set_* before - are you saying this has changed, or that it goes through a scalar load first and then does a reg-reg MOVD?

    BTW, your posts on VS on the VDub blog are always helpful, so if you say the situation has improved greatly - I'll believe it...

  • Stefan:

    Yes, the problem is that the compiler always generates MOV reg, mem32 + MOVD xmm, reg or MOVD reg, xmm + MOV mem32, reg instead of just a single MOVD instruction. I tried VS2008 and was unable to get it to generate a direct store either.

    I can still find cases where _mm_set* is suboptimal. For instance, _mm_set_ps(constant, constant, constant, variable) will generate 4 x load + 3 x unpack whereas two of the 32-bit loads and one unpack could probably be optimized into a single 64-bit load. The all-constants case is now just a load, though, which is the really important case. The case that really made me cry was this:

    #include <emmintrin.h>

    #include <xmmintrin.h>

    __m128i load16() {

    return _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);

    }

    In VS2005, this generates 17 instructions -- 16 byte stores and a vector load. In VS2008, it generates an astounding 94 instructions.

  • And in VS 2010, how many instructions does it generate ?

  • Someone made a statement that "the template inline heuristic is still buggy" and asked whether we were going to fix the problem.

    I looked at the "bug" and it describes, in general terms, that are inlining process does not generate as optimal code as one could imagine.

    In the compiler space (and in most software development AFAIK), we differentiate between two sets of issues:

    1. Code that does the *wrong* thing - aka bugs

    2. Code that could do things more efficiently\better.  Some people refer to those as bugs, we like to refer to them as work items.

    If you tell the compiler a=b*4 and we decide to multiply by 5, that's a bug and we will get that fixed ASAP, no matter what the milestone is.

    Now if you look at the code generated for this and we emit a mul instruction, but you report that the compiler could optimize that better with a << 2 and save a cycle or two, then you might be right.  We'd log this as a work item and prioritize that with the other optimization opportunities we track.

    In the compiler back end, we have very, very few bugs, but *lots* of work items.  We have enough work items to keep the team busy for 5+ years.  So we pick and choose which optimizations we will invest in based on overall customer feedback and the cost\reward of implementing those optimizations.

    I'll go further and say that many optimizations can make one piece of code go faster, while it makes other code go slower - loop unrolling being a good example.

    Loop unrolling is good if you really go through a loop a lot.  But if you have lots of loops you iterate through only a few times, the code bloat generated by unrolling the loop can make overall execution slower.

    This type of tuning is very hard, takes time to implement in the compier, often requires PGO data specific to the program to help us make the right choice, etc.

    So please report the "bugs" and please report the "code is not as optimal as it could be issues".  We look at all of them.  But we do prioritize this feedback differently, so expect different answers.

    BTW, as always "the compiler does not generate great code" is not very actionable.  "a=b*4 should use shl instead of mul" is much more actionable :-)

    -Andre

    Lead Program Manager

    C++ compiler

  • Re: Andre

    I understand the need to prioritize certain things, but the real problem with intrinsics in VS is that they can produce extremely suboptimal code, and that inline assembly is not supported in 64-bit mode so we cannot really take over in the cases that the compiler goofs up. So, while not being bugs, these are more than just missed optimization opportunities - I would not be using SSE intrinsics if I din't care about performance, and since I have no other well-integrated option for using SSE (lack of inline assembly), I get to complain about poor code generation - it is in that sense a broken feature (fails to let me optimize code manually). If you prefer to call broken features "work items" and not "bugs" - well, it's your call, but they still generate extra work and frustration for me.

    I'll end on a conciliatory note - glad to see you are working on the compiler backend, and hopefully your queue of work items will shrink soon.

  •    "a=b*4 should use shl instead of mul" is much more actionable"

    Or, how about a scaled lea?  Seems like a no-brainer there, Mr. C++ Compiler Guy.  shl can be slow on some processors.

    From what I can remember, the code generated for 4-byte float was terrible compared to 8-byte float (/arch:sse2) with lots of cvts in there when doing 4-byte floats.  I'm sure it was slower by a long shot, too.

    Intrinsics have always been a, "Caution: use at your own risk" sort of thing to me since they are so poorly documented.  What's new there!

  • Scaled LEA takes a lot of bytes, while shift by constant is fast on any modern CPU.

  • Andre asked in reponse to Phaeron

    "And in VS 2010, how many instructions does it generate ?"

    Here is what comes out of the VS2010 Beta 2 back-end with Phaeron's sample (compiled with /Ox) :

    ?load16@@YA?AT__m128i@@XZ (union __m128i __cdecl load16(void)):

     00000000: 66 0F 6F 05 00 00  movdqa      xmm0,xmmword ptr [__xmm@0]

               00 00

     00000008: C3                 ret

    Is that any better? :)

  • > Someone made a statement that "the template inline heuristic is still buggy" and asked whether we were going to fix the problem.

    Yeah, that would be me, the original reporter ;) Well, the main point here is that for me -- as a customer -- it's very difficult to tell when/if something is going to be fixed, and it's also difficult to estimate how urgent you think it is. If the template instantiation creates a function which ends with an unconditional call every time, it looks like a bug to me, as the unconditional jump buys you nothing (if you would emit the code directly inline, the linker would be able to remove the unused tail parts, leading to less code.) I also talked with some compiler guys about this, and none of them could give me a good reason, so that's why I consider it a "bug". You have a different opinion, and close the bug as "not solvable" and explain some bright day it might get indeed changed.

    So don't get me wrong, I'm happy to hear that

    * you noticed

    * you have work for several years

    but try to see it from a customer perspective -- we rely on the feedback in Connect, and something like ETA VS2012 would already help a lot, instead of "it will be fixed some day". Just change the status to "investigate for N+1 release" or something like this, so I know yep it's not forgotten. I guess I'm biased, as I got some more bugs for which I light a candle every night that they get fixed some day (294564 ;) ). The fun thing is that on some other bugs, you do indeed great work -- I'm very happy how you handled the SSE code generation problems, even though the feedback was rather unspecific at first ("we've made a lot of fixes" compared to "we've fixed all the issues in your report and some more".)

    At least the C++ compiler did really make some advances, which is always good to see. Last note, try to make the feedback slightly more useful, especially if feedback is added to acknowledged bugs. Otherwise, some issues really look abandoned.

  • I agree with @Boots:

     Intrinsics have always been a, "Caution: use at your own risk" sort of thing to me since they are so poorly documented.  What's new there!

    Can this documentation be improved? Maybe some comparisons of performance for common ops?  Like with Intrinsics OperatorA may produce Xbytes (avg) additional size, deliver Y% (avg) faster performance.

    I mean, isn't that the essence?  Intrinsics take more space but deliver better perf?  Does much perf benchmarking happen between major releases?  This touches on a hot gotcha in VS2005.  I didn't file this issue, but I also got burned: 98890 (VS 2005: Problem with Compiler Intrinsic of strcmp)

  • Anonymous; longtime C+/Mfc dev.  How about some more friendly names :-)

    It's fine for people to talk about how things were bad in VS2005.  But I hope you do look at the current product and give us some feedback on it.

    We do LOTS of benchmark measurements.  We do ridiculous amounts of both correctness and performance testing.  Folks from the test team will be bloging about that in the future.  

    Based on the results of all this testing and perf measurements, we decide what are the most important work items (or what you call bugs) we are going to tackle.  Then we investigate those areas, see if we can come up with something reasonable that will improve code quality.

    When people report issues we always put them on our list.  It would probably be a good thing to remind people that report the issues through connect that we do that.

    As for keeping people up to date on what we are working on, that's trickier.  It don't think it's appropriate or good business to go blog in detail about our future product plans.  We will be able to share some general areas we hope to address in the next product (once we've decided on that), but we can't give you the list of all the "bugs" we are working on for the next version of the product.  I'm sure our competitors would like to see that :-)

    -Andre Vachon

    Lead Program Manager

    C++ Compiler

  • <i>It don't think it's appropriate or good business to go blog in detail about our future product plans.</i>

    I disagree rather strongly. Visual Studio is a tool to help developers create programs which in turn sell Windows and Office. Visual Studio is not an end in and of itself.

    I believe that listing all the known "bugs" and being as transparent as possible with developer tools would be a great service to the developer community and would do much to restore your very tarnished reputation.

  • Tryi it yourself.  You also get the (free) option of an offset (other than 0).  You may (or not) get better results setting TimeBeginPeriod() to 1ms.

    __asm lea eax, [ecx * 4]

    __asm shl ecx, 2

    UINT32 times = 1000 * 1000 * 1000;

    DWORD sTick = GetTickCount();

    for (UINT32 ix = 0; ix < times; ix++) {

       __asm lea eax, [ecx * 4]

       __asm lea eax, [ecx * 4]

       __asm lea eax, [ecx * 4]

       __asm lea eax, [ecx * 4]

       __asm lea eax, [ecx * 4]

       __asm lea eax, [ecx * 4]

       __asm lea eax, [ecx * 4]

       __asm lea eax, [ecx * 4]

    }

    DWORD eTick = GetTickCount();

    DWORD sTick2 = GetTickCount();

    for (UINT32 ix = 0; ix < times; ix++) {

       __asm shl ecx, 2

       __asm shl ecx, 2

       __asm shl ecx, 2

       __asm shl ecx, 2

       __asm shl ecx, 2

       __asm shl ecx, 2

       __asm shl ecx, 2

       __asm shl ecx, 2

    }

    DWORD eTick2 = GetTickCount();

    DWORD t1 = eTick - sTick;

    DWORD t2 = eTick2 - sTick2;

    t2 = t2;

  • P.S. and you get the second operand in there with lea (the destination); with shl you will have to copy to another register at least, else you'll trash b (ecx).

Page 2 of 3 (45 items) 123