Welcome to MSDN Blogs Sign in | Join | Help

Visual C++ Code Generation in Visual Studio 2010

Hello, I’m Ten Tzen, a Compiler Architect on the Visual C++ Compiler Code Generation team. Today, I’m going to introduce some noteworthy improvements in Visual Studio 2010.

 

Faster LTCG Compilation:  LTCG (Link Time Code Generation) allows the compiler to perform better optimizations with information on all modules in the program (for more details see here).  To merge information from all modules, LTCG compilation generally takes longer than non-LTCG compilation, particularly for large applications.  In VS2010, we improved the information merging process and sped up LTCG compilation significantly. An LTCG build of Microsoft SQL Server (an application with .text size greater than 50MB) is sped up by ~30%.

 

Faster Pogo Instrumentation run:  Profile Guided Optimization (PGO) is an approach to optimization where the compiler uses profile information to make better optimization decisions for the program.  See here or here for an introduction of PGO.  One major drawback of PGO is that the instrumented run is usually several times slower than a regular optimized run.  In VS2010, we support a no-lock version of the instrumented binaries.  With that the scenario (PGI) runs are about 1.7X faster. 

 

Code size reduction for X64 target: Code size is a crucial factor to performance especially for applications that are performance-sensitive to the behavior of instruction cache or working set.  In VS2010, several effective optimizations are introduced or improved for X64 architecture. Some of the improvements are listed below:

·         More aggressively use RBP as the frame pointer to access local variables. RBP-relative address mode is one byte shorter than RSP-relative.

·         Enable tail merge optimizations with the presence of C++ EH or Windows SEH (see here and here for EH or SEH).

·         Combine successive constant stores to one store. 

·         Recognize more cases where we can emit 32-bit instruction for 64-bit immediate constants.

·         Recognize more cases where we can use a 32-bit move instead of a 64-bit move.

·         Optimize the code sequence of C++ EH destructor funclets.

 

Altogether, we have observed code size reduction in the range of 3% to 10% with various Microsoft products such as the Windows kernel components, SQL, Excel, etc.

 

Improvements for “Speed”:  As usual, there are also many code quality tuning and improvements done across different code generation areas for “speed’.  In this release, we have focused more on the X64 target.  The following are some of the important changes that have contributed to these improvements:

·         Identify and use CMOV instruction when beneficial in more situations

·         More effectively combine induction variable to reduce register pressure

·         Improve detection of region constants for strength reduction in a loop

·         Improve scalar replacement optimization in a loop

·         Improvement of avoiding store forwarding stall

·         Use XMM registers for memcpy intrinsic

·         Improve Inliner heuristics to identify and make more beneficial inlining decisions

Overall, we see an 8% improvement as measured by integer benchmarks and a few % points on the floating point suites for X64.  

 

Better SIMD code generation for X86 and X64 targets:  The quality of SSE/SSE2 SIMD code is crucial to game, audio, video and graphic developers.  Unlike inline asm which inhibits compiler optimization of surrounding code, intrinsics were designed to allow more effective optimization and still give developers access to low-level control of the machine.  In VS2010, we have added several simple but effective optimizations that focus on SIMD intrinsic quality and performance.  Some of the improvements are listed below:

 

·         Break false dependency:  The scalar convert instructions (CVTSI2SD, CVTSI2SS, CVTSS2SD, or CVTSD2SS) do not modify the upper bits of the destination register. This causes a false dependency which could significantly affect performance. To break the false dependence of memory to register conversions, VS2010 compiler inserts MOVD/MOVSS/MOVSD to zero-out the upper bits and use the corresponding packed conversion.  For instance,

 

cvtsi2ss xmm0, mem-operand   à           movd xmm0, mem-operand
                                                                         cvtdq2ps xmm0, xmm0

For register to register conversions, XORPS is inserted to break the false dependency.

cvtsd2ss xmm1, xmm0                 
à
           xorps xmm1, xmm1
                                                                        cvtsd2ss xmm1, xmm0

Even though this optimization may increase code size we have observed a significant positive performance improvement on several real world code and benchmark programs. 

 

·         Perform vectorization for constant vector initializations: In VS2008, a simple initialization statement, such as __m128 x = { 1, 2, 3, 4 }, would require ~10 instructions. With VS2010, it’s optimized down to a couple of instructions.  This can apply to dimensional initialization as well.  The instructions generated for initialization statements like __m128 x[] = {{1,2,3,4}, {5,6}} or __m128 t2[][2]= {{{1,2},{3,4,5}}, {{6},{7,8,9}}};  are greatly reduced with VS2010. 

 

·         Optimize __mm_set_**(), __mm_setr_**() and __mm_set1_**() intrinsic family.  In VS2008, a series of unpack instructions are used to do the combining of scalar values. When all arguments are constants, this can be achieved with a single vector instruction.  For example, the single statement, return _mm_set_epi16(0, 1, 2, 3, -4, -5, 6, 7), would require ~20 instructions to implement in previous releases while it’s only one instruction is required in  VS2010. 

 

Better register allocation for XMM registers thus removing many redundant loads, stores and moves.

·         Enable Compare & JCC CSE (Common Sub-expression Elimination) for SSE compares.  For example, the code sequence below at left will be optimized to the code sequence at right:

 

ECX, CC1 = PCMPISTRI                                   ECX, CC1 = PCMPISTRI
JCC(EQ) CC1                                                       JCC(EQ) CC1
ECX, CC2 = PCMPISTRI                  
à
           JCC(ULT) CC2
JCC(ULT) CC2                                                     JCC(P) CC3
ECX, CC3 = PCMPISTRI
JCC(P) CC3

 

Support for AVX in Intel and AMD processors:   Intel AVX (Intel Advanced Vector Extensions) is a 256 bit instruction set extension to SSE and is designed for applications that are floating point intensive (See here and here for detailed information from Intel and AMD respectively).  In VS2010 release, all AVX features and instructions are fully supported via intrinsic and /arch:AVX.  Many optimizations have been added to improve the code quality of AVX code generation which will be described with more details in an upcoming blog post. In addition to AVX support in the compiler, the Microsoft Macro Assembler (MASM) in VS2010 also supports the Intel AVX instruction set for x86 and x64.

 

 

More precise Floating Point computation with /fp:fast: To achieve maximum speed, the compiler is allowed to optimize floating point computation aggressively under /fp:fast option.  The consequence is that the floating point computation errors can accumulate and a result could be so inaccurate that it could severely affect the outcome of programs.  For example, we observed that more than half of the programs in the floating points benchmark suite fail with /fp:fast in VS2008 on the X64 targets.  In order to make /fp:fast more useful, we “down-tuned” a couple of optimizations in VS2010. This change could slightly affect the performance of some programs that were previously built with /fp:fast but will improve their accuracy.  And if your programs were failing with /fp:fast in earlier releases, you may see better results with VS2010.

 

Conclusion: The Visual C++ team cares about the performance of applications built with our compiler and we continue to work with customers and CPU vendors to improve code generation. If you see issues or opportunities for improvements, please let us know though Connect or through our blog.

 

 

 

 

Published Monday, November 02, 2009 10:56 AM by vcblog

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

Monday, November 02, 2009 11:31 AM by Ooh

# re: Visual C++ Code Generation in Visual Studio 2010

Woohoo! Your achievements are awesome!!! Can't wait for finally using it :-)

Monday, November 02, 2009 11:46 AM by Stefan

# re: Visual C++ Code Generation in Visual Studio 2010

If the code generation for intrinsics is improved, that would be great. I have seen wrong code generation with intrinsics, and have to manually unroll any loop with intrinsics - the compiler seems to think they are function calls when it makes unroll decisions. The CMOV improvements would be great, too - right now the only way to force the compiler to use those is to promote all sub-expressions to variables and use very awkward trenary operator sequences.

Monday, November 02, 2009 11:58 AM by Andre Vachon

# re: Visual C++ Code Generation in Visual Studio 2010

Have you seen these issues with the Visual Studio 2010 BETA 2 compiler?

Sharing sample code that exposes the problem (or missed opportunity) would be fantastic.  We do have a long list of optimizations we want to add\improve in the compiler, but specific test cases are always great to help us validate we really are addressing the issue you indentified.

-Andre

Lead Program Manager

C++ compiler

Monday, November 02, 2009 12:59 PM by repi

# re: Visual C++ Code Generation in Visual Studio 2010

As a game developer, all of these optimizations are great news, thanks!

Monday, November 02, 2009 1:13 PM by Cory Nelson

# re: Visual C++ Code Generation in Visual Studio 2010

I found out about the AVX stuff on my own but the rest is news!  Thanks for the continued work.

Monday, November 02, 2009 11:28 PM by Bryan Hayes

# re: Visual C++ Code Generation in Visual Studio 2010

Do you have any numbers on the speedup of VS 2010 vs. VS 2008 or VS 2005 for x86?

Monday, November 02, 2009 11:48 PM by QbProg

# re: Visual C++ Code Generation in Visual Studio 2010

That's what I wanted to hear!!

Tuesday, November 03, 2009 12:03 AM by Samsa

# re: Visual C++ Code Generation in Visual Studio 2010

Great news! Cann't wait to test it out. Thanks a lot.

Tuesday, November 03, 2009 12:38 AM by Anonymous

# re: Visual C++ Code Generation in Visual Studio 2010

Cool, at least for the SSE2 stuff, I can confirm that it's indeed working, well done. You should also mention that returning using foo.m128_f32 [0] now does a simple store instead of doing weird stuff :) Another change which is worth mentioning is that _mm_mul_ps (foo, _mm_load_ps (bar)) now gets compiled to mulps XMM1, XMMWORD PTR [ecx] instead of always loading both into a register (very annoying as it increased register pressure for no good reason).

Unfortunately, the template inline heuristic is still buggy (bug 351744) -- any chance that a SP for VS2010 will fix that, or is this really for the next release cycle?

Tuesday, November 03, 2009 6:07 AM by Andre Vachon

# re: Visual C++ Code Generation in Visual Studio 2010

RE: Bryan Hayes:

The numbers quoted in the "Improvements for “Speed”" section are VS2010 improvements over VS2008.  We have'nt taken the time to compare to VS2005.

Andre Vachon

Lead Program Manager

C++ compiler

Tuesday, November 03, 2009 6:26 AM by Tony

# re: Visual C++ Code Generation in Visual Studio 2010

Cool! Very interesting article.

Is "Faster Pogo" a typo? Or do you use it like this in Microsoft?

Tuesday, November 03, 2009 6:26 AM by Tony

# re: Visual C++ Code Generation in Visual Studio 2010

Cool! Very interesting article.

Is "Faster Pogo" a typo? Or do you use it like this in Microsoft?

Tuesday, November 03, 2009 6:35 AM by Alex

# re: Visual C++ Code Generation in Visual Studio 2010

Nice to see the native compiler evolves. What could be bring more pleasure to any true hardcore C++ coder?..

The one thing that keeps me sad VC++ release after release is compiler's behavior with memcpy function. When you call it with constant size, compiler inlines it aggresively, uses SSE2 and produces a very fast code...

But, if you ever try to give it a dynamic size, a "call memcpy" is inserted. Regardless of any "initrinsic" settings, a full call to either CRT DLL, or static version of the function is inserted. And moreover, even for a very simple copy of several bytes, the processor would pass through about 10 branches until it ever starts copying anything.

Another problem is if you want to compile a small binary without CRT dependency and happen to use memcpy/memset (even implicitly), you are the loser. I remember rewriting the simplest constuction of the form

SOMESTRUCTURE s={0};

to per-field zeroing just not to implicitly call memset, which I could not make the compiler to inline.

The behaviour has changed from VS 2005 to VS 2008 (or from .NET to VS 2005, can't remember more precisely). Before that, calls to memcpy with dynamic size were nicely inlined.

I once tried to ask someone from MS here and in forums and was given an answer that thus it was "faster". Still can't believe this...

Tuesday, November 03, 2009 7:21 AM by Andre Vachon

# re: Visual C++ Code Generation in Visual Studio 2010

Re: Alex

So you would like to see the compiler inline a rep movsb\movsd of some kind for the dynamic sized copies ?

Tuesday, November 03, 2009 9:06 AM by Stefan

# re: Visual C++ Code Generation in Visual Studio 2010

I'll see if we can get a sample code - we just shrugged off the problem and worked around it. The issue was code like this (resulting from a macro that needs a temp variable:

{

 {

    __m128i foo;

    // access foo

 }

 {

    __m128i foo;

    // access foo

 }

 {

    __m128i foo;

    // access foo

 }

}

The assembly showed that access to foo in each scope was garbled as if scoping did not apply; not using nested scopes but a single local foo fixed the issue. We'll see if we can resurrect the test case and send it through our Microsoft contact.

We'll give 2010 a spin with it too. However, Microsoft has consistently pointed us to beta versions of compilers as solutions to our problems. I cannot go to my supervisor and suggest we switch our code base to an unreleased and unsupported compiler regardless of what it fixes.

Tuesday, November 03, 2009 10:04 PM by Phaeron

# re: Visual C++ Code Generation in Visual Studio 2010

The set intrinsics are indeed greatly improved in beta 2 -- that will make things easier.

It looks like it's still impossible to generate a 32-bit load or store from XMM registers, though, with the compiler still generating a MOVD to or from a GPR and then doing a scalar move. The two patterns I've tried are *(int *)p = _mm_cvtsi128_si32(vec) and _mm_cvtsi32_si128(*(int *)p). I work with 32-bit packed pixels often, so this gums up the works a bit.

Wednesday, November 04, 2009 7:52 AM by Stefan

# re: Visual C++ Code Generation in Visual Studio 2010

Phareon:

_mm_cvtsi32_si128(...) used to work as a MOVD in 2008 - I've used it as a replacement to some of the _mm_set_* before - are you saying this has changed, or that it goes through a scalar load first and then does a reg-reg MOVD?

BTW, your posts on VS on the VDub blog are always helpful, so if you say the situation has improved greatly - I'll believe it...

Wednesday, November 04, 2009 10:02 PM by Phaeron

# re: Visual C++ Code Generation in Visual Studio 2010

Stefan:

Yes, the problem is that the compiler always generates MOV reg, mem32 + MOVD xmm, reg or MOVD reg, xmm + MOV mem32, reg instead of just a single MOVD instruction. I tried VS2008 and was unable to get it to generate a direct store either.

I can still find cases where _mm_set* is suboptimal. For instance, _mm_set_ps(constant, constant, constant, variable) will generate 4 x load + 3 x unpack whereas two of the 32-bit loads and one unpack could probably be optimized into a single 64-bit load. The all-constants case is now just a load, though, which is the really important case. The case that really made me cry was this:

#include <emmintrin.h>

#include <xmmintrin.h>

__m128i load16() {

return _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);

}

In VS2005, this generates 17 instructions -- 16 byte stores and a vector load. In VS2008, it generates an astounding 94 instructions.

Thursday, November 05, 2009 3:56 AM by Andre Vachon

# re: Visual C++ Code Generation in Visual Studio 2010

And in VS 2010, how many instructions does it generate ?

Thursday, November 05, 2009 5:36 AM by Andre Vachon

# re: Visual C++ Code Generation in Visual Studio 2010

Someone made a statement that "the template inline heuristic is still buggy" and asked whether we were going to fix the problem.

I looked at the "bug" and it describes, in general terms, that are inlining process does not generate as optimal code as one could imagine.

In the compiler space (and in most software development AFAIK), we differentiate between two sets of issues:

1. Code that does the *wrong* thing - aka bugs

2. Code that could do things more efficiently\better.  Some people refer to those as bugs, we like to refer to them as work items.

If you tell the compiler a=b*4 and we decide to multiply by 5, that's a bug and we will get that fixed ASAP, no matter what the milestone is.

Now if you look at the code generated for this and we emit a mul instruction, but you report that the compiler could optimize that better with a << 2 and save a cycle or two, then you might be right.  We'd log this as a work item and prioritize that with the other optimization opportunities we track.

In the compiler back end, we have very, very few bugs, but *lots* of work items.  We have enough work items to keep the team busy for 5+ years.  So we pick and choose which optimizations we will invest in based on overall customer feedback and the cost\reward of implementing those optimizations.

I'll go further and say that many optimizations can make one piece of code go faster, while it makes other code go slower - loop unrolling being a good example.

Loop unrolling is good if you really go through a loop a lot.  But if you have lots of loops you iterate through only a few times, the code bloat generated by unrolling the loop can make overall execution slower.

This type of tuning is very hard, takes time to implement in the compier, often requires PGO data specific to the program to help us make the right choice, etc.

So please report the "bugs" and please report the "code is not as optimal as it could be issues".  We look at all of them.  But we do prioritize this feedback differently, so expect different answers.

BTW, as always "the compiler does not generate great code" is not very actionable.  "a=b*4 should use shl instead of mul" is much more actionable :-)

-Andre

Lead Program Manager

C++ compiler

Thursday, November 05, 2009 8:40 AM by Stefan

# re: Visual C++ Code Generation in Visual Studio 2010

Re: Andre

I understand the need to prioritize certain things, but the real problem with intrinsics in VS is that they can produce extremely suboptimal code, and that inline assembly is not supported in 64-bit mode so we cannot really take over in the cases that the compiler goofs up. So, while not being bugs, these are more than just missed optimization opportunities - I would not be using SSE intrinsics if I din't care about performance, and since I have no other well-integrated option for using SSE (lack of inline assembly), I get to complain about poor code generation - it is in that sense a broken feature (fails to let me optimize code manually). If you prefer to call broken features "work items" and not "bugs" - well, it's your call, but they still generate extra work and frustration for me.

I'll end on a conciliatory note - glad to see you are working on the compiler backend, and hopefully your queue of work items will shrink soon.

Thursday, November 05, 2009 1:52 PM by Boots

# re: Visual C++ Code Generation in Visual Studio 2010

   "a=b*4 should use shl instead of mul" is much more actionable"

Or, how about a scaled lea?  Seems like a no-brainer there, Mr. C++ Compiler Guy.  shl can be slow on some processors.

From what I can remember, the code generated for 4-byte float was terrible compared to 8-byte float (/arch:sse2) with lots of cvts in there when doing 4-byte floats.  I'm sure it was slower by a long shot, too.

Intrinsics have always been a, "Caution: use at your own risk" sort of thing to me since they are so poorly documented.  What's new there!

Thursday, November 05, 2009 4:46 PM by Someone

# re: Visual C++ Code Generation in Visual Studio 2010

Scaled LEA takes a lot of bytes, while shift by constant is fast on any modern CPU.

Thursday, November 05, 2009 10:03 PM by JuanRod

# re: Visual C++ Code Generation in Visual Studio 2010

Andre asked in reponse to Phaeron

"And in VS 2010, how many instructions does it generate ?"

Here is what comes out of the VS2010 Beta 2 back-end with Phaeron's sample (compiled with /Ox) :

?load16@@YA?AT__m128i@@XZ (union __m128i __cdecl load16(void)):

 00000000: 66 0F 6F 05 00 00  movdqa      xmm0,xmmword ptr [__xmm@0]

           00 00

 00000008: C3                 ret

Is that any better? :)

Friday, November 06, 2009 1:25 AM by Anonymous

# re: Visual C++ Code Generation in Visual Studio 2010

> Someone made a statement that "the template inline heuristic is still buggy" and asked whether we were going to fix the problem.

Yeah, that would be me, the original reporter ;) Well, the main point here is that for me -- as a customer -- it's very difficult to tell when/if something is going to be fixed, and it's also difficult to estimate how urgent you think it is. If the template instantiation creates a function which ends with an unconditional call every time, it looks like a bug to me, as the unconditional jump buys you nothing (if you would emit the code directly inline, the linker would be able to remove the unused tail parts, leading to less code.) I also talked with some compiler guys about this, and none of them could give me a good reason, so that's why I consider it a "bug". You have a different opinion, and close the bug as "not solvable" and explain some bright day it might get indeed changed.

So don't get me wrong, I'm happy to hear that

* you noticed

* you have work for several years

but try to see it from a customer perspective -- we rely on the feedback in Connect, and something like ETA VS2012 would already help a lot, instead of "it will be fixed some day". Just change the status to "investigate for N+1 release" or something like this, so I know yep it's not forgotten. I guess I'm biased, as I got some more bugs for which I light a candle every night that they get fixed some day (294564 ;) ). The fun thing is that on some other bugs, you do indeed great work -- I'm very happy how you handled the SSE code generation problems, even though the feedback was rather unspecific at first ("we've made a lot of fixes" compared to "we've fixed all the issues in your report and some more".)

At least the C++ compiler did really make some advances, which is always good to see. Last note, try to make the feedback slightly more useful, especially if feedback is added to acknowledged bugs. Otherwise, some issues really look abandoned.

Friday, November 06, 2009 7:25 AM by longtime c++/Mfc dev

# re: Visual C++ Code Generation in Visual Studio 2010

I agree with @Boots:

 Intrinsics have always been a, "Caution: use at your own risk" sort of thing to me since they are so poorly documented.  What's new there!

Can this documentation be improved? Maybe some comparisons of performance for common ops?  Like with Intrinsics OperatorA may produce Xbytes (avg) additional size, deliver Y% (avg) faster performance.

I mean, isn't that the essence?  Intrinsics take more space but deliver better perf?  Does much perf benchmarking happen between major releases?  This touches on a hot gotcha in VS2005.  I didn't file this issue, but I also got burned: 98890 (VS 2005: Problem with Compiler Intrinsic of strcmp)

Friday, November 06, 2009 1:02 PM by Andre Vachon

# re: Visual C++ Code Generation in Visual Studio 2010

Anonymous; longtime C+/Mfc dev.  How about some more friendly names :-)

It's fine for people to talk about how things were bad in VS2005.  But I hope you do look at the current product and give us some feedback on it.

We do LOTS of benchmark measurements.  We do ridiculous amounts of both correctness and performance testing.  Folks from the test team will be bloging about that in the future.  

Based on the results of all this testing and perf measurements, we decide what are the most important work items (or what you call bugs) we are going to tackle.  Then we investigate those areas, see if we can come up with something reasonable that will improve code quality.

When people report issues we always put them on our list.  It would probably be a good thing to remind people that report the issues through connect that we do that.

As for keeping people up to date on what we are working on, that's trickier.  It don't think it's appropriate or good business to go blog in detail about our future product plans.  We will be able to share some general areas we hope to address in the next product (once we've decided on that), but we can't give you the list of all the "bugs" we are working on for the next version of the product.  I'm sure our competitors would like to see that :-)

-Andre Vachon

Lead Program Manager

C++ Compiler

Friday, November 06, 2009 2:20 PM by Joe

# re: Visual C++ Code Generation in Visual Studio 2010

<i>It don't think it's appropriate or good business to go blog in detail about our future product plans.</i>

I disagree rather strongly. Visual Studio is a tool to help developers create programs which in turn sell Windows and Office. Visual Studio is not an end in and of itself.

I believe that listing all the known "bugs" and being as transparent as possible with developer tools would be a great service to the developer community and would do much to restore your very tarnished reputation.

Friday, November 06, 2009 2:36 PM by Boots

# re: Visual C++ Code Generation in Visual Studio 2010

Tryi it yourself.  You also get the (free) option of an offset (other than 0).  You may (or not) get better results setting TimeBeginPeriod() to 1ms.

__asm lea eax, [ecx * 4]

__asm shl ecx, 2

UINT32 times = 1000 * 1000 * 1000;

DWORD sTick = GetTickCount();

for (UINT32 ix = 0; ix < times; ix++) {

   __asm lea eax, [ecx * 4]

   __asm lea eax, [ecx * 4]

   __asm lea eax, [ecx * 4]

   __asm lea eax, [ecx * 4]

   __asm lea eax, [ecx * 4]

   __asm lea eax, [ecx * 4]

   __asm lea eax, [ecx * 4]

   __asm lea eax, [ecx * 4]

}

DWORD eTick = GetTickCount();

DWORD sTick2 = GetTickCount();

for (UINT32 ix = 0; ix < times; ix++) {

   __asm shl ecx, 2

   __asm shl ecx, 2

   __asm shl ecx, 2

   __asm shl ecx, 2

   __asm shl ecx, 2

   __asm shl ecx, 2

   __asm shl ecx, 2

   __asm shl ecx, 2

}

DWORD eTick2 = GetTickCount();

DWORD t1 = eTick - sTick;

DWORD t2 = eTick2 - sTick2;

t2 = t2;

Friday, November 06, 2009 2:39 PM by Boots

# re: Visual C++ Code Generation in Visual Studio 2010

P.S. and you get the second operand in there with lea (the destination); with shl you will have to copy to another register at least, else you'll trash b (ecx).

Friday, November 06, 2009 2:56 PM by Boots

# re: Visual C++ Code Generation in Visual Studio 2010

P.P.S.  To belabor the point to the end since many won't get it and never will, that makes it

__asm mov eax, ecx

__asm shl eax, 2

and the shift must wait for the move, obviously.  lea blows the doors off that.  Try it, and bring your hankie.

"a=b*4 should use shl instead of mul" is much more actionable"

Now that is actionable.

Saturday, November 07, 2009 1:56 PM by Anonymous

# re: Visual C++ Code Generation in Visual Studio 2010

> Anonymous; longtime C+/Mfc dev.  How about some more friendly names :-)

Well, you know my real name, check your mails regarding some guy who had issues with SSE -- like a few months before Beta2 -- and reported problems with the Beta1 compiler as he couldn't get his hands on a newer one :D

Saturday, November 07, 2009 3:44 PM by JuanRod

# re: Visual C++ Code Generation in Visual Studio 2010

Boots,

You are not giving the OOO engine the ability to do its job in your asm sample.

If you rewrite the _asm loop code to use alternating src/dest registers and measure like I did on my Intel(R) Core 2(TM) Quad 6600 system you might find that shl will actually beat lea:

Here is how I rewrote the lea loop:

for (UINT32 ix = 0; ix < times; ix++) {

  __asm align 16

  __asm lea eax, [ecx * 4]

  __asm lea ecx, [eax * 4]

  __asm lea eax, [ecx * 4]

  __asm lea ecx, [eax * 4]

  __asm lea eax, [ecx * 4]

  __asm lea ecx, [eax * 4]

  __asm lea eax, [ecx * 4]

  __asm lea ecx, [eax * 4]

}

And for the shl loop, I rewrote it this way:

for (UINT32 ix = 0; ix < times; ix++) {

  __asm align 16

  __asm shl ecx, 2

  __asm shl eax, 2

  __asm shl ecx, 2

  __asm shl eax, 2

  __asm shl ecx, 2

  __asm shl eax, 2

  __asm shl ecx, 2

  __asm shl eax, 2

}

When I recompiled with VS2010 Beta 2 (/Ox) and ran it on Vista SP1, 32-bit, the shl loop was about 6.5% faster than the lea loop.

Also, I am sure that the operation of lea and shift are not identical because shl writes to the carry flag and lea does not which could result in a partial flag stall downstream.

Even when you look at processors based on the "Nehalem" microarchitecture, shifts can be dispatched to 2 ports (0,5) whereas lea can only be dispatched to 1 port (1) per cycle.

But, we were having fun with asm here so no harm done :)

Disclaimer: my personal ramblings do not reflect the opinion of Intel Corporation.

Saturday, November 07, 2009 11:53 PM by shamil

# re: Visual C++ Code Generation in Visual Studio 2010

ERROR The code you entered was invalid

Saturday, November 07, 2009 11:54 PM by shamil

# re: Visual C++ Code Generation in Visual Studio 2010

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement

Sunday, November 08, 2009 10:41 PM by Tim McD

# re: Visual C++ Code Generation in Visual Studio 2010

This may not be the best place to ask about this, but maybe someone on the team can comment, or blog about it elsewhere...

I've been researching alternative memory allocator libraries, like Hoard, tcmalloc, etc.  The use of these libs seems fairly straight-forward in *nix/gcc environments, but using them for MSVC always seems to involve jumping though strange hoops.  Can someone here comment on the state of MSVC with regards to it's openness to replacing the runtime's memory allocator, and is this something that has received any thought or attention for 2010?

Monday, November 09, 2009 11:24 AM by JK

# re: Visual C++ Code Generation in Visual Studio 2010

@JuanRod

MS recommends compiling in release with /O2 not /Ox.  What happens when you do that?

Monday, November 09, 2009 2:24 PM by Lin

# re: Visual C++ Code Generation in Visual Studio 2010

@JK:

First, the literal definitions. MSDN defines exactly what flags /O2 and /Ox give:

/O2: /Ob2 /Og /Oi /Ot /Oy /Gs /GF /Gy

/Ox: /Ob2 /Og /Oi /Ot /Oy

So, as you can see, /O2 and /Ox both:

* turns on inlining (/Ob2)

* turns on global optimizations (/Og)

* turns on intrinsics (/Oi)

* favors execution speed (/Ot)

* turns on frame pointer omission, an optimization (/Oy).

However, /O2 _also_ turns on

* stack probes (/Gs)

* string pooling (/GF)

* comdat folding (/Gy)

COMDAT folding and string pooling will both reduce the size of your executable, but they also have possible run-time effects: they operate in the grey area of "implementation-specific undefined behavior". For example, /GF switch pools string literals into a read-only page of memory, which will break C programs that mutate string literals. COMDAT folding will combine unrelated functions that are identical, but will cause problems if you compare function pointer values.*

So, for those codebases where you already have /Ox as your release build, you may not want to move to /O2, because /GF & /Gy might change your runtime behavior.

And that's why we have /O2 and /Ox, but we recommend you building in /O2 if you can.

*(For the nitpickers, /GF is turned on with /ZI, /Zi or Z7, i.e. debug info, so it's almost certainly on.)

Lin Xu

Program Manager

C++ compiler

Saturday, November 14, 2009 10:33 AM by Mark

# re: Visual C++ Code Generation in Visual Studio 2010

If /fp:fast was modified to disable optimizations that destabilize numerical results, what are the differences now between /fp:fast and /fp:precise?

Also, what is the difference in performance between the two after the changes?

Thanks.

Monday, November 16, 2009 4:26 PM by Ten Tzen

# re: Visual C++ Code Generation in Visual Studio 2010

There are still many optimizations only enabled under /fp:fast,  not under /fp:precise. The changes in 2010 release turned off (or be more conservative in) a few top-offending optimizations.

In terms of performance difference, it's dependent on compile options, underlaying hardware, nature of programs and so on.

For Spec2k6 FP suite, we observed 4% gain on X86 and 1.6% on X64 architecture when it's built with LTCG, Pogo and /arch:SSE2.

Ten Tzen

Principal Architect

C++ Compiler

Leave a Comment

(required) 
required 
(required) 

  
Enter Code Here: Required
 
Page view tracker