Hi, my name is Jim Hogg and I am a Program Manager, working in the Visual C++ compiler team in Microsoft, based on the main campus here in Redmond. More specifically, I work in the part of the compiler that optimizes your code, to make it run faster, or to make it smaller, or a mixture of the two.
In this series of blog posts, I will explain some of the optimizations that make your code run faster. I'll include examples, with measurements of how much gain various optimizations might deliver. I'll then describe some of the more recent optimizations that the team has added, transforming your code in amazing, non-obvious ways.
Who is this blog aimed at? Anyone that is interested in how compilers work. Anyone that wonders how a compiler can possibly make the code you wrote run faster than "what the original C++ code says". And, on the opposite side, some of the patterns that prevent or inhibit optimization: armed with this knowledge, you might tweak your source code to allow the optimizer more freedom, and make your program run faster.
What are the pre-requisites to follow along with this blog? Some knowledge of programming in C or C++. (Most of the examples I will use can be understood in C. Only towards the end will I examine optimizations that are specific to C++ code – such as "de-virtualization"). Ideally, you should be able to read 64-bit assembler code: that way, you can really see the transformations the optimizations make. But this is not a hard requirement – I'll aim to provide insights without digging all the way down to the binary machine code that the compiler generates.
I will create a Table-of-Contents here for all of the blog posts in this series, updating as I publish each post.
Great, can't wait for the next post, definitely interested in this series!
++1 to that! This promises to be highly valuable and fun journey. Any chance Jim that when the series concludes you could compile a single PDF of the posts? The would be fantastically useful.
does a float still load as a sse2 double reg, does the SISD (doh!) work, then converts back to a float? that made using float a lot slower than double.
I agree with Tom! Jim, keep the good stuff coming!
When it comes to leveraging the 20 years of investment in MSVC backend optimizations Jim, not least your recent work on auto vectorization (which I’m a huge fan of and would love to see you develop further), I really wish MS would adopt a LLVM style approach.
Let’s see .NET NGen sent to its rest (it barely manages to make modern silicon glow these days) and have a C# front end to the MSVC backend. Let’s have PGO etc for managed code (we might need to revisit what managed means in this context).
In the vast majority of enterprise scenarios I see .NET’s JIT delivering zero value. Worse, it actively precludes things like PGO since by definition the amount of analysis it can perform has to take place in a tiny time window. Sure there are pathological scenarios where JIT shines, like client-side IL rewriting – but they’re few and far between. What we need is to be able to write C# code and have your auto vectorization backend smear it over the cores vector units, without having to PInvoke down to Intel SIMD libraries etc. More to the point, we need some of this ‘for free’, as C++ has now, with the compiler detecting SIMD opportunities in C# code and just doing it. Not sure how much advantage C# could take of C++ style aggressive inlining to dissolve function calls, but there is a huge amount of optimization tech in the MSVC backend that C# could benefit from.
I say all this even though my ‘heart belongs to native’ because I think it would help bring more investment to the native tool chain and that can only benefit my language of choice, C++.
It would be great if you could include some details about how (or even if) the presence of /clr code affects codegen in the non-/clr objects in a mixed mode .exe. ie. would there be benefit to the pure native performace if I were somehow able to extract the /clr code out of my exe into a separate dll? Or does introducing an ABI boundary cause a new set of perf concerns?
"Making silicon glow" isn't the primary purpose of a hi-level, dynamic language. It's purpose is to make programming simpler. If you want to make silicon glow, then you have to use the tools that are designed for that task. C++ is the highest level language that is designed to be married to the metal and allow the metal to be micromanaged. It's really difficult to take something like a java or a c# that is designed for ease of use and try to repurpose it for ease of computing.
Yes, I'll try to remember to gather all of these into a single PDF at the end. Please ping me later if I forget.
Sounds promising can't wait!
The NGEN utility? Yes, we shipped NGEN many years ago, with the JIT compiler as its engine. The JIT does a fine job in balancing throughput against code-quality. But when working in a 'batch' mode, you can afford the extra cycles to perform more aggressive optimizations. (Useless factoid: we initially dubbed NGEN, the "pre-JIT". I'll leave folks to ponder the several dimensions in which this name was an unusual choice)
Everything you propose is good. In effect, "One optimizer to rule them all". Interesting too, to speculate on how such a move might increase the up-take of C#, right?
Well, I think we would agree that the ideal is a high-level language that is easy to use, like C#, but which also "makes silicon glow". Certain aspects of C# make this difficult - Reflection, for example. And certain aspects of the underlying CLR semantics - such as checking array bounds on every access - also make it difficult to match the raw performance of C++.
But is there some "middle ground"? I'd vote to keep garbage-collection - great invention, and worth the extra cycles in order to avoid common bugs. But what about array accesses? Would you, as a programmer, be willing to have array bounds checks hoisted above a loop? So the check would fire early, but would allow well-behaved loops to compile to much faster code. What about __restrict to rein-in aliasing? - having a language where that is the default (eg: FORTRAN) would allow better optimizations. (By the way, I'm not suggesting we change C# or CLR - this is just mental, "what if?", doodling)
@float v double
We load a float into the low 32-bits of an XMM register (using MOVSS). And float arithmetic, such as addition, does the calculation on 32-bits (using ADDSS).
Similarly for a double (MOVSD and ADDSD).
In either case, we are using the low 32 or 64 bits of the 128-bit XMM register.
But I'm not sure I am answering your question. Are you saying you are seeing codegen where we always promote a float to a double into the SSE registers?
I was not planning to cover any /clr topics. But maybe later? - I've added it to the list on the whiteboard.
GC is a robber. A silent robber few know exists. C# will forever be in the non-performance bin no matter how "super" the rest of it is claimed to be.
I could always use fast ... SISD
00162D35 F2 0F 10 05 28 42 16 00 movsd xmm0,mmword ptr [__real@3fb99999a0000000 (164228h)]
00162D3D F2 0F 58 05 A0 43 16 00 addsd xmm0,mmword ptr [__real@0000000000000000 (1643A0h)]
00162D45 66 0F 5A C0 cvtpd2ps xmm0,xmm0
00162D49 F3 0F 5A C0 cvtss2sd xmm0,xmm0
00162D4D F2 0F 58 05 60 42 16 00 addsd xmm0,mmword ptr [__real@4024000000000000 (164260h)]
00162D55 66 0F 5A C0 cvtpd2ps xmm0,xmm0
00162D59 F3 0F 5A C0 cvtss2sd xmm0,xmm0
00162D5D F2 0F 58 05 98 43 16 00 addsd xmm0,mmword ptr [__real@408f400000000000 (164398h)]
00B72BCF F3 0F 10 05 E8 41 B7 00 movss xmm0,dword ptr [__real@3dcccccd (0B741E8h)]
00B72BD7 F3 0F 10 5D D8 movss xmm3,dword ptr [ebp-28h]
00B72BDC F3 0F 11 45 DC movss dword ptr [ebp-24h],xmm0
00B72BE1 F3 0F 10 05 BC 41 B7 00 movss xmm0,dword ptr [__real@3f800000 (0B741BCh)]
00B72BE9 F3 0F 11 45 E0 movss dword ptr [ebp-20h],xmm0
00B72BEE F3 0F 10 05 1C 43 B7 00 movss xmm0,dword ptr [__real@41200000 (0B7431Ch)]
00B72BF6 F3 0F 11 45 E4 movss dword ptr [ebp-1Ch],xmm0
00B72BFB F3 0F 10 05 18 43 B7 00 movss xmm0,dword ptr [__real@42c80000 (0B74318h)]
00B72C03 F3 0F 11 45 E8 movss dword ptr [ebp-18h],xmm0