Hi my name is Li Shao. I am a Senior Software Design Engineer in Test in the Visual C++ group. In this blog, our team would like to review Visual Studio 11 (VS11) Desktop application build throughput compared to Visual Studio 2010 SP1 (VS10). Jim Hogg, Mark Hall, Bill Bailey, Mohamed Magdy Mohamed, and Valentin Isac have also contributed to this blog. You can refer to the Blog by Tim Wagner if you are interested in the new Metro Style application build throughput.
Build throughput is one of the most important productivity factors for C++ developers, and so build throughput testing is a vital component of our overall performance testing for Visual C++. We have targeted build throughput tests for the compiler and linker, and we also have build throughput tests for the MSBuild build engine. For every release, we have performance tests to check the overall end to end build performance, including time spent in the compiler front-end, back-end, and linker. For those of you who may not know, the compiler front-end is the phase that parses and analyzes the source code to build the intermediate representation (IR) of the program. The compiler back-end is the phase that takes the IR as input, and performs optimizations and generates code in the form of object files. The linker takes these object files as input and assembles them into an executable file. In the case of building with /GL (compile for link-time code generation, or LTCG), code generation and optimization happen during the linker phase where the IR for the entire executable can be analyzed.
Every new release of Visual C++ contains a large amount of new technology and features available to customers in the form of source code libraries and header files. For a given compiler, the more source code you have, the longer it takes to compile. Fortunately, processor advances over the years have offset this increase in functionality. In recent releases, as processor speeds began topping out, we have invested in multi-core features, such as Multi-Proc MSBuild, and the /MP compiler switch. In VS11, we multi-threaded the backend. For builds that require a lot of optimization time, this has netted some great wins. For example, Microsoft’s SQL build is 30% faster thanks to reduced back-end times.
The increase in functionality in VS11 is among the largest delta we’ve ever shipped. As we integrated all this new technology into the product, our performance tests told us how it was affecting overall build performance versus previous releases. Even though the amount of new source code being compiled rose significantly, in most cases, build time increases were held to an acceptable delta.
However, based on our testing results, if you have an application which uses the Standard Template Library heavily, you may notice slower builds due to the increased functionality mandated by the C++ Standard. In this blog, we will analyze the build throughput of a representative application to demonstrate the improvements we made and possible slowdown you may experience.
Our end to end throughput tests use production-level, “real world” applications. We refer to them as RWC (Real World Code). Here is the build throughput data from an internal, post-Beta version of VS11 on one of our RWC projects across different machine configurations. This particular desktop application has about 50 C++ projects, with 2.8 million lines of code (LOC). The application makes heavy use of the Standard Template Library. The “RWC” below refers specifically to this application.
Figure 1: RWC Build Throughput (MSBuild 2 Proc (/m:2) and Full Optimization (/Ox))
Figure 2: RWC Build throughput with link /LTCG (LTCG: Link Time Code Generation, MSBuild 2 Proc (/m:2), Compile with /GL, link with /LTCG)
Figure 3: RWC Compiler Back-end throughput (MSBuild 2 Proc (/m:2))
Figure 4: RWC Link time throughput (MSBuild 2 Proc (/m:2))
“Full Build time” is measured from when the build starts until the build finishes. It is the clock time that it takes for the build to finish. Compiler front-end (FE), compiler back-end (BE), and link time are the total time (accumulated time) spent in the C1.dll & C1xx.dll (FE), C2.dll (BE) and linker across all processes. Since this is a multi-proc build, and some of the projects have /MP enabled for compiler, generally there is more than one instance of the compiler running, which is why the total time spent in the tools is much larger than the “Full Build Time”.
In the non-LTCG build (Figure 1), the linker does minimal work compared with the rest of the components, therefore the linker time is hardly visible from the graph. Likewise, in LTCG build (Figure 2), all the code generation happens after linking. Since the linker eliminates all redundant template instantiations across the object files before the back-end runs, the back-end time is drastically reduced in this scenario.
Based on our data when running multi-proc MSBuild (MSBuild /M:n) on this particular application, building with 4-8 Proc can give slight throughput improvement over 2 Proc for overall build time. However, due to resource contention and project reference limitation on paralleled builds that can actually happen, a 2-Proc build is representative of multi-proc build characteristics for this application in terms of overall throughput. Therefore, we are using the 2 Proc data here to present the build throughput.
Here are some key observations based on the data:
In VS11, we have introduced a new feature to support creating multiple compiler back-end threads to improve build performance. The compiler back-end performs optimization and code generation on a single function at a time, allowing it to generate code for multiple functions in parallel. This can be a win, especially for code generation with optimizations enabled, as performing optimizations requires the back-end to spend more time in each function.
This throughput win can be seen in Figure 1 and Figure 3. While these are impressive improvements in compiler back-end throughput, we note that in this scenario the back end time is a small fraction of the total build time. We highlight it here for illustration purposes only.
We can see that there is 15% - 20% performance degradation in the VS11 front-end compared to VS10 in this real world example (Figure 1 and Figure 2). Our investigation revealed that this performance degradation is not in the compiler itself but rather is a function of the increased number of template instantiations being processed by the front-end. New functionality added to the STL, significantly increases the number of template instantiations in a given compilation. This major increase of the functionality is mandated by the new C++ Standard and is also widely requested by our customers. The increase in template types forces the front end to instantiate many more templates for a given input file, which causes the overall build performance to degrade. Of course, we haven’t shipped VS11 yet, so we continue to analyze this issue and look for ways to improve throughput.
As we mentioned earlier, the linker takes the intermediate files generated by the compiler and produces the final assembly. As a result of larger intermediate files generated by the compiler, you may also see slight slowdown in link time for both un-optimized build and optimized build (LTCG), especially on lower end machines (Spec C as we presented) due to more symbols to process and merge.
On a high end machine (Spec A and Spec B as we presented), our tests show faster link time in LTCG builds (Figure 4).
For a number of releases the compiler has supported Link Time Code Generation (LTCG), whereby code generation is deferred to link-time so that information on all modules for the entire image (EXE, DLL) is made available for optimization. Thus, code generation can be done in the context of the entire image, inlining across modules, application of custom calling convention, etc. Non-LTCG builds generally limit optimizations to within the context of a single object file.
With VS11, this link-time code generation can be done in multiple threads, compiling more than one function at a time, caveat dependencies as determined by the function call tree. This is a particularly good win for projects that compile into fewer, larger images. Our data shows that when building a “big EXE” Microsoft product, SQL Server, build time improved ~30% due to the work we did to improve LTCG build throughput. Note that SQL Server sources do not use template code, which is why it does not have the build throughput degradation caused by the new STL templates.
Since build throughput is very important to C++ developer productivity, here are a few suggestions that can help you improve build throughput.
From Figure 1 and Figure 2, you can see that although the total FE time has regressed, the regression in the overall build time is still very minimal on the particular application, especially on high-end machines (Spec A and Spec B). This is due to the effect of MultiProc build. When building in the IDE, the Build system will use the total number of cores on your machine by default. You can also modify multi-proc build by going to Tools -> Options -> Projects and Solutions -> Build and Run, changing the setting for “Maximum number of Parallel Project Builds”. If you are building on the command line, pass in /m:n (n is the numbers of processes you would like to use). The default when building with MSBuild on the command line is 1. As mentioned earlier, although building with 4 or more procs might give you better performance, for this particular application, setting MSBuild to 2 proc build (/m:2) gives the performance that is close to 4 or 8 proc build. In addition, you can also set /MP per project or per file to take advantage of compiler level multi-process build. For additional suggestions on how to further tune your build, you can take a look of this blog.
C++ library headers represent a collection of living, evolving libraries, and they tend to increase in size from release to release. In VS11, for example, the windows headers increased in raw size by 13%. This is due to adding more API functions and types related to Windows 8. The new Windows headers will add more compile time when the PCH is created, not when it is used. Once built, there are many more compiles that use the PCH, whose build time are affected only a tiny amount, if at all. Proper use of precompiled headers continues to be the most effective way to reduce overall build times. If your application uses template extensively, you may also consider pre-instantiate the template types in PCH files.
A small experiment shows the difference using PCH achieves for a single project with around 200 files. Each file has around a hundred lines of code. Various library headers are included in the PCH.
Figure 5: PCH impact of minimizing the effect of size increase of library headers
If you have a managed application, you may take advantage of the managed incremental build to avoid doing a full build when the referenced assemblies have insignificant change. For more information, you can read this blog.
We have made significant throughput improvements to the Compiler Back-end and Linker build phases. If you have an application that is not a heavy user of STL and spend a lot of time in optimization, you should see the build time improvement.
The increase of template types will play a dominant role in the overall build throughput for applications with extensive templates. Applications that use the STL extensively may experience longer build times due to changes mandated by the C++11 Standard. You may also experience slightly longer build time due to the increased number of total headers files in VS11 and Windows 8.
To improve the overall build time, you can take advantage of the multiple cores on your machines to do multiproc builds. Also make sure to use Pre-Compiled headers (PCH). For managed C++ application, be sure to have managed incremental build turned on to improve incremental build performance.
We are interested in getting your build throughput performance data if you see any improvement, or slowdown beyond what you might expect from some amount of growth in header files or usage of STL templates when migrating your applications to VS11. You may reply to the blog or send email to lishao at Microsoft dot com.
In addition to capturing the overall build time, you can get compiler and linker time for each compiler/linker instance by passing /Bt to Compiler and /Time to Linker.
· When building in the IDE, you can set /Bt as the additional options for compiler and /Time as the additional options for linker. Make sure you build the application with “Detailed” verbosity. To set the verbosity, you can go to Tools -> Options -> Project and Solutions -> Build and Run, set “MSBuild Project Build Verbosity” to “Detailed”.
· For command line build, you can set _CL_=/Bt and _LINK_=/Time in the build environment and build with MSBuild /v:d option.
In your build log, you will see the time spent in C1.dll (FE), C1xx.dll (FE), C2.dll (BE) and linker for each instance of the compiler and linker. You may need to write a simple script to add up those numbers. Please let us know if you would like us to post a script that can do the work. Alternatively, you can enable MSBuild “Diagnostic” logging. It will give you the time spent in Compiler task and linker task, which is close to the compiler and linker time.
We hope that this blog can help you understand more about C++ desktop application build throughput in VS11. If you are interested in C++ Metro Style Application build throughput, which are new for VS11, you can take a look at this blog to get an overview. Note that, we continue to make improvement in C++ Metro Style application build throughput. You should see about 20% overall build throughput improvement compared to Beta in the upcoming release.
Please let us know if you have any feedback. We appreciate your input to help us improve build performance.
So, basically, you are saying that build times are going to be even worse than they were in VS10. I am wondering, what kind of response are you hoping to get? Are people supposed to be happy? Are you simply trying to prevent the predictable outcries related to the build times getting worse, not better? It won't work, you know.
I really don't want to be negative, but each new post on this blog is just... I don't know...
The performance downgrade is obviously not nice, but the explanation with the templates is not completely absurd (in contrast to the downgrades concerning XP-Support, new User Interface and Express Edition). I am writing on a static code analyzer and I know, that template parsing takes a lot of time. Additionally, this blog post isn't written in marketing language, so its easier to understand and less offending to the users. Pointing out problems in clear language and with giving reasons usually works fine, if the user base hadn't been made angry before.
However, if you compiler team had worked on variadic templates, your Macro-Hack wouldn't be necessary. If I understood your STL guy correctly, it is a major performance problem.
Why the fu** does this matter when we can't use it due to ~30% of the users on XP? And no VC++ 11 Express..
Please, improve link time of native C++. We switch from VS 2005 to VS 2010 recently and link time is faster (thanks for this improvement) but it is still to slow.
Here is what I had posted on connect many years ago and it is still true with VS 2010 : connect.microsoft.com/.../optimize-link-time-in-real-developers-environment
We have 24 GB of RAM, you should use it.
This is clearly good for game devs since they always need a Release build and probably use LTCG. Not sure if it'll help the rest of us...
Do you have results for a large application which includes both Managed and Native code? For those cases, will we finally be able to re-enable Incremental Linking?? pretty please!!!! Linking is sooo slow w/o it.
This is disappointing. Link time is killing us, and the lack of distributed builds is killing us. Our product runs on Windows and Mac. Building the same C++ code under XCode with distributed builds absolutely flies compared to Visual Studio. It wouldn't matter that compile time is a little slower if it were being distributed across 20 or 30 machines like our XCode builds are.
So what you're basically saying is that while you've decided never ever to build a C compiler in terms of language (C99) support, you've nevertheless optimized your C++ compiler for C-like workloads (minimal use of C++ language features such as templates, and the C++ standard library)... Great...
And this comes on top of ripping out XP support and "forgetting" about C++11 support... I don't want to be too harsh, but can you see why it sometimes appears that Microsoft's compiler team is basically a bunch of monkeys running around in a cage? Is there a *single* thing that your customers care about, that you haven't wrecked or ignored in this release? Has the project properties dialog been made resizable, at least?
Or are you really planning to sell us a slower compiler, which supports fewer platforms, and doesn't even improve language support apart from fixing a bug in your lambda implementation?
Perhaps, since template instantiations make your compiler crawl, you should improve how templates are handled in your compiler. I understand GCC 4.7 did some pretty impressive work on that front. And Clang, obviously, was *designed* as a C++ compiler, rather than a C one. And as a result, is blazing fast.
Perhaps it's time you followed in the footsteps of *every other C++ compiler*, and started designing your compiler for modern C++ code, instead of 1970-style C. We're going to have a lot of templates. Deal with it. We're going to have a lot of code in headers. That's just what the world looks like. We are, in fact, going to write C++ code, and include C++ libraries. If your compiler can't handle that in a scalable manner, fix it.
By the way, on top of speeding up the compiler and linker (which you really really need to do) how about adding CCache-like functionality? Our Linux builds are literally 10x faster than our Windows builds, simply because on Linux, we can use GCC with CCache. Our build server is able to churn out a Linux build in a couple of minutes because of that. A Windows build? 40 minutes.
Precompiled headers are a decent enough hack to get around the crippling build time, but they're still a hack, which pollutes the code base. They're not the final word in compiler performance. Especially not when we can get similar (or better) speedups on other compilers using non-intrusive tools that *don't* require us to mess with our actual source code.
VC++ team, I've been reading vcblog and it seems you guys can't get a single post published without 90% of the comments being negative. Clearly, what you do with VC++ is wrong and you should listen to what people say in these negative comments.
The main wishlist right now is:
1. XP support (direct XP support, not multi-targeting)
2. Better C++11 conformance
3. Express edition that can be used to write desktop apps
I've been following Clang's progress ever since the C++11 feature list for VC++11 Developer Preview was published in September 2011. For beta you guys were only able to add range-for-loops and override/final. That's 6 months and 3 C++11 features. During that time Clang was able to implement tens of new C++11 features. Even Intel's C++ compiler has now passed you, previously it was behind. Why are you so slow?
I remember when Herb Sutter tried to justify the lack of C++11 features at the BUILD conference. Herb's words (not the exact wording): "The guy who was doing variadic templates didn't realize that they were a lot harder to implement. He was unable to add variadic templates and since he focused so hard on variadic templates we didn't have time for other features.". What was this even supposed to mean? Is there 1 person working on C++11 features?
Every other notable compiler is ahead of you. The upcoming C++ Builder will have stunning C++11 support. You guys are inferior.
My suggestion would be to look at the effect of faster disk IO on compilation times. How about a quick run using a SSD or a RAM disk?
Thank you for your comments:
@ Philipp K_
Regarding you comments about variadic templates , we are doing a lot of thinking about the implementation of variadic templates in our compiler as a part of our commitment to improve our conformance to the C++11 standard.
@ Frederic Hebert
I will follow up on the connect bug to see if can further improve on this linker scenario
The reason that Incremental linking is disabled for managed and native mixed code is due to the fact that current linker cannot do metadata merging incrementally. The functionality needs to be provide by CLR (.NET). I will open a bug on this to see if we can improve on this for future releases
I agree distributed build on multiple machine is definitely a good thing to have
Thanks for the suggestion. I will consider get the data on SSD or RAM disk
What about solutions with just a single project? Are files build in parallel? If not, why not?
@AndrewDover: Isn't most of the IO cached? AFAIK building is (mostly) CPU bound.
In VS11 Beta I still get that terrible message
cl : Command line warning D9030: '/Gm' is incompatible with multiprocessing; ignoring /MP switch
For my solution it would be a great improvement for my compile time...
@ Olaf vander Spek
Cpp files (except for the pch file) in a single projects are usually batched and passed to the compiler together. We do have /MP switch that can be set at the file and project level. Once /MP switch is on, the compiler can create one or more copies of itself in separate processes. These copies simultaneously compile the source files.
When MSBild Multi-Proc (/m:n) is on, MSBuild have the ability to paralle build the files in different projects.
Yes, /MP and /GM currently are incompatible. /MP attempts to compile everything it is presented in parallel. /Gm does a test compile and then based on the results compiles whatever else it thinks needs to be compiled. /MP will need to be heavily changed to make them compatible.
Do you have numbers as to how much improvement you would get with and without /MP even if you turn off /Gm?
With multiple cores running, and a single disk, there might be a performance impact. Anyhow the data will tell the story. If CPU was all that mattered, then it was somewhat odd that VS2010 had equal front end compile times for both B and C.
Is it? Maybe it can't take advantage of the extra cores / threads available.