For developer scenarios, linking takes the lion's share of the application's build time. From our investigation we know that the Visual C++ linker spends a large fraction of its time in preparing, merging and finally writing out debug information. This is especially true for non-Whole Program Optimization scenarios.
In Visual Studio 2013 Update 2 CTP2, we have added a set of features which help improve link time significantly as measured by products we build here in our labs (AAA Games and Open source projects such as Chromium):
Not all of these features are enabled by default. Keep reading for more details.
As a part of our analysis we found that we were un-necessarily bloating the size of object files as a result of emitting symbol information even for unreferenced functions and data. This as a result would cause additional and useless input to the linker which would eventually be thrown away as a result of linker optimizations.
Applying /Zc:inline on the compiler command line would result in the compiler performing these optimizations and as a result producing less input for the linker, improving end to end linker throughput.
New Compiler Switch: /Zc: inline[-] - remove unreferenced function or data if it is COMDAT or has internal linkage only (off by default)
Throughput Impact: Significant (double-digit (%) link improvements seen when building products like Chromium)
Breaking Change: Yes (possibly) for non-conformant code (with the C++11 standard), turning on this feature could mean in some cases you see an unresolved external symbol error as shown below but the workaround is very simple. Take a look at the example below:
If you are using VS2013 RTM, this sample program will compile (cl /O2 x.cpp xfunc.cpp) and link successfully. However, if you compile and link with VS2013 Update 2 CTP2 with /Zc:inline enabled (cl /O2 /Zc:inline x.cpp xfunc.cpp), the sample will choke and produce the following error message:
xfunc.obj : error LNK2019: unresolved external symbol "public: void __thiscall x::xfunc1(void)" (?xfunc1@x@@QAEXXZ) referenced in function _main
x.exe : fatal error LNK1120: 1 unresolved externals
There are three ways to fix this problem.
Applicability: All but LTCG/WPO and some (debug) scenarios should see significant speed up.
This feature is about improving type merging speed significantly by increasing the size of our internal data structures (hash-tables and such). For larger PDB's this will increase the size at most by a few MB but can reduce link times significantly. Today, this feature is enabled by default.
Throughput Impact: Significant (double-digit(%) link improvements for AAA games)
Breaking Change: No
Applicability: All but LTCG/WPO scenarios should see significant speed up.
The feature parallelizes (through multiple threads) the code-generation and optimization phase of the compilation process. By default today, we use four threads for the codegen and Optimization phase. With machines getting more resourceful (CPU, IO etc.) having a few extra build threads can't hurt. This feature is especially useful and effective when performing a Whole Program Optimization (WPO) build.
There are already multiple levels of parallelism that can be specified for building an artifact. The /m or /maxcpucount specifies the number of msbuild.exe processes that can be run in parallel. Where, as the /MP or Multiple Processes compiler flag specifies the number of cl.exe processes that can simultaneously compile the source files.
The /cgthreads flag adds another level of parallelism, where it specifies the number of threads used for the code generation and optimization phase for each individual cl.exe process. If /cgthreads, /MP and /m are all set too high it is quite possible to bring down the build system to its knees making it unusable, so use with caution!
New Compiler Switch: /cgthreadsN, where N is the number of threads used for optimization and code generation. 'N' represents the number of threads and 'N' can be specified between [1-8].
Breaking Change: No, but this switch is currently not supported but we are considering making it a supported feature so your feedback is important!
Applicability: This should make a definite impact for Whole Program Optimization scenarios.
This blog should give you an overview on a set of features we have enabled in the latest CTP which should help improve link throughput. Our current focus has been to look at slightly larger projects currently and as a result these wins should be most noticeable for larger projects such as Chrome and others.
Please give them a shot and let us know how it works out for your application. It would be great if you folks can post before/after numbers on linker throughput when trying out these features.
If you are link times are still painfully slow please email me, Ankit, at firstname.lastname@example.org. We would love to know more!
Thanks to C++ MVP Bruce Dawson, Chromium developers and the Kinect Sports Rivals team for validating that our changes had a positive impact in real-world scenarios.
I like how the most exciting feature, the last one, has no throughput impact listed.
it is quite possible to bring down the build system to its knees making it unusable
Make it so the whole shebang runs below normal priority. There is no reason this needs to be normal and starve my knees. Used to be the vsspawn thing ran the show and all one had to do was patch that (change a single byte) to run below normal and everyone got their machine back during a build.
just one place. Too many switches all over the place. Better yet, an env var.
@h4x, depending upon the architecture of your product (how many methods can infact be compiled in parallel) you should see anywhere between 10-20% build throughput improvement.
@Kneed, thanks for your feedback.
Wish I could have some of these changes on VS 2010 since I'm stuck on it for awhile making so called AAA games and linking so slowly. :(
@psikore, can you reach out to me on my email (email@example.com) and we can see if anything can be done to ease your pain.
Visual C++ is known to be used by UWin open source project & 5.0 release is 64-bit too! With SUAcommunity discontinued, its viable alternative to Cygwin by Redhat's monopoly in testing source with current version of MinGW64 as well as Visual Studio. More than Chromium games, performance building UWin is intriguing.
I agree with running the build at below normal priority.
"N' represents the number of threads and 'N' can be specified between [1-8]."
Is there a special reason why we have to specify the thread count manually?
It would be nice if we could simply set it to 0 (like known from other switches) or leave the number out: Then the number of threads should equal the number of CPUs/Cores, so PCs with 1 Core work with one thread and PCs with 16 Cores use all 16 Cores (or 8 if you can't add support for more threads).
Using /Zc:inline for our release build (using LTCG) I get a link error due to corrupt debug info
sicore.lib(symbol.obj) :warning LNK4209: debugging information corrupt; recompile module; linking object as if no debug info
then fatal error LNK1103: debugging information corrupt; recompile module
Recompiling that module always results in the same thing.
It works fine in debug though
In addition to Vertex's comment about /cgthreadsN, please separate number and string token with ":", like /maxcpucount:8
Excellent. Will try this out when I have a chance. Would disagree a bit with the sentiments to use lower-pri threading - run it at normal priority! I'd rather have it finish faster especially if it's in a CI situation.
@edl_si, is it possible for you to provide us a link repro. The steps to create a link repro are as follows
support.microsoft.com/.../134650 (<--- link repro section).
@jschroedl, yes do share your feedback with us, we are eager to hear more :).
@Vertex, Thank you for your feedback, as I mentioned as far as cgthreads is concerned, it is an experimental feature as of now and your feedback is welcome. By default, we do provide '4' threads for the code-generation and optimization phase, this has been in the case since VS2012 release. Now we speculate while that is sufficient for a large number of applications, a higher number of threads can infact help a wee bit more. As a part of the feedback we recieve from you folks we are trying to understand what provides the best level of parallelism when it comes to the different switches /M, /MP and /cgthreads. I agree the north star goal for us is to pick intelligent defaults looking at the hardware constraints.