Hello, my name is Chandler Shen, a developer from the Visual C++ Shanghai team.
We have made some changes in the upcoming Visual C++ 2010 release to improve the performance of linker. I would like to first give a brief overview of the linker and how we analyze the bottlenecks of current implementation. Later, I will describe the changes we made and the impact on linker performance.
We were targeting the linker throughput of large scale projects full build scenario because this scenario matters most in linker throughput scalability. Incremental linking and smaller projects will not benefit from the work I describe in this blog.
Traditionally, what’s done by linker can be split into two phases:
1. Pass1: collecting definitions of symbols (from both object files and libraries)
2. Pass2: fixing up references to symbols with final address (actually Relative Virtual Address) and writing out the final image.
If /GL (Whole Program Optimization) is specified when compiling, the compiler will generate a special format of object file containing intermediate language. When linker encounters such object files, Pass1 becomes a 2-phase procedure. From these object files, the linker first calls into compiler to collect definitions of all public symbols to build a complete public symbol table. Then the linker supplies this symbol table to the compiler which generates the final machine instructions (or code generation).
During Pass2, in addition to writing the final image, linker will also write debug information into a PDB (Program Database) file if user specifies /DEBUG (Generate Debug Info). Some of this debug information, such as address of symbols, is not decided until linking.
In this section, I will show how we analyze some test cases to figure out bottlenecks of performance.
To get an objective conclusion, four real world projects (whose names are omitted) differ in scale, including proj1, proj2, proj3 and proj4, were chosen as test cases.
Table 1 Measurements of test cases
Proj1
Proj2
Proj3
Proj4
Files
Total
55
27
168
1066
.obj
4
6
7
882
.lib
51
21
161
184
Symbols
6026
22436
69570
110262
In Table 1, the number of “symbols” is the number of entries of the symbol table which is internally used by linker to store the information of all external symbols. It is noticeable that “proj4” is much bigger than others.
Following is the configuration of the test machine
· Hardware
o CPU Intel Xeon CPU 3.20GHz, 4 cores
o RAM 2G
· Software Windows Vista 32-bit
To minimize the effect of environment, all cases were run for five times. And the unit of time is in seconds.
In Table 2 and Table 3, it showed that for each test case, there is always one (usually the first, marked in red) run which takes much longer than others. While one run (marked in Green) may take a much shorter run. This is because following two reasons
l OS will cache a file’s content in memory for next read (called prefetch on Windows XP, and SuperFetch on Windows Vista)
l Most of modern hard disks will cache a file’s content for next read
Comparing Table 2 with Table 3, we can notice that if /debug is off, the time of Pass2 is much shorter. So it indicates that the majority of Pass2 is writing PDB files
Table 2 Test result of Non-LTCG with /Debug On
Pass1
Pass2
1
4.437
2.328
6.765
2
0.266
1.218
1.484
3
0.265
1.188
1.453
1.219
5
0.235
1.375
1.610
9.484
15.766
25.250
1.531
8.188
9.719
1.579
8.078
9.657
1.625
7.890
9.515
8.297
9.907
27.266
43.687
70.953
4.250
17.672
21.922
4.141
17.265
21.406
4.203
18.500
22.703
4.688
19.078
23.766
47.453
70.172
117.625
17.250
59.813
77.063
17.547
55.672
73.219
16.516
47.172
63.688
14.937
44.079
59.016
Table 3 Test result of Non-LTCG with /Debug Off
0.187
0.078
0.218
0.031
0.249
0.047
0.234
0.203
6.209
0.297
6.506
1.310
1.497
1.295
1.482
1.342
1.545
1.513
15.382
0.764
16.146
3.541
0.546
4.087
3.650
0.562
4.212
3.557
4.150
3.588
12.059
1.856
13.915
10.811
1.778
12.589
10.874
1.809
12.683
12.855
1.794
14.649
10.796
12.574
It is highly recommended that users use/LTCG (Link-time Code Generation) to optimize applications. The test results with /LTCG are shown in Table 4.
Table 4 Test Result of LTCG with /Debug On
178.797
1.734
180.531
155.593
0.954
156.547
153.750
1.031
154.781
152.562
0.891
153.453
153.156
0.797
153.953
120.375
5.546
125.921
102.343
5.172
107.515
102.203
5.235
107.438
102.016
5.343
107.359
102.250
5.078
107.328
222.859
20.719
243.578
185.281
22.437
207.718
184.984
21.422
206.406
185.203
22.656
207.859
186.078
22.844
208.922
522.329
122.984
645.313
490.188
54.406
544.594
441.125
51.860
492.985
430.609
51.813
482.422
437.344
49.750
487.094
Based on above results and other investigation, we have the following observations
1. If /LTCG is used, most of linking time will spend on code-generation (a compiler task) in Pass1.
2. OS caching of input files will decrease the time spent in both passes quick a lot
3. The majority of time spent in Pass2 is writing the PDB file
After some investigations, we decided to introduce a dedicated thread to writing PDB files because
1. Most users normally specify /debug when linking, irrespective of whether the application is built under “debug” or “release” configuration.
2. The data written into final binary does not depend on the result of writing PDB file, and vice versa: i.e., the binary writing task is independent of the PDB writing task
3. When the project is big, linker has much other work to do during Pass2 in additional to writing PDB file, such as reading data from object files and libraries.
Following is the table that compares the linker performance results between VS2010 and VS2008 SP1. To remove the effect of cache, we rebooted our test machine (with SuperFetch disabled) before each run. For ease comparison, the time cost by old linker (from Table 2 and Table 4) are also listed (no caching).
Table 5 Test Result of new linker, Non-LTCG with /Debug On
New linker (VS2010)
Old linker (VS2008 SP1)
Pass2 Improved
Total Improved
3.547
1.859
5.406
20.15%
20.08%
9.797
10.266
20.063
34.89%
20.54%
17.078
22.609
39.687
48.25%
44.07%
47.500
54.281
101.781
22.65%
13.47%
Table 6 Test Result of new linker, LTCG with /Debug On
New linker(VS2010)
Old linker(VS2008 SP1)
153.516
0.953
154.469
45.04%
14.44%
119.703
5.391
125.094
2.79%
0.66%
225.688
16.594
242.282
19.91%
0.53%
525.375
80.375
605.750
34.65%
6.13%
From Table 5 and Table 6, it can be seen that multi-threading the linker has improved the performance of Pass2, and it is especially effective for bigger projects.
We will continue to look into linker throughput even after 2010 release to find areas to improve. If you have any suggestions and feedbacks, feel free to let us know.
Nice article! In fact, linker time is really annoying , specially with LTCG, when code gen. is required even after a small change.
Probably the whole "c++ -> obj->linker" architecture is old and is the cause of most c++ and linker times, isn't it?
It's great that you're looking at the linker as it's a big bottleneck on larger projects. I'm in games development and the link time is easily the biggest time waster.
Parallelising the pdb generation seems like a good start, but I think it would also be good if you could review the algorithmic complexity of the debug information compilation and symbol management as well to see if there are any easy wins there. Especially when you start using templates you end up with lots and lots of redundant debug data and empirically it seems to affect link quite a bit.
Now, if you compile a global c++ object in a cpp, it will be linked into the exe if added to the linker as an obj file.
When this obj file is taken into a lib and we use that lib to link with the exe then the linker wont choose it (there is no reference to it from outside, because the static object registers itself as a service into a singleton type manager).
Now there is not even an option to the linker to take all obj from a given lib file.
There is a big hack in the studio against this:
Linker/General/Use Library Dependency Inputs
But this is a joke when using big libraries...
Even ATL and MFC uses "workarounds", because of this problem (define in the header, use the first link attributes in the header files).
Only per object file (cpp) __pragma init_seg(...) available to set the initialization order.
No way to set this per c++ object basis like:
CMyObject __declspec(linkinit(56)) MyObject;
#pragma comment(lib, "dir/file.lib") does not work as a relative path:
only simple "file.lib" would be searched in LIBPATH-es,
"dir/file.lib" automatically considered as global path...
A second thread for writing the PDB is nice, but still won't make much use of today's multi-core CPUs. Is there no way the bulk of the work can be split across many threads? It's annoying to see Task Manager reporting 12.5% CPU usage for link.exe on a Core i7... still, I'm glad it's getting some attention.
I saw this a few days ago: http://stackoverflow.com/questions/1401342/why-is-linker-optimization-so-poor/1401374#1401374 and I think it's actually a good question. How come LTCG takes significantly longer than compiling if you simply include all the same code into a single translation unit?
And what can you do (are you doing?) to improve this?
Do any of the test case include machine generated source code files? Please provide some test performance numbers when generating a lexer/parser using yacc or another compiler generator for a decently large language (c++, c, pascal, etc).
Most comiler/linker pairs do miserably with generated code given the tens of thousands of symbol names and large amounts of goto code.
Unable to turn of warnings for 3rd party libraries:
xmaencoder.lib(wmachmtx.obj) : warning LNK4099: PDB 'wmachmtx.pdb' was not found with 'C:\Program Files (x86)\Microsoft Xbox 360 SDK\lib\win32\vs2008\xmaencoder.lib' or at 'C:\projects\Tachyon\_Lib\DebugWin32\wmachmtx.pdb'; linking object as if no debug info
xmaencoder.lib(resampthreads.obj) : warning LNK4099: PDB 'audiosrc.pdb' was not found with 'C:\Program Files (x86)\Microsoft Xbox 360 SDK\lib\win32\vs2008\xmaencoder.lib' or at 'C:\projects\Tachyon\_Lib\DebugWin32\audiosrc.pdb'; linking object as if no debug info
xmaencoder.lib(aresample.obj) : warning LNK4099: PDB 'audiosrc.pdb' was not found with 'C:\Program Files (x86)\Microsoft Xbox 360 SDK\lib\win32\vs2008\xmaencoder.lib' or at 'C:\projects\Tachyon\_Lib\DebugWin32\audiosrc.pdb'; linking object as if no debug info
xmaencoder.lib(gresample.obj) : warning LNK4099: PDB 'audiosrc.pdb' was not found with 'C:\Program Files (x86)\Microsoft Xbox 360 SDK\lib\win32\vs2008\xmaencoder.lib' or at 'C:\projects\Tachyon\_Lib\DebugWin32\audiosrc.pdb'; linking object as if no debug info
Hi bionicbeagle,
Thanks for your feedback, we are considering how to improve the performance of debug information processing in a future release.
Hi Andrew McDonald,
Currently, the bottleneck of linker is IO operations (mainly read operations), and we will attempt to balance CPU operations when we improve the performance of IO operations.
Hi András Csikvári,
1. (global c++ object)You can use /INCLUDE switch to force a symbol available in the final image.
2. (#pragma comment)Yes, linker will not search in LIBPATH if a directory is specified for security reasons. For example, if user specifies ..\..\aa.lib, this might lead to some private libraries that are linked.
3. (LNK4099)Sorry, linker does not provide an option to ignore any warnings silently.
Chandler Shen
Hi
1. "(global c++ object)You can use /INCLUDE switch to force a symbol available in the final image."
I don't want to /INCLUDE anything by hand, the main idea with static c++ objects, that I take into an object file and works (at least should work)...
To be more clear, the project has 1200+ cpp files, and we do such (small) objects by macro to register "services".
I would really need a force link switch to the linker like this: --force-all-objects-on mylib.lib --force-all-objects-off
2. "(#pragma comment)Yes, linker will not search in LIBPATH if a directory is specified for security reasons. For example, if user specifies ..\..\aa.lib, this might lead to some private libraries that are linked."
Ok, I have never said to do that. I would like to use like at headers (#include "lib/header.h"): #pragma comment(lib, "dir/file.lib").
There is a "workaround" now for this:
#pragma comment(lib, __FILE__ "/../../../mylibdir/mylib.lib")
It doesn't seems to be nicer...
3. "(LNK4099)Sorry, linker does not provide an option to ignore any warnings silently."
Yes, and it's quite silly, that I can't turn off the warnings of the Microsoft's own libraries... :)
Avoid the java 1 class per file scheme. It is quite slow over the long run considering it takes a few extra seconds over and over throughout the life of the project.
Avoid making hundreds of extra 1 liner classes just for completeness. They can be wrapped into a large catch all class (e.g., a large class that does data type conversions from machine specific to not machine specific basic data types).
Avoid having 10 different overloaded versions of the same method. Make 1 method that does all of the work and have optional pareters to provide the equivalent signature to the caller. (NB: I've stopped using optional arguments in the last 2 years since I've seen offshore written code that only supplies the minimal set of parameters to a function instead of calling the function the right way for the time it is needed. I think this is the . intellisense lazy method.)
Avoid creating unnecessary names and extra function calls just for coding style.
.NET, java, reflection and modern tools promote an explosion of names and identifiers but that, while simple in the narror scope, is detrimental in the long run. More names/symbols means more costly and harder to progress the project in the future.
"Most users normally specify /debug when linking, irrespective of whether the application is built under “debug” or “release” configuration."
Are you sure that's actually intensional or is it more likely that "most users" just didn't notice that MS added /debug to the default project templates a few versions ago?
@CMWoods,
It is standard practice in most professional development shops to store PDBs for all released binaries, even if the PDBs themselves are not released, for help in debugging. Adding /debug to the default template was an effect, not a cause.
A few observations:
1. PDB vs no-PDB difference tells that the binary write is much faster than PDB write. Why? I suppose you just have to dump the tables already built in memory. Parallelising them would save a pittance. A reason for longer PDB write could be that the current linker creates the PDB file as compressed. If you issue writes in pieces smaller than the compression unit (64K or so?) it could cause multiple compress-decompress rounds. I hope the linker is not calling WriteFile per each PDB record; that would just kill the performance.
2 Typically, it makes sense to avoid writing two large files in parallel in small pieces. The most effective approach is to build the data in memory and write it in one shot. With the modern memory sizes, it's feasible. And NO, a file mapping doesn't beat that; forget it. Especially a mapped compressed file (PDB).
3. If you deal with a lot of small files, it makes sense to read them to memory, as many as possible. To avoid disk thrashing, open a bunch of them at once, and then read; don't do one by one.