Hello, my name is Chandler Shen, a developer from the Visual C++ Shanghai team.
We have made some changes in the upcoming Visual C++ 2010 release to improve the performance of linker. I would like to first give a brief overview of the linker and how we analyze the bottlenecks of current implementation. Later, I will describe the changes we made and the impact on linker performance.
We were targeting the linker throughput of large scale projects full build scenario because this scenario matters most in linker throughput scalability. Incremental linking and smaller projects will not benefit from the work I describe in this blog.
Traditionally, what’s done by linker can be split into two phases:
1. Pass1: collecting definitions of symbols (from both object files and libraries)
2. Pass2: fixing up references to symbols with final address (actually Relative Virtual Address) and writing out the final image.
If /GL (Whole Program Optimization) is specified when compiling, the compiler will generate a special format of object file containing intermediate language. When linker encounters such object files, Pass1 becomes a 2-phase procedure. From these object files, the linker first calls into compiler to collect definitions of all public symbols to build a complete public symbol table. Then the linker supplies this symbol table to the compiler which generates the final machine instructions (or code generation).
During Pass2, in addition to writing the final image, linker will also write debug information into a PDB (Program Database) file if user specifies /DEBUG (Generate Debug Info). Some of this debug information, such as address of symbols, is not decided until linking.
In this section, I will show how we analyze some test cases to figure out bottlenecks of performance.
To get an objective conclusion, four real world projects (whose names are omitted) differ in scale, including proj1, proj2, proj3 and proj4, were chosen as test cases.
Table 1 Measurements of test cases
Proj1
Proj2
Proj3
Proj4
Files
Total
55
27
168
1066
.obj
4
6
7
882
.lib
51
21
161
184
Symbols
6026
22436
69570
110262
In Table 1, the number of “symbols” is the number of entries of the symbol table which is internally used by linker to store the information of all external symbols. It is noticeable that “proj4” is much bigger than others.
Following is the configuration of the test machine
· Hardware
o CPU Intel Xeon CPU 3.20GHz, 4 cores
o RAM 2G
· Software Windows Vista 32-bit
To minimize the effect of environment, all cases were run for five times. And the unit of time is in seconds.
In Table 2 and Table 3, it showed that for each test case, there is always one (usually the first, marked in red) run which takes much longer than others. While one run (marked in Green) may take a much shorter run. This is because following two reasons
l OS will cache a file’s content in memory for next read (called prefetch on Windows XP, and SuperFetch on Windows Vista)
l Most of modern hard disks will cache a file’s content for next read
Comparing Table 2 with Table 3, we can notice that if /debug is off, the time of Pass2 is much shorter. So it indicates that the majority of Pass2 is writing PDB files
Table 2 Test result of Non-LTCG with /Debug On
Pass1
Pass2
1
4.437
2.328
6.765
2
0.266
1.218
1.484
3
0.265
1.188
1.453
1.219
5
0.235
1.375
1.610
9.484
15.766
25.250
1.531
8.188
9.719
1.579
8.078
9.657
1.625
7.890
9.515
8.297
9.907
27.266
43.687
70.953
4.250
17.672
21.922
4.141
17.265
21.406
4.203
18.500
22.703
4.688
19.078
23.766
47.453
70.172
117.625
17.250
59.813
77.063
17.547
55.672
73.219
16.516
47.172
63.688
14.937
44.079
59.016
Table 3 Test result of Non-LTCG with /Debug Off
0.187
0.078
0.218
0.031
0.249
0.047
0.234
0.203
6.209
0.297
6.506
1.310
1.497
1.295
1.482
1.342
1.545
1.513
15.382
0.764
16.146
3.541
0.546
4.087
3.650
0.562
4.212
3.557
4.150
3.588
12.059
1.856
13.915
10.811
1.778
12.589
10.874
1.809
12.683
12.855
1.794
14.649
10.796
12.574
It is highly recommended that users use/LTCG (Link-time Code Generation) to optimize applications. The test results with /LTCG are shown in Table 4.
Table 4 Test Result of LTCG with /Debug On
178.797
1.734
180.531
155.593
0.954
156.547
153.750
1.031
154.781
152.562
0.891
153.453
153.156
0.797
153.953
120.375
5.546
125.921
102.343
5.172
107.515
102.203
5.235
107.438
102.016
5.343
107.359
102.250
5.078
107.328
222.859
20.719
243.578
185.281
22.437
207.718
184.984
21.422
206.406
185.203
22.656
207.859
186.078
22.844
208.922
522.329
122.984
645.313
490.188
54.406
544.594
441.125
51.860
492.985
430.609
51.813
482.422
437.344
49.750
487.094
Based on above results and other investigation, we have the following observations
1. If /LTCG is used, most of linking time will spend on code-generation (a compiler task) in Pass1.
2. OS caching of input files will decrease the time spent in both passes quick a lot
3. The majority of time spent in Pass2 is writing the PDB file
After some investigations, we decided to introduce a dedicated thread to writing PDB files because
1. Most users normally specify /debug when linking, irrespective of whether the application is built under “debug” or “release” configuration.
2. The data written into final binary does not depend on the result of writing PDB file, and vice versa: i.e., the binary writing task is independent of the PDB writing task
3. When the project is big, linker has much other work to do during Pass2 in additional to writing PDB file, such as reading data from object files and libraries.
Following is the table that compares the linker performance results between VS2010 and VS2008 SP1. To remove the effect of cache, we rebooted our test machine (with SuperFetch disabled) before each run. For ease comparison, the time cost by old linker (from Table 2 and Table 4) are also listed (no caching).
Table 5 Test Result of new linker, Non-LTCG with /Debug On
New linker (VS2010)
Old linker (VS2008 SP1)
Pass2 Improved
Total Improved
3.547
1.859
5.406
20.15%
20.08%
9.797
10.266
20.063
34.89%
20.54%
17.078
22.609
39.687
48.25%
44.07%
47.500
54.281
101.781
22.65%
13.47%
Table 6 Test Result of new linker, LTCG with /Debug On
New linker(VS2010)
Old linker(VS2008 SP1)
153.516
0.953
154.469
45.04%
14.44%
119.703
5.391
125.094
2.79%
0.66%
225.688
16.594
242.282
19.91%
0.53%
525.375
80.375
605.750
34.65%
6.13%
From Table 5 and Table 6, it can be seen that multi-threading the linker has improved the performance of Pass2, and it is especially effective for bigger projects.
We will continue to look into linker throughput even after 2010 release to find areas to improve. If you have any suggestions and feedbacks, feel free to let us know.