Linker throughput
Hello, my name is Chandler Shen, a developer from the Visual C++ Shanghai team.
We have made some changes in the upcoming Visual C++ 2010 release to improve the performance of linker. I would like to first give a brief overview of the linker and how we analyze the bottlenecks of current implementation. Later, I will describe the changes we made and the impact on linker performance.
Our Focus
We were targeting the linker throughput of large scale projects full build scenario because this scenario matters most in linker throughput scalability. Incremental linking and smaller projects will not benefit from the work I describe in this blog.
Brief Overview of Linker
Traditionally, what’s done by linker can be split into two phases:
1. Pass1: collecting definitions of symbols (from both object files and libraries)
2. Pass2: fixing up references to symbols with final address (actually Relative Virtual Address) and writing out the final image.
Link Time Code Generation (LTCG)
If /GL (Whole Program Optimization) is specified when compiling, the compiler will generate a special format of object file containing intermediate language. When linker encounters such object files, Pass1 becomes a 2-phase procedure. From these object files, the linker first calls into compiler to collect definitions of all public symbols to build a complete public symbol table. Then the linker supplies this symbol table to the compiler which generates the final machine instructions (or code generation).
Debug Information
During Pass2, in addition to writing the final image, linker will also write debug information into a PDB (Program Database) file if user specifies /DEBUG (Generate Debug Info). Some of this debug information, such as address of symbols, is not decided until linking.
Bottlenecks
In this section, I will show how we analyze some test cases to figure out bottlenecks of performance.
Test Cases
To get an objective conclusion, four real world projects (whose names are omitted) differ in scale, including proj1, proj2, proj3 and proj4, were chosen as test cases.
Table 1 Measurements of test cases
|
|
Proj1 |
Proj2 |
Proj3 |
Proj4 |
|
Files |
Total |
55 |
27 |
168 |
1066 |
|
.obj |
4 |
6 |
7 |
882 |
|
.lib |
51 |
21 |
161 |
184 |
|
Symbols |
6026 |
22436 |
69570 |
110262 |
In Table 1, the number of “symbols” is the number of entries of the symbol table which is internally used by linker to store the information of all external symbols. It is noticeable that “proj4” is much bigger than others.
Test Environment
Following is the configuration of the test machine
· Hardware
o CPU Intel Xeon CPU 3.20GHz, 4 cores
o RAM 2G
· Software Windows Vista 32-bit
Results
To minimize the effect of environment, all cases were run for five times. And the unit of time is in seconds.
In Table 2 and Table 3, it showed that for each test case, there is always one (usually the first, marked in red) run which takes much longer than others. While one run (marked in Green) may take a much shorter run. This is because following two reasons
l OS will cache a file’s content in memory for next read (called prefetch on Windows XP, and SuperFetch on Windows Vista)
l Most of modern hard disks will cache a file’s content for next read
Comparing Table 2 with Table 3, we can notice that if /debug is off, the time of Pass2 is much shorter. So it indicates that the majority of Pass2 is writing PDB files
Table 2 Test result of Non-LTCG with /Debug On
|
Pass1 |
Pass2 |
Total |
|
Proj1 |
1 |
4.437 |
2.328 |
6.765 |
|
2 |
0.266 |
1.218 |
1.484 |
|
3 |
0.265 |
1.188 |
1.453 |
|
4 |
0.265 |
1.219 |
1.484 |
|
5 |
0.235 |
1.375 |
1.610 |
|
Proj2 |
1 |
9.484 |
15.766 |
25.250 |
|
2 |
1.531 |
8.188 |
9.719 |
|
3 |
1.579 |
8.078 |
9.657 |
|
4 |
1.625 |
7.890 |
9.515 |
|
5 |
1.610 |
8.297 |
9.907 |
|
Proj3 |
1 |
27.266 |
43.687 |
70.953 |
|
2 |
4.250 |
17.672 |
21.922 |
|
3 |
4.141 |
17.265 |
21.406 |
|
4 |
4.203 |
18.500 |
22.703 |
|
5 |
4.688 |
19.078 |
23.766 |
|
Proj4 |
1 |
47.453 |
70.172 |
117.625 |
|
2 |
17.250 |
59.813 |
77.063 |
|
3 |
17.547 |
55.672 |
73.219 |
|
4 |
16.516 |
47.172 |
63.688 |
|
5 |
14.937 |
44.079 |
59.016 |
Table 3 Test result of Non-LTCG with /Debug Off
|
Pass1 |
Pass2 |
Total |
|
Proj1 |
1 |
0.187 |
0.078 |
0.265 |
|
2 |
0.218 |
0.031 |
0.249 |
|
3 |
0.187 |
0.047 |
0.234 |
|
4 |
0.203 |
0.031 |
0.234 |
|
5 |
0.187 |
0.031 |
0.218 |
|
Proj2 |
1 |
6.209 |
0.297 |
6.506 |
|
2 |
1.310 |
0.187 |
1.497 |
|
3 |
1.295 |
0.187 |
1.482 |
|
4 |
1.342 |
0.203 |
1.545 |
|
5 |
1.310 |
0.203 |
1.513 |
|
Proj3 |
1 |
15.382 |
0.764 |
16.146 |
|
2 |
3.541 |
0.546 |
4.087 |
|
3 |
3.650 |
0.562 |
4.212 |
|
4 |
3.557 |
0.546 |
4.150 |
|
5 |
3.588 |
0.562 |
4.150 |
|
Proj4 |
1 |
12.059 |
1.856 |
13.915 |
|
2 |
10.811 |
1.778 |
12.589 |
|
3 |
10.874 |
1.809 |
12.683 |
|
4 |
12.855 |
1.794 |
14.649 |
|
5 |
10.796 |
1.778 |
12.574 |
It is highly recommended that users use/LTCG (Link-time Code Generation) to optimize applications. The test results with /LTCG are shown in Table 4.
Table 4 Test Result of LTCG with /Debug On
|
Pass1 |
Pass2 |
Total |
|
Proj1 |
1 |
178.797 |
1.734 |
180.531 |
|
2 |
155.593 |
0.954 |
156.547 |
|
3 |
153.750 |
1.031 |
154.781 |
|
4 |
152.562 |
0.891 |
153.453 |
|
5 |
153.156 |
0.797 |
153.953 |
|
Proj2 |
1 |
120.375 |
5.546 |
125.921 |
|
2 |
102.343 |
5.172 |
107.515 |
|
3 |
102.203 |
5.235 |
107.438 |
|
4 |
102.016 |
5.343 |
107.359 |
|
5 |
102.250 |
5.078 |
107.328 |
|
Proj3 |
1 |
222.859 |
20.719 |
243.578 |
|
2 |
185.281 |
22.437 |
207.718 |
|
3 |
184.984 |
21.422 |
206.406 |
|
4 |
185.203 |
22.656 |
207.859 |
|
5 |
186.078 |
22.844 |
208.922 |
|
Proj4 |
1 |
522.329 |
122.984 |
645.313 |
|
2 |
490.188 |
54.406 |
544.594 |
|
3 |
441.125 |
51.860 |
492.985 |
|
4 |
430.609 |
51.813 |
482.422 |
|
5 |
437.344 |
49.750 |
487.094 |
Observations
Based on above results and other investigation, we have the following observations
1. If /LTCG is used, most of linking time will spend on code-generation (a compiler task) in Pass1.
2. OS caching of input files will decrease the time spent in both passes quick a lot
3. The majority of time spent in Pass2 is writing the PDB file
Linker changes and impact in VS2010
Multi-threading during Pass2
After some investigations, we decided to introduce a dedicated thread to writing PDB files because
1. Most users normally specify /debug when linking, irrespective of whether the application is built under “debug” or “release” configuration.
2. The data written into final binary does not depend on the result of writing PDB file, and vice versa: i.e., the binary writing task is independent of the PDB writing task
3. When the project is big, linker has much other work to do during Pass2 in additional to writing PDB file, such as reading data from object files and libraries.
Results
Following is the table that compares the linker performance results between VS2010 and VS2008 SP1. To remove the effect of cache, we rebooted our test machine (with SuperFetch disabled) before each run. For ease comparison, the time cost by old linker (from Table 2 and Table 4) are also listed (no caching).
Table 5 Test Result of new linker, Non-LTCG with /Debug On
|
New linker (VS2010) |
Old linker (VS2008 SP1) |
Pass2 Improved |
Total Improved |
|
Pass1 |
Pass2 |
Total |
Pass1 |
Pass2 |
Total |
|
Proj1 |
3.547 |
1.859 |
5.406 |
4.437 |
2.328 |
6.765 |
20.15% |
20.08% |
|
Proj2 |
9.797 |
10.266 |
20.063 |
9.484 |
15.766 |
25.250 |
34.89% |
20.54% |
|
Proj3 |
17.078 |
22.609 |
39.687 |
27.266 |
43.687 |
70.953 |
48.25% |
44.07% |
|
Proj4 |
47.500 |
54.281 |
101.781 |
47.453 |
70.172 |
117.625 |
22.65% |
13.47% |
Table 6 Test Result of new linker, LTCG with /Debug On
|
New linker(VS2010) |
Old linker(VS2008 SP1) |
Pass2 Improved |
Total Improved |
|
Pass1 |
Pass2 |
Total |
Pass1 |
Pass2 |
Total |
|
Proj1 |
153.516 |
0.953 |
154.469 |
178.797 |
1.734 |
180.531 |
45.04% |
14.44% |
|
Proj2 |
119.703 |
5.391 |
125.094 |
120.375 |
5.546 |
125.921 |
2.79% |
0.66% |
|
Proj3 |
225.688 |
16.594 |
242.282 |
222.859 |
20.719 |
243.578 |
19.91% |
0.53% |
|
Proj4 |
525.375 |
80.375 |
605.750 |
522.329 |
122.984 |
645.313 |
34.65% |
6.13% |
From Table 5 and Table 6, it can be seen that multi-threading the linker has improved the performance of Pass2, and it is especially effective for bigger projects.
Future
We will continue to look into linker throughput even after 2010 release to find areas to improve. If you have any suggestions and feedbacks, feel free to let us know.