Hello, my name is Chandler Shen, a developer from the Visual C++ Shanghai team.
We have made some changes in the upcoming Visual C++ 2010 release to improve the performance of linker. I would like to first give a brief overview of the linker and how we analyze the bottlenecks of current implementation. Later, I will describe the changes we made and the impact on linker performance.
We were targeting the linker throughput of large scale projects full build scenario because this scenario matters most in linker throughput scalability. Incremental linking and smaller projects will not benefit from the work I describe in this blog.
Traditionally, what’s done by linker can be split into two phases:
1. Pass1: collecting definitions of symbols (from both object files and libraries)
2. Pass2: fixing up references to symbols with final address (actually Relative Virtual Address) and writing out the final image.
If /GL (Whole Program Optimization) is specified when compiling, the compiler will generate a special format of object file containing intermediate language. When linker encounters such object files, Pass1 becomes a 2-phase procedure. From these object files, the linker first calls into compiler to collect definitions of all public symbols to build a complete public symbol table. Then the linker supplies this symbol table to the compiler which generates the final machine instructions (or code generation).
During Pass2, in addition to writing the final image, linker will also write debug information into a PDB (Program Database) file if user specifies /DEBUG (Generate Debug Info). Some of this debug information, such as address of symbols, is not decided until linking.
In this section, I will show how we analyze some test cases to figure out bottlenecks of performance.
To get an objective conclusion, four real world projects (whose names are omitted) differ in scale, including proj1, proj2, proj3 and proj4, were chosen as test cases.
Table 1 Measurements of test cases
In Table 1, the number of “symbols” is the number of entries of the symbol table which is internally used by linker to store the information of all external symbols. It is noticeable that “proj4” is much bigger than others.
Following is the configuration of the test machine
o CPU Intel Xeon CPU 3.20GHz, 4 cores
o RAM 2G
· Software Windows Vista 32-bit
To minimize the effect of environment, all cases were run for five times. And the unit of time is in seconds.
In Table 2 and Table 3, it showed that for each test case, there is always one (usually the first, marked in red) run which takes much longer than others. While one run (marked in Green) may take a much shorter run. This is because following two reasons
l OS will cache a file’s content in memory for next read (called prefetch on Windows XP, and SuperFetch on Windows Vista)
l Most of modern hard disks will cache a file’s content for next read
Comparing Table 2 with Table 3, we can notice that if /debug is off, the time of Pass2 is much shorter. So it indicates that the majority of Pass2 is writing PDB files
Table 2 Test result of Non-LTCG with /Debug On
Table 3 Test result of Non-LTCG with /Debug Off
It is highly recommended that users use/LTCG (Link-time Code Generation) to optimize applications. The test results with /LTCG are shown in Table 4.
Table 4 Test Result of LTCG with /Debug On
Based on above results and other investigation, we have the following observations
1. If /LTCG is used, most of linking time will spend on code-generation (a compiler task) in Pass1.
2. OS caching of input files will decrease the time spent in both passes quick a lot
3. The majority of time spent in Pass2 is writing the PDB file
After some investigations, we decided to introduce a dedicated thread to writing PDB files because
1. Most users normally specify /debug when linking, irrespective of whether the application is built under “debug” or “release” configuration.
2. The data written into final binary does not depend on the result of writing PDB file, and vice versa: i.e., the binary writing task is independent of the PDB writing task
3. When the project is big, linker has much other work to do during Pass2 in additional to writing PDB file, such as reading data from object files and libraries.
Following is the table that compares the linker performance results between VS2010 and VS2008 SP1. To remove the effect of cache, we rebooted our test machine (with SuperFetch disabled) before each run. For ease comparison, the time cost by old linker (from Table 2 and Table 4) are also listed (no caching).
Table 5 Test Result of new linker, Non-LTCG with /Debug On
New linker (VS2010)
Old linker (VS2008 SP1)
Table 6 Test Result of new linker, LTCG with /Debug On
Old linker(VS2008 SP1)
From Table 5 and Table 6, it can be seen that multi-threading the linker has improved the performance of Pass2, and it is especially effective for bigger projects.
We will continue to look into linker throughput even after 2010 release to find areas to improve. If you have any suggestions and feedbacks, feel free to let us know.
Also, how ICF (indentical COMDAT folding) affects the link time? Default BUILD script enables it by default, even in checked builds (which I believe is bad for setting breakpoints during debugging).
1. Before writing to the final .PDB file, linker should check the debug information to assure no wrong data will be written. Unfortunately, the algorithm of this part is time-consuming.
2 & 3. As for file I/O, the most time is cost by Page Fault handling (Reading a page from disk and writing a page into disk). Usually (with enough memory/CPU available), it has little difference whether user call WriteFile/ReadFile for N times, each process one record, or call WriteFile/ReadFile just once which process N records.
4. As for ICF, linker will compare two COMDATs byte-by-byte before folding one of them, before that, linker will check some flags or characteristics to avoid such comparing. So if there many functions within one module (.obj file) which having same size of code and similar implementation, linker will be slowed down.
I am working in the team developing huge C++ applications with circa 4000 source files and 2500 obj files. VS2008. Debug and Release configuration are built with debug info without /LTCG flag. Debug use incremental linking, Release don't use it.
My machine is double core 2.3Gz 4096Mb
Our concern are unsatisfactory linking times. Linking Debug configuration takes between 20 and 1200 sec.
Linking Release lays between 270 and 1800 sec. Besonders much time takes linking with IncredBuild also with high priority.
I have two suggestions on the issue.
First linker takes 25 minutes to link and after that its CPU link time is only 19 seconds and storage 350Mb. It would be nice if you will have a special option like /HUGEPROJECT and read all object segments into the storage also on the first step. All modern computers have a lot of memory but it is simply doesn't used by your linker.
Second, your linker obviously lacks satisfactory prelink check in the incremental modus to know whether it is possible to update the project in proper time. Every linker starting from 7.0 fails to link in acceptable time, if the changes in files are a little more than tiny. In result non-incremental linking takes 4 minutes, incremental 15. It is something other than a satisfactory performance.
One more remark.
It seems that linker simultaneously writes debug info and reads input modules.
Forming the complete debug info in memory and writing it at the program end will surely reduce the linking time.
The VC++ team spends a lot of time investigating linker throughput issues. The multi-threading work was done after demonstrating, through significant performance analysis of multiple scenarios, that it did actually increase overall performance
Maybe your scenario exposes unique performance problem, but overall, I don't think we have obvious bad practices in the linker.
Maybe I am wrong, and maybe there are. All I can say is we will continue to analyze linker performance, spending time understanding IO access patterns to further optimize throughput.
We know linker throughput is one of the most important things to our customers, and we'll keep doing whatever we can to improve it.
Thanks for your suggestion! We appreciate it so much!
For the IncredBuild case, multiple PDB files will be generated, this will cost linker more time to merge them into one PDB.
For your first suggestion, i think reading some of object segments in second step will utilize the power of multithreading, one thread is blocked by IO, another is blocked by CPU.
I agree with the cpp->obj->module comment.
If I was in the compiler/linker business, I would start using RAM.
The result is a service/daemon:
- In-memory dependency graph(FindFirstChange etc.)
- Traditional dependency check will not be necessary 99.99% of the time. It's done already.
- Compiler pipes output directly to daemon
- Daemon links in-memory
- Daemon keeps result (dll/exe) in-memory
- Daemon outputs dll/exe on-demand.