Linker throughput

Linker throughput

Rate This
  • Comments 22

Hello, my name is Chandler Shen, a developer from the Visual C++ Shanghai team.

We have made some changes in the upcoming Visual C++ 2010 release to improve the performance of linker. I would like to first give a brief overview of the linker and how we analyze the bottlenecks of current implementation. Later, I will describe the changes we made and the impact on linker performance.

Our Focus

 

We were targeting the linker throughput of large scale projects full build scenario because this scenario matters most in linker throughput scalability. Incremental linking and smaller projects will not benefit from the work I describe in this blog.

Brief Overview of Linker

 

Traditionally, what’s done by linker can be split into two phases:

1.       Pass1: collecting definitions of symbols (from both object files and libraries)

2.       Pass2: fixing up references to symbols with final address (actually Relative Virtual Address) and writing out the final image.

Link Time Code Generation (LTCG)

If /GL (Whole Program Optimization) is specified when compiling, the compiler will generate a special format of object file containing intermediate language. When linker encounters such object files, Pass1 becomes a 2-phase procedure. From these object files, the linker first calls into compiler to collect definitions of all public symbols to build a complete public symbol table. Then the linker supplies this symbol table to the compiler which generates the final machine instructions (or code generation).

Debug Information

During Pass2, in addition to writing the final image, linker will also write debug information into a PDB (Program Database) file if user specifies /DEBUG (Generate Debug Info). Some of this debug information, such as address of symbols, is not decided until linking.

Bottlenecks

In this section, I will show how we analyze some test cases to figure out bottlenecks of performance.

Test Cases

To get an objective conclusion, four real world projects (whose names are omitted) differ in scale, including proj1, proj2, proj3 and proj4, were chosen as test cases.

Table 1 Measurements of test cases

 

Proj1

Proj2

Proj3

Proj4

Files

Total

55

27

168

1066

.obj

4

6

7

882

.lib

51

21

161

184

Symbols

6026

22436

69570

110262

In Table 1, the number of “symbols” is the number of entries of the symbol table which is internally used by linker to store the information of all external symbols. It is noticeable that “proj4” is much bigger than others.

Test Environment

Following is the configuration of the test machine

·         Hardware

o   CPU       Intel Xeon CPU 3.20GHz, 4 cores

o   RAM      2G

·         Software             Windows Vista 32-bit

Results

To minimize the effect of environment, all cases were run for five times. And the unit of time is in seconds.

In Table 2 and Table 3, it showed that for each test case, there is always one (usually the first, marked in red) run which takes much longer than others.  While one run (marked in Green) may take a much shorter run. This is because following two reasons

l  OS will cache a file’s content in memory for next read (called prefetch on Windows XP, and SuperFetch on Windows Vista)

l  Most of modern hard disks will cache a file’s content for next read

 

Comparing Table 2 with Table 3, we can notice that if /debug is off, the time of Pass2 is much shorter. So it indicates that the majority of Pass2 is writing PDB files

Table 2 Test result of Non-LTCG with /Debug On

Pass1

Pass2

Total

Proj1

1

4.437

2.328

6.765

2

0.266

1.218

1.484

3

0.265

1.188

1.453

4

0.265

1.219

1.484

5

0.235

1.375

1.610

Proj2

1

9.484

15.766

25.250

2

1.531

8.188

9.719

3

1.579

8.078

9.657

4

1.625

7.890

9.515

5

1.610

8.297

9.907

Proj3

1

27.266

43.687

70.953

2

4.250

17.672

21.922

3

4.141

17.265

21.406

4

4.203

18.500

22.703

5

4.688

19.078

23.766

Proj4

1

47.453

70.172

117.625

2

17.250

59.813

77.063

3

17.547

55.672

73.219

4

16.516

47.172

63.688

5

14.937

44.079

59.016

 

Table 3 Test result of Non-LTCG with /Debug Off

Pass1

Pass2

Total

Proj1

1

0.187

0.078

0.265

2

0.218

0.031

0.249

3

0.187

0.047

0.234

4

0.203

0.031

0.234

5

0.187

0.031

0.218

Proj2

1

6.209

0.297

6.506

2

1.310

0.187

1.497

3

1.295

0.187

1.482

4

1.342

0.203

1.545

5

1.310

0.203

1.513

Proj3

1

15.382

0.764

16.146

2

3.541

0.546

4.087

3

3.650

0.562

4.212

4

3.557

0.546

4.150

5

3.588

0.562

4.150

Proj4

1

12.059

1.856

13.915

2

10.811

1.778

12.589

3

10.874

1.809

12.683

4

12.855

1.794

14.649

5

10.796

1.778

12.574

 

It is highly recommended that users use/LTCG (Link-time Code Generation) to optimize applications. The test results with /LTCG are shown in Table 4.

Table 4 Test Result of LTCG with /Debug On

Pass1

Pass2

Total

Proj1

1

178.797

1.734

180.531

2

155.593

0.954

156.547

3

153.750

1.031

154.781

4

152.562

0.891

153.453

5

153.156

0.797

153.953

Proj2

1

120.375

5.546

125.921

2

102.343

5.172

107.515

3

102.203

5.235

107.438

4

102.016

5.343

107.359

5

102.250

5.078

107.328

Proj3

1

222.859

20.719

243.578

2

185.281

22.437

207.718

3

184.984

21.422

206.406

4

185.203

22.656

207.859

5

186.078

22.844

208.922

Proj4

1

522.329

122.984

645.313

2

490.188

54.406

544.594

3

441.125

51.860

492.985

4

430.609

51.813

482.422

5

437.344

49.750

487.094

 

Observations

Based on above results and other investigation, we have the following observations

1.       If /LTCG is used, most of linking time will spend on code-generation (a compiler task) in Pass1.

2.       OS caching of input files will decrease the time spent in both passes quick a lot

3.       The majority of time spent in Pass2 is writing the PDB file

Linker changes and impact in VS2010

Multi-threading during Pass2

After some investigations, we decided to introduce a dedicated thread to writing PDB files because

1.       Most users normally specify /debug when linking, irrespective of whether the application is built under “debug” or “release” configuration.

2.       The data written into final binary does not depend on the result of writing PDB file, and vice versa: i.e., the binary writing task is independent of the PDB writing task

3.       When the project is big, linker has much other work to do during Pass2 in additional to writing PDB file, such as reading data from object files and libraries.

Results

Following is the table that compares the linker performance results between VS2010 and VS2008 SP1. To remove the effect of cache, we rebooted our test machine (with SuperFetch disabled) before each run. For ease comparison, the time cost by old linker (from Table 2 and Table 4) are also listed (no caching).

Table 5 Test Result of new linker, Non-LTCG with /Debug On

New linker (VS2010)

Old linker (VS2008 SP1)

Pass2 Improved

Total Improved

Pass1

Pass2

Total

Pass1

Pass2

Total

Proj1

3.547

1.859

5.406

4.437

2.328

6.765

20.15%

20.08%

Proj2

9.797

10.266

20.063

9.484

15.766

25.250

34.89%

20.54%

Proj3

17.078

22.609

39.687

27.266

43.687

70.953

48.25%

44.07%

Proj4

47.500

54.281

101.781

47.453

70.172

117.625

22.65%

13.47%

 

Table 6 Test Result of new linker, LTCG with /Debug On

New linker(VS2010)

Old linker(VS2008 SP1)

Pass2 Improved

Total Improved

Pass1

Pass2

Total

Pass1

Pass2

Total

Proj1

153.516

0.953

154.469

178.797

1.734

180.531

45.04%

14.44%

Proj2

119.703

5.391

125.094

120.375

5.546

125.921

2.79%

0.66%

Proj3

225.688

16.594

242.282

222.859

20.719

243.578

19.91%

0.53%

Proj4

525.375

80.375

605.750

522.329

122.984

645.313

34.65%

6.13%

 

From Table 5 and Table 6, it can be seen that multi-threading the linker has improved the performance of Pass2, and it is especially effective for bigger projects.

Future

We will continue to look into linker throughput even after 2010 release to find areas to improve. If you have any suggestions and feedbacks, feel free to let us know.

 

  • Also, how ICF (indentical COMDAT folding) affects the link time? Default BUILD script enables it by default, even in checked builds (which I believe is bad for setting breakpoints during debugging).

  • @Alexandre Grigoriev

    1. Before writing to the final .PDB file, linker should check the debug information to assure no wrong data will be written. Unfortunately, the algorithm of this part is time-consuming.

    2 & 3. As for file I/O, the most time is cost by Page Fault handling (Reading a page from disk and writing a page into disk). Usually (with enough memory/CPU available), it has little difference whether user call WriteFile/ReadFile for N times, each process one record, or call WriteFile/ReadFile just once which process N records.

    4. As for ICF, linker will compare two COMDATs byte-by-byte before folding one of them, before that, linker will check some flags or characteristics to avoid such comparing. So if there many functions within one module (.obj file) which having same size of code and similar implementation, linker will be slowed down.

    Chandler

  • I am working in the team developing huge C++ applications with circa 4000 source files and 2500 obj files. VS2008. Debug and Release configuration are built with debug info without /LTCG flag. Debug use incremental linking, Release don't use it.

    My machine is double core 2.3Gz 4096Mb

    Our concern are unsatisfactory linking times. Linking Debug configuration takes between 20 and 1200 sec.

    Linking Release lays between 270 and 1800 sec.  Besonders much time takes linking with IncredBuild also with high priority.

    I have two suggestions on the issue.

    First linker takes 25 minutes to link and after that its CPU link time is only 19 seconds and storage 350Mb. It would be nice if you will have a special option like /HUGEPROJECT and  read all object segments into the storage also on the first step. All modern computers have a lot of memory but it is simply doesn't used by your linker.

    Second, your linker obviously lacks satisfactory prelink check in the incremental modus to know whether it is possible to update the project in proper time. Every linker starting from 7.0 fails to link in acceptable time, if the changes in files are a little more than tiny. In result non-incremental linking takes 4 minutes, incremental 15. It is something other than a satisfactory performance.

    With Regards,

    Boris

  • One more remark.

    It seems that linker simultaneously writes debug info and reads input modules.

    Forming the complete debug info in memory and writing it at the program end will surely reduce the linking time.

  • The VC++ team spends a lot of time investigating linker throughput issues.  The multi-threading work was done after demonstrating, through significant performance analysis of multiple scenarios, that it did actually increase overall performance

    Maybe your scenario exposes unique performance problem, but overall, I don't think we have obvious bad practices in the linker.  

    Maybe I am wrong, and maybe there are. All I can say is we will continue to analyze linker performance, spending time understanding IO access patterns to further optimize throughput.

    We know linker throughput is one of the most important things to our customers, and we'll keep doing whatever we can to improve it.

  • @Boris Sunik

    Thanks for your suggestion! We appreciate it so much!

    For the IncredBuild case, multiple PDB files will be generated, this will cost linker more time to merge them into one PDB.

    For your first suggestion, i think reading some of object segments in second step will utilize the power of multithreading, one thread is blocked by IO, another is blocked by CPU.

    Chandler

  • I agree with the cpp->obj->module comment.

    If I was in the compiler/linker business, I would start using RAM.

    The result is a service/daemon:

    - In-memory dependency graph(FindFirstChange etc.)

    - Traditional dependency check will not be necessary 99.99% of the time. It's done already.

    - Compiler pipes output directly to daemon

    - Daemon links in-memory

    - Daemon keeps result (dll/exe) in-memory

    - Daemon outputs dll/exe on-demand.

    Patrick

Page 2 of 2 (22 items) 12