Linker throughput

Linker throughput

Rate This
  • Comments 22

Hello, my name is Chandler Shen, a developer from the Visual C++ Shanghai team.

We have made some changes in the upcoming Visual C++ 2010 release to improve the performance of linker. I would like to first give a brief overview of the linker and how we analyze the bottlenecks of current implementation. Later, I will describe the changes we made and the impact on linker performance.

Our Focus

 

We were targeting the linker throughput of large scale projects full build scenario because this scenario matters most in linker throughput scalability. Incremental linking and smaller projects will not benefit from the work I describe in this blog.

Brief Overview of Linker

 

Traditionally, what’s done by linker can be split into two phases:

1.       Pass1: collecting definitions of symbols (from both object files and libraries)

2.       Pass2: fixing up references to symbols with final address (actually Relative Virtual Address) and writing out the final image.

Link Time Code Generation (LTCG)

If /GL (Whole Program Optimization) is specified when compiling, the compiler will generate a special format of object file containing intermediate language. When linker encounters such object files, Pass1 becomes a 2-phase procedure. From these object files, the linker first calls into compiler to collect definitions of all public symbols to build a complete public symbol table. Then the linker supplies this symbol table to the compiler which generates the final machine instructions (or code generation).

Debug Information

During Pass2, in addition to writing the final image, linker will also write debug information into a PDB (Program Database) file if user specifies /DEBUG (Generate Debug Info). Some of this debug information, such as address of symbols, is not decided until linking.

Bottlenecks

In this section, I will show how we analyze some test cases to figure out bottlenecks of performance.

Test Cases

To get an objective conclusion, four real world projects (whose names are omitted) differ in scale, including proj1, proj2, proj3 and proj4, were chosen as test cases.

Table 1 Measurements of test cases

 

Proj1

Proj2

Proj3

Proj4

Files

Total

55

27

168

1066

.obj

4

6

7

882

.lib

51

21

161

184

Symbols

6026

22436

69570

110262

In Table 1, the number of “symbols” is the number of entries of the symbol table which is internally used by linker to store the information of all external symbols. It is noticeable that “proj4” is much bigger than others.

Test Environment

Following is the configuration of the test machine

·         Hardware

o   CPU       Intel Xeon CPU 3.20GHz, 4 cores

o   RAM      2G

·         Software             Windows Vista 32-bit

Results

To minimize the effect of environment, all cases were run for five times. And the unit of time is in seconds.

In Table 2 and Table 3, it showed that for each test case, there is always one (usually the first, marked in red) run which takes much longer than others.  While one run (marked in Green) may take a much shorter run. This is because following two reasons

l  OS will cache a file’s content in memory for next read (called prefetch on Windows XP, and SuperFetch on Windows Vista)

l  Most of modern hard disks will cache a file’s content for next read

 

Comparing Table 2 with Table 3, we can notice that if /debug is off, the time of Pass2 is much shorter. So it indicates that the majority of Pass2 is writing PDB files

Table 2 Test result of Non-LTCG with /Debug On

Pass1

Pass2

Total

Proj1

1

4.437

2.328

6.765

2

0.266

1.218

1.484

3

0.265

1.188

1.453

4

0.265

1.219

1.484

5

0.235

1.375

1.610

Proj2

1

9.484

15.766

25.250

2

1.531

8.188

9.719

3

1.579

8.078

9.657

4

1.625

7.890

9.515

5

1.610

8.297

9.907

Proj3

1

27.266

43.687

70.953

2

4.250

17.672

21.922

3

4.141

17.265

21.406

4

4.203

18.500

22.703

5

4.688

19.078

23.766

Proj4

1

47.453

70.172

117.625

2

17.250

59.813

77.063

3

17.547

55.672

73.219

4

16.516

47.172

63.688

5

14.937

44.079

59.016

 

Table 3 Test result of Non-LTCG with /Debug Off

Pass1

Pass2

Total

Proj1

1

0.187

0.078

0.265

2

0.218

0.031

0.249

3

0.187

0.047

0.234

4

0.203

0.031

0.234

5

0.187

0.031

0.218

Proj2

1

6.209

0.297

6.506

2

1.310

0.187

1.497

3

1.295

0.187

1.482

4

1.342

0.203

1.545

5

1.310

0.203

1.513

Proj3

1

15.382

0.764

16.146

2

3.541

0.546

4.087

3

3.650

0.562

4.212

4

3.557

0.546

4.150

5

3.588

0.562

4.150

Proj4

1

12.059

1.856

13.915

2

10.811

1.778

12.589

3

10.874

1.809

12.683

4

12.855

1.794

14.649

5

10.796

1.778

12.574

 

It is highly recommended that users use/LTCG (Link-time Code Generation) to optimize applications. The test results with /LTCG are shown in Table 4.

Table 4 Test Result of LTCG with /Debug On

Pass1

Pass2

Total

Proj1

1

178.797

1.734

180.531

2

155.593

0.954

156.547

3

153.750

1.031

154.781

4

152.562

0.891

153.453

5

153.156

0.797

153.953

Proj2

1

120.375

5.546

125.921

2

102.343

5.172

107.515

3

102.203

5.235

107.438

4

102.016

5.343

107.359

5

102.250

5.078

107.328

Proj3

1

222.859

20.719

243.578

2

185.281

22.437

207.718

3

184.984

21.422

206.406

4

185.203

22.656

207.859

5

186.078

22.844

208.922

Proj4

1

522.329

122.984

645.313

2

490.188

54.406

544.594

3

441.125

51.860

492.985

4

430.609

51.813

482.422

5

437.344

49.750

487.094

 

Observations

Based on above results and other investigation, we have the following observations

1.       If /LTCG is used, most of linking time will spend on code-generation (a compiler task) in Pass1.

2.       OS caching of input files will decrease the time spent in both passes quick a lot

3.       The majority of time spent in Pass2 is writing the PDB file

Linker changes and impact in VS2010

Multi-threading during Pass2

After some investigations, we decided to introduce a dedicated thread to writing PDB files because

1.       Most users normally specify /debug when linking, irrespective of whether the application is built under “debug” or “release” configuration.

2.       The data written into final binary does not depend on the result of writing PDB file, and vice versa: i.e., the binary writing task is independent of the PDB writing task

3.       When the project is big, linker has much other work to do during Pass2 in additional to writing PDB file, such as reading data from object files and libraries.

Results

Following is the table that compares the linker performance results between VS2010 and VS2008 SP1. To remove the effect of cache, we rebooted our test machine (with SuperFetch disabled) before each run. For ease comparison, the time cost by old linker (from Table 2 and Table 4) are also listed (no caching).

Table 5 Test Result of new linker, Non-LTCG with /Debug On

New linker (VS2010)

Old linker (VS2008 SP1)

Pass2 Improved

Total Improved

Pass1

Pass2

Total

Pass1

Pass2

Total

Proj1

3.547

1.859

5.406

4.437

2.328

6.765

20.15%

20.08%

Proj2

9.797

10.266

20.063

9.484

15.766

25.250

34.89%

20.54%

Proj3

17.078

22.609

39.687

27.266

43.687

70.953

48.25%

44.07%

Proj4

47.500

54.281

101.781

47.453

70.172

117.625

22.65%

13.47%

 

Table 6 Test Result of new linker, LTCG with /Debug On

New linker(VS2010)

Old linker(VS2008 SP1)

Pass2 Improved

Total Improved

Pass1

Pass2

Total

Pass1

Pass2

Total

Proj1

153.516

0.953

154.469

178.797

1.734

180.531

45.04%

14.44%

Proj2

119.703

5.391

125.094

120.375

5.546

125.921

2.79%

0.66%

Proj3

225.688

16.594

242.282

222.859

20.719

243.578

19.91%

0.53%

Proj4

525.375

80.375

605.750

522.329

122.984

645.313

34.65%

6.13%

 

From Table 5 and Table 6, it can be seen that multi-threading the linker has improved the performance of Pass2, and it is especially effective for bigger projects.

Future

We will continue to look into linker throughput even after 2010 release to find areas to improve. If you have any suggestions and feedbacks, feel free to let us know.

 

  • Nice article! In fact, linker time is really annoying , specially with LTCG, when code gen. is required even after a small change.

    Probably the whole "c++ -> obj->linker" architecture is old and is the cause of most c++ and linker times, isn't it?

  • It's great that you're looking at the linker as it's a big bottleneck on larger projects. I'm in games development and the link time is easily the biggest time waster.

    Parallelising the pdb generation seems like a good start, but I think it would also be good if you could review the algorithmic complexity of the debug information compilation and symbol management as well to see if there are any easy wins there. Especially when you start using templates you end up with lots and lots of redundant debug data and empirically it seems to affect link quite a bit.

  • Now, if you compile a global c++ object in a cpp, it will be linked into the exe if added to the linker as an obj file.

    When this obj file is taken into a lib and we use that lib to link with the exe then the linker wont choose it (there is no reference to it from outside, because the static object registers itself as a service into a singleton type manager).

    Now there is not even an option to the linker to take all obj from a given lib file.

    There is a big hack in the studio against this:

    Linker/General/Use Library Dependency Inputs

    But this is a joke when using big libraries...

    Even ATL and MFC uses "workarounds", because of this problem (define in the header, use the first link attributes in the header files).

  • Only per object file (cpp) __pragma init_seg(...) available to set the initialization order.

    No way to set this per c++ object basis like:

    CMyObject __declspec(linkinit(56)) MyObject;

  • #pragma comment(lib, "dir/file.lib") does not work as a relative path:

    only simple "file.lib" would be searched in LIBPATH-es,

    "dir/file.lib" automatically considered as global path...

  • A second thread for writing the PDB is nice, but still won't make much use of today's multi-core CPUs. Is there no way the bulk of the work can be split across many threads? It's annoying to see Task Manager reporting 12.5% CPU usage for link.exe on a Core i7... still, I'm glad it's getting some attention.

  • I saw this a few days ago: http://stackoverflow.com/questions/1401342/why-is-linker-optimization-so-poor/1401374#1401374 and I think it's actually a good question. How come LTCG takes significantly longer than compiling if you simply include all the same code into a single translation unit?

    And what can you do (are you doing?) to improve this?

  • Do any of the test case include machine generated source code files?  Please provide some test performance numbers when generating a lexer/parser using yacc or another compiler generator for a decently large language (c++, c, pascal, etc).

    Most comiler/linker pairs do miserably with generated code given the tens of thousands of symbol names and large amounts of goto code.

  • Unable to turn of warnings for 3rd party libraries:

    xmaencoder.lib(wmachmtx.obj) : warning LNK4099: PDB 'wmachmtx.pdb' was not found with 'C:\Program Files (x86)\Microsoft Xbox 360 SDK\lib\win32\vs2008\xmaencoder.lib' or at 'C:\projects\Tachyon\_Lib\DebugWin32\wmachmtx.pdb'; linking object as if no debug info

    xmaencoder.lib(resampthreads.obj) : warning LNK4099: PDB 'audiosrc.pdb' was not found with 'C:\Program Files (x86)\Microsoft Xbox 360 SDK\lib\win32\vs2008\xmaencoder.lib' or at 'C:\projects\Tachyon\_Lib\DebugWin32\audiosrc.pdb'; linking object as if no debug info

    xmaencoder.lib(aresample.obj) : warning LNK4099: PDB 'audiosrc.pdb' was not found with 'C:\Program Files (x86)\Microsoft Xbox 360 SDK\lib\win32\vs2008\xmaencoder.lib' or at 'C:\projects\Tachyon\_Lib\DebugWin32\audiosrc.pdb'; linking object as if no debug info

    xmaencoder.lib(gresample.obj) : warning LNK4099: PDB 'audiosrc.pdb' was not found with 'C:\Program Files (x86)\Microsoft Xbox 360 SDK\lib\win32\vs2008\xmaencoder.lib' or at 'C:\projects\Tachyon\_Lib\DebugWin32\audiosrc.pdb'; linking object as if no debug info

  • Hi bionicbeagle,

    Thanks for your feedback, we are considering how to improve the performance of debug information processing in a future release.

    Hi Andrew McDonald,

    Currently, the bottleneck of linker is IO operations (mainly read operations), and we will attempt to balance CPU operations when we improve the performance of IO operations.

    Hi András Csikvári,

    1. (global c++ object)You can use /INCLUDE switch to force a symbol available in the final image.

    2. (#pragma comment)Yes, linker will not search in LIBPATH if a directory is specified for security reasons. For example, if user specifies ..\..\aa.lib, this might lead to some private libraries that are linked.

    3. (LNK4099)Sorry, linker does not provide an option to ignore any warnings silently.

    Chandler Shen

  • Hi

    1. "(global c++ object)You can use /INCLUDE switch to force a symbol available in the final image."

    I don't want to /INCLUDE anything by hand, the main idea with static c++ objects, that I take into an object file and works (at least should work)...

    To be more clear, the project has 1200+ cpp files, and we do such (small) objects by macro to register "services".

    I would really need a force link switch to the linker like this: --force-all-objects-on mylib.lib --force-all-objects-off

    2. "(#pragma comment)Yes, linker will not search in LIBPATH if a directory is specified for security reasons. For example, if user specifies ..\..\aa.lib, this might lead to some private libraries that are linked."

    Ok, I have never said to do that. I would like to use like at headers (#include "lib/header.h"): #pragma comment(lib, "dir/file.lib").

    There is a "workaround" now for this:

    #pragma comment(lib, __FILE__ "/../../../mylibdir/mylib.lib")

    It doesn't seems to be nicer...

    3. "(LNK4099)Sorry, linker does not provide an option to ignore any warnings silently."

    Yes, and it's quite silly, that I can't turn off the warnings of the Microsoft's own libraries... :)

  • Avoid the java 1 class per file scheme.  It is quite slow over the long run considering it takes a few extra seconds over and over throughout the life of the project.

    Avoid making hundreds of extra 1 liner classes just for completeness.  They can be wrapped into a large catch all class (e.g., a large class that does data type conversions from machine specific to not machine specific basic data types).

    Avoid having 10 different overloaded versions of the same method.  Make 1 method that does all of the work and have optional pareters to provide the equivalent signature to the caller.  (NB: I've stopped using optional arguments in the last 2 years since I've seen offshore written code that only supplies the minimal set of parameters to a function instead of calling the function the right way for the time it is needed.  I think this is the . intellisense lazy method.)

    Avoid creating unnecessary names and extra function calls just for coding style.

    .NET, java, reflection and modern tools promote an explosion of names and identifiers but that, while simple in the narror scope, is detrimental in the long run.  More names/symbols means more costly and harder to progress the project in the future.

  • "Most users normally specify /debug when linking, irrespective of whether the application is built under “debug” or “release” configuration."

    Are you sure that's actually intensional or is it more likely that "most users" just didn't notice that MS added /debug to the default project templates a few versions ago?

  • @CMWoods,

    It is standard practice in most professional development shops to store PDBs for all released binaries, even if the PDBs themselves are not released, for help in debugging. Adding /debug to the default template was an effect, not a cause.

  • A few observations:

    1. PDB vs no-PDB difference tells that the binary write is much faster than PDB write. Why? I suppose you just have to dump the tables already built in memory. Parallelising them would save a pittance. A reason for longer PDB write could be that the current linker creates the PDB file as compressed. If you issue writes in pieces smaller than the compression unit (64K or so?) it could cause multiple compress-decompress rounds. I hope the linker is not calling WriteFile per each PDB record; that would just kill the performance.

    2 Typically, it makes sense to avoid writing two large files in parallel in small pieces. The most effective approach is to build the data in memory and write it in one shot. With the modern memory sizes, it's feasible. And NO, a file mapping doesn't beat that; forget it. Especially a mapped compressed file (PDB).

    3. If you deal with a lot of small files, it makes sense to read them to memory, as many as possible. To avoid disk thrashing, open a bunch of them at once, and then read; don't do one by one.

Page 1 of 2 (22 items) 12