How Many Passes?

How Many Passes?

Rate This
  • Comments 50

Large bodies of code written in the C/C++ languages typically divide up the code into “header” files, which are just declarations of methods and types (and definitions of macros). The actual bodies of those methods and types are in completely different files.

People sometimes ask me why doesn’t C# need header files? which is a bit of an odd way to phrase the question; I would have asked the equivalent question why does C++ need header files? Header files seem like a huge potential point of failure; all the time I edit C++ code and change the signature of a method; if I forget to update the header file, then the code doesn’t compile and often gives some cryptic error message. Hopefully this large cost actually buys you something.

It buys the compiler writer one thing, and the user one thing.

What it buys the user is that you can compile each individual “cpp” file into a “obj” file independently, provided that you have all the necessary headers. All the information necessary to generate the bodies that are in a given cpp file can be gleaned from the set of headers. This means that the build system can recompile just those cpp files that changed, provided that no header changed.

What it buys the compiler writer is that every file can be compiled in “one pass”. Because every type and method declaration is processed before its first usage, the compiler can simply start from the top of the file, pull in all the included headers, and proceed from the top down, spitting out the obj file as it goes, never having to go back and revisit something its seen already.

The C# language does not require that declarations occur before usages, which has two impacts, again, on the user and on the compiler writer. The impact on the user is that you don’t get to recompile just the IL that changed when you change a file; the whole assembly is recompiled. Fortunately the C# compiler is sufficiently fast that this is seldom a huge issue. (Another way to look at this is that the “granularity” of recompilation in C# is at the project level, not the file level.)

The impact on the compiler writer is that we have to have a “two pass” compiler. In the first pass, we look for declarations and ignore bodies. Once we have gleaned all the information from the declarations that we would have got from the headers in C++, we take a second pass over the code and generate the IL for the bodies.

A good way to think about this is that the first pass computes metadata: the “top level” stuff, like namespaces, classes, structs, enums, interfaces, delegates, methods, type parameters, formal parameters, constructors, events, attributes, and so on. The second pass computes the IL: the code that goes in the method bodies, constructor bodies, and so on.

In practice, though we only do “two passes” over the raw text, we do many, many passes over the resulting data structures. I thought I might lay out for you just what passes we do in order to implement the C# language.

The first “metadata” phase goes like this:

The first thing we do is take the text of the sources and break it up into a stream of tokens. That is, we do lexical analysis to determine that "class c : b { }" is class, identifier, colon, identifier, left curly, right curly.

We then do a "top level parse" where we verify that the token streams define a grammatically-correct C# program. However, we skip parsing method bodies.  When we hit a method body, we just blaze through the tokens until we get to the matching close curly. We'll come back to it later; we only care about getting enough information to generate metadata at this point.

We then do a "declaration" pass where we make notes about the location of every namespace and type declaration in the program. At this point we're done looking at the source code for the first phase; every subsequent pass is over the set of "symbols" deduced from the declarations.

We then do a pass where we verify that all the types declared have no cycles in their base types. We need to do this first because in every subsequent pass we need to be able to walk up type hierarchies without having to deal with cycles.

We then do a pass where we verify that all generic parameter constraints on generic types are also acyclic.

We then do a pass where we check whether every member of every type -- methods of classes, fields of structs, enum values, and so on -- is consistent. No cycles in enums, every overriding method overrides something that is actually virtual, and so on. At this point we can compute the "vtable" layouts of all interfaces, classes with virtual methods, and so on.

We then do a pass where we work out the values of all "const" fields.

At this point we have enough information to emit almost all the metadata for this assembly. We still do not have information about the metadata for iterator/anonymous function closures or anonymous types; we do those late.

We can now start generating IL. For each method body (and properties, indexers, constructors, and so on), we rewind the lexer to the point where the method body began and parse the method body.

[UPDATE: A number of people have asked me about how the passes below are optimized. During the initial binding pass we make notes on things like "there's nullable arithmetic in here", or "there's an object initializer in here" and so on. When we go to do object initializer rewriting, say, we first check to see if we even have to, and skip the pass entirely if we know it will be a no-op. (And testing runs a special version of the compiler that does the pass anyway, and reports back whether it had any effect or not, so that we can test that our pass-skipping logic is correct.) ]

Once the method body is parsed, we do an initial "binding" pass, where we attempt to determine the types of every expression in every statement. We then do a whole pile of passes over each method body. Again, at this point we are done looking at the source code; we're now passing over the "annotated parse tree" many times, and rewriting it as we go.

We first run a pass to transform loops into gotos and labels.

(The next few passes look for bad stuff.)

Then we run a pass to look for use of deprecated types, for warnings.

Then we run a pass that searches for uses of anonymous types that we haven't emitted metadata for yet, and emit those.

Then we run a pass that searches for bad uses of expression trees. For example, using a ++ operator in an expression tree.

Then we run a pass that looks for all local variables in the body that are defined, but not used, to report warnings.

Then we run a pass that looks for illegal patterns inside iterator blocks.

Then we run the reachability checker, to give warnings about unreachable code, and tell you when you've done something like forgotten the return at the end of a non-void method.

Then we run a pass that verifies that every goto targets a sensible label, and that every label is targeted by a reachable goto.

Then we run a pass that checks that all locals are definitely assigned before use, notes which local variables are closed-over outer variables of an anonymous function or iterator, and which anonymous functions are in reachable code. (This pass does too much. I have been meaning to refactor it for some time now.)

At this point we're done looking for bad stuff, but we still have way more passes to go before we sleep.

Next we run a pass that detects missing ref arguments to calls on COM objects and fixes them. (This is a new feature in C# 4.)

Then we run a pass that looks for stuff of the form "new MyDelegate(Foo)" and rewrites it into a call to CreateDelegate.

Then we run a pass that transforms expression trees into the sequence of factory method calls necessary to create the expression trees at runtime.

Then we run a pass that rewrites all nullable arithmetic into code that tests for HasValue, and so on.

Then we run a pass that finds all references of the form base.Blah() and rewrites them into code which does the non-virtual call to the base class method.

Then we run a pass which looks for object and collection initializers and turns them into the appropriate property sets, and so on.

Then we run a pass which looks for dynamic calls (in C# 4) and rewrites them into dynamic call sites that use the DLR.

Then we run a pass that looks for calls to removed methods. (That is, partial methods with no actual implementation, or conditional methods that don't have their conditional compilation symbol defined.) Those are turned into no-ops.

Then we look for unreachable code and remove it from the tree. No point in codegenning IL for it.

Then we run an optimization pass that rewrites trivial "is" and "as" operators.

Then we run an optimization pass that looks for switch(constant) and rewrites it as a branch directly to the correct case.

Then we run a pass which turns string concatenations into calls to the correct overload of String.Concat.

(Ah, memories. These last two passes were the first things I worked on when I joined the compiler team.)

Then we run a pass which rewrites uses of named and optional parameters into calls where the side effects all happen in the correct order.

Then we run a pass which optimizes arithmetic; for example, if we know that M() returns an int, and we have 1 * M(), then we just turn it into M().

Then we do generation of the code for anonymous types first used by this method.

Then we transform anonymous functions in this body into methods of closure classes.

Finally, we transform iterator blocks into switch-based state machines.

Then we emit the IL for the transformed tree that we've just computed.

UPDATE: I've glossed over a number of additional passes that happen during IL generation. Basically, the IL generator turns the program tree into a set of "basic blocks" -- that is, sequences of code where nothing "in the middle" of the block leaves the sequence, and no other sequences branch to "the middle" of any other sequence. Once we have the code in basic block form we can start doing additional analysis on it: rewriting branches to be more efficient, removing basic blocks that the arithmetic optimizer has proven to be dead code, and so on. See my article on the optimize switch for more details of what we do during this pass.

Easy as pie!

  • David Cunningham: I would assume that the C# compiler is faster than a C++ compiler because it does less work. A C# compiler only has to look at each source file once, while a C++ compiler has to look at each header file once for each main file that includes it. Using the STL could easily make the compiler do 10 times as much work as if you had written the code yourself (for example just doing #include <string> could pull in two dozen more STL files totalling over 13,000 lines). Even if you use precompiled headers, that's still significant overhead that has to be repeated for each file. While a C# generic needs to be generated once, using C++ templates requires expansion of a template for each different type that it's instantiated for.

    Furthermore, the C# compiler generates IL, leaving the native code generation and optimization to somebody else (ngen or runtime), while a C++ compiler usually has to output optimized native machine code.

  • Interesting, at what stage are partial classes 'merged'. You speak of when the optimizations for unimplemented methods occur but not when their merge happens.

    Partial classes are merged immediately after the top-level parse completes. Since two halves of a partial class must have consistent base classes, we need to have all the "declarations" for a particular class known before base class checking. Namespaces are of course always "partial"; their contents are also merged together early. -- Eric

    Also in the same vein at what stage do you convert things like foreach/using into their more verbose equivalents?

    In the C# 2 code all the "lowering" from loops to gotos happened in the initial binding pass. In C# 3 and 4, we've been gradually moving to a model where those lowerings happen later; right now the lowerings for many constructs happen before reachability and definite assignment checking because those passes only understand gotos, not loops. Eventually we would like to be able to easily implement things like "statement trees", so statements would not be lowered until just before IL generation, so that the statement tree generator could generate the right thing. In today's code, the lowering of a foreach to its try-finally-with-a-loop-in-it form happens early; I'd like to see that happen later. -- Eric

  • Makes me think of that cartoon with two guys at the blackboard doing maths, with "then a miracle happens" in the middle.

    I just hit F5, and somebody else does the magic.

  • Eric, if I was almost half as smart as you and your team, I'd be no less than twice as smart as I'm not now!

  • Which parts of your AST are mutable and which are immutable?

Page 4 of 4 (50 items) 1234