How Many Passes?

How Many Passes?

Rate This
  • Comments 50

Large bodies of code written in the C/C++ languages typically divide up the code into “header” files, which are just declarations of methods and types (and definitions of macros). The actual bodies of those methods and types are in completely different files.

People sometimes ask me why doesn’t C# need header files? which is a bit of an odd way to phrase the question; I would have asked the equivalent question why does C++ need header files? Header files seem like a huge potential point of failure; all the time I edit C++ code and change the signature of a method; if I forget to update the header file, then the code doesn’t compile and often gives some cryptic error message. Hopefully this large cost actually buys you something.

It buys the compiler writer one thing, and the user one thing.

What it buys the user is that you can compile each individual “cpp” file into a “obj” file independently, provided that you have all the necessary headers. All the information necessary to generate the bodies that are in a given cpp file can be gleaned from the set of headers. This means that the build system can recompile just those cpp files that changed, provided that no header changed.

What it buys the compiler writer is that every file can be compiled in “one pass”. Because every type and method declaration is processed before its first usage, the compiler can simply start from the top of the file, pull in all the included headers, and proceed from the top down, spitting out the obj file as it goes, never having to go back and revisit something its seen already.

The C# language does not require that declarations occur before usages, which has two impacts, again, on the user and on the compiler writer. The impact on the user is that you don’t get to recompile just the IL that changed when you change a file; the whole assembly is recompiled. Fortunately the C# compiler is sufficiently fast that this is seldom a huge issue. (Another way to look at this is that the “granularity” of recompilation in C# is at the project level, not the file level.)

The impact on the compiler writer is that we have to have a “two pass” compiler. In the first pass, we look for declarations and ignore bodies. Once we have gleaned all the information from the declarations that we would have got from the headers in C++, we take a second pass over the code and generate the IL for the bodies.

A good way to think about this is that the first pass computes metadata: the “top level” stuff, like namespaces, classes, structs, enums, interfaces, delegates, methods, type parameters, formal parameters, constructors, events, attributes, and so on. The second pass computes the IL: the code that goes in the method bodies, constructor bodies, and so on.

In practice, though we only do “two passes” over the raw text, we do many, many passes over the resulting data structures. I thought I might lay out for you just what passes we do in order to implement the C# language.

The first “metadata” phase goes like this:

The first thing we do is take the text of the sources and break it up into a stream of tokens. That is, we do lexical analysis to determine that "class c : b { }" is class, identifier, colon, identifier, left curly, right curly.

We then do a "top level parse" where we verify that the token streams define a grammatically-correct C# program. However, we skip parsing method bodies.  When we hit a method body, we just blaze through the tokens until we get to the matching close curly. We'll come back to it later; we only care about getting enough information to generate metadata at this point.

We then do a "declaration" pass where we make notes about the location of every namespace and type declaration in the program. At this point we're done looking at the source code for the first phase; every subsequent pass is over the set of "symbols" deduced from the declarations.

We then do a pass where we verify that all the types declared have no cycles in their base types. We need to do this first because in every subsequent pass we need to be able to walk up type hierarchies without having to deal with cycles.

We then do a pass where we verify that all generic parameter constraints on generic types are also acyclic.

We then do a pass where we check whether every member of every type -- methods of classes, fields of structs, enum values, and so on -- is consistent. No cycles in enums, every overriding method overrides something that is actually virtual, and so on. At this point we can compute the "vtable" layouts of all interfaces, classes with virtual methods, and so on.

We then do a pass where we work out the values of all "const" fields.

At this point we have enough information to emit almost all the metadata for this assembly. We still do not have information about the metadata for iterator/anonymous function closures or anonymous types; we do those late.

We can now start generating IL. For each method body (and properties, indexers, constructors, and so on), we rewind the lexer to the point where the method body began and parse the method body.

[UPDATE: A number of people have asked me about how the passes below are optimized. During the initial binding pass we make notes on things like "there's nullable arithmetic in here", or "there's an object initializer in here" and so on. When we go to do object initializer rewriting, say, we first check to see if we even have to, and skip the pass entirely if we know it will be a no-op. (And testing runs a special version of the compiler that does the pass anyway, and reports back whether it had any effect or not, so that we can test that our pass-skipping logic is correct.) ]

Once the method body is parsed, we do an initial "binding" pass, where we attempt to determine the types of every expression in every statement. We then do a whole pile of passes over each method body. Again, at this point we are done looking at the source code; we're now passing over the "annotated parse tree" many times, and rewriting it as we go.

We first run a pass to transform loops into gotos and labels.

(The next few passes look for bad stuff.)

Then we run a pass to look for use of deprecated types, for warnings.

Then we run a pass that searches for uses of anonymous types that we haven't emitted metadata for yet, and emit those.

Then we run a pass that searches for bad uses of expression trees. For example, using a ++ operator in an expression tree.

Then we run a pass that looks for all local variables in the body that are defined, but not used, to report warnings.

Then we run a pass that looks for illegal patterns inside iterator blocks.

Then we run the reachability checker, to give warnings about unreachable code, and tell you when you've done something like forgotten the return at the end of a non-void method.

Then we run a pass that verifies that every goto targets a sensible label, and that every label is targeted by a reachable goto.

Then we run a pass that checks that all locals are definitely assigned before use, notes which local variables are closed-over outer variables of an anonymous function or iterator, and which anonymous functions are in reachable code. (This pass does too much. I have been meaning to refactor it for some time now.)

At this point we're done looking for bad stuff, but we still have way more passes to go before we sleep.

Next we run a pass that detects missing ref arguments to calls on COM objects and fixes them. (This is a new feature in C# 4.)

Then we run a pass that looks for stuff of the form "new MyDelegate(Foo)" and rewrites it into a call to CreateDelegate.

Then we run a pass that transforms expression trees into the sequence of factory method calls necessary to create the expression trees at runtime.

Then we run a pass that rewrites all nullable arithmetic into code that tests for HasValue, and so on.

Then we run a pass that finds all references of the form base.Blah() and rewrites them into code which does the non-virtual call to the base class method.

Then we run a pass which looks for object and collection initializers and turns them into the appropriate property sets, and so on.

Then we run a pass which looks for dynamic calls (in C# 4) and rewrites them into dynamic call sites that use the DLR.

Then we run a pass that looks for calls to removed methods. (That is, partial methods with no actual implementation, or conditional methods that don't have their conditional compilation symbol defined.) Those are turned into no-ops.

Then we look for unreachable code and remove it from the tree. No point in codegenning IL for it.

Then we run an optimization pass that rewrites trivial "is" and "as" operators.

Then we run an optimization pass that looks for switch(constant) and rewrites it as a branch directly to the correct case.

Then we run a pass which turns string concatenations into calls to the correct overload of String.Concat.

(Ah, memories. These last two passes were the first things I worked on when I joined the compiler team.)

Then we run a pass which rewrites uses of named and optional parameters into calls where the side effects all happen in the correct order.

Then we run a pass which optimizes arithmetic; for example, if we know that M() returns an int, and we have 1 * M(), then we just turn it into M().

Then we do generation of the code for anonymous types first used by this method.

Then we transform anonymous functions in this body into methods of closure classes.

Finally, we transform iterator blocks into switch-based state machines.

Then we emit the IL for the transformed tree that we've just computed.

UPDATE: I've glossed over a number of additional passes that happen during IL generation. Basically, the IL generator turns the program tree into a set of "basic blocks" -- that is, sequences of code where nothing "in the middle" of the block leaves the sequence, and no other sequences branch to "the middle" of any other sequence. Once we have the code in basic block form we can start doing additional analysis on it: rewriting branches to be more efficient, removing basic blocks that the arithmetic optimizer has proven to be dead code, and so on. See my article on the optimize switch for more details of what we do during this pass.

Easy as pie!

  • Might seem like a silly question but would writing code like;

    string message = string.Concat("A string", "B string", "C string");

    be more efficient (compiler time wise)

    than

    string message = "A string" + "B string" + "C string";

  • quoth E. Lippert

    > Running all the metadata to and from disk for each project can be expensive.

    So this provokes two questions: do you develop on an SSD-juiced rig? And, what fraction of your team/build platform does?  In other words, has the SSD tipping point tipped, at least for the C# compiler cognoscenti.

  • I'm not sure I understand the pass that checks generics are acyclic, could you explain? For example, this compiles:

    public class CyclicGeneric<T> where T : CyclicGeneric<T> { }

    Isn't that cyclic?

    Nope. The base classes have to be cycle free, so "class C : B {} class B : C{}" is a cycle. And the constraints have to be cycle free, so "class D<T, U> where T : U where U : T {}" is a cycle. But the code you gave is perfectly legal. Weird, but legal. Similarly, "struct E : IComparable<E>" is not a cycle, and obviously is a desirable and common pattern. However, there are numerous bugs in our existing cycle detector that accidentally detect some of these weird cases as cycles incorrectly. -- Eric

  • Eric, would you please walk across the hall and "poke" someone on the VB team to see if they'd be willing to write up some details on their current and future compiler design. :) I'm curious to hear whether they made similar choices.

    Do they plan to support incremental compilation?

    Are they really rewriting the compiler in VB?

    If so, what VB features are they using/avoiding to do so, and did they have to resort to non-VB for parts, or add hidden special features such as support for unsafe code?

    Have they considered large messy multi-million LOC single-assembly applications as a possible scenario?

    My team is currently in the midst of re-architecting/designing/writing a large mess of a Java application, and the available compilers are inadequate. The Eclipse compiler is the best available, but frequently gets confused, requiring a full rebuild. My dream is to one day move this app to the .NET world (primarily to take advantage of WPF and language features), but compilation speed is critical.

  • @Michael:  no, I've used that approach to create a hierarchical class, e.g.:

    public abstract class Node<T> where T : Node<T>

    {

        protected T Parent;

        protected List<T> Children;

    }

  • Wait, misread the first part of your statement.  I'm not exactly sure what constitutes "cyclic," but I'm sure one more venerably wise than I will shed some light.

  • Ugh, today is my multi-post day.

    > Running all the metadata to and from disk for each project can be expensive.

    Avoiding repeated disk writes is precisely why I like copying projects to a RAM disk until I'm done working on them for the day.  Barring catastrophic system failure and loss of resident data, it's a good way to speed up . . . well, everything.  Have you perhaps maybe considered the possibility of compiling (or providing an option to compile) temporary files to memory before writing the final output?  I mean, you can already do this with CodeDOM.

  • Serch for "Then we run a pass that checks" on Google...^^

  • On the build times and parallelization, MSBuild does use multiple cores, and it speeds build time up a bit. But the vast majority of build time is held up in disk access. Copy local references are a killer, especially in a solution with many projects with lots of references because when an assembly is copied local, all of its dependencies are copied local as well. So it fans out quite quickly and becomes painful.

    I saw numbers on SSD read and write times a while ago and it seemed that they weren't really that much faster. I would think compiling to a ramdisk would be a massive speedup. I've seen a 5 disk RAID 0 setup turn a 20-25 minute build of 4 million lines of code into 3-5 minutes.

    I doubt parallelization of the CPU work in the compiler would have a significant effect on the build times of projects that are actually suffering from slow builds.

  • Nice blog!  You mention that the C# compiler is pretty darn fast - it certainly is!  I can understand why the linker portion is significantly faster than the VS C++, but can you explain why the compile process is so much faster?

    Nope. First off, I don't know that it actually is faster; I've never once run a head-to-head comparison of C# vs C++. (Though my experience with compiling large bodies of both C# and C++ makes it seem plausible that C# is faster.) And second, supposing for the sake of argument that C# is faster, in order to determine why it is faster, I'd need to do a comparison of each *stage* of compilation in order to find the one for which C# is faster. Could be that C# is easier to parse, or that overload resolution is easier in C#, or that loading metadata is faster than loading predefined headers, or that its the backend code generator that is faster; could be anything. I wouldn't like to say without actually crunching the numbers. -- Eric

  • Following up on Joe White's comment, I too was wondering why the dead code removal wasn't done after the switch(constant) rewrite. I would think that if you swapped the order, you could actually even remove the branch that the switch is converted to, since the branch statement would only go to the code following itself.

    I'm sure that many of these passes have dependencies on earlier passes; I'd be interested to know what some of those are.  Why isn't X done later, why isn't Y done earlier?

  • Is it bad that since I found out the C# compiler optimizes + operator string concatenations into string.Concat calls, I stopped actually specifying string.Concat?

  • Now I understand why you might be loath to add features. It's complicated!

    Now I'm left wondering what the code looks like. A method for each pass, or maybe #regions... probably methods in partial classes; less likely to stomp other dev's changes that way and easier to do unit testing.

    Don't forget, the compiler is written in C++, not in C#.

    We have a whole bunch of visitor pass base classes -- one that is useful only for its side effects (like producing errors), one that is used for tree rewriting, and one that is used to query the annotated AST for facts about it. Most of the passes are just derived classes from one of the visitor base classes. Some of them need a lot more custom logic -- like, say, definite assignment checking -- and those are built by hand to recursively descend through the tree. -- Eric

  • > Now I'm left wondering what the code looks like. A method for each pass, or maybe #regions... probably methods in partial classes; less likely to stomp other dev's changes that way and easier to do unit testing.

    From Eric's comment above:

    "The compiler is written in unmanaged C++"

  • "Then we run a pass that verifies that every goto targets a sensible label, and that every label is targeted by a reachable goto"

    Since this happens after the pass that transforms loops into gotos, does it mean that the transformed loops are also checked instead of only user code?

    Label checking determines that every user-defined label is targetted by some goto. Compiler-generated labels are not checked; we always internally generate a label for the "break" and "continue" locations even if there is no break/continue statement, but having a loop with no break or continue statement is not something you should be warned about.

    A compiler-generated goto never does anything stupid; it never branches out of a lambda or out of a finally, for example. So we skip checking those as well. -- Eric

     

Page 3 of 4 (50 items) 1234