How Many Passes?

How Many Passes?

Rate This
  • Comments 50

Large bodies of code written in the C/C++ languages typically divide up the code into “header” files, which are just declarations of methods and types (and definitions of macros). The actual bodies of those methods and types are in completely different files.

People sometimes ask me why doesn’t C# need header files? which is a bit of an odd way to phrase the question; I would have asked the equivalent question why does C++ need header files? Header files seem like a huge potential point of failure; all the time I edit C++ code and change the signature of a method; if I forget to update the header file, then the code doesn’t compile and often gives some cryptic error message. Hopefully this large cost actually buys you something.

It buys the compiler writer one thing, and the user one thing.

What it buys the user is that you can compile each individual “cpp” file into a “obj” file independently, provided that you have all the necessary headers. All the information necessary to generate the bodies that are in a given cpp file can be gleaned from the set of headers. This means that the build system can recompile just those cpp files that changed, provided that no header changed.

What it buys the compiler writer is that every file can be compiled in “one pass”. Because every type and method declaration is processed before its first usage, the compiler can simply start from the top of the file, pull in all the included headers, and proceed from the top down, spitting out the obj file as it goes, never having to go back and revisit something its seen already.

The C# language does not require that declarations occur before usages, which has two impacts, again, on the user and on the compiler writer. The impact on the user is that you don’t get to recompile just the IL that changed when you change a file; the whole assembly is recompiled. Fortunately the C# compiler is sufficiently fast that this is seldom a huge issue. (Another way to look at this is that the “granularity” of recompilation in C# is at the project level, not the file level.)

The impact on the compiler writer is that we have to have a “two pass” compiler. In the first pass, we look for declarations and ignore bodies. Once we have gleaned all the information from the declarations that we would have got from the headers in C++, we take a second pass over the code and generate the IL for the bodies.

A good way to think about this is that the first pass computes metadata: the “top level” stuff, like namespaces, classes, structs, enums, interfaces, delegates, methods, type parameters, formal parameters, constructors, events, attributes, and so on. The second pass computes the IL: the code that goes in the method bodies, constructor bodies, and so on.

In practice, though we only do “two passes” over the raw text, we do many, many passes over the resulting data structures. I thought I might lay out for you just what passes we do in order to implement the C# language.

The first “metadata” phase goes like this:

The first thing we do is take the text of the sources and break it up into a stream of tokens. That is, we do lexical analysis to determine that "class c : b { }" is class, identifier, colon, identifier, left curly, right curly.

We then do a "top level parse" where we verify that the token streams define a grammatically-correct C# program. However, we skip parsing method bodies.  When we hit a method body, we just blaze through the tokens until we get to the matching close curly. We'll come back to it later; we only care about getting enough information to generate metadata at this point.

We then do a "declaration" pass where we make notes about the location of every namespace and type declaration in the program. At this point we're done looking at the source code for the first phase; every subsequent pass is over the set of "symbols" deduced from the declarations.

We then do a pass where we verify that all the types declared have no cycles in their base types. We need to do this first because in every subsequent pass we need to be able to walk up type hierarchies without having to deal with cycles.

We then do a pass where we verify that all generic parameter constraints on generic types are also acyclic.

We then do a pass where we check whether every member of every type -- methods of classes, fields of structs, enum values, and so on -- is consistent. No cycles in enums, every overriding method overrides something that is actually virtual, and so on. At this point we can compute the "vtable" layouts of all interfaces, classes with virtual methods, and so on.

We then do a pass where we work out the values of all "const" fields.

At this point we have enough information to emit almost all the metadata for this assembly. We still do not have information about the metadata for iterator/anonymous function closures or anonymous types; we do those late.

We can now start generating IL. For each method body (and properties, indexers, constructors, and so on), we rewind the lexer to the point where the method body began and parse the method body.

[UPDATE: A number of people have asked me about how the passes below are optimized. During the initial binding pass we make notes on things like "there's nullable arithmetic in here", or "there's an object initializer in here" and so on. When we go to do object initializer rewriting, say, we first check to see if we even have to, and skip the pass entirely if we know it will be a no-op. (And testing runs a special version of the compiler that does the pass anyway, and reports back whether it had any effect or not, so that we can test that our pass-skipping logic is correct.) ]

Once the method body is parsed, we do an initial "binding" pass, where we attempt to determine the types of every expression in every statement. We then do a whole pile of passes over each method body. Again, at this point we are done looking at the source code; we're now passing over the "annotated parse tree" many times, and rewriting it as we go.

We first run a pass to transform loops into gotos and labels.

(The next few passes look for bad stuff.)

Then we run a pass to look for use of deprecated types, for warnings.

Then we run a pass that searches for uses of anonymous types that we haven't emitted metadata for yet, and emit those.

Then we run a pass that searches for bad uses of expression trees. For example, using a ++ operator in an expression tree.

Then we run a pass that looks for all local variables in the body that are defined, but not used, to report warnings.

Then we run a pass that looks for illegal patterns inside iterator blocks.

Then we run the reachability checker, to give warnings about unreachable code, and tell you when you've done something like forgotten the return at the end of a non-void method.

Then we run a pass that verifies that every goto targets a sensible label, and that every label is targeted by a reachable goto.

Then we run a pass that checks that all locals are definitely assigned before use, notes which local variables are closed-over outer variables of an anonymous function or iterator, and which anonymous functions are in reachable code. (This pass does too much. I have been meaning to refactor it for some time now.)

At this point we're done looking for bad stuff, but we still have way more passes to go before we sleep.

Next we run a pass that detects missing ref arguments to calls on COM objects and fixes them. (This is a new feature in C# 4.)

Then we run a pass that looks for stuff of the form "new MyDelegate(Foo)" and rewrites it into a call to CreateDelegate.

Then we run a pass that transforms expression trees into the sequence of factory method calls necessary to create the expression trees at runtime.

Then we run a pass that rewrites all nullable arithmetic into code that tests for HasValue, and so on.

Then we run a pass that finds all references of the form base.Blah() and rewrites them into code which does the non-virtual call to the base class method.

Then we run a pass which looks for object and collection initializers and turns them into the appropriate property sets, and so on.

Then we run a pass which looks for dynamic calls (in C# 4) and rewrites them into dynamic call sites that use the DLR.

Then we run a pass that looks for calls to removed methods. (That is, partial methods with no actual implementation, or conditional methods that don't have their conditional compilation symbol defined.) Those are turned into no-ops.

Then we look for unreachable code and remove it from the tree. No point in codegenning IL for it.

Then we run an optimization pass that rewrites trivial "is" and "as" operators.

Then we run an optimization pass that looks for switch(constant) and rewrites it as a branch directly to the correct case.

Then we run a pass which turns string concatenations into calls to the correct overload of String.Concat.

(Ah, memories. These last two passes were the first things I worked on when I joined the compiler team.)

Then we run a pass which rewrites uses of named and optional parameters into calls where the side effects all happen in the correct order.

Then we run a pass which optimizes arithmetic; for example, if we know that M() returns an int, and we have 1 * M(), then we just turn it into M().

Then we do generation of the code for anonymous types first used by this method.

Then we transform anonymous functions in this body into methods of closure classes.

Finally, we transform iterator blocks into switch-based state machines.

Then we emit the IL for the transformed tree that we've just computed.

UPDATE: I've glossed over a number of additional passes that happen during IL generation. Basically, the IL generator turns the program tree into a set of "basic blocks" -- that is, sequences of code where nothing "in the middle" of the block leaves the sequence, and no other sequences branch to "the middle" of any other sequence. Once we have the code in basic block form we can start doing additional analysis on it: rewriting branches to be more efficient, removing basic blocks that the arithmetic optimizer has proven to be dead code, and so on. See my article on the optimize switch for more details of what we do during this pass.

Easy as pie!

  • Which pass expands query comprehensions? Since that's fairly mechanical, is it an effect of the top level parse?

    The initial method body binding pass does syntactic query rewrites on the parse tree. We could in theory do it in the method body parser; for a variety of reasons it turns out to simply be easier to do it post parsing. -- Eric

  • > The impact on the user is that you don’t get to recompile just the IL that changed when you change a file; the whole assembly is recompiled.

    Is this part literally true?  Does a single change require the whole assembly to be recompiled?  Or is the system a bit smarter, only recompiling those files that depend on the changes?

    Yep. We recompile the whole thing. The compiler is pretty fast! -- Eric

  • What I really want to know very much, is C# compiler faster on multicore CPU (can it use multiple threads)? For how much? Can't find any benchmarks.

    If not, are there plans to made it so?

    I would like that very much, but no guarantees. See above. -- Eric

  • Many of those passes are optimizations. Why do they need to run in different passes, and why do they not run inside a common backend for all .Net compilers?

    They don't need to run in different passes. Doing so is convenient because it means that we can (1) write classes that have a single responsibility, and (2) write later passes to depend on postconditions of earlier passes.

    They don't run in a common backend because that would mean every .NET language team would have to agree on a common format for the internal representation of a program and then of course it would not be "internal" anymore. Remember, the compiler fundamentally is manipulating parse trees at this stage, and every language parses differently.

    Could we build a library of lowest-common-denominator objects representing things like "if", "local variable", "assignment", "goto" and so on? Then lower the parse trees to those and run a common backend analyzer on them. Yes, we could. The expression tree library could be used as a start on such a thing. But is there any cost savings in getting every compiler team to agree on such a format? It seems like a lot of work for very little gain; the vast majority of the analysis and transformation is language-specific and happens before IL gen. This is an idea we've considered, but I don't think much will come of it any time soon. -- Eric

  • Aaron: I imagine having one pass that does three things is about the same in terms of performance as having three passes that do one thing each.

    In any case, the C# grammar is simple enough that compiling even very large projects takes a very short amount of time. Our software is ~1M loc, spread out over dozens of solutions and projects and takes about 3 minutes to compile (and that's including the time our build server takes to do an 'svn update'). That's impressive!

    Nice. And my guess would be that much of that time is not being spent in the analysis. A million lines of code is not that much to throw at the compiler. Running all the metadata to and from disk for each project can be expensive. -- Eric

  • @Eric

    Sweet! Thank you for this.

    @Aaron

    Given Eric's comment regarding how "impenetrable" the code is, I'll guess that the driving criteria for Eric's desire to refactor that pass is maintainability, not performance. I always look at it like this:

    Rule #1: The program must be able to successfully do the work you want it to.

    Rule #2: The program must be able to be changed in case you find out that you don't really want it to do that, unless this should conflict with the first rule or you're intentionally ruling out using this code for any other than the original application.

    Rule #3: The program should run as quickly as possible, unless this should conflict with the first or second rules.

    I'll also go out on a limb and speculate wildly that there have been no recent, major features that have touched on what that too-big pass does, or the rewrite would have been more of a priority. Then again, I could be wrong.

  • Eric, thanks for the clarification of the base method rewriting.

    I'm curious as to how you might approach refactoring the iterator-rewriting pass. I've generally been happy to consider that functionality as "magic" but I've never quite had a good understanding of the process by which that magic is made to do the right thing.

    The code is a bit of a confusing mess right now. The way it builds up the switch that drives the state machine is hard to understand and hard to maintain. I would want to refactor it until every step along the way was very clear. I'd probably divide it up into multiple passes: first, hoist the locals onto a closure class. Then identify every point in the code that is a branch target and insert a label. Then rewrite the try blocks to ensure that the finallys are run when they ought to be. And last, build a switch that branches to the labels. Put it all together and emit the closure class, and you're done. -- Eric

    I'd also like to second Jesper's question about query comprehensions; don't see an obvious place for them, either. Maybe the "loops into gotos and labels" pass?

    Query comprehensions are a syntactic sugar for method calls, and we work out the type of every method call in the first "binding" pass -- that's when we do overload resolution. So the rewrite of query comprehensions into method calls needs to happen very early. -- Eric

  • As Pavel mentioned, headers separate interface from implementation.

    The separation of interface and implementation can create some significant disadvantages. The interface and implementation are mostly redundant. This means time is spent writing the same thing twice and keeping the interface and the implementation in sync. Even worse, they can get out of sync. And they aren't completely redundant, so you have to look in two places to really know the answer to some questions.

    The separation of interface and implementation can create some significant advantages. For those who simply need the interface (compilers of client code or human readers of the code), the information density of a header is much higher than the information density of the implementation. In addition, the implementation can change without requiring a change in the implementation (i.e. clients don't have to recompile unless the header changes). These advantages are not currently available in C#.

    The part that really bugs me is that C# could have its cake and eat it too. With a bit of additional effort, it could get all of the benefits of separated interface and implementation but avoid most or all disadvantages. Unfortunately, instead of being addressed, the disadvantages have been ignored or downplayed.

    To be honest, C# does have header files, but they go by a different name. Instead of .h, you have metadata (embedded in the assembly, in a .metadata_dll file, or in an .asmmeta file) and documentation (those .xml doc files). This is actually pretty cool because it means that the headers are automatically generated from the implementation -- the programmer just writes the .cs file and the headers are generated by magic! Of course, the programmer might have to put more effort into the implementation (add XML-doc comments), but nothing that he/she wouldn't have had to put into a header in the .H model.

    The problem is that there isn't any good support in the VS IDE or the C# compiler to take advantage of the automatically-generated headers. The IDE could have a mode that filters out the implementation and shows a nice high-level view of the .CS file (just the metadata with XML comments included optionally included), but there is no such mode (there is for viewing .DLL files, but not for viewing .CS files). The C# compiler (with help from the build system) could generate .metadata_dll files automatically, but instead you have to use messy external tools like asmmeta (along with a bunch of ad-hoc additional processing to filter out "insignificant" changes). The build system could automatically detect changes in the .metadata_dll file and only rebuild the downstream assemblies if the .metadata_dll changed significantly.

  • dcook, if you wish to separate method signatures from implementation you can choose to do so by using C# interfaces.

  • Sounds like there are a lot more passes now than what's in the 2.0 compiler source in Rotor, but this helped explain some of the stuff I saw in there.  I've been looking through that source for the past couple weeks and one thing I've wondered is why it does two passes through the lex data.  Why not do one pass to build the AST and then walk the AST to pull out "header" information as a first pass?    My guess is so the parser could also be used for Intellisense or other metadata only needs.

    Will the 4.0 source every be released in a new Rotor release?

    This is highly unlikely for C# 4. However, we are evaluating whether to release hypothetical future versions of the compiler under a shared-source sort of license. No promises, but I would like this very much. -- Eric

  • I was about to ask the same question as Aaron R. The compilers I've worked on have done just what he suggested: lexing and parsing happens exactly once, then all further operations happen on in-memory trees (including matching up identifiers that aren't defined until some point after they're used). I don't understand why the C# compiler needs to parse the source code multiple times.

  • @dcook, there's nothing stopping you from writing a visualizer into Visual Studio (or standalone if you prefer).

    Or, for the lazy like myself, just collapse to definitions and scroll.

    There are many add-in's out there that supplement the look and feel and give you some wicked productivity tools. e.g. Resharper. To be honest, I've never figured out a system by looking at a .h file.

    I'd also argue that the metadata is consumed by the most important client, and that's the compiler itself. It's funny to see, however many years later, that there are people now lamenting the fact of no separate .h files and no sync issues.

    The manifest being included in the assembly, and concretely describing every type is a godsend and is what makes it possible to do so much more that you could do with C++. In theory with a correctly compiled C++ system, all you need are .h files, but developing with .h files was a nuisance, and the theory often failed in practice.

    You can argue there's still the same issue but in a different guise: instead of instead of .h proliferation, there is dll proliferation, but at least you can be assured that if you have the working dll, you will have the working code. You can't say the same about a .h.

  • What are trivial "is" and "as" operators? Is that where an expression being converted is already known to be of or not be of the type being converted to?

    Correct. Unfortunately, introducing co and contravariant conversions changed the set of reference conversions which are known to be trivial, and I accidentally introduced some bugs in this pass as a result. Hopefully the fixes will make it into the final C# 4 release. Apologies for the inconvenience. -- Eric

  • I found it interesting that you check for nonreachable code twice (once early on, to give warnings; once late, on an already-substantially-rewritten parse tree, to remove the dead code). The only reason I can think of for separating them is if the intervening rewrites could actually introduce unreachable code that wasn't in the original source. Is that the case?

    I found it even more interesting that the dead-code removal won't necessarily remove all the dead code. For example, if I had a switch(const), it sounds like you would codegen all of the cases, even though only one was reachable -- because you do the switch(const) optimization after you remove the dead code, instead of before.

    I'd be curious to hear how the decisions were made (and by whom) about where to add some of these passes, and how you weighed the tradeoffs. I'm sure there are a few interesting stories there.

    Good question. We have two kinds of dead code. First, dead code that the *specification* says is unreachable. The spec is very clear about the formal definition of unreachable. For example: 

        int x = 10;
        int y, z;
        if (false) M(y);
        if (x * 0 != 0) N(z);

    The call to M(y) is unreachable, and therefore the definite assignment checker skips reporting the fact that y is used before it is assigned. The call to N(z) is reachable according to the spec, and therefore the usage of z is an error. The spec says that the consequence of a conditional is unreachable only if the condition is the compile time constant false. A compile time constant never contains a variable. Therefore x * 0 != 0 is not a compile-time constant, and therefore the consequence is reachable, and must be checked for definite assignment errors.

    Now, you and I know that x * 0 != 0 is always false, and the arithmetic optimizer knows that. Assuming that we initialize z so the program becomes error free and we get to code gen, we can safely remove the call to N(z) during code gen. 

    In short: we need to compute these two different kinds of reachability at different times, and therefore have two different passes to remove the two kinds of unreachable code. -- Eric
       

  • This is very different from the Java world right?

    The Eclipse IDE scales to very large projects excellently as it seems to retain a pre-computed dependency graph between compilation units within a single project. That way a recompile of the entire project is unnecessary because you changed one file - you just need to recompile what really needs recompiling.

    I don't think its enough to say "the C# compiler is really fast" because it simply is not fast enough for day to day coding on a large commercial project (I'm working a project where individual assemblies were developed that contain thousands of files totally hundreds of thousands of lines of code). The speed of the compiler is a significant problem in my day to day workflow as I must sit around feeling frustrated while my box churns away. I work in finance so have a very powerful dev. box too.

    The advice "split up your assemblies" is good and we are moving in that direction, but ultimately I feel that C# and Visual Studio have taken a large retrograde step by not meeting the bar in this area.

Page 2 of 4 (50 items) 1234