How Many Passes?

How Many Passes?

Rate This
  • Comments 50

Large bodies of code written in the C/C++ languages typically divide up the code into “header” files, which are just declarations of methods and types (and definitions of macros). The actual bodies of those methods and types are in completely different files.

People sometimes ask me why doesn’t C# need header files? which is a bit of an odd way to phrase the question; I would have asked the equivalent question why does C++ need header files? Header files seem like a huge potential point of failure; all the time I edit C++ code and change the signature of a method; if I forget to update the header file, then the code doesn’t compile and often gives some cryptic error message. Hopefully this large cost actually buys you something.

It buys the compiler writer one thing, and the user one thing.

What it buys the user is that you can compile each individual “cpp” file into a “obj” file independently, provided that you have all the necessary headers. All the information necessary to generate the bodies that are in a given cpp file can be gleaned from the set of headers. This means that the build system can recompile just those cpp files that changed, provided that no header changed.

What it buys the compiler writer is that every file can be compiled in “one pass”. Because every type and method declaration is processed before its first usage, the compiler can simply start from the top of the file, pull in all the included headers, and proceed from the top down, spitting out the obj file as it goes, never having to go back and revisit something its seen already.

The C# language does not require that declarations occur before usages, which has two impacts, again, on the user and on the compiler writer. The impact on the user is that you don’t get to recompile just the IL that changed when you change a file; the whole assembly is recompiled. Fortunately the C# compiler is sufficiently fast that this is seldom a huge issue. (Another way to look at this is that the “granularity” of recompilation in C# is at the project level, not the file level.)

The impact on the compiler writer is that we have to have a “two pass” compiler. In the first pass, we look for declarations and ignore bodies. Once we have gleaned all the information from the declarations that we would have got from the headers in C++, we take a second pass over the code and generate the IL for the bodies.

A good way to think about this is that the first pass computes metadata: the “top level” stuff, like namespaces, classes, structs, enums, interfaces, delegates, methods, type parameters, formal parameters, constructors, events, attributes, and so on. The second pass computes the IL: the code that goes in the method bodies, constructor bodies, and so on.

In practice, though we only do “two passes” over the raw text, we do many, many passes over the resulting data structures. I thought I might lay out for you just what passes we do in order to implement the C# language.

The first “metadata” phase goes like this:

The first thing we do is take the text of the sources and break it up into a stream of tokens. That is, we do lexical analysis to determine that "class c : b { }" is class, identifier, colon, identifier, left curly, right curly.

We then do a "top level parse" where we verify that the token streams define a grammatically-correct C# program. However, we skip parsing method bodies.  When we hit a method body, we just blaze through the tokens until we get to the matching close curly. We'll come back to it later; we only care about getting enough information to generate metadata at this point.

We then do a "declaration" pass where we make notes about the location of every namespace and type declaration in the program. At this point we're done looking at the source code for the first phase; every subsequent pass is over the set of "symbols" deduced from the declarations.

We then do a pass where we verify that all the types declared have no cycles in their base types. We need to do this first because in every subsequent pass we need to be able to walk up type hierarchies without having to deal with cycles.

We then do a pass where we verify that all generic parameter constraints on generic types are also acyclic.

We then do a pass where we check whether every member of every type -- methods of classes, fields of structs, enum values, and so on -- is consistent. No cycles in enums, every overriding method overrides something that is actually virtual, and so on. At this point we can compute the "vtable" layouts of all interfaces, classes with virtual methods, and so on.

We then do a pass where we work out the values of all "const" fields.

At this point we have enough information to emit almost all the metadata for this assembly. We still do not have information about the metadata for iterator/anonymous function closures or anonymous types; we do those late.

We can now start generating IL. For each method body (and properties, indexers, constructors, and so on), we rewind the lexer to the point where the method body began and parse the method body.

[UPDATE: A number of people have asked me about how the passes below are optimized. During the initial binding pass we make notes on things like "there's nullable arithmetic in here", or "there's an object initializer in here" and so on. When we go to do object initializer rewriting, say, we first check to see if we even have to, and skip the pass entirely if we know it will be a no-op. (And testing runs a special version of the compiler that does the pass anyway, and reports back whether it had any effect or not, so that we can test that our pass-skipping logic is correct.) ]

Once the method body is parsed, we do an initial "binding" pass, where we attempt to determine the types of every expression in every statement. We then do a whole pile of passes over each method body. Again, at this point we are done looking at the source code; we're now passing over the "annotated parse tree" many times, and rewriting it as we go.

We first run a pass to transform loops into gotos and labels.

(The next few passes look for bad stuff.)

Then we run a pass to look for use of deprecated types, for warnings.

Then we run a pass that searches for uses of anonymous types that we haven't emitted metadata for yet, and emit those.

Then we run a pass that searches for bad uses of expression trees. For example, using a ++ operator in an expression tree.

Then we run a pass that looks for all local variables in the body that are defined, but not used, to report warnings.

Then we run a pass that looks for illegal patterns inside iterator blocks.

Then we run the reachability checker, to give warnings about unreachable code, and tell you when you've done something like forgotten the return at the end of a non-void method.

Then we run a pass that verifies that every goto targets a sensible label, and that every label is targeted by a reachable goto.

Then we run a pass that checks that all locals are definitely assigned before use, notes which local variables are closed-over outer variables of an anonymous function or iterator, and which anonymous functions are in reachable code. (This pass does too much. I have been meaning to refactor it for some time now.)

At this point we're done looking for bad stuff, but we still have way more passes to go before we sleep.

Next we run a pass that detects missing ref arguments to calls on COM objects and fixes them. (This is a new feature in C# 4.)

Then we run a pass that looks for stuff of the form "new MyDelegate(Foo)" and rewrites it into a call to CreateDelegate.

Then we run a pass that transforms expression trees into the sequence of factory method calls necessary to create the expression trees at runtime.

Then we run a pass that rewrites all nullable arithmetic into code that tests for HasValue, and so on.

Then we run a pass that finds all references of the form base.Blah() and rewrites them into code which does the non-virtual call to the base class method.

Then we run a pass which looks for object and collection initializers and turns them into the appropriate property sets, and so on.

Then we run a pass which looks for dynamic calls (in C# 4) and rewrites them into dynamic call sites that use the DLR.

Then we run a pass that looks for calls to removed methods. (That is, partial methods with no actual implementation, or conditional methods that don't have their conditional compilation symbol defined.) Those are turned into no-ops.

Then we look for unreachable code and remove it from the tree. No point in codegenning IL for it.

Then we run an optimization pass that rewrites trivial "is" and "as" operators.

Then we run an optimization pass that looks for switch(constant) and rewrites it as a branch directly to the correct case.

Then we run a pass which turns string concatenations into calls to the correct overload of String.Concat.

(Ah, memories. These last two passes were the first things I worked on when I joined the compiler team.)

Then we run a pass which rewrites uses of named and optional parameters into calls where the side effects all happen in the correct order.

Then we run a pass which optimizes arithmetic; for example, if we know that M() returns an int, and we have 1 * M(), then we just turn it into M().

Then we do generation of the code for anonymous types first used by this method.

Then we transform anonymous functions in this body into methods of closure classes.

Finally, we transform iterator blocks into switch-based state machines.

Then we emit the IL for the transformed tree that we've just computed.

UPDATE: I've glossed over a number of additional passes that happen during IL generation. Basically, the IL generator turns the program tree into a set of "basic blocks" -- that is, sequences of code where nothing "in the middle" of the block leaves the sequence, and no other sequences branch to "the middle" of any other sequence. Once we have the code in basic block form we can start doing additional analysis on it: rewriting branches to be more efficient, removing basic blocks that the arithmetic optimizer has proven to be dead code, and so on. See my article on the optimize switch for more details of what we do during this pass.

Easy as pie!

  • Interesting to hear Eric! I remember the first time I heard that

    string superBigString = bigStringIHave + anotherBigString + yetAnotherBigString;

    ...was actually optimized to a String.Concat call. Made using StringBuilder unnecessary. It would be interesting to hear about more optimizations done and how it affects the daily code developers write. So here's a vote for such a blog entry =)

  • What pass takes a base.X() call in an iterator or anonymous method and generates an appropriate accessor method on the containing class?

    That is done by both the interator rewriter and the anonymous function rewriter. In the anonymous method rewriter, that pass is run after all other transformations have been performed on the anonymous function. The iterator rewriter is kind of bizarre; first we rewrite the body to figure out where the switch needs to branch to, then we rewrite the base calls, then we generate the switch and the branches, and then we generate the returns. I need to refactor that code one of these days; it is nigh impenetrable. -- Eric

  • > What it buys the compiler writer is that every file can be compiled in “one pass”. Because every type and method declaration is processed before its first usage, the compiler can simply start from the top of the file, pull in all the included headers, and proceed from the top down, spitting out the obj file as it goes, never having to go back and revisit something its seen already.

    It's true for C, but not really so for C++, because of inline functions, and body of class being fully accessible from within method bodies. I.e.:

      class foo {

           void bar() { baz(); /* forward reference! */ }

           void baz() { bar(); }

      };

    why they didn't go all the way there, and didn't also provide the same facility on namespace level, is beyond me, but oh well. Ultimately it just means that compiler writer now has two headaches - he has to write a two-pass compiler because of classes, but he also has to insert checks in that compiler so that forward references on namespace level are reported as errors (because language spec requires them to be ill-formed).

    That aside, headers actually provide one other advantage to the user: they (ideally) separate interface of the module from its implementation. This allows several things.

    For one, I can do interface-first design - write the header first, carefully designing the public surface of my API, and then implement it. If I forget or mismatch something, the compiler will complain. And if something isn't in the header, then it's not exposed (well, not quite true with C/C++, in practice...).

    As a side note, doc comments also normally go into headers, so they stick with interface where they belong - not with implementation.

    For another, as a library client, I can use header as formalized documentation - just open it, and look at the definitions (hopefully with doc comments).

    I suspect those two reasons are why F# went for its separate module interface (.fsi) + module implementation (.fs) approach, since they don't actually use .fsi files in the same way C/C++ does for separate compilation.

  • > What pass takes a base.X() call in an iterator or anonymous method and generates an appropriate accessor method on the containing class?

    It's implicit in the algorithm due to the ordering of the steps:

    - Then we run a pass that finds all references of the form base.Blah() and rewrites them into code which does the non-virtual call to the base class method.

    - Then we transform anonymous functions in this body into methods of closure classes.

    - Finally, we transform iterator blocks into switch-based state machines.

    As I understand it, the structures between those steps don't map 1-to-1 to C# code, which is why this is possible. So at step #1, base.Foo() is mapped to something which isn't representable directly in C# syntax, but is analogous to ClassName::Foo() in C++.

  • Pavel, that's true in C# 2 and 3, but it generates unverifiable code (which therefore produces a warning, because there wasn't time to fix it in those versions of the compilers). Because verifiability prohibits non-virtual calls on any object other than 'this', I believe.

    In C#4, this bug is fixed by having the compiler actually generate a method accordingly, so that:

    public IEnumerable<int> foo() {

     yield return base.Bar();

    }

    is compiled into something equivalent to:

    public int baseBar() { return base.Bar(); }

    public Func<int> foo() {

     yield return baseBar();

    }

    I'm curious as to which pass of the compiler does that rewriting.

  • > Pavel, that's true in C# 2 and 3, but it generates unverifiable code (which therefore produces a warning, because there wasn't time to fix it in those versions of the compilers). Because verifiability prohibits non-virtual calls on any object other than 'this', I believe.

    Indeed, I've found the corresponding section in the CLI spec - PIII 3.19. It's actually a little bit more subtle - it prohibits non-virtual calls not via "this" only for non-final virtual methods - so base.GetType() is fine, but base.ToString() is not. Something new for me, and I've never ran into this VC# warning before, but overall it makes sense.

  • Thanks a lot for putting this out here. I have been wondering about the process the C# compiler used to get from a loose string of text into IL for a long time.

    What is the reason for turning partial methods with no implementation into no-ops as opposed to completely eliminating them as if they were unreachable code?  Does the compiler do this the same way regardless of the state of the debug info and optimization options?

    I believe the no-ops are removed by the code generator if in "optimize" mode but I am not certain of that; I haven't looked at that code for a while. -- Eric

  • Very enlightening.  It's interesting that you mention your intent to refactor one of the passes that does too many things, presumably, into more passes.  How important is the cohesion of the passes compared to the performance of the compiler (or are they somehow orthogonal)?  Do you consider the impact of adding passes to the compiler when designing new language features, or do you simply design the feature as you will and then do what it takes to make the compiler performant?  Obviously cramming functionality together into a single pass violates the "Do Not Make Premature Optimizations" principle, but since your team likely has a good understanding of how much penalty must be paid for each additional pass, I'm curious as to the process for determining how each pass is justified.

  • > What is the reason for turning partial methods with no implementation into no-ops as opposed to completely eliminating them as if they were unreachable code?

    I suspect it is so that one can set a breakpoint at such a method in debug build, whether it's implemented or not. VC# likes to put NOPs in the code elsewhere as well, with /optimize-, and probably for the same reason.

  • Great post!

    I am curious how the IDE communicates with the compiler and which passes are included when the IDE checks the code. Also how do you do "edit and continiue" and integrate with the debugger in general. Now when I read my questions I realize that these may require a full post or more than one full post.

    The interaction between the design-time IDE, the run-time debugger, and the edit-and-continue layer are all very complex, and none of them are my area of expertise. -- Eric

  • Given the large number of passes many of which don't seem to rely much on one another what opportunites do you see for doing some of this concurrently?  Or compiling multiple projects concurrently?  Or is disk I/O the big bottleneck, which would make concurrent project compilation pointless?

    MSBuild should already be pretty smart about doing project compilation on separate processors, but again, that is not my area of expertise.

    I would like very much to be able to take advantage of the potential parallel execution here. Doing the passes in parallel is probably a bad idea because the rewriting passes all act on the output of the previous pass, and many of the analysis passes depend on the rewriting passes having run. (For example, today's reachability checking pass assumes that all loops have been lowered into ifs and gotos.)

    We could certainly run analysis of method bodies in parallel. However, that then raises the question of what to do about emitting the IL; that seems like a poor candidate for parallelization. -- Eric

  • I'm curious to know whether the basic lexer and parser is hand-written or generated by tools analogous to lex and yacc?

    The lexer and parser are both hand-written; the parser is your basic recursive descent parser. Though it is easy to quickly get a language going with a machine-generated parser, and they do tend to be accurate, there are enough oddities and unusual cases in C# that its not easy to do a great job with a machine-generated parser. We also want to be able to easily extend and debug the parser, add custom error recovery, add heuristics that make it more useful for IDE scenarios, and so on. I'm sure that if we had an expert on machine-generated parsers we could get something going; we've done experiments with them in the past. Trouble is, when something breaks really horribly, you kinda need someone with an advanced degree in parser theory to suss it out. We don't want to have to go running to Steve Lucco every time our parser has a glitch. -- Eric

  • Wow, that brings back a lot of memories. Of being confused, mostly: I was a manager at the time.

    One implication of this architecture is if you have errors in your program, the compiler will stop on the failing pass. That means if you have cycles in your base types, and cycles in your generic parameter types, only the first class of errors will be reported. Only when the first class of errors is resolved will the second class appear.  

    We had been writing our C++ in a way the reflected our love of C#. We tried to use idioms that would bring some of the power, safety, and expressiveness of C# in to C++. At one point we tried to port the C++ to C# by renaming all the source files to .cs, compiling, and working through the errors. We figured this might be doable since we were writing C++ like it was C#.

    We'd compile and have 10,000 errors, then work our way down to 1, then fix that last 1, and then there'd be 7000 new errors, as the compiler made it on to the next pass.

    Indeed; if you cannot know that your base classes are acyclic then it makes it very difficult to, say, determine whether a given method overrides something; you could look through the base types forever. We make it easier on ourselves by skipping later phases if important earlier phases did not complete without error. This is then complicated by the desire to give good intellisense in the IDE even when the program does have problems at a more fundamental level. It's a hard problem. -- Eric

     

  • Having this internal glimpse of the mechanics is very, very interesting!

    And it's simply all kinds of awesome to blog about it.

    Eric, a hearty thank-you for the blog in general and this one in particular.

  • Thank you for this post! I've been wondering about many of these stages for a while.

    How does the compiler handle emitting of IL? Does it use reflection.emit or something more sinister?

    Thanks

    The compiler is written in unmanaged C++, so the managed ref emit library is impractical. We use the unmanaged metadata and IL emitting code provided by the CLR team.

    Even if we had a compiler written in managed code, we would not use ref emit; its metadata emitter was not designed with full-fledged compilers in mind. It is not difficult to find crazy type topologies that ref emit is unable to handle. If you want to write a compiler in managed code that can emit arbitrarily complex type topologies then consider something like CCI. -- Eric

Page 1 of 4 (50 items) 1234