Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
A number of the articles I’ve published over the years involve “backtracking” algorithms; most recently my series on how to solve Sudoku puzzles (and more generally, all graph colouring problems) by backtracking. The idea of a backtracking algorithm is really simple and powerful: when faced with a choice, try every possibility. If all of them go wrong, backtrack to the previous choice and try a different possibility. Backtracking algorithms are essentially a depth-first search of the space of possible solutions.
But what if the solution space is large? There are more ways to fill in nine digits in 81 places than there are protons in the entire Universe; we can’t check them all. The reason that backtracking works so rapidly for Sudoku puzzles is because typically every guess eliminates many of the branches of the solution tree. If you guess that the first square is 1 then you do not need to explore the possibility that the second square is 1; you can just prune that branch and never explore it. Since that branch itself is Vast, that’s a good strategy! (There are pathological cases for this algorithm, but in practice it is fast.)
Backtracking algorithms work well for many situations, but we have consciously eschewed them in the design of C#. (With a few exceptions).
For example, there is no backtracking that ever crosses “phases” of compilation in C#.(*) When we are presented with some source code the first thing we do is “lex” it; break it up into “tokens”. This is akin to finding the words in a sentence by looking at the spacing and punctuation. Then we do a grammatical analysis of the token stream, and then we do a semantic analysis of the grammatical analysis. When one of those phases finds an error, we never backtrack to a previous phases and ask for more analysis. Suppose for example you have this silly code:
x = j+++++1;
That is lexically correct C# code. The lexer is a greedy lexer; it tries to make the largest token it can at any given time. So it tokenizes this as:
x , = , j , ++ , ++ , + , 1 , ;
Now the grammar analyzer takes that token stream and says “this is an expression involving two identifiers, one constant and four operators; how should I parenthesize that?” The grammar analyzer determines using the rules of precedence and associativity that this means
So this means “evaluate x, then increment j, then increment whatever the previous result was, then add 1 to the previous result, then assign the previous result to x, done”
Then the semantic analyzer checks that to make sure that all those operations make sense. They do not, because the result of j++ is a value, not a variable, but the second ++ requires a variable. The semantic analyzer gives an error, and the program does not compile.
If we had a backtracking compiler, the semantic analyzer could tell the parser “no, you did something wrong, give me a different parse tree if there is one”.
Suppose the parser does that. It could say well, maybe that last + was not a binary operator but rather the unary operator + applied to 1, oh, but if we do that then we have two expressions next to each other with no intervening operator. Same thing if the second ++ applies to the +1 instead of the j++. So no, there is no other parse of this. So push back on the lexer and ask it to backtrack.
The lexer could then say, oh, sure, if you don’t like that then maybe my guess that the second ++ was ++ was wrong. Try this one:
x , = , j , ++ , + , ++ , 1 , ;
and then the parser says OK, if that's where the token breaks are then the grammatical analysis is:
and the semantic analyzer determines that’s wrong too, this time because you can’t have a ++ on a constant. So the semantic analyzer tells the parser, no, that’s not it either, and the parser tells the lexer, no, that’s not it either. And the lexer says “well, maybe that wasn’t a ++ at all the second time, maybe that should be
x , = , j , ++ , +, + , + , 1 , ;
And now the parser says “oh, I see what that is, it is two unary plus operators applied to 1 and an addition operator:
and the semantic analyzer says “finally! something I can work with.”
That’s not what we do. This is a silly example, but perhaps you can see that there are two major problems with this strategy.
First, though most of the time lexing is unambiguous, in some pathological cases there are potentially enormously many ways that simple programs could be lexed. Just determining where the breaks could go between five plus signs gives eight possibilities; it grows exponentially from there. Since most of the time programs lex unambiguously, and when they do lex ambiguously, the space to explore could be really big, it’s better to always lex greedily and not backtrack ever. If a program fails grammatical analysis because of an unintended lexical structure then the best thing to do is tell the user and they’ll fix it.
The second major problem is that when backtracking, how do you know which backtracking is better? Of those eight possibilities, two give program fragments that pass semantic analysis: the one we just found, and x=(j+(+(+(+(+1))))); Is it at all clear which of the two possibilities is the one the user meant when they wrote this ridiculous statement? Remember, we’re not trying to solve a Sudoku puzzle here, where any given combination of numbers is as meaningless as the next; we are attempting to output IL that has the same meaning as the meaning intended by the user! One of these possibilities mutates j and the other does not, so it makes a difference. C# is not a “guess what you meant” language, it’s a “do what you said” language(**). If what you said was ambiguous, we should tell you about it.
Next time we’ll return to this point by looking at how the name resolution algorithm also does not backtrack.
(*) Unlike JScript, interestingly enough. The ECMAScript specification notes that there are lexical ambiguities involving the / character; it could be introducing a comment as // or as /*, or closing a comment as */, or could be a normal division operator, or a /= operator, or opening or closing a regular expression. Some of those overlap; The spec actually says that the correct lex is the one that produces a grammatically sensible program!
(**) Again, unlike JScript.
Indeed. For an analysis of the ambiguities introduced by the automatic semicolon insertion algorithm see my article from 2004 on that subject:
But let's have a look at the slash character. First off, a regex can't contain comments or division operators. Well then...
/* - Would introduce an invalid regex, so it starts a comment.*/ - Closes comment mode, otherwise it's just a regular * and /..... E.g. end of a regex or a = 5 *//comment[ new line ] 4;/= - Operator or beginning of a regex, I think context can be used to decide this...... (Because a regex is an object; there are two cases:........... Directly after an lvalue: you can't have an object there............ Everywhere else: you can't have an assignment there...... Similarly, we can always distinguish the / operator from a regex start.)// - Either a comment or an empty regex. This would be the main problem, e.g.:..... ListHits(//g);ListHits(..... /a/g);..... Both Chromium and IE treat this as ListHits(/a/g)...... There are other ways to make a regex that matches an empty string.
Anything I missed?
No, that looks fine. As you note in your fourth point, whether the lexical /= character sequence is a division-equals token or not depends upon the syntactic context. That's precisely my point; it is bizarre that the *parser* should have to tell the tokenizer "no, do the tokenization again, you did it wrong because of the grammatical context". Most languages do not have any sort of "backtracking" like this whereby the parser pushes back on the lexer when it encounters a problem. - Eric
For clarification: this would mean that in certain edge cases the way these browsers handle // would appear to be contrary to the spec as you provided it. So I actually read the spec, which can be found here:
Turns out you got it wrong.
It is not at all clear from your comment exactly which claim you believe I "got wrong". If, as you say, you actually read the spec then surely you noticed the bit in section 7 where it says
"There are two goal symbols for the lexical grammar. The InputElementDiv symbol is used in those syntactic grammar contexts where a leading division or division-assignment operator is permitted"
See? The goal state for the lexical grammar depends upon analysis of the syntactic grammar! This is quite unusual! Normally the grammatical analysis depends upon the output of the lexical analysis but does not affect the lexical analysis.
If there's something else you think I "got wrong" I'm happy to discuss it. What did I get wrong?
I believe that Eric may actually mean JScript when he says JScript, not ECMAScript.
I think when Anonymous Coward 2 says "you got it wrong", he or she is referring to Anonymous Coward 1, who stated that comments (//) are also ambiguous. As far as I can tell, this is not true, comments are not ambiguous - it is only the division operators and regular expression literals that are ambiguous (and thus require cooperation with the parser).
An interesting issue about Sudoku (relevant to the discussion) is that a proper Sudoku puzzle has a unique solution. Therefore, "try all the possibilities until you find one that makes sense" is a valid algorithm because you have the implicit guarantee that only one of them will make sense. You won't accidentally find the "wrong" solution.
I don't believe there is any ambiguity in the grammar. The problem is that your typical lexer just can't tell on its own when we have a proper context for a RegularExpressionLiteral.
Here is a note from the specification:
<strong>There are no syntactic grammar contexts where both a leading division or division-assignment, and a leading
RegularExpressionLiteral are permitted.</strong>
From my experience building 3 progressively better parsers it is very simple to tell the lexer when you are in a proper context. It is a bit tricky integrating this with my latest F# parser that uses parser combinators though.
"Is it at all clear which of the two possibilities is the one the user meant when they wrote this ridiculous statement? Remember, we’re not trying to solve a Sudoku puzzle here, where any given combination of numbers is as meaningless as the next;"
The point is that a Sudoku puzzle has only one solution so there never any ambiguity. It's got nothing to do with the solution being a meaningless combination of numbers.
ChaosPandion: I think the ambiguity comes from the // comment sequence. Since every context where a regex literal is legal a comment is also legal, there's no way to tell the difference between a comment and an empty regex.
I remember when I first heard that Netscape was adding regex literals and saw the syntax they used. I was very disappointed because JS doesn't need regexes often enough to justify special syntax, let alone add ambiguity to the grammar. Even Perl's regex/comment ambiguity isn't that bad, and that's saying a lot!
Don't you perform backtracking to determine where an explicit cast operation is being performed.
for example, (MyType) - myValue, or maybe you call it lookahead.
Another example with generics.
MyFunction( MyType< X, Y > (arg) )
Yes, we use lookahead in both those situations, and as a result the grammar of C# requires potentially arbitrarily large lookahead to parse. That's unfortunate but in practice these situations are usually pretty easy to resolve. Whether looking ahead in the token stream for a hint about what comes next is logically equivalent to "backtracking" is a question of semantics; I wouldn't consider it to be backtracking. - Eric
@Wesner Moise: Eric said <i>no backtracking that ever crosses “phases”</i>. These are probably solved in the C# compiler by backtracking on the parse tree, without going back to lexer.
<i>See? The goal state for the lexical grammar depends upon analysis of the syntactic grammar! This is quite unusual! Normally the grammatical analysis depends upon the output of the lexical analysis but does not affect the lexical analysis.</i>
Controlling lexer from parser is not a backtracking (e.g. it never considers more than one alternative), it's known as a lexical tie-in, or "lexer hack". It actually is quite common, one reason is that a lot of parsers are built in yacc-like tools, which are a pain for non-trivial grammars. Another reason is "C-like" grammars -- most C/C++ compilers classify tokens as typenames in lexer, which requires lexer to know symbol table (obviously a very high-level bit of semantic information).
Maybe you've planned this for a future part, but I'm interested in how the compiler deals with List<List<int>> vs the >> operator.
How about List<Dictionary<int, int>>?
C++ has a real greedy lexer, it considers >> as a shift operation, so in this case it requires a whitespace (or comment) between two > chars.
But C# doen't. How do you process this case?
The lexer lexes the >> as two > tokens, not one >> token. >> is not a token in C# 2 and above. If the parser is then looking for an operator and it finds a >, it checks to see if the next one is another > before deciding whether this is "greater than" or "right shift". - Eric
" The spec actually says that the correct lex is the one that produces a grammatically sensible program!" I believe that the programmers who wrote this spec did not realize what this sentence means for an implementation of JScript. They would not have included backtracking across parser layers intentionally.
I assure you that they knew precisely what they were doing; the authors of that line of the spec all had implemented multiple ECMAScript parsers and knew how easy or difficult such a requirement was. As other commenters have noted, this oddity does not actually introduce too onerous a burden upon the implementer of the lexer and parser. It's just vexing. C# was designed to avoid such problems, though as other commenters have noted, it does not avoid grammatical ambiguities. - Eric
"The lexer lexes the >> as two > tokens, not one >> token. >> is not a token in C# 2 and above. If the parser is then looking for an operator and it finds a >, it checks to see if the next one is another > before deciding whether this is "greater than" or "right shift". - Eric"
From this, I believe I can deduce that the C# lexer considers "whitespace" a token rather than simply a divider between tokens that gets ignored for the purposes of passing the lexer result onto the next phase?
That's not exactly how we do it but it's pretty close. See section 2.4.5 of the specification for details. - Eric