Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
[This is part of a series on generating every string in a language. The previous part is here. The next part is here.]
Last time: Clearly we have reduced a problem of size k to multiple problems of size less than k-1, and so this is amenable to a recursive solution. Right?
Wrong. (The “clearly” was a dead giveaway.) Remember, three things are necessary for a recursive solution. First, there has to be a way of reducing a problem of some size to a related problem of smaller size. Second, there has to be a way of combining solutions to smaller problems into a solution to a larger problem. And third, the smallest possible problem has to admit a non-recursive solution; the recursion has to stop somewhere.
Have we got all of that? We seem to, but let’s try a different example. We previously discussed a grammar for arithmetical sums. Let’s make a minor modification to that:
S: N | S PP: + NN: 1 | 2 | 3
I hope you agree that this is clearly the same language, just with a slightly different grammar. Now how would we go about answering the question “what are all the strings in this grammar of form S with three terminals? I want a concise way of saying that, so let’s say that S<3> means that. To say “the set of strings which is the combination of all possible strings of X<n> followed by all possible strings of Y<m>” I’ll say X<n>Y<m>
So, what is S<3> ? It is the combination of N<3>, S<0>P<3>, S<1>P<2>, S<2>P<1> and S<3>P<0>.
Uh oh. We have not met the first condition; we “reduced” the problem of “what are all the strings of size three that match S” to a set of problems that requires us to solve that problem! In fact, the subproblem S<3>P<0> is solvable if we know that we can skip it because there are no such strings of the form P with zero terminals. But how do we know that? In this particular case it is obvious that substitution of P never produces NIL, but in general we have no way of knowing that.
Heck, we can’t even satisfy the third condition; the problem of working out S<0> is “reduced” to the problem of solving both N<0> and S<0>P<0>, which is a larger problem!
We might be able to make some progress if we can annotate the grammar to say “by the way, I happen to know that this production rule always produces at least one terminal, so you don’t need to consider the zero cases.” That is doable, but it is somewhat vexing to have to do so.
Is there another way?
Next time: Why yes, yes there is.
Step 1: Mark all terminal clauses as +1.
Step 2: Mark all rules with | as +X, where X is the maximum value.
N = +1
P = +2
S = +1 | S + 2
I'm not sure what to do at this point, but I feel like this has given us a hint at how many terminals the different rules produce.
You want to know if they can produce 0 terminals? Follow the NILs, going backwards much like a garbage collector.
A = x | x B
B = C D
C = x | NIL
D = y | C
C can be NIL.
Now everything that contains C can omit that. B can be D. D can be NIL.
Now everything that contains D can omit it. B can be NIL. A can be x.
Now we're done collecting garbage - B, C and D can be NIL, A can't.
I'm not sure how to do it in more complex cases because this solution isn't exactly recursive, but it's definitely a start.
The problem looks similar to the one I had with left recursive grammers in parser combinators. One solution for this which I implemented in my own .net parser combinator I found in these papers:
The idea is basically to limit the recursion depth for each single non-terminal to the length of the input string. This works because a useful recursive rule must at least produce one terminal in every recursion step.
However I'm not completely sure if and how to apply this to the problem on hand, because I don't fully understand it :-S
In the sequence S<3>P<0>, P<0> stands for all possible strings of length 0 that can be produced from the non-terminal P. But since the length is 0, there is only 2 possibilities: either P can produce NIL, or it can not. If P can be reduced to NIL, then S<3>P<0> is the same as S<3> (which we are evaluating already) and if not then S<3>P<0> can be skipped. So wether or not P<0> can be reduced to NIL, we can always skip the rule S<3>. Am I missing something?
Maybe not too imaginative, but I'd construct LR parser states from the grammar, then in each state iterate all terminals that are accepted by the state, moving to the proper follow state (also applying reductions, like when parsing), and when the desired word length is reached check if end-of-input is a valid follow symbol for where I am.