Evaluating Query Expressions

After writing the code to translate query expressions and remove transparent identifiers, one of my first desires was to write a tool that would generate query expressions to test the correctness of the code.  Of course, I already had a set of programmer tests in place to make sure that the code worked.  But I also knew, that this feature exposed so many other features that it was sure to introduce pain points that did not previously exist.

Compiling the average query involves a large number of language features of which many are new: query expressions, anonymous types (transparent identifiers), extension methods (method calls), lambdas (parameters), expression trees, object initializers (anonymous type initialization), type inference (generic methods and lambdas), and iterators (LINQ query operators).  Many of these features are among the more complex features in the language.  Even if all of these features work correctly when evaluated separately, it seemed likely that their interaction would be the source of bugs.  So I started thinking about what I could do to sample the space of query expressions to see if we were doing alright.

My initial idea was to generate random syntactically correct queries and then feed the query expressions to the C# compiler and an auxiliary translator and then compare the results.  Fortunately, our fantastic QA team already had a tool that did translation given an object model of the queries.  So, I put together the random query generator and QA plugged it into their translator and we achieved the first goal.  We quickly verified that translation worked correctly and found a few spec and code issues which were quickly resolved.  But we still hadn't tested the interaction of the many features that are used in query expressions, only the translation itself had been tested.

While driving home I thought, why can't we do more than translate the queries?  Why can't we compile them? Why can't we run them?  In order to do this, the generator should generate random semantically correct queries.  With a little work (defining an algebra and formalizing notions of scope) on the generator, it was able to generate random (large and fairly complex although ultimately meaningless) semantically correct queries.

This was great and enabled us to find many very interesting bugs in the compiler.  It opened our minds to some big performance problems in some of the new language features.  It seemed that in several cases, the compiler generates code patterns that typical users didn't previously write and so we had to modify the compiler to enable that style of code to perform well both during compilation and at runtime.  In fact, these findings fed back into the language design and we changed the query rewrite rules somewhat.

At first, we didn't give much serious thought to taking to the next step and running the generated queries, because we would have to know what results the queries should produce in order to really effectively test them.  The big problem with predetermining the results is that queries are pure syntactic sugar for a series of method calls to methods which are not determined by the query expressions.  In order to interpret the queries, we needed a way to interpret queries in terms of the queries and not the translation.  Then I realized that we could interpret queries without translating them given a well defined provider.  An interpreter for queries against the LINQ to objects provider was written.  Once I implemented a query interpreter, we were able to test the translation of queries, the compilation of queries, and the runtime semantics of both the code that was generated and LINQ to objects (IEnumerable<T> and IQueryable<T>).  We found more bugs, both in the compiler and in System.Core, doing this, including at least one bug in subtle interactions of the type system and nullables that had been introduced in Whidbey.

As interesting as the exercise was from a software engineering standpoint, I want to focus on the notion of interpreting queries.  My last post discussed the syntax and scoping of queries.  The formalisms that were developed to discuss the scoping of queries were actually conceived of while writing the semantically correct query generator.  These scoping formalisms have been very helpful to our team in both understanding and writing queries.  It has been just as helpful to have some notions of how to interpret queries without translating them.  So, I thought I would share those with you in the hope that they will help you to understand queries better.

Interpretation without Translation

In this section, a way to interpret LINQ to objects queries without doing the translation will be described.  The essential notion is that each clause takes a sequence of bindings and produces a sequence of bindings where a binding is a set of variables along with the values assigned to those variables.  For example, consider the following query:

from x in foo

where x >= 2

select x + 1

Where

foo = new int[] { 0, 1, 2, 3 }

The first clause does not take a binding but it produces a sequence of bindings that bind the variable x to each element of foo in successive order.  This will be denoted as:

[(x:0), (x:1), (x:2), (x:3)]

The next clause, the where clause filters the bindings based on the predicate.  The resultant bindings are:

[(x:2), (x:3)]

Finally, the select clause generates the projection of the bindings in successive order.  The results are:

3, 4

Seems simple enough, but lets try something more complex.

from x in foo

where x >= 2

from y in foo

where x <= y

orderby x descending, y descending

select x + y

Look at the bindings after each clause:

from x in foo

  [(x:0),(x:1),(x:2),(x:3)]

where x >= 2

  [(x:2),(x:3)]

from y in foo

  [(x:2,y:0),(x:2,y:1),(x:2,y:2),(x:2,y:3),(x:3,y:0),(x:3,y:1),(x:3,y:2),(x:3,y:3)]

where x <= y

  [(x:2,y:2),(x:2,y:3),(x:3,y:3)]

orderby x descending, y descending

  [(x:3,y:3),(x:2,y:3),(x:2,y:2)]

select x + y

  6,5,4

Notice, how after each clause we have a sequence of bindings that are then fed into the next clause.  The bindings may grow to include more variables (like after the second from).  Finally, the terminating clause (select or group...by) produces a projection or grouping based on the bindings.  Here is how to handle each clause type:

from x in src

for each binding b

  for each element e in src

    extend b with x equals e

let x = e

for each binding b

  extend b with x equals e

join x in src on outerKey equals innerKey

for each binding b

  for each element e in src where outerKey equals innerKey

    extend b with x equals e

join x in src on outerKey equals innerKey into g

for each binding b

  for all elements e in src where outerKey equals innerKey

    extend b with g equals group of all elements which satisfy previous condition

where p

for each binding b where p is true

  pass b on

orderby k1, k2, ...

for each binding b

  sort elements by first applying k1 to b, then k2 to b, and so on

select e

for each binding b

  produce result by evaluating e in context of b

group e1 by e2

for all bindings b where e2 is equal

  produce group result (key = e2, results = sequence of evaluation of e2 from b)

into x

for each result r

  create binding with x equal to r

Here is a final example:

using System;
using System.Linq;

class Program
{
  static void Main(string[] args)
  {
    var foo = new[] { 2, 0, 3, 1 };
    var bar = new[] { "fred", "jim", "bob", "george" };

    var q = from x in foo // [(x:2),(x:0),(x:3),(x:1)]
            where x >= 2 // [(x:2),(x:3)]
            from y in bar // [(x:2,y:"fred"),(x:2,y:"jim")
                          // ,(x:2,y:"bob"),(x:2,y:"george")
                          // ,(x:3,y:"fred"),(x:3,y:"jim")
                          // ,(x:3,y:"bob"),(x:3,y:"george")]
            where y.Length > 3 // [(x:2,y:"fred"),(x:2,y:"george")
                               // ,(x:3,y:"fred"),(x:3,y:"george")]

            orderby x, y // [(x:2,y:"fred"),(x:2,y:"george")
                         // ,(x:3,y:"fred"),(x:3,y:"george")]

            select new { x, y }; // { x = 2, y = fred }
                                 // { x = 2, y = george }
                                 // { x = 3, y = fred }
                                 // { x = 3, y = george }

    foreach (var item in q) Console.WriteLine(item);
  }
}

And here is the output from the program:

{ x = 2, y = fred }
{ x = 2, y = george }
{ x = 3, y = fred }
{ x = 3, y = george }

Implications

It is fascinating, although not totally surprising, that given a well defined provider, a model for interpreting the queries can be formed.  Given our model for interpreting LINQ to objects queries, several facts become apparent.

1.  Each query clause is evaluated independently of the previous clauses.  Only the binding information flows between them.

2.  The clauses are evaluated in order, hinting at imperative evaluation.

Given these facts, it is easy to see that LINQ to objects queries are analogous to the relational algebra and therefore they are not as much declarative in nature as they are a specification of a series of operations.  This leads to several considerations that should be taken.  Because it is closer to the relational algebra than it is to the relational calculus, there are performance implications based upon the order of clauses.  Consider the following two queries:

var q1 = from x in foo

         from y in bar

         where p(x)

         select f(x,y);

var q2 = from x in foo

         where p(x)

         from y in bar

         select f(x, y);

The two queries look very similar, but if we are evaluating the query against the LINQ to objects provider then two queries are quite different.  For this provider there is no query planner and so while they produce the same results, the second query probably performs less steps.  Of course, it is not clear whether or not the performance difference will matter in a given application but it is wise to know the difference and know why there is a difference.

The number of elements that are filtered by the where clause in the first query is equal to the number of elements in foo times the number of elements in bar.  The number of elements that are filtered by the second query is equal to the number of elements in foo.  Order matters when providers model the relational algebra.  (Try interpreting the query by hand for some values of foo and bar)

In the next post, we will see that other query providers expose a model that is more declarative in nature rather than imperative.

Merry Querying!