Welcome to MSDN Blogs Sign in | Join | Help

This is the sixteenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you might try rolling into a ball and crying for momma. That’s right, reading on is as hazardous to your health as a skinned knee. Just be warned and have an anti-biotic on hand.

Complete list of posts in the Building an IQueryable Provider series

I would have had the update out sooner but I couldn’t think of what to write in this excuse slot. My excuse generator was down for scheduled maintenance but did not come back on line as planned and wouldn’t you know it the techie responsible gave me all sorts of lame excuses. Some people!

What's inside:

Performance – I’ve actually tried to improve things.

Caching – Re-use queries without the burden of defining delegates.

Evaluation – Execute expression trees without Reflection.Emit.

Bug Fixes – Most of the bugs reported on CodePlex are fixed.


The full source code and redistributable DLL's can be found at:

http://www.codeplex.com/IQToolkit

Faster is Better

One of the big things I tried to tackle this time around was to finally do some performance improvements.  Until now, the only performance considerations were in the design of compiled queries and the use of compiled LINQ expressions for materializing objects.  Yet, when I looked at actual performance of compiled queries versus straight ADO, there was still a lot of overhead.

Where was the time being spent?  As it turns out, even though I was compiling a LINQ expression to represent the execution plan and an explicit function to convert DataReader rows into objects, ideally making the writing of data into objects as fast as possible, there was still room for improvement. The problem was not the writing of data into objects, but reading the data from the DataReader. I wholely blame myself for this. In an attempt to simplify the IQToolkit source code way back in one of the original posts I chose to read data using the general DataReader.GetValue method. This has two undesirable side effects; 1) the value is returned typed as object, which mean all non reference types (mostly numeric and datetime types) are boxed, which is measurably significant, and 2) in order to make sure the value was in the correct form to write it into the object I now had to coerce it back (which immediately led to an unbox operation, equally as significant.)

I tried many variations of solutions. Ideally, the runtime code would simply know which field access method of the DataReader to call and to coerce the results only if necessary. Unfortunately, the translation engine does not track enough information to be 100% certain of the data type the provider will return. It can make a good guess, but if it is wrong then the query will die in a fiery exception that halts the program and sends your manager storming toward your office. The solution I chose was sort of a hybrid.  Based on an expected value type a type specific helper method is now invoked. This method calls the equivalent DataReader method inside of a try-catch. I know, I hate having to use try-catch in this way, but the cheap cost of setting up the try-catch and the rare condition where the guess is wrong led me to the dark side. I will now change the color of my light-saber from blue to red.

Here’s an example of the GetInt32 helper method.

public static class FieldReader
{
...
public static Int32 GetInt32(DbEntityProviderBase provider, DbDataReader reader, int ordinal) { if (reader.IsDBNull(ordinal)) { return default(Int32); } try { return reader.GetInt32(ordinal); } catch { return (Int32)provider.Convert(reader.GetValue(ordinal), typeof(Int32)); } } ... }

As long as the expected type is correct, the faster DataReader.GetInt32 method is called. If that fails, then the fallback is to call the general GetValue method and coerce the result.  This should rarely happen.

This is all it took to get the compiled query into very low overhead versus direct ADO calls; mostly less than 3%. I’ve added a performance test you can run to check it out on your set up. Of course, this will vary depending on the query, the hardware, the load on the server and the network latency.

Caching Queries

“Why can’t the provider simply cache the queries for me.”  I’ve gotten this request a lot.  Sometimes from direct pleas in email, other times from those of you trying to do it yourself and asking for advice. 

It seems natural to imagine that it would only take a little bit of work for a provider to determine if a query being executed now is the same or similar to one executed before, so why is the only way to re-use a query to specify one using the cryptic syntax of generic methods, lambda expressions and delegates?  And why do I have to hold onto the thing myself, can’t some service do this for me?

Of course, my usual reaction is to give a heavy sigh in the privacy of my office and then craft a quite sensible fire-and-brimstone reply, complete with infallible logic and dramatic consternation, as to why and how you really are better off with the compiled query solution. But I’m tired of doing that, so instead of impressing you with my sound reasoning I’m going to show you how I went ahead and just did it. I like challenges, and there’s no better challenge than the impossible, or the sometimes impractical, or the generally ill-advised.

What I built is a new piece of code called the ‘QueryCache’.  It is actually implemented to be generic enough that it will work with your own IQueryable provider. Yet its not currently integrated into any provider, though you may choose to embed it into yours.  You can, however, use the cache as is to execute your queries and take advantage of its cache-y goodness. You don’t have to make delegates and invoke them, you simply have to give the cache your query and it will give you back the results.

var cache = new QueryCache(10);
var query = from c in db.Customers where c.City == "London" select c;
var result = cache.Execute(query);

Here’s how it works.  The cache maintains a most-recently-used list of compiled queries.  Every time you execute a query via the cache, the cache compares your query against the ones in the cache. If it finds a match, it simply uses the existing compiled query and invokes it. If not, it makes a new compiled query and adds it to the list.

Of course, that’s the easy part. 

The hard part is figuring out how to compare an IQueryable query object against a list of compiled-query delegate objects and determine which ones can be reused. For example, are these two the same query?

var query1 = from c in db.Customers where c.City == "London" select c;
var query2 = from c in db.Customers where c.City == "Seattle" select c;

Technically they are different expression trees, but if that’s the deciding factor then I might as well give up now. They are structurally similar and so it is logical to assume that a query compiled for one should be nearly identical to a query compiled for the other. If I were using compiled queries directly I would simply choose to make the name of the city a parameter to the compiled query delegate and invoke it with different values. So isn’t that just what I want the cache to do for me? 

To do this I need a little tool that will take a look at a query tree and decide which bits should be considered parameters and then give me back a new query tree with the parameters in place of those bits. As it turns out this is not really all that hard since it seems obvious for the most generality that anything that appears in the query tree as a constant should be deemed a parameter. A trivial expression visitor can produce this.

If it also wraps the rewritten query tree in a lambda expression and gives me back the set of constants that it extracted then I have everything I need to make a compiled query and invoke it with an array of values. I also have what I need to find an already compiled query in the cache, since compiled queries hold onto their defining query trees. So if my first query in the example above is already in the cache, it already has a parameterized query tree defining it and that ought to look awfully similar to the parameterized version of the second query.

The only thing I need now is a way to compare two lambda expressions to see if they are similar.  Fortunately, I wrote that ages ago.  The ExpressionComparer does just this. 

If I cobble all these parts together into a reasonable order I get my QueryCache. Now, I can use compiled queries without ever having to manually define them again!  Huzzah!

Yet, if this is so grand, why haven’t I taken that one last step and simply plugged into the base DbEntityProvider class?

Unfortunately, reality is not as rosy and I would hope.

There are a few problems holding me back.  The most significant is the silly act of parameterizing itself. The problem with the cache is that it doesn’t know which constants in the query should be made into parameters, so it turns them all into parameters. Yet, as much as I’d like it to be otherwise, databases don’t treat all parameters equally. Sometimes a parameter is terrific for performance; a database like SQL Server optimizes by storing parameterized query plans so you basically get the same benefit client and server.  Yet, sometimes, for some bits of SQL, for some back ends, parameters either cannot be used or have crippling effects on speed. So it's usually the best policy to be explicit and judicious when using parameters, something the cache in all its glory cannot chose for you.

So I leave the cache, for now, as something you can explicitly opt-in to using, per query.

Please, give it a try.  Feedback is welcome.

Expression Evaluator

Sometimes you may want to execute expression trees as code without actually turning it into true runtime code. You may be running on a platform that does not currently support the Expression.Compile() method (or the ability to use Reflection.Emit to generate IL at runtime) like Windows Mobile. I’ve been encouraged by folks here at Microsoft to think about this from time to time, so I set out to explore this space and the result is a whole mess code called the ExpressionEvaluator, which is an expression tree interpreter. I added my experimental solution to the toolkit in case it is beneficial for someone. It is a lot of code, so it doesn’t compile into the library by default. You have to enable it with a specific #define, NOREFEMIT.  This will also switch over all direct uses of Reflection.Emit in the toolkit to using this evaluator.

It has a hefty performance overhead, so I would not consider it a viable alternative to Reflection.Emit. I’ve even added many tricks to avoid calls to reflection and boxing. This has made it a lot faster than a naive solution, but still it pales in comparison to real IL.  However, in cases where there is no alternative, its probably the best you are going to get.

Crazy Antics

In keeping with the spirit of making changes for no good reason, I’ve unbundled the base set of toolkit code from the ADO provider specific bits. So now if you are using IQToolkit to build your own non-ADO based provider you don’t have to be dragging the rest of it along for the ride.  All the SQL translation and DbEntityProvider goodness is now in its own DLL, IQToolkit.Data.dll. This means you’ll probably have to tweak your projects to include this new DLL, but that’s about all.

The DLL list is now:

IQToolkit.dll
IQToolkit.Data.dll
IQToolkit.Data.Access.dll
IQToolkit.Data.SqlClient.dll
IQToolkit.Data.SqlServerCe.dll
IQToolkit.Data.MySql.dll
IQToolkit.Data.SQLite.dll

Of course, as always, you only need around the ones you are using, or none of them if you are simply lifting the code out as source.

That’s All Folks

I’m sure there are many more wonderful nuggets of goodness I added but forgot to mention.  If you discover any of them please file a full report at http://www.codeplex.com/iqtoolkit

This is the fifteenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you might try searching for the audio tapes on www.Bing.com.  That would be a lot easier than reading. You won't find any, but you'll feel better for having tried.

Complete list of posts in the Building an IQueryable Provider series

Getting this version of the toolkit together has taken a lot of sleepless nights.  Thank goodness for Netflix or I'd have had to spend those sleepless nights actually working on the Toolkit.

Okay, enough with the flavor text, let's get to the crunch.

What's inside:

More Providers - MySQL and SQLite join the previous MS only line up.

Transactions - Use ADO transactions to control the isolation of your queries & updates.

Entity Providers - The provider concept is expanded to include tables of entities

Entity Sessions - The session concept adds identity caching, change tracking and deferred updates via SubmitChanges

Provider Factory - Create providers on the fly w/o knowing anything more than the database name and mapping.

Madness


The full source code and redistributable DLL's can be found at:

http://www.codeplex.com/IQToolkit

More Providers

  • MySQL -- With the stewardship of MySQL in doubt after the purchase of Sun by Oracle, I was leery of taking on the challenge of making a MySQL provider for the toolkit. Yet, the benefits of doing so turned out to be very significant and not just for the MySQL users, as it forced me to challenge some of my very assumptions about SQL which in turn made some of my meager testing better, which in turn made the toolkit better as a whole. 

    I was surprised when I went to download the free and open-source database product that I had to fill out a bunch of information on myself and wait for a formal approval (that took many days) before I was *allowed* to access the product. This seemed like an awful lot of hassle and odd for an open source project; way too corporate. Fortunately, I persisted and got the server up an running with no problem. Next I searched high and low for MySQL version of Northwind so I could have something to run the tests against. If only Bing were available then!  I had to settle for plain ol' Live Search. Fortunately I found what I was looking for, even if it didn't offer me any cash back.

    MySQL seems to me what you get if you take a bunch of C programmers and tell them to make a SQL database product and yet offer no guidance. This is not criticism, just a bit of snarky wit over many of the underscore endowed API's that make up the product. Luckily, you'll not need to worry yourself about the manual dexterity needed to type such awkward function names, as you'll be using LINQ and the MySQL provider will do the work for you. On the positive side, the function library of MySQL is quite large and feature rich. I had no problem finding appropriate translations. MySQL even had many interesting features that I'd like to go back and refactor the toolkit to take advantage of.  Next release maybe.

    There were a few problems I uncovered while trying to get MySQL to pass the test suites.  I took the opportunity to fill out the execution suite that has languished for a few releases now, so I could actually assert what the correct output ought to be for specific queries. In almost all the cases MySQL did the right thing (as far as I was expecting.)  In most of the cases that it did not appear to do the right thing, it often turned out to be bad assumptions on my part about collation ordering and so removing those assumptions from the tests fixed everything up. The remaining problem still boggles my mind.

    In a test designed merely to prove that the translation of simple joins succeeded to generate the right query, the database did not return the correct number of rows. 

    from c in db.Customers
    from o in c.Orders
    from d in o.Details
    select d.ProductId

    which executes a MySql query that looks like this:

    SELECT d.ProductId
    FROM [Customers] AS c
    LEFT OUTER JOIN [Orders] AS o ON o.CustomerID = c.CustomerID
    LEFT OUTER JOIN [Order_Details] AS d ON d.OrderID = o.OrderID

    I expected to get a number of rows corresponding to the total number of order-details in the database.  But I didn't.  I got a whole lot less.  So I started experimenting with different forms of the query.  If I selected o.OrderID instead of d.ProductID, I got a different number from either of the other two. If I selected c.CustomerID, again an entirely different number of rows.  It was only when I chose to select the entire order-detail object did I get the number of rows I was expecting.

    Something strange seems to be going on.  On nobody's definition of SQL should I be getting a different number of rows depending on the columns in the selection (except if that SELECT has a DISTINCT operator specified.)  It was as if all queries were getting a default DISTINCT operation. When I took a look a the results when I selected out only ProductID, sure enough, all I got back were distinct product-id's.  MySQL has an 'ALL' keyword that is the semantic opposite of DISTINCT, so I tried adding that to the query, but to no avail. 

    I don't know if this is a problem in the query engine or the transport layer, or if the MySQL folks actually think it is appropriate behavior.  As far as I'm concerned it is catastrophically bad; E_FAIL. For this problem alone I would recommend not using MySQL. Still, it may not be as bad as it sounds. It is likely very rare that you'd ever actually write a query that was expected to retrieve duplicate rows of data. So it may not ever impact you at all. I do recommend testing all your queries (and your application) before you put it into production.
  • SQLite - This is an open-source database product that is similar in nature to MS Access and SQL Compact, since it runs in-proc and is not really a server. After wrangling with MySQL, I thought I knew all I needed to know about different forms of SQL, until Jonathan Peppers sent me his go at making a SQLite provider. ( See, I told everyone that I'd add new providers if they would send them to me.)  Of course, SQLite had a few problems of its own.

    While MySQL has a huge library of function, SQLite has a tiny one; many .Net API's don't have supported translations.  (This is of the set I had translated for MS SQL.  To be truthful, MS Access falls short on some of these too.)  So queries using API's other than some simple string manipulation or equality testing probably won't work against SQLite. 

    In addition one of SQLite's big drawbacks is its lack of a rich type system.  SQLite's developers claim this to be a feature, and it may afford great flexibility, but is often a death sentence for LINQ queries, which are all strongly typed.  SQLite does have types, but only a few of them, like number, text and binary.  Problems arise when you try to use DateTime's, as they are not really their own type in SQLite, but a text layout.  For example, if you were trying to find all orders for a customer that happen in January, you might write this query.

    from o in db.Orders
    where o.CustomerID == cust && o.OrderDate.Month = 1
    select o

    this would produce this query.

    SELECT o.OrderID, o.OrderDate
    FROM Orders AS o
    WHERE o.CustomerID = @cust AND STRFTIME('%m', o.OrderDate) = 1

    The date function to extract the month out of the OrderDate column is really a text formatting function that extracts the month portion of the date-time text layout, which in this case is the text '01', since the date-time format is always padded to a specific width.  When this is compared against the number (1), another subtle difference crops up. For all other SQL's I've come across, text is considered the weakest form of type, so when two types are compared for equality text is always converted to the other form.  Yet, in SQLite, the opposite it true (which truthfully is more like C# and Java).  So when the two types are compared they are found incompatible, because the number (1) is turned into the text '1' and that is not the same as the text '01'. Adding insult to injury, there appears to be no type conversion functions at all, so I can't even work around the problem by injecting a conversion. My ignorance of SQLite may just be showing here, or a lack of sufficient documentation.

    So my recommendation if you are using SQLite, to stay away from most API functions in your where clauses and such. API calls in the projection are okay, because these get executed on the client.

Transactions

I'm surprised no one's called me on this before. The DbQueryProvider and its ilk have been suspiciously lacking in support for transactions. The providers work, LINQ queries are converted into ADO Commands and executed, yet those ADO Command objects are never assigned an ADO transaction, even if you started one explicitly. 

Of course, the official word from Microsoft is to stop using the ADO transactions altogether and instead use the System.Transactions.TransactionScope object, that is newer, better and enables automatic use of distributed transactions, etc, etc, etc. And if you did use TransactionScope, then the problem I'm referring to would not be a problem. SqlCommand object's would implicitly enlist in the transation without me having to specify anything. 

Unfortunately, TransactionScope is mired with many problems and is not supported by all ADO providers so ADO transactions are still a necessity.  You can now use ADO transactions with query providers in a manner similar to LINQ to SQL.

provider.Transaction = provider.Connection.BeginTransaction();

// use the provider here to execute queries and updates, etc.

provider.Transaction.Commit();
provider.Transaction = null;

The provider will use whatever transaction object you give it when it creates new ADO command objects.

Entity Providers

I decided to formalize the pairing of query providers with a Table object that enables updates and other facilities.  The definition of an entity provider is now defined by these three interfaces.

public interface IEntityProvider : IQueryProvider
{
    IEntityTable<T> GetTable<T>(string tableId);
    IEntityTable GetTable(Type type, string tableId);
}

public interface IEntityTable : IQueryable, IUpdatable
{
    new IEntityProvider Provider { get; }
    string TableId { get; }
    object GetById(object id);
    int Insert(object instance);
    int Update(object instance);
    int Delete(object instance);
    int InsertOrUpdate(object instance);
}

public interface IEntityTable<T> : IQueryable<T>, IEntityTable, IUpdatable<T>
{
    new T GetById(object id);
    int Insert(T instance);
    int Update(T instance);
    int Delete(T instance);
    int InsertOrUpdate(T instance);
}

You can now always get to a table directly from a provider.  The two concepts are coupled together.  An entity table also has explicit CRUD methods and implements IUpdatable, so no more separation between normal tables and updatable tables. In my mind this simplifies things quite a bit. 

Of course, this caused me to want to rename DbQueryProvider.  Ooops.  This will likely cause you some grief as any of your existing code that was using DbQueryProvider directly is now not going to compile.  The new name for this class is now DbEntityProvider. It might not matter so much now that there is a nifty IEntityProvider interface.

Entity Sessions

One thing missing from the Toolkit so far has been all of that context stuff that LINQ to SQL and LINQ to Entities have.  When you use LINQ to SQL you have a change tracking service that detects when your objects change, and sends the updates for you all at the same time when you call SubmitChanges. 

An entity session is all of this change-tracking, deferred updating stuff packaged up together. It is distinctly different from an entity provider in these ways, yet similar to one in many others.

An entity session is defined below:

public interface IEntitySession
{
    IEntityProvider Provider { get; }
    ISessionTable<T> GetTable<T>(string tableId);
    ISessionTable GetTable(Type elementType, string tableId);
    void SubmitChanges();
}

public interface ISessionTable : IQueryable
{
    IEntitySession Session { get; }
    IEntityTable ProviderTable { get; }
    object GetById(object id);
    void SetSubmitAction(object instance, SubmitAction action);
    SubmitAction GetSubmitAction(object instance);
}

public interface ISessionTable<T> : IQueryable<T>, ISessionTable
{
    new IEntityTable<T> ProviderTable { get; }
    new T GetById(object id);
    void SetSubmitAction(T instance, SubmitAction action);
    SubmitAction GetSubmitAction(T instance);
}

public enum SubmitAction
{
    None,
    Update,
    PossibleUpdate,
    Insert,
    InsertOrUpdate,
    Delete
}

public static class SessionTableExtensions
{
    public static void InsertOnSubmit<T>(this ISessionTable<T> table, T instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Insert);
    }

    public static void InsertOnSubmit(this ISessionTable table, object instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Insert);
    }

    public static void InsertOrUpdateOnSubmit<T>(this ISessionTable<T> table, T instance)
    {
        table.SetSubmitAction(instance, SubmitAction.InsertOrUpdate);
    }

    public static void InsertOrUpdateOnSubmit(this ISessionTable table, object instance)
    {
        table.SetSubmitAction(instance, SubmitAction.InsertOrUpdate);
    }

    public static void UpdateOnSubmit<T>(this ISessionTable<T> table, T instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Update);
    }

    public static void UpdateOnSubmit(this ISessionTable table, object instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Update);
    }

    public static void DeleteOnSubmit<T>(this ISessionTable<T> table, T instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Delete);
    }

    public static void DeleteOnSubmit(this ISessionTable table, object instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Delete);
    }
}

As you can see, an entity session has tables, just like a provider.  Yet, those tables are not directly updatable.  Instead you can assign entity instances submit actions. These are the actions that take place later when you call SubmitChanges.  There are a bunch of extension methods defined to add the LINQ to SQL like InsertOnSubmit() methods to the interface.  These simply call the SetSubmitAction() method for you.

Also note that a session is not a provider. It is a service used in conjunction with a provider. You can use multiple different sessions with the same provider instance.   

There is one current implementation of an entity session in the Toolkit called (you guessed it) DbEntitySession.  You create a DbEntitySession by giving it an existing DbEntityProvider. The DbEntitySession hooks the provider in such a way that it gets first crack at all materialized objects before they are returned to you. In this way, the DbEntitySession can employ an identity cache so queries that retrieve the same entity will always return the same entity instance, and it can start automatic change tracking on all entities returned.

You are also not locked into the session's behavior.  At any time you can interact with the underlying provider instead for retrieving entities without passing through the identity cache or being changed tracked.  You can even get to the provider's table directly off a session table.

Provider Factory

Now with so many providers and one single way to write queries you'd think it would be easy to switch between them.  In reality it is not. You have to pick the provider you want, reference its library (IQToolkit.Data.XXX), reference its corresponding ADO library (System.Data.XXX), create the ADO connection, the mapping object and construct the provider.

var connection = new SqlConnection("...");
var mapping = new AttributeMapping(typeof(Northwind));
var provider = new SqlProvider(connection, mapping, QueryPolicy.Default, null);
var db = new Northwind(provider);

You can hide this all inside your database context class (or whatever you want to call yours), so you only have to write it once, but then your context class is tied to a specific provider.  Instead, you could wrap this code up into a factory method of your own devising, but then calls to the factory would be spread throughout your codebase.  There no good way to defer all this work to some configuration setting. Until now.

Introducing the new factory methods built into DbEntityProvider.

public static DbEntityProvider FromApplicationSettings();
public static DbEntityProvider From(string filename, string mappingId); public static DbEntityProvider From(string provider, string connectionString, string mappingId);

These methods allow you to get up and running with only knowing a few bits of information.  You don't have to hard link you application to any particular provider.

The FromApplicationSettings method creates you a new instance of a provider from information found in the config file.  It looks for the "Provider", "Connection" and "Mapping" properties in the configuration and feeds them to the other factories.  It is also possible to look this information up in web settings, but I have not formalized that one yet.

The provider argument is a string that refers to the name of an assembly that contains the query provider.  These are generally of the form IQToolkit.Data.XXX.  If that assembly is not loaded, it will be loaded dynamically. This assembly can be in the assembly cache or in the same directory as your app (or other places that the runtime might look.)  From this assembly it will look for a type in the same namespace (as the name of the assembly) that derives from DbEntityProvider.

The connectionString and filename arguments are really the same thing. You can specify either the name of a database file or a full ADO connection string. If a file is specified, a correct connection string is obtained by calling the static GetConnectionString(string) method on the provider. A provider may be inferred from the file extension of a database file if none is specified.

The mappingId can either refer to the name of a context class (like Northwind) that has mapping attributes on it or the name of an xml file. 

So now you can write code like this to get your provider.

var db = new Northwind(DbEntityProvider.From(somedbfile, somemapfile));

Or better yet, you can use the FromApplicationSettings() method in the constructor of your context and still be configurable at runtime.

Madness

Of course, it wouldn't be a new toolkit release without some additional crazy changes.  One significant change is namespaces again. This time its not going to conflict with your code too much.  Most of the classes that where in IQToolkit.Data have been demoted into the namespace IQToolkit.Data.Common.  This includes most all classes that are implementation detail or base classes.  Mapping attributes and the like are now in IQToolkit.Data.Mapping.  This makes the namespace clean and obvious when you start looking for things via intellisense. 

DbEntityProvider and DbEntitySession are the only classes sitting in IQToolkit.Data, as these are the ones you'll likely need to reference when writing code.  IEntityProvider and IEntitySession are in IQToolkit namespace, because they are not specific to ADO (System.Data classes).


I hope you find this version feature rich enough to either build application directly on top of it, or model your own provider or data layer by using these techniques.

Don't forget the audio tapes.

 

This is the fourteenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you might request a weeks vacation, sit back, relax with a mochacino in one hand a netbook in the other, or if you've got better things to do with your time print them all out and stuff them under your pillow. Who knows, it might work better.

Complete list of posts in the Building an IQueryable Provider series

Okay, enough with all the post-is-late guilt! It's done now, so breathe a sigh of relief and get on with the reading.

What's inside:

More Mapping - Finally a real mapping system, with attributes and XML.

More Providers - MS Access and MS SQL Server Compact Edition

More POCO - Constructors, Enum and Interfaces.

More More More

The full source code can be found at:

http://www.codeplex.com/IQToolkit

More Mapping

  • Attribute Mapping - put attributes on the properties in the class that declares your tables.

    This differs from LINQ to SQL mapping attributes which are placed on the entities themselves and is more like the proposed LINQ to Entity mapping attributes. However, I've not actually gone out of my way to make them the same. The advantages to this approach are 1) keeping the mapping separate from the entity objects (more POCO), and 2) being able to supply different mapping for the same entity type based on the table the entities are accessed from. 

    Mapping attributes look like this:
  • [Table]
    [Column(Member = "CustomerId", IsPrimaryKey = true)]
    [Column(Member = "ContactName")]
    [Column(Member = "CompanyName")]
    [Column(Member = "Phone")]
    [Column(Member = "City", DbType="NVARCHAR(20)")]
    [Column(Member = "Country")]
    [Association(Member = "Orders", KeyMembers = "CustomerID", RelatedEntityID = "Orders", RelatedKeyMembers = "CustomerID")]
    public IUpdatableTable<Customer> Customers

You specify the Table, Column and Association attributes as necessary.  The 'Member' refers to the member in the entity type. If this is the same name as the database's column name you don't need to repeat it by specifying 'Name' too. 

You can specify nested mapping information by using a dot in the Member name. This allows you to have what some call value types, but to keep from clashing with .Net terminology I don't. For example, if you've defined an Address type that you want to use in a nested relationship (actually embedded in the same table row) you can do that like this:

[Table]
[Column(Member = "EmployeeID", IsPrimaryKey = true)]
[Column(Member = "LastName")]
[Column(Member = "FirstName")]
[Column(Member = "Title")]
[Column(Member = "Address.Street", Name = "Address")]
[Column(Member = "Address.City")]
[Column(Member = "Address.Region")]
[Column(Member = "Address.PostalCode")]
public IUpdatable<Employee> Employees
  • Xml Mapping -- this is same as attribute based mapping but data is read from an XML file. 

    Xml mapping looks like this:
  • <?xml version="1.0" encoding="utf-8" ?>
    <map>
      <Entity Id="Customers">
        <Table Name="Customers" />
        <Column Member = "CustomerId" IsPrimaryKey = "true" />
        <Column Member = "ContactName" />
        <Column Member = "CompanyName" />
        <Column Member = "Phone" />
        <Column Member = "City" DbType="NVARCHAR(20)" />
        <Column Member = "Country" />
        <Association Member = "Orders" KeyMembers = "CustomerID" RelatedEntityID = "Orders" RelatedKeyMembers = "CustomerID" />
      </Entity>
      <Entity Id="Orders">
        <Column Member = "OrderID" IsPrimaryKey = "true" IsGenerated = "true"/>
        <Column Member = "CustomerID" />
        <Column Member = "OrderDate" />
        <Association Member = "Customer" KeyMembers = "CustomerID" RelatedEntityID = "Customers" RelatedKeyMembers = "CustomerID" />
        <Association Member = "Details" KeyMembers = "OrderID" RelatedEntityID = "OrderDetails" RelatedKeyMembers = "OrderID" />
      </Entity>
      <Entity Id="OrderDetails">
        <Table Name="Order Details"/>
        <Column Member = "OrderID" IsPrimaryKey = "true" />
        <Column Member = "ProductID" IsPrimaryKey = "true" />
        <Association Member = "Product" KeyMembers = "ProductID" RelatedEntityID = "Products" RelatedKeyMembers = "ProductID" />
      </Entity>            
    </map>

You use it like this:

XmlMapping mapping = XmlMapping.FromXml(TSqlLanguage.Default, File.ReadAllText(@"northwind.xml"));
SqlQueryProvider provider = new SqlQueryProvider(connection, mapping);
  • Multi-table mapping -- Map multiple tables into a single entity.  If you've got entity data spread out over multiple tables with a 1:1 association between them you can now specify the additional tables in mapping using the ExtensionTable attribute or equivalent XML element. 

    Here's what a multi-table mapping looks like:
  • [Table(Name = "TestTable1", Alias = "TT1")]
    [ExtensionTable(Name = "TestTable2", Alias = "TT2", KeyColumns = "ID", RelatedAlias = "TT1", RelatedKeyColumns = "ID")]
    [ExtensionTable(Name = "TestTable3", Alias = "TT3", KeyColumns = "ID", RelatedAlias = "TT1", RelatedKeyColumns = "ID")]
    [Column(Member = "ID", Alias = "TT1", IsPrimaryKey = true, IsGenerated = true)]
    [Column(Member = "Value1", Alias = "TT1")]
    [Column(Member = "Value2", Alias = "TT2")]
    [Column(Member = "Value3", Alias = "TT3")]
    public IUpdatable<MultiTableEntity> MultiTableEntities

Extension tables are specified similar to how Associations are specified, except you are never referring to members, only column names.  You use the 'Alias' value to connect column & association mappings with columns from particular tables.  All queries for this multi-table entity treat the 'Table' as the primary table queried, all other tables are queried with left-outer joins.  All keys for associations must be from the same alias.

Can I mix nested mapping with multi-table mapping?  I have not tried it, but in theory it should work. It should not matter which table your nested entity gets it's data from, so in effect you can have a composition relationship between one table and another as long as it is 1:1.  

What about many-to-many?  Not yet. Making the system query a many-to-many relationship is relatively easy.  I haven't yet figured out the right semantics for inserts & updates. Right now, all insert, updates and deletes are explicit via calls to the IUpdatable with real-live entities. Yet how do you make an explicit update to the link table that you don't have an entity directly mapped to?  I need to ponder this some more.  Possibly if one side of the relationship is a composition as opposed to an association, then it would be implied when that side is updated.  Yet what if you chose not to load the relationship, how do you tell the system to not delete all previous relationships?

More Providers

  • MS Access -- This new query provider works with both Access 2000 - 2003 and Access 2007 data files. I don't know what the true differences are between the Jet and the Ace engines; the query language appears to be identical (as per my limited tests so far), yet the filename extension changed in Access 2007 to 'accdb' instead of 'mdb' and the northwind sample database plumped up an extra 66% in disk size without any additional data.

In order to make this work I've added an AccessLanguage object that is necessary to get the correct semantics for MS Access queries and an AccessFormatter object that handles generating the correct command text. In order to salvage as much as I could from the TSqlFormatter, I moved most of this code to a common SqlFormatter base class, and now the TSQL and Access formatters only supply the deviations from the standard syntax.  (Of course, 'standard' is currently whatever I deem it to be so don't go getting some actual online specification and prove me wrong.) Access only allows one command at a time, so that added an extra wrinkle, but in the end there is now support in the system for providers that can only do one command at a time. This means there are multiple round-trips to the engine for things like inserting a record and getting back the computed keys. Luckily, the access engine is in-proc so this is not really a burden. A new property on QueryLanguage, 'AllowMultipleCommands' determines how the execution plan is generated and whether multiple commands can be lumped together into a single ADO command.

The good news is that the access engine passes almost all the Northwind tests; some are not possible (mostly ones testing translation of framework methods that have no apparent equivalent in the access expression engine).  There were a lot of hairy strange & subtle differences in syntax between Access and TSQL, but most were handled by having different format rules, some required new expression visitors to change the query tree, like no explicit cross joins!  This caused me to write a visitor to attempt to get rid of cross joins (often injected by my visitor that tries to get rid of cross-apply joins) which is now generally useful to everyone, and if that didn't do it, another visitor that would attempt to isolate out the cross joins from any other joins and push them into sub-queries where Access lets me use the old-style comma-list, which is truly a cross join, though it just can't be mixed with other kinds of joins in the same from clause.

  • SQL Compact -- Yes, even more SQL Server.  Though to be truthful, SQL Server Compact Edition (aka SQL CE, aka SQL Compact, aka Skweelzy) is not really SQL Server, it is some other entirely different product that handles a subset of TSQL syntax, and is not a server at all since it runs in-proc just like MS Access.

  • What about MySQL or Oracle?
    One day. The fact is that MS SQL and MS Access are easy for me to get to, they are already on my box. Getting something else up and running would take actual effort, and the MS secret database police might come get me. Meanwhile, if someone out there wants to put together a provider implementation I'll add it into the drop.

  • Where did the SqlQueryProvider go?
    I moved it. With the addition of the new providers it became apparent that I'd have to start factoring out all this ADO provider specific nonsense, otherwise all uses of the toolkit would have direct dependencies to way more than necessary. So I made separate projects, each building its own library. I may end up separating all the core 'data' stuff out into its own project too.

    The solution builds these libraries now:

- IQToolkit.dll
- IQToolkit.Data.Access.dll
- IQToolkit.Data.SqlClient.dll
- IQToolkit.Data.SqlServerCe.dll

More POCO

  • Constructors -- Use entities that don't have default constructors.  It is now possible to have entities that require invocation of a constructor with parameters.  The binding process will figure out how to call your constructor and the client side code will call it for you as long as the constructor parameter names match property names. You can even have fully read-only entities if all data member are accounted for in the constructor.

  • Enums -- They actually sort of work now. You can have a member in your entity typed as some enum and you get automatic mapping between that enum and a numeric type in the database.

  • Interfaces and abstract base classes -- You can now declare you IQueryable's as IQueryable<SomeInterface> or IQueryable<SomeAbstractClass> and have the provider create instances of a compatible type under the covers, automatically translating all your references to interface or abstract class members to the appropriate mapped members. You can have mutiple entities share a common interface or base class and get different mapping for each. You can write code using generics and constrain generic parameters based on an interface and write queries that will get correct translation at runtime.  (Note, variation of mapping likely won't work with compiled queries, since the translation is fixed on the first execution.)

Less Policy

There's not a whole lot of policy being used right now and the policy objects dependence on the mapping object was no where near as deep as the mapping object's dependence on the language.  So policy is now independent of mapping, which means you can construct providers without specifying policy and/or reusing mapping with different policies.  Now if I could only make it simpler to specify/construct mapping without needing to know the language.  Back to the drawing board.

More Insanity

I apologize for the churn. The namespace changed so now all heck is going to break loose. Gone is the simple 'IQ' namespace and in its place is the 'IQToolkit' namespace.  I really did like the 'IQ' name, it was short, classy and made you feel intelligent just by looking at it.  Yet, it was hard to guess at if you did not already know what it was.  I chose to change the namespace name to match the product name and the DLL name. You add reference to the IQToolkit.dll and you import/use the IQToolkit namespace. No fuss, no muss.  Except for all those files you'll have to edit now. But hey, this is pre-pre-pre-pre beta stuff. Some people may think they are something special by snarkily keeping all their products in beta. They've got a lot to learn.

I hope this toolkit is becoming useful to many. I realize there have been a variety of requests for new things in the toolkit that I just have not gotten time to put in yet.  So you can expect plenty more in the future.

So enough with reading.  It's time to code!

This is the thirteenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you probably have a life beyond the keyboard, but if you don't then follow the link below to find oodles more to help fill your meaningless existence.

Complete list of posts in the Building an IQueryable Provider series

It's been precisely the correct amount of time that it took for me to complete the additional goodness that is jam packed into this drop, less actual work, dinners out, dinners in, any interesting film and televisions programs, housework, trips out to the store, family game night, time reading fiction, napping on the couch and other assorted unavoidable activities.

The full source code can be found at:

http://www.codeplex.com/IQToolkit

I'll try to cover as much as I can in this post, however you'll like find other gems by scouring the source itself.

What's inside:

Updates - Insert, Update & Delete operations.

Batch processing - true SQL Server batch processing.

Server language type systems - correct parameter types.

Mapping Changes - use the same class with multiple tables, etc.

Insert, Update and Delete

It's about time that this toolkit actually got usable right out of the box.  My original intention with the series was to show how to build an IQueryable provider and that turned more and more into a fully working query engine that you could actually use to get real work done. Yet, how many real world applications only ever need to pull data out of a database and never push it back?  Not many.

So I knew I'd eventually want to add updates, because I knew that you'd eventually need to do it too. Yet, every time I started thinking about updates I always fell into the trap of thinking about full blown ORM's with object tracking, et al, and I did not really want to go there, at least not yet. As a toolkit I think its just fine to define the primitives that a more advanced system might be built out of. And there is nothing wrong with those primitives being generally useful on their own. So you should be able to use the toolkit as-is and not only get a pretty good query engine but also something that at least works as a rudimentary data access layer.

Common Primitives

So then I set about thinking about just the primitives for updating data. They should have semantics similar to the underlying SQL equivalent commands. That means they should not defer work until some time later, but execute immediately. There should be at least the familiar commands, Insert, Update and Delete; but also Upsert (both Insert & Update combined) since its so often the right thing for many situations.

Also, like other LINQ operations, update commands should be a pattern, and be available for any kind of provider. So I set out thinking about what the pattern would look like and how it might be specified.  This is what I came up with.

public interface IUpdatable : IQueryable
{
}

public interface IUpdatable<T> : IUpdatable, IQueryable<T>
{
}

public static class Updatable
{
    public static S Insert<T, S>(this IUpdatable<T> collection, T instance, Expression<Func<T, S>> resultSelector)
    public static S Update<T, S>(this IUpdatable<T> collection, T instance, Expression<Func<T, bool>> updateCheck, Expression<Func<T, S>> resultSelector)
    public static S InsertOrUpdate<T, S>(this IUpdatable<T> collection, T instance, Expression<Func<T, bool>> updateCheck, Expression<Func<T, S>> resultSelector)
    public static int Delete<T>(this IUpdatable<T> collection, T instance, Expression<Func<T, bool>> deleteCheck)
    public static int Delete<T>(this IUpdatable<T> collection, Expression<Func<T, bool>> predicate)
    public static IEnumerable<S> Batch<T,S>(this IUpdatable collection, IEnumerable<T> instances, Expression<Func<T, S>> fnOperation, int batchSize, bool stream)
... }

This pattern works just like the LINQ Enumerable and Queryable patterns.  I've declared an interface 'IUpdatable' that extends IQueryable, so anything that is updatable is also queryable, and then an Updatable class with a bunch of new extension methods that encapsulate the pattern.  (I realize the IUpdatable name may be in conflict with some other library, but until I think of something better this is what it is.)

The Insert method inserts an object instance into a collection. It's not an ordinary insert, like with List<T>. The collection is considered to be remote and inserting into it copies the data from your instance. It has an optional result-selector argument that can be a function you supply to construct a result out of the object after it has been inserted. This, of course, is intended to occur on the server and can be used to read back auto-generated state and computed expressions.

IUpdatable<Customer> customers = ...;
Customer c = ...;
customers.Insert(c);

IUpdatable<Order> orders = ...;
Order o = ...;
var id = orders.Insert(o, d => d.OrderID);

The Update method updates a corresponding object already in the collection with the values in the instance supplied. This is a complete overwrite, not a dynamic update like LINQ to SQL would have generated. I have not yet defined a piecemeal update operation, but I still can.  We'll see how it goes.  In addition to an update selector (like the one for insert) you can also specify an update-check predicate. This is an expression evaluated against the server's state and can be used to implement optimistic concurrency by basically checking to see if the server's state is still the same as you remembered it. An ORM layer built on top of this primitive might choose to generate this expression automatically, based on mapping information, but here you must specify it manually if you want to use it.

IUpdatable<Customer> customers = ...;
Customer c = ...;
var computedValue = customers.Update(c, d => d.City == originalCity, d => d.ComputedColumn);

The InsertOrUpdate is the 'Upsert' operation.  It will basically insert an object into the collection if a corresponding one does not exist, or update the one that does with the new values. You specify it just like you'd specify an update, instead you call InsertOrUpdate.

There are two flavors of Delete. The first one lets you delete the object in the collection corresponding to the instance. You can optionally specify a delete-check, which is similar to the update-check, a predicate function evaluated against the server's state. The delete will only occur if the check passes. The second flavor just lets you specify a predicate. It's basically a delete-all-where method and will delete all objects from the collection that match the predicate. So far, its the only SQL-like 'set-based' operation I've defined.

IUpdatable<Customer> customers = ...;
Customer c = ...;
customers.Delete(c, d => d.City == originalCity);

IUpdatable<Customer> customers = ...;
Customer c = ...;
customers.Delete(c => c.CustomerID == "ALFKI");

The last operation is Batch.  It will allow you to specify an operation to apply to a whole set of instances. The operation can be one of the other commands like Insert or Update.  You can use this method Insert, Update or Delete a whole bunch of objects all at the same time. If possible, the provider will use optimized batching techniques to give you extra performance.

IUpdatable<Customer> customers = ...;
Customer[] custs = new Customer[] { ... };

customers.Batch(custs, c => customers.Insert(c));

If you've got many objects to update and you want to have instance specific update-checks done, you can sneak the extra information into the batch process by combining the data together into a single collection and then piecing them apart in the operation.

IUpdatable<Customer> customers = ...;
var oldAndNew = new [] { new { Old = oldCustomer, New = newCustomer }, ...};
customers.Batch(oldAndNew, (u, x) => u.Update(x.New, d => d.City == x.Old.City));
 

Updates and DbQueryProvider

In order to make use of this new capability I'm going to need a new object to declare the IUpdatable interface.  The Query<T> class only implemented IQueryable<T>, and that was fine as long as I only ever want to query.  Now I also want to be able to update, so I need a new class to represent the root of my query that I can also update. These things in databases are called tables, so that's what I'll stick with. 

public interface IQueryableTable : IQueryable
{
    string TableID { get; }
}

public interface IQueryableTable<T> : IQueryable<T>, IQueryableTable
{
}

public class QueryableTable<T> : Query<T>, IQueryableTable<T>
{
    string id;

    public QueryableTable(IQueryProvider provider, string id)
        : base(provider)
    {
        this.id = id;
    }

    public QueryableTable(IQueryProvider provider)
        : this(provider, null)
    {
    }

    public string TableID
    {
        get { return this.id; }
    }
}

public interface IUpdatableTable : IQueryableTable, IUpdatable
{
}

public interface IUpdatableTable<T> : IQueryableTable<T>, IUpdatable<T>, IUpdatableTable
{
}

public class UpdatableTable<T> : QueryableTable<T>, IUpdatableTable<T>
{
    public UpdatableTable(IQueryProvider provider, string id)
        : base(provider, id)
    {
    }

    public UpdatableTable(IQueryProvider provider)
        : this(provider, null)
    {
    }
}

You'll note that not only did I define a UpdatableTable<T> class, which is specifically what I wanted, I also went ahead and made a QueryableTable<T>, and extra interfaces to correspond to them.  This is intentional.  Eventually, I may want to add more methods specific to tables here and I'll need a place to put them.  Right now I've only added a property 'TableID'.  You can ignore it for now, though it will get more interesting when I discuss the mapping changes.

Take a look at the Northwind class in the test source code and you'll see how I made use of my new table class.

The Plumbing

Of course, update commands work in the query provider just like queries do.  First there are a bunch of new DbExpression nodes to represent them.

public abstract class CommandExpression : DbExpression
{
}

public abstract class
CommandWithResultExpression : CommandExpression { public abstract Expression Result { get; } }
public class InsertExpression : CommandWithResultExpression { public TableExpression Table { get; } public ReadOnlyCollection<ColumnAssignment> Assignments { get; } public override Expression Result { get; } } public class ColumnAssignment { public ColumnExpression Column { get; } public Expression Expression { get; } } public class UpdateExpression : CommandWithResultExpression { public TableExpression Table { get; } public Expression Where { get; } public ReadOnlyCollection<ColumnAssignment> Assignments { get; } public override Expression Result { get; } } public class UpsertExpression : CommandWithResultExpression { public Expression Check { get; } public InsertExpression Insert { get; } public UpdateExpression Update { get; } public override Expression Result { get; } } public class DeleteExpression : CommandExpression { public TableExpression Table { get; } public Expression Where { get; } } public class BatchExpression : CommandExpression { public Expression Input { get; } public LambdaExpression Operation { get; } public Expression BatchSize { get; } public Expression Stream { get; } }

Then there's the standard visit method in DbExpressionVisitor, DbExpressionWriter, etc.  Binding them happens in the QueryBinder just like all other query operations, but the work of deciding what nodes to generate gets doled out to the QueryMapping object.  Luckily, the base QueryMapping class has a default implementation that builds the correct DbExpression node.  If you want to map a single object into multiple tables or some other crazy scheme you'll probably have to have a more advanced mapping implementation. :-)

These nodes get plumbed through the system until they are encountered by the ExecutionBuilder and are formatted using the QueryLanguage rules. The TSQL formatter converts the nodes into corresponding TSQL text.  Depending on the contents of the command expression, the generated SQL may have one or more actual TSQL operations.

Batch Processing

ADO.Net has this nice feature built into its SqlClient API; the ability to get high-performance batch processing. Yet, the only way to get at it is through use of DataSet's or DataReaders. As far as I'm concerned this is rather low level and a bit complicated to use if you are starting out with domain objects and not DataSets. Your data access layer should do this for you. Yet, in order for it to do it, the abstraction for batch processing has to exist, which is why I added it to the updatable pattern.  After that it was a cinch. :-)  Not really. 

What I needed fundamentally was something that would execute the same database command over and over again with different sets of parameters. This is basically what TSQL batching does as it is sent over the wire. So I needed to add this abstraction to DbQueryProvider. Yet, since only SqlClient supports this actually behavior I'd need a fall back plan. So DbQueryProvider implements a method to do batch processing, but it does not actually do it optimally. 

public virtual IEnumerable<int> ExecuteBatch(QueryCommand query, IEnumerable<object[]> paramSets, int batchSize, bool stream)
{
    var result = this.ExecuteBatch(query, paramSets);
    if (!stream)
    {
        return result.ToList();
    }
    else
    {
        return new EnumerateOnce<int>(result);
    }
}

private IEnumerable<int> ExecuteBatch(QueryCommand query, IEnumerable<object[]> paramSets)
{
    this.LogCommand(query, null);
    DbCommand cmd = this.GetCommand(query, null);
    foreach (var paramValues in paramSets)
    {
        this.LogMessage("");
        this.LogParameters(query, paramValues);
        this.SetParameterValues(cmd, paramValues);
        int result = cmd.ExecuteNonQuery();
        yield return result;
    }
}

public virtual IEnumerable<T> ExecuteBatch<T>(QueryCommand query, IEnumerable<object[]> paramSets, Func<DbDataReader, T> fnProjector, int batchSize, bool stream)
{
    var result = this.ExecuteBatch(query, paramSets, fnProjector);
    if (!stream)
    {
        return result.ToList();
    }
    else
    {
        return new EnumerateOnce<T>(result);
    }
}

private IEnumerable<T> ExecuteBatch<T>(QueryCommand query, IEnumerable<object[]> paramSets, Func<DbDataReader, T> fnProjector)
{
    this.LogCommand(query, null);
    DbCommand cmd = this.GetCommand(query, null);
    cmd.Prepare();
    foreach (var paramValues in paramSets)
    {
        this.LogMessage("");
        this.LogParameters(query, paramValues);
        this.SetParameterValues(cmd, paramValues);
        var reader = cmd.ExecuteReader();
        if (reader.HasRows)
        {
            reader.Read();
            yield return fnProjector(reader);
        }
        else
        {
            yield return default(T);
        }
        reader.Close();
    }
}

What I have here are four methods, two of which are just private implementations, but two others that are virtual so can be overridden. Batching can work via streaming or not. If streamed the results of each execution (or batch) is yielded out. This works great if the number of individual items is large, but takes a lot of discipline to remember to actually inspect the results or nothing gets executed at all!  By requesting no streaming (stream == false) the execution occurs immediately and the results are packaged into a list that you can conveniently ignore if you so choose. That's why the implementation is separated out, so that the enumerable can be captured and converted to a list, enabling either behavior.  The two types of ExecuteBatch differ in whether a result is computed via information coming back from the server or not.

Now that these are defined, I can implement a new kind of provider, a SqlClient specific version that makes automatic use of optimized batching.

public class SqlQueryProvider : DbQueryProvider
{
    ...
public override IEnumerable<int> ExecuteBatch(QueryCommand query, IEnumerable<object[]> paramSets, int batchSize, bool stream) { var result = this.ExecuteBatch(query, paramSets, batchSize); if (!stream) { return result.ToList(); } else { return new EnumerateOnce<int>(result); } } private IEnumerable<int> ExecuteBatch(QueryCommand query, IEnumerable<object[]> paramSets, int batchSize) { SqlCommand cmd = (SqlCommand)this.GetCommand(query, null); DataTable dataTable = new DataTable(); for (int i = 0, n = query.Parameters.Count; i < n; i++) { var qp = query.Parameters[i]; cmd.Parameters[i].SourceColumn = qp.Name; dataTable.Columns.Add(qp.Name, qp.Type); } SqlDataAdapter dataAdapter = new SqlDataAdapter(); dataAdapter.InsertCommand = cmd; dataAdapter.InsertCommand.UpdatedRowSource = UpdateRowSource.None; dataAdapter.UpdateBatchSize = batchSize; this.LogMessage("-- Start SQL Batching --"); this.LogCommand(query, null); IEnumerator<object[]> en = paramSets.GetEnumerator(); using (en) { bool hasNext = true; while (hasNext) { int count = 0; for (; count < dataAdapter.UpdateBatchSize && (hasNext = en.MoveNext()); count++) { var paramValues = en.Current; dataTable.Rows.Add(paramValues); this.LogMessage(""); this.LogParameters(query, paramValues); } if (count > 0) { int n = dataAdapter.Update(dataTable); for (int i = 0; i < count; i++) { yield return (i < n) ? 1 : 0; } dataTable.Rows.Clear(); } } } this.LogMessage(string.Format("-- End SQL Batching --")); } }

Note that I only have an implementation for the variation of ExecuteBatch that computes no user specified result. This is due to there being no back-channel available when using SqlClient batching.

The implementation uses DataTable's and DataAdapters to make this work.  A DataTable is created filled with the parameters necessary for executing the command.  The DataAdapter is used to invoke the batch using the Update method.  Of course, this doesn't actually have to be an update command. I can also use this to batch inserts and deletes too, or really any TSQL that I want to execute as long as I don't expected to get a bunch of data back.

Server Language Types

One thing that has always bugged me about SQL Server was the need to get the command parameters right. If I declare the parameter to be the wrong text flavor I can cause serious performance issues for the query. So just setting the parameter values and having the SqlCommand object guess at the right SqlType encoding is really not a good plan. Fortunately, it is often possible to figure out the correct parameter types to use if the information is available via the mapping. Parameters are often never just sent to the server for no reason. If I use a parameter in a query I'm usually comparing it against a column.  In most cases I can simply infer that the parameter should have the same server type as the column.

So I've gone ahead and defined a new server type system abstraction and threaded it into some of the DbExpressions and make use of it in some of the visitors. 

public abstract class QueryType
{
    public abstract DbType DbType { get; }
    public abstract bool NotNull { get; }
    public abstract int Length { get; }
    public abstract short Precision { get; }
    public abstract short Scale { get; }
}

public abstract class QueryTypeSystem 
{
    public abstract QueryType Parse(string typeDeclaration);
    public abstract QueryType GetColumnType(Type type);        
}

A QueryType encodes the typical database scalar type. It has a few properties that you can use to inspect common attributes of a server type, but most code won't really care to know the details, it will just pass the information along until it ends up where I need it.  A QueryTypeSystem is basically a factory for producing QueryType's.  The Parse method will construct a language-specific QueryType out of some text encoding.  This is typically the server language syntax for declaring a column of that particular type, like 'VARCHAR(10)'. 

A QueryTypeSystem is specific to a language, so QueryLanguage is where you go to get one.

public abstract class QueryLanguage
{
    public abstract QueryTypeSystem TypeSystem { get; }
}

One place I definitely know where to encode server types is in the ColumnExpression.  If a ColumnExpression knows what its server type is, then when I get to the point of comparing parameters to columns I know which server type is in play.

public class ColumnExpression : DbExpression, IEquatable<ColumnExpression>
{
    public QueryType QueryType { get; }
}

I've also stuck it into NamedValueExpression, because that's the type I'm using for parameters.

public class NamedValueExpression : DbExpression
{
    public QueryType QueryType { get; }
}

And I've basically modified Parameterizer, so that if a column and parameter (named-value expression) are ever paired together in any binary expression, I'll infer the parameter to have the same server type as the column.

public class Parameterizer : DbExpressionVisitor
{
    protected override Expression VisitBinary(BinaryExpression b)
    {
        Expression left = this.Visit(b.Left);
        Expression right = this.Visit(b.Right);
        if (left.NodeType == (ExpressionType)DbExpressionType.NamedValue
         && right.NodeType == (ExpressionType)DbExpressionType.Column)
        {
            NamedValueExpression nv = (NamedValueExpression)left;
            ColumnExpression c = (ColumnExpression)right;
            left = new NamedValueExpression(nv.Name, c.QueryType, nv.Value);
        }
        else if (b.Right.NodeType == (ExpressionType)DbExpressionType.NamedValue
         && b.Left.NodeType == (ExpressionType)DbExpressionType.Column)
        {
            NamedValueExpression nv = (NamedValueExpression)right;
            ColumnExpression c = (ColumnExpression)left;
            right = new NamedValueExpression(nv.Name, c.QueryType, nv.Value);
        }
        return this.UpdateBinary(b, left, right, b.Conversion, b.IsLiftedToNull, b.Method);
    }

    protected override ColumnAssignment VisitColumnAssignment(ColumnAssignment ca)
    {
        ca = base.VisitColumnAssignment(ca);
        Expression expression = ca.Expression;
        NamedValueExpression nv = expression as NamedValueExpression;
        if (nv != null)
        {
            expression = new NamedValueExpression(nv.Name, ca.Column.QueryType, nv.Value);
        }
        return this.UpdateColumnAssignment(ca, ca.Column, expression);
    }
}

Of course, the same goes for ColumnAssignment used by Insert and Update commands. You'll notice that I'm not having these types flow throughout the expression tree like normal types.  I could probably get more edge cases correct if I did, but for now this handles most of the cases. 

The GetCommand method in  DbQueryProvider will now make use of this info when constructing parameters. The SqlQueryProvider expects to see a new TSqlType that's made available by a new TSqlTypeSystem found on the TSqlLanguage object.  :-)

public class SqlQueryProvider : DbQueryProvider
{
    protected override DbCommand GetCommand(QueryCommand query, object[] paramValues)
    {
        // create command object (and fill in parameters)
        SqlCommand cmd = new SqlCommand(query.CommandText, (SqlConnection)this.Connection);
        for (int i = 0, n = query.Parameters.Count; i < n; i++)
        {
            QueryParameter qp = query.Parameters[i];
            TSqlType sqlType = (TSqlType)qp.QueryType;
            if (sqlType == null)
                sqlType = (TSqlType)this.Language.TypeSystem.GetColumnType(qp.Type);
            var p = cmd.Parameters.Add("@" + qp.Name, sqlType.SqlDbType, sqlType.Length);
            if (sqlType.Precision != 0)
                p.Precision = (byte)sqlType.Precision;
            if (sqlType.Scale != 0)
                p.Scale = (byte)sqlType.Scale;
            if (paramValues != null)
            {
                p.Value = paramValues[i] ?? DBNull.Value;
            }
        }
        return cmd;
    }
}

Mapping Changes

I've made some changes to the mapping system (or QueryMapping to be particular.)  I came across a variety of odd behavior while developing update logic that basically boiled down to the ImplictMapping object not being able to tell the difference between a type that was intended to correspond to a database table and others that appeared there just for the sake of representation, like LINQ anonymous types.  Some other mapping implementations might be able to tell the difference, but the simplest one couldn't so I needed to find another solution.

Obviously, everything that is an entity in a query comes from somewhere, and that's either from one of the roots of the query (a table) or via a relationship property. It was a mistake to think otherwise (or not think about it at all.) What I needed was a more explicit representation in the expression tree of what was an entity and what was not.  I figured I could either add annotations to the tree in every node, or find some nominal solution that would do the trick. 

I chose to make a new expression node, EntityExpression, which I use as a wrapper around any expression that would be normally constructing an entitiy.  This node is placed into the system when the QueryMapping first creates the sub-express for constructing an entity or relationship, at the time I actually know that I'm dealing with an entity and in particular which entity it is.

public class EntityExpression : DbExpression
{
    public MappingEntity Entity { get ; }
    public Expression Expression { get; }
}

I've also introduced a new abstraction called MappingEntity.  This how I let the QueryMapping object place a bread-crumb into the expression tree so it can be reminded which exact entity was being referred to.

public class MappingEntity
{
    public string TableID { get; }
    public Type Type { get; }
}

It' really just as simple little class that minimally remembers the correspondence between the runtime type and the table its being mapped to.  If you've been paying attention you'll realize that this 'TableID' is the same property that was added to the IQueryableTable interface.  That's how the query engine gets the table-id in the first place, right from the start of the query.  Of course, the IQueryableTable<T> interface also knows the runtime type, that's the 'T' part.  So your table's have all the information needed to make a MappingEntity.  Except that job is deferred to the QueryMapping object so it can do whatever bookkeeping it wants.

public abstract class QueryMapping
{
public virtual MappingEntity GetEntity(Type type); public virtual MappingEntity GetEntity(Type type, string tableID); }

You'll also notice that most of the other methods on QueryMapping are now modified to take a MappingEntity as an argument.

public abstract class QueryMapping
{
    public virtual bool IsMapped(MappingEntity entity, MemberInfo member)
    public virtual bool IsColumn(MappingEntity entity, MemberInfo member)
    public virtual bool IsIdentity(MappingEntity entity, MemberInfo member)
    public virtual bool IsComputed(MappingEntity entity, MemberInfo member)
    public virtual bool IsGenerated(MappingEntity entity, MemberInfo member)
    public virtual bool IsRelationship(MappingEntity entity, MemberInfo member)
    public virtual MappingEntity GetRelatedEntity(MappingEntity entity, MemberInfo member)
    public virtual bool IsAssociationRelationship(MappingEntity entity, MemberInfo member)
... }

Now, what you're probably saying is "Gee, Matt, that's looks a bit ominous. Why aren't these all just methods on MappingEntity now?"  And you'd be right. They probably should be.  Yet, it makes it a lot more difficult to subclass the mapping object and merely override some of the behavior.  Not a big deal for complete mapping sub-systems that someday might exists, but painful for simple uses such as overriding the ImplicitMapping with a few additional rules.  So I'm leaving it as-is for now until I can think about it more.  Any thoughts from the peanut-gallery?

Also, its worth nothing, that given this new arrangement, with explicit entity information in the query tree and connecting each entity info back to the table it originated from, it is now possible to support mapping systems that allow individual runtime types to be mapped to more than one table.  So you can have your cake and eat it too! 

That's All Folks!

At least for today.  The future may hold more goodies.  Any suggestions are welcome, either as ideas or source-code contributions.

Remember, you can get the current sources at CodePlex:  http://www.codeplex.com/IQToolkit

The LINQ IQueryable Toolkit is now a CodePlex project. 

http://www.codeplex.com/IQToolkit 

Going forward this will the be official site to find the latest greatest source bits.  I'll continue to post here about the toolkit, how to use it and to show off new features in this blog, but you'll have to follow a link to download.

You can also start discussions over how utterly fabulous the source code is and how you can't wait for the next drop. 

It's up to you.

This is the twelfth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you probably were born yesterday. How could you possibly make sense of this post without any context at all?  At least make an attempt. Sometimes I don't know why I bother.

Complete list of posts in the Building an IQueryable Provider series 

<Insert standard disclaimer for why there have been no updates on this for 'like' forever>

It's been so long since I last posted we must be up to Web 5.0 by now. I suspect there are not any actual web developers left. Its all just twitter application programming and cybernetic mind melds now. Does anybody still write code? In a text based language? Without an electric shunt wired directly into your cerebral cortex?  Nobody?

Sigh. I'm probably just talking to an empty internet; everyone's moved on to Planck Space. But since I've got nothing better to do, I might as well get on with it.

IQueryable Toolkit

One of the first things you'll notice when you take a look at the source is that I changed it quite a bit.  I moved code around, changed names gratuitously, added & removed classes and broke a lot of continuity with the prior versions. One of the biggest changes is that the code is no longer just a sample.  All those internal classes are public, the project builds as a DLL, the tests are hosted separately and the namespace is no longer 'Sample.'  Why, oh why would I do this?  Simple.

I originally started the project to give developers a structured hands-on walk through of the construction of an IQueryable provider; hoping to inspire many of you to build your own. (Many of you have.) The source code was made available so you could learn from it and as a cheap way to get you to debug it for me. (Many of you haven't.) And so you could take from it what you would, copy a class here, steal a method there, etc, to make your job easier to get your own LINQ to XYZ up off the  ground quicker.

Yet, I received so many requests to re-use the entire code base wholesale, that I now realize what you really want is a toolkit not just some sample code. As a toolkit it is still sample code. The sources are still there and you can take from it what you will.  Or you can build it into a shiny DLL and then build your code on top of it.  All the pieces I've built so far are publicly re-usable. Many of the classes have been enhanced for extensibility, so you can mix & match and override to your hearts content to build the system you want out of the pieces I already have.

How it breaks down

At the top there is a namespace 'IQ'.  I thought that it was cleaver & consise. Maybe someone out there will come up with a better one, something catchy, like a cold, and I'll feel compelled to catch it, fall under-the-weather for a few weeks and emerge blissfully sedated enough to go with it. Maybe. Under the 'IQ' namespace exists all the bits and pieces that are generally useful for any IQueryable impementation.  This is where you'll find the basic ExpressionVisitor class, the generic Query and base QueryProvider class.  You'll also get these:

ExpressionComparer - compares two expression trees for equality. immensely useful.

ExpressionWriter - Get a better translation of that narly expression tree into a  C#-ish syntax that you can actually read.  Helps debugging a lot.

Grouping - It's that dead simple implementation of the LINQ IGrouping interface. 

CompoundKey - A class that helps you represent compound key values.

ScopedDictionary - 'cuz I needed it and now you can have it too.  Works sort of like a Dictionary but with nested scopes.

TypeHelper - Used to be called TypeSystem (how pretentious of me).  Helps you know a few things about types.

DeferredList - An implementation of IList that enables deferred loading.  (Bells are probably ringing in your head about now.)

Nested inside this namespace is another one, called simply 'Data'  (I know, original, huh). In here you'll find all those other classes that made up most of my previous posts like the provider itself, DbQueryProvider, that works on top of any DbConnection.  To really understand what has gone on in my head you need to get into this class and look around.

Logically, it still does the same thing.  It primarily just implements the IQueryable.Execute method, converting query expressions into SQL commands, executing them using the DbConnection and translating the results into objects. That's all still there, its just been re-arranged a bit and made a lot more extensible.

How it is extended is the interesting bit that flavors everything else to come. The DbQueryProvider's brains are now fully pluggable. I took a look at what was going on during execution and broke it down into logically atomic steps. The prior version used to first translate the query, build a generic reader and then execute and return.  That's still sort-of what happens, but there's an even better break-down.

Queries are translated, but the translation goes through three distinct phases: 

1) mapping is applied

2) policy is enforced

3) the target query language has the last say. 

Each of these steps is pluggable.  How?  There are three new classes to get acquainted to.

QueryMapping - Defines information & rules to map an object model onto a database model

QueryPolicy - Defines information & rules to determine inclusion of related data & execution plans

QueryLanguage - Defines information & rules to adhere to a target language, including converting the query into text

Every DbQueryProvider is now supplied these three at runtime, each can be overriden by you to take control at different points in then translation and execution process. The mapping, policy & language can each override a portion of the query translation pipeline, injecting its own rules as rewrites of the query expression using the common DbExpression primitives provided.

In addition, the QueryLanguage controls the final translation of the tree to SQL text (or whatever target language is being used), and the QueryPolicy is invoked to build the execution plan.  Note, the execution plan is not the simple object-reader of yore.  It is now an entire runtime generated program for completely executing the database query & constructing objects out of the results.  The QueryPolicy is allowed to do whatever final rewrite is necessary to turn the translated query expression into an executable piece of code.

Of course, by default, these three mostly do what was being done before, and if you care to look you'll see that the QueryLanguage's Format method (for turning queries into text) simply defers to the TSqlFormatter class (which used to be called QueryFormatter.)  You can overidde this method and have it call your own formatter.  You can even override the TSqlFormatter and make a few minor changes if your back end SQL is largely the same.

If you want to implement a different strategy to retrieving hierarchical results, you can inject your own logic both during query translation and you can take over the entire process of building the execution plan.  The default BuildExecutionPlan on the QueryPolicy class defers to the ExecutionBuilder (used to be ProjectionBuilder in its former life), but you can change that and do it your own way.  Of course, all the source is available, so you can build your version based off of the one that exists in the toolkit.

The crazy logic for inferring mapping information by simply using the names of types and members is retained, but walled off in a specific implementation of mapping called the ImplicitMapping.  You can re-use this one if your needs are as meager as my demos.  Or you can build your own.

Of course, now that you've had a moment to think about it, given so much flexibility, you'll inevitably start asking about what database providers are supported.  (Still just TSQL.)  You'll also want to know what complex mapping models I've invented and how they compare to ones used by ORM's.  (Still just the implicit demo one.)  And then you'll want to know what I did to improve reading data hierarchies because that N+1 strategy is just a loser.  (Here's where you finally get something out of me.)

Relationships and Loading

The query translation finally understands relationship properties. The QueryMapping understands the abstraction of mapping properties, these can map to columns or relationships (like associations) or whatever you want.  The mapping decides what's possible.  What have I implemented?  The basic mapping understands association relationships only; those kinds of relationships that are made via a join matching one or more columns across two tables.

The interesting part (at least to me) is how relationships are dealt with during query translation and later during execution.  Does a one to many relationship property residing in the output projection lead to extra queries at execution time or not?  If so, how many? 

I've implemented a few rewriters that deal with relationships.  For example, the SingletonProjectionRewriter finds projections of singleton relationships (one to one or many to one) and folds them into the query itself using a server-side join.  A singleton query doesn't change the cardinality of the results, so it is generally safe.  Yet, what happens when you refer to the same singleton relationship property more than once? Do you get the same join over and over again? To hold back the flood of redundant joins I had to come up with another rewriter, one that would find the duplicate uses and get them to place nice together, but because this situation can happen more often than just as the side effect of doing the singleton rewrites, I wrote it to work on joins and not just nested projections. The RedundantJoinRemover looks for identical joins expressed in the same query and throws out all the extra ones.  It does not look through into sub-queries, so its best to first removed redundant sub-query layers using the RedundantSubqueryRemover in hopes of forcing as many joins into the same layer.

The ClientJoinedProjectionRewriter attempts to do something very different. It looks to convert nested projections (ones that would become extra queries per object at execution time) into client-side joins. So you'll get one join per included relationship. This is not the "Big Join" strategy that LINQ to SQL uses. It precisely the strategy that I still consider inferior if I'm asked to ensure correctness of the results. So why did I choose to implement it? I'm not being asked to insure the correctness of results, nor am I being asked to ensure that large result sets can stream back from the server. I assume that the normal degree of ambiguity of results you get any time you attempt to get related results via more than one query is okay in this case too. I also assume that a user will choose to employ a transaction to get better consistency. You can also disable this one easily if you want. You can implement a policy that uses one or more other strategies.

This is what the ClientJoinedProjectionRewriter actually does.  If the SelectExpression of a nested projection is constrained to the outer query via an equality comparison, like this column equals that parameter and so on I figure I can easily turn this into a client-side LINQ to object style join.  So I rewrite the query so it restates all the significant bits of the queries that came before it.  So if you wanted to retrieve customers in London and all their orders you'll get one query for the customers in London and then another query for all the orders for all the customers in London. The idea is that the second query is actually run first and all the objects are stuffed into a lookup table. When the query for customers executes, it picks up the matching orders right out of the lookup table.  Neat.

Taking it for a Spin

Given a simple query for customers in London:

var query = from c in db.Customers
            where c.City == "London"
            select c;

And, assuming I have a policy that tells the provider that the Orders property on customers is supposed to be included whenever I ask for customers.  (The TestPolicy used by the supplied test project lets me pick and choose properties by name.)

When I start out, the provider first sees the query as LINQ expression tree:

Query(Test.Customer).Where(c => (c.City = "London"))


After mapping the query expression now looks like this: 

Project(
  @"SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
  FROM [Customers] AS t0
  WHERE (t0.City = 'London')",
  new Customer() {
    City = A0.Column("City"),
    ContactName = A0.Column("ContactName"),
    Country = A0.Column("Country"),
    CustomerID = A0.Column("CustomerID"),
    Phone = A0.Column("Phone")
  },
  p => Queryable.AsQueryable(p)
)

It's got the SelectExpression (formatted by default as TSQL), followed by an expression that constructs a customer out of the query's result columns, and function that converts the presumed resulting IEnumerable<Customer> into the expected type. 

Next the QueryPolicy is asked to translate, so it applies rules like determining if a relationship property is included in the output or not.  Since I set up the policy to automatically include orders, I get a new expression that looks like this:

Project(
  @"SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
  FROM [Customers] AS t0
  WHERE (t0.City = 'London')",
  new Customer() {
    City = A0.Column("City"),
    ContactName = A0.Column("ContactName"),
    Country = A0.Column("Country"),
    CustomerID = A0.Column("CustomerID"),
    Phone = A0.Column("Phone"),
    Orders = Project(
      @"SELECT t0.CustomerID, t0.OrderDate, t0.OrderID
      FROM [Orders] AS t0
      WHERE (t0.CustomerID = t2.CustomerID)",
      new Order() {
        CustomerID = A1.Column("CustomerID"),
        OrderDate = A1.Column("OrderDate"),
        OrderID = A1.Column("OrderID")
      },
      p => Enumerable.ToList(p)
    )
  },
  p => Queryable.AsQueryable(p)
)

Now I've got two queries, one nested inside the other.  If nothing happens to this, the inner query will get executed once per row in the outer query results.

Next, the QueryPolicy also applies the the ClientJoinedProjectionRewriter.  It attempts to convert nested projections into a query to fetch all the data (that will be executed once per outer query) with the resulting objects joined on the client machine.

Project(
  @"SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
  FROM [Customers] AS t0
  WHERE (t0.City = 'London')",
  new Customer() {
    City = A0.Column("City"),
    ContactName = A0.Column("ContactName"),
    Country = A0.Column("Country"),
    CustomerID = A0.Column("CustomerID"),
    Phone = A0.Column("Phone"),
    Orders = ClientJoin(
      OuterKey(A0.Column("CustomerID")),
      InnerKey(A1.Column("CustomerID")),
      Project(
        @"SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
        FROM [Customers] AS t0
        OUTER APPLY (
          SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
          FROM [Orders] AS t2
          WHERE (t2.CustomerID = t0.CustomerID)
          ) AS t1
        WHERE (t0.City = 'London')",
        Outer(
          A1.Column("Test"), 
          new Order() {
            CustomerID = A1.Column("CustomerID1"),
            OrderDate = A1.Column("OrderDate"),
            OrderID = A1.Column("OrderID")
          }
        ),
        p => Enumerable.ToList(p)
      )
    )
  },
  p => Queryable.AsQueryable(p)
)

Now I have an embedded ClientJoin node.  Notice how the inner query has a join so that it will retrieve all the orders for all the customers in London now.  Unfortunately, its an OUTER APPLY join when it does not really need to be.  But don't fear.  The QueryLanguage will soon fix it.  The combined query also gets a new column 'Test'. This is used to determine if the join found at least one matching row or not. If no match was found instead of the value 1 the 'Test' column will be null. I can use this information to make sure the resulting collection for a non-match is empty.

Next the QueryLanguage get its shot at translating the tree.  When it does, it attempts to get rid of any APPLY node if at all possible; OUTER APPLY's become LEFT OUTER JOIN's and CROSS APPLY's become INNER JOIN's.

Project(
  @"SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
  FROM [Customers] AS t0
  WHERE (t0.City = 'London')",
  new Customer() {
    City = A0.Column("City"),
    ContactName = A0.Column("ContactName"),
    Country = A0.Column("Country"),
    CustomerID = A0.Column("CustomerID"),
    Phone = A0.Column("Phone"),
    Orders = ClientJoin(
      OuterKey(A0.Column("CustomerID")),
      InnerKey(A1.Column("CustomerID")),
      Project(
        @"SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
        FROM [Customers] AS t0
        LEFT OUTER JOIN (
          SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
          FROM [Orders] AS t2
          ) AS t1
          ON (t1.CustomerID = t0.CustomerID)
        WHERE (t0.City = 'London')",
        Outer(
          A1.Column("Test"), 
          new Order() {
            CustomerID = A1.Column("CustomerID1"),
            OrderDate = A1.Column("OrderDate"),
            OrderID = A1.Column("OrderID")
          }
        ),
        p => Enumerable.ToList(p)
      )
    )
  },
  p => Queryable.AsQueryable(p)
)

Now the query expression is fully translated we only have left to build the execution plan.  The QueryPolicy is asked to turn the expression into a program that will do the actual execution.

Invoke(
  null,
  lookup1 => ((IQueryable<Customer>)ExecutionBuilder.Sequence(new Object[] {
    ExecutionBuilder.Assign(
      lookup1,
      Enumerable.ToLookup(
        Enumerable.Where(
          ((DbQueryProvider)Query(Test.Customer).Provider).Execute(
            new QueryCommand<KeyValuePair<String,Order>>(
              @"SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
              FROM [Customers] AS t0
              LEFT OUTER JOIN (
                SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
                FROM [Orders] AS t2
                ) AS t1
                ON (t1.CustomerID = t0.CustomerID)
              WHERE (t0.City = @p0)",
              new String[] {"p0"},
              r1 => new KeyValuePair<String,Order>(
                r1.IsDBNull(0)
                  ? null
                  : ((String)Convert.ChangeType(
                    r1.GetValue(0),
                    System.String
                  )),
                r1.IsDBNull(1)
                  ? null
                  : new Order() {
                    CustomerID = r1.IsDBNull(2)
                      ? null
                      : ((String)Convert.ChangeType(
                        r1.GetValue(2),
                        System.String
                      )),
                    OrderDate = r1.IsDBNull(3)
                      ? new DataTime("1/1/0001 12:00:00 AM")
                      : ((DateTime)Convert.ChangeType(
                        r1.GetValue(3),
                        System.DateTime
                      )),
                    OrderID = r1.IsDBNull(4)
                      ? 0
                      : ((Int32)Convert.ChangeType(
                        r1.GetValue(4),
                        System.Int32
                      ))
                  }
              )
            ),
            new Object[] {((Object)"London")}
          ),
          kvp => kvp.Value != null
        ),
        kvp => kvp.Key,
        kvp => kvp.Value
      )
    ),
    Queryable.AsQueryable(((DbQueryProvider)Query(Test.Customer).Provider).Execute(
      new QueryCommand<Customer>(
        @"SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
        FROM [Customers] AS t0
        WHERE (t0.City = @p0)",
        new String[] {"p0"},
        r0 => new Customer() {
          City = r0.IsDBNull(0)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(0),
              System.String
            )),
          ContactName = r0.IsDBNull(1)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(1),
              System.String
            )),
          Country = r0.IsDBNull(2)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(2),
              System.String
            )),
          CustomerID = r0.IsDBNull(3)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(3),
              System.String
            )),
          Phone = r0.IsDBNull(4)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(4),
              System.String
            )),
          Orders = Enumerable.ToList(lookup1.get_Item(r0.IsDBNull(3)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(3),
              System.String
            ))))
        }
      ),
      new Object[] {((Object)"London")}
    ))
  }))
  )

Now we're cookin' with GAS!  This is the exact LINQ expression that can be run to execute the query and produce the results.  Notice how the code directly calls the DbQueryProvider.Execute method that takes the command text, parameters and a function that converts a DbDataReader into a sequence of objects.  That function is described inline with the rest of the LINQ expression.  Yes, you can do that.  That's how all those other LINQ operators work.

Also notice how I cheat with expression trees.  The process to create the lookup table requires variables, assignments and statement sequences; all things not currently possible with expression trees -- or so you thought.  The variable is really a parameter to a lambda expression.  If you directly invoke an inline lambda expression you basically create a variable that is in scope over the body of the lambda.  Next I call a method on ExecutionBuilder called Sequence.  This lets me simulate having statement sequences.  All the expressions are evaluated in order because the results are turned into an argument array.  The Sequence method simply returns the last value in the array.  The last cheat is the assignment.  While there is no assignment operator in the expression tree (at least not yet), it is possible to pass a parameter to a method as a by-ref argument.  Yes, this is ugly and causes side-effects, but that's exactly what I wanted.  (Until we get support for statements sometime in the future!)  The ExecutionBuilder Assign method take a ref argument and a value and simply assigns the value to the ref argument. 

When the query is finally executed, all the data is retrieved up front in only two queries.

SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
FROM [Customers] AS t0
LEFT OUTER JOIN (
  SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
  FROM [Orders] AS t2
  ) AS t1
  ON (t1.CustomerID = t0.CustomerID)
WHERE (t0.City = @p0)
-- @p0 = [London]

SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
FROM [Customers] AS t0
WHERE (t0.City = @p0)
-- @p0 = [London]

 

Well that's it then.  Kudos, if you've read this far you really are trying to understand how to build your own IQueryable provider.  Either that or you are up late at night with insomnia trying to find something softer than a hammer to put yourself to sleep with.

Have fun with the new code!

 

Note: all sources are made available under the Microsoft Public License (MSPL)

This is the eleventh in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you’ll want to do so before proceeding, or at least before proceeding to copy the code into your own project and telling your boss you single-handedly solved the data layer problem over the weekend.

Complete list of posts in the Building an IQueryable Provider series 

No excuses for a late post this time. In my estimation, not only am I early but I have arrived with abundance. This is not my ordinary, keep it to the basics, slow and steady post about adding that next incremental feature to a feature-starved sample program. This is the Big Kahuna! This is the post where I finally go over the edge, lose my cool and let loose on the code. No more Mr. Nice guy.  Now, I’m really taking it to the cleaners. 

There’s so much goodness here I don’t even know where to start.  Even if I did, how could I possibly describe it all in one little post? I’m not even going to try. You’re going to have to download the source to get the real truth. A link is located at the bottom of the post. Click it. Feel the power flow over the internet, through the wires and onto your screen. Feel the radiance seep in through your eyes and take hold of your brain with the vice-like grip of a big-time wrestler.

So just what have I been cooking?  Let’s take a look at the highlights.

More Query Operators

That’s right; a lot more. Don’t believe me? Look at the code and see for yourself. Too busy? Boss breathing down your neck. Okay, here’s the list.

Distinct - Not only is there translation for this, but Distinct also makes an appearance inside most aggregates! AVG(DISTINCT t0.Price) anyone!

Skip & Take – Both work and work well together using the ROW_NUMBER & BETWEEN combo that gets incredible performance.

First, FirstOrDefault, Single & SingleOrDefault – No, these are not just the same thing. They really do behave differently.

Any, All & Contains - Not only do these guys work, but all three work with local collections. Use collection.Any(xxx) to get any predicate expression expanded into a series of OR conditions over the input set. 

Framework Methods and Properties


Hallelujahs, brothers! Now you can write queries that reference simple String, DateTime, Decimal, and Math operations, and get them translated into equivalent SQL.  I’ve not implemented every possible method but you’ll find a lot there. There are so many I can’t even list them all here. Rest assured you’ll get classics like string StartsWith, EndsWith and Contains and every variation of Equals and Compare that the framework can throw at me.

Parameters

That’s what I said, parameters. No more SQL injection attacks in the sample code, I’ve finally gotten around to adding parameter support. Take a look, it’s all rather straight forward; a new node type NamedValueExpression, a new visitor Parameterizer and a few extra lines of code to create and assign parameter values at execution time. Not every argument is parameterized either. Simple numeric values are left as literals.

Simpler Queries

The RedundantSubqueryRemover has been enhanced to know how to merge more types of SelectExpression’s together. Now, you’ll get even less layers of subqueries.

Compiled Queries

OMG! Can it be so? Yes, compiled queries too. How is it possible? It took a little refactoring of how queries are executed. There is now a new low-level Execute method on DbQueryProvider that takes a translated command, parameters and a function that converts a DbDataReader into a result (this could be sequence or a single value.)  Given that and LINQ Expression trees themselves, it is easy to see how I can turn a request to execute a LambdaExpression into constructing a function that when called, calls down to the low-level execute method with pre-translated SQL queries and a pre-translated object reader.

Unit Tests

Okay, get back up into your chair. I realized how easy it is to lose control and find the need to roll around the floor in ecstasy. However, your co-workers are starting to wonder if they should call for medical assistance. 

This does not mean I’m using TDD, or DDT or any TLA methodology; it just an apology really. The sample codebase has become so unwieldy that even I can’t trust myself anymore and have discovered so many prior features that I’ve broken in the process of doing bigger and better things that I literally broke down and wrote some simple unit tests. AND they are simple too. For the most part they prove 1) some LINQ operators are translated into the SQL I thought looked good at the time, and 2) the database server actually accepts these queries and does something with them.  There are over 200 of these tests now.  There is also a start at verifying actual semantics by looking at the results.  I started a second bank of tests for these and so far only have a few compiled-query tests listed.  More surely to come.

 

Whew!  I know that’s a lot to GROK, but take a look anyway.  If you’ve been building your own provider and have been stuck on how to do some of these kinds of features (except for tests, shame on you if you haven’t got your own tests yet), now you can get moving again.

But WAIT, there’s more!  I’m not done yet.  Even with all this I’ve shown you today, I still have a few more ideas.  More posts to come. 

This is the tenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you'll want to find a nice shady tree, relax and meditate on why your world is so confused and full of meaningless tasks that it has kept you from pursuing the perfection of writing IQueryable providers.

Complete list of posts in the Building an IQueryable Provider series 

Last time I blamed the television writers strike for delaying me in getting out my next installment.  This time I blame the lack of one, and sunny days, and my son riding his bicycle, and the grueling, tiresome task of getting paid.  Would you believe some of the code we are writing has nothing what-so-ever to do with IQueryable LINQ providers? Crazy, I know. Maybe it would be a different story around here if we had a few more shady trees. 

Fortunately for you I have a prime piece of savory source code ready for slow grilling over a bed of hot mesquite. It's something I'd like to say I was saving for a special occasion, but the truth is I've been putting it off because I thought the code would be just too overwhelming. I mean for me, not you. You see I've been trying to do two things with this series; one, provide a guide to help developers navigate and understand the power and flexibility of the IQuerayble interface, and two, attempt to prove to myself that code written in a purely functional style (or as pure as I can make it) can end up much cleaner and easier on the eyes. I've been horrified ever since tackling OrderBy that this subject would become my undoing. Knowing how involved the translation was for LINQ to SQL I was dreading the immutable madness that would ensue.  Obviously, I survived the ordeal and so will you.

Grappling with GroupBy and Aggregates

Translating just GroupBy itself might not be so difficult if I did not have to also account for aggregates. It seems that GroupBy is not very interesting without the ability to write expressions involving Count() and Max() and Sum(), and that's where the difficulty sets in.

A query involving groups and aggregates that looks like this:

var query = from o in db.Orders

            group o by o.CustomerID into g

            select g.Max(o => o.OrderID);

Is translated into a series of method calls that looks like this:

var query = db.Orders.GroupBy(o => o.CustomerID).Select(g => g.Max(o => o.OrderID));

 

And this is a problem because as I translate from the bottom up (as I have been doing all along) building individual SELECT subqueries for each query operator, I won’t even discover the use of the aggregate ‘Max’ until after I’ve created the SELECT with the GROUP BY.  So how do I get the two pieces together in the same place at the same time without violating my functional style and immutable expression nodes?  Weirder yet, how do I even know that a call to an aggregate method should be tied back to a particular GroupBy? What if I had two group-bys?

 

So where do I start to explain?  Maybe the easy stuff first.  I’ve added a GroupBy property to SelectExpression.

 

internal class SelectExpression : Expression {
    ...

    ReadOnlyCollection<Expression> groupBy;

 

    internal SelectExpression(
        ...,

        IEnumerable<Expression> groupBy

        )

    internal ReadOnlyCollection<Expression> GroupBy {

        get { return this.groupBy; }

    }

}

 

Now every place where I construct one I have to specify a collection of grouping expressions (or null).  GroupBy expressions also get visited after everything else (so far).  Here’s the new version of VisitSelect in DbExpressionVisitor.

 

protected virtual Expression VisitSelect(SelectExpression select) {

    Expression from = this.VisitSource(select.From);

    Expression where = this.Visit(select.Where);

    ReadOnlyCollection<ColumnDeclaration> columns = this.VisitColumnDeclarations(select.Columns);

    ReadOnlyCollection<OrderExpression> orderBy = this.VisitOrderBy(select.OrderBy);

    ReadOnlyCollection<Expression> groupBy = this.VisitExpressionList(select.GroupBy);

    if (from != select.From

        || where != select.Where

        || columns != select.Columns

        || orderBy != select.OrderBy

        || groupBy != select.GroupBy

        ) {

        return new SelectExpression(select.Type, select.Alias, columns, from, where, orderBy, groupBy);

    }

    return select;

}

 

I also added an AggregateExpression class. It represents a call to an aggregate operator.

 

    internal enum AggregateType {

        Count,

        Min,

        Max,

        Sum,

        Average

    }

 

    internal class AggregateExpression : Expression {

        AggregateType aggType;

        Expression argument;

        internal AggregateExpression(Type type, AggregateType aggType, Expression argument)

            : base((ExpressionType)DbExpressionType.Aggregate, type) {

            this.aggType = aggType;

            this.argument = argument;

        }

        internal AggregateType AggregateType {

            get { return this.aggType; }

        }

        internal Expression Argument {

            get { return this.argument; }

        }

    }

 

Now I at least know what to turn those aggregates into.  But how do I get these aggregate expressions into the right place in the query tree?  What if I did nothing and just let the aggregates fall where they may?  What would it look like?  Would it even be legal SQL?

 

SELECT MAX(t5.OrderID)

FROM (

  SELECT t0.CustomerID

  FROM Orders AS t0

  GROUP BY t0.CustomerID

  ) AS t5

 

Oh, no. Danger Will Robinson! This is not legal SQL. Where does OrderID even come from? It’s not even projected out of the sub-query with the GROUP BY. And even if it were somehow projected, the result of the query would be all wrong.  It would give me a single maximum instead of one for each group.  Getting group-by & aggregates to work together is going to be tricky.

 

Is it even possible to solve it and maintain my layering approach?  What about correlated sub-queries?  I could form a sub-query that joins back to the original query based on the grouping expressions and have it contain the aggregate.

 

SELECT (

  SELECT MAX(t2.OrderID)

  FROM Orders AS t2

  WHERE ((t2.CustomerID IS NULL AND t5.CustomerID IS NULL) OR (t2.CustomerID = t5.CustomerID))

  ) AS c0

FROM (

  SELECT t0.CustomerID

  FROM Orders AS t0

  GROUP BY t0.CustomerID

  ) AS t5

 

Now that at least executes on the server. It might not be pretty and likely not efficient unless the server is really good at unscrambling my query, but it does execute and produce the correct results. So it is technically legal. But surely I can do better!

The GroupBy Operator

 

By now you are thinking, OMG, he’s going to stick us with a wacky solution like he did with nested queries!  Never fear, my friend.  Do not look at the man behind the current.  Cast your gaze ahead and all will be made clear. In my hands, I have the GroupBy operator.  Watch as it transforms into a SQL query inside the QueryBinder!

 

    protected override Expression VisitMethodCall(MethodCallExpression m) {

        if (m.Method.DeclaringType == typeof(Queryable) ||

            m.Method.DeclaringType == typeof(Enumerable)) {

            switch (m.Method.Name) {
                ...

                case "GroupBy":

                    if (m.Arguments.Count == 2) {

                        return this.BindGroupBy(

                            m.Arguments[0],

                            (LambdaExpression)StripQuotes(m.Arguments[1]),

                            null,

                            null

                            );

                    }

                    else if (m.Arguments.Count == 3) {

                        return this.BindGroupBy(

                            m.Arguments[0],

                            (LambdaExpression)StripQuotes(m.Arguments[1]),

                            (LambdaExpression)StripQuotes(m.Arguments[2]),

                            null

                            );

                    }

                    else if (m.Arguments.Count == 4) {

                        return this.BindGroupBy(

                            m.Arguments[0],

                            (LambdaExpression)StripQuotes(m.Arguments[1]),

                            (LambdaExpression)StripQuotes(m.Arguments[2]),

                            (LambdaExpression)StripQuotes(m.Arguments[3])

                            );

                    }

                    break;
                ...

            }

        }

        return base.VisitMethodCall(m);

    }

 

As you can see, I’m handling three different variations of GroupBy.  The first one simply takes a single keySelector lambda (the thing to group by).  It is supposed to return a sequence of grouping’s that each have a group key value and a collection of elements that make up that group.  The second form is the same as the first except it also includes an elementSelector lambda that allows me to specify a map between an item in the source sequence and its shape of the elements in the group. The last form includes a resultSelector lambda that allows me to specify what the key and group collection turn into, instead of assuming they always form into instances of IGrouping<K,E>. With all these variations, the translation of group-by is going to get a bit involved, so bear with me.

 

Get ready. Get set. Open your eyes. Close them. Now open them again and really look. Go!

 

    protected virtual Expression BindGroupBy(Expression source, LambdaExpression keySelector, LambdaExpression elementSelector, LambdaExpression resultSelector) {

        ProjectionExpression projection = this.VisitSequence(source);

       

        this.map[keySelector.Parameters[0]] = projection.Projector;

        Expression keyExpr = this.Visit(keySelector.Body);           

 

        Expression elemExpr = projection.Projector;

        if (elementSelector != null) {

            this.map[elementSelector.Parameters[0]] = projection.Projector;

            elemExpr = this.Visit(elementSelector.Body);

        }

       

        // Use ProjectColumns to get group-by expressions from key expression

        ProjectedColumns keyProjection = this.ProjectColumns(keyExpr, projection.Source.Alias, projection.Source.Alias);

        IEnumerable<Expression> groupExprs = keyProjection.Columns.Select(c => c.Expression);

 

        // make duplicate of source query as basis of element subquery by visiting the source again

        ProjectionExpression subqueryBasis = this.VisitSequence(source);

 

        // recompute key columns for group expressions relative to subquery (need these for doing the correlation predicate)

        this.map[keySelector.Parameters[0]] = subqueryBasis.Projector;

        Expression subqueryKey = this.Visit(keySelector.Body);

       

        // use same projection trick to get group-by expressions based on subquery

        ProjectedColumns subqueryKeyPC = this.ProjectColumns(subqueryKey, subqueryBasis.Source.Alias, subqueryBasis.Source.Alias);

        IEnumerable<Expression> subqueryGroupExprs = subqueryKeyPC.Columns.Select(c => c.Expression);

        Expression subqueryCorrelation = this.BuildPredicateWithNullsEqual(subqueryGroupExprs, groupExprs);

       

        // compute element based on duplicated subquery

        Expression subqueryElemExpr = subqueryBasis.Projector;

        if (elementSelector != null) {

            this.map[elementSelector.Parameters[0]] = subqueryBasis.Projector;

            subqueryElemExpr = this.Visit(elementSelector.Body);

        }

       

        // build subquery that projects the desired element

        string elementAlias = this.GetNextAlias();

        ProjectedColumns elementPC = this.ProjectColumns(subqueryElemExpr, elementAlias, subqueryBasis.Source.Alias);

        ProjectionExpression elementSubquery =

            new ProjectionExpression(

                new SelectExpression(TypeSystem.GetSequenceType(subqueryElemExpr.Type), elementAlias, elementPC.Columns, subqueryBasis.Source, subqueryCorrelation),

                elementPC.Projector

                );

 

        string alias = this.GetNextAlias();

        // make it possible to tie aggregates back to this group-by

        GroupByInfo info = new GroupByInfo(alias, elemExpr);

        this.groupByMap.Add(elementSubquery, info);

 

        Expression resultExpr;

        if (resultSelector != null) {

            Expression saveGroupElement = this.currentGroupElement;

            this.currentGroupElement = elementSubquery;

            // compute result expression based on key & element-subquery

            this.map[resultSelector.Parameters[0]] = keyProjection.Projector;

            this.map[resultSelector.Parameters[1]] = elementSubquery;

            resultExpr = this.Visit(resultSelector.Body);

            this.currentGroupElement = saveGroupElement;

        }

        else {

            // result must be IGrouping<K,E>

            resultExpr = Expression.New(

                typeof(Grouping<,>).MakeGenericType(keyExpr.Type, subqueryElemExpr.Type).GetConstructors()[0],

                new Expression[] { keyExpr, elementSubquery }

                );

        }

 

        ProjectedColumns pc = this.ProjectColumns(resultExpr, alias, projection.Source.Alias);

 

        // make it possible to tie aggregates back to this group-by

        Expression projectedElementSubquery = ((NewExpression)pc.Projector).Arguments[1];

        this.groupByMap.Add(projectedElementSubquery, info);

 

        return new ProjectionExpression(

            new SelectExpression(TypeSystem.GetSequenceType(resultExpr.Type), alias, pc.Columns, projection.Source, null, null, groupExprs),

            pc.Projector

            );

    }

 

It starts off easy, following the same binding pattern as the other operators.  First I bind the key selector and get a key expression. Then if I have an element selector I bind that too, to get an element expression, otherwise I use the projector expression from the incoming source as the element expression. Got it?

 

Now I have to figure out what the grouping expressions are going to be. You might be thinking I already know what they are, and there’s only one of them. It’s the key expression, right? Well, not exactly. It turns out its common to need to group by more than one property or column of data. The only way to do that with LINQ is to construct an anonymous type with multiple parts. So a key might be an anonymous type with multiple interesting fields. How do I turn that into one or more expression that can be evaluated as part of a SQL group by clause?

 

It turns out I already have a handy function that does what I want. If I treat the key expression they same way I treat a projection expression, I can use the ColumnProjector to turn the expression into a list of column declarations, each with its own scalar expression.  Then I can simply throw away the column declarations and keep the expressions that define the columns and voila I have grouping expressions!  Don’t try this at home.  Err, I mean, go ahead and try this at home. It’s easy. Watch, I’ll do it again in a moment.

 

Moving along we come to some code trying to compute a ‘basis’. What’s that?  Well, it turns out SQL doesn’t really support a GroupBy operator the way it is defined in LINQ. The LINQ operator, by default, produces a sequence of groups with each group containing the elements that got designated as part of that group. SQL group-by can only produce aggregate computations over the groups, not the groups themselves.  Oh, dear, what to do, what to do?

 

The only way to solve this problem is to form a self join back to the data I want. If I take the result of the group-by and join it back to the original query (everything up until the group-by) matching group-by key values I get all the data back that SQL was not going to let me have.  That’s what the ‘subqueryBasis’ is.  It’s a rebinding of the original input sequence. Then we rebind the key again and do my ColumnProjector trick to get a new collection of group expressions. Now I can use both sets to form a predicate for a join.

 

Next, you can see that if I did specify an element selector I now bind it with respect to this new ‘basis’ and use the resulting expression as the projection coming out of this new branch of the query tree.

 

Alas, I am not yet done.  What about that third argument?  If it’s specified then I have another lambda I have to do something with and if it is not specified then I need to figure out how I’m going to return an IGrouping<K,E>.

 

Okay, that doesn’t turn out to be too difficult.  I created a Grouping<K,E> class that I can use in case I need to actually return this stuff all the way back to the calling code.

 

    public class Grouping<TKey, TElement> : IGrouping<TKey, TElement> {

        TKey key;

        IEnumerable<TElement> group;

 

        public Grouping(TKey key, IEnumerable<TElement> group) {

            this.key = key;

            this.group = group;

        }

 

        public TKey Key {

            get { return this.key; }

        }

 

        public IEnumerator<TElement> GetEnumerator() {

            return this.group.GetEnumerator();

        }

 

        IEnumerator IEnumerable.GetEnumerator() {

            return this.group.GetEnumerator();

        }

    }

 

Even if I don’t, I can still use this type as a way to encode the fact that I’m returning a grouping, which is what I do. The final result of the call to GroupBy (unless a result selector was supplied) is a sequence of these groupings, so the projector expression resulting from this operation is simply a construction of one of these Grouping<K,E> types via a NewExpression node. If a result selector is supplied that lambda determines the shape of the result so I just bind that the same way I bind all the other lambdas.  In both cases, I’m using the newly created extra ‘basis’ query that projects out elements to form the group collection that is either transformed with the result selector or returned to the caller.

 

That’s about it; clean and neat. Right? What about those other few lines of code referring to ‘GroupByInfo’ and ‘currentGroupElement’? That, sir, is the man behind the current. Now I’ll show you a little of him.

 

Later, when I’m binding aggregates I will need to invent a way to tie those aggregates back to this group-by; if they belong to this group-by.  So, later, I’m going to need to know something about this group-by (and certainly all group-by’s that might have already been encountered) to be able to pick the right one.  This is what the GroupByInfo is.  It’s the bits of information I hope will be useful later to correlate the two pieces together.  If we move along now to aggregate binding I’ll show you how it’s used.

Aggregate Binding

 

Aggregates get treated just like any other operator.

 

    protected override Expression VisitMethodCall(MethodCallExpression m) {

        if (m.Method.DeclaringType == typeof(Queryable) ||

            m.Method.DeclaringType == typeof(Enumerable)) {
            ...

            switch (m.Method.Name) {

                case "Count":

                case "Min":

                case "Max":

                case "Sum":

                case "Average":

                    if (m.Arguments.Count == 1) {

                        return this.BindAggregate(m.Arguments[0], m.Method, null);

                    }

                    else if (m.Arguments.Count == 2) {

                        LambdaExpression selector = (LambdaExpression)StripQuotes(m.Arguments[1]);

                        return this.BindAggregate(m.Arguments[0], m.Method, selector);

                    }

                    break;

            }

        }

        return base.VisitMethodCall(m);

    }

 

The actual binding has a few extra bits that help make the job a bit easier.

 

    Expression currentGroupElement;

 

    class GroupByInfo {

        internal string Alias { get; private set; }

        internal Expression Element { get; private set; }

        internal GroupByInfo(string alias, Expression element) {

            this.Alias = alias;

            this.Element = element;

        }

    }

 

    private AggregateType GetAggregateType(string methodName) {

        switch (methodName) {

            case "Count": return AggregateType.Count;

            case "Min": return AggregateType.Min;

            case "Max": return AggregateType.Max;

            case "Sum": return AggregateType.Sum;

            case "Average": return AggregateType.Average;

            default: throw new Exception(string.Format("Unknown aggregate type: {0}", methodName));

        }

    }

 

    private bool HasPredicateArg(AggregateType aggregateType) {

        return aggregateType == AggregateType.Count;

    }

 

    private Expression BindAggregate(Expression source, MethodInfo method, LambdaExpression argument, bool isRoot) {

        Type returnType = method.ReturnType;

        AggregateType aggType = this.GetAggregateType(method.Name);

        bool hasPredicateArg = this.HasPredicateArg(aggType);

 

        if (argument != null && hasPredicateArg) {

            // convert query.Count(predicate) into query.Where(predicate).Count()

            source = Expression.Call(typeof(Queryable), "Where", method.GetGenericArguments(), source, argument);

            argument = null;

        }

 

        ProjectionExpression projection = this.VisitSequence(source);

 

        Expression argExpr = null;

        if (argument != null) {

            this.map[argument.Parameters[0]] = projection.Projector;

            argExpr = this.Visit(argument.Body);

        }

        else if (!hasPredicateArg) {

            argExpr = projection.Projector;

        }

 

        string alias = this.GetNextAlias();

        var pc = this.ProjectColumns(projection.Projector, alias, projection.Source.Alias);

        Expression aggExpr = new AggregateExpression(returnType, aggType, argExpr);

        Type selectType = typeof(IEnumerable<>).MakeGenericType(returnType);

        SelectExpression select = new SelectExpression(selectType, alias, new ColumnDeclaration[] { new ColumnDeclaration("", aggExpr) }, projection.Source, null);

 

        if (isRoot) {

            ParameterExpression p = Expression.Parameter(selectType, "p");

            LambdaExpression gator = Expression.Lambda(Expression.Call(typeof(Enumerable), "Single", new Type[] { returnType }, p), p);

            return new ProjectionExpression(select, new ColumnExpression(returnType, alias, ""), gator);

        }

 

        SubqueryExpression subquery = new SubqueryExpression(returnType, select);

 

        // if we can find the corresponding group-info we can build a special AggregateSubquery node that will enable us to

        // optimize the aggregate expression later using AggregateRewriter

        GroupByInfo info;

        if (!hasPredicateArg && this.groupByMap.TryGetValue(projection, out info)) {

            // use the element expression from the group-by info to rebind the argument so the resulting expression is one that

            // would be legal to add to the columns in the select expression that has the corresponding group-by clause.

            if (argument != null) {

                this.map[argument.Parameters[0]] = info.Element;

                argExpr = this.Visit(argument.Body);

            }

            else {

                argExpr = info.Element;

            }

            aggExpr = new AggregateExpression(returnType, aggType, argExpr);

 

            // check for easy to optimize case.  If the projection that our aggregate is based on is really the 'group' argument from

            // the query.GroupBy(xxx, (key, group) => yyy) method then whatever expression we return here will automatically

            // become part of the select expression that has the group-by clause, so just return the simple aggregate expression.

            if (projection == this.currentGroupElement)

                return aggExpr;

 

            return new AggregateSubqueryExpression(info.Alias, aggExpr, subquery);

        }

 

        return subquery;

    }

 

Binding aggregates has its own slew of caveats that drive how this binding function is formed.  Aggregates can be specified with our without an argument. Normally, if an aggregate is not specified with an argument it is implied that the aggregate is operating over the element of the sequence. 

 

For example, it is legal to write:

 

    db.Orders.Select(o => o.OrderID).Max();

 

It has the same meaning as:

 

    db.Orders.Max(o => o.OrderID);

 

So if an aggregate invocation has no argument, I can simply pick up and use the projector expression coming out of the source sequence, right?  Well, not when the aggregate is Count. The LINQ Count aggregate takes a predicate as an argument not a value expression.  So I definitely don’t want to treat Count with an argument the same way I treat other aggregates.  For Count, I want to take that argument and turn it into a WHERE clause. I do that near the top of the BindAggregate function by tacking on a call to Queryable.Where right onto the source before I even translate it.

 

The next thing I do is actually create the AggregateExpression and stuff it into a column of a new SelectExpression. Whoa! Stop right there. Didn’t I want to avoid this?  Don’t I want this aggregate ending up alongside the group-by?  Yes, but not just yet.

 

What happens is that the source expression, once translated, already is a correlated sub-query!  How did that happen?  Take a moment to consider what the source of the aggregate really is.  Go ahead, I’ll wait.  Got it yet?   It’s the silly ‘g’.  Ugh.

 

Remember this query?

 

var query = db.Orders.GroupBy(o => o.CustomerID).Select(g => g.Max(o => o.OrderID));

 

Now do you see it?  The ‘g’ is the parameter to the Select operator.  Where does it come from? Yes, that’s right. It comes from the output of the GroupBy operator.  And what was that?  Yes, right again.  It was a sequence of Grouping<K,E> instances.  So a single ‘g’ is a single Grouping<K,E> instance.  Which itself is a collection of grouped items, which was formed using that basis query back in the BindGroupBy method, which was a self-join back to the original query with a join condition based on matching up grouping key expressions. Yes, that query, and yes, replicated here, in the context of an aggregate expression, it is indistinguishable from a correlated sub-query.  So I simply add my extra aggregate expression on top and, presto change-o, I am finished! 

 

Well, not really.  Skipping over this next bit about “isRoot” you can see that I create a SubqueryExpression.  I have not talked about this yet, but this is how I represent a true correlated sub-query. 

 

And here it is in all its majestic glory.

 

    internal class SubqueryExpression : Expression {

        SelectExpression select;

        internal SubqueryExpression(Type type, SelectExpression select)

            : base((ExpressionType)DbExpressionType.Subquery, type) {

            this.select = select;

        }

        internal SelectExpression Select {

            get { return this.select; }

        }

    }

 

Okay, that was a tad anti-climatic. Get over it.  It comes in handy later.

 

So what’s the rest of the method doing?  Ah, yes, you’ve seen it. This is where I look to see if I can tie the aggregate back to the group-by that it’s related to.  I do this by looking up the GroupByInfo based on the source projection; again that’s the ‘g’.  If the ‘g’ translates into the same expression that I created back in the BindGroupBy method then I know it’s the same one.  Back in that method I stored a memento of the group-by keyed by this very same expression, and now I’m pulling this information back up.

 

What you don’t see me doing here is actually going back and changing the SelectExpression that contains the group-by. Believe me; I really want to, but not just yet.  Instead, what I do now is drop a breadcrumb into the query tree so that I can later follow the trail in one sweeping tree rewriting operation. To do this, I have invented yet another expression node to act as this breadcrumb.

 

    internal class AggregateSubqueryExpression : Expression {

        string groupByAlias;

        Expression aggregateInGroupSelect;

        SubqueryExpression aggregateAsSubquery;

        internal AggregateSubqueryExpression(string groupByAlias, Expression aggregateInGroupSelect, SubqueryExpression aggregateAsSubquery)

            : base((ExpressionType)DbExpressionType.AggregateSubquery, aggregateAsSubquery.Type)

        {

            this.aggregateInGroupSelect = aggregateInGroupSelect;

            this.groupByAlias = groupByAlias;

            this.aggregateAsSubquery = aggregateAsSubquery;

        }

        internal string GroupByAlias { get { return this.groupByAlias; } }

        internal Expression AggregateInGroupSelect { get { return this.aggregateInGroupSelect; } }

        internal SubqueryExpression AggregateAsSubquery { get { return this.aggregateAsSubquery; } }

    }

 

What this node does is hold onto two possible futures. Either, the sweeping rewrite is going to move the aggregate expression back into the select with the group-by or its going to abandon all hope and fall back to the correlated sub-query approach.  That’s why this node holds onto two different expressions. One is the sub-query node I’ve already created.  The other is an expression that is fit and ready to become a citizen of the group.  I get this expression by simply rebinding the argument relative to the element expression I stuffed into the GroupByInfo when binding the GroupBy.  Now I stick both of the variations into a AggregateSubqueryExpression (reminding myself that unless I rewrite this the aggregate is going to be interpreted as a correlated subquery.)

 

What about ‘currentGroupElement’?  How does this factor in?  Recall that there was a form of GroupBy that took as an argument the result selector. This is not the form that C# converts query expressions into (but VB does when you use the Aggregate clause).  If the aggregate expression is being bound as part of the result selector then whatever expression I return from this method is going to end up in the same select expression with the GroupBy, no need to optimize later.  So if this is the case I simply have the method return the optimal aggregate expression and forget all about the sub-query nonsense. 

 

That just leaves us with ‘isRoot.’  Explaining isRoot is simple.  If the aggregate operator itself is the root of the tree, and in this case I mean with operator on top, or the ‘last’ operator, then it’s a very different kind of aggregate than normal. It is not associated with a group-by. It is a top-level aggregate. 

 

I get these when I type things like:

 

    db.Orders.Max(o => o.OrderID);

 

Now instead of returning a sequence of aggregates, I need only one for the whole sequence.  However, the query provider still gets back a sequence of rows from ADO and the ProjectionReader is still an IEnumerable of whatever the returned element type is.  In this case it is IEnumerable<int>.  Yet, I can clearly not return that IEnumerable as the result of the execution of the query. I need just a single int. So I need a way to tell the provider how to get the one true int out of the sequence.  That’s what I’m doing here.  I’ve added an extra lambda to the ProjectionExpression that tells the provider later how to aggregate the sequence back into a singleton. The ProjectionExpression already has a ‘projector’ that tells the provider how to turn DataReader rows into objects.  Now is also has an ‘aggregator’ that tells the provider how to turn the entire result set into a single value.  The aggregator I construct calls the Enumerable.Single method. 

 

That’s really how it all works, except for the real magic, the part where I wave a wand and polymorph the ugly tree into the beautiful one.  That’s the aggregate rewriter.

Aggregate Rewriting

 

Are you ready for it?  This is the part that does all the truly maligned and dishonest shuffling of bits.

 

    internal class AggregateRewriter : DbExpressionVisitor {

        ILookup<string, AggregateSubqueryExpression> lookup;

        Dictionary<AggregateSubqueryExpression, Expression> map;

 

        private AggregateRewriter(Expression expr) {

            this.map = new Dictionary<AggregateSubqueryExpression, Expression>();

            this.lookup = AggregateGatherer.Gather(expr).ToLookup(a => a.GroupByAlias);

        }

 

        internal static Expression Rewrite(Expression expr) {

            return new AggregateRewriter(expr).Visit(expr);

        }

 

        protected override Expression VisitSelect(SelectExpression select) {

            select = (SelectExpression)base.VisitSelect(select);

            if (lookup.Contains(select.Alias)) {

                List<ColumnDeclaration> aggColumns = new List<ColumnDeclaration>(select.Columns);

                foreach (AggregateSubqueryExpression ae in lookup[select.Alias]) {

                    string name = "agg" + aggColumns.Count;

                    ColumnDeclaration cd = new ColumnDeclaration(name, ae.AggregateInGroupSelect);

                    this.map.Add(ae, new ColumnExpression(ae.Type, ae.GroupByAlias, name));

                    aggColumns.Add(cd);

                }

                return new SelectExpression(select.Type, select.Alias, aggColumns, select.From, select.Where, select.OrderBy, select.GroupBy);

            }

            return select;

        }

 

        protected override Expression VisitAggregateSubquery(AggregateSubqueryExpression aggregate) {

            Expression mapped;

            if (this.map.TryGetValue(aggregate, out mapped)) {

                return mapped;

            }

            return this.Visit(aggregate.AggregateAsSubquery);

        }

 

        class AggregateGatherer : DbExpressionVisitor {

            List<AggregateSubqueryExpression> aggregates = new List<AggregateSubqueryExpression>();

            private AggregateGatherer() {

            }

 

            internal static List<AggregateSubqueryExpression> Gather(Expression expression) {

                AggregateGatherer gatherer = new AggregateGatherer();

                gatherer.Visit(expression);

                return gatherer.aggregates;

            }

 

            protected override Expression VisitAggregateSubquery(AggregateSubqueryExpression aggregate) {

                this.aggregates.Add(aggregate);

                return base.VisitAggregateSubquery(aggregate);

            }

        }

    }

 

That was it.  No, really.  That was the hard part.  Look closer, you’ll see where it actually does something.

 

The AggregateRewriter is actually two visitors; the primary rewriting visitor and a secondary gatherer that builds a collection of all the reachable aggregate subqueries.  See, I told you having this new node would come in handy.

 

What is actually going on is first all the aggregate expressions are gathered up and formed into a lookup table keyed off the group-by alias.  Then during the rewrite, when I get to the select expression with that alias I simply tack on all the aggregate expressions by inventing columns for them.  On the side I keep a mapping between all the original aggregate sub-queries and a column-expression that references the newly declared column.  A new select expression is created with the new set of columns, triggering a cascade of tree rebuilding (keeping everything nice and immutable.)  Then, when the aggregate sub-query is visited later, if it’s found in the map I simply replace it with the new column expression.

 

And now the aggregate expressions are in the right place.  Peace, harmony and all is well with the world.

 

I can see you are now impressed with my mad wizard-like skillz.

 

Let’s put it all together.  I add the aggregate rewriter to the pipeline of visitors we have in DbQueryProvider.Translate().

 

    private TranslateResult Translate(Expression expression) {

        ProjectionExpression projection = expression as ProjectionExpression;

        if (projection == null) {

            expression = Evaluator.PartialEval(expression, CanBeEvaluatedLocally);

            expression = QueryBinder.Bind(this, expression);

            expression = AggregateRewriter.Rewrite(expression);

            expression = OrderByRewriter.Rewrite(expression);

            expression = UnusedColumnRemover.Remove(expression);

            expression = RedundantSubqueryRemover.Remove(expression);

            projection = (ProjectionExpression)expression;

        }

        string commandText = QueryFormatter.Format(projection.Source);

        string[] columns = projection.Source.Columns.Select(c => c.Name).ToArray();

        LambdaExpression projector = ProjectionBuilder.Build(projection.Projector, projection.Source.Alias, columns);

        return new TranslateResult(commandText, projector, projection.Aggregator);

    }

 

Notice the new ‘aggregator’ being passed along in the result. The Execute method now makes use of it.

    private object Execute(TranslateResult query) {

        Delegate projector = query.Projector.Compile();

 

        if (this.log != null) {

            this.log.WriteLine(query.CommandText);

            this.log.WriteLine();

        }

 

        DbCommand cmd = this.connection.CreateCommand();

        cmd.CommandText = query.CommandText;

        DbDataReader reader = cmd.ExecuteReader();

 

        Type elementType = TypeSystem.GetElementType(query.Projector.Body.Type);

       

        IEnumerable sequence = (IEnumerable) Activator.CreateInstance(

            typeof(ProjectionReader<>).MakeGenericType(elementType),

            BindingFlags.Instance | BindingFlags.NonPublic, null,

            new object[] { reader, projector, this },

            null

            );

 

        if (query.Aggregator != null) {

            Delegate aggregator = query.Aggregator.Compile();

            AggregateReader aggReader = (AggregateReader) Activator.CreateInstance(

                typeof(AggregateReader<,>).MakeGenericType(elementType, query.Aggregator.Body.Type),

                BindingFlags.Instance | BindingFlags.NonPublic, null,

                new object[] { aggregator },

                null

                );

            return aggReader.Read(sequence);

        }

        else {

            return sequence;

        }

    }

Taking it for a Spin

 

Let’s give the provider a big soupy group-by ball of aggravation and see how it behaves.

 

    var query = from o in db.Orders

                group o by o.CustomerID into g

                select new {

                    Customer = g.Key,

                    Total = g.Sum(o => o.OrderID),

                    Min = g.Min(o => o.OrderID),

                    Avg = g.Average(o => o.OrderID)

                };

 

This query translates into:

 

SELECT t0.CustomerID, SUM(t0.OrderID) AS agg1, MIN(t0.OrderID) AS agg2, AVG(t0.OrderID) AS agg3

FROM Orders AS t0

GROUP BY t0.CustomerID

 

The results of execution are:

 

{ Customer = ALFKI, Total = 64835, Min = 10643, Avg = 10805 }

{ Customer = ANATR, Total = 42618, Min = 10308, Avg = 10654 }

{ Customer = ANTON, Total = 74195, Min = 10365, Avg = 10599 }

{ Customer = AROUT, Total = 139254, Min = 10355, Avg = 10711 }

...

{ Customer = WILMK, Total = 75650, Min = 10615, Avg = 10807 }

{ Customer = WOLZA, Total = 75595, Min = 10374, Avg = 10799 }

 

Don’t ask me why I think it is interesting to total up the OrderID’s.  I’m just weird.

 

That’s it.  Really.  This time I mean it.  I’m done.  If you want more take a look at the source code attached.  All the gory details are in there, along with a bazillion minor bug fixes I made since the last installment. 

 

DISCLAIMER:  The provided solution for GroupBy works well for many simple cases, such as when the aggregate expressions immediately follow the GroupBy operator. It is possible to write queries that produce the correlated sub-query form. This is normal. There is definitely still room for improvement.

 

Now, get out from under that shady tree and get back to writing code.  You know I am.

 

I often get asked how LINQ to SQL is supposed to be used with Test Driven Design (TDD).  Okay, not really.  People aren’t knocking on my door or calling me at 3:00 am.  I do, however, occasionally read developers angst on their personal blogs. It seems they are trying to actually do this, but are often confounded by the DataContext and its dearth of appropriate interfaces. Of course, my original knee-jerk reaction is to question why anyone would want or need to do this in the first place. Certainly, abstraction at a higher level of the application would be more appropriate, yada yada yada.  Eventually, my internal ranting ebbs and my practical side takes over. I start thinking like an engineer. How would I go about it? If only I’d added such fundamental interfaces such as IDataContext and ITable<T> before hitting RTM, all would be so much easier. Yet, TDD was not a priority.  It wasn’t even on the list of features that didn’t make the cut.  Still, how would I do it?  Then I start wishing I could override the DataContext’s methods and substitute my own logic.  Yet these methods are not virtual and cannot be overridden. Then with fitting irony I recall reading the other developer blogs that pointed this out too.

Of course, this only makes the problem that much more interesting and worthy of a good hack. I consider wrapping the DataContext in some other layer that looks exactly like it and abstract it that way, but then realize it would certainly trip the system up, especially deep in the query translation engine where it expects to find references to specific types.  Instead, the ideal solution would keep the DataContext the same, yet allow me to do something other than hitting the database when a query is executed. If only LINQ to SQL had a public provider model, I could simply plug a new one in and use it to intercept all interaction with the database. Oh, double irony, as there is no such provider model, at least not a public one.  Grin.

LINQ to SQL was actually designed to be host to more types of back-ends than just SQL server. It had a provider model targeted for RTM, but was disabled before the release.  Don’t ask me why.  Be satisfied to know that is was not a technical reason. Internally, it still behaves that way.  The trick is to find out how to swap in a new one when everything from the language to the runtime wants to keep you from doing it.

Fortunately, the DataContext has a nice little ‘provider’ instance variable just waiting to be overwritten.  A little bit of reflection can make quick work of that. The trouble is how to specify a new provider. The DataContext only talks to it through an interface (as it should), and yet that interface is internal to the LINQ to SQL assembly. The programming language won’t let you define your own implementation.  How do you go about implementing an interface that you can’t even say the name of in your source code?

Actually, I can think of two ways; 1) write a bunch of reflection emit code that generates an implementation at runtime or 2) trick the runtime into thinking some existing object implements the interface.  You can probably guess where I am going from here, as every good hack needs a good trick. Besides, a bunch of reflection emit code would be a lot more work.  Onward to the fun solution!

This is where CLR grand-interception-theory comes in; in the CLR you can intercept any interaction with any object, really, as long as it’s a method call and the object derives from MarshalByRef.  Actually, that’s not really true, you can intercept more than method calls, or at least they don’t start out being method calls, and they don’t necessarily need to be on only MarshalByRef objects. Still, not only do I want to intercept calls on an object, I want to make the object appear to implement an interface and intercept the calls on that interface. That’s a tall order, to be sure.  But it can be done.

The interception capability is the underpinnings of remoting (aka DCOM) support in the runtime.  I can use it to make an object masquerade as another object. The original intention was to enable client-side proxy objects to appear to implement the API of an object that only really exists on a server. The term ‘MarshalByRef’ refers to the DCOM behavior of marshalling a reference to the object from the server back to the client, such that calls on the client-side proxy are marshaled back to the server. It works by the JITer injecting specialized thunks into the code that identify and handle calls to these special dopplegangers. The really interesting thing to note is that interfaces in the runtime work nearly the same. They also have thunks that are capable of recognizing these proxies and acting accordingly; quite possibly because COM is so dependent on multitudes of interfaces.  However, regardless of the reason they exist, I can use this mechanism to wedge my own provider implementation into the mix.

What I first need to do is define a proxy object that will intercept these calls.  The remoting mechanism actually uses two different proxies, one that masquerades as the type (the transparent proxy) and one that receives the interception (the ‘real’ proxy.)  Both of these guys are intended to exist on the client. The real proxy is supposed to be the object that actually implements the marshalling behavior. My guess is that the only reason that I’m even allowed to implement my own real proxy is to enable marshalling over newer communication layers. Fortunately, I can use this proxy to simply act as an interceptor to do my bidding.

The next question I faced was what to do when I actually intercepted the calls.  Should I forward them on to some new grand public provider model?  That just seemed a bit over the top.  Instead, I chose to redirect the calls back to methods on the DataContext that can be overridden.  It was a quicker hack and introduces far fewer concepts to those already familiar with the DataContext.  And that’s really what you wanted all along, anyway, wasn’t it?

So I reveal to you, the new and shiny ExtensibleDataContext, one with a few new poorly named methods that you can actually override and implement yourself.

using System;

using System.Collections;

using System.Collections.Generic;

using System.Diagnostics;

using System.IO;

using System.Linq;

using System.Linq.Expressions;

using System.Text;

using System.Reflection;

 

using System.Runtime.Remoting;

using System.Runtime.Remoting.Activation;

using System.Runtime.Remoting.Proxies;

using System.Runtime.Remoting.Messaging;

using System.Runtime.Remoting.Services;

 

using System.Data;

using System.Data.Common;

using System.Data.Linq;

using System.Data.Linq.Mapping;

using System.Data.Linq.Provider;

 

namespace System.Data.Linq

{

    public class ExtensibleDataContext : DataContext

    {

        public ExtensibleDataContext(object connection, MappingSource mapping)

            : base("", mapping)

        {

            FieldInfo providerField = typeof(DataContext).GetField("provider", BindingFlags.Instance | BindingFlags.NonPublic);

            object proxy = new ProviderProxy(this).GetTransparentProxy();

            providerField.SetValue(this, proxy);

            this.Initialize(connection);

        }

 

        protected virtual void Initialize(object connection)

        {

        }

 

        private TextWriter LogImpl { get; set; }

        private DbConnection ConnectionImpl { get; set; }

        private DbTransaction TransactionImpl { get; set; }

        private int CommandTimeoutImpl { get; set; }

 

        protected internal virtual void ClearConnectionImpl()

        {

        }

 

        protected internal virtual void CreateDatabaseImpl()

        {

        }

 

        protected internal virtual void DeleteDatabaseImpl()

        {

        }

 

        protected internal virtual bool DatabaseExistsImpl()

        {

            return false;

        }

 

        protected internal virtual IExecuteResult ExecuteImpl(Expression query)

        {

            return new ExecuteResult(null);

        }

 

        protected class ExecuteResult : IExecuteResult

        {

            object value;

 

            public ExecuteResult(object value)

            {

                this.value = value;

            }

 

            public object GetParameterValue(int parameterIndex)

            {

                return null;

            }

 

            public object ReturnValue

            {

                get { return this.value; }

            }

 

            public void Dispose()

            {

                IDisposable d = this.value as IDisposable;

                if (d != null)

                    d.Dispose();

            }

        }

 

        protected internal virtual object CompileImpl(Expression query)

        {

            return null;

        }

 

        protected internal virtual IEnumerable TranslateImpl(Type elementType, DbDataReader reader)

        {

            return null;

        }

 

        protected internal virtual IMultipleResults TranslateImpl(DbDataReader reader)

        {

            return null;

        }

 

        protected internal virtual string GetQueryTextImpl(Expression query)

        {

            return null;

        }

 

        protected internal virtual DbCommand GetCommandImpl(Expression query)

        {

            return null;

        }

 

        public class ProviderProxy : RealProxy, IRemotingTypeInfo

        {

            ExtensibleDataContext dc;

 

            internal ProviderProxy(ExtensibleDataContext dc)

                : base(typeof(ContextBoundObject))

            {

                this.dc = dc;

            }

 

            public override IMessage Invoke(IMessage msg)

            {

                if (msg is IMethodCallMessage)

                {

                    IMethodCallMessage call = (IMethodCallMessage)msg;

                    if (call.MethodBase.DeclaringType.Name == "IProvider" && call.MethodBase.DeclaringType.IsInterface)

                    {

                        MethodInfo mi = typeof(ExtensibleDataContext).GetMethod(call.MethodBase.Name + "Impl", BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic | BindingFlags.DeclaredOnly);

                        if (mi != null)

                        {

                            try

                            {

                                return new ReturnMessage(mi.Invoke(this.dc, call.Args), null, 0, null, call);

                            }

                            catch (TargetInvocationException e)

                            {

                                return new ReturnMessage(e.InnerException, call);

                            }

                        }

                    }

                }

                throw new NotImplementedException();

            }

 

            public bool CanCastTo(Type fromType, object o)

            {

                return true;

            }

 

            public string TypeName

            {

                get { return this.GetType().Name; }

                set { }

            }

        }

    }

}

 

The ExtensibleDataContext’s constructor has the job of overwriting the DataContext’s private ‘provider’ variable.  It creates a new ProviderProxy instance and assigns it to the private field using FieldInfo.SetValue().  The implementation of SetValue attempts to cast the object to the LINQ to SQL private interface IProvider. This succeeds because the function CanCastTo on the ProviderProxy returns true, allowing the proxy to be cast to any type.  After that, all interface calls on this object are rerouted to the Invoke method.  The implementation of Invoke simply calls the DataContext back, invoking methods with similar names.  These are left empty for you to override in your own derivation of ExtensibleDataContext.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Data.Linq;

using System.Data.Linq.Mapping;

using System.Text;

 

namespace MocksNix

{

    public class MyDataContext : ExtensibleDataContext

    {

        static MappingSource mapping = new AttributeMappingSource();

        public MyDataContext()

            : base("", mapping)

        {

        }

 

        public Table<Customer> Customers

        {

            get { return this.GetTable<Customer>(); }

        }

 

        protected internal override IExecuteResult ExecuteImpl(System.Linq.Expressions.Expression query)

        {

            this.Log.WriteLine("executing query: {0}", query);

            return new ExecuteResult(new Customer[] { });

        }

    }

 

    public class Customer

    {

        [Column(IsPrimaryKey = true)]

        public string CustomerId;

 

        [Column]

        public string ContactName;

    }

 

    class Program

    {

        static void Main(string[] args)

        {

            MyDataContext dc = new MyDataContext();

            var query = from c in dc.Customers where c.CustomerId == "X" select c;

            var list = query.ToList();

        }

    }

}

 

Now, I can use the ExtensibleDataContext in a small test program.  I create my own MyDataContext that implements ExecuteImpl().  This method gets called whenever a query needs to be executed.  Instead of executing the query, I write out a simple message and return an empty collection.

That’s it. Now take this bit of code and go forth and prosper.

DISCLAIMER: Overriding internal implementation details is not a practice recommend or supported by Microsoft. Implementation details are subject to change without warning.

But who cares!

Go on, mock LINQ to SQL all you want.

 

 

This is the nineth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts here's a handy list of all the fun you've been missing.

Complete list of posts in the Building an IQueryable Provider series 

It's now officially a trend that additional installments to this series take longer and longer to be produced.  Blame the television writer's strike, I do.    

Cleaning up the Mess

I've been promising for a while to show you how I'm going to go about cleaning up the unnecessary layers of nested select expressions that my query translator has been accumulating. It's easy for a human brain to look at a query and realize that it could be written a lot simpler. However, its a lot easier for a computer program to just keep piling the layers on, after all the semantics are the same and the boon we get from keeping the program simple is nothing to sneer at.

It's easy to see the problem in a simple query with a where clause.

    from c in db.Customers
    where c.Country == "UK"
    select c;

This innocuous query turns into the following SQL:

SELECT t1.Country, t1.CustomerID, t1.ContactName, t1.Phone, t1.City
FROM (
  SELECT t0.Country, t0.CustomerID, t0.ContactName, t0.Phone, t0.City
  FROM Customers AS t0
) AS t1
WHERE (t1.Country = 'UK')

Why the extra SELECT?  It's easy to see why it happens when you know how the translation works and what the underlying LINQ query really is.

The LINQ query's method call syntax really looks like this: 

    db.Customers.Where(c => c.Country == "UK").Select(c => c);

It has two LINQ query operators, Where() and Select().  My translation engine in the SqlBinder class translates both of these method calls into two separate SelectExpression's.

Ideally, the SQL query would have looked like this:

SELECT t0.Country, t0.CustomerID, t0.ContactName, t0.Phone, t0.City
FROM Customers AS t0
WHERE (t0.Country = 'UK')

However, that's just the easy case.  It gets worse as more operators are added.  Did you think the translator was smart enough to merge multiple Where clauses together?  I certainly did not add any code for that.  It would be nice if the language compiler did it for me, but what about the case where additional Where() operators are added conditionally after the base query is already formed?

var query = 
    from c in db.Customers
    where c.Country == "UK"
    select c;
...
query = from c in query
        where c.Phone == "555-5555"
        select c;

This becomes a triple layer monstrosity, which would only be good if it were a sandwich.

SELECT t2.CustomerID, t2.ContactName, t2.Phone, t2.City, t2.Country
FROM (
  SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
  FROM (
    SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
    FROM Customers AS t0
  ) AS t1
  WHERE (t1.Country = 'UK')
) AS t2
WHERE (t2.Phone = '555-5555')

And its not just the layering either.  What happens when when I try to project out a subset of the data? 

var query = 
    from c in db.Customers
    where c.Country == "UK"
    select c.CustomerID;

This becomes the following:

SELECT t2.CustomerID
FROM (
  SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
  FROM (
    SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
    FROM Customers AS t0
  ) AS t1
  WHERE (t1.Country = 'UK')
) AS t2

Why oh why do the nested queries keep reselecting data that's never being used?  Hopefully the database engine is smart enough not to pipeline all that data when most of its never even referred to in the query or returned to the client.  Yet, wouldn't it be a whole lot nicer if the query translator could reduce this madness down to a simpler form that an actual human might write?  Then maybe someone would be able to make head or tails out of this one:

var query = from c in db.Customers
            join o in db.Orders on c.CustomerID equals o.CustomerID
            let m = c.Phone
            orderby c.City
            where c.Country == "UK"
            where m != "555-5555"
            select new { c.City, c.ContactName } into x
            where x.City == "London"
            select x;

I'm not even going to show you this one yet, as it might frighten you to the point of powering down the computer.

What I am going to do is show you what I've done about it, how I rolled up my sleeves and wrote some code that saves the day.  It turned out not to be too horribly difficult.  I had imagined working with the immutable expression tree would become more and more complex as the desired transformations became more and more interesting, given how complicated the order-by rewriter seemed to get.  However, I was pleasantly surprised to find out that this clean-up logic was in fact turning out to be actually sort of clean.

 

Removing Redudant Subqueries

I first wanted to tackle how to get rid of redundant subqueries.  For example, an identity select does absolutely nothing, yet its adds a whole new layer.  What I needed was a way to remove select expressions that do nothing interesting.  Of course, the first thing I needed to do was decide what made an expression redundant.  Obviously, a select expression is redundant if it does not actually do anything at all, except for re-selecting columns from another select expression. Other select expressions would seem redundant if the only thing they do are operations that could have been combined with those of another select expression.

The next thing I needed to do was to figure out how to actually remove a select expression from an expression tree and end up with a tree that is actually legal.  This step had an ominous feel that I was going to end up devising something even more arcane than the order-by rewriter.  I was surprised that it turned out to be so simple.  Take a look.

    internal class SubqueryRemover : DbExpressionVisitor
    {
        HashSet<SelectExpression> selectsToRemove;
        Dictionary<string, Dictionary<string, Expression>> map;

        public Expression Remove(SelectExpression outerSelect, params SelectExpression[] selectsToRemove)
        {
            return Remove(outerSelect, (IEnumerable<SelectExpression>)selectsToRemove);
        }

        public Expression Remove(SelectExpression outerSelect, IEnumerable<SelectExpression> selectsToRemove)
        {
            this.selectsToRemove = new HashSet<SelectExpression>(selectsToRemove);
            this.map = selectsToRemove.ToDictionary(d => d.Alias, d => d.Columns.ToDictionary(d2 => d2.Name, d2 => d2.Expression));
            return this.Visit(outerSelect);
        }

        protected override Expression VisitSelect(SelectExpression select)
        {
            if (this.selectsToRemove.Contains(select))
            {
                return this.Visit(select.From);
            }
            else
            {
                return base.VisitSelect(select);
            }
        }

        protected override Expression VisitColumn(ColumnExpression column)
        {
            Dictionary<string, Expression> nameMap;
            if (this.map.TryGetValue(column.Alias, out nameMap))
            {
                Expression expr;
                if (nameMap.TryGetValue(column.Name, out expr))
                {
                    return this.Visit(expr);
                }
                throw new Exception("Reference to undefined column");
            }
            return column;
        }
    }

That's it.  This is a nice little class that will rewrite an expression tree and remove one or more select expressions from it, automatically fixing up all references to columns that go away. Looking at the two visit methods, the code looks trivial. When I see a select that's one of ones to be removed, I simply throw it away by returning its 'from' expression. When I see a column expression that is referencing a column that is declared in a select expression that is removed, I substitute the expression used in the declaration for the reference.  As it turns out the most interesting piece of the whole visitor is figuring out the set of columns that are going away, and this is determined using a nice little LINQ query in the Remove method to construct a dictionary of dictionaries that tells me this.

Now all that's left is to write some code that actually figures out which subqueries are the redundant ones.

And here that is. 

   internal class RedundantSubqueryRemover : DbExpressionVisitor
    {
        internal Expression Remove(Expression expression)
        {
            return this.Visit(expression);
        }

        protected override Expression VisitSelect(SelectExpression select)
        {
            select = (SelectExpression)base.VisitSelect(select);

            // first remove all purely redundant subqueries
            List<SelectExpression> redundant = new RedundantSubqueryGatherer().Gather(select.From);
            if (redundant != null)
            {
                select = (SelectExpression)new SubqueryRemover().Remove(select, redundant);
            }

            // next attempt to merge subqueries

            // can only merge if subquery is a single select (not a join)
            SelectExpression fromSelect = select.From as SelectExpression;
            if (fromSelect != null)
            {
                // can only merge if subquery has simple-projection (no renames or complex expressions)
                if (HasSimpleProjection(fromSelect))
                {
                    // remove the redundant subquery
                    select = (SelectExpression)new SubqueryRemover().Remove(select, fromSelect);
                    // merge where expressions 
                    Expression where = select.Where;
                    if (fromSelect.Where != null)
                    {
                        if (where != null)
                        {
                            where = Expression.And(fromSelect.Where, where);
                        }
                        else
                        {
                            where = fromSelect.Where;
                        }
                    }
                    if (where != select.Where)
                    {
                        return new SelectExpression(select.Type, select.Alias, select.Columns, select.From, where, select.OrderBy);
                    }
                }
            }

            return select;
        }

        private static bool IsRedudantSubquery(SelectExpression select)
        {
            return HasSimpleProjection(select)
                && select.Where == null
                && (select.OrderBy == null || select.OrderBy.Count == 0);
        }

        private static bool HasSimpleProjection(SelectExpression select)
        {
            foreach (ColumnDeclaration decl in select.Columns)
            {
                ColumnExpression col = decl.Expression as ColumnExpression;
                if (col == null || decl.Name != col.Name)
                {
                    // column name changed or column expression is more complex than reference to another column
                    return false;
                }
            }
            return true;
        }

        class RedundantSubqueryGatherer : DbExpressionVisitor
        {
            List<SelectExpression> redundant;

            internal List<SelectExpression> Gather(Expression source)
            {
                this.Visit(source);
                return this.redundant;
            }

            protected override Expression VisitSelect(SelectExpression select)
            {
                if (IsRedudantSubquery(select))
                {
                    if (this.redundant == null)
                    {
                        this.redundant = new List<SelectExpression>();
                    }
                    this.redundant.Add(select);
                }
                return select;
            }
        }
    }

The RedundantSubqueryRemover is a bit more involved that the SubqueryRemover, but it basically has a simple algorithm.  When it examines a given select expression it tries to determine if one or more sub-select's are redundant. To determine this set it uses anothger visitor, the RedundantSubqueryGatherer, which builds a list of redundant subqueries that can be reached without recursing down into any 'from' expressions. This allows it to consider all the sub select expressions that are in scope to the parent select expression, seeing through any join expression nodes that may exist. Once I have this list, I just use the SubqueryRemover to remove them.

Following that, I look to see if any select expressions can be merged together.  The only interesting thing to consider at this time is whether a subquery would be considered redundant except for a where expression that can easily be combined into the outer select expression. If I find one of these, I go ahead and remove the subquery and add its where expression to the outer select expression.  Yes, I'm surprised too that it works out that easily.

 

Removing Unused Columns

The second part of clean up is to get rid of unused columns.  If I project into a smaller set of columns and don't even reference some of the others then why keep them in the query at all?  It might not matter to the semantics of the query and it might not even make the query faster, but it will certainly be easier on the eyes, and it might make it possible to discover more redundant subqueries that might have slipped by simply because the subquery computed a column that is later ignored.

Of course, this turns out to be a lot more complicated than the other two, so I saved it for last.

    internal class UnusedColumnRemover : DbExpressionVisitor
    {
        Dictionary<string, HashSet<string>> allColumnsUsed;

        internal Expression Remove(Expression expression)
        {
            this.allColumnsUsed = new Dictionary<string, HashSet<string>>();
            return this.Visit(expression);
        }

        protected override Expression VisitColumn(ColumnExpression column)
        {
            HashSet<string> columns;
            if (!this.allColumnsUsed.TryGetValue(column.Alias, out columns))
            {
                columns = new HashSet<string>();
                this.allColumnsUsed.Add(column.Alias, columns);
            }
            columns.Add(column.Name);
            return column;
        }

        protected override Expression VisitSelect(SelectExpression select)
        {
            // visit column projection first
            ReadOnlyCollection<ColumnDeclaration> columns = select.Columns;

            HashSet<string> columnsUsed;
            if (this.allColumnsUsed.TryGetValue(select.Alias, out columnsUsed))
            {
                List<ColumnDeclaration> alternate = null;
                for (int i = 0, n = select.Columns.Count; i < n; i++)
                {
                    ColumnDeclaration decl = select.Columns[i];
                    if (!columnsUsed.Contains(decl.Name))
                    {
                        decl = null;  // null means it gets omitted
                    }
                    else
                    {
                        Expression expr = this.Visit(decl.Expression);
                        if (expr != decl.Expression)
                        {
                            decl = new ColumnDeclaration(decl.Name, decl.Expression);
                        }
                    }
                    if (decl != select.Columns[i] && alternate == null)
                    {
                        alternate = new List<ColumnDeclaration>();
                        for (int j = 0; j < i; j++)
                        {
                            alternate.Add(select.Columns[j]);
                        }
                    }
                    if (decl != null && alternate != null)
                    {
                        alternate.Add(decl);
                    }
                }
                if (alternate != null)
                {
                    columns = alternate.AsReadOnly();
                }
            }

            ReadOnlyCollection<OrderExpression> orderbys = this.VisitOrderBy(select.OrderBy);
            Expression where = this.Visit(select.Where);
            Expression from = this.Visit(select.From);

            if (columns != select.Columns || orderbys != select.OrderBy || where != select.Where || from != select.From)
            {
                return new SelectExpression(select.Type, select.Alias, columns, from, where, orderbys);
            }

            return select;
        }

        protected override Expression VisitProjection(ProjectionExpression projection)
        {
            // visit mapping in reverse order
            Expression projector = this.Visit(projection.Projector);
            SelectExpression source = (SelectExpression)this.Visit(projection.Source);
            if (projector != projection.Projector || source != projection.Source)
            {
                return new ProjectionExpression(source, projector);
            }
            return projection;
        }

        protected override Expression VisitJoin(JoinExpression join)
        {
            // visit join in reverse order
            Expression condition = this.Visit(join.Condition);
            Expression right = this.VisitSource(join.Right);
            Expression left = this.VisitSource(join.Left);
            if (left != join.Left || right != join.Right || condition != join.Condition)
            {
                return new JoinExpression(join.Type, join.Join, left, right, condition);
            }
            return join;
        }
    }

The reason is turns out to be so much more complicated is that in order to figure out what columns are used I have to examine the places where they are used before I get to the place where they are defined because that's where I need to remove them, which is exactly opposite of how the visitors normally flow through the expression tree, so instead of simply override a few pieces of a visitor I have to re-specify visit methods I could have otherwise ignored.

When I get to a select expression I have to examine the column declarations, where expression and order-by expressions before I recurse down into the from expression, because all of these expression may reference columns declared in the sub queries.  The first thing I do is grovel through the set of column declarations and remove the ones that are not referenced by the layer above. The top level select expression's column expression are, of course, reference by the projection expression. Most of the code in VisitSelect is just the work of reassembling this list of declarations. The rest is pretty much boiler plate from the base DbExpressionVisitor, except in a reverse order.

 

Putting It to the Test

The last thing to do is to wire these new clean-up visitors into the query translation process.  The appropriate place to do this is where all the other top level visitors are plugged in, the DbQueryProvider class.

    public class DbQueryProvider : QueryProvider {
... private TranslateResult Translate(Expression expression) { ProjectionExpression projection = expression as ProjectionExpression; if (projection == null) { expression = Evaluator.PartialEval(expression, CanBeEvaluatedLocally); expression = new QueryBinder(this).Bind(expression); expression = new OrderByRewriter().Rewrite(expression); expression = new UnusedColumnRemover().Remove(expression); expression = new RedundantSubqueryRemover().Remove(expression); projection = (ProjectionExpression)expression; } string commandText = new QueryFormatter().Format(projection.Source); LambdaExpression projector = new ProjectionBuilder().Build(projection.Projector, projection.Source.Alias); return new TranslateResult { CommandText = commandText, Projector = projector }; }
... }

I just added them in right after the OrderByRewriter, allowing unused columns to be removed first in case doing so helps the redundant subquery removal.

Now all I have to do is try it out. 

How about that huge scary query?  How bad could it be?

var query = from c in db.Customers
            join o in db.Orders on c.CustomerID equals o.CustomerID
            let m = c.Phone
            orderby c.City
            where c.Country == "UK"
            where m != "555-5555"
            select new { c.City, c.ContactName } into x
            where x.City == "London"
            select x;

Without the new additions the translation is horrific.

SELECT t9.City, t9.ContactName
FROM (
  SELECT t8.City, t8.ContactName
  FROM (
    SELECT t7.CustomerID, t7.ContactName, t7.Phone, t7.City, t7.Country, t7.OrderID, t7.CustomerID1, t7.OrderDate
    FROM (
      SELECT t6.CustomerID, t6.ContactName, t6.Phone, t6.City, t6.Country, t6.OrderID, t6.CustomerID1, t6.OrderDate
      FROM (
        SELECT t5.CustomerID, t5.ContactName, t5.Phone, t5.City, t5.Country, t5.OrderID, t5.CustomerID1, t5.OrderDate
        FROM (
          SELECT t4.CustomerID, t4.ContactName, t4.Phone, t4.City, t4.Country, t4.OrderID, t4.CustomerID1, t4.OrderDate
          FROM (
            SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country, t3.OrderID, t3.CustomerID AS CustomerID1, t3.OrderDate
            FROM (
              SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
              FROM Customers AS t0
            ) AS t1
            INNER JOIN (
              SELECT t2.OrderID, t2.CustomerID, t2.OrderDate
              FROM Orders AS t2
            ) AS t3
              ON (t1.CustomerID = t3.CustomerID)
          ) AS t4
        ) AS t5
      ) AS t6
      WHERE (t6.Country = 'UK')
    ) AS t7
    WHERE (t7.Phone <> '555-5555')
  ) AS t8
) AS t9
WHERE (t9.City = 'London')
ORDER BY t9.City

When I gather the courage to look I start to worry about exceeding some maximum query size sending this to the server.  Clearly, if the table had a more realistic number of columns the size would explode comparatively.

Now take a look at the query produced when the new clean-up code is enabled.

SELECT t0.City, t0.ContactName
FROM Customers AS t0
INNER JOIN Orders AS t2
  ON (t0.CustomerID = t2.CustomerID)
WHERE (((t0.Country = 'UK') AND (t0.Phone <> '555-5555')) AND (t0.City = 'London'))
ORDER BY t0.City

How sweet is that?  It's actually easier to read than the original LINQ query.

Sometimes I do amaze even myself.

Check it out and see if you know the answer.  Rico's Performance Tidbits

I am going to tell you something that will disturb you. You might laugh, but it will be a cold uncertain laugh that will haunt you as you read on, because somewhere deep down you'll know it to be true. You might brush it off, get on with your day, yet sometime later, a week or a year, it will seep back in and unsettle you to the core. From that moment on you will be changed. You will think different, act different and will fundamentally be different. So take a moment to prepare yourself now, breath deeply, clear your mind and open up to the possibility that building software is hard.

It really is. It's harder then you've ever expected or even experienced. For most people on the planet it is actually impossible, for most of the rest is next to impossible. Only a rare few are even capable of taking on the challenge and most of those are simply in denial. As a species, humans excel at denial. It is built into how our brains work. For the most part we are not physically capable of acknowledging even a simple fact if that fact in some way threatens us. Instead, we adopt elaborate delusions that protect our egos. When it happens to a single person we tend to think of them as crazy, eccentric or quaint. When it happens to a large group we think of them as radicals. When it happens to most of us, we think of it as normal.

Evidence to the contrary enrages us. We react by constructing even more elaborate delusions to surround and trap these pieces of heresy. It happens with our religions, our politics, and our social lives. It even happens with our work. Our careers threaten us more than anything, so naturally we invent our best defenses. Having trouble with your latest review? Your manager is a jerk. Having trouble with all your reviews? There is a worldwide conspiracy against people like yourself. Your product has no appreciable market share? The other company is a monopolist. Can't get a handle around building software? You are using the wrong methodology.

In fact, software development methodologies are one of the biggest delusions of all. They help keep us sane by protecting us from the truth that building software is hard. They delude us into thinking that by following a preset script we will end up successful, that even the biggest software challenges can be overcome by engaging in a widely varying set of patterns and practices, that somehow if we all just planned better with a more precise schedule or gave ourselves the freedom to think big and react quickly the actual problems of finding the right solutions and building the software will take care of themselves. Yet, its just not true. On some level we know its just not true. Every project always ends the same, with some piece of something produced and everyone bemoaning how horrible the process was or how intolerable or inflexible.  Instead of realizing that it is just hard and we are not really good at doing it, we distract ourselves by diving deep into naval contemplation, trying to devise improvements for the process for the next go around. However, we might as well have gone out and gotten new haircuts for all the good it does us.

The truth is that most of us can't really build software well, no matter the system. A few among us, however, can with apparent ease accomplish just about any development task, to any degree or complexity, in spite of whichever methodology is in use. We deny this too because it upsets the belief that its the methodology, not ourselves, that is really at the heart of the matter. We think of these folks as renegades or cowboys. We build up even bigger defenses to minimize the damage and marginalize their impact. We tell them they can't work on the entire project, they must pick some smaller piece befitting a mere mortal. We wall them off into little rooms and let them build, and then we fear what they produce, because it always ends up grander, more complete and more compelling than all the work done by everyone else combined. Often their work surpasses even the most ambitious dream. Yet we react to it by squelching it using backroom politics, voting it off the island.

The truth is, that if we could only harness the power of these exceptionally brilliant few in a meaningful way we would be ahead of the game. Instead of devising systems to distract ourselves into believing we all can do it, why not acknowledge the fact that really only these few know what the hell is going on and build a system to basically keep them functioning at prime efficiency? Instead of textbooks lauding the latest practices to keep our armies of developers in sync and on track, we should be pull the men out of the silos and teach them how to be a support team for the guy or gal in the center.

In reality, the really successful product teams already function this way, the ones that build the cutting edge technology that wows the customers and is adopted with great fanfare. Yet, they are often functioning this way by accident. It can only happen in the wild when one of these genius few turns out to also be good at promoting his or her ideas to the degree that management gets fooled into following along, and before even a team can be built around it the core 80% of the product is already designed, developed and ready for beta testing.

What we need is a system that will foster this behavior. The hard part is identifying the right model for the "rest of us," so we instinctively know what to do. We are so used to expecting equal treatment it will be difficult for most to adopt a different way of working. However, it is not inconceivable that a group of primates like ourselves can find an optimal organization that will produce the desired effect. You might say it is more instinctive that you would otherwise imagine. In fact, it may be so instinctive it would be considered primal.

Take gorillas, for example, another set of primates that most certainly operate in this manner. Every group has a dominant ape. The others are often found grooming the prime ape, keeping him at his best. If we switched to this model, we could easily have one dominant developer and the rest would basically be grooms, reviewing code, fixing bugs, writing design docs, running fxcop and polycheck, getting coffee. In fact, some could literally be cutting hair.

The possibilities are endless.

I hope someday you will come to realize the inevitable truth, stop wasting your time refining pointless process minutiae and accept that our ancestors had it right all along.

This is the eighth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you might want to shut off the Halo 3 and get crackin'.

Complete list of posts in the Building an IQueryable Provider series 

This post has a been a few weeks in coming. I realize that many of you have been anxiously awaiting the next installment. Your custom LINQ providers have been left alone collecting dust when they'd really rather be out on the town dazzling passers-by.

Implementing OrderBy

Today's topic is translating those order-by clauses.  Fortunately, there is only one way to do ordering, and that's using the LINQ ordering specific query operators. The bad news is that there are four different operators. 

When I write a query using the query syntax ordering looks simple.  It's just one clause.

    var query = from c in db.Customers
                orderby c.Country, c.City
                select c;

However, when that syntax is translated into LINQ there is more than one LINQ operator involved.

    var query = db.Customers.OrderBy(c => c.Country).ThenBy(c => c.City);

In fact, there is one ordering operator per ordering expression specified.  So LINQ providers translating to SQL will need to convert this pattern of individual operators back into a single clause.  The code to do this is going to be just a wee bit more involved than the previous operator translations, primarily because it needs to gather all these expressions up before it can do anything with them.  The prior operators had the luxury of being able to simply build a new outer select layer over the previous query accumulation.  They only ever had to consider their immediate arguments.  Ordering, instead, must delve deep into the otherness and understand more than just itself. 

The first thing I'm going to need is a way to represent the order-by clause.  The simplest thing to do is simply add this ordering stuff right into the SelectExpression I already have.  Yet, since each ordering expression can have a direction, ascending or descending, I'm going to need to remember that too.

So I'm going to add these new definitions:

    internal enum OrderType {
        Ascending,
        Descending
    }

    internal class OrderExpression {
        OrderType orderType;
        Expression expression;
        internal OrderExpression(OrderType orderType, Expression expression) {
            this.orderType = orderType;
            this.expression = expression;
        }
        internal OrderType OrderType {
            get { return this.orderType; }
        }
        internal Expression Expression {
            get { return this.expression; }
        }
    }

The new type 'OrderExpression' is not really an Expression node, since I'm not expecting to be able to stick one anywhere in the tree.  They will only every appear in a SelectExpression as part of its definition.  Therefore, SelectExpression needs to change a little bit.

    internal class SelectExpression : Expression {
... ReadOnlyCollection<OrderExpression> orderBy; internal SelectExpression( Type type, string alias, IEnumerable<ColumnDeclaration> columns, Expression from, Expression where, IEnumerable<OrderExpression> orderBy) : base((ExpressionType)DbExpressionType.Select, type) {
... this.orderBy = orderBy as ReadOnlyCollection<OrderExpression>; if (this.orderBy == null && orderBy != null) { this.orderBy = new List<OrderExpression>(orderBy).AsReadOnly(); } }
... internal ReadOnlyCollection<OrderExpression> OrderBy { get { return this.orderBy; } } }

Of course, I need to update DbExpressionVisitor to know about this ordering stuff:

    internal class DbExpressionVisitor : ExpressionVisitor {
... protected virtual Expression VisitSelect(SelectExpression select) { Expression from = this.VisitSource(select.From); Expression where = this.Visit(select.Where); ReadOnlyCollection<ColumnDeclaration> columns = this.VisitColumnDeclarations(select.Columns); ReadOnlyCollection<OrderExpression> orderBy = this.VisitOrderBy(select.OrderBy); if (from != select.From || where != select.Where || columns != select.Columns || orderBy != select.OrderBy) { return new SelectExpression(select.Type, select.Alias, columns, from, where, orderBy); } return select; }
...
protected ReadOnlyCollection<OrderExpression> VisitOrderBy(ReadOnlyCollection<OrderExpression> expressions) { if (expressions != null) { List<OrderExpression> alternate = null; for (int i = 0, n = expressions.Count; i < n; i++) { OrderExpression expr = expressions[i]; Expression e = this.Visit(expr.Expression); if (alternate == null && e != expr.Expression) { alternate = expressions.Take(i).ToList(); } if (alternate != null) { alternate.Add(new OrderExpression(expr.OrderType, e)); } } if (alternate != null) { return alternate.AsReadOnly(); } } return expressions; } }

And I have to fix up the other places in the code that I construct SelectExpressions, but that's relatively easy. 

Converting order-by clauses back to text is also not so bad.

    internal class QueryFormatter : DbExpressionVisitor {
... protected override Expression VisitSelect(SelectExpression select) {
... if (select.OrderBy != null && select.OrderBy.Count > 0) { this.AppendNewLine(Indentation.Same); sb.Append("ORDER BY "); for (int i = 0, n = select.OrderBy.Count; i < n; i++) { OrderExpression exp = select.OrderBy[i]; if (i > 0) { sb.Append(", "); } this.Visit(exp.Expression); if (exp.OrderType != OrderType.Ascending) { sb.Append(" DESC"); } } }
... }
... }

The heavy lifting comes in the QueryBinder where I need to build up the ordering clause out of these method calls.  What I decided to do was build up a list of these ordering expressions and evaluate them all at once.  Since ThenBy and ThenByDescending operators must follow other ordering operators, its easy to walk down the tree, throwing each of these in a collection until I reach the root of the order-by clause, a call to OrderBy or OrderByDescending.

    internal class QueryBinder : ExpressionVisitor {







...
        protected override Expression VisitMethodCall(MethodCallExpression m) {
            if (m.Method.DeclaringType == typeof(Queryable) ||
                m.Method.DeclaringType == typeof(Enumerable)) {
... switch (m.Method.Name) { case "OrderBy": return this.BindOrderBy(m.Type, m.Arguments[0], (LambdaExpression)StripQuotes(m.Arguments[1]), OrderType.Ascending); case "OrderByDescending": return this.BindOrderBy(m.Type, m.Arguments[0], (LambdaExpression)StripQuotes(m.Arguments[1]), OrderType.Descending); case "ThenBy": return this.BindThenBy(m.Arguments[0], (LambdaExpression)StripQuotes(m.Arguments[1]), OrderType.Ascending); case "ThenByDescending": return this.BindThenBy(m.Arguments[0], (LambdaExpression)StripQuotes(m.Arguments[1]), OrderType.Descending); } }
... }
List<OrderExpression> thenBys; protected virtual Expression BindOrderBy(Type resultType, Expression source, LambdaExpression orderSelector, OrderType orderType) { List<OrderExpression> myThenBys = this.thenBys; this.thenBys = null; ProjectionExpression projection = (ProjectionExpression)this.Visit(source); this.map[orderSelector.Parameters[0]] = projection.Projector; List<OrderExpression> orderings = new List<OrderExpression>(); orderings.Add(new OrderExpression(orderType, this.Visit(orderSelector.Body))); if (myThenBys != null) { for (int i = myThenBys.Count - 1; i >= 0; i--) { OrderExpression tb = myThenBys[i]; LambdaExpression lambda = (LambdaExpression)tb.Expression; this.map[lambda.Parameters[0]] = projection.Projector; orderings.Add(new OrderExpression(tb.OrderType, this.Visit(lambda.Body))); } } string alias = this.GetNextAlias(); ProjectedColumns pc = this.ProjectColumns(projection.Projector, alias, projection.Source.Alias); return new ProjectionExpression( new SelectExpression(resultType, alias, pc.Columns, projection.Source, null, orderings.AsReadOnly()), pc.Projector ); } protected virtual Expression BindThenBy(Expression source, LambdaExpression orderSelector, OrderType orderType) { if (this.thenBys == null) { this.thenBys = new List<OrderExpression>(); } this.thenBys.Add(new OrderExpression(orderType, orderSelector)); return this.Visit(source); } ... }

When a call to BindThenBy is made (used for ThenBy's and ThenByDescending's), the call's arguments are just appended to a growing list of then-by info. I re-use the OrderExpression to store the then-by's since its the same layout. Later, when BindOrderBy is called, the binding logic can then bind everything and build up single SelectExpression.  Note that when I do bind the then-by's, I iterate over the collection in reverse since the then-by's were collected backward.

Now, with all that in place, everything should be ready to go.

Giving it a whirl, I see that the query:

    var query = from c in db.Customers
                orderby c.Country, c.City
                select c;

Is translated into:

    SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
    FROM (
      SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
      FROM Customers AS t0
    ) AS t1
    ORDER BY t1.Country, t1.City

Which is exactly what I wanted. Wahoo!

 

Unfortunately, that's not the end of the story.  You probably knew I was going to say that.  You were probably thinking, hey, it seems he's about wrapped up on the feature yet there is a whole bunch more text to read. He can't possibly be done. There must be something wrong. He's trying to trick me.  It's like one of those puzzle questions.  Arrg!  I hate puzzle questions.

Yes, there is something definitely wrong. In this example, ordering appeared to work fine. The translator translated the query, the server ate it up and presto, results ordered as requested. The problem lies in other potential examples. The truth is that LINQ is a bit more free with ordering than SQL. As its stands now, a slightly different example would generate SQL text that SQL databases would choke on.

LINQ allows me to put ordering expressions anywhere I please. SQL is very restrictive. There are a few exceptions, but mostly I'm constrained to having only a single order-by clause on the outer-most SQL Select statement. Just like in my example above. Yet what would happen if the order-by came earlier? What if I piled on some more LINQ operators that logically came after the ordering?

What if instead I had this query.

    var query = from c in db.Customers
                orderby c.City
                where c.Country == "UK"
                select c;

It's very similar to the one before, yet I now have a where clause after the orderby.  I can't do that with SQL.  Even if I could, what kind of SQL would my provider generate?

    SELECT t2.City, t2.Country, t2.CustomerID, t2.ContactName, t2.Phone
    FROM (
      SELECT t1.City, t1.Country, t1.CustomerID, t1.ContactName, t1.Phone
      FROM (
        SELECT t0.City, t0.Country, t0.CustomerID, t0.ContactName, t0.Phone
        FROM Customers AS t0
      ) AS t1
      ORDER BY t1.City
    ) AS t2
    WHERE (t2.Country = 'UK')

Egads!  That's definitely not going to fly. Aside from the fact that this query is getting out of control size-wise, the order-by is now nested and that should not happen. At least if I don't want the user getting exceptions anytime they deviate from the basic ho-hum query.

It even fails when I add a simple projection.

    var query = from c in db.Customers
                orderby c.City
                select new { c.Country, c.City, c.ContactName };

This translates with the same problem.

    SELECT t2.Country, t2.City, t2.ContactName
    FROM (
      SELECT t1.City, t1.Country, t1.ContactName, t1.CustomerID, t1.Phone
      FROM (
        SELECT t0.City, t0.Country, t0.ContactName, t0.CustomerID, t0.Phone
        FROM Customers AS t0
      ) AS t1
      ORDER BY t1.City
    ) AS t2

Clearly something needs to be done.  The question is, what?

<insert dramatic pause>

Of course, I already have a solution in case you were wondering. It pretty much entails fixing up the query tree to abide by SQL's rules about ordering.  This means lifting the order-by expressions out from where they don't belong and adding them back where they do. 

Which is not going to be easy.  The query tree, based on LINQ expression nodes is immutable, meaning it cannot be changed. That's not the hard part because fortunately the visitors were written to recognize change automatically and build me a new immutable tree.  The hard part is going to be making sure the table aliases all match up correctly and dealing with order-by's that reference columns that no longer exist in the output. 

Looks like the real heavy lifting has not even started yet.

 

Reordering for SQL's sake

So how did I do it?  I basically wrote another visitor that just deals with moving the order-by's around the tree. I've tried to simplify it as much as possible, but in the end its still a complicated piece of code.  I could have attempted to integrate this rewriting into the binder itself, but that would have only added as much complication to an already involved process.  It was better to separate it out so did not add confusion to the rest.

Take a look.

    /// <summary>
    /// Move order-bys to the outermost select
    /// </summary>
internal class OrderByRewriter : DbExpressionVisitor { IEnumerable<OrderExpression> gatheredOrderings; bool isOuterMostSelect; public OrderByRewriter() { } public Expression Rewrite(Expression expression) { this.isOuterMostSelect = true; return this.Visit(expression); } protected override Expression VisitSelect(SelectExpression select) { bool saveIsOuterMostSelect = this.isOuterMostSelect; try { this.isOuterMostSelect = false; select = (SelectExpression)base.VisitSelect(select); bool hasOrderBy = select.OrderBy != null && select.OrderBy.Count > 0; if (hasOrderBy) { this.PrependOrderings(select.OrderBy); } bool canHaveOrderBy = saveIsOuterMostSelect; bool canPassOnOrderings = !saveIsOuterMostSelect; IEnumerable<OrderExpression> orderings = (canHaveOrderBy) ? this.gatheredOrderings : null; ReadOnlyCollection<ColumnDeclaration> columns = select.Columns; if (this.gatheredOrderings != null) { if (canPassOnOrderings) { HashSet<string> producedAliases = new AliasesProduced().Gather(select.From); // reproject order expressions using this select's alias so the outer select will have properly formed expressions BindResult project = this.RebindOrderings(this.gatheredOrderings, select.Alias, producedAliases, select.Columns); this.gatheredOrderings = project.Orderings; columns = project.Columns; } else { this.gatheredOrderings = null; } } if (orderings != select.OrderBy || columns != select.Columns) { select = new SelectExpression(select.Type, select.Alias, columns, select.From, select.Where, orderings); } return select; } finally { this.isOuterMostSelect = saveIsOuterMostSelect; } } protected override Expression VisitJoin(JoinExpression join) { // make sure order by expressions lifted up from the left side are not lost // when visiting the right side Expression left = this.VisitSource(join.Left); IEnumerable<OrderExpression> leftOrders = this.gatheredOrderings; this.gatheredOrderings = null; // start on the right with a clean slate Expression right = this.VisitSource(join.Right); this.PrependOrderings(leftOrders); Expression condition = this.Visit(join.Condition); if (left != join.Left || right != join.Right || condition != join.Condition) { return new JoinExpression(join.Type, join.Join, left, right, condition); } return join; } /// <summary> /// Add a sequence of order expressions to an accumulated list, prepending so as /// to give precedence to the new expressions over any previous expressions /// </summary> /// <param name="newOrderings"></param> protected void PrependOrderings(IEnumerable<OrderExpression> newOrderings) { if (newOrderings != null) { if (this.gatheredOrderings == null) { this.gatheredOrderings = newOrderings; } else { List<OrderExpression> list = this.gatheredOrderings as List<OrderExpression>; if (list == null) { this.gatheredOrderings = list = new List<OrderExpression>(this.gatheredOrderings); } list.InsertRange(0, newOrderings); } } } protected class BindResult { ReadOnlyCollection<ColumnDeclaration> columns; ReadOnlyCollection<OrderExpression> orderings; public BindResult(IEnumerable<ColumnDeclaration> columns, IEnumerable<OrderExpression> orderings) { this.columns = columns as ReadOnlyCollection<ColumnDeclaration>; if (this.columns == null) { this.columns = new List<ColumnDeclaration>(columns).AsReadOnly(); } this.orderings = orderings as ReadOnlyCollection<OrderExpression>; if (this.orderings == null) { this.orderings = new List<OrderExpression>(orderings).AsReadOnly(); } } public ReadOnlyCollection<ColumnDeclaration> Columns { get { return this.columns; } } public ReadOnlyCollection<OrderExpression> Orderings { get { return this.orderings; } } } /// <summary> /// Rebind order expressions to reference a new alias and add to column declarations if necessary /// </summary>
        protected virtual BindResult RebindOrderings(IEnumerable<OrderExpression> orderings, string alias, HashSet<string> existingAliases, IEnumerable<ColumnDeclaration> existingColumns) {
            List<ColumnDeclaration> newColumns = null;
            List<OrderExpression> newOrderings = new List<OrderExpression>();
            foreach (OrderExpression ordering in orderings) {
                Expression expr = ordering.Expression;
                ColumnExpression column = expr as ColumnExpression;
                if (column == null || (existingAliases != null && existingAliases.Contains(column.Alias))) {
                    // check to see if a declared column already contains a similar expression
                    int iOrdinal = 0;
                    foreach (ColumnDeclaration decl in existingColumns) {
                        ColumnExpression declColumn = decl.Expression as ColumnExpression;
                        if (decl.Expression == ordering.Expression || 
                            (column != null && declColumn != null && column.Alias == declColumn.Alias && column.Name == declColumn.Name)) {
                            // found it, so make a reference to this column
                            expr = new ColumnExpression(column.Type, alias, decl.Name, iOrdinal);
                            break;
                        }
                        iOrdinal++;
                    }
                    // if not already projected, add a new column declaration for it
                    if (expr == ordering.Expression) {
                        if (newColumns == null) {
                            newColumns = new List<ColumnDeclaration>(existingColumns);
                            existingColumns = newColumns;
                        }
                        string colName = column != null ? column.Name : "c" + iOrdinal;
                        newColumns.Add(new ColumnDeclaration(colName, ordering.Expression));
                        expr = new ColumnExpression(expr.Type, alias, colName, iOrdinal);
                    }
                    newOrderings.Add(new OrderExpression(ordering.OrderType, expr));
                }
            }
            return new BindResult(existingColumns, newOrderings);
        }
    }

That's a whole lot of something! :-) 

The main visitation algorithm works like this.  As the visitor walks back up the tree it maintains a growing collection of order-by expressions.  This is almost the opposite of what the binder was doing, as it collected then-by expressions as it walked down the tree.  If both an outer level and inner level select node have order-by expressions, neither of the expressions are lost. The outer level ones simply take precedence by appearing before the inner ones.  This happens in VisitSelect by calling the PrependOrderings function that adds the current order-by's as a block to the head of the growing list.

Next I decide if the current select node can even have order-by expressions in it. Currently, this is a simple test of whether the select node is the outermost select node or not. This question would be more interesting if I had TSQL's TOP clause available. I also determine whether this select node can pass on ordering information. Again, currently, this has to do with the nodes outermost levelness.  If I had DISTINCT available, I would have shut off the order-by propagation.  The reason comes clear when I take into consideration what comes next. The rebind.

After determining that this node must pass on its order-by's to an outer node, the order-by expressions themselves must be rewritten to appear as expressions that reference this select's table alias since these expressions are currently built to refer to aliases from an even deeper nesting.  In addition, if any of these columns that are referenced in the order-by expressions are not available in this select node's column projection, then I need to add them to the projection so they are accessible from the outer select.  This whole ball of wax is called rebinding, and is cordoned off into its own function RebindOrderings.

Now to get back to a prior point, if a given select node had also been distinct, it would have been incorrect to allow order-by expressions to introduce new columns into the projection.  This would have changed the evaluation of the distinct operation as it would have been based on additional columns.  No bother really right now since there is no distinct, but I might add it next week so its worth it to be thinking ahead.  This is the actual reason that LINQ to SQL does not maintain ordering through a distinct or union operation.

 

So putting it all together, I just need to modify DbQueryProvider to call this new visitor.

    public class DbQueryProvider : QueryProvider {
... private TranslateResult Translate(Expression expression) { ProjectionExpression projection = expression as ProjectionExpression; if (projection == null) { expression = Evaluator.PartialEval(expression, CanBeEvaluatedLocally); expression = new QueryBinder(this).Bind(expression);
expression = new OrderByRewriter().Rewrite(expression);
projection = (ProjectionExpression)expression; } string commandText = new QueryFormatter().Format(projection.Source); LambdaExpression projector = new ProjectionBuilder().Build(projection.Projector, projection.Source.Alias); return new TranslateResult { CommandText = commandText, Projector = projector }; }
... }

Now, if I run this otherwise complicated query

    var query = from c in db.Customers
                orderby c.City
                where c.Country == "UK"
                select new { c.City, c.ContactName };

I get this translation in SQL

    SELECT t3.City, t3.ContactName
    FROM (
      SELECT t2.City, t2.Country, t2.ContactName, t2.CustomerID, t2.Phone
      FROM (
        SELECT t1.City, t1.Country, t1.ContactName, t1.CustomerID, t1.Phone
        FROM (
          SELECT t0.City, t0.Country, t0.ContactName, t0.CustomerID, t0.Phone
          FROM Customers AS t0
        ) AS t1
      ) AS t2
      WHERE (t2.Country = 'UK')
    ) AS t3
    ORDER BY t3.City

Which is far cry better than what it was generating before. 

If I run it to completion, I get the following output:

{ City = Cowes, ContactName = Helen Bennett }
{ City = London, ContactName = Simon Crowther }
{ City = London, ContactName = Hari Kumar }
{ City = London, ContactName = Thomas Hardy }
{ City = London, ContactName = Victoria Ashworth }
{ City = London, ContactName = Elizabeth Brown }
{ City = London, ContactName = Ann Devon }

 

There!  Now that's ordering, or at least a good start. 

Of course, it would still look better if I could reduce some of the unnecessary layers of sub-queries.  Maybe next time. :-)

This is the seventh in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you might want to rethink your place in the universe. :-)

Complete list of posts in the Building an IQueryable Provider series 

There's been a few weeks hiatus since the last installment. I hope that most of you have used the free time to explore building your own providers. I keep hearing about other people's LINQ to watchamacallit projects and it is encouraging. Today I want to show you how I went about adding 'join' capability to my provider.  Finally, it will be able to do something more interesting than just select and where.

Implementing Join

There are actually a couple different ways to represent a join using LINQ.  In C# or VB if I use more than one 'from' clause I am performing a cross product, and if I match keys from one with keys from the other I am performing a join. 

var query = from c in db.Customers
            from o in db.Orders
            where c.CustomerID == o.CustomerID
            select new { c.ContactName, o.OrderDate };

Of course, there is also the 'join' clause that is an explicit join.

var query = from c in db.Customers
            join o in db.Orders on c.CustomerID equals o.CustomerID
            select new { c.ContactName, o.OrderDate };

Both of these queries produce the same results.  So why are there two ways to do the same thing?

The answer to that is a bit involved, but I'll try to make a stab at it here.  The explicit join requires me to specify two key expressions to match; in database parlance this is known as an equi-join. The nested from clauses allow more flexibility. The reason the explicit join is so restrictive is that by being restrictive the LINQ to Objects implementation can be efficient in execution without needing to analyze and rewrite the query. The good news here is that almost all joins used in database queries are equi-joins. 

Also, by not being as expressive the explicit join is much simpler to implement. In this post I'll actually implement both, but I'll start with the explicit join because it has fewer pitfalls.

The definition of the Queryable Join method looks like this:

public static IQueryable<TResult> Join<TOuter,TInner,TKey,TResult>(
this IQueryable<TOuter> outer,
IEnumerable<TInner> inner,
Expression<Func<TOuter,TKey>> outerKeySelector,
Expression<Func<TInner,TKey>> innerKeySelector,
Expression<Func<TOuter,TInner,TResult>> resultSelector
)

That's a lot of arguments and a lot of generics!  Fortunately, its not as hard to understand as it looks.  The 'inner' and 'outer' parameters are referring to input sequences (the sequences on both sides of the join); each have their own key selector (the expressions that appear in the 'on' clause on opposite sides of the 'equals'); and finally an expression that is used to produce a result of the join.  This last 'resultSelector' might be a bit confusing since it does not seem to appear in the C# or VB syntax.  In fact, it actually does.  In my example above it is the select expression.  In other examples, not shown here, it might be a compiler generated projection that carries the data forward to the next query operation.

Either way, its rather straight forward to implement. In fact, I already have almost everything I need to implement it. What I don't have yet is a node to represent the join. 

At least that's easy to remedy.

    internal enum DbExpressionType {
        Table = 1000, // make sure these don't overlap with ExpressionType
        Column,
        Select,
        Projection,
        Join
    }

I modified my enum to make a new 'Join' node type, and then I implement a JoinExpression.

    internal enum JoinType {
        CrossJoin,
        InnerJoin,
        CrossApply,
    }

    internal class JoinExpression : Expression {
        JoinType joinType;
        Expression left;
        Expression right;
        Expression condition;
        internal JoinExpression(Type type, JoinType joinType, Expression left, Expression right, Expression condition)
            : base((ExpressionType)DbExpressionType.Join, type) {
            this.joinType = joinType;
            this.left = left;
            this.right = right;
            this.condition = condition;
        }
        internal JoinType Join {
            get { return this.joinType; }
        }
        internal Expression Left {
            get { return this.left; }
        }
        internal Expression Right {
            get { return this.right; }
        }
        internal new Expression Condition {
            get { return this.condition; }
        }
    }

I've also defined a JoinType enum and filled it out with all the join types I'll be needing to know about.  'CrossApply' is a SQL Server only join type. Ignore it for now, I don't need it for the equi-join. In fact, I only need the 'InnerJoin'.  The other two come later.  I told you this was the simpler case.

What about outer joins? That will have to be a topic for another post. :-)

Now, that I have a new JoinExpression I'll need to update my DbExpressionVisitor.

    internal class DbExpressionVisitor : ExpressionVisitor {
        protected override Expression Visit(Expression exp) {
... switch ((DbExpressionType)exp.NodeType) {
... case DbExpressionType.Join: return this.VisitJoin((JoinExpression)exp);
... } }
... protected virtual Expression VisitJoin(JoinExpression join) { Expression left = this.Visit(join.Left); Expression right = this.Visit(join.Right); Expression condition = this.Visit(join.Condition); if (left != join.Left || right != join.Right || condition != join.Condition) { return new JoinExpression(join.Type, join.Join, left, right, condition); } return join; } }

So far so good.  Now, I just update QueryFormatter to know how to produce SQL text out of this new node.

    internal class QueryFormatter : DbExpressionVisitor {
... protected override Expression VisitSource(Expression source) { switch ((DbExpressionType)source.NodeType) {
... case DbExpressionType.Join: this.VisitJoin((JoinExpression)source); break;
... }
... } protected override Expression VisitJoin(JoinExpression join) { this.VisitSource(join.Left); this.AppendNewLine(Indentation.Same); switch (join.Join) { case JoinType.CrossJoin: sb.Append("CROSS JOIN "); break; case JoinType.InnerJoin: sb.Append("INNER JOIN "); break; case JoinType.CrossApply: sb.Append("CROSS APPLY "); break; } this.VisitSource(join.Right); if (join.Condition != null) { this.AppendNewLine(Indentation.Inner); sb.Append("ON "); this.Visit(join.Condition); this.AppendNewLine(Indentation.Outer); } return join; } }

The idea here is that JoinExpression nodes appear in the same place as other query source expressions such as SelectExpression and TableExpression.  Therefore, I've modified the VisitSource method to know about Joins, and I've added an implementation for VisitJoin.

Of couse, I'm not going to get anywhere if I don't know how to turn expression nodes calling the Queryable Join method into my new JoinExpression.  What I need is a method in QueryBinder just like the BindSelect and BindWhere methods.  This turns out to be the meat of the operation, however, it turns out to be rather straight-forward since I already have the support built in to handle the other operators.

    internal class QueryBinder : ExpressionVisitor {
... protected override Expression VisitMethodCall(MethodCallExpression m) { if (m.Method.DeclaringType == typeof(Queryable) || m.Method.DeclaringType == typeof(Enumerable)) { switch (m.Method.Name) {
... case "Join": return this.BindJoin( m.Type, m.Arguments[0], m.Arguments[1], (LambdaExpression)StripQuotes(m.Arguments[2]), (LambdaExpression)StripQuotes(m.Arguments[3]), (LambdaExpression)StripQuotes(m.Arguments[4]) ); } }
... }
... protected virtual Expression BindJoin(Type resultType, Expression outerSource, Expression innerSource, LambdaExpression outerKey, LambdaExpression innerKey, LambdaExpression resultSelector) { ProjectionExpression outerProjection = (ProjectionExpression)this.Visit(outerSource); ProjectionExpression innerProjection = (ProjectionExpression)this.Visit(innerSource); this.map[outerKey.Parameters[0]] = outerProjection.Projector; Expression outerKeyExpr = this.Visit(outerKey.Body); this.map[innerKey.Parameters[0]] = innerProjection.Projector; Expression innerKeyExpr = this.Visit(innerKey.Body); this.map[resultSelector.Parameters[0]] = outerProjection.Projector; this.map[resultSelector.Parameters[1]] = innerProjection.Projector; Expression resultExpr = this.Visit(resultSelector.Body); JoinExpression join = new JoinExpression(resultType, JoinType.InnerJoin, outerProjection.Source, innerProjection.Source, Expression.Equal(outerKeyExpr, innerKeyExpr)); string alias = this.GetNextAlias(); ProjectedColumns pc = this.ProjectColumns(resultExpr, alias, outerProjection.Source.Alias, innerProjection.Source.Alias); return new ProjectionExpression( new SelectExpression(resultType, alias, pc.Columns, join, null), pc.Projector ); } }

Inside the BindJoin method It's almost like I'm handling two operators at once. I've got two sources, so I end up with two different source projections.  I use these projections to seed the global map that is used to translate parameter references and then I translate each of the key expressions.  Finally, the same goes for the result expression, except that the result expression can see both source projections instead of just one.

Once I have translated all the input expressions I have enough to represent the join, so I go ahead and construct the JoinExpression.  Then I use ProjectColumns to build me a column list for the new SelectExpression that I'm going to wrap around the whole thing.  Notice there is one small change in ProjectColumns.  It allows me to specify more than one pre-existing alias.  This is important, because with a Join there are actually two aliases that my result expression may be referring to.

That's it. I'm actually done. Join should work as advertised.

Let's try it out.

var query = from c in db.Customers
            where c.CustomerID == "ALFKI"
            join o in db.Orders on c.CustomerID equals o.CustomerID
            select new { c.ContactName, o.OrderDate };

Console.WriteLine(query);

foreach (var item in query) {
    Console.WriteLine(item);
}

Running this produces the following output:

SELECT t2.ContactName, t4.OrderDate
FROM (
  SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
  FROM (
    SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
    FROM Customers AS t0
  ) AS t1
  WHERE (t1.CustomerID = 'ALFKI')
) AS t2
INNER JOIN (
  SELECT t3.OrderID, t3.CustomerID, t3.OrderDate
  FROM Orders AS t3
) AS t4
  ON (t2.CustomerID = t4.CustomerID)

{ ContactName = Maria Anders, OrderDate = 8/25/1997 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 10/3/1997 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 10/13/1997 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 1/15/1998 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 3/16/1998 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 4/9/1998 12:00:00 AM }

 

Now for the hard stuff. :-)

Implementing SelectMany  

If you've got any experience with SQL you are probably confounded by my insistence that nested 'from' case should be difficult. After all, with SQL this might only seem to be the difference between a CROSS JOIN and an INNER JOIN.  For those non-SQL inclined a CROSS JOIN is not really a join at all, it's a cross product. To make it a join, it relies on a join condition placed in the WHERE clause to do the actually joining.  So the only real difference, as far as SQL is concerned, is a CROSS JOIN puts is join condition in the WHERE clause and a INNER JOIN puts its join condition in the ON clause.  There shouldn't be so much fuss.

Oh, but there is.  There's a lot of fuss.  The problem isn't in the SQL, its more about what's not in the SQL.  You see a LINQ nested from is not the same thing as a CROSS JOIN.  Sometimes it's the same, but not always.

The problem comes down to what is visible when. A join takes two completely independent sub queries and joins them together using a join condition. The join condition can see columns from each side of the join, but that's the only expression that can. A LINQ nested from is very different. The inner source expression can see items enumerated from the outer source. Think of them as nested foreach loops. The inner one can reference the variable for the outer one.

The problem comes in finding a suitable translation for queries that are specified with these rogue references to the outer variable.

If your query looks like this then no problem:

var query = from c in db.Customers
            from o in db.Orders
            where c.CustomerID == o.CustomerID
            select new { c.ContactName, o.OrderDate };

This translates into method calls that look like this:

var query = db.Customers
              .SelectMany(c => db.Orders, (c, o) => new { c, o })
              .Where(x => x.c.CustomerID == x.o.CustomerID)
              .Select(x => new { x.c.ContactName, x.o.OrderDate });

The SelectMany's collection expression 'db.Orders' never references anything from 'c'. This is easy to translate to SQL since we can simply put db.Customers and db.Orders on opposite sides of a join.

However, if you simply change how you write the query to this:

var query = from c in db.Customers
            from o in db.Orders.Where(o => o.CustomerID == c.CustomerID)
            select new { c.ContactName, o.OrderDate };

Now, you've got a very different beast. Translated to method calls this becomes:

var query = db.Customers
              .SelectMany(
                 c => db.Orders.Where(o => c.CustomerID == o.CustomerID),
                 (c, o) => new { c.ContactName, o.OrderDate }
                 );

Now the join condition exists as part of the SelectMany's collection expression, so it references 'c'.  Translation can no longer simply be a process of putting both source expressions on either side of a join in SQL, whether CROSS or INNER.

So how am I going to solve this problem?  I'm not.  Not really.  I'm only going to solve it in the crudest sense.  I'm going to let Microsoft's SQL solve it for me, mostly.  Microsoft SQL2005 introduced a new type of join operator, CROSS APPLY, that has the exact same semantics as what I'm looking for, a very happy coincidence indeed. A CROSS APPLY's right-hand-side expression can reference columns from the left-hand-side.  That's why I included it when defining the JoinType enum.

A large part of LINQ to SQL translation engine exists to reduce CROSS APPLY's into CROSS JOIN in an opportunistic fashion.  Without this work, LINQ to SQL would likely not work very well with SQL2000, and even so it is not always possible.  Adding capability to do this to the sample provider would also be a lot of work that I'm reluctant to do right now. However, I'm not entirely unfeeling, so I've thrown in a bone.  I'm going to catch the easy case and convert that to a CROSS JOIN.

So let's take a look at the code.

    internal class QueryBinder : ExpressionVisitor {
        protected override Expression VisitMethodCall(MethodCallExpression m) {
            if (m.Method.DeclaringType == typeof(Queryable) ||
                m.Method.DeclaringType == typeof(Enumerable)) {
                switch (m.Method.Name) {
... case "SelectMany": if (m.Arguments.Count == 2) { return this.BindSelectMany( m.Type, m.Arguments[0], (LambdaExpression)StripQuotes(m.Arguments[1]), null ); } else if (m.Arguments.Count == 3) { return this.BindSelectMany( m.Type, m.Arguments[0], (LambdaExpression)StripQuotes(m.Arguments[1]), (LambdaExpression)StripQuotes(m.Arguments[2]) ); } break;
... } }
... } protected virtual Expression BindSelectMany(Type resultType, Expression source, LambdaExpression collectionSelector, LambdaExpression resultSelector) { ProjectionExpression projection = (ProjectionExpression)this.Visit(source); this.map[collectionSelector.Parameters[0]] = projection.Projector; ProjectionExpression collectionProjection = (ProjectionExpression)this.Visit(collectionSelector.Body); JoinType joinType = IsTable(collectionSelector.Body) ? JoinType.CrossJoin : JoinType.CrossApply; JoinExpression join = new JoinExpression(resultType, joinType, projection.Source, collectionProjection.Source, null); string alias = this.GetNextAlias(); ProjectedColumns pc; if (resultSelector == null) { pc = this.ProjectColumns(collectionProjection.Projector, alias, projection.Source.Alias, collectionProjection.Source.Alias); } else { this.map[resultSelector.Parameters[0]] = projection.Projector; this.map[resultSelector.Parameters[1]] = collectionProjection.Projector; Expression result = this.Visit(resultSelector.Body); pc = this.ProjectColumns(result, alias, projection.Source.Alias, collectionProjection.Source.Alias); } return new ProjectionExpression( new SelectExpression(resultType, alias, pc.Columns, join, null), pc.Projector ); } private bool IsTable(Expression expression) { ConstantExpression c = expression as ConstantExpression; return c != null && IsTable(c.Value); }
... }

The first interesting thing to note is that there are two different forms of SelectMany that are interesting.  The first form takes a source expression and a collectionSelector expression. It produces a sequence of the same elements that are produced in the collectionSelector, only merging all the individual sequences together.  The second form adds the resultSelector expression that lets you project your own result out of the two joined items.  I've implemented BindSelectMany to work with or without the resultSelector being specified.

Note that on the fourth line of the function I determine which kind of Join I'm going to use to represent the SelectMany call.  If I can determine that the collectionSelector is just a simple table query then I know it cannot have references to outer query variable (the parameter to the collectionSelector lambda expression).  Therefore I know I can safely chose to use the CROSS JOIN instead of the CROSS APPLY. If I wanted to be a bit more sophisticated I could have written a visitor to prove that the collectionSelector made no references.  Maybe I'll do that next time. I have a feeling I'm going to need to know this for other reasons. Yet for now this is the simple test I'm going to use. 

All in all, this code is not too different from the BindJoin function or the others.  I do have to handle the case without the resultSelector.  In this case, I simply get to reuse the collectionProjection again as the final projection.

 

Let's try the new code too.

var query = from c in db.Customers
            where c.CustomerID == "ALFKI"
            from o in db.Orders
            where c.CustomerID == o.CustomerID
            select new { c.ContactName, o.OrderDate };

Console.WriteLine(query);

foreach (var item in query) {
    Console.WriteLine(item);
}

Running this code now produces the following results:

SELECT t6.ContactName, t6.OrderDate
FROM (
  SELECT t5.CustomerID, t5.ContactName, t5.Phone, t5.City, t5.Country, t5.OrderID, t5.CustomerID1, t5.OrderDate
  FROM (
    SELECT t2.CustomerID, t2.ContactName, t2.Phone, t2.City, t2.Country, t4.OrderID, t4.CustomerID AS CustomerID1, t4.OrderDate
    FROM (
      SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
      FROM (
        SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
        FROM Customers AS t0
      ) AS t1
      WHERE (t1.CustomerID = 'ALFKI')
    ) AS t2
    CROSS JOIN (
      SELECT t3.OrderID, t3.CustomerID, t3.OrderDate
      FROM Orders AS t3
    ) AS t4
  ) AS t5
  WHERE (t5.CustomerID = t5.CustomerID1)
) AS t6

{ ContactName = Maria Anders, OrderDate = 8/25/1997 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 10/3/1997 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 10/13/1997 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 1/15/1998 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 3/16/1998 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 4/9/1998 12:00:00 AM }

Yikes! The queries are starting too look a bit hefty.  I guess that's what happens when I keep mindlessly adding new select layers. Maybe one of these days I'll figure out a way to reduce this to just what is necessary. :-)

Of course, If I write the query in such a way that it does not pass my simple check I'm going to get a CROSS APPLY.

var query = db.Customers
              .Where(c => c.CustomerID == "ALFKI")
              .SelectMany(
                 c => db.Orders.Where(o => c.CustomerID == o.CustomerID),
                 (c, o) => new { c.ContactName, o.OrderDate }
                 );

Console.WriteLine(query);

foreach (var item in query) {
    Console.WriteLine(item);
}

This code produces the following:

SELECT t2.ContactName, t5.OrderDate
FROM (
  SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
  FROM (
    SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
    FROM Customers AS t0
  ) AS t1
  WHERE (t1.CustomerID = 'ALFKI')
) AS t2
CROSS APPLY (
  SELECT t4.OrderID, t4.CustomerID, t4.OrderDate
  FROM (
    SELECT t3.OrderID, t3.CustomerID, t3.OrderDate
    FROM Orders AS t3
  ) AS t4
  WHERE (t2.CustomerID = t4.CustomerID)
) AS t5

{ ContactName = Maria Anders, OrderDate = 8/25/1997 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 10/3/1997 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 10/13/1997 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 1/15/1998 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 3/16/1998 12:00:00 AM }
{ ContactName = Maria Anders, OrderDate = 4/9/1998 12:00:00 AM }

 

Exactly what I was expecting! 

Now my provider handles Join and SelectMany calls.  Do I hear a "woot woot" out there?  Maybe my ears are just ringing.  This provider does a lot, but there are still obvious holes to fill and operators to implement.  I should get paid for doing this.

This is the sixth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you might want to file for a government grant and take sabbatical because you've got a lot of catching up to do. :-)

Complete list of posts in the Building an IQueryable Provider series 

So, again you thought I was done with this series, that I've given up and moved on to greener pastures. You think that since Select works wonderfully that that's all you need to know to make your own IQueryable provider? Ha! There's loads more to know. And, by the way, Select is still broken.

 

Finishing Select

Broken? How is this possible? I thought you were this bigshot Microsoft uber developer that never made mistakes! Yet, you claim you gave us shoddy code? I cut and pasted that stuff already into my product and my boss says we 'go live' next Monday! How could you do this to me?  Sniff.

Relax. It's not that broken. It's just a little bit broken.

Recall in the last post I invented these new expression nodes; Table, Column, Select and Projection. They worked great didn't they? Sure they did! The part that is broken is that I did not handle all the cases where they might appear. The case I handled was the most obvious, when Projection nodes appear on the top of the query tree.  After all, since I'm only allowing for Select and Where anyway, the last operation must be one of those. The code actually assumed that to be true.

That's not the problem.

The problem is Projection nodes may also appear inside selector expressions themselves.  For example, take a look at the following.

    var query = from c in db.Customers

                select new {

                    Name = c.ContactName,

                    Orders = from o in db.Orders

                             where o.CustomerID == c.CustomerID

                             select o

                };

I added a nested query in the select expression. This is very different than the queries I wrote before, which were only tabular. Now, I'm basically asking my provider to construct a hierarchy, each object is going to have a name and a collection of orders. How in the world am I going to do that? SQL can't even do that. And even if I wanted to disallow it outright, what happens if someone does write this?

Bang! I get an exception. Not the one I thought though. I guess the code is more buggy than I realized. I expected to get an exception when trying to compile the projector function, since this lovely query should leave a little ProjectionException fly in my selection expression soup. Didn't I claim that it was okay to invent my own expression nodes because no one was going to see them anyway?  Ha! Looks like I was wrong. (The actual exception I get is from assembling a bad expression tree because I was mistyping the Projection nodes when I constructed them. That's going to have to be fixed.)

Assuming I fix the typing bug, what am I going to do about these nested projection nodes?  I could just catch them and throw my own exception with an apologetic disclaimer that nesting is a no-no. But then I wouldn't be a good LINQ citizen, and I wouldn't have the fun of figuring out how to actually make it work.

So on to the good stuff.

 

Nested Queries

What I want to do when I see a nested ProjectionExpression is turn it into a nested query. SQL cannot actually do this, so I'm going to have to do it or something like it in my own code. However, I'm not going to shoot for a super advanced solution here, just one that actually retrieves the data.

Since the projector function has to be turned into executable code, I'm going to have to swap some piece of code into the expression where this ProjectionExpression node currently is.  It's got to be something that constructs the Orders collection out of something. It can't come from the current DataReader since that guy only holds tabular results. Therefore its got to come from another DataReader. What I really need is something that turns a ProjectionExpression into a function that when executed returns this collection.

Now where have I seen something like that before?

Thinking...

Right. That's what my provider already does, more or less. Whew, I thought this was going to be difficult.  My provider already converts an expression tree into a result sequence via the Execute method.  I guess I'm already half way home!

So what I need to do is add a function to my good ol' ProjectionRow class that executes a nested query. It can figure out how to get back to the provider for me in order to do the actual work.

Here's the new code for ProjectionRow and ProjectionBuilder.

    public abstract class ProjectionRow {
        public abstract object GetValue(int index);
        public abstract IEnumerable<E> ExecuteSubQuery<E>(LambdaExpression query);
    }

    internal class ProjectionBuilder : DbExpressionVisitor {
        ParameterExpression row;
        string rowAlias;
        static MethodInfo miGetValue;
        static MethodInfo miExecuteSubQuery;
        
        internal ProjectionBuilder() {
            if (miGetValue == null) {
                miGetValue = typeof(ProjectionRow).GetMethod("GetValue");
                miExecuteSubQuery = typeof(ProjectionRow).GetMethod("ExecuteSubQuery");
            }
        }

        internal LambdaExpression Build(Expression expression, string alias) {
            this.row = Expression.Parameter(typeof(ProjectionRow), "row");
            this.rowAlias = alias;
            Expression body = this.Visit(expression);
            return Expression.Lambda(body, this.row);
        }

        protected override Expression VisitColumn(ColumnExpression column) {
            if (column.Alias == this.rowAlias) {
                return Expression.Convert(Expression.Call(this.row, miGetValue, Expression.Constant(column.Ordinal)), column.Type);
            }
            return column;
        }

        protected override Expression VisitProjection(ProjectionExpression proj) {
            LambdaExpression subQuery = Expression.Lambda(base.VisitProjection(proj), this.row);
            Type elementType = TypeSystem.GetElementType(subQuery.Body.Type);
            MethodInfo mi = miExecuteSubQuery.MakeGenericMethod(elementType);
            return Expression.Convert(
                Expression.Call(this.row, mi, Expression.Constant(subQuery)),
                proj.Type
                );
        }
    }

So, just like I inject code to call GetValue when I see a ColumnExpression, I'm going to inject code to call ExecuteSubQuery when I see a ProjectionExpression.

I decided I needed to bundle up the projection and the parameter I was using to refer to my ProjectionRow, because as it turns out the ProjectionExpression also gets its ColumnExpressions converted.  Luckily, there was already a class designed to do that, LambdaExpression, so I used it as the argument type for ExecuteSubQuery.

Also notice how I pass the subquery as a ConstantExpression.  This is how I trick the Expression.Compile feature into not noticing that I've invented new nodes. Fortunately, I didn't want them to be compiled just yet anyway.

Next to take a look at is the changed ProjectionReader. Of course, the Enumerator now implements ExecuteSubQuery for me.

    internal class ProjectionReader<T> : IEnumerable<T>, IEnumerable {
        Enumerator enumerator;

        internal ProjectionReader(DbDataReader reader, Func<ProjectionRow, T> projector, IQueryProvider provider) {
            this.enumerator = new Enumerator(reader, projector, provider);
        }

        public IEnumerator<T> GetEnumerator() {
            Enumerator e = this.enumerator;
            if (e == null) {
                throw new InvalidOperationException("Cannot enumerate more than once");
            }
            this.enumerator = null;
            return e;
        }

        IEnumerator IEnumerable.GetEnumerator() {
            return this.GetEnumerator();
        }

        class Enumerator : ProjectionRow, IEnumerator<T>, IEnumerator, IDisposable {
            DbDataReader reader;
            T current;
            Func<ProjectionRow, T> projector;
            IQueryProvider provider;

            internal Enumerator(DbDataReader reader, Func<ProjectionRow, T> projector, IQueryProvider provider) {
                this.reader = reader;
                this.projector = projector;
                this.provider = provider;
            }

            public override object GetValue(int index) {
                if (index >= 0) {
                    if (this.reader.IsDBNull(index)) {
                        return null;
                    }
                    else {
                        return this.reader.GetValue(index);
                    }
                }
                throw new IndexOutOfRangeException();
            }

            public override IEnumerable<E> ExecuteSubQuery<E>(LambdaExpression query) {
                ProjectionExpression projection = (ProjectionExpression) new Replacer().Replace(query.Body, query.Parameters[0], Expression.Constant(this));
                projection = (ProjectionExpression) Evaluator.PartialEval(projection, CanEvaluateLocally);
                IEnumerable<E> result = (IEnumerable<E>)this.provider.Execute(projection);
                List<E> list = new List<E>(result);
                if (typeof(IQueryable<E>).IsAssignableFrom(query.Body.Type)) {
                    return list.AsQueryable();
                }
                return list;
            }

            private static bool CanEvaluateLocally(Expression expression) {
                if (expression.NodeType == ExpressionType.Parameter ||
                    expression.NodeType.IsDbExpression()) {
                    return false;
                }
                return true;
            }

            public T Current {
                get { return this.current; }
            }

            object IEnumerator.Current {
                get { return this.current; }
            }

            public bool MoveNext() {
                if (this.reader.Read()) {
                    this.current = this.projector(this);
                    return true;
                }
                return false;
            }

            public void Reset() {
            }

            public void Dispose() {
                this.reader.Dispose();
            }
        }
    }

Now, you can see that when I construct the ProjectionReader I pass the instance of my provider here. I'm going to use that to execute the subquery down in the ExecuteSubQuery function.

Looking at ExectueSubQuery... Hey, what is that Replacer.Replace thing?

I haven't shown you that bit of magic yet. I will in just a moment. Let me explain what is going on in this method first.  I've got the argument that is a LambdaExpression that holds onto the original ProjectionExpression in the body and the parameter that was used to reference the current ProjectionRow.  That's all fine and dandy though.  The problem I have is that I can't just execute the projection expression via a call back to my provider because all of the ColumnExpressions that used to reference the outer query (think join condition in the Where clause) are now GetValue expressions. 

That's right, I've got references to the outer query in my sub query; chocolate in my peanut butter.  I can't leave those particular calls to GetValue in the projection, because they would be trying to access columns that don't exist in the new query.  What a dilema, Charlie Brown.

Thinking...

Aha! I've got it. Fortunately, all the data for those GetValue calls is readily available. It's sitting in the DataReader one object reference away from the code in ExecuteSubQuery. The data is available in the current row.  So what I want to do is somehow 'evaluate' those little bits of expressions right here and now and force those sub expressions to call those GetValue methods.  I wish I had code that could do that.  That would be perfect.

Wait, isn't that what Evaluator.PartialEval does?  Sure, but that won't work. Why?  Because those silly little expressions have references to my ProjectionRow parameter, and ParameterExpressions are the rule that tell it not to eval the expression.  If I could only get rid of those silly parameter references and instead have constants that point to my current running instance of ProjectionRow, then I could use Evaluator.PartialEval to turn those expressions into values!  Life would be good.

How to do that? I need a tool that will search my expression tree and replace some nodes with other nodes. Yah, that's the ticket.

Here's something, I call it Replacer.  It simply walks the tree looking for references to one node instance and swapping it for references to a different node.

    internal class Replacer : DbExpressionVisitor {
        Expression searchFor;
        Expression replaceWith;
        internal Expression Replace(Expression expression, Expression searchFor, Expression replaceWith) {
            this.searchFor = searchFor;
            this.replaceWith = replaceWith;
            return this.Visit(expression);
        }
        protected override Expression Visit(Expression exp) {
            if (exp == this.searchFor) {
                return this.replaceWith;
            }
            return base.Visit(exp);
        }
    }

Beautiful!  Sometimes I amaze even myself.

Okay, great, now I can swap out those nasty references to the ProjectionRow parameter with the real honest-to-goodness instance.  That's what the first line in ExecuteSubQuery does. And it only took a few dozen lines of English to explain it. :-)

The second line calls Evaluate.PartialEval.  Just what I wanted. The line after that calls the provider to execute! Hurray! Then I throw the results into a List object. Finally, I recognize that I might have to turn the result back into an IQueryable.  Weird, I know, but the type of the 'Orders' property in the original query was IQueryable<Order> because that's how IQueryable query operators work, so C# invented the anonymous type using that for the member type.  If I try to just return the list, the projector that combines the results together will be none-too-pleased.  Fortunately, I already have a facility to turn IEnumerable's into IQueryables; Queryable.AsQueryable.

Wow! It's almost as if someone designed this stuff to work together.

Full disclosure:  I did cheat a little bit.  I had to modify the Evaluator class. I had to get it to understand my new expression types.  I know, I know, I said no one else needed to know about them, but it is my code too, so I guess that's alright.  I'll save that one-liner for you to view in the attached zip file.  I only post mega-long code snippets, not measly one-liners.

I also had to invent a new CanEvaluateLocally rule for Evaluator to use.  I needed to make sure that it would never think it was possible to evaluate one of my new nodes locally.

So now let's take a look on how DbQueryProvider changed

    public class DbQueryProvider : QueryProvider {
        DbConnection connection;
        TextWriter log;

        public DbQueryProvider(DbConnection connection) {
            this.connection = connection;
        }

        public TextWriter Log {
            get { return this.log; }
            set { this.log = value; }
        }

        public override string GetQueryText(Expression expression) {
            return this.Translate(expression).CommandText;
        }

        public override object Execute(Expression expression) {
            return this.Execute(this.Translate(expression));
        }

        private object Execute(TranslateResult query) {
            Delegate projector = query.Projector.Compile();

            if (this.log != null) {
                this.log.WriteLine(query.CommandText);
                this.log.WriteLine();
            }

            DbCommand cmd = this.connection.CreateCommand();
            cmd.CommandText = query.CommandText;
            DbDataReader reader = cmd.ExecuteReader();

            Type elementType = TypeSystem.GetElementType(query.Projector.Body.Type);
            return Activator.CreateInstance(
                typeof(ProjectionReader<>).MakeGenericType(elementType),
                BindingFlags.Instance | BindingFlags.NonPublic, null,
                new object[] { reader, projector, this },
                null
                );
        }

        internal class TranslateResult {
            internal string CommandText;
            internal LambdaExpression Projector;
        }

        private TranslateResult Translate(Expression expression) {
            ProjectionExpression projection = expression as ProjectionExpression;
            if (projection == null) {
                expression = Evaluator.PartialEval(expression);
                projection = (ProjectionExpression)new QueryBinder().Bind(expression);
            }
            string commandText = new QueryFormatter().Format(projection.Source);
            LambdaExpression projector = new ProjectionBuilder().Build(projection.Projector, projection.Source.Alias);
            return new TranslateResult { CommandText = commandText, Projector = projector };
        }
    } 

The only thing that changed is my Translate method. It recognizes when it is handed a ProjectionExpression and chooses not do the work to turn an users query expression into a ProjectionExpression. Instead, it just skips down to the step that builds the command text and projection.

Did I forget to mention I added a 'Log' feature just like LINQ to SQL has.  That will help us see what's going on.  I added it here in my Context class too.

    public class Northwind {
        public Query<Customers> Customers;
        public Query<Orders> Orders;

        private DbQueryProvider provider;
        public Northwind(DbConnection connection) {
            this.provider = new DbQueryProvider(connection);
            this.Customers = new Query<Customers>(this.provider);
            this.Orders = new Query<Orders>(this.provider);
        }

        public TextWriter Log {
            get { return this.provider.Log; }
            set { this.provider.Log = value; }
        }
    }
 

Taking it for a Spin

Now let's give this new magic mojo a spin.

        string city = "London";
        var query = from c in db.Customers
                    where c.City == city
                    select new {
                        Name = c.ContactName,
                        Orders = from o in db.Orders
                                 where o.CustomerID == c.CustomerID
                                 select o
                    };


        foreach (var item in query) {
            Console.WriteLine("{0}", item.Name);
            foreach (var ord in item.Orders) {
                Console.WriteLine("\tOrder: {0}", ord.OrderID);
            }
        }

 

Run this and it outputs the following:

 

Thomas Hardy
        Order: 10355
        Order: 10383
        Order: 10453
        Order: 10558
        Order: 10707
        Order: 10741
        Order: 10743
        Order: 10768
        Order: 10793
        Order: 10864
        Order: 10920
        Order: 10953
        Order: 11016
Victoria Ashworth
        Order: 10289
        Order: 10471
        Order: 10484
        Order: 10538
        Order: 10539
        Order: 10578
        Order: 10599
        Order: 10943
        Order: 10947
        Order: 11023
Elizabeth Brown
        Order: 10435
        Order: 10462
        Order: 10848
Ann Devon
        Order: 10364
        Order: 10400
        Order: 10532
        Order: 10726
        Order: 10987
        Order: 11024
        Order: 11047
        Order: 11056
Simon Crowther
        Order: 10517
        Order: 10752
        Order: 11057
Hari Kumar
        Order: 10359
        Order: 10377
        Order: 10388
        Order: 10472
        Order: 10523
        Order: 10547
        Order: 10800
        Order: 10804
        Order: 10869

 

Here are the queries it executed:  (I used the new .Log property to capture these)

SELECT t2.ContactName, t2.CustomerID
FROM (
  SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
  FROM (
    SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
    FROM Customers AS t0
  ) AS t1
  WHERE (t1.City = 'London')
) AS t2

SELECT t4.OrderID, t4.CustomerID, t4.OrderDate
FROM (
  SELECT t3.OrderID, t3.CustomerID, t3.OrderDate
  FROM Orders AS t3
) AS t4
WHERE (t4.CustomerID = 'AROUT')

SELECT t4.OrderID, t4.CustomerID, t4.OrderDate
FROM (
  SELECT t3.OrderID, t3.CustomerID, t3.OrderDate
  FROM Orders AS t3
) AS t4
WHERE (t4.CustomerID = 'BSBEV')

SELECT t4.OrderID, t4.CustomerID, t4.OrderDate
FROM (
  SELECT t3.OrderID, t3.CustomerID, t3.OrderDate
  FROM Orders AS t3
) AS t4
WHERE (t4.CustomerID = 'CONSH')

SELECT t4.OrderID, t4.CustomerID, t4.OrderDate
FROM (
  SELECT t3.OrderID, t3.CustomerID, t3.OrderDate
  FROM Orders AS t3
) AS t4
WHERE (t4.CustomerID = 'EASTC')

SELECT t4.OrderID, t4.CustomerID, t4.OrderDate
FROM (
  SELECT t3.OrderID, t3.CustomerID, t3.OrderDate
  FROM Orders AS t3
) AS t4
WHERE (t4.CustomerID = 'NORTS')

SELECT t4.OrderID, t4.CustomerID, t4.OrderDate
FROM (
  SELECT t3.OrderID, t3.CustomerID, t3.OrderDate
  FROM Orders AS t3
) AS t4
WHERE (t4.CustomerID = 'SEVES')

 

Okay, maybe lots of extra little queries is not ideal. Still, its better than throwing an exception!

 

Now, finally, Select is done. It really can handle any projection.  Maybe. :-)

More Posts Next page »
 
Page view tracker