Welcome to MSDN Blogs Sign in | Join | Help

This is the sixteenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you might try rolling into a ball and crying for momma. That’s right, reading on is as hazardous to your health as a skinned knee. Just be warned and have an anti-biotic on hand.

Complete list of posts in the Building an IQueryable Provider series

I would have had the update out sooner but I couldn’t think of what to write in this excuse slot. My excuse generator was down for scheduled maintenance but did not come back on line as planned and wouldn’t you know it the techie responsible gave me all sorts of lame excuses. Some people!

What's inside:

Performance – I’ve actually tried to improve things.

Caching – Re-use queries without the burden of defining delegates.

Evaluation – Execute expression trees without Reflection.Emit.

Bug Fixes – Most of the bugs reported on CodePlex are fixed.


The full source code and redistributable DLL's can be found at:

http://www.codeplex.com/IQToolkit

Faster is Better

One of the big things I tried to tackle this time around was to finally do some performance improvements.  Until now, the only performance considerations were in the design of compiled queries and the use of compiled LINQ expressions for materializing objects.  Yet, when I looked at actual performance of compiled queries versus straight ADO, there was still a lot of overhead.

Where was the time being spent?  As it turns out, even though I was compiling a LINQ expression to represent the execution plan and an explicit function to convert DataReader rows into objects, ideally making the writing of data into objects as fast as possible, there was still room for improvement. The problem was not the writing of data into objects, but reading the data from the DataReader. I wholely blame myself for this. In an attempt to simplify the IQToolkit source code way back in one of the original posts I chose to read data using the general DataReader.GetValue method. This has two undesirable side effects; 1) the value is returned typed as object, which mean all non reference types (mostly numeric and datetime types) are boxed, which is measurably significant, and 2) in order to make sure the value was in the correct form to write it into the object I now had to coerce it back (which immediately led to an unbox operation, equally as significant.)

I tried many variations of solutions. Ideally, the runtime code would simply know which field access method of the DataReader to call and to coerce the results only if necessary. Unfortunately, the translation engine does not track enough information to be 100% certain of the data type the provider will return. It can make a good guess, but if it is wrong then the query will die in a fiery exception that halts the program and sends your manager storming toward your office. The solution I chose was sort of a hybrid.  Based on an expected value type a type specific helper method is now invoked. This method calls the equivalent DataReader method inside of a try-catch. I know, I hate having to use try-catch in this way, but the cheap cost of setting up the try-catch and the rare condition where the guess is wrong led me to the dark side. I will now change the color of my light-saber from blue to red.

Here’s an example of the GetInt32 helper method.

public static class FieldReader
{
...
public static Int32 GetInt32(DbEntityProviderBase provider, DbDataReader reader, int ordinal) { if (reader.IsDBNull(ordinal)) { return default(Int32); } try { return reader.GetInt32(ordinal); } catch { return (Int32)provider.Convert(reader.GetValue(ordinal), typeof(Int32)); } } ... }

As long as the expected type is correct, the faster DataReader.GetInt32 method is called. If that fails, then the fallback is to call the general GetValue method and coerce the result.  This should rarely happen.

This is all it took to get the compiled query into very low overhead versus direct ADO calls; mostly less than 3%. I’ve added a performance test you can run to check it out on your set up. Of course, this will vary depending on the query, the hardware, the load on the server and the network latency.

Caching Queries

“Why can’t the provider simply cache the queries for me.”  I’ve gotten this request a lot.  Sometimes from direct pleas in email, other times from those of you trying to do it yourself and asking for advice. 

It seems natural to imagine that it would only take a little bit of work for a provider to determine if a query being executed now is the same or similar to one executed before, so why is the only way to re-use a query to specify one using the cryptic syntax of generic methods, lambda expressions and delegates?  And why do I have to hold onto the thing myself, can’t some service do this for me?

Of course, my usual reaction is to give a heavy sigh in the privacy of my office and then craft a quite sensible fire-and-brimstone reply, complete with infallible logic and dramatic consternation, as to why and how you really are better off with the compiled query solution. But I’m tired of doing that, so instead of impressing you with my sound reasoning I’m going to show you how I went ahead and just did it. I like challenges, and there’s no better challenge than the impossible, or the sometimes impractical, or the generally ill-advised.

What I built is a new piece of code called the ‘QueryCache’.  It is actually implemented to be generic enough that it will work with your own IQueryable provider. Yet its not currently integrated into any provider, though you may choose to embed it into yours.  You can, however, use the cache as is to execute your queries and take advantage of its cache-y goodness. You don’t have to make delegates and invoke them, you simply have to give the cache your query and it will give you back the results.

var cache = new QueryCache(10);
var query = from c in db.Customers where c.City == "London" select c;
var result = cache.Execute(query);

Here’s how it works.  The cache maintains a most-recently-used list of compiled queries.  Every time you execute a query via the cache, the cache compares your query against the ones in the cache. If it finds a match, it simply uses the existing compiled query and invokes it. If not, it makes a new compiled query and adds it to the list.

Of course, that’s the easy part. 

The hard part is figuring out how to compare an IQueryable query object against a list of compiled-query delegate objects and determine which ones can be reused. For example, are these two the same query?

var query1 = from c in db.Customers where c.City == "London" select c;
var query2 = from c in db.Customers where c.City == "Seattle" select c;

Technically they are different expression trees, but if that’s the deciding factor then I might as well give up now. They are structurally similar and so it is logical to assume that a query compiled for one should be nearly identical to a query compiled for the other. If I were using compiled queries directly I would simply choose to make the name of the city a parameter to the compiled query delegate and invoke it with different values. So isn’t that just what I want the cache to do for me? 

To do this I need a little tool that will take a look at a query tree and decide which bits should be considered parameters and then give me back a new query tree with the parameters in place of those bits. As it turns out this is not really all that hard since it seems obvious for the most generality that anything that appears in the query tree as a constant should be deemed a parameter. A trivial expression visitor can produce this.

If it also wraps the rewritten query tree in a lambda expression and gives me back the set of constants that it extracted then I have everything I need to make a compiled query and invoke it with an array of values. I also have what I need to find an already compiled query in the cache, since compiled queries hold onto their defining query trees. So if my first query in the example above is already in the cache, it already has a parameterized query tree defining it and that ought to look awfully similar to the parameterized version of the second query.

The only thing I need now is a way to compare two lambda expressions to see if they are similar.  Fortunately, I wrote that ages ago.  The ExpressionComparer does just this. 

If I cobble all these parts together into a reasonable order I get my QueryCache. Now, I can use compiled queries without ever having to manually define them again!  Huzzah!

Yet, if this is so grand, why haven’t I taken that one last step and simply plugged into the base DbEntityProvider class?

Unfortunately, reality is not as rosy and I would hope.

There are a few problems holding me back.  The most significant is the silly act of parameterizing itself. The problem with the cache is that it doesn’t know which constants in the query should be made into parameters, so it turns them all into parameters. Yet, as much as I’d like it to be otherwise, databases don’t treat all parameters equally. Sometimes a parameter is terrific for performance; a database like SQL Server optimizes by storing parameterized query plans so you basically get the same benefit client and server.  Yet, sometimes, for some bits of SQL, for some back ends, parameters either cannot be used or have crippling effects on speed. So it's usually the best policy to be explicit and judicious when using parameters, something the cache in all its glory cannot chose for you.

So I leave the cache, for now, as something you can explicitly opt-in to using, per query.

Please, give it a try.  Feedback is welcome.

Expression Evaluator

Sometimes you may want to execute expression trees as code without actually turning it into true runtime code. You may be running on a platform that does not currently support the Expression.Compile() method (or the ability to use Reflection.Emit to generate IL at runtime) like Windows Mobile. I’ve been encouraged by folks here at Microsoft to think about this from time to time, so I set out to explore this space and the result is a whole mess code called the ExpressionEvaluator, which is an expression tree interpreter. I added my experimental solution to the toolkit in case it is beneficial for someone. It is a lot of code, so it doesn’t compile into the library by default. You have to enable it with a specific #define, NOREFEMIT.  This will also switch over all direct uses of Reflection.Emit in the toolkit to using this evaluator.

It has a hefty performance overhead, so I would not consider it a viable alternative to Reflection.Emit. I’ve even added many tricks to avoid calls to reflection and boxing. This has made it a lot faster than a naive solution, but still it pales in comparison to real IL.  However, in cases where there is no alternative, its probably the best you are going to get.

Crazy Antics

In keeping with the spirit of making changes for no good reason, I’ve unbundled the base set of toolkit code from the ADO provider specific bits. So now if you are using IQToolkit to build your own non-ADO based provider you don’t have to be dragging the rest of it along for the ride.  All the SQL translation and DbEntityProvider goodness is now in its own DLL, IQToolkit.Data.dll. This means you’ll probably have to tweak your projects to include this new DLL, but that’s about all.

The DLL list is now:

IQToolkit.dll
IQToolkit.Data.dll
IQToolkit.Data.Access.dll
IQToolkit.Data.SqlClient.dll
IQToolkit.Data.SqlServerCe.dll
IQToolkit.Data.MySql.dll
IQToolkit.Data.SQLite.dll

Of course, as always, you only need around the ones you are using, or none of them if you are simply lifting the code out as source.

That’s All Folks

I’m sure there are many more wonderful nuggets of goodness I added but forgot to mention.  If you discover any of them please file a full report at http://www.codeplex.com/iqtoolkit

This is the fifteenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you might try searching for the audio tapes on www.Bing.com.  That would be a lot easier than reading. You won't find any, but you'll feel better for having tried.

Complete list of posts in the Building an IQueryable Provider series

Getting this version of the toolkit together has taken a lot of sleepless nights.  Thank goodness for Netflix or I'd have had to spend those sleepless nights actually working on the Toolkit.

Okay, enough with the flavor text, let's get to the crunch.

What's inside:

More Providers - MySQL and SQLite join the previous MS only line up.

Transactions - Use ADO transactions to control the isolation of your queries & updates.

Entity Providers - The provider concept is expanded to include tables of entities

Entity Sessions - The session concept adds identity caching, change tracking and deferred updates via SubmitChanges

Provider Factory - Create providers on the fly w/o knowing anything more than the database name and mapping.

Madness


The full source code and redistributable DLL's can be found at:

http://www.codeplex.com/IQToolkit

More Providers

  • MySQL -- With the stewardship of MySQL in doubt after the purchase of Sun by Oracle, I was leery of taking on the challenge of making a MySQL provider for the toolkit. Yet, the benefits of doing so turned out to be very significant and not just for the MySQL users, as it forced me to challenge some of my very assumptions about SQL which in turn made some of my meager testing better, which in turn made the toolkit better as a whole. 

    I was surprised when I went to download the free and open-source database product that I had to fill out a bunch of information on myself and wait for a formal approval (that took many days) before I was *allowed* to access the product. This seemed like an awful lot of hassle and odd for an open source project; way too corporate. Fortunately, I persisted and got the server up an running with no problem. Next I searched high and low for MySQL version of Northwind so I could have something to run the tests against. If only Bing were available then!  I had to settle for plain ol' Live Search. Fortunately I found what I was looking for, even if it didn't offer me any cash back.

    MySQL seems to me what you get if you take a bunch of C programmers and tell them to make a SQL database product and yet offer no guidance. This is not criticism, just a bit of snarky wit over many of the underscore endowed API's that make up the product. Luckily, you'll not need to worry yourself about the manual dexterity needed to type such awkward function names, as you'll be using LINQ and the MySQL provider will do the work for you. On the positive side, the function library of MySQL is quite large and feature rich. I had no problem finding appropriate translations. MySQL even had many interesting features that I'd like to go back and refactor the toolkit to take advantage of.  Next release maybe.

    There were a few problems I uncovered while trying to get MySQL to pass the test suites.  I took the opportunity to fill out the execution suite that has languished for a few releases now, so I could actually assert what the correct output ought to be for specific queries. In almost all the cases MySQL did the right thing (as far as I was expecting.)  In most of the cases that it did not appear to do the right thing, it often turned out to be bad assumptions on my part about collation ordering and so removing those assumptions from the tests fixed everything up. The remaining problem still boggles my mind.

    In a test designed merely to prove that the translation of simple joins succeeded to generate the right query, the database did not return the correct number of rows. 

    from c in db.Customers
    from o in c.Orders
    from d in o.Details
    select d.ProductId

    which executes a MySql query that looks like this:

    SELECT d.ProductId
    FROM [Customers] AS c
    LEFT OUTER JOIN [Orders] AS o ON o.CustomerID = c.CustomerID
    LEFT OUTER JOIN [Order_Details] AS d ON d.OrderID = o.OrderID

    I expected to get a number of rows corresponding to the total number of order-details in the database.  But I didn't.  I got a whole lot less.  So I started experimenting with different forms of the query.  If I selected o.OrderID instead of d.ProductID, I got a different number from either of the other two. If I selected c.CustomerID, again an entirely different number of rows.  It was only when I chose to select the entire order-detail object did I get the number of rows I was expecting.

    Something strange seems to be going on.  On nobody's definition of SQL should I be getting a different number of rows depending on the columns in the selection (except if that SELECT has a DISTINCT operator specified.)  It was as if all queries were getting a default DISTINCT operation. When I took a look a the results when I selected out only ProductID, sure enough, all I got back were distinct product-id's.  MySQL has an 'ALL' keyword that is the semantic opposite of DISTINCT, so I tried adding that to the query, but to no avail. 

    I don't know if this is a problem in the query engine or the transport layer, or if the MySQL folks actually think it is appropriate behavior.  As far as I'm concerned it is catastrophically bad; E_FAIL. For this problem alone I would recommend not using MySQL. Still, it may not be as bad as it sounds. It is likely very rare that you'd ever actually write a query that was expected to retrieve duplicate rows of data. So it may not ever impact you at all. I do recommend testing all your queries (and your application) before you put it into production.
  • SQLite - This is an open-source database product that is similar in nature to MS Access and SQL Compact, since it runs in-proc and is not really a server. After wrangling with MySQL, I thought I knew all I needed to know about different forms of SQL, until Jonathan Peppers sent me his go at making a SQLite provider. ( See, I told everyone that I'd add new providers if they would send them to me.)  Of course, SQLite had a few problems of its own.

    While MySQL has a huge library of function, SQLite has a tiny one; many .Net API's don't have supported translations.  (This is of the set I had translated for MS SQL.  To be truthful, MS Access falls short on some of these too.)  So queries using API's other than some simple string manipulation or equality testing probably won't work against SQLite. 

    In addition one of SQLite's big drawbacks is its lack of a rich type system.  SQLite's developers claim this to be a feature, and it may afford great flexibility, but is often a death sentence for LINQ queries, which are all strongly typed.  SQLite does have types, but only a few of them, like number, text and binary.  Problems arise when you try to use DateTime's, as they are not really their own type in SQLite, but a text layout.  For example, if you were trying to find all orders for a customer that happen in January, you might write this query.

    from o in db.Orders
    where o.CustomerID == cust && o.OrderDate.Month = 1
    select o

    this would produce this query.

    SELECT o.OrderID, o.OrderDate
    FROM Orders AS o
    WHERE o.CustomerID = @cust AND STRFTIME('%m', o.OrderDate) = 1

    The date function to extract the month out of the OrderDate column is really a text formatting function that extracts the month portion of the date-time text layout, which in this case is the text '01', since the date-time format is always padded to a specific width.  When this is compared against the number (1), another subtle difference crops up. For all other SQL's I've come across, text is considered the weakest form of type, so when two types are compared for equality text is always converted to the other form.  Yet, in SQLite, the opposite it true (which truthfully is more like C# and Java).  So when the two types are compared they are found incompatible, because the number (1) is turned into the text '1' and that is not the same as the text '01'. Adding insult to injury, there appears to be no type conversion functions at all, so I can't even work around the problem by injecting a conversion. My ignorance of SQLite may just be showing here, or a lack of sufficient documentation.

    So my recommendation if you are using SQLite, to stay away from most API functions in your where clauses and such. API calls in the projection are okay, because these get executed on the client.

Transactions

I'm surprised no one's called me on this before. The DbQueryProvider and its ilk have been suspiciously lacking in support for transactions. The providers work, LINQ queries are converted into ADO Commands and executed, yet those ADO Command objects are never assigned an ADO transaction, even if you started one explicitly. 

Of course, the official word from Microsoft is to stop using the ADO transactions altogether and instead use the System.Transactions.TransactionScope object, that is newer, better and enables automatic use of distributed transactions, etc, etc, etc. And if you did use TransactionScope, then the problem I'm referring to would not be a problem. SqlCommand object's would implicitly enlist in the transation without me having to specify anything. 

Unfortunately, TransactionScope is mired with many problems and is not supported by all ADO providers so ADO transactions are still a necessity.  You can now use ADO transactions with query providers in a manner similar to LINQ to SQL.

provider.Transaction = provider.Connection.BeginTransaction();

// use the provider here to execute queries and updates, etc.

provider.Transaction.Commit();
provider.Transaction = null;

The provider will use whatever transaction object you give it when it creates new ADO command objects.

Entity Providers

I decided to formalize the pairing of query providers with a Table object that enables updates and other facilities.  The definition of an entity provider is now defined by these three interfaces.

public interface IEntityProvider : IQueryProvider
{
    IEntityTable<T> GetTable<T>(string tableId);
    IEntityTable GetTable(Type type, string tableId);
}

public interface IEntityTable : IQueryable, IUpdatable
{
    new IEntityProvider Provider { get; }
    string TableId { get; }
    object GetById(object id);
    int Insert(object instance);
    int Update(object instance);
    int Delete(object instance);
    int InsertOrUpdate(object instance);
}

public interface IEntityTable<T> : IQueryable<T>, IEntityTable, IUpdatable<T>
{
    new T GetById(object id);
    int Insert(T instance);
    int Update(T instance);
    int Delete(T instance);
    int InsertOrUpdate(T instance);
}

You can now always get to a table directly from a provider.  The two concepts are coupled together.  An entity table also has explicit CRUD methods and implements IUpdatable, so no more separation between normal tables and updatable tables. In my mind this simplifies things quite a bit. 

Of course, this caused me to want to rename DbQueryProvider.  Ooops.  This will likely cause you some grief as any of your existing code that was using DbQueryProvider directly is now not going to compile.  The new name for this class is now DbEntityProvider. It might not matter so much now that there is a nifty IEntityProvider interface.

Entity Sessions

One thing missing from the Toolkit so far has been all of that context stuff that LINQ to SQL and LINQ to Entities have.  When you use LINQ to SQL you have a change tracking service that detects when your objects change, and sends the updates for you all at the same time when you call SubmitChanges. 

An entity session is all of this change-tracking, deferred updating stuff packaged up together. It is distinctly different from an entity provider in these ways, yet similar to one in many others.

An entity session is defined below:

public interface IEntitySession
{
    IEntityProvider Provider { get; }
    ISessionTable<T> GetTable<T>(string tableId);
    ISessionTable GetTable(Type elementType, string tableId);
    void SubmitChanges();
}

public interface ISessionTable : IQueryable
{
    IEntitySession Session { get; }
    IEntityTable ProviderTable { get; }
    object GetById(object id);
    void SetSubmitAction(object instance, SubmitAction action);
    SubmitAction GetSubmitAction(object instance);
}

public interface ISessionTable<T> : IQueryable<T>, ISessionTable
{
    new IEntityTable<T> ProviderTable { get; }
    new T GetById(object id);
    void SetSubmitAction(T instance, SubmitAction action);
    SubmitAction GetSubmitAction(T instance);
}

public enum SubmitAction
{
    None,
    Update,
    PossibleUpdate,
    Insert,
    InsertOrUpdate,
    Delete
}

public static class SessionTableExtensions
{
    public static void InsertOnSubmit<T>(this ISessionTable<T> table, T instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Insert);
    }

    public static void InsertOnSubmit(this ISessionTable table, object instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Insert);
    }

    public static void InsertOrUpdateOnSubmit<T>(this ISessionTable<T> table, T instance)
    {
        table.SetSubmitAction(instance, SubmitAction.InsertOrUpdate);
    }

    public static void InsertOrUpdateOnSubmit(this ISessionTable table, object instance)
    {
        table.SetSubmitAction(instance, SubmitAction.InsertOrUpdate);
    }

    public static void UpdateOnSubmit<T>(this ISessionTable<T> table, T instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Update);
    }

    public static void UpdateOnSubmit(this ISessionTable table, object instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Update);
    }

    public static void DeleteOnSubmit<T>(this ISessionTable<T> table, T instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Delete);
    }

    public static void DeleteOnSubmit(this ISessionTable table, object instance)
    {
        table.SetSubmitAction(instance, SubmitAction.Delete);
    }
}

As you can see, an entity session has tables, just like a provider.  Yet, those tables are not directly updatable.  Instead you can assign entity instances submit actions. These are the actions that take place later when you call SubmitChanges.  There are a bunch of extension methods defined to add the LINQ to SQL like InsertOnSubmit() methods to the interface.  These simply call the SetSubmitAction() method for you.

Also note that a session is not a provider. It is a service used in conjunction with a provider. You can use multiple different sessions with the same provider instance.   

There is one current implementation of an entity session in the Toolkit called (you guessed it) DbEntitySession.  You create a DbEntitySession by giving it an existing DbEntityProvider. The DbEntitySession hooks the provider in such a way that it gets first crack at all materialized objects before they are returned to you. In this way, the DbEntitySession can employ an identity cache so queries that retrieve the same entity will always return the same entity instance, and it can start automatic change tracking on all entities returned.

You are also not locked into the session's behavior.  At any time you can interact with the underlying provider instead for retrieving entities without passing through the identity cache or being changed tracked.  You can even get to the provider's table directly off a session table.

Provider Factory

Now with so many providers and one single way to write queries you'd think it would be easy to switch between them.  In reality it is not. You have to pick the provider you want, reference its library (IQToolkit.Data.XXX), reference its corresponding ADO library (System.Data.XXX), create the ADO connection, the mapping object and construct the provider.

var connection = new SqlConnection("...");
var mapping = new AttributeMapping(typeof(Northwind));
var provider = new SqlProvider(connection, mapping, QueryPolicy.Default, null);
var db = new Northwind(provider);

You can hide this all inside your database context class (or whatever you want to call yours), so you only have to write it once, but then your context class is tied to a specific provider.  Instead, you could wrap this code up into a factory method of your own devising, but then calls to the factory would be spread throughout your codebase.  There no good way to defer all this work to some configuration setting. Until now.

Introducing the new factory methods built into DbEntityProvider.

public static DbEntityProvider FromApplicationSettings();
public static DbEntityProvider From(string filename, string mappingId); public static DbEntityProvider From(string provider, string connectionString, string mappingId);

These methods allow you to get up and running with only knowing a few bits of information.  You don't have to hard link you application to any particular provider.

The FromApplicationSettings method creates you a new instance of a provider from information found in the config file.  It looks for the "Provider", "Connection" and "Mapping" properties in the configuration and feeds them to the other factories.  It is also possible to look this information up in web settings, but I have not formalized that one yet.

The provider argument is a string that refers to the name of an assembly that contains the query provider.  These are generally of the form IQToolkit.Data.XXX.  If that assembly is not loaded, it will be loaded dynamically. This assembly can be in the assembly cache or in the same directory as your app (or other places that the runtime might look.)  From this assembly it will look for a type in the same namespace (as the name of the assembly) that derives from DbEntityProvider.

The connectionString and filename arguments are really the same thing. You can specify either the name of a database file or a full ADO connection string. If a file is specified, a correct connection string is obtained by calling the static GetConnectionString(string) method on the provider. A provider may be inferred from the file extension of a database file if none is specified.

The mappingId can either refer to the name of a context class (like Northwind) that has mapping attributes on it or the name of an xml file. 

So now you can write code like this to get your provider.

var db = new Northwind(DbEntityProvider.From(somedbfile, somemapfile));

Or better yet, you can use the FromApplicationSettings() method in the constructor of your context and still be configurable at runtime.

Madness

Of course, it wouldn't be a new toolkit release without some additional crazy changes.  One significant change is namespaces again. This time its not going to conflict with your code too much.  Most of the classes that where in IQToolkit.Data have been demoted into the namespace IQToolkit.Data.Common.  This includes most all classes that are implementation detail or base classes.  Mapping attributes and the like are now in IQToolkit.Data.Mapping.  This makes the namespace clean and obvious when you start looking for things via intellisense. 

DbEntityProvider and DbEntitySession are the only classes sitting in IQToolkit.Data, as these are the ones you'll likely need to reference when writing code.  IEntityProvider and IEntitySession are in IQToolkit namespace, because they are not specific to ADO (System.Data classes).


I hope you find this version feature rich enough to either build application directly on top of it, or model your own provider or data layer by using these techniques.

Don't forget the audio tapes.

 

This is the fourteenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you might request a weeks vacation, sit back, relax with a mochacino in one hand a netbook in the other, or if you've got better things to do with your time print them all out and stuff them under your pillow. Who knows, it might work better.

Complete list of posts in the Building an IQueryable Provider series

Okay, enough with all the post-is-late guilt! It's done now, so breathe a sigh of relief and get on with the reading.

What's inside:

More Mapping - Finally a real mapping system, with attributes and XML.

More Providers - MS Access and MS SQL Server Compact Edition

More POCO - Constructors, Enum and Interfaces.

More More More

The full source code can be found at:

http://www.codeplex.com/IQToolkit

More Mapping

  • Attribute Mapping - put attributes on the properties in the class that declares your tables.

    This differs from LINQ to SQL mapping attributes which are placed on the entities themselves and is more like the proposed LINQ to Entity mapping attributes. However, I've not actually gone out of my way to make them the same. The advantages to this approach are 1) keeping the mapping separate from the entity objects (more POCO), and 2) being able to supply different mapping for the same entity type based on the table the entities are accessed from. 

    Mapping attributes look like this:
  • [Table]
    [Column(Member = "CustomerId", IsPrimaryKey = true)]
    [Column(Member = "ContactName")]
    [Column(Member = "CompanyName")]
    [Column(Member = "Phone")]
    [Column(Member = "City", DbType="NVARCHAR(20)")]
    [Column(Member = "Country")]
    [Association(Member = "Orders", KeyMembers = "CustomerID", RelatedEntityID = "Orders", RelatedKeyMembers = "CustomerID")]
    public IUpdatableTable<Customer> Customers

You specify the Table, Column and Association attributes as necessary.  The 'Member' refers to the member in the entity type. If this is the same name as the database's column name you don't need to repeat it by specifying 'Name' too. 

You can specify nested mapping information by using a dot in the Member name. This allows you to have what some call value types, but to keep from clashing with .Net terminology I don't. For example, if you've defined an Address type that you want to use in a nested relationship (actually embedded in the same table row) you can do that like this:

[Table]
[Column(Member = "EmployeeID", IsPrimaryKey = true)]
[Column(Member = "LastName")]
[Column(Member = "FirstName")]
[Column(Member = "Title")]
[Column(Member = "Address.Street", Name = "Address")]
[Column(Member = "Address.City")]
[Column(Member = "Address.Region")]
[Column(Member = "Address.PostalCode")]
public IUpdatable<Employee> Employees
  • Xml Mapping -- this is same as attribute based mapping but data is read from an XML file. 

    Xml mapping looks like this:
  • <?xml version="1.0" encoding="utf-8" ?>
    <map>
      <Entity Id="Customers">
        <Table Name="Customers" />
        <Column Member = "CustomerId" IsPrimaryKey = "true" />
        <Column Member = "ContactName" />
        <Column Member = "CompanyName" />
        <Column Member = "Phone" />
        <Column Member = "City" DbType="NVARCHAR(20)" />
        <Column Member = "Country" />
        <Association Member = "Orders" KeyMembers = "CustomerID" RelatedEntityID = "Orders" RelatedKeyMembers = "CustomerID" />
      </Entity>
      <Entity Id="Orders">
        <Column Member = "OrderID" IsPrimaryKey = "true" IsGenerated = "true"/>
        <Column Member = "CustomerID" />
        <Column Member = "OrderDate" />
        <Association Member = "Customer" KeyMembers = "CustomerID" RelatedEntityID = "Customers" RelatedKeyMembers = "CustomerID" />
        <Association Member = "Details" KeyMembers = "OrderID" RelatedEntityID = "OrderDetails" RelatedKeyMembers = "OrderID" />
      </Entity>
      <Entity Id="OrderDetails">
        <Table Name="Order Details"/>
        <Column Member = "OrderID" IsPrimaryKey = "true" />
        <Column Member = "ProductID" IsPrimaryKey = "true" />
        <Association Member = "Product" KeyMembers = "ProductID" RelatedEntityID = "Products" RelatedKeyMembers = "ProductID" />
      </Entity>            
    </map>

You use it like this:

XmlMapping mapping = XmlMapping.FromXml(TSqlLanguage.Default, File.ReadAllText(@"northwind.xml"));
SqlQueryProvider provider = new SqlQueryProvider(connection, mapping);
  • Multi-table mapping -- Map multiple tables into a single entity.  If you've got entity data spread out over multiple tables with a 1:1 association between them you can now specify the additional tables in mapping using the ExtensionTable attribute or equivalent XML element. 

    Here's what a multi-table mapping looks like:
  • [Table(Name = "TestTable1", Alias = "TT1")]
    [ExtensionTable(Name = "TestTable2", Alias = "TT2", KeyColumns = "ID", RelatedAlias = "TT1", RelatedKeyColumns = "ID")]
    [ExtensionTable(Name = "TestTable3", Alias = "TT3", KeyColumns = "ID", RelatedAlias = "TT1", RelatedKeyColumns = "ID")]
    [Column(Member = "ID", Alias = "TT1", IsPrimaryKey = true, IsGenerated = true)]
    [Column(Member = "Value1", Alias = "TT1")]
    [Column(Member = "Value2", Alias = "TT2")]
    [Column(Member = "Value3", Alias = "TT3")]
    public IUpdatable<MultiTableEntity> MultiTableEntities

Extension tables are specified similar to how Associations are specified, except you are never referring to members, only column names.  You use the 'Alias' value to connect column & association mappings with columns from particular tables.  All queries for this multi-table entity treat the 'Table' as the primary table queried, all other tables are queried with left-outer joins.  All keys for associations must be from the same alias.

Can I mix nested mapping with multi-table mapping?  I have not tried it, but in theory it should work. It should not matter which table your nested entity gets it's data from, so in effect you can have a composition relationship between one table and another as long as it is 1:1.  

What about many-to-many?  Not yet. Making the system query a many-to-many relationship is relatively easy.  I haven't yet figured out the right semantics for inserts & updates. Right now, all insert, updates and deletes are explicit via calls to the IUpdatable with real-live entities. Yet how do you make an explicit update to the link table that you don't have an entity directly mapped to?  I need to ponder this some more.  Possibly if one side of the relationship is a composition as opposed to an association, then it would be implied when that side is updated.  Yet what if you chose not to load the relationship, how do you tell the system to not delete all previous relationships?

More Providers

  • MS Access -- This new query provider works with both Access 2000 - 2003 and Access 2007 data files. I don't know what the true differences are between the Jet and the Ace engines; the query language appears to be identical (as per my limited tests so far), yet the filename extension changed in Access 2007 to 'accdb' instead of 'mdb' and the northwind sample database plumped up an extra 66% in disk size without any additional data.

In order to make this work I've added an AccessLanguage object that is necessary to get the correct semantics for MS Access queries and an AccessFormatter object that handles generating the correct command text. In order to salvage as much as I could from the TSqlFormatter, I moved most of this code to a common SqlFormatter base class, and now the TSQL and Access formatters only supply the deviations from the standard syntax.  (Of course, 'standard' is currently whatever I deem it to be so don't go getting some actual online specification and prove me wrong.) Access only allows one command at a time, so that added an extra wrinkle, but in the end there is now support in the system for providers that can only do one command at a time. This means there are multiple round-trips to the engine for things like inserting a record and getting back the computed keys. Luckily, the access engine is in-proc so this is not really a burden. A new property on QueryLanguage, 'AllowMultipleCommands' determines how the execution plan is generated and whether multiple commands can be lumped together into a single ADO command.

The good news is that the access engine passes almost all the Northwind tests; some are not possible (mostly ones testing translation of framework methods that have no apparent equivalent in the access expression engine).  There were a lot of hairy strange & subtle differences in syntax between Access and TSQL, but most were handled by having different format rules, some required new expression visitors to change the query tree, like no explicit cross joins!  This caused me to write a visitor to attempt to get rid of cross joins (often injected by my visitor that tries to get rid of cross-apply joins) which is now generally useful to everyone, and if that didn't do it, another visitor that would attempt to isolate out the cross joins from any other joins and push them into sub-queries where Access lets me use the old-style comma-list, which is truly a cross join, though it just can't be mixed with other kinds of joins in the same from clause.

  • SQL Compact -- Yes, even more SQL Server.  Though to be truthful, SQL Server Compact Edition (aka SQL CE, aka SQL Compact, aka Skweelzy) is not really SQL Server, it is some other entirely different product that handles a subset of TSQL syntax, and is not a server at all since it runs in-proc just like MS Access.

  • What about MySQL or Oracle?
    One day. The fact is that MS SQL and MS Access are easy for me to get to, they are already on my box. Getting something else up and running would take actual effort, and the MS secret database police might come get me. Meanwhile, if someone out there wants to put together a provider implementation I'll add it into the drop.

  • Where did the SqlQueryProvider go?
    I moved it. With the addition of the new providers it became apparent that I'd have to start factoring out all this ADO provider specific nonsense, otherwise all uses of the toolkit would have direct dependencies to way more than necessary. So I made separate projects, each building its own library. I may end up separating all the core 'data' stuff out into its own project too.

    The solution builds these libraries now:

- IQToolkit.dll
- IQToolkit.Data.Access.dll
- IQToolkit.Data.SqlClient.dll
- IQToolkit.Data.SqlServerCe.dll

More POCO

  • Constructors -- Use entities that don't have default constructors.  It is now possible to have entities that require invocation of a constructor with parameters.  The binding process will figure out how to call your constructor and the client side code will call it for you as long as the constructor parameter names match property names. You can even have fully read-only entities if all data member are accounted for in the constructor.

  • Enums -- They actually sort of work now. You can have a member in your entity typed as some enum and you get automatic mapping between that enum and a numeric type in the database.

  • Interfaces and abstract base classes -- You can now declare you IQueryable's as IQueryable<SomeInterface> or IQueryable<SomeAbstractClass> and have the provider create instances of a compatible type under the covers, automatically translating all your references to interface or abstract class members to the appropriate mapped members. You can have mutiple entities share a common interface or base class and get different mapping for each. You can write code using generics and constrain generic parameters based on an interface and write queries that will get correct translation at runtime.  (Note, variation of mapping likely won't work with compiled queries, since the translation is fixed on the first execution.)

Less Policy

There's not a whole lot of policy being used right now and the policy objects dependence on the mapping object was no where near as deep as the mapping object's dependence on the language.  So policy is now independent of mapping, which means you can construct providers without specifying policy and/or reusing mapping with different policies.  Now if I could only make it simpler to specify/construct mapping without needing to know the language.  Back to the drawing board.

More Insanity

I apologize for the churn. The namespace changed so now all heck is going to break loose. Gone is the simple 'IQ' namespace and in its place is the 'IQToolkit' namespace.  I really did like the 'IQ' name, it was short, classy and made you feel intelligent just by looking at it.  Yet, it was hard to guess at if you did not already know what it was.  I chose to change the namespace name to match the product name and the DLL name. You add reference to the IQToolkit.dll and you import/use the IQToolkit namespace. No fuss, no muss.  Except for all those files you'll have to edit now. But hey, this is pre-pre-pre-pre beta stuff. Some people may think they are something special by snarkily keeping all their products in beta. They've got a lot to learn.

I hope this toolkit is becoming useful to many. I realize there have been a variety of requests for new things in the toolkit that I just have not gotten time to put in yet.  So you can expect plenty more in the future.

So enough with reading.  It's time to code!

This is the thirteenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you probably have a life beyond the keyboard, but if you don't then follow the link below to find oodles more to help fill your meaningless existence.

Complete list of posts in the Building an IQueryable Provider series

It's been precisely the correct amount of time that it took for me to complete the additional goodness that is jam packed into this drop, less actual work, dinners out, dinners in, any interesting film and televisions programs, housework, trips out to the store, family game night, time reading fiction, napping on the couch and other assorted unavoidable activities.

The full source code can be found at:

http://www.codeplex.com/IQToolkit

I'll try to cover as much as I can in this post, however you'll like find other gems by scouring the source itself.

What's inside:

Updates - Insert, Update & Delete operations.

Batch processing - true SQL Server batch processing.

Server language type systems - correct parameter types.

Mapping Changes - use the same class with multiple tables, etc.

Insert, Update and Delete

It's about time that this toolkit actually got usable right out of the box.  My original intention with the series was to show how to build an IQueryable provider and that turned more and more into a fully working query engine that you could actually use to get real work done. Yet, how many real world applications only ever need to pull data out of a database and never push it back?  Not many.

So I knew I'd eventually want to add updates, because I knew that you'd eventually need to do it too. Yet, every time I started thinking about updates I always fell into the trap of thinking about full blown ORM's with object tracking, et al, and I did not really want to go there, at least not yet. As a toolkit I think its just fine to define the primitives that a more advanced system might be built out of. And there is nothing wrong with those primitives being generally useful on their own. So you should be able to use the toolkit as-is and not only get a pretty good query engine but also something that at least works as a rudimentary data access layer.

Common Primitives

So then I set about thinking about just the primitives for updating data. They should have semantics similar to the underlying SQL equivalent commands. That means they should not defer work until some time later, but execute immediately. There should be at least the familiar commands, Insert, Update and Delete; but also Upsert (both Insert & Update combined) since its so often the right thing for many situations.

Also, like other LINQ operations, update commands should be a pattern, and be available for any kind of provider. So I set out thinking about what the pattern would look like and how it might be specified.  This is what I came up with.

public interface IUpdatable : IQueryable
{
}

public interface IUpdatable<T> : IUpdatable, IQueryable<T>
{
}

public static class Updatable
{
    public static S Insert<T, S>(this IUpdatable<T> collection, T instance, Expression<Func<T, S>> resultSelector)
    public static S Update<T, S>(this IUpdatable<T> collection, T instance, Expression<Func<T, bool>> updateCheck, Expression<Func<T, S>> resultSelector)
    public static S InsertOrUpdate<T, S>(this IUpdatable<T> collection, T instance, Expression<Func<T, bool>> updateCheck, Expression<Func<T, S>> resultSelector)
    public static int Delete<T>(this IUpdatable<T> collection, T instance, Expression<Func<T, bool>> deleteCheck)
    public static int Delete<T>(this IUpdatable<T> collection, Expression<Func<T, bool>> predicate)
    public static IEnumerable<S> Batch<T,S>(this IUpdatable collection, IEnumerable<T> instances, Expression<Func<T, S>> fnOperation, int batchSize, bool stream)
... }

This pattern works just like the LINQ Enumerable and Queryable patterns.  I've declared an interface 'IUpdatable' that extends IQueryable, so anything that is updatable is also queryable, and then an Updatable class with a bunch of new extension methods that encapsulate the pattern.  (I realize the IUpdatable name may be in conflict with some other library, but until I think of something better this is what it is.)

The Insert method inserts an object instance into a collection. It's not an ordinary insert, like with List<T>. The collection is considered to be remote and inserting into it copies the data from your instance. It has an optional result-selector argument that can be a function you supply to construct a result out of the object after it has been inserted. This, of course, is intended to occur on the server and can be used to read back auto-generated state and computed expressions.

IUpdatable<Customer> customers = ...;
Customer c = ...;
customers.Insert(c);

IUpdatable<Order> orders = ...;
Order o = ...;
var id = orders.Insert(o, d => d.OrderID);

The Update method updates a corresponding object already in the collection with the values in the instance supplied. This is a complete overwrite, not a dynamic update like LINQ to SQL would have generated. I have not yet defined a piecemeal update operation, but I still can.  We'll see how it goes.  In addition to an update selector (like the one for insert) you can also specify an update-check predicate. This is an expression evaluated against the server's state and can be used to implement optimistic concurrency by basically checking to see if the server's state is still the same as you remembered it. An ORM layer built on top of this primitive might choose to generate this expression automatically, based on mapping information, but here you must specify it manually if you want to use it.

IUpdatable<Customer> customers = ...;
Customer c = ...;
var computedValue = customers.Update(c, d => d.City == originalCity, d => d.ComputedColumn);

The InsertOrUpdate is the 'Upsert' operation.  It will basically insert an object into the collection if a corresponding one does not exist, or update the one that does with the new values. You specify it just like you'd specify an update, instead you call InsertOrUpdate.

There are two flavors of Delete. The first one lets you delete the object in the collection corresponding to the instance. You can optionally specify a delete-check, which is similar to the update-check, a predicate function evaluated against the server's state. The delete will only occur if the check passes. The second flavor just lets you specify a predicate. It's basically a delete-all-where method and will delete all objects from the collection that match the predicate. So far, its the only SQL-like 'set-based' operation I've defined.

IUpdatable<Customer> customers = ...;
Customer c = ...;
customers.Delete(c, d => d.City == originalCity);

IUpdatable<Customer> customers = ...;
Customer c = ...;
customers.Delete(c => c.CustomerID == "ALFKI");

The last operation is Batch.  It will allow you to specify an operation to apply to a whole set of instances. The operation can be one of the other commands like Insert or Update.  You can use this method Insert, Update or Delete a whole bunch of objects all at the same time. If possible, the provider will use optimized batching techniques to give you extra performance.

IUpdatable<Customer> customers = ...;
Customer[] custs = new Customer[] { ... };

customers.Batch(custs, c => customers.Insert(c));

If you've got many objects to update and you want to have instance specific update-checks done, you can sneak the extra information into the batch process by combining the data together into a single collection and then piecing them apart in the operation.

IUpdatable<Customer> customers = ...;
var oldAndNew = new [] { new { Old = oldCustomer, New = newCustomer }, ...};
customers.Batch(oldAndNew, (u, x) => u.Update(x.New, d => d.City == x.Old.City));
 

Updates and DbQueryProvider

In order to make use of this new capability I'm going to need a new object to declare the IUpdatable interface.  The Query<T> class only implemented IQueryable<T>, and that was fine as long as I only ever want to query.  Now I also want to be able to update, so I need a new class to represent the root of my query that I can also update. These things in databases are called tables, so that's what I'll stick with. 

public interface IQueryableTable : IQueryable
{
    string TableID { get; }
}

public interface IQueryableTable<T> : IQueryable<T>, IQueryableTable
{
}

public class QueryableTable<T> : Query<T>, IQueryableTable<T>
{
    string id;

    public QueryableTable(IQueryProvider provider, string id)
        : base(provider)
    {
        this.id = id;
    }

    public QueryableTable(IQueryProvider provider)
        : this(provider, null)
    {
    }

    public string TableID
    {
        get { return this.id; }
    }
}

public interface IUpdatableTable : IQueryableTable, IUpdatable
{
}

public interface IUpdatableTable<T> : IQueryableTable<T>, IUpdatable<T>, IUpdatableTable
{
}

public class UpdatableTable<T> : QueryableTable<T>, IUpdatableTable<T>
{
    public UpdatableTable(IQueryProvider provider, string id)
        : base(provider, id)
    {
    }

    public UpdatableTable(IQueryProvider provider)
        : this(provider, null)
    {
    }
}

You'll note that not only did I define a UpdatableTable<T> class, which is specifically what I wanted, I also went ahead and made a QueryableTable<T>, and extra interfaces to correspond to them.  This is intentional.  Eventually, I may want to add more methods specific to tables here and I'll need a place to put them.  Right now I've only added a property 'TableID'.  You can ignore it for now, though it will get more interesting when I discuss the mapping changes.

Take a look at the Northwind class in the test source code and you'll see how I made use of my new table class.

The Plumbing

Of course, update commands work in the query provider just like queries do.  First there are a bunch of new DbExpression nodes to represent them.

public abstract class CommandExpression : DbExpression
{
}

public abstract class
CommandWithResultExpression : CommandExpression { public abstract Expression Result { get; } }
public class InsertExpression : CommandWithResultExpression { public TableExpression Table { get; } public ReadOnlyCollection<ColumnAssignment> Assignments { get; } public override Expression Result { get; } } public class ColumnAssignment { public ColumnExpression Column { get; } public Expression Expression { get; } } public class UpdateExpression : CommandWithResultExpression { public TableExpression Table { get; } public Expression Where { get; } public ReadOnlyCollection<ColumnAssignment> Assignments { get; } public override Expression Result { get; } } public class UpsertExpression : CommandWithResultExpression { public Expression Check { get; } public InsertExpression Insert { get; } public UpdateExpression Update { get; } public override Expression Result { get; } } public class DeleteExpression : CommandExpression { public TableExpression Table { get; } public Expression Where { get; } } public class BatchExpression : CommandExpression { public Expression Input { get; } public LambdaExpression Operation { get; } public Expression BatchSize { get; } public Expression Stream { get; } }

Then there's the standard visit method in DbExpressionVisitor, DbExpressionWriter, etc.  Binding them happens in the QueryBinder just like all other query operations, but the work of deciding what nodes to generate gets doled out to the QueryMapping object.  Luckily, the base QueryMapping class has a default implementation that builds the correct DbExpression node.  If you want to map a single object into multiple tables or some other crazy scheme you'll probably have to have a more advanced mapping implementation. :-)

These nodes get plumbed through the system until they are encountered by the ExecutionBuilder and are formatted using the QueryLanguage rules. The TSQL formatter converts the nodes into corresponding TSQL text.  Depending on the contents of the command expression, the generated SQL may have one or more actual TSQL operations.

Batch Processing

ADO.Net has this nice feature built into its SqlClient API; the ability to get high-performance batch processing. Yet, the only way to get at it is through use of DataSet's or DataReaders. As far as I'm concerned this is rather low level and a bit complicated to use if you are starting out with domain objects and not DataSets. Your data access layer should do this for you. Yet, in order for it to do it, the abstraction for batch processing has to exist, which is why I added it to the updatable pattern.  After that it was a cinch. :-)  Not really. 

What I needed fundamentally was something that would execute the same database command over and over again with different sets of parameters. This is basically what TSQL batching does as it is sent over the wire. So I needed to add this abstraction to DbQueryProvider. Yet, since only SqlClient supports this actually behavior I'd need a fall back plan. So DbQueryProvider implements a method to do batch processing, but it does not actually do it optimally. 

public virtual IEnumerable<int> ExecuteBatch(QueryCommand query, IEnumerable<object[]> paramSets, int batchSize, bool stream)
{
    var result = this.ExecuteBatch(query, paramSets);
    if (!stream)
    {
        return result.ToList();
    }
    else
    {
        return new EnumerateOnce<int>(result);
    }
}

private IEnumerable<int> ExecuteBatch(QueryCommand query, IEnumerable<object[]> paramSets)
{
    this.LogCommand(query, null);
    DbCommand cmd = this.GetCommand(query, null);
    foreach (var paramValues in paramSets)
    {
        this.LogMessage("");
        this.LogParameters(query, paramValues);
        this.SetParameterValues(cmd, paramValues);
        int result = cmd.ExecuteNonQuery();
        yield return result;
    }
}

public virtual IEnumerable<T> ExecuteBatch<T>(QueryCommand query, IEnumerable<object[]> paramSets, Func<DbDataReader, T> fnProjector, int batchSize, bool stream)
{
    var result = this.ExecuteBatch(query, paramSets, fnProjector);
    if (!stream)
    {
        return result.ToList();
    }
    else
    {
        return new EnumerateOnce<T>(result);
    }
}

private IEnumerable<T> ExecuteBatch<T>(QueryCommand query, IEnumerable<object[]> paramSets, Func<DbDataReader, T> fnProjector)
{
    this.LogCommand(query, null);
    DbCommand cmd = this.GetCommand(query, null);
    cmd.Prepare();
    foreach (var paramValues in paramSets)
    {
        this.LogMessage("");
        this.LogParameters(query, paramValues);
        this.SetParameterValues(cmd, paramValues);
        var reader = cmd.ExecuteReader();
        if (reader.HasRows)
        {
            reader.Read();
            yield return fnProjector(reader);
        }
        else
        {
            yield return default(T);
        }
        reader.Close();
    }
}

What I have here are four methods, two of which are just private implementations, but two others that are virtual so can be overridden. Batching can work via streaming or not. If streamed the results of each execution (or batch) is yielded out. This works great if the number of individual items is large, but takes a lot of discipline to remember to actually inspect the results or nothing gets executed at all!  By requesting no streaming (stream == false) the execution occurs immediately and the results are packaged into a list that you can conveniently ignore if you so choose. That's why the implementation is separated out, so that the enumerable can be captured and converted to a list, enabling either behavior.  The two types of ExecuteBatch differ in whether a result is computed via information coming back from the server or not.

Now that these are defined, I can implement a new kind of provider, a SqlClient specific version that makes automatic use of optimized batching.

public class SqlQueryProvider : DbQueryProvider
{
    ...
public override IEnumerable<int> ExecuteBatch(QueryCommand query, IEnumerable<object[]> paramSets, int batchSize, bool stream) { var result = this.ExecuteBatch(query, paramSets, batchSize); if (!stream) { return result.ToList(); } else { return new EnumerateOnce<int>(result); } } private IEnumerable<int> ExecuteBatch(QueryCommand query, IEnumerable<object[]> paramSets, int batchSize) { SqlCommand cmd = (SqlCommand)this.GetCommand(query, null); DataTable dataTable = new DataTable(); for (int i = 0, n = query.Parameters.Count; i < n; i++) { var qp = query.Parameters[i]; cmd.Parameters[i].SourceColumn = qp.Name; dataTable.Columns.Add(qp.Name, qp.Type); } SqlDataAdapter dataAdapter = new SqlDataAdapter(); dataAdapter.InsertCommand = cmd; dataAdapter.InsertCommand.UpdatedRowSource = UpdateRowSource.None; dataAdapter.UpdateBatchSize = batchSize; this.LogMessage("-- Start SQL Batching --"); this.LogCommand(query, null); IEnumerator<object[]> en = paramSets.GetEnumerator(); using (en) { bool hasNext = true; while (hasNext) { int count = 0; for (; count < dataAdapter.UpdateBatchSize && (hasNext = en.MoveNext()); count++) { var paramValues = en.Current; dataTable.Rows.Add(paramValues); this.LogMessage(""); this.LogParameters(query, paramValues); } if (count > 0) { int n = dataAdapter.Update(dataTable); for (int i = 0; i < count; i++) { yield return (i < n) ? 1 : 0; } dataTable.Rows.Clear(); } } } this.LogMessage(string.Format("-- End SQL Batching --")); } }

Note that I only have an implementation for the variation of ExecuteBatch that computes no user specified result. This is due to there being no back-channel available when using SqlClient batching.

The implementation uses DataTable's and DataAdapters to make this work.  A DataTable is created filled with the parameters necessary for executing the command.  The DataAdapter is used to invoke the batch using the Update method.  Of course, this doesn't actually have to be an update command. I can also use this to batch inserts and deletes too, or really any TSQL that I want to execute as long as I don't expected to get a bunch of data back.

Server Language Types

One thing that has always bugged me about SQL Server was the need to get the command parameters right. If I declare the parameter to be the wrong text flavor I can cause serious performance issues for the query. So just setting the parameter values and having the SqlCommand object guess at the right SqlType encoding is really not a good plan. Fortunately, it is often possible to figure out the correct parameter types to use if the information is available via the mapping. Parameters are often never just sent to the server for no reason. If I use a parameter in a query I'm usually comparing it against a column.  In most cases I can simply infer that the parameter should have the same server type as the column.

So I've gone ahead and defined a new server type system abstraction and threaded it into some of the DbExpressions and make use of it in some of the visitors. 

public abstract class QueryType
{
    public abstract DbType DbType { get; }
    public abstract bool NotNull { get; }
    public abstract int Length { get; }
    public abstract short Precision { get; }
    public abstract short Scale { get; }
}

public abstract class QueryTypeSystem 
{
    public abstract QueryType Parse(string typeDeclaration);
    public abstract QueryType GetColumnType(Type type);        
}

A QueryType encodes the typical database scalar type. It has a few properties that you can use to inspect common attributes of a server type, but most code won't really care to know the details, it will just pass the information along until it ends up where I need it.  A QueryTypeSystem is basically a factory for producing QueryType's.  The Parse method will construct a language-specific QueryType out of some text encoding.  This is typically the server language syntax for declaring a column of that particular type, like 'VARCHAR(10)'. 

A QueryTypeSystem is specific to a language, so QueryLanguage is where you go to get one.

public abstract class QueryLanguage
{
    public abstract QueryTypeSystem TypeSystem { get; }
}

One place I definitely know where to encode server types is in the ColumnExpression.  If a ColumnExpression knows what its server type is, then when I get to the point of comparing parameters to columns I know which server type is in play.

public class ColumnExpression : DbExpression, IEquatable<ColumnExpression>
{
    public QueryType QueryType { get; }
}

I've also stuck it into NamedValueExpression, because that's the type I'm using for parameters.

public class NamedValueExpression : DbExpression
{
    public QueryType QueryType { get; }
}

And I've basically modified Parameterizer, so that if a column and parameter (named-value expression) are ever paired together in any binary expression, I'll infer the parameter to have the same server type as the column.

public class Parameterizer : DbExpressionVisitor
{
    protected override Expression VisitBinary(BinaryExpression b)
    {
        Expression left = this.Visit(b.Left);
        Expression right = this.Visit(b.Right);
        if (left.NodeType == (ExpressionType)DbExpressionType.NamedValue
         && right.NodeType == (ExpressionType)DbExpressionType.Column)
        {
            NamedValueExpression nv = (NamedValueExpression)left;
            ColumnExpression c = (ColumnExpression)right;
            left = new NamedValueExpression(nv.Name, c.QueryType, nv.Value);
        }
        else if (b.Right.NodeType == (ExpressionType)DbExpressionType.NamedValue
         && b.Left.NodeType == (ExpressionType)DbExpressionType.Column)
        {
            NamedValueExpression nv = (NamedValueExpression)right;
            ColumnExpression c = (ColumnExpression)left;
            right = new NamedValueExpression(nv.Name, c.QueryType, nv.Value);
        }
        return this.UpdateBinary(b, left, right, b.Conversion, b.IsLiftedToNull, b.Method);
    }

    protected override ColumnAssignment VisitColumnAssignment(ColumnAssignment ca)
    {
        ca = base.VisitColumnAssignment(ca);
        Expression expression = ca.Expression;
        NamedValueExpression nv = expression as NamedValueExpression;
        if (nv != null)
        {
            expression = new NamedValueExpression(nv.Name, ca.Column.QueryType, nv.Value);
        }
        return this.UpdateColumnAssignment(ca, ca.Column, expression);
    }
}

Of course, the same goes for ColumnAssignment used by Insert and Update commands. You'll notice that I'm not having these types flow throughout the expression tree like normal types.  I could probably get more edge cases correct if I did, but for now this handles most of the cases. 

The GetCommand method in  DbQueryProvider will now make use of this info when constructing parameters. The SqlQueryProvider expects to see a new TSqlType that's made available by a new TSqlTypeSystem found on the TSqlLanguage object.  :-)

public class SqlQueryProvider : DbQueryProvider
{
    protected override DbCommand GetCommand(QueryCommand query, object[] paramValues)
    {
        // create command object (and fill in parameters)
        SqlCommand cmd = new SqlCommand(query.CommandText, (SqlConnection)this.Connection);
        for (int i = 0, n = query.Parameters.Count; i < n; i++)
        {
            QueryParameter qp = query.Parameters[i];
            TSqlType sqlType = (TSqlType)qp.QueryType;
            if (sqlType == null)
                sqlType = (TSqlType)this.Language.TypeSystem.GetColumnType(qp.Type);
            var p = cmd.Parameters.Add("@" + qp.Name, sqlType.SqlDbType, sqlType.Length);
            if (sqlType.Precision != 0)
                p.Precision = (byte)sqlType.Precision;
            if (sqlType.Scale != 0)
                p.Scale = (byte)sqlType.Scale;
            if (paramValues != null)
            {
                p.Value = paramValues[i] ?? DBNull.Value;
            }
        }
        return cmd;
    }
}

Mapping Changes

I've made some changes to the mapping system (or QueryMapping to be particular.)  I came across a variety of odd behavior while developing update logic that basically boiled down to the ImplictMapping object not being able to tell the difference between a type that was intended to correspond to a database table and others that appeared there just for the sake of representation, like LINQ anonymous types.  Some other mapping implementations might be able to tell the difference, but the simplest one couldn't so I needed to find another solution.

Obviously, everything that is an entity in a query comes from somewhere, and that's either from one of the roots of the query (a table) or via a relationship property. It was a mistake to think otherwise (or not think about it at all.) What I needed was a more explicit representation in the expression tree of what was an entity and what was not.  I figured I could either add annotations to the tree in every node, or find some nominal solution that would do the trick. 

I chose to make a new expression node, EntityExpression, which I use as a wrapper around any expression that would be normally constructing an entitiy.  This node is placed into the system when the QueryMapping first creates the sub-express for constructing an entity or relationship, at the time I actually know that I'm dealing with an entity and in particular which entity it is.

public class EntityExpression : DbExpression
{
    public MappingEntity Entity { get ; }
    public Expression Expression { get; }
}

I've also introduced a new abstraction called MappingEntity.  This how I let the QueryMapping object place a bread-crumb into the expression tree so it can be reminded which exact entity was being referred to.

public class MappingEntity
{
    public string TableID { get; }
    public Type Type { get; }
}

It' really just as simple little class that minimally remembers the correspondence between the runtime type and the table its being mapped to.  If you've been paying attention you'll realize that this 'TableID' is the same property that was added to the IQueryableTable interface.  That's how the query engine gets the table-id in the first place, right from the start of the query.  Of course, the IQueryableTable<T> interface also knows the runtime type, that's the 'T' part.  So your table's have all the information needed to make a MappingEntity.  Except that job is deferred to the QueryMapping object so it can do whatever bookkeeping it wants.

public abstract class QueryMapping
{
public virtual MappingEntity GetEntity(Type type); public virtual MappingEntity GetEntity(Type type, string tableID); }

You'll also notice that most of the other methods on QueryMapping are now modified to take a MappingEntity as an argument.

public abstract class QueryMapping
{
    public virtual bool IsMapped(MappingEntity entity, MemberInfo member)
    public virtual bool IsColumn(MappingEntity entity, MemberInfo member)
    public virtual bool IsIdentity(MappingEntity entity, MemberInfo member)
    public virtual bool IsComputed(MappingEntity entity, MemberInfo member)
    public virtual bool IsGenerated(MappingEntity entity, MemberInfo member)
    public virtual bool IsRelationship(MappingEntity entity, MemberInfo member)
    public virtual MappingEntity GetRelatedEntity(MappingEntity entity, MemberInfo member)
    public virtual bool IsAssociationRelationship(MappingEntity entity, MemberInfo member)
... }

Now, what you're probably saying is "Gee, Matt, that's looks a bit ominous. Why aren't these all just methods on MappingEntity now?"  And you'd be right. They probably should be.  Yet, it makes it a lot more difficult to subclass the mapping object and merely override some of the behavior.  Not a big deal for complete mapping sub-systems that someday might exists, but painful for simple uses such as overriding the ImplicitMapping with a few additional rules.  So I'm leaving it as-is for now until I can think about it more.  Any thoughts from the peanut-gallery?

Also, its worth nothing, that given this new arrangement, with explicit entity information in the query tree and connecting each entity info back to the table it originated from, it is now possible to support mapping systems that allow individual runtime types to be mapped to more than one table.  So you can have your cake and eat it too! 

That's All Folks!

At least for today.  The future may hold more goodies.  Any suggestions are welcome, either as ideas or source-code contributions.

Remember, you can get the current sources at CodePlex:  http://www.codeplex.com/IQToolkit

The LINQ IQueryable Toolkit is now a CodePlex project. 

http://www.codeplex.com/IQToolkit 

Going forward this will the be official site to find the latest greatest source bits.  I'll continue to post here about the toolkit, how to use it and to show off new features in this blog, but you'll have to follow a link to download.

You can also start discussions over how utterly fabulous the source code is and how you can't wait for the next drop. 

It's up to you.

This is the twelfth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you probably were born yesterday. How could you possibly make sense of this post without any context at all?  At least make an attempt. Sometimes I don't know why I bother.

Complete list of posts in the Building an IQueryable Provider series 

<Insert standard disclaimer for why there have been no updates on this for 'like' forever>

It's been so long since I last posted we must be up to Web 5.0 by now. I suspect there are not any actual web developers left. Its all just twitter application programming and cybernetic mind melds now. Does anybody still write code? In a text based language? Without an electric shunt wired directly into your cerebral cortex?  Nobody?

Sigh. I'm probably just talking to an empty internet; everyone's moved on to Planck Space. But since I've got nothing better to do, I might as well get on with it.

IQueryable Toolkit

One of the first things you'll notice when you take a look at the source is that I changed it quite a bit.  I moved code around, changed names gratuitously, added & removed classes and broke a lot of continuity with the prior versions. One of the biggest changes is that the code is no longer just a sample.  All those internal classes are public, the project builds as a DLL, the tests are hosted separately and the namespace is no longer 'Sample.'  Why, oh why would I do this?  Simple.

I originally started the project to give developers a structured hands-on walk through of the construction of an IQueryable provider; hoping to inspire many of you to build your own. (Many of you have.) The source code was made available so you could learn from it and as a cheap way to get you to debug it for me. (Many of you haven't.) And so you could take from it what you would, copy a class here, steal a method there, etc, to make your job easier to get your own LINQ to XYZ up off the  ground quicker.

Yet, I received so many requests to re-use the entire code base wholesale, that I now realize what you really want is a toolkit not just some sample code. As a toolkit it is still sample code. The sources are still there and you can take from it what you will.  Or you can build it into a shiny DLL and then build your code on top of it.  All the pieces I've built so far are publicly re-usable. Many of the classes have been enhanced for extensibility, so you can mix & match and override to your hearts content to build the system you want out of the pieces I already have.

How it breaks down

At the top there is a namespace 'IQ'.  I thought that it was cleaver & consise. Maybe someone out there will come up with a better one, something catchy, like a cold, and I'll feel compelled to catch it, fall under-the-weather for a few weeks and emerge blissfully sedated enough to go with it. Maybe. Under the 'IQ' namespace exists all the bits and pieces that are generally useful for any IQueryable impementation.  This is where you'll find the basic ExpressionVisitor class, the generic Query and base QueryProvider class.  You'll also get these:

ExpressionComparer - compares two expression trees for equality. immensely useful.

ExpressionWriter - Get a better translation of that narly expression tree into a  C#-ish syntax that you can actually read.  Helps debugging a lot.

Grouping - It's that dead simple implementation of the LINQ IGrouping interface. 

CompoundKey - A class that helps you represent compound key values.

ScopedDictionary - 'cuz I needed it and now you can have it too.  Works sort of like a Dictionary but with nested scopes.

TypeHelper - Used to be called TypeSystem (how pretentious of me).  Helps you know a few things about types.

DeferredList - An implementation of IList that enables deferred loading.  (Bells are probably ringing in your head about now.)

Nested inside this namespace is another one, called simply 'Data'  (I know, original, huh). In here you'll find all those other classes that made up most of my previous posts like the provider itself, DbQueryProvider, that works on top of any DbConnection.  To really understand what has gone on in my head you need to get into this class and look around.

Logically, it still does the same thing.  It primarily just implements the IQueryable.Execute method, converting query expressions into SQL commands, executing them using the DbConnection and translating the results into objects. That's all still there, its just been re-arranged a bit and made a lot more extensible.

How it is extended is the interesting bit that flavors everything else to come. The DbQueryProvider's brains are now fully pluggable. I took a look at what was going on during execution and broke it down into logically atomic steps. The prior version used to first translate the query, build a generic reader and then execute and return.  That's still sort-of what happens, but there's an even better break-down.

Queries are translated, but the translation goes through three distinct phases: 

1) mapping is applied

2) policy is enforced

3) the target query language has the last say. 

Each of these steps is pluggable.  How?  There are three new classes to get acquainted to.

QueryMapping - Defines information & rules to map an object model onto a database model

QueryPolicy - Defines information & rules to determine inclusion of related data & execution plans

QueryLanguage - Defines information & rules to adhere to a target language, including converting the query into text

Every DbQueryProvider is now supplied these three at runtime, each can be overriden by you to take control at different points in then translation and execution process. The mapping, policy & language can each override a portion of the query translation pipeline, injecting its own rules as rewrites of the query expression using the common DbExpression primitives provided.

In addition, the QueryLanguage controls the final translation of the tree to SQL text (or whatever target language is being used), and the QueryPolicy is invoked to build the execution plan.  Note, the execution plan is not the simple object-reader of yore.  It is now an entire runtime generated program for completely executing the database query & constructing objects out of the results.  The QueryPolicy is allowed to do whatever final rewrite is necessary to turn the translated query expression into an executable piece of code.

Of course, by default, these three mostly do what was being done before, and if you care to look you'll see that the QueryLanguage's Format method (for turning queries into text) simply defers to the TSqlFormatter class (which used to be called QueryFormatter.)  You can overidde this method and have it call your own formatter.  You can even override the TSqlFormatter and make a few minor changes if your back end SQL is largely the same.

If you want to implement a different strategy to retrieving hierarchical results, you can inject your own logic both during query translation and you can take over the entire process of building the execution plan.  The default BuildExecutionPlan on the QueryPolicy class defers to the ExecutionBuilder (used to be ProjectionBuilder in its former life), but you can change that and do it your own way.  Of course, all the source is available, so you can build your version based off of the one that exists in the toolkit.

The crazy logic for inferring mapping information by simply using the names of types and members is retained, but walled off in a specific implementation of mapping called the ImplicitMapping.  You can re-use this one if your needs are as meager as my demos.  Or you can build your own.

Of course, now that you've had a moment to think about it, given so much flexibility, you'll inevitably start asking about what database providers are supported.  (Still just TSQL.)  You'll also want to know what complex mapping models I've invented and how they compare to ones used by ORM's.  (Still just the implicit demo one.)  And then you'll want to know what I did to improve reading data hierarchies because that N+1 strategy is just a loser.  (Here's where you finally get something out of me.)

Relationships and Loading

The query translation finally understands relationship properties. The QueryMapping understands the abstraction of mapping properties, these can map to columns or relationships (like associations) or whatever you want.  The mapping decides what's possible.  What have I implemented?  The basic mapping understands association relationships only; those kinds of relationships that are made via a join matching one or more columns across two tables.

The interesting part (at least to me) is how relationships are dealt with during query translation and later during execution.  Does a one to many relationship property residing in the output projection lead to extra queries at execution time or not?  If so, how many? 

I've implemented a few rewriters that deal with relationships.  For example, the SingletonProjectionRewriter finds projections of singleton relationships (one to one or many to one) and folds them into the query itself using a server-side join.  A singleton query doesn't change the cardinality of the results, so it is generally safe.  Yet, what happens when you refer to the same singleton relationship property more than once? Do you get the same join over and over again? To hold back the flood of redundant joins I had to come up with another rewriter, one that would find the duplicate uses and get them to place nice together, but because this situation can happen more often than just as the side effect of doing the singleton rewrites, I wrote it to work on joins and not just nested projections. The RedundantJoinRemover looks for identical joins expressed in the same query and throws out all the extra ones.  It does not look through into sub-queries, so its best to first removed redundant sub-query layers using the RedundantSubqueryRemover in hopes of forcing as many joins into the same layer.

The ClientJoinedProjectionRewriter attempts to do something very different. It looks to convert nested projections (ones that would become extra queries per object at execution time) into client-side joins. So you'll get one join per included relationship. This is not the "Big Join" strategy that LINQ to SQL uses. It precisely the strategy that I still consider inferior if I'm asked to ensure correctness of the results. So why did I choose to implement it? I'm not being asked to insure the correctness of results, nor am I being asked to ensure that large result sets can stream back from the server. I assume that the normal degree of ambiguity of results you get any time you attempt to get related results via more than one query is okay in this case too. I also assume that a user will choose to employ a transaction to get better consistency. You can also disable this one easily if you want. You can implement a policy that uses one or more other strategies.

This is what the ClientJoinedProjectionRewriter actually does.  If the SelectExpression of a nested projection is constrained to the outer query via an equality comparison, like this column equals that parameter and so on I figure I can easily turn this into a client-side LINQ to object style join.  So I rewrite the query so it restates all the significant bits of the queries that came before it.  So if you wanted to retrieve customers in London and all their orders you'll get one query for the customers in London and then another query for all the orders for all the customers in London. The idea is that the second query is actually run first and all the objects are stuffed into a lookup table. When the query for customers executes, it picks up the matching orders right out of the lookup table.  Neat.

Taking it for a Spin

Given a simple query for customers in London:

var query = from c in db.Customers
            where c.City == "London"
            select c;

And, assuming I have a policy that tells the provider that the Orders property on customers is supposed to be included whenever I ask for customers.  (The TestPolicy used by the supplied test project lets me pick and choose properties by name.)

When I start out, the provider first sees the query as LINQ expression tree:

Query(Test.Customer).Where(c => (c.City = "London"))


After mapping the query expression now looks like this: 

Project(
  @"SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
  FROM [Customers] AS t0
  WHERE (t0.City = 'London')",
  new Customer() {
    City = A0.Column("City"),
    ContactName = A0.Column("ContactName"),
    Country = A0.Column("Country"),
    CustomerID = A0.Column("CustomerID"),
    Phone = A0.Column("Phone")
  },
  p => Queryable.AsQueryable(p)
)

It's got the SelectExpression (formatted by default as TSQL), followed by an expression that constructs a customer out of the query's result columns, and function that converts the presumed resulting IEnumerable<Customer> into the expected type. 

Next the QueryPolicy is asked to translate, so it applies rules like determining if a relationship property is included in the output or not.  Since I set up the policy to automatically include orders, I get a new expression that looks like this:

Project(
  @"SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
  FROM [Customers] AS t0
  WHERE (t0.City = 'London')",
  new Customer() {
    City = A0.Column("City"),
    ContactName = A0.Column("ContactName"),
    Country = A0.Column("Country"),
    CustomerID = A0.Column("CustomerID"),
    Phone = A0.Column("Phone"),
    Orders = Project(
      @"SELECT t0.CustomerID, t0.OrderDate, t0.OrderID
      FROM [Orders] AS t0
      WHERE (t0.CustomerID = t2.CustomerID)",
      new Order() {
        CustomerID = A1.Column("CustomerID"),
        OrderDate = A1.Column("OrderDate"),
        OrderID = A1.Column("OrderID")
      },
      p => Enumerable.ToList(p)
    )
  },
  p => Queryable.AsQueryable(p)
)

Now I've got two queries, one nested inside the other.  If nothing happens to this, the inner query will get executed once per row in the outer query results.

Next, the QueryPolicy also applies the the ClientJoinedProjectionRewriter.  It attempts to convert nested projections into a query to fetch all the data (that will be executed once per outer query) with the resulting objects joined on the client machine.

Project(
  @"SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
  FROM [Customers] AS t0
  WHERE (t0.City = 'London')",
  new Customer() {
    City = A0.Column("City"),
    ContactName = A0.Column("ContactName"),
    Country = A0.Column("Country"),
    CustomerID = A0.Column("CustomerID"),
    Phone = A0.Column("Phone"),
    Orders = ClientJoin(
      OuterKey(A0.Column("CustomerID")),
      InnerKey(A1.Column("CustomerID")),
      Project(
        @"SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
        FROM [Customers] AS t0
        OUTER APPLY (
          SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
          FROM [Orders] AS t2
          WHERE (t2.CustomerID = t0.CustomerID)
          ) AS t1
        WHERE (t0.City = 'London')",
        Outer(
          A1.Column("Test"), 
          new Order() {
            CustomerID = A1.Column("CustomerID1"),
            OrderDate = A1.Column("OrderDate"),
            OrderID = A1.Column("OrderID")
          }
        ),
        p => Enumerable.ToList(p)
      )
    )
  },
  p => Queryable.AsQueryable(p)
)

Now I have an embedded ClientJoin node.  Notice how the inner query has a join so that it will retrieve all the orders for all the customers in London now.  Unfortunately, its an OUTER APPLY join when it does not really need to be.  But don't fear.  The QueryLanguage will soon fix it.  The combined query also gets a new column 'Test'. This is used to determine if the join found at least one matching row or not. If no match was found instead of the value 1 the 'Test' column will be null. I can use this information to make sure the resulting collection for a non-match is empty.

Next the QueryLanguage get its shot at translating the tree.  When it does, it attempts to get rid of any APPLY node if at all possible; OUTER APPLY's become LEFT OUTER JOIN's and CROSS APPLY's become INNER JOIN's.

Project(
  @"SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
  FROM [Customers] AS t0
  WHERE (t0.City = 'London')",
  new Customer() {
    City = A0.Column("City"),
    ContactName = A0.Column("ContactName"),
    Country = A0.Column("Country"),
    CustomerID = A0.Column("CustomerID"),
    Phone = A0.Column("Phone"),
    Orders = ClientJoin(
      OuterKey(A0.Column("CustomerID")),
      InnerKey(A1.Column("CustomerID")),
      Project(
        @"SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
        FROM [Customers] AS t0
        LEFT OUTER JOIN (
          SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
          FROM [Orders] AS t2
          ) AS t1
          ON (t1.CustomerID = t0.CustomerID)
        WHERE (t0.City = 'London')",
        Outer(
          A1.Column("Test"), 
          new Order() {
            CustomerID = A1.Column("CustomerID1"),
            OrderDate = A1.Column("OrderDate"),
            OrderID = A1.Column("OrderID")
          }
        ),
        p => Enumerable.ToList(p)
      )
    )
  },
  p => Queryable.AsQueryable(p)
)

Now the query expression is fully translated we only have left to build the execution plan.  The QueryPolicy is asked to turn the expression into a program that will do the actual execution.

Invoke(
  null,
  lookup1 => ((IQueryable<Customer>)ExecutionBuilder.Sequence(new Object[] {
    ExecutionBuilder.Assign(
      lookup1,
      Enumerable.ToLookup(
        Enumerable.Where(
          ((DbQueryProvider)Query(Test.Customer).Provider).Execute(
            new QueryCommand<KeyValuePair<String,Order>>(
              @"SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
              FROM [Customers] AS t0
              LEFT OUTER JOIN (
                SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
                FROM [Orders] AS t2
                ) AS t1
                ON (t1.CustomerID = t0.CustomerID)
              WHERE (t0.City = @p0)",
              new String[] {"p0"},
              r1 => new KeyValuePair<String,Order>(
                r1.IsDBNull(0)
                  ? null
                  : ((String)Convert.ChangeType(
                    r1.GetValue(0),
                    System.String
                  )),
                r1.IsDBNull(1)
                  ? null
                  : new Order() {
                    CustomerID = r1.IsDBNull(2)
                      ? null
                      : ((String)Convert.ChangeType(
                        r1.GetValue(2),
                        System.String
                      )),
                    OrderDate = r1.IsDBNull(3)
                      ? new DataTime("1/1/0001 12:00:00 AM")
                      : ((DateTime)Convert.ChangeType(
                        r1.GetValue(3),
                        System.DateTime
                      )),
                    OrderID = r1.IsDBNull(4)
                      ? 0
                      : ((Int32)Convert.ChangeType(
                        r1.GetValue(4),
                        System.Int32
                      ))
                  }
              )
            ),
            new Object[] {((Object)"London")}
          ),
          kvp => kvp.Value != null
        ),
        kvp => kvp.Key,
        kvp => kvp.Value
      )
    ),
    Queryable.AsQueryable(((DbQueryProvider)Query(Test.Customer).Provider).Execute(
      new QueryCommand<Customer>(
        @"SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
        FROM [Customers] AS t0
        WHERE (t0.City = @p0)",
        new String[] {"p0"},
        r0 => new Customer() {
          City = r0.IsDBNull(0)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(0),
              System.String
            )),
          ContactName = r0.IsDBNull(1)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(1),
              System.String
            )),
          Country = r0.IsDBNull(2)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(2),
              System.String
            )),
          CustomerID = r0.IsDBNull(3)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(3),
              System.String
            )),
          Phone = r0.IsDBNull(4)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(4),
              System.String
            )),
          Orders = Enumerable.ToList(lookup1.get_Item(r0.IsDBNull(3)
            ? null
            : ((String)Convert.ChangeType(
              r0.GetValue(3),
              System.String
            ))))
        }
      ),
      new Object[] {((Object)"London")}
    ))
  }))
  )

Now we're cookin' with GAS!  This is the exact LINQ expression that can be run to execute the query and produce the results.  Notice how the code directly calls the DbQueryProvider.Execute method that takes the command text, parameters and a function that converts a DbDataReader into a sequence of objects.  That function is described inline with the rest of the LINQ expression.  Yes, you can do that.  That's how all those other LINQ operators work.

Also notice how I cheat with expression trees.  The process to create the lookup table requires variables, assignments and statement sequences; all things not currently possible with expression trees -- or so you thought.  The variable is really a parameter to a lambda expression.  If you directly invoke an inline lambda expression you basically create a variable that is in scope over the body of the lambda.  Next I call a method on ExecutionBuilder called Sequence.  This lets me simulate having statement sequences.  All the expressions are evaluated in order because the results are turned into an argument array.  The Sequence method simply returns the last value in the array.  The last cheat is the assignment.  While there is no assignment operator in the expression tree (at least not yet), it is possible to pass a parameter to a method as a by-ref argument.  Yes, this is ugly and causes side-effects, but that's exactly what I wanted.  (Until we get support for statements sometime in the future!)  The ExecutionBuilder Assign method take a ref argument and a value and simply assigns the value to the ref argument. 

When the query is finally executed, all the data is retrieved up front in only two queries.

SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
FROM [Customers] AS t0
LEFT OUTER JOIN (
  SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
  FROM [Orders] AS t2
  ) AS t1
  ON (t1.CustomerID = t0.CustomerID)
WHERE (t0.City = @p0)
-- @p0 = [London]

SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
FROM [Customers] AS t0
WHERE (t0.City = @p0)
-- @p0 = [London]

 

Well that's it then.  Kudos, if you've read this far you really are trying to understand how to build your own IQueryable provider.  Either that or you are up late at night with insomnia trying to find something softer than a hammer to put yourself to sleep with.

Have fun with the new code!

 

Note: all sources are made available under the Microsoft Public License (MSPL)

This is the eleventh in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you’ll want to do so before proceeding, or at least before proceeding to copy the code into your own project and telling your boss you single-handedly solved the data layer problem over the weekend.

Complete list of posts in the Building an IQueryable Provider series 

No excuses for a late post this time. In my estimation, not only am I early but I have arrived with abundance. This is not my ordinary, keep it to the basics, slow and steady post about adding that next incremental feature to a feature-starved sample program. This is the Big Kahuna! This is the post where I finally go over the edge, lose my cool and let loose on the code. No more Mr. Nice guy.  Now, I’m really taking it to the cleaners. 

There’s so much goodness here I don’t even know where to start.  Even if I did, how could I possibly describe it all in one little post? I’m not even going to try. You’re going to have to download the source to get the real truth. A link is located at the bottom of the post. Click it. Feel the power flow over the internet, through the wires and onto your screen. Feel the radiance seep in through your eyes and take hold of your brain with the vice-like grip of a big-time wrestler.

So just what have I been cooking?  Let’s take a look at the highlights.

More Query Operators

That’s right; a lot more. Don’t believe me? Look at the code and see for yourself. Too busy? Boss breathing down your neck. Okay, here’s the list.

Distinct - Not only is there translation for this, but Distinct also makes an appearance inside most aggregates! AVG(DISTINCT t0.Price) anyone!

Skip & Take – Both work and work well together using the ROW_NUMBER & BETWEEN combo that gets incredible performance.

First, FirstOrDefault, Single & SingleOrDefault – No, these are not just the same thing. They really do behave differently.

Any, All & Contains - Not only do these guys work, but all three work with local collections. Use collection.Any(xxx) to get any predicate expression expanded into a series of OR conditions over the input set. 

Framework Methods and Properties


Hallelujahs, brothers! Now you can write queries that reference simple String, DateTime, Decimal, and Math operations, and get them translated into equivalent SQL.  I’ve not implemented every possible method but you’ll find a lot there. There are so many I can’t even list them all here. Rest assured you’ll get classics like string StartsWith, EndsWith and Contains and every variation of Equals and Compare that the framework can throw at me.

Parameters

That’s what I said, parameters. No more SQL injection attacks in the sample code, I’ve finally gotten around to adding parameter support. Take a look, it’s all rather straight forward; a new node type NamedValueExpression, a new visitor Parameterizer and a few extra lines of code to create and assign parameter values at execution time. Not every argument is parameterized either. Simple numeric values are left as literals.

Simpler Queries

The RedundantSubqueryRemover has been enhanced to know how to merge more types of SelectExpression’s together. Now, you’ll get even less layers of subqueries.

Compiled Queries

OMG! Can it be so? Yes, compiled queries too. How is it possible? It took a little refactoring of how queries are executed. There is now a new low-level Execute method on DbQueryProvider that takes a translated command, parameters and a function that converts a DbDataReader into a result (this could be sequence or a single value.)  Given that and LINQ Expression trees themselves, it is easy to see how I can turn a request to execute a LambdaExpression into constructing a function that when called, calls down to the low-level execute method with pre-translated SQL queries and a pre-translated object reader.

Unit Tests

Okay, get back up into your chair. I realized how easy it is to lose control and find the need to roll around the floor in ecstasy. However, your co-workers are starting to wonder if they should call for medical assistance. 

This does not mean I’m using TDD, or DDT or any TLA methodology; it just an apology really. The sample codebase has become so unwieldy that even I can’t trust myself anymore and have discovered so many prior features that I’ve broken in the process of doing bigger and better things that I literally broke down and wrote some simple unit tests. AND they are simple too. For the most part they prove 1) some LINQ operators are translated into the SQL I thought looked good at the time, and 2) the database server actually accepts these queries and does something with them.  There are over 200 of these tests now.  There is also a start at verifying actual semantics by looking at the results.  I started a second bank of tests for these and so far only have a few compiled-query tests listed.  More surely to come.

 

Whew!  I know that’s a lot to GROK, but take a look anyway.  If you’ve been building your own provider and have been stuck on how to do some of these kinds of features (except for tests, shame on you if you haven’t got your own tests yet), now you can get moving again.

But WAIT, there’s more!  I’m not done yet.  Even with all this I’ve shown you today, I still have a few more ideas.  More posts to come. 

This is the tenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you'll want to find a nice shady tree, relax and meditate on why your world is so confused and full of meaningless tasks that it has kept you from pursuing the perfection of writing IQueryable providers.

Complete list of posts in the Building an IQueryable Provider series 

Last time I blamed the television writers strike for delaying me in getting out my next installment.  This time I blame the lack of one, and sunny days, and my son riding his bicycle, and the grueling, tiresome task of getting paid.  Would you believe some of the code we are writing has nothing what-so-ever to do with IQueryable LINQ providers? Crazy, I know. Maybe it would be a different story around here if we had a few more shady trees. 

Fortunately for you I have a prime piece of savory source code ready for slow grilling over a bed of hot mesquite. It's something I'd like to say I was saving for a special occasion, but the truth is I've been putting it off because I thought the code would be just too overwhelming. I mean for me, not you. You see I've been trying to do two things with this series; one, provide a guide to help developers navigate and understand the power and flexibility of the IQuerayble interface, and two, attempt to prove to myself that code written in a purely functional style (or as pure as I can make it) can end up much cleaner and easier on the eyes. I've been horrified ever since tackling OrderBy that this subject would become my undoing. Knowing how involved the translation was for LINQ to SQL I was dreading the immutable madness that would ensue.  Obviously, I survived the ordeal and so will you.

Grappling with GroupBy and Aggregates

Translating just GroupBy itself might not be so difficult if I did not have to also account for aggregates. It seems that GroupBy is not very interesting without the ability to write expressions involving Count() and Max() and Sum(), and that's where the difficulty sets in.

A query involving groups and aggregates that looks like this:

var query = from o in db.Orders

            group o by o.CustomerID into g

            select g.Max(o => o.OrderID);

Is translated into a series of method calls that looks like this:

var query = db.Orders.GroupBy(o => o.CustomerID).Select(g => g.Max(o => o.OrderID));

 

And this is a problem because as I translate from the bottom up (as I have been doing all along) building individual SELECT subqueries for each query operator, I won’t even discover the use of the aggregate ‘Max’ until after I’ve created the SELECT with the GROUP BY.  So how do I get the two pieces together in the same place at the same time without violating my functional style and immutable expression nodes?  Weirder yet, how do I even know that a call to an aggregate method should be tied back to a particular GroupBy? What if I had two group-bys?

 

So where do I start to explain?  Maybe the easy stuff first.  I’ve added a GroupBy property to SelectExpression.

 

internal class SelectExpression : Expression {
    ...

    ReadOnlyCollection<Expression> groupBy;

 

    internal SelectExpression(
        ...,

        IEnumerable<Expression> groupBy

        )

    internal ReadOnlyCollection<Expression> GroupBy {

        get { return this.groupBy; }

    }

}

 

Now every place where I construct one I have to specify a collection of grouping expressions (or null).  GroupBy expressions also get visited after everything else (so far).  Here’s the new version of VisitSelect in DbExpressionVisitor.

 

protected virtual Expression VisitSelect(SelectExpression select) {

    Expression from = this.VisitSource(select.From);

    Expression where = this.Visit(select.Where);

    ReadOnlyCollection<ColumnDeclaration> columns = this.VisitColumnDeclarations(select.Columns);

    ReadOnlyCollection<OrderExpression> orderBy = this.VisitOrderBy(select.OrderBy);

    ReadOnlyCollection<Expression> groupBy = this.VisitExpressionList(select.GroupBy);

    if (from != select.From

        || where != select.Where

        || columns != select.Columns

        || orderBy != select.OrderBy

        || groupBy != select.GroupBy

        ) {

        return new SelectExpression(select.Type, select.Alias, columns, from, where, orderBy, groupBy);

    }

    return select;

}

 

I also added an AggregateExpression class. It represents a call to an aggregate operator.

 

    internal enum AggregateType {

        Count,

        Min,

        Max,

        Sum,

        Average

    }

 

    internal class AggregateExpression : Expression {

        AggregateType aggType;

        Expression argument;

        internal AggregateExpression(Type type, AggregateType aggType, Expression argument)

            : base((ExpressionType)DbExpressionType.Aggregate, type) {

            this.aggType = aggType;

            this.argument = argument;

        }

        internal AggregateType AggregateType {

            get { return this.aggType; }

        }

        internal Expression Argument {

            get { return this.argument; }

        }

    }

 

Now I at least know what to turn those aggregates into.  But how do I get these aggregate expressions into the right place in the query tree?  What if I did nothing and just let the aggregates fall where they may?  What would it look like?  Would it even be legal SQL?

 

SELECT MAX(t5.OrderID)

FROM (

  SELECT t0.CustomerID

  FROM Orders AS t0

  GROUP BY t0.CustomerID

  ) AS t5

 

Oh, no. Danger Will Robinson! This is not legal SQL. Where does OrderID even come from? It’s not even projected out of the sub-query with the GROUP BY. And even if it were somehow projected, the result of the query would be all wrong.  It would give me a single maximum instead of one for each group.  Getting group-by & aggregates to work together is going to be tricky.

 

Is it even possible to solve it and maintain my layering approach?  What about correlated sub-queries?  I could form a sub-query that joins back to the original query based on the grouping expressions and have it contain the aggregate.

 

SELECT (

  SELECT MAX(t2.OrderID)

  FROM Orders AS t2

  WHERE ((t2.CustomerID IS NULL AND t5.CustomerID IS NULL) OR (t2.CustomerID = t5.CustomerID))

  ) AS c0

FROM (

  SELECT t0.CustomerID

  FROM Orders AS t0

  GROUP BY t0.CustomerID

  ) AS t5

 

Now that at least executes on the server. It might not be pretty and likely not efficient unless the server is really good at unscrambling my query, but it does execute and produce the correct results. So it is technically legal. But surely I can do better!

The GroupBy Operator

 

By now you are thinking, OMG, he’s going to stick us with a wacky solution like he did with nested queries!  Never fear, my friend.  Do not look at the man behind the current.  Cast your gaze ahead and all will be made clear. In my hands, I have the GroupBy operator.  Watch as it transforms into a SQL query inside the QueryBinder!

 

    protected override Expression VisitMethodCall(MethodCallExpression m) {

        if (m.Method.DeclaringType == typeof(Queryable) ||

            m.Method.DeclaringType == typeof(Enumerable)) {

            switch (m.Method.Name) {
                ...

                case "GroupBy":

                    if (m.Arguments.Count == 2) {

                        return this.BindGroupBy(

                            m.Arguments[0],

                            (LambdaExpression)StripQuotes(m.Arguments[1]),

                            null,

                            null

                            );

                    }

                    else if (m.Arguments.Count == 3) {

                        return this.BindGroupBy(

                            m.Arguments[0],

                            (LambdaExpression)StripQuotes(m.Arguments[1]),

                            (LambdaExpression)StripQuotes(m.Arguments[2]),

                            null

                            );

                    }

                    else if (m.Arguments.Count == 4) {

                        return this.BindGroupBy(

                            m.Arguments[0],

                            (LambdaExpression)StripQuotes(m.Arguments[1]),

                            (LambdaExpression)StripQuotes(m.Arguments[2]),

                            (LambdaExpression)StripQuotes(m.Arguments[3])

                            );

                    }

                    break;
                ...

            }

        }

        return base.VisitMethodCall(m);

    }

 

As you can see, I’m handling three different variations of GroupBy.  The first one simply takes a single keySelector lambda (the thing to group by).  It is supposed to return a sequence of grouping’s that each have a group key value and a collection of elements that make up that group.  The second form is the same as the first except it also includes an elementSelector lambda that allows me to specify a map between an item in the source sequence and its shape of the elements in the group. The last form includes a resultSelector lambda that allows me to specify what the key and group collection turn into, instead of assuming they always form into instances of IGrouping<K,E>. With all these variations, the translation of group-by is going to get a bit involved, so bear with me.

 

Get ready. Get set. Open your eyes. Close them. Now open them again and really look. Go!

 

    protected virtual Expression BindGroupBy(Expression source, LambdaExpression keySelector, LambdaExpression elementSelector, LambdaExpression resultSelector) {

        ProjectionExpression projection = this.VisitSequence(source);

       

        this.map[keySelector.Parameters[0]] = projection.Projector;

        Expression keyExpr = this.Visit(keySelector.Body);           

 

        Expression elemExpr = projection.Projector;

        if (elementSelector != null) {

            this.map[elementSelector.Parameters[0]] = projection.Projector;

            elemExpr = this.Visit(elementSelector.Body);

        }

       

        // Use ProjectColumns to get group-by expressions from key expression

        ProjectedColumns keyProjection = this.ProjectColumns(keyExpr, projection.Source.Alias, projection.Source.Alias);

        IEnumerable<Expression> groupExprs = keyProjection.Columns.Select(c => c.Expression);

 

        // make duplicate of source query as basis of element subquery by visiting the source again

        ProjectionExpression subqueryBasis = this.VisitSequence(source);

 

        // recompute key columns for group expressions relative to subquery (need these for doing the correlation predicate)

        this.map[keySelector.Parameters[0]] = subqueryBasis.Projector;

        Expression subqueryKey = this.Visit(keySelector.Body);

       

        // use same projection trick to get group-by expressions based on subquery

        ProjectedColumns subqueryKeyPC = this.ProjectColumns(subqueryKey, subqueryBasis.Source.Alias, subqueryBasis.Source.Alias);

        IEnumerable<Expression> subqueryGroupExprs = subqueryKeyPC.Columns.Select(c => c.Expression);

        Expression subqueryCorrelation = this.BuildPredicateWithNullsEqual(subqueryGroupExprs, groupExprs);

       

        // compute element based on duplicated subquery

        Expression subqueryElemExpr = subqueryBasis.Projector;

        if (elementSelector != null) {

            this.map[elementSelector.Parameters[0]] = subqueryBasis.Projector;

            subqueryElemExpr = this.Visit(elementSelector.Body);

        }

       

        // build subquery that projects the desired element

        string elementAlias = this.GetNextAlias();

        ProjectedColumns elementPC = this.ProjectColumns(subqueryElemExpr, elementAlias, subqueryBasis.Source.Alias);

        ProjectionExpression elementSubquery =

            new ProjectionExpression(

                new SelectExpression(TypeSystem.GetSequenceType(subqueryElemExpr.Type), elementAlias, elementPC.Columns, subqueryBasis.Source, subqueryCorrelation),

                elementPC.Projector

                );

 

        string alias = this.GetNextAlias();

        // make it possible to tie aggregates back to this group-by

        GroupByInfo info = new GroupByInfo(alias, elemExpr);

        this.groupByMap.Add(elementSubquery, info);

 

        Expression resultExpr;

        if (resultSelector != null) {

            Expression saveGroupElement = this.currentGroupElement;

            this.currentGroupElement = elementSubquery;

            // compute result expression based on key & element-subquery

            this.map[resultSelector.Parameters[0]] = keyProjection.Projector;

            this.map[resultSelector.Parameters[1]] = elementSubquery;

            resultExpr = this.Visit(resultSelector.Body);

            this.currentGroupElement = saveGroupElement;

        }

        else {

            // result must be IGrouping<K,E>

            resultExpr = Expression.New(

                typeof(Grouping<,>).MakeGenericType(keyExpr.Type, subqueryElemExpr.Type).GetConstructors()[0],

                new Expression[] { keyExpr, elementSubquery }

                );

        }

 

        ProjectedColumns pc = this.ProjectColumns(resultExpr, alias, projection.Source.Alias);

 

        // make it possible to tie aggregates back to this group-by

        Expression projectedElementSubquery = ((NewExpression)pc.Projector).Arguments[1];

        this.groupByMap.Add(projectedElementSubquery, info);

 

        return new ProjectionExpression(

            new SelectExpression(TypeSystem.GetSequenceType(resultExpr.Type), alias, pc.Columns, projection.Source, null, null, groupExprs),

            pc.Projector

            );

    }

 

It starts off easy, following the same binding pattern as the other operators.  First I bind the key selector and get a key expression. Then if I have an element selector I bind that too, to get an element expression, otherwise I use the projector expression from the incoming source as the element expression. Got it?

 

Now I have to figure out what the grouping expressions are going to be. You might be thinking I already know what they are, and there’s only one of them. It’s the key expression, right? Well, not exactly. It turns out its common to need to group by more than one property or column of data. The only way to do that with LINQ is to construct an anonymous type with multiple parts. So a key might be an anonymous type with multiple interesting fields. How do I turn that into one or more expression that can be evaluated as part of a SQL group by clause?

 

It turns out I already have a handy function that does what I want. If I treat the key expression they same way I treat a projection expression, I can use the ColumnProjector to turn the expression into a list of column declarations, each with its own scalar expression.  Then I can simply throw away the column declarations and keep the expressions that define the columns and voila I have grouping expressions!  Don’t try this at home.  Err, I mean, go ahead and try this at home. It’s easy. Watch, I’ll do it again in a moment.

 

Moving along we come to some code trying to compute a ‘basis’. What’s that?  Well, it turns out SQL doesn’t really support a GroupBy operator the way it is defined in LINQ. The LINQ operator, by default, produces a sequence of groups with each group containing the elements that got designated as part of that group. SQL group-by can only produce aggregate computations over the groups, not the groups themselves.  Oh, dear, what to do, what to do?

 

The only way to solve this problem is to form a self join back to the data I want. If I take the result of the group-by and join it back to the original query (everything up until the group-by) matching group-by key values I get all the data back that SQL was not going to let me have.  That’s what the ‘subqueryBasis’ is.  It’s a rebinding of the original input sequence. Then we rebind the key again and do my ColumnProjector trick to get a new collection of group expressions. Now I can use both sets to form a predicate for a join.

 

Next, you can see that if I did specify an element selector I now bind it with respect to this new ‘basis’ and use the resulting expression as the projection coming out of this new branch of the query tree.

 

Alas, I am not yet done.  What about that third argument?  If it’s specified then I have another lambda I have to do something with and if it is not specified then I need to figure out how I’m going to return an IGrouping<K,E>.

 

Okay, that doesn’t turn out to be too difficult.  I created a Grouping<K,E> class that I can use in case I need to actually return this stuff all the way back to the calling code.

 

    public class Grouping<TKey, TElement> : IGrouping<TKey, TElement> {

        TKey key;

        IEnumerable<TElement> group;

 

        public Grouping(TKey key, IEnumerable<TElement> group) {

            this.key = key;

            this.group = group;

        }

 

        public TKey Key {

            get { return this.key; }

        }

 

        public IEnumerator<TElement> GetEnumerator() {

            return this.group.GetEnumerator();

        }

 

        IEnumerator IEnumerable.GetEnumerator() {

            return this.group.GetEnumerator();

        }

    }

 

Even if I don’t, I can still use this type as a way to encode the fact that I’m returning a grouping, which is what I do. The final result of the call to GroupBy (unless a result selector was supplied) is a sequence of these groupings, so the projector expression resulting from this operation is simply a construction of one of these Grouping<K,E> types via a NewExpression node. If a result selector is supplied that lambda determines the shape of the result so I just bind that the same way I bind all the other lambdas.  In both cases, I’m using the newly created extra ‘basis’ query that projects out elements to form the group collection that is either transformed with the result selector or returned to the caller.

 

That’s about it; clean and neat. Right? What about those other few lines of code referring to ‘GroupByInfo’ and ‘currentGroupElement’? That, sir, is the man behind the current. Now I’ll show you a little of him.

 

Later, when I’m binding aggregates I will need to invent a way to tie those aggregates back to this group-by; if they belong to this group-by.  So, later, I’m going to need to know something about this group-by (and certainly all group-by’s that might have already been encountered) to be able to pick the right one.  This is what the GroupByInfo is.  It’s the bits of information I hope will be useful later to correlate the two pieces together.  If we move along now to aggregate binding I’ll show you how it’s used.

Aggregate Binding

 

Aggregates get treated just like any other operator.

 

    protected override Expression VisitMethodCall(MethodCallExpression m) {

        if (m.Method.DeclaringType == typeof(Queryable) ||

            m.Method.DeclaringType == typeof(Enumerable)) {
            ...

            switch (m.Method.Name) {

                case "Count":

                case "Min":

                case "Max":

                case "Sum":

                case "Average":

                    if (m.Arguments.Count == 1) {

                        return this.BindAggregate(m.Arguments[0], m.Method, null);

                    }

                    else if (m.Arguments.Count == 2) {

                        LambdaExpression selector = (LambdaExpression)StripQuotes(m.Arguments[1]);

                        return this.BindAggregate(m.Arguments[0], m.Method, selector);

                    }

                    break;

            }

        }

        return base.VisitMethodCall(m);

    }

 

The actual binding has a few extra bits that help make the job a bit easier.

 

    Expression currentGroupElement;

 

    class GroupByInfo {

        internal string Alias { get; private set; }

        internal Expression Element { get; private set; }

        internal GroupByInfo(string alias, Expression element) {

            this.Alias = alias;

            this.Element = element;

        }

    }

 

    private AggregateType GetAggregateType(string methodName) {

        switch (methodName) {

            case "Count": return AggregateType.Count;

            case "Min": return AggregateType.Min;

            case "Max": return AggregateType.Max;

            case "Sum": return AggregateType.Sum;

            case "Average": return AggregateType.Average;

            default: throw new Exception(string.Format("Unknown aggregate type: {0}", methodName));

        }

    }

 

    private bool HasPredicateArg(AggregateType aggregateType) {

        return aggregateType == AggregateType.Count;

    }

 

    private Expression BindAggregate(Expression source, MethodInfo method, LambdaExpression argument, bool isRoot) {

        Type returnType = method.ReturnType;

        AggregateType aggType = this.GetAggregateType(method.Name);

        bool hasPredicateArg = this.HasPredicateArg(aggType);

 

        if (argument != null && hasPredicateArg) {

            // convert query.Count(predicate) into query.Where(predicate).Count()

            source = Expression.Call(typeof(Queryable), "Where", method.GetGenericArguments(), source, argument);

            argument = null;

        }

 

        ProjectionExpression projection = this.VisitSequence(source);

 

        Expression argExpr = null;

        if (argument != null) {

            this.map[argument.Parameters[0]] = projection.Projector;

            argExpr = this.Visit(argument.Body);

        }

        else if (!hasPredicateArg) {

            argExpr = projection.Projector;

        }

 

        string alias = this.GetNextAlias();

        var pc = this.ProjectColumns(projection.Projector, alias, projection.Source.Alias);

        Expression aggExpr = new AggregateExpression(returnType, aggType, argExpr);

        Type selectType = typeof(IEnumerable<>).MakeGenericType(returnType);

        SelectExpression select = new SelectExpression(selectType, alias, new ColumnDeclaration[] { new ColumnDeclaration("", aggExpr) }, projection.Source, null);

 

        if (isRoot) {

            ParameterExpression p = Expression.Parameter(selectType, "p");

            LambdaExpression gator = Expression.Lambda(Expression.Call(typeof(Enumerable), "Single", new Type[] { returnType }, p), p);

            return new ProjectionExpression(select, new ColumnExpression(returnType, alias, ""), gator);

        }

 

        SubqueryExpression subquery = new SubqueryExpression(returnType, select);

 

        // if we can find the corresponding group-info we can build a special AggregateSubquery node that will enable us to

        // optimize the aggregate expression later using AggregateRewriter

        GroupByInfo info;

        if (!hasPredicateArg && this.groupByMap.TryGetValue(projection, out info)) {

            // use the element expression from the group-by info to rebind the argument so the resulting expression is one that

            // would be legal to add to the columns in the select expression that has the corresponding group-by clause.

            if (argument != null) {

                this.map[argument.Parameters[0]] = info.Element;

                argExpr = this.Visit(argument.Body);

            }

            else {

                argExpr = info.Element;

            }

            aggExpr = new AggregateExpression(returnType, aggType, argExpr);

 

            // check for easy to optimize case.  If the projection that our aggregate is based on is really the 'group' argument from

            // the query.GroupBy(xxx, (key, group) => yyy) method then whatever expression we return here will automatically

            // become part of the select expression that has the group-by clause, so just return the simple aggregate expression.

            if (projection == this.currentGroupElement)

                return aggExpr;

 

            return new AggregateSubqueryExpression(info.Alias, aggExpr, subquery);

        }

 

        return subquery;

    }

 

Binding aggregates has its own slew of caveats that drive how this binding function is formed.  Aggregates can be specified with our without an argument. Normally, if an aggregate is not specified with an argument it is implied that the aggregate is operating over the element of the sequence. 

 

For example, it is legal to write:

 

    db.Orders.Select(o => o.OrderID).Max();

 

It has the same meaning as:

 

    db.Orders.Max(o => o.OrderID);

 

So if an aggregate invocation has no argument, I can simply pick up and use the projector expression coming out of the source sequence, right?  Well, not when the aggregate is Count. The LINQ Count aggregate takes a predicate as an argument not a value expression.  So I definitely don’t want to treat Count with an argument the same way I treat other aggregates.  For Count, I want to take that argument and turn it into a WHERE clause. I do that near the top of the BindAggregate function by tacking on a call to Queryable.Where right onto the source before I even translate it.

 

The next thing I do is actually create the AggregateExpression and stuff it into a column of a new SelectExpression. Whoa! Stop right there. Didn’t I want to avoid this?  Don’t I want this aggregate ending up alongside the group-by?  Yes, but not just yet.

 

What happens is that the source expression, once translated, already is a correlated sub-query!  How did that happen?  Take a moment to consider what the source of the aggregate really is.  Go ahead, I’ll wait.  Got it yet?   It’s the silly ‘g’.  Ugh.

 

Remember this query?

 

var query = db.Orders.GroupBy(o => o.CustomerID).Select(g => g.Max(o => o.OrderID));

 

Now do you see it?  The ‘g’ is the parameter to the Select operator.  Where does it come from? Yes, that’s right. It comes from the output of the GroupBy operator.  And what was that?  Yes, right again.  It was a sequence of Grouping<K,E> instances.  So a single ‘g’ is a single Grouping<K,E> instance.  Which itself is a collection of grouped items, which was formed using that basis query back in the BindGroupBy method, which was a self-join back to the original query with a join condition based on matching up grouping key expressions. Yes, that query, and yes, replicated here, in the context of an aggregate expression, it is indistinguishable from a correlated sub-query.  So I simply add my extra aggregate expression on top and, presto change-o, I am finished! 

 

Well, not really.  Skipping over this next bit about “isRoot” you can see that I create a SubqueryExpression.  I have not talked about this yet, but this is how I represent a true correlated sub-query. 

 

And here it is in all its majestic glory.

 

    internal class SubqueryExpression : Expression {

        SelectExpression select;

        internal SubqueryExpression(Type type, SelectExpression select)

            : base((ExpressionType)DbExpressionType.Subquery, type) {

            this.select = select;

        }

        internal SelectExpression Select {

            get { return this.select; }

        }

    }

 

Okay, that was a tad anti-climatic. Get over it.  It comes in handy later.

 

So what’s the rest of the method doing?  Ah, yes, you’ve seen it. This is where I look to see if I can tie the aggregate back to the group-by that it’s related to.  I do this by looking up the GroupByInfo based on the source projection; again that’s the ‘g’.  If the ‘g’ translates into the same expression that I created back in the BindGroupBy method then I know it’s the same one.  Back in that method I stored a memento of the group-by keyed by this very same expression, and now I’m pulling this information back up.

 

What you don’t see me doing here is actually going back and changing the SelectExpression that contains the group-by. Believe me; I really want to, but not just yet.  Instead, what I do now is drop a breadcrumb into the query tree so that I can later follow the trail in one sweeping tree rewriting operation. To do this, I have invented yet another expression node to act as this breadcrumb.

 

    internal class AggregateSubqueryExpression : Expression {

        string groupByAlias;

        Expression aggregateInGroupSelect;

        SubqueryExpression aggregateAsSubquery;

        internal AggregateSubqueryExpression(string groupByAlias, Expression aggregateInGroupSelect, SubqueryExpression aggregateAsSubquery)

            : base((ExpressionType)DbExpressionType.AggregateSubquery, aggregateAsSubquery.Type)

        {

            this.aggregateInGroupSelect = aggregateInGroupSelect;

            this.groupByAlias = groupByAlias;

            this.aggregateAsSubquery = aggregateAsSubquery;

        }

        internal string GroupByAlias { get { return this.groupByAlias; } }

        internal Expression AggregateInGroupSelect { get { return this.aggregateInGroupSelect; } }

        internal SubqueryExpression AggregateAsSubquery { get { return this.aggregateAsSubquery; } }

    }

 

What this node does is hold onto two possible futures. Either, the sweeping rewrite is going to move the aggregate expression back into the select with the group-by or its going to abandon all hope and fall back to the correlated sub-query approach.  That’s why this node holds onto two different expressions. One is the sub-query node I’ve already created.  The other is an expression that is fit and ready to become a citizen of the group.  I get this expression by simply rebinding the argument relative to the element expression I stuffed into the GroupByInfo when binding the GroupBy.  Now I stick both of the variations into a AggregateSubqueryExpression (reminding myself that unless I rewrite this the aggregate is going to be interpreted as a correlated subquery.)

 

What about ‘currentGroupElement’?  How does this factor in?  Recall that there was a form of GroupBy that took as an argument the result selector. This is not the form that C# converts query expressions into (but VB does when you use the Aggregate clause).  If the aggregate expression is being bound as part of the result selector then whatever expression I return from this method is going to end up in the same select expression with the GroupBy, no need to optimize later.  So if this is the case I simply have the method return the optimal aggregate expression and forget all about the sub-query nonsense. 

 

That just leaves us with ‘isRoot.’  Explaining isRoot is simple.  If the aggregate operator itself is the root of the tree, and in this case I mean with operator on top, or the ‘last’ operator, then it’s a very different kind of aggregate than normal. It is not associated with a group-by. It is a top-level aggregate. 

 

I get these when I type things like:

 

    db.Orders.Max(o => o.OrderID);

 

Now instead of returning a sequence of aggregates, I need only one for the whole sequence.  However, the query provider still gets back a sequence of rows from ADO and the ProjectionReader is still an IEnumerable of whatever the returned element type is.  In this case it is IEnumerable<int>.  Yet, I can clearly not return that IEnumerable as the result of the execution of the query. I need just a single int. So I need a way to tell the provider how to get the one true int out of the sequence.  That’s what I’m doing here.  I’ve added an extra lambda to the ProjectionExpression that tells the provider later how to aggregate the sequence back into a singleton. The ProjectionExpression already has a ‘projector’ that tells the provider how to turn DataReader rows into objects.  Now is also has an ‘aggregator’ that tells the provider how to turn the entire result set into a single value.  The aggregator I construct calls the Enumerable.Single method. 

 

That’s really how it all works, except for the real magic, the part where I wave a wand and polymorph the ugly tree into the beautiful one.  That’s the aggregate rewriter.

Aggregate Rewriting

 

Are you ready for it?  This is the part that does all the truly maligned and dishonest shuffling of bits.

 

    internal class AggregateRewriter : DbExpressionVisitor {

        ILookup<string, AggregateSubqueryExpression> lookup;

        Dictionary<AggregateSubqueryExpression, Expression> map;

 

        private AggregateRewriter(Expression expr) {

            this.map = new Dictionary<AggregateSubqueryExpression, Expression>();

            this.lookup = AggregateGatherer.Gather(expr).ToLookup(a => a.GroupByAlias);

        }

 

        internal static Expression Rewrite(Expression expr) {

            return new AggregateRewriter(expr).Visit(expr);

        }

 

        protected override Expression VisitSelect(SelectExpression select) {

            select = (SelectExpression)base.VisitSelect(select);

            if (lookup.Contains(select.Alias)) {

                List<ColumnDeclaration> aggColumns = new List<ColumnDeclaration>(select.Columns);

                foreach (AggregateSubqueryExpression ae in lookup[select.Alias]) {

                    string name = "agg" + aggColumns.Count;

                    ColumnDeclaration cd = new ColumnDeclaration(name, ae.AggregateInGroupSelect);

                    this.map.Add(ae, new ColumnExpression(ae.Type, ae.GroupByAlias, name));

                    aggColumns.Add(cd);

                }

                return new SelectExpression(select.Type, select.Alias, aggColumns, select.From, select.Where, select.OrderBy, select.GroupBy);

            }

            return select;

        }

 

        protected override Expression VisitAggregateSubquery(AggregateSubqueryExpression aggregate) {

            Expression mapped;

            if (this.map.TryGetValue(aggregate, out mapped)) {

                return mapped;

            }

            return this.Visit(aggregate.AggregateAsSubquery);

        }

 

        class AggregateGatherer : DbExpressionVisitor {

            List<AggregateSubqueryExpression> aggregates = new List<AggregateSubqueryExpression>();

            private AggregateGatherer() {

            }

 

            internal static List<AggregateSubqueryExpression> Gather(Expression expression) {

                AggregateGatherer gatherer = new AggregateGatherer();

                gatherer.Visit(expression);

                return gatherer.aggregates;

            }

 

            protected override Expression VisitAggregateSubquery(AggregateSubqueryExpression aggregate) {

                this.aggregates.Add(aggregate);

                return base.VisitAggregateSubquery(aggregate);

            }

        }

    }

 

That was it.  No, really.  That was the hard part.  Look closer, you’ll see where it actually does something.

 

The AggregateRewriter is actually two visitors; the primary rewriting visitor and a secondary gatherer that builds a collection of all the reachable aggregate subqueries.  See, I told you having this new node would come in handy.

 

What is actually going on is first all the aggregate expressions are gathered up and formed into a lookup table keyed off the group-by alias.  Then during the rewrite, when I get to the select expression with that alias I simply tack on all the aggregate expressions by inventing columns for them.  On the side I keep a mapping between all the original aggregate sub-queries and a column-expression that references the newly declared column.  A new select expression is created with the new set of columns, triggering a cascade of tree rebuilding (keeping everything nice and immutable.)  Then, when the aggregate sub-query is visited later, if it’s found in the map I simply replace it with the new column expression.

 

And now the aggregate expressions are in the right place.  Peace, harmony and all is well with the world.

 

I can see you are now impressed with my mad wizard-like skillz.

 

Let’s put it all together.  I add the aggregate rewriter to the pipeline of visitors we have in DbQueryProvider.Translate().

 

    private TranslateResult Translate(Expression expression) {

        ProjectionExpression projection = expression as ProjectionExpression;

        if (projection == null) {

            expression = Evaluator.PartialEval(expression, CanBeEvaluatedLocally);

            expression = QueryBinder.Bind(this, expression);

            expression = AggregateRewriter.Rewrite(expression);

            expression = OrderByRewriter.Rewrite(expression);

            expression = UnusedColumnRemover.Remove(expression);

            expression = RedundantSubqueryRemover.Remove(expression);

            projection = (ProjectionExpression)expression;

        }

        string commandText = QueryFormatter.Format(projection.Source);

        string[] columns = projection.Source.Columns.Select(c => c.Name).ToArray();

        LambdaExpression projector = ProjectionBuilder.Build(projection.Projector, projection.Source.Alias, columns);

        return new TranslateResult(commandText, projector, projection.Aggregator);

    }

 

Notice the new ‘aggregator’ being passed along in the result. The Execute method now makes use of it.

    private object Execute(TranslateResult query) {

        Delegate projector = query.Projector.Compile();

 

        if (this.log != null) {

            this.log.WriteLine(query.CommandText);

            this.log.WriteLine();

        }

 

        DbCommand cmd = this.connection.CreateCommand();

        cmd.CommandText = query.CommandText;

        DbDataReader reader = cmd.ExecuteReader();

 

        Type elementType = TypeSystem.GetElementType(query.Projector.Body.Type);

       

        IEnumerable sequence = (IEnumerable) Activator.CreateInstance(

            typeof(ProjectionReader<>).MakeGenericType(elementType),

            BindingFlags.Instance | BindingFlags.NonPublic, null,

            new object[] { reader, projector, this },

            null

            );

 

        if (query.Aggregator != null) {

            Delegate aggregator = query.Aggregator.Compile();

            AggregateReader aggReader = (AggregateReader) Activator.CreateInstance(

                typeof(AggregateReader<,>).MakeGenericType(elementType, query.Aggregator.Body.Type),

                BindingFlags.Instance | BindingFlags.NonPublic, null,

                new object[] { aggregator },

                null

                );

            return aggReader.Read(sequence);

        }

        else {

            return sequence;

        }

    }

Taking it for a Spin

 

Let’s give the provider a big soupy group-by ball of aggravation and see how it behaves.

 

    var query = from o in db.Orders

                group o by o.CustomerID into g

                select new {

                    Customer = g.Key,

                    Total = g.Sum(o => o.OrderID),

                    Min = g.Min(o => o.OrderID),

                    Avg = g.Average(o => o.OrderID)

                };

 

This query translates into:

 

SELECT t0.CustomerID, SUM(t0.OrderID) AS agg1, MIN(t0.OrderID) AS agg2, AVG(t0.OrderID) AS agg3

FROM Orders AS t0

GROUP BY t0.CustomerID

 

The results of execution are:

 

{ Customer = ALFKI, Total = 64835, Min = 10643, Avg = 10805 }

{ Customer = ANATR, Total = 42618, Min = 10308, Avg = 10654 }

{ Customer = ANTON, Total = 74195, Min = 10365, Avg = 10599 }

{ Customer = AROUT, Total = 139254, Min = 10355, Avg = 10711 }

...

{ Customer = WILMK, Total = 75650, Min = 10615, Avg = 10807 }

{ Customer = WOLZA, Total = 75595, Min = 10374, Avg = 10799 }

 

Don’t ask me why I think it is interesting to total up the OrderID’s.  I’m just weird.

 

That’s it.  Really.  This time I mean it.  I’m done.  If you want more take a look at the source code attached.  All the gory details are in there, along with a bazillion minor bug fixes I made since the last installment. 

 

DISCLAIMER:  The provided solution for GroupBy works well for many simple cases, such as when the aggregate expressions immediately follow the GroupBy operator. It is possible to write queries that produce the correlated sub-query form. This is normal. There is definitely still room for improvement.

 

Now, get out from under that shady tree and get back to writing code.  You know I am.

 

I often get asked how LINQ to SQL is supposed to be used with Test Driven Design (TDD).  Okay, not really.  People aren’t knocking on my door or calling me at 3:00 am.  I do, however, occasionally read developers angst on their personal blogs. It seems they are trying to actually do this, but are often confounded by the DataContext and its dearth of appropriate interfaces. Of course, my original knee-jerk reaction is to question why anyone would want or need to do this in the first place. Certainly, abstraction at a higher level of the application would be more appropriate, yada yada yada.  Eventually, my internal ranting ebbs and my practical side takes over. I start thinking like an engineer. How would I go about it? If only I’d added such fundamental interfaces such as IDataContext and ITable<T> before hitting RTM, all would be so much easier. Yet, TDD was not a priority.  It wasn’t even on the list of features that didn’t make the cut.  Still, how would I do it?  Then I start wishing I could override the DataContext’s methods and substitute my own logic.  Yet these methods are not virtual and cannot be overridden. Then with fitting irony I recall reading the other developer blogs that pointed this out too.

Of course, this only makes the problem that much more interesting and worthy of a good hack. I consider wrapping the DataContext in some other layer that looks exactly like it and abstract it that way, but then realize it would certainly trip the system up, especially deep in the query translation engine where it expects to find references to specific types.  Instead, the ideal solution would keep the DataContext the same, yet allow me to do something other than hitting the database when a query is executed. If only LINQ to SQL had a public provider model, I could simply plug a new one in and use it to intercept all interaction with the database. Oh, double irony, as there is no such provider model, at least not a public one.  Grin.

LINQ to SQL was actually designed to be host to more types of back-ends than just SQL server. It had a provider model targeted for RTM, but was disabled before the release.  Don’t ask me why.  Be satisfied to know that is was not a technical reason. Internally, it still behaves that way.  The trick is to find out how to swap in a new one when everything from the language to the runtime wants to keep you from doing it.

Fortunately, the DataContext has a nice little ‘provider’ instance variable just waiting to be overwritten.  A little bit of reflection can make quick work of that. The trouble is how to specify a new provider. The DataContext only talks to it through an interface (as it should), and yet that interface is internal to the LINQ to SQL assembly. The programming language won’t let you define your own implementation.  How do you go about implementing an interface that you can’t even say the name of in your source code?

Actually, I can think of two ways; 1) write a bunch of reflection emit code that generates an implementation at runtime or 2) trick the runtime into thinking some existing object implements the interface.  You can probably guess where I am going from here, as every good hack needs a good trick. Besides, a bunch of reflection emit code would be a lot more work.  Onward to the fun solution!

This is where CLR grand-interception-theory comes in; in the CLR you can intercept any interaction with any object, really, as long as it’s a method call and the object derives from MarshalByRef.  Actually, that’s not really true, you can intercept more than method calls, or at least they don’t start out being method calls, and they don’t necessarily need to be on only MarshalByRef objects. Still, not only do I want to intercept calls on an object, I want to make the object appear to implement an interface and intercept the calls on that interface. That’s a tall order, to be sure.  But it can be done.

The interception capability is the underpinnings of remoting (aka DCOM) support in the runtime.  I can use it to make an object masquerade as another object. The original intention was to enable client-side proxy objects to appear to implement the API of an object that only really exists on a server. The term ‘MarshalByRef’ refers to the DCOM behavior of marshalling a reference to the object from the server back to the client, such that calls on the client-side proxy are marshaled back to the server. It works by the JITer injecting specialized thunks into the code that identify and handle calls to these special dopplegangers. The really interesting thing to note is that interfaces in the runtime work nearly the same. They also have thunks that are capable of recognizing these proxies and acting accordingly; quite possibly because COM is so dependent on multitudes of interfaces.  However, regardless of the reason they exist, I can use this mechanism to wedge my own provider implementation into the mix.

What I first need to do is define a proxy object that will intercept these calls.  The remoting mechanism actually uses two different proxies, one that masquerades as the type (the transparent proxy) and one that receives the interception (the ‘real’ proxy.)  Both of these guys are intended to exist on the client. The real proxy is supposed to be the object that actually implements the marshalling behavior. My guess is that the only reason that I’m even allowed to implement my own real proxy is to enable marshalling over newer communication layers. Fortunately, I can use this proxy to simply act as an interceptor to do my bidding.

The next question I faced was what to do when I actually intercepted the calls.  Should I forward them on to some new grand public provider model?  That just seemed a bit over the top.  Instead, I chose to redirect the calls back to methods on the DataContext that can be overridden.  It was a quicker hack and introduces far fewer concepts to those already familiar with the DataContext.  And that’s really what you wanted all along, anyway, wasn’t it?

So I reveal to you, the new and shiny ExtensibleDataContext, one with a few new poorly named methods that you can actually override and implement yourself.

using System;

using System.Collections;

using System.Collections.Generic;

using System.Diagnostics;

using System.IO;

using System.Linq;

using System.Linq.Expressions;

using System.Text;

using System.Reflection;

 

using System.Runtime.Remoting;

using System.Runtime.Remoting.Activation;

using System.Runtime.Remoting.Proxies;

using System.Runtime.Remoting.Messaging;

using System.Runtime.Remoting.Services;

 

using System.Data;

using System.Data.Common;

using System.Data.Linq;

using System.Data.Linq.Mapping;

using System.Data.Linq.Provider;

 

namespace System.Data.Linq

{

    public class ExtensibleDataContext : DataContext

    {

        public ExtensibleDataContext(object connection, MappingSource mapping)

            : base("", mapping)

        {

            FieldInfo providerField = typeof(DataContext).GetField("provider", BindingFlags.Instance | BindingFlags.NonPublic);

            object proxy = new ProviderProxy(this).GetTransparentProxy();

            providerField.SetValue(this, proxy);

            this.Initialize(connection);

        }

 

        protected virtual void Initialize(object connection)

        {

        }

 

        private TextWriter LogImpl { get; set; }

        private DbConnection ConnectionImpl { get; set; }

        private DbTransaction TransactionImpl { get; set; }

        private int CommandTimeoutImpl { get; set; }

 

        protected internal virtual void ClearConnectionImpl()

        {

        }

 

        protected internal virtual void CreateDatabaseImpl()

        {

        }

 

        protected internal virtual void DeleteDatabaseImpl()

        {

        }

 

        protected internal virtual bool DatabaseExistsImpl()

        {

            return false;

        }

 

        protected internal virtual IExecuteResult ExecuteImpl(Expression query)

        {

            return new ExecuteResult(null);

        }

 

        protected class ExecuteResult : IExecuteResult

        {

            object value;

 

            public ExecuteResult(object value)

            {

                this.value = value;

            }

 

            public object GetParameterValue(int parameterIndex)

            {

                return null;

            }

 

            public object ReturnValue

            {

                get { return this.value; }

            }

 

            public void Dispose()

            {

                IDisposable d = this.value as IDisposable;

                if (d != null)

                    d.Dispose();

            }

        }

 

        protected internal virtual object CompileImpl(Expression query)

        {

            return null;

        }

 

        protected internal virtual IEnumerable TranslateImpl(Type elementType, DbDataReader reader)

        {

            return null;

        }

 

        protected internal virtual IMultipleResults TranslateImpl(DbDataReader reader)

        {

            return null;

        }

 

        protected internal virtual string GetQueryTextImpl(Expression query)

        {

            return null;

        }

 

        protected internal virtual DbCommand GetCommandImpl(Expression query)

        {

            return null;

        }

 

        public class ProviderProxy : RealProxy, IRemotingTypeInfo

        {

            ExtensibleDataContext dc;

 

            internal ProviderProxy(ExtensibleDataContext dc)

                : base(typeof(ContextBoundObject))

            {

                this.dc = dc;

            }

 

            public override IMessage Invoke(IMessage msg)

            {

                if (msg is IMethodCallMessage)

                {

                    IMethodCallMessage call = (IMethodCallMessage)msg;

                    if (call.MethodBase.DeclaringType.Name == "IProvider" && call.MethodBase.DeclaringType.IsInterface)

                    {

                        MethodInfo mi = typeof(ExtensibleDataContext).GetMethod(call.MethodBase.Name + "Impl", BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic | BindingFlags.DeclaredOnly);

                        if (mi != null)

                        {

                            try

                            {

                                return new ReturnMessage(mi.Invoke(this.dc, call.Args), null, 0, null, call);

                            }

                            catch (TargetInvocationException e)

                            {

                                return new ReturnMessage(e.InnerException, call);

                            }

                        }

                    }

                }

                throw new NotImplementedException();

            }

 

            public bool CanCastTo(Type fromType, object o)

            {

                return true;

            }

 

            public string TypeName

            {

                get { return this.GetType().Name; }

                set { }

            }

        }

    }

}

 

The ExtensibleDataContext’s constructor has the job of overwriting the DataContext’s private ‘provider’ variable.  It creates a new ProviderProxy instance and assigns it to the private field using FieldInfo.SetValue().  The implementation of SetValue attempts to cast the object to the LINQ to SQL private interface IProvider. This succeeds because the function CanCastTo on the ProviderProxy returns true, allowing the proxy to be cast to any type.  After that, all interface calls on this object are rerouted to the Invoke method.  The implementation of Invoke simply calls the DataContext back, invoking methods with similar names.  These are left empty for you to override in your own derivation of ExtensibleDataContext.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Data.Linq;

using System.Data.Linq.Mapping;

using System.Text;

 

namespace MocksNix

{

    public class MyDataContext : ExtensibleDataContext

    {

        static MappingSource mapping = new AttributeMappingSource();

        public MyDataContext()

            : base("", mapping)

        {

        }

 

        public Table<Customer> Customers

        {

            get { return this.GetTable<Customer>(); }

        }

 

        protected internal override IExecuteResult ExecuteImpl(System.Linq.Expressions.Expression query)

        {

            this.Log.WriteLine("executing query: {0}", query);

            return new ExecuteResult(new Customer[] { });

        }

    }

 

    public class Customer

    {

        [Column(IsPrimaryKey = true)]

        public string CustomerId;

 

        [Column]

        public string ContactName;

    }

 

    class Program

    {

        static void Main(string[] args)

        {

            MyDataContext dc = new MyDataContext();

            var query = from c in dc.Customers where c.CustomerId == "X" select c;

            var list = query.ToList();

        }

    }

}

 

Now, I can use the ExtensibleDataContext in a small test program.  I create my own MyDataContext that implements ExecuteImpl().  This method gets called whenever a query needs to be executed.  Instead of executing the query, I write out a simple message and return an empty collection.

That’s it. Now take this bit of code and go forth and prosper.

DISCLAIMER: Overriding internal implementation details is not a practice recommend or supported by Microsoft. Implementation details are subject to change without warning.

But who cares!

Go on, mock LINQ to SQL all you want.

 

 

This is the nineth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts here's a handy list of all the fun you've been missing.

Complete list of posts in the Building an IQueryable Provider series 

It's now officially a trend that additional installments to this series take longer and longer to be produced.  Blame the television writer's strike, I do.    

Cleaning up the Mess

I've been promising for a while to show you how I'm going to go about cleaning up the unnecessary layers of nested select expressions that my query translator has been accumulating. It's easy for a human brain to look at a query and realize that it could be written a lot simpler. However, its a lot easier for a computer program to just keep piling the layers on, after all the semantics are the same and the boon we get from keeping the program simple is nothing to sneer at.

It's easy to see the problem in a simple query with a where clause.

    from c in db.Customers
    where c.Country == "UK"
    select c;

This innocuous query turns into the following SQL:

SELECT t1.Country, t1.CustomerID, t1.ContactName, t1.Phone, t1.City
FROM (
  SELECT t0.Country, t0.CustomerID, t0.ContactName, t0.Phone, t0.City
  FROM Customers AS t0
) AS t1
WHERE (t1.Country = 'UK')

Why the extra SELECT?  It's easy to see why it happens when you know how the translation works and what the underlying LINQ query really is.

The LINQ query's method call syntax really looks like this: 

    db.Customers.Where(c => c.Country == "UK").Select(c => c);

It has two LINQ query operators, Where() and Select().  My translation engine in the SqlBinder class translates both of these method calls into two separate SelectExpression's.

Ideally, the SQL query would have looked like this:

SELECT t0.Country, t0.CustomerID, t0.ContactName, t0.Phone, t0.City
FROM Customers AS t0
WHERE (t0.Country = 'UK')

However, that's just the easy case.  It gets worse as more operators are added.  Did you think the translator was smart enough to merge multiple Where clauses together?  I certainly did not add any code for that.  It would be nice if the language compiler did it for me, but what about the case where additional Where() operators are added conditionally after the base query is already formed?

var query = 
    from c in db.Customers
    where c.Country == "UK"
    select c;
...
query = from c in query
        where c.Phone == "555-5555"
        select c;

This becomes a triple layer monstrosity, which would only be good if it were a sandwich.

SELECT t2.CustomerID, t2.ContactName, t2.Phone, t2.City, t2.Country
FROM (
  SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
  FROM (
    SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
    FROM Customers AS t0
  ) AS t1
  WHERE (t1.Country = 'UK')
) AS t2
WHERE (t2.Phone = '555-5555')

And its not just the layering either.  What happens when when I try to project out a subset of the data? 

var query = 
    from c in db.Customers
    where c.Country == "UK"
    select c.CustomerID;

This becomes the following:

SELECT t2.CustomerID
FROM (
  SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
  FROM (
    SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
    FROM Customers AS t0
  ) AS t1
  WHERE (t1.Country = 'UK')
) AS t2

Why oh why do the nested queries keep reselecting data that's never being used?  Hopefully the database engine is smart enough not to pipeline all that data when most of its never even referred to in the query or returned to the client.  Yet, wouldn't it be a whole lot nicer if the query translator could reduce this madness down to a simpler form that an actual human might write?  Then maybe someone would be able to make head or tails out of this one:

var query = from c in db.Customers
            join o in db.Orders on c.CustomerID equals o.CustomerID
            let m = c.Phone
            orderby c.City
            where c.Country == "UK"
            where m != "555-5555"
            select new { c.City, c.ContactName } into x
            where x.City == "London"
            select x;

I'm not even going to show you this one yet, as it might frighten you to the point of powering down the computer.

What I am going to do is show you what I've done about it, how I rolled up my sleeves and wrote some code that saves the day.  It turned out not to be too horribly difficult.  I had imagined working with the immutable expression tree would become more and more complex as the desired transformations became more and more interesting, given how complicated the order-by rewriter seemed to get.  However, I was pleasantly surprised to find out that this clean-up logic was in fact turning out to be actually sort of clean.

 

Removing Redudant Subqueries

I first wanted to tackle how to get rid of redundant subqueries.  For example, an identity select does absolutely nothing, yet its adds a whole new layer.  What I needed was a way to remove select expressions that do nothing interesting.  Of course, the first thing I needed to do was decide what made an expression redundant.  Obviously, a select expression is redundant if it does not actually do anything at all, except for re-selecting columns from another select expression. Other select expressions would seem redundant if the only thing they do are operations that could have been combined with those of another select expression.

The next thing I needed to do was to figure out how to actually remove a select expression from an expression tree and end up with a tree that is actually legal.  This step had an ominous feel that I was going to end up devising something even more arcane than the order-by rewriter.  I was surprised that it turned out to be so simple.  Take a look.

    internal class SubqueryRemover : DbExpressionVisitor
    {
        HashSet<SelectExpression> selectsToRemove;
        Dictionary<string, Dictionary<string, Expression>> map;

        public Expression Remove(SelectExpression outerSelect, params SelectExpression[] selectsToRemove)
        {
            return Remove(outerSelect, (IEnumerable<SelectExpression>)selectsToRemove);
        }

        public Expression Remove(SelectExpression outerSelect, IEnumerable<SelectExpression> selectsToRemove)
        {
            this.selectsToRemove = new HashSet<SelectExpression>(selectsToRemove);
            this.map = selectsToRemove.ToDictionary(d => d.Alias, d => d.Columns.ToDictionary(d2 => d2.Name, d2 => d2.Expression));
            return this.Visit(outerSelect);
        }

        protected override Expression VisitSelect(SelectExpression select)
        {
            if (this.selectsToRemove.Contains(select))
            {
                return this.Visit(select.From);
            }
            else
            {
                return base.VisitSelect(select);
            }
        }

        protected override Expression VisitColumn(ColumnExpression column)
        {
            Dictionary<string, Expression> nameMap;
            if (this.map.TryGetValue(column.Alias, out nameMap))
            {
                Expression expr;
                if (nameMap.TryGetValue(column.Name, out expr))
                {
                    return this.Visit(expr);
                }
                throw new Exception("Reference to undefined column");
            }
            return column;
        }
    }

That's it.  This is a nice little class that will rewrite an expression tree and remove one or more select expressions from it, automatically fixing up all references to columns that go away. Looking at the two visit methods, the code looks trivial. When I see a select that's one of ones to be removed, I simply throw it away by returning its 'from' expression. When I see a column expression that is referencing a column that is declared in a select expression that is removed, I substitute the expression used in the declaration for the reference.  As it turns out the most interesting piece of the whole visitor is figuring out the set of columns that are going away, and this is determined using a nice little LINQ query in the Remove method to construct a dictionary of dictionaries that tells me this.

Now all that's left is to write some code that actually figures out which subqueries are the redundant ones.

And here that is. 

   internal class RedundantSubqueryRemover : DbExpressionVisitor
    {
        internal Expression Remove(Expression expression)
        {
            return this.Visit(expression);
        }

        protected override Expression VisitSelect(SelectExpression select)
        {
            select = (SelectExpression)base.VisitSelect(select);

            // first remove all purely redundant subqueries
            List<SelectExpression> redundant = new RedundantSubqueryGatherer<