Welcome to MSDN Blogs Sign in | Join | Help

I often get asked how LINQ to SQL is supposed to be used with Test Driven Design (TDD).  Okay, not really.  People aren’t knocking on my door or calling me at 3:00 am.  I do, however, occasionally read developers angst on their personal blogs. It seems they are trying to actually do this, but are often confounded by the DataContext and its dearth of appropriate interfaces. Of course, my original knee-jerk reaction is to question why anyone would want or need to do this in the first place. Certainly, abstraction at a higher level of the application would be more appropriate, yada yada yada.  Eventually, my internal ranting ebbs and my practical side takes over. I start thinking like an engineer. How would I go about it? If only I’d added such fundamental interfaces such as IDataContext and ITable<T> before hitting RTM, all would be so much easier. Yet, TDD was not a priority.  It wasn’t even on the list of features that didn’t make the cut.  Still, how would I do it?  Then I start wishing I could override the DataContext’s methods and substitute my own logic.  Yet these methods are not virtual and cannot be overridden. Then with fitting irony I recall reading the other developer blogs that pointed this out too.

Of course, this only makes the problem that much more interesting and worthy of a good hack. I consider wrapping the DataContext in some other layer that looks exactly like it and abstract it that way, but then realize it would certainly trip the system up, especially deep in the query translation engine where it expects to find references to specific types.  Instead, the ideal solution would keep the DataContext the same, yet allow me to do something other than hitting the database when a query is executed. If only LINQ to SQL had a public provider model, I could simply plug a new one in and use it to intercept all interaction with the database. Oh, double irony, as there is no such provider model, at least not a public one.  Grin.

LINQ to SQL was actually designed to be host to more types of back-ends than just SQL server. It had a provider model targeted for RTM, but was disabled before the release.  Don’t ask me why.  Be satisfied to know that is was not a technical reason. Internally, it still behaves that way.  The trick is to find out how to swap in a new one when everything from the language to the runtime wants to keep you from doing it.

Fortunately, the DataContext has a nice little ‘provider’ instance variable just waiting to be overwritten.  A little bit of reflection can make quick work of that. The trouble is how to specify a new provider. The DataContext only talks to it through an interface (as it should), and yet that interface is internal to the LINQ to SQL assembly. The programming language won’t let you define your own implementation.  How do you go about implementing an interface that you can’t even say the name of in your source code?

Actually, I can think of two ways; 1) write a bunch of reflection emit code that generates an implementation at runtime or 2) trick the runtime into thinking some existing object implements the interface.  You can probably guess where I am going from here, as every good hack needs a good trick. Besides, a bunch of reflection emit code would be a lot more work.  Onward to the fun solution!

This is where CLR grand-interception-theory comes in; in the CLR you can intercept any interaction with any object, really, as long as it’s a method call and the object derives from MarshalByRef.  Actually, that’s not really true, you can intercept more than method calls, or at least they don’t start out being method calls, and they don’t necessarily need to be on only MarshalByRef objects. Still, not only do I want to intercept calls on an object, I want to make the object appear to implement an interface and intercept the calls on that interface. That’s a tall order, to be sure.  But it can be done.

The interception capability is the underpinnings of remoting (aka DCOM) support in the runtime.  I can use it to make an object masquerade as another object. The original intention was to enable client-side proxy objects to appear to implement the API of an object that only really exists on a server. The term ‘MarshalByRef’ refers to the DCOM behavior of marshalling a reference to the object from the server back to the client, such that calls on the client-side proxy are marshaled back to the server. It works by the JITer injecting specialized thunks into the code that identify and handle calls to these special dopplegangers. The really interesting thing to note is that interfaces in the runtime work nearly the same. They also have thunks that are capable of recognizing these proxies and acting accordingly; quite possibly because COM is so dependent on multitudes of interfaces.  However, regardless of the reason they exist, I can use this mechanism to wedge my own provider implementation into the mix.

What I first need to do is define a proxy object that will intercept these calls.  The remoting mechanism actually uses two different proxies, one that masquerades as the type (the transparent proxy) and one that receives the interception (the ‘real’ proxy.)  Both of these guys are intended to exist on the client. The real proxy is supposed to be the object that actually implements the marshalling behavior. My guess is that the only reason that I’m even allowed to implement my own real proxy is to enable marshalling over newer communication layers. Fortunately, I can use this proxy to simply act as an interceptor to do my bidding.

The next question I faced was what to do when I actually intercepted the calls.  Should I forward them on to some new grand public provider model?  That just seemed a bit over the top.  Instead, I chose to redirect the calls back to methods on the DataContext that can be overridden.  It was a quicker hack and introduces far fewer concepts to those already familiar with the DataContext.  And that’s really what you wanted all along, anyway, wasn’t it?

So I reveal to you, the new and shiny ExtensibleDataContext, one with a few new poorly named methods that you can actually override and implement yourself.

using System;

using System.Collections;

using System.Collections.Generic;

using System.Diagnostics;

using System.IO;

using System.Linq;

using System.Linq.Expressions;

using System.Text;

using System.Reflection;

 

using System.Runtime.Remoting;

using System.Runtime.Remoting.Activation;

using System.Runtime.Remoting.Proxies;

using System.Runtime.Remoting.Messaging;

using System.Runtime.Remoting.Services;

 

using System.Data;

using System.Data.Common;

using System.Data.Linq;

using System.Data.Linq.Mapping;

using System.Data.Linq.Provider;

 

namespace System.Data.Linq

{

    public class ExtensibleDataContext : DataContext

    {

        public ExtensibleDataContext(object connection, MappingSource mapping)

            : base("", mapping)

        {

            FieldInfo providerField = typeof(DataContext).GetField("provider", BindingFlags.Instance | BindingFlags.NonPublic);

            object proxy = new ProviderProxy(this).GetTransparentProxy();

            providerField.SetValue(this, proxy);

            this.Initialize(connection);

        }

 

        protected virtual void Initialize(object connection)

        {

        }

 

        private TextWriter LogImpl { get; set; }

        private DbConnection ConnectionImpl { get; set; }

        private DbTransaction TransactionImpl { get; set; }

        private int CommandTimeoutImpl { get; set; }

 

        protected internal virtual void ClearConnectionImpl()

        {

        }

 

        protected internal virtual void CreateDatabaseImpl()

        {

        }

 

        protected internal virtual void DeleteDatabaseImpl()

        {

        }

 

        protected internal virtual bool DatabaseExistsImpl()

        {

            return false;

        }

 

        protected internal virtual IExecuteResult ExecuteImpl(Expression query)

        {

            return new ExecuteResult(null);

        }

 

        protected class ExecuteResult : IExecuteResult

        {

            object value;

 

            public ExecuteResult(object value)

            {

                this.value = value;

            }

 

            public object GetParameterValue(int parameterIndex)

            {

                return null;

            }

 

            public object ReturnValue

            {

                get { return this.value; }

            }

 

            public void Dispose()

            {

                IDisposable d = this.value as IDisposable;

                if (d != null)

                    d.Dispose();

            }

        }

 

        protected internal virtual object CompileImpl(Expression query)

        {

            return null;

        }

 

        protected internal virtual IEnumerable TranslateImpl(Type elementType, DbDataReader reader)

        {

            return null;

        }

 

        protected internal virtual IMultipleResults TranslateImpl(DbDataReader reader)

        {

            return null;

        }

 

        protected internal virtual string GetQueryTextImpl(Expression query)

        {

            return null;

        }

 

        protected internal virtual DbCommand GetCommandImpl(Expression query)

        {

            return null;

        }

 

        public class ProviderProxy : RealProxy, IRemotingTypeInfo

        {

            ExtensibleDataContext dc;

 

            internal ProviderProxy(ExtensibleDataContext dc)

                : base(typeof(ContextBoundObject))

            {

                this.dc = dc;

            }

 

            public override IMessage Invoke(IMessage msg)

            {

                if (msg is IMethodCallMessage)

                {

                    IMethodCallMessage call = (IMethodCallMessage)msg;

                    if (call.MethodBase.DeclaringType.Name == "IProvider" && call.MethodBase.DeclaringType.IsInterface)

                    {

                        MethodInfo mi = typeof(ExtensibleDataContext).GetMethod(call.MethodBase.Name + "Impl", BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic | BindingFlags.DeclaredOnly);

                        if (mi != null)

                        {

                            try

                            {

                                return new ReturnMessage(mi.Invoke(this.dc, call.Args), null, 0, null, call);

                            }

                            catch (TargetInvocationException e)

                            {

                                return new ReturnMessage(e.InnerException, call);

                            }

                        }

                    }

                }

                throw new NotImplementedException();

            }

 

            public bool CanCastTo(Type fromType, object o)

            {

                return true;

            }

 

            public string TypeName

            {

                get { return this.GetType().Name; }

                set { }

            }

        }

    }

}

 

The ExtensibleDataContext’s constructor has the job of overwriting the DataContext’s private ‘provider’ variable.  It creates a new ProviderProxy instance and assigns it to the private field using FieldInfo.SetValue().  The implementation of SetValue attempts to cast the object to the LINQ to SQL private interface IProvider. This succeeds because the function CanCastTo on the ProviderProxy returns true, allowing the proxy to be cast to any type.  After that, all interface calls on this object are rerouted to the Invoke method.  The implementation of Invoke simply calls the DataContext back, invoking methods with similar names.  These are left empty for you to override in your own derivation of ExtensibleDataContext.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Data.Linq;

using System.Data.Linq.Mapping;

using System.Text;

 

namespace MocksNix

{

    public class MyDataContext : ExtensibleDataContext

    {

        static MappingSource mapping = new AttributeMappingSource();

        public MyDataContext()

            : base("", mapping)

        {

        }

 

        public Table<Customer> Customers

        {

            get { return this.GetTable<Customer>(); }

        }

 

        protected internal override IExecuteResult ExecuteImpl(System.Linq.Expressions.Expression query)

        {

            this.Log.WriteLine("executing query: {0}", query);

            return new ExecuteResult(new Customer[] { });

        }

    }

 

    public class Customer

    {

        [Column(IsPrimaryKey = true)]

        public string CustomerId;

 

        [Column]

        public string ContactName;

    }

 

    class Program

    {

        static void Main(string[] args)

        {

            MyDataContext dc = new MyDataContext();

            var query = from c in dc.Customers where c.CustomerId == "X" select c;

            var list = query.ToList();

        }

    }

}

 

Now, I can use the ExtensibleDataContext in a small test program.  I create my own MyDataContext that implements ExecuteImpl().  This method gets called whenever a query needs to be executed.  Instead of executing the query, I write out a simple message and return an empty collection.

That’s it. Now take this bit of code and go forth and prosper.

DISCLAIMER: Overriding internal implementation details is not a practice recommend or supported by Microsoft. Implementation details are subject to change without warning.

But who cares!

Go on, mock LINQ to SQL all you want.

 

 

This is the nineth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts here's a handy list of all the fun you've been missing.

Part I - Reusable IQueryable base classes
Part II - Where and reusable Expression tree visitor
Part II - Local variable references
Part IV - Select
Part V - Improved Column binding
Part VI - Nested queries
Part VII - Join and SelectMany
Part VIII - OrderBy

It's now officially a trend that additional installments to this series take longer and longer to be produced.  Blame the television writer's strike, I do.    

Cleaning up the Mess

I've been promising for a while to show you how I'm going to go about cleaning up the unnecessary layers of nested select expressions that my query translator has been accumulating. It's easy for a human brain to look at a query and realize that it could be written a lot simpler. However, its a lot easier for a computer program to just keep piling the layers on, after all the semantics are the same and the boon we get from keeping the program simple is nothing to sneer at.

It's easy to see the problem in a simple query with a where clause.

    from c in db.Customers
    where c.Country == "UK"
    select c;

This innocuous query turns into the following SQL:

SELECT t1.Country, t1.CustomerID, t1.ContactName, t1.Phone, t1.City
FROM (
  SELECT t0.Country, t0.CustomerID, t0.ContactName, t0.Phone, t0.City
  FROM Customers AS t0
) AS t1
WHERE (t1.Country = 'UK')

Why the extra SELECT?  It's easy to see why it happens when you know how the translation works and what the underlying LINQ query really is.

The LINQ query's method call syntax really looks like this: 

    db.Customers.Where(c => c.Country == "UK").Select(c => c);

It has two LINQ query operators, Where() and Select().  My translation engine in the SqlBinder class translates both of these method calls into two separate SelectExpression's.

Ideally, the SQL query would have looked like this:

SELECT t0.Country, t0.CustomerID, t0.ContactName, t0.Phone, t0.City
FROM Customers AS t0
WHERE (t0.Country = 'UK')

However, that's just the easy case.  It gets worse as more operators are added.  Did you think the translator was smart enough to merge multiple Where clauses together?  I certainly did not add any code for that.  It would be nice if the language compiler did it for me, but what about the case where additional Where() operators are added conditionally after the base query is already formed?

var query = 
    from c in db.Customers
    where c.Country == "UK"
    select c;
...
query = from c in query
        where c.Phone == "555-5555"
        select c;

This becomes a triple layer monstrosity, which would only be good if it were a sandwich.

SELECT t2.CustomerID, t2.ContactName, t2.Phone, t2.City, t2.Country
FROM (
  SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
  FROM (
    SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
    FROM Customers AS t0
  ) AS t1
  WHERE (t1.Country = 'UK')
) AS t2
WHERE (t2.Phone = '555-5555')

And its not just the layering either.  What happens when when I try to project out a subset of the data? 

var query = 
    from c in db.Customers
    where c.Country == "UK"
    select c.CustomerID;

This becomes the following:

SELECT t2.CustomerID
FROM (
  SELECT t1.CustomerID, t1.ContactName, t1.Phone, t1.City, t1.Country
  FROM (
    SELECT t0.CustomerID, t0.ContactName, t0.Phone, t0.City, t0.Country
    FROM Customers AS t0
  ) AS t1
  WHERE (t1.Country = 'UK')
) AS t2

Why oh why do the nested queries keep reselecting data that's never being used?  Hopefully the database engine is smart enough not to pipeline all that data when most of its never even referred to in the query or returned to the client.  Yet, wouldn't it be a whole lot nicer if the query translator could reduce this madness down to a simpler form that an actual human might write?  Then maybe someone would be able to make head or tails out of this one:

var query = from c in db.Customers
            join o in db.Orders on c.CustomerID equals o.CustomerID
            let m = c.Phone
            orderby c.City
            where c.Country == "UK"
            where m != "555-5555"
            select new { c.City, c.ContactName } into x
            where x.City == "London"
            select x;

I'm not even going to show you this one yet, as it might frighten you to the point of powering down the computer.

What I am going to do is show you what I've done about it, how I rolled up my sleeves and wrote some code that saves the day.  It turned out not to be too horribly difficult.  I had imagined working with the immutable expression tree would become more and more complex as the desired transformations became more and more interesting, given how complicated the order-by rewriter seemed to get.  However, I was pleasantly surprised to find out that this clean-up logic was in fact turning out to be actually sort of clean.

 

Removing Redudant Subqueries

I first wanted to tackle how to get rid of redundant subqueries.  For example, an identity select does absolutely nothing, yet its adds a whole new layer.  What I needed was a way to remove select expressions that do nothing interesting.  Of course, the first thing I needed to do was decide what made an expression redundant.  Obviously, a select expression is redundant if it does not actually do anything at all, except for re-selecting columns from another select expression. Other select expressions would seem redundant if the only thing they do are operations that could have been combined with those of another select expression.

The next thing I needed to do was to figure out how to actually remove a select expression from an expression tree and end up with a tree that is actually legal.  This step had an ominous feel that I was going to end up devising something even more arcane than the order-by rewriter.  I was surprised that it turned out to be so simple.  Take a look.

    internal class SubqueryRemover : DbExpressionVisitor
    {
        HashSet<SelectExpression> selectsToRemove;
        Dictionary<string, Dictionary<string, Expression>> map;

        public Expression Remove(SelectExpression outerSelect, params SelectExpression[] selectsToRemove)
        {
            return Remove(outerSelect, (IEnumerable<SelectExpression>)selectsToRemove);
        }

        public Expression Remove(SelectExpression outerSelect, IEnumerable<SelectExpression> selectsToRemove)
        {
            this.selectsToRemove = new HashSet<SelectExpression>(selectsToRemove);
            this.map = selectsToRemove.ToDictionary(d => d.Alias, d => d.Columns.ToDictionary(d2 => d2.Name, d2 => d2.Expression));
            return this.Visit(outerSelect);
        }

        protected override Expression VisitSelect(SelectExpression select)
        {
            if (this.selectsToRemove.Contains(select))
            {
                return this.Visit(select.From);
            }
            else
            {
                return base.VisitSelect(select);
            }
        }

        protected override Expression VisitColumn(ColumnExpression column)
        {
            Dictionary<string, Expression> nameMap;
            if (this.map.TryGetValue(column.Alias, out nameMap))
            {
                Expression expr;
                if (nameMap.TryGetValue(column.Name, out expr))
                {
                    return this.Visit(expr);
                }
                throw new Exception("Reference to undefined column");
            }
            return column;
        }
    }

That's it.  This is a nice little class that will rewrite an expression tree and remove one or more select expressions from it, automatically fixing up all references to columns that go away. Looking at the two visit methods, the code looks trivial. When I see a select that's one of ones to be removed, I simply throw it away by returning its 'from' expression. When I see a column expression that is referencing a column that is declared in a select expression that is removed, I substitute the expression used in the declaration for the reference.  As it turns out the most interesting piece of the whole visitor is figuring out the set of columns that are going away, and this is determined using a nice little LINQ query in the Remove method to construct a dictionary of dictionaries that tells me this.

Now all that's left is to write some code that actually figures out which subqueries are the redundant ones.

And here that is. 

   internal class RedundantSubqueryRemover : DbExpressionVisitor
    {
        internal Expression Remove(Expression expression)
        {
            return this.Visit(expression);
        }

        protected override Expression VisitSelect(SelectExpression select)
        {
            select = (SelectExpression)base.VisitSelect(select);

            // first remove all purely redundant subqueries
            List<SelectExpression> redundant = new RedundantSubqueryGatherer().Gather(select.From);
            if (redundant != null)
            {
                select = (SelectExpression)new SubqueryRemover().Remove(select, redundant);
            }

            // next attempt to merge subqueries

            // can only merge if subquery is a single select (not a join)
            SelectExpression fromSelect = select.From as SelectExpression;
            if (fromSelect != null)
            {
                // can only merge if subquery has simple-projection (no renames or complex expressions)
                if (HasSimpleProjection(fromSelect))
                {
                    // remove the redundant subquery
                    select = (SelectExpression)new SubqueryRemover().Remove(select, fromSelect);
                    // merge where expressions 
                    Expression where = select.Where;
                    if (fromSelect.Where != null)
                    {
                        if (where != null)
                        {
                            where = Expression.And(fromSelect.Where, where);
                        }
                        else
                        {
                            where = fromSelect.Where;
                        }
                    }
                    if (where != select.Where)
                    {
                        return new SelectExpression(select.Type, select.Alias, select.Columns, select.From, where, select.OrderBy);
                    }
                }
            }

            return select;
        }

        private static bool IsRedudantSubquery(SelectExpression select)
        {
            return HasSimpleProjection(select)
                && select.Where == null
                && (select.OrderBy == null || select.OrderBy.Count == 0);
        }

        private static bool HasSimpleProjection(SelectExpression select)
        {
            foreach (ColumnDeclaration decl in select.Columns)
            {
                ColumnExpression col = decl.Expression as ColumnExpression;
                if (col == null || decl.Name != col.Name)
                {
                    // column name changed or column expression is more complex than reference to another column
                    return false;
                }
            }
            return true;
        }

        class RedundantSubqueryGatherer : DbExpressionVisitor
        {
            List<SelectExpression> redundant;

            internal List<SelectExpression> Gather(Expression source)
            {
                this.Visit(source);
                return this.redundant;
            }

            protected override Expression VisitSelect(SelectExpression select)
            {
                if (IsRedudantSubquery(select))
                {
                    if (this.redundant == null)
                    {
                        this.redundant = new List<SelectExpression>();
                    }
                    this.redundant.Add(select);
                }
                return select;
            }
        }
    }

The RedundantSubqueryRemover is a bit more involved that the SubqueryRemover, but it basically has a simple algorithm.  When it examines a given select expression it tries to determine if one or more sub-select's are redundant. To determine this set it uses anothger visitor, the RedundantSubqueryGatherer, which builds a list of redundant subqueries that can be reached without recursing down into any 'from' expressions. This allows it to consider all the sub select expressions that are in scope to the parent select expression, seeing through any join expression nodes that may exist. Once I have this list, I just use the SubqueryRemover to remove them.

Following that, I look to see if any select expressions can be merged together.  The only interesting thing to consider at this time is whether a subquery would be considered redundant except for a where expression that can easily be combined into the outer select expression. If I find one of these, I go ahead and remove the subquery and add its where expression to the outer select expression.  Yes, I'm surprised too that it works out that easily.

 

Removing Unused Columns

The second part of clean up is to get rid of unused columns.  If I project into a smaller set of columns and don't even reference some of the others then why keep them in the query at all?  It might not matter to the semantics of the query and it might not even make the query faster, but it will certainly be easier on the eyes, and it might make it possible to discover more redundant subqueries that might have slipped by simply because the subquery computed a column that is later ignored.

Of course, this turns out to be a lot more complicated than the other two, so I saved it for last.

    internal class UnusedColumnRemover : DbExpressionVisitor
    {
        Dictionary<string, HashSet<string>> allColumnsUsed;

        internal Expression Remove(Expression expression)
        {
            this.allColumnsUsed = new Dictionary<string, HashSet<string>>();
            return this.Visit(expression);
        }

        protected override Expression VisitColumn(ColumnExpression column)
        {
            HashSet<string> columns;
            if (!this.allColumnsUsed.TryGetValue(column.Alias, out columns))
            {
                columns = new HashSet<string>();
                this.allColumnsUsed.Add(column.Alias, columns);
            }
            columns.Add(column.Name);
            return column;
        }

        protected override Expression VisitSelect(SelectExpression select)
        {
            // visit column projection first
            ReadOnlyCollection<ColumnDeclaration> columns = select.Columns;

            HashSet<string> columnsUsed;
            if (this.allColumnsUsed.TryGetValue(select.Alias, out columnsUsed))
            {
                List<ColumnDeclaration> alternate = null;
                for (int i = 0, n = select.Columns.Count; i < n; i++)
                {
                    ColumnDeclaration decl = select.Columns[i];
                    if (!columnsUsed.Contains(decl.Name))
                    {
                        decl = null;  // null means it gets omitted
                    }
                    else
                    {
                        Expression expr = this.Visit(decl.Expression);
                        if (expr != decl.Expression)
                        {
                            decl = new ColumnDeclaration(decl.Name, decl.Expression);
                        }
                    }
                    if (decl != select.Columns[i] && alternate == null)
                    {
                        alternate = new List<ColumnDeclaration>();
                        for (int j = 0; j < i; j++)
                        {
                            alternate.Add(select.Columns[j]);
                        }
                    }
                    if (decl != null && alternate != null)
                    {
                        alternate.Add(decl);
                    }
                }
                if (alternate != null)
                {
                    columns = alternate.AsReadOnly();
                }
            }

            ReadOnlyCollection<OrderExpression> orderbys = this.VisitOrderBy(select.OrderBy);
            Expression where = this.Visit(select.Where);
            Expression from = this.Visit(select.From);

            if (columns != select.Columns || orderbys != select.OrderBy || where != select.Where || from != select.From)
            {
                return new SelectExpression(select.Type, select.Alias, columns, from, where, orderbys);
            }

            return select;
        }

        protected override Expression VisitProjection(ProjectionExpression projection)
        {
            // visit mapping in reverse order
            Expression projector = this.Visit(projection.Projector);
            SelectExpression source = (SelectExpression)this.Visit(projection.Source);
            if (projector != projection.Projector || source != projection.Source)
            {
                return new