Note : I plan to post this blog on the VB team blog eventually. However, because of the length and complexity, I'm posting this up on my blog for now to hopefully get some eyes on it so that I won't make a fool of myself on the team blog :) Please feel free to contact me if you think something is wrong, misleading, or could be improved on. Thanks!
A couple months ago, I had a chance to do a presentation at the New England Code Camp #8 where I talked about LINQ under the hood. This blog post is an attempt at reconstructing the presentation as best as I can.
Disclaimer: This is an advanced post on LINQ, the new VB9 features, and how the compiler does its magic to make LINQ work.
One day, I may try to record a screen cast of this as well if I manage to find time (and if people will find it helpful).
Overview
The goal for us today is to understand what exactly happens when the compiler sees a LINQ query that is like this:
Dim q = From c In customers Where c.ID Mod 2 = 0 Select c.ID, c.Name
In order to understand what happens, we're going to start with exploring the world as it existed in VB 2005.
We will attempt to create a doubly linked list from first principles (that is, without the help of existing FX types that were introduced in VB 2005, such as List(Of T)), because building this from scratch will help us understand the concepts at a deeper level.
Creating a doubly linked list
Let's try to create the doubly linked list from the ground up, without relying on FX collection types. A doubly linked list will contain elements that have forward and backward pointers, so the following basic implementation will suffice:
Class DoublyLinkedListElement(Of T) Public Sub New(ByVal data As T) Me.m_data = data End Sub Private m_data As T Public ReadOnly Property Data() As T Get Return m_data End Get End Property Private m_nextElement As DoublyLinkedListElement(Of T) Public Property NextElement() As DoublyLinkedListElement(Of T) Get Return Me.m_nextElement End Get Set(ByVal value As DoublyLinkedListElement(Of T)) Me.m_nextElement = value End Set End Property Private m_prevElement As DoublyLinkedListElement(Of T) Public Property PrevElement() As DoublyLinkedListElement(Of T) Get Return m_prevElement End Get Set(ByVal value As DoublyLinkedListElement(Of T)) m_prevElement = value End Set End Property End Class Class DoublyLinkedList(Of T) Private m_Head As DoublyLinkedListElement(Of T) Private m_Tail As DoublyLinkedListElement(Of T) Public Sub InsertFirst(ByVal data As T) Dim ele = New DoublyLinkedListElement(Of T)(data) ele.NextElement = m_Head If m_Head IsNot Nothing Then m_Head.PrevElement = ele m_Head = ele Else m_Head = ele m_Tail = ele End If End Sub
Public Sub InsertLast(ByVal data As T) Dim ele = New DoublyLinkedListElement(Of T)(data) ele.PrevElement = m_Tail If m_Tail IsNot Nothing Then m_Tail.NextElement = ele m_Tail = ele Else m_Head = ele m_Tail = ele End If End Sub End Class
(BTW the final code will be available as a ZIP file; but the intermediate steps won't be, so you'll have to follow this post through in order to understand the concepts).
Take some time to understand what this code is doing; it's a very simply, feature-lacking doubly linked list. All that you can do to it is add to the beginning or the end of the list. Nothing more.
Modeling some data on top of the doubly linked list
The first thing that we'll try to do is to model some data on top of the doubly linked list. A typically "real world" model that maps to doubly linked lists well are trains. As a result, let's try to build some train models on top of the doubly linked list.
Here's an example:
Class TrainCar Public ID As Integer Public Color As String Public Overrides Function ToString() As String Return "ID: " & ID & vbCrLf & "Color: " & Color End Function End Class Class PassengerCar : Inherits TrainCar Public Count As Integer Public Overrides Function ToString() As String Return "Passenger Car" & vbCrLf & "Count: " & Count & vbCrLf & MyBase.ToString() End Function End Class Class Cargo : Inherits TrainCar Public Company As String Public Overrides Function ToString() As String Return "Cargo Car" & vbCrLf & "Company: " & Company & vbCrLf & MyBase.ToString() End Function End Class Class Engine : Inherits TrainCar Public HorsePower As Integer Public Overrides Function ToString() As String Return "Engine Car" & vbCrLf & "Horsepower: " & HorsePower & vbCrLf & MyBase.ToString() End Function End Class Class Caboose : Inherits TrainCar Public Overrides Function ToString() As String Return "Caboose" & vbCrLf & MyBase.ToString() End Function End Class Class Train : Inherits DoublyLinkedList(Of TrainCar) End Class
Here, we are creating a small hierarchy of car types, along with a Train model that simply inherits from DoublyLinkedList and providing a specific parameter for the generic argument.
This isn't necessarily the best way to model the train domain; it simply serves as an example for this blog post. Also, the code is "demo" code; a real production model for train cars will be much more robust.
So now we have a basic working train model from which you can add cars to it, and that's about it.
Populating the trains - new language syntax!
Here's a simple function that will populate a train class with a number of train cars:
Function PopulateTrains() As Train Dim t = New Train() t.InsertFirst(New Engine With {.ID = 1, .Color = "Red", .HorsePower = 100}) t.InsertLast(New Engine With {.ID = 2, .Color = "Black", .HorsePower = 100}) t.InsertLast(New Engine With {.ID = 3, .Color = "Red", .HorsePower = 75}) t.InsertLast(New Engine With {.ID = 4, .Color = "Green", .HorsePower = 75}) t.InsertLast(New Engine With {.ID = 5, .Color = "Red", .HorsePower = 75}) t.InsertLast(New PassengerCar With {.ID = 6, .Color = "Red", .Count = 50}) t.InsertLast(New PassengerCar With {.ID = 7, .Color = "Green", .Count = 51}) t.InsertLast(New PassengerCar With {.ID = 8, .Color = "Green", .Count = 50}) t.InsertLast(New Cargo With {.ID = 9, .Color = "Blue", .Company = "Microsoft"}) t.InsertLast(New Cargo With {.ID = 10, .Color = "Blue", .Company = "Microsoft"}) t.InsertLast(New Cargo With {.ID = 11, .Color = "Blue", .Company = "Microsoft"}) t.InsertLast(New PassengerCar With {.ID = 12, .Color = "Black", .Count = 55}) t.InsertLast(New Cargo With {.ID = 13, .Color = "Green", .Company = "Starbucks"}) t.InsertLast(New Cargo With {.ID = 14, .Color = "Green", .Company = "Starbucks"}) t.InsertLast(New Caboose With {.ID = 15, .Color = "Gold"}) Return t End Function
The first thing to notice is the new object initializer syntax. This allows you to construct an instance of an object and initialize it's fields and properties in an "expression context" - that is, any place in the VB language that expects an expression, you can use an object initializer. In the example above, we used the object initializer as an argument to a function call.
This is different from the with statement, which works as a statement, and not as an expression.
The second thing to notice is type inference. In the PopulateTrains method, the type of t is not object, as it would be in VS2005. Rather, the compiler inferred the type from the right hand side of the assignment statement, and t is typed as a Train.
What the compiler does in this scenario is that it notices that you did not specify a type for the variable t, and so it tries its best to determine what the type is of the right hand side of the assignment, and use that type as the type of t. However, if it cannot, then it will give up and generate an error.
This is especially useful in certain places where we introduced the concept of unnamed types. You'll see this later.
This new feature is controlled by a new option : option infer. By default, it is on for new projects. You can turn it off, and you'll get VS2005 behavior.
Here are a couple more posts that talk about type inference.
Enumerating the doubly linked list
Now that we have some data, we really need more support in the doubly linked list class for this data to be useful. In the .Net Framework, all collection types can be enumerated, so let's add a custom enumerator and take advantage of the fact that Visual Basic allows you to enumerate over a type that exposes a GetEnumerator method.
Here's a simple and straightforward implementation of IEnumerator for our doubly linked list (the Visual Basic IDE automatically generated the stubs for me, so I just filled in where necessary):
Class DoublyLinkedListEnumerator(Of T) : Implements IEnumerator(Of T) Private m_First As DoublyLinkedListElement(Of T) Private m_Current As DoublyLinkedListElement(Of T) Private m_Started As Boolean Public Sub New(ByVal elem As DoublyLinkedListElement(Of T)) m_First = elem m_Current = Nothing m_Started = False End Sub Public ReadOnly Property Current() As T Implements System.Collections.Generic.IEnumerator(Of T).Current Get Return m_Current.Data End Get End Property Public ReadOnly Property Current1() As Object Implements System.Collections.IEnumerator.Current Get Return m_Current.Data End Get End Property Public Function MoveNext() As Boolean Implements System.Collections.IEnumerator.MoveNext If Not m_Started Then m_Started = True m_Current = m_First Return If(m_Current Is Nothing, False, True) End If m_Current = m_Current.NextElement If m_Current IsNot Nothing Then Return True Else Return False End If End Function Public Sub Reset() Implements System.Collections.IEnumerator.Reset m_Current = Nothing m_Started = False End Sub Private disposedValue As Boolean = False ' To detect redundant calls ' IDisposable Protected Overridable Sub Dispose(ByVal disposing As Boolean) If Not Me.disposedValue Then If disposing Then ' TODO: free other state (managed objects). End If ' TODO: free your own state (unmanaged objects). ' TODO: set large fields to null. End If Me.disposedValue = True End Sub #Region " IDisposable Support " ' This code added by Visual Basic to correctly implement the disposable pattern. Public Sub Dispose() Implements IDisposable.Dispose ' Do not change this code. Put cleanup code in Dispose(ByVal disposing As Boolean) above. Dispose(True) GC.SuppressFinalize(Me) End Sub #End Region End Class
Take a look at the class above; it walks the doubly linked list as expected every time you ask the enumerator for the next element.
Now, we need to add the following method to the DoublyLinkedList(Of T) class defined at the beginning of the article:
Public Overridable Function GetEnumerator() As IEnumerator(Of T) Return New DoublyLinkedListEnumerator(Of T)(m_Head) End Function
We now added enumeration support to the doubly linked list, so that we can iterate the items and get the items out.
Example
Ok so we have enough pieces to actually get a somewhat meaningful example. Here's some code that will create a train with a bunch of train cars, and then iterate and print them out:
Module Module1 Sub Main() Dim trains = PopulateTrains() For Each t in trains Console.WriteLine(t.ToString) Next End Sub End Module
You should see all the train cars printed to the console. So far, so good.
Searching for specific cars
Right now, we have a bunch of data in the form of train cars, and the only thing that we can do is to iterate the data and print it out. That's not very interesting, because very quickly we will want to be able to ask questions about our data.
In real life, data is really useless unless you can ask questions about it. If you had a portfolio of stocks, you want to ask questions like, "which ones are outperforming others?", or "how long have I held these stocks?". If you had a social network of friends on Facebook, you want to ask questions like, "which friends have updated their profile recently?", or "which friends have recently poked me?".
If you aren't able to ask questions and filter your data, you'll quickly be overwhelmed.
So let's start by asking a specific question: let's find all the train cars that are red.
One of the most natural ways (and probably the one that most people would think of) is to add a new method to the Train class. Object oriented programming is all about sending a message (calling a method) on an object, and finding the right place to handle these is of paramount importance. In this scenario, the Train class is the right model.
Here's some code that will do this. Notice that we take advantage of the fact that we can iterate the train cars internally:
Function FindRedTrainCars() As Train Dim ret = New Train For Each c In Me If c.Color = "Red" Then ret.InsertLast(c) End If Next Return ret End Function
This does specifically what we asked for, and nothing more. It's fairly straight forward; so far, so good.
More, and more, and more...
As with all things in life, things constantly change. The problem with the prior API, FindRedTrainCars, is that it's too inflexible. You can imagine that you would end up with dozens of public APIs that must be exposed with this model. The problem with this model is that:
It's the last point that's really a kicker for me. So let's see what we can do about it here.
The template method pattern
As our first stab at making the train class "feel better", let's apply the template method pattern with a twist. In the official template method pattern, you define a "template" method (which has boilerplate code) and you customize with inheritance; for us, we'll customize with aggregation rather than inheritance for a variety of reasons (for me, any time where aggregation is viable I prefer it, because I don't like inheritance unless it's needed).
The way we'll do aggregation is via delegates.
Let's modify the find method so that it employs the template method pattern.
The first order of business is to define a delegate that is suitable. We'll follow the same implementation strategy as before, so that calls for a delegate that is applied to each train car, and we'll let the delegate decide whether or not the train car fits the bill.
Here's a straight forward implementation:
Public Delegate Function FindFilter(ByVal c As TrainCar) As Boolean Public Function Find(ByVal filter As FindFilter) As Train Dim ret = New Train For Each c In Me If filter(c) Then ret.InsertLast(c) End If Next Return ret End Function
There's nothing tricky about this : we define a delegate function that filters a train car and returns true if the filter is positive, and false otherwise.
Notice that with this strategy, we've accomplished a design goal - we've deferred filtering elements to the client, and in a way that doesn't require inheritance.
Built-in delegate types
One of the small niceties we added to the .NET Framework 3.5 is the Func type. You can use the Func type as a predefined delegate type. There are several definitions that follow this pattern:
Delegate Function Func(Of R)() As R Delegate Function Func(Of TArg1, R)(a As TArg1) As R
And so on (there are definitions for up to 5 arguments).
Thus you can write the above without having to define your delegate, FindFilter, like the following:
Public Function Find(ByVal filter As Func(Of TrainCar, Boolean)) As Train Dim ret = New Train For Each c In Me If filter(c) Then ret.InsertLast(c) End If Next Return ret End Function
This is just a nice addition so that you don't have to define delegate types for the most common (less then 5 arguments) delegates.
Using the new Find method
Ok cool, so now let's see how we can make use of this new find method. Here's the typical way you would pass a function as a delegate to Find. First, we define a function called Filter which matches the signature, and then we take the address of the function when we call the Find method:
Module Module1 Private Function Filter(ByVal c As TrainCar) As Boolean Return c.Color = "Red" End Function Sub Main() Dim trains = PopulateTrains() Dim results = trains.Find(AddressOf Filter) For Each t in results Console.WriteLine(t.ToString) Next End Sub End Module
This is pretty good! We now control the filtering, and we (as the client) can do whatever filtering we want, without making requirements to the train class.
There are a couple of things that are not so cool about this though:
Ok, so can we do better?
Lambda expressions
We definitely can do better. We'll need a new feature in VB9 called lambda expressions. Think of a lambda expression as a "inline function" that contains an expression. The following is equivalent to the above:
Module Module1 Sub Main() Dim trains = PopulateTrains() Dim results = trains.Find(Function(c) c.Color = "Red") For Each t in results Console.WriteLine(t.ToString) Next End Sub End Module
Notice all the following cool things:
Lambda expressions are really powerful.
But it gets better.
Let the user determine what the filter criteria is
Let's suppose that we want to allow the user to control what color to filter the trains with. If we go back to the original design with the delegates, we essentially need a way to pass data to the Filter method. Since delegates don't allow you pass data to it, there are several unpleasant workarounds:
The third option is the most viable, so let's see what that looks like:
Class ColorFilter Private m_Color As String Public Sub New(ByVal color As String) Me.m_Color = color End Sub Private Function Filter(ByVal c As TrainCar) As Boolean Return c.Color = m_Color End Function End Class Module Module1 Sub Main() Dim trains = PopulateTrains() Dim filter = new ColorFilter(Console.Readline()) Dim results = trains.Find(AddressOf filter.Filter) For Each t in results Console.WriteLine(t.ToString) Next End Sub End Module
This just got a lot worst. We now need a private class just for filtering. If we wanted 100 filters, we'd need 100 private classes. This sucks. Not much better than just inheriting 100 classes from the Train class, either.
It would be nice if the lambda expression feature can handle this scenario, and it most definitely can.
Lambda expressions part two - lifting variables
Let's consider the following example, which is slightly modified from the lambda expression example above. Pay particular attention to the body of the lambda expression:
Module Module1 Sub Main() Dim trains = PopulateTrains() Dim color = Console.Readline() Dim results = trains.Find(Function(c) c.Color = color) For Each t in results Console.WriteLine(t.ToString) Next End Sub End Module
Notice that there's a subtlety here : the lambda expression is referring to a variable, color, which is not a lambda parameter, but is defined in Main. When this happens, the compiler generates code that is morally equivalent to the ColorFilter class, except we call the ColorFilter class a closure.
That is, the two examples are equivalent. The compiler provides powerful variable lifting services for you so that you can write code that refers to variables that are not in the scope of the lambda expression, and the compiler will "make it work".
You've seen the magic (we wrote the code in the ColorFilter class example!) - now you know what the compiler does.
Lambda expressions are extremely powerful : check out my Lambda expression article on MSDN magazine for more coverage.
Now, we really have a clean, good solution. I like what we have. But as usual, let's push on, and see what else we can do to make this even better.
Can we make it more general?
Everything we've done so far has been pretty cool, and I feel that we are converging on a good design. But let's continue to push further : we added a very general find method to the train class, but it feels like it's in the wrong place, since the act of finding elements is not necessarily restricted to finding train cars in trains.
We really want to add this method to the DoublyLinkedList class, so that the train class just gains the benefit of the find method, as well as any other class that uses the DoublyLinkedList class.
But what if we've already shipped the DoublyLinkedList class, and we can't modify the interface anymore? Or what if the DoublyLinkedList class belong to someone else (and we purchased the rights to the API library, for example)?
We need a way to extend the behavior of an existing class.
Extension methods
One of the most trivial ways we can "extend" the behavior of an existing class is to create a shared method that takes this class as the first parameter (sort of like the ADT way of doing things way before OOP became popular). Here's some code that will do just that:
Module DoublyLinkedListExtensions Public Function Find(Of TElement)(ByVal col As DoublyLinkedList(Of TElement), _ ByVal filter As Func(Of TElement, Boolean)) As DoublyLinkedList(Of TElement) Dim ret = New DoublyLinkedList(Of TElement) For Each c In col If filter(c) Then ret.InsertLast(c) End If Next Return ret End Function End Module
The implementation is pretty straight forward and is very similar to what you would do if you were to implement this function as an instance method (just like above).
You use this method like so:
Module Module1 Sub Main() Dim t = PopulateTrains() Dim results = Find(t, Function(c) c.Color = "Red") For Each train in results Console.WriteLine(train.ToString) Next End Sub End Module
This is how you would call the shared method. There are a couple of things with the shared method approach that are a bit lacking:
For example, here's what the tooltip and intellisense show:
Notice that in this case, we have no idea what TElement is. As a result, when you are writing the lambda expression, you don't get intellisense:
So while shared methods work, they don't really give the greatest user experience.
This is where extension methods come in. One of my team mates, Scott, has a series of articles on extension methods that are extremely well written and in-depth : I encourage you to check those out if you aren't too familiar with extension methods.
What I want to highlight is that extension methods have two very nice properties:
As above, I'll demonstrate with some samples. To make your shared method an extension method, simply supply the <Extension> attribute : and that's it! Nothing more. It's really easy to do:
Module DoublyLinkedListExtensions <Extension> _ Public Function Find(Of TElement)(ByVal col As DoublyLinkedList(Of TElement), _ ByVal filter As Func(Of TElement, Boolean)) As DoublyLinkedList(Of TElement) Dim ret = New DoublyLinkedList(Of TElement) For Each c In col If filter(c) Then ret.InsertLast(c) End If Next Return ret End Function End Module
Be sure to import the System.Runtime.CompilerServices namespace.
Now, you can write code that looks like this:
Module Module1 Sub Main() Dim t = PopulateTrains() Dim results = t.Find(Function(c) c.Color = "Red") For Each t in results Console.WriteLine(t.ToString) Next End Sub End Module
As you can see, you now call the Find method as if it were an instance method : in a sense, we've "extended" the type DoublyLinkedList with a new method called Find.
But the power really comes from the user experience. Check out the intellisense:
Check that out! The partial type inference feature essentially performs type inference on TElement based on the "instance" type : in this case, it decided that TElement is a TrainCar, and now, you've got better tooltips!
And not only that, when you write your lambda expression, we now know what the lambda argument is, so you get intellisense there as well:
Look at that! We now know that x is a TrainCar, so you get intellisense there!
This is the power of extension methods. Again, please check out Scott's articles for more details.
Projections
To hammer home the idea of extension methods, I want to define another extension method as an example. This method transforms a DoublyLinkedList of type T to a DoublyLinkedList of type R, like so:
<Extension()> _ Public Function Transform(Of TSource, TResult)(ByVal col As DoubleLinkedList(Of TSource), _ ByVal Projection As Func(Of TSource, TResult)) As DoublyLinkedList(Of TResult) Dim ret = New DoublyLinkedList(Of TResult) For Each c In col ret.InsertLast(Projection(c)) Next Return ret End Function
This is a simple extension method that just converts elements in a list. Check it out yourself and make sure you understand the concepts, and try to call it and see what results you get.
Where are we?
So now we have a very good design. We can choose to use instance methods or extension methods to provide a template method pattern of access for our clients, and our clients can write clear, concise code to use our collection.
But we can do more. I have not written at all about LINQ yet, and we're now going to see how LINQ works.
Using LINQ on our DoublyLinkedList class
The wishful thinking part of me wishes that the following code would compile, but it does not:
Module Module1 Sub Main() Dim trains = PopulateTrains() Dim q = From t In trains _ Where t.Color = "Red" _ Select t End Sub End Module
We really want to make this work, because this would be the ultimate syntax for what we want. It would allow us to ask information about our train classes (and any class that inherits from our doublylinkedlist class) through the new LINQ operators.
Are you ready to see this work? Simply change the method "Find" to "Where" and "Transform" to "Select":
Module DoublyLinkedListExtensions <Extension()> _ Public Function Where(Of TElement)(ByVal col As DoublyLinkedList(Of TElement), _ ByVal filter As Func(Of TElement, Boolean)) As DoublyLinkedList(Of TElement) Dim ret = New DoublyLinkedList(Of TElement) For Each c In col If filter(c) Then ret.InsertLast(c) End If Next Return ret End Function <Extension()> _ Public Function [Select](Of TSource, TResult)(ByVal col As DoublyLinkedList(Of TSource), _ ByVal Projection As Func(Of TSource, TResult)) As DoublyLinkedList(Of TResult) Dim ret = New DoublyLinkedList(Of TResult) For Each c In col ret.InsertLast(Projection(c)) Next Return ret End Function End Module
And the code compiles! We can use LINQ over our DoublyLinkedList!
Wrapping up
Whew. I had originally hoped to have this post done a long time ago. And the initial draft ideas had this post at about a quarter of the length. If you're still reading, I hope you've gotten a very good idea of what exactly the compiler does when it sees a LINQ statement.
After all, the whole point of this post is to understand what magic the compiler performs, in order for you to have the knowledge and know-how of how LINQ works.
To summarize in reverse order:
I hope that you can move forward and backwards : that is, that not only do you understand how to decompose a LINQ query into its parts (for those rare cases where you need to call a method that's not available in query format), but you also understand how to take advantage of LINQ in your legacy applications, by following a sequence of refactorings similar to the one I did in this article.
And there's still much more to say about LINQ:
I'll save these for a post in the future.
As always, if you have questions, feel free to email me or leave a comment!
Thanks, and I hope you enjoy Visual Basic 2008! On the product team, we love this product, we love LINQ, and we hope that you find it as enjoyable, productive, and useful as we do.
Tim