Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
OK, computers aren't entirely dumb when it comes to LINQ. Here's an example of a place where we're a bit smarter.
Consider the following query:
IEnumerable<int> query = from n in number_array orderby n select n;
Does this get transformed by the compiler into
IEnumerable<int> query = number_array.OrderBy(n => n);
IEnumerable<int> query = number_array.OrderBy(n => n).Select( n => n );
This question caused tremendous debate amongst the members of the C# design and implementation teams. I am quite happy with the compromise we achieved.
The sections of the C# specification which cover this topic are sections 220.127.116.11, and 18.104.22.168:
A query expression of the form “from x in e select v” is translated into “( e ) . Select ( x => v )” except when v is the identifier x, the translation is simply “( e )”
A query expression of the form “from x in e select x” is translated into “( e ) . Select ( x => x )”.
A degenerate query expression is one that trivially selects the elements of the source. A later phase of the translation removes degenerate queries introduced by other translation steps by replacing them with their source. It is important however to ensure that the result of a query expression is never the source object itself, as that would reveal the type and identity of the source to the client of the query. Therefore this step protects degenerate queries written directly in source code by explicitly calling Select on the source. It is then up to the implementers of Select and other query operators to ensure that these methods never return the source object itself.
It might sound like the second statement contradicts the first. The first statement says that the “.Select(x=>v)” is omitted when “v” is “x”, and the second statement says that no, the “.Select(x=>x)” is in fact generated.
The spec is not particularly clear on this point, I agree, though the explanatory text about “degenerate query expressions” does help clear it up somewhat. Basically what we are saying here is that the optimization to remove degenerate selects is only performed on the result of an earlier query translation.
Some examples will help.
If you have from n in number_array orderby n select n then this is first translated into from n in (number_array).OrderBy(n=>n) select n and then, since the selector is the same as the range variable, this is translated into ((number_array).OrderBy(n=>n)).
We optimize away the Select(n=>n) because from n in (number_array).OrderBy(n=>n) select n is the intermediate result of an earlier query translation.
If you had from n in number_array select n then this would be considered a degenerate query expression requiring a Select. We cannot simply translate it into (number_array) because then you could assign that to a variable and have referential identity with the collection. The result of the query should not “leak” information about the collection out; it should just be an enumerable object of the appropriate type, and not have an identity relationship. Therefore, we take the performance hit and protect the identity of the collection by emitting a degenerate query translation (number_array).Select(n=>n).
This is important because the collection might be a mutable collection. One does not expect that the result of a query over that collection could be used to mutate the collection! The query results should be read-only.
We do not need to append the degenerate Select(n=>n) when you have an “orderby” in there because the OrderBy(n=>n) call already protects the identity of the original collection. This saves on the performance cost of the no-op Select.
I think we’ve made a reasonable compromise in the case of degenerate queries. If you are crazy enough to write “from x in xs select x” when you mean “(xs)”, then you’ll have to live with the decreased performance. But if you are introducing a degenerate select at the end of a meaningful query, then it will be optimized away.
I think the spec is actually reasonably clear on what happens - although as I've argued before, it makes it even more sensible to be able to omit the select clause - but I'm still not quite as clear in my mind about why the identity relationship is such a bad thing.
I suspect it's because I haven't seen any examples of where it would make a difference. Do you have any compelling examples where it would be really obviously a bad thing to do? I think I understand it from a "philosophical" point of view, but I can't immediately see an urgent practical problem.
Consider the code I blogged about here:
The code requires that the private hashset only be accessed via the read-only IEnumerable<T> enumerator. If external code can get at the underlying hash set via a cast, then it can muck up the carefully maintained invariants of the class. That's why even though the collection implements IEnumerable, I had the code explicitly implement an iterator, to protect the mutable collection from undesired mutation.
You could imagine implementing the iterator as a degenerate query rather than an iterator block; we would want the result to be the same. The degenerate query should protect the mutable collection from mutation.
Right, I suppose the important part is when you're returning the result of a query to another piece of code. That's quite possibly what I'd been missing before. Thanks for bearing with me :)
Obviously the code which is *doing* the query to start with already has the original reference - I hadn't been widening my mental scope far enough.
Just as a though, it would be nice to have an extension method on IEnumerable which explicitly does a "no-op select" to avoid having to write the yield return loop you had in the linked post. Easy to do, obviously, but it would be nice to have it in the framework. Maybe not for 3.5 SP1 though ;)
I just wanted to let you know how much I have enjoyed your last several posts. Every one of your posts seems to have little seed of useful insight!
Ummm. I didn't know about that (and I my mind crashed on those seemingly contradictory statements from spec :D).
3.0 is the first version of C# that I haven't read its spec fully... yet!
Rather than place the links to the most recent C# team content directly in Community Convergence , I
"This is important because the collection might be a mutable collection. One does not expect that the result of a query over that collection could be used to mutate the collection! The query results should be read-only."
Another argument why it's a shame that .net doesn't support immutable datastructures out of the box. LINQ to objects would have much more latitude in optimizing queries against such things.