Conor Cunningham’s blog on SQL Server, data-driven applications, and pretty much whatever other random stuff he decides to post.
I was asked to comment on a post about the order of WHERE NOT EXISTS (<subquery>) in a query and its impact on a query plan.
Specifically this thread:
Broadly, SQL Server does attempt to make it so that the order in which you write a query does not matter in terms of plan choice. For many of the common cases, we have done this work. I was reading through the source code last night and I don’t believe we do relative reordering of WHERE NOT EXISTS subqueries (which in the academic literature is called an anti-semi-join) in the same scope.
In the specific customer case, the customer actually did WHERE NOT EXISTS (SELECT TOP 1 … <rest of subquery>).
I will specifically recommend _against_ doing this as a general practice. The Optimizer can do all sorts of magic to transform semi-joins to joins and joins to semi-joins, and there are reorderings that are blocked if you do the TOP 1 (since it is not a relational operator). The SQL Optimizer is smart enough to know not to retrieve more than one row for any WHERE EXISTS or WHERE NOT EXISTS subquery, so you should not consider it necessary or wise to add TOP 1 for this case. (Perhaps there are cases when it is appropriate, but I don’t know of any general cases that could not be solved by using other hints).
As I have posted before, there are some cases where SQL does not do every possible rule transformation because some of them are just not common enough.
Also, remember that we do add rules each release, so don’t assume that we never do these rewrites. We like adding transformation rules in the engine :).
Thanks for the post! I have a few followup questions though. In this post, your wording makes it seem like the TOP operator is the culprit behind the lack or join optimization; however, at the bottom of the MSDN thread Brad Schulz provides an example that does not use the top operator. So is it fair to say that the optimizer does not optimize ANTI-SEMI predicates at all, regardless of the top operator? Additionally this behavior is not limited to ANTI-SEMI joins, as this also occurs with correlated subqueries, in the predicate. Are these not being optimized too?
Here is a sample query to illustrate (with the sample data from the MSDN thread)
from #Customers c
WHERE (select MAX([CustomerID]) from #CustHugeRowSize where CustomerID=c.CustomerID) IS NULL
AND (select MAX([CustomerID]) from #CustTeenyRowSize where CustomerID=c.CustomerID) IS NULL
WHERE (select MAX([CustomerID]) from #CustTeenyRowSize where CustomerID=c.CustomerID) IS NULL
AND (select MAX([CustomerID]) from #CustHugeRowSize where CustomerID=c.CustomerID) IS NULL
Adding to what Adam said... as I pointed out in my blog post a few weeks ago...
...the NOT EXISTS predicates (i.e. ANTI SEMI JOINs) are processed in the order we present them in the query. (And these are "vanilla" predicates without a TOP 1 inside them).
But it's not just an ANTI SEMI JOIN issue. It seems to be an OUTER JOIN issue in general. The other example I gave in the post was three multiple-LEFT-JOIN queries that are identical except for the order of the LEFT JOINs. They produce the same result set, but they have 3 different query plans (all of the LEFT JOINed tables only refer back to the FROM table... they do not interact with each other):
from Sales.Customer c
left join Sales.StoreContact sc on c.CustomerID=sc.CustomerID
left join Sales.SalesOrderHeader soh on c.CustomerID=soh.CustomerID
left join Sales.CustomerAddress ca on c.CustomerID=ca.CustomerID
And the estimated plans come up with relative costs of 35%:33%:31% and, in fact, the first one IS slower and the 3rd one is faster.
So we just want to make absolutely sure and get confirmation... It looks like the bottom line is that the optimizer always inextricably links the FROM table with the first LEFT JOIN table (as shown by the query plans of the 3 queries above)... in other words, the optimizer does not do any optimization or transformation to come up with a single "best" plan.
If this is so, then we developers have to be careful in our choice of which table to LEFT JOIN first (or, for that matter, which table to reference first in multiple NOT EXISTS or correlated subquery predicates).
IMHO this is a an example where cost is not something you can compare.
It appears this issue is that the MERGE JOIN operator does not pass on order after an OUTER join operation, and so the optimiser fixes one merge operator and then hashes the others. So it probably finds a good one and doesn't look at others. Remember the optimiser finds a good enough plan and not the best plan.
Also this example is a bit contrived as you are going a cross product of four tables. Each join is going to introduce more rows, and so doesn't make any sense. Cross product queries are often difficult to optimise.
The optimizer does actually have the ability to compare different alternatives for subtrees with different physical properties (ex: sortedness). It can find the local minimum plan and it can also find the global minimum plan for all plans it considers. Note, however, that the optimizer has logic to not generate all alternatives to improve performance. These are somewhat separate concepts.