Conor Cunningham’s blog on SQL Server, data-driven applications, and pretty much whatever other random stuff he decides to post.
I received an email from a reader this weekend that has a lot of different questions, but one of the main questions related to the difference between views and base tables in SQL Server’s plan generation process. ie if I have a view “select * from table” vs. a table “table”, does the optimizer treat these differently?
Before I answer, I’ll clarify that SQL Server changes some of its internals each release and that there is no single answer that will completely satisfy here. This is a simple question without a simple answer. I’ll go over the main points and I hope this will help you form rules-of-thumb for questions in this area when you are trying to understand what the SQL Server Optimizer is doing during plan generation.
Views != Tables
These are syntactically equivalent for most queries, but they are not identical from a plan generation perspective in all cases. For trivial views, things are pretty close, but there are additional things happening when you use a view instead of a table:
1. SQL Server expands the view into the input tree for the Optimizer. If you have a view that’s a 10-way join, it gets included into the input tree and doesn’t just look like a single table to the optimizer.
2. Simplfication steps are run on the whole tree. This includes finding unnecessary subtrees, moving predicates towards the tree leaves to facilitate later index matching, etc.
Once you have a simplified input tree, the Optimizer will perform other heuristics including finding an initial join order (or orders) to begin searching for a good query plan. That order is a function of the shape of the tree at the time that logic is run. Very complex views distribute the join graph and may prevent it from being located in one spot – this causes the algorithm to run multiple times instead of potentially once. The heuristics of that algorithm are partially a function of the number of tables in each local area. (So, while trivial views won’t really impact this too much, views containing joins or other operations can significantly impact this set of heuristics). Furthermore, the SQL Server optimizer uses a dynamic search algorithm that searches until it has found a “good enough” query plan. That logic is based on the time spent searching vs. the best plan cost found so far. If the view is non-trivial, that can also impact the search cost/time.
The user who mailed me also happened to be using a view that was defined over a 4-part name (ie a remote table). This kind of query is _very much not_ like a normal table. Distributed queries have significantly higher costs to execute. The cost of running this query is higher, which will impact the optimal plan choice in costing, whether parallelism can be used, somewhat limits the index sargability, and otherwise looks the same but is really something completely different.
I’ve co-authored a paper on how SQL Server optimizes distributed queries here. It goes over some of the deep technical pieces of how the current DQ algorithm works. For this specific case, it is worth noting that the heuristics associated with setting the initial join order are different for distributed queries.
The bottom line is that you should be careful in assuming that a local table and a local view over a remote table are remotely the same. They are not. While some of the plan choices may turn out looking similar for these two paths, in general they are not optimized the same way and you should not assume that they should be.
The full problem of optimizing queries in distributed environments is a tough problem and isn’t really an easy blog post (there are books on this subject ;). Often, splitting the query up is the only viable option with the current shipped products (2008R2 and below). I am hopeful that we can improve that area in a future product release.
So, the lesson for Monday is that local tables != local views over a remote table.
Can you elaborate on the impact of using remote tables please - in terms of are these tables in another database for the same instance, or in a linked server? or is the detrimental affect on plan choice the same?
Also does the same apply for synonyms - pointing at same database tables, same server tables and linked server tables?
"remote", at least in this context, means tables referenced using distributed query. Cross-database table references within a single database in a regular SQL Server are not using distributed query and they would work the same as normal tables within the same database.
Synonyms is a more complex discussion since they don't work properly with all distributed query scenarios. In general, you should look at the final target, not that it is a synonym.
It is incorrect to say that the plan choice is "bad" as a result of referencing distributed query tables. It is just not the same as the normal path. That's a good thing, as the optimal plan choices would be different. Remember, the question was "why aren't these identical?" and I wanted to convey that they should not be.