I received a question from a customer today about query plan choice for a parameterized query - specifically that different plans get picked based on where the plan is located, and sometimes one might be better than another. Apparently, for their application, a particular query plan was “better” when compiled in dynamic sql instead of a stored procedure. Others say that query plans are just as fine when compiled within a stored procedure. Which is it?
(If I were still consulting, the answer to this is “it depends” ;))
There are a lot of things necessary to answer this question, as it really isn’t about the exact location, at least not as such. It has more to do with the way different batches/procedures are compiled and also about how parameters are sniffed during plan compilation.
Let’s start at the beginning: The model for T-SQL (the procedural part of the programming surface area) is compiled in batches in SQL Server. This means that if you submit a batch of 5 queries, SQL Server will generally try to compile them all before executing any of them. (There are exceptions)
Now, let’s mix things like "set @p=5” in that query batch, and let’s have a later query reference that parameter. In this example, the whole batch is “compiled” before it is “executed”. That means that the “set @p=5” does not actually set @p=5 until that statement is executed. That means that the later query that references that parameter will sniff a NULL instead of a 5.
I wrote about the basics of this issue a few years ago in the QO team blog:
I Smell a Parameter!
So, back to our question – which one is “better”. The answer is that this problem is about scopes and not procedures vs. dynamic sql. Basically, putting the parameterized query within a nested scope (dynamic sql or another procedure) happens to impact whether the parameter gets sniffed or not. It’s the sniffing that impacts the plan choice, not from where it is hosted in T-SQL.
Now, why did the plan become “good” or “bad”:
* Sometimes when you sniff a NULL, you get a perfectly reasonable plan (if NULL is represented in similar proportions to other possible values).
* Sometimes if you sniff a NULL, you can get an absolutely horrible plan.
Generally, you want to get the Optimizer to see a good average value for your parameterized queries so that it will put a plan that works well, on average, for the average parameter case. If you never call the procedure with NULL for a parameter AND it generates a different plan than if you had compiled it with an actual value you would use, then you probably want to look at that and see how to get it to sniff a more common value – you want to get that Optimizer working for _you_ and not for that dreaded NULL :).
So, we’re done, right? Unfortunately no – it gets more complex again. There are some cases where one plan may not perform great for all parameter values (so, there is either not a good average value or there is otherwise some large performance difference based on the parameter value for the “best” plan). Let’s say that you are querying some big table for the “USA” table during the day and the “Togo” (a small island nation) data at night. These plans may be very different because their filter’s selectivities will be different. We refer to this as a plan with parameter-sensitivity.
For these parameter sensitive plans, you can also see “good” vs. ”bad” performance. By that, the performance differs, perhaps even by a lot, based on the parameter value. If you have cases like this, you have a few options.
So, there are a few different ways you can see “good” vs. “bad” plans for parameterized queries. Let’s talk about some practical guidance for how to avoid these issues.
First, you can move things to a new scope – that’s what happens when you move to dynamic sql. This likely gets the parameter sniffed and almost always will give you a good plan (at least for that parameter value). Another option is to just use an OPTIMIZE FOR hint to pass a common value to be sniffed – this is a great way to not have to worry about whether the batch compiles before it executes, etc. Another option is to use OPTION(RECOMPILE). If you have a parameter-sensitive case where sniffing the value does not work well, then you will want to consider OPTION(RECOMPILE) – when the compilation cost is smaller than the performance difference from using a suboptimal plan for the wrong parameter value, then it is a very good idea.
I’ve given you the short version :) There are some additional details around “deferred compilation” where a batch is only partially compiled before parts of it gets executed. SQL 2005+ also has statement-level recompile, an this can change the sniffing logic as well. I am not trying to give you a complete answer other than to give you the tools to ask the right questions – this stuff is tricky!
Happy Querying!
Conor Cunningham
Architect, SQL Server QP
Thanks to those of you who replied to my previous post. I spent the last week or so talking with customers, coworkers, and looking at various applications to see how recompiles are modeled today in their applications. At a high level, SQL Server doesn’t really “require” you to recompile at all. The main set of people who seemed to think about this were people having trouble with plan selection. For example, if statistics were not updated frequently enough and users were querying new data on a time-series column, this could cause the automatic algorithm to not work well until statistics were updated on the queried column. So, this is usually a more advanced topic for DBAs to consider.
As you might expect, customers who do this usually manage a somewhat larger database or set of databases. Sometimes it is an application where there are very precise performance requirements and the queries are relatively small (for example, a query that is usually a “seek” against a table instead of a scan). When the query plan selected is a scan, the query performs much more slowly than the seek plan.
In terms of frequency of mechanism, this seemed to be all over the place – people use sp_recompile, freeproccache, and update statistics for different kinds of problems. I think update statistics has an edge since it usually addresses the problems that can be fixed with a recompile, but others would do freeproccache or sp_recompile as well.
I will convey a small fact to you about sp_recompile that you may not know. Internally, SQL Server has a timestamp that is used to determine whether plans in the procedure cache are still valid – as long as the timestamp on the referenced object is the same the current timestamp in the metadata for that object, the plan is “valid”. sp_recompile works by updating that metadata timestamp for the object passed as an argument to sp_recompile. So, you are effectively making a metadata change by doing an sp_recompile. DBCC FREEPROCCACHE will drop objects from the cache without a metadata change. update statistics works on a different, secondary metadata version with slightly different semantics.
So, if I had to make a recommendation to you, I’d say to just consider whether you need to use sp_recompile or not – it is somewhat more heavyweight than the other approaches. None of them are invalid, of course, but you might find that it is a bigger hammer than you need if you want to fix one query with a plan that is not what you want. Note that DBCC FREEPROCCACHE takes arguments, and you can limit what is evicted so that you can reduce recompile load to the subset you desire.
Happy Querying (and Memorial Day, for those of you in the USA)!
Conor
I have a question for you guys about how do you force a recompile for a single query today?
There are a lots of different ways that one could cause the system to recompile a query:
a) sp_recompile <object used in a query>
b)DBCC FREEPROCCACHE (with optional arguments to restrict the set of impacted queries)
c) detach/attach db (2008+), restart server, etc.
d) use a lot of memory to force plans out of memory
e) update statistics
f) change the text of the query (add a space), which works mostly except for trivial plans (which you may not care about recompiling anyways)
I am curious what other ways people use (or what you use most frequently). If you have any interesting scenarios about how/why you recompile a plan, I’d like to hear about them, so please send them my way (conorc @ microsoft.com)
If you never have occasion to recompile queries or think it’s not useful, that’s fine too – I want to hear about those cases as well.
If there are some interesting cases, I’ll post them up anonymously if you don’t mind.
Thanks,
Conor
I received a question this week about whether filtered statistics update as frequently as regular statistics. The right way to ask the question is “Do filtered statistics become invalid as frequently as regular statistics?”, as stats are recomputed based on the queries that consume them, while stats are invalidated based on changes to the underlying data.
The basic mechanism for stats invalidation is a per-column counter in the storage engine. It’s not transacted, and it is incremented on each INSERT/UPDATE/DELETE/MERGE. When a normal statistics object is created, the current “modcounter” for that column is stored in the statistics object. As changes are made to the table, the difference between the current modcounter and the one stored in the statistics object becomes greater. Ultimately, when the difference is large enough (I will slightly oversimplify and say that it is 20% of the table changed for this blog post), the statistics object is invalidated. When a query consuming the statistic is executed again, it will determine that the statistic is out-of-date and recompile it. Once done, the query is recompiled using the new statistics object. Unfortunately, many of the details of this logic are not exposed, and this can make it somewhat difficult to understand the mechanism.
Filtered statistics do not change the fundamental mechanism – there is still one set of counters in the storage engine. The thresholds for when filtered statistics become invalid, however, are modified from the normal case based on the estimated selectivity of the filter. So, a filter that only covers 20% of the domain is updated 1/5th as frequently. The calculation is done without knowledge to whether the changes being made to the table would actually cause the statistic object to be changed.
I’ve written up an example to show that filtered stats do indeed get updated even when all of the changes to the data are outside of the range of the filter on the statistics object. In this example, the filter selectivity is very high (~90%), and the “data change” is effectively not really changing the data, but I know that it is impacting the modcounter for the column.
1: use tempdb
2:
3: --drop table testfilteredstatsrefresh
4: -- create a table for our example with values 0...999 in col2
5: create table testfilteredstatsrefresh(col1 int identity, col2 int)
6: go
7: set nocount on
8: begin transaction
9: declare @a int=0
10: while @a < 20000
11: begin
12: insert into testfilteredstatsrefresh(col2) values (@a % 1000)
13: set @a+=1
14: end
15: commit transaction
16: go
17: -- create filtered stats over a portion of the domain
18: create statistics i1 on testfilteredstatsrefresh(col2) WHERE col2 > 100
19: go
20: select * from testfilteredstatsrefresh where col2 <> 55 option(recompile)
21: go
22: set nocount off
23: -- update the table with a series of no-ops on the section of the range
24: -- not covered by the filtered stats
25: update testfilteredstatsrefresh set col2=col2 %1000 where col2<100
26: update testfilteredstatsrefresh set col2=col2 %1000 where col2<100
27: update testfilteredstatsrefresh set col2=col2 %1000 where col2<100
28: update testfilteredstatsrefresh set col2=col2 %1000 where col2<100
29: update testfilteredstatsrefresh set col2=col2 %1000 where col2<100
30: go
31: -- run the query again - auto-stats fires again
32: -- so,
33: select * from testfilteredstatsrefresh where col2 <> 55 option(recompile)
34:
I turned on sql profiler and captured the “auto stats” event to watch when it is recompiled. You will notice that this happens on the last query (after the changes), even though there have been no changes in the part of the domain covered by the filtered statistics object.
Happy Querying!
Conor
For those of you waiting for the new "Inside SQL Server" book, it is actually out now. Kalen decided to give it a name change, and the actual title is "SQL Server 2008 Internals". I've written a chapter on how the Query Optimizer works. It is currently #1 in the 3 related database/sql server categories on Amazon, so I'll say "thank you" to those who have already ordered a copy ;). Like my blog postings, there are a lot of sample scripts you can try so that you can "see" what the optimizer is doing - I find this technique to be a very good way to learn what is happening, and I hope it works for you.
http://www.amazon.com/Microsoft%C2%AE-SQL-Server%C2%AE-2008-Internals/dp/0735626243/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1237753244&sr=8-1
Enjoy!
Conor Cunningham
A customer recently asked our support organization about an unexplained series of range locks that they were seeing on a DELETE to a table with a foreign key.
Here is the repro:
1: create table Foo (FooId int not null primary key)
2: create table Bar (FooId int not null, BarId int not null)
3:
4: alter table Bar
5: add constraint PK_Bar
6: primary key (FooId, BarId)
7:
8: alter table Bar
9: add constraint FK_Bar_Foo
10: foreign key (FooId) references Foo(FooId)
11: on delete cascade
12:
13: insert into Foo values (1)
14: insert into Bar values (1, 1)
15:
16:
17: set transaction isolation level read committed
18: begin tran
19: delete from Foo where FooId = 1
20: commit tran
So the basic idea is to delete a row from a table in such a way that it would cause a cascading delete to occur. (It would also work for updates with update cascades). The customer saw that we were taking key range locks, which is really something that is done only in the serializable isolation level (which is SQL Server’s safest but “most blocking” isolation mode).
You can read more about isolation levels and key range locking here:
http://msdn.microsoft.com/en-us/library/ms189122.aspx
http://msdn.microsoft.com/en-us/library/ms191272.aspx
Locking is a very important aspect of database applications, but it is not well understood because it happens almost entirely implicitly. If you look at the plan for this delete, you don’t see much in terms of locking being explained:
(So, I think that this is probably something we need to do a better job of documenting in the future ;)
To understand the customer problem, we need to understand how locks get taken by the system. Internally, the Storage Engine will lock all pages that are scanned by the Query Processor. So, if the QP generates a seek plan, the SE will lock each structure that is touched as part of the Seek (Index object, then each intermediate page, and then the leaf row, assuming row-level locking as the granularity is on). Likewise, a scan will acquire locks for each row it touches, but it might have a lot more. At a certain point, SQL Server will try to escalate locks to a less granular lock so it can work better on large operations.
Now, what happens when we have a delete that actually touches two tables? The semantic properties of a cascading delete are:
1. Delete from the Foo table implies that all associated rows in Bar get deleted (the cascading part)
2. At the end of the statement, the Foreign Key property is still valid
So, if we just used nice read-committed, we hold X locks on things that we modify until the transaction commits (good), but we don’t prevent someone else from inserting a row while the Delete is being processed. This is a classic definition of why you would need a SERIALIAZABLE transaction… This diagram shows what is happening on the plan temporally:
(Remember that SQL Server often has many threads executing at once. So, operations against different access paths will not happen at identical times. There is a small time gap between the operations, and this is where you can get race conditions).
So, rather than require that customers set the serialization level for operations like this, SQL Server automatically upgrades the isolation level of certain operations to block other operations from inserting a row that would violate the foreign key property. In this case, it will take some range locks as if those parts of the query were running in a serializable isolation level without upgrading the whole statement. This is one way to minimize the blocking associated with the requirement by the server to avoid data corruptions.
SERIALIZABLE locks are typically held until the end of the transaction. If you run multiple statements in the same batch, it could take awhile until those locks get released. Since the server knows that exactly why it took the locks, it happens to know that it is safe to release those serialized range locks at the end of the statement. So, these internal serialized locks are held only until the statement finishes.
This same condition applies to indexed view maintenance, but I’ll save that for another day :).
Happy Querying,
Conor
(as ANSI NULLs is one of the requirements for indexed views, I thought I’d give you some perspective on why the questions I ask you are useful and interesting to better understand queries and various query operations)
As currently designed, indexed views tries to strike a balance among:
1. query shapes that can be efficiently matched
2. query shapes that can be efficiently maintained
3. query shapes that can be efficiently created
Loosely speaking, indexed views work for basic SPJG (SELECT-PROJECT-JOIN-GROUPBY) queries. The joins must be inner joins, not self joins, and join on a key. Each operator has a series of restrictions based on (1),(2),and (3) above.
So how does the optimizer maintain these things? Well, it helps to have an understanding of what happens with views in general. Here’s a primer I wrote on the subject awhile back (http://www.sqlskills.com/BLOGS/CONOR/post/Inserts-against-views-An-introduction.aspx). Overall, the view has a tree definition, and an update against that view is decomposed by substituting the original update tree with an expanded tree for the view definition. Since the view is ultimately based on base tables, the update queries that are supported in SQL Server are ultimately operations against those base tables (in other words, the server translates your request into an equivalent operation against the base table). Then the change is performed and everyone is happy.
INDEXED views are an extension of this concept – the main difference is that the index is maintained as part of the update to the base table. (My soon-to-be-released book chapter on Query Optimization goes into more detail on the subject of wide updates). The set of changes to the base table are further streamed into a set of changes against the indexed view. The key property is that the exact set of changes to the view can be computed from the specific changes to the base table. Many of the restrictions in indexed views come from this requirement. In the academic literature, this is sometimes referred to as the delta algebra or delta logic.
Essentially, the original view definition is modified so that the source table (the one you are modifying in the UPDATE) is replaced with only the set of changes instead of the whole original table. Here’s a basic example
create table dbo.t1(col1 int primary key, col2 int, col3 int)
create table dbo.t2(col1 int primary key, col2 int, col3 int)
drop view dbo.v1
CREATE VIEW dbo.v1 with schemabinding as
select t1.col1 from dbo.t1 as t1 inner join dbo.t2 as t2 ON (t1.col1=t2.col3)
WHERE t1.col3=5 and t2.col2 = 8
create unique clustered index i1 on dbo.v1(col1)
update v1 set col1=col1+1
The update to the view is seen in the bottom branch of the query, and the delta part is the Table Spool (which replaced t1 in this case)
So, the rows being modified are joined with t2 (just like in the view definition), filtered, and projected (the SELECT list). The resulting delta stream is then applied to the persisted indexed view after some update-specific operations (collapse and a special filter that is mostly internal stuff).
Let’s talk about a few of the restrictions:
1. No Self-joins – well, the delta algebra doesn’t work for self joins. In other words, one can’t take a single update stream for the base table and substitute it into the view definition and then _efficiently_ update the view definition (remember, one of the goals is to maintain views efficiently)
2. ANSI_PADDING, ARITHABORT, ANSI_NULLs, … The main problem here is that these can impact the results returned in the query, and since customers likely don’t want to have 1 index view for each combination of these values, they are restricted.
3. Limited operator support – Finding the set of operators that can be efficiently matched, maintained, and created is a difficult problem – there are many operators, and some of them can not easily be supported in indexed views. Most of them have odd restrictions like what is described in the CREATE INDEX help page (http://msdn.microsoft.com/en-us/library/ms188783.aspx)
3a. No DISTINCT – The QP needs to be able to uniquely determine the set of rows in the indexed view to change. DISTINCT, like GROUP BY, “destroys” information in the query tree by collapsing rows. GROUP BY in indexed views requires COUNT_BIG(*), which is a way to not lose information (but it requires that you modify your query to achieve that).
3b. MIN/MAX, STDEV, VARIANCE, AVG – does not preserve enough information to allow incremental maintenance.
3c. UNION – UNION (not ALL) implicitly includes DISTINCT on the result of a UNION ALL, so it is not allowed for the same reason as DISTINCT. More generally, UNION ALL makes it hard to decompose the source table for the UPDATE to work (without more extensive modificiations to the view to track which branch of the UNION is the source of the row)
3d. fulltext, ROWSET – these include external information, and these are restricted to prevent it from changing. It is necessary to prevent a wrong results bug if the external source changes, as indexed views are designed to be kept in-sync with the source.
3e. Subquery – this is partially performance issue, but some subqueries make it impossible to determine the exact set of rows to change without also looking at rows not being changed.
…
There are lots of restrictions on indexed views, but I hope this gives you a bit of the “why” on the design.
Happy Querying,
Conor Cunningham
Technorati Tags:
SQL,
Indexed Views
I was thinking about indexed views. Well, more specifically, I was thinking about why indexed views are hard to use. One of the reasons that they are hard is that there are just so many restrictions. Anyways, one of the things I'd like to better understand is how many customers actually use ANSI NULLs OFF and specifically _why_ do you use it? The typical reason we would expect is legacy applications (and that's a fine reason), but I am curious if there is more to it than just this.
So, you know the email address - conorc (at) microsoft . com - feel free to email me. I'll anonymously summarize what I learn, but I'd like to get a feel for how customers approach this little flag. Are you planning to get rid of you use of this? Do you plan to keep using it forever? Does the packaged sql app that you bought still use this? which one?
It is in the deprecation path (http://msdn.microsoft.com/en-us/library/ms188048.aspx). So it will eventually go away, but that doesn't mean that we have nothing to learn from it :).
Happy Querying!
Conor Cunningham
MERGE is a new operation added in SQL Server 2008. It has all sorts of knobs, bells, and whistles. The primary value of this feature is to collapse multiple query statements into one query statement. Overall, this avoids the “overhead” necessary to run a query a few times and can be a big benefit, especially when running many small queries that do things like “if exists then update else insert”.
I received a natural question this week on the behavior of MERGE with respect to locks and transactions. The expressed expectation was that since MERGE was a single statement, it should be atomic transactionally and it should not be possible to get duplicate primary key errors. This expectation is not really correct – ultimately, the locking semantics for UPDATES, DELETES,INSERTS, and MERGE are more complex and require some knowledge of locking and query plans to fully understand.
Here’s the script. The initial part is created once and the second part could be run multiple times.
(running in default readcommitted, btw)
Setup:
DROP TABLE FOO
CREATE TABLE FOO (id INT PRIMARY KEY NOT NULL, name nvarchar(20) NOT NULL)
DELETE FROM FOO
INSERT INTO FOO VALUES(0, 'base')
Runtime portion:
DELETE FROM @fooChanges
INSERT INTO @fooChanges VALUES((SELECT MAX(id)+1 FROM FOO), 'induction')
MERGE FOO AS f USING @fooChanges as fc
ON ( fc.id = f.id )
WHEN MATCHED THEN
UPDATE SET name = fc.name
WHEN NOT MATCHED BY TARGET THEN
INSERT (id, name)
VALUES (fc.id, fc.name);
Effectively, MERGE is doing complex stuff under the covers for you. It is performing a join between the original table (FOO) and the table to be MERGEed (@fooChanges). It is an outer join, and the matching rows correspond to the case when the two rows match and the “not matched” rows in the join correspond to the “not matched” case in the query statement. The not matched rows have NULL on the non-matching side, and this is eventually converted into an action to INSERT instead of UPDATE in the stream of operations to be performed against the table FOO.
Here’s where this is happening in the query plan:

For non-MERGE UPDATEs and DELETEs, the common practice within most engines is to have a special “U” (Update) lock that is taken on the source rowset and then promoted to an “X” (exclusive) lock when the Update operation is actually performed later in the query tree. (Way back in the days, the read part would take an “S” (shared) lock that would not prevent other consumers from reading the row. Ultimately, this could cause all updates to fail in high transaction environments because two update queries could both get the S lock and not be able to promote to the X lock (since S locks can only be promoted if nobody else holds another S lock). Ultimately, this was solved by adding a U lock that prevented other UPDATEs but allowed other SELECTs.
In the case of MERGE WHEN NOT MATCHED, we actually don’t have a row yet, so there is no place on which to attach a lock (unless you use SERIALIZABLE isolation, which would take key range locks). So, no U lock means that there is no prevention against duplicate inserts (or PRIMARY KEY duplicate violations, in this case).
So, ultimately, the MERGE does not prevent key collisions for WHERE NOT MATCHED INSERTs.
Happy Querying!
Conor Cunningham
As many of you may know, I work from my house most of the time - I have a nice office that is quiet and I actually find it to be a great way to get stuff done - I have no commute, I don't get interrupted very much, and I have to attend fewer meetings :).
There are a few negatives to the arrangement, and one of them is that I am my own IT staff. Actually, I am both the IT staff for me and for my wife's business (she is a photographer http://www.shannoncunninghamphoto.com/, and a darn good one IMO).
So I have a printer that until recently sat in my closet and was quiet until someone printed something. It would then wake up, warm up the drum, and print something. So far, so good. Unfortunately, it also causes my server computer to reboot.
Yes, coming out of power saving mode apparently causes it to reboot. My server is on an APS UPS as well, but apparently the spike does funny things and I am unlucky.
So, I go program my server to auto-login so that my music server stuff will continue to work.
I also ordered a new wireless bridge so that I could move the printer out of my room (and onto a different circuit in the house). That arrived over the holidays and I was able to get that all fixed up (without having to run another cable in the attic, and that made me happy).
So, I go back and look at my logs for my server, and it's still rebooting - apparently my decision to buy one of these 80%+ efficient power-saving power supplies was.... premature..
I should just become a part owner in Newegg - every problem seems to require sending them money :).
The printer now mostly works - I think :). I am playing with free network monitoring software (currently I am trying "The Dude"), as my home network has become complex enough to require some help in keeping everything working.
Yes, I'm alive.
I have been chatting with my wife about the times I need to "go dark" - in this case, I am working on features that I can't discuss yet, other than to say that I am *very* excited to go to work every day.. (well, since I work out of my house, I really just walk up the stairs, but give me some artisitic liberty). That doesn't mean that I can't talk about other topics, of course, so feel free to mail me if you have a pressing one. I just can't blog about my regular work each day.
I've gotten some questions about OPTIMIZE FOR UNKNOWN, which was added in 2008. It does seem a bit daft to try to reason about the unknown, but in this case there is a good reason.
For a primer on parameters and parameter sniffing, I'll refer you to my previous post on the subject:
http://blogs.msdn.com/queryoptteam/archive/2006/03/31/565991.aspx
The UNKNOWN extension is intended to help customers trying to get a good "default" parameterized plan. OPTIMIZE FOR is able to let you specify a particular paramter to use when creating a plan. Internally, it uses the cardinality estimate for a specific value to help determine the plan through the normal costing formulas. The main reason to do this is that the plans are cached, and if you cache a plan for a very common parameter value, it may not be the best plan for average use. Alternatively, if you cache the plan for the average predicate selectivity, you also may be suboptimal when you cache this plan for use in your workload.
OPTIMIZE FOR UNKNOWN lets you say "I want to select a plan that is based on the average selectivity for the predicate (the not common parameter value), but I don't know a good value to use".
Statistics in SQL Server have a histogram to track the distribution of values and their frequency. It also contains an average frequency over the whole table, which is stored inverted as 1/frequency in a field that is called the "density"). This value is used in different places. One place where it is used is in cardinality estimation for the GROUP BY operator (if you have 3 occurannces per value, grouping on that value should reduce your cardinality to 1/3 of the previous value). This same value happens to also be used for cardinality estimation for parameters with some predicates. If I have "WHERE col1 = @p", you can estimate that, for the average case, that predicate will return whatever the average number of duplicate values in the table is (assuming there is a match, which is generally the case).
So, OPTIMIZE FOR UNKNOWN is a nice way to tell the optimizer that you want a parameterized plan for the average parameter frequency distribution case. The nice benefit is that if you add a lot of data to the table and then update the statistics, the definition of "average" is defined based on the current statistics, so you don't have to worry about pegging a constant in your plan that is not representative in the future as your data changes.
My almost-in-print book chapter in the Inside SQL Server series for 2008 has a section on the math behind cardinality estimation with statistics. (Update: apparently Kalen changed the name of the book series - it appears to be called Microsoft SQL Server 2008 Internals now: http://sqlblog.com/blogs/kalen_delaney/archive/2008/10/23/SQL-Server-2008-internals-is-orderable.aspx)
I hope this gives you some insight into the optimizer's thought process when generating a query plan.
Happy Querying!
Conor
http://blogs.msdn.com/sqlprogrammability/archive/2008/11/26/optimize-for-unknown-a-little-known-sql-server-2008-feature.aspx)
I attended only a small number of the talks at PASS. I really enjoyed Bob Ward's talk on memory and Paul Randal's talk on DBCC corruption/repair. Otherwise, I spent a lot of time answering questions from customers. LOTS of questions. So many questions I pretty much lost my voice.
Overall, I was amazed at how many people had challenging questions at PASS - I didn't expect to be busy for the whole day answering questions like that - I conclude that SQL Server users are all above average, better looking, and generally smarter than non-SQL Server users :).
While I received lots of questions, advice, feature requests, requests to debug live servers, and much more, the topic that came up the most was ETL - lots of data loading questions, specifically. So, I'll talk a bit about data loading in SQL Server and in query plans.
Most query plans are complicated because we have to deal with tree normalization, statistics, exponential search spaces, and a host of other problems. Whenever you have queries with lots of joins, this is typically the kind of stuf that I end up examining to see why a query is fast or slow. ETL _can_ be about this, but often this is more about how much computing power you can throw at the problem.
In automotive terms, algorithmic optimization is about knowing what gear to use, what line to take in the turns, and when to do smart things like draft on someone else. Loading lots of data, on the other hand, is about horsepower. The more, the merrier.
As I had this conversation a few times with customers, I'll share questions and typical answers I got while working on this:
1. Have you separated out the major streams of data in your system? Is your data on a different drive than your log and a different drive again from tempdb? If they are on different drives, do you need different storage controllers?
2. Some people buy storage aggregators - basically lots of hard drives in big striping solutions over a LOT of drives (think RAID). These are nice because you can theoretically load at the throughput of more drives. You can either federate traffic on these (put tempdb on a subset of the drives, for example) or buy multiple aggregation solutions, one for each set of traffic if you prefer.
Remember, traditional drives are fastest at sequential writes because moving the disk head is relatively expensive. So, you want to minimize disk head seeks which means writing sequentially when possible. I had one customer who had put far more files than spindles on this, and I have no idea why (if someone has a good reason, let me know - I don't buy these things every day, so perhaps there is a good reason that eludes me).
So most of this advice from the first two is simple - separate things out so you can get greater throughput.
Now we get into the more interesting things.
3. bulk insert can work in parallel. regular insert typically doesn't. SQL Server can't support more than one writer to a table in a single query, typically. So, often you get blocked on this if you want to insert into an existing table (as bulk insert has restrictions). So, if you want to add more stuff into your table quickly, consider partitioning. You can create a new table, bulk insert into that table, and then SWITCH the partion into your big table. In fact, we did substantial work in SQL 2005 and 2008 to make that work well. It's called the Sliding Window Scenario (look it up in Books Online).
If you understand that query plans can't write to a single rowset in parallel, you'll understand the bottleneck from 1-2 to 3. If you hit the rate at which your CPU and memory can push rows for a single thread, you're in trouble. So, federating your work is key to a highly scalable app. So, you can bulk insert into the same table with multiple threads and you can do multiple partitions-to-be at the same time. This lets you exploit all of the CPU cores you have as well. Eventually you will run out of something else (memory bandwidth?), but by now I'll say that most of the customers are happy
The advice is basic, but most of the problems I saw around loading lots of data boiled down to one of these problems - understand how to keep all of the drives busy and then all of the CPUs busy and as productive as possible, and then you will maximize your hardware.
Conor
I hope to see folks at PASS - please look for me at the Ask the Experts booth for Query Processing/Performance. I'm very interested in hearing about what kinds of ways you are using SQL Server and how we can make the product better!
Thanks,
Conor
I've been building a new machine and I have put Win2K8 Enterprise on it. While I have no specific reason to enable the hypervisor other than my absolute, unbounded enjoyment at saying the word "hypervisor", I also was going to try to run a few of the server-like programs that I use at the house (for example, my music server software).
So, I turned it on and found that it would randomly reboot 1-3 times/day. (I realized this because my Squeezebox would stop showing me the time downstairs, which means the server went away). While I can (and did) put in the registry keys to auto-login to my server and get the music player to work more reliably, I was disappointed because this hardware was a combination of pieces I had been using successfully in another machine.
My best guess is that I have some sort of driver issue with my motherboard/cpu/etc and win2k8. In fairness, the system is about 2 years old, and I doubt they did extensive testing on the final version of win2k8. I will go hunt for updated BIOS and drivers, but often this means you need to upgrade to a more recent motherboard...
FWIW, I am running a core 2 duo w/8GB RAM on an Asus P5B-E. The video card is an ATI Radeon x2600 PCI-E, I believe. Not much else in there but drives.
Based on my searches, not too many people are running such a combination :). Oh well.
Thank you for everyone who "voted" - a great day to talk about the issues of the day :).
After much research, I've found that we live in a divided country. Some of us like triggers. Others wouldn't be caught dead with a trigger and do lots of business logic validation within stored procedures. Some of you consider CHECK constraints, depending on what else is available on the ballot. Nobody uses CHECK constraints for all of their validation, though a suprising few use some of the lesser-known and unsupported parts of CHECK constraints in SQL Server to get the job done.
My original purpose for asking about business logic validation was to see how many people use CHECK constraints (and to ponder possible future features in the area). What I learned was that there are lots of emotions, questions, and opportunities for SQL Server around business logic validation.
Here's a rough summary of what I've heard from customers.
1. Applications that are stored-procedure based (often OLTP applications of various sizes) usually put a good chunk of the business validation logic in the server within these stored procedures. The good part about this design is that all of the logic can be centralized on the server. This usally aligns well in organizations where the DBA job is separate from the application developer job (as the DBA can use this design to validate data coming across that trust boundary). The challenges with this design are that writing T-SQL code can be more difficult, that similar routines may have to duplicate logic, and that this may be "slow" (which I won't even try to define, but we can accept that it takes time to validate constraints of various forms, no matter the application).
2. Applications that use triggers often do similar things. Curiously, I didn't find anyone actively building such designs, so most of these were legacy. I don't know if I should read anything into this or not (people building new systems with triggers - please mail me!).
I didn't see a specific delineation between why one would pick one path or the other. My sample size is probably not quite big enough to draw conclusions about this.
3. Some people use CHECK constraints. Often this was the result of a DBA trying to deal with issues across the "trust boundary" between database developers and DBAs. Many people didn't know much about CHECK constraints, but the ones who used them typically swore by them.
4. (no customers sent this to me in my latest inquiry, but I'll add this to the discussion). As databases start to get into larger and larger scales, it becomes necessary to factor portions of the application to make it more loosely coupled. This typically relates to how many CPUs you can stuff onto a single machine, so it is more of a problem at the high-end. Service-Oriented Architectures speak to this issue. In Business Logic Terms, this often moves to the mid-tier as applications tend to scale. That doesn't mean that you should try to start with it there - most applications never get to that scale. It does make sense to factor your business logic so it can be moved/reimplemented in a different tier.
My advice to people considering new applications today is to think about the data constraints more explicitly than many people do today. Even if unenforced, going through the exercise of saying what the database data *should* look like can find issues in your application. I get the impression that this is a bit of a lost art.
One last point - PRIMARY KEY, UNIQUE, CHECK, and FOREIGN KEY constraints are all used by the optimizer in plan generation. These can be helpful in plan generation, but maintaining those constraints is considered expensive for some customer scenarios, such as large data warehouses, where each index is scrutinized. I'd recommend considering database-level constraints in non-DW applications, where possible. It can often to lead to substantial improvements in certain query plans.
Thanks,
Conor