Conor Cunningham’s blog on SQL Server, data-driven applications, and pretty much whatever other random stuff he decides to post.
In my previous post, I debugged a customer problem for you, including the various guesses I had to make and why they matter.
The specific problem was likely related to parameter sensitivity, an issue in query optimization where the optimizer will try to generate a plan that is optimal for one parameter value but perhaps not optimal over all parameters values likely to be used in the future calling pattern of this query. I had recommended using the optimizer hint OPTIMIZE FOR (@p <value>) to nudge the optimizer to use a value that was better for the average workload.
Another user wants to better understand the difference between OPTIMIZE FOR(@p value) and OPTIMIZE for (@p UNKNOWN), a new hint that was added in SQL 2008. This new hint optimizes for an average value rather than a specific value. It is probably a good time to review a few background details of parameterized query plans so that I can use some vocabulary that we use internally to discuss specific customer cases.
So, the abbreviated summary (which you can find in earlier blog posts if you have not seen before:
1. Queries in SQL Server (all current versions) are compiled and are usually cached. That plan is used for all subsequent executions as long as it does not recompile.
2. Queries with parameter values _can_ (but do not always) have their values “sniffed”, meaning that the optimizer takes a peek at the current state of the containing T-SQL batch, using this value to generate a more optimal plan. You can tell if a particular parameter value was sniffed by looking at the showplan xml output, but please understand that the current system is imperfect in terms of sniffing. For the rest of this post, I will assume that values ARE sniffed for simplicity in describing the behavior that happens in the system.
When the optimizer knows a specific value to sniff, it will be able to determine things like whether a table scan is likely to be faster than a seek + bookmark lookup for a given filter. For example, “SELECT * FROM T WHERE col = @p” would likely generate a scan plan if col = @p is non-selective. If it is highly selective, then the seek may be optimal. This seek/scan plan flip is one common case where parameter sniffing can yield a more optimal plan for a specific parameter value. Whether this plan is optimal for the workload is another matter entirely. Unfortunately, the optimizer is in a tricky spot. Sniffing the value often leads to a much better plan – so much so that customers complain if they don’t get that plan. So, there is an interaction between the initial sniffed value + the resulting plan that is generated with the average calling pattern for that parameter. (It gets more complicated when you have many parameters).
There are a number of different common “plan flips” where the optimizer generates different plans based on what is sniffed. In addition to seek/scan flips, I often see index flips where the optimizer will choose a seek on index 1 vs. a seek on index 2. I also see join order flips in OLTP applications. The main ingredients here are that your workload be repetitive enough to see the patterns.
Even if you do have plan flips, this does not necessarily mean that your application will have problems. The relative performance of those plans may be very close, meaning that there is no real substantial benefit in picking one over the other. Even if the performance isn’t great, it might be that the query is just unimportant to an application – perhaps it is not really visible to the application and the whole job takes awhile, so some variability in one query is insignificant. The key point is that this problem includes the perception of the user and is not just some arbitrary metric defined in the optimizer.
The most common class of parameter sensitive problem (PSP) we see is where sniffing one value has “good” performance (from the perspective of the customer) for all parameter values while sniffing + caching the plan for another parameter value leads to “bad” performance.
Common Case (Atypical parameter):
Less Common (Distinct Sets of Optimal Plans/Values):
90% of the cases I see (where I can also get enough information to categorize) relate to having an uncommon value sniffed. If we take our earlier example, if @p is a very frequent value in the data distribution, the filter WHERE col = @p is highly unselective and the Scan plan may be generated. However, this value may be very infrequently used. As a result, most of the time the seek plan is generated but occasionally the scan plan might be generated instead if this infrequent value happens to be passed when the query needs to be recompiled. Understanding the frequency of calling values vs. the frequency of data distribution is necessary to fully understand which case you have. Note that I could construct an example this is opposite of this, where picking the scan plan is the right answer in most cases but sniffing a very infrequent value with low representation in the data set could cache a seek plan. The metric is overall workload time.
The other kind of problem we see is that plans are really only optimal for certain sets of parameters. If you use the wrong plan with the wrong parameter, then the performance is “bad” (again, from the perspective of a customer). This is harder because you actually don’t want to cache plans like this, so adding option(recompile) is usually needed if you want consistent performance. (You can also split the query into N copies and call each separate copy based on the values you see via T-SQL).
To optimize overall workload throughput, the other piece of the equation is compilation time. If you have a very expensive query to run but is cheap to compile, option(recompile) fixes all PSP problems easily :). If you have a PSP problem where the performance difference between the “good” and “bad” cases is larger than the total compilation time, you might very well be able to use option(recompile) to bound that query’s performance. If you have a critical code path that needs faster performance than compiling each time, then you should consider the OPTIMIZE FOR hints. If you are in the common case, you can hint a regularly used parameter value and that will avoid the “bad” performance case. If you are in the less common case, you may need to revert to OPTION(RECOMPILE) or perhaps split your queries into different copies so each set of parameters can get their own plans.
Now I’ve given enough background to explain OPTIMIZE FOR … UNKNOWN. There are some customers who have a common PSP problem but do not know a common value to pass. Perhaps their data distribution changes over time. Perhaps they don’t know their data set that well. OPTIMIZE FOR… UNKNOWN tells the optimizer to use an “average” value to optimize the query, and this often (but not always) will avoid the “bad” squares in the taxonomy I described.
Here’s an example of how this works:
SELECT * FROM T WHERE col = @p.
Let’s say that T has 10000 rows. Let’s say that we have the values 1-5000 each in the table once. Let’s have the value 5001 in there 5000 times. Furthermore, we have the column “col” in an index and we also have a lot of other columns that are not in that index. If the optimizer sniffs any of the values 1-5000, it is likely to generate the seek + fetch plan. For the value 5001, the scan plan is optimal.
When the optimizer is reasoning about parameters, it might not have a specific value to use. So, it reverts to use the density/frequency information stored in our histograms. That stores a representation of how common an “average” value is in the data distribution. In this case, the first 5000 values exist once and the last value is there 5000 times, so selecting an “average” value in an equality predicate would yield ~2 rows. 2 may be a low enough cardinality such that the seek + fetch plan is still optimal and the customer is happy. Note that running the parameter 5001 with that plan may still be very slow but perhaps it is never/rarely done. That parameter will get suboptimal performance.
The use of the frequency/density information varies a bit based on the scalar predicate or relational operator. GROUP BY col would use the density, but an arbitrary scalar condition in a WHERE clause may or may not (the conditions are too complex to explain and we don’t document them all publicly anyways).
I hope this gives a bit more insight into parameter sensitivity problems and how to solve them
Great post. Thanks for this information.
Thanks for these two articles, Connor. I hadn't understood how SQL Server managed query plans before.
What are the other reasons why a query is suddenly MUCH slower than usual? I have an ETL SSIS project which runs a series of SQL procs to clean up and augment the data it has loaded. Normally each of the several updates takes a few seconds - but sometimes one or another takes many hours, on the same data as took seconds in a previous run. If I start the whole job again, the same update usually runs normally, in seconds.
I don't think it's a query plan issue, as the parameter values actually used should not result in a different plan - and also running dbcc freeproccache first doesn't stop the problem. I've also tried rebuilding the indices just before the procedures run, without much benefit, so I don't think it's caused by index fragmentation. Finally, I added EXEC sp_updatestats to the start of the update tasks in case it was a problem with out of date statistics, yet the problem still occurs.
Nor is it a hardware problem, as it happens on both the development machine (old AMD single core running Windows 7) and the production server (dual Xeon, 16 GB RAM, Windows Server 2008); both run SQL Server 2008 sp1. The database size is 10 GBytes.
Nothing else is using the server at the time and there don't appear to be any locked transactions. The only thing I can find out of the ordinary when a query goes slow - other than the duration! - is hundreds of millions of logical reads, against a few thousand physical reads.
Usually this would be a plan change, and I would first check to see if it is parameter sniffing. The kinds of cases where this would happen are seek/scan flips (as described in the post) and sometimes there are Nested Loops + Spools on the inner side of the NL. This would happen if you have a "gatekeeper row" case, which is conceptually similar to a parameter but not actually a parameter. I guess I have my next topic on which to blog :).
I'm not sure that seek/scan would make this much difference - it's almost as if occasionally the plan is to use no indices at all! Some of the queries involve self-joins which are very slow without an index.
I don't /think/ it's a gatekeeper row case; the number of rows affected does go up if an Id matches a lot of orders, but with the update affecting 5k to 20k out of 800k rows, I would not have thought that would make a 3 orders of magnitude time difference. But perhaps I misunderstand the term; either way I look forward to your next post!
Of course the frustrating thing is that I can't see the bad execution plan - every time I test the code it runs quickly. Heisenbug!
Anyway, I added WITH RECOMPILE to the start of the procs which most often do this, and SO FAR it hasn't happened again. (I don't minds the mSecs of compile time when the procs will run for a few seconds.)
And of course, soon after I posted that, the job once again took 100 minutes for a step that normally takes 2 seconds, DESPITE the WITH RECOMPILE
The main thing I can recommend is capturing the query plans (sql profiler, DMVs) when you see this case happening. This would give you insight into what is happening - table scan, lots of random IOs, etc.
Thanks. The profiler didn't help much - simply too much data - but
SELECT * FROM sys.dm_exec_requests ORDER BY logical_reads DESC;
gave me a clue - though sys.dm_exec_query_plan() showed a plan of NULL while running, and there was nothing in the cache afterwards due to WITH RECOMPILE. However, I did manage to work out a situation in which the problem ALWAYS happens, so I could then run the query under test in SSMS on my development system and see the Actual Execution Plan when it finished - 7 hours later! 99% of time doing CLUSTERED INDEX SEEK on estimated 1 row, actual 9722 rows. BUT the rest of the plan involved two Nested Loops (1%), one of which had met an Actual 2498 million rows instead of its Estimated 1 row!!!
So I did EXEC sp_updatestats and ran the query again - this time it took just 4 seconds, with a completely different plan that used the secondary indexes on [Order_Number] and [Order_Date], and did Hash Matches. This DESPITE the fact that there was an sp_updatestats just before the proc had run. Which also gave me a clue; the query in question is near the end of the proc, and depends on fields [Order_Date] and [Selection_Date]. The latter is set by the previous UPDATE step. If we're loading new data, [Selection_Date] would have been all NULL at the start of the proc - so the plan created at that point would have assumed it still was; hence the pathologically bad plan. Maybe I should put a WITH RECOMPILE in the query itself? Or an UPDATE STATS for that index? Or add a hint to use the [Index_Date] index? (This is all new to me; I always thought "SQL Server knows best" before!).
Please read my earlier blog post on batch compilation in SQL Server:
Interesting. My problem wasn't due to a parameter, so I didn't use OPTIMISE FOR @p=whatever. I added option(RECOMPILE) to the query (instead of for the whole proc), but it did not help, perhaps because it was recompiling against misleading stats from when [Selection_Date] was always NULL, (is that what you mean by a gatekeeper case?). So I put UPDATE STATISTICS [tablename]; immediately before the query, and this + the recompile in the query DOES seem to have fixed the problem. I haven't been able to make it go wrong yet anyway!
Thanks for your help and guidance. I think I've fixed my problem, and I've learned a lot about SQL Server DMVs, statistics and execution plans in the process.
I use this query to find the cached query plans. It takes a while to run depending on how you write the where clause but it works most of the time. When run against a production environment usually you can finger the highest volume plan as the one most used. If you click on the xml output and it's a suboptimal plan then you most likely have your winner. Hope this helps.
select * from sys.dm_exec_cached_plans c
cross apply sys.dm_exec_query_plan(plan_handle) p
cross apply sys.dm_exec_sql_text(plan_handle) s
where s.[text] like '%query string here%'
and usecounts > 50 -- I set this around 50 to filter out tests run by developers. For the most part I don't care about the low volume plans, it's usually the high volume ones that kill you, in prod at least.
Let me start by stating I am quite new to the world of SQL Server and DB administration in general. I'm trying to learn as much as I can.
My understanding of parameterized queries and cached plans is that each subsequent query must be identical to the cached version. My question is how identical must future queries be? Does it have to be identical character for character? Is the optimizer smart enough to recognize simple difference in column or predicate order? What if the only difference is the name of the database being queried?
Thanks Leon, that's a useful script.
caching is done with an exact string match today. (memcmp), broadly.
One more thing you could do is that. Find the threshold value beyond which scan will be better and before this threshold seek + lookup is better. So if value is above threshold execute a query with forced scan plan using hints and below that use forced seek + lookup. I know it can not be done at optimizer level because optimizer wont be having these details stored somewhere.But an application designer or developer who knows what kind of data is there in table and how the data is accessed and how many times sql is called and what is the ratio of the values below threshold and above threshold. Thus based on this you could tweak the code to take adavntage of this technique to make all your queries as efficient as possible.