Welcome to MSDN Blogs Sign in | Join | Help

Programmers should naturally gravitate toward the simplest, most elegant solution.  This is because the simplest coding solution is so often the best solution: simple solutions are cheaper to implement; easier for others to understand, maintain, and extend; and less prone to bugs.  Simplicity can also be associated with efficient and consistent execution, but performance is one area where Occam's Razor doesn't apply quite as consistently.  Performance requirements, more often than many other types of requirement, may demand introducing complexity into what would otherwise be a nice, tidy implementation.  I’m going to discuss one such case here. 

 

The All-In-One Search Query

Sometimes you need a query or a stored procedure to expose several search parameters but actually do the search only for the subset of the parameters that the caller cares about.  For example, consider a [find_customer] stored procedure that can search for a @name value, a customer @id, an @address, or any combination of these.  The most obvious solution to this problem looks something like this:

 

    SELECT [id], [name], [address], ...

    FROM [customers]

    WHERE (@name IS NULL OR [name] = @name)

        AND (@id IS NULL OR [id] = @id)

        AND (@address IS NULL OR [address] = @address)

 

This allows the caller to pass non-NULL values for the search criteria they care about, and to pass NULLs for any criteria that they want to be ignored.  For lack of a better term, I have always referred to this as the “All-In-One Search Query”.  Here’s a sample stored proc that shows a slightly simplified example with just two available search parameters:

 

    CREATE PROC find_customer

        @name nvarchar(128) = NULL,

        @id int = NULL

    AS

    BEGIN

        SELECT [id], [name], [address], [creation_date]

        FROM [customers]

        WHERE (@name IS NULL OR [name] = @name)

            AND (@id IS NULL OR [id] = @id);

    END;

 

This idiom is simple enough that it is relatively easy to understand.  Despite its simplicity, it allows a single query to meet many different needs (in this case: a search on name, or a search on address, or a search on both name and address).  It is understandable that queries like this are so popular; it seems to be an elegant solution, and developers like elegance.  Unfortunately, it’s usually a bad idea for practical performance reasons.  The problems are significant enough that I regard this as an anti-pattern in my own code. 

 

What's Wrong With the Obvious Solution?

The first problem is that the OR operators will prevent an efficient index seek-based query plan.  If the search parameters are mutually exclusive (the caller is only supposed to provide one parameter value), you can avoid this problem by changing the query to:

 

    SELECT [id], [name], [address], [creation_date]

    FROM [customers]

    WHERE (@name IS NOT NULL AND [name] = @name)

        OR (@id IS NOT NULL AND [id] = @id);

 

That may allow an index seek, but you should expect that it will still perform poorly in some cases because it suffers from other problems that I’ll discuss next.  Also note that the query semantics aren't quite the same as the first query. 

 

The second problem that affects both the original problem query and the attempted rewrite is that SQL does a poor job with plan costing for queries like this.  In general, you should avoid using filter predicates where both operands are constants and one or both are variables or parameters (e.g. “where @var = 1”, or “where @parameter is null”).  These have a tendency to screw up the optimizer’s estimates because at plan compile time the optimizer has no idea whether the predicate will qualify all rows, or disqualify all rows.  You are essentially guaranteed that its blind guess will be exactly wrong for some parameters.  The bad estimate is likely to cause inaccurate costing of plan alternatives, which in turn is likely to cause selection of a suboptimal plan. 

 

The third problem is that there is no single plan that will be appropriate for all of the different combinations of parameters, even if the optimizer was smart enough to predict the outcome of “variable = constant”-type predicates at compile time.  For example, suppose you had a nonclustered index on the [name] column, and another nonclustered index on [id].  If SQL chose a plan that first did a lookup using the index on the [name] column, this would be a very poor choice whenever the caller passed NULL for @name.  Conversely, a plan that first scanned an index on the [id] column would be inefficient when @id was null and @name was non-null. 

 

The only situations where you should use queries like either of those shown above is when (a) you don't care about the performance of the query, or (b) you can guarantee that the table will only contain a handful of rows so that a table scan-based plan will never be noticeably slower than a seek-based plan. 

 

Possible Solutions

One possible solution would be to have a different query for each search parameter (or combination of search parameters, if that is legal input for the procedure).  This is shown below.  It may be a practical solution if you only have two or three possible search parameters. 

 

    CREATE PROC find_customer

        @name nvarchar(128) = NULL,

        @id int = NULL

    AS

    BEGIN

        IF (@name IS NULL)

        BEGIN

            SELECT [id], [name], [address], [creation_date]

            FROM [customers]

            WHERE [id] = @id;

        END

        ELSE IF (@id IS NULL)

        BEGIN

            SELECT [id], [name], [address], [creation_date]

            FROM [customers]

            WHERE [name] = @name;

        END

        ELSE BEGIN

            SELECT [id], [name], [address], [creation_date]

            FROM [customers]

            WHERE [name] = @name AND [id] = @id;

        END;

    END;

 

Note that parameter sniffing may cause bad plan selection even in this case if each search query is more complex than the queries shown here (if the queries include joins against large tables, for example).  You can protect against this problem by assigning the parameters to local variables and using the variables in the queries.  You can also avoid the problem by moving the search queries to a different compilation scope (child stored procedures, dynamic SQL, or "OPTION(RECOMPILE)" query hints). 

 

The second possible solution uses dynamic SQL to construct the WHERE clause dynamically. 

 

    CREATE PROC find_customer

        @name nvarchar(128) = NULL,

        @id int = NULL

    WITH EXECUTE AS OWNER

    AS

    BEGIN

        DECLARE @search_query nvarchar(max);

        SET @search_query = '

            SELECT [id], [name], [address], [creation_date]

            FROM [customers]

            WHERE 1 = 1 ';

       

        IF (@name IS NOT NULL)

        BEGIN

            SET @search_query = @search_query + 'AND [name] = @name';

        END;

       

        IF (@id IS NOT NULL)

        BEGIN

            SET @search_query = @search_query + 'AND [id] = @id';

        END;

       

        EXEC sp_executesql @search_query, N' @name nvarchar(128), @id int',

            @name = @name, @id = @id;

    END;

 

You should explicitly parameterize the query, as shown here, to avoid the risk of SQL injection attacks.  Note the EXECUTE AS OWNER used for this procedure, which may be appropriate to avoid the need to grant the end user SELECT permission on the base table(s).  Also note that the pattern is simplified by always passing all parameters to sp_executesql, even though the query may only make use of a subset of the parameters.  I generally don't like to use dynamic SQL where I can avoid it, but I make an exception for this case; the dynamic WHERE clause is actually my preferred solution when I'm faced with this problem. 

 

I’ve seen the All-In-One search query cause perf problems in code written both inside and outside of Microsoft, and I’ve been burned by the problem in my own code.  It’s unfortunate that the better alternatives all require more complex code, but this is a case where the additional complexity is justified. 

 

 

 

UPDATE: Erland Sommarskog has a page on this very topic here: http://www.sommarskog.se/dyn-search-2005.html.  He calls it the problem of "dynamic search conditions" (a less awkward phrase than "All-In-One Search Query").  The page provides a very detailed discussion of the topic, and would be a great read if you really want to drill into the details. 

 

UPDATE #2: Another good read on a variation of this problem: http://sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/ (and, to a lesser degree, http://sqlinthewild.co.za/index.php/2009/09/15/multiple-execution-paths/), both from Gail Shaw.  Gail calls this class of query "Catch-All Queries". 

 

 

SQL Server 2008 added a new data type named “datetimeoffset”.  This is similar to the old datetime data type, with the following significant differences:

·         Internally, the time is stored in unambiguous UTC format

·         The local time zone offset is stored along with the UTC time, which allows the time to be displayed as a local time value (or converted to any another time zone offset)

·         The data type is capable of storing more precise times than datetime

 

So, when should you use datetimeoffset, and when should you use datetime?  The first answer is that you have no choice at all if you aren’t using SQL 2008, since datetimeoffset was first added in this SQL version.  If you are working with SQL 2008, let’s address the question by examining some problems with the older datetime type:

 

Problem 1: DateTime is inherently ambiguous

Suppose you are a consultant looking at an existing table with a datetime column.  A row in this table tells you that some critical event occurred at 4:35pm.  Is this the server’s local time?  Local time for the end user’s machine?  GMT, or Coordinated Universal Time (UTC)?  Local time for a time zone selected by convention that may or may not match the server’s local time zone?  Here in Microsoft, for example, some systems were designed to store Pacific time by convention, even though in some of these cases the SQL Server may reside in a data center that isn’t on the west coast.  Other databases store UTC times, again by convention adopted by whoever designed those systems.  So this is the first problem: datetime is ambiguous.  A datetime value by itself actually does not identify a particular moment in time; it takes on an clear meaning only when you interpret it in the context of some assumed, and usually unenforced, time zone. 

 

Problem 2: It is impractical to convert historical DateTime values to/from time zones

If you’ve ever built a data warehouse that consolidated data from several data sources, you may have struggled to convert various ambiguous local time representations into a consistent form.  Some data sources store their times as UTC, some store local server time, others use a local time based on some end user’s time zone.  You could just cram all of those values into a datetime column in your warehouse, but it would be impossible to interpret the times in a meaningful way.  No matter what time zone you selected as a lens through which to interpret the data, it would be wrong for some values. 

 

So you must expend some extra effort to figure out the implicit time zone for every data source in your warehouse, and then you have to write code that converts all of these various times to some consistent time zone, likely UTC time, in the central data warehouse.  If that was the end of it, it would be bad enough.  But Daylight Savings Time makes it next to impossible to do this conversion in a generic way.  The problem is that different locales have different Daylight Savings Time rules.  Many parts of the world don’t honor DST at all, some places do honor it but use half-hour offsets instead of full hour offsets, and various places begin and end DST on different dates.  Here’s an concrete example: suppose you have the local datetime value 2008-03-05 08:30:00 in the database, and you need to convert this to UTC time.  By interviewing the right DBA or by examining source code, you have determined that this database stores datetime values using local server time.  The local server is in the Pacific time zone, and you’ve found that you can use a T-SQL expression like this to determine the current time zone offset for the local server:

 

                DATEDIFF (minute, GETUTCDATE(), GETDATE())

 

This tells you that the local server time is 7 hours behind UTC right now, so you should be able to add 7 hours to the local time to get the equivalent UTC time, right?  That would be wrong; Daylight Savings Time is in effect in the U.S. today (April), but when this datetime value was collected back in March of last year, DST was not in effect.  The correct time zone offset to use is UTC minus 8 hours.  So you have to have knowledge of what the local time zone offset would have been on arbitrary past or future dates, and bake this knowledge into your conversion routine.  If your data comes from a variety of locales, you have to have correct information about the time zone rules in every place your data comes from.  To make matters worse, the rules for DST may change from year to year in the same locale due to legislative changes, so you have to capture different sets of rules for different ranges of dates within each region.  Have you taken into account the fact that (most of) the state of Arizona doesn’t use DST?  Or that Indiana didn’t use DST at all prior to 2006, but you do need to adjust for DST for any data that was captured on a server from Indiana after 2006?  Does your conversion routine account for the fact that Daylight Savings Time in the U.S. was lengthened by about a month starting in 2007?  This problem isn’t unique to the United States, and the situation can be even more grim if your data comes from more than one country.  DBAs and developers in China, India, and Japan get off a little bit easier because DST is not observed in those countries, but they still have the problem to some degree if they ever need to consume data that originated in other places or push their data to a consumer in a different country. 

 

Finally, there is small time window each year that will defeat even the world’s most intelligent time zone conversion routine.  In 2009, DST in the United States will end at 2:00 am on November 1st.  At that time the clocks will roll back an hour, to 1:00 am.  In other words, each year there is a one-hour window during which times like 2009-11-01 01:35 am will actually occur twice.  That ambiguity is completely intractable.

 

Problem 3: If you do manage to convert DateTime values collected in various places to a single time zone, you lose important information

Let’s suppose that a database you are working with stores UTC times.  Good for you: all of your times are unambiguous.  But unless your users all live in London or Lisbon (and DST is not in effect), UTC is generally not very meaningful to a user.  You could theoretically convert the times to the end user’s local time zone (if it weren't for the inconvenient fact that this is impractical, as we just discovered).  But what if you wanted to present the time in local time relative to the place where the timestamp was captured?  For example, suppose you wanted to show records from a consolidated server health log as local times for the server where they were captured.  You can’t.  The information about the current local time zone offset at the moment the timestamp was collected was lost when you converted the difficult-to-work-with local time into that nice, pure UTC time. 

 

 

These three problems can combine to create a real mess.  Frankly, I think I might consider a career change if I was tasked with solving all of these problems in a large-scale project that consolidated historical data from many different places.  Of course, you might be thinking, My datetime values are all local server time; their meaning is perfectly clear to me.  Well, one day your company may expand and your little homegrown system might need to handle data from more than one region.  Or you might need to import the data into a new system when your solution is thrown out for being too provincial :).  Or the data in your local database might turn out to be needed in some central data warehouse that consolidates data from a variety of sources.  You can save yourself and your successors some grief by using the more robust datetimeoffset data type from the start.

 

 

When to Use DateTimeOffset?

Because the datetimeoffset data type stores a UTC date internally, it’s free of the ambiguity that causes problems #1 and #2.  And because it also stores the time zone offset that was current at the time the timestamp was generated, it doesn’t suffer from the data loss problem that you face if you store times as UTC datetime values (problem #3).  In other words, with the same datetimeoffset value you can represent the value as a local time or easily convert it to a UTC time.  SQL will do the right thing if you compare two datetimeoffset values, even if the values were captured from systems with very different time zone offsets. 

 

So, let's return to the original question: When should you use datetimeoffset instead of datetime?  The answer is: you should almost always use datetimeoffset.  I’ll make the claim that there is only a single case where datetime is clearly the best data type for the job, and that’s when you actually require an ambiguous time.  For example, if you wanted a column to record the fact that all stores in a chain should open at 8:00am local time (whatever the local time zone may be), you should use datetime.  But any time you want to store a value that represents an absolute moment in time, you would be better off using datetimeoffset.  For most applications, that means that just about everywhere you currently use datetime would be a good candidate for datetimeoffset.

 

Please don’t beat me up over the fact that SQL 2008’s DMVs still use datetime :).  This is a known problem, and it’s on the books to look at for future versions.  We’re facing the same problems that you’ll face in your existing systems: it’s hard to make a system-wide datatype change that doesn’t break someone somewhere.  For any brand new SQL development work you do, though, I encourage you to pause and consider your choice carefully before using datetime.  It may be an appropriate choice in some cases, but most of the time you’d probably be better off with datetimeoffset. 

 

Common DateTimeOffset-Related Tasks in T-SQL (SQL 2008 only)

Retrieve the current time as a datetimeoffset (comparable to the venerable GETDATE function):

                SELECT SYSDATETIMEOFFSET()

 

Retrieve the server’s current time zone offset (the number of minutes before or after UTC):

                SELECT DATENAME (TZoffset, SYSDATETIMEOFFSET())

 

Convert from datetime to datetimeoffset (note that this uses the server’s current time zone offset, which could be inappropriate for historical dates):

                SELECT TODATETIMEOFFSET (datetimevalue, DATENAME (TZoffset, SYSDATETIMEOFFSET()))

 

Convert a datetimeoffset value (in this case, local server time returned by SYSDATETIMEOFFSET) to a new time zone offset:

                SELECT SWITCHOFFSET (SYSDATETIMEOFFSET(), '-05:00')

            

 

Finally, be aware that .Net 2.0 SP1 added support for the datetimeoffset data type, so you can round-trip this nice new data type between SQL and a client app without any fuss. 

 

 

I've received a couple of questions in email and in comments about deadlocks involving mysterious-sounding non-lock resources like "exchangeEvent" and "threadpool".  There are a couple of examples in the comments for post http://blogs.msdn.com/bartd/archive/2006/09/25/deadlock-troubleshooting-part-3.aspx, and here's a forum post on the topic: http://forums.microsoft.com/Forums/ShowPost.aspx?PostID=3913233&SiteID=1.  

 

Here's one example (note that I've omitted the "inputbuf" and "executionStack" nodes for the sake of brevity and clarity):

 

__________________________________________________________________

deadlock-list

deadlock victim=process38316d8

 

 process-list

  process id=process3808478 schedulerid=1 kpid=216 status=suspended  spid=51 sbid=0 ecid=8 

 

  process id=process3809ac8 schedulerid=1 kpid=5672 status=suspended spid=51 sbid=0 ecid=17

  process id=process38136d8 schedulerid=2 kpid=5644 status=suspended spid=51 sbid=0 ecid=16

  process id=process3813828 schedulerid=2 kpid=6064 status=suspended spid=51 sbid=0 ecid=9 

  process id=process381c478 schedulerid=3 kpid=5292 status=suspended spid=51 sbid=0 ecid=10

  process id=process381d2e8 schedulerid=3 kpid=4372 status=suspended spid=51 sbid=0 ecid=19

  process id=process38265c8 schedulerid=4 kpid=5552 status=suspended spid=51 sbid=0 ecid=11

  process id=process3827ac8 schedulerid=4 kpid=5716 status=suspended spid=51 sbid=0 ecid=18

  process id=process38309b8 waittime=609 schedulerid=5 kpid=0

  process id=process38312e8 schedulerid=5 kpid=3204 status=suspended spid=51 sbid=0 ecid=6 

  process id=process38316d8 schedulerid=5 kpid=5108 status=suspended spid=51 sbid=0 ecid=13

  process id=process383a718 schedulerid=6 kpid=5216 status=suspended spid=51 sbid=0 ecid=7 

  process id=process383ada8 waittime=609 schedulerid=6 kpid=0

  process id=process383beb8 schedulerid=6 kpid=5852 status=suspended spid=51 sbid=0 ecid=14

  process id=process3845588 schedulerid=7 kpid=6096 status=suspended spid=51 sbid=0 ecid=15

  process id=process38456d8 schedulerid=7 kpid=760 status=suspended  spid=51 sbid=0 ecid=0 

  process id=process3845c18 schedulerid=7 kpid=5992 status=suspended spid=51 sbid=0 ecid=12

 

 resource-list

  threadpool id=scheduleree6080

   owner-list

    owner id=process38316d8

    owner id=process38312e8

   waiter-list

    waiter id=process38309b8

  exchangeEvent id=port80140950 nodeId=9

   owner-list

    owner event=pending id=process383ada8

    owner event=pending id=process38309b8

   waiter-list

    waiter event=e_waitPortOpen type=consumer id=process3813828

    waiter event=e_waitPortOpen type=consumer id=process3808478

    waiter event=e_waitPortOpen type=consumer id=process381c478

    waiter event=e_waitPortOpen type=consumer id=process38265c8

    waiter event=e_waitPortOpen type=consumer id=process3845c18

    waiter event=e_waitPortOpen type=consumer id=process38316d8

    waiter event=e_waitPortOpen type=consumer id=process383beb8

    waiter event=e_waitPortOpen type=producer id=process3845588

    waiter event=e_waitPortOpen type=producer id=process38136d8

    waiter event=e_waitPortOpen type=producer id=process3809ac8

    waiter event=e_waitPortOpen type=producer id=process3827ac8

    waiter event=e_waitPortOpen type=producer id=process381d2e8

  exchangeEvent id=port80140690 nodeId=5

   owner-list

    owner event=pending id=process383ada8

    owner event=pending id=process38309b8

   waiter-list

    waiter event=e_waitPortOpen type=consumer id=process38456d8

  exchangeEvent id=port80140c10 nodeId=12

   owner-list

    owner event=pending id=process383ada8

    owner event=pending id=process38309b8

   waiter-list

    waiter event=e_waitPortOpen type=producer id=process38312e8

    waiter event=e_waitPortOpen type=producer id=process383a718

  threadpool id=scheduleref6080

   owner-list

    owner id=process383beb8

    owner id=process383a718

   waiter-list

    waiter id=process383ada8

__________________________________________________________________

 

"exchangeEvent" Deadlock Resources

Some terminology, to better understand the trace flag 1222 deadlock output shown above:

spid = system process ID, AKA "session_id" -- to oversimplify slightly, this represents a connection to SQL

sbid = system batch ID, also called "request_id" -- a query that a spid is running

ecid = execution context ID -- a worker thread running part of a query

 

There's one thing that you should note about this deadlock right off the bat: all of the participants (the "process" nodes in the -T1222 output) are from the same session identifier (spid) and the same batch (sbid).  They each have a different thread ID (kpid and ecid).  Each "process" entry in this deadlock represents a different worker thread, but all of these worker threads are running part of a single large parallel query that was submitted by spid 51. 

 

Another interesting thing about this deadlock is in the resource list: most deadlocks involve lock resources ("pagelock", "keylock", etc), but this one only deals with "exchangeEvent" and "threadpool" resources. 

 

Deadlocks centering around exchangeEvent resources have been given the name "intra-query parallelism deadlock".  (I know -- it just drips 'sexy', doesn't it?)  They may be accompanied by this error message -- sent to the client app only, not logged in the SQL errorlogs:

Server: Msg 8650, Level 13, State 1, Line 1 Intra-query parallelism caused your server command (process ID #51) to deadlock. Rerun the query without intra-query parallelism by using the query hint option (maxdop 1).

An "exchangeEvent" resource indicates the presence of parallelism operators in a query plan.  The idea is that the work for an operation like a large scan, sort, or join is divided up so that it can be executed on multiple child threads.  There are "producer" threads that do the grunt work and feed sets of rows to "consumers".  Intra-query parallel requires signaling between these worker threads: the consumers may have to wait on producers to hand them more data, and the producers may have to wait for consumers to finish processing the last batch of data.  Parallelism-related waits manifest in SQL DMVs as CXPACKET or EXCHANGE wait types (note that the presence of these wait types is normal -- by itself, these waits don't indicate that this type or any other type of deadlock is occurring).

 

Wherever you have threads waiting for resources, there is a risk that they will end up in a circular blocking chain (thread A holding resource X and waiting for resource Y, thread B holding resource Y and waiting for resource X).  The synchronization objects used in parallel query execution are no exception; in rare cases, the threads running a single query can end up deadlocking with one another.  Most intra-query parallelism deadlocks are considered bugs, although some of them can be risky bugs to fix.  If you run into one and you're already on the latest SQL service pack, your best bet may be to investigate workarounds.  Luckily, this type of deadlock is relatively uncommon, and in most cases it's possible to work around the problem by eliminating parallelism in the query.  Try one of these two approaches:

 

Workaround #1: Add an index or improve the query to eliminate the need for parallelism.  In most cases, the use of parallelism in a query indicates that you have a very large scan, sort, or join that isn't supported by proper indexes.  If you tune the query, you will often find that you end up with a much quicker and more efficient plan that doesn't use parallelism, and therefore isn't subject to this type of problem. Of course, in some queries (DSS/OLAP-type queries, in particular) it may be difficult to eliminate all large scans.


Workaround #2:  Force single-threaded execution with an "OPTION (MAXDOP 1)" query hint at the end of the query.  If you can't modify the query, you can apply the hint to any query with a plan guide (assuming that you're running SQL 2005 or later).

 

 

"threadpool" Deadlock Resources

A process waiting for a "threadpool" resource is actually waiting for a worker thread.  There are a finite number of threads in SQL's thread pool, and if they are all in use, new requests must wait for an in-progress task to complete and free up a thread.  Thread pool waits (in DMVs, "THREADPOOL" or "UMSTHREAD" waittype) are typically a side effect of a massive resource contention problem -- most commonly, a large blocking chain.  You should investigate what is tying up all of your worker threads, and eliminate that bottleneck.  While it's not clear from this deadlock output alone, I suspect that in this case there may have been many other large untuned queries using lots of parallel threads, so excessive parallelism itself may have been the cause of the thread starvation. 

 

 

"resourceWait" Deadlock Resources

(UPDATE: The following info on "resourceWait" was added to this post 15 Oct 2009.)

A process waiting for a "resourceWait" resource is waiting for a "resource semaphore".  A "resourceWait" type resource corresponds to a resource semaphore in SQL.  Resource semaphores are typically used to govern memory used for query sorts and hashes.  So the following wait graph:

resource-list
   keylock hobtid=72057594038845440 dbid=6 objectname=XXXTABLE indexname=YYYINDEX id=lockffffffff81314cc0 mode=X associatedObjectId=72057594038845440
    owner-list
     owner id=processebb108 mode=X
    waiter-list
     waiter id=processebae38 mode=S requestType=wait
   resourceWait
    owner-list
     owner id=processebae38
    waiter-list
     waiter id=processebb108

could be read as:

     Spid A is waiting for a shared key lock, but is blocked by Spid B, who holds an exclusive lock on this key. 
     Spid B is waiting for more memory to run his query (and, eventually, to release the X key lock), but he is blocked by other spids, including Spid A, who are currently holding all of the memory available for this type of operation. 

 

You could tackle this by looking for tuning opportunities in the queries run by both deadlock participants.  They are probably running a query plan that involves a hash or sort operation.  Remove this through indexing or query changes and you should eliminate the query's need to wait for a query memory grant.  You could also try throwing RAM at the problem, but keep in mind that query workspace memory, the memory used for sorts and hashes, must be drawn from "visible bpool".  On a 32-bit box, visible buffer pool is limited to approximately 2GB (3GB if you are running with /3GB).  If SQL already has this much memory available to it, adding more won't help. 

 

Caveats

Just because you see “exchangeEvent” resources in your deadlock graph doesn’t necessarily mean that you are facing an intra-query parallelism deadlock.  Sometimes the engine includes extraneous resources in the deadlock graph.  This makes it important to find out how the waiters relate to one another so that you can determine which of the resources is an essential part of the circular blocking chain.   The post http://blogs.msdn.com/bartd/archive/2006/09/09/Deadlock-Troubleshooting_2C00_-Part-1.aspx steps you through a deconstruction of -T1222 output so that you can get a clearer understanding of the relationships.  As a rule of thumb, if there are any lock resources in your deadlock output (pagelock, keylock, rowlock) along with the exchangeEvent resources, you should suspect that the exchangeEvent resources are non-essential and that you are probably facing a “normal” deadlock. 

 

 

If you're interested in more background info on parallel query execution, there's a great presentation by Craig Freedman attached to this blog post: http://blogs.msdn.com/craigfr/archive/2007/04/17/parallel-query-execution-presentation.aspx

 

Policy Based Management (PBM) is a new feature in SQL Server 2008 that allows you to define a set of policies that capture the intended state for a group of servers.  For example, you could define a policy that says that your user databases should all have the auto update statistics database option enabled.  (If you’re not yet familiar with PBM, you can read more about it in Books Online or in the PBM blog.)  

In SQL 2008, the focus of PBM is primarily on static aspects of server management – policies covering things like schema requirements, or server and database configuration settings.  However, there are certain more dynamic aspects of server state that are equally important, but much harder to monitor.  Server “health” monitoring (e.g. uptime, responsiveness) is one example.  I’m going to show you how you can use the ExecuteSql function to extend the normal capabilities of PBM by defining policies that ensure your servers are servicing queries effectively.  You can use live Dynamic Management View (DMV) queries, or even query historical data that you are capturing in a Management Data Warehouse. 

IMPORTANT: Before going on, read through this blog post for an overview of the PBM ExecuteSql function. 

Suppose you wanted to define a policy like this one:

The average disk response time for all data and log files that have a non-trivial number of I/Os should not exceed 100ms. 

You can use a query like this one to find files that violate this policy:

SELECT

    CASE

        WHEN MAX (avg_ms_per_io) > 100 THEN

            'Excessive disk response time ('

            + CONVERT (varchar, MAX (avg_ms_per_io))

            + 'ms) for file ID ' + CONVERT (varchar, MAX (file_id))

            + ', database ' + MAX (database_name)

        ELSE ''

    END AS policy_violation_message

FROM

(   

    SELECT TOP 1

        io_stall / (num_of_reads + num_of_writes) AS avg_ms_per_io,

        io_stall AS io_stall_ms,

        (num_of_reads + num_of_writes) AS num_io,

        DB_NAME (database_id) AS database_name,

        file_id

    -- Only check I/O stats for the current database

    FROM sys.dm_io_virtual_file_stats (DB_ID(), null)

    WHERE

        -- Ignore idle databases and files; we do not care about disk

        -- wait time if the file has barely seen any I/O.

        (num_of_reads + num_of_writes) > 500

    -- PBM will use the first column of the first row to evaluate policy

    -- compliance, so return the file with the longest avg I/O time. 

    ORDER BY 1 DESC

) AS slowest_file;

 

We’re going to wrap this query in PBM’s ExecuteSql() function, which will allow us to reference it within a policy condition.  To permit PBM to bubble up more actionable info to the user (other than the simple fact that a database is out of policy), I’ve written the query to provide a string-type policy compliance indicator rather than a simple numeric 0/1 value.  If the query detects that a database is in policy, it returns an empty string for the [policy_violation_message] column.  But if disk response time is found to be out of policy, the query returns a message identifying the offending file and its average disk response time.  This message will be logged with the policy results, which can help during postmortem diagnosis. 

1.     Create a new policy (in Management Studio under Management\Policies, right-click the Policies folder and select New Policy).  Name the policy “Disk Health”. 

2.     In the Check Condition drop-down listbox, select New condition

3.     Name the condition “Disk Response Time is Healthy”, and set the Facet drop-down listbox to Database. 

4.     Click the “…” button next to the Field textbox.  In the Advanced Edit dialog, enter the text shown below, then click OK (this is the same query as above, except that single quotes have been escaped by doubling them). 

ExecuteSql('String', '

DECLARE @max_allowed_ms_per_io int;

SET @max_allowed_ms_per_io = 100;

 

SELECT

    CASE

        WHEN MAX (avg_ms_per_io) > @max_allowed_ms_per_io THEN

            ''Excessive disk response time (''

            + CONVERT (varchar, MAX (avg_ms_per_io))

            + ''ms) for file ID '' + CONVERT (varchar, MAX (file_id))

            + '', database '' + MAX (database_name)

        ELSE ''''

    END AS policy_violation_message

FROM

(   

    SELECT TOP 1

        io_stall / (num_of_reads + num_of_writes) AS avg_ms_per_io,

        io_stall AS io_stall_ms,

        (num_of_reads + num_of_writes) AS num_io,

        DB_NAME (database_id) AS database_name,

        file_id

    -- Only check I/O stats for the current database

    FROM sys.dm_io_virtual_file_stats (DB_ID(), null)

    WHERE

        -- Ignore idle databases and files; we do not care about disk

        -- wait time if the file has barely seen any I/O.

        (num_of_reads + num_of_writes) > 500

    -- PBM will use the first column of the first row to evaluate policy

    -- compliance, so return the file with the longest avg I/O time. 

    ORDER BY 1 DESC

) AS slowest_file;

')

 

5.     Set the Operator to equals (=) and the Value to '' (Note: that’s two single quotes, not a double quote). 

Your new policy should look like this.  If it does, click OK twice to create the policy.

 

Now right-click on your policy and select Evaluate.  Click the Evaluate button in the lower-right portion of the Evaluate Policies dialog, and acknowledge that you realize that the policy includes a custom script (see the blog post I referenced earlier for more information about the security implications of this approach).  This will run the policy against each database on your instance, and will flag any that are suffering from poor disk response time.  In the screenshot below, you can see that my [mdw] and [testdatabase] databases are in policy, but my [AdventureWorks] database is out of policy: the average I/O response time for file 1 in this database is an embarrassingly slow 122ms.

Because ExecuteSql takes an arbitrary query, you can define a policy that monitors literally any aspect of server health that you can write a query to detect.  Below, for example, is a query that will trigger a policy violation if the value of the “SQLServer:Buffer Manager\Page Life Expectancy” perfmon counter drops too low:

ExecuteSql('String', '

SELECT

    CASE

        WHEN cntr_value < 300 THEN

            ''Page life expectancy (sec): ''

            + CONVERT (varchar, cntr_value)

        ELSE ''''

    END AS policy_violation_message

FROM sys.dm_os_performance_counters

WHERE counter_name = ''Page life expectancy'' AND object_name LIKE ''%Buffer Manager%'';

')

 

 

SQL Health Monitoring Policies and Management Data Warehouse

If you’ve spent much time trying to use SQL DMVs for server health monitoring in the past, you’ve probably run into situations where you needed to compare the live DMV data to an historical snapshot of the same data.  For example, consider the “slow I/O” query at the top of this post.  The I/O counters and disk “stall” time reported in sys.dm_os_virtual_file_stats are cumulative totals since the SQL instance was started.  If the server had been running a long time, these values would be very large, and any disk performance problem would have to exist for a long time before the overall average I/O response time grew high enough to trigger the threshold.  What you typically want in a case like this is to assess recent I/O performance, where “recent” means the average within the last several minutes or hours.  Here's a more refined version of the policy we started with, that would make this type of monitoring more useful in the real world:

The average disk response time for all data and log files that have a non-trivial number of I/Os should not exceed 100ms (where "average disk response time" is defined as the average I/O wait time for all reads or writes to a file within the last 30 minutes)

This requires historical snapshots of the same data to compare the most recent data to.  Luckily, this happens to be exactly what the Data Collector and Management Data Warehouse (MDW) features in SQL 2008 were intended to provide. 

Being able to take advantage of historical data in your health monitoring policies has a number of advantages:

·          For cumulative-since-server-startup DMV data, the policy will flag problems with less delay

·          Similarly, a monitored object can go back “in policy” relatively quickly once the problem has been addressed

·          For metrics that are instantaneous measurements, you get fewer false positive policy violations if you can average a set of recent measurements instead of focusing on a single instantaneous data point that might be an atypical value

The query below pulls recent disk stats from a local MDW database (this DMV’s data is collected by the built-in Server Activity collection set, so you don’t need to create a custom collection set).  This assumes that (a) the monitored SQL instance is hosting its own MDW database, and (b) the MDW database’s name is “MDW”.  If your MDW database is local but is named something other than "MDW", update the query text to reference the correct MDW database.  (If you need a walkthough showing how to set up MDW, see this page.)

ExecuteSql('String', '

DECLARE @max_allowed_ms_per_io int;

DECLARE @time_window_min int;

SET @max_allowed_ms_per_io = 100;

SET @time_window_min = 120;

 

SELECT

    CASE

        WHEN MAX (recent_avg_ms_per_io) > @max_allowed_ms_per_io THEN

            ''Excessive disk response time (''

            + CONVERT (varchar, MAX (recent_avg_ms_per_io))

            + ''ms) for file ID '' + CONVERT (varchar, MAX (file_id))

            + '', database '' + MAX (database_name)

        ELSE ''''

    END AS policy_violation_message

FROM

(   

    SELECT TOP 1

        (recent_io_stall_read_ms + recent_io_stall_write_ms)

            / (recent_read_count + recent_write_count + 1)

            AS recent_avg_ms_per_io,

        (recent_io_stall_read_ms + recent_io_stall_write_ms)

            AS recent_io_stall_ms,

        (recent_read_count + recent_write_count) AS recent_io_count,

        database_name,

        file_id

    FROM

    (

        SELECT

            database_name,

            file_id,

            MAX (io_stall_read_ms) - MIN (io_stall_read_ms)

                AS recent_io_stall_read_ms,

            MAX (io_stall_write_ms) - MIN (io_stall_write_ms)

                AS recent_io_stall_write_ms,

            MAX (num_of_reads) - MIN (num_of_reads) AS recent_read_count,

            MAX (num_of_writes) - MIN (num_of_writes) AS recent_write_count

        FROM mdw.snapshots.io_virtual_file_stats AS fs

        INNER JOIN mdw.core.snapshots AS snap

            ON fs.snapshot_id = snap.snapshot_id

        WHERE

            -- MDW data for the local instance only

            snap.instance_name = @@SERVERNAME

            -- I/O stats only for the current database

            AND fs.database_id = DB_ID()

            -- within the last 5 minutes

            AND fs.collection_time >

                DATEADD (minute, @time_window_min, SYSDATETIMEOFFSET())

        GROUP BY database_name, file_id

    ) AS recent_io_stats

    WHERE

        -- Ignore idle databases and files; we do not care about disk

        -- wait time if the file has barely seen any I/O.

        (recent_read_count + recent_write_count) > 500

    -- PBM will use the first column of the first row to evaluate policy

    -- compliance, so return the file with the longest avg I/O time. 

    ORDER BY 1 DESC

) AS slowest_file;

')

 

I think this approach to server health monitoring could be useful, but it does have some limitations that are worth noting:

·          This release of SQL doesn’t provide out-of-the-box policies of this sort.  You have to write custom out-of-policy detection queries for any complex server health condition you want to monitor. 

·          Complex policies may have parameters that you might want to tweak on a per-server basis (in this example, the max allowed disk response time is one such parameter).  Unfortunately, there is no facility for this in PBM in SQL 2008.  If you really need this, you could create your own custom policy configuration table on each server with thresholds particular to that server, and join to this table in your custom condition query.

·          When you use ExecuteSql, you can only choose “=” or “!=” for your operators.  Because you can’t use inequality operators, you must push your policy violation thresholds down into the query itself. 

If you’ve tried this approach to health monitoring and have experiences you can share, good or bad, let me know.

 

In versions of SQL Server before SQL Server 2008, it can be difficult to determine the cumulative cost of the queries running on a server if the workload includes unparameterized queries.  The only truly reliable method is to capture a Profiler trace during a representative time period, then post-process the trace with a utility that strips out any inline literal values from the query text.   A number of utilities have sprung up that use this general approach: ReadTrace, SQL Nexus/TraceBuster, and Bill Graziano’s ClearTrace utility are three that I know of.  Trace-based query cost analysis is effective, but there are a number of big problems with this approach: capturing a batch- or statement-level profiler trace is expensive and sometimes can slow down your server, a trace can grow up to several GB per minute on servers that have a high transaction rate, and the capture and analysis of the data tends to be a time-consuming and labor-intensive process that is difficult to automate.

In SQL Server 2008, the SQL Server database engine has a powerful new feature that generates an identifier for each query.  The identifier is independent of any inline parameter values, so it serves as a very effective query identifier.  This identifier – sometimes called a “query fingerprint” – enables a fairly robust method of identifying the most expensive queries on your server based on nothing but DMV queries.  I think the feature will eventually form the basis of a query cost analysis approach that requires a much smaller investment of DBA time, has greatly reduced risk, and scales to higher-volume workloads than a traditional trace-based analysis.  

This is my personal favorite “sleeper feature” in SQL 2008; the query optimizer team deserves kudos for getting it done in time for the release, and for making it inexpensive enough that fingerprint generation can be on-by-default.  It was a very late addition to the release, which I think is the main reason that fingerprints have generated relatively little buzz so far.  Plan fingerprints are used to calculate query plan cost in the new Activity Monitor tool that you can launch by right-clicking a server in SQL Server 2008's Management Studio. 

To fully appreciate query fingerprints’ value, you first must understand one of the limitations of the sys.dm_exec_query_stats Dynamic Management View (DMV) that was introduced in SQL Server 2005.  This DMV lists every statement query plan that is in procedure cache at that moment, along with execution stats for the query plan such as the number of executions of the plan, total CPU cost, physical and logical reads, and so on.  It was a groundbreaking addition to SQL Server because, for the first time, it allowed a DBA to examine the cost of the queries in a workload without capturing a Profiler trace.  For certain workloads, the original sys.dm_exec_query_stats DMV provides a simple and powerful way to identify the most expensive queries: you capture two snapshots of the DMV, then you join the second snapshot back to the first to calculate the execution cost of each query plan in the time between the two snapshots.  But the DMV also has some limitations that constrain its usefulness.  In particular, if a query is not explicitly or implicitly parameterized and if the query text contains inline literal values, that query plan will not be reused. Every execution of the query with a different set of parameter values will generate a new compiled plan object.  You can see this by running these queries:

-- Run "the same" query twice, but with a different inline parameter

-- value for each execution

GO

SELECT type FROM sys.objects WHERE name = 'sysfiles1'

GO

SELECT type FROM sys.objects WHERE name = 'sysprivs'

GO

 

-- Find the query execution statistics row(s) for the query

SELECT sql_handle, plan_handle, execution_count

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY sys.dm_exec_sql_text (qs.plan_handle) AS sql

WHERE sql.text LIKE 'SELECT type FROM sys.objects WHERE name %'

GO

Here's the output:

The two queries that look like “SELECT type FROM sys.objects WHERE name = 'x'” are identical except for a different inline parameter value, but each query’s statistics are tracked in different rows in sys.dm_exec_query_stats.  This happens because the query is not explicitly parameterized, and the different inline literal values cause SQL Server to compile a separate query plan each time the query is executed with a different parameter value.  This behavior makes it difficult to use sys.dm_exec_query_stats to identify the true cumulative cost of a query.  Suppose that this query was executed frequently enough to be the most expensive query on your server, but suppose that most executions had a different search parameter value.  With the query’s cumulative cost spread out across thousands of rows in sys.dm_exec_query_stats, how would you recognize that each of those thousands of rows actually represents a fraction of the total execution cost of a single query?  This is a key problem that the query fingerprint feature in SQL Server 2008 helps to address.  To see how it works, add these queries to your query window, and then re-execute the entire script:

-- Execute a second query that is not quite the same shape as the prior query

GO

SELECT type FROM sys.objects WHERE name = 'sysprivs' AND create_date < GETDATE()

GO

 

-- Find the query execution statistics row(s) for the query

SELECT sql_handle, plan_handle, execution_count, query_hash, query_plan_hash  

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY sys.dm_exec_sql_text (qs.plan_handle) AS sql

WHERE sql.text LIKE 'SELECT type FROM sys.objects WHERE name %'

GO

When you re-run the script, you should see output like the following:

Note the query_hash and query_plan_hash columns that are new to sys.dm_exec_query_stats in SQL Server 2008.  This is one of two places where the new SQL Server 2008 query fingerprint and query plan fingerprint features are exposed (the other is the sys.dm_exec_requests DMV).  Note that the two queries we ran first have the same query_hash value, which indicates that if you were to strip out any inline literals, the two queries have the same “shape”.  They also have the same query_plan_hash value, which means that even though they each had their own compiled plan object in the procedure cache, those two query plans also had the same general form.  However, the second query doesn’t have quite the same “shape” as the first two queries; it includes an additional filter predicate (“create_date < GETDATE()”), and therefore has different query_hash and query_plan_hash values.  Using these two new columns, we can combine the costs of all plans that have the same shape and calculate the true cumulative cost of all executions of a query.  This approach generally works well even when faced with an unparameterized workload that features poor plan reuse. 

Below I've provided some more detailed information about query fingerprints and query plan fingerprints.  This information should be accurate as of the initial release of SQL Server 2008, but it is possible that some of the implementation details will change in subsequent releases.  In order to protect the ability to improve the feature, we’re not providing an iron-clad guarantee that a query’s fingerprint won’t change across SQL versions, although we recognize the value of an identifier that remains constant across releases and plan to avoid changes to the fingerprint calculation algorithms unless there’s a very good reason to change them.

Query Fingerprints

A query fingerprint is also called a “query hash”.  The queries below are all similar enough to have the same query fingerprint:

SELECT * FROM foo WHERE column1 = 'A' AND column2 = 'B'

SELECT * FROM foo WHERE column1 = 'X' AND column2 = 'Y'

SELECT * FROM foo WHERE column1 = 100 AND column2 = 200

SELECT * FROM foo WHERE foo.column1 = 'X' AND foo.column2 = 'Y'

On the other hand, all of the queries below are have differences that are significant enough to produce different query fingerprints:

SELECT * FROM foo WHERE column1 = 'X'

SELECT * FROM foo AS foo2 WHERE column1 = 'X' AND column2 = 'Y'

SELECT * FROM foo WHERE column1 = 'X' AND column2 IS NULL

SELECT * FROM foo WHERE column1 = 'X' OR column2 = 'Y'

SELECT column1 FROM foo WHERE column1 = 'X' AND column2 > 'Y'

Here are some facts that help clarify how query fingerprints are generated:

  • Query fingerprints are generated from a tree of logical operators that is used as an input to the query optimizer.  Because this tree is created prior to query optimization, a query fingerprint is influenced only by the query’s text, not by the query plan.  In other words, two queries may have the same query fingerprint but use two very different query plans. 
  • Similarly, because the query fingerprint is generated from an operator tree *after* parsing, two queries don't have to have the exact same text to have the same fingerprint.  Whitespace and comments don't matter, and the queries can even have some small, semantically-irrelevant differences. 
  • Query fingerprints are not affected by the current database context or by the current instance name.  This means that the same query run within two different databases, or even two different SQL Server instances, will generally have the same query fingerprint value.  
  • The table, view, and function names referenced by two queries must be identical for the queries to have the same fingerprint.  This includes table aliases; if two queries are identical except that one of them uses a table alias and the other query refers to the table using its actual name, the two queries will not have the same fingerprint. 
  • If a query has any table or query hints (including any hints applied via a plan guide), the hints must be identical in order to generate the same query fingerprint. 
  • SET options may influence query fingerprints if they change the query semantics.  For example, the logical meaning of a predicate like “column1 = NULL” is influenced by the current ANSI_NULLS setting.  If the query includes a predicate like this, then changing the ANSI_NULLS setting may generate a different query fingerprint.

Plan Fingerprints

A query plan fingerprint may also be referred to as a “plan hash”.  Plan fingerprints are generated from the tree of physical operators that makes up a compiled query execution plan.  Generally speaking, if a user would consider two plans to be different, they will have different plan fingerprint values. 

In order to have the same plan fingerprint value, the trees of operators that make up the plans must have the same shape.  For each physical operator in one plan, the corresponding node in the other plan must be the same physical operator.  For example, if two plans have the same general shape but one plan includes a Hash Join operator where the second plan uses a Loop Join, the two plans will not have the same plan fingerprint. 

Certain operator attributes must also be identical in order to generate a matching plan fingerprint.  For example, two Table Scan operator must reference the same table name, or they will not match.  However, not every attribute of an operator is included in the query plan hash value.  For example, the specific number of rows that is estimated to be returned from a Table Scan operator does not influence the plan fingerprint.   Two cached plans may have slightly different estimated row counts, yet have the exact same shape and the same execution characteristics, and therefore the same plan fingerprint. 

Like query fingerprints, a plan fingerprint is not affected by database context or SQL Server instance name, so if similarly-shaped query plans are used in two different databases, the plans will receive the same plan fingerprint.  Plan fingerprints are sensitive to object names, but an exception exists for automatically-generated primary key and unique key constraints.  (Unfortunately, due to an issue in the initial release of SQL Server 2008, a query that references a local temporary table will generate different plan and query fingerprints for each execution.  Hopefully, this will be addressed in a subsequent Cumulative Update or Service Pack release.) 

Finally, in the initial implementation of the plan fingerprint feature, a statement’s query fingerprint value is included in the plan hash.  In other words, two different queries with different query fingerprints may result in the “same” plan, but the plan fingerprints will be different simply because the query fingerprints are different. 

Limitations

Here are some of the more important restrictions that you should be aware of:

  • In SQL 2008, the query and plan fingerprints are still tied to the sys.dm_exec_query_stats DMV, which means that the aggregate stats for a fingerprint are associated with cached query plans.  Query plans have a transient lifetime, and may be removed at any time in response to internal or external memory pressure on the procedure cache.  Any query statistics that are inserted and removed in the interval between two queries against the DMV will not be reflected in your query cost estimates (you can partially compensate for this by querying the DMV at a more frequent interval).  Also, certain types of query plan are never inserted into the procedure cache (one example is the plan for a CREATE INDEX).  Execution statistics for these types of queries may be undercounted. 
  • The sys.dm_exec_query_stats DMV only shows statistics for completed query executions.  In-progress, long-running queries will not show up in the DMV until they finish running.  You can merge the stats from this DMV with the the sys.dm_exec_requests DMV (which also exposes the new fingerprint columns) in order to get a more complete view.
  • Using fingerprints to determine cumulative query cost relies on using the new query_hash and query_plan_hash columns as keys that uniquely identify a particular query’s or plan’s “shape”.  It is possible, though unlikely, that two different queries may end up with the same hash value, causing statistics for both of the queries to be charged to one of them.  It is also possible that two variations of the “same” query may be assigned different fingerprint hash values, in which case the cost of the query may appear to be spread out over several buckets, making it difficult to recognize the query’s true cumulative cost.  At the time of the initial release of SQL Server 2008, we’ve only encountered one instance of this: different executions of a query that references a temp table may be assigned different query_hash values.  This problem with temp tables isn’t intentional; it’s a bug, and I’m hopeful that it will be fixed in an upcoming Service Pack or Cumulative Update. 

Future Plans

I already mentioned the new SQL 2008 Activity Monitor, which uses plan fingerprints to generate more accurate estimates of query cost than would otherwise be possible.  Looking forward, we have plans to publish a new custom collection set based on the SQL Server 2008 Data Collector that uses plan fingerprints to provide a Top N Query identification tool.  You will be able to use this new “Query Hash Statistics” collection set as-is, or customize it to meet your needs.  The documentation that accompanies this collection set will include a subset of the info in this post, so if you’ve read to this point, you’ll be able to skim over some of the collection set docs.

I suspect that it won’t be terribly long before your favorite SQL monitoring tool starts to take advantage of this feature, but if you’re of the roll-your-own bent, you can start creating your own custom query analysis scripts right now using the info provided here.  Below I've provided a query to get you started. It groups on query_plan_hash to calculate query statistics for all plans with a given plan fingerprint, and pulls the statement text from a representative cached plan object that has a given fingerprint. 

SELECT TOP 100

    query_hash, query_plan_hash,

    cached_plan_object_count,

    execution_count,

    total_cpu_time_ms, total_elapsed_time_ms,

    total_logical_reads, total_logical_writes, total_physical_reads,

    sample_database_name, sample_object_name,

    sample_statement_text

FROM

(

    SELECT

        query_hash, query_plan_hash,

        COUNT (*) AS cached_plan_object_count,

        MAX (plan_handle) AS sample_plan_handle,

        SUM (execution_count) AS execution_count,

        SUM (total_worker_time)/1000 AS total_cpu_time_ms,

        SUM (total_elapsed_time)/1000 AS total_elapsed_time_ms,

        SUM (total_logical_reads) AS total_logical_reads,

        SUM (total_logical_writes) AS total_logical_writes,

        SUM (total_physical_reads) AS total_physical_reads

    FROM sys.dm_exec_query_stats

    GROUP BY query_hash, query_plan_hash

) AS plan_hash_stats

CROSS APPLY

(

    SELECT TOP 1

        qs.sql_handle AS sample_sql_handle,

        qs.statement_start_offset AS sample_statement_start_offset,

        qs.statement_end_offset AS sample_statement_end_offset,

        CASE

            WHEN [database_id].value = 32768 THEN 'ResourceDb'

            ELSE DB_NAME (CONVERT (int, [database_id].value))

        END AS sample_database_name,

        OBJECT_NAME (CONVERT (int, [object_id].value), CONVERT (int, [database_id].value)) AS sample_object_name,

        SUBSTRING (

            sql.[text],

            (qs.statement_start_offset/2) + 1,

            (

                (

                    CASE qs.statement_end_offset

                        WHEN -1 THEN DATALENGTH(sql.[text])

                        WHEN 0 THEN DATALENGTH(sql.[text])

                        ELSE qs.statement_end_offset

                    END

                    - qs.statement_start_offset

                )/2

            ) + 1

        ) AS sample_statement_text

    FROM sys.dm_exec_sql_text(plan_hash_stats.sample_plan_handle) AS sql 

    INNER JOIN sys.dm_exec_query_stats AS qs ON qs.plan_handle = plan_hash_stats.sample_plan_handle

    CROSS APPLY sys.dm_exec_plan_attributes (plan_hash_stats.sample_plan_handle) AS [object_id]

    CROSS APPLY sys.dm_exec_plan_attributes (plan_hash_stats.sample_plan_handle) AS [database_id]

    WHERE [object_id].attribute = 'objectid'

        AND [database_id].attribute = 'dbid'

) AS sample_query_text

ORDER BY total_cpu_time_ms DESC;

 

This isn't perf-related like most of my earlier posts, but I thought it was useful enough that I should share it.  We recently had a situation where we had to convert a hexadecimal string representation of a binary value to a true binary (e.g. varchar value '0x1234abcdef' --> varbinary 0x1234ABCDEF).  There's a built-in function in SQL (fn_varbintohexstr) to convert from varbinary to a hex-formatted varchar value, but nothing to convert in the opposite direction.  A smart guy on my team (Ahmed Ayad) came up with a clever solution.  There are other solutions floating around out there that do this same conversion, but I don't think I've ever run across one as nice and tidy as this.  It takes advantage of the fact that SQL's parser already knows how to convert a hex string representiation of a binary into a "real" binary. 

IF OBJECT_ID ('usp_hexstrtovarbin', 'P') IS NOT NULL DROP PROC usp_hexstrtovarbin
GO
CREATE PROC usp_hexstrtovarbin @hexstr varchar(3990), @bin varbinary (4000) OUTPUT AS
DECLARE @sql nvarchar(4000)
SET @sql = 'SET @bin=' + @hexstr
EXEC sp_executesql @sql, N'@bin varbinary(4000) OUTPUT', @bin OUTPUT
GO 

Usage is straightforward: just call the proc, passing it the hex-formatted string and an output param to receive the converted value.  For example:

DECLARE @hexstr varchar(max), @bin varbinary (4000)
SET @hexstr = '0x1234abcdef'

EXEC usp_hexstrtovarbin @hexstr, @bin OUTPUT

SELECT
@bin
GO

Unfortunately, SQL won't allow you to use sp_executesql within a user-defined function, so a disadvantage of this approach is that you can't move this into a scalar function for use in queries like "SELECT dbo.fn_hexstrtovarbin(mycolumn) FROM mytable". 

 

Did you know that your SQL Server is keeping track of the indexes that it thinks you should create?  The "missing index" DMVs in SQL are a really great new feature in SQL Server 2005 that (in my opinion) seem to have been underutilized so far.  If you want to see if this feature can spare you the tedium of an afternoon identifying poor performing queries and tuning them, all you have to do is ask:

SELECT

  migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) AS improvement_measure,

  'CREATE INDEX [missing_index_' + CONVERT (varchar, mig.index_group_handle) + '_' + CONVERT (varchar, mid.index_handle)

  + '_' + LEFT (PARSENAME(mid.statement, 1), 32) + ']'

  + ' ON ' + mid.statement

  + ' (' + ISNULL (mid.equality_columns,'')

    + CASE WHEN mid.equality_columns IS NOT NULL AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END

    + ISNULL (mid.inequality_columns, '')

  + ')'

  + ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement,

  migs.*, mid.database_id, mid.[object_id]

FROM sys.dm_db_missing_index_groups mig

INNER JOIN sys.dm_db_missing_index_group_stats migs ON migs.group_handle = mig.index_group_handle

INNER JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle

WHERE migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) > 10

ORDER BY migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) DESC

You'll want to run this after your server has been up and running a normal workload for a while.  If this returns no results, that's good news and indicates that you're not missing any indexes that are obvious enough for the DMV to detect.  If it does return some suggestions, even better: you just improved your server's perf with almost no work.

While to me this feature is so cool it almost seems magical, it does have a few limitations you should be aware of:

  • It's not as smart as the Database Engine Tuning Advisor.  If you have identified a query that you know is expensive and needs some help, don't pass up DTA just because the missing index DMVs didn't have any suggestions.  DTA might still be able to help. 
  • The missing index DMVs don't take into account the overhead that new indexes can create (extra disk space, slight impact on insert/delete perf, etc). DTA does take this into account, however.
  • The "improvement_measure" column in this query's output is a rough indicator of the (estimated) improvement that might be seen if the index was created.  This is a unitless number, and has meaning only relative the same number for other indexes.  (It's a combination of the avg_total_user_cost, avg_user_impact, user_seeks, and user_scans columns in sys.dm_db_missing_index_group_stats.)
  • The missing index DMVs don't make recommendation about whether a proposed index should be clustered or nonclustered.  This has workload-wide ramifications, while these DMVs focus only on the indexes that would benefit individual queries.  (DTA can do this, however.)
  • Won't recommend partitioning.
  • It's possible that the DMVs may not recommend the ideal column order for multi-column indexes.
  • The DMV tracks information on no more than 500 missing indexes.

If you're a typical SQL user, you may not be using these DMVs yet.  If you look around, though, there are a few places where they are in use. One is in the SP2 Performance Dashboard reports.  Another is the Perf Stats Script that SQL PSS uses.  And if you think the missing index DMVs are useful, check out this set of scripts that builds on the missing index DMVs to simulate an "auto create index" feature.  Also, you should be aware there is similar missing index info output in the new XML showplan format in SQL 2005.  If you are already focused on a poorly-performing query, I would start with the plan view of missing indexes (followed by DTA) rather than the DMVs. 

Here’s an example of the classic scenario that is usually used to introduce the concept of a deadlock in a database:

 

 

Process A

Process B

 

 

1. Begin Transaction

1. Begin Transaction

 

 

2. Update Part table

2. Update Supplier table

 

à

3. Update Supplier table

3. Update Part table

ß

 

4. Commit Transaction

4. Commit Transaction

 

 

If Process A and Process B each reached step #3 in their respective transactions at approximately the same time, it’s easy to see how they could end up blocking each other.  The most obvious solution to this deadlock is to change the order of the UPDATE statements in one of the transactions, so that lock resources are acquired in a consistent order. 

 

Instead of this overly simplistic deadlock, let’s take a closer look at the deadlock scenario demonstrated in Deadlock Troubleshooting, Part 2.  In that case, these two stored procedures ended up deadlocked:

 

       CREATE PROC p1 @p1 int AS

       SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1

       GO

 

       CREATE PROC p2 @p1 int AS

       UPDATE t1 SET c2 = c2+1 WHERE c1 = @p1

       UPDATE t1 SET c2 = c2-1 WHERE c1 = @p1

       GO

 

There’s no UPDATE, DELETE, or INSERT in the first proc; it consists of a single SELECT.  And even the two UPDATE statements in the second proc aren’t wrapped in an outer BEGIN TRAN/COMMIT TRAN.  Both UPDATEs ran within their own autocommit transaction, which means that only one of them could have been involved in the deadlock.  Clearly this doesn’t fit the stereotypical “modify A then modify B / modify B then modify A” deadlock model described above.  This isn’t an edge case, by the way. We actually see this type of deadlock – where one or both of the participants are in the middle a single-query, autocommit transaction – more often than easy-to-understand deadlock scenarios involving two multi-statement transactions that just modify two tables in a different order. 

 

So, what would you do if DTA hadn’t automagically recommended a new index that prevented this deadlock?  To craft your own solution by hand, you need a deeper understanding of the deadlock than we have at the moment. 

 

What caused this deadlock?

We’ll need to refer back to the deadlock summary that was distilled from the -T1222 output (see Deadlock Troubleshooting, Part 1 for a refresher on decoding -T1222):

 

            Spid X is running this query (line 2 of proc [p1], inputbuffer “… EXEC p1 4 …”):
                                SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1
                Spid Y is running this query (line 2 of proc [p2], inputbuffer “EXEC p2 4”):
                                UPDATE t1 SET c2 = c2+1 WHERE c1 = @p1
               
                The SELECT is waiting for a Shared KEY lock on index t1.cidx.  The UPDATE holds a conflicting X lock.
                The UPDATE is waiting for an eXclusive KEY lock on index t1.idx1.  The SELECT holds a conflicting S lock.

 

First, let’s look at the query plan for the SELECT query.  To view the plan, execute “SET STATISTICS PROFILE ON”, then run “EXEC p1 4”. 

 

   SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1

     |--Nested Loops(Inner Join, OUTER REFERENCES:([Uniq1002], [t1].[c1]))

          |--Index Seek(OBJECT:([t1].[idx1]), SEEK:([t1].[c2] >= [@p1] AND [t1].[c2] <= [@p1]+(1)) ORDERED FORWARD)

          |--Clustered Index Seek(OBJECT:([t1].[cidx]), SEEK:([t1].[c1]=[t1].[c1] AND [Uniq1002]=[Uniq1002]) LOOKUP ORDERED FORWARD)

 

A Nested Loop join executes its first child operator once, and executes the second child operator for each row returned by the first child (see this post for details).  In this case, the first child is a nonclustered Index Seek to find the rows “WHERE c2 BETWEEN @p1 AND @p1+1”.  For each qualifying row in the nonclustered index, a second seek is done on the clustered index to look up the whole data row.  This clustered index seek is necessary because the nonclustered index does not cover the query.  If you’re running SQL 2000, you’ll see a different-looking plan:

 

   SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1

     |--Bookmark Lookup(BOOKMARK:([Bmk1000]), OBJECT:([t1]))

          |--Index Seek(OBJECT:([t1].[idx1]), SEEK:([t1].[c2] >= [@p1] AND [t1].[c2] <= [@p1]+1) ORDERED FORWARD)

 

For practical purposes, these two plans are identical.  The purpose of the Bookmark Lookup operator in the SQL 2000 plan is to visit the clustered index to retrieve the full set of columns for a row identified by a nonclustered index.  In SQL 2005 this same operation is expressed as a loop join between the nonclustered index and the clustered index.  For this deadlock, it’s simply important to note that both plans calls for a seek from the nonclustered index, then a seek from the clustered index.

 

Now let’s look at the UPDATE:

 

   UPDATE t1 SET c2 = c2+1 WHERE c1 = @p1

     |--Clustered Index Update(OBJECT:([t1].[cidx]), OBJECT:([t1].[idx1]), SET:([t1].[c2] = [Expr1004]))

          |--Compute Scalar(DEFINE:([Expr1013]=[Expr1013]))

               |--Compute Scalar(DEFINE:([Expr1004]=[t1].[c2]+(1), [Expr1013]=CASE WHEN CASE WHEN ...

                    |--Top(ROWCOUNT est 0)

                         |--Clustered Index Seek(OBJECT:([t1].[cidx]), SEEK:([t1].[c1]=[@p1]) ORDERED FORWARD)

 

The UPDATE has a fairly simple query plan.  The two most significant operators are the first and the last one.  The Clustered Index Seek locates the rows that quality for the “WHERE c1 = @p1” predicate.  Once a qualifying row has been found, the Clustered Index Update operator acquires an eXclusive key lock on the clustered index and modifies the row. 

 

We now have a full understanding of how the UPDATE blocks the SELECT: the UPDATE acquires an X lock on a clustered index key, and that lock blocks the SELECT’s bookmark lookup on the clustered index.  But the other half of the deadlock – the reason that the SELECT blocks the UPDATE – isn’t quite so obvious.  The -T1222 told us “The UPDATE is waiting for an eXclusive KEY lock on index t1.idx1.  The SELECT holds a conflicting S lock.  It’s not very apparent from the plan, but the UPDATE needs an X lock on the nonclustered index [idx1] because the column it is updating ([c2]) is one of the non-clustered index’s key columns.  Any change to an index key column means that a row in the index must be relocated, and that relocation requires an X lock. 

 

This is a key point to remember when trying to understand many deadlocks: the access path to find the qualifying rows is important, but index updates implied by the columns being modified can be just as important.  To make things more confusing, sometimes you’ll see explicit “Index Update” or “Index Delete” operators in the plan for each nonclustered index that needs to be updated, while other times these don’t show up in the plan.  (For more info on this check out Wide vs. Narrow Plans.) 

 

To summarize: the SELECT used the nonclustered index to find a qualifying row.  While holding a Shared lock on the nonclustered index, it needs to jump over to the clustered index and retrieve some columns that aren’t part of the nonclustered index.  While it’s doing this, the UPDATE is busy doing a seek on the clustered index.  It finds a row, locks it and modifies it.  But because one of the columns being modified is a key column in the nonclustered index, it then has to move to the nonclustered index and update that index, too.  This requires a second X key lock on the nonclustered index.  So, the SELECT ends up blocked waiting for the UPDATE to release his X lock on the clustered index, while the UPDATE winds up blocked and waiting for the SELECT to release his S lock on the nonclustered index. 

 

Hopefully it’s clear that even though each participant in this deadlock is just a single query, this is still a problem caused by out-of-order resource access patterns.  The SELECT statement locks a key in the nonclustered index, then locks a key in the clustered index.  The problem is that the UPDATE needs to lock the same two resources, but because of its query plan, it tries to lock them in the opposite order.  In a sense, it’s really the same problem as the simple deadlock scenario described at the beginning of this post. 

 

The locks acquired by a query aren’t acquired all at once.  A query plan is like a little program.  It wouldn’t be terribly inaccurate, for example, to think of a nested loop join as a FOR loop.  Each iteration of the loop acquires a key lock on the outer table, then holds that lock while looking up (and locking) matching rows in the inner table.  Deadlocks like this one are a little harder to figure out because the order of resource access within a single query depends on the query plan, and can’t be determined just by looking at the T-SQL. 

 

How did DTA’s new index avoid the deadlock? 

Here’s an index that will prevent this deadlock:

            CREATE INDEX idx2 ON t1 (c2, c3)

 

This index “covers” the query “SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1”, which is just another way of saying that the index includes all of the columns referenced by the query.  SQL will use this index instead of the [idx1] index because the plan based on the covering index is cheaper.  The fact that the index covers the query means that the bookmark lookup against the clustered index is no longer necessary.  And since the SELECT no longer needs to access the clustered index, it won’t get blocked by the UPDATE’s lock on the clustered index.  

 

What other solutions are available?

All deadlocks boil down to out-of-order resource access patterns.  In the simple deadlock scenario described at the beginning of this post, the solution is obvious: just reverse the two UPDATE statements in one of the transactions, and you won’t end up deadlocked.  But in the more complex scenario that we just explored, it’s not so clear how to change the order in which locks are acquired.  Each deadlock participant is running a single-query, autocommit transaction, so you can’t just swap the order of two queries to acquire resources in a different order.  SQL is a language designed to express high-level set operations; the specifics of how the database should go about retrieving and updating the specified set of data is generally left up to the SQL engine, with good reason.  However, you do have some options for either influencing which lock resources a query needs, or modifying the order in which it acquires the locks.  Below are six different possible solutions to this deadlock.  Some of these are not ideal for this particular deadlock, but they are still worth exploring since the approach to deadlock avoidance that they illustrate may be the best possible solution for some other deadlock you encounter.

 

  • The new index is arguably the simplest and most elegant solution.  This deadlock occurs because two queries take different paths to the same resource.  The new index avoids the deadlock by eliminating any need for the SELECT to access the row in the clustered index.  As a happy side effect, it also speeds up the SELECT query.  
           CREATE INDEX idx2 ON t1 (c2, c3)
  • If you’re running SQL 2005, you could use the new SNAPSHOT or READ_COMMITTED_SNAPSHOT isolation levels.  
           ALTER DATABASE deadlocktest SET READ_COMMITTED_SNAPSHOT ON
  • Adding a NOLOCK hint to the SELECT will avoid the deadlock, but be cautious of this solution -- dirty reads can cause runtime errors and will expose you to uncommitted data.
           ALTER PROC p1 @p1 int AS
           SELECT c2, c3 FROM t1 WITH (NOLOCK) WHERE c2 BETWEEN @p1 AND @p1+1
  • As mentioned above, this deadlock occurs because two queries take different paths to the same resource.  By forcing one of the queries to use the same index as the other query, you can prevent the deadlock.  However, SQL chose query plans that used two different indexes because those were the most efficient plans available for the two queries.  By forcing a different index path, you are actually slowing down one of the queries.  This may be OK since it does avoid the deadlock, but you should test to make sure the cost is acceptable.  
           ALTER PROC p1 @p1 int AS
           SELECT c2, c3 FROM t1 WITH (INDEX=cidx) WHERE c2 BETWEEN @p1 AND @p1+1
    If this query was coming from an application as an ad hoc query (not part of a stored proc), you could either modify the app to specify the index hint or use a plan guide with OPTION (USE PLAN...) if modifying the app wasn't possible.  Plan guides are available in SQL 2005 and later.
  • One way to look at this deadlock is as a problem that arises because there’s an index on a frequently-updated column.  Dropping the nonclustered index [idx1] will avoid the deadlock by (a) depriving the SELECT of its alternate access path to the row, and (b) preventing the UPDATE from having to update the nonclustered index row when it updates the [c2] column.  Like the prior solution, however, this will slow down the SELECT and any other queries that use this index.  
           DROP INDEX t1.idx1
  • You could force one of the transactions to block at an earlier point, before it has had an opportunity to acquire the lock that ends up blocking the other transaction.  In the example below, the SELECT proc has been modified to run a new query that acquires and holds a lock on the clustered index before it accesses the nonclustered index.  In effect, this changes the order of resource access from (nonclustered, clustered) to (clustered, nonclustered).  Since that’s the same order that the UPDATE uses, the deadlock is no longer an issue.  
           ALTER PROC p1 @p1 int AS
           BEGIN TRAN
             DECLARE @x int
             SELECT @x = COUNT(*) FROM t1 WITH (HOLDLOCK, UPDLOCK) WHERE c1 = @p1
             SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1
           COMMIT TRAN

 

If you can think of any other solutions, please share them in a comment.  

In this post I’ll look at an actual deadlock, then troubleshoot it using the steps I described in Deadlock Troubleshooting, Part 1 so you can see them in action.  This is a simplified version of a deadlock scenario that an internal customer here at Microsoft called us for help with.  To set up the scenario, run this:

 

       -- Batch #1

       CREATE DATABASE deadlocktest

       GO

       USE deadlocktest

       SET NOCOUNT ON

       DBCC TRACEON (1222, -1)

       GO

       IF OBJECT_ID ('t1') IS NOT NULL DROP TABLE t1

       IF OBJECT_ID ('p1') IS NOT NULL DROP PROC p1

       IF OBJECT_ID ('p2') IS NOT NULL DROP PROC p2

       GO

       CREATE TABLE t1 (c1 int, c2 int, c3 int, c4 char(5000))

       GO

       DECLARE @x int

       SET @x = 1

       WHILE (@x <= 1000) BEGIN

         INSERT INTO t1 VALUES (@x*2, @x*2, @x*2, @x*2)

         SET @x = @x + 1

       END

       GO

       CREATE CLUSTERED INDEX cidx ON t1 (c1)

       CREATE NONCLUSTERED INDEX idx1 ON t1 (c2)

       GO

       CREATE PROC p1 @p1 int AS

       SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1

       GO

       CREATE PROC p2 @p1 int AS

       UPDATE t1 SET c2 = c2+1 WHERE c1 = @p1

       UPDATE t1 SET c2 = c2-1 WHERE c1 = @p1

       GO

 

Now, run this from another connection:

       -- Batch #2

       USE deadlocktest

       SET NOCOUNT ON

       WHILE (1=1)

         EXEC p2 4

       GO

 

Finally, leave that one running while you run this from a third connection:

       -- Batch #3

       USE deadlocktest

       SET NOCOUNT ON

       CREATE TABLE #t1 (c2 int, c3 int)

       GO

       WHILE (1=1) BEGIN

         INSERT INTO #t1 EXEC p1 4

         TRUNCATE TABLE #t1

       END

       GO

 

This will cause a deadlock; you should see one of the batches aborted by a 1205 error.  Now that we have a reproducible deadlock, I’ll follow the troubleshooting steps that I posted in Deadlock Troubleshooting, Part 1.  

 

  1. Turn on trace flag 1222.  The setup script already turned this on for you as a global flag (the “-1” in the dbcc traceon command is critical). 
  2. Get the -T1222 output.  Look at your errorlog now and you should see the trace flag 1222 output describing the deadlock.  
  3. Decode the -T1222 output.  Read through Deadlock Troubleshooting, Part 1 again if you need more information about how to interpret -T1222 or -T1204 output.  Here’s what you should end up with after sifting through the -T1222 details and extracting the most important tidbits:

               
    Spid X is running this query (line 2 of proc [p1], inputbuffer “… EXEC p1 4 …”):
                                    SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1
                    Spid Y is running this query (line 2 of proc [p2], inputbuffer “EXEC p2 4”):
                                    UPDATE t1 SET c2 = c2+1 WHERE c1 = @p1
                   
                    Spid X is waiting for a Shared KEY lock on index t1.cidx.  Spid Y holds a conflicting X lock.
                    Spid Y is waiting for an eXclusive KEY lock on index t1.idx1.  Spid X holds a conflicting S lock.

  4. Run the queries through Database Tuning Advisor.  The -T1222 output tell us what inputbuffer we were running at the time of the deadlock (“EXEC p1 4” and “EXEC p2 4”).  Tune each of these queries in DTA using the steps I discussed in Part 1.  DTA will recommend a new index for Batch 3.  Create the index by selecting "Apply Recommendations" from the Action drop-down menu.

 

At this point, if you re-run Batch 2 and Batch 3, you’ll find that the deadlock has been solved.  You didn’t even have to use steps 5-8 or the list of other deadlock avoidance strategies that I listed in Part 1 of this series of posts.  

 

In a subsequent post I'll look at the details of the query plans involved in this particular deadlock to understand what caused the deadlock and why DTA's proposed index fixed it. 

 

(This post series is continued in Deadlock Troubleshooting, Part 3.)

A deadlock is a circular blocking chain, where two or more threads are each blocked by the other so that no one can proceed.  When the deadlock monitor thread in SQL Server detects a circular blocking chain, it selects one of the participants as a victim, cancels that spid’s current batch, and rolls backs his transaction in order to let the other spids continue with their work.  The deadlock victim will get a 1205 error:

 

Transaction (Process ID 52) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

 

A deadlock is a special type of blocking scenario, but blocking and deadlocking are not the same thing.  Sometimes we have people report that they are experiencing "deadlocking" when they are really only seeing blocking.

 

With very few exceptions, deadlocks are a natural side effect of blocking, not a SQL Server bug.  The typical deadlock solution is either a stored proc/app code tweak, or a schema/indexing change. 

 

Here’s how to troubleshoot deadlocks.  These steps apply to most deadlocks, and they’ll allow you to resolve many of them without even having to dig into query plans or other nitty gritty details.  What’s that?  You like digging into query plans, and have nitty grits for breakfast every morning?  OK then, we’ll look at a deadlock scenario from the inside out a bit later.  But first, here are the basics:

 

  1. Turn on trace flag 1222 with “DBCC TRACEON (1222, -1)” or by adding “-T1222” as a SQL startup parameter.  This trace flag is a new trace flag in SQL 2005, a much improved version of the tried-and-true -T1204.  If you’re running SQL 2005, you should be using 1222 instead of 1204 unless you have deep-seated masochistic tendencies. Alternatives to 1222:
    • If you are using SQL 2000 or SQL 7.0, you’ll have no choice but to fall back on the older -T1204. 
    • There’s a “Deadlock graph” Profiler trace event that provides the same info as -T1222.  Feel free to use this instead of -T1222 if you’re on SQL 2005.  But don’t waste your time with the “Lock:Deadlock” and “Lock:Deadlock Chain” trace events that are in SQL 2000, as they provide an unacceptably incomplete picture of the deadlock. 
  2. Get the -T1222 output from the SQL errorlog after the deadlock has occurred.  You’ll see output that looks like this:

deadlock-list

 deadlock victim=processdceda8

  process-list

   process id=processdceda8 taskpriority=0 logused=0 waitresource=KEY: 2:72057594051493888 (0400a4427a09) waittime=5000 ownerId=24008914 transactionname=SELECT lasttranstarted=2006-09-08T15:54:22.327 XDES=0x8fd9a848 lockMode=S schedulerid=1 kpid=4404 status=suspended spid=54 sbid=0 ecid=0 priority=0 transcount=0 lastbatchstarted=2006-09-08T15:54:22.293 lastbatchcompleted=2006-09-08T15:54:22.293 clientapp=OSQL-32 hostname=BARTD2 hostpid=3408 loginname=bartd isolationlevel=read committed (2) xactid=24008914 currentdb=2 lockTimeout=4294967295 clientoption1=538968096 clientoption2=128056

    executionStack

     frame procname=tempdb.dbo.p1 line=2 stmtstart=60 sqlhandle=0x03000200268be70bd

       SELECT c2, c3 FROM t1 WHERE c2 = @p1    

     frame procname=adhoc line=2 stmtstart=32 stmtend=52 sqlhandle=0x020000008a4df52d3

       EXEC p1 3    

    inputbuf

       EXEC p1 3

   process id=process3c54c58 taskpriority=0 logused=16952 waitresource=KEY: 2:72057594051559424 (0900fefcd2fe) waittime=5000 ownerId=24008903 transactionname=UPDATE lasttranstarted=2006-09-08T15:54:22.327 XDES=0x802ecdd0 lockMode=X schedulerid=2 kpid=4420 status=suspended spid=55 sbid=0 ecid=0 priority=0 transcount=2 lastbatchstarted=2006-09-08T15:54:22.327 lastbatchcompleted=2006-09-08T15:54:22.310 clientapp=OSQL-32 hostname=BARTD2 hostpid=2728 loginname=bartd isolationlevel=read committed (2) xactid=24008903 currentdb=2 lockTimeout=4294967295 clientoption1=538968096 clientoption2=128056

    executionStack

     frame procname=tempdb.dbo.p2 line=2 stmtstart=58 sqlhandle=0x030002005fafdb0c

       UPDATE t1 SET c1 = FLOOR (c1), c2 = FLOOR (c2) WHERE c1 = @p1    

     frame procname=adhoc line=2 stmtstart=32 stmtend=52 sqlhandle=0x020000006f878816

       EXEC p2 3    

    inputbuf

       EXEC p2 3

  resource-list

   keylock hobtid=72057594051559424 dbid=2 objectname=tempdb.dbo.t1 indexname=idx1 id=lock83642a00 mode=S associatedObjectId=72057594051559424

    owner-list

     owner id=processdceda8 mode=S

    waiter-list

     waiter id=process3c54c58 mode=X requestType=wait

   keylock hobtid=72057594051493888 dbid=2 objectname=tempdb.dbo.t1 indexname=cidx id=lock83643780 mode=X associatedObjectId=72057594051493888

    owner-list

     owner id=process3c54c58 mode=X

    waiter-list

     waiter id=processdceda8 mode=S requestType=wait

 

  1. “Decode” the -T1222 output to better understand the deadlock scenario.  The deadlock is summarized by a “process-list” and a “resource-list”.  A “process” is a spid or worker thread that participates in the deadlock.  Each process is assigned an identifier, like “processdceda8”.  A resource is a resource that one of the participants owns (usually a lock) that the other participant is waiting on.  I like to use a format like the one below to summarize the deadlock.  You can skip this step if you want, but I never do; I find it really helps me understand the deadlock situation more clearly.  I’ve highlighted in yellow each of the data points within the 1222 output that you would need to reconstruct this summary on your own.

               
    Spid 54 is running this query (line 2 of proc [p1]):
                                    SELECT c2, c3 FROM t1 WHERE c2 = @p1
                    Spid 55 is running this query (line 2 of proc [p2]):
                                    UPDATE t1 SET c1 = FLOOR (c1), c2 = FLOOR (c2) WHERE c1 = @p1
                   
                    Spid 54 is waiting for a Shared KEY lock on index t1.cidx.  
                                    (Spid 55 holds a conflicting X lock.)
                    Spid 55 is waiting for an eXclusive KEY lock on index t1.idx1.  
                                    (
    Spid 54 holds a conflicting S lock.)


    For those of you on SQL 2005 who think that the -T1222 output is a bit overwhelming, count your blessings and be thankful that you don’t have to wade through -T1204 output, which is a lot more difficult to interpret than -T1222 and doesn’t provide nearly as much useful information about the deadlock.  Check out the file "Decoding_T1204_Output.htm" attached to this post for annotated -T1204 output.
  2. Run the queries involved in the deadlock through Database Tuning Advisor.  Plop the query in a Management Studio query window, change db context to the correct database, right-click the query text and select “Analyze Query in DTA”.  Don’t skip this step; more than half of the deadlock issues we see are resolved simply by adding an appropriate index so that one of the queries runs more quickly and with a smaller lock footprint.  If DTA recommends indexes (it'll say “Estimated Improvement: <some non-zero>%”), create them and monitor to see if the deadlock persists.  You can select “Apply Recommendations” from the Action drop-down menu to create the index immediately, or save the CREATE INDEX commands as a script to create them during a maintenance window.  Be sure to tune each of the queries separately. 
  3. Make sure the query is using the minimum necessary transaction isolation level (-T1222 will tell you this – search the output for “isolationlevel”).  Queries run by transactional COM+ components will default to serializable, which is usually overkill.  This can be reduced by query hints (“...FROM tbl1 WITH (READCOMMITTED)...”), a SET TRANSACTION ISOLATION LEVEL command, or, in Windows 2003 and later, by configuring the object in the Component Services MMC plugin.
  4. Make sure that your transactions are as brief as they can be while still meeting the relevant business constraints.  Try not to use implicit transactions, as this model of transaction management encourages unnecessarily long transactions. 
  5. Look for other opportunities to improve the efficiency of the queries involved in the deadlock, either through query changes or through indexing improvements.  A query that locks the minimum number of resources will be much less likely to deadlock with another query.  Table scans, index scans, and large hashes or large sorts in the query plan may indicate opportunities for improvement.
  6. If one or both spids is running a multi-statement transaction, you may need to capture a profiler trace that spans the deadlock in order to identify the full set of queries that were involved in the deadlock.  Unfortunately, both -T1204 and -T1222 only print out the two queries that “closed the loop”, and it’s possible that one of the blocking locks was acquired by an earlier query run within the same transaction.

These are all general recommendations that you can apply to any deadlock without having to really roll up your sleeves and get dirty.  If after doing all of this you haven’t resolved it, though, you’ll have to dive a bit deeper and tailor a solution to the specifics of the scenario.  Here’s a menu of some common techniques that you can choose from when deciding how best to tackle a deadlock:

 

  • Access objects in the same order.   Consider the following two batches:

1. Begin Transaction

1. Begin Transaction

2. Update Part table

2. Update Supplier table

3. Update Supplier table

3. Update Part table

4. Commit Transaction

4. Commit Transaction

These two batches may deadlock frequently.  If both are about to execute step 3, they may each end up blocked by the other because they both need access to a resource that the other connection locked in step 2. 

  • If both deadlock participants are using the same index, consider adding an index that can provide an alternate access path to one of the spids.  For example, adding a covering nonclustered index for a SELECT involved in a deadlock may prevent the problem (assuming that none of the covering index keys are modified by the other deadlock participant).
  • On the other hand, if the spids are deadlocking because they took alternate paths (indexes) to a common required data row or page, consider whether one of the indexes can be removed or an index hint used to force both queries to share an access path.  Be cautious of potential performance hits as a result of this approach.
  • Deadlocks are a special type of blocking where two spids both end up blocking the other.  Sometimes the best way to prevent a deadlock is to force the blocking to occur at an earlier point in one of the two transactions.  For example, if you force spid A to be blocked by spid B at the very beginning of A’s transaction, it may not have a chance to acquire the lock resource that later ends up blocking spid B.  Doesn’t this means you are deliberately causing blocking?  Yes, but remember that you already have blocking or you wouldn’t be in a deadlock situation, and simple blocking is a big improvement over a deadlock.  As soon as B commits his transaction, A will be able to proceed.  HOLDLOCK and UPDLOCK hints can be useful for this.
  • If a high priority process is being selected as a victim in a deadlock with a lower priority process, the lower priority process could be modified to SET DEADLOCK_PRIORITY LOW.  Spids that set this will offer themselves up as the sacrificial lamb in any deadlock they encounter. 
  • Avoid placing clustered indexes on columns that are frequently updated. Updates to clustered index key columns will require locks on the clustered index (to move the row) and all nonclustered indexes (since the leaf level of NC indexes reference rows by clustered index key value). 
  • In some cases it may be appropriate to add a NOLOCK hint, assuming that one of the queries is a SELECT statement.  While this is a tempting path because it is a quick and easy solution for many deadlocks, approach it with caution as it carries with it all the usual caveats surrounding read uncommitted isolation level (a query could return a transactionally inconsistent view of the data).  If you are unfamiliar with the risks, read the "SET TRANSACTION ISOLATION LEVEL" topic in SQL Books Online. 
  • In SQL 2005 you could consider the new SNAPSHOT isolation level.  This will avoid most blocking while avoiding the risks of NOLOCK.  An even cooler new feature IMHO is the new READ COMMITTED SNAPSHOT database option (see ALTER DATABASE), which allows you to use a variant of snapshot isolation level without changing your app.  
  • If one or both locks involved in the deadlock are S/X TAB (table) locks, lock escalation may be involved.  You can reduce the likelihood of lock escalation by enabling trace flag 1224 (SQL 2005 and later) or 1211 (see KB 323630).  Note that this does not apply to "intent" TAB locks, which have a capital "I" prefix (e.g. IS / IX TAB locks).

In a follow-up post I’ll look at a fairly typical deadlock in detail.  This will provide an example of what you'd have to do if the 8 high-level steps listed above fail you, forcing you to understand the scenario at a deeper level so that you can craft a custom solution.  

 

(This post series is continued in Deadlock Troubleshooting, Part 2.)

 

CraigFr has a great series of posts in his blog describing the difference between the various logical and physical join types, the details of how SQL Server implements these joins, and the things that the query optimizer takes into account when selecting a join type.  These five posts are a wonderful read.  
 
 

Sometime we get complaints that a query is slower than it could be because a filter isn’t pushed very deeply down into a plan.  For example, consider this hypothetical poor performance scenario (my apologies in advance for the lack of normalization):

 

USE tempdb

IF OBJECT_ID ('Sales') IS NOT NULL DROP TABLE Sales

IF OBJECT_ID ('SalesSummary') IS NOT NULL DROP VIEW SalesSummary

CREATE TABLE Sales (

  SalesPerson varchar(30) NOT NULL,

  SalesAmount money NOT NULL, Comments char(200))

INSERT INTO Sales VALUES ('Green ', 19011.87, '')

INSERT INTO Sales VALUES ('Green', 2478.42, '')

INSERT INTO Sales VALUES ('Green ', 1975.11, '')

INSERT INTO Sales VALUES ('White', 3007.01, '')

INSERT INTO Sales VALUES ('White', 5312.44, '')

INSERT INTO Sales VALUES ('Brown', 843.20, '')

CREATE INDEX idx ON Sales (SalesPerson, SalesAmount)

GO

CREATE VIEW SalesSummary AS

SELECT SalesPerson, SUM (SalesAmount) AS TotalSales

FROM Sales

GROUP BY SalesPerson

GO

 

SET STATISTICS PROFILE ON

GO

-- This query uses an index seek to retrieve only the rows

-- where SalesPerson = 'Green'

SELECT *

FROM SalesSummary

WHERE SalesPerson = 'Green'

GO

 

-- This query uses an index scan, then filters the rows later.

SELECT *

FROM SalesSummary

WHERE SalesPerson LIKE 'Green'

GO

SET STATISTICS PROFILE OFF

GO

 

The first query filters on “SalesPerson = 'Green'”.  The second filters on “SalesPerson LIKE 'Green'”.  Here’s the first query’s plan:

 

  |--Stream Aggregate(DEFINE:([Expr1004]=SUM([Sales].[SalesAmount]), [Sales].[SalesPerson]=ANY([Sales].[SalesPerson])))

       |--Index Seek(OBJECT:([Sales].[idx]), SEEK:([Sales].[SalesPerson]='Green') ORDERED FORWARD)

 

You can see that the first thing the plan does is a very efficient index seek to narrow the set of rows down to those that pass the “SalesPerson = ‘Green’” filter.  Then a Stream Aggregate operator computes the SUM(SalesAmount) expression for each SalesPerson returned by the index seek. 

 

The second plan, though, scans every row in the table and computes the SUM for every SalesPerson.  Only after it has scanned and aggregated every row does it filter out those values that don’t survive the LIKE predicate. 

 

  |--Filter(WHERE:([Sales].[SalesPerson] like 'Green'))

       |--Stream Aggregate(GROUP BY:([Sales].[SalesPerson]) DEFINE:([Expr1004]=SUM([Sales].[SalesAmount])))

            |--Index Scan(OBJECT:([Sales].[idx]), ORDERED FORWARD)

 

Here’s a general rule off thumb you can follow when looking for tuning opportunities in a query plan: for best performance, you usually want to push the most selective predicates as deeply as possible into the plan.  If you do the most selective operation first, the remaining operators have fewer rows to process, and that means faster overall query execution.  Based on this rule of thumb, the first plan here is clearly preferable from a performance perspective.  So what gives?  Why does the use of “LIKE” instead of “=” make SQL refuse to push the filter down?  Even more baffling, why do you get the more efficient plan when you bypass the view and select directly from the table, even if you use "LIKE"?  Run this and you’ll see what I mean:

 

SELECT SalesPerson, SUM (SalesAmount) AS TotalSales

FROM Sales

WHERE SalesPerson LIKE 'Green'

GROUP BY SalesPerson

 

This uses an efficient index seek-based plan, just like the first query. 

 

There are several things going on here:

  • Views must behave like a table
  • The “LIKE” and “=” operators use subtly different rules for string matching 
  • GROUP BY uses the same string comparison rules as “=” for the purposes of determining which rows end up in the same bucket 

What I mean by “views must behave like a table” is that the output of a select from a view must be the same as what you could get by materializing the view (e.g. selecting it into a temp table), then querying the materialized view.  In this case, the GROUP BY in the view could return a different total sales amount for a SalesPerson if SQL chose a plan that pushed a LIKE predicate below the Stream Aggregate.  Here’s proof:

 

SELECT *

FROM SalesSummary

WHERE SalesPerson LIKE 'Green '

 

-- Query 1 output:

SalesPerson                    TotalSales           

------------------------------ ---------------------

Green                          23465.4000

 

 

SELECT SalesPerson, SUM (SalesAmount) AS TotalSales

FROM Sales

WHERE SalesPerson LIKE 'Green '

GROUP BY SalesPerson

 

-- Query 2 output:

SalesPerson                    TotalSales           

------------------------------ ---------------------

Green                          20986.9800

 

The first query selects from the view, while the second moves the view logic into the query and selects directly from the base table.  Other than that, they are identical, yet the SUM(SalesAmount) calculation is different.  Recall that I mentioned that “=” and “LIKE” have different string comparison semantics.  In particular, LIKE considers trailing blanks in the right-hand operand to be significant, while the “=” operator ignores trailing blanks.  For the set of three rows with group ID “Green”, only two will qualify for the filter “WHERE SalesPerson LIKE ‘Green ‘” because they have trailing blanks.  The third “Green” row doesn’t have any trailing blanks and won’t survive the LIKE filter.  When you push this LIKE filter below the aggregate, you end up SUMming a different set of rows.  That’s not allowed if the GROUP BY is part of a view; if a WHERE clause applied to a view can change a property of a row instead of just filtering it out, it would mean that the view didn’t behave like a materialized table.  Put another way, a filter on a view that includes a GROUP BY is only allowed to eliminate entire groups; it's not legal for it to eliminate some base rows in a group but retain others, changing the group's membership.  It’s therefore by design that the slower scan-based plan is selected for the select from the view with LIKE.  In contrast, the query from the view with the "SalesPerson='Green'" filter can be pushed because GROUP BY uses the same string comparison rules as the "=" operator.  It's safe for the QO to assume that pushing the "=" filter below the Stream Aggregate will not change the view's semantics. 

 

This isn’t just about trailing blanks – you can see the attached script for a couple of examples that demonstrate the exact same thing (pushing LIKE below an aggregate changes the output of the aggregate) for a couple of interesting non-blank characters.  And it isn’t only about “LIKE” vs. “=”, either; this is just the example that was close at hand when I wrote this (we have a case open for this scenario right now). 

 

Finally, be aware that derived tables (and CTEs) also provide the same guarantee.  For example, note that this query selects from the base table but also does a full scan followed by filter, just like the above select from the view:

 

SELECT *

FROM (

   SELECT SalesPerson, SUM (SalesAmount) AS TotalSales

   FROM Sales

   GROUP BY SalesPerson) AS t

WHERE SalesPerson LIKE 'Green'

 

So, to net out all of this: Generally you want your filters to be pushed deep into the query plan -- as deeply as possible.  But when you’re selecting from a view, there will be some limits to what can be pushed.  Some filters can’t be pushed beneath parts of the view without changing the view’s semantics, and that would break a contract that SQL is required to maintain.  

 

 

UPDATE (2 March 2009): Fabiano Amorim pointed out that the Query #1 and Query #2 use the same plan on SQL 2008.  He's right; a new performance optimization causes the LIKE predicate to be pushed below the GROUP BY's aggregate operator.  I think this is actually a bug -- it does result in a faster plan, but it breaks the "views behave like tables" rule that SQL follows in all other cases.  The general rule stands: not all predicates can be pushed below a view's GROUP BY, even in SQL 2008 with this fairly aggressive performance optimization.  (And don't be surprised if this optimization gets removed in a future release. ;)

 

Here's another case where you might see intermittently poor performance that is "by design".  Suppose you see that a delete, insert, or update query in a stored proc usually runs quickly, but occasionally the query takes much longer to complete.  You captured a detailed profiler trace of both the good and bad periods, including the Showplan Statistics event.  The offending delete is in a stored proc, and it couldn't be simpler: "DELETE FROM t1 WHERE c1 = @p1".   One of the plans you see for this statement looks like this: 

DELETE FROM t1 WHERE c1 = @p1 
     |--Clustered Index Delete(OBJECT:([mydb].[dbo].[t1].[idx1]), WHERE:([t1].[c1]=[@p1])) 

And the other plan looks something like this: 

DELETE FROM t1 WHERE c1 = @p1 
     |--Sequence
          |--Index Delete(OBJECT:([mydb].[dbo].[t1].[idx2]))
          |    |--Sort(ORDER BY:([t1].[c2] ASC, [t1].[c1] ASC, [Bmk1000] DESC))
          |         |--Table Spool
          |              |--Clustered Index Delete(OBJECT:([mydb].[dbo].[t1].[idx1]), WHERE:([t1].[c1]=[@p1]))
          |--Index Delete(OBJECT:([mydb].[dbo].[t1].[idx3]))
          |    |--Sort(ORDER BY:([t1].[c3] ASC, [t1].[c1] ASC, [Bmk1000] DESC))
          |         |--Table Spool
          |--Index Delete(OBJECT:([mydb].[dbo].[t1].[idx4]))
               |--Sort(ORDER BY:([t1].[c2] ASC, [t1].[c3] ASC, [t1].[c1] ASC, [Bmk1000] DESC))
                    |--Table Spool


(The number of branches may vary -- there will be one for each nonclustered index.)
 
These are two very different-looking plans for the same, very simple (DELETE FROM t1 WHERE c1 = @p1) query.  What's going on here? 
 
When a SQL query plan needs to delete a clustered index row, it can let the storage engine take care of cleaning up (deleting) the corresponding rows in all the nonclustered indexes. (Similarly, in the case of an insert or update, the storage engine is capable of automatically inserting or updating rows in the affected nonclustered indexes.)  This is a simple approach, but the problem with this is that the nonclustered index rows are deleted in clustered index key order, not nonclustered key order, which means that SQL has to jump all around the nonclustered index in a more or less random fashion deleting one row at a time here, a second one way over there, etc.  If there are a large number of rows being deleted, the disk seeks to reposition the head back and forth across the nonclustered indexes can be very expensive.  In these cases (when SQL knows that it has to modify a lot of rows from a clustered index), the query optimizer has the option of choosing to include the logic needed to delete all the corresponding nonclustered index rows in the query plan as explicit Index Delete operators (see the second plan above for an example).  In this case the QO can do fancier things than the storage engine knows how to do, like sorting the data set in nonclustered index key order before the Index Delete; this means that all the nonclustered index keys can be removed in a single, ordered pass over the nonclustered index, which means less drive head movement and therefore a smaller amount of disk wait time.  This is a neat trick, but the overhead of setting up and performing all the sorts and having the NC index deletes happen at a higher level in the engine has a certain amount of overhead, so it only makes sense for the QO to handle index maintenance itself if a large number of rows will be deleted.  (There are other factors that also come into play like the number of nonclustered indexes on the table, but for a fixed schema and query, the key variable is the number of rows that SQL estimates will be affected by the query.) 
 
In this particular case, we have a parameterized DELETE query.  The number of rows that will be deleted depends entirely on the specific parameter value that is passed into the query.  Whether the QO will choose to compile a plan that does the index maintenance in the plan (this is called the "wide plan") or leaves this up to the storage engine (a "narrow plan") depends on the number of rows that will be deleted.  If a parameter value that only qualifies a small number of rows is passed in when the plan for the proc isn't in cache, the QP will select a narrow plan.  This is most likely the most efficient plan for that particular parameter value.  Unfortunately it may not be an efficient plan for other parameter values that may delete thousands of rows, but because the query is in a stored proc, the plan will be reused for subsequent executions until something kicks the plan out of cache (that something could be a reindex, a schema change to a table, a DBCC FREEPROCCACHE, or just normal cache maintenance to free up memory for new plans).  Plan dependency on the parameters that are passed in for the execution that triggers the compile is called parameter sniffing.  (Param sniffing can lead to a host of common perf issues -- I may talk more about some of these in future entries.) 

So, the basic problem here is the result of
parameter sniffing combined with the choice of wide vs. narrow plans.  A few possible solutions (definitely not an exhaustive list):
  • Use local variables instead of parameters in the INSERT/DELETE/UPDATE query so that the plan selected doesn't depend on param value.
  • Define the proc WITH RECOMPILE (or use an OPTION(RECOMPILE) query hint if you're on SQL 2005) so that everyone gets a plan tailored to their particular parameter.
  • Use an "OPTIMIZE FOR (@p1=<a "typical" param value>)" query hint to tell the optimizer to always optimize for a parameter value that is typical of the type that will most commonly be passed in
  • Use a USE PLAN hint, possibly with a plan guide, to force one of the two plan possibilities
Below is a simple script that shows the two types of index maintenance plans; you can play around with this to see them in action.   After looking at the plans, try omitting the DBCC FREEPROCCACHE to see how perf is impacted when a wide plan is reused to modify a small number of rows, or when a narrow plan is reused to modify a large number of rows. 

USE tempdb
GO
IF @@TRANCOUNT > 0 ROLLBACK TRAN
IF OBJECT_ID ('t1') IS NOT NULL DROP TABLE t1
IF OBJECT_ID ('myproc') IS NOT NULL DROP PROC myproc
GO
CREATE TABLE t1 (c1 int, c2 int, c3 int)
GO
SET NOCOUNT ON
GO
DECLARE @x int
SET @x = 1
BEGIN TRAN
WHILE @x <= 100000
BEGIN
  IF @x % 250 < 100
    INSERT INTO t1 (c1, c2, c3)
    VALUES (0, @x % 400, @x % 1000)
  ELSE
    INSERT INTO t1 (c1, c2, c3)
    VALUES (@x % 1000, @x % 400, @x % 1000)
  IF @x % 5000 = 0
  BEGIN
    RAISERROR ('Inserted %d rows...', 0, 1, @x) WITH NOWAIT
    COMMIT TRAN
    BEGIN TRAN
  END
  SET @x = @x + 1
END
WHILE @@TRANCOUNT > 0 COMMIT TRAN
GO
CREATE CLUSTERED INDEX idx1 ON t1 (c1)
CREATE INDEX idx2 ON t1 (c2)
CREATE INDEX idx3 ON t1 (c3)
CREATE INDEX idx4 ON t1 (c2, c3)
GO
CREATE PROC myproc @p1 int AS
UPDATE t1 SET c1 = c1 WHERE c1 = @p1
GO
DBCC FREEPROCCACHE
GO
-- First delete a small number of rows: let storage
-- engine take care of index maintenance
SET STATISTICS PROFILE ON
EXEC myproc @p1 = 126
SET STATISTICS PROFILE OFF
GO
-- Be sure to flush the proc cache or we will reuse the plan
-- that was compiled for a much smaller number of rows on the
-- last execution.
DBCC FREEPROCCACHE
GO
-- Delete a larger number of rows: handle index maintenance
-- explicitly in query plan
SET STATISTICS PROFILE ON
EXEC myproc @p1 = 0
SET STATISTICS PROFILE OFF
GO


If you look closely at the wide plan you might notice something curious: only one of the Table Spool operators in the plan has a child.  The purpose of a table spool is to store a set of rows in an intermediate place so that they can be used more than once in the plan without having to execute the same query subtree over and over.  Spools are typically used either for a perf boost or for Halloween protection; in this example, the spool fulfills both of these needs.  Table spools require a child -- remember that the purpose of a spool is to cache rows, so they must have an input that provides the rows to cache.  Any time you see a spool without a visible child, it's a spool that was created elsewhere in the plan, and is being reused.  In this plan, only the first appearance of the first spool shows the spool's child -- the content of the spool is then reused by the subsequent index delete operators.  (Unfortunately, if there are multiple spools in a plan, there's no obvious indicator about which one is referenced by each of the "childless" spools.)
 
If you examine the wide update plan from the script above, you might notice a Split operator.  This guy is poorly documented, but pretty cool -- he provides both Halloween protection and a performance boost (the benefits may sound similar to those of the Table Spool, but the mechanism is completely different).  If I have time, in a future post I may explain how Split works to provide these benefits.
 
A final footnote: In the comments above I mention that the storage engine is responsible for index maintenance when executing a narrow plan.  Technically, this is only true in SQL 2000.  In SQL 2005, both wide and narrow plans are performed by layers above the storage engine.  The different plan types still exist in SQL 2005, however, and the concepts and terms "wide plan" and "narrow plan" still apply to SQL 2005. 
 

To set up this scenario, run the script below:

USE tempdb
GO
IF OBJECT_ID ('test1') IS NOT NULL DROP TABLE test1
GO
CREATE TABLE test1 (c1 tinyint, c2 smallint)
DECLARE @x int
DECLARE @msg varchar(1000)
SET @x = 1
SET NOCOUNT ON
BEGIN TRAN
WHILE (@x <= 1000000)
BEGIN
  INSERT INTO test1 (c1, c2) VALUES (@x % 255, CASE WHEN @x % 1000 = 500 THEN 1000 ELSE @x % 1000 END)
  IF @x % 5000 = 0
  BEGIN
    COMMIT TRAN
    BEGIN TRAN
    SET @msg = 'Inserted ' + CONVERT(varchar(20), @x) 
+ ' rows ...'
    RAISERROR (@msg, 0, 1) WITH NOWAIT
  END
  SET @x = @x + 1
END
WHILE (@@TRANCOUNT > 0) COMMIT TRAN
SET NOCOUNT OFF
GO
CREATE INDEX idx1 ON test1 (c2)
UPDATE STATISTICS test1 WITH FULLSCAN
DBCC SHRINKDATABASE (db2)
-- DBCC SHOW_STATISTICS ('test1', 'idx1')

GO

The hypothetical scenario here is that your users complain that the first of the two queries below runs very slowly. You find that if you force SQL to use index [idx1] with an index hint, the query executes much more quickly.  But why should you change your app to compensate for what seems to be a SQL Server bug?  :)

USE db2
GO
SET STATISTICS PROFILE ON
SET STATISTICS TIME ON
-- Query #1 (slow)
SELECT c1, c2 FROM test1 WHERE c2 = 500


-- Query #2 (fast)
SELECT c1, c2 FROM test1 WITH (INDEX = idx1) WHERE c2 = 500


SET STATISTICS TIME OFF
SET STATISTICS PROFILE OFF

GO

To me, this suggests two questions: 1) Why isn't the fast plan chosen by default? 2) Are there any possible solutions that don't require modifying code to supply an index hint?

 

 

Despite appearances, this actually isn't a bug.  Here's the explanation:

 

This table demonstrates uneven data distribution.  The [c2] column holds values ranging from 0 to 999.  There are exactly 1000 rows with the same [c2] value for each of the integers between 0 and 999. However, the value 500 is an anomaly; there are 0 rows with this value.  Run the following query to show the number of rows with each [c2] value and you'll see that there are none with [c2]=500:

SELECT c2, COUNT(*) AS cnt FROM test1 GROUP BY c2


 c2     cnt        
 ------ -----------
 ...
 498    1000
 499    1000
 501    1000
 502    1000
 ...

When the server builds a histogram for the statistics on a column, it attempts to intelligently select range endpoints so that a given step of the histogram represents a range of values with similar density.  If you run DBCC SHOW_STATISTICS on the index you can see the histogram: 

 

 -- DBCC SHOW_STATISTICS ('test1', 'idx1')

 RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
 ------------ ---------- ------- ------------------- --------
 0            0.0        1000.0  0                   0.0
 499          498000.0   1000.0  498                 1000.0
 503          2000.0     1000.0  2                   1000.0
 1000         496000.0   1000.0  496                 1000.0

 

RANGE_HI_KEY is the value that sets the upper bound for the range of values represented in each histogram step.  The lower bound is RANGE_HI_KEY+1 for the preceding step.  For example, consider this row:


 499          498000.0   1000.0  498                 1000.0

 

This indicates that there are 498000 rows (RANGE_ROWS) with a value between 1 and 498, inclusive.  There are 1000 rows with the exact value of 499 (EQ_ROWS).  AVG_RANGE_ROWS tells us that the typical value that falls within the range 1-498 shows up in 1000 rows.  Now consider the next step in the histogram:

 

 RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
 ------------ ---------- ------- ------------------- --------
 503          2000.0     1000.0  2                   1000.0

 

This indicates that there are 2000 rows (RANGE_ROWS) with a value between 500 and 502, inclusive; and there are 1000 rows (EQ_ROWS) with the exact value of 503. The typical c2 value in the range 500-502 shows up average of 1000 times in the table.  Recall what we know about the actual number of rows in this range:

 c2     cnt        
 ------ -----------
 ...
 499    1000
 501    1000
 502    1000
 503    1000
 ...

This histogram step describes its range of c2 values exactly.  What the histogram doesn't tell us, however, is exactly how many times a particular value in the 500-502 range appears.  Remember that each histogram step only summarizes the values within a given range; it doesn't tell you exactly how many of each particular value there are (that is, unless the table's domain is small enough for every distinct value to have its own step in the histogram).  SQL knows it needs to search for rows where [c2]=500, so it locates this step in the statistics' histogram, finds that the typical value in this range shows up in 1000 rows, and uses this as its rowcount estimate.

 

This is a key point: if a value being searched for falls in the middle of a histogram step, the AVG_RANGE_ROWS for that step is used to estimate the number of matches that will be found.  SQL is always optimistic and assumes that the value it is searching for is actually one of the range rows.  In this case that assumption is incorrect.  The QP ends up overestimating the number of rows that will be returned and choosing a table scan when an index seek would have been more efficient.

 

So if this isn't a bug, what could you do to fix the problem?  One solution would be to make the nonclustered index a covering index.  Alternately, you could make [idx1] a clustered index.  And, if you're using SQL 2005, you could use a plan guide with a USE PLAN hint to force the fast plan without changing the query text.  All of these solutions should cause SQL to select the fast index seek plan.  The bad cardinality estimate will still be there, but the perf problem won't.

 

Everyone that has worked with databases for long enough has run into situations where the query optimizer doesn't select the best possible plan.  You may find that you can force SQL to use an index, choose a different join algorithm or join order, or use some other query hint to get a much faster plan.  But you're reluctant to try to push through a change in your app in order to compensate for what seems to be a clear SQL Server bug.
 
We see lots of instances where people have the expectation that if SQL chose a demonstrably inefficient plan, it must be a bug.  Unfortunately, it's not quite that simple.  At the end of the day, the query processor is just a guessing engine that tries to pick the best possible plan based off statistics that summarize the data in a table.  There are some significant limits to how accurately the query optimizer can model the real world when it is costing plans.  For starters, the information available to it in statistics only summarizes column data at a fairly high level; it doesn't tell the QO all that there is to know about the distribution of values in a column, or the relationships between data in two or more columns.  And there's a tradeoff between the complexity of the QO's modeling rules and the amount of time that it takes to cost the hundreds of thousands of plan possibilities that exist for even modestly complex queries.  Even if the QO had full knowledge of everything there was to know about your data, trying to take all of those facts into account when compiling plans would lead to more situations where the optimizer spent more time trying to find that perfect plan than it would have taken to actually execute the query.  There are definitely QO bugs out there, but the cases where the QO picks a less-than-ideal plan are not necessarily bugs; some are simply the result of these limits.  In fact, our experience is that a majority of bad plan issues fall into the "unfortunate, but by design" bucket. 
 
Dubious?  That's OK -- I'll try to illustrate some of the more common "by design" situations where you might see poor plan selections here.  Even if you end up taking the perfectly defensible position that the QO should be able to handle some of these situations more gracefully, it'll still benefit you to be able to recognize them. 
More Posts Next page »
 
Page view tracker