Welcome to MSDN Blogs Sign in | Join | Help

Correction to my prior post on sys.dm_db_index_operational_stats

In this post about the sys.dm_db_index_usage_stats and sys.dm_db_index_operational_stats DMVs, I wrote:

Another less important difference between these DMVs is that sys.dm_db_index_usage_stats only reports on indexes that have been used at least once since the server was last restarted while sys.dm_db_index_operational_stats reports on all indexes regardless of whether they have been used.

The Books Online page for sys.dm_db_index_operational_stats similarly states about the object_id parameter: "Specify NULL to return information for all tables and views in the specified database."  However, further down in the same page, it states:

The data returned by sys.dm_db_index_operational_stats exists only as long as the metadata cache object that represents the heap or index is available. This data is neither persistent nor transactionally consistent. This means you cannot use these counters to determine whether an index has been used or not, or when the index was last used. For information about this, see sys.dm_db_index_usage_stats (Transact-SQL).

An observant reader pointed out that the sys.dm_db_index_operational_stats DMV was not behaving as I claimed - that it was not returning all rows - and further pointed out the apparent discrepancy in Books Online.  I checked with one of the storage engine developers who confirmed for me that the DMV will only return rows for those objects that are currently in the metadata cache.

The metadata cache does not contain any user tables when the server is first started or, for a database, when it is first attached.  Moreover, if a database contains many tables, the cache may not be large enough to store all of them at once and some may be evicted.  If a table is evicted, it will cease to appear in the DMV output and its counters will be reset to zero.

Let's look at a simple example.  First, let's create a test database:

CREATE DATABASE DMVTest
GO
USE DMVTest
GO
CREATE TABLE T (A INT, B INT, C INT)
CREATE UNIQUE CLUSTERED INDEX TA ON T(A)
CREATE UNIQUE INDEX TB ON T(B)
GO

Since we just created this table, it should be in the metadata cache and should be listed by sys.dm_db_index_operational_stats:

SELECT index_id, range_scan_count, singleton_lookup_count
FROM sys.dm_db_index_operational_stats (DB_ID(), OBJECT_ID('T'), NULL, NULL)
ORDER BY index_id

index_id    range_scan_count     singleton_lookup_count
----------- -------------------- ----------------------
1           0                    0
2           0                    0

Let's also run a simple query so that the table appears in sys.dm_db_index_usage_stats.  Recall that sys.dm_index_usage_stats only lists tables and indexes that appear in a query plan and only when that query plan is actually executed.

SELECT * FROM T

SELECT index_id, user_seeks, user_scans, user_lookups, user_updates
FROM sys.dm_db_index_usage_stats
WHERE database_id = DB_ID() and object_id = OBJECT_ID('T')
ORDER BY index_id

index_id    user_seeks           user_scans           user_lookups         user_updates
----------- -------------------- -------------------- -------------------- --------------------
1           0                    1                    0                    0

Now, let's detach and reattach the database to clear the metadata cache:

USE tempdb
GO
EXEC SP_DETACH_DB 'DMVTest'
GO
EXEC SP_ATTACH_DB 'DMVTest',
      'C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\DATA\DMVTest.mdf',
      'C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\DATA\DMVTest_log.ldf'
GO
USE DMVTest
GO

Rechecking the two DMVs shows that the table is not listed in either one:

SELECT index_id, range_scan_count, singleton_lookup_count
FROM sys.dm_db_index_operational_stats (DB_ID(), OBJECT_ID('T'), NULL, NULL)
ORDER BY index_id

index_id    range_scan_count     singleton_lookup_count
----------- -------------------- ----------------------

SELECT index_id, user_seeks, user_scans, user_lookups, user_updates
FROM sys.dm_db_index_usage_stats
WHERE database_id = DB_ID() and object_id = OBJECT_ID('T')
ORDER BY index_id

index_id    user_seeks           user_scans           user_lookups         user_updates
----------- -------------------- -------------------- -------------------- --------------------

Finally, we can repopulate the metadata cache by compiling a query that references the table.  Note that compiling the query is all that is needed.  There is no need to execute it.

SET SHOWPLAN_TEXT ON
GO
SELECT * FROM T
GO
SET SHOWPLAN_TEXT OFF
GO

Rechecking sys.dm_db_index_operational_stats one final time shows that the table is once again listed:

SELECT index_id, range_scan_count, singleton_lookup_count
FROM sys.dm_db_index_operational_stats (DB_ID(), OBJECT_ID('T'), NULL, NULL)
ORDER BY index_id

index_id    range_scan_count     singleton_lookup_count
----------- -------------------- ----------------------
1           0                    0
2           0                    0

Note that all of the indexes, not just the clustered index (which happens to be used by the table scan in the above query plan), are listed.  In general, SQL Server loads all of a table's metadata as part of the compilation process.

In this simple example, I used database attach and detach to flush the metadata cache.  Of course, SQL Server also starts with an empty metadata cache after a restart and, as I noted above, may evict metadata at any time due to memory pressure.  Thus, Books Online is correct in stating that sys.dm_db_index_operational_stats may not offer reliable data regarding a table's past usage.
Posted by craigfr | 0 Comments

Maximum Row Size and Query Hints

In my last post (yes, that was two months ago), I gave an example of how a query hint could cause a query to fail.  In this post, I'll give another example of how query hints can cause problems.  As with my last post, this post was inspired by a question submitted by a reader.

SQL Server has a documented row size limit of 8060 bytes.  This row size limit applies specifically to the fixed width or in-row portion of variable width columns.  A web search yields plenty of articles and discussions about this limit.  If you try to create a table that exceeds this limit, you will encounter the following error:

CREATE TABLE T (A INT, B CHAR(8000), C CHAR(8000))

Msg 1701, Level 16, State 1, Line 1
Creating or altering table 'T' failed because the minimum row size would be 16011, including 7 bytes of internal overhead. This exceeds the maximum allowable table row size of 8060 bytes.

So, what does this row size limit have to do with query hints?  Let's start with the following simple join:

CREATE TABLE T1 (A INT, B CHAR(8000))
CREATE TABLE T2 (A INT, B CHAR(8000))
CREATE TABLE T3 (A INT, B CHAR(8000))

SELECT *
FROM T1 JOIN T2 ON T1.A = T2.A

  |--Hash Match(Inner Join, HASH:([T2].[A])=([T1].[A]), RESIDUAL:([T2].[A]=[T1].[A]))
       |--Table Scan(OBJECT:([T2]))
       |--Table Scan(OBJECT:([T1]))

So far, so good.  Now let's suppose that we want to sort the results of this query:

SELECT *
FROM T1 JOIN T2 ON T1.A = T2.A
ORDER BY T1.A

  |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([T1].[A])=([T2].[A]), RESIDUAL:([T2].[A]=[T1].[A]))
       |--Sort(ORDER BY:([T1].[A] ASC))
       |    |--Table Scan(OBJECT:([T1]))
       |--Sort(ORDER BY:([T2].[A] ASC))
            |--Table Scan(OBJECT:([T2]))

The optimizer has switched to a merge join.  This plan may seem pretty reasonable.  SQL Server needs to sort the data, so why not before the join?  Recall that merge join is order preserving so SQL Server does not need to sort again after the join.  However, suppose that we really want the hash join after all.  We might be tempted to add a hint forcing it:

SELECT *
FROM T1 JOIN T2 ON T1.A = T2.A
ORDER BY T1.A
OPTION (HASH JOIN)

Msg 8618, Level 16, State 2, Line 1
The query processor could not produce a query plan because a worktable is required, and its minimum row size exceeds the maximum allowable of 8060 bytes. A typical reason why a worktable is required is a GROUP BY or ORDER BY clause in the query. If the query has a GROUP BY or ORDER BY clause, consider reducing the number and/or size of the fields in the clause. Consider using prefix (LEFT()) or hash (CHECKSUM()) of fields for grouping or prefix for ordering. Note however that this will change the behavior of the query.

What happened?  Recall that hash join is not order preserving.  Thus, SQL Server must sort after the join.  The sort requires a work table which must include all columns to be sorted.  In this example, the columns include T1.B and T2.B.  These two CHAR(8000) columns together exceed the 8060 byte row size limit, SQL Server cannot create the work table, and the query fails.  Note that the hash join itself is not a problem (the original query above succeeds with a hash join) but rather it is the sort that causes the failure.

Let's try another example:

SELECT *
FROM (T1 JOIN T2 ON T1.A = T2.A) JOIN T3 ON T1.A = T3.A

  |--Nested Loops(Inner Join, WHERE:([T2].[A]=[T3].[A]))
       |--Hash Match(Inner Join, HASH:([T3].[A])=([T1].[A]), RESIDUAL:([T1].[A]=[T3].[A]))
       |    |--Table Scan(OBJECT:([T3]))
       |    |--Table Scan(OBJECT:([T1]))
       |--Table Scan(OBJECT:([T2]))

Notice that this query uses one hash join and one nested loops join.  Suppose that we really want two hash joins.  We might again be tempted to use a hint:

SELECT *
FROM (T1 JOIN T2 ON T1.A = T2.A) JOIN T3 ON T1.A = T3.A
OPTION (HASH JOIN)

This query fails with the same 8618 error that we saw above.  Why?  Like a sort, a hash join also requires a work table.  In fact, a hash join requires work tables for both the build and probe (i.e., left and right inputs).  No matter which two tables SQL Server joins first, the results of that join, which includes two CHAR(8000) columns, must be joined with the third and final table.  This second join cannot use a hash join as SQL Server cannot build a work table that exceeds the 8060 byte row size limit.

There are many more examples possible where hints may cause a query to fail due to requiring a work table that exceeds the 8060 byte row size limit.  One solution to this problem is, of course, to remove the offending hint.  Another solution is to replace large fixed length CHAR columns with variable length VARCHAR columns.

Posted by craigfr | 0 Comments
Filed under: ,

Implied Predicates and Query Hints

In this post, I want to take a look at how two seemingly unrelated features of SQL Server can interact to cause a problem.  The idea for this post came from a question submitted by a reader.  Let's begin.  Consider the following trivial schema and query:

CREATE TABLE T1 (A INT, B INT)
CREATE TABLE T2 (A INT, B INT)

SELECT *
FROM T1 INNER JOIN T2 ON T1.A = T2.A
WHERE T1.B = 0
OPTION (HASH JOIN)

Not surprisingly, this query yields the following plan:

  |--Hash Match(Inner Join, HASH:([T1].[A])=([T2].[A]), RESIDUAL:([T2].[A]=[T1].[A]))
       |--Table Scan(OBJECT:([T1]), WHERE:([T1].[B]=(0)))
       |--Table Scan(OBJECT:([T2]))

In fact, this query yields this plan with or without the hint.  Now let's make a small modification to the WHERE clause of the query see what happens:

SELECT *
FROM T1 INNER JOIN T2 ON T1.A = T2.A
WHERE T1.A = 0
OPTION (HASH JOIN)

Now this query yields the following error message:

Msg 8622, Level 16, State 1, Line 1

Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

What happened?  Why does this seemingly innocuous change to the query cause it to fail?  To find the answer, let's run the query without the HASH JOIN hint and look at the plan:

SELECT *
FROM T1 INNER JOIN T2 ON T1.A = T2.A
WHERE T1.A = 0

  |--Nested Loops(Inner Join)
       |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=(0)))
       |--Table Scan(OBJECT:([T1]), WHERE:([T1].[A]=(0)))

There are two things about this plan that are especially notable:

First, the plan includes the predicate "T2.A = 0".  Although we did not specify this predicate in the original query, SQL Server derives it from the predicates that we did specify.  This derivation is a good thing.  It allows SQL Server to filter out rows from the scan of T2 earlier than would otherwise be possible.

Second, the original equijoin predicate "T1.A = T2.A" appears nowhere in this plan.  This predicate is redundant with the original predicate "T1.A = 0" and the derived predicate "T2.A = 0" so SQL Server eliminates it.  Normally, eliminating a redundant predicate is a good thing.  By the time the rows from the scans reach the join, there is no reason to evaluate this predicate.  It would be a waste of time.  Unfortunately, in this case, the eliminated predicate also happens to be the only equijoin predicate and hash joins (and merge joins) require at least one equijoin predicate.  Thus, the query with the HASH JOIN hint fails.

Note that the loss of the hash join and merge join alternatives for the above plan is not a big deal from a performance perspective.  With or without the join predicate, the query is a cross join since all rows from T1 will join with all rows from T2.

If we throw in a third, briefly lived feature of SQL Server 2008, the situation gets even more complex.  SQL Server 2008 RTM has an optimization that substitutes constants for parameter and variable values if we use the RECOMPILE hint.  Note that this optimization was removed from SQL Server 2008 Cummulative Update 4 and Service Pack 1 to fix an issue.  If your instance of SQL Server has the fix, you will not be able to reproduce the following scenario.

Let's see a simple example of this optimization in action.  Compare the plans for the following two identical queries:

DECLARE @P INT
SET @P = 0

SELECT * FROM T1 WHERE T1.A = @P

  |--Table Scan(OBJECT:([T1]), WHERE:([T1].[A]=[@P]))

SELECT * FROM T1 WHERE T1.A = @P
OPTION (RECOMPILE)

  |--Table Scan(OBJECT:([T1]), WHERE:([T1].[A]=(0)))

Observe how the WHERE clause in the first plan references the @P variable while the WHERE clause in the second plan which includes the RECOMPILE hint references the constant 0.  This optimization is safe since the plan generated for the query with the RECOMPILE hint will be used only once and then discarded rather than cached.

The preceding plans are "actual" rather than "estimated" query plans.  That is, I collected them using SET STATISTICS PROFILE ON rather than SET SHOWPLAN_TEXT ON.  The parameter substitution optimization can only be applied when the query is recompiled immediately prior to execution and the actual parameter values are known.

Now let's create a simple stored procedure:

CREATE PROCEDURE MY_SP (@P1 INT, @P2 INT)
AS
SELECT *
FROM T1 INNER JOIN T2 ON T1.A = T2.A
WHERE T1.A BETWEEN @P1 AND @P2
OPTION (HASH JOIN, RECOMPILE)

If we run this stored procedure with two different parameters, it works just fine:

EXEC MY_SP 0, 1

Here is the query plan:

  |--Hash Match(Inner Join, HASH:([T1].[A])=([T2].[A]), RESIDUAL:([T2].[A]=[T1].[A]))
       |--Table Scan(OBJECT:([T1]), WHERE:([T1].[A]>=(0) AND [T1].[A]<=(1)))
       |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]>=(0) AND [T2].[A]<=(1)))

However, if we run the stored procedure with the same parameter, it fails with the same error that we saw at the beginning of this post.  For example, the following fails:

EXEC MY_SP 0, 0

Msg 8622, Level 16, State 1, Line 1

Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

Why does the stored procedure work with some parameters but not with others?  If we alter the stored to remove the HASH JOIN hint, we can see what has happened:

ALTER PROCEDURE MY_SP (@P1 INT, @P2 INT)
AS
SELECT *
FROM T1 INNER JOIN T2 ON T1.A = T2.A
WHERE T1.A BETWEEN @P1 AND @P2
OPTION (RECOMPILE)

EXEC MY_SP 0, 0

  |--Nested Loops(Inner Join)
       |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=(0)))
       |--Table Scan(OBJECT:([T1]), WHERE:([T1].[A]=(0)))

SQL Server has transformed the predicate "T1.A BETWEEN @P1 AND @P2" first into "T1.A BETWEEN 0 AND 0" and then into "T1.A = 0".  Now we have exactly the same scenario that we had in the first failed query above.

Query hints can be powerful tools, but they can also backfire - often in mysterious ways.  In these examples, two and even three separate features all interact to induce the error.  Even a small change to a query or to a stored procedure's parameters can be the difference between success and failure.
Posted by craigfr | 1 Comments
Filed under: ,

OPTIMIZED Nested Loops Joins

In my past two posts, I explained how SQL Server may add a sort to the outer side of a nested loops join and showed how this sort can significantly improve performance.  In an earlier post, I discussed how SQL Server can use random prefetching to improve the performance of a nested loops join.  In this post, I'm going to explore one more nested loops join performance feature.  I'll use the same database that I used in my two prior posts.  Let's start with the following simple query:

SELECT SUM(Data)
FROM T
WHERE RandKey < 1000

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1011]=(0) THEN NULL ELSE [Expr1012] END))
       |--Stream Aggregate(DEFINE:([Expr1011]=COUNT_BIG([T].[Data]), [Expr1012]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1010]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (1000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

Notice that the nested loops join includes an extra keyword: OPTIMIZED.  This keyword indicates that the nested loops join may try to reorder the input rows to improve I/O performance.  This behavior is similar to the explicit sorts that we saw in my two previous posts, but unlike a full sort it is more of a best effort.  That is, the results from an optimized nested loops join may not be (and in fact are highly unlikely to be) fully sorted.

SQL Server only uses an optimized nested loops join when the optimizer concludes based on its cardinality and cost estimates that a sort is most likely not required, but where there is still a possibility   that a sort could be helpful in the event that the cardinality or cost estimates are incorrect.  In other words, an optimized nested loops join may be thought of as a "safety net" for those cases where SQL Server chooses a nested loops join but would have done better to have chosen an alternative plan such as a full scan or a nested loops join with an explicit sort.  For the above query which only joins a few rows, the optimization is unlikely to have any impact at all.

Let's look at an example where the optimization actually helps:

SELECT SUM(Data)
FROM T
WHERE RandKey < 100000000 AND
    Flags & 0x1 = 0x1 AND
    Flags & 0x2 = 0x2 AND
    Flags & 0x4 = 0x4 AND
    Flags & 0x8 = 0x8

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1014]=(0) THEN NULL ELSE [Expr1015] END))
       |--Stream Aggregate(DEFINE:([Expr1014]=COUNT_BIG([T].[Data]), [Expr1015]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1013]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (100000000)),  WHERE:(([T].[Flags]&(1))=(1) AND ([T].[Flags]&(2))=(2) AND ([T].[Flags]&(4))=(4) AND ([T].[Flags]&(8))=(8)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

The Flags column contains the value 0xFF in every row.  Thus, every one of the bitwise AND predicates evaluates to true and this query returns about 2.5 million rows or 10% of the table.  Ordinarily, when faced with a query like this one, SQL Server would resort to a sequential scan of the entire table.  Indeed, if you try this query without the extra bitwise filters, you will get a sequential scan.  However, SQL Server does not realize that these predicates are always true, estimates a much lower cardinality of less than 10,000 rows, and chooses a simple nested loops join plan.  Note that I would generally recommend against using predicates like these ones in a real world application precisely because they will lead to cardinality estimation errors and poor plans.

To see what effect the optimized nested loops join has, let's compare the above plan with an "un-optimized" nested loops join.  We can eliminate the optimization by using the following UPDATE STATISTICS statement to trick SQL Server into believing that the table is very small:

UPDATE STATISTICS T WITH ROWCOUNT = 1, PAGECOUNT = 1

I'll compare the above query with the following simpler query which uses essentially the same plan and touches the same data but has an "un-optimized" nested loops join:

SELECT SUM(Data)
FROM T WITH (INDEX (IRandKey))
WHERE RandKey < 100000000

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1009]=(0) THEN NULL ELSE [Expr1010] END))
       |--Stream Aggregate(DEFINE:([Expr1009]=COUNT_BIG([T].[Data]), [Expr1010]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK]))
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (100000000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

We can reset the statistics using the following statement:

UPDATE STATISTICS T WITH ROWCOUNT = 25600000, PAGECOUNT = 389323

As in my last post, I'm going to simulate a larger table by reducing the memory available to the server to 1 GByte with SP_CONFIGURE 'MAX SERVER MEMORY' and I'm also going to flush the buffer pool between runs with DBCC DROPCLEANBUFFERS.

Note that you will NOT want to run these statements on a production server.

I ran both of the above queries with three different constants.  Here are my results.  Keep in mind that these results depend greatly on the specific hardware.  If you try this experiment, your results may vary.

 

Execution Time

Increase

OPTIMIZED

"un-OPTIMIZED"

Constant

10,000,000
(1% of rows)

6.5 minutes

26 minutes

4x

100,000,000
(10% of rows)

10.4 minutes

4.3 hours

25x

250,000,000
(25% of rows)

11.3 minutes

10.6 hours

56x

Clearly the optimized nested loops join can have a huge impact on performance.  Moreover, as the plan touches more rows the benefit of the optimization grows dramatically.  Although a full scan or a nested loops join with an explicit sort would be faster, the optimized nested loops join really is a safety net protecting against a much worse alternative.

Posted by craigfr | 0 Comments
Filed under: ,

Optimizing I/O Performance by Sorting – Part 2

In my last post, I discussed how SQL Server can use sorts to transform random I/Os into sequential I/Os.  In this post, I'll demonstrate directly how such a sort can impact performance.  For the following experiments, I'll use the same 3 GByte database that I created last week.

The system I'm using to run this test has 8 GBytes of memory.  To exaggerate the performance effects and simulate an even larger table that does not fit in main memory, I'm going to adjust the ‘MAX SERVER MEMORY' SP_CONFIGURE option to allow SQL Server to use just 1 GByte of memory.  I'm going to use CHECKPOINT to ensure that the newly created database is completely flushed to disk before running any experiments.  Finally, I'm going to run DBCC DROPCLEANBUFFERS before each test to ensure that none of the data is cached in the buffer pool between tests.

CHECKPOINT

EXEC SP_CONFIGURE 'SHOW ADVANCED OPTIONS', '1'
RECONFIGURE
EXEC SP_CONFIGURE 'MAX SERVER MEMORY', '1024'
RECONFIGURE

DBCC DROPCLEANBUFFERS

Note that you will NOT want to run these statements on a production server.

As I discussed last week, SQL Server can use one of three plans for the following query depending on the value of the constant:

SELECT SUM(Data)
FROM T
WHERE RandKey < constant

To recap, if the constant is small, SQL Server uses a non-clustered index seek and a bookmark lookup.  If the constant is large, SQL Server uses a clustered index scan to avoid performing many random I/Os.  Finally,  if the constant is somewhere in the middle, SQL Server uses the non-clustered index seek but sorts the rows prior to performing the bookmark lookup to reduce the number of random I/Os.  You can review last week's post to see examples of each of these plans.  I'm going to focus on the third and final plan with the sort.

To demonstrate the benefit of the sort, I need to be able to run the same query with and without the sort.  A simple way to make SQL Server remove the sort is to use the following UPDATE STATISTICS statement to trick SQL Server into believing that the table is really small.  To ensure that I still get the plan with the non-clustered index seek and the bookmark lookup, I need to add an INDEX hint.  I'm also adding a RECOMPILE query hint to ensure that SQL Server generates a new plan after I've altered the statistics.

UPDATE STATISTICS T WITH ROWCOUNT = 1, PAGECOUNT = 1

SELECT SUM(Data)
FROM T WITH (INDEX (IRandKey))
WHERE RandKey < constant
OPTION (RECOMPILE)

I can also reset the statistics using the following statement:

UPDATE STATISTICS T WITH ROWCOUNT = 25600000, PAGECOUNT = 389323

Here is an example of the default plan with the real statistics and with the sort:

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
       |--Stream Aggregate(DEFINE:([Expr1010]=COUNT_BIG([T].[Data]), [Expr1011]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1009]) WITH UNORDERED PREFETCH)
                 |--Sort(ORDER BY:([T].[PK] ASC))
                 |    |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (2000000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

Here is an example of the plan after running UPDATE STATISTICS and without the sort:

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1009]=(0) THEN NULL ELSE [Expr1010] END))
       |--Stream Aggregate(DEFINE:([Expr1009]=COUNT_BIG([T].[Data]), [Expr1010]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK]))
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (2000000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

Here are my results running this query with two values of the constant both with and without the sort.  Keep in mind that these results depend greatly on the specific hardware.  If you try this experiment, your results may vary.

 

Execution Time

% Increase

with Sort

without Sort

Constant

2,000,000
(0.2% of rows)

91 seconds

352 seconds

286%

4,000,000
(0.4% of rows)

97 seconds

654 seconds

574%

% Increase

100%

6%

86%

 

There are a two points worth noting regarding these results.  First, it should be very clear that the plan with the sort is significantly faster (up to 7 times faster) than the plan without the sort.  This result clearly shows the benefit of sequential vs. random I/Os.  Second, doubling the number of rows touched had hardly any effect on the execution time for the plan with the sort but nearly doubled the execution time for the plan without the sort.  Adding additional I/Os to the plan with the sort adds only a small incremental cost since the I/Os are sequential and the disk head will pass over the required data exactly once either way.  Adding additional I/Os to the plan without the sort adds additional disk seeks and increases the execution time proportionately to the increase in the number of rows.  In fact, if the constant is increased further, the execution time of the plan with the sort will continue to increase only gradually with the execution time of the plan without the sort will continue to increase rapidly.

Posted by craigfr | 2 Comments
Filed under: ,

Optimizing I/O Performance by Sorting – Part 1

In this post from last year, I discussed how random I/Os are slower than sequential I/Os (particularly for conventional rotating hard drives).  For this reason, SQL Server often favors query plans that perform sequential scans of an entire table over plans that perform random lookups of only a portion of a table.  (See the last example in this post for a simple demonstration.)  In other cases, instead of performing a sequential scan, SQL Server introduces a sort operator whose sole purpose is to convert random I/Os into sequential I/Os.

Let's look at an example of such a sort.  To measure the performance effects, we'll need a reasonably large table.  The following script creates a 25.6 million row table that consumes about 3 GBytes of storage.

CREATE DATABASE IOTest
    ON ( NAME = IOTest_Data, FILENAME = '...\IOTest_Data.mdf', SIZE = 4 GB )
    LOG ON ( NAME = IOTest_Log, FILENAME = '...\IOTest_Log.ldf', SIZE = 200 MB )
GO
ALTER DATABASE IOTest SET RECOVERY SIMPLE
GO
USE IOTest
GO
CREATE TABLE T (
    PK INT IDENTITY PRIMARY KEY,
    RandKey INT,
    Flags TINYINT,
    Data INT,
    Pad CHAR(100))
GO
SET NOCOUNT ON
DECLARE @I INT
SET @I = 0
WHILE @I < 100000
  BEGIN
    WITH
      X2 (R) AS ( SELECT RAND() UNION ALL SELECT RAND() ),
      X4 (R) AS ( SELECT R FROM X2 UNION ALL SELECT R FROM X2 ),
      X8 (R) AS ( SELECT R FROM X4 UNION ALL SELECT R FROM X4 ),
      X16 (R) AS ( SELECT R FROM X8 UNION ALL SELECT R FROM X8 ),
      X32 (R) AS ( SELECT R FROM X16 UNION ALL SELECT R FROM X16 ),
      X64 (R) AS ( SELECT R FROM X32 UNION ALL SELECT R FROM X32 ),
      X128 (R) AS ( SELECT R FROM X64 UNION ALL SELECT R FROM X64 ),
      X256 (R) AS ( SELECT R FROM X128 UNION ALL SELECT R FROM X128 )
    INSERT T (RandKey, Flags, Data, Pad)
        SELECT R * 1000000000, 0xFF, 1, '' FROM X256
    SET @I = @I + 1
  END
GO
CREATE INDEX IRandKey on T (RandKey, Flags)
GO

Due to the fixed width Pad column, each row of T consumes 113 bytes (plus overhead).  Roughly 65 rows fit on a single 8 Kbyte page.  (The Flags column is unused in this example, but I will make use of it in a subsequent post.)

The RandKey column, as the name suggests, contains random values.  Notice that we have a non-clustered index on this column.  Given a predicate on the RandKey column, SQL Server can use this index to fetch qualifying rows from the table.  However, because the values in this column are random, the selected rows will be scattered randomly throughout the clustered index.

If we select just a few rows from the table using a filter on RandKey, SQL Server will use the non-clustered index:

SELECT SUM(Data)
FROM T
WHERE RandKey < 1000

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1011]=(0) THEN NULL ELSE [Expr1012] END))
       |--Stream Aggregate(DEFINE:([Expr1011]=COUNT_BIG([T].[Data]), [Expr1012]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1010]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (1000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

The non-clustered index seek selects a few rows (the use of random keys means that the exact number may vary each time the table is loaded) and looks them up in the clustered index to get the value of the Data column for the SUM aggregate.  The non-clustered index seek is very efficient - it likely touches only one page - but the clustered index seek generates a random I/O for each row.

If we select a large number of rows, SQL Server recognizes that the random I/Os are too expensive and switches to a clustered index scan:

SELECT SUM(Data)
FROM T
WHERE RandKey < 10000000

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1009]=(0) THEN NULL ELSE [Expr1010] END))
       |--Stream Aggregate(DEFINE:([Expr1009]=COUNT_BIG([T].[Data]), [Expr1010]=SUM([T].[Data])))
            |--Clustered Index Scan(OBJECT:([T].[PK__T__...]), WHERE:([T].[RandKey]<(10000000)))

This query touches only 1% of the data.  Still, the query is going to touch more than half of the pages in the clustered index so it is faster to scan the entire clustered index than to perform on the order of 256,000 random I/Os.

Somewhere in between these two extremes things get a little more interesting:

SELECT SUM(Data)
FROM T
WHERE RandKey < 2500000

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
       |--Stream Aggregate(DEFINE:([Expr1010]=COUNT_BIG([T].[Data]), [Expr1011]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1009]) WITH UNORDERED PREFETCH)
                 |--Sort(ORDER BY:([T].[PK] ASC))
                 |    |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (2500000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

This query touches a mere 0.25% of the data.  The plan uses the non-clustered index to avoid unnecessarily touching many rows.  Yet, performing 64,000 random I/Os is still rather expensive so SQL Server adds a sort.  By sorting the rows on the clustered index key, SQL Server transforms the random I/Os into sequential I/Os.  Thus, we get the efficiency of the seek - touching only those rows that qualify - with the performance of the sequential scan.

It is worth pointing out that sorting on the clustered index key will yield rows that are in the logical index order.  Due to fragmentation or due simply to the multiple layers of abstraction between SQL Server and the actual hard drives, there is no guarantee that the physical order on disk matches the logical order.

In my next post, I'll run some of these queries and demonstrate the performance implications of the sort.
Posted by craigfr | 4 Comments
Filed under: ,

What is the difference between sys.dm_db_index_usage_stats and sys.dm_db_index_operational_stats?

SQL Server includes two DMVs - sys.dm_db_index_usage_stats and sys.dm_db_index_operational_stats - that are extremely useful for monitoring which indexes are used as well as how and when they are used.  Both DMVs report similar statistics on information such as the number of scans, seeks, and updates to different indexes.  These DMVs are documented in Books Online (see here and here) and a simple Web search reveals numerous other postings about these DMVs.  However, in my own search, I did not find many direct explanations of the difference between these two valuable DMVs.  (You will find a short explanation halfway through this post on the Microsoft SQL Server Customer Advisory Team blog.)

The main difference between these DMVs is simple but important:

sys.dm_db_index_usage_stats records how many times the query optimizer uses an index in a plan.  This usage information is recorded again each time the plan is executed.  (Compiling a plan alone is not sufficient to record an index's usage.)  However, and this is the important part, for the purposes of computing the statistics, it does matter how many times the query processor executes the specific operator that references the index.  For that matter, it does not matter whether the query processor executes the operator at all.  Mere execution of the plan counts as a single usage for each index used by the plan.

sys.dm_db_index_operational_stats records how many times the storage engine executes a specific operation on the index.  These statistics do depend on how many times the query processor executes each operator.  If an operator is never executed, the storage engine does not perform any operations on the index and the DMV reports that the index was not used.  If an operator is executed multiple times, the storage engine performs multiple operations on the index and the DMV reports that the index was used multiple times.

Update (7/29/2009): The following paragraph is incorrect.  See this post for more information.

(Another less important difference between these DMVs is that sys.dm_db_index_usage_stats only reports on indexes that have been used at least once since the server was last restarted while sys.dm_db_index_operational_stats reports on all indexes regardless of whether they have been used.)

Let's try an example to see this difference in action.  I'll use the following simple schema:

CREATE TABLE T (A INT, B INT, C INT)
CREATE UNIQUE CLUSTERED INDEX TA ON T(A)
CREATE UNIQUE INDEX TB ON T(B)

As expected, immediately after creating this table, the stats are zero (or just non-existent):

SELECT index_id, user_seeks, user_scans, user_lookups, user_updates
FROM sys.dm_db_index_usage_stats
WHERE database_id = DB_ID('tempdb') and object_id = OBJECT_ID('tempdb..t')
ORDER BY index_id

SELECT index_id, range_scan_count, singleton_lookup_count
FROM sys.dm_db_index_operational_stats (DB_ID('tempdb'), OBJECT_ID('tempdb..t'), NULL, NULL)
ORDER BY index_id

index_id    user_seeks           user_scans           user_lookups         user_updates
----------- -------------------- -------------------- -------------------- --------------------
index_id    range_scan_count     singleton_lookup_count
----------- -------------------- ----------------------
1           0                    0
2           0                    0

Now suppose that we do a scan of the clustered index:

SELECT * FROM T

  |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Repeating the DMV queries, we see that the clustered index shows one scan in both DMVs.  SQL Server records the scan even though the table contains no rows and the query returns an empty result:

index_id    user_seeks           user_scans           user_lookups         user_updates
----------- -------------------- -------------------- -------------------- --------------------
1           0                    1                    0                    0
index_id    range_scan_count     singleton_lookup_count
----------- -------------------- ----------------------
1           1                    0
2           0                    0

Next let's try a singleton lookup on the clustered index:

SELECT * FROM T WHERE A = 1

  |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[T].[TA]), SEEK:([tempdb].[dbo].[T].[A]=CONVERT_IMPLICIT(int,[@1],0)) ORDERED FORWARD)

Again the table contains no rows and the query returns an empty result.  Nevertheless, the DMVs now report one seek and one singleton lookup:

index_id    user_seeks           user_scans           user_lookups         user_updates
----------- -------------------- -------------------- -------------------- --------------------
1           1                    1                    0                    0
index_id    range_scan_count     singleton_lookup_count
----------- -------------------- ----------------------
1           1                    1
2           0                    0

(Keep in mind that the DMV results are cumulative so you need to subtract the previous values from the current values as you run each of these experiments.  Thus, we can disregard the scan that was already reported by the previous example.)

Now let's try something a little more interesting.  Let's run a bookmark lookup:

SELECT * FROM T WHERE B = 1

  |--Nested Loops(Inner Join, OUTER REFERENCES:([tempdb].[dbo].[T].[A]))
       |--Index Seek(OBJECT:([tempdb].[dbo].[T].[TB]), SEEK:([tempdb].[dbo].[T].[B]=CONVERT_IMPLICIT(int,[@1],0)) ORDERED FORWARD)
       |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[T].[TA]), SEEK:([tempdb].[dbo].[T].[A]=[tempdb].[dbo].[T].[A]) LOOKUP ORDERED FORWARD)

As expected sys.dm_db_index_usage_stats reports a seek on index TB (index id 2) and a bookmark lookup on the clustered index (index id 1).  However, sys.dm_db_index_operational_stats reports only the singleton lookup on index TB but does not report any new activity on the clustered index:

index_id    user_seeks           user_scans           user_lookups         user_updates
----------- -------------------- -------------------- -------------------- --------------------
1           1                    1                    1                    0
2           1                    0                    0                    0
index_id    range_scan_count     singleton_lookup_count
----------- -------------------- ----------------------
1           1                    1
2           0                    1

To understand what has happened, recall how a nested loops join works.  The server executes the seek (the singleton lookup) on index TB and, as in the previous example, both DMVs are updated even though the seek returns no rows.  However, since the seek on index TB returns no rows, the nested loops join does not execute the clustered index seek (i.e., the bookmark lookup).  The server updates sys.dm_db_index_usage_stats to indicate that it executed a query plan that includes a bookmark lookup on table T, but does not update sys.dm_db_index_operational_stats since the query did not actually perform any bookmark lookups.

Next, let's insert three rows into the table and run another bookmark lookup experiment.  I'm using a hint to force a bookmark lookup plan.  Without the hint, the optimizer would simply use a clustered index scan since the query returns all three rows in the table:

INSERT T VALUES (0, 0, 0), (1, 1, 1), (2, 2, 2)
SELECT * FROM T WITH (INDEX (TB))

  |--Nested Loops(Inner Join, OUTER REFERENCES:([tempdb].[dbo].[T].[A]))
       |--Index Scan(OBJECT:([tempdb].[dbo].[T].[TB]))
       |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[T].[TA]), SEEK:([tempdb].[dbo].[T].[A]=[tempdb].[dbo].[T].[A]) LOOKUP ORDERED FORWARD)

This time sys.dm_db_index_usage_stats reports a scan on index TB and a bookmark lookup on the clustered index (plus the updates from the insert statement).  But, sys.dm_db_index_operational_stats reports a scan on index TB and three bookmark lookups on the clustered index:

index_id    user_seeks           user_scans           user_lookups         user_updates
----------- -------------------- -------------------- -------------------- --------------------
1           1                    1                    2                    1
2           1                    1                    0                    1
index_id    range_scan_count     singleton_lookup_count
----------- -------------------- ----------------------
1           1                    4
2           1                    1

When the server executes the above query, it runs the clustered index seek three times - once for each row returned by the index scan.  We ran the query only once but it performed three bookmark lookups.  Thus, as in the prior example, the server updates sys.dm_db_index_usage_stats to indicate that it executed a query plan that includes a bookmark lookup on table T, but unlike the prior example, it updates sys.dm_db_index_operational_stats to indicate that the query performed three actual bookmark lookups.

I've used bookmark lookups in the above examples, but any nested loops join will produce similar results.  At this point, it should be clear that the statistics returned by these two DMVs can differ dramatically.

So, what is the important takeaway from all of these examples?   Don't expect the data reported by these two DMVs to match.  sys.dm_db_index_usage_stats tells us the proportion of query plans that were executed that use various indexes.  This information is useful for concluding how many of the executed query plans might be affected if we drop an index but it does not tell us how many actual operations are performed using each index.  sys.dm_db_index_operational_stats, on the other hand, tells us how often the indexes are actually used during the execution of plans and, thus, which indexes are directly contributing to server performance.  But, even if sys.dm_db_index_operational_stats indicates that an index is not used very often (or perhaps even that an index is never used), do not automatically conclude that you can drop the index.  First, be sure that sys.dm_db_index_usage_stats indicates that no queries depend on the index.  In some cases, the presence of an index could change a query plan for the better even though the index itself is not used when the plan is executed.
Posted by craigfr | 4 Comments
Filed under:

Random Prefetching

In my last post, I explained the importance of asynchronous I/O  and described how SQL Server uses sequential read ahead to boost the performance of scans.  In this post, I'll discuss how SQL Server uses random prefetching.  Let's begin with a simple example of a query plan that performs many random I/Os.  As in my prior post, all of the examples in this post use a 1GB scale factor TPC-H database.  The following query returns the number of line items associated with each order placed on March 15, 1998:

SELECT O_ORDERKEY, COUNT(*)
FROM ORDERS JOIN LINEITEM ON O_ORDERKEY = L_ORDERKEY
WHERE O_ORDERDATE = '1998-03-15'
GROUP BY O_ORDERKEY

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1012],0)))
       |--Stream Aggregate(GROUP BY:([ORDERS].[O_ORDERKEY]) DEFINE:([Expr1012]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([ORDERS].[O_ORDERKEY], [Expr1011]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Clustered Index Seek(OBJECT:([ORDERS].[O_ORDERDATE_CLUIDX]), SEEK:([ORDERS].[O_ORDERDATE]='1998-03-15') ORDERED FORWARD)
                 |--Index Seek(OBJECT:([LINEITEM].[L_ORDERKEY_IDX]), SEEK:([LINEITEM].[L_ORDERKEY]=[ORDERS].[O_ORDERKEY]) ORDERED FORWARD)

This query plan uses an index nested loops join.  The clustered index seek on the ORDERS table returns the 661 orders that were placed on March 15, 1998.  For each of these 661 orders, SQL Server performs an index seek on the LINEITEM table to lookup the records associated with this order.  Each of these index seeks potentially represents a series of random I/Os to navigate from the root of the B-tree index to the leaf page(s) where the records for that order are stored.  To minimize the cost of these I/Os, SQL Server enhances the nested loops join with prefetching.  (Notice the WITH UNORDERED PREFETCH keywords associated with the nested loops join.)  The prefetching mechanism peers ahead in the clustered index seek on the ORDERS table and issues asynchronous I/Os for the pages that will ultimately be needed by the index seek on the LINEITEM table.  As in the sequential read ahead scenario, we can see the prefetching in action by checking the output of SET STATISTICS IO ON.  Look at the read-ahead reads for the LINEITEM table:

Table 'LINEITEM'. Scan count 661, logical reads 5165, physical reads 2, read-ahead reads 5000, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ORDERS'. Scan count 1, logical reads 15, physical reads 2, read-ahead reads 19, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

You may have noticed that the prefetch in this example was UNORDERED.  Indeed, there are two types of prefetch:  UNORDERED and ORDERED.  Although a nested loops join ordinarily preserves the order of the rows from its outer input (in this case the ORDERS table), a nested loops join WITH UNORDERED PREFETCH does not preserve order.  Instead, the rows are returned in the order that the asynchronous I/Os complete.  However, if the order of the rows is important, SQL Server can use a nested loops join WITH ORDERED PREFETCH.  For example, observe what happens to the plan if we add an ORDER BY clause to the above query:

SELECT O_ORDERKEY, COUNT(*)
FROM ORDERS JOIN LINEITEM ON O_ORDERKEY = L_ORDERKEY
WHERE O_ORDERDATE = '1998-03-15'
GROUP BY O_ORDERKEY
ORDER BY O_ORDERKEY

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1012],0)))
       |--Stream Aggregate(GROUP BY:([ORDERS].[O_ORDERKEY]) DEFINE:([Expr1012]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([ORDERS].[O_ORDERKEY], [Expr1011]) WITH ORDERED PREFETCH)
                 |--Sort(ORDER BY:([ORDERS].[O_ORDERKEY] ASC))
                 |    |--Clustered Index Seek(OBJECT:([ORDERS].[O_ORDERDATE_CLUIDX]), SEEK:([ORDERS].[O_ORDERDATE]='1998-03-15 00:00:00.000') ORDERED FORWARD)
                 |--Index Seek(OBJECT:([LINEITEM].[L_ORDERKEY_IDX]), SEEK:([LINEITEM].[L_ORDERKEY]=[ORDERS].[O_ORDERKEY]) ORDERED FORWARD)

Notice that SQL Server chooses to push the sort below the nested loops join.  For this sort to satisfy the ORDER BY clause, the nested loops join must preserve the order of the rows that it returns.  Thus, this time SQL Server uses a nested loops join WITH ORDERED PREFETCH.

SQL Server can also use random prefetching to speed up bookmark lookups and certain update and delete statements.  For instance, consider the following two statements:

SELECT *
FROM LINEITEM
WHERE L_ORDERKEY BETWEEN 5000000 AND 5001000

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Uniq1002], [LINEITEM].[L_SHIPDATE], [Expr1004]) OPTIMIZED WITH UNORDERED PREFETCH)
       |--Index Seek(OBJECT:([LINEITEM].[L_ORDERKEY_IDX]), SEEK:([LINEITEM].[L_ORDERKEY] >= (5000000) AND [LINEITEM].[L_ORDERKEY] <= (5001000)) ORDERED FORWARD)
       |--Clustered Index Seek(OBJECT:([LINEITEM].[L_SHIPDATE_CLUIDX]), SEEK:([LINEITEM].[L_SHIPDATE]=[LINEITEM].[L_SHIPDATE] AND [Uniq1002]=[Uniq1002]) LOOKUP ORDERED FORWARD)

UPDATE LINEITEM
SET L_DISCOUNT = 0.1
WHERE L_ORDERKEY BETWEEN 5000000 AND 5001000

  |--Clustered Index Update(OBJECT:([LINEITEM].[L_SHIPDATE_CLUIDX]), SET:([LINEITEM].[L_DISCOUNT] = RaiseIfNull([ConstExpr1010])) WITH UNORDERED PREFETCH)
       |--Compute Scalar(DEFINE:([ConstExpr1010]=CONVERT_IMPLICIT(money,[@1],0)))
            |--Top(ROWCOUNT est 0)
                 |--Index Seek(OBJECT:([LINEITEM].[L_ORDERKEY_IDX]), SEEK:([LINEITEM].[L_ORDERKEY] >= [@2] AND [LINEITEM].[L_ORDERKEY] <= [@3]) ORDERED FORWARD)

Both plans use a non-clustered index on the LINEITEM table to identify rows that match the L_ORDERKEY predicate.  In the case of the SELECT statement, SQL Server performs a bookmark lookup - recall that a bookmark lookup is just a special case of a nested loops join - to fetch the columns of the LINEITEM table that are not covered by the non-clustered index.  In the case of the UPDATE statement, SQL Server needs to locate the correct page and row in the clustered index and update the L_DISCOUNT column.  The resulting I/O sequence is the same as the bookmark lookup.  In both cases, to minimize the cost of the I/Os, SQL Server adds prefetching to the plan.  Just as in the original example above, the prefetch mechanism peers ahead in the non-clustered index seek on the LINEITEM table and issues asynchronous I/Os for the pages of the clustered index that will be needed.

For systems with many hard drives, random prefetching can dramatically improve performance.  However, prefetching can adversely affect concurrency as I explained in this post.
Posted by craigfr | 1 Comments
Filed under: , ,

Sequential Read Ahead

Balancing CPU and I/O throughput is essential to achieve good overall performance and to maximize hardware utilization.  SQL Server includes two asynchronous I/O mechanisms - sequential read ahead and random prefetching - that are designed to address this challenge.

To understand why asynchronous I/O is so important, consider the CPU to I/O performance gap.  The memory subsystem on a modern CPU can deliver data sequentially at roughly 5 Gbytes per second per socket (or for non-NUMA machines for all sockets sharing the same bus) and (depending on how you measure it) can fetch random memory locations at roughly 10 to 50 million accesses per second.  By comparison, a high end 15K SAS hard drive can read only 125 Mbytes per second sequentially and can perform only 200 random I/Os per second (IOPS).  Solid State Disks (SSDS) can reduce the gap between sequential and random I/O performance by eliminating the moving parts from the equation, but a performance gap remains.  In an effort to close this performance gap, it is not uncommon for servers to have a ratio of 10 or more drives for every CPU.  (It is also important to consider and balance the entire I/O subsystem including the number and type of disk controllers not just the drives themselves but that is not the focus of this post.)

Unfortunately, a single CPU issuing only synchronous I/Os can keep only one spindle active at a time.  For a single CPU to exploit the available bandwidth and IOPs of multiple spindles effectively the server must issue multiple I/Os asynchronously.  Thus, SQL Server includes the aforementioned read ahead and prefetching mechanisms.  In this post, I'll take a look at sequential read ahead.

When SQL Server performs a sequential scan of a large table, the storage engine initiates the read ahead mechanism to ensure that pages are in memory and ready to scan before they are needed by the query processor.  The read ahead mechanism tries to stay 500 pages ahead of the scan.  We can see the read ahead mechanism in action by checking the output of SET STATISTICS IO ON.  For example, I ran the following query on a 1GB scale factor TPC-H database.  The LINEITEM table has roughly 6 million rows.

SET STATISTICS IO ON

SELECT COUNT(*) FROM LINEITEM

Table 'LINEITEM'. Scan count 3, logical reads 22328, physical reads 3, read-ahead reads 20331, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Repeating the query a second time shows that the table is now cached in the buffer pool:

SELECT COUNT(*) FROM LINEITEM

Table 'LINEITEM'. Scan count 3, logical reads 22328, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

For sequential I/O performance, it is important to distinguish between allocation ordered and index ordered scans.  An allocation ordered scan tries to read pages in the order in which they are physically stored on disk while an index ordered scan reads pages according to the order in which the data on those index pages is sorted.  (Note that in many cases there are multiple levels of indirection such as RAID devices or SANS between the logical volumes that SQL Server sees and the physical disks.  Thus, even an allocation ordered scan may in fact not be truly optimally ordered.)  Although SQL Server tries to sort and read pages in allocation order even for an index ordered scan, an allocation ordered scan is generally going to be faster since pages are read in the order that they are written on disk with the minimal number of seeks.  Heaps have no inherent order and, thus, are always scanned in allocation order.  Indexes are scanned in allocation order only if the isolation level is read uncommitted (or the NOLOCK hint is used) and only if the query process does not request an ordered scan.  Defragmenting indexes can help to ensure that index ordered scans perform on par with allocation ordered scans.

In my next post, I'll take a look at random prefetching.
Posted by craigfr | 9 Comments
Filed under: ,

Dynamic Partition Elimination Performance

In this post on partitioned tables, I mentioned that SQL Server 2008 has a much more efficient implementation of dynamic partition elimination as compared to SQL Server 2005.  In response, a reader posted this comment asking how much dynamic partition elimination really costs in SQL Server 2005.  While I was sure that the answer is not much, I nonetheless decided to measure it and find out.

I wrote the following stored procedure which creates a table with the specified number of partitions:

CREATE PROCEDURE CreateTable @N INT
AS
BEGIN
    DECLARE @Cmd VARCHAR(8000)
    DECLARE @I INT

    SET @Cmd = 'CREATE PARTITION FUNCTION PF(INT) AS RANGE FOR VALUES (1'

    SET @I = 2
    WHILE @I < @N
        BEGIN
            SET @Cmd = @Cmd + ',' + CONVERT(VARCHAR(4),@I)
            SET @I = @I + 1
        END

    SET @Cmd = @Cmd + ')'

    EXECUTE (@Cmd)

    CREATE PARTITION SCHEME PS AS PARTITION PF ALL TO ([PRIMARY])

    CREATE TABLE T1 (A INT, B INT)
    CREATE CLUSTERED INDEX T1A ON T1(A) ON PS(A)
END

I then created a table with either 2, 10, 100, or 1000 partitions.  For each table, I ran the following batch:

SET NOCOUNT ON

DECLARE @T DATETIME
DECLARE @I INT, @J INT, @K INT
SET @I = 0
SET @J = 0
SET @T = GETDATE()
WHILE @J < 10000
    BEGIN
        SELECT @K = B FROM T1 WHERE A < @I
        SET @J = @J + 1
    END
SELECT DATEDIFF (MS, @T, GETDATE())

The main SELECT statement in this batch scans a single partition but, because the WHERE clause references a variable, SQL Server does not know this information at compile time.  This query is too fast to measure a single execution accurately.  Thus, I magnify the cost by repeating the query 10,000 times in a WHILE loop.  The SET NOCOUNT ON statement and the assignment in the SELECT statement merely serve to minimize client communication costs which would otherwise dominate the batch cost.

The table is empty which makes this test a worst case scenario.  The cost of the query is dominated by the cost of locating the correct partition.

Looking at the SQL Server 2005 STATISTICS PROFILE output for the 1000 partition table, we can see that SQL Server tests each partition before scanning only the one that is not eliminated:

0      1        |--Nested Loops(Inner Join, OUTER REFERENCES:([PtnIds1004]) PARTITION ID:([PtnIds1004]))
1      1             |--Filter(WHERE:([PtnIds1004]<=RangePartitionNew([@I],(0),(1),(2),(3),...
1000   1             |    |--Constant Scan(VALUES:(((1)),((2)),((3)),...
0      1             |--Clustered Index Seek(OBJECT:([T1].[T1A]), SEEK:([T1].[A] < [@I]) ORDERED FORWARD PARTITION ID:([PtnIds1004]))

By comparison, SQL Server 2008 seeks directly to the correct partition:

0      1        |--Clustered Index Seek(OBJECT:([T1].[T1A]), SEEK:([PtnId1000] >= (1) AND [PtnId1000] <= RangePartitionNew([@I],(0),(1),(2),(3),...

Here are the results.  In each case, before collecting any data, I ran the batch once to eliminate compilation overhead and warm up costs.  I then took the median of 5 runs.  Times are in milliseconds for the full 10,000 executions of the query.  Note that we cannot compare the SQL Server 2005 results directly to the SQL Server 2008 results as I ran these tests on two different machines.  However, the relative cost of adding partitions in SQL Server 2005 vs. SQL Server 2008 is clear.

# Partitions

SQL Server 2005

SQL Server 2008

2

233

216

10

313

203

100

1216

216

1000

18060

216

It is important to remember that these are times for 10,000 executions of the query.  Thus, even on SQL Server 2005 with 1000 partitions, the cost of a single execution of the query is only 1.8 milliseconds.  Moreover, if the table contains any real data and if the query returns even a modest number of rows, the cost of the query will be dominated by the actual data processing.

Posted by craigfr | 1 Comments
Filed under:

Partitioned Indexes in SQL Server 2008

In my last post, I looked at how SQL Server 2008 handles scans on partitioned tables.  I explained that SQL Server 2008 treats partitioned tables as tables with a logical index on the partition id column and that SQL Server 2008 implements partition elimination by performing a logical index seek on the partition id column.  Specifically, I showed some examples using a heap.  In this post, I'll continue this discussion and explore how SQL Server 2008 handles scans and seek on partitioned indexes.

Let's begin with a simple example:

CREATE PARTITION FUNCTION PF(INT) AS RANGE FOR VALUES (0, 10, 100)
CREATE PARTITION SCHEME PS AS PARTITION PF ALL TO ([PRIMARY])

CREATE TABLE T1 (A INT, B INT)
CREATE CLUSTERED INDEX T1A ON T1(A) ON PS(A)

As I noted in my last post, the SQL Server query processor logically treats this partitioned index as a multi-column index on ([PtnId], A).  We can scan this logical index just like any real index:

SELECT * FROM T1

  |--Clustered Index Scan(OBJECT:([T1].[T1A]))

This query uses the same plan as any other clustered index scan.  It implicitly scans all of the partitions.  Aside from the "Partitioned" attribute in the graphical and XML plans, there is no difference between this plan and a similar clustered index scan on a non-partitioned table.  We can also perform a seek on the logical index:

DECLARE @I INT
SELECT @I = 0
SELECT * FROM T1 WHERE A = @I

  |--Clustered Index Seek(OBJECT:([T1].[T1A]), SEEK:([PtnId1000]=RangePartitionNew([@I],(0),(0),(10),(100)) AND [T1].[A]=[@I]) ORDERED FORWARD)

Notice that SQL Server derives the predicate on the [PtnId] column from the explicit predicate on column A.  SQL Server then seeks on both columns of the logical index - [PtnId] and column A - just as if the logical index were a real index.  So far, everything is pretty straightforward.  Now, let's consider a slightly more complex scenario:

DECLARE @I INT
SELECT @I = 0
SELECT * FROM T1 WHERE A < @I

Both this and the prior example have predicates on column A.  In both cases, SQL Server derives a predicate on the [PtnId] column.  However, in the prior example, the derived predicate was an equality predicate that limited the seek to a single partition.  In this example, due to the inequality, the derived predicate is actually a range of partitions:

  |--Clustered Index Seek(OBJECT:([T1].[T1A]), SEEK:([PtnId1000] >= (1) AND [PtnId1000] <= RangePartitionNew([@I],(0),(0),(10),(100)) AND [T1].[A] < [@I]) ORDERED FORWARD)

If you are familiar with the rules on index seeks, you will notice something odd about this plan.  Ordinarily, SQL Server can only perform an index seek on the second column of a multi-column index if there exists an equality predicate on the first column.  Clearly, there is not an equality predicate on the [PtnId] column in this query and yet there is a predicate on column A.  What's going on?  The query processor supports a special type of index seek on partitioned tables.  Basically, the query processor first uses the predicate on the [PtnId] column to identify those partitions into which to seek and then seeks again using the predicate on column A.  Because the seek operator skips over some rows in each partition that do not match the predicate on column A, we refer to this type of seek as a skip scan.  Note that this operation really is a seek - it is not a residual predicate.  The storage engine only returns those rows that pass both predicates.  Also note that SQL Server 2008 only supports this skip scan functionality on the [PtnId] column of a partitioned table.

Now let's change the schema slightly:

CREATE TABLE T2 (A INT, B INT)
CREATE CLUSTERED INDEX T2A ON T2(A) ON PS(B)

This new table is indexed on column A but partitioned on column B.  Once again, SQL Server logically treats the index as a multi-column index on ([PtnId], B).

DECLARE @I INT
SELECT @I = 0
SELECT * FROM T2 WHERE A = @I

Since the table is partitioned on column B, we cannot use the predicate on column A to derive a predicate on the [PtnId] column.  Fortunately, the query processor can still perform a skip scan by adding a predicate on the [PtnId] column that explicitly scans all partitions:

  |--Clustered Index Seek(OBJECT:([T2].[T2A]), SEEK:([PtnId1000] >= (1) AND [PtnId1000] <= (4) AND [T2].[A]=[@I]) ORDERED FORWARD)

Finally, before I wrap up this post, I'd like to point out that, while it's not often useful, we can add an explicit predicate on the [PtnId] column of a table.  To do so, we use the $PARTITION function:

DECLARE @PtnId INT
SELECT @PtnId = 1
SELECT * FROM T1 WHERE $PARTITION.PF(A) = @PtnId

  |--Clustered Index Scan(OBJECT:([T1].[T1A]), SEEK:([PtnId1000]=[@PtnId]) ORDERED FORWARD)

Just as with the derived [PtnId] predicates, SQL Server can perform a seek on the [PtnId] column.  When using this syntax, it is important to use the correct $PARTITION function and the correct column or SQL Server will not recognize it as the [PtnId] column for this table.

Books Online has more information on these and other partitioned table changes in SQL Server 2008.

Partitioned Tables in SQL Server 2008

In this post, I introduced how SQL Server 2005 implements query plans on partitioned tables.  If you've read that post or used partitioned tables, you may recall that SQL Server 2005 uses a constant scan operator to enumerate the list of partition ids that need to be scanned.  As a refresher, here is the example from that post showing the plan for scanning a table with four partitions:

CREATE PARTITION FUNCTION PF(INT) AS RANGE FOR VALUES (0, 10, 100)
CREATE PARTITION SCHEME PS AS PARTITION PF ALL TO ([PRIMARY])
CREATE TABLE T (A INT, B INT) ON PS(A)

SELECT * FROM T

  |--Nested Loops(Inner Join, OUTER REFERENCES:([PtnIds1004]) PARTITION ID:([PtnIds1004]))
       |--Constant Scan(VALUES:(((1)),((2)),((3)),((4))))
       |--Table Scan(OBJECT:([t]))

SQL Server 2005 treats partitioned tables specially and creates special plans, such as the above one, for partitioned tables.  SQL Server 2008, on the other hand, for the most part treats partitioned tables as regular tables that just happen to be logically indexed on the partition id column.  For example, for the purposes or query optimization and query execution, SQL Server 2008 treats the above table not as a heap but as an index on [PtnId].  If we create a partitioned index (clustered or non-clustered), SQL Server 2008 logically adds the partition id column as the first column of the index.  For instance, if we create the following index:

CREATE INDEX TA ON T(A) ON PS(A)

SQL Server 2008 treats it not as a single column index on T(A), but rather as a multi-column or composite index on T([PtnId],A).  Note that the table and index are still stored physically as partitioned tables.  In this example, the table and non-clustered index are decomposed into four heaps and four non-clustered indexes.

By treating partitioned tables as indexes, SQL Server 2008 can frequently use much simpler query plans.  Let's look at the SQL Server 2008 plan for the above example:

  |--Table Scan(OBJECT:([T]))

This query plan is no different than what you might see if you scanned any other ordinary table!  So, how can we tell that the table is really partitioned and how can we tell what partitions the plan is going to scan?  Luckily, the graphical and XML plans identify partitioned tables by adding a "Partitioned" attribute to any scan, seek, or update operator that processes a partitioned table.  Moreover, the actual graphical plan and the STATISTICS XML output identify the exact partitions that the plan touches during execution.  For example, here is an excerpt from the STATISTICS XML output generated by running the above query:

<RunTimePartitionSummary>
  <PartitionsAccessed PartitionCount="4">
    <PartitionRange Start="1" End="4"/>
  </PartitionsAccessed>
</RunTimePartitionSummary>

This output shows that, as expected, the query plan scanned all four partitions of the table.  Now, suppose we insert a row into the second partition (up until now, the table has been empty) and run the following slightly modified query:

INSERT T VALUES (1, 1)

SELECT TOP 1 * FROM T

This query can stop executing as soon as it finds a single row.  It scans the first partition and finding no rows continues on to scan the second partition.  At that point, it finds a row and stops.  Thus, the query terminates after scanning only two partitions.  This outcome is clearly reflected in the STATISTICS XML output:

<RunTimePartitionSummary>
  <PartitionsAccessed PartitionCount="2">
    <PartitionRange Start="1" End="2"/>
  </PartitionsAccessed>
</RunTimePartitionSummary>

Now let's see how partition elimination works.  Let's start with static partition elimination:

SELECT * FROM T WHERE A < 100

  |--Table Scan(OBJECT:([T]), SEEK:([PtnId1001] >= (1) AND [PtnId1001] <= (3)),  WHERE:([T].[A]<(100)) ORDERED FORWARD)

Because SQL Server 2008 treats the partitioned table like an index on [PtnId], it can simply calculate the range of partitions that need to be scanned and perform a "seek" on those partitions.  It may seem a little strange to see a SEEK predicate on a table scan, but it just means that the plan is going to scan partitions 1 through 3.

And now, let's look at dynamic partition elimination:

DECLARE @I INT
SELECT @I = 0
SELECT * FROM T WHERE A < @I

  |--Table Scan(OBJECT:([T]), SEEK:([PtnId1001] >= (1) AND [PtnId1001] <= RangePartitionNew([@I],(0),(0),(10),(100))),  WHERE:([T].[A]<[@I]) ORDERED FORWARD)

This plan is identical to the prior plan except that the constant partition id 3 has been replaced by the RangePartitionNew function which computes the correct partition id from the variable @I.  To find out exactly which partitions were scanned, we can once again check the STATISTICS XML output:

<RunTimePartitionSummary>
  <PartitionsAccessed PartitionCount="1">
    <PartitionRange Start="1" End="1"/>
  </PartitionsAccessed>
</RunTimePartitionSummary>

Note that this plan computes exactly which partitions to scan.  Compare that to the SQL Server 2005 plan which evaluates a filter for each and every partition id:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([PtnIds1004]) PARTITION ID:([PtnIds1004]))
       |--Filter(WHERE:([PtnIds1004]<=RangePartitionNew([@i],(0),(0),(10),(100))))
       |    |--Constant Scan(VALUES:(((1)),((2)),((3)),((4))))
       |--Table Scan(OBJECT:([t]), WHERE:([t].[a]<[@i]) PARTITION ID:([PtnIds1004]))

While evaluating the filter repeatedly may not cost too much if we have only four partitions, it is certainly going to waste some cycles if we have many partitions!

In my next post, I'll take a look at how SQL Server 2008 handles scans and seeks on partitioned indexes.

Subqueries in BETWEEN and CASE Statements

Consider the following query:

CREATE TABLE T1 (A INT, B1 INT, B2 INT)
CREATE TABLE T2 (A INT, B INT)

SELECT *
FROM T1
WHERE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) BETWEEN T1.B1 AND T1.B2

Observe that the subquery in this query only needs to be evaluated once for each row of T1.  Indeed running on SQL Server 2000, we get the following plan (with the portion of the plan corresponding to the subquery in bold):

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1004]))
       |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1014]=0) then NULL else [Expr1015]))
       |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1014]=COUNT_BIG([T2].[B]), [Expr1015]=SUM([T2].[B])))
       |         |--Sort(ORDER BY:([T2].[A] ASC))
       |              |--Table Scan(OBJECT:([T2]))
       |--Filter(WHERE:([Expr1004]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1004]))
                 |--Table Scan(OBJECT:([T1]))

Now let's look at the query plan we get (with SQL Server 2005 or SQL Server 2008):

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1008], [Expr1014]))
       |--Merge Join(Inner Join, MERGE:([T2].[A])=([T2].[A]), RESIDUAL:([T2].[A]=[T2].[A]))
       |    |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1026]=(0) THEN NULL ELSE [Expr1027] END))
       |    |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1026]=COUNT_BIG([T2].[B]), [Expr1027]=SUM([T2].[B])))
       |    |         |--Sort(ORDER BY:([T2].[A] ASC))
       |    |              |--Table Scan(OBJECT:([T2]))
       |    |--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1028]=(0) THEN NULL ELSE [Expr1029] END))
       |         |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1028]=COUNT_BIG([T2].[B]), [Expr1029]=SUM([T2].[B])))
       |              |--Sort(ORDER BY:([T2].[A] ASC))
       |                   |--Table Scan(OBJECT:([T2]))
       |--Filter(WHERE:([Expr1014]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1008]))
                 |--Table Scan(OBJECT:([T1]))

Notice that the subquery is actually evaluated twice.  There are two scans of T2, two sorts, and two stream aggregates.  I've highlighted both sets of operators in bold.

Why is SQL Server evaluating the subquery twice?  The answer is that SQL Server transforms "X BETWEEEN Y AND Z" into "X <= Y AND X >= Z".  If as in this query, X is a subquery, the subquery is repeated:

SELECT *
FROM T1
WHERE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) >= T1.B1 AND
    (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) <= T1.B2

This transformation occurs very early in the processing of the query.  Unfortunately, after that point, SQL Server 2005 and SQL Server 2008 do not realize that the subquery is a common subexpression and they evaluate it as if there were two completely different subqueries.

You may also have noticed that, instead of scanning T1 first and then evaluating the subquery for each row of T1, this plan actually scans T2 first.  This join order results from the optimizer decorrelating the subqueries.

So, how can we get SQL Server to evaluate the subquery only once?  There are actually multiple solutions and all involve rewriting the query to calculate the subquery separately from the BETWEEN clause so that when SQL Server transforms the BETWEEN clause, it does not also duplicate the subquery.  Here are some examples:

SELECT Q.A, Q.B1, Q.B2
FROM
    (
    SELECT *, (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) SUM_B
    FROM T1
    ) Q
WHERE SUM_B BETWEEN Q.B1 AND Q.B2

SELECT T1.*
FROM T1 CROSS APPLY (SELECT SUM(T2.B) SUM_B FROM T2 WHERE T2.A = T1.A) Q
WHERE Q.SUM_B BETWEEN T1.B1 AND T1.B2

SELECT T1.*
FROM T1, (SELECT T2.A, SUM(T2.B) SUM_B FROM T2 GROUP BY T2.A) Q
WHERE T1.A = Q.A AND Q.SUM_B BETWEEN T1.B1 AND T1.B2

All three of these rewrites produce the same plan:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1008]))
       |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1016]=(0) THEN NULL ELSE [Expr1017] END))
       |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1016]=COUNT_BIG([T2].[B]), [Expr1017]=SUM([T2].[B])))
       |         |--Sort(ORDER BY:([T2].[A] ASC))
       |              |--Table Scan(OBJECT:([T2]))
       |--Filter(WHERE:([Expr1008]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1008]))
                 |--Table Scan(OBJECT:([T1]))

SQL Server also transforms a CASE statement of the form:

CASE X
    WHEN Y1 THEN Z1
    WHEN Y2 THEN Z2
    ...
    ELSE ZN
END

Into:

CASE
    WHEN X = Y1 THEN Z1
    WHEN X = Y2 THEN Z2
    ...
    ELSE ZN
END

Thus, CASE statements can yield the same problematic behavior if X is a subquery.  Unfortunately, with a CASE statement, the number of times that the subquery is reevaluated depends on the number WHEN clauses and can be quite large.  Here is a simple query that illustrates the problem:

SELECT *,
    CASE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A)
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1

Here is the SQL Server 2000 plan:

  |--Compute Scalar(DEFINE:([Expr1007]=If ([Expr1004]=[T1].[B1]) then 'B1' else If ([Expr1004]=[T1].[B2]) then 'B2' else NULL))
       |--Nested Loops(Left Outer Join, OUTER REFERENCES:([T1].[A]))
            |--Table Scan(OBJECT:([T1]))
            |--Hash Match(Cache, HASH:([T1].[A]), RESIDUAL:([T1].[A]=[T1].[A]))
                 |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1020]=0) then NULL else [Expr1021]))
                      |--Stream Aggregate(DEFINE:([Expr1020]=COUNT_BIG([T2].[B]), [Expr1021]=SUM([T2].[B])))
                           |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))

And here is the SQL Server 2005 and SQL Server 2008 plan:

  |--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [Expr1008]=[T1].[B1] THEN 'B1' ELSE CASE WHEN [Expr1014]=[T1].[B2] THEN 'B2' ELSE NULL END END))
       |--Nested Loops(Inner Join, PASSTHRU:([Expr1008]=[T1].[B1]), OUTER REFERENCES:([T1].[A]))
            |--Nested Loops(Left Outer Join, OUTER REFERENCES:([T1].[A]))
            |    |--Table Scan(OBJECT:([T1]))
            |    |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1030]=(0) THEN NULL ELSE [Expr1031] END))
            |         |--Stream Aggregate(DEFINE:([Expr1030]=COUNT_BIG([T2].[B]), [Expr1031]=SUM([T2].[B])))
            |              |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))
            |--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1032]=(0) THEN NULL ELSE [Expr1033] END))
                 |--Stream Aggregate(DEFINE:([Expr1032]=COUNT_BIG([T2].[B]), [Expr1033]=SUM([T2].[B])))
                      |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))

Finally, the same solutions once again apply:

SELECT Q.A, Q.B1, Q.B2,
    CASE Q.SUM_B
        WHEN Q.B1 THEN 'B1'
        WHEN Q.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM
    (
    SELECT *, (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) SUM_B
    FROM T1
    ) Q

SELECT T1.*,
    CASE Q.SUM_B
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1 CROSS APPLY (SELECT SUM(T2.B) SUM_B FROM T2 WHERE T2.A = T1.A) Q

SELECT T1.*,
    CASE Q.SUM_B
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1, (SELECT T2.A, SUM(T2.B) SUM_B FROM T2 GROUP BY T2.A) Q
WHERE T1.A = Q.A

Posted by craigfr | 5 Comments
Filed under:

Implicit Conversions

In my last couple of posts, I wrote about how explicit conversions can lead to errors.  In this post, I'm going to take a look at some issues involving implicit conversions.  SQL Server adds implicit conversions whenever you mix columns, variables, and/or parameters with different (but compatible) data types in a single expression.  For example, if you try to compare INT and FLOAT columns, the INT must be converted to a FLOAT.  If you write "C_INT = C_FLOAT", SQL Server rewrites this expressions as "CONVERT(FLOAT,C_INT) = C_FLOAT".

When performing implicit conversions, SQL Server will try to choose the conversion that is least likely either to fail due to an overflow or to lose precision.  For example, a SMALLINT will be converted to an INT since all SMALLINTs can be converted to INTs without any data loss.  On the other hand, an INT will be converted to a REAL since all INTs can be converted to REALs but not vice-versa.  However, the conversion is potentially lossy since some INTs contain more digits than can be represented by a REAL.

Note that some data types can only be combined in a single expression by using an explicit conversion and that some data types are not compatible at all with or without an explicit conversion.  This Books Online page includes a compatibility matrix showing all of the possible data type combinations.  For the remainder of this post, I'm going to focus only on implicit conversions.

To get started, I'd like to show an example that did not work so well on SQL Server 2000:

CREATE TABLE T (C INT)
CREATE UNIQUE CLUSTERED INDEX TC ON T(C)

INSERT T VALUES (1000000000)
INSERT T VALUES (1000000001)

First, let's consider a query that uses a clustered index scan:

DECLARE @V REAL
SET @V = 1E9
SELECT * FROM T WITH (INDEX(0)) WHERE C = @V

C           
----------- 
1000000000
1000000001

At first glance, this may seem like an incorrect result.  The second row does not appear to match the predicate.  However, if we check the plan, we see that SQL added an implicit convert to the predicate:

  |--Clustered Index Scan(OBJECT:([T].[TC]), WHERE:(Convert([T].[C])=[@V]))

Moreover, if we check the results of the conversion, we can see that it is lossy.  Both INTs convert to the same REAL value.  The above results now make more sense:

SELECT *, CONVERT(REAL,C) REAL_C FROM T

C           REAL_C                   
----------- ------------------------ 
1000000000  1.0E+9
1000000001  1.0E+9

Now, let's consider the same query with a clustered index seek:

DECLARE @V REAL
SET @V = 1E9
SELECT * FROM T WITH (INDEX(1)) WHERE C = @V

C           
----------- 
1000000000

Hold it!  What happened to the second row?  Let's check the plan:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1002], [Expr1003], [Expr1004]))
       |--Compute Scalar(DEFINE:([Expr1002]=Convert([@V])-1, [Expr1003]=Convert([@V])+1, [Expr1004]=If (Convert([@V])-1=NULL) then 0 else 6|If (Convert([@V])+1=NULL) then 0 else 10))
       |    |--Constant Scan
       |--Clustered Index Seek(OBJECT:([T].[TC]), SEEK:([T].[C] > [Expr1002] AND [T].[C] < [Expr1003]),  WHERE:(Convert([T].[C])=[@V]) ORDERED FORWARD)

This plan looks rather complicated, but aside from unfortunately getting a wrong answer (at least according to the conversion semantics we just discussed), it is really quite simple.  To perform an index seek, SQL Server must have key values that match the data type stored in the index.  SQL Server cannot perform an index seek on an INT index using a REAL key.  So, SQL Server must convert the REAL variable to an INT.  Because, as we've already seen, conversions can be lossy, SQL Server also expands the range of values returned by the index seek by subtracting and adding one to the result of the conversion.  Basically, after substituting the value of the parameter, this plan is the same as running the following query:

SELECT * FROM T WHERE (C > 999999999 AND C < 1000000001) AND CONVERT(REAL,C) = 1E9

Notice that the original predicate with the convert is retained in case the index seek returns too many rows and some must be filtered out.  In most cases, this algorithm works just fine; I've deliberately selected a scenario that does not work so well.  In addition to potentially omitting results, this algorithm may also harm performance.  For instance, consider this example:

CREATE TABLE T (C NUMERIC(12,4))
CREATE UNIQUE CLUSTERED INDEX TC ON T(C)

INSERT T VALUES (0.0001)
INSERT T VALUES (0.0002)
INSERT T VALUES (0.0003)
...

DECLARE @P REAL
SET @P = 1.0000
SELECT * FROM T WITH (INDEX(1)) WHERE C = @P

Suppose that table T includes many values in the range 0.0000 to 2.0000.  In this example, the conversion is lossless.  Only rows with the value 1.0000 should be returned.  Unfortunately, the index seek still returns all of the rows in the range and the filter with the convert discards all of them except for those with the value 1.0000.

Now let's see how SQL Server 2005 (and 2008) handle these same queries.  Let's start with the clustered index scan:

DECLARE @P REAL
SET @P = 1E9
SELECT * FROM T WITH (INDEX(0)) WHERE C = @P

  |--Clustered Index Scan(OBJECT:([T].[TC]), WHERE:(CONVERT_IMPLICIT(real(24),[T].[C],0)=[@P]))

This is the same plan that we got from SQL Server 2000 and we get the same result (both rows).  The only difference is that SQL Server 2005 indicates that the convert is implicit and provides the type of the conversion.  Next let's try the clustered index seek:

DECLARE @P REAL
SET @P = 1E9
SELECT * FROM T WITH (INDEX(1)) WHERE C = @P

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1005], [Expr1006], [Expr1004]))
       |--Compute Scalar(DEFINE:(([Expr1005],[Expr1006],[Expr1004])=GetRangeThroughConvert([@P],[@P],(62))))
       |    |--Constant Scan
       |--Clustered Index Seek(OBJECT:([T].[TC]), SEEK:([T].[C] > [Expr1005] AND [T].[C] < [Expr1006]),  WHERE:(CONVERT_IMPLICIT(real(24),[T].[C],0)=[@P]) ORDERED FORWARD)

This plan is similar to the SQL Server 2000 plan with one key difference.  The conversion of the seek key from REAL to INT is performed by an internal function GetRangeThroughConvert which correctly determines the exact range of values to seek.  A little experimentation will show that, in this example, the range works out to all values between 999,999,968 and 1,000,000,479.  On the other hand, for the example with the numeric column, the "range" works out to exactly 1.0000.  This solution is both more accurate and more efficient than the SQL Server 2000 algorithm.

Finally, let's take a look at the performance implications of the new plan: 

CREATE TABLE T1 (C_INT INT, C_REAL REAL)
CREATE CLUSTERED INDEX T1_C_REAL ON T1(C_REAL)

CREATE TABLE T2 (C_INT INT, C_REAL REAL)
CREATE CLUSTERED INDEX T2_C_INT ON T2(C_INT)

SET NOCOUNT ON
DECLARE @I INT
SET @I = 0
WHILE @I < 100000
  BEGIN
    INSERT T1 VALUES (@I, @I)
    INSERT T2 VALUES (@I, @I)
    SET @I = @I + 1
  END

Here are three join plans.  I'm using a join hint to ensure that I do not get a hash join which uses a completely different algorithm.  I'm using COUNT(*) to eliminate the overhead of transferring output rows to the client and an OPTION hint to avoid parallelism.  The first plan joins the two INT columns.  The second plan joins the two REAL columns.  The third plan joins the INT and REAL columns and uses the same algorithm with the GetRangeThroughConvert function as the simpler query above.

SELECT COUNT(*)
FROM T1 INNER LOOP JOIN T2 ON T1.C_INT = T2.C_INT
OPTION(MAXDOP 1)

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1012],0)))
       |--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[C_INT], [Expr1011]) WITH UNORDERED PREFETCH)
                 |--Clustered Index Scan(OBJECT:([T1].[T1_C_REAL]))
                 |--Clustered Index Seek(OBJECT:([T2].[T2_C_INT]), SEEK:([T2].[C_INT]=[T1].[C_INT]) ORDERED FORWARD)

SELECT COUNT(*)
FROM T2 INNER LOOP JOIN T1 ON T2.C_REAL = T1.C_REAL
OPTION(MAXDOP 1)

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1012],0)))
       |--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[C_REAL], [Expr1011]) WITH UNORDERED PREFETCH)
                 |--Clustered Index Scan(OBJECT:([T2].[T2_C_INT]))
                 |--Clustered Index Seek(OBJECT:([T1].[T1_C_REAL]), SEEK:([T1].[C_REAL]=[T2].[C_REAL]) ORDERED FORWARD)

SELECT COUNT(*)
FROM T1 INNER LOOP JOIN T2 ON T1.C_REAL = T2.C_INT
OPTION(MAXDOP 1)

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1014],0)))
       |--Stream Aggregate(DEFINE:([Expr1014]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[C_REAL]))
                 |--Clustered Index Scan(OBJECT:([T1].[T1_C_REAL]))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1012], [Expr1013], [Expr1011]))
                      |--Compute Scalar(DEFINE:(([Expr1012],[Expr1013],[Expr1011])=GetRangeThroughConvert([T1].[C_REAL],[T1].[C_REAL],(62))))
                      |    |--Constant Scan
                      |--Clustered Index Seek(OBJECT:([T2].[T2_C_INT]), SEEK:([T2].[C_INT] > [Expr1012] AND [T2].[C_INT] < [Expr1013]),  WHERE:([T1].[C_REAL]=CONVERT_IMPLICIT(real(24),[T2].[C_INT],0)) ORDERED FORWARD)

I measured the times using SET STATISTICS TIME ON.  Here are the results I got on my system:

SQL Server Execution Times:   CPU time = 750 ms,  elapsed time = 741 ms.

SQL Server Execution Times:   CPU time = 735 ms,  elapsed time = 742 ms.

SQL Server Execution Times:   CPU time = 1609 ms,  elapsed time = 1605 ms.

Notice that each of first two queries with the joins on the matching types complete in 740ms while the third query with the join on the mismatched types takes more than twice as long at 1600ms!  The moral of this story?  Try to stick to matching types whenever possible.

Posted by craigfr | 1 Comments
Filed under:

Query Processing Presentation

Last week, I had the opportunity to talk to the New England SQL Server Users Group.  I would like to thank the group for inviting me, Adam Machanic for organizing the event, and Red Gate for sponsoring it.  My talk was an introduction to query processing, query execution, and query plans in SQL Server.  I've had a request for the slides, so here they are.  Enjoy!

Posted by craigfr | 0 Comments
Attachment(s): QPTalk.pdf
More Posts Next page »
 
Page view tracker