Welcome to MSDN Blogs Sign in | Join | Help

Partitioned Tables in SQL Server 2008

In this post, I introduced how SQL Server 2005 implements query plans on partitioned tables.  If you've read that post or used partitioned tables, you may recall that SQL Server 2005 uses a constant scan operator to enumerate the list of partition ids that need to be scanned.  As a refresher, here is the example from that post showing the plan for scanning a table with four partitions:

CREATE PARTITION FUNCTION PF(INT) AS RANGE FOR VALUES (0, 10, 100)
CREATE PARTITION SCHEME PS AS PARTITION PF ALL TO ([PRIMARY])
CREATE TABLE T (A INT, B INT) ON PS(A)

SELECT * FROM T

  |--Nested Loops(Inner Join, OUTER REFERENCES:([PtnIds1004]) PARTITION ID:([PtnIds1004]))
       |--Constant Scan(VALUES:(((1)),((2)),((3)),((4))))
       |--Table Scan(OBJECT:([t]))

SQL Server 2005 treats partitioned tables specially and creates special plans, such as the above one, for partitioned tables.  SQL Server 2008, on the other hand, for the most part treats partitioned tables as regular tables that just happen to be logically indexed on the partition id column.  For example, for the purposes or query optimization and query execution, SQL Server 2008 treats the above table not as a heap but as an index on [PtnId].  If we create a partitioned index (clustered or non-clustered), SQL Server 2008 logically adds the partition id column as the first column of the index.  For instance, if we create the following index:

CREATE INDEX TA ON T(A) ON PS(A)

SQL Server 2008 treats it not as a single column index on T(A), but rather as a multi-column or composite index on T([PtnId],A).  Note that the table and index are still stored physically as partitioned tables.  In this example, the table and non-clustered index are decomposed into four heaps and four non-clustered indexes.

By treating partitioned tables as indexes, SQL Server 2008 can frequently use much simpler query plans.  Let's look at the SQL Server 2008 plan for the above example:

  |--Table Scan(OBJECT:([T]))

This query plan is no different than what you might see if you scanned any other ordinary table!  So, how can we tell that the table is really partitioned and how can we tell what partitions the plan is going to scan?  Luckily, the graphical and XML plans identify partitioned tables by adding a "Partitioned" attribute to any scan, seek, or update operator that processes a partitioned table.  Moreover, the actual graphical plan and the STATISTICS XML output identify the exact partitions that the plan touches during execution.  For example, here is an excerpt from the STATISTICS XML output generated by running the above query:

<RunTimePartitionSummary>
  <PartitionsAccessed PartitionCount="4">
    <PartitionRange Start="1" End="4"/>
  </PartitionsAccessed>
</RunTimePartitionSummary>

This output shows that, as expected, the query plan scanned all four partitions of the table.  Now, suppose we insert a row into the second partition (up until now, the table has been empty) and run the following slightly modified query:

INSERT T VALUES (1, 1)

SELECT TOP 1 * FROM T

This query can stop executing as soon as it finds a single row.  It scans the first partition and finding no rows continues on to scan the second partition.  At that point, it finds a row and stops.  Thus, the query terminates after scanning only two partitions.  This outcome is clearly reflected in the STATISTICS XML output:

<RunTimePartitionSummary>
  <PartitionsAccessed PartitionCount="2">
    <PartitionRange Start="1" End="2"/>
  </PartitionsAccessed>
</RunTimePartitionSummary>

Now let's see how partition elimination works.  Let's start with static partition elimination:

SELECT * FROM T WHERE A < 100

  |--Table Scan(OBJECT:([T]), SEEK:([PtnId1001] >= (1) AND [PtnId1001] <= (3)),  WHERE:([T].[A]<(100)) ORDERED FORWARD)

Because SQL Server 2008 treats the partitioned table like an index on [PtnId], it can simply calculate the range of partitions that need to be scanned and perform a "seek" on those partitions.  It may seem a little strange to see a SEEK predicate on a table scan, but it just means that the plan is going to scan partitions 1 through 3.

And now, let's look at dynamic partition elimination:

DECLARE @I INT
SELECT @I = 0
SELECT * FROM T WHERE A < @I

  |--Table Scan(OBJECT:([T]), SEEK:([PtnId1001] >= (1) AND [PtnId1001] <= RangePartitionNew([@I],(0),(0),(10),(100))),  WHERE:([T].[A]<[@I]) ORDERED FORWARD)

This plan is identical to the prior plan except that the constant partition id 3 has been replaced by the RangePartitionNew function which computes the correct partition id from the variable @I.  To find out exactly which partitions were scanned, we can once again check the STATISTICS XML output:

<RunTimePartitionSummary>
  <PartitionsAccessed PartitionCount="1">
    <PartitionRange Start="1" End="1"/>
  </PartitionsAccessed>
</RunTimePartitionSummary>

Note that this plan computes exactly which partitions to scan.  Compare that to the SQL Server 2005 plan which evaluates a filter for each and every partition id:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([PtnIds1004]) PARTITION ID:([PtnIds1004]))
       |--Filter(WHERE:([PtnIds1004]<=RangePartitionNew([@i],(0),(0),(10),(100))))
       |    |--Constant Scan(VALUES:(((1)),((2)),((3)),((4))))
       |--Table Scan(OBJECT:([t]), WHERE:([t].[a]<[@i]) PARTITION ID:([PtnIds1004]))

While evaluating the filter repeatedly may not cost too much if we have only four partitions, it is certainly going to waste some cycles if we have many partitions!

In my next post, I'll take a look at how SQL Server 2008 handles scans and seeks on partitioned indexes.

Subqueries in BETWEEN and CASE Statements

Consider the following query:

CREATE TABLE T1 (A INT, B1 INT, B2 INT)
CREATE TABLE T2 (A INT, B INT)

SELECT *
FROM T1
WHERE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) BETWEEN T1.B1 AND T1.B2

Observe that the subquery in this query only needs to be evaluated once for each row of T1.  Indeed running on SQL Server 2000, we get the following plan (with the portion of the plan corresponding to the subquery in bold):

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1004]))
       |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1014]=0) then NULL else [Expr1015]))
       |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1014]=COUNT_BIG([T2].[B]), [Expr1015]=SUM([T2].[B])))
       |         |--Sort(ORDER BY:([T2].[A] ASC))
       |              |--Table Scan(OBJECT:([T2]))
       |--Filter(WHERE:([Expr1004]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1004]))
                 |--Table Scan(OBJECT:([T1]))

Now let's look at the query plan we get (with SQL Server 2005 or SQL Server 2008):

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1008], [Expr1014]))
       |--Merge Join(Inner Join, MERGE:([T2].[A])=([T2].[A]), RESIDUAL:([T2].[A]=[T2].[A]))
       |    |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1026]=(0) THEN NULL ELSE [Expr1027] END))
       |    |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1026]=COUNT_BIG([T2].[B]), [Expr1027]=SUM([T2].[B])))
       |    |         |--Sort(ORDER BY:([T2].[A] ASC))
       |    |              |--Table Scan(OBJECT:([T2]))
       |    |--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1028]=(0) THEN NULL ELSE [Expr1029] END))
       |         |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1028]=COUNT_BIG([T2].[B]), [Expr1029]=SUM([T2].[B])))
       |              |--Sort(ORDER BY:([T2].[A] ASC))
       |                   |--Table Scan(OBJECT:([T2]))
       |--Filter(WHERE:([Expr1014]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1008]))
                 |--Table Scan(OBJECT:([T1]))

Notice that the subquery is actually evaluated twice.  There are two scans of T2, two sorts, and two stream aggregates.  I've highlighted both sets of operators in bold.

Why is SQL Server evaluating the subquery twice?  The answer is that SQL Server transforms "X BETWEEEN Y AND Z" into "X <= Y AND X >= Z".  If as in this query, X is a subquery, the subquery is repeated:

SELECT *
FROM T1
WHERE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) >= T1.B1 AND
    (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) <= T1.B2

This transformation occurs very early in the processing of the query.  Unfortunately, after that point, SQL Server 2005 and SQL Server 2008 do not realize that the subquery is a common subexpression and they evaluate it as if there were two completely different subqueries.

You may also have noticed that, instead of scanning T1 first and then evaluating the subquery for each row of T1, this plan actually scans T2 first.  This join order results from the optimizer decorrelating the subqueries.

So, how can we get SQL Server to evaluate the subquery only once?  There are actually multiple solutions and all involve rewriting the query to calculate the subquery separately from the BETWEEN clause so that when SQL Server transforms the BETWEEN clause, it does not also duplicate the subquery.  Here are some examples:

SELECT Q.A, Q.B1, Q.B2
FROM
    (
    SELECT *, (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) SUM_B
    FROM T1
    ) Q
WHERE SUM_B BETWEEN Q.B1 AND Q.B2

SELECT T1.*
FROM T1 CROSS APPLY (SELECT SUM(T2.B) SUM_B FROM T2 WHERE T2.A = T1.A) Q
WHERE Q.SUM_B BETWEEN T1.B1 AND T1.B2

SELECT T1.*
FROM T1, (SELECT T2.A, SUM(T2.B) SUM_B FROM T2 GROUP BY T2.A) Q
WHERE T1.A = Q.A AND Q.SUM_B BETWEEN T1.B1 AND T1.B2

All three of these rewrites produce the same plan:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1008]))
       |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1016]=(0) THEN NULL ELSE [Expr1017] END))
       |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1016]=COUNT_BIG([T2].[B]), [Expr1017]=SUM([T2].[B])))
       |         |--Sort(ORDER BY:([T2].[A] ASC))
       |              |--Table Scan(OBJECT:([T2]))
       |--Filter(WHERE:([Expr1008]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1008]))
                 |--Table Scan(OBJECT:([T1]))

SQL Server also transforms a CASE statement of the form:

CASE X
    WHEN Y1 THEN Z1
    WHEN Y2 THEN Z2
    ...
    ELSE ZN
END

Into:

CASE
    WHEN X = Y1 THEN Z1
    WHEN X = Y2 THEN Z2
    ...
    ELSE ZN
END

Thus, CASE statements can yield the same problematic behavior if X is a subquery.  Unfortunately, with a CASE statement, the number of times that the subquery is reevaluated depends on the number WHEN clauses and can be quite large.  Here is a simple query that illustrates the problem:

SELECT *,
    CASE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A)
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1

Here is the SQL Server 2000 plan:

  |--Compute Scalar(DEFINE:([Expr1007]=If ([Expr1004]=[T1].[B1]) then 'B1' else If ([Expr1004]=[T1].[B2]) then 'B2' else NULL))
       |--Nested Loops(Left Outer Join, OUTER REFERENCES:([T1].[A]))
            |--Table Scan(OBJECT:([T1]))
            |--Hash Match(Cache, HASH:([T1].[A]), RESIDUAL:([T1].[A]=[T1].[A]))
                 |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1020]=0) then NULL else [Expr1021]))
                      |--Stream Aggregate(DEFINE:([Expr1020]=COUNT_BIG([T2].[B]), [Expr1021]=SUM([T2].[B])))
                           |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))

And here is the SQL Server 2005 and SQL Server 2008 plan:

  |--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [Expr1008]=[T1].[B1] THEN 'B1' ELSE CASE WHEN [Expr1014]=[T1].[B2] THEN 'B2' ELSE NULL END END))
       |--Nested Loops(Inner Join, PASSTHRU:([Expr1008]=[T1].[B1]), OUTER REFERENCES:([T1].[A]))
            |--Nested Loops(Left Outer Join, OUTER REFERENCES:([T1].[A]))
            |    |--Table Scan(OBJECT:([T1]))
            |    |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1030]=(0) THEN NULL ELSE [Expr1031] END))
            |         |--Stream Aggregate(DEFINE:([Expr1030]=COUNT_BIG([T2].[B]), [Expr1031]=SUM([T2].[B])))
            |              |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))
            |--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1032]=(0) THEN NULL ELSE [Expr1033] END))
                 |--Stream Aggregate(DEFINE:([Expr1032]=COUNT_BIG([T2].[B]), [Expr1033]=SUM([T2].[B])))
                      |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))

Finally, the same solutions once again apply:

SELECT Q.A, Q.B1, Q.B2,
    CASE Q.SUM_B
        WHEN Q.B1 THEN 'B1'
        WHEN Q.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM
    (
    SELECT *, (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) SUM_B
    FROM T1
    ) Q

SELECT T1.*,
    CASE Q.SUM_B
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1 CROSS APPLY (SELECT SUM(T2.B) SUM_B FROM T2 WHERE T2.A = T1.A) Q

SELECT T1.*,
    CASE Q.SUM_B
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1, (SELECT T2.A, SUM(T2.B) SUM_B FROM T2 GROUP BY T2.A) Q
WHERE T1.A = Q.A

Posted by craigfr | 5 Comments
Filed under:

Implicit Conversions

In my last couple of posts, I wrote about how explicit conversions can lead to errors.  In this post, I'm going to take a look at some issues involving implicit conversions.  SQL Server adds implicit conversions whenever you mix columns, variables, and/or parameters with different (but compatible) data types in a single expression.  For example, if you try to compare INT and FLOAT columns, the INT must be converted to a FLOAT.  If you write "C_INT = C_FLOAT", SQL Server rewrites this expressions as "CONVERT(FLOAT,C_INT) = C_FLOAT".

When performing implicit conversions, SQL Server will try to choose the conversion that is least likely either to fail due to an overflow or to lose precision.  For example, a SMALLINT will be converted to an INT since all SMALLINTs can be converted to INTs without any data loss.  On the other hand, an INT will be converted to a REAL since all INTs can be converted to REALs but not vice-versa.  However, the conversion is potentially lossy since some INTs contain more digits than can be represented by a REAL.

Note that some data types can only be combined in a single expression by using an explicit conversion and that some data types are not compatible at all with or without an explicit conversion.  This Books Online page includes a compatibility matrix showing all of the possible data type combinations.  For the remainder of this post, I'm going to focus only on implicit conversions.

To get started, I'd like to show an example that did not work so well on SQL Server 2000:

CREATE TABLE T (C INT)
CREATE UNIQUE CLUSTERED INDEX TC ON T(C)

INSERT T VALUES (1000000000)
INSERT T VALUES (1000000001)

First, let's consider a query that uses a clustered index scan:

DECLARE @V REAL
SET @V = 1E9
SELECT * FROM T WITH (INDEX(0)) WHERE C = @V

C           
----------- 
1000000000
1000000001

At first glance, this may seem like an incorrect result.  The second row does not appear to match the predicate.  However, if we check the plan, we see that SQL added an implicit convert to the predicate:

  |--Clustered Index Scan(OBJECT:([T].[TC]), WHERE:(Convert([T].[C])=[@V]))

Moreover, if we check the results of the conversion, we can see that it is lossy.  Both INTs convert to the same REAL value.  The above results now make more sense:

SELECT *, CONVERT(REAL,C) REAL_C FROM T

C           REAL_C                   
----------- ------------------------ 
1000000000  1.0E+9
1000000001  1.0E+9

Now, let's consider the same query with a clustered index seek:

DECLARE @V REAL
SET @V = 1E9
SELECT * FROM T WITH (INDEX(1)) WHERE C = @V

C           
----------- 
1000000000

Hold it!  What happened to the second row?  Let's check the plan:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1002], [Expr1003], [Expr1004]))
       |--Compute Scalar(DEFINE:([Expr1002]=Convert([@V])-1, [Expr1003]=Convert([@V])+1, [Expr1004]=If (Convert([@V])-1=NULL) then 0 else 6|If (Convert([@V])+1=NULL) then 0 else 10))
       |    |--Constant Scan
       |--Clustered Index Seek(OBJECT:([T].[TC]), SEEK:([T].[C] > [Expr1002] AND [T].[C] < [Expr1003]),  WHERE:(Convert([T].[C])=[@V]) ORDERED FORWARD)

This plan looks rather complicated, but aside from unfortunately getting a wrong answer (at least according to the conversion semantics we just discussed), it is really quite simple.  To perform an index seek, SQL Server must have key values that match the data type stored in the index.  SQL Server cannot perform an index seek on an INT index using a REAL key.  So, SQL Server must convert the REAL variable to an INT.  Because, as we've already seen, conversions can be lossy, SQL Server also expands the range of values returned by the index seek by subtracting and adding one to the result of the conversion.  Basically, after substituting the value of the parameter, this plan is the same as running the following query:

SELECT * FROM T WHERE (C > 999999999 AND C < 1000000001) AND CONVERT(REAL,C) = 1E9

Notice that the original predicate with the convert is retained in case the index seek returns too many rows and some must be filtered out.  In most cases, this algorithm works just fine; I've deliberately selected a scenario that does not work so well.  In addition to potentially omitting results, this algorithm may also harm performance.  For instance, consider this example:

CREATE TABLE T (C NUMERIC(12,4))
CREATE UNIQUE CLUSTERED INDEX TC ON T(C)

INSERT T VALUES (0.0001)
INSERT T VALUES (0.0002)
INSERT T VALUES (0.0003)
...

DECLARE @P REAL
SET @P = 1.0000
SELECT * FROM T WITH (INDEX(1)) WHERE C = @P

Suppose that table T includes many values in the range 0.0000 to 2.0000.  In this example, the conversion is lossless.  Only rows with the value 1.0000 should be returned.  Unfortunately, the index seek still returns all of the rows in the range and the filter with the convert discards all of them except for those with the value 1.0000.

Now let's see how SQL Server 2005 (and 2008) handle these same queries.  Let's start with the clustered index scan:

DECLARE @P REAL
SET @P = 1E9
SELECT * FROM T WITH (INDEX(0)) WHERE C = @P

  |--Clustered Index Scan(OBJECT:([T].[TC]), WHERE:(CONVERT_IMPLICIT(real(24),[T].[C],0)=[@P]))

This is the same plan that we got from SQL Server 2000 and we get the same result (both rows).  The only difference is that SQL Server 2005 indicates that the convert is implicit and provides the type of the conversion.  Next let's try the clustered index seek:

DECLARE @P REAL
SET @P = 1E9
SELECT * FROM T WITH (INDEX(1)) WHERE C = @P

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1005], [Expr1006], [Expr1004]))
       |--Compute Scalar(DEFINE:(([Expr1005],[Expr1006],[Expr1004])=GetRangeThroughConvert([@P],[@P],(62))))
       |    |--Constant Scan
       |--Clustered Index Seek(OBJECT:([T].[TC]), SEEK:([T].[C] > [Expr1005] AND [T].[C] < [Expr1006]),  WHERE:(CONVERT_IMPLICIT(real(24),[T].[C],0)=[@P]) ORDERED FORWARD)

This plan is similar to the SQL Server 2000 plan with one key difference.  The conversion of the seek key from REAL to INT is performed by an internal function GetRangeThroughConvert which correctly determines the exact range of values to seek.  A little experimentation will show that, in this example, the range works out to all values between 999,999,968 and 1,000,000,479.  On the other hand, for the example with the numeric column, the "range" works out to exactly 1.0000.  This solution is both more accurate and more efficient than the SQL Server 2000 algorithm.

Finally, let's take a look at the performance implications of the new plan: 

CREATE TABLE T1 (C_INT INT, C_REAL REAL)
CREATE CLUSTERED INDEX T1_C_REAL ON T1(C_REAL)

CREATE TABLE T2 (C_INT INT, C_REAL REAL)
CREATE CLUSTERED INDEX T2_C_INT ON T2(C_INT)

SET NOCOUNT ON
DECLARE @I INT
SET @I = 0
WHILE @I < 100000
  BEGIN
    INSERT T1 VALUES (@I, @I)
    INSERT T2 VALUES (@I, @I)
    SET @I = @I + 1
  END

Here are three join plans.  I'm using a join hint to ensure that I do not get a hash join which uses a completely different algorithm.  I'm using COUNT(*) to eliminate the overhead of transferring output rows to the client and an OPTION hint to avoid parallelism.  The first plan joins the two INT columns.  The second plan joins the two REAL columns.  The third plan joins the INT and REAL columns and uses the same algorithm with the GetRangeThroughConvert function as the simpler query above.

SELECT COUNT(*)
FROM T1 INNER LOOP JOIN T2 ON T1.C_INT = T2.C_INT
OPTION(MAXDOP 1)

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1012],0)))
       |--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[C_INT], [Expr1011]) WITH UNORDERED PREFETCH)
                 |--Clustered Index Scan(OBJECT:([T1].[T1_C_REAL]))
                 |--Clustered Index Seek(OBJECT:([T2].[T2_C_INT]), SEEK:([T2].[C_INT]=[T1].[C_INT]) ORDERED FORWARD)

SELECT COUNT(*)
FROM T2 INNER LOOP JOIN T1 ON T2.C_REAL = T1.C_REAL
OPTION(MAXDOP 1)

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1012],0)))
       |--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[C_REAL], [Expr1011]) WITH UNORDERED PREFETCH)
                 |--Clustered Index Scan(OBJECT:([T2].[T2_C_INT]))
                 |--Clustered Index Seek(OBJECT:([T1].[T1_C_REAL]), SEEK:([T1].[C_REAL]=[T2].[C_REAL]) ORDERED FORWARD)

SELECT COUNT(*)
FROM T1 INNER LOOP JOIN T2 ON T1.C_REAL = T2.C_INT
OPTION(MAXDOP 1)

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1014],0)))
       |--Stream Aggregate(DEFINE:([Expr1014]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[C_REAL]))
                 |--Clustered Index Scan(OBJECT:([T1].[T1_C_REAL]))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1012], [Expr1013], [Expr1011]))
                      |--Compute Scalar(DEFINE:(([Expr1012],[Expr1013],[Expr1011])=GetRangeThroughConvert([T1].[C_REAL],[T1].[C_REAL],(62))))
                      |    |--Constant Scan
                      |--Clustered Index Seek(OBJECT:([T2].[T2_C_INT]), SEEK:([T2].[C_INT] > [Expr1012] AND [T2].[C_INT] < [Expr1013]),  WHERE:([T1].[C_REAL]=CONVERT_IMPLICIT(real(24),[T2].[C_INT],0)) ORDERED FORWARD)

I measured the times using SET STATISTICS TIME ON.  Here are the results I got on my system:

SQL Server Execution Times:   CPU time = 750 ms,  elapsed time = 741 ms.

SQL Server Execution Times:   CPU time = 735 ms,  elapsed time = 742 ms.

SQL Server Execution Times:   CPU time = 1609 ms,  elapsed time = 1605 ms.

Notice that each of first two queries with the joins on the matching types complete in 740ms while the third query with the join on the mismatched types takes more than twice as long at 1600ms!  The moral of this story?  Try to stick to matching types whenever possible.

Posted by craigfr | 1 Comments
Filed under:

Query Processing Presentation

Last week, I had the opportunity to talk to the New England SQL Server Users Group.  I would like to thank the group for inviting me, Adam Machanic for organizing the event, and Red Gate for sponsoring it.  My talk was an introduction to query processing, query execution, and query plans in SQL Server.  I've had a request for the slides, so here they are.  Enjoy!

Posted by craigfr | 0 Comments
Attachment(s): QPTalk.pdf

Conversion and Arithmetic Errors: Change between SQL Server 2000 and 2005

In this post from last week, I gave an example of a query with a conversion where the optimizer pushes the conversion below a join.  The result is that the conversion may be run on rows that do not join which could lead to avoidable failures.  I ran this query on SQL Server 2005.  After I published that post, a reader pointed out to me that my example query generates a different plan on SQL Server 2000:

CREATE TABLE T1 (A INT, B CHAR(8))
INSERT T1 VALUES (0, '0')
INSERT T1 VALUES (1, '1')
INSERT T1 VALUES (99, 'Error')

CREATE TABLE T2 (X INT)
INSERT T2 VALUES (1)
INSERT T2 VALUES (2)
INSERT T2 VALUES (3)
INSERT T2 VALUES (4)
INSERT T2 VALUES (5)

SELECT T1.A, CONVERT(INT, T1.B) AS B_INT
FROM T1 JOIN T2 ON T1.A = T2.X

Here is the SQL Server 2000 plan:

  |--Compute Scalar(DEFINE:([Expr1004]=Convert([T1].[B])))
       |--Hash Match(Inner Join, HASH:([T1].[A])=([T2].[X]), RESIDUAL:([T2].[X]=[T1].[A]))
            |--Table Scan(OBJECT:([T1]))
            |--Table Scan(OBJECT:([T2]))

And here is the SQL Server 2005 plan:

  |--Hash Match(Inner Join, HASH:([T1].[A])=([T2].[X]), RESIDUAL:([T2].[X]=[T1].[A]))
       |--Compute Scalar(DEFINE:([Expr1008]=CONVERT(int,[T1].[B],0)))
       |    |--Table Scan(OBJECT:([T1]))
       |--Table Scan(OBJECT:([T2]))

The result is that the SQL Server 2000 plan runs without any errors - the "bad" row is filtered out by the join - while the SQL Server 2005 plan fails when it tries to convert the "bad" row.  It turns out that this change in behavior is documented in this Books Online entry under the Transact-SQL section and "Expressions in queries" feature.  This entry states that queries that worked on SQL Server 2000 could fail on SQL Server 2005.  It also states that the change enables two benefits: "the ability to match indexes on computed columns" and "the prevention of redundant computation of expression results."  Let's investigate these two benefits.

To understand the first benefit ("the ability to match indexes on computed columns"), suppose we create a new table with a computed column that matches the conversion in our query and then create an index on that computed column:

CREATE TABLE T3 (A INT, B CHAR(8), C AS CONVERT(INT, B)
CREATE INDEX T3C ON T3 (C, A)

INSERT T3 VALUES (0, '0')
INSERT T3 VALUES (1, '1')

This computed column is sometimes referred to as a "persisted" computed column because the index stores the actual values of the computed column.  For computed columns on complex or expensive expressions, using such an index could be faster than evaluating the expression anew.  Note that, because the computed column must be evaluated to maintain the index, we cannot insert a "bad" row into T3.  Any attempt to insert a "bad" row immediately fails with a conversion error.

I've deliberately included column A as part of the index so that the index covers all of the columns in the query.  If we omit column A from the index, SQL Server cannot use the index without performing a bookmark lookup.  In this particular example, that would prevent SQL Server from choosing the plan that I'm trying to demonstrate.  Also, note that, on SQL Server 2005 we could use the INCLUDE keyword instead of making column A an index key, but I wanted this example to work on both SQL Server 2000 and SQL Server 2005.

Now let's compare the plans using the new table.  First, the SQL Server 2000 plan:

  |--Compute Scalar(DEFINE:([Expr1004]=Convert([T3].[B])))
       |--Hash Match(Inner Join, HASH:([T3].[A])=([T2].[X]), RESIDUAL:([T2].[X]=[T3].[A]))
            |--Table Scan(OBJECT:([T3]))
            |--Table Scan(OBJECT:([T2]))

Compare that to the SQL Server 2005 plan:

  |--Hash Match(Inner Join, HASH:([T3].[A])=([T2].[X]), RESIDUAL:([T2].[X]=[T3].[A]))
       |--Compute Scalar(DEFINE:([T3].[C]=[T3].[C]))
       |    |--Index Scan(OBJECT:([T3].[T3C]))
       |--Table Scan(OBJECT:([T2]))

Notice that SQL Server 2000 uses the same plan as we saw above while SQL Server 2005 uses the new index and eliminates the conversion.  This type of index matching is not limited to conversions and works for many other expressions and computed columns.

I researched the second benefit ("the prevention of redundant computation of expression results") but I'm unable to come up with a compelling explanation for it.  In some cases, it is certainly true that pushing the conversion below a join reduces the number of times that the conversion is evaluated.  For instance, returning to the original example above, suppose we add many duplicate values to T2 such that each T1 row joins multiple times.  In the SQL Server 2005 plan, the conversion is evaluated before the join and so it is evaluated exactly once for each T1 row regardless of how many times each row joins.  In the SQL Server 2000 plan, the conversion is evaluated after the join and so it is evaluated multiple times for each T1 row - once for each time that the row joins.  Of course, it is also possible that none of the rows from T1 join.  In this case, the SQL Server 2005 plan unnecessarily evaluates the conversion on every T1 row while the SQL Server 2000 plan does not evaluate the conversion at all.

Moreover, you may recall from my last post, that SQL Server defers the execution of most scalar expressions until they are actually evaluated.  I used the following nested loops join example to illustrate this point:

SELECT T1.A, CONVERT(INT, T1.B) AS B_INT
FROM T1 JOIN T2 ON T1.A = T2.X
OPTION (LOOP JOIN)

  |--Nested Loops(Inner Join, WHERE:([T2].[X]=[T1].[A]))
       |--Compute Scalar(DEFINE:([Expr1008]=CONVERT(int,[T1].[B],0)))
       |    |--Table Scan(OBJECT:([T1]))
       |--Table Scan(OBJECT:([T2]))

In this example, the compute scalar is below the join in the plan but the conversion is still executed after the join.  Thus, pushing the conversion below the join does not change the number of times that it is executed.

Posted by craigfr | 0 Comments
Filed under: ,

Conversion and Arithmetic Errors

Let's take a look at a simple query:

CREATE TABLE T1 (A INT, B CHAR(8))
INSERT T1 VALUES (0, '0')
INSERT T1 VALUES (1, '1')
INSERT T1 VALUES (99, 'Error')

SELECT T1.A, CONVERT(INT, T1.B) AS B_INT FROM T1

There is no way to convert the string "Error" into an integer, so it should come as no surprise that this query fails with a conversion error:

A           B_INT
----------- -----------
1           1
Msg 245, Level 16, State 1, Line 1
Conversion failed when converting the varchar value 'Error   ' to data type int.

Now, let's create a second table and try a join:

CREATE TABLE T2 (X INT)
INSERT T2 VALUES (1)
INSERT T2 VALUES (2)
INSERT T2 VALUES (3)
INSERT T2 VALUES (4)
INSERT T2 VALUES (5)

SELECT T1.A, CONVERT(INT, T1.B) AS B_INT
FROM T1 JOIN T2 ON T1.A = T2.X

Although this join query includes the same conversion as the first query, the row with T1.A = 99 does not join with any rows in T2, so at first glance, we might conclude that the conversion should never be executed.  However, if we execute the query, we still get the error:

A           B_INT
----------- -----------
Msg 245, Level 16, State 1, Line 1
Conversion failed when converting the varchar value 'Error   ' to data type int.

What happened?  The SQL Server query optimizer is free to move expressions up and down within a query plan.  Since the expression with the conversion depends only on T1, the optimizer can (and does) push the conversion below the join and close to the scan of T1:

  |--Hash Match(Inner Join, HASH:([T1].[A])=([T2].[X]), RESIDUAL:([T2].[X]=[T1].[A]))
       |--Compute Scalar(DEFINE:([Expr1008]=CONVERT(int,[T1].[B],0)))
       |    |--Table Scan(OBJECT:([T1]))
       |--Table Scan(OBJECT:([T2]))

Now, let's see what happens if we force a nested loops join:

SELECT T1.A, CONVERT(INT, T1.B) AS B_INT
FROM T1 JOIN T2 ON T1.A = T2.X
OPTION (LOOP JOIN)

  |--Nested Loops(Inner Join, WHERE:([T2].[X]=[T1].[A]))
       |--Compute Scalar(DEFINE:([Expr1008]=CONVERT(int,[T1].[B],0)))
       |    |--Table Scan(OBJECT:([T1]))
       |--Table Scan(OBJECT:([T2]))

Once again, the optimizer pushes the conversion below the join, so we might conclude that this query will also fail.  Surprisingly, it does not:

A           B_INT
----------- -----------
1           1

What's going on here?  SQL Server includes a runtime optimization that defers the evaluation of most scalar expressions until they are actually referenced.  We can see this optimization in action if we check the output of SET STATISTICS PROFILE ON:

1      1        |--Nested Loops(Inner Join, WHERE:([T2].[X]=[T1].[A]))
0      0             |--Compute Scalar(DEFINE:([Expr1008]=CONVERT(int,[T1].[B],0)))
3      1             |    |--Table Scan(OBJECT:([T1]))
15     3             |--Table Scan(OBJECT:([T2])) 

Notice that the compute scalar is executed zero times.  The result of the conversion is not needed by the join and is computed immediately prior to returning the query result to the client.  By this time, the bad row has already been filtered by the join.

SQL Server actually uses the same optimization both for the nested loops join and for the hash join.  However, recall that a hash join builds a hash table on its left input.  This hash table includes values for all columns that are needed to evaluate the join and values for all columns that are output by the join.  Thus, the hash join must evaluate and store the result of the conversion in the hash table.  Thus, the conversion is evaluated and fails before the hash join can determine whether the bad row actually joins.

As you can see, whether or not a conversion causes a failure depends on a variety of factors including where in the plan the optimizer chooses to place the conversion and when during execution the conversion is actually evaluated.  Moreover, when the conversion is actually evaluated depends on the precise operators in the plan and the behavior of those operators.

We can see similar problems with arithmetic errors such as overflows or divide by zero:

INSERT T1 VALUES (0, '0')

SELECT T1.A, 1 / T1.A AS A_Reciprocal
FROM T1 JOIN T2 ON T1.A = T2.X

A           A_Reciprocal
----------- ------------
Msg 8134, Level 16, State 1, Line 1
Divide by zero error encountered.

So, what can you do to work around this problem?  The best solution from a performance perspective is to avoid using the conversion at all or to ensure that the data in the column to be converted is valid (or to avoid using an expression that could fail due to an arithmetic error).  If you must use a conversion (or any other expression that could fail), you can use a CASE statement to validate the input.  For example, the following query trims away leading and trailing spaces from the CHAR column and then uses the PATINDEX function to validate that the column does not contain any non-numeric characters:

SELECT T1.A,
    CASE WHEN PATINDEX('%[^0-9]%', RTRIM(LTRIM(T1.B))) = 0
        THEN CONVERT(INT, T1.B)
        ELSE NULL
    END AS B_INT
FROM T1

This example shows how we can avoid the divide by zero error:

SELECT T1.A,
    CASE WHEN T1.A <> 0
        THEN 1 / T1.A
        ELSE NULL
    END AS A_Reciprocal
FROM T1 JOIN T2 ON T1.A = T2.X

Although I would not recommend it, for arithmetic errors, you can use the SET ANSI_WARNINGS, SET ARITHABORT, and SET ARITHIGNORE options to disable the error.  These settings do nothing for conversion errors.

Finally, it is worth pointing out that some hints can affect the placement of the conversion:

SELECT T1.A, CONVERT(INT, T1.B) AS B_INT
FROM T1 JOIN T2 ON T1.A = T2.X
OPTION (FORCE ORDER)

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT(int,[T1].[B],0)))
       |--Hash Match(Inner Join, HASH:([T1].[A])=([T2].[X]), RESIDUAL:([T2].[X]=[T1].[A]))
            |--Table Scan(OBJECT:([T1]))
            |--Table Scan(OBJECT:([T2]))

Notice that the conversion is now computed after the join.  I would strongly discourage the use of this hint as a workaround except perhaps in the simplest of queries given the number of side effects it has.  Mostly I point out this particular result in case any readers are tempted to use this or other hints to explore how other join types and join orders might affect the behavior of this query.

Posted by craigfr | 4 Comments
Filed under: ,

Ranking Functions: RANK, DENSE_RANK, and NTILE

In my previous post, I discussed the ROW_NUMBER ranking function which was introduced in SQL Server 2005.  In this post, I'll take a look at the other ranking functions - RANK, DENSE_RANK, and NTILE.  Let's begin with RANK and DENSE_RANK.  These functions are similar - both in functionality and implementation - to ROW_NUMBER.  The difference is that while the ROW_NUMBER function assigns a unique (and ascending) value to each row without regard for ties in the ORDER BY values, the RANK and DENSE_RANK functions assign the same value to rows that have the same ORDER BY value.  The difference between the RANK and DENSE_RANK functions is in how values are assigned to rows following a tie.  The easiest way to illustrate the difference between all of these functions is with a simple example:

CREATE TABLE T (PK INT IDENTITY, A INT, B INT, C INT)
CREATE UNIQUE CLUSTERED INDEX TPK ON T(PK)

INSERT T VALUES (0, 1, 6)
INSERT T VALUES (0, 1, 4)
INSERT T VALUES (0, 3, 2)
INSERT T VALUES (0, 3, 0)
INSERT T VALUES (1, 0, 7)
INSERT T VALUES (1, 0, 5)
INSERT T VALUES (0, 2, 3)
INSERT T VALUES (0, 2, 1)

SELECT *,
    ROW_NUMBER() OVER (ORDER BY B) AS RowNumber,
    RANK() OVER (ORDER BY B) AS Rank,
    DENSE_RANK() OVER (ORDER BY B) AS DenseRank
FROM T

PK    A     B     C     RowNumber  Rank       DenseRank
----- ----- ----- ----- ---------- ---------- ----------
5     1     0     7     1          1          1
6     1     0     5     2          1          1
1     0     1     6     3          3          2
2     0     1     4     4          3          2
7     0     2     3     5          5          3
8     0     2     1     6          5          3
3     0     3     2     7          7          4
4     0     3     0     8          7          4

Notice how the ROW_NUMBER function ignores the duplicate values for column B and assigns the unique integers from 1 to 8 to the 8 rows while the RANK and DENSE_RANK functions assigns the same value to each of the pairs of duplicate rows.  Moreover, notice how the RANK function counts the duplicate rows even while it assigns the same value to each duplicate row whereas the DENSE_RANK function does not count the duplicate rows.  For example, both the RANK and DENSE_RANK functions assign a rank of 1 to the first two rows, but the RANK function assigns a rank of 3 to the third row - as it is the third row - while the DENSE_RANK function assigns a rank of 2 to the third row - as it contains the second distinct value for column B.  Note that the maximum value returned by the DENSE_RANK function is exactly equal to the number of distinct values in column B.

The query plan for computing all three of these functions is essentially the same and, in fact, as long as all three of the functions share the same PARTITION BY and ORDER BY clauses, a single sequence project operator can compute all three functions simultaneously:

  |--Sequence Project(DEFINE:([Expr1003]=row_number, [Expr1004]=rank, [Expr1005]=dense_rank))
       |--Segment [GROUP BY:([T].[B])]
            |--Segment [GROUP BY:()]
                 |--Sort(ORDER BY:([T].[B] ASC))
                      |--Clustered Index Scan(OBJECT:([T].[TPK]))

The only real difference between this plan and a plan that computes only the ROW_NUMBER function is the presence of the extra segment operator that groups by column B.  This segment operator detects ties and notifies the sequence project operator when to output the same value and when to output a new value for the RANK and DENSE_RANK functions.

All of the issues from my previous post on ROW_NUMBER (such as the use of an index to avoid the sort or the impact of mixing ranking functions with different PARTITION BY and/or ORDER BY clauses) apply equally well to the RANK and DENSE_RANK functions so I will not repeat them here.

Now let's take a look at the NTILE function.  This function breaks an input set down into N equal sized groups.  To determine how many rows belong in each group, SQL Server must first determine the total number of rows in the input set.  If the NTILE function includes a PARTITION BY clause, SQL Server must compute the number of rows in each partition separately.  Once we know the number of rows in each partition, we can write the NTILE function as

NTILE(N) := (N * (ROW_NUMBER() - 1) / COUNT(*)) + 1

where COUNT(*) is the number of rows in each partition.  We need the -1 and +1 in this equation as the ROW_NUMBER and NTILE functions are 1 not 0 based.

We already know how to compute the ROW_NUMBER function, so the real question is how to compute the COUNT(*) function.  It turns out that SQL Server supports use of the OVER clause that we've already seen with the ranking functions with nearly all aggregate functions as well.  This type of aggregate function is sometimes referred to as a "window aggregate function."  With a window aggregate function, instead of returning a single row per group, SQL Server returns the same aggregate result with each row in the group.  Let's look at an example:

SELECT *, COUNT(*) OVER (PARTITION BY A) AS Cnt FROM T

PK    A     B     C     Cnt
----- ----- ----- ----- ----------
1     0     1     6     6
2     0     1     4     6
3     0     3     2     6
4     0     3     0     6
7     0     2     3     6
8     0     2     1     6
5     1     0     7     2
6     1     0     5     2

Notice that unlike a scalar aggregate query which can return only aggregate functions or an aggregate query with a GROUP BY clause which can return only columns from the GROUP BY clause and aggregate functions, this query can return any columns from the table.  Moreover, since each row in the group gets the same result, the ORDER BY clause is not supported for this scenario.  Now let's take a look at the plan for this query:

  |--Nested Loops(Inner Join)
       |--Table Spool
       |    |--Segment [GROUP BY:([T].[A])]
       |         |--Sort(ORDER BY:([T].[A] ASC))
       |              |--Clustered Index Scan(OBJECT:([T].[TPK]))
       |--Nested Loops(Inner Join, WHERE:((1)))
            |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1005],0)))
            |    |--Stream Aggregate(DEFINE:([Expr1005]=Count(*)))
            |         |--Table Spool
            |--Table Spool

While this plan is certainly more complex than a typical aggregation query plan, it's actually fairly straightforward.  The sort and segment operators merely break the rows into groups in the same way as a stream aggregate.  The table spool above the segment operator is a special type of common subexpression spool known as a segment spool.  This spool operator makes a copy of all rows in one group and then, when the segment operator indicates the start of a new group, it returns a single row to the topmost nested loops join which immediately executes its inner side.  At this point, the table spool replays the rows from the current group and the stream aggregate counts them.  Finally, the stream aggregate returns the count to the bottommost nested loops join which joins this count with the rows from the current group - the table spool replays these rows a second time.

Note that this plan needs the segment spool because there is no way to know a priori how many rows are in the same group with the current row without scanning the entire group.  But after scanning the entire group and determining the count, we need some way to go back and apply the count to each of the original rows.  The spool makes it possible to go back.  A typical aggregate query with a GROUP BY clause does not have this issue because after computing the aggregate function (be it a COUNT or any other function), it simply outputs the result without go back to the original rows.

For an example of a different query that yields a very similar query plan involving a segment spool, check out my discussion of subqueries on pages 171 through 173 of Inside Microsoft SQL Server 2005: Query Tuning and Optimization.

It's now a simple matter to see how to compute the NTILE function.  For example, the following query groups the rows by column A and then for each group computes which rows are below the median (denoted by a 1) and which rows are above the median (denoted by a 2):

SELECT *, NTILE(2) OVER (PARTITION BY A ORDER BY B) AS NTile FROM T

Here are results for this query:

PK    A     B     C     NTile
----- ----- ----- ----- ------
1     0     1     6     1
2     0     1     4     1
7     0     2     3     1
8     0     2     1     2
3     0     3     2     2
4     0     3     0     2
5     1     0     7     1
6     1     0     5     2

And here is the plan for this query:

  |--Sequence Project(DEFINE:([Expr1003]=ntile))
       |--Compute Scalar(DEFINE:([Expr1007]=(1)))
            |--Segment [GROUP BY:([T].[A])]
                 |--Nested Loops(Inner Join)
                      |--Table Spool
                      |    |--Segment [GROUP BY:([T].[A])]
                      |         |--Sort(ORDER BY:([T].[A] ASC, [T].[B] ASC))
                      |              |--Clustered Index Scan(OBJECT:([T].[TPK]))
                      |--Nested Loops(Inner Join, WHERE:((1)))
                           |--Stream Aggregate(DEFINE:([Expr1004]=Count(*)))
                           |    |--Table Spool
                           |--Table Spool

This plan is basically the same as the plan for the previous COUNT(*) query.  The only difference is the addition of the extra segment and sequence project operators which compute the NTILE function by computing ROW_NUMBER and applying the formula I provided above.  Using this formula, we could also compute NTILE manually as follows:

SELECT *, (2*(RowNumber-1)/Cnt)+1 AS MyNTile
FROM
    (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY A ORDER BY B) AS RowNumber,
        COUNT(*) OVER (PARTITION BY A ) AS Cnt,
        NTILE(2) OVER (PARTITION BY A ORDER BY B) AS NTile
    FROM T
    ) TSeq

Unfortunately, the optimizer does not recognize that it can use the same plan fragment to compute both the ranking functions and the COUNT(*) window aggregate function and ends up generating the following inefficient plan which computes the COUNT(*) twice:

  |--Compute Scalar(DEFINE:([Expr1006]=((2)*([Expr1003]-(1)))/CONVERT_IMPLICIT(bigint,[Expr1004],0)+(1)))
       |--Nested Loops(Inner Join)
            |--Table Spool
            |    |--Segment [GROUP BY:([T].[A])]
            |         |--Sequence Project(DEFINE:([Expr1003]=row_number, [Expr1005]=ntile))
            |              |--Compute Scalar(DEFINE:([Expr1010]=(1)))
            |                   |--Segment [GROUP BY:([T].[A])]
            |                        |--Nested Loops(Inner Join)
            |                             |--Table Spool
            |                             |    |--Segment [GROUP BY:([T].[A])]
            |                             |         |--Sort(ORDER BY:([T].[A] ASC, [T].[B] ASC))
            |                             |              |--Clustered Index Scan(OBJECT:([T].[TPK]))
            |                             |--Nested Loops(Inner Join, WHERE:((1)))
            |                                  |--Stream Aggregate(DEFINE:([Expr1007]=Count(*)))
            |                                  |    |--Table Spool
            |                                  |--Table Spool
            |--Nested Loops(Inner Join, WHERE:((1)))
                 |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1012],0)))
                 |    |--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))
                 |         |--Table Spool
                 |--Table Spool

I'll leave parsing this query plan as an exercise.  Enjoy!
Posted by craigfr | 0 Comments

Ranking Functions: ROW_NUMBER

SQL Server 2005 introduced four new functions, ROW_NUMBER, RANK, DENSE_RANK, and NTILE that are collectively referred to as ranking functions.  These functions differ from ordinary scalar functions in that the result that they produce for a given row depends on the other rows in the result set.  They also differ from aggregate functions in that they produce exactly one output row for each input row.  Unlike aggregates they do not collapse a set of rows into a single row.  In this post, I'll take a look at ROW_NUMBER - the simplest of all ranking functions.

I'll use the following simple data set for the examples in this post:

CREATE TABLE T (PK INT IDENTITY, A INT, B INT, C INT)
CREATE UNIQUE CLUSTERED INDEX TPK ON T(PK)

INSERT T VALUES (0, 1, 8)
INSERT T VALUES (0, 3, 6)
INSERT T VALUES (0, 5, 4)
INSERT T VALUES (0, 7, 2)
INSERT T VALUES (0, 9, 0)
INSERT T VALUES (1, 0, 9)
INSERT T VALUES (1, 2, 7)
INSERT T VALUES (1, 4, 5)
INSERT T VALUES (1, 6, 3)
INSERT T VALUES (1, 8, 1)

Let's begin with the following simple example:

SELECT *, ROW_NUMBER() OVER (ORDER BY B) AS RowNumber FROM T

This query orders the rows in the table by column B and assigns sequential row numbers (much like an identity column) to these rows.  The result looks like so:

PK    A     B     C     RowNumber
----- ----- ----- ----- ----------
6     1     0     9     1
1     0     1     8     2
7     1     2     7     3
2     0     3     6     4
8     1     4     5     5
3     0     5     4     6
9     1     6     3     7
4     0     7     2     8
10    1     8     1     9
5     0     9     0     10

The plan for this query is fairly simple:

  |--Sequence Project(DEFINE:([Expr1003]=row_number))
       |--Compute Scalar(DEFINE:([Expr1005]=(1)))
            |--Segment [GROUP BY:()]
                 |--Sort(ORDER BY:([T].[B] ASC))
                      |--Clustered Index Scan(OBJECT:([T].[TPK]))

This plan scans the table, sorts it on column B, and then the sequence project operator assigns sequential numbers to each row.

The segment operator normally breaks adjacent rows into segments or groups of related rows.  It is unnecessary in this example as all rows are in a single segment.  Note that the text plan normally does not show the GROUP BY columns for a segment operator; this information is only available from SHOWPLAN_ALL, SHOWPLAN_XML, or the graphical plan.  I've annotated the text plan to show explicitly that the segment operator has no GROUP BY columns.  We will see an example where the segment operator is necessary in just a moment.

The compute scalar operator is also unnecessary and, in fact, has been removed in the SQL Server 2008 CTPs.

ROW_NUMBER (and the other ranking functions) also support dividing rows into groups or partitions and computing the function separately on each group:

SELECT *, ROW_NUMBER() OVER (PARTITION BY A ORDER BY B) AS RowNumber FROM T

Since column A has two values (0 and 1), this query breaks the rows in the table into two groups and assigns row numbers separately to each group.  The result is:

PK    A     B     C     RowNumber
----- ----- ----- ----- ----------
1     0     1     8     1
2     0     3     6     2
3     0     5     4     3
4     0     7     2     4
5     0     9     0     5
6     1     0     9     1
7     1     2     7     2
8     1     4     5     3
9     1     6     3     4
10    1     8     1     5

The plan for this query is nearly identical to the previous plan:

  |--Sequence Project(DEFINE:([Expr1003]=row_number))
       |--Compute Scalar(DEFINE:([Expr1005]=(1)))
            |--Segment [GROUP BY:([T].[A])]
                 |--Sort(ORDER BY:([T].[A] ASC, [T].[B] ASC))
                      |--Clustered Index Scan(OBJECT:([T].[TPK]))

Notice that the new plan sorts on columns A and B and then the segment operator groups the rows by column A.  The sequence project assigns sequential numbers to each row just as in the first example but resets the row count at the beginning of each segment.  You may notice some similarities between how the grouping works in this plan and how the grouping works in a plan with a stream aggregate operator.

It is possible to combine multiple ROW_NUMBER functions (or multiple of any of the other ranking functions) with different ORDER BY clauses in a single query:

SELECT *,
    ROW_NUMBER() OVER (ORDER BY B) AS RowNumber_B,
    ROW_NUMBER() OVER (ORDER BY C) AS RowNumber_C
FROM T

PK    A     B     C     RowNumber_B  RowNumber_C
----- ----- ----- ----- ------------ ------------
5     0     9     0     10           1
10    1     8     1     9            2
4     0     7     2     8            3
9     1     6     3     7            4
3     0     5     4     6            5
8     1     4     5     5            6
2     0     3     6     4            7
7     1     2     7     3            8
1     0     1     8     2            9
6     1     0     9     1            10

The plan for this query is just like the plan for the first query only it is repeated once for each ranking function with a different sort order for each ORDER BY clause:

  |--Sequence Project(DEFINE:([Expr1004]=row_number))
       |--Compute Scalar(DEFINE:([Expr1008]=(1)))
            |--Segment [GROUP BY:()]
                 |--Sort(ORDER BY:([T].[C] ASC))
                      |--Sequence Project(DEFINE:([Expr1003]=row_number))
                           |--Compute Scalar(DEFINE:([Expr1006]=(1)))
                                |--Segment [GROUP BY:()]
                                     |--Sort(ORDER BY:([T].[B] ASC))
                                          |--Clustered Index Scan(OBJECT:([T].[TPK]))

Note that the ORDER BY clause associated with each ranking function determines the behavior of that ranking function only.  These ORDER BY clauses do not determine the order that rows will be returned from the query.  To guarantee absolutely a certain output order from the query, you must add an explicit query-level ORDER BY clause:

SELECT *, ROW_NUMBER() OVER (ORDER BY B) AS RowNumber FROM T ORDER BY B

In this case, the extra ORDER BY clause does not affect the plan as the optimizer recognizes that the data is already sorted by column B following the computation of the ROW_NUMBER function.

Unfortunately, if we have multiple ROW_NUMBER functions with different ORDER BY clauses and a query-level ORDER BY clause, the optimizer does not consider whether computing the ROW_NUMBER functions in a different order might help.  For example, the following query results in an extra sort:

SELECT *,
    ROW_NUMBER() OVER (ORDER BY B) AS RowNumber_B,
    ROW_NUMBER() OVER (ORDER BY C) AS RowNumber_C
FROM T
ORDER BY B

  |--Sort(ORDER BY:([T].[B] ASC))
       |--Sequence Project(DEFINE:([Expr1004]=row_number))
            |--Compute Scalar(DEFINE:([Expr1008]=(1)))
                 |--Segment
                      |--Sort(ORDER BY:([T].[C] ASC))
                           |--Sequence Project(DEFINE:([Expr1003]=row_number))
                                |--Compute Scalar(DEFINE:([Expr1006]=(1)))
                                     |--Segment
                                          |--Sort(ORDER BY:([T].[B] ASC))
                                               |--Clustered Index Scan(OBJECT:([T].[TPK]))

On the other hand, if we simply reverse the order of the two ROW_NUMBER functions in the SELECT clause, the extra sort vanishes:

SELECT *,
    ROW_NUMBER() OVER (ORDER BY C) AS RowNumber_C,
    ROW_NUMBER() OVER (ORDER BY B) AS RowNumber_B
FROM T
ORDER BY B

  |--Sequence Project(DEFINE:([Expr1004]=row_number))
       |--Compute Scalar(DEFINE:([Expr1008]=(1)))
            |--Segment
                 |--Sort(ORDER BY:([T].[B] ASC))
                      |--Sequence Project(DEFINE:([Expr1003]=row_number))
                           |--Compute Scalar(DEFINE:([Expr1006]=(1)))
                                |--Segment
                                     |--Sort(ORDER BY:([T].[C] ASC))
                                          |--Clustered Index Scan(OBJECT:([T].[TPK]))

Like a stream aggregate, SQL Server also can eliminate the sort for a sequence project operator if there is an available index:

CREATE INDEX TB ON T(B)

SELECT PK, ROW_NUMBER() OVER (ORDER BY B) AS RowNumber_B FROM T

  |--Sequence Project(DEFINE:([Expr1003]=row_number))
       |--Compute Scalar(DEFINE:([Expr1005]=(1)))
            |--Segment
                 |--Index Scan(OBJECT:([T].[TB]), ORDERED FORWARD)

However, because the query plan for a query with multiple ROW_NUMBER functions with different ORDER BY clauses "stacks" several sequence project operators on top of a single scan, SQL Server can use only a single index in such a plan.  Thus, the plan for such a query must contain at least one sort.  Moreover, even if the plan does use an index in place of one of the sorts, if that index does not cover all columns required in the query, the optimizer must choose between using the index along with a bookmark lookup or using the clustered index and performing an extra sort.  I'll leave experimenting with these issues as an exercise.

Before I conclude this post, I'd also like to point out that SQL Server does not support using the ROW_NUMBER function without an ORDER BY clause.  Generally, using a ranking function without an ORDER BY clause does not make much sense as the results are non-deterministic.  However, there are some scenarios where you may want to omit the ORDER BY clause.  For example, you may simply want to use the ROW_NUMBER function to assign identity values.  This post by Itzik Ben-Gan and this article by Alex Kozak both discuss this issue and how to work around it.  Finally, this post by Itzik Ben-Gan shows how you can calculate row numbers on earlier versions of SQL Server that did not support the ROW_NUMBER function.

In my next post, I'll look at the other ranking functions - namely RANK, DENSE_RANK, and NTILE.
Posted by craigfr | 1 Comments
Filed under:

Halloween Protection

In a prior post, I introduced the notion that update plans consist of two parts: a read cursor that identifies the rows to be updated and a write cursor that actually performs the updates.  Logically speaking, SQL Server must execute the read cursor and write cursor of an update plan in two separate steps or phases.  To put it another way, the actual update of rows must not affect the selection of which rows to update.  This problem of ensuring that the write cursor of an update plan does not affect the read cursor is known as "Halloween protection" as it was discovered by IBM researchers more than 30 years ago on Halloween.

One simple solution to the Halloween problem is to physically separate the read and write cursors of an update plan using a blocking operator such as an eager spool or sort.  Inserting a blocking operator between the two halves of an update plan ensures that the read cursor runs in its entirety and generates all rows that it will generate before the write cursor begins executing or modifying any rows.   Unfortunately, inserting a blocking operator such as an eager spool into the update plan requires copying all rows output by the read cursor.  This copying can be quite expensive.  Fortunately, in many cases, SQL Server can determine that the write cursor will not affect the read cursor and does not need to add a blocking operator at all.

Let's look at an example:

CREATE TABLE T (PK INT, A INT)
CREATE UNIQUE CLUSTERED INDEX TPK ON T(PK)
CREATE INDEX TA ON T(A)

INSERT T VALUES (1, 1)
INSERT T VALUES (2, 2)
INSERT T VALUES (3, 3)

UPDATE T SET A = A + 10

Here is the plan for the update statement:

  |--Clustered Index Update(OBJECT:([T].[TPK]), OBJECT:([T].[TA]), SET:([T].[A] = [Expr1003]))
       |--Compute Scalar(DEFINE:([Expr1016]=[Expr1016]))
            |--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [Expr1004] THEN (1) ELSE (0) END))
                 |--Compute Scalar(DEFINE:([Expr1003]=[T].[A]+(10), [Expr1004]=CASE WHEN [T].[A] = ([T].[A]+(10)) THEN (1) ELSE (0) END))
                      |--Top(ROWCOUNT est 0)
                           |--Clustered Index Scan(OBJECT:([T].[TPK]))

In this plan, the clustered index scan is the read cursor while the clustered index update is the write cursor.  Notice that this plan has no blocking operators.  Because we are modifying column A which is not part of the clustered index key, SQL Server knows that rows in the clustered index will not move due to this update and, thus, knows that there is no need to separate the scan and update operators.  Now let's use a hint to force SQL Server to scan the non-clustered index on column A:

UPDATE T SET A = A + 10 FROM T WITH (INDEX(TA))

Here is the new update plan:

  |--Clustered Index Update(OBJECT:([T].[TPK]), OBJECT:([T].[TA]), SET:([T].[A] = [Expr1003]))
       |--Compute Scalar(DEFINE:([Expr1016]=[Expr1016]))
            |--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [Expr1004] THEN (1) ELSE (0) END))
                 |--Top(ROWCOUNT est 0)
                      |--Compute Scalar(DEFINE:([Expr1003]=[T].[A]+(10), [Expr1004]=CASE WHEN [T].[A] = ([T].[A]+(10)) THEN (1) ELSE (0) END))
                           |--Table Spool
                                |--Index Scan(OBJECT:([T].[TA]), ORDERED FORWARD)

Notice that this plan does include a blocking operator - specifically an eager table spool.  (The text plan as generated by SHOWPLAN_TEXT does not show that the spool is eager, but SHOWPLAN_ALL as well as the graphical and XML plans do indicate that the spool is eager.)  This time SQL Server recognizes that updating column A could cause rows to move within the index on column A which could cause the scan to return these rows more than once which would in turn lead to the plan updating the same rows more than once.  The spool ensures that we get the correct result by saving a copy of the scan output before updating any rows.

There is no way to get SQL Server to omit the spool from the above plan as this would lead to incorrect results.  However, we can simulate what would happen by using dynamic cursors.  The following batch creates a dynamic cursor to scan index TA and then updates each row before fetching the next row.  Because updates are immediately visible to dynamic cursors, this batch yields the same result as the above plan would yield if we could remove the spool.  Note that I am not suggesting that anyone should implement an update this way and, if anything, the following example nicely illustrates one of the pitfalls of dynamic cursors.

DECLARE @PK INT
DECLARE C CURSOR DYNAMIC SCROLL_LOCKS FOR SELECT PK FROM T WITH (INDEX(TA))
OPEN C
WHILE 0=0
    BEGIN
        FETCH NEXT FROM C INTO @PK
        IF @@FETCH_STATUS <> 0
            BREAK
        UPDATE T SET A = A + 10 WHERE PK = @PK
    END
CLOSE C
DEALLOCATE C

If you execute this batch, it will enter an infinite loop as it repeatedly scans, updates, and then again scans the same three rows.  On the other hand, the batch terminates properly if we change the index hint to use the clustered index (where no spool is required) or if we use a static cursor (which makes a copy of the table just like the spool):

DECLARE C CURSOR DYNAMIC SCROLL_LOCKS FOR SELECT PK FROM T WITH (INDEX(TPK))
DECLARE C CURSOR STATIC FOR SELECT PK FROM T WITH (INDEX(TA))

SQL Server can use any blocking operator, not just a spool, to provide Halloween protection.  Normally, if an update plan requires Halloween protection, SQL Server adds a spool because the spool is the cheapest blocking operator.  However, if an update plan already includes another blocking operator, SQL Server will not also add a spool.  For example, if we update column PK which has a unique clustered index, we get a sort as a by-product of maintaining the unique index.  Since this plan already has a sort, we do not also get a spool operator.

UPDATE T SET PK = PK + 10 FROM T WITH (INDEX(TPK))

  |--Index Update(OBJECT:([T].[TA]), SET:([PK1015] = [T].[PK],[A1016] = [T].[A]))
       |--Split
            |--Clustered Index Update(OBJECT:([T].[TPK]), SET:([T].[PK] = [T].[PK],[T].[A] = [T].[A]))
                 |--Collapse(GROUP BY:([T].[PK]))
                      |--Sort(ORDER BY:([T].[PK] ASC, [Act1014] ASC))
                           |--Split
                                |--Top(ROWCOUNT est 0)
                                     |--Compute Scalar(DEFINE:([Expr1003]=[T].[PK]+(10)))
                                          |--Clustered Index Scan(OBJECT:([T].[TPK]))

As I mentioned above, adding a spool is not free and does increase the cost of an update plan.  If we insert enough rows into the table, we can measure this effect.  For example, I loaded the table with 100,000 rows as follows.

TRUNCATE TABLE T
SET NOCOUNT ON
DECLARE @I INT
SET @I = 0
WHILE @I < 100000
    BEGIN
        INSERT T VALUES (@I, @I)
        SET @I = @I + 1
    END
SET NOCOUNT OFF

I then used SET STATISTICS TIME ON to measure the time to run the first two update statements above.  I ran each statement twice and reported the second time to ensure that the buffer pool was warmed up.

UPDATE T SET A = A + 10

SQL Server Execution Times:
   CPU time = 3046 ms,  elapsed time = 3517 ms.

UPDATE T SET A = A + 10 FROM T WITH (INDEX(TA))

SQL Server Execution Times:
   CPU time = 4391 ms,  elapsed time = 4666 ms.

As you can see, the same update took over 30% longer in elapsed time and over 40% longer in CPU time.  I ran this experiment on a Pentium Xeon 2.2 GHz workstation with 2 GB of RAM, Windows Server 2003 SP2, and SQL Server 2005 SP2.  Other system configurations may yield different results.

Finally, although I've used update statements for all of the examples in this post, some insert and delete statements also require Halloween protection, but I'll save that topic for a future post.
Posted by craigfr | 2 Comments
Filed under:

Maintaining Unique Indexes with IGNORE_DUP_KEY

A few months ago, I wrote a post describing how SQL Server maintains unique indexes while avoiding false uniqueness violations.  In this post, I'm going to look at how SQL Server maintains unique indexes that were created with the WITH IGNORE_DUP_KEY clause.  Normally, if we attempt to insert a duplicate key into a unique index, SQL Server fails the entire statement, rolls back any changes made by that statement, and returns an error.  However, if we attempt to insert a duplicate key into a unique index that was created with the WITH IGNORE_DUP_KEY clause, SQL Server merely discards the "bad" row, generates a warning message, and continues executing the remainder of the statement.  Note that SQL Server only generates the warning for actual insert statements.  Update statements that try to create a duplicate key still fail with an error.

Let's begin with a trivial insert into a unique clustered index:

CREATE TABLE T_SRC (A INT, B INT)

CREATE TABLE T_DST (A INT, B INT)
CREATE UNIQUE CLUSTERED INDEX T_DST_A ON T_DST(A) WITH IGNORE_DUP_KEY

INSERT T_DST SELECT * FROM T_SRC

This insert yields the following rather mundane query plan:

  |--Clustered Index Insert(OBJECT:([T_DST].[T_DST_A]), SET:([T_DST].[A] = [T_SRC].[A],[T_DST].[B] = [T_SRC].[B]))
       |--Top(ROWCOUNT est 0)
            |--Table Scan(OBJECT:([T_SRC]))

If you were expecting something more exciting, I'm sorry for the disappointment.  So what's going on here?  SQL Server simply tries to insert each row.  If any insertion fails due to a duplicate key, SQL Server just converts the error into a warning, discards the failed row, and continues.  Since we only have the one clustered index and no non-clustered indexes to maintain, SQL Server either inserts or does not insert each row into the single clustered index.  Either way, the integrity of the table is preserved.

Now let's create a non-clustered index and see what happens:

CREATE UNIQUE INDEX T_DST_B ON T_DST(B)

INSERT T_DST SELECT * FROM T_SRC

As it turns out, aside from adding the extra index to the clustered index insert, nothing changes:

  |--Clustered Index Insert(OBJECT:([T_DST].[T_DST_A]), OBJECT:([T_DST].[T_DST_B]), SET:([T_DST].[A] = [T_SRC].[A],[T_DST].[B] = [T_SRC].[B]))
       |--Top(ROWCOUNT est 0)
            |--Table Scan(OBJECT:([T_SRC]))

Even with two indexes (one clustered and one non-clustered) to maintain, SQL Server continues to use the same strategy of inserting rows and waiting to see whether any of the inserts fail due to duplicate keys.  This strategy works because SQL Server always maintains the clustered index first.  If it successfully inserts the row into the clustered index, it can safely proceed to insert the row into the non-clustered index.  If it cannot insert the row into the clustered index, it can discard the row without any risk of corrupting the indexes.  Once again, either way, the integrity of the table is preserved.

It turns out that this strategy works perfectly so long as only the clustered index was created with the WITH IGNORE_DUP_KEY clause.  However, if we have even one non-clustered index that was created with the WITH IGNORE_DUP_KEY clause, this strategy no longer works.  Imagine what would happen if SQL Server were to insert a row into the clustered index only to encounter a duplicate key violation when trying to insert the same row into the non-clustered index.  At this point, SQL Server could not simply issue a warning and discard the row as this would leave the clustered and non-clustered indexes inconsistent.  Let's create a non-clustered index with the WITH IGNORE_DUP_KEY clause and see how SQL Server handles this case:

DROP INDEX T_DST.T_DST_B
CREATE UNIQUE INDEX T_DST_B ON T_DST(B) WITH IGNORE_DUP_KEY

INSERT T_DST SELECT * FROM T_SRC

Finally, an interesting query plan:

  |--Clustered Index Insert(OBJECT:([T_DST].[T_DST_A]), OBJECT:([T_DST].[T_DST_B]), SET:([T_DST].[A] = [T_SRC].[A],[T_DST].[B] = [T_SRC].[B]))
       |--Top(TOP EXPRESSION:((1)))
            |--Segment
                 |--Sort(ORDER BY:([T_SRC].[B] ASC))
                      |--Assert(WHERE:(CASE WHEN [Expr1012] IS NOT NULL THEN (0) ELSE NULL END))
                           |--Nested Loops(Left Semi Join, OUTER REFERENCES:([T_SRC].[B]), DEFINE:([Expr1012] = [PROBE VALUE]))
                                |--Top(ROWCOUNT est 0)
                                |    |--Table Scan(OBJECT:([T_SRC]))
                                |--Index Seek(OBJECT:([T_DST].[T_DST_B]), SEEK:([T_DST].[B]=[T_SRC].[B]) ORDERED FORWARD)

There are two parts of this plan that handle duplicate keys.  The first part includes the left semi-join, index seek, and assert operators.  The second part includes the sort, segment, and top operators.

For each row that we intend to insert, the left semi-join, index seek, and assert operators determine whether a matching row with the same key already exists in the non-clustered index.  If they find a matching row, they issue the duplicate key warning, discard the row, and continue.  Let's explore exactly how these operators accomplish this task.  First, the table scan returns a row to insert.  The semi-join uses the index seek on the non-clustered index to determine whether a row with the same key already exists in the index.  Recall that a left semi-join ordinarily returns rows from the left (or first) input that join with at least one row from the right (or second) input.  This semi-join is special.  It returns all rows, whether or not they join, and outputs a PROBE column that indicates whether or not each row in fact joined.  (For two other examples of a semi-join with a probe column, see my subquery posts here and here.)  Finally, the assert operator checks the value of the PROBE column.  If the index seek does not return a matching row, the new row is not a duplicate and the assert operator returns it normally.  If the index seek does return a matching row, the assert operator issues the warning and discards the row.

The sort, segment, and top operators check for duplicates in the stream of input rows.  If they find a duplicate input row, they issue the duplicate key warning, discard the duplicate row, and continue.  Again, let's explore exactly how these operators execute this task.  The sort simply ensures that any duplicate input rows are clustered one after the other in the input stream.  The segment operator compares adjacent rows checking for duplicates.  It performs two functions in this plan.  First, if it finds a duplicate row, it issues the warning.  Second, it "flags" groups of duplicate rows.  Unlike the assert operator, the segment operator does not discard any rows.  Instead, the top operator discards the duplicates.  The top operator in this plan is a special type of top known as a segment top.   A normal top returns the first N rows and then stops.  A segment top returns the first N rows from each group of duplicate rows flagged by the segment operator.  In this case, the segment top returns only the first row from each group and, thus, eliminates the duplicates.

Any rows that are not discarded by the time they reach the insert operator must be unique and can safely be inserted without triggering a uniqueness violation.

If we create multiple non-clustered indexes with the WITH IGNORE_DUP_KEY clause, one set of these extra operators is inserted for each non-clustered index.  Note also that the plan needs these additional operators only for the non-clustered index.  The clustered index is handled as described at the beginning of this post.

Finally, I want to comment briefly on the performance implications of the WITH IGNORE_DUP_KEY clause.  On a clustered index, there is no plan change and the performance implications are negligible.  Unfortunately, on a non-clustered index, all of the additional operators discussed above are costly and do result in a measurable performance reduction.

Posted by craigfr | 3 Comments
Filed under:

Partial Aggregation

In some of my past posts, I've discussed how SQL Server implements aggregation including the stream aggregate and hash aggregate operators.  I also used hash aggregation as an initial example in my introductory post on parallel query execution.   In this post, I'll look at a partial aggregation.  Partial aggregation is a technique that SQL Server uses to optimize parallel aggregation.  Before I begin, I just want to note that I also discuss partial aggregation in Inside Microsoft SQL Server 2005: Query Tuning and Optimization.  (See the bottom of page 187.)

Let's begin with a simple scalar aggregation example.  Recall that a scalar aggregate is an aggregate without a GROUP BY clause.  A scalar aggregate always produces a single output row.

CREATE TABLE T (A INT, B INT IDENTITY, C INT, D INT)
CREATE CLUSTERED INDEX TA ON T(A)

SELECT COUNT(*) FROM T

Not surprisingly, this query yields a trivial stream aggregate plan:

  |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1005],0)))
       |--Stream Aggregate(DEFINE:([Expr1005]=Count(*)))