Welcome to MSDN Blogs Sign in | Join | Help

More on TOP

Last week I wrote about a special case of the TOP operator known as ROWCOUNT TOP.  This week I'll take a look at some other interesting TOP scenarios.  In general, TOP is a fairly mundane operator.  It simply counts and returns the specified number of rows.  SQL Server 2005 does include two enhancements to TOP that were not present in SQL Server 2000.  First, in SQL Server 2000, we can only specify an integer constant for the number of rows to return.  In SQL Server 2005, we can specify an arbitrary expression, including an expression containing T-SQL variables or parameters.  Second, SQL Server 2000 only permits TOP in a SELECT statement (though it does support ROWCOUNT TOP in INSERT, UPDATE, and DELETE statements).  SQL Server 2005 permits TOP with SELECT, INSERT, UPDATE, and DELETE statements.

In this post, I'm going to focus only on some simple examples involving SELECT statements.  Let's create a small table to get started:

CREATE TABLE T (A INT, B INT)
CREATE CLUSTERED INDEX TA ON T(A)

SET NOCOUNT ON
DECLARE @i INT
SET @i = 0
WHILE @i < 100
  BEGIN
    INSERT T VALUES (@i, @i)
    SET @i = @i + 1
  END
SET NOCOUNT OFF

The plan for the simplest possible TOP query should not need any explanation:

SELECT TOP 5 * FROM T

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5)))
5      1             |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

TOP is often used in conjunction with ORDER BY.  Combining TOP with ORDER BY adds determinism to the set of rows returned.  Without the ORDER BY, the set of rows returned depends on the query plan and may even vary from execution to execution.  If we have an index to provide order, we continue to get a simple query plan (notice the ORDERED FORWARD keywords):

SELECT TOP 5 * FROM T ORDER BY A

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5)))
5      1             |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]), ORDERED FORWARD)

If we do not have an index to provide order, SQL Server must sort the data:

SELECT TOP 5 * FROM T ORDER BY B

Rows   Executes
5      1        |--Sort(TOP 5, ORDER BY:([tempdb].[dbo].[T].[B] ASC))
100    1             |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Note that without an index to find the top 5 rows, SQL Server must scan all 100 rows in the input table.  Also note that the sort is actually a "top sort."  A top sort generally uses less memory than a regular sort as it only needs to identify and sort the top few rows rather than sorting the entire input.

Now let's consider what happens if we request the top 5% of rows.  To figure out the actual number of rows to return, SQL Server must count all of the rows and calculate 5%.  This makes queries that use TOP PERCENT less efficient than queries that use TOP with an absolute row count.

SELECT TOP 5 PERCENT * FROM T

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5.000000000000000e+000)) PERCENT)
5      1             |--Table Spool
100    1                 |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Like the prior example, SQL Server must scan all 100 rows in the input table.  In this example, SQL Server uses an eager spool which reads and counts all of the input rows before returning any rows.  The top then requests the row count from the spool, calculates 5%, and proceeds like any other top.

If SQL Server must sort, the sort can also provide the count of input rows.  However, only a regular sort can provide this count.  A top sort must know the number of rows to return from the start.

SELECT TOP 5 PERCENT * FROM T ORDER BY B

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5.000000000000000e+000)) PERCENT)
5      1             |--Sort(ORDER BY:([tempdb].[dbo].[T].[B] ASC))
100    1                  |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

TOP WITH TIES is also incompatible with top sort.  TOP WITH TIES cannot know for certain how many rows will be returned until the number of ties is determined.  For this example, let's add a tie for the fifth row returned:

INSERT T VALUES (4, 4)

SELECT TOP 5 WITH TIES * FROM T ORDER BY B

Rows   Executes
6      1        |--Top(TOP EXPRESSION:((5)))
7      1             |--Sort(ORDER BY:([tempdb].[dbo].[T].[B] ASC))
101    1                  |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Although the above plan does not seem to reflect that we have a TOP WITH TIES, the argument column of SHOWPLAN_ALL or STATISTICS PROFILE includes this information:  "TIE COLUMNS:([T].[B])".  This information is also available in the graphical and XML plans on SQL Server 2005.  Note that the TOP actually returns one extra row.  When a TOP N WITH TIES reaches the Nth row, it keeps a copy of the tie columns for this row (in this example B==4) and compares each subsequent row to that row.  As long as there are rows that match, it keeps returning them.  Because the top must compare subsequent rows until it finds a non-match to the Nth row, the top retrieves one more row from the sort than it returns.

Finally, there are a couple of special cases.  The optimizer knows that TOP 0 and TOP 0 PERCENT never return any rows and replaces any query plan with a TOP 0 with a constant scan:

SELECT TOP 0 * FROM T

  |--Constant Scan

The optimizer also knows that TOP 100 PERCENT always returns all rows and removes the top operator from the query plan:

SELECT TOP 100 PERCENT * FROM T

  |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Both of these optimizations require that the number of rows is a constant.  For example, using an expression that includes a T-SQL variable or parameter will prevent the optimization.  Both optimizations also work with INSERT, UPDATE, and DELETE statements.

Note that using TOP to work around SQL language restrictions on the use of ORDER BY in sub-selects or in views or to force specific plan evaluation orders is not recommended.

Published Wednesday, August 01, 2007 9:13 AM by craigfr
Filed under:

Comments

# re: More on TOP

Hi Craig,

Thanks again for continuing to post.

I have some questions regarding the table spool in this query plan.

Rows   Executes

5      1        |--Top(TOP EXPRESSION:((5.000000000000000e+000)) PERCENT)

5      1             |--Table Spool

100    1                 |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

I understand that a Table spool is a physical operator that scans the input and places a copy of each row in a table that is stored in the tempdb database.  So in this plan, are we actually creating a copy of the table T and then scanning the Table Spool in Tempdb in order to answer the query: SELECT TOP 5 PERCENT * FROM T? or the server only stores a set of pointers? is there any particular reason why the optimizer chooses this path? what happened if the table is huge? Wouldn't it be more efficient (memory wise) to remove the Table Spool and have a second Clustered Index Scan for the total of rows we want to retrieve?

Thanks again for your post.

Thursday, August 09, 2007 2:45 PM by AlexLopez

# re: More on TOP

The table spool does create a copy of the table though it only copies those columns that are specified in the SELECT list of the query.  In some cases, this example is likely one of those cases, it might be faster to perform two clustered index scans.  However, if the table has many large columns that are not output, the spool could actually make the query faster by eliminating unnecessary columns from the second scan and, thus, reducing the number of I/Os.  More importantly, if the input to the spool is a complex operation such as an aggregation or a join, it could be much faster to spool the results than to recompute them.

Wednesday, August 15, 2007 12:57 PM by craigfr

# re: More on TOP

Hi Craig, thanks for the interesting blog.

I have a question on the simplest of selects:

SELECT TOP 5 * FROM T

Rows   Executes

5      1        |--Top(TOP EXPRESSION:((5)))

5      1             |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Why does SQL Server do an entire clustered index scan when it only needs the top 5 (unordered) entries? Or am I misunderstanding what happens?

Thanks,

Tommy.

Monday, November 19, 2007 7:28 AM by thomas.hayes

# re: More on TOP

The presence of the scan operator does not mean that SQL Server scans the entire table.  In fact, as the statistics profile output shows, SQL Server actually only scans 5 rows.  SQL Server executes this query plan by requesting one row at a time from the top operator which requests one row at a time from the scan operator.  Once the top operator has returned 5 rows, it terminates the scan (regardless of the number of rows in the table) and returns that execution is complete.

Monday, November 26, 2007 10:38 AM by craigfr
Anonymous comments are disabled
 
Page view tracker