Welcome to MSDN Blogs Sign in | Join | Help

Understanding SQL Server Fast_Forward Server Cursors

SQL Server's server cursor model is a critical tool to many application writers.  Fast_forward cursors are very popular as an alternative to read_only forward_only cursors, but their inner workings are not well-publicized.  So I thought I'd give it a go.

Background

A server cursor is a cursor managed by SQL Engine.  It consists of a query execution and some runtime state, including a current position.  SQL clients can use a server cursor to fetch results of a query one-at-a-time, or in batches of N-at-a-time, rather than by the usual all-at-a-time, firehose, default query result set.  Many applications and client libraries have iteration models, but server cursors are the only way for SQL Server engine to do incremental query processing.

Server cursors' popularity is derived from three advantages:

1.       Matching very well the iterative model of programming that non-database programmers are most comfortable with.  This advantage is more perceived than actual, and does more harm than good.  Books Online puts it well: 

           

            If the data that is to be retrieved will be consumed all at once, and there is no need for       positioned updates or scrolling, default result sets are recommended.

 

2.       Reducing unnecessary processing of unused data when a client application stops reading data early.  Because in many cases (though not all), server cursor processing is incremental and on-demand, server load and data transmission can be reduced.

3.       Mapping very well to screen-at-a-time or window-at-a-time viewing and scrolling through large data sets.  There is simply no better way to tell SQL Server to "give me the next row(s)."

SQL Server engine has four server cursor models:  static, keyset, dynamic, and fast_forward.  [Apologies for not using the ADO.NET terminology -- it tends to confuse me.  How various settings on ADO.NET and ODBC commands translate to server cursor models is a topic for another time.]  There is lots of further background on the models and their differences in SQL Server 2008 Books Online's section on cursors.  But my concern here is to demystify fast_forward cursors, about which Books Online say very little, perhaps worse than nothing:

FAST_FORWARD

Specifies a FORWARD_ONLY, READ_ONLY cursor with performance optimizations enabled.

What's a fast_forward cursor in a nutshell?

It's a cursor model equivalent to read_only, forward_only that compiles to a static-like or dynamic-like cursor plan.  It does not downgrade to other cursor models, like dynamic and keyset do.

Aren't fast_forward cursors redundant?  We already have read_only forward_only cursors.  Why do we need them?

Fast_forward is redundant.  Read_only forward_only cursors are sufficient for many apps.  But some apps got poor query plans.  Read_only forward_only cursors are dynamic cursors, and dynamic cursors use a dynamic plan if one is available.  The reason this is a problem is that sometimes, the best dynamic plan is much, much worse than a static one.  Fast_forward cursors take a more balanced approach, choosing a static plan if it looks better.

When should I use fast_forward vs. read_only forward_only?

On balance, fast_forward is a little better. [1][1] However, performance testing should be done before a final decision is made for a particular application.  This is because the way the decision is made to use a dynamic-like or a static-like plan is fundamentally different (see below).  Either way, index tuning and plan hints may be an important part of tuning your cursors, whichever cursor model you choose.

What's a dynamic plan?

A dynamic plan can be processed incrementally.  In SQL Server we do this by serializing the state of the query execution into what we call a marker.  Later, we can build a new query execution tree, use the marker to reposition each operator.  Moreover, a dynamic plan can move forwards and backwards relative to its current position.  Dynamic plans are used by both dynamic and some fast_forward cursors.

A dynamic plan consists only of dynamic operators -- operators that support markers and moving forwards and backwards.  This corresponds closely, but not exactly, to the query processing notion of streaming operators (vs. stop-and-go).  But not every streaming operator is dynamic.  In SQL Server, dynamic means:

1.       The operator can be repositioned to its current position using a marker, or to a relative position (either next or previous) from its current one.

2.       The operator's state has to be small, so the marker can be small.  No row data can be stored in the operator.  In particular, no sort table, hash table, or work table.  Not even one row can be stored, since a single row can be very large.

Without a dynamic plan, the cursor would need temporary storage to keep the query result set (or keyset thereof).  A dynamic plan does no such thing!  However, certain operators are disqualified -- hash join, hash agg, compute sequence, and sort, for example.  This leads to sub-optimal plans.

When is a dynamic plan worse than a static one?

In some queries, like those that use row_number, a dynamic plan is simply not available.  The trouble comes when there are both dynamic and static alternatives.  This can happen when an operator, e.g. join, has dynamic (nested loops) and non-dynamic (hash) implementations.  This can also happen when some indexes support interesting orders and some do not.

Here is an example of the latter drawn from a real customer case.  Table ORDERS has an index on DATE and an index on SUBTOTAL (clustered/non-clustered is not relevant for the example).  The application wants to see very large orders:

      SELECT DATE, SUBTOTAL, ORDERID, CUSTOMERID

      FROM ORDERS where SUBTOTAL > 10000

      ORDER BY DATE

 

For argument's sake, say there are 100 orders this large out of a table of 100 million orders.  A dynamic cursor cannot sort, so it must use the DATE index, touching all the rows, filtering on SUBTOTAL.  A static cursor has the option of seeking the SUBTOTAL index for the range >10000, sorting the rows, and storing them in the cursor.

How do fast_forward cursors solve this problem?

Fast_forward cursors under certain conditions[2][2] choose the cheaper of the best static and best dynamic plans.  In an extreme case like the above, it can use a static plan.  It should be the best of both worlds. 

What's the hitch?

The optimizer relies on cost estimates to decide on the cursor model.  Sometimes, when it picks static rather than dynamic, it will simply be wrong, and the application will be very slow.  There are three reasons why this happens.

1.       The cost of a query depends on the "row goal":  the number of rows estimated to be fetched.  In many cursor applications, user are viewing a screen with a few rows and a next button or scroll bar.  If the user is likely to scroll through all the data, the plan cost is different than if the user is likely to quit after viewing the first page.  We have to pick a row goal at compile time to estimate a cost, but the user's behavior is unknown at that time.  Typically, static plans are not sensitive to row goal, while dynamic plans are critically sensitive.  In the case above, the dynamic plan cost scales linearly with the row goal (and when there are joins, it's super-linear).  That makes the dynamic plan superior by an arbitrary factor for low row goals, and inferior by an arbitrary factor for high row goals.  The factors depend on query and data parameters (such as the 10000, and the data distribution, in the example).  Application writers can control row goals by specifying OPTION (FAST <N>) on their cursor queries, but often they cannot predict the correct value either.

2.       Parameterized queries are compiled based on the parameters at hand, and then reused for different parameters.  In the example above, it makes a big difference whether the cut-off is at 10000 or at 10.  This is no different from non-cursor queries.

3.       The choice of plan is subject to estimation errors, which can be extreme, especially when combined with the previous factors.  A typical flavor of disaster involves two indexes like in the example, where the server estimates that the predicate is super selective, and therefore will scan half the table before finding the first qualifying row, therefore picking the static plan.  However, if there is a qualifying row on the first page, and the user only wants the first row, then the dynamic plan is very, very cheap.

What can an application developer do?

For canned queries, via trial and error, you can use hints to get you the cursor plan you want.  For query generators, it is much more difficult, and there's no one solution.  You do have a few levers at your disposal, though.

1.       Picking a neutral value for OPTION (FAST <N>).

2.       Avoiding plan reuse for different parameters by calling sp_cursorprepare or by using OPTION (RECOMPILE).

3.       Avoiding ORDER BY in cursors when not really needed.

4.       Use equality predicates and multi-column indexes to support interesting orders efficiently.  Modifying the example slightly, if we had a computed column SIZE with values {S,M,L,XL} for ranges of SUBTOTAL, we could query WHERE SIZE = 'XL' ORDER BY DATE, and index on <SIZE, DATE>.  In general, extend the index columns to include the equality columns followed by the ordering columns.

 

 

By Marc Friedman, 8/2009

Marc.friedman@donotreply.microsoft.com

 

Note to self:  ref:  VSTS# 273442

 



 



[1][1] For SQL Server 2005 users, it's worth mentioning that there is a performance enhancement to fast-forward prefetched I/O in SP3 Cumulative Update 5 that brings fast_forward up to parity.  SQL Server 2008 and beyond, this is a non-issue.

[2][2] In situations where the dynamic plan looks promising, the cost comparison may be heuristically skipped.  This occurs mainly for extremely cheap queries, though the details are esoteric.  In that case no difference between dynamic and fast_forward will be noticed.

Posted by queryproc | 0 Comments

Distinct Aggregation Considered Harmful

Distinct aggregation (e.g. select count(distinct key) …) is a SQL language feature that results in some very slow queries.  It’s particularly frustrating that you can take a perfectly efficient query with multiple aggregates, and make that query take forever just by adding a distinct keyword to one of the aggregates.  For instance, it often makes logical sense to compute distinct counts “along the way” while you are computing other aggregates.  Recently, I've seen a number of customer queries like this:

select

sum(salary),

max(salary),

count(employeeid),

count(distinct employeeid),

count(distinct salarygrade)

 

In practice, however, SQL Server 2008 does not compute these distinct aggregates “along the way.”  Mixing one or more distinct aggregates with non-distinct aggregates in the same select list, or mixing two or more distinct aggregates, leads to spooling and rereading of intermediate results – which can be more expensive than computing the distinct aggregates in a separate query!  In the worst case, such as the fragment of code above, this is sort of inevitable.  The SQL Server 2008 query processor cannot compute aggregates on a stream without destroying the stream.  So computing the distinct aggregate consumes the stream, which has to be recomputed for the other aggregates.

Fortunately, for select lists with only one distinct aggregate, you can rewrite the input SQL in a way that does not have this problem.  SQL Server 2008’s optimizer does not do this rewrite for you.  (Disclaimers:  all the aggregates have to be algebraic [Gray et al 1996], and no guarantees are made that the behavior of the rewrite is exactly the same, particularly when there are arithmetic overflows.)

Getting Rid of Distinct Aggregates

The code below shows an example rewrite using the AdventureWorksDW database distributed with SQL Server 2008.  The rewrite gives me 5x speedup on my desktop even on this small database!  What’s going on is that we are breaking the aggregation into two aggregation steps.  First, we build an intermediate result called PartialSums.  We group together all the values for the distinct aggregate (adding the distinct aggregate’s column to the GROUP BY list), while building partial aggregations for all the non-distinct aggregates (in this example, just count(*)).  Then in the second step, we use the original GROUP BY list, which in this example is empty, and collect the partial aggregates together.  Note how the aggregate functions used depend on the initial aggregates:  count becomes count, followed by sum.  Note that in sum(1), the 1 is actually a constant number value, not a column reference.

As far as I can tell, this rewrite is never worse than the spooled plan.  It parallelizes better, uses tempdb better, and does less logical I/O.

set statistics profile on

set statistics time on

go

use adventureworksdw

go

--

-- Use a distinct aggregate and a normal aggregate in the same select list

-- over some complex (one or more joins) derived table

--

with FISinFRS as (

 select * from factinternetsales FIS

  where ProductKey in

   (select ProductKey from FactResellerSales)

 

)

select

      count(*)                            as CountStar,

      count(distinct ProductKey)          as CountProductKeys

from FISinFRS

go

 

--

-- Now use partial aggregations in a new derived table

-- This is equivalent, but SQL Server 2008 does not do this transformation

-- for you

--

with FISinFRS as (

 select * from factinternetsales FIS

  where ProductKey in

   (select ProductKey from FactResellerSales)

 

)

, PartialSums as (

  select

      count(*)                            as CountStarPartialCount

  from FISinFRS

  group by ProductKey

)

select

      sum(CountStarPartialCount)          as CountStar,

      sum(1)                              as CountProductKeys

from PartialSums

 

References

[Gray et al 1996] Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals (1996), ftp://ftp.research.microsoft.com/pub/tr/tr-95-22.pdf.

 

 

By Marc Friedman, 9/2008

Marc.friedman@donotreply.microsoft.com

 

 

Store Statistics XML in database tables using SQL Traces for further analysis.

Since SQL Server 2005, query plan as well as statistics of query execution can be captured  in XML format. Also, SQL Server 2005 has XQuery support to directly query XML document. By combining these two new features, users can analyze the query plans using queries.

 

However, in SQL Server, there is no easy way to capture the statitics XML into a table. Fortunately, there are SQL traces provided by SQL Server to capture the showplan XML and statistics XML information into trace files and loaded into tables.

 

Note: The only limitation is the 128 level of nesting levels supported by XML data type in SQL 2005. In that case, you have to write client code to parse the query plan, which going to be a very complex query plan.

 

Here is a small example using SQL traces to store the statistics XML and extract the estimated rows and actual rows

 

/*Using Traces to Capture Statistics XML*/

declare @trace_id int

declare @trace_file nvarchar(200)

select @trace_file = 'c:\temp\test_stats_' + cast(newid() as varchar(100))

 

-- using trace table.

exec sp_trace_create @trace_id output,

      2,

      @tracefile=@trace_file

 

-- capture statistics-xml, textdata, on

exec sp_trace_setevent @trace_id, 146, 1, 1

 

-- start

exec sp_trace_setstatus @trace_id, 1

 

-- test statement.

select * from sys.objects

 

-- stop

exec sp_trace_setstatus @trace_id, 0

 

-- close

exec sp_trace_setstatus @trace_id, 2

 

-- load trace files into table

if object_id('temp_trc') is not null

      drop table temp_trc

 

select *

into temp_trc

from fn_trace_gettable(@trace_file + '.trc', default)

 

-- look at the captured stats xml

declare @plan xml

 

select @plan=cast(textdata as xml)

from temp_trc

where eventclass = 146;

 

-- collect the actual and also estimate stats.

with XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/showplan' as sql)

select

      ro.relop.value('@NodeId', 'int') NodeId,

      ro.relop.value('@PhysicalOp', 'nvarchar(200)') PhysicalOp,

      ro.relop.value('@LogicalOp', 'nvarchar(200)') LogicalOp,

      (ro.relop.value('@EstimateRows', 'float')

            * (ro.relop.value('@EstimateRewinds', 'float')

                  +  ro.relop.value('@EstimateRebinds', 'float')

                  + 1.0)) EstimateRows,

      case

            when root_actual.ActualRows = 0

                  then null

            else root_actual.ActualRows

      end ActualRows,

      cast(ro.relop.exist('*/sql:RelOp') as bit) IsNotLeaf

from @plan.nodes('//sql:RelOp') as ro(relop)

      cross apply (

            select sum(rti.info.value('@ActualRows', 'float')) ActualRows

            from ro.relop.nodes('sql:RunTimeInformation/sql:RunTimeCountersPerThread') as rti(info)

      ) root_actual;

go

 

The output of the estimate rows and actual rows is given below:

 

NodeId

PhysicalOp

LogicalOp

EstimateRows

ActualRows

IsNotLeaf

0

Nested Loops

Left Outer Join

52

52

1

1

Nested Loops

Left Outer Join

52

52

1

2

Filter

Filter

52

52

1

3

Compute Scalar

Compute Scalar

52

NULL

1

4

Clustered Index Scan

Clustered Index Scan

52

52

0

13

Clustered Index Seek

Clustered Index Seek

52.000031

NULL

0

14

Clustered Index Seek

Clustered Index Seek

52

52

0

Index Build strategy in SQL Server - Part 4-2: Offline Serial/Parallel Partitioning (Non-aligned partitioned index build)

Source Partitioned
While the table is partitioned, we may want to change the way it is partitioned when building the new index. For example, by using the same partition function and scheme, the new index can be partitioned on different columns than the original table.

Create table t (c1 int, c2 int) on ps(c2)

…….

Create clustered Index idx_t on t(c1) on ps(c1)

The serial plan looks like follows.
    Index Insert
       |
     Sort
       |
      NL (Nested Loop)
     /    \
 CTS   Scan
CTS is the Constant Table Scan. It scans each partition one by one and provides partition ID to the inner side (the lower part in graphics showplan) of NL. The inner side of NL scans the corresponding partition and sends data to the sort iterator. From there, it is exactly the same as source non-partitioned scenario. Not surprisingly, the memory and disk space requirement is the same too.

In the case of parallel plan, we have
      X (Distribute Streams)
       |
     Index Insert
       |
     Sort
       |
       X (Repartition Streams)
       |
      NL
     /   \
   X    Scan
  /
CTS
The operator above CTS is a Gather Streams operator, meaning it has one producer and many consumers. There is no parallelism below this operator. Between the Gather Streams and the Repartition Streams, each worker is assigned (Number of Source's Partitions)/(Degree of Parallelism) number of source partitions. The source is only scanned once in total.

The Repartition Streams operator splits the query plan into two parallelism sections. Between the top-level Distribute Streams and the Repartition Streams, we have a different set of workers than the worker set below Repartition Streams. Each worker is assigned (Number of Target Index Partitions)/(Degree of Parallelism) target partitions. The target index partition information is pushed down to the Repartition Streams which redirects data to different sort tables based on target partition location. The rest, i.e. how the sort and index building works, is the same as the parallel plan in the source non-partitioned case. Again, the memory and disk space requirement is the same as in the source non-partitioned case.

Posted by queryproc | 1 Comments
Filed under:

Index Build strategy in SQL Server - Part 4-1: Offline Serial/Parallel Partitioning (Non-aligned partitioned index build)

Recall that in the previous posts on index build, we defined "aligned" as the case when base object and in-build index use the same partition schema, and "non-aligned" to be the case when heap and index use different partition schemes, or the case when heap is not partitioned. In this post, we will talk about the two scenarios of non-aligned partitioned index build, source partitioned and source not partitioned.

Source Not Partitioned
Consider the following query.

Create Partition Function pf (int)

as range right for values (1, 100, 1000)

Create Partition Scheme ps as Partition pf

ALL TO ([PRIMARY])

Create table t (c1 int, c2 int) –the table is created on the primary filegroup by default

Create clustered Index idx_t on t(c1) on ps(c1) -- non-aligned index build

The serial plan is straightforward.

 Index Insert (write data to the in-build index)
   |
 Sort (order by index key)
   |
 Scan (read data from source)

The sort iterator creates one sort table per target partition (there are four partitions in this example so we will construct four sort tables concurrently). By default, we use the user database for sort to spill data. As we mentioned before, we free each extent from sort table after all its rows are copied. By doing this, for each partition, we can reduce the disk space requirement from 3*partition size (source + sort table + b-tree) to just about 2.2*partition size. Therefore, each file group requires 2.2*(size of all partitions that belong to this file group) of disk space. If the users specify sort_in_tempdb, then all the sort tables reside in the tempdb. Therefore, we require 2.2*(Size of the whole index) of free space in tempdb.

Index insert iterator can start building index after the sort iterator finishes sorting all sort tables. Therefore, we will have as many sort tables as the number of partitions at the same time. Recall that each sort table requires at least 40 pages. So, the minimum required memory will be #PT*40pages.

When it comes to parallel plan, it looks like

 X (Exchange)
   |
 Index Insert
   |
 Sort
   |
 Scan

Each worker thread is assigned (Partition Count)/(Degree Of Parallelism) number of partitions (e.g. if we have 4 partitions and 4 worker threads each gets 1 partition), which can be skewed. The sort iterator creates one sort table per assigned partition. Each worker scans the source once and retrieves the rows that belong to its partition(s), the retrieved rows will be inserted into the corresponding sort table depending on which partition they belong to.

After all sort tables got populated, the index builder starts consuming rows from sort tables, it consumes one sort table after another, building b-tree(s) in each partition's file group. Currently workers do not share partitions. Therefore, it is possible for one worker to finish all assigned partitions and idle while another worker is busy inserting rows.

The disk space and memory requirements are exactly the same as the serial plan. This is because in both cases, we cannot start building the index until all the sort tables are populated.

Posted by queryproc | 1 Comments
Filed under:

How to Check Whether the Final Query Plan is Optimized for Star Join Queries?

The star join optimization technique is an index based optimization designed for data warehousing scenarios to make optimal use of non-clustered indexes on the huge fact tables. The general idea is to use the non-clustered indexes on the fact table to limit the number of rows scanned from it. More details of index based star join optimization can be found at http://blogs.msdn.com/bi_systems/pages/164502.aspx.

 

The following discussion is based on SQL Server 2005 query plans. 

 

In SQL Server 2005, we put "StarJoinInfo" element in Showplan XML to highlight star join optimization. If the query plan contains the “StarJoinInfo” element, then SQL Server has identified this plan as a star join plan and it definitely is one.

 

However, the query optimizer may not detect all star join plans due to star join detection restrictions. Hence there are some star join plans that won’t have the StarJoinInfo. This post will shed some light on how to manually detect if a given query plan is a star join plan. 

 

These steps can help you identify what’s NOT a star join plan:

  • First identify your fact table.
  • If you see clustered index scan (or table scan) on fact table, then it’s not an index-based star join plan (however, this is a valid multi-table join plan, which can benefit from multiple-bitmap filter pushdown).

To identify a star join plan, you should:

  • Again, first identify your fact table
  • You should have a single RID lookup or clustered index seek on the fact table
  • Restrictive dimensions (dimension tables with restrictive filters in the query) should be processed before processing the clustered index seek or RID lookup on the fact table. You should find either:
    • A Cartesian product between the dimensions joined with a multi-column index on the fact table.
    • Or a semi-join of the dimensions with some non-clustered indexes on the fact table.
    • Or a join between multiple dimensions.
  • Non-restrictive dimensions are joined later with the fact table.

So the rule of thumb is: to detect whether a given plan is using index-based star join optimization, you should always look for a seek on a fact table that is based on some joins of some dimension tables.

Hash Warning SQL Profiler Event

One of the less well-known warning events that is logged to SQL Profiler trace is the Hash Warning event.  Hash Warning events are fired when a hash recursion or hash bailout has occurred during a hashing operation.  Both of these situations are less than desirable, as they mean that a Hash Join or Hash Aggregate has run out of memory and been forced to spill information to disk during query execution.  When a hashing operation spills to disk, this almost always results in slower query performance and can cause space consumption increase in tempdb.

 

Note that the Hash Warning event needs to be explicitly enabled within SQL Profiler.  It is not one of the “default” set of events.  More info on SQL Profiler can be found here

 

What can be done if you see a lot of Hash Warning events?  The recommended actions are:

 

·         Make sure that statistics exist on the columns that are involved in the hashing operation.  Without statistics, the hashing operation has no way to know how much memory to pre-allocate.

·         Even if statistics do exist, try updating them, as this can give more current information to the hashing operation when it decides how much memory to allocate.

·         Try using a different type of join (this can be done by hinting OPTION(MERGE JOIN) or OPTION(LOOP JOIN)).  Note that forcing a join type does not necessarily guarantee a better execution plan.

·         If all of these fail, you can increase the available memory on the computer.

 

A sample of what you will see in the profiler would look something like the following.  Note the batch starting, followed by a number of Hash Warning events prior to batch completion.  Also, the SPID that is causing the events will be recorded

 

EventClass

StartTime

SPID

SQL:BatchStarting                           

2007-02-01 18:53:34.703

51

Hash Warning

2007-02-01 18:53:48.267

51

Hash Warning

2007-02-01 18:53:48.283

51

Hash Warning

2007-02-01 18:53:50.097

51

Hash Warning

2007-02-01 18:54:05.300

51

SQL:BatchCompleted

2007-02-01 18:54:19.130

51

 

- Steve Herbert

SQL Server Query Execution

Index Build strategy in SQL Server - Part 3: Offline Serial/Parallel Partitioning (Aligned partitioned parallel index build)

Aligned partitioned parallel index build

 

In case of parallel build we scan and sort partitions in parallel and the actual number of sort tables existing at the same time will depends on the actual number of concurrent workers. Partitions are being chosen by workers one by one and when one worker completes with one partition it takes the next partition which is not yet taken by another worker. Each worker builds 0 - N partitions (we do not share one partition among multiple workers).  Why can it be 0?  If DOP > # of partitions, then we do not have enough partition to give out to all the workers(s). Which partition a worker is going to work on?  This is non-deterministic per execution.  First come first serve.

As we never share one partition among several workers, the biggest partition becomes a bottleneck. A situation could happen when all workers completed with their partitions and one worker still sorting the biggest one along. Which also means the resource being used by this query (such as memory and threads) will not be available for other queries.

 

There is no final stitch among workers for partitioned index.  Each partition is being represented as a separated b-tree in storage engine.

 

How does it affect disc space requirements:

-         In case of sorting in user’s database (default setting) the requirements are actually the same as in case of serial build as we are sorting in each filegroup for each corresponding partition we will need 2.2*(Size of partition) for each filegroup.

-         In case of using Sort_in_tempdb (SORT_IN_TEMPDB = ON) we won’t have the same advantage as in case of serial build because we may have several sort tables at the same time and as long as we don’t know the actual distribution of data between the partitions we will still require the same 2.2*(Size of the whole index) of free space in tempdb.

 

Memory consideration:

We will have several sort tables at the same time (depends of #DOP and # of partitions) and we will need at least 40pages per each sort table to be able to start the index build operation. So, the minimum required memory will be #DOP*40pages.

Total memory = 40 * DOP + additional memory.

 

Note that additional memory does not change for serial or parallel plans, this is because the total number of rows we need to sort remain the same in both plans.

Posted by queryproc | 1 Comments
Filed under:

Index Build strategy in SQL Server - Part 3: Offline Serial/Parallel Partitioning

There are two main categories of partitioned index build:

  • Aligned (when base object and in-build index use the same partition schema)
  • Not- Aligned (when heap and index use different partition schemas (including the case when base object is not partitioned at all and in-build index use partitions))

(see Index Build strategy in SQL Server - Introduction (II)

 

 

Aligned partitioned index build

 

Aligned partitioned serial index build

 

     NL

                /       \

             CTS   Builder (write data to the in-build index)

                           \

                        [Sort] (order by index key) <-- optional

                             \

                          Scan (read data from source)

 

CTS: Constant Table Scan (the purpose of CTS is to provide partition IDs for index builder)

NL: Nested Loop

 

In case of aligned partition index build Constant Table Scan provides partition IDs to the inner side of Nested Loop so we can build one partition at a time – for each partition ID provided by CTS (outer side of the NL), inner side would build the index for that partition (not the entire index). Sort table gets created for each partition, but as we are doing it one by one and we are building final b-tree for each partition one by one we don’t need to keep this sort table for each partition at the same time. As a result we have only one sort table at any given moment.

How does it affect disc space requirements:

-         In case of sorting in user’s database (default setting) we are actually sorting in each filegroup for each corresponding partition, which means we will need the same 2.2*(Size of the partition) for each filegroup. For example: let’s say we have 3 partitions located in filegroups FG1, FG2, FG3 and index data takes 1Gb, 2Gb, and 3Gb respectively. In this case we will need 2.2*1 = 2.2Gb of free space in FG1; 2.2*2 = 4.4Gb of free space in FG2, and 2.2*3 = 6.6Gb of free space in FG3. That means we will need total 9.9Gb of free space on the disk(s).

-         In case of using Sort_in_tempdb (SORT_IN_TEMPDB = ON) option we can reuse the same space of tempdb for the sort table and as we are sorting partitions one by one we actually will need only 2.2*(Size of the biggest partition) of free space in tempdb. For example: let’s look at the case described above. We will need only 3.3Gb of free space in tempdb (this size of free space will let us build smaller partitions and then reuse this space for building the biggest partition).

 

Memory consideration:

We will have only one sort tables at the same time and we will need at least 40pages for the sort table to be able to start the index build operation. So, the minimum required memory will be 40pages.

Total memory = minimum required memory + additional memory*.

 

*additional memory calculated as row size multiplied by the estimated row count provided by Query Optimizer.  

 

Posted by: Lyudmila Fokina

Posted by queryproc | 1 Comments
Filed under:

Index Build strategy in SQL Server - Part 2: Offline, Parallel, No Partitioning (Non stats plan (no histogram))

Build (serial) (write data to the in-build index)

                          |

                X (Merge exchange)

                           /          |           \

                      Sort…      Sort…  Sort …(order by index key)

                           |           |            |

                       Scan…    Scan… Scan…(read data from source)   

 

When histogram is not available (for example when we building an index on a view) we can’t use the same methods as described in pervious post (For statistics gathering, it is possible only for ‘real’ object.  We are building index on view, so index does not exist at this point and we are not able to gather sample stats on the view this is why we have a different plan here). So we will be using regular parallel scan which is not aware of data distribution at all.  

 

How this works:

We will scan source data in parallel using parallel scan, but serial b-tree build.

Each worker scans a page from the heap the same way as in a previous parallel scan method. When scan is done sort table gets created and populated for each worker. Each worker maintains its own sort table - one sort table and the data from all workers will be merged (as these sort tables are not disjoint we can not just build separate b-tree’s and ‘stitch’ them; we have to merge sort tables) and produced to a final sorted output. After that, index build operation will be serial as we have one final output. The index builder consumes data from Merge Exchange and builds the final index.

Why is this plan relatively slow?  We are building index in serial plus some extra overhead introduced by ‘merge exchange’.

 

 

Memory consideration:

            In parallel index build, we are building multiple sort tables concurrently hence the basic memory requirement is higher and the calculation is a bit different.  For memory calculation, we have 1) required memory, 2) additional memory.  Each sort requires 40 pages of required memory.  Let’s say we have DOP = 2 so we have 2 sort tables, we need 80 pages of required memory, but the total additional memory remains the same regardless DOP setting, this is because the total number of rows remain the same regardless DOP setting.  For example, if serial plan needs 500 pages of additional memory, then parallel plan has the same request for additional memory, each worker will get 500/DOP pages of additional memory + 40 pages of required memory.

Posted by queryproc | 0 Comments
Filed under:

Index Build strategy in SQL Server - Part 2: Offline, Parallel, No Partitioning

The type of parallel index build plan in SQL server depends on whether or not we have a histogram available with necessary statistics. Therefore, there are two broad categories of parallel index plans:

  • Histogram available:
  • No histogram

 

Histogram available (parallel sort and build): 

 

              X (Exchange)

   |          \            \

         Builder… Build…  Build… (write data to the in-build index)

                           |           |            |

                      Sort…      Sort…  Sort … (order by index key)

                           |          /            /

                       Scan (read data from source)

 

This type of Parallel Index build is getting chosen when we have statistics available (hence range partition information is available and can be used to identify data distribution).

 

How does scan happen in this case?  We must have some statistics on the leading key column, so if we don’t have stats we will go ahead and generate sample statistics to determine whether and how to parallelize the index build operation. In some situations, however, we are not able to build sample stats, such as indexed view ("No stats plan"), and then different index build plan will be generated. Using the statistics and histogram we can identify data distribution (divide data in several buckets), so we can load balance the workload among workers in parallel plan, it also help us to make DOP (degree of parallelism) decision to achieve high utilization of system resource. Using the row count estimates from the histogram for each bucket in the distribution the workload is split into N ranges (N = DOP), one for each worker (this is an attempt to load balance the work among all workers).

Using range partition scan to scan data, each worker receives data belonging to its range and builds its own sort table and b-tree based on sort table, so each worker will have its own sort table and all the b-trees are disjoint. The coordinator thread will then stitch all the b-tree’s together at the end of index build operation and we build full statistics on the new b-tree and are done.

 

Parallel Index Build with histogram available can give us the best performance.

On the downside it is more memory consuming and will fail if there is not enough memory (because each worker creates #DOP sort tables). We can play with MAXDOP option to reduced to max number of DOP used in the index build and as a result – min memory required for the build. You can run sp_configure to figure out what is the default setting for ‘max degree of parallelism’ on the server. Max degree of parallelism = 0, means ‘uses the actual number of available CPUs depending on the current system workload’. You can explicitly limit the number of processors to use in parallel plan execution.

For example:

Create index idx_t on t(c1, c2)

WITH (MAXDOP = 2)

-- limit # of processor to use for index build to 2

 

 //Next time - Non stats plan (no histogram) index build plan

Posted by: Lyudmila Fokina

Posted by queryproc | 1 Comments
Filed under:

Query Execution Timeouts in SQL Server (Part 2 of 2)

Checklist for time out errors

 

Memory pressure: In most cases timeouts are caused by insufficient memory (i.e. memory pressure). There are different types of memory pressures and it is very important to identify the root cause. The following articles give a good start point on this issue:

 

http://blogs.msdn.com/slavao/archive/2005/02/19/376714.aspx

http://blogs.msdn.com/slavao/archive/2005/02/01/364523.aspx

http://www.microsoft.com/technet/prodtechnol/sql/2005/tsprfprb.mspx#EWIAC (includes a link that explains DBCC MEMORYSTATUS)

 

Especially, we should pay attention to the size of buffer pool (since it is the source for query execution memory grant) and the size of memory held by query execution. You can use this simple query to get the size of buffer pool:

 

select sum(virtual_memory_committed_kb) from sys.dm_os_memory_clerks where type='MEMORYCLERK_SQLBUFFERPOOL'

 

The following query gives the size of memory held by query execution (available in SQL Server 2000 SP1 only):

 

select sum(total_memory_kb) from sys.dm_exec_query_resource_semaphores

 

Note: please be cautious when using sys.dm_exec_query_memory_grants and sys.dm_exec_query_resource_semaphores with an “order by” clause or a JOIN on a loaded system since this query may itself require a memory grant and it may experience a query execution time out. It is true even you use the DAC connection. DAC has a pre-committed memory for normal allocations, but not for memory grants.  It will need to use regular resource semaphores for memory grants.  The difference is: DAC query does not wait for memory grant and may force minimum grant.  This will likely make OOM condition worse.

 

It is important to point out that more physical memory does not necessarily mean more memory for query execution. The memory that can be used by query execution is limited by process-addressable virtual address space, which is normally 2GB for 32-bit architectures. So on a 32-bit system, the maximum memory query execution can use is around 1.7G since the operating system and other SQL Server components (like optimization) need memory as well. Generally, you should expect around 1.2 GB of main memory for query execution since quite likely other components could require more memory in a loaded system. There is no such 2GB limit for a 64-bit system.

 

Option “min memory per query” (BOL link). This option sets the minimum amount of memory (in kilobytes (KB)) that is allocated for the execution of a query. The default value is 1024 (KB) and the minimum allowed value is 512. Don’t make it too large if there are many ad hoc small queries: it simply wastes the memory since small queries won’t make full use of them.

 

Option “max server memory” (BOL link). This option controls the maximum size of buffer pool, which is the source of query execution memory. If it is too small, there won’t be many queries running at the same time. Make sure this option is set to a reasonably large value.

 

Option “query wait” (BOL link). This option specifies the time in seconds a query waits for memory before it times out. Check if it is set properly. We recommend leaving it as default, which is calculated as 25 times of the estimated query cost.

 

Update statistics. The amount of memory to be granted is mainly based on the cardinality estimation. So updating statistics could improve the accuracy of cardinality estimation and perhaphs reduce the waste on memory reservation.  On the other hand, if statistics is out of date, AUTOSTAT can kick in during compilation, which typically uses big memory because it has to sort rows. If we cannot get grant for AUTOSTAT, we will use stale stats instead.

 

 

Identify the queries consuming (or that will consume) the most memory. If you ever decide to kill some queries to free up memory, it might be more efficient to kill queries that are using or will use a large amount of memory. Of course, executing a big query will make the situation worse.  The following query shows the memory required by both running (non-null grant_time) and waiting queries (null grant_time).

select requested_memory_kb, grant_time, cost, plan_handle, sql_handle

from sys.dm_exec_query_memory_grants

 

Before you decide to kill a query, it is always recommended to check the showplan of that query. You should investigate if the plan cost and/or memory requirement exceed your expectation. You can use plan_handle to retrieve the showplan from sys.dm_exec_query_plan and sql_handle to retrieve the SQL text from sys.dm_exec_sql_text.

Posted by queryproc | 2 Comments

Index Build strategy in SQL Server - Part 1: offline, serial, no partitioning

         Builder (write data to the in-build index)

                           |

                     Sort (order by index key)

                           |

                     Scan (read data from source)

 

In order to build the b-tree for the index we have to first sort the data from source.  The flow is to scan the source, sort it (if possible - in memory*), then build the b-tree from the sort. 

Why do we need to sort first before building the b-tree?  In theory we don’t have to sort, we could use regular DML and directly insert data into the in-build index (no sort), but in this case we would be doing random inserts, random inserts in a b-tree require searching the b-tree for the correct leaf node first and then inserting the data. And while searching a b-tree is fairly fast, doing so before each insert is far from optimal. So for index build operation, we sort the data using the sort order for the new index, so when we insert data into the in-build index, it is not a random insert, it is actually an append operation and this is why the operation can be much faster than random inserts. 

While inserting data between sort and index builder we free each extent from the sort table as soon as all of its rows are copied.  In this way we reduce the overall disk space consumption from a possible 3*Index Size (source + sort table + b-tree) to just 2.2*Index Size (approximately).

 

*We do not guarantee in memory sort; the decision of whether we can do in memory sort or not depends on memory available and actual row count. ‘In-memory’ sort is, naturally, fast (also disk space requirements will be more relaxed in this case, because we don’t have to allocate space for the sort table on the disk), but it is not required; we can always spill data to disk, although the performance is much slower than in-memory sort.

 

For an index build operation, we use the user database (the database where we build the index) by default for sort to spill data, but if the user specifies sort_in_tempdb, then we use tempdb for spill.

Each sort table (even when we have very little data to sort) requires at least 40 pages (3200KB) to run (later we will see that in case of parallelism we can have several sort tables at the same time). When calculating memory for sort, we try to allocate enough memory to have an in-memory sort.  For large index build operations it is not likely that we will be able to fit the entire sort in memory. If we can’t provide at least 40 pages for the Index Build operation, it will fail.

 

The last step of index build is to always build full statistics. Good statistics information helps the query optimizer to generate better plan, users can issue ‘create’ or ‘update’ stats commands to force SQL Server generate or refresh stats on a certain object.  When we are building a new index, since we need to touch every row, we use this opportunity to build full stats at the same time as a side benefit.

 

Conclusion:

To be able to build a non partitioned Index offline with serial plan we will need free disk space (in user’s or in tempdb database) of approximately 2.2*IndexSize and at least 40 pages of memory available for the query executor to be able to start the process.

 

 

Read in next post: Index Build Scenario 2: Offline, Parallel, No Partitioning

 

Posted by: Lyudmila Fokina

Posted by queryproc | 1 Comments
Filed under:

Query Execution Timeouts in SQL Server (Part 1 of 2)

This short article provides a checklist for query execution time out errors in Yukon. It does not touch the time out issues on optimization and connection. Before reading this article, you are recommended to read the following post to get familiar with SQL Server memory management architecture: http://blogs.msdn.com/slavao/archive/2005/02/11/371063.aspx

 

Overview of query processing

When a query is submitted, SQL Server first checks if there is a plan cached. If yes, that plan will be selected. If not, the query statement is first parsed to generate sequence tree. The sequence tree is then bound and normalized and converted to algebras tree. Algebras tree is optimized to generate algebraic plan.

 

When an optimized plan is generated (or selected if it is cached), it will be executed if its memory requirement can be satisfied right away. If not, which happens quite often, the query is put into a queue and waiting for the memory.

 

How does a query execution time out happen?

 

Before executing a query, SQL Server estimates how much memory it needs to run and tries to reserve this amount of memory from the buffer pool. If the reservation succeeds the query is executed immediately. If there is not enough memory readily available from the buffer pool, then the query is put into a queue with a timeout value, where the timeout value is guided by the query cost. The basic rule is: higher the estimated cost is, larger the time out value is. When the waiting time of this query exceeds the timeout value, a time out error is thrown and the query is removed from the queue. The following shows a sample error for time out:

 

[State:42000 Error:8645]      [Microsoft][SQL Native Client][SQL Server]A time out occurred while waiting for memory resources to execute the query. Rerun the query.

 

 

If the memory is enough for a newly submitted query but there are queries in waiting queues, this query is put into a queue. Queries in waiting queues are “sorted” based on their cost and waiting time. Less cost or longer waiting time a query has, higher it is ranked. Note the ranking is dynamic and changes frequently. The query with the highest rank will run if there is enough free memory. If the memory is insufficient, then no other queries will run. It will NOT bother to check if the free memory is enough to run other queries. You can check which query is next to be picked up by running the following query. If the returns no rows, then there are no waiting queries. Note: the results from this query change with time.

select * from sys.dm_exec_query_memory_grants where is_next_candidate is not null

 

You can use the value in the plan_handle column to retrieve the showplan from sys.dm_exec_query_plan and the sql_handle column to retrieve the SQL text from sys.dm_exec_sql_text.

 

Note that not every query needs a memory reservation. Usually a query needs a memory reservation if its execution plan has sort, hash, or bitmap operators. Since an index build requires a sort, it always needs a memory reservation. If a query needs no memory reservation, it is immediately executed.

- Senqiang Zhou

Query Execution

Posted by queryproc | 1 Comments
Filed under:

Using ETW for SQL Server 2005

ETW stands for “Event Tracing for Windows” and it is used by many Windows applications to provide debug trace functionality.  This “wide” availability is a key point of using ETW because it can help to track certain activities from end to end.  For example, you can literally track a request coming from IIS, passing through protocol layer, and then finally handled by database engine in single trace file.   Unfortunately, not many people know that SQL Server 2005 provides a full ETW functionality which can output most of SQL Trace events available for SQL Server Profiler.  In this article, I will explain how to use SQL ETW with examples.

First of all, how do I know if SQL ETW is really active on my machine?  The following shows output on my server which has SQL Server 2005 Standard Edition as 2nd instance (named Yukon).

C:\>logman query providers
Provider                                 GUID
-------------------------------------------------------------------------------
YUKON Trace                              {130A3BE1-85CC-4135-8EA7-5A724EE6CE2C}
.NET Common Language Runtime             {e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}
ACPI Driver Trace Provider               {dab01d4d-2d48-477d-b1c3-daad0ce6f06b}
Active Directory: Kerberos               {bba3add2-c229-4cdb-ae2b-57eb6966b0c4}
IIS: Request Monitor                     {3b7b0b4b-4b01-44b4-a95e-3c755719aebf}
Local Security Authority (LSA)           {cc85922f-db41-11d2-9244-006008269001}

My other server has a default instance of SQL Server 2005 and it shows something like this:

MSSQLSERVER Trace                        {2373A92B-1C1C-4E71-B494-5CA97F96AA19}

Is there a connection between these 2 examples?   Yes, the SQL ETW provider name appears to be a combination of “Instance name” and “Trace”, and its GUID is unique across instances.    You can use either provider name or GUID to start and stop ETW session.

Please note this ETW functionality is not available for SQL Express, because, well…, you get what you paid for J.  If you don’t see SQL ETW provider name, and you are sure you have a regular SQL Server 2005, then you may be able to fix it yourself.   Here’s a short list of instructions.

  1. Go to the directory where your SQL server binaries are.  On my machine, it is on
    “C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn”.
  2. Look for “etwcls.mof”.   If you cannot find it, something is definitely wrong with your installation.  You probably need to seek help elsewhere.
  3. Run “mofcomp etwcls.mof” in command window.
  4. Try “logman query providers” again.  You should see your SQL ETW provider.

Now I will describe how to configure SQL ETW to collect interesting traces.   Go to the same directory described in the above instruction and find a file called “etwcnf.xml”.   This is your configuration file and you can modify it with any text editor.   There should several entries already defined in this file for your examples.   See the following example picked out of etwcls.xml file.

<Template id="1" Name="TSQL replay">
       <Event id="11"/>
       <Event id="13"/>
       <Event id="14"/>
       <Event id="15"/>
       <Event id="17"/>
       <Event id="53"/>
       <Event id="70"/>
       <Event id="71"/>
       <Event id="72"/>
       <Event id="74"/>
       <Event id="77"/>
       <Event id="78"/>
       <Event id="100"/>
</Template>

This defines a tracing template called “TSQL replay” with template ID = 1.   It contains a list of events with ID = 11, 13, and so forth.  This event ID is matched to SQL Trace Event Class.  By consulting BOL documentation on available SQL Trace events, you can find “RPC:Starting” event has ID of 11, and “SQL:BatchStarting” event has ID of 13.   Yes, now you know how to change this template to suite your need!

Now let’s try to get some actual trace events.  I’m using “logman.exe” and “tracerpt.exe” which should be available on most Windows platform.

First, create a text file as shown in the following example.  You can activate multiple providers by listing them in separate lines.  For simplicity, let’s enable only SQL ETW.

C:\etw>type prov.txt
"YUKON Trace" 1 0

As you have figured out already, the first column is SQL ETW provider name (you can use GUID also).   The next number (1) is called “enable flag” and should match the template ID in etwcls.xml file.  The 3rd number (0) is called “enable level” and should be kept as zero for SQL ETW.   The meaning of these 2 flags depends on providers and you should consult appropriate documentations for other providers.

Now you are ready to start your 1st SQL ETW session!   In my example, I decided to give a very creative/original name: “mytrace”.   Here’s a screen dump from my machine.

C:\etw>logman start mytrace -pf prov.txt -ets
Name:                      mytrace
Age Limit:                 15
Buffer Size:               64
Buffers Written:           1
Clock Type:                System
Events Lost:               0
Flush Timer:               0
Buffers Free:              2
Buffers Lost:              0
File Mode:                 Sequential
File Name:                 C:\etw\mytrace.etl
Logger Id:                 3
Logger Thread Id:          1308
Maximum Buffers:           25
Maximum File Size:         0
Minimum Buffers:           3
Number of buffers:         3
Real Time Buffers Lost:    0

Provider                                  Flags                     Level
-------------------------------------------------------------------------------
* "YUKON Trace"                           0x00000001                0x00
{130A3BE1-85CC-4135-8EA7-5A724EE6CE2C}  0x00000001                0x00

To test if this ETW session can actually receive something from SQL Server, let’s issue “select * from sys.dm_exec_requests”.  After that, you can stop the tracing as shown in the following example.

C:\etw>logman stop mytrace -ets
The command completed successfully.

Look for newly created file in the working directory (mytrace.etl).  Unfortunately, this is a binary trace file and you will need some help to crack it open.   “Tracerpt.exe” does some rudimentary job of converting this binary trace file into human-readable form.

C:\etw>tracerpt mytrace.etl
Input
----------------
File(s):
     mytrace.etl
100.00%
Output
---------------
Text (CSV):         dumpfile.csv
Summary:            summary.txt
The command completed successfully.

Note this creates 2 files: dumpfile.csv and summary.txt.   Here’s a screen dump. 

C:\etw>type dumpfile.csv
  Event Name,       Type,        TID,           Clock-Time, Kernel(ms),   User(ms), User Data
  EventTrace,     Header, 0x00000BF4,   128075886826813362,          0,         10,    65536, 33620485,     790, 1, 128075887366383792,   100144,        0, 0x00000000,        2,        1,        4,        0,     2785, 0x01402930, 001402940,                0,          3579545, 128075886826813362, 0x00000002,        0, 0, 0
SQL:BatchStarting,          0, 0x00000874,   128075886939625782,          0,         10, "select * from sys.dm_exec_reqests",        1,                0, "xxxx", "REDMOND", "JAYC8V1",     1580, "SQL Query Analyzer", "REDMOND\xxxx",       51, 28075598939600000,       13, "master", 010500000000000515000000A065CF7E784B9B5FE77C8770A9640000",        0,    0,       38,        0, "REDMOND\xxxx", 0, 0
C:\etw>

Yes, this looks pretty cryptic and you will probably need to look at etwcls.mof (and SQL Trace description) to decipher each column.   From etwcls.mof file, we know 1st column is “TextData”, 2nd column is “DatabaseID”, 3rd column is “TransactionID”, etc.

       [WmiDataId(1),
        Description("TextData"),
        format("w"),
        StringTermination("Counted"),
        read
       ]
       string TextData;
       [WmiDataId(2),
        Description("DatabaseID"),
        read
       ]
       sint32 DatabaseID;
       [WmiDataId(3),
        Description("TransactionID"),
        read
       ]
       sint64 TransactionID;

Now we know this ETW session received “SQL:BatchStarting” event correctly, and its TextData column contains correct TSQL statement.

OK, you know how to start and stop SQL ETW.  But with very little UI support compared to the pretty looking SQL Server Profiler, why even bother to consider ETW if I don’t need end-to-end tracing?  Well, one good thing about SQL ETW is that your ETW session survives SQL Server restart, unlike ordinary SQL Trace.

 
Jay Choe

Posted by queryproc | 3 Comments
Filed under: , ,
More Posts Next page »
 
Page view tracker