Ian Jose's WebLog

This weblog is intended to help T-SQL developers get the best performing query plans from SQL Server. Some of the posts describe simple mistakes that can be easily avoided. Other posts describe complex solutions to limitations in SQL Server. Lastly, some posts describe recent improvements in SQL Server, and the means to employ them best.

  • Ascending Keys and Auto Quick Corrected Statistics

    A common problem for some SQL Server applications are cases where data typically ascends.  For example, datetime columns where the column represents a current date.  SQL Server builds statistics with the assumption that the data will by in large be similar in the future.  However, when data typically ascends, most new insertions are out of the previously found range.  This can lead to poorly performing plans as filters selecting recent data seem to exclude the entire relation when in fact a significant number of rows are included.

    Trace flag 2389 and 2390, both new in SQL Server 2005 SP1, can help to address this problem.  SQL Server 2005 SP1 begins to track the nature of columns via subsequent operations of updating statistics.  When the statistics are seen to increase three times the column is branded ascending.  If trace flag 2389 is set, and a column is branded ascending, and a covering index exists with the ascending column as the leading key, then the statistics will be updated automatically at query compile time.  A statement is compiled to find the highest value and a new step is added at the end of the existing histogram to model the recently added data. 

    Trace flag 2390 enables the same behavior even if the ascending nature of the column is not known.  As long as the column is a leading column in an index, then the optimizer will refresh the statisitc (with respect to the highest value) at query compile time.  Never use 2390 alone since this would mean that this logic would be disabled as soon as the ascending nature of the column was known.

    -- enable auto-quick-statistics update for known ascending keys

    dbcc traceon( 2389 )

    -- neable auto-quick-statistics update for all columns, known ascending or unknown

    dbcc traceon( 2389, 2390 ) -- never enable 2390 alone

  • Query Processor Modelling Extensions in SQL Server 2005 SP1

    Trace flag 2301, available in SQL Server 2005 SP1, enhances the modelling ability of the query optimizer to better handle complex statements.  Improved modelling can lead to dramatically faster performing query plans in some cases.  These extensions to the query processor modelling abilities can lead to increased compile time and so should only be used by applications which compile infrequently.  The model extensions are as follows:

    Integer Modelling

    Normally, histogram modelling assumes that values between histogram steps are equally distributed to every numerical double code point.  This modelling extension remembers, for integer base types, that values can only occur on integer code points and this improves cardinality estimates for inequality filters.

    Comprehensive Histogram Usage

    Normally, histograms are ignored when the cardinality of a relation dropps below the number of steps in a histogram.  This is a heuristic which captures the liklihood that a histogram continues to describe a relation.  This modelling extension applies the histogram in cardinality estimate regardless of the cardinality estimate for the relation.

    Base Containment Assumption

    Normally, when two relations are joined, we assume that X distinct code points in the same key range on input relation R will join with Y distinct code points in the same key range on input relation S such that MIN(X,Y) will find matches.  This assumption is called Simple Containment.  We assume that the smaller number of distinct code points match with code points from the other side.  This modelling ignores the relative population of distinct code points in the base forms of R and S, and also ignores any filtering that has occured to the base forms for R and S before joining.  Base containment applies the containment assumption only to the base relations and uses probabilistic methods to compute the degree of joining.  In addition, implied filters are modelled correctly since their behavior is very different from orthogonal filters.

    Comprehensive Density Remapping

    Normally, when columns are CONVERTed only a small number of densities involving such columns are remapped to the new column definitions.  Note that operations like convert rarely change the density of a column.  Density is the measure of the number of duplicate values for each distinct value.  With this modelling extension, all such remappings are applied which makes possible subsequent density matching for the purposes of cardinality estimation.  In some cases, this can lead to excessive use of memory.

    Comprehensive Density Matching

    Normally, densities are matched when the very same base column is filtered or joined.  With this modelling extension, the notion of equivalence of columns as a result of equi-joins is applied leading to more complete density matching for the purposes of improved cardinality estimation.

    These extentions all were developed to address customer found problems relation to poor performing query plans.  If customers experience such poor performing plans where one or more of the above extentions may help, then trace flag 2301 may be applied.  It is important to note that compile times will increase, and in some cases memory consumption can increase dramatically.  Thus, it is important to apply this trace flag with care and test exhaustively before using in production.

     

  • Disabling Constant-Constant Comparison Estimation

    SQL Server 8.0 did not perform cardinality estimates based on the comparion of two constants.  Instead, SQL Server 8.0 guessed at the resulting selectivity.  The reasoning for this is that one or more of the constants may be statement parameters, which would change from one execution of the statement to the next.  However, SQL Server 9.0 reversed this behavior and does esimate operator cardinality based on the comparison of two constants, when both values are available at compile time.  Constant values are avilable at compile when when 1) they are literal constants or 2) they are parameters to a stored procedure or otherwise set at the nesting level above the the compilation of the statement.  This change in behavior can be problematic for statements that optimize for one set of values but run for other sets of values.  Applications that find adverse plan changes resulting from the change in behvior can revert behavior to that of the previous released by enabling trace flag 2328.  For example, to revert the behavior server wide, for the current invokation of SQL Server, one could run the following command:

    dbcc traceon( 2328, -1)

    If running SQL Server 8.0, or SQL Server 9.0 with trace flag 2328 enabled, the optimizer will guess the selectivity arising from the comparion of two constants.  The guess used is the same as comparing a column to an constant (where either the column distribution or the constant value, or both, is unknown).  Since the selectivity of the comparison of two constants is either 0 or 1, one might think that a guess for such a comparion wouldbe 50%.  Although there is a lot of logic in this behavior, for historic reasons, the guess used by SQL Server was very different. 

    Guessing the selectivity of a column-constant comparsion attempts to model several phenomena.  First, when there are many rows, the selectivity is likely smaller because there could be more values from which to choose.  Second, when there are many such conjuncts, it is more likely that subsequent conjuncts will be correlated with previous conjucts.  SQL Server therefore reduces the selectivity from one to 0 by the exponent of the cardinality and the reducition factor is reduced with each guess.  The following table shows the number of conjucts guessed and the resultant selectivity as a function of input table cardinality of N:

     

    # Conjucts         Cardinality            Selectivity

    1                        N^(3/4)               N^(-1/4)

    2                        N^(11/16)           N^(-5/16)

    3                        N^(43/64)           N^(-21/64)

    4                        N^(171/256)       N^(-85/256)

    5                        N^(170/256)       N^(-86/256)

    6                        N^(169/256)       N^(-87/256)

    7                        N^(168/256)       N^(-88/256)

    ...

    175                    N^(0/256)          N^(-1)

  • Make Functions Schema-bound

    Create user defined functions with the SCHEMABINDING clause where possible.  In the absence of this clause, SQL Server must assume at compile time that statements using the function may bind to a function which accesses and updates data.  Note that even if the UDFs do not access data or perform any updates, the SQL Server optimizer must be prepared for this in case the statement at run-time binds to a different UDF which does access and update data.  This leads the optimizer to add halloween protection that is often unneccessary.  This protection often manifests as Spool or Sort iterators in the plans and can dramatically slow down plan performance.  Making UDFs schema-bound is necessary even if the UDF is uniquely named, and even if no UDFs access or update data. 

    This behavior has changed in SQL Server 2005, and applications that are porting from SQL Server 2000 to SQL Server 2005 should re-examine whether their UDFs can be schema bound and to make them schema bound if they are not already so marked.

    Background on Halloween Protection.

    "Halloween protection" in database systems refers to a solution for a problem that can occur in update queries. The problem occurs when an update itself affects the rows selected for update.  For example, imagine a company wants to give every employee a 10% pay increase. If the update query is unlucky enough to walk the salary index then the same row may be selected, updated, the update moves the row ahead in the index, and then the row will be redundantly selected again and again.  This problem is corrected by isolating the rows chosen from the effects of the update itself.  For example, a SPOOL operation which stores all the rows to be updated outside of the context and any index can provide the necessary isolation. SORTs are also sufficient for isolation purposes.

  • Regularly Update Statistics for Ascending Keys

    Ascending key columns, such as IDENTITY columns or datetime columns representing real-world timestamps, can cause inaccurate statistics in tables with frequent INSERTS because new values all lie outside the histogram.  Consider updating statistics on such columns frequently with a batch job if your application seems to be getting inadequate query plans for queries that have a condition on the ascending key column. How often to run the batch job depends on your application. Consider daily or weekly intervals, or more often if needed for your application. Alternatively, trigger the job based on an application event, such as after a bulk load or after a certain number of INSERT operations.

  • Create Multi-Column Statistics

    When a query has a multi-column condition, consider creating multi-column statistics if you suspect that the optimizer is not producing the best plan for the query. You get multi-column statistics as a by-product of creating a multi-column index, so if there is already a multi-column index that supports the multi-column condition, there is no need to create statistics explicitly. Auto create statistics only creates single-column statistics, never multi-column statistics. So if you need multi-column statistics, create them manually, or create a multi-column index.

    Consider a query that accesses the AdventureWorks.Person.Contact table, and contains the following condition:

    FirstName = 'Catherine' AND LastName = 'Abel'

    To make selectivity estimation more accurate for this query, create the following statistics object:

    CREATE STATISTICS LastFirst ON Person.Contact(LastName,FirstName)

    This statistics object will be useful for queries that contain predicates on LastName and FirstName, as well as LastName alone. In general, the selectivity of a predicate on any prefix of the set of columns in a multi-column statistics object can be estimated using that statistics object.

    For a statistics object to fully support a multi-column condition, a prefix of the columns in the multi-column statistics object must contain the columns in the condition. For example, a multi-column statistics object on columns (a,b,c) only partially supports the condition a=1 AND c=1; the histogram will be used to estimate the selectivity for a=1, but the density information for c will not be used since b is missing from the condition. Multi-column statistics on (a,c) or (a,c,b) would support the condition a=1 AND c=1, and density information could be used to improve selectivity estimation.

  • Auto-create and Auto-update Statistics

    For a large majority of SQL Server installations, the most important best practice is to use auto create and auto update statistics database-wide. Auto create and auto update statistics are on by default. If you observe bad plans and suspect that missing or out of date statistics are at fault, verify that auto create and auto update statistics are on.

    An alternative to enabling auto create statistics is enabled or make sure to manually create statistics using CREATE STATISTICS or sp_createstats.  Note that auto-statistics will not work for read-only databases.

  • Create Statistics for All Union Inputs

    When a query requires statistics on the result of a UNION or UNION ALL operation, create needed statistics uniformly on all UNION or UNION ALL inputs.  It is not sufficient to create statistics on a subset of the UNION or UNION ALL inputs, even when those inputs are dominant.  For example,

    SELECT * FROM Lineitem l WHERE EXISTS

    ( SELECT * FROM Region1 r1 WHERE r1.C = l.C and r1.S = l.S

    UNION SELECT * FROM Region2 r2 WHERE r2.C = l.C and r2.S = l.S )

    In order for the EXISTS to be optimized accurately, it may be necessary to have a multi-column statistic on columns C and S in for Lineitem, Region1 and Region2.  If the multi-column statistic does not exist in either Region1 or Region2 then the remaining statistics cannot be used to optimize this query.

  • Limit Use of Multi-Statement TVFs and Table Variables

    Limit use of multi-statement table valued functions (TVFs) and table variables in situations where getting a high-performance plan is requied.  Both multi-statement TVFs and table variables have no statistics. The optimizer must guess the size of their results. Similarly, result column do not have statistics, and the optimizer must resort to guesses for predicates involving these columns.  If a bad plan results because of these guesses, consider using an in-lined TVF, or standard table or temporary table as a temporary holding place for the results of the TVF, or a replacement for the table variable. This will allow the optimizer to use better cardinality estimates.  Note that this technique may also lead to an increased number of statement recompilations.

  • Simplify statements with IF

    In some cases, a SQL statement can be simplified by using procedural logic.  Instead of issuing one complex query to cover multiple cases with UNION or OR, it is better to use and if..else logic to separate the cases into different SQL statements.  In this way, the optimizer optimizes multiple simpler statements instead of a single complex statement and may be able to get better performing plans for each case.

  • Use Condition-Specific Stored Procedures

    The optimizer chooses the best plan for an SP given the current parameter values.  This plan is then reused regardless of whether the nature of the parameters changes from call to call.  If an application is aware that input parameters have a small number of significant behaviors, then the application may use condition-specific SPs. 

    For example, if an SP has a single parameter which can be either NULL or non-NULL, and a NULL value selects very few rows while a non-NULL value selects many more, then this technique may be used.  The application would code the SP twice.  The two SPs will have different names but the SP contents will be the same.  The logic that calls the SP must call the appropriate SP depending on whether the parameter value is NULL or non-NULL.  SPs are compiled the first time they are called.  As a result, the SP for NULL will be optimized for a NULL value and will be subsequently only called when the parameter value is NULL.  The SP for a non-NULL parameter will similarly be optimized for a non-NULL parameter value and only be called with a non-NULL parameter value.  This method can result in dramatic improvements in performance when the degree of selectivity of parameters varies significantly.

  • Use Function Results

    Built-in functions with literal constant inputs are simplified during optimization to resultant constant values.  However, user defined functions or built-in functions with variable[1] inputs are only simplified to resultant constants for the purposes of plan optimization for the following functions:

     


    Lower

    Upper

    RTrim

    Datediff

    Dateadd

    Datepart

    Substring

    Charindex

    Length

    SUser_SName

    IsNull

    GetDate

    GetUTCDate


     

    With other functions, the optimizer will guess the selectivity of predicates and this may lead to a poorly performing plan.  It is better for the query writer to set the value of the scalar expression to a variable and pass this value to the SQL statement as a parameter to a stored procedure or parameter to sp_executesql.  The query optimizer can determine the value contained within a parameter and will use this value to determine the best plan.  In this way, the query optimizer will be able to optimize with the resultant expression value.

    If the expression is present in the SQL query, the optimizer will not know the resultant value and will guess the selectivity of predicates containing this expression.  This can lead to poorly performing plans.  Note that column inputs to expressions must be present in the query statement since they require query context to have meaning. 



    [1] Corelated parameter, local variable, or procedure parameter.

  • Limit Non-order Preserving Expressions

    Expressions with column transformations that do not retain the original column order do not benefit from column statistics.  This can lead to poor plans.  Assume for purposes of illustration that we've added a column to Sales.SalesOrderHeader in AdventureWorks as follows:

    ALTER TABLE Sales.SalesOrderHeader ADD SalesOrderFlag tinyint

    For example, consider the following:

          SELECT * FROM Sales.SalesOrderHeader h

          WHERE h.SalesOrderFlag & 0x0001 = 1

    If a predicate on a column performs a bit-wise AND on the column, the selectivity of the predicate cannot in general be estimated using the column histogram.  Here, it would be better to separate out the bit flag as its own column on which statistics could be built.  For example:

    ALTER TABLE Sales.SalesOrderHeader ADD SalesOrderValidFlag tinyint

    In the latter case the selectivity of a predicate on the flag which is no longer bit-wise ANDed could be estimated, as shown below.

    SELECT * FROM Sales.SalesOrderHeader h

          WHERE h.SalesOrderValidFlag = 1

    Alternatively, a computed column could be defined on the bit-wise ANDed value, and a statistic could be built on the computed column.  The computed column would then be referenced in the query.

    There are many cases of non-order preserving expressions.  Consider the query shown below.

    SELECT * FROM Sales

          WHERE Price * ( 1 + Tax ) > 100

    If you see slower-than-desired query performance in such a case, consider creating a computed column with the equivalent expression, and creating statistics or an index on the computed column. Auto create statistics will also create statistics for the computed column if it exists, so you need not create the computed column statistics manually if auto create statistics is enabled. You do not have to modify the query to reference the computed column explicitly because query expressions are matched to identical computed column expressions automatically.

  • Avoid Unnecessary Data Type Conversions

    SQL Server adds implicit data type conversions when types don’t match.  This can have unintended results both on query results, but also on the query plan.  It is important to use literal constants that match in type column they are being compared with wherever possible.  A common mistake in writing T-SQL is to always use string literals for constants even when those constants are numbers, for example.

    SELECT * FROM Sales

    WHERE Quantity = ‘100’

    In this example, ‘100’ is a character string while the column Quantity is an integer.  The implicit conversion used follows strict precedence rules.  Here the conversion will be on Quantity to character string, and not on the ‘100’ to an integer. 

  • Use Parameters or Literals for Query Inputs

    Use unmodified parameters or literal constants in query statements to ensure that the optimizer can determine a representative value and optimize accordingly.  If you use a local variable in a query predicate instead of a parameter or literal, the optimizer is unable to determine the value for the input.  Likewise, if a parameter value is modified before being used in a statement, then the query plan will be chosen for the original value of the parameter and may not be the best plan for the actual value of the parameter when the statement is executed.  Not having the proper value, or any value at all, leads to a less accurate cardinality estimates.  Poor estimates can in turn lead to a poorly performing query plan.  Use unmodified parameters or literals in the query instead of local variables, and the optimizer typically will be able to pick a better query plan. For example, consider this query that uses a local variable:

    declare @StartOrderDate datetime

    set @StartOrderDate = '20040731'

    select * from Sales.SalesOrderHeader h, Sales.SalesOrderDetail d

    WHERE h.SalesOrderID = d.SalesOrderId

    AND h.OrderDate >= @StartOrderDate

    The number of rows from Sales.SalesOrderHeader that the optimizer estimates will qualify vs. the condition h.OrderDate >= @StartOrderDate is 9439.5, which is exactly 30% of the size of the table. You can use the graphical showplan for the query and right-click the plan node for Sales.SalesOrderHeader to display this cardinality estimate. In a pre-release version of SQL Server 2005 used while preparing this paper, the plan chosen uses a merge join (the observations that follow are based on this same SQL Server 2005 version; your results may differ depending on your SQL Server version, available memory, etc.). Now, consider this equivalent query that doesn't use a local variable:

    SELECT * FROM Sales.SalesOrderHeader h, Sales.SalesOrderDetail d

    WHERE h.SalesOrderID = d.SalesOrderId

    AND h.OrderDate >= '20040731'

    The cardinality of the result set for the predicate "h.OrderDate >= '20040731'" is estimated as 40 in the graphical showplan for the query (right-click the filter operator), for a selectivity of 0.13%. The plan chosen for this query uses a nested loop join instead of a merge join because of this improved estimate.

    Even when local variables are used in a query, an estimate that is better than a guess is used in the case of equality predicates. Selectivity for conditions of the form "@local_variable = column_name" is estimated using the average value frequency from the histogram for column_name. So, for example, if the column column_name contains all unique values, then a selectivity estimate of 1/(number of unique values in column) will be used, which is accurate.

    To eliminate the use of local variables, consider (1) rewriting the query to use literals instead of variables, (2) using sp_executesql with parameters that replace your use of local variables, or (3) using a stored procedure with parameters that replace your use of local variables. Dynamic SQL via EXEC may also be useful for eliminating local variables, but it typically will result in higher compilation overhead.

    The query optimizer chooses the best plan for the value of the parameter that was input to the stored procedure or call to sp_executesql.  For this reason, it is best to avoid writing SQL where the value used at execution is different from the value input to the stored procedure.  Instead it is better to modify the parameter value in one stored procedure and then pass it to another stored procedure which contains the SQL statement where it is used.  For example, consider this query that modifies a parameter value:

    CREATE PROCEDURE GetRecentSales (@date datetime) WITH RECOMPILE AS

    BEGIN

          IF @date IS NULL

                SET @date = dateadd("mm",-3,(SELECT MAX(OrderDATE)

                                              FROM Sales.SalesOrderHeader))

          SELECT * FROM Sales.SalesOrderHeader h, Sales.SalesOrderDetail d

          WHERE h.SalesOrderID = d.SalesOrderID

          AND h.OrderDate > @date

    END

    This SP, if called with NULL, will have the final SELECT statement optimized for @date = NULL. Since no rows have NULL OrderDate, the cardinality estimate for the result of applying this filter to SalesOrderHeader is very low (1 row). However, at run time, the date is not NULL, but three months before the latest OrderDate.  The actual number of SalesOrderHeader rows that qualify is 5,736. The optimizer chooses a nested loop join for the query when NULL is passed to GetRecentSales, whereas the optimal plan contains a merge join. You can see the plan selected, and the expected and actual cardinalities, using this script:

    SET STATISTICS PROFILE ON

    GO

    EXEC GetRecentSales NULL

    GO

    SET STATISTICS PROFILE OFF

    GO

    The WITH RECOMPILE option specified on the GetRecentSales stored procedure above does not eliminate the cardinality estimation error. One way to ensure that the queries in this example are optimized with appropriate parameter values that allow good estimates to be obtained is to modify the stored procedure as follows, breaking it down into parts:

    CREATE PROCEDURE GetRecentSales (@date datetime)  AS

    BEGIN

          IF @date IS NULL

                SET @date = dateadd("mm",-3,(SELECT MAX(OrderDATE)          FROM Sales.SalesOrderHeader))

          EXEC GetRecentSalesHelper @date

    END

    CREATE PROCEDURE GetRecentSalesHelper (@date datetime) WITH RECOMPILE AS

    BEGIN

          SELECT * FROM Sales.SalesOrderHeader h, Sales.SalesOrderDetail d

          WHERE h.SalesOrderID = d.SalesOrderID

          AND h.OrderDate > @date -- @date is unchanged from compile time,

                                  -- so a good plan is obtained.

    END


© 2008 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker