Microsoft SQL ISV Program Management Team

  • Designing Composite Indexes

     When it comes to creating composite indexes there are two questions I get asked most often by the ISVs I work with as well as their customers.

    1.      What is the optimal number of columns to include in a composite index?

    2.      What should the order of those columns be?

    While in most cases the answer is “it depends’, there are a handful of considerations which help in defining optimal composite indexes.  This BLOG attempts to offer some guidance on these and also elaborate on some of the tradeoffs.

     

    To start with let’s make sure we’re all on the same page with the definition of a composite index: “A composite index is one which is created across two or more key columns”.  Composite indexes are often also referred to as ‘multi-column’ index.  For example, an index (idx1) created on table (TabA) columns (col1, col2, col3 (CREATE CLUSTERED INDEX idx1 ON TabA (col1, col2, col3);).  Index columns refer only to key columns of the index; although INCLUDED columns are part of the physical index structure, they are not part of the index key and therefore not considered to be a part of a composite index for the purpose of this discussion.

     

    Now let’s drill into at the answers to the two questions individually.

     

    1.     Optimal number of columns to include in a composite index

    As mentioned earlier, there is no one answer for this question, every query and table situation is different and needs to be evaluated on a case by case basis.  That being said there are a few rules of thumb that can be used to effectively determine which columns to include in the composite index.

     

    a.       Only include columns that are selective.  Columns that have just a few, or in an extreme case just 1, values across the entire table add little or no value in index search operations and should not be included in an index even if they are used as query predicates.

     

    b.      Volatile columns (columns that are frequently updated) should not be indexed.   When the value of the column that’s part of an index is modified, it usually results in the index needing to be reorganized as well.   This results in the database engine having to perform multiple writes and thereby having additional overhead.  For example if a volatile column (col1) is included in a non-clustered index, every time any data in col1 changes, the database engine has to perform two writes, one to update the col1 data itself and the other to update the non-clustered index that includes col1.  If the table has more than one non-clustered index that includes col1, then for every change equal to the number of such non-clustered indexes need to be performed.

     

    c.       Columns included in query predicates.  Columns that are commonly used as query predicates in multiple frequently executed queries should be included in the composite index as long as they qualify the checks mentioned in 1.a and 1.b above.

     

    d.      Limit the number of columns.  While SQL Server limits the maximum number of columns in an index key to 16, in my opinion that limit is of little significance as in most cases you should be able to create an optimal indexes with far fewer columns.  As a rule you should use the fewest number of columns in a composite index as required to give you optimal performance. For example, if the combination of two columns (col1 and col2) results in uniquely qualifying every row of the table, there is little value in adding additional columns to the composite index to make the index seek operations more efficient.  In most cases I have seen well designed applications to have composite indexes built across no more than 6 columns.  Every column in a composite index has disk space, performance overhead related to disk I/O and index maintenance overhead associated with it.  To yield a net positive gain from having the index you should make sure that the benefit of having a column in a composite index outweighs the overheads.  Therefore the fewer the columns in a composite index the lower the overhead and higher the probability of a net gain.  This is particularly applicable to clustered indexes because the length of the index key length will affect all the non-clustered indexes.

     

    e.       Use INCLUDED columns.   Scenarios where a column is added to a composite index for the sole purpose of creating a ‘covering index’ (an index where the SELECT clause is fully serviced by the index itself and the underlying table doesn’t need to get referenced) can be made to use INCLUDED columns instead.  The INCLUDED columns provide the same benefits as far as covering indexes go, but do not have the index maintenance overhead associated with the actual composite index columns. You can get additional information on INCLUDED columns as well as read some of the other advantages they offer by referring to: http://msdn.microsoft.com/en-us/library/ms190806.aspx.

     

    f.        Index key record size limited to 900 bytes. In SQL Server 2008 the sum of the lengths of all columns in an index key cannot exceed 900 bytes (this limit does not apply to XML indexes or spatial indexes).  When using composite indexes you need to make sure that you do not exceed this limit.  For covering index case where you need to have indexes whose total length is greater than 900 bytes, you can consider the use of INCLUDED columns as explained above.  For cases where the limit is exceeded due to the number and sizes of the columns, you have no choice but to either eliminate some of the previously selected columns, or change the data types of some of the columns if possible.

     

    g.       Star join optimization.  There are some queries, especially in data warehouse workloads, where creating a composite index on two dimension columns in the fact table increases performance by the use of the ‘safe crossjoin’ star join optimization.  In such cases, the database engine is able to crossjoin and apply combined selectivity of the two dimensions before digging into the fact table, potentially yielding a significant reduction in I/O. For example, in the following query:

     

    SELECT Fact.* FROM Fact

    JOIN DimA ON Fact.SK_A = DimA.SK_A

    JOIN DimB ON Fact.SK_B = DimB.SK_B

    WHERE DimA.Col IN (42, 13, 7)

    AND DimB.Col IN (13, 42, 7);

     

    a composite index on columns SK_A, SK_B (CREATE INDEX IX_StarJoin ON Fact (SK_A, SK_B)) enables the use of the ‘safe crossjoin’ star join optimization as can be seen in the corresponding query plan below.

     

     

    2.     Order of columns in a composite index

    Once we’ve determined which indexes to include in a composite index, the next challenge is to determine the optimal order for these columns.  As with the previous question, there is no clear-cut answer here either, but the points below should help provide guidance around the key factors to consider.

     

    a.       Ensure that the leading columns are selective.  This has a two-fold benefit: (1) the effectiveness of index seek operation is increased; and (2) since the intra-query parallelism with which an index create or rebuild operation is executed is limited to the selectivity of the first column, having a selective leading column ensures that the database engine can execute the operation with the highest degree of parallelism possible on the server at the given time.  For example, if the first column of an index only has one data value (example: all values are ‘Region001’), the index rebuild operation can at most be executed with a parallelism of 1, i.e. it is executed with a serial query plan.   I have seen real-world customer deployments where the design of the database has been such that the leading columns of certain large tables have had low (all values the same) selectivity and because of this their index rebuild operations were executed with a serial query plan.  This resulted in the task taking a really long time to complete even though they had a relatively lightly loaded 32-core server and had all the other database server configurations (‘max degree of parallelism’, ‘cost threshold for parallelism’) set correctly.

     

    One possible exception to this point may arise with regards to index fragmentation.  When selecting the leading column of an index (clustered index especially) you should take into consideration potential for fragmentation, and any column for which you are inserting data values all over key range should preferably not be placed as the leading column, especially if there's a high insert or update activity on the table.

     

    b.      Specify columns used in inequality predicate towards the end of the list.  When creating composite indexes for queries that use a mix of equality and inequality (>, <, !=, BETWEEN) predicates, the columns corresponding to the selective equality predicates should specified first in the composite index, followed by the inequality predicates.  This is because the depth with which an index seek operation is performed is limited to the occurrence of the first inequality predicate.  For example, if the following query was to execute and use the idx1 index defined above, the index seek operation would only be able to seek on the first column (col1) of the index even though there are valid equality predicates for col3.

     SELECT col1 FROM TabA

    WHERE col1 > 28

          AND col2 = 99

          AND col3 = 0;

    GO

     

    The query execution plan text (viewed using SET SHOWPLAN_TEXT ON) for this query would look like:

    |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[TabA].[idx1]), SEEK:([tempdb].[dbo].[TabA].[col1] > CONVERT_IMPLICIT(int,[@1],0)),  WHERE:([tempdb].[dbo].[TabA].[col2]=CONVERT_IMPLICIT(int,[@2],0) AND [tempdb].[dbo].[TabA].[col3]=CONVERT_IMPLICIT(int,[@3],0)) ORDERED FORWARD)

     

    In this case you can notice that the index is not optimally used due to the SEEK operation only being executed on col1 while the predicate values of col2 and col3 are filtered out using a WHERE clause (SCAN operation).

     

    If the index idx1 was primarily created to help the performance of this query it may help to reorganize the index columns such that the one (col1) which is used in the inequality predicate is specified towards the end (right-most column) in the composite index, i.e. the order should be:  col2, col3, col1.  This would permit the database engine to optimally SEEK into columns col2, col3 and col1 as can be seen in the query execution plan text below.

    |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[TabA].[idx1]), SEEK:([tempdb].[dbo].[TabA].[col2]=CONVERT_IMPLICIT(int,[@2],0) AND [tempdb].[dbo].[TabA].[col3]=CONVERT_IMPLICIT(int,[@3],0) AND [tempdb].[dbo].[TabA].[col1] > CONVERT_IMPLICIT(int,[@1],0)) ORDERED FORWARD)

     

    After making this index change you will notice that there is no WHERE clause in the query plan text and the number of logical read operations (observed via SET STATISTICS IO ON) decrease.

     

    NOTE: You cannot view these internal index operations using the graphical showplan available in SQL Server Management Studio.

     

    If a query has more than 1 inequality predicate it is best to specify the most selective inequality predicate immediately following any equality predicates.

     

    c.       Specify appropriate ascending or descending column order.  The columns in an index can be specified to be ascending (ASC) or descending (DESC).  For composite index, qualifying the column order to match the ORDER BY clause in a query optimizes the use of the index.  For example, for the query below it would help to specify the order of the index columns as col1 ASC, col2 DESC, col3 ASC.

    SELECT col1, col2, col3 FROM TabA

    ORDER BY col1 ASC, col2 DESC, col3 ASC;

     

    NOTE:  SQL Server is able to perform index seeks and scans in either direction; this means that you do not need to create 2 indexes if one query orders the result set using “ORDER BY col1 ASC” and the other uses “ORDER BY col1 DESC”.

     

    Once you’ve created the composite index, you can use the SQL Server built-in stored procedure sp_spaceused to view the total size of all the indexes present on a table. For example: sp_spaceused  [dbo.TbA] displays:

     

    name        rows        reserved    data        index       unused

    TabA        869088     187288 KB   158032 KB   29192 KB    64 KB

    In this output we can see that the total space used by all the indexes in TabA is 29.19 MB.

     

    If you’re interested in drilling in further, you can use the following query to view details about each index level of every index for the table (TabA).

    SELECT OBJECT_NAME(OBJECT_ID), *

    FROM sys.dm_db_index_physical_stats (DB_ID(),NULL, NULL , NULL, 'SAMPLED')

    WHERE OBJECT_ID=OBJECT_ID('TabA')

     

    Please refer to SQL Server Books OnLine for details about each of the columns displayed (http://msdn.microsoft.com/en-us/library/ms188917.aspx).

     

    The points mentioned above apply equally to clustered as well as non-clustered composite indexes, unless explicitly stated, and therefore no distinction has been drawn between them.

     

    NOTE: It is a SQL Server best practice to have a clustered index on all tables, please refer to the SQL Server Best Practices Article: ‘Comparing Tables Organized with Clustered Indexes versus Heaps’ (http://sqlcat.com/whitepapers/archive/2007/12/16/comparing-tables-organized-with-clustered-indexes-versus-heaps.aspx) for additional information on this topic.

     

     

  • How to create an autonomous transaction in SQL Server 2008

    I have been asked by many customers and partners, especially those migrating from Oracle, this question: how to create an autonomous transaction in SQL Server? It turns out to be a tricky thing to do since SQL Server doesn't have built-in autonomous transaction support like Oracle.

    An Autonomous transaction is essentially a nested transaction where the inner transaction is not affected by the state of the outer transaction. In other words, you can leave the context of current transaction (outer transaction) and call another transaction (autonomous transaction). Once you finish work in the autonomous transaction, you can come back to continue on within current transaction. What is done in the autonomous transaction is truly DONE and won't be changed no matter what happens to the outer transaction. To make it easier to understand, here is an example of the described scenario.

    BEGIN TRAN OuterTran

          INSERT TABLE1

          BEGIN “AUTONOMOUS” TRAN InnerTran

                INSERT TABLE2

          COMMIT “AUTONOMOUS” TRAN InnerTran

    ROLLBACK TRAN OuterTran

    The above pseudo script is meant to preserve result of INSERT TABLE2”. In SQL Server 2008 or prior versions, "ROLLBACK TRAN" would always rollback all inner transactions to the outermost "BEGIN TRAN" statement (without specifiying savepoint). So the "InnerTran" transaction would be rolled back as well, which is not the desired behavior for the particular scenario.

    You could wonder why we need an autonomous transaction in the first place. Why can't we just implement two separate transactions so they don't interfere with each other? There are scenarios where applications do need logic structured like this. Logging errors in database is one of the most common scenarios. Below is a TSQL script demonstrating a nested transaction where the inner transaction attempts to save the runtime errors in a table.

    USE TEMPDB

    GO

    CREATE TABLE ErrorLogging (logTime DATETIME, msg VARCHAR(255))

    CREATE TABLE TestAT (id INT PRIMARY KEY)

    GO

    CREATE PROCEDURE usp_ErrorLogging

          @errNumber INT

    AS

          INSERT INTO ErrorLogging VALUES (GETDATE(), 'Error ' + CAST(@errNumber AS VARCHAR(8)) +' occurred.')

    GO

     

    DECLARE @ERROR AS INT

    INSERT INTO TestAT VALUES (1)

    BEGIN TRAN OuterTran

          INSERT INTO TestAT VALUES (1) -- This will raise primary key constraint violation error

         

          SELECT @ERROR = @@ERROR

          IF @ERROR <> 0

          BEGIN

                BEGIN TRAN InnerTran

                      EXEC usp_ErrorLogging @ERROR

                COMMIT TRAN InnerTran

         

                ROLLBACK TRAN OuterTran

          END

     

    IF @@TRANCOUNT > 0     

    COMMIT TRAN OuterTran

    GO

    SELECT * FROM TestAT

    SELECT * FROM ErrorLogging

    GO

    If you run above script against SQL Server, you would see no error message recorded in table "ErrorLogging" due to the "ROLLBACK TRAN OuterTran" statement. So, how can we make it work?

    In SQL Server 2008, you can implement a loopback linked server to achieve the same goal. For more information about loopback linked server, check Books Online for details (http://msdn.microsoft.com/en-us/library/ms188716.aspx).

    USE MASTER

    GO

    EXEC sp_addlinkedserver @server = N'loopback',@srvproduct = N' ',@provider = N'SQLNCLI', @datasrc = @@SERVERNAME

    GO

    EXEC sp_serveroption loopback,N'remote proc transaction promotion','FALSE'

    Go

    Note 'remote proc transaction promotion' is a new option on SQL Server 2008, which allows you to control whether or not you want to enlist remote stored procedure call in a distributed transaction. When this option is off  (FALSE) as we set in the above example, the local transaction will not be promoted to distributed transaction. This is how we are able to separate outer and inner transactions in a "autonomous transaction" fashion.

    The Inner transaction above can be replaced by:

          BEGIN TRAN InnerTran

                EXEC loopback.tempdb.dbo.usp_ErrorLogging @ERROR

          COMMIT TRAN InnerTran

    A complete working script is in the appendix below. I want to point out that this method of using a loopback linked server might not scale well if it's executed very frequently. And it only works in SQL Server 2008 due to new server option of 'remote proc transaction promotion' as discussed above. As always, test before you use it.

    If you are looking for alternative ways of creating autonomous transaction on SQL 2008 or 2005, you have these options:

    1. Loopback connection from SQLCLR procedure to start a new transaction. Compared to more rigid structure need of loopback linked server, SQLCLR is more flexible and gives you more control over how you want to handle interaction with database. If the logic of the autonomous transaction includes computational tasks, it's one of SQLCLR's strengths to provide performance gain as extra benefit.
    2. Using table variable to save data within transaction. Table variables are not affected by transaction rollback thus serve as temporary buffer for transaction data. Once transaction is done, you can dump data out of table variable to a permanent table. Table variables have limited scope and are less flexible. Usually they would also be slower due to lack of index/statistics. However, it does offer you a pure TSQL option with no need to create anything new.
    3. Loopback connection from Extended Stored Procedures. Note: Extendeded Stored Procedure functionality is on the SQL Server deprecation list and Microsoft strongly recommends NOT using it.

    In a future blog, I'll provide sample SQLCLR code and a script using a table variable to create autonomous transactions. I will also compare their performance differences with loopback linked server in a scalability test. Stay tuned.

    Appendix

    USE MASTER

    GO

    EXEC sp_addlinkedserver @server = N'loopback',@srvproduct = N' ',@provider = N'SQLNCLI', @datasrc = @@SERVERNAME

    GO

    EXEC sp_serveroption loopback,N'remote proc transaction promotion','FALSE'

    EXEC sp_serveroption loopback,N'RPC OUT','TRUE' -- Enable RPC to the given server.

    Go

    USE TEMPDB

    GO

    CREATE TABLE ErrorLogging (logTime DATETIME, msg VARCHAR(255))

    CREATE TABLE TestAT (id INT PRIMARY KEY)

    GO

    CREATE PROCEDURE usp_ErrorLogging

          @errNumber INT

    AS

          INSERT INTO ErrorLogging VALUES (GETDATE(), 'Error ' + CAST(@errNumber AS VARCHAR(8)) +' occurred.')

    GO

     

    DECLARE @ERROR AS INT

    INSERT INTO TestAT VALUES (1)

    BEGIN TRAN OuterTran

          INSERT INTO TestAT VALUES (1) -- This will raise primary key constraint violation error

         

          SELECT @ERROR = @@ERROR

          IF @ERROR <> 0

          BEGIN

                BEGIN TRAN InnerTran

                      EXEC loopback.tempdb.dbo.usp_ErrorLogging @ERROR

                COMMIT TRAN InnerTran

               

                ROLLBACK TRAN OuterTran

          END

     

    IF @@TRANCOUNT > 0     

    COMMIT TRAN OuterTran

    GO

    SELECT * FROM TestAT

    SELECT * FROM ErrorLogging

    GO
  • UPDATE with OUTPUT clause – Triggers – and SQLMoreResults

    NOTE:  the code in this BLOG is TSQL instead of ODBC calls.  Since ODBC can be hard to understand and other API’s will have the same basic issues, I decided to use the simpler and more concise TSQL, which should also appeal to a wider audience.

     

    An ISV I work with recently ran into an interesting problem; here is the description and solution.

     

    PROBLEM:

    Adding an unexpected trigger caused application code to fail due to incomplete SQL Syntax, and not reading through all returned results.

     

    The ISV wanted to utilize the OUTPUT Clause of the UPDATE statement in their ODBC (SNAC) based application. The OUTPUT clause is very useful in providing data back to the application regarding the row, or rows, which were updated (or: inserted / deleted).  In the example I use below, the application is interested in knowing the date/time of the updated row(s).

     

    This could be accomplished by issuing the following statement:

    UPDATE T SET COL2 = @Pcol2, COL3 = getdate() OUTPUT CAST(INSERTED.COL3 AS varchar(30))WHERE COL1 = @Pcol1

     

    The ISV coded up the application expecting a return value for number of rows affected, and if that value was greater than 0 then it also returned the value of the inserted date/time.

     

    This worked well, until an external Partner application added a trigger to the table listed in the UPDATE statement.

     

    Example: CREATE TRIGGER [dbo].[TTrigger1] on [dbo].[T] after update as update t2 set col3 = 0

     

    Now the application failed on the UPDATE statement with the following error message:

    [Microsoft][SQL Native Client][SQL Server]The target table 'T' of the DML statement cannot have any enabled triggers if the statement contains an OUTPUT clause without INTO clause.

     

    The error message is self-explanatory, but was a surprise to the ISV application (and the application developer).  The developer did not expect a trigger to ever be created on the table.

    There are two different methods of getting OUTPUT data from an UPDATE statement;

    ·       UPDATE with the OUTPUT clause only – this returns output results directly as part of the statement. This option cannot have a trigger defined on the table.

    ·       UPDATE with OUTPUT and INTO clauses – this returns the output a specific table, or table variable. This option must be used if there is any possibility the table will have a trigger on it at any point.

    ·       See the following website for complete the OUTPUT Clause documentation:

    http://msdn.microsoft.com/en-us/library/ms177564.aspx

    The developer then utilized the following syntax to send the same statement to SQL Server, and also to get the expected result back: declare @p165 table (col2 varchar(30));UPDATE T SET COL2 = ?, COL3 = getdate() OUTPUT CAST(INSERTED.COL3 AS varchar(30)) into @p165 WHERE COL1 = 1;select * from @p165

     

    Now a subtlety occurred, can you guess what it was? If you guessed that additional results are returned you are correct.

    The ODBC code returned data in a loop utilizing the following API calls:  SQLFetch, SQLNumResultCols, SQLRowCount, SQLMoreResults:

    ·       The first results returned were the number of rows affected by the trigger, not the number of rows affected by the UPDATE statement, which was what the application was actually expecting

    ·       The second set of results were the number of rows affected by the UPDATE statement

    ·       The third set of results were the number of rows returned by the SELECT statement reading the table variable

    ·       And finally, the actual data from the updated row(s) – which is what we really wanted in the first place!

    So, the lessons to be learned here are:

    1.   Be aware that triggers will affect your UPDATE statements if utilizing the OUTPUT clause

    2.    You should utilize the INTO clause to avoid the issue

    3.    Always use SQLMoreResults to read all of the result-sets that could be returned from SELECT, UPDATE, INSERT, or DELETE statements.

    4.    Triggers should include the ‘SET NOCOUNT ON’ statement to avoid returning the ‘affected number of rows’.

    SOLUTION:

    The application was changed to utilize the INTO clause, and SQLMoreResults was used to return all the resulting data.  Using SET NOCOUNT ON in trigger logic is also a best practice that prevents additional results ‘Rows affected’ from being generated.

     

    Here is a script to duplicate the issues I’ve described:

    USE tempdb

    GO

    ------You may want to run this script in steps from comment – to comment

    ------so you can follow along, instead of running the entire script at once

     

    CREATE TABLE t(

          [col1] [int] NOT NULL,

          [col2] [varchar](30) NULL,

          [col3] [datetime] NULL

    ) ON [PRIMARY]

    GO

    insert into t values (1,'abc', getdate())

    insert into t values (1,'abc', getdate())

    insert into t values (1,'abc', getdate())

    GO

    select * from t

    GO

    UPDATE t SET col2 = 'Peter', col3 = getdate()

    OUTPUT CAST(INSERTED.col3 AS varchar(30))WHERE col1 = 1

    GO

    select * from t

    GO

    ------So far everything is good, Now let’s add the new table and the trigger

    CREATE TABLE t2(

          [col1] [int] NULL,

          [col2] [datetime] NULL

    ) ON [PRIMARY]

    GO

    insert into t2 values (2, getdate())

    insert into t2 values (2, getdate())

    GO

    select * from t2

    GO

    ------In this example, the trigger: ttr1 will update the rows

    ------of a second table: t2

    CREATE TRIGGER ttr1 on t after update as update t2 set col1 = 0

    GO

    ------OK, let’s try now with the trigger on

    UPDATE t SET col2 = 'Peter', col3 = getdate() OUTPUT CAST(INSERTED.col3 AS varchar(30))WHERE col1 = 1

    GO

    ------Chances are good you got the following error message

    --Msg 334, Level 16, State 1, Line 1

    --The target table 't' of the DML statement cannot have any enabled triggers --if the statement contains an OUTPUT clause without INTO clause.

    ----- let’s fix that now.

    declare @p1 varchar(30)

    UPDATE t SET col2 = 'Peter', col3 = getdate() OUTPUT CAST(INSERTED.col3 AS varchar(30))into @p1 WHERE col1 = 1

    GO

    ------Notice this failed as well with the following error message

    --Msg 1087, Level 16, State 1, Line 2

    --Must declare the table variable "@p1".

    ------We need to use a table

    ------for this to work correctly we must use a table or

    ------a table variable where the ‘INTO’ data will reside,

    ------and be retrieved from

    declare @p1 table (col2 varchar(30))

    UPDATE t SET col2 = 'Peter', col3 = getdate() OUTPUT CAST(INSERTED.col3 AS varchar(30))into @p1 WHERE col1 = 1

    select * from @p1

    --Now you get what we were originally looking for

    --    the date/times of the rows that were updated