SQL Server Storage Engine

  • TempDB:: Table variable vs local temporary table

    As you know the tempdb is used by user applications and SQL Server alike to store transient results needed to process the workload. The objects created by users and user applications are called ‘user objects’ while the objects created by SQL Server engine as part of executing/processing the workload are called ‘internal objects’. In this blog, I will focus on user objects and on table variable in particular.

    There are three types of user objects; ##table, #table and table variable. Please refer to BOL for specific details. While the difference between ##table (global temporary table) and #table (local temporary table) are well understood, there is a fair amount of confusion between #table and table variable.  Let me walk through main differences between these.

    A table variable, like any other variable, is a very useful programming construct. The scoping rules of the table variable are similar to any other programming variables. For example, if you define a variable inside a stored procedure, it can’t be accessed outside the stored procedure. Incidentally, #table is very similar. So why did we create table variables? Well, a table variable can be very powerful when user with stored procedures to pass it as input/output parameters (new functionality available starting with SQL Server 2008) or to store the result of a table valued function. Here are some similartities and differences between the two:

    ·         First, the table variable is NOT necessarily memory resident. Under memory pressure, the pages belonging to a table variable can be pushed out to tempdb. Here is one example shows space taken by table variable in tempdb

    use tempdb

    go

     

    drop table #tv_source

    go

     

    create table #tv_source(c1 int, c2 char(8000))

    go

     

    declare @i int

    select @i = 0

    while (@i < 1000)

    begin

           insert into #tv_source values (@i, replicate ('a', 100))

           select @i = @i + 1

    end

     

    DECLARE @tv_target TABLE (c11 int, c22 char(8000))

     

     

     INSERT INTO @tv_target (c11, c22)

        SELECT c1, c2

        FROM  #tv_source

     

    -- checking the size through DMV.

    -- The sizes here are in 8k pages. This shows the allocated space

    -- to user objects to be 2000 pages (1000 pages for #tv_source and

    -- 1000 pages for @tv_target

     

    Select total_size = SUM (unallocated_extent_page_count) +

    SUM (user_object_reserved_page_count) +

    SUM (internal_object_reserved_page_count) +

    SUM (version_store_reserved_page_count) +

    SUM (mixed_extent_page_count),

     

    SUM (unallocated_extent_page_count) as freespace_pgs,

    SUM (user_object_reserved_page_count) as user_obj_pgs,

    SUM (internal_object_reserved_page_count) as internal_obj_pgs,

    SUM (version_store_reserved_page_count)  as version_store_pgs,

    SUM (mixed_extent_page_count) as mixed_extent_pgs

    from sys.dm_db_file_space_usage

    ·         Second, when you create a table variable, it is like a regular DDL operation and its metadata is stored in system catalog. Here is one example to check this

    declare @ttt TABLE(c111 int, c222 int)

    select name from sys.columns where object_id > 100 and name like 'c%'

    This will return two rows containing columns c111 and c222. Now this means that if you were encountering DDL contention, you cannot address it by changing a #table to table variable.

    ·         Third, transactional and locking semantics. Table variables don’t participate in transactions or locking. Here is one example


    -- create a source table

    create table tv_source(c1 int, c2 char(100))

    go

     

    declare @i int

    select @i = 0

    while (@i < 100)

    begin

           insert into tv_source values (@i, replicate ('a', 100))

           select @i = @i + 1

           end

    -- using #table

    create table #tv_target (c11 int, c22 char(100))

    go

     

    BEGIN TRAN

     

        INSERT INTO #tv_target (c11, c22)

                SELECT c1, c2

                FROM  tv_source

     

     

    -- using table variable

     

    DECLARE @tv_target TABLE (c11 int, c22 char(100))

     

    BEGIN TRAN

           INSERT INTO @tv_target (c11, c22)

            SELECT c1, c2

        FROM  tv_source

     

     

    -- Now if I look at the locks, you will see that only

    -- #table takes locks. Here is the query that used

    -- to check the locks   

    select 

        t1.request_session_id as spid, 

        t1.resource_type as type,  

        t1.resource_database_id as dbid, 

        (case resource_type

          WHEN 'OBJECT' then object_name(t1.resource_associated_entity_id)

          WHEN 'DATABASE' then ' '

          ELSE (select object_name(object_id) 

                from sys.partitions 

                where hobt_id=resource_associated_entity_id)

        END) as objname, 

        t1.resource_description as description,  

        t1.request_mode as mode, 

        t1.request_status as status,

           t2.blocking_session_id

    from sys.dm_tran_locks as t1 left outer join sys.dm_os_waiting_tasks as t2

    ON t1.lock_owner_address = t2.resource_address

     

    Another interesting aspect is that if I rollback the transaction involving the table variable earlier, the data in the table variable is not rolled back.

    Rollback

    -- this query will return 100 for table variable but 0 for #table.

    SELECT COUNT(*) FROM @tv_target

    ·         Fourth, the operations done on table variable are not logged. Here is the example I tried

    -- create a table variable, insert bunch of rows and update

    DECLARE @tv_target TABLE (c11 int, c22 char(100))

     

    INSERT INTO @tv_target (c11, c22)

        SELECT c1, c2

        FROM  tv_source

     

     

    -- update all the rows

    update @tv_target set c22 = replicate ('b', 100)

     

     

    -- look at the top 10 log records. I get no records for this case

    select top 10 operation,context, [log record fixed length], [log record length], AllocUnitId, AllocUnitName

    from fn_dblog(null, null)

    where AllocUnitName like '%tv_target%'

    order by [Log Record Length] Desc

     

    -- create a local temptable

    drop table #tv_target

    go

     

    create table #tv_target (c11 int, c22 char(100))

    go

     

    INSERT INTO #tv_target (c11, c22)

        SELECT c1, c2

        FROM  tv_source

     

    -- update all the rows

    update #tv_target set c22 = replicate ('b', 100)

     

     

    -- look at the log records. Here I get 100 log records for update

    select  operation,context, [log record fixed length], [log record length], AllocUnitName

    from fn_dblog(null, null)

    where AllocUnitName like '%tv_target%'

    order by [Log Record Length] Desc

     

    ·         Fifth, no DDL is allowed on table variables. So if you have a large rowset which needs to be queried often, you may want to use #table when possible so that you can create appropriate indexes. You can get around this by creating unique constraints when declaring table variable.

     

    ·         Finally, no statistics is maintained on table variable which means that any changes in data impacting table variable will not cause recompilation of queries accessing table variable. Queries involving table variables don't generate parallel plans.

     

     

    Thanks

    Sunil

     

  • Overhead of Row Versioning

    Last week I was teaching a class on snapshot isolation and we discussed the overheads of snapshot isolation. There are three main overheads that you need to be aware of when you enable snapshot isolation read-committed-snapshot on a database. First, each UPDATE/DELETE operation generates a row version. Note, that the row version is not generated when you INSERT a new row. There is an exceptions to this, but I will skip that discussion here. Second, row versions are created and kept in tempdb that leads to increase space usage and IOs there. Third, the queries run under snapshot isolation or read-committed-snapshot need to traverse the row version chain which may lead to as many logical IOs and potentially as many 'random' physical IOs because row versions for a given row are not necessarily on the same or even contiguous physical page(s). It is interesting to note that if we did not traverse the row versions, the query would have blocked otherwise. So basically, at some expense of IO/CPU, the SQL Server can return the transactional consistent results without incurring blocking. We recommend enabling row versioning based isolation levels only when your application is incurring reader/writer blocking.

    One question that came up was on the overhead of row versioning on the data or index row. When you enable snapshot isolation or read-committed-snapshot option on the database, the SQL Server starts adding a 14 byte overhead to each row to keep the following information:

    • XTS (transaction sequence number). It takes 6 bytes. This is used for marking the XSN that did the DML operation on the row
    • RID (row identifier) that points to the versioned row. It takes 8 bytes.

    But this overhead is only added when an existing row is actually modified or if the database was already enabled for either of these options. Let me give a very simple example.

    create table foo..t2_snapshot (c1 int)

    go

     

    insert into foo..t2_snapshot values (1)

    go

     

     

    -- show the max rowsize in bytes

    select max_record_size_in_bytes

    from sys.dm_db_index_physical_stats (db_id('foo'),

                                           object_id('foo.t2_snapshot'),

        null, null, 'DETAILED')

    This will show a max row size of 11 bytes (note, the 7 bytes are the metadata overhead within the row and other 4 bytes store the integer value). Now, let me enable snapshot isolation on the database foo. Note, this is a metadata operation and it will NOT change the existing rows in the tables to add extra 14 bytes.

    alter database foo set allow_snapshot_isolation ON

    go

     

    -- you can verify the datbaase state using the following 

    select is_read_committed_snapshot_on, snapshot_isolation_state_desc,

    snapshot_isolation_state from sys.databases where name='foo'

    go

     

    Now I run the query

    select max_record_size_in_bytes

    from sys.dm_db_index_physical_stats (db_id('foo'),

                                         object_id('foo.t2_snapshot'),

                                          null, null, 'DETAILED')

    You will notice that the row length still remains 11 bytes. No extra 14 bytes overhead even though the database foo has been enabled for snapshot isolation.  Let us now update the data row as follows

    update foo..t2_snapshot set c1 = 2

    After the update, you will notice the length of the row has become 25 bytes. In other words, there is a 14 byte overhead. Also, the new rows that you insert will have a length of 25 bytes (including row versioning overhead of 14 bytes).There are two interesting observations to be made. First, an update of a fixed length column may lead to page splits because the length of the row has increased. This only happens the first time the row is updated. Second, the increased row length will cause the table size to grow. A typical row size in production is around 300 bytes. So this means that snapshot isolation or read-committed-snapshot will cause 5% growth in the size of the table (in the worst case when all the rows were updated). 

    Now, if I disable the snapshot isolation and update the same row again, the extra 14 bytes will go away. This is because row versioning has been disabled. I can do the same by rebuilding the index

    Thanks

    Sunil

     

  • CHECKSUM and Tempdb

    You may recall that starting with SQL Server 2005, you have an option available to enable CHECKSUM on the user databases.  For details, please refer to http://blogs.msdn.com/sqlserverstorageengine/archive/2006/06/29/Enabling-CHECKSUM-in-SQL2005.aspxIn fact, any new database created in SQL2005 has CHECKSUM enabled automatically but it does not happen to databases that are upgraded from previous versions of SQL Server. For the upgraded databases from SQL Server versions earlier than SQL Server2005, you will need to enable CHECKSUM explicitly using ALTER DATABASE command. Enabling CHECKSUM is critical and it allows SQL Server to detect the corruption in the IO path (e.g. a disk malfunction) when the page is read as part of query or when you run DBCC CHECKDB command. While this has been very useful to our customers, there was one missing link and that was that SQL Server did not allow enabling CHECKSUM on tempdb. This provided a window where a page corrupted due to mis-behaving disk found its way into user database even if you had enabled CHECKSUM on the databse. Here is one such scenario

    Scenario:: bulk import the data into a temp table for staging purposes and then move it to the user database. The user database has CHECKSUM enabled, so the new page, when written to the disk, will have checksum computed. But guess what is missing? If the tempdb disk corrupts the pages in tempdb, SQL Server will have no way of knowing that the page was corrupted and it will go to user database without detection. Yes, when the page is subsequently read, depending upon what the corruption was, the SQL Server may detect it or may not. For example if a bit flip happened for the integer value, it will go undetected.

    With CHECKSUM available on tempdb starting with SQL2008, you can finally close this window.  You can use the following command

    ALTER DATASE tempdb set PAGE_VERIFY CHECKSUM

    For new installs of SQL Server 2008, all tempdbs will have CHECKSUM enabled by default. You can always disable it using ALTER DATABASE command but we don't recommend it. For databases upgraded to SQL Server 2008, you will need to explicitly enable CHECKSUM on the tempdb. We measured the performance impact of enabling CHECKSUM in tempdb and the impact is very low (<2% of CPU) which is similar to what you would expect in user database. Since the CHECKSUM is only computed when page is written to the disk, the added point is that there is no 'checkpoint' in tempdb, so a page in tempdb is written to disk ONLY under memory pressure. So you may not see as many CHECKSUM calculations in tempdb.

     Note, this is not available in CTP-6 but will be available in RTM bits.

    Thanks

    Sunil

  • Minimal Logging changes in SQL Server 2008 (part-3)

     

    So far we had discussed  (http://blogs.msdn.com/sqlserverstorageengine/archive/2008/03/23/minimal-logging-changes-in-sql-server-2008-part-2.aspx ) how minimal logging changes impact when you are moving data from one table to another table. Now let us look at how does this change more conventional bulk import. As you will see, conventional bulk import takes advantage on these changes as well but these changes are only useful when importing into a btree as conventional bulk import already provides minimal logging for heaps. Here are the series of scenarios that I tried

     

     

     

    (1)  Inserting into an HEAP. No changes in this behavior

     

    begin tran

     

    -- this is optimized load

    -- (1) show that individual rows are not logged

    -- (2) show the BU lock

     

    bulk insert t_heap

    from 'C:\sql-server-test\minimal-logging\t_source-dat.dat'

      with (TABLOCK)

     

    Logging:: Minimal Logging

     

     

    -- if we don't specify TABLOCK, it leads to full logging. This is same what we always had

    -- (1) without TABLOCK, no minmal logging.

    bulk insert t_heap

    from 'C:\sql-server-test\minimal-logging\t_source-dat.dat'

     

     

    LOGGING: Fully logged

     

     

     

    (2) Insert into BTREE

     

    a)    Into empty BTREE

    create table t_ci (c1 int,  c2 int, c3 char(100), c4 char(1000))

    go

     

    create clustered index ci on t_ci(c1)

    go

     

    begin tran

     

    -- this is optimized load

    -- (1) show that individual rows are not logged

    -- (2) Lock: X lock on the table

    bulk insert t_ci

    from 'C:\sql-server-test\minimal-logging\t_source-dat.dat'

      with (TABLOCK, datafiletype = 'char')


     

     

     

    -- Now, what if don't specify TABLOCK? This is a change from earlier versions. and this is also optimized load

    -- (1) we get minimal logging

    -- (2) IX: at table level

    -- (3) you can use this to do parallel load as long as you are

    -- (4) importing into disjoint ranges

     

    bulk insert t_ci

    from 'C:\sql-server-test\minimal-logging\t_source-dat.dat'

      with (datafiletype = 'char')

     

     

    Here are the top 10 log records. Note, in this case, only the range lock is taken

     

    b)    Into non-empty btree (when inserting into an increasing range)

     

    -- Inserting rows with disjoint range, we still get minimal logging

    -- if you are inserting into a btree but the range you are inserting overlaps partially with the existing data,

    -- the logging will go between minimal and full logging. The rule is that you will get minimal logging

    -- only when you allocate a new page.

     

     

    bulk insert t_ci

    from 'C:\sql-server-test\minimal-logging\t_newrange-dat.dat'

      with (datafiletype = 'char')

    Here are the top 10 log records. Note, in this case, only the range lock is taken. Since the SQL Server does not take X lock on the table, you can do parallel bulk import with minimal logging into a btree.

     

    c)   Into a table with clustered and non-clustered indexes

     

    create table t_ci_nci (c1 int, c2 int, c3 char(100), c4 char(1000))

    go

     

    create clustered index ci on t_ci_nci(c1)

    go

     

    create index nci on t_ci_nci(c2, c3)

    go<