SQL Server Storage Engine

How can you prove that hardware is the problem? Use stress to reduce stress...

I'm sure you've been there - something causes corruption in the database. You blame the hardware, the hardware guys blame the software. There's no smoking gun and the hardware diagnostics come back clean. What can you do?

SQLIOStress

This is the hardware diagnostic you really want to run. It simulates a very heavy SQL Server workload and should do a far better job of discovering flaws in your hardware setup than individual hardware vendors' diagnostics will. We recommend that you run it before installing a system as well as using it to expose hardware as the problem in difficult-to-diagnose corruption problems.

Youc an find info on it at http://support.microsoft.com/default.aspx?scid=kb;en-us;231619 and there will be an updated version released in the next month or so (I'll blog when it gets out).

Page checksums

This is a cool new feature of 2005. It's a per-database option that will write an error-detection checksum on each page when it is flushed from the buffer pool. When a page is subsequently read again, the checksum is recalculated and checked against that stored on the page..

Here are some points to note about page checksums (they debunk a bunch of common misconceptions):

  • Upgrading a database to SQL Server 2005 and turning on the option does not automatically protect all the pages, as a page has to be changed and written to disk after the database option is enabled to have a checksum written on it. Only when a page has been through this process is it protected. There is no tool or automatic way to force all pages to go through this process (we may provide one in future).
  • In benchmarking tests with a TPCH workload, we measured approx 2% performance degradation as a result of having checksums enabled.
  • The checksum cannot be used for error correction. Generating an error-correcting checksum would be a more complicated algorithm and so would be slower to compute.
  • The checksum is validated when a page is read for checking by any of the DBCC CHECK* commands (regardless of whether the PHYSICAL_ONLY option was used) so all existing page checksums can be checked by issuing a DBCC CHECKDB command.
  • Any existing checksums are checked when pages are read as part of taking a backup. In addition, the restore logic will also verify the page checksums as the pages are restored from the backup media, so there is solid assurance that the data from the backup is consistent.
  • Page checksums do not prevent in-memory corruptions from memory scribblers (where the page is read in, corrupted in memory from some rogue process, and then written out with a new checksum). In Enterprise Edition, there is a ‘checksum sniffer’ that runs constantly, randomly picking unchanged pages in the buffer pool and validating their checksums to see if the page has been scribbled on.

Bad page checksums will result in IO errors being reported:

  • IO failures will trigger read-retry logic, which will re-read the page several times to see if the error clears itself (if it does, a message is written to the SQL error log)
  • IO failures that persist through read-retry are logged in the error log and Windows event log, so monitoring these will allow you to be alerted to hardware problems quickly.
  • If an IO error is reported, the page is not taken offline in any way. Anything that subsequently touches the page will get another IO error. The exception to this is if the IO error is encountered during transaction rollback. In this case, the database is taken offline and must be brought back online manually.

Page checksums should vastly reduce the number of undiagnosed corruption problems (saving time and hassle for you and our support team). Now all you need to do is make sure you have that excellent backup strategy to go with it...

Published Wednesday, June 21, 2006 9:37 PM by Paul Randal - MSFT
Filed under:

Comments

 

bertcord said:

Has thi sutility been updated to allwo test files larger than 4GB.  I think this was the limit the last time I tried.  While this is good for DAS solutions some of our SANS have 16-32GB of cache
June 30, 2006 5:02 PM
 

JHalmans said:

The updated version has raised the file size substantially.  Actually it's capable of creating files as large as ones created by SQL Server.  
July 7, 2006 10:18 PM
 

dabiep said:

It starts with restoring to disimiliar hardware.. Try doing a fresh install of SQL and see how it attaches itself to the hardware, then create a database, then back it up , using veritas, TSM, or whatever else you can find out there. Then, simulate a disaster. Restore this to disimilar hardware. GOOD LUCK ... If you really want to blow a blood vessel..do system state restore as well... I know what the problem is and the solution... It is really simple...

May 11, 2007 10:10 PM
 

Drugs and Movies » SQL Server Storage Engine : How can you prove that hardware is the … said:

April 16, 2008 8:58 PM
Anonymous comments are disabled

About Paul Randal - MSFT

Paul started in the industry in 1994 working for DEC on the VMS file system and check/repair tools. In 1999 he moved to Microsoft to work on SQL Server, specifically on DBCC. For SQL Server 2000, he concentrated on index fragmentation (writing DBCC INDEXDEFRAG and DBCC SHOWCONTIG) plus various algorithms in DBCC CHECKDB. During SQL Server 2005 development Paul was the lead developer/manager of one the core dev teams in the Storage Engine, responsible for data access and storage (DBCC, allocation, indexes & heaps, pages/records, text/LOB storage, snapshot isolation, etc). He also spent several years rewriting DBCC CHECKDB and repair. For SQL Server 2008, Paul managed the Program Management team for the core Storage Engine to become more focused on customer/partner engagement and feature set definition. In 2007, after 8.5 years on the SQL Server team, Paul left Microsoft to join his wife, Kimberly Tripp, running SQLskills.com and pursuing his passion for presenting and consulting. Paul regularly presents at conferences and user groups around the world on high-availability, disaster recovery and Storage Engine internals. His popular blog is at http://www.sqlskills.com/blogs/paul/.

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker