My name is HarshDeep Singh, and I am a SQL Server engineer with Microsoft CSS. I've been working on SQL Server for quite some time, and the intention behind starting this blog is to try and share my knowledge with the SQL Server community. Any feedback or suggestions on the blog are both welcome and solicited. Please feel free to drop me a line on firstname.lastname@example.org anytime.
I know many of you get into situations where SQL Server fails over from one node of a cluster to the other, and you’re hard-pressed to find out why. In this post, I shall seek to answer quite a few questions about how to about conducting a post-mortem analysis for SQL Server cluster failover, aka Cluster Failover RCA.
First up, since this is a post mortem analysis, we need all the logs we can get. Start by collecting the following:
Now that we have all the logs in place, then comes the analysis part. I’ve tried to list down the steps and most common scenarios here:
2012-07-13 06:39:45.22 Server SQL Server is terminating because of a system shutdown. This is an informational message only. No user action is required. 2012-07-13 06:39:48.04 spid14s The Database Mirroring protocol transport has stopped listening for connections. 2012-07-13 06:39:48.43 spid14s Service Broker manager has shut down. 2012-07-13 06:39:55.39 spid7s SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required. 2012-07-13 06:39:55.43 Server The SQL Server Network Interface library could not deregister the Service Principal Name (SPN) for the SQL Server service. Error: 0x6d3, state: 4. Administrator should deregister this SPN manually to avoid client authentication errors. You can also use the systeminfo command from a command prompt to check when the node was last rebooted (look for “System Boot Time”), and see if this matches the time of the Failover. If so, then you need to investigate why the node rebooted, because SQL was just a victim in this case.
Information 7/29/2012 12:44:07 AM MSSQLSERVER 680 Server Error [8, 23, 2] occurred while attempting to drop allocation unit ID 423137010909184 belonging to worktable with partition ID 423137010909184.
Error 7/29/2012 12:44:07 AM MSSQLSERVER 823 Server The operating system returned error 2(The system cannot find the file specified.) to SQL Server during a read at offset 0x000001b6d70000 in file 'H:\MSSQL\Data\tempdata4.ndf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. And then some time later, we see SQL shutting down in reaction to this:
Error 7/29/2012 12:44:17 AM MSSQLSERVER 3449 Server SQL Server must shut down in order to recover a database (database ID 2). The database is either a user database that could not be shut down or a system database. Restart SQL Server. If the database fails to recover after another startup, repair or restore the database.
Error 7/29/2012 12:44:17 AM MSSQLSERVER 3314 Server During undoing of a logged operation in database 'tempdb', an error occurred at log record ID (12411:7236:933). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.
Error 7/29/2012 12:44:17 AM MSSQLSERVER 9001 Server The log for database 'tempdb' is not available. Check the event log for related error messages. Resolve any errors and restart the database. Another error that clearly points toward the disks being a culprit is this:
Error 7/29/2012 12:44:15 AM MSSQLSERVER 823 Server The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x00000000196000 in file 'S:\MSSQL\Data\tempdb.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. The next logical step of course would be to check why the disks became unavailable/inaccessible. I would strongly recommend having your disks checked for consistency, speed and stability by your vendor.
There is no exhaustive guide to finding the root cause for a Cluster Failover, mainly because it is an approach thing. I do, however, want to talk about a few cluster concepts here, which might help you understand the messages from the various logs better.
checkQueryProcessorAlive: Also known as the isAlive check in SQL Server, this executes “SELECT @@servername” against the SQL Server instance. It waits 60 seconds before running the query again, but checks every 5 seconds whether the service is alive by calling sqsrvresCheckServiceAlive. Both these values(60 seconds and 5 seconds) are configured “by default” and can be changed from the properties of the SQL Server resource in Failover Cluster Manager/Cluster Administrator. I understand that for SQL 2012, we’ve included some more comprehensive checks like running sp_server_diagnostics as part of this check to ensure that SQL is in good health.
sqsrvresCheckServiceAlive: Also known as the looksAlive check in SQL Server, this checks to see if the status of the SQL Service and returns “Service is dead” if the status is not one of the following:
So if you see messages related to one of these checks failing in either the event logs or the cluster logs, you know that SQL Server was not exactly “available” at that time, which caused the failover. The next step, of course would be to investigate why SQL Server was not available at that time. It can be due to a resource bottleneck such as high CPU or memory consumption, SQL Server hung/stalled, etc.
The base idea here, as with any post-mortem analysis, is to construct a logical series of events leading up to the failover, based on the data. If we can do that, then we have at least a clear indication on what caused the failover, and more importantly, how to avoid such a situation in the future.
If you’re still unable to determine anything about the cause of the failover, I would strongly recommend contacting Microsoft CSS to review the data once and see if they’re able to spot anything.
Hope this helps. As always, comments, feedback and suggestions are welcome.