I know many of you get into situations where SQL Server fails over from one node of a cluster to the other, and you’re hard-pressed to find out why. In this post, I shall seek to answer quite a few questions about how to about conducting a post-mortem analysis for SQL Server cluster failover, aka Cluster Failover RCA.

First up, since this is a post mortem analysis, we need all the logs we can get. Start by collecting the following:

  • SQL Server Errorlogs
  • The “Application” and “System” event logs, saved in txt or csv format (eases analysis)
  • The cluster log (see here and here for details on how to enable/collect cluster logs for Windows 2003 and 2008 respectively)

Now that we have all the logs in place, then comes the analysis part. I’ve tried to list down the steps and most common scenarios here:

  1. Start with the SQL Errorlog. The Errorlog files in the SQL Server log folder can be viewed using notepad, textpad or any other text editor. The current file will be named Errorlog, the one last used Errorlog.1, and so on.  See if the SQL Server was shut down normally. For example, the following stack denotes a normal shutdown for SQL:

    2012-09-04 00:32:54.32 spid14s     Service Broker manager has shut down.
    2012-09-04 00:33:02.48 spid6s      SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.
    2012-09-04 00:33:02.50 spid6s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
  2. You might see a lot of situations where SQL Server failed over due to a system shutdown i.e. the node itself rebooted. In that case, the stack at the bottom of the SQL Errorlog will look something like this:
  3. 2012-07-13 06:39:45.22 Server      SQL Server is terminating because of a system shutdown. This is an informational message only. No user action is required.
    2012-07-13 06:39:48.04 spid14s     The Database Mirroring protocol transport has stopped listening for connections.
    2012-07-13 06:39:48.43 spid14s     Service Broker manager has shut down.
    2012-07-13 06:39:55.39 spid7s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
    2012-07-13 06:39:55.43 Server      The SQL Server Network Interface library could not deregister the Service Principal Name (SPN) for the SQL Server service. Error: 0x6d3, state: 4. Administrator should deregister this SPN manually to avoid client authentication errors.
    You can also use the systeminfo command from a command prompt to check when the node was last rebooted (look for “System Boot Time”), and see if this matches the time of the Failover. If so, then you need to investigate why the node rebooted, because SQL was just a victim in this case.

  4. Next come the event logs. Look for peculiar signs in the application and system event logs that could have caused the failover. For example, one strange scenario that I came across was when the disks hosting tempdb became inaccessible for some reason. In that case, I saw the following in the event logs:
  5. Information 7/29/2012 12:44:07 AM MSSQLSERVER 680 Server Error [8, 23, 2] occurred while attempting to drop allocation unit ID 423137010909184 belonging to worktable with partition ID 423137010909184.

    Error 7/29/2012 12:44:07 AM MSSQLSERVER 823 Server The operating system returned error 2(The system cannot find the file specified.) to SQL Server during a read at offset 0x000001b6d70000 in file 'H:\MSSQL\Data\tempdata4.ndf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
    And then some time later, we see SQL shutting down in reaction to this:

    Error 7/29/2012 12:44:17 AM MSSQLSERVER 3449 Server SQL Server must shut down in order to recover a database (database ID 2). The database is either a user database that could not be shut down or a system database. Restart SQL Server. If the database fails to recover after another startup, repair or restore the database.

    Error 7/29/2012 12:44:17 AM MSSQLSERVER 3314 Server During undoing of a logged operation in database 'tempdb', an error occurred at log record ID (12411:7236:933). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

    Error 7/29/2012 12:44:17 AM MSSQLSERVER 9001 Server The log for database 'tempdb' is not available. Check the event log for related error messages. Resolve any errors and restart the database.
    Another error that clearly points toward the disks being a culprit is this:

    Error 7/29/2012 12:44:15 AM MSSQLSERVER 823 Server The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x00000000196000 in file 'S:\MSSQL\Data\tempdb.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
    The next logical step of course would be to check why the disks became unavailable/inaccessible. I would strongly recommend having your disks checked for consistency, speed and stability by your vendor. 

  6. If you don’t have any clue from these past steps, try taking a look at the cluster log as well. Please do note that the Windows cluster logs are recorded in GMT/UTC time zone always, so you’ll need to make the necessary calculations to determine what time to focus on in the cluster log. See if you can find anything which could have caused the cluster group to fail, such as the network being unavailable, failure of the IP/Network name, etc.

There is no exhaustive guide to finding the root cause for a Cluster Failover, mainly because it is an approach thing. I do, however, want to talk about a few cluster concepts here, which might help you understand the messages from the various logs better.

checkQueryProcessorAlive: Also known as the isAlive check in SQL Server, this executes “SELECT @@servername” against the SQL Server instance. It waits 60 seconds before running the query again, but checks every 5 seconds whether the service is alive by calling sqsrvresCheckServiceAlive. Both these values(60  seconds and 5 seconds) are configured “by default” and can be changed from the properties of the SQL Server resource in Failover Cluster Manager/Cluster Administrator. I understand that for SQL 2012, we’ve included some more comprehensive checks like running sp_server_diagnostics as part of this check to ensure that SQL is in good health.

sqsrvresCheckServiceAlive: Also known as the looksAlive check in SQL Server, this checks to see if the status of the SQL Service and returns “Service is dead” if the status is not one of the following:

  • SERVICE_RUNNING
  • SERVICE_START_PENDING
  • SERVICE_PAUSED
  • SERVICE_PAUSE_PENDING

So if you see messages related to one of these checks failing in either the event logs or the cluster logs, you know that SQL Server was not exactly “available” at that time, which caused the failover. The next step, of course would be to investigate why SQL Server was not available at that time. It can be due to a resource bottleneck such as high CPU or memory consumption, SQL Server hung/stalled, etc.

The base idea here, as with any post-mortem analysis, is to construct a logical series of events leading up to the failover, based on the data. If we can do that, then we have at least a clear indication on what caused the failover, and more importantly, how to avoid such a situation in the future.

If you’re still unable to determine anything about the cause of the failover, I would strongly recommend contacting Microsoft CSS to review the data once and see if they’re able to spot anything.

Hope this helps. As always, comments, feedback and suggestions are welcome.