Using Windows Cluster log to troubleshoot SQL Server cluster errors

There are times where all the SQL Server knowledge in the world won't help us to find a solution to our cluster problem. The following post will show you how to use the Microsoft Windows cluster log file to find the root-cause of a common SQL Server issue (it does not matter what SQL Server version we use for the purpose of this post).

In this case the customer called to Microsoft PSS due to a problem in which a SQL Server 2000 clustered instance was not starting up for an unknown reason. Looking at the customer Cluster Administrator, this was the situation of the SQL Server 2000 cluster group, called "SHILOH":

 

We tried to diagnose what was wrong using the SQL Server ERRORLOG file, but we could not find any useful information there. After a few seconds online, the SQL Server service was stopped by Service Control Manager for no apparent reason:

2008-07-27 17:28:04.51 server Copyright (C) 1988-2002 Microsoft Corporation.
2008-07-27 17:28:04.51 server All rights reserved.
2008-07-27 17:28:04.51 server Server Process ID is 304.
2008-07-27 17:28:04.51 server Logging SQL Server messages in file 'S:\Program Files\Microsoft SQL Server\MSSQL$SHILOH\log\ERRORLOG'.
2008-07-27 17:28:04.53 server SQL Server is starting at priority class 'normal'(1 CPU detected).
2008-07-27 17:28:04.55 server SQL Server configured for thread mode processing.
2008-07-27 17:28:04.56 server Using dynamic lock allocation. [2500] Lock Blocks, [5000] Lock Owner Blocks.
2008-07-27 17:28:04.57 server Attempting to initialize Distributed Transaction Coordinator.
2008-07-27 17:28:18.76 spid3 Starting up database 'master'.
2008-07-27 17:28:19.30 server Using 'SSNETLIB.DLL' version '8.0.2039'.
2008-07-27 17:28:19.32 server SQL server listening on 192.168.0.60: 1131.
2008-07-27 17:28:19.32 spid3 Server name is 'SHILOH\SHILOH'.
2008-07-27 17:28:19.32 spid5 Starting up database 'model'.
2008-07-27 17:28:19.33 spid8 Starting up database 'msdb'.
2008-07-27 17:28:19.33 spid9 Starting up database 'pubs'.
2008-07-27 17:28:19.34 server SQL server listening on TCP, Shared Memory, Named Pipes.
2008-07-27 17:28:19.34 server SQL Server is ready for client connections
2008-07-27 17:28:19.34 spid10 Starting up database 'Northwind'.
2008-07-27 17:28:20.30 spid5 Clearing tempdb database.
2008-07-27 17:28:21.49 spid5 Starting up database 'tempdb'.
2008-07-27 17:28:21.87 spid3 Recovery complete.
2008-07-27 17:28:21.87 spid3 SQL global counter collection task is created.
2008-07-27 17:28:42.29 spid3 SQL Server is terminating due to 'stop' request from Service Control Manager.

 

Application Event log was full of the following error, but the cryptic message description wasn't of any help to us:

Event Type: Error
Event Source: MSSQL$SHILOH
Event Category: (3)
Event ID: 19019
Date: 7/27/2008
Time: 5:28:19 PM
User: N/A
Computer: JORGEPC-CN1
Description:
The description for Event ID ( 19019 ) in Source ( MSSQL$SHILOH ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: [sqsrvres] ODBC sqldriverconnect failed
.
Data:
0000: 4b 4a 00 40 01 00 00 00 KJ.@....
0008: 07 00 00 00 53 00 48 00 ....S.H.
0010: 49 00 4c 00 4f 00 48 00 I.L.O.H.
0018: 00 00 00 00 00 00 ...... 

 

System Event log did show a very generic error about Cluster service failing to bring SQL Server service online:

Event Type: Error
Event Source: ClusSvc
Event Category: Failover Mgr
Event ID: 1069
Date: 7/27/2008
Time: 5:27:38 PM
User: N/A
Computer: JORGEPC-CN2
Description:
Cluster resource 'SQL Server (SHILOH)' in Resource Group 'SHILOH' failed.

For more information, see Help and Support Center at https://go.microsoft.com/fwlink/events.asp.

 

It was clear that something related to the Cluster service was causing the SQL Server service to go offline.

Windows Server 2003 includes two useful cluster log files where all cluster activity is logged on: cluster.log and cluster.oml. These two files can be located into the C:\Windows\Cluster folder. cluster.log is the file where all the cluster activity is collected in a very verbose way, while cluster.oml is where we can find mappings between globally unique identifiers (GUIDs) for each cluster resource and resource names (a kind of Rosetta stone to understand cluster.log file). You can find more information about these and other cluster log files at the Microsoft TechNet article How a Server Cluster Works.

At first sight cluster.log can look very intimidating but we just need to take the specific time where the SQL Server event happened and search for that time on the cluster log. Be aware that cluster.log files are stored in GMT time while Windows Event Logs are stored at local time, depending on your time zone you will have to add/subtract one or more hours. In my case I am located on GMT+2 so I know I I have to look for events logged on cluster.log two hours in advance.

I knew the first Application Event log took place around 17:28 so I did look for anything abnormal around 15h28m:

0000067c.00000870::2008/07/27-15:28:07.524 INFO [Qfs] QfsCloseHandle 424, status 0
0000067c.00000870::2008/07/27-15:28:07.524 INFO [Qfs] QfsDeleteFile C:\DOCUME~1\CLUADMIN\LOCALS~1\Temp\CLSA8.tmp, status 0
00000718.00000ad4::2008/07/27-15:28:19.668 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] ODBC sqldriverconnect failed
00000718.00000ad4::2008/07/27-15:28:19.668 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] checkODBCConnectError: sqlstate = 28000; native error = 4818; message = [Microsoft][SQL Native Client][SQL Server]Login failed for user 'CONTOSO\CLUADMIN'.
00000718.00000ad4::2008/07/27-15:28:19.668 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] ODBC sqldriverconnect failed
00000718.00000ad4::2008/07/27-15:28:19.668 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] checkODBCConnectError: sqlstate = 08S01; native error = e9; message = [Microsoft][SQL Native Client]Named Pipes Provider: No process is on the other end of the pipe.

00000718.00000ad4::2008/07/27-15:28:19.668 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] ODBC sqldriverconnect failed
00000718.00000ad4::2008/07/27-15:28:19.668 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] checkODBCConnectError: sqlstate = 08S01; native error = e9; message = [Microsoft][SQL Native Client]Communication link failure
00000718.00000ad4::2008/07/27-15:28:19.668 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] ODBC sqldriverconnect failed
00000718.00000ad4::2008/07/27-15:28:19.668 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] checkODBCConnectError: sqlstate = 08S01; native error = e9; message = [Microsoft][SQL Native Client]Named Pipes Provider: No process is on the other end of the pipe.

00000718.00000ad4::2008/07/27-15:28:19.668 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] ODBC sqldriverconnect failed
00000718.00000ad4::2008/07/27-15:28:19.668 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] checkODBCConnectError: sqlstate = 08S01; native error = e9; message = [Microsoft][SQL Native Client]Communication link failure
00000718.00000ad4::2008/07/27-15:28:19.678 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] ODBC sqldriverconnect failed
00000718.00000ad4::2008/07/27-15:28:19.678 ERR SQL Server <SQL Server (SHILOH)>: [sqsrvres] checkODBCConnectError: sqlstate = 08S01; native error = e9; message = [Microsoft][SQL Native Client]Named Pipes Provider: No process is on the other end of the pipe.

There were plenty of errors (code ERR) on the cluster.log file, but usually the first ones are the more interesting to look for:

(SHILOH)>: [sqsrvres] checkODBCConnectError: sqlstate = 28000; native error = 4818; message = [Microsoft][SQL Native Client][SQL Server]Login failed for user 'CONTOSO\CLUADMIN'.

In our case CONTOSO\CLUADMIN was the Cluster service account use by Windows Cluster. As you probably know, this account needs to have access to SQL Server in order to perform the IsAlive and LooksAlive poll intervals to detect a resource failure.

In this case the problem was that the BUILTIN\Administrators SQL Server account has been removed as part of a Company security guideline. In a default installation, removing this group from the SQL Server logins effectively removes the ability for the Cluster service to log into SQL Server and perform the IsAlive and LooksAlive checks. This is a fairly well known issue discussed on Knowledge Base article KB291255.

In other cases you will see events in cluster.log related to some "cryptic" resources, like the following ones:

0000067c.00000f5c::2008/07/27-15:27:40.070 INFO [FM] FmpOnlineResourceList: trying to bring resource c7018774-2673-4350-ad88-196e4aca3f95 online
0000067c.00000f5c::2008/07/27-15:27:40.070 INFO [FM] OnlineResource: c7018774-2673-4350-ad88-196e4aca3f95 depends on 582210d0-c29d-40a5-af8a-bbbc08c3e08d. Bring online first.

cluster.log does not usually show the resource name but the resource GUI (like "resource ce12eb0d-a9d0-4e4a-8531-f3d70cad8c6c"). By using the cluster.oml file we can translate the GUID string to a specific cluster resource:

00000644.000006b0::2008/05/15-11:41:40.621 OBRENAME "Resource" "c7018774-2673-4350-ad88-196e4aca3f95" "SQL Server Fulltext (SHILOH)"

00000664.000006b8::2008/05/15-11:36:29.428 OBRENAME "Resource" "582210d0-c29d-40a5-af8a-bbbc08c3e08d" "SQL Server (SHILOH)"

In this cluster.log sample section Cluster service is reporting that it needs to start SQL Server service first in order to bring Full Text Search service online.

Additional Information

To make your sysadmin life easier, Microsoft has released a utility to simplify cluster.log analysis called Cluster Diagnostics and Verification Tool (ClusDiag.exe). This tool can be used to read in a more user-friendly way cluster log files from Windows Server 2003 and 2000 clusters. Here you can see a screen capture of this tool in action:

 

In Windows Server 2008 cluster.log file is not present in Cluster folder anymore. Even though you can still access cluster.log file on a Windows Server 2008 cluster; please refer to the Overview of Failover Clustering with Windows Server 2008 document in this Microsoft web page for more information about the changes incorporated into Windows Server 2008 cluster. I have extracted here the relevant part of the text:

The text-file-based cluster log is also gone. Event trace logging (.etl) is now enabled via Event Tracing for Windows (ETW). Default log sizes vary and can be modified with cluster installs logs (Operational and ClusterLog). There is new functionality built into the command line. The cluster.exe tool allows you to dump the trace log into a text file. This file looks similar to the cluster log used in previous versions of failover clustering. Use the cluster.exe Log /Generate command to see this log.