A recent issue I saw was that the SQl service would not come on line on an x64 cluster. my prefered method of looking at cluster failovers / failure to start is to first look at the errorlog.This is to a large part to get aquainted with the server and see if anything is apparent from there. Clearly if there's no errorlog to look through we are a little stuck. But on this occasion there was. We could see the following
The bottom of the errorlog looked like this
2009-01-24 07:09:08.21 spid19s Service Broker manager has shut down.2009-01-24 07:09:08.49 spid11s SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.2009-01-24 07:09:08.49 spid11s SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
In addition I couldn't see any nasty looking errors or stack dumps so the conclusion here is that for some reason the cluster service could not connect to the SQL instance to run its IsAlive check.
So at this point its time to change attack and look at the application event log or cluster log, you can take your pick which you choose but I generally start with the application event log, I then search for the following string [sqsrvres]So why do I do that, well sqsrvres.dll is the name of the resource dll for the SQL Server resource, when it logs errors this string will be somewhere in the line.
I find the following in the application event log
20/01/2009 23:31:20 MSSQL$SQL1 Error (3) 19019 N/A TESTSQL2 "[sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = ffffffff; message = [Microsoft][SQL Native Client]SQL Network Interfaces: Error getting enabled protocols list from registry [xFFFFFFFF]. "20/01/2009 23:31:20 MSSQL$SQL1 Error (3) 19019 N/A TESTSQL2 "[sqsrvres] ODBC sqldriverconnect failed"20/01/2009 23:31:00 MSSQL$SQL1 Error (3) 19019 N/A TESTSQL2 "[sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = ffffffff; message = [Microsoft][SQL Native Client]An error has occurred while establishing a connection to the server. When connecting to SQL Server 2005, this failure may be caused by the fact that under the default settings SQL Server does not allow remote connections."20/01/2009 23:31:00 MSSQL$SQL1 Error (3) 19019 N/A TESTSQL2 "[sqsrvres] ODBC sqldriverconnect failed"20/01/2009 23:31:00 MSSQL$SQL1 Error (3) 19019 N/A TESTSQL2 "[sqsrvres] checkODBCConnectError: sqlstate = HYT00; native error = 0; message = [Microsoft][SQL Native Client]Login timeout expired"20/01/2009 23:31:00 MSSQL$SQL1 Error (3) 19019 N/A TESTSQL2 "[sqsrvres] ODBC sqldriverconnect failed"20/01/2009 23:31:00 MSSQL$SQL1 Error (3) 19019 N/A TESTSQL2 "[sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = ffffffff; message = [Microsoft][SQL Native Client]SQL Network Interfaces: Error getting enabled protocols list from registry [xFFFFFFFF]. "20/01/2009 23:31:00 MSSQL$SQL1 Error (3) 19019 N/A TESTSQL2 "[sqsrvres] ODBC sqldriverconnect failed"
So this looks nasty, checking the tip that we might not allow remote connections, this is not the cause.But there is also some suggestion of inpropriety in the registry. Additionally from the client side we may have an issue with the browser service causing us not to be able to find the SQL instance.
Firther down in the registry we can see the following
20/01/2009 23:30:34 SQLBrowser Information None 12 N/A TESTSQL2 The SQLBrowser service has started.20/01/2009 23:30:34 SQLBrowser Information None 16 N/A TESTSQL2 The SQLBrowser is enabling SQL instance and connectivity discovery support.20/01/2009 23:30:34 SQLBrowser Warning None 3 N/A TESTSQL2 The configuration of the AdminConnection\TCP protocol in the SQL instance WEB1 is not valid.
Now I had seen a couple of previous cases where the above was observed, and this is what we found.As you may know part of the registry is replicated for the SQL resource so that each node is kept up to date. However on the impacted cluster we have the following
[HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\MSSQL.1] [HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\MSSQL.1\Cluster] [HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\MSSQL.1\CPE] [HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLSERVER][HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\MSSQL.1\PROVIDERS][HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\MSSQL.1\Replication] [HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\MSSQL.1\SQLserverAgent] [HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\MSSQL.1\SQLServerSCP]
Now the significant thing here is the WOW6432 part, these keys are junk keys, as they exist (as expected) in the normal part of the registry, the WOW6443Node is for compatability. So the simple solution is to remove these keys (after backing up the registry etc etc....)
Once that is done and SQL and the browser service restarted things should be working a lot better again