Recently I encountered a case wherein clustered instance of sql server 2000 fails to come online on other node of a 2 Node Cluster. 

Usually for such cases we need to start troubleshooting by checking the Error log of the sql server from the server where it fails to come online. The Sql Server Error log can give you the cause if the sql server service itself is not able to start.

In our case we observed that the sql server error log was clean and the sql server was able to start without any errors.

So it appears that cluster service was not able to communicate with the sql server and made an assumption that sql server service could not be started.  Cluster service tries to connect to the sql server until it gets timed out after which cluster service marks the sql resources as failed.

The next obvious step in this cases would be to check the Application event log. We checked the Application Event log and found the following error reported.

Event Type:    Error
Event Source:    MSSQL$TEST
Event Category:    (3)
Event ID:    17052
Date:        1/11/2009
Time:        1:26:01 AM
User:        N/A
Computer:    Node02
Description:
[sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = 11; message = [Microsoft][ODBC SQL Server Driver][TCP/IP Sockets]SQL Server does not exist or access denied.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 9c 42 00 40 01 00 00 00   ?B.@....
0008: 0a 00 00 00 4b 00 44 00   ....K.D.
0010: 43 00 44 00 53 00 51 00   C.D.S.Q.
0018: 4c 00 31 00 34 00 00 00   L.1.4...
0020: 00 00 00 00               ....   

Event Type:    Error
Event Source:    MSSQL$TEST
Event Category:    (3)
Event ID:    17052
Date:        1/11/2009
Time:        1:26:01 AM
User:        N/A
Computer:    Node02
Description:
[sqsrvres] checkODBCConnectError: sqlstate = 01000; native error = 274d; message = [Microsoft][ODBC SQL Server Driver][TCP/IP Sockets]ConnectionOpen (Connect()).

From the above Error we observe that Either Sql server is listening on wrong port or cluster service is trying to connect to sql on wrong port.

We checked the following location of the registry and found that sql server was listening on port 2068.

HKLM/SOFTWARE/MICROSOFT/MICROSOFT SQL SERVER/TEST/MSSQLSERVER/SuperSocketNetLib/TCP    

We also confirmed by checking the sql server error log which states as following

2009-01-11 01:28:44.07 server    SQL server listening on 10.10.0.194: 2068

We checked the port on the working Node1 and it was the same i.e 2068

So it appears that cluster was trying to connect to sql server to wrong port.

We checked the Client Network Config Utility and found that there was TCP/IP Alias created  for the sql server instance SQLCLUSTER\TEST on port 1433 while sql was listening on port 2068.

Cause
=====

Whenever we have alias created to connect to the sql server instance using tcp/ip or np protocol on clustered instance of sql server. Cluster service will use only the alias to connect to the sql instance and so if the alias has wrong definition, the cluster service will fail to connect to sql server instance and thereby will fail to bring the sql resource online.

Resolution
=======

So in our case we checked on the box "Dynamically determine port" in the client network utility and removed the static port address of 1433 in the TCP/IP alias and the sql resource came online fine on Node 2 without any errors.

 

Parikshit Savjani,
SE, Microsoft SQL Server