Unable to correct invalid SQL Server Network Configuration on clustered SQL Server causes clustered SQL Server fail to start “permanently”

Unable to correct invalid SQL Server Network Configuration on clustered SQL Server causes clustered SQL Server fail to start “permanently”

  • Comments 14

Recently, we encountered an issue reported by a customer on a clustered SQL Server. What happens is the following.

 

The Customer needs to restrict the number of IP addresses SQL 2005 Server is listening. He uses the following procedure for that: open “SQL Server Configuration Manager” (SSCM)  -> “SQL Server 2005 Network Configuration” -> “Protocols for XX” , right clicked TCP/IP choosing Protocols and set “Listen All” to “NO”. After that he wants to set the proper IP-addresses. This is working fine on a non-clustered system. On a clustered system the following happens when he is restarting SQL Server: SQL Server does not come online anymore. And the SQL Server error log reports the following.

 

SQL Server 2005 errorlog

2005-11-28 10:53:29.02 Server      Error: 17182, Severity: 16, State: 1.

2005-11-28 10:53:29.02 Server      TDSSNIClient initialization failed with error 0x32, status code 0x1c.

2005-11-28 10:53:29.02 Server      Error: 17182, Severity: 16, State: 1.

2005-11-28 10:53:29.02 Server      TDSSNIClient initialization failed with error 0x32, status code 0x1.

2005-11-28 10:53:29.03 Server      Error: 17826, Severity: 18, State: 3.

2005-11-28 10:53:29.03 Server      Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.

2005-11-28 10:53:29.03 Server      Error: 17120, Severity: 16, State: 1.

2005-11-28 10:53:29.03 Server      SQL Server could not spawn FRunCM thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

 

 

Configuring the server to listening on individual IP-addresses is not supported on Clustered SQL Server, thus the failure indicated by the error log is accurate. However, the customer can NOT set “Listen ALL” to “Yes” any more to bring the server online. To be more accurate, even though the customer can set “Listen ALL” to “Yes” using SSCM on each physical node, each time he try to bring SQL Server online, the value will be overwritten to “NO” and the clustered SQL Server will end up failing. This is a serious issue that causes frustration of our customers.

 

The root cause of this issue is because of cluster checkpoint service behavior. If a setting is changed while the resource is online, that change will get check-pointed to the CPT hive file in the cluster quorum disk. If the resource is offline while you make the parameter change, then it will never be check-pointed. Each time you bring up the resources, the check-pointed value will overwrite the local value. SQL Server network configuration is one resource that is check-pointed. So, if you put the SQL Server resource offline while you change the “Listen All” from “NO” to “YES”, and then you try to put the resource back online, it will fail as the "local" change was overwritten (during resource startup) with what was persisted in the checkpoint file.

 

Because of this check-pointing behavior, any time that SQL Server network configurations are modified into invalid values while the server is online, restarting the clustered SQL Server will cause the server fail to start “permanently”.

 

To get out of such BAD state, one workaround is to disable the check-pointing for SQL Server network configuration, described as following.

 

1. While SQL Server instance is in offline/failed state, disable cluster checkpointing for network configuration by:

      cluster res "SQL Server" /removecheck:”Software\Microsoft\Microsoft SQL Server\MSSQL.XXX\MSSQLSERVER”

2. Correct the configuration by using SSCM. Verified the key was corrected on both nodes.

3. Bring SQL cluster back online.

4. Re-enabled cluster checkpointing for network configuration by:

      cluster res "SQL Server" /addcheck: ”Software\Microsoft\Microsoft SQL Server\MSSQL.XXX\MSSQLSERVER”

 

Note that, for named instance, the resource display name "SQL Server" should be replaced with "SQL Server (<instance name>)".

 

If the workaround does not resolve the issue described as above in your case, please let us know.

 

Do you know that you can post question w.r.t SQL Server data access, connectivty issues at http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=87&SiteID=1?

 

Nan Tu, Software Design Engineer, SQL Protocols 

 

Disclaimer: This posting is provided "AS IS" with no warranties, and confers no rights

Leave a Comment
  • Please add 8 and 6 and type the answer here:
  • Post
  • there is a similar story in sql 2000

    a resource kit tool clusterrecovery.exe can do the same thing (refresh the registry checkpoint) with a GUI, without having to know the registry keys
  • I am trying to connect to a server which is installed with SQL server 2000, using SQL server 2005 client. I am getting the following error

    When connecting to SQL server 2005, his failure may be caused by the fact that under the default settingsSQL server does not allow remote connections. (Provider: Named pipes Provider, error: 40 - could not open a connection to SQL server)(Microsoft SQL server, Error:53)
  • Manik,

    This is a connectivity issue not pertain only to Cluster Scenario. In your case, you need to check:

    (1) Remote connectivity is enabled. If not, you can use SQL Server Surface Area Configuration to enable that.

    (2) Make sure that the listening port of SQL Server is not blocked by Windows Firewall
  • When connecting to SQL server 2005, his failure may be caused by the fact that under the default settingsSQL server does not allow remote connections. (Provider: Named pipes Provider, error: 40 - could not open a connection to SQL server)(Microsoft SQL server, Error:53)
  • I get the error message above.

    I have made the exception in the firewall for both Database engine and the Browser processes.  I configured Surface Area config to allow remote with TCP and Named Pipes.  I turned on the service "Browser".  I still get the error above.

  • Please try to answer questions in

    http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=362498&SiteID=1,

    If it doesn't solve your problem, please post your question on the forum. It'll be answered in a timely manner.

    Thanks,

    Nan

  • Thanks for your blog Nan. I was expecting to have to make a call but was saved hours of heartache by being able to stop the cluster checkpoint. I found your explanation doing a search on the network error.

    May I suggest that KB docs 823938 and 265808 be updated to warn folks like me that do not realize static ports are not supported on a cluster? I struggled a bit with the syntax but eventually got it. Fortunately, it was not a production machine.

    Ralph

  • Hi all,

    I did use this but got error as below:

    C:\>cluster res "SQL Server" /removecheck:"Software\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLSERVER"

    The 'removecheck' option does not require any parameters to be specified.

    See "CLUSTER RESOURCE /?" for correct syntax.

    So, I open a case with support team and the soluction was:

    1.With SQL offline on cluster

    C:\>cluster res "SQL Server" /checkpoints

    --- List all checkpoint to SQL Server

    2. Run /removecheckpoints

    C:\>cluster res "SQL Server" /removecheckpoints:"Software\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLServer"

    3.Correct the configuration by using SSCM.

    4.run /addcheckpoints

    C:\>cluster res "SQL Server" /addcheckpoints:"Software\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLServer"

    5. Bring SQL cluster back online.

    Yes....it´s ok now !!

    regards

    Nilton Pinheiro

  • Worked like a champ.  Thanks for the post - we looked everywhere trying to fix this.

  • Thanks heaps, worked perfectly - just the commands in the original post weren't right but Nilton Pinheiro's correctios were.  It's /removecheckpoints not /removecheck and /addcheckpoints not /addcheck

  • Nilton, you save me from a new support call... txs!

  • This post just saved me from insanity, thank you so much!

  • Thanks Nan and Nilton, worked perfectly

  • In my case (this took some time to figure out) SQL 2005 32 bit on Win 2008 64 bit, you need to remove/add the WOW6432 checkpoint.

    Software\Wow6432Node\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLSERVER

    ...and as to the conflicting final 2 steps:

    Nan Tu 3) start up cluster 4) /addcheckpoint

    vs. Nilton 4)  /addcheckpoint 5) bring cluster back online

    ...I went with Nan Tu's approach and that worked for me.

    Thanks!

Page 1 of 1 (14 items)