• SQL Server Cast

    “OnlineThread: SQL Cluster shared data upgrade failed” error (ID 19019) installing Cumulative Update in SQL Server cluster

    • 5 Comments

    Last week I was working with a customer on this obscure SQL Server error right after installing Cumulative Update 9 (CU9) on a SQL Server 2008 two-node cluster. As you will see, the root-cause of the problem was easy to understand, but getting to the point of finding this root-cause was not.

    In this case the customer was running a virtual instance of SQL Server 2008 SP1; this instance was running with no errors and moving back and forth the clustered resources was not causing any problem. The customer was installing CU9 to avoid a database issue that was affecting the performance of a CRM system. Following best practices in SQL Server 2008/R2, she was installing the CU in the passive node first, in order to minimize the amount of downtime in the production CRM system. The problem was that trying to start the SQL Server service on the updated cluster node was failing immediately.

    We looked into the Windows Event log and Cluster log but the information from these two sources did not provide any clue:

    Application Event log:

    10/09/2010 08:08:15 PM  Error         NODE1 19019   MSSQL$SQL2008                    Failover        N/A                                [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3) 
    10/09/2010 08:08:11 PM  Error         NODE1 19019   MSSQL$SQL2008                    Failover        N/A                                [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3) 
    10/09/2010 08:08:10 PM  Error         NODE1 19019   MSSQL$SQL2008                    Failover        N/A                                [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3) 
    10/09/2010 07:56:26 PM  Error         NODE1 19019   MSSQL$SQL2008                    Failover        N/A                                [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3) 
    10/09/2010 07:55:26 PM  Error         NODE1 19019   MSSQL$SQL2008                    Failover        N/A                                [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3)
     

    Cluster log:

    00000d78.000010dc::2010/09/14-16:50:34.060 ERR   [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed                            (status 0, Worker retval = 3)
    00000d78.000010dc::2010/09/14-16:50:34.060 ERR   [RHS] Online for resource SQL Server (SQL2008) failed.
    000009dc.000011cc::2010/09/14-16:50:34.060 INFO  [RCM] HandleMonitorReply: ONLINERESOURCE for 'SQL Server (SQL2008)', gen(0) result 5018.
    000009dc.000011cc::2010/09/14-16:50:34.060 INFO  [RCM] TransitionToState(SQL Server (SQL2008)) OnlinePending-->ProcessingFailure.
    000009dc.00001168::2010/09/14-16:50:34.060 ERR   [RCM] rcm::RcmResource::HandleFailure: (SQL Server (SQL2008))
    000009dc.00001168::2010/09/14-16:50:34.060 INFO  [RCM] resource SQL Server (SQL2008): failure count: 1, restartAction: 2.

    The error was preventing SQL Server to create and write any ERRORLOG file so we couldn’t rely in this log either. In these kind of situations Sysinternals Process Monitor (Procmon) is usually a good option so we run this utility while reproducing the service start failure. After capturing Procmon, we filtered the trace by the rhs.exe process which is the Resource Health Check process responsible of loading SQSRVRES.DLL running the IsAlive check. Through this test we could see this DLL was detecting that the SQL Server component required an updated because the PatchLevel and SharedDataPatchLevel on Windows Registry were different. The error was taking place during this update:

    image

    So Procmon was useful telling us what was failing but not telling us why it was failing. Working with my colleague João Loureiro I learnt that by modifying the following registry key to “1” (one), the Windows cluster.log file could provide a more verbose output:

    HKLM\Cluster\Resources\<ID for SQL Server Instance>\Parameters\VerboseLogging

    After modifying this key, we failed the updated node again and check one more time to the cluster.log file, this time the information in the file was the following:

    000014c8.00001eb0::2010/10/06-16:34:54.991 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: enter; VirtualServerName=NETNAME; InstanceName=SQL2008
    000014c8.00001eb0::2010/10/06-16:34:54.991 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: ServiceName=MSSQL$SQL2008
    000014c8.00001eb0::2010/10/06-16:34:55.022 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] ClusterResourceControl, found the network name: NETNAME.
    000014c8.00001eb0::2010/10/06-16:34:55.022 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: ServerName=NETNAME\SQL2008
    000014c8.00001eb0::2010/10/06-16:34:55.054 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Calling SQLClusterResourceWorker::WaitForCompletion (200)
    000014c8.000017f0::2010/10/06-16:34:55.054 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Worker thread starting ...
    000014c8.000017f0::2010/10/06-16:34:55.054 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Entering SQLClusterSharedDataUpgradeWorker thread.
    000014c8.000017f0::2010/10/06-16:34:55.054
    INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): SqlDataRoot = 'R:\MSSQL10.SQL2008\MSSQL',         CurrentPatchLevel = '10.1.2789.0',         SharedDataPatchLevel = '10.1.2531.0'
    000014c8.000017f0::2010/10/06-16:34:55.054 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): FTDataPath = 'R:\MSSQL10.SQL2008\MSSQL\FTData',         SQLGroup = 'S-1-5-80-2586167408-2258694634-1203600018-1002233561-722790735'
    000014c8.000017f0::2010/10/06-16:34:55.054 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0):
    Entering DoSQLDataRootApplyACL (R:\MSSQL10.SQL2008\MSSQL\Data).
    000014c8.000017f0::2010/10/06-16:34:55.054 WARN  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): DoSQLDataRootApplyACL : Failed to create directory tree at SQLDataRoot.
    000014c8.000017f0::2010/10/06-16:34:55.069 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Exiting DoSQLDataRootApplyACL (3).
    000014c8.000017f0::2010/10/06-16:34:55.069 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Exiting SQLClusterSharedDataUpgradeWorker thread (3).
    000014c8.000017f0::2010/10/06-16:34:55.069 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Worker thread exiting (retval = 3) ...
    000014c8.00001eb0::2010/10/06-16:34:55.069 INFO  [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Calling SQLClusterResourceWorker::WaitForCompletion (4294967295)
    000014c8.00001eb0::2010/10/06-16:34:55.069 ERR   [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed       (status 0, Worker retval = 3)
    000014c8.00001eb0::2010/10/06-16:34:55.069 ERR   [RHS] Online for resource SQL Server (SQL2008) failed.

    Here we could see much more detailed information. After reviewing this log with the help of the customer we did find that the R: drive was indeed non-existent on the cluster. Only then we were able to understand why the SQLClusterSharedDataUpgradeWorker thread was failing. Checking the following registry key:

    HKLM\Software\Microsoft\Microsoft SQL Server\MSSQL10.SQL2008\Setup\SQLDataRoot

    we found drive R: listed instead of the actual SQL Server Data root drive. From this point fixing the problem was a matter of changing this registry key one cluster node at a time right after installing the required CU. It is clear than the problem was not related to this CU9 specifically, using any other CU or SP would have probably lead to a similar situation.

    Neither the customer nor us were able to understand how this path could have been changed. We can only imagine some ALTER DATABASE statement was run at certain point, leaving a wrong registry entry for the SQLDataRoot key.

  • SQL Server Cast

    Unexpected Behavior Setting Up Firewall Rules with Clustered SQL Server Instances

    If you have ever try to setup firewall rules in your network to manage access to SQL Server you have probably found a somehow weird behavior when connecting to a clustered instance. In this particular case my customer was finding that the rules she had configured on her Windows firewall were working fine when connecting to a SQL Server 2008 cluster, but were failing when connecting to a SQL Server 2005 cluster instead.

    This behavior is indeed clearly documented on KB944390 article but in our case it took some time to match what we were finding looking at the network traces to this specific Knowledge Base article. I am posting this information here so you can see what you can expect when running your prefer network sniffer while tracing down this problem.

    If you want to go ahead and try this to yourself, grab you favorite network sniffing utility, a PC with SQL Server Management Studio (SSMS) and locate a SQL Server 2005 and SQL Server 2008 clusters on your network. The test only includes opening SSMS on your client and capture some network traces so there is no risk for the clusters on a the production system. In my case I used Microsoft Network Monitor  because I find easy to interpret the SQL Server packet information using the available SQL Server parser; you can find and download this parser here.

    Open Microsoft Network Monitor and make sure the correct SQL Server parser is selected, the new Network Monitor version makes this easier than ever:

    image

    Click on New Capture to open a new capture window and then click on the Start button to initiate the capture process. Then, open SSMS and connect to the clustered SQL Server 2008 instance using the default setting;, as soon as the connection completes, go back to Network Monitor and stop the capture process.

    As you may know, when connecting to SQL Server named instances the SQL Browser service plays an important role giving to the clients the required information. SQL Browser runs as a local service on each standalone or cluster server where SQL Server is installed (you won’t find SQL Browser service on SQL Server 2000 because it was implemented as a stand-alone service starting with SQL Server 2005). Here is in summary what SQL Server browser does from the MSDN documentation:

    “On startup, SQL Server Browser starts and claims UDP port 1434. SQL Server Browser reads the registry, identifies all SQL Server instances on the computer, and notes the ports and named pipes that they use. When a server has two or more network cards, SQL Server Browser will return all ports enabled for SQL Server. SQL Server 2005 and SQL Server Browser support ipv6 and ipv4. […]

    When SQL Server 2000 and SQL Server 2005 clients request SQL Server resources, the client network library sends a UDP message to the server using port 1434. SQL Server Browser responds with the TCP/IP port or named pipe of the requested instance.”

    The component in SQL Browser that takes care of all this is called the SQL Server Resolution Protocol or SSRP for short. When connecting to my SQL Server 2008 clustered and named instance I was able to see the expected pattern of network packets going in and out. First we can see my client “JORGEPC-SQL” trying to access the remote clustered instance “SQL2008\KATMAI64”, the first Browser request is identified by and UDP Ssrp packet to the destination port 1434:

    image

    Then we can see the response from the SQL Server Browser service using the same SSRP protocol; the response contains all the information required by the client to complete the connection successfully:

    image

    As you can see, the answer in the SSRP traffic comes from the SQL Server Network Name resource in the cluster (sql2008.contoso.lab), not from the cluster node where the SQL Browser service is running. This is expected because I am initiating the communication with the virtual server name in SQL Server, not with the cluster node.

    See now what happens when I perform the very same action using connecting to a clustered SQL Server 2005 named instance called “SQL2005\YUKON”. Here you have the screen captures of the SSRP request from my client (first) and the SSRP response from the server (second):

    image

     

    image

    As you can see from the second image the Browser service is providing the correct information to the client via SSRP, including the instance name, version, and TCP and NP information (you can also see from this frame there are two instances in the cluster, one for SQL Server 2000, represented by Version = 8 and another one for 2005, represented by Version = 9). However, the SSRP response in this case is not coming from the SQL Server Network Name “SQL2005” but from the cluster node where the instance is currently active, “JORGEPC-VM-CN1”.

    If you are not using a firewall or if your firewall rules do not include to block all unknown incoming traffic this behavior will be unnoticed. However, if you block all unknown incoming traffic on your client and using of a firewall that does not support loose source mapping, the connection attempt will fail. You firewall will log the following information on this case:

    2010-10-28 15:23:21 DROP UDP 192.168.0.50 192.168.0.40 1434 56039 122 - - - - - - - RECEIVE
    2010-10-01 15:23:22 DROP UDP 192.168.0.50 192.168.0.40 1434 56040 122 - - - - - - - RECEIVE
    2010-10-01 15:23:23 DROP UDP 192.168.0.50 192.168.0.40 1434 56041 122 - - - - - - - RECEIVE
    2010-10-01 15:23:25 DROP UDP 192.168.0.50 192.168.0.40 1434 56042 122 - - - - - - - RECEIVE

    The KB944390 mentioned before explain this situation in more detail:

    “When you connect to the named instance, the client network library sends a User Datagram Protocol (UDP) request packet to the IP address of the named instance. Then, SQL Server Browser returns a UDP response packet that contains the information about the connection endpoints.
    However, in the UDP response packet, the source IP address may not be the IP address to which the UDP request packet was sent. If the named instance is a failover cluster instance, the source IP address is the IP address of the physical computer instead of the virtual IP address of the remote server. If the remote server has multiple IP addresses, the source IP address may be any of the IP addresses that are assigned to the remote server”

    You can learn more about loose source mapping in the this MSDN link.

    These results depend not only on SQL Server but Windows versions. The network traces will shown the expected behavior (incoming RRSP traffic source is the same as the destination) in SQL Server 2008 SP1 and above when installed on Windows Server 2008 or above; other SQL Server and Windows combination below these two versions will suffer from the same problem, for example:

    • SQL Server 2008 (RTM) on Windows Server 2008
    • SQL Server 2008 SP1/SP2 on Windows Server 2003 cluster
    • SQL Server 2005 on Windows Server 2008

    For the SQL Server and Windows Server versions affected by this problem, 3 different workarounds are provided on the abovementioned KB944390 article.

Page 1 of 1 (2 items)