I recently worked on a very interesting issue, where one of the cluster nodes in an AlwaysOn environment became unstable, and the administrators ended up evicting the node from the Windows cluster as an emergency measure. Ideally, since the primary node/replica was no longer available, the Availability Group should have come up on the secondary replica, but it didn’t in this case. The AG was showing online in the Failover Cluster Manager, but in SQL Server Management studio, the database in the AG was in “Not Synchronizing\Recovery Pending” state.
We checked the errorlogs (on the secondary), and found these messages:
2012-09-05 04:01:32.300 spid18s AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required. 2012-09-05 04:01:32.310 spid21s Error: 35262, Severity: 17, State: 1. 2012-09-05 04:01:32.310 spid21s Skipping the default startup of database 'Test' because the database belongs to an availability group (Group ID: 65537). The database will be started by the availability group. This is an informational message only. No user action is required. ……..
2012-09-05 04:01:32.430 spid18s AlwaysOn: The local replica of availability group 'PST TEST' is starting. This is an informational message only. No user action is required. ……. 2012-09-05 04:01:32.470 spid18s The state of the local availability replica in availability group 'PST TEST' has changed from 'NOT_AVAILABLE' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. …….
2012-09-05 04:01:32.880 spid52 AlwaysOn: The local replica of availability group 'PST TEST' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is require 2012-09-05 04:01:32.980 spid52 The state of the local availability replica in availability group 'PST TEST' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. 2012-09-05 04:01:33.090 Server Error: 41015, Severity: 16, State: 1. 2012-09-05 04:01:33.090 Server Failed to obtain the Windows Server Failover Clustering (WSFC) node handle (Error code 5042). The WSFC service may not be running or may not be accessible in its current state, or the specified cluster node name is invalid.
Since there were clear errors related to the Windows Server Failover Cluster (WSFC), we checked and ensured that the windows cluster was stable. It was, and the cluster validation came back clean.
We tried bringing the database online using "Restore database lab with recovery", but it failed saying the database is part of an availability group. We then tried removing it from the Availability Group, but it failed with error 41190, stating that the database is not in a state that it can be removed from the Availability Group. The only option we had at this point was to delete the AG. We tried doing so, but that too returned with an error:
Msg 41172, Level 16, State 0, Line 3 An error occurred while dropping availability group 'PST TEST' from Windows Server Failover Clustering (WSFC) cluster and from the local metadata. The operation encountered SQL OS error 41036, and has been terminated. Verify that the specified availability group name is correct, and then retry the command.
However, the AG was no longer visible in SQL Server Management Studio and Failover Cluster Manager. I was still skeptical, since the error had clearly complained about the metadata cleanup. When we tried creating a new AG with the name PST TEST, it errored out as expected, stating that the AG as still present. So we ended up creating an AG with a different name and adding the Test database to it.
Root Cause Analysis
So much for getting the environment back up, but what about the root cause? I mean, how can we ensure that such as issue never happens again? I checked with some friends in the Product Group, and according to them, deleting an AG should “Always” work. So why didn’t it work in this case?
The answer lies in the size of the registry on the servers. As many of you might know, the limit for registry size is still 2 GB. This is also documented in the msdn article here. The proper way to investigate would be to follow these steps:
The best remedial measure is to compress the “Bloated” registry hives, using the steps outlined in this KB: http://support.microsoft.com/kb/2498915
There can, of course, be other processes bloating the Software hive, and the only way to find out is to take a backup of the registry hive and try to find which hives/keys are the largest. Once we have identified the keys, we can trace them back to the process which is responsible.
Update: A fix for the issue SQL Server issue (the msiexec.exe keeps running after SP1 installation) is available at:http://www.microsoft.com/en-us/download/details.aspx?id=36215
The fix is also included in the Cumulative Update 2 (CU2) for SQL 2012 SP1, available at:http://support.microsoft.com/kb/2790947
Hope this helps. Any feedback/suggestions are welcome.