I recently worked on a very interesting issue, where one of the cluster nodes in an AlwaysOn environment became unstable, and the administrators ended up evicting the node from the Windows cluster as an emergency measure. Ideally, since the primary node/replica was no longer available, the Availability Group should have come up on the secondary replica, but it didn’t in this case. The AG was showing online in the Failover Cluster Manager, but in SQL Server Management studio, the database in the AG was in “Not Synchronizing\Recovery Pending” state.

We checked the errorlogs (on the secondary), and found these messages:

2012-09-05 04:01:32.300 spid18s      AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required.
2012-09-05 04:01:32.310 spid21s      Error: 35262, Severity: 17, State: 1.
2012-09-05 04:01:32.310 spid21s      Skipping the default startup of database 'Test' because the database belongs to an availability group (Group ID:  65537). The database will be started by the availability group. This is an informational message only. No user action is required.
……..

2012-09-05 04:01:32.430 spid18s      AlwaysOn: The local replica of availability group 'PST TEST' is starting. This is an informational message only. No user action is required.
…….      
2012-09-05 04:01:32.470 spid18s      The state of the local availability replica in availability group 'PST TEST' has changed from 'NOT_AVAILABLE' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.
…….

2012-09-05 04:01:32.880 spid52       AlwaysOn: The local replica of availability group 'PST TEST' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is require
2012-09-05 04:01:32.980 spid52       The state of the local availability replica in availability group 'PST TEST' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. 
2012-09-05 04:01:33.090 Server       Error: 41015, Severity: 16, State: 1.
2012-09-05 04:01:33.090 Server       Failed to obtain the Windows Server Failover Clustering (WSFC) node handle (Error code 5042).  The WSFC service may not be running or may not be accessible in its current state, or the specified cluster node name is invalid.

Since there were clear errors related to the Windows Server Failover Cluster (WSFC), we checked and ensured that the windows cluster was stable. It was, and the cluster validation came back clean.

We tried bringing the database online using "Restore database lab with recovery", but it failed saying the database is part of an availability group. We then tried removing it from the Availability Group, but it failed with error 41190, stating that the database is not in a state that it can be removed from the Availability Group. The only option we had at this point was to delete the AG. We tried doing so, but that too returned with an error:

Msg 41172, Level 16, State 0, Line 3
An error occurred while dropping availability group 'PST TEST' from Windows Server Failover Clustering (WSFC) cluster and from the local metadata. The operation encountered SQL OS error 41036, and has been terminated. Verify that the specified availability group name is correct, and then retry the command.

However, the AG was no longer visible in SQL Server Management Studio and Failover Cluster Manager. I was still skeptical, since the error had clearly complained about the metadata cleanup. When we tried creating a new AG with the name PST TEST, it errored out as expected, stating that the AG as still present. So we ended up creating an AG with a different name and adding the Test database to it.

Root Cause Analysis

So much for getting the environment back up, but what about the root cause? I mean, how can we ensure that such as issue never happens again? I checked with some friends in the Product Group, and according to them, deleting an AG should “Always” work. So why didn’t it work in this case?

The answer lies in the size of the registry on the servers. As many of you might know, the limit for registry size is still 2 GB. This is also documented in the msdn article here. The proper way to investigate would be to follow these steps:

  1. Check the Paged pool usage from perfmon by checking the Memory->Pool Paged Bytes counter
  2. If you see high memory usage there (close to the 2 GB limit), then we need to find who’s using the pool. There’s a very useful article on this:
    http://msdn.microsoft.com/en-us/library/windows/hardware/gg463213.aspx
  3. Using one of the methods described in the article, we can easily identify which process is using the paged pool. One other way is to use the Process->Pool paged bytes Perfmon counter.
  4. In our case, we identified CM31 as the tag using about 1.97 GB from the paged pool. Looking up the tag list available here, we can see that the CM series corresponds to “Configuration Manager (registry)”.
    So it’s clear that registry is using a large chunk of the paged pool, and once this usage hits 2 GB, users will not be able to login to the system, and as a result, everything else, including the cluster service and the AG, will fail. This issue can happen either due to large registry hives or some process loading keys multiple times.
  5. Next, check the sizes of the files in the Windows/system32/config folder. If these are large (>1 GB), then that will be the cause of the issue. Also, check the sizes of the NTUser.dat files in C:\Users. There will be one for each user, so searching for them in c:\users is the simplest way.
  6. In our case, we could clearly see that the SOFTWARE hive was by far the largest, and very close to the limit:
    image
  7. The next step is to figure out which process/hive is responsible for the huge size of the Software branch. In our case we found that it was a known issue with the Cluster service, outlined in this KB:
    http://support.microsoft.com/kb/2616514
  8. Another known issue with SQL 2012 SP1:
    https://connect.microsoft.com/SQLServer/feedback/details/770630/msiexec-exe-processes-keep-running-after-installation-of-sql-server-2012-sp1
  9. Yet another known issue that can cause similar issues in Windows 2000 and 2003:
    http://support.microsoft.com/kb/906952

The best remedial measure is to compress the “Bloated” registry hives, using the steps outlined in this KB:
http://support.microsoft.com/kb/2498915

There can, of course, be other processes bloating the Software hive, and the only way to find out is to take a backup of the registry hive and try to find which hives/keys are the largest. Once we have identified the keys, we can trace them back to the process which is responsible.

Update: A fix for the issue SQL Server issue (the msiexec.exe keeps running after SP1 installation) is available at:
http://www.microsoft.com/en-us/download/details.aspx?id=36215

The fix is also included in the Cumulative Update 2 (CU2) for SQL 2012 SP1, available at:
http://support.microsoft.com/kb/2790947

Hope this helps. Any feedback/suggestions are welcome.