Over the last few years that I have been in Microsoft, I have come across multiple issues where database administrators have mistakenly made inappropriate changes to the clustered SQL Server registry hives. Later, when SQL Server restarts (either due to manual intervention or during a failover), SQL Server can no longer come online on the cluster. Additionally, the registry changes cannot be reverted back; and SQL Server remains in the failed state despite all efforts to bring it online.
This phenomenon happens due to a process known as "Registry Check-pointing" which happens automatically on a cluster. This is true not only for the SQL Server Resource, but for all clustered resources. Lets first try and understand the "Registry Check-pointing" process, as this will help us understand how to fix a scenario such as this.
To understand the "Registry Check-pointing" process, lets take up a cluster. My clustered environment has 2 instances of SQL Server installed; however, I would not make any changes to these instances. What I would do, for demonstration purpose, is to create a new resource and play with that.
Lets go ahead and create a new cluster group, and a new resource in the group. We will not bring the resource online just yet. Also, for this demonstration, the resource type is not important.
Now, in the registry, we will create a sample registry hive, that we will later associate to this resource.
Now, to associate the Registry hive we just created to the new resource, we go to HKLM\Cluster\Resources. We navigate through the GUIDs to locate the GUID that corresponds to our Resource. Under the GUID, we create a new Key and name it RegSync. Under RegSync, we create a new String Value, name it 00000001, and enter Software\MyTestResource as its value. This will actually point to the Registry Hive HKLM\Software\MyTestResource. Refer snapshot below:
Now, let us bring the group MyTestGroup online. This will bring the resource MyTestResource also online. Now, we will edit the registry entry HKLM\Software\MyTestResource\ChildKey and modify the value of Param_2. Within a few seconds, we can see a folder being created in the Quorum_Drive\MSCS folder, having its name same as the GUID. This will contain a file 00000001.CPT. We will now, revert back the value of Param_2. At this point, we have the following:
Now, we will run the following tests:
So, what's happening in the background?
Here's what – The registry hives that are mentioned under the RegSync key of any clustered resource go in for automatic check-pointing when the resource is online. This means that whenever a change is made to any entry in this hive, when the corresponding resource is online, the value is automatically replicated to the other nodes of the cluster. Also, the entire registry hive is exported to a folder in the Quorum Drive. The folder name corresponds to the GUID for the resource in the Registry and the filename corresponds to the Hive Number under RegSync.
Additionally, when the resource goes offline, the check-point process is fired and the same activity is completed once again.
Now, when the resource tries to come online, here is what it does:
Hence, we can clearly understand that any changes made to any of the check-pointed registry hives will be automatically over-written by the contents in the CPT file.
How do we recover in case such a mistake has been done?
It is not impossible to recover the resource back from this scenario. However, care must be taken while following the steps that I will mention below; so that even by mistake, we do not commit any mistakes. Please be extra cautious, as any mistake here can lead to a complete reinstall / reconfiguration.
I hope this post will be useful to my readers. Please post your comments in case you find this post helpful.
Disclaimer: All information provided here is my personal opinion and is neither verified nor approved by Microsoft before it is published. All information, and code samples, if any, is provided "AS IS" with no warranties and confers no rights.
PingBack from http://www.anith.com/?p=41602
Thanks for the information, it is very helpful
Nice article and I loved the way it was demonstrated
very helpful and neatly described. Thanks a lot. This article cleared many of my doubts.