Failover Clustering and Network Load Balancing Team Blog
In this blog I will discuss how Failover Clustering communicates with cluster resources, along with how clustering detects and recovers when something goes wrong. For the sake of simplicity I will use a Virtual Machine as an example throughout this blog, but the logic is generic and applies to all workloads.
When a Virtual Machine is clustered, there is a cluster “Virtual Machine” resource created which controls that VM. The “Virtual Machine” resource and its associated resource DLL communicates with the VMMS service and tells the VM when to start, when to stop, and it also does health checks to ensure the VM is ok.
Resources all run in a component of the Failover Clustering feature called the Resource Hosting Subsystem (RHS). These VM actions from the user map to entry point calls that RHS makes to resources, such as Online, Offline, IsAlive, and LooksAlive. You can find the full list of resource DLL entry-point functions here.
The most interesting in most cases where resources go unresponsive and you see clustering need to recover is with the LooksAlive and IsAlive which is a health check to the resource.
Health check calls to the resource continue constantly while resources are online. If a resource returns a failure for the lightweight LooksAlive health check, RHS will then immediately do a more comprehensive health check and call IsAlive to see if the resource is really healthy. A resource is considered failed as the result of an IsAlive failure.
Think of it like this… Every 60 seconds RHS calls IsAlive and basically is asking the resource “Are you ok?”. And the resource then responds to RHS “Yes, I am doing fine.” This periodic health check goes on and on… Until, there can be a case where something happens to the resource and it doesn’t respond. Think of it like a dropped call on your cell phone, how long are you willing to sit there going “Hello? Hello? Hello?”… before you give up and call the person back? Basically resetting the connection…
Failover Clustering has this same concept. RHS will sit there waiting for the resource to respond to an IsAlive call, and eventually it will give up and need to take recovery action. By default RHS will wait for 5 minutes for the resource to respond to an entry point call to it. This is configurable with the resource DeadlockTimeout common property.
To modify the DeadlockTimeout property of an individual resource, you can use the following PowerShell cmdlet command: (Get-ClusterResource “Resource Name”).DeadlockTimeout = 300000
Or if you want to modify the DeadlockTimeout for all resources of that type you can modify it at the resource type level with the following syntax (this example will be for all virtual machine resources): (Get-ClusterResourceType “Virtual Machine”).DeadlockTimeout = 300000
Resources are expected to respond to an IsAlive or LooksAlive within a few hundred milliseconds, so waiting 5 minutes for a resource to respond is a really long time. Something pretty bad happened if a resource which normally responds in milliseconds, suddenly takes longer than 5 minutes. So it is generally recommended to stay with the default values.
If the resource doesn’t respond in 5 minutes, RHS decides that there must be something wrong with the resource and that it should take recovery action to get it back up and running. Remember that the resource has gone silent; RHS has no idea what is wrong with it. The only way to recover and get the resource back up and running is that the RHS process is terminated, then RHS restarts, which will then restart the resource, and everything is back up and running. You may also see the associated entries in the System event log:
Event ID 1230 Cluster resource ‘Resource Name’ (resource type ‘Resource Type Name’, DLL ‘DLL Name’) did not respond to a request in a timely fashion. Cluster health detection will attempt to automatically recover by terminating the Resource Hosting Subsystem (RHS) process running this resource.
Event ID 1146 The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource.
This is the way clustering is designed to work… it is monitoring the health of the system, it detects something is wrong, and recovers. This is a good thing!
The Resource Hosting Subsystem (RHS) is the process which hosts resources, and for any given node if there are multiple resources currently online and being hosted by a node they may share a common RHS process. For example, if you had 5 clustered VMs running on the same node, all the resources associated with those VMs would all be running in the same RHS process.
There are some side effects from terminating the RHS process when a resource goes unresponsive. If there are multiple resources hosted on that node, they may be hosted in the same RHS process. That means when RHS terminates and restarts to recover an individual resource, all resources being hosted in that specific RHS process are also restarted. With Windows Server 2008 R2 if you have 5 VMs running on a node, all 5 VMs are going to get restarted.
If a resource becomes unresponsive and causes an RHS crash, the cluster service will deem that specific resource to be suspect and that it needs be isolated. Think of it as, one strike and you are out! The cluster service will automatically set the resource common property SeparateMonitor to mark that resource to run in its own dedicated RHS process, so that in the event that the resource becomes unresponsive again; it will not affect others. This setting is also configurable, you can either manually enable a resource to run in its own RHS process or you can disable a resource from running in its own RHS process as the result of having had an issue in the past which is now addressed.
To modify the SeparateMonitor property of an individual resource, you can use the following PowerShell cmdlet command: (Get-ClusterResource “Resource Name”).SeparateMonitor = 0
The impact of running resources in their own dedicated RHS process is that each RHS process consumes a little more system resources. If you open Task Manager you will see a series of “Failover Cluster Resource Host Subsystem” processes running, each of which consuming a few MB of RAM.
In general clustering will self-manage misbehaving resources. Resources will be given a chance to play nicely with everyone else, and if they don’t they will be automatically isolated to minimize impact. So it is generally recommended to stay with the default values.
There are some feature enhancements in Windows Server 2012 to mitigate the impact of non-responsive resource recovery.
Additionally resources can also be marked with the SeparateMonitor property to run in their own dedicated RHS process in Windows Server 2012, as they could in previous releases.
Everything we have discussed in this blog to this point has describing the expected behavior of how Failover Clustering recovers when something goes wrong with a resource and it becomes unresponsive. Now the most important question… What do you do about it?
The key take-away is that RHS recovery is expected behavior for a resource that has become unresponsive. To address the root cause issue you need to dig in to which resource is failing and then by understanding what it was attempting to do, you can identify why it didn’t respond.
For additional information on troubleshooting resources that result in RHS recovery, see the blogs below. Microsoft support is also available to assist in advanced debugging to help you identify root cause.
Resource Hosting Subsystem (RHS) In Windows Server 2008 Failover Clusters http://blogs.technet.com/b/askcore/archive/2009/11/23/resource-hosting-subsystem-rhs-in-windows-server-2008-failover-clusters.aspx
Thanks! Elden Christensen Principal Program Manager Lead Clustering & High-Availability Microsoft