Failover Clustering and Network Load Balancing Team Blog
Windows Server Failover Clustering is a high availability platform that is constantly monitoring the network connections and health of the nodes in a cluster. If a node is not reachable over the network, then recovery action is taken to recover and bring applications and services online on another node in the cluster.
Failover Clustering by default is configured to deliver the highest levels of availability, with the smallest amount of downtime. The default settings out of the box are optimized for failures where there is a complete loss of a server, what we will refer to in this blog as a ‘hard’ failure. These would be unrecoverable failure scenarios such as the failure of non-redundant hardware or an OS bug check. In these situations the server is lost and the goal is for Failover Clustering to very quickly detect the loss of the server and rapidly recover on another server in the cluster. To accomplish this fast recovery from hard failures, the default settings for cluster health monitoring are fairly aggressive. However, they are fully configurable to allow flexibility for a variety of scenarios.
These default settings deliver the best behavior for most customers, however as clusters are stretched from being inches to possibly miles apart the cluster may become exposed to more unreliable networking components between the nodes. Another factor is that the quality of commodity servers is constantly increasing, coupled with augmented resiliency through redundant components (such as dual power supplies, NIC teaming, and multi-path I/O), the number of non-redundant hardware failures may potentially be fairly rare. Because hard failures may be less frequent some customers may wish to tune the cluster for transient failures, where the cluster is more resilient to brief network failures between the nodes. By increasing the default failure thresholds you can decrease the sensitivity to brief network issues that last a short period of time.
It is important to understand that there is no right answer here, and the optimized setting may vary by your specific business requirements and service level agreements.
Think of it like your cell phone, when the other end goes silent how long are you willing to sit there going “Hello?... Hello?... Hello?” before you hang-up the phone and call the person back. When the other end goes silent, you don’t know when or even if they will come back.
The key question you need to ask yourself is: What is more important to you? To quickly recover when you pull out the power cord or to be tolerant to a network hiccup?
There are four primary settings that affect cluster heartbeating and health detection between nodes.
It is important to understand that both the delay and threshold have a cumulative effect on the total health detection. For example setting CrossSubnetDelay to send a heartbeat every 2 seconds and setting the CrossSubnetThreshold to 10 heartbeats missed before taking recovery, means that the cluster can have a total network tolerance of 20 seconds before recovery action is taken. In general, continuing to send frequent heartbeats but having greater thresholds is the preferred method.
Fast Failover (Default)
Note: The “Relaxed” column above is a recommendation for customers looking to set their clusters to be more tolerant of transient failures. The recommended suggestions double the defaults to 10 second threshold for nodes on the same subnet and 20 second threshold for nodes on different subnets.
Disclaimer: When increasing the cluster thresholds, it is recommended to increase in moderation. It is important to understand that increasing resiliency to network hiccups comes at the cost of increased downtime when a hard failure occurs. In most customers’ minds, the definition of a server being down on the network is when it is no longer accessible to clients. Traditionally for TCP based applications this means the resiliency of the TCP reconnect window. While the cluster thresholds can be configured for durations of minutes, to achieve reasonable recovery times for clients it is generally not recommended to exceed the TCP reconnect timeouts. Evaluate your business needs to define what are the maximum values that are right for your deployments configuration.
It critical to recognize that cranking up the thresholds to high values does not fix nor resolve the transient network issue, it simply masks the problem by making health monitoring less sensitive. The #1 mistake made broadly by customers is the perception of not triggering cluster health detection means the issue is resolved (which is not true!). I like to think of it, that just because you choose not to go to the doctor it does not mean you are healthy. In other words, the lack of someone telling you that you have a problem does not mean the problem went away.
Cluster heartbeat configuration settings are considered advanced settings which are only exposed via PowerShell. These setting can be set while the cluster is up and running with no downtime and will take effect immediately with no need to reboot or restart the cluster.
To view the current heartbeat configuration values: PS C:\Windows\system32> get-cluster | fl *subnet*
The setting can be modified with the following syntax: PS C:\Windows\system32> (get-cluster).CrossSubnetDelay = 2000
In Windows Server 2012 there is additional logging to the Cluster.log for heartbeat traffic when heartbeats are dropped. By default the RouteHistoryLength setting is set 10, which is two times the number of default thresholds. If you increase the SameSubnetThreshold or CrossSubnetThrehold values, it is recommended to increase the RouteHistoryLength value to be twice the value to ensure that if the time arises that you need to troubleshoot heartbeat packets being dropped that there is sufficient logging. This can be done with the following syntax: PS C:\Windows\system32> (get-cluster).RouteHistoryLength = 20
For more information on troubleshooting issues with nodes being removed from cluster membership due to network communication issues, please see the following blog: http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx
Thanks! Elden Christensen Principal Program Manager Lead Clustering & High-Availability Microsoft