In this blog I will discuss considerations on planning the number of nodes in a Windows Server Failover Cluster.

Starting with Windows Server 2012, Failover Clustering can support to up 64-nodes in a single cluster making it industry leading in scale for a private cloud.  While this is exciting, the reality is that it is probably bigger than the average person cares to do.  There is also no limitation on cluster sizes in different versions of Windows Server (Standard vs. Datacenter, etc…).  Since there is no practical limitation on scale for the average IT admin, then how many nodes should you deploy with your cluster?  The primary consideration comes down to defining a fault domain.  Let’s discuss the considerations…

Resiliency to Hardware Faults

When thinking about fault domains, hardware resiliency is one of the biggest considerations.  Be it chassis, rack, or datacenter.  Let’s start with blades as an example; you probably don’t want a chassis to be a single point of failure.  To mitigate a chassis failure you probably want to span across multiple chassis.  If you have eight blades per chassis, it would be desired for your nodes to reside across two different chassis for resiliency, so you create a 16-node cluster with eight nodes in each chassis.  Or maybe you want to have rack resiliency, in that case create a cluster out of nodes that span multiple racks.  The number of nodes in the cluster will be influenced by how many servers you have in the rack.  If you want your cluster to achieve disaster recovery in addition to high availability, you will have nodes in the cluster which will span across datacenters.  Defining fault domains can protect you from hardware class failures.

Multi-site Clusters

To expand upon the previous topic a little… when thinking about disaster recovery scenarios and having a Failover Cluster that can achieve not only high availability, but also disaster recovery you may span clusters across physical locations.  Generally speaking, local failover is less expensive than site failover.  Meaning that on site failover data replication needs to flip, IP’s may switch to different subnets, and failover times may be longer.  In fact, switching over to another site may require IT leadership approval.  When deploying a multi-site cluster it is recommended to scale up the number of nodes so that there are 2+ nodes in each site.  The goal is that when there is a server failure, there is fast failover to a site local node.  Then when there is a catastrophic site failure, services failover to the disaster recovery site.  Defining multiple nodes per fault domain can give you better service level agreements.

All your Eggs in One Basket

There is no technical limitations which makes one size cluster better than another.  While we hope that there is never a massive failure which results in an entire cluster to go down, some might point out that they have seen it happen… so there’s a matter of how many eggs do you want in one basket?  By breaking up your clusters you can have multiple fault domains, where in the event of losing an entire cluster it also mitigates impact.  So let’s say you have 1,000 VMs… if you have a single 32-node cluster, and the entire cluster goes down, that means all 1,000 VMs go down.  Where if you had them broken into two 16-node clusters, then only 500 VMs (half) go down.  Defining fault domains can protect you from cluster class failures.

Flexibility with a Larger Pool of Nodes

System Center Virtual Machine Manager has a feature called Dynamic Optimization which analyzes the load across the nodes and moves VMs around to load balance the cluster.  The larger the cluster, the more nodes Dynamic Optimization has to work with and the better balancing it can achieve.  So while creating multiple smaller clusters may divide up multiple fault domains, creating too small of clusters can increase the management and keep them from being utilized optimally.  Defining a larger cluster creates finer granularity to spread and move across.

Greater Resiliency to Failures

The more nodes you have in your cluster, the less impactful losing each node becomes.  So let’s say you create a bunch of little 2-node clusters, if you were to lose 2 nodes… all the VMs go down.  Where if you had a 4-node cluster, you can lose 2 servers and the cluster stays up and keeps running.  Again, this ties back more to the hardware fault domains discussion.

Another aspect of this is that when a node fails, the more surviving nodes you have to distribute the load across.  So let’s say you have 2-nodes… if you lose 1 node, the surviving node is now running at 200% capacity (running everything it was before, and all the failed nodes as well).  If you scale the number of nodes, the VMs can be spread out across more hosts, and the loss of an individual node is less impactful.  If you have a 3-node cluster and lose a node, each node is operating at 150% capacity.

Another way to think of it, is how much stress do you want to put on yourself?  If you have a 2-node cluster, and you lose a node… you will probably have a fire drill to urgently get that node fixed.  Where if you lose a node in a 32-node cluster… you might be ok finishing your round of golf before you worry about it.  Increasing scale can protect you from larger numbers of failures and makes an individual failure less impactful.

Diagnosability

Troubleshooting a large cluster may at times be more difficult than smaller clusters.  Say for example you have a problem on your 64-node cluster, that may involve pulling and correlating logs across all 64 servers.  This can be more complex and cumbersome to deal with the large number of logs.  Another example is that the cluster Validation tool is a functional test tool, and will also take longer on larger clusters… when things go wrong and you want to check your cluster.  Some IT admin’s prefer smaller fault domains when troubleshooting problems.

Workload

You also scale different types of clusters differently based on the workload they are running:

  • Hyper-V:  You want your private cloud to be one fluid system where VMs are dynamically moving around and adjusting.  Tools like SCVMM Dynamic Optimization really start to shine with larger clusters to monitor the load of the nodes and seamlessly move VMs around to optimize and load balance the cluster.  Hyper-V clusters are usually the biggest and may have 16, 24, 32 or higher.
  • Scale-out File Server:  File based storage for your applications with a SoFS should usually be 2 – 4 nodes.  For example, internal Microsoft SoFS clusters are deployed with 4-nodes.
  • Traditional File Server:  Traditional information worker File Clusters tend to also be smaller, again in the 2 – 4 node range.
  • SQL Server:  Most SQL Server clusters deployed with a failover cluster instance (FCI) are 2-nodes, but that has more to do with SQL Server licensing and the ability to create a 2-node FCI with SQL Server standard edition.  The other consideration is that each SQL instance requires a drive letter.  So that’s a maximum of let’s say 24 instances.  This is addressed with SQL Server 2014 support for Cluster Shared Volumes…   but generally speaking, it doesn’t make much sense to deploy a 32-node SQL Server cluster.  So think smaller…  2, 4, or maybe up to 8.  A SQL cluster with an Availability Group (AG) is usually a multi-site cluster and will have more nodes than an FCI.

Conclusion

There’s no right or wrong answer here on how many nodes to have in your cluster, many other vendors have strong recommendations to work around limitations they may have… but those don’t apply to Windows Server Failover Clustering.  It’s more about thinking about your fault domains, and personal preference for a large part.  Big clusters are cool and come with serious bragging rights, but have some considerations…   little clusters seem simple, but don’t really shine as well as they should…   you will likely find the right fit for you somewhere in-between.

Thanks!
Elden Christensen
Principal Program Manager Lead
Clustering & High-Availability
Microsoft