Draining Nodes for Planned Maintenance with Windows Server 2012

Draining Nodes for Planned Maintenance with Windows Server 2012

Rate This
  • Comments 5

Windows Server 2012 Failover Clusters are easier to manage and maintain with the new “Node Drain” and “Resume with Failback” features. This enables nodes to be gracefully drained for planned maintenance. This functionality is part of the infrastructure that enables “Cluster Aware Updating” (CAU) for patching nodes in a cluster.

Overview

Bringing an individual node down for planned maintenance is a common administrative task, to for example install a Service Pack or hardware upgrades. 

On a Windows Server 2008 R2 Failover Cluster, this is a manual process where you place a cluster node in PAUSED state, and then move individual Roles (workloads) to the other nodes in the cluster as outlined in this KB article.

In Windows Server 2012 conducting planned maintenance on Failover Clusters is dramatically simplified, as these steps are automated in the Node Drain (or Node Maintenance Mode) feature.

Node Drain

Using Node Drain you can automate moving the Roles (workloads) off of a cluster node. Think of Node Drain is to as an enhanced, workload aware Node Pause.

Steps automated by Node Drain:

1)      The cluster node is put in a PAUSED state, which prevents other workloads hosted on other nodes from moving to the node.

2)      The Roles (workloads) currently owned by the cluster node, are sorted according to their Priority order. (Priority of Roles is another new Failover Clustering functionality in Windows Server 2012.)

3)      The Roles are then distributed to the other active nodes in the cluster in priority order. Node Drain works with all workloads running on the cluster. For virtual machines, it leverages live migrations and memory-aware intelligent placement.

4)      When all the Roles are moved off of the cluster node, Node Drain operation is completed.

Initiating Node Drain through Failover Cluster Manager:

Initiating Node Drain through Failover Cluster Manager snap-in is a simple one-click operation:

  1.        Open Failover Cluster Manager (CluAdmin.msc)
  2.        On the left hand pane navigate to Nodes
  3.        Right-click on the node you wish to drain
  4.        Under Pause select Drain Roles

    Note: If you select “Do Not Drain Roles”, then it would simply “PAUSE” the node similar to Windows Server 2008 R2.

    Initiating Node Drain through PowerShell:

    You can initiate Node Drain using the “Suspend-ClusterNode” PowerShell command.

    There are additional advanced options available through PowerShell to manage draining nodes, which includes:

    Parameter

    Purpose

    Drain

    Initiates Node Drain

    TargetNode

    The destination node where all drained roles will be moved/live migrated to

    ForceDrain

    Moves the roles off of the draining node even if the Group cannot move either because no other node can host this group or it is in locked state

    Wait

    Defines an amount of time to wait for the Node Drain operation to begin

     

    Status of Drained Node:

    When a Node Drain is initiated, the command returns the NodeDrainStatus property, indicating that the cluster node has begun the node drain operation. You can track the status of the on-going node drain operation using these two cluster node common properties:

    Node Common Property

    Values

    Purpose

    NodeDrainStatus

    0 – Not Initiated

    This property indicates the current status of the Node Drain.

    1 – In Progress

    2 – Completed

    3 – Failed

    NodeDrainTarget

    Cluster Node Id

    ID of the cluster node which all the workload will be moved to. This ID is set when you use the TargetNode parameter.

     

    Node Drain Failure:

    Node Drain will fail if a virtual machine’s Live Migration fails due to some reason, or if a Role cannot be moved as the node being drained is the last possible owner node for the Role.

    Upon encountering an error with an individual role, the node drain operation will continue to drain the remaining roles hosted on the node. The status of node drain would be set to “3” only after the remaining roles are drained from the cluster node.

    Restarting Node Drain and optionally you can specify “-ForceDrain” parameter to override any errors encountered during the initial node drain.

    Rebooting a Drained Node:

    Once a node is drained, it will remain in the PAUSED state across reboots to prevent any roles from moving to that node, until the node is resumed. This keeps the node drained for the duration of the maintenance window.

    Node Resume with Failback

    When a node is drained, the cluster will remember the workload(s) that were moved off of the node. When resuming the node after maintenance, you have the option of moving back all the workload(s) to the cluster node.  This will restore the cluster back to the original state it was in before the maintenance.

    Steps automated Node Resume with Failback:

    1)      The cluster node is removed from PAUSED state - this enables workload(s) to move to this node.

    2)      The workload(s) that were originally drained from the node are moved back using Failback.

      1. If a failback policy is configured to only failback during a specific failback window, resume will honor the setting and the roles failback will be delayed until the failback window.

     Resuming Node through Failover Cluster Manager:

    1.        Open Failover Cluster Manager (CluAdmin.msc)
    2.        On the left hand pane navigate to Nodes
    3.        Right-click on the node you wish to resume
    4.        Under Resume select Fail Roles Back

    Note: If you select “Do Not Fail Roles Back”, then it would simply “RESUME” the node similar to Windows Server 2008 R2.

    Resuming Node through PowerShell:

    You can resume a node using the Resume-ClusterNode PowerShell command.

    There are additional advanced options available through PowerShell to manage resuming nodes, which includes:

    Name

    Value

    Purpose

    Failback

    NoFailback – Don’t Failback workload

    Immediate – Failback immediately

    Policy – Failback during configured Window

    This defines the type of failback to expect after node is resumed.

    Additional Information:

    Cancelling Node Drain:

    Draining a node may be a long running operation.  A Node Drain that is in progress can be cancelled by initiating a Node Resume. This will cause the Node Drain operation to stop, and if Fail Roles Back is specified, the drained workloads which were moved will be moved back to the cluster node.

    Configuring the Move Type for a Virtual Machine

    Node Drain and Node Resume with Failback will leverage Live Migration for virtual machines so that a node can be drained with no downtime. Live Migration may at times be a long running operation, and there may be scenarios where you wish to quickly drain a node. Node draining provides the flexibility to allow configuration of how VMs should be moved, using either Live Migration or Quick Migration. 

    You also have the granular control to configure the move type to be used based on the priority setting of the VM.  This is configured with the Virtual Machine Resource Type property private property NodeDrainMoveTypeThreshold:

    Name

    Value

    Purpose

    NodeDrainMoveTypeThreshold

    (Private Property)

    Priority of Virtual Machines

    Virtual Machines with Priority equal to or higher than the specified priority will be moved using Live Migration.

     

    Virtual Machines with Priority lower than the specified priority will be moved using Quick Migration.

    Example PowerShell commands to view or modify this private property:

    Creating property:
    Get-ClusterResourceType "Virtual Machine" | Set-ClusterParameter -Create @{"NodeDrainMoveTypeThreshold"="3000"}

    Modifying created property:
    Get-ClusterResourceType "Virtual Machine" | Set-ClusterParameter -Multiple @{"NodeDrainMoveTypeThreshold"="3000"}

    Reading property:
    Get-ClusterResourceType "Virtual Machine" | Get-ClusterParameter NodeDrainMoveTypeThreshold

     

    Conclusion:

    Node Drain is a great new time-saving feature in Windows Server 2012 Failover Clustering for conducting planned maintenance. Using this feature, you can easily drain the workload(s) off of a cluster node in a single click, and easily restore them when maintenance operations are completed on the cluster node.

     

    Thanks!

    Amitabh Tamhane                                                                                                           Lokesh Koppolu
    Program Manager II                                                                                                        Principal Development Lead
    Clustering & High Availability                                                                                       Clustering & High Availability
    Microsoft                                                                                                                          Microsoft

    Leave a Comment
    • Please add 2 and 6 and type the answer here:
    • Post
    • Does the Node Drain also change Owner Node for Cluster Disks? What about the Quorum? Thanks!

    • @Eric: Yes, as part of Node Drain, cluster moves ALL groups (including Cluster Shared Volumes, Cluster Group, any other clustered Roles). As part of Cluster Group, we also move Quorum Witness resource to a different node.

      Node Drain will fail if moving quorum witness to a different would cause cluster to lose quorum.

    • During node drain if moving a quorum resource cause failure of process "Node Drain" then how the VMs state will be determined then?

    • @Nishant: When node drain hits a failure condition, it keeps the remaining VMs running on the local node. For the VMs which are already live migrated to different nodes, the node drain would not attempt to bring them back or change their state. In all cases, we always do our best to keep VMs up & running as much as possible.

    • We have SQL 2012 (64-bit) SP1 installed on Windows Server 2012 Data Center. All the nodes are under WSFC. And all these runs under VMWare. We also have availability group setup in each node for automatic failover.

      Questions:

      (1) Is CAU enabled by default ? where can I check it ?

      (2) Do I need to have CAU enabled in order to use the "node drain" and "resume with failback" feature ?

      (3) are there other requirements to use "node drain" and "resume with failback" feature ?

      Thanks in advance

    Page 1 of 1 (5 items)