Last updated 2/22/2012

 

In the Microsoft Multipath I/O Step by Step guide for Windows Server 2008 R2 two new MPIO registry keys were introduced (UseCustomPathRecoveryInterval and PathRecoveryInterval) to mitigate a transient error. We issued post-release guidance previously and have recently updated our guidance as described below.

For additional information about MPIO settings, please refer to the Microsoft Multipath I/O Step-By-Step guide here:

http://technet.microsoft.com/en-us/library/ee619749(WS.10).aspx

Background on the issue:

The two new settings were introduced in Windows Server 2008 R2 to help mitigate the issue detailed below:

  • A transient error can cause a path to briefly fail and recover.
  • MPIO detects that the path has failed and then performs a failover.
  • The failed path was the last path for a particular pseudo-LUN, so the associated PDO Remove Timer started ticking down.
  • The error was brief enough and PnP was busy enough that PnP missed the fact that the path went away and came back. Therefore, there are no PnP events generated to indicate that the path is back online.
  • The pseudo-LUN never sees the path come back online and it gets removed after the PDO Remove Timer runs out.

The end result is that the system has at least one path and one device online, but no pseudo-LUN to represent that device.

MPIO has a path recovery mechanism that can be used to avoid this issue. However, by default, the period at which path recovery is attempted is set to twice in the PDORemovePeriod. In the majority of cases, the default is acceptable, but it does not solve the problem in this particular scenario.

This is where the CustomPathRecoveryInterval setting comes into play. They allow you to configure a timer that determines the period at which path recovery attempts are performed. By setting the PathRecoveryInterval to less than the PDORemovePeriod, the path recovery attempt executes before the pseudo-LUN is removed, the path is detected as back online, and the pseudo-LUN is saved from removal.

We recommend that you test the use of this value before widespread deployment in production to ensure that path recovery attempts are not happening so frequently that they have a significant impact on regular I/O.

Updated guidance:

As the default settings allow for the potential that a path recovery under high load may be missed, we are making the following updated recommendation around the use of these settings. Note however, that, as always, settings should be evaluated for potential impact in a test environment prior to implementing changes in production environments.

We now recommend that the keys above be considered for wider use since they have the potential to allow path recovery under load in situations that might otherwise result in a path failure and I/O delays.

A warning about this value.  The PathRecoveryInterval controls how often MPIO will check to see if the device has returned after an error. This translates to a greater amount of traffic to the array. Caution should be used when implementing this setting, as implementing this change with a value that is too low may cause adverse performance impact.

Our general guidance going forward for this setting is as follows:

  • The PDORemovePeriod should be set to a minimum of 30 seconds. 
  • UseCustomPathRecoveryInterval would be set to 1 to enable custom path recovery.  
     
  • PathRecoveryInterval should be set to a value approximately 10 seconds less than the PDO remove period.  Additionally we recommend that this value not be set below 20 seconds to mitigate performance degradation risk.

It is also important to note that the PDORemovePeriod must be set to a value less than the global Windows Disk Timeout, to allow path recovery prior to I/O timeouts.  For more information on the global Windows “Disk” timeout registry key, please see the article link at the end of this post.

For example:

If the Windows Disk timeout is 30 seconds

AND

The PDORemovePeriod is 25 seconds

Then a good starting point value for PathRecoveryInterval would be 15 to 20 seconds.

Setting

Definition

HKLM\System\CurrentControlSet\Services\mpio\Parameters\

  UseCustomPathRecoveryInterval

        

If this key exists and is set to 1, it allows the use of PathRecoveryInterval.

HKLM\System\CurrentControlSet\Services\mpio\Parameters\

  PathRecoveryInterval

        

Represents the period after which PathRecovery is attempted. This setting is only used if it is not set to 0 and UseCustomPathRecoveryInterval is set to 1.

Additional Guidelines for setting timeouts:

Regardless of the values that you choose for MPIO, it is crucial that the following rules be used when setting the timeouts referenced in this article:

  • The Windows Disk timeout (detailed in the article below) must have the highest value.
  • The PDORemovePeriod must be less than the Disk timeout setting
  • The CustomPathRecoveryInterval must be less than the value used for PDORemovePeriod.

  

Note: The settings detailed in this article are also useful in the recovery of paths with the iSCSI Initiator and MPIO.

Additional References:

http://blogs.msdn.com/b/san/archive/2011/09/01/the-windows-disk-timeout-value-understanding-why-this-should-be-set-to-a-small-value.aspx

 

Thanks,

The MPIO Team