Cluster Shared Volume (CSV) Inside Out

Cluster Shared Volume (CSV) Inside Out

Rate This
  • Comments 16

In this blog we will take a look under the hood of the cluster file system in Windows Server 2012 R2 called Cluster Shared Volumes (CSV). This blog post is targeted at developers and ISV’s who are looking to integrate their storage solutions with CSV.

Note: Throughout this blog, I will refer to C:\ClusterStorage assuming that the Windows is installed on the C:\ drive. Windows can be installed on any available drive and the CSV namespace will be built on the system drive, but instead of using %SystemDrive%\ClusterStorage\ I’ve used C:\ClusteredStorage for better readability since C:\ is used as the system drive most of the time.

Components

Cluster Shared Volume in Windows Server 2012 is a completely re-architected solution from Cluster Shared Volumes you knew in Windows Server 2008 R2. Although it may look similar in the user experience – just a bunch of volumes mapped under the C:\ClusterStorage\ and you are using regular windows file system interface to work with the files on these volumes, under the hood, these are two completely different architectures. One of the main goals is that in Windows Server 2012, CSV has been expanded beyond the Hyper-V workload, for example Scale-out File Server and in Windows Server 2012 R2 CSV is also supported with SQL Server 2014.

First, let us look under the hood of CsvFs at the components that constitute the solution.

Figure 1 CSV Components and Data Flow Diagram

The diagram above shows a 3 node cluster. There is one shared disk that is visible to Node 1 and Node 2. Node 3 in this diagram has no direct connectivity to the storage. The disk was first clustered and then added to the Cluster Shared Volume. From the user’s perspective, everything will look the same as in the Windows 2008 R2. On every cluster node you will find a mount point to the volume: C:\ClusterStorage\Volume1. The “VolumeX” naming can be changed, just use Windows Explorer and rename like you would any other directory.  CSV will then take care of synchronizing the updated name around the cluster to ensure all nodes are consistent.  Now let’s look at the components that are backing up these mount points.

Terminology

The node where NTFS for the clustered CSV disk is mounted is called the Coordinator Node. In this context, any other node that does not have clustered disk mounted is called Data Servers (DS). Note that coordinator node is always a data server node at the same time. In other words, coordinator is a special data server node when NTFS is mounted.

If you have multiple disks in CSV, you can place them on different cluster nodes. The node that hosts a disk will be a Coordinator Node only for the volumes that are located on that disk. Since each node might be hosting a disk, each of them might be a Coordinator Node, but for different disks. So technically, to avoid ambiguity, we should always qualify “Coordinator Node” with the volume name. For instance we should say: “Node 2 is a Coordinator Node for the Volume1”. Most of the examples we will go through in this blog post for simplicity will have only one CSV disk in the cluster so we will drop the qualification part and will just say Coordinator Node to refer to the node that has this disk online.

Sometimes we will use terms “disk” and “volume” interchangeably because in the samples we will be going through one disk will have only one NTFS volume, which is the most common deployment configuration. In practice, you can create multiple volumes on a disk and CSV fully supports that as well. When you move a disk ownership from one cluster node to another, all the volumes will travel along with the disk and any given node will be the coordinator for all volumes on a given disk. Storage Spaces would be one exception from that model, but we will ignore that possibility for now.

This diagram is complicated so let’s try to break it up to the pieces, and discuss each peace separately, and then hopefully the whole picture together will make more sense.

On the Node 2, you can see following stack that represents mounted NTFS. Cluster guarantees that only one node has NTFS in the state where it can write to the disk, this is important because NTFS is not a clustered file system.  CSV provides a layer of orchestration that enables NTFS or ReFS (with Windows Server 2012 R2) to be accessed concurrently by multiple servers. Following blog post explains how cluster leverages SCSI-3 Persistent Reservation commands with disks to implement that guarantee http://blogs.msdn.com/b/clustering/archive/2009/03/02/9453288.aspx .

Figure 2 CSV NTFS stack

Cluster makes this volume hidden so that Volume Manager (Volume in the diagram above) does not assign a volume GUID to this volume and there will be no drive letter assigned. You also would not see this volume using mountvol.exe or using FindFirstVolume() and FindNextVolume() WIN32 APIs.

On the NTFS stack the cluster will attach an instance of a file system mini-filter driver called CsvFlt.sys at the altitude 404800. You can see that filter attached to the NTFS volume used by CSV if you run following command:

>fltmc.exe instances
Filter                Volume Name                              Altitude        Instance Name
--------------------  -------------------------------------  ------------  ----------------------
<skip>
CsvFlt                \Device\HarddiskVolume7                   404800     CsvFlt Instance
<skip>

 Applications are not expected to access the NTFS stack and we even go an extra mile to block access to this volume from the user mode applications. CsvFlt will check all create requests coming from the user mode against the security descriptor that is kept in the cluster public property SharedVolumeSecurityDescriptor. You can use power shell cmdlet “Get-Cluster | fl SharedVolumeSecurityDescriptor” to get to that property. The output of this PowerShell cmdlet shows value of the security descriptor in self-relative binary format (http://msdn.microsoft.com/en-us/library/windows/desktop/aa374807(v=vs.85).aspx):

PS D:\Windows\system32> Get-Cluster | fl SharedVolumeSecurityDescriptor

SharedVolumeSecurityDescriptor : {1, 0, 4, 128...}

 CsvFlt plays several roles:

  • Provides an extra level of protection for the hidden NTFS volume used for CSV
  • Helps provide a local volume experience (after all CsvFs does look like a local volume). For instance you cannot open volume over SMB or read USN journal. To enable these kinds of scenarios CsvFs often times marshals the operation that need to be performed to the CsvFlt disguising it behind a tunneling file system control. CsvFlt is responsible for converting the tunneled information back to the original request before forwarding it down-the stack to NTFS.
  • It implements several mechanisms to help coordinate certain states across multiple nodes. We will touch on them in the future posts. File Revision Number is one of them for example.

The next stack we will look at is the system volume stack. On the diagram above you see this stack only on the coordinator node which has NTFS mounted. In practice exactly the same stack exists on all nodes.

 

Figure 3 System Volume Stack

The CSV Namespace Filter (CsvNsFlt.sys) is a file system mini-filter driver at an altitude of 404900:

D:\Windows\system32>fltmc instances
Filter                Volume Name                              Altitude        Instance Name
--------------------  -------------------------------------  ------------  ----------------------
<skip>
CsvNSFlt              C:                                        404900     CsvNSFlt Instance
<skip>

CsvNsFlt plays the following roles:

  • It protects C:\ClusterStorage by blocking unauthorized attempts that are not coming from the cluster service to delete or create any files or subfolders in this folder or change any attributes on the files. Other than opening these folders about the only other operation that is not blocked is renaming the folders. You can use command prompt or explorer to rename C:\ClusterStorage\Volume1 to something like C:\ClusterStorage\Accounting.  The directory name will be synchronized and updated on all nodes in the cluster.
  • It helps us to dispatch the block level redirected IO. We will cover this in more details when we talk about the block level redirected IO later on in this post.

The last stack we will look at is the stack of the CSV file system. Here you will see two modules CSV Volume Manager (csvvbus.sys), and CSV File System (CsvFs.sys). CsvFs is a file system driver, and mounts exclusively to the volumes surfaced up by CsvVbus.

 

Figure 5 CsvFs stack

Data Flow

Now that we are familiar with the components and how they are related to each other, let’s look at the data flow.

First let’s look at how Metadata flows. Below you can see the same diagram as on the Figure 1. I’ve just kept only the arrows and blocks that is relevant to the metadata flow and removed the rest from the diagram.

 

Figure 6 Metadata Flow

Our definition of metadata operation is everything except read and write. Examples of metadata operation would be create file, close file, rename file, change file attributes, delete file, change file size, any file system control, etc. Some writes may also, as a side effect cause a metadata change. For instance, an extending write will cause CsvFs to extend all or some of the following: file allocation size, file size and valid data length. A read might cause CsvFs to query some information from NTFS.

On the diagram above you can see that metadata from any node goes to the NTFS stack on Node 2. Data server nodes (Node 1 and Node 3) are using Server Message Block (SMB) as a protocol to forward metadata over.

Metadata are always forwarded to NTFS. On the coordinator node CsvFs will forward metadata IO directly to the NTFS volume while other nodes will use SMB to forward the metadata over the network.

Next, let’s look at the data flow for the Direct IO. The following diagram is produced from the diagram on the Figure 1 by removing any blocks and lines that are not relevant to the Direct IO. By definition Direct IO are the reads and writes that never go over the network, but go from CsvFs through CsvVbus straight to the disk stack. To make sure there is no ambiguity I’ll repeat it again: - Direct IO bypasses volume stack and goes directly to the disk.

 Figure 7 Direct IO Flow

Both Node 1 and Node 2 can see the shared disk - they can send reads and writes directly to the disk completely avoiding sending data over the network. The Node 3 is not in the diagram on the Figure 7 Direct IO Flow since it cannot perform Direct IO, but it is still part of the cluster and it will use block level redirected IO for reads and writes.

The next diagram shows a File System Redirected IO request flows. The diagram and data flow for the redirected IO is very similar to the one for the metadata from the Figure 6 Metadata Flow:

Figure 8 File System Redirected IO Flow

Later we will discuss when CsvFs uses the file system redirected IO to handle reads and writes and how it compares to what we see on the next diagram – Block Level Redirected IO:

Figure 9 Block Level Redirected IO Flow

Note that on this diagram I have completely removed CsvFs stack and CSV NTFS stack from the Coordinator Node leaving only the system volume NTFS stack. The CSV NTFS stack is removed because Block Level Redirected IO completely bypasses it and goes to the disk (yes, like Direct IO it bypasses the volume stack and goes straight to the disk) below the NTFS stack. The CsvFs stack is removed because on the coordinating node CsvFs would never use Block Level Redirected IO, and would always talk to the disk. The reason why Node 3 would use Redirected IO, is because Node 3 does not have physical connectivity to the disk. A curious reader might wonder why Node 1 that can see the disk would ever use Block Level Redirected IO. There are at least two cases when this might be happening. Although the disk might be visible on the node it is possible that IO requests will fail because the adapter or storage network switch is misbehaving. In this case, CsvVbus will first attempt to send IO to the disk and on failure will forward the IO to the Coordinator Node using the Block Level Redirected IO. The other example is Storage Spaces - if the disk is a Mirrored Storage Space, then CsvFs will never use Direct IO on a data server node, but instead it will send the block level IO to the Coordinating Node using Block Level Redirected IO.  In Windows Server 2012 R2 you can use the Get-ClusterSharedVolumeState cmdlet http://technet.microsoft.com/en-us/library/dn456528.aspx to query the CSV state (direct / file level redirected / block level redirected) and if redirected it will state why.

Note that CsvFs sends the Block Level Redirected IO to the CsvNsFlt filter attached to the system volume stack on the Coordinating Node. This filter dispatches this IO directly to the disk bypassing NTFS and volume stack so no other filters below the CsvNsFlt on the system volume will see that IO. Since CsvNsFlt sits at a very high altitude, in practice no one besides this filter will see these IO requests. This IO is also completely invisible to the CSV NTFS stack. You can think about Block Level Redirected IO as a Direct IO that CsvVbus is shipping to the Coordinating Node and then with the help of the CsvNsFlt it is dispatched directly to the disk as a Direct IO is dispatched directly to the disk by CsvVbus.

What are these SMB shares?

CSV uses the Server Message Block (SMB) protocol to communicate with the Coordinator Node. As you know, SMB3 requires certain configuration to work. For instance it requires file shares. Let’s take a look at how cluster configures SMB to enable CSV.   

If you dump list of SMB file shares on a cluster node with CSV volumes you will see following:

> Get-SmbShare
Name                          ScopeName                     Path                          Description
----                          ---------                     ----                          -----------
ADMIN$                        *                             C:\Windows                    Remote Admin
C$                            *                             C:\                           Default share
ClusterStorage$               CLUS030512                  C:\ClusterStorage             Cluster Shared Volumes Def...
IPC$                          *                                                           Remote IPC

There is a hidden admin share that is created for CSV, shared as ClusterStorage$. This share is created by the cluster to facilitate remote administration. You should use it in the scenarios where you would normally use an admin share on any other volume (such as D$). This share is scoped to the Cluster Name. Cluster Name is a special kind of Network Name that is designed to be used to manage a cluster. You can learn more about Network Name in the following blog post http://blogs.msdn.com/b/clustering/archive/2009/07/17/9836756.aspx.  You can access this share using the Cluster Name \\<cluster name>\ClusterStorage$

Since this is an admin share, it is ACL’d so only members of the Administrators group have full access to this share. In the output the access control list is defined using Security Descriptor Definition Language (SDDL). You can learn more about SDDL here http://msdn.microsoft.com/en-us/library/windows/desktop/aa379567(v=vs.85).aspx

ShareState            : Online
ClusterType           : ScaleOut
ShareType             : FileSystemDirectory
FolderEnumerationMode : Unrestricted
CachingMode           : Manual
CATimeout             : 0
ConcurrentUserLimit   : 0
ContinuouslyAvailable : False
CurrentUsers          : 0
Description           : Cluster Shared Volumes Default Share
EncryptData           : False
Name                  : ClusterStorage$
Path                  : C:\ClusterStorage
Scoped                : True
ScopeName             : CLUS030512
SecurityDescriptor    : D:(A;;FA;;;BA)

There are also couple hidden shares that are used by the CSV. You can see them if you add the IncludeHidden parameter to the get-SmbShare cmdlet. These shares are used only on the Coordinator Node. Other nodes either do not have these shares or these shares are not used:

> Get-SmbShare -IncludeHidden
Name                          ScopeName                     Path                          Description
----                          ---------                     ----                          -----------
17f81c5c-b533-43f0-a024-dc... *                             \\?\GLOBALROOT\Device\Hard...
ADMIN$                        *                             C:\Windows                    Remote Admin
C$                            *                             C:\                           Default share
ClusterStorage$               VPCLUS030512                  C:\ClusterStorage             Cluster Shared Volumes Def...
CSV$                          *                             C:\ClusterStorage
IPC$                          *                                                           Remote IPC

Each Cluster Shared Volume hosted on a coordinating node cluster creates a share with a name that looks like a GUID. This is used by CsvFs to communicate to the hidden CSV NTFS stack on the coordinating node. This share points to the hidden NTFS volume used by CSV. Metadata and the File System Redirected IO are flowing to the Coordinating Node using this share.

ShareState            : Online
ClusterType           : CSV
ShareType             : FileSystemDirectory
FolderEnumerationMode : Unrestricted
CachingMode           : Manual
CATimeout             : 0
ConcurrentUserLimit   : 0
ContinuouslyAvailable : False
CurrentUsers          : 0
Description           :
EncryptData           : False
Name                  : 17f81c5c-b533-43f0-a024-dc431b8a7ee9-1048576$
Path                  : \\?\GLOBALROOT\Device\Harddisk2\ClusterPartition1\
Scoped                : False
ScopeName             : *
SecurityDescriptor    : O:SYG:SYD:(A;;FA;;;SY)(A;;FA;;;S-1-5-21-2310202761-1163001117-2437225037-1002)
ShadowCopy            : False
Special               : True
Temporary             : True 

On the Coordinating Node you also will see a share with the name CSV$. This share is used to forward Block Level Redirected IO to the Coordinating Node. There is only one CSV$ share on every Coordinating Node:

ShareState            : Online
ClusterType           : CSV
ShareType             : FileSystemDirectory
FolderEnumerationMode : Unrestricted
CachingMode           : Manual
CATimeout             : 0
ConcurrentUserLimit   : 0
ContinuouslyAvailable : False
CurrentUsers          : 0
Description           :
EncryptData           : False
Name                  : CSV$
Path                  : C:\ClusterStorage
Scoped                : False
ScopeName             : *
SecurityDescriptor    : O:SYG:SYD:(A;;FA;;;SY)(A;;FA;;;S-1-5-21-2310202761-1163001117-2437225037-1002)
ShadowCopy            : False
Special               : True
Temporary             : True

Users are not expected to use these shares - they are ACL’d so only Local System and Failover Cluster Identity user (CLIUSR) have access to the share.

All of these shares are temporary - information about these shares is not in any persistent storage, and when node reboots they will be removed from the Server Service. Cluster takes care of creating the shares every time during CSV start up. 

Conclusion

You can see that that Cluster Shared Volumes in Windows Server 2012 R2 is built on a solid foundation of Windows storage stack, CSVv1, and SMB3.

Thanks!
Vladimir Petter
Principal Software Development Engineer
Clustering & High-Availability
Microsoft

Leave a Comment
  • Please add 3 and 6 and type the answer here:
  • Post
  • Thanks for the great post!

  • Thanks for a great post - very informative.

    Can you also blog about Resume Key Filter and svhdxflt?

    Looking forward to more such blog posts.

  • Is it possible to set a preference for the CSV Coordinator Node, similar to the Preferred Nodes for other cluster resources?

    The scenario I'm talking about is where we have a stretched Scale Out File Cluster with 4 nodes, 2 of which are full read and write and the two on the other site are just read. The SAN automatically swaps the direction based upon the Owner Node via the HP Cluster Extension resource on which the CSV's are dependant.

    We have a scenario where we have VM's that are preferred to run on the Hyper-V Servers on one site, but can run on the other if that site goes down.

    At present, the automatic levelling out of the SOFC means that sometimes we have VM's on there preferred site having redirected IO to the other site. This isn't a major issue at present as the two InfiniBand Fabrics are connected by Dual 40GB Ethernet.

    That said, it's not ideal and would be much better if we could configure a preferred owner nodes setting so they don't level out to the remote site.

    If there is a solution to this I'd really appreciate hearing about it.

  • Question: Is it possible to set a preference for the CSV Coordinator Node, similar to the Preferred Nodes for other cluster resources?

    Answer:

    In Windows Server 2012 we explicitly were prohibiting setting preferred nodes. We have removed this block in Windows 2012 R2. It would not help with your scenario. Setting this property on CSV group will only constrain where CSV would place physical disk resource (what node can be coordinating), but cluster still would build CSV volumes on all cluster nodes. Setting this property on group with SOFS and DNN will only suggest that leader should be on preferred nodes, but clients still will be able to connect to any node in the cluster.

    You might be able to achieve what you want by setting possible owners on the Scale Out File Server, and Distributed Network Name group. Please note that you need to set possible owners, not preferred owners (blogs.msdn.com/.../9000092.aspx ). In that case DNN will not register in DNS IP addresses of the nodes that are not in the possible owner list, and SOFS will not plumb SMB shares on these nodes. CSB namespace for local access still will be built on all nodes. On fail over from one site to another you would need to change possible owners.

  • We are seeing in Server 2012r2  When a node comes on line and the csv's are redistributed that 95% of the time the CSV ends in a failed state crashing any associated VM's   We are using ISCSI storage. This never seemed to happen with Server 2012.  We disabled ODX as our storage does not support it. We are at our whits end. Do you have any ideas this auto balancing is an auto crashing for us.

  • @Bryant

    Hi Bryant, very few issues reported with CSV balancer in WS.2012 R2. You can check if this is the issue by turning off the CSV Balancer to see if the issue remains: (Get-Cluster).CSVBalancer=0.

  • That example looks very interesting.

    Suppose Node3 was at a disaster recovery (DR) site and the main datacenter hosting Node1 and Node2 was totally lost.

    I believe this cluster (as is) would still be lost, as Node3 would no longer have network access to the shared disk (located at the main datacenter).

    So, could you also replicate the shared disk to a sever at the DR site as Node3 and have Node3 use this copy in the event of a disaster at the main site ?

    If so, what the best way to set this up in Windows 2012 R2.

    We have the option of using SAN replication to replicate LUNs between SANs at a block level, if that helps.

    Thanks

    Mark.

  • Mark, CSVFS supports disk replication. You need to consult with the provider of your storage replication solutions about particular details, and if they have support for CSV. One thing to keep in mind is that depending on architecture of replication is might or might not be Direct IO friendly. For instance if replication is done by a filter attached to the disk/volume on the coordinator node then for this solution to work correctly we need to send all IO to the coordinator node, and all other nodes would need to be in File System Redirected mode or Block Redirected mode.

  • Vladimir

    This is a very good read ; shed light to many of our concerns.  we have a design in-place and I am testing performance of the system .

    we are using storage tiering on a Mirrored storage space . this is causing all CSV nodes to be " FileSystemRedirected "  , when I stressed CSV with more IOPS ; I noticed that even on CSV coordinator node ; some IOPS go over network ; Is this a bug ? or it's by design ?  

  • To Ali: "some IOPS go over network ; Is this a bug ? or it's by design ? "

    Coordinated node would never talk to NTFS over SMB. Perhaps other nodes are getting some load. You can look at the performance counters to better understand where the traffic is coming from.

    Check out these two blog posts

    Cluster Shared Volume Diagnostics

    blogs.msdn.com/.../10507826.aspx

    Cluster Shared Volume Performance Counters

    blogs.msdn.com/.../10531462.aspx

    If performance counters do not answer you question you can use netmon or procmon or windows performance toolkit. These tools allow you to see individual IOs on a node.

  • I correct my previous comment , IOs going over the network only on data server nodes .

  • I have a two Hyper-V 2012 R2 hosts connected to an iSCSI SAN using two CSV LUNS.

    Is it supported on the Host level to copy data from one CSV Lun to the other (c:\ ClusteredStorage\LUN1 to c:\ ClusteredStorage\LUN2) while at the same time on the other host copy data from c:\ ClusteredStorage\LUN2 to c:\ ClusteredStorage\LUN1.

    Thanks

    Eamonn

  • In a scenario with a low-cost 2-node cluster with a shared JBOD we use Storage Spaces for cluster storage and 4x1GbE network between the nodes. Cluster is used for virtual machines. We see quite slow Block Redirected IO on one of the nodes, which is nicely explained in the article. But how to optimize disk performance?

    Should it be 2 CSV volumes on one large pool of disks or should it be 2 CSV volumes on 2 different pools of disks? Does it make sense to have each CSV volume owned by each node, and to host some VM's on one CSV volume being owned by node 1, and to host other VM's on another CSV volume being owned by node 2? We can assign preferred owners for VM's but we cannot assign preferred owners for CSV volumes. How then we can ensure that we have direct IO on both nodes? Are there any best practices?

  • To: Eamonn

    "Is it supported on the Host level to copy data from one CSV Lun to the other (c:\ ClusteredStorage\LUN1 to c:\ ClusteredStorage\LUN2) while at the same time on the other host copy data from c:\ ClusteredStorage\LUN2 to c:\ ClusteredStorage\LUN1."

    Yes, that is supported,

  • To: Alexey

    Q: "We see quite slow Block Redirected IO on one of the nodes, which is nicely explained in the article. But how to optimize disk performance?:

    A: You need to find a bottleneck in your system. First run a perf test directly against space disk on your local machine and verify that performance numbers match your expectations. If they do then next try to run performance test over SMB from one cluster node to another. If you see performance drop then your network is the problem. To improve network you can either use RDMA or add mode NICs and rely on SMB multichannel to load balance traffic across NICs. Check out this blog post that explains how to use performance counters to find a bottleneck: blogs.msdn.com/.../10531462.aspx

    Q: "Should it be 2 CSV volumes on one large pool of disks or should it be 2 CSV volumes on 2 different pools of disks?"

    A: If partitioning pool into 2 spaces does not become a management problem I would consider that because by having more CSV disks spread across multiple cluster nodes you will have better load balancing.

    Q: "Does it make sense to have each CSV volume owned by each node, and to host some VM's on one CSV volume being owned by node 1, and to host other VM's on another CSV volume being owned by node 2?"

    A: Certainly should work, but make sure your network is not a bottleneck. If you your network is not a bottleneck then you can host VMs on MDS as well as on DS nodes load balancing memory and CPU utilization by VMs.

    Q: "We can assign preferred owners for VM's but we cannot assign preferred owners for CSV volumes."

    A: I believe in Windows 2012 R2 we allow assigning preferred owners for CSV groups. W2012 did not allow that. Also in W2012R2 we have CSV load balancer that makes sure that CSV disks are evenly distributed across all cluster nodes.

Page 1 of 2 (16 items) 12