Hi,

 

In this series of blog posts we will help you to design, develop and debug the Resource DLL you are developing to give your application high-availability with Windows Server 2008 & 2008 R2 Failover Clustering.

 

We recommend you start with the other blog post in the series:

·         Part 1: http://blogs.msdn.com/b/clustering/archive/2010/03/11/9976620.aspx

·         Part 2: http://blogs.msdn.com/b/clustering/archive/2010/03/30/9987135.aspx

·         Part 3: http://blogs.msdn.com/b/clustering/archive/2010/04/21/9999736.aspx

 

In this post we will look at the IsAlive and LooksAlive calls.  To familiarize yourself with these functions, refer to this documentation:  http://support.microsoft.com/kb/914458, http://msdn.microsoft.com/en-us/library/aa370496(v=VS.85).aspx and http://msdn.microsoft.com/en-us/library/aa370972(v=VS.85).aspx.

 

With the default behavior, the cluster will give each of these calls up to 5 minutes to complete before declaring that the call deadlocks.  It is also expected that LooksAlive should be a very lightweight check to not impact performance, while IsAlive can be more detailed healthcheck.  For example, if you are customizing an IIS-based application, the lightweights LooksAlive call may just ping the IIS Service, while the detailed IsAlive may check that a specific webpage can be opened.

 

There are really no published recommendations for how to best use these calls.  It all depends on the nature of your resource and what types of healthchecks are necessarily to determine that the resource is responsive.

 

Here are a few best practices we can suggest:

 

·         If your LooksAlive/IsAlive call can easily complete within the 5 minutes, then just do the job in the context of the caller’s thread

·         If it might take longer than 5 minutes, then you can do one of the following:

o   In the online, create a worker thread that would monitor the application’s health.  This thread can communicate resource failure in one of the two ways:

1.       Signaling the failure event that Online would return from the Online (or in the SetResourceStatus)

2.       Failing the next IsAlive/LooksAlive call

o   Using the thread pool:

1.       If you do not like the idea of keeping the worker thread around then you can schedule a work item using a thread pool.  The Win32 thread pool provides that feature, see Timer section here: http://msdn.microsoft.com/en-us/library/ms686766(VS.85).aspx.  Keep in mind that you are not supposed to occupy worker thread for a long time.  If it takes long time then perhaps spawning your own thread is a better choice.

2.       Spawn a work item from the LooksAlive/IsAlive (see the Work section in the http://msdn.microsoft.com/en-us/library/ms686766(VS.85).aspx) and use LooksAlive/IsAlive to monitor the progress of this worker.  Keep in mind that you are not supposed to occupy worker thread for a long time.  If it takes long time then perhaps spawning your own thread is a better choice.

3.       In the Resource DLL, Open/Close entry-points that spawn 1 thread per RHS instance that will be monitoring all resource of the given type in this RHS.

 

I bet with this little information you can also come up with many other ways to solve this design challenge and will pick the one that is right for your application.

 

Also remember that there is another way to tell cluster that the resource is not healthy.  This can be done by providing the cluster a handle during the Online call.  This handle will signal the cluster if it is not healthy and the cluster can then take corrective actions.

 

I hope this series of blog posts will be helpful when you design your own Resource DLL.

 

Thanks,
Vladimir Petter
Senior Software Development Engineer
Clustering & High-Availability
Microsoft