Building a decoupled, queue based system is will give you the ability to scale and the opportunity to create a highly available application. By dispatching work to multiple back end worker roles we are building a system that can survive unfortunate events like bugs, exceptions, hardware failure, fire, flood, pestilence and the other horsemen of the developer apocalypse - ok, maybe I'm getting carried away and our application won't need to survive all that, but you get the idea. If one of our worker roles dies; another can take its place; Due to the reliable nature of the queues, anything it was working on will be reprocessed by another node and all is well.

Azure includes a mechanism where it can check on the heath of your nodes. ASP.Net already has this functionality built into it, but for a worker role we need to role this ourselves. All worker roles derive from RoleEntryPoint which defines a method that Azure will call periodically to determine if your role is working. The default implementation looks like:

    public override RoleStatus GetHealthStatus()
    {
        // This is a sample worker implementation. Replace with your logic.
        return RoleStatus.Healthy;
    }

That's all well and good if our app is perfect and bug free, but in reality we are going to want to make this a little bit smarter. There are a variety of things that we can return from this function, but for the most part all we will be interested in is Healthy or Unhealthy. A sensible implementation would be to get our main worker process to periodically update a timestamp to show that it is working correctly and to check this timestamp in GetHealthStatus. So lets build ourselves a main processing loop for a worker role.

    while (true)
    {
        // Report in
        lastHealthyReport = DateTime.Now;

        // Get message next message to work on
        Message msg = queue.GetMessage();

        if (msg != null)
        {
            // process message
            // ...

        }
        else
        {
            // no messages waiting, so sleep for a while
            Thread.Sleep(DEFAULT_SLEEP_TIME);
        }
    }

There are a couple of things going on here:

  1. At the start of each loop we record the current time in a DateTime member variable. This is effectively a record of the last time the main loop was known to be in a healthy state.
  2. If there are no messages waiting on the queue, we sleep - this ensures that we aren't constantly burning up our CPU looking for work when there isn't any. This reduces the power consumption of our app and will presumably reduce the cost of your azure deployment - I say presumably because any costing details are yet to be announced.

Now we come to write our GetHealthStatus status all we have to do is query the lastHealthyReport to ensure that the main thread reported in within an acceptable time.

    public override RoleStatus GetHealthStatus()
    {
        if (lastHealthyReport.AddSeconds(60) < DateTime.Now)
        {
            RoleManager.WriteToLog("Error", 
                "Node going unhealthy, not heard from main loop since " 
                + lastHealthyReport.ToUniversalTime().ToString());
            return RoleStatus.Unhealthy;
        }
        return RoleStatus.Healthy;
    }

The above code assumes that each message should take no longer that 60 seconds to process. If it does take longer than that it will be reporting that the node is unhealthy; allowing Azure to tear down the node and build us another one. Oh, and logging the error in Universal Time (or Greenwich Mean Time if you are British ;-) means that it will be easier to look in any other log files to see what was happening at about the time the node went down.

Neil.