After the last 2 blog entries, we have our worker process's main loop feed back into Azure's heath checking system. If we go unhealthy, Azure will notice and will eventually restart the worker role. However, this is a bit heavy handed; what if the failure was a temporary network error, a poison message or another bug - do we really want our node to die so easily? By refactoring our code slightly we can make our process far more robust and harder to kill than Chuck Norris!

Step 1, we need an interface - remember: every computer science problem can be solved by adding a new level of abstraction; but beware: a new level of abstraction will introduce a new problem.

Lets call it IProcessMessages - I prefer interface names that are grammatically correct - IDoThings, IHateThings, IDontGetOutMuch:

    public interface IProcessMessages 
    {
        bool Process ();
    }

Were Process will return true if it found a message and false if the queue was empty; we need this to determine if the node should have a quick sleep before checking the queue again.

We can now create some processor types:

    public class QueueProcessor : IProcessMessages {

        MessageQueue queue;

        public QueueProcessor ()
        {
            QueueStorage qs = QueueStorage.Create (
                      StorageAccountInfo.GetDefaultQueueStorageAccountFromConfiguration());
            queue = qs.GetQueue ("myqueue");
        }

        public bool Process()
        {
            Message msg = queue.GetMessage ();
            if (msg != null) {
                ProcessMessage (msg);
            }
            return msg != null;
        }
    }

We've created an instance type and the initialisation of the queue is moved into the constructor. The creation of a queue is an expensive operation, as GetQueue involves a trip to the storage system, so we've moved it out of our processing loop. We can now modify our main loop to look like:

    public override void Start()
    {
        QueueProcessor p = new QueueProcessor ();

        while (true)
        {            
            // Updating the date time in the main loop
            lock (dateTimeLock)
            {
                lastThreadTest = DateTime.Now;
            }            
            
            if (p.Process () == false) {
                Thread.Sleep(DEFAULT_SLEEP_TIME);
            }
        }
    }

Functionally we are now pretty much back to what we had at the end of Part 2. Time for our bulletproof vest ... we can now update our main loop to catch any exceptions thrown from the process method. If we catch an exception, we can destroy the current processor, create a new one and carry on as though nothing happened. Now this isn't going to work for all errors and normally I wouldn't recommend catching exceptions that you can't explicitly deal with; as some of them could be unrecoverable. However, we have Azure's status check watching our back. If we don't update our lastThreadTest counter in the event of an exception, we will eventually go unhealthy and Azure will restart us. So our main loop looks like:

    public override void Start()
    {
        QueueProcessor p = new QueueProcessor ();

        while (true)
        {            
            try {
                if (p.Process () == false) {
                    Thread.Sleep(DEFAULT_SLEEP_TIME);
                }

                // Updating the date time only if an error is not occur
                lock (dateTimeLock)
                {
                    lastThreadTest = DateTime.Now;
                }            

            } catch (Exception ex) {
                RoleManager.WriteToLog("Error",
                    "Error occurred, trying again"
                    + ex.ToString());
                p = new QueueProcessor ();
            }
        }
    }

We might want to adjust the catch block to perform a sleep, just so we don't spin at 100% cpu in the event of repeated errors - this could even be on a back off algorithm, so the first couple of retry's have no sleep, then 100ms,500ms, ... Just so that longer lasting errors (network?) won't keep us to busy throwing exceptions. The easiest way for doing a back of is to declare an array of sleep times:

    private const int [] SLEEP_TIMES = { 0, 0, 100, 500, 1000, 2000, 5000, 10000 };

and have a local variable keep track of the current sleep time index:

        while (true)
        {            
            try {
                if (p.Process () == false) {
                    Thread.Sleep(DEFAULT_SLEEP_TIME);
                }

                // Updating the date time only if an error is not occur
                lock (dateTimeLock)
                {
                    lastThreadTest = DateTime.Now;
                }            

                // reset the sleep time
                sleepTimeIndex = 0;

            } catch (Exception ex) {
                RoleManager.WriteToLog("Error",
                    "Error occured, trying again"
                    + ex.ToString());
                p = new QueueProcessor ();
                // Sleep using our back off
                sleepTimeIndex = Math.Min (sleepTimeIndex++, SLEEP_TIMES.Length - 1);
                Thread.Sleep (SLEEP_TIMES [sleepTimeIndex]);
            }
        }

We can now survive errors with a more appropriate response and only rely on a restart in the event of repeated errors.

Neil