This is a summary of a relatively long thread from the Microsoft internal SQL Azure discussion alias.  I basically stole some ideas and put them together in an implementation that works well for our solution.


Our solution
As I mentioned in my previous post, we released the Social eXperience Platform (SXP) in April of this year.  We heavily instrumented our code as Windows Azure operations was new to us (Design for Operations is a good practice regardless of the newness of the platform).  Our solution consists of a web service running as a web role on Windows Azure compute and a set of SQL Azure databases.  We are a multi-tenant solution and chose to use a SQL Azure database per tenant (more on that in a future post).

We have been live for 8 weeks and have delivered over 4 - 9s of uptime - on a version 1 release on new technology.  Other than the SQL retry code and caching our configuration data, every call to our service is at least one round trip to SQL Azure.  We intentionally chose not to build in redundancy at the aplication level, but to rely on the platform.  So far, we're extremely pleased with Windows Azure and SQL Azure, but our goal is 100% availability, so we keep tweaking. 


The problem
One of the things that SQL Azure does to deliver high availability is it sometimes closes connections. SQL Azure does some pretty cool stuff under the covers to minimize the impact, but this is a key difference in SQL Azure development vs. SQL Server development.


When a connection gets closed, the vast majority of the time, closing and opening the connection object solves the problem.  Sometimes, you have to wait a few hundred ms to a few seconds.  Very seldom does the retry totally fail.


These retries seem to come in waves and very seldom affect more than one of our databases at a time.  This makes sense because SQL Azure spreads databases independently.  The odds of having multiple databases on the same virtual or physical hardware is remote – unless you have hundreds or thousands of databases.


We know all of this because we heavily instrumented our code.  Design for operations is very important, particularly when the platform is new to you.


A solution
Your typical ADO.NET data access code looks something like this (note that if you’re using Linq, Entity Framework, or other higher level data libraries, something like this is likely happening under the covers):


try
{
    using (SqlConnection conn = new SqlConnection(sqlConnectionString))
    {
        using (SqlCommand cmd = new SqlCommand(sqlStatement, conn))
        {
            conn.Open();

            using (SqlDataReader dr = cmd.ExecuteReader())
            {
                while (dr.Read())
                {
                }
            }
        }
    }
}

catch (SqlException ex)
{
  SxpLog.WriteSqlException(ex, "some", "other", "data");
}


One approach is to put retry logic around the entire try – catch block.  A simple for loop with a sleep does the trick.  For reads, this works just fine, but for writes, you have to make sure that the call really failed and deal with it accordingly.  That is much more difficult to generalize.  You also have to repeat the logic unless you're using Lambda expressions or something similar.


Another solution
What we’ve noticed is that all of our failures are due to connection issues.  conn.Open() doesn’t fail, but rather the first SQL statement we try to execute fails.  Remember that our “client” is running in the same Azure data center as our database.  Remote connections could cause the command execute or the results processing to fail.  Our queries are very small and very fast (by design), so I doubt we will ever see many failures outside of connect.  Larger queries could fail.  Large, remote queries seem the most likely to fail in the execute or results processing.  Adjust your approach accordingly.


Something else we wanted to do is get the SQL Azure context information.  This information helps the SQL Azure team debug any issues you run into.  There is a great blog post on that here.


We think we came up with an elegant solution to both issues.  Here’s how the generic ADO.NET code is affected:


string sqlContext = string.empty;

try
{
    using (SqlConnection conn = new SqlConnection(sqlConnectionString))
    {
        using (SqlCommand cmd = new SqlCommand(sqlStatement, conn))
        {
            sqlContext = GetSqlContextInfo(conn);

            using (SqlDataReader dr = cmd.ExecuteReader())
            {
                while (dr.Read())
                {
                }
            }
        }
    }
}

catch (SqlException ex)
{
  SxpLog.WriteSqlException(ex, sqlContext, "some", "other", "data");
}


As you can see, we made 3 code changes and kept the generalization.  Our first line of code declares the sqlContext variable.  We do this outside the try block for scoping.  The second change is the replacement of conn.Open() with sqlContext = GetSqlContextInfo(conn).  GetSqlContextInfo encapsulates our retry logic, opens the connection, and retrieves our SQL context from the server.  This actually works out well because conn.Open() doesn’t fail – we have to issue a command execute in order to determine if the connection is valid.  Rather than "waste" a roundtrip, we do something useful.  The 3rd change is that we added the sqlContext to our exception log.  Again, this will help the SQL Azure team when debugging.


GetSqlContextInfo
public string GetSqlContextInfo(SqlConnection conn)
{
    string sqlContext = string.Empty;

    // Omitted the code to read these from configuration
    int sqlMaxRetries = 4;
    int sqlRetrySleep = 100;
    int sqlMaxSleep = 5000;
    int sqlMinSleep = 10;

    // start a timer
    TimeSpan ts;
    DateTime dt = DateTime.UtcNow;

    for (int retryCount = 0; retryCount <= sqlMaxRetries; retryCount++)
    {
        try
        {
            conn.Open();

            // get the SQL Context and validate the connection is still valid
            using (SqlCommand cmd = new SqlCommand("SELECT CONVERT(NVARCHAR(36), CONTEXT_INFO())", conn))
            {
                sqlContext = cmd.ExecuteScalar().ToString();
            }

            ts = DateTime.UtcNow - dt;

            // log opens that take too long
            if (ts.TotalMilliseconds >= 90)
            {
                SxpLog.WriteKnownEvent(8001, ts.TotalMilliseconds, "Connect", conn.DataSource, conn.Database, conn.WorkstationId, sqlContext);
            }

            break;
        }

        catch (SqlException ex)
        {
            conn.Close();
            SqlConnection.ClearPool(conn);

            if (retryCount < sqlMaxRetries)
            {
                SxpLog.WriteSqlRetry(5902, ex, conn.DataSource, conn.Database, conn.WorkstationId, "Connect");

                // don't sleep on the first retry
                // Most SQL Azure retries work on the first retry with no sleep
                if (retryCount > 0)
                {
                    // wait longer between each retry
                    int sleep = (int)Math.Pow(retryCount + 1, 2.0) * sqlRetrySleep;

                    // limit to the min and max retry values
                    if (sleep > sqlMaxSleep)
                    {
                        sleep = sqlMaxSleep;
                    }
                    else if (sleep < sqlMinSleep)
                    {
                        sleep = sqlMinSleep;
                    }

                    // sleep
                    System.Threading.Thread.Sleep(sleep);
                }
            }
            else
            {
                // Log the exception
                SxpLog.WriteSqlException(ex, conn.DataSource, conn.Database, conn.WorkstationId, "Connect");

                // we thought about rethrowing the exception, but chose not to
                // this will give us one more chance to execute the request - we might get lucky ...

                // sleep
                System.Threading.Thread.Sleep(sqlMaxSleep);
            }
        }
    }

    // we log this value and null might cause an issue
    if (sqlContext == null)
    {
        sqlContext = string.Empty;
    }

    return sqlContext;
}


Notice that we don't sleep on our first retry and we sleep longer on each subsequent retry.  This gives a good balance between recovering quickly and giving SQL Azure time to re-open the connection.


We debated re-throwing the exception on retry failure, but decided not to.  Our next statement is a command execute and it will fail if the connection isn’t ready.  There’s a chance SQL Azure might recover between our last retry and our command execute.  We don’t bother checking the return value as we want to take that last shot at success.


We’ve had this change in production for a couple of days now and, so far, so good.  It's definitely an improvement over our previous logic, but only time will tell if it needs to be tweaked some more.  Again, this might not be 100% right for your solution, particularly if you have remote clients and large, long-running queries, but hopefully it gives you some ideas.  If you run into any bugs, please let me know and I’ll send you a Dove bar.

Please see an update to this post here.