I just submitted the final (photo-ready) version of a new paper called "Building on Quicksand" to the Conference on Innovative Database Research.  This is a paper I coauthored with my friend, Dave Campbell.  We submitted an earlier version (to try to get into the conference) in October, I wrote a presentation on this for TechEd EMEA, and now we have a new and longer full paper complete.  I like this version best of all. 

- Pat

Building on Quicksand

Reliable systems have always been built out of unreliable components.  Early on, the reliable components were small such as mirrored disks or ECC (Error Correcting Codes) in core memory. These systems were designed such that failures of these small components were transparent to the application. Later, the size of the unreliable components grew larger and semantic challenges crept into the application when failures occurred.

Fault tolerant algorithms comprise a set of idempotent sub-algorithms.  Between these idempotent sub-algorithms, state is sent across the failure boundaries of the unreliable components.  The failure of an unreliable component can then be tolerated as a takeover by a backup, which uses the last known state and drives forward with a retry of the idempotent sub-algorithm.  Classically, this has been done in a linear fashion (i.e. one step at a time).

As the granularity of the unreliable component grows (from a mirrored disk to a system to a data  center), the latency to communicate with a backup becomes unpalatable.  This leads to a more relaxed model for fault tolerance.  The primary system will acknowledge the work request and its actions without waiting to ensure that the backup is notified of the work.  This improves the responsiveness of the system because the user is not delayed behind a slow interaction with the backup.

There are two implications of asynchronous state capture:

1)   Everything promised by the primary is probabilistic.  There is always a chance that an untimely failure shortly after the promise results in a backup proceeding without knowledge of the commitment.  Hence, nothing is guaranteed!

2)   Applications must ensure eventual consistency.  Since work may be stuck in the primary after a failure and reappear later, the processing order for work cannot be guaranteed.

Platform designers are struggling to make this easier for their applications.  Emerging patterns of eventual consistency and probabilistic execution may soon yield a way for applications to express requirements for a “looser” form of consistency while providing availability in the face of ever larger failures. As we will also point out in this paper, the patterns of probabilistic execution and eventual consistency are applicable to intermittently connected application patterns.  

This paper recounts portions of the evolution of these trends, attempts to show the patterns that span these changes, and talks about future directions as we continue to “build on quicksand”.