This is a continuation of last Wednesday's post, which featured the intro section to my forthcoming white paper, "Data Architecture in a Cloudy World".
In that overview, I briefly mentioned the bewildering array of choices facing developers today in the data architecture area. But the array of choices is actually much longer than I was able to describe in a brief overview.
There is not only a wide variety of possible data stores, but there are also a number of choices for physical architecture.
Here is a short list of the possible data stores that could be used in an application:
Different kinds of “NoSQL” databases:
“Big Data” solutions such as HDInsight (aka Hadoop on Windows), with its many components built on top of Hadoop
And here is a list of possible choices for physical architecture:
The above is only a partial listing. PaaS is somewhat circumscribed by the services offered by the cloud service provider. So for example, with Windows Azure you have choices like Windows Azure Table Storage, or SQL Azure Database. On the other hand, the IAAS architecture is virtually unlimited in that you are free to install whatever you want on a VM.
In the interests of full disclosure, while this paper addresses architecture in general, the cloud service provider focus is mainly on Windows Azure.
Another dimension to consider is that for some products, cloud versions and on-premise versions both exist, but the feature sets of the products might not be identical across the two platforms. Two examples of this are Microsoft SQL Server, and HDInsight. For both of these products there is a Windows Azure version, and an on-premise version. Over the long run you can expect the feature sets to converge, but at present you must take feature differences into account.
Two major differences have emerged:
Several consequences flow from the above:
In a distributed system, the basic mechanism that enables complex transactions in relational databases, two-phase commit, no longer works well at extreme scaling. This has been summed up by the “CAP Theorem”, which says that in a large distributed system, you can provide only 2 of the following items, and not all 3:
Consistency refers to the internal consistency of a database’s data. Relational database transactions are expected to satisfy the ACID properties:
With the two-phase commit protocol, you can ensure that your data is always consistent and satisfies the ACID properties. The stereotypical example is moving $10 from your bank’s checking account to the savings account. You first subtract $10 from checking, and next add $10 to savings. If the system goes down, you don’t want $10 to be subtracted from checking only. The “two-phase commit” process ensures that a transaction either completely succeeds, or is discarded.
Availability means that every transaction or request receives a response indicating whether it succeeded or failed.
Partition tolerance means that the overall system is unaffected when any piece of hardware fails, such as a disk, a computer, or an entire rack in the data center. This is ensured by means of massive hardware redundancy. Windows Azure provides at least 3 copies of every resource, and transparently recovers from hardware failures.
The item that has the highest performance cost is “Consistency”, and particularly the two-phase commit process that relational databases use to enforce it. Consequently it is the commonest item to be dropped in large distributed systems, in favor of availability and partition tolerance, since you can only get two out of three. Typically ACID consistency is relaxed to require “eventual consistency”: sooner or later the database is guaranteed to be in a consistent state.
For more information on Distributed Two-Phase Commit and the CAP theorem see: