I like to recommend a talk "Resiliency through Failure - Netflix's Approach to Extreme Availability in the Cloud". If you have not heard about Chao Monkey, it is the main topic of the talk.  The main point is that for service, you can not rely on purely testing to ensure quality.  You need to inject features into the service or dependence services so that you can test the resiliency of your service in production.  This is one of the important way to ensure that your service having high availability.  In our SQL Azure case, we did lot of such tests, and we even force the whole cluster down in order to test that we can recover from it with no data lose within a given time range.