To bring the windows live experience to life, we host a variety of interconnected services. These are large scale services which run on thousands of servers, and due to their scale cannot be upgraded atomically. One of the challenges is making sure that when one of these services is upgraded, the rest of the system works properly. Upgrades to the full site require a coordinated test effort that often involves running one service against multiple versions of another. To ensure that upgrades go smoothly, Windows Live has several strategies to make sure that all the versions to which our users will be exposed during a release are covered.
Keeping the state throughout upgrades
When testing an upgrade in a stateful system like ours, we need to think about how the new version of the software will interact with preexisting state and data. It is a common mistake to only test a service with data created using the new version, when existing customer data has been around for much longer and has gone through multiple upgrades and migrations. The question is how we replicate this state in our labs, particularly using our automation. Some of our approaches are the following:
Another interesting issue that comes when upgrading stateful systems is when old versions of client software cannot support state created with new versions of the service. Testing an upgrade that covers both client and server scenarios requires us to consider the version matrix and test each combination. Often this matrix is reduced by having a way to enforce that a new version of the software client cannot run against an old version of the service code, which means that the transition that we need to test is:
Testing strategies when upgrading multiple services
Upgrading a single service in the cloud has its own level of complexity, but in practice we are often upgrading multiple services at a time. Common issues during upgrades happen when one service is rolled out sooner or later than expected. It is thus important to be able to build a timeline of the full upgrade to understand the dependencies that one service will have on another. The best would be to be able to replicate this upgrade procedure on a production like environment, then running end to end application tests. However, this is not always easy to coordinate and execute. It is generally easier to list the different versions of the services that we are planning to run against during the upgrade. With this information, we can build a test matrix to ensure that we are clear when running against any of those versions.
One thing to keep in mind is that upgrade test procedures often rely on moving a test lab back and forth from one version to the other. How easy it is to execute upgrade testing depends entirely on how easy it is to set up a lab environment to a particular state. If you work on a system that is constantly running through upgrades, or has a complex upgrade matrix, investing in easy and quick deployments will be crucial in being empowered to test and find upgrade related issues.
In conclusion, upgrade testing requires thinking in terms on versions of services and clients, and thinking about the different states that a user may be when the process begins, executes and completes. Upgrade tests are often more involved and sometimes hard to execute; However, they are crucial to ensure that our customers will have a good experience when we offer them the new version of our service. Even if customers may be exposed to the system in upgrade for a small period of time, a small issue can lead to issues like corruption which can cause long term issues. Upgrade testing is often overlooked, because it is hard and because it is transient. However our customers always deserve the highest quality service, and that includes the periods of time when we are upgrading.
Federico Gomez Suarez, Microsoft