One of the most useful resources for a test engineer dealing with web services is the production environment. This is the live environment that exposes the product to end users. Some of the challenges that the production environment provides us are the following:
In this blog post, we will look at some of the strategies that can be used to improve quality by incorporating the production environment and production data into our testing.
Smoke Testing in production
Some bugs appear more readily in production due to discrepancies between the test and live environments. For example, the network configuration in a test environment might be slightly different from the live site, causing calls between datacenters to fail unexpectedly. One possible way to identify issues like this would be to perform a full test pass on the production environment for every change that we want to make. However, we don't want to require running a full suite of tests before every upgrade in the live environments since this would be prohibitively time-consuming. Smoke tests are a good compromise, as they give us confidence that the core features of the product are working without incurring too high of a cost.
A smoke test is a type of test that performs a broad and shallow validation of the product. The term comes from the electronics field, where after plugging in a new board, if smoke comes out, we cannot really do any more testing. During daily testing, we can use smoke tests as a first validation that the product is functional and ready for further testing. Smoke tests also provide a quick way to determine if the site is working properly after deploying an update. When we release an update to our production environment we generally perform the following steps to validate that everything went as planned:
The tests used for smoke testing should have the following qualities:
Windows Live uses an automated smoke test tool which is able to do a validation of the service within a few minutes. The same utility is used in developer boxes, test environments and production, and is consistently updated as new features are added to the system.
Reacting to issues through data collection and monitoring
Even though we may have done thorough functional validation, shipping a new feature to production always implies a risk that things may not work as intended. Logging and real-time monitoring are tools that help us in this front. Before shipping a new feature to production, try to answer the following questions. This will give you a sense of readiness for handling issues:
Some of the strategies that windows live uses for allowing quicker reaction to issues are the following:
Using production data as input for tests
The involvement of Test in production should be limited to releasing a new feature or investigating an issue. Production contains a wealth of data that helps us better define what to test. The higher priority tests are those that map to the core customer calling patterns, and for existing scenarios, production data is the best source. Some of the interesting questions that production data analysis is able to answer are the following:
Gathering and analyzing data to answer the above and other questions is often non-trivial, but the resulting data is invaluable, particularly when deciding which areas should have a bigger focus when testing.
Within Windows Live, we have used this approach to understand both user scenarios and calling patterns. We measure some of the characteristics of the data (like how many folders a SkyDrive has, or how many comments photos typically have) to identify both common scenarios and outliers. This data lets us focus efforts like performance testing and stress on the most common scenarios, while ensuring that we have coverage on the edge cases.
When using production data in testing, the approach to privacy is extremely important and needs to be figured out before starting the work. Our tools only interact with abstractions of user data, with all actual user content and identity removed. We care about what the data looks like, not specifically what the data is.
In conclusion, the effectiveness of a test engineer can be enhanced by using production as a source of information. It may be by making sure that all the core scenarios work as expected through smoke testing, by creating a quick mechanism for reacting to issues, or by harvesting data to feed into test tools and plans.