+++
This is part 2 of my series on “principles”. In this post, I talk about the principles I have used to guide all teams I have managed. Before we delve in – a couple of clarifications:
The principles, which I cover, are:
Enjoy!
High test stability is the cornerstone of every good test automation system. Ideally, a test should fail only when there is a product bug. In reality, tests tend to fail for other reasons too, but we should strive to minimize those failures by either fixing or removing (and rewriting) unstable tests.
The principle of high test stability (99.5% test stability) has several important implications. It allows us to spend less time chasing random test failures and more time discovering bugs in the product. It also allows us to increase test coverage faster than increasing support costs. It helps us land the product in a stable, predictable manner. But most importantly, high test stability allows us to drive testing upstream.
Defect prevention is the “holy grail” of quality management. There are many studies that show that preventing a defect of entering the product is a much cheaper option that discovering and fixing the defect after it’s has been introduced.
Obviously there are many different ways to go about defect prevention. One of the simplest and most robust one is having a stable and fast suite of tests that can easily be run by developers prior to check-in. The suite has to be stable – because no one likes to debug through 100s of unrelated failures to confirm that the code works. The suite has to be fast, because no one likes to wait for hours for test results to come out.
Having a stable and fast test suite allows us to practice a continuous “develop – run tests – fix issues – repeat” cycle prior to check-in.
About 5 years ago, we used to invest a lot in advanced failure investigation tools, which allowed us to quickly deal with and resolve large numbers of test failures. It turned out that having high test stability negated the need for having advanced failure investigation tools. For the most part, a NUnit-like report (typically XML with a XSLT transform to make the data readable) is all the team needed. At this point of time, I assert that having an advanced failure investigation tool may actually be a dangerous practice for test teams as it acts as a powerful crutch that pulls the team away from the root underlying issues that need to be fixed – i.e. tests with low stability or a very low quality product.
There is one important aspect of failure investigation tool development that we need to account for though. SDETs are fundamentally software engineers. As software engineers we like to develop software and a failure investigation tool often presents opportunities to hone one’s design and coding skills and to also experiment with new technologies (web services, databases, extensibility frameworks, AI, GUI frameworks, etc.) We clearly have to find ways to expose our SDETs to other opportunities to improve their transferable skills (design, algorithms, etc).
Case Study: 99.5% pass rate of the WPF tests 99.5% is obviously an arbitrary number. The right number to strive for is always 100%. A realistic stability goal depends on the size of the team and on the size of the test suite, but typically revolves around 99.5%. For example, in the WPF test team, we had a suite of about 105,000 tests which at 99.5% pass rate produced about 500 test failures for a team of about 30 engineers or about 18 failures per person, which seemed to be a reasonable number (so that every SDET spends no more than about 30 minutes investigating failure every day). With time, 99.5 became a team mantra. SDETs, SDEs, PMs actively identified with it and fought for it.
Case Study: 99.5% pass rate of the WPF tests
99.5% is obviously an arbitrary number. The right number to strive for is always 100%. A realistic stability goal depends on the size of the team and on the size of the test suite, but typically revolves around 99.5%. For example, in the WPF test team, we had a suite of about 105,000 tests which at 99.5% pass rate produced about 500 test failures for a team of about 30 engineers or about 18 failures per person, which seemed to be a reasonable number (so that every SDET spends no more than about 30 minutes investigating failure every day).
With time, 99.5 became a team mantra. SDETs, SDEs, PMs actively identified with it and fought for it.
At Microsoft, we serve hundreds of millions of customers with great environment variability. A typical “test matrix” consists of a number of HW configurations, OS platforms, languages, locales, display configurations, etc. and typically contains millions of variations.
Teams typically do the following to deal with the “configuration matrix explosion” problem:
Technique (b) specifically often has the unfortunate effect of getting us to test mostly on “vanilla configurations”, which results in missing bugs until late in the testing cycle and results in “training” the tests to pass on vanilla configurations only. Technique (c) tends to result in high testing schedule management costs.
An easy way to combat the undesired effects of (b) and (c) is to switch to a weighted random configuration management. The weight of a certain config can be calculated dynamically based on historical data (pass rate of the last time the config was run on, and frequency of running).
One can even build predictive models that would allow the team to search for configurations that result in large number of bugs.
Case Study: WPF test execution matrix management Prior to early 2007, the WPF used to invest a considerable amount of effort to plan day-to-day test execution. The team had a dedicated test lead who managed the test execution matrix and sequence. Introduction of a new OS to the matrix was a fairly disruptive event, necessitating a regeneration of the whole execution sequence. Test passes took a long time, because of completeness concerns. At the beginning of 2007, the WPF test team switched to weighted random configuration management. We introduced a simple Excel spreadsheet, which was used by the lab engineers to generate a random testing config every day that was then installed in the lab and used for test execution. The team identified about 20 configurations as high priority configurations. 4 out of 5 days in the week, the team used a randomly selected high pri config. Every 5th day or so, the team explored the rest of the configurations. The switch to random test config generation removed the need for test execution scheduling, resulted in additional test stabilization (because the tests were not trained to pass on vanilla configurations) and enabled the team to find bugs off the beaten path earlier in the development cycle.
Case Study: WPF test execution matrix management
Prior to early 2007, the WPF used to invest a considerable amount of effort to plan day-to-day test execution. The team had a dedicated test lead who managed the test execution matrix and sequence. Introduction of a new OS to the matrix was a fairly disruptive event, necessitating a regeneration of the whole execution sequence. Test passes took a long time, because of completeness concerns.
At the beginning of 2007, the WPF test team switched to weighted random configuration management. We introduced a simple Excel spreadsheet, which was used by the lab engineers to generate a random testing config every day that was then installed in the lab and used for test execution. The team identified about 20 configurations as high priority configurations. 4 out of 5 days in the week, the team used a randomly selected high pri config. Every 5th day or so, the team explored the rest of the configurations.
The switch to random test config generation removed the need for test execution scheduling, resulted in additional test stabilization (because the tests were not trained to pass on vanilla configurations) and enabled the team to find bugs off the beaten path earlier in the development cycle.
Developer unit tests use exactly the same test harness as the functional tests developed by SDETs. This enables code reuse between the unit tests and the functional tests. It also enables developers to easily run and debug functional tests (because they know how to do it), thus enabling defect prevention.
Automated tests are programs that get executed multiple times throughout the lifecycle of the product. Having slow tests results in the following problems:
One area that often gets overlooked is the speed of building the tests compared to the speed of building the product. Tests often take much longer to build because they are not properly packaged (e.g. too many DLLs) and because no one really looks into improving the build times. In reality having a fast product and test build enables defect prevention.
In theory, the idea of BVTs (build verification tests : a sub-suite of high priority tests that get run more often than the rest of the tests) sounds good. In practice, BVTs tend to become just another “crutch” that typically prevents test teams from addressing the root underlying problem of test slowness and instability[1]. Introduction of the notion of BVTs also introduces various “management costs” (suite management and curation, execution schedule management, etc.) that are moving the focus away from more important activities directly related to the quality of the product.
So I highly discourage the use of BVTs.
Case Study: WPF BVTs In the WPF test team, we experienced the full cycle. We started with a suite of tests that we used to run every day. As we expanded the suite of tests, we saw that the run times of the suite became longer and that the stability of tests became lower. Instead of investing in fixing test perf, stability and duplication of coverage, we decided to segment the suite into P0s, P1s, P2s, etc. We created various processes to handle bugs produced by P0 tests, etc. (the “hot bug” concept). Because BVTs were treated as special high-priority tests fortified by these additional processes, SDETs tended to add more and more tests to the BVT suite, which in turn increased the run times of the suite, reduced its stability, and necessitated introducing additional “BVT nomination” and “BVT auditing” processes. We had a BVT team (!!!) whose sole purpose of existence was handling the BVT suite. The “BVT auditing” process did not work, so we invested in further segmenting the BVT suite into “micro-BVTs” and “regular BVTs”, we introduced micro-BVT execution time budgets per area, etc, etc, etc. We lived with this crazy complicated system for years. At the beginning of 2007, we decided to put an end to the madness, and focused on optimizing our run times and improving our stability. We improved our test build times 800% and we improved test run times 500%, which enabled us to do a complete test run for about 2 hours on 20 machines. We did not really get rid of priorities, but we found that when the tests are stable and fast, people tend to not care about test priority. Today, nobody cares about priorities. The test result reports visibly display run times, so that SDETs, SDEs and leads can address any excessively long tests. SDEs are also asked to provide feedback whenever they feel that the test run times are too long. Run times are reported by test (and aggregated into feature area run times) to allow drilling into areas that take too long to run.
Case Study: WPF BVTs
In the WPF test team, we experienced the full cycle. We started with a suite of tests that we used to run every day. As we expanded the suite of tests, we saw that the run times of the suite became longer and that the stability of tests became lower. Instead of investing in fixing test perf, stability and duplication of coverage, we decided to segment the suite into P0s, P1s, P2s, etc. We created various processes to handle bugs produced by P0 tests, etc. (the “hot bug” concept). Because BVTs were treated as special high-priority tests fortified by these additional processes, SDETs tended to add more and more tests to the BVT suite, which in turn increased the run times of the suite, reduced its stability, and necessitated introducing additional “BVT nomination” and “BVT auditing” processes. We had a BVT team (!!!) whose sole purpose of existence was handling the BVT suite. The “BVT auditing” process did not work, so we invested in further segmenting the BVT suite into “micro-BVTs” and “regular BVTs”, we introduced micro-BVT execution time budgets per area, etc, etc, etc. We lived with this crazy complicated system for years.
At the beginning of 2007, we decided to put an end to the madness, and focused on optimizing our run times and improving our stability. We improved our test build times 800% and we improved test run times 500%, which enabled us to do a complete test run for about 2 hours on 20 machines. We did not really get rid of priorities, but we found that when the tests are stable and fast, people tend to not care about test priority. Today, nobody cares about priorities.
The test result reports visibly display run times, so that SDETs, SDEs and leads can address any excessively long tests. SDEs are also asked to provide feedback whenever they feel that the test run times are too long.
Run times are reported by test (and aggregated into feature area run times) to allow drilling into areas that take too long to run.
Tests have both build-time dependencies (the test code, etc.) and run-time dependencies (test metadata such as owners, priorities, test support files, configuration details, etc.) Some teams tend to keep the run-time dependencies on a dedicated server or in databases. In theory that sounds great – after all databases are created for storing of data.
In practice, storing of test metadata in databases is problematic, because it introduces versioning issues. Handling code branch forks, RIs, FIs, etc. becomes very difficult and error prone because you have to mirror the branch structure on the server or in the database. Some teams have designed elaborate processes to maintain referential integrity between the source control system and the support servers / databases. Although these processes may work (or can be made to work), they come with a significant support cost and are typically not robust when left on their own.
A better approach is to keep all test-related data in the source control system. There are obviously exceptions (e.g. checking in large video files may not be a good idea), and we have to be smart to not bring the source control system to its knees, but in general this approach is superior to maintaining separate databases, servers, etc.
It’s a good idea to keep dev and test code in the same branch. That helps catching and preventing build breaks as a result of breaking changes. It also enables code reuse across the organization (SDEs can reuse test framework pieces for unit tests, SDETs can reuse dev code for tests).
A lot of teams tend to keep specs on SharePoint servers. While this makes specs readily accessible and easily editable (if you can get around some of SharePoint’s idiosyncrasies), it suffers from the following problems:
As a best practice, specs should be checked in right next to the source code (or in a obvious location that makes sense, based on the source code directory organization).
In principle, it’s a good idea to minimize the server dependencies you have. Having server dependencies increases the maintenance costs, complicates the move to Sustained Engineering (because SE has to replicate all servers and record server management knowledge), prevents SDETs and SDEs to run tests when not connected to the network and in general complicates the lab setup.
So we should subject every server dependency to careful examination and if we have to have servers, move them to the cloud.
Case Study: WPF server and database dependencies At the beginning of 2007, the WPF test team had about 30 servers and a database-based test case management (TCM) system (its name was Tactics). This resulted in significant maintenance costs that could no longer be afforded. So the team decided to switch to a source-control-based TCM. The results were outstanding – the team reduced server dependencies to about 6 servers (1 file server, 1 web server, 1 SharePoint server and 3 backup servers) and enabled SDETs and SDEs to build and run tests without being connected to the network. It also practically removed the need for any referential integrity related maintenance. The WPF team also checked in all specs in source depot. Specs are bin-placed as part of the build process. This ensured that the team can produce a build with matching specs for every release of the product and we can track and compare different versions of the specs.
Case Study: WPF server and database dependencies
At the beginning of 2007, the WPF test team had about 30 servers and a database-based test case management (TCM) system (its name was Tactics). This resulted in significant maintenance costs that could no longer be afforded. So the team decided to switch to a source-control-based TCM.
The results were outstanding – the team reduced server dependencies to about 6 servers (1 file server, 1 web server, 1 SharePoint server and 3 backup servers) and enabled SDETs and SDEs to build and run tests without being connected to the network. It also practically removed the need for any referential integrity related maintenance.
The WPF team also checked in all specs in source depot. Specs are bin-placed as part of the build process. This ensured that the team can produce a build with matching specs for every release of the product and we can track and compare different versions of the specs.
For every piece of software, there are a number of cross-cutting concerns which we call “fundamentals” or “basics”. These are:
All of these are extremely important and have historically represented a significant competitive advantage for Microsoft. Testing of fundamentals is expensive so we need to have the right supporting processes and automated systems in place to ensure that we produce solid software by default.
Some of these fundamentals can be integrated within the functional testing – others typically need dedicated systems and processes. Below, I am only presenting two of the fundamentals above.
Performance and Scalability are two fundamentals that can and should be automated. Ideally, the team has an automated system that runs Perf / Scalability tests (both micro-benchmarks and end-to-end scenarios are equality important) on target HW and SW configurations on every daily build. The system runs a set of tests, which capture agreed-upon performance goals and metrics. The system provides facilities for easy visualization of key trends.
Having an automated Perf system enables early discovery and fixing of Perf regressions. Due to the domain-specific nature of the work, it’s typically necessary to have a dedicated Performance team that develops and supports the necessary processes, tools and systems.
Case Study: WPF performance The WPF team has a dedicated highly automated Performance lab. The Performance infrastructure automatically picks up and installs daily builds on a stable set of machines and configurations, runs a set of performance tests, presents results (cold startup times, warm startup times, working set size and various other metrics) and trends, identifies regressions, and captures necessary traces for follow up investigation. The lab infrastructure also allows testing and generating diffs for individual dev changes (this feature is not really broadly used although the lab does do dedicated perf runs of specific potentially disruptive changes.
Case Study: WPF performance
The WPF team has a dedicated highly automated Performance lab. The Performance infrastructure automatically picks up and installs daily builds on a stable set of machines and configurations, runs a set of performance tests, presents results (cold startup times, warm startup times, working set size and various other metrics) and trends, identifies regressions, and captures necessary traces for follow up investigation.
The lab infrastructure also allows testing and generating diffs for individual dev changes (this feature is not really broadly used although the lab does do dedicated perf runs of specific potentially disruptive changes.
Stress and Stability are another fundamental that requires a dedicated process and system. It can either be done on a dedicated pool of machines / devices or on the devices in engineer’s offices during off-work hours. Some teams tend to invest a lot in automated triage systems, but in my experience these tend to be expensive to maintain so should be avoided at first.
The Stress system can also be used for security testing by running fuzzers, fault injection, other penetration tools alongside with the
Case Study: WPF stress The WPF team has a simple dedicated stress framework, which consumes both stress-specific test code and generalized testing blocks. The tests are “deterministically random” i.e. a stress test failure can in theory (and often in practice) be reproved on demand. The tests are distributed to about 100 machines every night (these machines are in the lab). Results get analyzed and presented by vendors in China. The team used to have a fairly sophisticated stress system that was able to do preliminary triage of failure stacks, map them to existing bugs, etc. This turned out to be an over-automated system which had a significant support cost, so we switched to the current significantly simpler system where stress failures are triaged by a vendor team in China, who manage the pool of the stress machines remotely. The system is managed by 2 engineers in China with some support from local Redmond SDETs.
Case Study: WPF stress
The WPF team has a simple dedicated stress framework, which consumes both stress-specific test code and generalized testing blocks. The tests are “deterministically random” i.e. a stress test failure can in theory (and often in practice) be reproved on demand. The tests are distributed to about 100 machines every night (these machines are in the lab). Results get analyzed and presented by vendors in China.
The team used to have a fairly sophisticated stress system that was able to do preliminary triage of failure stacks, map them to existing bugs, etc. This turned out to be an over-automated system which had a significant support cost, so we switched to the current significantly simpler system where stress failures are triaged by a vendor team in China, who manage the pool of the stress machines remotely. The system is managed by 2 engineers in China with some support from local Redmond SDETs.
Tests are long-lived software. Tests actually tend to live longer than the product code they test, due to AppCompat and other reasons – in Windows we have tests that are 20 years old, dating back to the Win16 days. In order to build a stable test suite teams need to invest in design and code and asset reuse.
Proper software design is an acquired taste. It is one of those things that doesn’t just happen on its own. It requires focus on architecture, design reviews, code reviews. It also requires training – both on the job (e.g. through mentoring, DR, CR) and structured training (e.g. design pattern “brownbag” sessions, etc.)
A test suite (as any other piece of software) needs to be able to evolve. A test suite also needs to be portable. The single best way to create a robust, evolvable, maintainable test suite is to construct it as a combination of small self-contained building blocks.
We employ the principles of componentization and aggregation as opposed to inheritance and integration when constructing test suites. In that sense, tests are like “lego models” constructed as an aggregation of reusable “lego blocks” weaved together by test execution and configuration policy.
There are two major techniques for automated code verification:
Static code analysis should be performed prior to check-in to prevent the team to check-in buggy code. Ideally tools such as PreFast, PreFix, FXCOP, StyleCop, Polycheck should be integrated in the build environment and any violation of their rules should result in a build break.
Dynamic code analysis is another powerful technique done through in-situ verification (asserts, special OS instrumentation such as that enabled with AppVerifier and driver verifier, etc).
Product features are created in order to enable end-to-end user scenarios. The quality of a product feature does not matter if the e2e scenario doesn’t work. So test teams should work closely with the PM and dev teams to understand the e2e scenarios and to build tests that capture these e2e scenarios early in the development cycle.
The single fastest way to improve a team’s throughput is through training. Of course all of us experience continuous on-the-job training (which is one of the great things about the dynamic industry we are in), but I am a big believer in providing a bit of a training structure for the organization.
The fundamental training activities we engage in are:
Hopefully the two words that jump in your mind after reading this post are “simplicity” and “rigor”. As a general rule, I’d always trade advanced functionality for simplicity and we keep doing that all the time in the team. It takes a lot of spine though to truly commit to simplicity – as engineers our natural tendency is to solve the most complicated (and interesting) problem.