As I described in my last article, our first layer of quality assessment is build "scouting", designed to do a shallow but broad pass across the product to determine if it's worthwhile to proceed with deeper testing.

Our next layer is more thorough automated testing runs.  Many would call this "functional" testing but words in the testing space get so overloaded.  It includes both API level tests (which some may call Unit tests), UI automation tests and some broader "system integration tests".

We divide our automation tests at this level into two categories - Nightly Automation Runs (NARs) and Full Automation Runs (FARs).

As their name suggests, NARs are designed to be run every day, on every build that is not SelfToast (see part 1 for a description of this term).  Today we have on the order of 3,000 NAR tests.  In our test planning excercise (when we design the test cases we want to run and choose what will be automated) we prioritize our test cases (1, 2, 3).  The NARs are generally selected from the Pri 1 test cases and chosen so that they can run and the results analyzed within a few hours (important if you are going to do this every day).

FARs are the sum of all automated tests that we have.  For TFS, this amounts to something on the order of 20,000 tests today.  A FAR run takes about a week to run and analyze the results.  As a result, we don't start doing them until later in the product cycle and we run them less frequently - the closer to the end we get, the more frequently we run them.  Right now, I think we are running them every 2 or 3 weeks.

For completeness, beyond FARs, we have what we call a Full Test Pass (FTP).  Which is a period of time where we run multiple NAR and FAR runs on a cross section of our test matrix (the subject of a future Managing Quality post) and run our manual test cases.  Last I checked, we had about 10,000 manual test cases on top of the 20,000 FAR cases.  A Full Test Pass takes somewhere from 2 - 4 weeks.

So, with that background, on to the reports...  Here's a recent NAR trend report:

As you can see, we breakdown each result by the cause of the failure.  These warrant a little discussion:

  • Initial Pass Rate - The test passed the first time it was run.
  • Final Pass Rate - When tests fail, we "analyze" the run and some of them, we are able to tweak something about the test and re-run them.  Those that pass the second time are marked as "Final Pass Rate".  Over time this should go to zero as all tests should pass on first run but when the code is churning, it's not uncommon to need to tweak tests to keep them up to date.
  • Product issue - This is what you'd expect - it's a failure in the product and results in a "product bug report".
  • Test Issue - Believe it or not, test code can have bugs too.  These are failures in tests that can't be fixed by tweaks and require significant work in the test itself.  They can result from changes in the product or just improperly written test code.
  • Other Issue - Anything else.  These might be test infrastructure issues, lab network issues, etc.

Sidebar - I think I've described this before but since I'm showing a bunch of build numbers, let me tell you about them again.  20108.00 is a build number.  The format is YMMDD.NN.  Years is somewhat arbitrary but increases each year that the project is underway - we are using 2 for Orcas because this is the second calendar year (we started on it in '06).  0108 is January 8th.  NN represents the number of rebuilds of this "build".  Mostly this is used when we branch a build for a release and we freeze the main part of the build number and only increment NN.  During the main phase of development, it's pretty much always 00.

Here's a FAR report.  It looks pretty much the same:

This report is only showing 2 FAR runs because we've really just started getting going with FAR runs.  We are dusting them off and getting them running again on the Orcas code base.  You'll notice the gap between these two runs is a little over 2 weeks.

 

We also produce more detailed test result reports to drill into specific feature areas like this:

 

The wide variations in number of scenarios tends to come from how granularly different feature teams break up their tests.  As you'll see in a future post, the code coverage metrics for all feature areas are about the same.

Looking at all of this data from a project management perspective...  In addition to the build quality data in part 1, these form a very important part of telling how the quality of the product is progressing.  In my experience, these numbers rise and fall throughout the product cycle and tend to hit their low point right around our "code complete" deadline - that's when developers are feeling pressure to get their functionality done and quality tends to suffer a bit.  Then we move into a stabilization phase the the numbers trend up.  They need to be in the mid-high 90's to have a Beta quality release.  We generally never achieve more than about a 99% pass rate because there are inevitably some tests which fail for very minor reasons and we decide they do not materially impact the quality of the product.

This product cycle we've made some pretty significant changes to our macro level project management.  After I get through this quality series, perhaps I'll talk a bit about that and the impact it's had on how we manage quality.

Hopefully this continuing thread is useful to you.  Let me know if not, or if you would like me to focus on certain aspects.

Brian