The Cost of a Test Defect

Visual CPP Team

April 2nd, 20070 0

Hi, I’m Marina Polishchuk, an SDET on the VC++ Compiler Front End team. Recently, I have been working on investigating Orcas test pass results, which involves validating and reproducing the results, researching into causes of failures, and filing bugs when appropriate. This is a typical activity for an SDET throughout the latter stages of our product cycle, when we most heavily focus on trying to uncover any lingering product issues in our supported features/scenarios. It is a fact that the longer a bug persists in the source, the more costly it becomes: hence, finding bugs early on during Beta testing is very important so they can be fixed in a timely manner. However, the ability to detect product bugs rests crucially on having accurate test results in one’s possession, which brings up the often overlooked type of bug: a test defect. The cost of a test defect, much like that of a product bug, depends largely on the duration of time before it is found—for example, whether it is found before or after a real product regression occurs and is not caught because the associated test case is buggy. In this blog post, I will talk about several kinds of test defects I’ve encountered, steps to ensure a product issue is not hiding behind the failing test, and ideas on how to better detect certain types of test defects.

An “easy” test bug to be faced with is a test that no longer works due to an intended behavior change in the compiler. I call this bug “easy” in the sense that it is straightforward to detect: “regressed” tests (passing tests that start to fail from one version of the compiler to the next) will be examined by a human, who will determine that the cause is benign, and the tests need updating. Ideally, tests are patched right away to account for the new behavior; if the affected number is large, we may allow them to fail until fixes can be applied, and associate a test bug in our tracking system to the failures so everyone on the team is aware that the failures are expected.

Another type of test result often seen during beta testing is a false positive: a “failed” result that is not indicative of incorrect compiler behavior. False positives may arise from incorrect machine or product configurations, dependencies on other failing tests, test commands run with insufficient user privileges, etc. False positives are diagnosed either by examining logs or re-running the test if logs are insufficient. Re-creating the original test environment is nontrivial and generally infeasible for factors such as machine stress during a full suite run or non-determinism when tests are executed concurrently on multiple processors. If I cannot recreate the failure, a general course of action is to wait and see whether it manifests on subsequent test runs with the same parameters, possibly expanding its logging output to ease configuration failure diagnoses. In my view, false positives are also benign with respect to product quality, unless of course one considers the time taken away from discovering real defects while investigating spurious errors.

Finally, a highly distressing type of test result is a false negative: a passing result from a test that exercises a bug in the compiler, but fails to report it. Since code coverage of the compiler is the same whether a test reports “1” or “0” at the end, and testers generally do not perform much investigating of passing results, such test failures often go undetected for a long time. A phenomenon I find interesting with respect to bug-finding tools at large is that they generally throw away information associated with affirmative results. Within our test harnessing, we avoid logging test execution details for tests that report “pass” by default. Given the current practices, this is quite reasonable: no one will investigate passing test results, so why preserve logs that no one will be looking at? However, when one reasons about causes of failures solely from logging information, details of how the test last passed become important. In particular, saving test traces regardless of result type opens up opportunities for better diagnosis of false negatives: rather than reporting isolated test results per test run, a more advanced harness might consider previous results to contextualize test results from run to run. Taking “pass” and “fail” results only, six new result types may be derived from considering result pairs, as follows:

Run₁

Run₂

Log Differences

(i)

pass

same

(ii)

pass

fail

(iii)

fail

pass

(iv)

fail

different

(v)

pass

different

(vi)

fail

same

Result types (i) and (vi) would not need to be examined, as they definitely have no associated product behavior changes (presumably, (vi) has been investigated previously). Result type (ii) is a regression that would be investigated as usual. Result type (iii) could be used to verify that a new passing result corresponds to a bug fix in the product and not a manifestation of a test bug. Result (iv) could be used to examine a test case for possible coverage of several compiler bugs: it may already be failing due to Bug #1, but may have started to fail differently due to Bug#2. Currently, this test case would not effectively detect Bug#2, as Bug#1 will have been associated to it as the cause for its failure, deeming it an “expected failure” that does not need to be re-analyzed. Result (v) is most notable in that it could be used to find false positives. For example, a few months ago, a small number of our tests were found to return “pass” even when the compiler was not installed. With either result (iii) or (v), this bug would have been caught at the first configuration failure, rather than sporadically later on. Result (v) may also expose logic errors: if a test prints “Value=4” during one run, and “Value=2147942402” on the next run, both passing, then its behavior may be worth double-checking. Of course, logging information would have to be designed with differencing in mind, handling run-specific values (such as pointer values and timing information) specially. If anyone has tried this type of test result analysis, I would be curious to hear about your experiences.