My team discovered a bug recently that had been around in our daily builds a while.  We did a retrospective after the fact and found some interesting facts.  We had a test case covering the scenario, and it had been passing with no problems over the past several weeks.  There were no bugs in the test automation – it reliably did exactly what we wanted it to do.  The scenario even had 100% code coverage via test automation.

So what went wrong?

The problem was that running the scenario once worked fine, but it left behind some side-effects in the system that caused it to fail it you executed the scenario again.  So the next time you ran it (and every time after that), the user was presented a nice unhandled exception message rather than completing the task they were attempting.

How did we find the bug?  Since the daily build had passed initial tests, one of our engineers picked it up to do some “real world” exploratory testing.  As soon as he tried to do something “real” with the build, he quickly ran into the problem.  This scenario was one that any real user would almost certainly execute multiple times under normal operating conditions.

So how could we have found this bug sooner?  I’d like to say we discovered some key indicator in our retrospective that should have warned us we had a hole, but the truth is the scenario was covered well and passed the sniff test.  What was missing was an automated repetition test.  As our real world testing illustrated, this was a “high repetition” type of scenario.  We should have had a test in place ensuring that the scenario worked reliably over and over.  This particular scenario was difficult to automate because it actually required two computers.  That’s not impossible, and in fact a task for adding that automation was on our backlog.  However, the fact that we had a highly repetitive user scenario that wasn’t automated should have been our indicator to focus on the scenario with manual testing more frequently until we had the automation in place.  It turned out that a simple tweak to our existing automation was all it took to put this test in place. As an added bonus, we can now also use this automated test as a canary for finding memory leaks and performance degradation issues over time.

Obviously this isn’t the only way to catch this type of bug.  Additional code reviews and pre-commit unit tests might have also helped.  But the code was code reviewed, and our unit tests also had high code coverage for this scenario. So this was the next most obvious and easiest way for us to ensure the right coverage.

So, next time you’re scratching your head thinking of ways to find more bugs in your system under test, maybe try running your existing tests… again (and again, and again, and again…).