I haven't actually written in a long time which its an indication of how busy things have been around here. I put together a presentation about how to handle itermitent test failures earlier this year and I've been meaning to turn it into a document and I figured this was a good chance to do just that (you're warned that this is the first draft). The presentation is way more entertaining than the document since you're missing the jokes and the pretty picture but it's still interesting enough if you're into software testing.


 

Abstract:

Ever had a situation where you saw a test failing but since it didn’t immediately repro the failure got dismissed and then later the issue resurfaced and it ended up being a serious issue? We’ll explore as to why even though most of us agree we shouldn’t not ignore failures people still do along with some real case studies of occurrences of this that I’ve come across as a tester in Visual Studio. Finally we’ll talk about when it’s “ok” to dismiss a failure and ways to influence the culture around this.

 

Section 1: Introduction

Does this scenario sound familiar?

  • Test fails
  • First thing whomever looking at it does is to run it again
  • Test passes
  • If there was a bug it’s closed at not repro
  • If it was on a test run it gets ignored as pass on rerun
  • Real bug keeps lurking and resurfaces later

If it does it also probably frustrates you to no end. After all, tests are written for a reason and when they fail it usually means something is wrong. In theory all this sounds right but in practice it might not be the case. So why is it that we’re so far apart between what is should happen and what actually happens? From experience I can summarize the top reason for that in one of these 4 buckets:

  • Everyone is on a tight schedule – This means that when there’s a business decision to be made you have to take some calculated risks and if a test is only failing sometimes that’s something you might justify living with.
  • Investigations are unattractive work – Some people love digging into problems and figuring them out. Others view it as something that isn’t fun or something that they rather not do. I’d even go as far as say that some people think that doing test investigations is beneath them. Bottom line is that it isn’t glamorous work and some investigations can be very tedious.
  • Lack of manual repro must mean that the user scenario isn’t broken – This falls under the blame the test category (which we’ll cover in more detail later). It’s very easy to blame test automation (which effectively translates to its someone else’s bug) when you’re dealing with a hard to repro bug.
  • It’s easier to add a workaround in the test than to find the root cause – Sometimes people will agree that there’s a bug (albeit begrudgingly) but they’ll argue that working around the issue is easy and cheap so the test should be doing that instead.

 

Section 2: Culture around test failures

One big problem as to why even though people agree with the statements in the previous section is how tests (and by association) the testers are viewed and valued in an organization and more importantly how much does that organization value testing.

There are two very distinct lines of thought when talking about tests and as a tester you should always be pushing for the later:

  • Tests are just things that need to run and pass before I check in my code changes.
  • Tests results paint me an accurate picture of what the quality of my product is.

If you’re stuck around people that have the mindset of the former you’ve probably seen both of these:

  • When a test fails the first thing a good number of developers ask is, “Is this a flaky test?”

The first thing you should ask yourself when a test fails is what new bug got introduced in the product and how do we fix it? Not if the test itself is full of bugs.

  • The Guilty until proven innocent mentality

This is a variation on the last statement. For a lot of developers until you can prove and explain a product bug it’s considered a test bug.

Both of these paint a bleak picture where the overall mentality ends up being “It’s never my fault, it’s always yours” and its sibling “It’s someone else’s bug”. This has the unfortunate side effect of affecting how testers are valued and these are statements I’ve heard that make me roll my eyes because they’re wrong and most of the time unfair and untrue:

  • “This is a crappy test so I just keep running it until it passes.”
  • “My code is fine, your test never really finds issues it just fails randomly.”
  • “Can we fix the test so that it’s more resilient to bugs?”

 

Section 3: Common issues on the product and on the test side

So how do we fix all this? First thing we need to understand is why there are failures that feel random and that there’s a multitude of issues on both the test and the product side that can contribute to this. Here’s what I’ve seen cause the most instability with test results and it’s worth noting that as everything around software there’s enough blame to go around and both sides (development and testing) have their own share.

On the product side of things:

  • Issues that only manifest themselves on the application first load

This one is self-explanatory, these are issues that only happen on cold boots or the first time ever the application runs on a machine.

  • Race conditions

One of the “fun” parts of dealing with threading. When there’s multiple threads sharing resources assumptions are usually made and a lot of times they can be wrong and cause race conditions when the bugs only manifest itself when one thread gets there before another and one of them assumed there was going to be some sort of order.

  • Reentrancy issues

Another threading issue, one of the common causes of these is when reentrant functions presume resources are thread safe while they’re not necessarily completely thread safe.

On the test side of things:

  • Dirty test assets

Depending on what your application consumes your assets might not always be clean. One good example here is testing a test editor where you’re opening an existing file and doing some edits. The first time the test runs it might be pristine but if you don’t do things right or cleanup correctly subsequent runs might run against a different file.

  • Dirty application settings/environment

When the test depends on the state of the machine and/or the application and it doesn’t ensure that everything is on the expected state you’ll see different results depending on where tests are run.

  • Lack of synchronization

Not every action finishes instantaneously so a lot of times you have to wait until that action is finished to validate your result or to perform other actions that depend on the previous one. A lot of times the synchronization is happening as part of your test API’s but you shouldn’t take anything for granted and always confirm that the correct synchronization is happening.

 

Section 4: Case studies

Let’s talk about actual examples of issues I’ve seen while working on Visual Studio over the last few years that illustrate some of the issues I previously mentioned. I’ll go over one common test bug and two interesting product bugs I’ve had to deal with.

 

Section 4.1 – Blame Your Tester

This is something I’ve seen way more times than I’d like to admit on multiple incarnations. Can you spot the bug in this algorithm?

OpenVisualStudio()
CreateNewProject()
RequestIntelliSense()


The problem here is that after you create a new project IntelliSense is not immediately ready (is this reminding you yet of the synchronization issues we talked about before?). It’s a very easy issue to fix as long as you have the right tools and you actually spot the issue. Now this is how it’s usually solved, does it sound right?

OpenVisualStudio()
CreateNewProject()
Sleep()
RequestIntelliSense()

So now we wait and then we perform the action. Issue solved right? No, and I can’t express this enough. If you see a sleep there better be a good explanation as to why it was the last resort or sooner or later you’ll hit the same issue. Sleeps are non deterministic, for all you know it might exit way earlier than the time you actually needed or even worse it will work most of the time and fail other times.

What should happen instead is something like this:

OpenVisualStudio()
CreateNewProject()
WaitForIntelliSenseReady()
RequestIntelliSense()

In which you have some good deterministic synchronization that waits just the exact amount of time and won’t introduce random failures.

Conclusion: Synchronizing the right way makes a world of a difference yet it’s one thing that’s easy to overlook. When you end up with cases like this in which you hit the worst possible scenario of working most of the time but still failing in some occasions the statements we talked about on section two made by developers are more than justified.

 

Section 4.2 – Death by a Thousand Cuts

This is an example that perfectly illustrates the “it’s someone else’s bug” attitude that we mentioned previously. It also caused me a lot of grief which you’ll see soon enough why. So to start off here’s the failure:

Before running tests our team does a couple of pre setup steps which ensure that we’re running on a clean environment. Remember that pesky issue about dirty environment we mentioned earlier? This is one of the ways we deal with it. As part of this one of the steps is to load Visual Studio and reset the settings. Now, we don’t want things to be hung so we set the timeout for this step to 5 minutes which is more than enough. All of the sudden this starts failing and everyone says it has nothing to do with them.

Now here’s where the grief part starts, these are actual parts of an email conversation I had with one of the developers investigating a failure:

Me: I’m concerned about is that I’m not seeing that issue in main (or elsewhere) and you mentioned that you hit it multiple times so if it is an issue it’s coming from the branch and it should be investigated there.

Developer: I am only looking at the debugger issue. If it fails on official runs the owner will need to take a look. I doubt that it is a product issue

Me: If you had to rerun it multiple times that would be something you should investigate as that might mean we’ll see the same failure on official runs. The timeout of that is 5 minutes, if a call to reset settings is taking that long there might be some perf regressions or other badness here.

So this already isn’t going to well, over the next several weeks I start getting emails from a lot of different people about the test failing with the same common cause that timeout. At this point I think I was close to testing how sturdy my office door was by ramming my head into it quite a few times.

As usual though, I had keep fighting the good fight as I knew that the root cause here was a product issue and something that should get addressed. So fast forward 3 weeks later and after enough investigation was done there was a broad email from a developer manager high along the food chain acknowledging that we indeed had an issue and that we had multiple people working on it. Score one for testers standing up for the tests!

Conclusion: The root cause for this failure ended up being multiple bugs on different packages that caused significant slowdowns that would only manifest themselves the first time the application was loaded. Interestingly enough the two tests that were hitting this, which of course came under fire and I had to fight off a lot of requests change in order to add workarounds, were not actually even going through the scenarios that were broken. It just so happened that their setup routines was uncovering issues that would have been otherwise ignored.

 

Section 4.3 – I Called Dibs!

This was an extremely interesting case which yielded some big changes which I’ll talk about later. Is there anything wrong with this test code?

OpenVisualStudio()
CreateNewProject()
Build()

It turns out everything on the test side was correct, however under the wraps there were a lot of things going. After a project is created there’s a bunch of background operations that happen in order to be able to deliver the full user experience including the work IntelliSense needs to do. In this case particularly both the IntelliSense engine and the build system have different builds going on (both using their own components).

So far so good, however both of them need to ask the build manager for some project specific data. So what happens when the build manager is busy? It turns out the way it works is that when there’s a build in progress and someone asks the build manager for information it tells them sorry I’m occupied right now, could you please ask me later? Now if any of the components is greedy and presumes that they’ll get everything when they ask for it they’ll eventually run into issues which is what ultimately happened here.

Conclusion: The root cause of this ended up being that both of the consumers needed to handle the retry calls graciously and neither of them was so depending on who started first we’d have a failure coming from the other component. As I mentioned earlier, this turned out to be an architectural issue that needed to be changed with both components understanding that they might not always have priority and that getting told to wait is a completely valid response.

 

Section 5: Promoting change

So all of this is great information but you’re still not sure as to how you can make a difference and change the culture around you? For starters you should always be thorough and be able to explain with confidence what’s happening (even if all you have is a working theory). Here’s three things you can do that will help your argument on failures being product bugs:

  • Run the test on multiple iterations

Don’t give up when it doesn’t repros immediately, as we have mentioned multiple times just because it passes on rerun doesn’t means everything is good. Most drivers have options to run a test multiple times so abuse it.

  • Restore the machine to a clean state

This can be tedious and annoying as its very time consuming but sometimes it’s the only way. If possible (and virtualization helps with this in a big way) you can restore the machine to the same state it was before the test was originally run. You don’t always have to go that far though, every application has specific things that they do the first time they run (add registry entries, save setting files, etc). If you know your product well you should be able to manually get the machine back to that state.

  • Cleanup the test assets and environments before and after the test

This probably should go without saying but if you eliminate as many variables from the equation as possible it makes it hard to argue that there are issues with the way the tests are doing something.

Now, there are cases when you have to just cut your losses and accept that the test failed without having a good explanation for it. Even when that is the case my previous statement still holds - always be through and explain things with confidence. Two very important tips in this area:

  • Above all perform due diligence before dismissing it outright

If you’re dismissing your test failures without exploring all possible options then you can’t expect others to respect your tests which correlate to how much respect they have for you as a tester.

  • Use your judgment and prioritize

Some tests are more important than others and as much as it pains us to admit it most bugs will get fixed but some bugs will not. If it’s something that you consider extremely important go the extra mile but if it’s something that you can probably live with due to the rarity and the severity of the bug your time might be better spent elsewhere. You know your product and your features so trust your judgement.

Section 6: Conclusion

If nothing else there’s one thing that I hope was ingrained in your brain from this:

Pass on rerun and not repro resolutions are unacceptable.

If that’s all you got away from this then all this time was already more than worth it yet if you also got all these then you’ll fully appreciate testing (and your testers) even more.

  • Failures are bad, if they were acceptable then what’s the point of having a test for it?
  • A lot of times flaky tests are just reflecting a flaky product, don’t be quick to blame the tests.
  • Until you can prove that there’s a test issue presume that it is a product issue.
  • Workarounds are a sign of a buggy product, if you have to work around something it probably means your customers will have to as well.
  • Even when it is a test issue, get to the root cause of it.