Prologue

One of my teammates recently set up a Wiki site for my team, and I must admit that I have become quite addicted to it.  One of the pages I’ve created has been a list of common test areas, a checklist of things most of our tests will want to concern themselves with.  Things like “bad parameters”, and “different transport protocols”.

Let me interject at this point that this concept of a common test checklist is by no means a new one.  Anyone among my three readers who has done software testing before knows what I mean.  Despite all the prior art, however, I believe it is useful to have a team-specific list; something that targets those test areas particularly relevant to that team.  To pick a team completely at random, for example, .NET Remoting likely has specific test areas and checklist items that are particularly relevant to it.

My checklist currently has about 20-30 items on it; today I would like to focus on a single one of them; two simple words: Multithreaded Testing

In reality, it’s a topic that can fill a library.  I’m going to chat about my empirical experiences in this area, but if you are interested, there is also considerable computer science research in this area.

As I write this, I find myself mentioning test topics that deserve more discussion, in a future post.  I’ll mark those with a *Ping* so I remember to go back to them some other day.

“The first priority, young man, is to find the bugs…”

Many moons ago, in the before-time, I wrote a test suite for the IErrorInfo interface and the associated COM infrastructure.  (Yes, it is my fault if it doesn’t work right…)  One set of tests that I wrote was designed to discover race condition problems when multiple objects on different threads got error objects back from the same target object.

I was actually quite proud of these tests.  In one case, I would have threads A and B call object O, where the calls arrived in the order A, then B, but by judiciously blocking the calls in the target object, they would return in the order B, then A.  In another case, I would force the order as A-calls, B-calls, A-returns, B-returns.

I had about four of these test variations.  I had beautiful charts in my test specification describing the control flow.  I had bountiful program output, describing the scenario in loving detail for anyone who might happen to need to debug a failure.  I loved those, because if anything failed, I could point to the exact repro scenario and documentation needed to demonstrate and debug the bug.  *Ping*

What I forgot (or hadn’t learned yet), was that documentation and easy repros, while important, are all “priority two”.  The first priority is to *find the bugs*.

Hindsight is 20/20

You see, by artificially controlling the ordering of the thread actions, I’m also artificially constraining the product code paths that my test explores.  For example, my test would never try the case where a call arrives at the target object at the exact same time as another call returned from that object; the care I took in synchronizing the scenario prohibited it.

What I should have done was kick off about a hundred threads, set up some loops so they continuously hit the target object, and let it run for a minute or two.  Sure, random testing is not deterministic; there is no guarantee that a given failure will repro, and figuring out what happened when something does fail is a major pain in the rear.  But remember, that is all “priority two”.  Think about all those calls, twisting and twining, overlapping and conflicting, throughout the internal IErrorInfo infrastructure.  Its gonna be *tons* better at finding bugs than the four simple variations I wrote years ago.

Now, even that random test case isn’t enough.  There may be races that will just never show up on your machine, in your configuration.  To catch stuff like that, you’ll want to induce errors or delays; the joys of fault injection.  *Ping*

You also might want to run tests for longer than a minute or two, which brings us to our next topic…

How I learned to stop worrying and love the stress

Sometimes, a race only shows up once every couple months.  Sometimes it will only show up on a single machine; the one that has the magic combination of system components that demonstrate the failure.  To catch these, the Windows team has this thing called “Office Stress”.  This runs a bunch of different tests, exercising many of the features of Windows.  It crushes the machine – office stress will routinely peg the CPU, and things run so slowly that failures are the norm.

Now, honestly, that is mostly useful for testing software that cannot fail.  Things like winlogon, or rpcss; if they fail, the machine fails.  These core system components have to keep functioning even if 90% of their memory allocations start failing – and office stress will force that condition.

Office stress is not so useful for testing other kinds of programs.  Your typical application doesn’t expect to keep working with memory failures – it just dies, hopefully with some appropriate error message.  Typically when you run these programs under office stress, they’ll die in the first 10% of the program, so you end up never testing the other 90% of the program.  For these programs, a lower-intensity variant of stress is appropriate.  We’ll tune the test configuration so that it runs at about 70% resource utilization, and just let it run continuously.  The ASP.NET team does a lot of their stress testing like this; in addition to finding race conditions and other multithreading issues, it is also good for finding slow resource leaks.

On a side note, we also have a concept of “long-haul” stress.  Teams at Microsoft often have ship criteria where we won’t ship a product unless it has run on so-many-hundreds of machines, for a certain number of days, under stress, without failures.  For Windows, for example, I believe it is something like forty days.  (Once we start the last one of these forty-day test passes, it is a major pain in the rear if someone finds a showstopper bug that resets testing…)  Depending on the team, long-haul stress may consist mainly of high-intensity or low-intensity stress testing.

In the managed world, we do an interesting variation on fault-injection combined with stress.  We have this tool called GCStress, which basically forces the garbage collector to do a collection on every program step.  Yeah, its really slow.  By running this along with our tests, we can surface memory failures pretty much at the point where they occur, which makes them much easier to debug.  (Similar to using the appverifier and turning on full pageheap, in the unmanaged world.)

Anyway, I’m definitely rambling away from the original topic of this entry, so I’ll sign off now.  I’ve enjoyed writing this up; it helps me clarify the concepts in my mind.  Please do let me know if you find it interesting as well!