I’d like to talk about one of the first lessons I learned as a tester at Microsoft:

Anticipate failure investigations, and log everything.

Let me tell you a fairly rambling tale of my early days at Microsoft:

We’d just shipped Windows NT 4.0 (and IIS 1.0).  I’d recently moved from a contract position in the Windows build lab; to a full-time position (aka FTE, aka ‘blue-badge’) in the COM test team.

In those days, the COM team was just beginning its transition from a consumer, drag-n-drop, linking-and-embedding technology; to a distributed computing, enterprise-application sort of focus.  This was all before COM+; DCOM was the new thing.  Given that history, it is perhaps not surprising that we had a bunch of old tests covering a variety of those aforementioned consumer-oriented applications of COM.  Linking and embedding, COM common dialogs, and more; these were tests that had been lovingly handed down from tester to tester over the years.

You’ll recall I mentioned a transition.  Now, those who have been in the industry for a while know what it means when the product changes focus like that.  Resources shift.  Things happen, like taking these “upper layer” tests from the five experienced testers who owned them, and giving them to a new, green tester fresh from the build lab.  Sink or swim, as they say.  Let me tell you, taking over maintenance of a large test suite (written by a number of different people, in a number of different styles *ping*) is a great way to learn some valuable lessons about test code design.

For example, there was a test for the OLE common dialogs, used for inserting OLE objects and other operations.  This test basically ran a large script (*ping*) that iterated through a bunch of scenarios exercising these dialogs.  In total the scripts performed thousands of operations – click that button, move that window, select this other menu option, etc.  Now, any time you have one program executing a script like that, you basically have one program running another program.  This means it is particularly difficult to figure out, from a debugger, what the dang thing is doing.  When you break in, the stack just shows you what the script interpreter is doing, and it requires some investigation of the program variables to figure out where you are in the script being interpreted.

Now, the test had a log file output, but unfortunately when I acquired the test this output was fairly minimal.  Let me tell you, debugging a failure in this test was an adventure!  It was time-consuming, and technically challenging.  (Technically educational, too, I will say.)  So I went into the interpreter and did something very simple – I just logged each unique operation as it executed:

LOG TRACE: Menu Select ‘File Open’

The logs got a lot bigger, and the test ran marginally more slowly (those UI tests tended to be dog slow, anyway), but now when something failed, I could look at the log and see the precise script actions that led up to the failure.

So what I learned is that it is really, really important in a test to log *everything*.  Log your inputs; log your outputs; log all the program variables that seem even halfway interesting.  It will make your life so much easier.  It will make the life of the next test maintainer so much better.  If it causes a severe performance hit, *then* you can perhaps think about placing some of the trace output under a switch, that you can enable or disable as needed for debugging purposes.  Even there, make sure you can turn on the trace output without having to recompile the program – you’ll thank yourself later.

On a side note, in the last couple years I’ve noticed a trend in Microsoft product design to add this kind of detailed tracing to our products.  Debug level tracing, in the production code – something you can enable or disable with a program or registry entry.  I know that DTC has done this for a while, and I believe it has been standard practice in the mainframe world for decades.  Of course, for performance reasons the tracing is typically turned off in normal execution – but it is there, and it doesn’t require a re-compile to enable.

PSS loves this – it really helps them track down and solve customer problems.