I like processes that simply drop-dead fail when they have an unrecoverable fault.  Trying to continue is often dangerous and unlikely to actually help anyone.  This policy is all fine and well but in all cases it is vital that the “death” stack of a process be the true problem, not a generic error handler.  Those kinds of handlers ruin the lives of developers who are trying to analyze error logs, and effectively make services like Watson useless.  I can’t stress enough how critical it is to NOT mess with the failure stack.

I’ve seen many of these in the wild, including of course Microsoft code, and it means that instead of getting a nice actionable stack in the log (which would have been a Watson report in the field) you have to go back to the code and try to repro.  In the case below I stared at the source for a while trying to figure out what was likely wrong and it turned out the true problem was quite distant from the likely candidates.

2010/04/05 11:02:14 AM  BEGIN    Peek_Range TEST
2010/04/05 11:02:14 AM  ---- Exception in [StackTest] thrown type=[TestCodeException], eip=[00000004`4716C93E], message=[Index (1) is out-of-range [0..1]!]  

That was the failure right there, and that stack was the one I needed! Nothing after that should have happened... But... execution continues until finally....

2010/04/05 11:02:15 AM  Test completed exceptionally [Peek_Range]: Index (1) is out-of-range [0..1]!

The stack reported at this point is useless… the real problem was millions of cycles ago.  I'm not even including the stack from the log because it's just stupid.

The real stack can be extracted with a repro (at great pain)

And it ends up looking something like this (cleaned up)


The actual problem is in the test code!   The only useful information from the original log was basically the test that I should run.  Thank goodness it’s a 100% repro.