Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Resilience is NOT necessarily a good thing

Resilience is NOT necessarily a good thing

Rate This
  • Comments 66

I just ran into this post by Eric Brechner who is the director of Microsoft's Engineering Excellence center.

What really caught my eye was his opening paragraph:

I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it's better to crash and let Watson report the error than it is to catch the exception and try to correct it.

Wow.  I'm not going to mince words: What a profoundly stupid assertion to make.  Of course it's better to crash and let the OS handle the exception than to try to continue after an exception.

 

I have a HUGE issue with the concept that an application should catch exceptions[1] and attempt to correct them.  In my experience handling exceptions and attempting to continue is a recipe for disaster.  At best, it takes an easily debuggable problem into one that takes hours of debugging to resolve.  At it's worst, exception handling can either introduce security holes or render security mitigations irrelevant.

I have absolutely no problems with fail fast (which is what Eric suggests with his "Restart" option).  I think that restarting a process after the process crashes is a great idea (as long as you have a way to prevent crashes from spiraling out of control).  In Windows Vista, Microsoft built this functionality directly into the OS with the Restart Manager, if your application calls the RegisterApplicationRestart API, the OS will offer to restart your application if it crashes or is non responsive.  This concept also shows up in the service restart options in the ChangeServiceConfig2 API (if a service crashes, the OS will restart it if you've configured the OS to restart it).

I also agree with Eric's comment that asserts that cause crashes have no business living in production code, and I have no problems with asserts logging a failure and continuing (assuming that there's someone who is going to actually look at the log and can understand the contents of the log, otherwise the  logs just consume disk space). 

 

But I simply can't wrap my head around the idea that it's ok to catch exceptions and continue to run.  Back in the days of Windows 3.1 it might have been a good idea, but after the security fiascos of the early 2000s, any thoughts that you could continue to run after an exception has been thrown should have been removed forever.

The bottom line is that when an exception is thrown, your program is in an unknown state.  Attempting to continue in that unknown state is pointless and potentially extremely dangerous - you literally have no idea what's going on in your program.  Your best bet is to let the OS exception handler dump core and hopefully your customers will submit those crash dumps to you so you can post-mortem debug the problem.  Any other attempt at continuing is a recipe for disaster.

 

-------

[1] To be clear: I'm not necessarily talking about C++ exceptions here, just structured exceptions.  For some C++ and C# exceptions, it's ok to catch the exception and continue, assuming that you understand the root cause of the exception.  But if you don't know the exact cause of the exception you should never proceed.  For instance, if your binary tree class throws a "Tree Corrupt" exception, you really shouldn't continue to run, but if opening a file throws a "file not found" exception, it's likely to be ok.  For structured exceptions, I know of NO circumstance under which it is appropriate to continue running.

 

Edit: Cleaned up wording in the footnote.

  • I like the principle: "You should handle an exception only if you know what to do with it."

  • Doug: Works for me, but only for C++ exceptions (and RPC exceptions, which are essentially the same as C++ exceptions except they're propogated by SEH).

  • Larry,

    I think I may be a little unclear so I ask for your help.

    In your example of the binary search tree, if it throws a tree corrupt exception, what would be wrong with wiping out the tree, making a new tree, and starting over?

    Also, I assume that other exceptions, such as a thrown exception because an access database you are trying to connect to doesn't exist in which case you tell the user to either enter a new path or give them the option to close the program gracefully, are not what you are talking about.

    Thank you for your assistance in helping me understand.

    P.S.

    I like the idea of "handle an exception if and only if you know what to do with it."  But I would extend that to languages such as those of .Net and Java.  Then again, you may have managed environments in a whole new category.  

    JamesNT

  • Maybe I'm not reading you right, but are you saying that if someone powers down Google's datacenter and my C#-implemented browser gets some sort of TimeoutException from the TCP stack, the correct thing is for my browser to crash?  

  • JamesNT: That might be ok, IF you can guarantee that the only cause of the tree corruption failure is that the trees internal state is corrupt.  

    But if the tree corruption error is thrown because of something else (I don't know, maybe it was because of an error in an underlying heap manager that was rethrown as a tree corruption error, you can't.

    And that's exactly my point.  When you encounter an exception you don't FULLY understand, you can make NO assumptions about the state of the process.  And the only safe action to take at that point is to die and let the OS restart you if possible.

    The "access database you are trying to connect to doesn't exist" scenario is analogous to my "file not found" example - in that case, the exception really isn't "exceptional", it's just a mechanism used by the database library to communicate an error and you handle it just like you handle any other error.

  • Reliability is a complicated thing.  There's a tradeoff between availability and integrity,  and that tradeoff becomes more severe as a system becomes larger and more distributed.  UNIX tends to choose availability over integrity,  and Windows does the opposite.

    You're more likely to find some funny characters at the end of a file on a UNIX system after a crash,  and more likely to have a Windows machine give up the ghost or let a badly written application lock up your desktop for a few minutes.

    Life-critical systems can't shut down just because something unexpected happened.  Neither can large scale web sites or e-commerce systems.  There's a whole art of system recovery,  partitioning of corruption,  and having the system stay in a 'sane' state that isn't necessarily correct.

    People have different expectations for desktop apps:  people expect to have them crash and lose their work.  That's one of the reasons why the world is giving up on desktop apps.

  • John: You're confusing exceptions and errors (it's really easy to confuse the two).

    Exceptions are supposed to be used to handle <i>exceptional</i> events (like corrupted internal state).  They're not the same as errors (which are used to express "normal" failures).  

    The only kind of exception handling that is unilaterally bad is structured exception handling (except in VERY limited circumstances like handling RPC failures and kernel mode probes of user mode addresses).  

    See my footnote: C++ and C# and Java exceptions <i>might</i> be ok IF you can guarantee you know the reason for the failure.

    I'm not aware of any networking stacks that use SEH to represent network failures.

  • Its not that hard if its a known exception that can be handled, handle it.  If its an unknown/unexpected exception crash and report .

  • From that article, I didn't get the idea that continuing from exceptions was considered a good practice. I got the idea that only that using Watson alone to handle crashes is not sufficient.

    MSN

  • It depends on your application domain.  In my desktop, userland world, crashing is a wonderful option.

    In my brother's medical device world, crashing means a kid stops breathing.  The FDA kinda insists that software in such devices fails in a safe way.  Crashing isn't a safe way if it's providing life support to the user.  Restarting the process may or may not be depending on the situation.

    Aside from that, I agree that assertions only belong in debug builds.  But if you did have something assertion-like in a release build, then it should be treated just as critically as an exception.  If your assertion failed, then your program is in an unknown, illegal, or improper state--just as it would be if an exception is thrown.  Even if the assertion itself is the bug, others who wrote the code that follows may be counting on the assumption it represents.  Report and bail out.

  • I agree with the statement "You should handle an exception only if you know what to do with it.".

    Somehow, it seems to me that the more you try, the harder you fall. If you intend to make a system more reliable, the few failures will be even bigger headaches.

  • I do believe I see where Larry is coming from now.  Exceptions for things such as missing files, incorrect database passwords, and things of that nature you can handle yourself since either you know the answer, can give the user a chance to answer what needs to be done (i.e. enter the correct password or path), or can allow the program to exit gracefully.

    But for those exceptions where you don't have the slightest idea as to what could have happened, don't try to continue since that is analogous to ignoring there is a problem.  Let the program die in flames, then open up a formal investigation to see what happened.

    Programs that attempt to continue after an unknown exception actually sound dubious when you think about it.

    JamesNT

  • I think Eric just picked a really bad quote to start off his article with.  His article doesn't really advocate "catching the exception and try to correct it" as the quote may suggest.  The closest thing to it that was advocated was "retry"ing an operation, and the examples he described has nothing to do w/ catching an exception and correct it.

    The overall point of the article is really to make error recovery less disruptive to the user experience, and that applications needs to be written with that in mind.

    This kinda reminds me of the MobileSafari browser on the iPhone and iPod Touch.  There's been several times where it clearly crashed, but what happens is that the OS simply closes the browser without telling the user anything.  It builds up the crash dumps silently on the device and those get send to Apple when you sync the device thru iTunes (of course they don't call it crash dumps, but something like "customer data to improve the software").  I honestly don't think this business of "hiding" the fact that it crashed is really that much of an improvement but I can see users getting fooled into thinking that things are working better than they actually do, and well, user perception is king.  (As an anecdote, this doesn't always work anyway; once my iPod Touch actually wound up in a hard freeze that required a full power off/power on to reset the device.)

  • Hang on....under Windows I thought structured exceptions were the basic exception type, and that C++ exceptions were implemented as structured exceptions.

    But you're saying that C++ exceptions are not implemented as SEs?

  • Exceptions should be used for its original intended purposes: exceptional circumstances. All those C++ exceptions for "errors" like file not found are just unnecessary complexity, they should be replaced by error codes.

    When an application catches an exception, it should save the work in progress as much as possible, and then exit and let error reporting take over.

Page 1 of 5 (66 items) 12345