Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Resilience is NOT necessarily a good thing

Resilience is NOT necessarily a good thing

Rate This
  • Comments 66

I just ran into this post by Eric Brechner who is the director of Microsoft's Engineering Excellence center.

What really caught my eye was his opening paragraph:

I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it's better to crash and let Watson report the error than it is to catch the exception and try to correct it.

Wow.  I'm not going to mince words: What a profoundly stupid assertion to make.  Of course it's better to crash and let the OS handle the exception than to try to continue after an exception.

 

I have a HUGE issue with the concept that an application should catch exceptions[1] and attempt to correct them.  In my experience handling exceptions and attempting to continue is a recipe for disaster.  At best, it takes an easily debuggable problem into one that takes hours of debugging to resolve.  At it's worst, exception handling can either introduce security holes or render security mitigations irrelevant.

I have absolutely no problems with fail fast (which is what Eric suggests with his "Restart" option).  I think that restarting a process after the process crashes is a great idea (as long as you have a way to prevent crashes from spiraling out of control).  In Windows Vista, Microsoft built this functionality directly into the OS with the Restart Manager, if your application calls the RegisterApplicationRestart API, the OS will offer to restart your application if it crashes or is non responsive.  This concept also shows up in the service restart options in the ChangeServiceConfig2 API (if a service crashes, the OS will restart it if you've configured the OS to restart it).

I also agree with Eric's comment that asserts that cause crashes have no business living in production code, and I have no problems with asserts logging a failure and continuing (assuming that there's someone who is going to actually look at the log and can understand the contents of the log, otherwise the  logs just consume disk space). 

 

But I simply can't wrap my head around the idea that it's ok to catch exceptions and continue to run.  Back in the days of Windows 3.1 it might have been a good idea, but after the security fiascos of the early 2000s, any thoughts that you could continue to run after an exception has been thrown should have been removed forever.

The bottom line is that when an exception is thrown, your program is in an unknown state.  Attempting to continue in that unknown state is pointless and potentially extremely dangerous - you literally have no idea what's going on in your program.  Your best bet is to let the OS exception handler dump core and hopefully your customers will submit those crash dumps to you so you can post-mortem debug the problem.  Any other attempt at continuing is a recipe for disaster.

 

-------

[1] To be clear: I'm not necessarily talking about C++ exceptions here, just structured exceptions.  For some C++ and C# exceptions, it's ok to catch the exception and continue, assuming that you understand the root cause of the exception.  But if you don't know the exact cause of the exception you should never proceed.  For instance, if your binary tree class throws a "Tree Corrupt" exception, you really shouldn't continue to run, but if opening a file throws a "file not found" exception, it's likely to be ok.  For structured exceptions, I know of NO circumstance under which it is appropriate to continue running.

 

Edit: Cleaned up wording in the footnote.

  • Tony: That's the rub.  How do you know that the document is in a consistent state in memory?

    It's not an easy a challenge.  That's why I'm continuing to state that crashing is the right thing to do.

  • Tom M:

    Not quite... the idea is that I want a customized application to act as the "debugger." Actually, what I probably would do is just have the SAME application launched a second time, and have it use the regular Win32 debug APIs to analyze the crashed instance and recover data. It's mostly the same as an in-process exception handler, but the process separation would make it much more robust (although more difficult to write).

    Larry:

    Yes, the certificate is the blocking point. I've never been able to find very good information on what exactly is required for OCA access for individuals like me. The WinQual site itself offers little information, and I've seen conflicting information in various places. Cost is one issue, although it looks like a single $99 certificate is sufficient. The other big issue is that everything I've seen regarding WinQual and the Class 3 Verisign certificate required to sign up for it only refers to companies -- it looks like individual developers not associated with a official business entity aren't eligible. All of the OCA literature also refers to companies, which doesn't encourage me to spend money in an experiment.

    The uncertainty as to whether I could participate in OCA wouldn't bother me except that there seems to be a recent trend toward blocking non-OCA diagnosis methods, unintentional or not. What really pissed me off was when I found out that the Visual C++ library team stuck code in the 8.0 CRT that explicitly tears off any existing exception handler and calls Watson directly. I think that unless the Windows and WinQual teams ensure that small ISVs can participate and provide clear directions for doing so, it isn't appropriate to assume that everyone can use Watson + WER + OCA.

    Don't get me wrong, I'd love to get OCA reports for my application, even if the ones that fell through that path didn't have all of the diagnostic information that my app's normal exception handler dumps. I take all crash reports seriously. There's just too much ambiguity and uncertainty involved in getting set up. I haven't found any report from a Microsoft employee along the lines of, "yes, we've successfully had individuals not associated with a business sign up to WinQual for crash reports with just certificate X."

    On a side note, I just realized... regarding your comment about asserts that cause crashes in release code: doesn't the NT kernel do exactly that?

  • >[Larry]: How do you know that the document is in a consistent state in memory?  It's not an easy challenge.  That's why I'm continuing to state that crashing is the right thing to do.

    I read that as "it's hard, therefore we shouldn't try".  But I know that's not what you mean, because you have some great examples of exception handling done right.

    I think many of the commenters on both blogs (Eric's and yours) are looking at the problem too coarsely; too black and white.  How do you know that the document is in a consistent state in memory?  By designing in consistency checks.  Little extra validation routines that take advantage of a little extra redundancy built into your document's memory representation.

    This isn't really novel work.  It's just work that hasn't traditionally been a priority at Microsoft outside of specialized teams.  But I believe, and I think this is Eric's point as well, that Trustworthy Computing includes Highly Available Software and that we need to tackle the hard challenges associated with that.

  • One question.

    When JIT translates callvirt into x86 assembly. It turns it into

    mov eax, [ecx]

    call whatever

    The move will raise a exception if 'this' is null, because the processor will attempt to move from 0 location to eax. However, this exception is caught and eventually turned into NullReferenceException.

    But, [ecx] could equally reside outside the committed memory too and it would turn that into NullReferenceException too.

  • Badar: I think that it operates on the assumption that in the managed code all pointers to objects must be valid or null because you have no way to generate pointer to object which points into some random memory.

  • How the Watson decides what memory to insert into the dump send to the Winqual? Sometimes I needed to check a structure whose reference was passed as parameter a little above in the call stack. Unfortunately most of the time that memory was not included in the dump.

  • Alan, I'm willing to concede that for some exceptions it may be possible to catch them and terminate after saving state.  But for the many of them (like STATUS_ACCESS_VIOLATION, which is likely to be the most common one) there is no safe way of running ANY additional code in the process.  We have far too many examples of security vulnerabilities caused by people trying to be resiliant in the face of an access violation error for that kind of practice to be considered safe.

    Jiri, by default Watson generates a minidump, which consists of the stacks for each of threads and the thread contexts for each thread.  You may add additional data to the dump with WerRegisterMemoryBlock.  See here for more details: http://msdn2.microsoft.com/en-us/library/ms678713(VS.85).aspx

    Phaeron: As far as I know, the NT kernel doesn't have many asserts that are live.  Some of them (paged fault at raised IRQL) are sort-of asserts, but they exist because (a) in all circumstances, a page fault at IRQL2 is a bug, and more importantly (b) there is no way to satisfy the page fault.

  • > You may add additional data to the dump with WerRegisterMemoryBlock.

    What I would like is equivalent of MiniDumpWithIndirectlyReferencedMemory. While that might be doable with the WerRegisterMemoryBlock and manual stack walk, I do not think that attempting to do that from the exception handler or crashing process is a good idea.

  • Phaeron:

    The main reason the NT Kernel bluescreens (i.e. 'asserts') is to protect the disk data and metadata from corruption.

    I disagree with Larry's point that assertions should not be in released code.  I just think we don't do it for performance reasons.  

  • >[Larry]: But for the many of them (like STATUS_ACCESS_VIOLATION, which is likely to be the most common one) there is no safe way of running ANY additional code in the process.

    I think that statement also is too absolute (too black and white).  Even Dave LeBlanc points out (in a post inked above) that it's not about risk avoidance, it's about risk management -- shades of grey.

    But you yourself point out the perfect counterexample: catching access violations while probing parameters across an untrusted->trusted boundary.  In other words, there are patterns and practices where even catching access violations, in a controlled way, can increase both resiliency and security.

  • Alan, here's the problem.  Let's say you have all sorts of internal consistancy checks and you KNOW that your data structures are likely to be intact.  So you install a top level exception handler wrapped around all your code that saves your state and exits.

    How do you know that the reason that the exception handler was called was because some attacker has exploited a validation check in your code (http://www.matasano.com/log/1032/this-new-vulnerability-dowds-inhuman-flash-exploit/ for the classic example of this) that has enabled him to exploit an error in your exception handler?  

    By their very nature, exception handlers are less tested than the rest of your code, and they're vastly more dangerous.  As I mentioned above, the Watson team has literally spent years refining the built-in exception handler code (which does nothing but dump core) to reduce the likelihood of vulnerabilties in the code.   You're proposing that not only should the exception handler be application specific, but also that it invoke the save file handler and potentially put up UI.  That means that you have even MORE code that is being run when the application is in an unknown state.

    I think a far better solution is to do what Office does: it installs an application restart handler and checkpoints it's state periodically.  When it recovers from a crash, it looks for one of the checkpoint files and attempts to recover it.  You should also add a bunch of validation checks to the recovery process because you don't know the full state of the file (I don't know if Office does this).  It means that you get the resiliancy you desire WITHOUT the threats associated with running code after an access violation.

    IE has a similar solution (and they're improving it for IE8).  In IE8, if an IExplore instance crashes, it doesn't affect the IE hosting application, which will restart the instance in-frame.

  • > How do you know that the reason that the exception handler was called was because some attacker has exploited a validation check in your code that has enabled him to exploit an error in your exception handler?

    > I think a far better solution is to do what Office does: it installs an application restart handler and checkpoints it's state periodically. [...] It means that you get the resiliancy you desire WITHOUT the threats associated with running code after an access violation.

    I don't think this is a good justification by itself for pushing the recovery process out of process. If the vulnerability that was exploited was due to malicious data, and the exceptional path is less reliable due to being exercised less, then the I'd say the vulnerability could hit the recovery code just as well as the main code. Given that save routines are often agnostic to the data being serialized, the autosave routine may just push the corrupted data to disk without itself crashing.

    Pushing the recovery handler out of process does greatly reduce the risk of recursive crashes due to general process badness, but I'd say it's weak security-wise unless there's something fundamentally different than the original process, such as it runs with extremely limited process privileges IE7+Vista style, or it's written in a different language such that the original vulnerability is impossible. I'm not a fan of .NET for mainstream desktop applications, but I could see using it for the recovery app.

  • >[Larry]: ...you install a top level exception handler ...

    Let me make a clarification.  I think many readers are assuming (and a lot of reaction is coming from this) that Eric and I are talking about recovery from top-level (global unhandled) exception handlers.  Speaking for myself, I am not.  The only thing I've ever done in a global unhandled exception handler (in a managed-code hosted service) is log and die.

    Since state management is really the hard problem we're discussing, the scope of any exception handling I'm proposing must be limited and controlled such that state is manageable.  

    (Yes, I read the Flash story.  Truly amazing.)

  • Alan, ah, that makes a lot more sense to me - I use locally scoped exception handlers a lot (I live in an error-code based world where exceptions are evil).  As I've said before, in certain limited scenarios (kernel/user parameter probes, RPC error handling, etc) locally scoped handlers can be quite useful.  

    I HAD assumed (based on the recovery behaviors that Eric was proposing) that you and he were promoting the idea of global top level exception handlers.

  • "At best, it takes an easily debuggable problem into one that takes hours of debugging to resolve."

    If a crash happens on an end-users PC, then it is extremely likely NOT EASILY DEBUGGABLE. Most of them are not developers. So, where is the benefit?

    Programs should try to recover as much as they safely can - but not more. And they should tell - in simple words - what went wrong.

    Did you notice that Vista does not just abort a copy operation because of a full drive, but actually allows you to free up space and retry and finish the rest of the operation? (Handy for that memory stick that's always full.)

    "The bottom line is that when an exception is thrown, your program is in an unknown state."

    Agreed. But what about all the _known_ states that a programmer could, but did not handle?

    I can actually do without a functioning spell checker if all I want to do is print a page of a document. But I can not do with an app dying or even a BSOD, only because some data file - not necessary for the task at hand - was not found.

    If I understood Erics article correctly, he focused on the user experience. Not the developer experience. :-)

Page 3 of 5 (66 items) 12345