Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Structured Exception Handling Considered Harmful

Structured Exception Handling Considered Harmful

Rate This
  • Comments 33

I could have sworn that I wrote this up before, but apparently I’ve never posted it, even though it’s been one of my favorite rants for years.

In my “What’s wrong with this code, Part 6” post, several of the commenters indicated that I should be using structured exception handling to prevent the function from crashing.  I couldn’t disagree more.  In my opinion, SEH, if used for this purpose takes simple, reproducible and easy to diagnose failures and turns them into hard-to-debug subtle corruptions.

By the way, I’m far from being alone on this.  Joel Spolsky has a rather famous piece “Joel on Exceptions” where he describes his take on exception (C++ exceptions).  Raymond has also written about exception handling (on CLR exceptions).

Structured exception handling is in many ways far worse than C++ exceptions.  There are multiple ways that structured exception handling can truly mess up an application.  I’ve already mentioned the guard page exception issue.  But the problem goes further than that.  Consider what happens if you’re using SEH to ensure that your application doesn’t crash.  What happens when you have a double free?  If you don’t wrap the function in SEH, then it’s highly likely that your application will crash in the heap manager.  If, on the other hand, you’ve wrapped your functions with try/except, then the crash will be handled.  But the problem is that the exception caused the heap code to blow past the release of the heap critical section – the thread that raised the exception still holds the heap critical section. The next attempt to allocate memory on another thread will deadlock your application, and you have no way of knowing what caused it.

The example above is NOT hypothetical.  I once spent several days trying to track down a hang in Exchange that was caused by exactly this problem – Because a component in the store didn’t want to crash the store, they installed a high level exception handler.  That handler caught the exception in the heap code, and swallowed it.  And the next time we came in to do an allocation, we hung.  In this case, the offending thread had exited, so the heap critical section was marked as being owned by a thread that no longer existed.

Structured exception handling also has performance implications.  Structured exceptions are considered “asynchronous” by the compiler – any instruction might cause an exception.  As a result of this, the compiler can’t perform flow analysis in code protected by SEH.  So the compiler disables many of its optimizations in routines protected by try/catch (or try/finally).  This does not happen with C++ exceptions, by the way, since C++ exceptions are “synchronous” – the compiler knows if a method can throw (or rather, the compiler can know if an exception will not throw).

One other issue with SEH was discussed by Dave LeBlanc in Writing Secure Code, and reposted in this article on the web.  SEH can be used as a vector for security bugs – don’t assume that because you wrapped your function in SEH that your code will not suffer from security holes.  Googling for “structured exception handling security hole” leads to some interesting hits.

The bottom line is that once you’ve caught an exception, you can make NO assumptions about the state of your process.  Your exception handler really should just pop up a fatal error and terminate the process, because you have no idea what’s been corrupted during the execution of the code.

At this point, people start screaming: “But wait!  My application runs 3rd party code whose quality I don’t control.  How can I ensure 5 9’s reliability if the 3rd party code can crash?”  Well, the simple answer is to run that untrusted code out-of-proc.  That way, if the 3rd party code does crash, it doesn’t kill YOUR process.  If the 3rd party code is processing a request crashes, then the individual request fails, but at least your service didn’t go down in the process.  Remember – if you catch the exception, you can’t guarantee ANYTHING about the state of your application – it might take days for your application to crash, thus giving you a false sense of robustness, but…

 

PS: To make things clear: I’m not completely opposed to structured exception handling.  Structured exception handling has its uses, and it CAN be used effectively.  For example, all NT system calls (as opposed to Win32 APIs) capture their arguments in a try/except handler.  This is to guarantee that the version of the arguments to the system call that is referenced in the kernel is always valid – there’s no way for an application to free the memory on another thread, for example.

RPC also uses exceptions to differentiate between RPC initiated errors and function return calls – the exception is essentially used as a back-channel to provide additional error information that could not be provided by the remoted function.

Historically (I don’t know if they do this currently) the NT file-systems have also used structured exception handling extensively.  Every function in the file-systems is protected by a try/finally wrapper, and errors are propagated by throwing exception this way if any code DOES throw an exception, every routine in the call stack has an opportunity to clean up its critical sections and release allocated resources.  And IMHO, this is the ONLY way to use SEH effectively – if you want to catch exceptions, you need to ensure that every function in your call stack also uses try/finally to guarantee that cleanup occurs.

Also, to make it COMPLETELY clear.  This post is a criticism of using C/C++ structured exception handling as a way of adding robustness to applications.  It is NOT intended as a criticism of exception handling in general.  In particular, the exception handling primitives in the CLR are quite nice, and mitigate most (if not all) of the architectural criticisms that I’ve mentioned above – exceptions in the CLR are synchronous (so code wrapped in try/catch/finally can be optimized), the CLR synchronization primitives build exception unwinding into the semantics of the exception handler (so critical sections can’t dangle, and memory can’t be leaked), etc.  I do have the same issues with using exceptions as a mechanism for error propagation as Raymond and Joel do, but that’s unrelated to the affirmative harm that SEH can cause if misused.

  • Niclas,

    Remember that we're talking about SEH here, not exceptions in general, so we're looking at pretty bad events like access violations and guard page exceptions. I can probably count on one hand the number of cases where a problem could properly recover from these events. Usualy they're indicative of a bug in your code.

    For example, as I mentioned in my last post, I removed a a __try/__except(1) block that was wrapping an entire program. The program in question was a server, and if it caught an exception, it would log it and then happily go on serving clients. But the program couldn't tell what had caused that exception to be thrown, and something like an access violation often points to very bad things like memory corruption. So, rather then trying to keep going, it was better to crash and let the service control manager restart the server in a clean state.
  • At this point I'm pretty well convinced that it's nigh unto impossible to write reliable software that uses exceptions for error propagation. To get a flavor, see http://blogs.msdn.com/mgrier/archive/2004/02/18/75324.aspx.

    Exceptions only really work reliably when nobody catches them.

    And in that case, I don't understand why we don't just call something like BugcheckApplication() instead of throwing an actual catchable exception.

    It was clever on VMS to have continuable exceptions which led to the SEH design on NT. I'm not sure that giving code the ability to do fun things like user-mode fixups of things like uncommitted virtual address space or adjust FP results etc. is worth the complexity that this design entails.

    Catching exceptions in an exception rich environment (like the CLR or Java for example) is nearly impossible to do correctly. If we were to start writing in C again, it's do-able but the fact that all the new languages include capabilities like implicit conversions and operator overloading means that it's impossible to understand whether the scope of the try/catch is correct. (And even if it was correct, changes to other parts of the code can invalidate your careful analysis and coding.)

    So, Larry's point is entirely valid but once you accept it, it's not hard to see that the use of exceptions in modern languages fundamentally makes it impossible to write reliable software.

    Which is, of course, funny since most people think that exceptions are about writing reliable software finally. Well, I guess throwing the exceptions is OK. It's just those super geniuses who think that they can catch them that mess it all up. :-)

  • Anything is better than crash simply because user still has a chance to save his work. Yeah, corruption may happen and app should warn user that exception has been caught and it might be a good idea to restart the application. But it is BETTER than crash. In debug build crash is better.
  • > Anything is better than crash simply because
    > user still has a chance to save his work.

    For a text editor, maybe. For a non-interactive service that processes financial transactions, definitely no.

    And even in a text editor a crash is better than an undebuggable deadlock after some COM object corrupts the heap then swallows the resulting AV leaving the default process heap critical section orphaned.

    If you want to allow user to save his work in case of an unhandled exception, that's fine. Nobody is saying you shouldn't do that. But catching unknown exceptions and not reporting them properly (using ReportFault() or something similar to that) is often worse than no exception handling at all.
  • Pavel:

    Well then I would agree with you if the design uses throw extensivly. I do not like a design that throws extensivly as it complicates debugging =), that is why I tried to explain that exceptions should only happen in rare conditions. So it is not the exception theory itself that complicates it for you, it is the implementation of it.

    If the caught exception leaves the application with unreleased resources, then it was not caught in all levels it needed to be caught to clean up properly. It is not the exceptions fault.

    You want a nice memory dump, even if a nice memory dump is heaven, a logged stack trace and full detail of the exception(and maybe even a hexdump of the surrounding memory) will provide you with almost equally interesting information, and you can let the memory dumps stay in house as much as possible.

    But I do agree with you, if you have a service that is not allowed to glitch, then don't start guessing on the state of your application, you don't want a $10 transaction turn into a $100000 one, unless of course it is your paycheck =)


    -------------

    Extensive use of exceptions is a bad thing according to me, I do not like the philosphy of Java/C#. To me it is a just a lazy way of getting out of trouble, get less nested if statements. But your code path has more than one exit point, I don't like that because it tends to trick the developer into resource leaks. I have seen so many cases where a developer grabs some resource when the function begins, and then added some check afterwards in the code that merely did a return in the middle of it, but forgot to return the resources. I believe in simple design.

    grab resource
    work with resource, record error/succes
    release resource
    return error/success

    And in an exntensive exception philosphy this would be

    grab resource
    try
    work with resource throw on error
    catch or finally(cleanup, which is so much better)
    cleanup

    throw again if the work part throwed else return sucess.

    Which is a design that I do not like.

    I have had many of these discussions before, and the only way to convince anyone of the opposite is to show it in practice, implemented in a way which I think is proper and safe. I have never had complicate odd crashes due to it, but instead I have slept better knowing that even if we have a bug we might survive, if we didn't then too bad. Are starting state is already a crash so it can't get worse.

    One of the worst appliances of exceptions I can think of is COM objects used with the non raw interface wrappers, any E code is merely turned into a throw...


    Not far from this discussion is the question if you should leave asserts active in release build. I surely don't think so, and assert in general is a lazy way to do things, it tends to lead the developer to not care of the failure scenario, and thus not to the proper clean up. Why should he/she? It will crash on the assert anyway.

    Most of the examples brought up are very rare conditions of asynchronous exceptions, most always they are much less harmful, and even more often they are merely a NULL pointer exception, which of course is a bug, but usually not fatal in any way to the program state. The non NULL pointer exceptions however are more scary.

    But it is just a question of determining which part of your program state that is most likely corrupt, rinse that and go again.

    If the same exception keeps thrashing(because your estimate of which parts of your program state that must be corrupt was wrong) then it is about time to abort. But then again that logic applies to any kind of unconditional loops, if you don't supervise then in some way it can be a possible hang.

    catching exceptions are not a way to make your application more robust, but it is a tool among many to make it more robust
  • >> Anything is better than crash simply because user still has a chance to save his work. Yeah, corruption may happen and app should warn user that exception has been caught and it might be a good idea to restart the application.

    Dear User,

    Something evil this way comes. The application you are currently running did something really bad, but we don't really know what. I know what you are thinking, "Did I save my data five minutes ago or 30 minutes ago?" You have to ask yourself, "Am I feeling lucky?" Well do you punk? BTW, I would exit this application and restart.

    [OK]

    Problem #1: Users do not read dialog boxes. "Hey, I was typing and this dialog got in my way so I am going to OK it to get it to go away. I worked for three more hours and then when I went to save, it trashed my data."

    Problem #2: Users do not understand that concept of a program that has "crashed" but is still running. "Hey, I can still type, things must be good".

    Problem #3: Users do not understand that what they save might be totally trashed an can not be read back in. They will save over their current version of their file. If you force them to save to another file name they will curse you for forcing them to do something they don't want to do and then promptly delete their original and rename the newly saved trashed file. If they don't know how to rename files, they will just load up the old file and that trashed file will remain in their directory haunting them until they get a new computer. (Nah, I've never seen this happen. Right...)

    Problem #4: In a mission critical application you run a great risk of sending bad data to other applications. I have seen this nearly happen. People can die from bad data. Usually caused by a series of procedural errors and a computer error. "Stupid didn't turn the panel into service mode and remove power before starting to work on the screw pump. A hardware fault then caused the software to fail and the screw pump was turned on."

    If an application has an "unexpected exception" , all bets are off as far as any level of functionality.
  • "If an application has an "unexpected exception" , all bets are off as far as any level of functionality."

    True, but that does not mean that it will work better because you restart it. The user or the application will probably retry what he/she/it just did and probably hit the same bug again, rinse and repeat. If this is an online service I am sure you can already hear the phones ringing from the slightly upset customer.

    "Stupid didn't turn the panel into service mode and remove power before starting to work on the screw pump. A hardware fault then caused the software to fail and the screw pump was turned on."

    Any kind of memory corruption bug could cause this to happen, or any offer kind of bug too for that matter.

    Crashes are just too expensive when it comes to customers perception of the stability of the application. In many cases it is better to stay alive and hope for the best, because most of the time it will be fine.
  • > If an application has an "unexpected
    > exception", all bets are off as far as any
    > level of functionality.

    That's taking it to the other extreme.

    Certainly there are applications where saving user's work in case of a crash makes sense. Like email processors for example.

    This is a separate issue from where to handle unknown exceptions and how to report them.
  • Trying to run code in an address space which is likely to be corrupt is just plain bad for the user. If you really want to preserve the value of the keystrokes/operations that occurred before the crash, then journal them!

    All the editors on VAX/VMS journalled; I was shocked to come to the PC world and that we never do such things.

    This is a much smarter approach than to try to continue to run code in the corrupt address space.
  • Well, journalling might sound smart, but remember that journalling will recreate what you just did, which means if you hit the bug doing it, journalling to back track to it will most likely hit it again and voila...(at least if it is one of those bugs that I like to catch with this kind of exception handling, a state problem).

    There are very few occasions that you actually do corrupt the address space, most exception occur due to subtile race conditions where a pointer dangled, these may or may not corrupt your user space. They may or may not corrupt the user space without you noticing it for a long time(that is no exception), an exception is not a receipt telling you that your address space is corrupt, it is more often than not a programming error, where the program happen to get into a state which wasn't fully analyzed. Rarely does that mean that the user space is corrupt.


    You can actually have user space corruption that you _never_ notice, and that actually didn't matter, and if it didn't matter why bother?

    As I said, if you have a program where you need to 100% know that the entire state of the application is healthy, then exceptions should not be part of such a model, but if you don't need that, exceptions will catch the numerous NULL pointer exceptions that hurt noone (unless the application crashes of course).

    I have one very fine real life occurence, that happen only 2 days ago. Our system had a NULL pointer bug in the provisioning part of the system, when a certain order of events occured. Our application sadly uses C on FreeBSD, so exceptions are not portable. Anyway if we had nicely caught that exception with nice logging so we could fix it in the normal process, the customer could keep using the system with a _slight_ defect.

    As it was now, since it crashed the entire system (it is a Mobile IP Telephony exchange), with approx 80000-100000 active mean users 24/7, repeatedly, since this one user kept running with the same setup, it turned into a class A red alert where we had to bring in the right people and go through en emergency build procedure. The costs of doing that are huge, and I would take the bad sides of exceptions any day of the week to avoid those.

    I should probably add, that we wouldn't lose all 100k users right away, but the user causing the crash was moved around within the system each time it connected, so eventually it had crashed literally all users and kept doing so. So even with process seperation you are far from safe, nor will the problem get any better because you reboot.

    The system was funtioning flawlessly to 99.9%, the 0.1% that casued the crash made 100% unusable, that is just not exceptable, anything is better than that.

    I frankly don't care if I could get into worse problems using exceptions, because they don't get worse than that. What should have happened was a nice little SNMP trap due to the exception to the operator, a nice little email with a traplog from the operator, a call to a designer for analysis to advice the customer. This is by far cheaper than having to bring in half the staff for spinning a new build, for a missing "if (pPointer)", imagine what those about 15 bytes of code costed.

    I know what you will say, the (possible) memory corruption could cause the system to for instance charge more from users than it should, but I will wager alot that the exception handling will not be the reason for that, it will be the X number of other bugs(Yes, all applications have bugs, no matter how much testing is done, some just have less)

    Even if the exception had been a dangling pointer, chances are that it was a race, and that it was removed prematurely, that will not casued any harm to the user space either. They are all programatic errors, that are _far_ more common than buffer overrun(which are the most common memory corruptions) or double deletes.

    The class of bugs that are dangerous to catch with exceptions(which cause some kind of corruption) are _far_ smaller than the class of stupidity bugs that can be found in every application. And even if you do catch a bug which is a result of corruption, then most likely are you able to clean up the state problem to restart that little part of the system.

    If we lived in an ideal world I would agree to never catch exceptions, but anywhere I turn my head the world is not ideal and all we do is try to plan and foresee all kinds of coming trouble.

    If trouble was never coming then we wouldn't even have invented a word exception.



    Those are my pennies and I will continue to be paranoid about crashing an application.
  • 9/12/2004 2:39 PM Niclas Lindgren

    > journalling will recreate what you just did,
    > which means if you hit the bug doing it,
    > journalling to back track to it will most
    > likely hit it again and voila

    Bingo. I once edited a journal, deleting the mention of the keystroke that was mishandled by the editor, so that replay of the modified journal would not hit the bug. Then I saved my work at that point, i.e. saving with a loss of one known keystroke instead of saving with a loss of a forgotten number of minutes and changes. Then I found some other way to proceed with the next necessary change.

    Theoretically the journal would also be of immense value to any coder who wanted to fix the editor.

    > we had to bring in the right people and go
    > through en emergency build procedure

    Well, I'll repeat here an idea which got me a black mark on my record at one previous employer, and which could have got me fired if my boss had been present. Maybe someone can say what was wrong with it. When the current build wasn't working and you had an emergency situation where you needed a working build, boot the previous working build. While the production system is operating on its previous build, configure one test system to match, boot the failing build there, and debug it on the test system. Orange flag emergency instead of red flag. During the time that the previous build is running, you don't get to charge customers for features that aren't being provided, but you still get to charge for basic telephone service and customers still have it.

    (Well actually yes I do know what was wrong with my suggestion. I'm an engineer and corporate politicians are corporate politicians, and there's no room for engineers in companies that are run by politicians.)
  • For user apps, auto-saving work, automatically restarting the app, and giving the user the option to recover their old documents is usually fine.

    For online services doing financial transactions, you better not be messing with my money in corrupt address space.
  • >As a result of this, the compiler can’t >perform flow analysis in code protected by >SEH. So the compiler disables many of its >optimizations in routines protected by >try/catch (or try/finally). This does not >happen with C++ exceptions

    shouldnt that read "try/except (or try/finally)" ?

  • probably try/except/finally, you're right rsd.
  • PingBack from http://cars.oneadayvitamin.info/?p=650

Page 2 of 3 (33 items) 123