Asynchrony in C# 5, Part Eight: More Exceptions

Asynchrony in C# 5, Part Eight: More Exceptions

Rate This
  • Comments 34

(In this post I'll be talking about exogenous, vexing, boneheaded and fatal exceptions. See this post for a definition of those terms.)

If your process experiences an unhandled exception then clearly something bad and unanticipated has happened. If its a fatal exception then you're already in no position to save the process; it is going down. You might as well leave it unhandled, or just log it and rethrow it. If it had been anticipated because it's a vexing or exogenous exception then there would be a handler in place for it. An unhandled vexing/exogenous exception is a bug, but probably one which does not actually indicate a logic problem in the program's algorithms, it's just an oversight.

But if you have an unhandled boneheaded exception then that is evidence that your program has a very serious bug indeed, a bug so bad that its operation cannot continue. The boneheaded exception should never have been thrown in the first place; you never handle them, you make for darn sure they cannot possibly happen. If a boneheaded exception is thrown then you have no idea whatsoever what locks were released early, what internal state is now corrupt or inconsistent, and so on. You can't do anything with confidence, and often the best thing to do in that case is to aggressively shut down the process before things get any worse.

We cannot easily tell the difference between bugs which are missing handlers for vexing/exogenous exceptions, and which are bugs that have caused a program crash because something is broken in the implementation. The safest thing to do is to assume that every unhandled exception is either a fatal exception or an unhandled boneheaded exception. In both cases, the right thing to do is to take down the process immediately.

This philosophy underlies the implementation of unhandled exceptions in the CLR. Way back in the CLR v1.0 days the policy was that an unhandled exception on the "main" thread took down the process aggressively, but an unhandled exception on a "worker" thread simply killed the thread and left the main thread running. (And an exception on the finalizer thread was ignored and finalizers kept running.) This turned out to be a poor choice; the scenario it leads to is that a server assigns a buggy subsystem to do some work on a bunch of worker threads; all the worker threads go down silently, and the user is stuck with a server that is sitting there waiting patiently for results that will never come because all the threads that produce results have disappeared. It is very difficult for the user to diagnose such a problem; a server that is working furiously on a hard problem and a server that is doing nothing because all its workers are dead look pretty much the same from the outside. The policy was therefore changed in CLR v2.0 such that an unhandled exception on a worker thread also takes down the process by default. You want to be noisy about your failures, not silent.

I am of the philosophical school that says that sudden, catastrophic failure of a software device is, of course, unfortunate, but in many cases it is preferable that the software call attention to the problem so that it can be fixed, rather than trying to muddle along in a bad state, possibly introducing a security hole or corrupting user data along the way. Software that terminates itself upon encountering unexpected exceptions is software that is less vulnerable to attackers taking advantage of its flaws. As Ripley said, when things go wrong you should take off and nuke the entire site from orbit; it's the only way to be sure. But does this awesome philosophy serve the async scenario well?

Last time I mentioned two interesting scenarios: (1) what happens if a task-returning async method does a WhenAll or WhenAny on multiple tasks, several of which throw exceptions? and (2) what if a void-returning async method awaits a task which completes abnormally? What happens to that exception?

Let's consider the first case first.

WhenAll collects all the exceptions from its completed sub-tasks and stuffs them into an aggregating exception. When all its sub-tasks complete, it completes its task abnormally with the aggregated exception. A slightly bizarre fact, however, is that by default, the EndAwait only re-throws the first of those exceptions; it does not re-throw the entire aggregating exception. The more common scenario is for any try-catch surrounding an "await" to be catching some set of specific exceptions; making you always write code that goes and unpacks the aggregating exception seems onerous. This may seem slightly odd; for more details on why this is a reasonable idea see Jon Skeet's recent posts on the topic. 

The WhenAny case is similar. Suppose the first sub-task completes, either normally or abnormally. That completes the WhenAny task, either normally or abnormally. Suppose one of the additional sub-tasks completes abnormally; what happens to its exception? The WhenAny is done: it has already completed and called its continuation, which is now scheduled to run on some work queue if it hasn't already.

In both the WhenAll and WhenAny cases we have a situation where there could be an exception that goes "unobserved" by the creator of the WhenAll or WhenAny task. That is to say, in both these cases there could be an exception that is thrown, automatically caught, cached and never thrown again which in the equivalent synchronous code would have brought down the process.

This seems potentially bad. Should an unobserved exception from a task that was asynchronously awaited take down the process, as the equivalent synchronous code would have?

Suppose we decide that yes, an unobserved exception should take down the process. When does that happen? That is, when do we definitively know that the exception actually was not re-thrown? We only know that if the task object is finalized without its result ever being observed. After all, a "living" task object that has completed abnormally could have its continuation executed at any time in the future; it cannot know when that continuation is going to be scheduled. There could be any number of queued-up tasks on this thread that get to run between the time this task completed abnormally and its result is requested. As long as the task object is alive then its exception could be observed.

OK, so, great, if a task is finalized, and it completed abnormally then we... what? Throw the exception on the finalizer thread? Sure! That will take down the process, right? In CLR v2.0 and above, unhandled exceptions on any thread take down the process. But let's take a step back. Remind me, why do we want an unobserved exception to take down the process? The philosophical reason is: we cannot tell whether this was a boneheaded exception that indicates a potentially horrible, security-impacting situation that needs to be dealt with by immediate termination, or simply the result of a missing handler for an unanticipated exogenous exception. The safe thing to do is to say that it was a boneheaded exception with a security impact and immediately take the process down. Which is precisely what we are not doing! We are waiting for the task to be collected by the garbage collector and then trying to take the process down in the finalizer thread. But in the gap between the exception being recorded in the task and the finalizer observing the exception, we've potentially kept right on running dozens more tasks, any of which could be using the inconsistent state caused by the boneheaded exception.

Furthermore, we anticipate that most async tasks that throw exceptions in realistic code will in fact be throwing exogenous exceptions like "the password for this web service is wrong" or "you don't have permission to read this file", or "this operation timed out", rather than boneheaded exceptions like "you dereferenced null" or "you tried to pop an empty stack". In these realistic cases it seems much more plausible to say that if for some reason a task completes abnormally and no one bothers to observe its result, it's because some asynchronous unit of work was abandoned; any of its sub-tasks that ran into problems connecting to web servers (or whatever) can safely be ignored.

In short, an unobserved exception from a finalized task is one that no one cares about, is probably harmless, and if it was harmful, then we've already delayed taking action too long to prevent more harm. Either way, we might as well just ignore it.

This does illustrate that asynchronous programming introduces a new flavour of security vulnerability. If there is a security vulnerability caused by a bug that would normally take down the process, and if that code is rewritten to be asynchronous, and if the buggy task is abandoned without observation of its exception, then the bug might not result in an aggressive destruction of the now-vulnerable process. And even if the exception is eventually observed, there might be a window in time between when the bug introduces the vulnerability and the exception is observed. That window might be large enough for an attacker to succeed. That sounds like a tortuous chain of things that have to go wrong - because it is - but attackers will take whatever they can get. They are crafty, they have all the time in the world, and they only have to succeed once.

I never did say what happens to a void-returning method that awaits a task; you can think of this as a "fire and forget" sort of method. Perhaps a void-returning button-click event handler awaits fetching some data asynchronously and then updating the user interface; there's no "caller" of the event handler that cares to hold on to a task, and will never observe its result. So what happens if the data-fetching task completes abnormally?

In that case, when the void-returning method (which registered itself as a continuation, remember) starts up again, it checks to see if the task completed abnormally. If it did, then it immediately re-throws the exception to its caller, which is, of course, probably some message loop. I believe the plan of action here is to be consistent with the behaviour described above; in that scenario the message loop will discard the exception, assuming that the fire-and-forget asynchronous method failed in some benign way.

Having been an advocate of the "nuke from orbit" philosophy of unhandled exceptions for many years, emotionally this does not sit well with me, but I'm unable to marshal a convincing argument against this strategy for dealing with exceptions in task-based asynchrony. Readers: What do you think? What is in your opinion the right thing to do in scenarios where exceptions of tasks go unobserved?

And on that somewhat ominous note, I'm going to take a break from talking about the new Task Asynchrony Pattern for now. Please download the CTP, keep sending us your feedback and questions, and start thinking about what sorts of things that will work well or work poorly with this new feature. Next time: we'll pick up with more fabulous adventures after American Thanksgiving; I'm cooking turkey for 19 this year, which should be quite the adventure in of itself.

  • Unhandled exceptions should at least go to AppDomain.UnhandledException.

    To handle the exception, I think it is more of a problem the current catch statement, which does not deal with async/tasks. Some possible improvement can be:

    try

    {

     await something;

    }

    catch (InvalidDataException ex) // for each exception in AggregateException, go through the catch list in the normal way.

    {

     ...

    }

    catch (IOException ex)

    {

     ...

    }

    aggregate (Exception[] unhandledAndRethrownExceptions) // for all remaining exceptions and rethrown exceptions

    {

     // analysis all remaining exceptions

     // can do some logging

     // can throw a new exception

    }

    finally

    {

    }

    Just some wild thinking...

  • Hello Eric,

    There are two "elephant in the living-room", and the problem you presented is a facet of these elephants. I would like to hear what you think about the following subjects:

    1.  In distributed systems we want each component to fail-fast, because it is for the common good of the distributed "being". As another component takes its place, there is no harm in killing a single component. In that respect the component (before it pushes the self destruct button) needs to be sure that the Exception originated from itself and not propagated from another component, otherwise the fail-fast would be counter productive, and cause the opposite effect.

    2.What does code contracts for .NET say about the degenerated case where no code contracts were defined, but an exception is thrown anyway. Are all Exceptions then considered fatal ? How does code contracts relate to the async keyword?

    Regards,

    Itai

  • Tasks aren't that different from threadpooled work items in an abstract sense.  .NET moved toward fail-fast; please don't give that up, you'll make asynchronous programming more of a minefield than it is already!

    There's a relevant distinction between Tasks here, namely into those that don't have side-effects, and those that do.  If you're not sharing mutable state, then it's certainly irrelevant that a task failed: it's fine to ignore even boneheaded exceptions since the task simply has no effect.

    If you are sharing mutable state (or possibly doing so indirectly via a Task you start), then the exception means that something is broken and wrong - and in that case, indefinitely delayed failure is probably still better than no failure at all.  It's not as bad as you make it out to be: it may seem very un-fail-fast since it's not "fast" but it *is* fast in the sense that it will be noticable this run of the process eventually (unless some other error happens first).  Such an exception will contain a real stack-trace and will be reasonably debuggable despite the non-deterministic timing, though obviously much harder than synchronous code.

    I think that void-returning async methods failing (or an awaited subtask failing) should nuke the process.  Throwing an exception in an event-handler is not OK; and since the method returns void you can be almost certain that anything it calls (asynchronously or not) is in some way aiming to assist it in mutating state; ignoring a failure here means making debugging *very* hard and leaving the process in a partially mutated state.  Not Good.

    For a non-void returning async method, I suppose you're bound by what .NET 4 offers: that is, silently ignore errors.  I don't that's good either; there should be a task creation option, and preferably the default should be "fail-as-soon-as-possible": if a Task is finalized or disposed and hasn't been explicitly marked "ignore exceptions" then any exceptions terminating it should be fatal.

    So, I'd consider it ideal if all tasks had fail-ASAP on by default, but realize this is a compatibility problem.  Certainly void-returning async should not swallow exceptions silently.  You could add an app.config setting controlling the default and have C# 5 projects terminate-on-unobserved-task-failure by default to avoid compatibility issues.

  • I'm with AnthonyP: I'd like void returning async methods to fail if there's an unhandled exception which gets to the toplevel.  Otherwise this starts looking just like the "silent" death of background thread in .NET 1.0.

    Plus, if I want to swallow all exceptions, I can manually add a toplevel try { ... } catch.  But if such a handler is provided (implicitly) by the framework, I can get rid of it.

  • "we anticipate that most async tasks that throw exceptions in realistic code will in fact be throwing exogenous exceptions [snip] rather than boneheaded exceptions"

    Perhaps most, yes, but what happens when it is a boneheaded exception?  How does this affect our debugging experience?

    "the message loop will discard the exception, assuming that the fire-and-forget asynchronous method failed in some benign way"

    I understand the perspective that if nobody's listening for a benign exception, then it doesn't really matter; however,  the biggest issue with passivity is that this assumption could be wrong in critical situations.  I believe it's these critical minority situations to which the benign majority should yield.

    We could easily opt-in to ignore benign exceptions.  But now, it seems like we must opt-in to being notified about critical exceptions, which are, well, exceptional.  How can we opt-in to being notified about something we don't know about?

    This also seems to defeat one of the primary reasons for using exceptions as opposed to return codes.

  • Eamon - the definition of shared state is important.

    If the corrupt state is in memory and is only implicit (members of waiting tasks) then the smart thing is to kill the tasks that are related (anscestors, or brothers) to the corrupt task. You can actually halt these tasks and dump their data to a "hospital", and only then kill them. Later dev-ops can analyze the business data waiting in the "hospital".

    If the corrupt shared state is in a remote storage (database or nosql) then a restart would not do any good, infact it could cause a collective suicide of a distributed cluster encountering the same corrupt shared state. Instead you need to reset the state in the database, and send the corrupt data to a "hospital" component that will later be analyzed by the dev-ops.

    If the corrupt state is in shared memory (usually means it is a local cache of persisted data), then the right thing is to clear the cache, and lazily reload the persosted data.

    There is a fourth option, and that is that the shared memory is the primary source of data. Usually that means you are using an In-Memory-Data-Grid which has its own tools to handling corrupt state (out of scope of this comment, I think).

    To conclude, in many cases, it's the state that we want to restart  (which might be achieved by restarting the process, but that depends on your distributed architecture).

    Itai

  • As earlier posters pointed out, the core issue is whether the failed task was updating state shared with other tasks.

    I suggest that the issue is the exception-handling behavior of the synchronization construct used to protect the shared state (most commonly a C# lock statement) rather than the exception-handling behavior of the task.

    The problem is that a lock statement releases the lock when its body terminates, even if it terminates due to an exception. As soon as this happens, another thread can acquire the lock and see the inconsistent state. One way to address this would be to mark a lock as "Broken" when it is released due to an exception. When a thread attempts to acquire a lock that is Broken, an exception is thrown.

    You might be interested in the work I did on Failboxes <www.cs.kuleuven.be/.../failboxes). It attempts to address this problem, as well as the problem of running finally blocks in unsafe states, the problem of zombie servers, and the problem of safe cancellation (i.e. Thread.Abort done right). I'd be interested in any comments you have on this.

  • The problem with the "fail eventually" scenario is that it doesn't seem easy to protect your app against benign errors that you don't care about. Consider this snippet from a hypothetical web browser:

    try {

       var ipAddr = PeformDnsLookupAsync(url);

       if (await CheckBlacklistAsync(url))

           string data = await FetchDataAsync(await ipAddr, url);

       else

           DisplayError();

    }

    catch (TimeoutException e) { DisplayError(); }

    catch (SocketException e) { DisplayError(); }

    As it is now, the system will start a DNS lookup, start a blacklist lookup, and if the blacklist lookup succeeds it will await the result of the DNS lookup, eventually fetching the URL. So what happens when the blacklist lookup fails or throws an exception? An error is displayed to the user and we no longer care about the results of the DNS lookup. Obviously we don't want the web browser to crash if the DNS lookup throws an exception after the blacklist lookup has already returned failure or thrown an exception.

    I suggest those advocating "fail eventually" semantics rewrite my simple example to have the same semantics in their intended system.

  • @Gabe

    bool errorDisplayed = false;

    void DisplayError() {

     if (errorDisplayed) return;

     errorDisplayed = true;

     ...

    }

    This presupposes semantics that would allow the same catch clause to be entered more than once in the case of an async method. That'd be a slightly odd change to the semantics of the current "catch" clause, so my suggestion would be to rename it to "async catch" in that case.

    You'd have to define the semantics carefully, especially with regard to what happens to any code that's *after* the catch/finally clause. And I'm sure I haven't thought it through quite right, but it kind of seems like you would want the *first* time through the async catch to then continue running the code that comes after, and subsequent times through to simply run the catch and then stop.

    And it does seem like there ought to be some way to behind-the-scenes annotate the Task object to indicate whether exceptions thrown by it will be handled or not. That way, if not, the failure can be immediate rather than "eventual"?

  • "In short, an unobserved exception from a finalized task is one that no one cares about, is probably harmless, and if it was harmful, then we've already delayed taking action too long to prevent more harm."

    This statement is the crux of the entire argument, and it is deeply flawed. Hopefully in describing its flaws we can come up with the "convincing argument" you are looking for.

    First, the idea that "an unobserved exception from a finalized task is one that no one cares about" and "is probably harmless" is based on a very flimsy assumption: "most async tasks that throw exceptions in realistic code will in fact be throwing exogenous exceptions like 'the password for this web service is wrong' or 'you don't have permission to read this file', or 'this operation timed out', rather than boneheaded exceptions like 'you dereferenced null' or 'you tried to pop an empty stack'". This is pure assumption and is offered with no facts or supporting data, or even a convincing thought experiment. I don't see why you have any more reason to believe that an exception from an async task isn't "boneheaded" than you have to believe the same about an exception from non-async code. In fact, wasn't that the whole point of the change that was made to the 2.0 CLR that you just described? It seems like you are making exactly the same mistake all over again.

    Second, even if we accept the idea that exceptions thrown from async tasks are somehow not important enough to bring down the process, why doesn't the same logic apply to the first exception that is thrown? Why isn't that exception simply stored in the task with the other exceptions, for the user to observe or not as they choose? You are elevating the first exception that gets thrown to a special status that you have no reason to believe it deserves (or more accurately demoting the other exceptions to a status you have no reason to believe they deserve).

    Third, the idea that "we've already delayed taking action too long to prevent more harm" is erroneous. To see why, all you have to do is look again at the first exception. It triggers the continuation, which will throw the exception on the original thread, and it will bring the process down if there is no handler ready to catch it. BUT, it doesn't bring the process down UNTIL the continuation is called. And there is no way of knowing when that will be; the message pump may process any number of messages between the time the exception occurs and the time it is rethrown on the original thread.

    So both in the case of the first exception AND all subsequent exceptions, an indeterminate amount of time has passed, and an indeterminate amount of code has been executed. So why should one case cause a process failure and the other case be silently ignored? In my opinion, it shouldn't be ignored: "fail fast" is still the right philosophy in this case, for all the same reasons that it is in (nearly) every other case.

  • I agree that an exception from a background thread that won't be handled should take down the application, preferrably as soon as possible. I use the words "won't be" and "should" deliberately.

    I disagree with the opinion from referenced post "Vexing Exceptions" that "vexing exceptions" should be avoided. I prefer code to throw in an expected failure scenario for two reasons:

    1) The try catch block is a clean and natural mechanism for dealing with exceptional flows.

    2) If a developer forgets that (for example) a bank transaction might not complete if the debited account has insufficient funds, I would prefer the code attempting the transaction fail with a traceable exception and possibly take down the application if it's not expected that this operation might fail.

    (Of course, as is true of all design philosophies, there are cases where I would recommend a different approach)

    If I dispatch a task to do some work, I would prefer not to write try blocks and check result flags each and every time. I would like the exception to be dealt with as through the await keyword wasn't used. If I am using the await keyword from a background thread with no exception handler, I would like it to kill the application. It would be nice if it happened immediately, but I can imagine reasons why this would be messy to implement.

    If you are writing code that modifies state from a background thread, even if that code can never throw, you're going to have to write that code carefully. If you are making multiple state changes that have to be made together, you will already be using locking mechanisms, and if your code can fail you should already be dealing with cases where one change succeeds and a later change fails. This is an annoying problem, I agree, but I don't think it's fair to expect the asynch / await syntax to deal with it

    If the choice is between killing an app if a task ever fails, or throwing an exception in the continuation (even if it's delayed), then my vote is for the second.

  • One solution is to, as soon as an exception is thrown, lock the program (i.e. suspend all other threads) and check if there is handler which will find the exception.  If not, crash.  If so, behave exactly the same way you described (don't unlock the program until after exiting the catch block).  This solution is brutal, might not actually be possible, has numerous flaws, and is generally an ugly idea that is tough to implement.  It's inefficient, too.  Still, I wonder if it offers a starting point for thinking about solutions to this problem.

  • One solution is to, as soon as an exception is thrown, lock the program (i.e. suspend all other threads) and check if there is handler which will find the exception.  If not, crash.  If so, behave exactly the same way you described (don't unlock the program until after exiting the catch block).  This solution is brutal, might not actually be possible, has numerous flaws, and is generally an ugly idea that is tough to implement.  It's inefficient, too.  Still, I wonder if it offers a starting point for thinking about solutions to this problem.

  • Ignoring the exceptions is maybe good as a default, but for diagnostic purpose there absolutely must be a way to log them through a custom handler! As a bonus you could then use that handler to shutdown the application if you don't like the default of ignoring it.

  • I disklike AggregatedException approach. In a fork/join scenario, should one task encounter an exception it's been my experience that all of the tasks encounter the same exception. So having an aggregate of all of the exceptions isn't helpful. True, I can think of situations where different tasks *could* get different exceptions, but I have yet to encounter that. I would prefer for WaitAll to throw the first exception is receives and the rest of tasks are concidered (and are) cancled.

    In a non fork/join scenario, should the programmer not continue the task, they don't deserve to know that something went wrong.

    But to resolve Eric's concerns I think the only thing to do would be to have the compiler encforce that all tasks either are joined, or are continued. This of course would make some proof of concept programming annoying, but I feel it is a solution to the problem.

Page 2 of 3 (34 items) 123