Careful with that axe, part one: Should I specify a timeout?

Careful with that axe, part one: Should I specify a timeout?

Rate This
  • Comments 19

Careful

(This is part one of a two-part series on the dangers of aborting a thread. Part two is here.)

The other day, six years ago, I was was talking a bit about how to decide whether to keep waiting for a bus, or to give up and walk. It led to a quite interesting discussion on the old JoS forum. But what if the choice isn’t “wait for a bit then give up”, instead it is “wait for a bit, and then take an axe to the thread”? A pattern I occasionally see is something like I’ve got a worker thread that I started up, I ask it to shut down, and then I wait for it to do so. If it doesn’t shut down soon, take an axe to it:

this.running = false;
if (!workerThread.Join(timeout))
    workerThread.Abort();

Is this a good idea?

It depends on just how badly the worker thread behaves and what it is doing when it is misbehaving.

If you can guarantee that the work is short in duration, for whatever 'short' means to you, then you don't need a timeout. If you cannot guarantee that, then I would suggest first rewriting the code so that you can guarantee that; life becomes much easier if you know that the code will terminate quickly when you ask it to.

If you cannot, then what's the right thing to do? The assumption of this scenario is that the worker is ill-behaved and does not terminate in a timely manner when asked to. So now we've got to ask ourselves "is the scenario that the worker is slow by design, buggy, or hostile?"

In the first option, the worker is simply doing something that takes a long time and for whatever reason, cannot be interrupted. What's the right thing to do here? I have no idea. This is a terrible situation to be in. Presumably the worker is not shutting down quickly because doing so is dangerous or impossible. In that case, what are you going to do when the timeout times out? You've got something that is dangerous or impossible to shut down, and its not shutting down in a timely manner. Your choices seem to be

(1) do nothing
(2) wait longer
(3) do something impossible. Preferably before breakfast.
(4) do something dangerous

Choice one is identical to not waiting at all; if that’s what you’re going to do then why wait in the first place? Choice two just changes the timeout to a different value; this is question begging. By assumption we're not waiting forever. Choice three is impossible. That leaves “do something dangerous”. Which seems… dangerous.

Knowing what the right thing to do in order to minimize harm to user data depends upon the exact circumstances that are causing the danger; analyze it carefully, understand all the scenarios, and figure out the right thing to do. There’s no slam-dunk easy solution here; it will depend entirely on the real code running.

Now suppose the worker is supposed to be able to shut down quickly, but does not because it has a bug. Obviously, if you can, fix the bug. If you cannot fix the bug -- perhaps it is in code you do not own -- then again, you are in a terrible fix. You have to understand what the consequences are of not waiting for already-buggy-and-therefore-unpredictable code to finish before disposing of the resources that you know it is using right now on another thread. And you have to know what the consequences are of terminating a thread while a buggy worker thread is still busy doing heaven only knows what to operating system state.

If the code is hostile and is actively resisting being shut down then you have already lost. You cannot halt the thread by normal means, and you cannot even reliably thread abort it. There is no guarantee whatsoever that aborting a hostile thread actually terminates it; the owner of the hostile code that you have foolishly started running in your process could be doing all of its work in a finally block or other constrained region which prevents thread abort exceptions.

The best thing to do is to never get into this situation in the first place; if you have code that you think is hostile, either do not run it at all, or run it in its own process, and terminate the process, not the thread when things go badly.

In short, there's no good answer to the question "what do I do if it takes too long?" You are in a terrible situation if that happens and there is no easy answer. Best to work hard to ensure you don't get into it in the first place; only run cooperative, benign, safe code that always shuts itself down cleanly and rapidly when asked. Careful with that axe, Eugene.

Next time, what about exceptions?

(This is part one of a two-part series on the dangers of aborting a thread. Part two is here.)

  • Eric, I agree, "careful" is the key word here:

    A few months ago, I've run into a issue where a mobile application wasn't shutting down in some special conditions (Losing wi-fi connectivity when shutting down a 3rd party component while sending data). We reported the bug to the vendor along with code that could repro the bug fairly consistently on out test devices, unfortunately the bug was closed  with "no repro" after several iterations of knowledge exchange. So I had to pull the axe. The code for the app was specifically written to be resilient in case of errors both on client side and server side, so the only outcome could have been some minor data loss, but we did a small redesign to deal with such a issue, by storing all unprocessed data locally.

    Our profession is not  dangerous (from a failure perspective)  than most are, we are faced with human errors, hardware errors and software errors , possibility of data loss or exceptional conditions can creep up at any time, so you have to be prepared to pull the axe, and know how to use it without hurting yourself or others. This is a fact, you can't avoid using some extreme measures indefinitely, situations will arise, when the axe is needed, even if only as temporary solution or in extreme cases or unforeseen cases.

    My opinion is that you shouldn't use the axe if other options are available, but if you have to, then make sure you know how to use it, and take all the necessary safeguards.

    P.S. We can't make perfect software just yet, sometimes the axe will be needed despite everything else, and no matter how much we dislike taking that path, it comes with being a professional.

  • @Anon: the other context referred to was processes on the OS.

    ThreadAbort as-is is problematic because you don't know the state of the overall process (or appdomain) when the thread aborts: perhaps the thread was doing something critical to global state and leaves it behind corrupted.

    Now, there are (at least) two solutions to this problem:

    (A) manually mark safe stopping points.  This is a huge effort; it's likely to be buggy at least some of the time.  Even if you do pick "safe" stopping points, that's not a cure-all: you'll be terminating via a code-path not commonly travelled in the single-threaded scenario, so you'll still need to manually inspect each clean-up case. or trust to automatic mechanism such as try...finally.

    (B) partition and encapsulate updatable regions of your memory.  If a thread goes down asynchronously, assume any partition the thread is currently touching is simply trash; use an external tool to enforce the release of resources held by that thread.  This is essentially what the OS does when processes go down.

    (B) isn't a magic bullet - resources need to be owned by a thread to be released (often a stack-unwinding release ala using clauses will work here, but a more robust method is conceivable).  Also, you do need to partition *write* access to memory.  In practice, manageable multithreaded programs do that anyhow; you don't go around writing everywhere unless you like headaches.

    (A), however, is simply not sustainable.  It's virtually untestable, unverifiable, and asking for exotic race conditions - did you check all possible exits?  Also, problems with this approach are *not* limited to thread abort exceptions.  In practice, any other unexpected exception might halt excecution in an invalid state - so for code to be robust, it needs to be abortable *anyhow* just to be pragmatically exception safe (which is why even in a single-threaded world things like RAII and using-clauses are needed).  Marking manual stopping points is much, *much* more work and makes code less readable.  It's also a waste of time if your code is actually largely a side-effect-free functional body of code; and that seems to be the direction multi-threaded code is heading anyhow.  If your code doesn't have any side effects - that is, the only consistency-critical effect on external state is the final "return" value then it doesn't matter *when* you shut down the thread *anyhow*.

    We want side-effect free threading anyhow; this notion of "safe" stopping points is a treacherous bog I don't want to enter if I can at all avoid it.  "using" blocks can deal with external resources the thread really needs to release, and the rest of the state is trash for the GC.

  • The great thing about managed code is that you can safely take the axe to any thread that doesn't manipulate shared state. Back in the olden days terminating a thread leaked whatever memory it had allocated, leaving files locked, sockets open, and so forth, meaning you couldn't rely on axing threads as a strategy, so you had to pepper threads with stopping points. Managed code lets you avoid that, so it is reasonable to use thread termination as a way of dealing with things like users canceling a long-running calculation.

  • I have a similar situation here at work. I'm designing a system where we make a "cross-environment method call", which is, of course, asynchronous. The caller, A, delegates this task to my process orchestrator, B, telling B that a process § should be completed (or, B(§)). Now, since the process § is parameter based and can call as many different programs that the user wants, I have no clue how long should § take to complete. In fact, any program P can enter a infinite loop.  

    Now, since A is make a cross-environment call, he deals with atomicity using compensation methods. So if the call to B(§) fails or times out, compensation should be done for every step up to B(§), but not to B(§) itself, since it failed. But, let's said that what really takes long in dealing B(§) is a database commit. The commit is the last step B executes, and B can no longer expires, even if the timeout has already occurred. By the time B finishes commiting, A has already began compesating other steps in his own process.

    This a liability and, right now, I can only think of dealing with it be fine-tuning the time outs... which is no good at all.

Page 2 of 2 (19 items) 12