Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
(This is part one of a two-part series on the dangers of aborting a thread. Part two is here.)
The other day, six years ago, I was was talking a bit about how to decide whether to keep waiting for a bus, or to give up and walk. It led to a quite interesting discussion on the old JoS forum. But what if the choice isn’t “wait for a bit then give up”, instead it is “wait for a bit, and then take an axe to the thread”? A pattern I occasionally see is something like I’ve got a worker thread that I started up, I ask it to shut down, and then I wait for it to do so. If it doesn’t shut down soon, take an axe to it:
this.running = false; if (!workerThread.Join(timeout)) workerThread.Abort();
Is this a good idea?
It depends on just how badly the worker thread behaves and what it is doing when it is misbehaving.
If you can guarantee that the work is short in duration, for whatever 'short' means to you, then you don't need a timeout. If you cannot guarantee that, then I would suggest first rewriting the code so that you can guarantee that; life becomes much easier if you know that the code will terminate quickly when you ask it to.
If you cannot, then what's the right thing to do? The assumption of this scenario is that the worker is ill-behaved and does not terminate in a timely manner when asked to. So now we've got to ask ourselves "is the scenario that the worker is slow by design, buggy, or hostile?"
In the first option, the worker is simply doing something that takes a long time and for whatever reason, cannot be interrupted. What's the right thing to do here? I have no idea. This is a terrible situation to be in. Presumably the worker is not shutting down quickly because doing so is dangerous or impossible. In that case, what are you going to do when the timeout times out? You've got something that is dangerous or impossible to shut down, and its not shutting down in a timely manner. Your choices seem to be
(1) do nothing (2) wait longer(3) do something impossible. Preferably before breakfast.(4) do something dangerous Choice one is identical to not waiting at all; if that’s what you’re going to do then why wait in the first place? Choice two just changes the timeout to a different value; this is question begging. By assumption we're not waiting forever. Choice three is impossible. That leaves “do something dangerous”. Which seems… dangerous.
Knowing what the right thing to do in order to minimize harm to user data depends upon the exact circumstances that are causing the danger; analyze it carefully, understand all the scenarios, and figure out the right thing to do. There’s no slam-dunk easy solution here; it will depend entirely on the real code running.
Now suppose the worker is supposed to be able to shut down quickly, but does not because it has a bug. Obviously, if you can, fix the bug. If you cannot fix the bug -- perhaps it is in code you do not own -- then again, you are in a terrible fix. You have to understand what the consequences are of not waiting for already-buggy-and-therefore-unpredictable code to finish before disposing of the resources that you know it is using right now on another thread. And you have to know what the consequences are of terminating a thread while a buggy worker thread is still busy doing heaven only knows what to operating system state.
If the code is hostile and is actively resisting being shut down then you have already lost. You cannot halt the thread by normal means, and you cannot even reliably thread abort it. There is no guarantee whatsoever that aborting a hostile thread actually terminates it; the owner of the hostile code that you have foolishly started running in your process could be doing all of its work in a finally block or other constrained region which prevents thread abort exceptions.
The best thing to do is to never get into this situation in the first place; if you have code that you think is hostile, either do not run it at all, or run it in its own process, and terminate the process, not the thread when things go badly.
In short, there's no good answer to the question "what do I do if it takes too long?" You are in a terrible situation if that happens and there is no easy answer. Best to work hard to ensure you don't get into it in the first place; only run cooperative, benign, safe code that always shuts itself down cleanly and rapidly when asked. Careful with that axe, Eugene.
Next time, what about exceptions?
I never realized that a finally block prohibited thread abort exceptions. I think I understand the reasoning behind it (don't die without cleaning up), but is malicious C# really as easy as
throw new Exception("MWAH HA HA!!");
If so, what mechanisms can terminate a malicious .Net process that is executing in a finally block? What would happen in this situation if MaliciousFunction() ran unmanaged code?
Pretty sure Environment.FailFast() will kill a finally block as will running the possibly malicious code in a separate process and using Process.Kill().
Once I nearly resorted to spawning a "taskkill myprocess.exe" to kill a rogue thread that sometimes would cause my process not to exit. This rogue thread was in a buggy old IE plug-in that wasn't supported by the vendor anymore, so I just had to use whatever means necessary to work around its quirks.
I would add one extra option to that list:
(5) ask another entity (a human user, perhaps)
To add safety, you can also isolate threads that you do not fully control (due to their needing to use components you do not own/fully control) into their own app domains, hence limiting their potential for damage. Of course, that limits your ability to interract with them too.
Ideally, in threads that run your own code, you should design a shared status flag into them (i.e. bool shouldContinue) or something, and thus allow you a clean means of your own design to cancel a thread's operation, and use that. If you do that, and the thread does not shut down, you may as well hard shut it down - you are lost in that scenario anyway.
Some commandments of mutli-threading.
All threads shall terminate quickly when ask to.
Thou shall only use one primitive for multi-threading.
Thou shall treat every thread as if it is a feature.
Thou shall not share data incautiously.
Thou shall not thrash the L2 cache.
If a failure to shut down when told to do so is dangerous I'd hate to think of the consequences in the failure of a component that is a single point of failure in such a system.
I really hate this situation. This is often the case when working with someone else's interface. You call into some interface whose implementation you have no control over, and it decides to never return. Ugh.
I wish some developers would be honest with their naming. Instead of calling it FetchValue(...), they should call it FetchValue_OrHangForeverWithoutGivingTheDeveloperAnyWayToCancelThisStupidCall(...).
Was that a Hitch-hiker's reference in #3?
Mike: no, Alice in Wonderland.
"Presumably the worker is not shutting down quickly because doing so is dangerous or impossible."
That's an unwarranted assumption. The worker might not be shutting down simply because it's doing some long-running computation, and the worker code wasn't peppered with frequent checks for cancellation requests. Moreover, doing so makes the worker code harder to read, is rather laborious, and possibly very difficult since you need to balance performance with quick cancellation. Thread.Abort is the only practical option here, and it's also safe if the code is working only on managed memory with no outside dependencies.
However, I hear F# (or was it the new parallel library in .NET 4?) has a nice trick where parallelized loops do automatic cancellation checks at each iteration, so you don't have to manually add those checks to the worker code. That should take care of some cases that are amenable to this kind of parallelization.
Potentially hostile code (such as plug-ins) are often executed in a new AppDomain. For those unaware, this approach offers the use of CAS to limit the code's capabilities (i.e., sandbox), makes communication with the main application quite easy (e.g., MarshalByRefObject), and also provides a way to terminate the code and unload all of its assemblies without affecting the main AppDomain.
However, AppDomain cannot solve the "aborted thread" problem as you've described for potentially harmful code, in all cases, due to CERs. Whenever CannotUnloadAppDomainException is thrown by a call to AppDomain.Unload, calling Environment.FailFast will ignore all finally blocks, except for CERs. I believe we can prevent plug-ins from using CERs by denying them unmanaged code privileges via CAS. But of course, if plug-ins need unmanaged code execution rights, then we have to make the difficult choice of running them out-of-proc (possibly sacrificing CAS altogether, and increasing the difficulty of communicating with the main app), or running them in-proc and leaving it up to the user to "Fail Fast" manually by terminating the process in Task Manager.
The "have cancellation checks" alternative to thread.abort is a terribly poor solution to what (in other contexts) is a solved problem. It's not always easy to predict which component needs cancellation checks where; large very valuable bodies of external code you'll be using don't include these new cancellation points (but potentially *do* include finally blocks - go figure), and adding cancellation points in complex computations may be tricky and obscure the code.
If an operation reads a bunch of memory, and writes to a particular segment of memory, then it's principally OK to stop the operation in whatever invalid state half-way so long as you discard the contents (or consider non-sensical) of the output. This approach much, much more naturally maps to how encapsulation in general works; it plays particularly nice with functional code, and, via finally blocks or some other mechanism it can still permit an unlimited amount of cleanup after asynchronous abort if it's absolutely critical (which it really shouldn't be).
Now, we *could* stick that abortable code in a seperate process. However, the .NET framework in particular doesn't make that easy: the new process lives in a new memory space, meaning that even read-only access to critical datastructures isn't possible (or at least not easily), and even if it were, you suddenly lose the benefits of the GC since the GC doesn't track cross-process references. And, obviously, can't use a whole host of standard features for no particularly good reason (e.g. things like Parallel.ForEach). Finally; it's also a much less stable solution: whereas previously a synchronization error might corrupt one process and leave it with a few thousand threads, using processes like this by default breaks containment; killing one does not necessarily kill the rest - so bugs can more easily undermine the whole system's stability. If processes were as cheap as threads, and the instantiation library would instruct the OS to consider new processes as sub-processes to be terminated on a parents exit, and the GC were multi-process aware and accessing other processes objects (at least in a read-only fashion) were transparently possible, sure, *then* processes would be a decent option.
None of this is to say that Thread.Abort might not be potentially problematic *right now*. However, in the long run, it's a much more fundamentally attractive proposition than manually sprinkling code with "stopping points". The OS switch from cooperative multi-tasking to preemptive multitasking; and for similar reasons thread termination should not *require* manual handholding - it's just too bug-prone, and breaks encapsulation.
If .NET aims to be a useful platform for multi-threaded development, it should aim to make such development easy and reliable. Manually sprinkling code with stopping points is anything but easy and reliable.
Funny, Just got back from The Australian Pink Floyd. The didn't play that, but it was very good!!
What about threads which itselfs uses timeouts
I use worker threads to monitor devices over the internet. The worker threads are sending commando's with timeouts over sochets.
When stopping the worker thread I wait the longest timeout, but sometimes the thread does not respond in a timely fashion.
So I'm using the Axe. Didn't find another option.
> The "have cancellation checks" alternative to thread.abort is a terribly poor solution to what (in other contexts) is a solved problem.
@Eamon, so what's the solution then to the 'solved problem'?
I'd agree that predicting is hard without profiling, but typically a worker thread is going away to do something specific, and for me that usually means something like 'go and process the next n requests' and at the end of every n, the thread checks for cancellation.
I've written a few windows services, and these checks are built into the base class that each worker runs. We try to keep sys admins happy and avoid "The service failed to end in a timely fashion" messages.
So, if you're creating a background process, or anything unattended you have to raise the bar, and you should always raise the question 'Is it cancellable, and how long do I let it run before terminating?'
So as long as you know the worst case behaviour and communicate that out, then the code that spawns the process knows that it should on average take x, so if, after requesting cancel, the time is > x + some fudge factor, then go ahead and take the axe to it.
The axe coming out really points to a defect in the worker, and your code should log that. If you're calling a 3rd party component that's run away on you, well at least you have an axe to grind.