Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Cleaning up shared resources when a process is abnormally terminated

Cleaning up shared resources when a process is abnormally terminated

  • Comments 18

This post came into my suggestion box yesterday from Darren Cherneski:

We have a system that has an in-memory SQL database running in shared memory that is created with CreateFileMapping(). Processes start up, attach to it via a DLL, do some queries, and shut down. The problem we keep running into during development is when a developer starts up a process in the debugger, performs a query (which gets a lock on a database table), and then the developer hits Shift-F5 to stop debugging, the lock on the database table doesn't get released. We've put code in the DllMain() function of the DLL to perform proper cleanup when a process crashes but DllMain() doesn't seem to get called when a developer stops a processes in the debugger.

Windows has hundreds of system DLLs where a process can get a handle to a resource (Mutex, file, socket, GDI, etc). How do these DLLs know to cleanup when a developer hits Stop in the debugger?

It's a great question which comes up on our internal Win32 programming alias once a month or so, and it illustrates one of the key issues with resource ownership.

The interesting thing is that this issue only occurs with named synchronization objects.  Unnamed synchronization objects are always private, so the effects of a process abnormally terminating are restricted to that process.  The other resources mentioned above (files, sockets, GDI, etc) don't have this problem; because when the process is terminated, the handle to the resource is closed, and closing that handle causes all the per-process state (locks on the file, etc) to be flushed.  The problem with synchronization objects is that with the exception of mutexes, they have state (the signaled state) that's not tied to a process or thread.  The system has no way of knowing what to do when a handle is closed with an event set to the signaled state, because there is no way of knowing what the user intended.

Having said that, a mutex DOES have the concept of an owning thread, and if the thread that owns a mutex terminates, then one of the threads blocked waiting on the mutex will be awoken with a return code of WAIT_ABANDONED.  That allows the caller to realize that the owning thread was terminated, and perform whatever cleanup is necessary.

Putting code in the DllMain doesn't work, because, as the Darren observed, the DllMain won't be called when the process is terminated abruptly (like when exiting the debugger).

To me, the right solution is to use a mutex to protect the shared memory region, and if any of the people waiting on the mutex get woken up with WAIT_ABANDONED, they need to recognize that the owner of the mutex terminated without releasing the resource and clean up.

Oh, and I looked Windows doesn't have "hundreds of system DLLs where a process can get a handle to a resource"  There are actually a relatively few cases in the Windows code base where a named shared synchronization object is created (for just this reason).  And all of the cases I looked at either use a mutex and handle the WAIT_ABANDONED error, or they're using a manual reset event (which don't have this problem), or they have implemented some form of alternative logic to manage this issue (waiting with a timeout, registering the owner in a shared memory region, and if the timeout occurs, looking for the owner process still exists).

The reason that manual reset events aren't vulnerable to this issue, btw is that they don't have the concept of "ownership", instead, manual reset events are typically used to notify multiple listeners that some event has occurred (or that some state has changed).  In fact, internally in the kernel, manual reset events are known as NotificationEvents for just this reason (auto-reset events are known as SynchronizationEvents).  Oh, and mutexes are known as Mutants internally (you can see this if you double click on a mutex object using the WinObj tool) Why are they called mutants?  Well, it's sort-of an in joke.  As Helen Custers put it in "Inside Windows NT":

The name mutant has a colorful history.  Early in Windows NT's development, Dave Cutler created a kernel mutex object that implemented low-level mutual exclusion.  Later he discovered that OS/2 required a version of the mutual exclusion semaphore with additional semantics, which Dave considered "brain-damaged" and which was incompatible with the original object. (Specifically, a thread could abandon the object and leave it inaccessible.)  So he created an OS/2 version of the mutex and gave it the name mutant.  Later Dave modified the mutant object to remove the OS/2 semantics, allowing the Win32 subsystem to use the object.  The Win32 API calls the modified object mutex, but the native services retain the name mutant.

Edit: Cleaned up newsgator mess.

 

  • What's with all the weird and wonderful garbage characters in todays post? =)
  • Newsgator messed up. I've seen it once before, and I don't know what causes it.
  • Larry, I've been trying to figure out what timezone your posts are timestamped to. I've narrowed it down to a few islands off the coast of Newfoundland. Please elucidate.

    - adam
  • Adam,
    It's a rather curious interaction between newsgator and .Text (with the error being on the .Text end). Apparently Scott's working on fixing it.
  • > Edit: Cleaned up newsgator mess.

    Off by one, and the remaining bug is brain-damaged.
  • Nice info (as always!) :)

    Did you forget to remove the 'æ' in the last paragraph from the newsgator/.text problem?

    "... Dave considered "œbrain-damaged" and which was incompatible ..."

    Hm, or did Dave indeed call them 'æbrain-damaged'? :)
  • A couple of points:

    1. Putting cleanup code in a DllMain (or a global C++ destructor in a DLL) is usually a bad idea. See MSDN docs for the list of restrictions that apply to code running in DllMain.

    2. No matter what you do, you can't guarantee that your cleanup code will always run. This is a fact of life, and if you want to build a reliable system you need to accept it and design accordingly.

    In this case, it means using a mutex or some other object that will not be left in an unpredictable state even if a process crashes. Note that you also have to make sure that the data protected by the mutex is not corrupt - this can be much trickier (but then you decided to build your own SQL database, so you knew what you were getting into, right?)
  • Processes and threads are waitable too. When one of these terminates, it becomes 'signaled'; it will satisfy a wait function.

    You could create a thread to wait on threads and processes that need cleanup when they terminate using WaitForMultipleObjects with bWaitAll set to false, plus an extra event for adding new items to the list. I guess the biggest problem with that is that you would need a seperate control process to host that thread.
    It would at least work as a backup plan, if your client threads/processes don't terminate nicely, you can still tell when they die. Just don't forget to close the handle when you are done.

    Looking at the csrss.exe process in SysInternals's Process Explorer, I notice that csr has handles open to every win32 process in this session, plus a long list of thread handles in those processes. I wonder if csr uses something like this to do emergency cleanup?
  • Curious: Explorer dies/hangs and is terminated for whatever reason. What brings it back fresh? And what causes it to not come back those really ugly times?
  • Winlogon tries to keep the shell (usually Explorer) alive by restarting it when it dies. It also puts an event in the event log when that happens. I don't know why it doesn't always come back.
    See this registry key:http://www.microsoft.com/windows2000/techinfo/reskit/en-us/default.asp?url=/windows2000/techinfo/reskit/en-us/regentry/12316.asp
  • If the shell exits with a status of 1 then winlogon doesn't try to restart it.
  • When developing/debugging shell extensions it's nice that explorer doesn't always restart (like if you kill it manually from task manager). More information about that is available here: http://msdn.microsoft.com/library/en-us/shellcc/platform/shell/programmersguide/shell_basics/shell_basics_programming/debugging.asp.
  • And sometimes when you manually restart Explorer it starts a separate window with its usual settings (in my case detailed view and maximized), which is usually what one wants for the second or subsequent instances of Explorer. It doesn't start a replacement for the Explorer that crashed, so the task bar doesn't come back.
  • Thanks Larry!

    This is one of those issues that has come up several times in our development. We have a Performance Monitor plugin DLL that uses a bit of shared memory that doesn't cleanup on Shift-F5. We also have a small event-queue system that uses shared memory and the queues don't get cleaned up on Shift-F5. A guy programs all day - start, stop, start, stop, start, stop and by 2-3:00 you start getting weird failures because your subsystems have run out of memory.

    Using mutexes for their WAIT_ABANDONED property has a certain appeal. We've also been thinking about scanning for abnormal process death by trying to OpenProcess() each process that is using some sh mem and then checking its creation time with GetProcessTimes() to avoid the PID reuse issue. The problem with this method is that we're using a software copy-protection system that disables OpenProcess(). <sigh>


    Pavel Lebedinsky
    >(but then you decided to build your own SQL database, so you knew what you were getting into, right?)

    We actually bought this one http://www.quilogic.cc/

    It has a few bugs in it but they sell you the source so you can patch them up yourself.
Page 1 of 2 (18 items) 12