Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Concurrency, Part 4 - Deadlocks

Concurrency, Part 4 - Deadlocks

  • Comments 14
Yesterday I talked about critical sections and their use in protecting application data.

But critical sections are like duct-tape - they hold applications together but they also have their dark side.  A couple of commenters have touched on this, but the biggest issue with critical sections is "deadlocks".

The simplest way of generating a deadlock is what is called a "lock order inversion".  Basically if you have two objects, call them "Thing1" and "Thing2".  Each is protected by a critical section.

So what happens when you have:

class Thing1
{
    CriticalSection _Thing1Lock;
    class Thing2 *_Thing2;
public:
    void Lock() { EnterCriticalSection(&_Thing1Lock); }
    void Unlock() { LeaveCriticalSection(&_Thing1Lock); }

    void Thing1Method1()
    {
        Lock();
        ...
        _Thing2->Thing2Method2();
        ...
        Unlock();
    }
    Thing1Method2()
    {
        Lock();
        ...
        Unlock();
    }
};

class Thing2
{
    CriticalSection _Thing2Lock;
    class Thing1 *_Thing1;
public:
    void Lock() { EnterCriticalSection(&_Thing2Lock); }
    void Unlock() { LeaveCriticalSection(&_Thing2Lock); }

    Thing2Method1()
    {
        Lock();
        ...
        _Thing1->Thing1Method2();
        ...
        Unlock();
    }
    Thing2Method2()
    {
        Lock();
        ...
        Unlock();
    }
};

So what happens when Thread1 calls:

        thing1->Thing1Method1();

And Thread2 calls:

        thing2->Thing2Method1();

?

Thread 1 Thread 2
thing1->Thing1Method1() thing2->Thing2Method1()
EnterCriticalSection(&_Thing1Lock) EnterCriticalSection(&_Thing2Lock)
... ...
_thing2->Thing2Method2() _thing1->Thing1Method2()
EnterCriticalSection(&_Thing2Lock) EnterCriticalSection(&_Thing1Lock)

At this point, you've deadlocked the process.  Thread 1 is blocked waiting for Thread2 to release the _Thing2Lock.  Thread2 is blocked waiting on Thread1 to release the _Thing1Lock.

And there's no way of getting out of this situation, you're totally stuck.

So how do you avoid deadlocks?  Well, this is where the whole "concurrency discussion" starts getting tricky.

The simplest solution is once again, to avoid the problem.  Define a strict ordering for your locks and stick with it.

So this would change Thing2::Thing2Method1() to be:
    Thing2Method1()
    {
        _Thing1->Lock();
        Lock();
        ...
        _Thing1->Thing1Method2();
        ...
        Unlock();
        _Thing1->Unlock();
    }
 

And now, the flow of control is:

Thread 1 Thread 2
thing1->Thing1Method1() thing2->Thing2Method1()
EnterCriticalSection(&_Thing1Lock) EnterCriticalSection(_Thing1Lock)
   
...  
_thing2->Thing2Method2()  
EnterCriticalSection(&_Thing2Lock)  
_Thing1->Thing1Method2()  
EnterCriticalSection(&_Thing1Lock)  
...  
LeaveCriticalSection(&_Thing1Lock)  
...  
LeaveCriticalSection(&_Thing2Lock)  
LeaveCriticalSection(&_Thing1Lock)  
  EnterCriticalSection(&_Thing2Lock)
  ...
  _Thing1->Thing1Method2()
  EnterCriticalSection(&_Thing1Lock)
  ...
  LeaveCriticalSection(&_Thing1Lock)
  ...
  LeaveCriticalSection(&_Thing2Lock)
  LeaveCriticalSection(&_Thing1Lock)

Voila, we've avoided the deadlock.

So that's the third principle of concurrent programming: Know your lock order and never, ever violate it.

But what happens if you can't avoid violating the lock order?  It does happen, especially if your code calls out to other components.

Tomorrow, I'll talk about some ways of avoiding the problem.

 

Edit: Fixed typo in Thing2::Lock and Unlock, thanks ac.

 

  • The hard part is maintenance, for the original programmer as well as anyone else who has to do it. If it's up to the programmer to remember to lock things in a specific order, it creates bugs that take forever to figure out. It's hard enough to remember what's in a class you haven't touched for 6 months, let alone what order you're supposed to lock things, let alone what things need to be locked. It's horrible.

    A good idea just crossed my mind: Every time a thread tries to acquire a mutex/critical section/etc., the acquisition routine looks at the other mutexes that thread owns and makes sure they're being acquired in the correct order, i.e. by memory address. If they're out of order, then it does some magic to fix it. The magic could be difficult. This way locks aren't acquired in the wrong order and the philosophers can all dine (I think that's the right situation, correct me if I'm wrong).

    Yes, it's horribly expensive during program runtime, but it's a lot faster during program development, testing, maintenance, release, and support.

    Ugh. Every time I start thinking about concurrency, even easy concurrency, I cringe at how difficult it is and how many undiscovered bugs my code must still have.

    Programming languages and tools will *have* to evolve better support for it, otherwise most programmers will be totally lost at it. It's simply too hard for non-Einsteins to get correct.
  • Absolutely right Ryan.

    I've tried the "define a hierarchy and change your "EnterCriticalSection" logic to enforce the hierarchy (it's not too hard - you define a TLS entry that contains a linked list of the locks that have been acquired on the thread so far and check it on entry).

    THe problem is that this requires a HUGE amount of effort on the part of the development team, and there are times when lock inversions are benign (if two code paths can never be executed simultaneously (because of a higher level lock, for example), then the lock order doesn't matter).
  • I'm guessing that the Lock and Unlock functions in class Thing2 should be calling Enter/LeaveCriticalSection(&_Thing2Lock) not Thing_1_Lock, right?

  • Several times I've been asked to help with hang bugs that turned out to be deadlocks, and when I asked about the lock order, I heard "what's a lock hierarchy." Groan.

    What usually screws you over here is hidden locks in APIs that call back into user code. The Win32 loader lock is famous for this; I've also been burned by the tree locks in Java AWT and the filter graph lock in DirectShow. I believe that internal lock info should be part of all API specifications.
  • The tree lock is actually documented, but only after that problem has bit you in the rear and you read the source code to figure out what's happening. All the AWT/Swing components plainly tell you that you can only access them from the AWT thread after show() has been called on any AWT/Swing component. You can only realize how important that detail is when your program mysteriously hangs (some of the time) when the GUI is popping up.
  • Larry, this is a great series ! Thanks.

    If you write device drivers for Windows you get lock inversion checking when you run the driver verifier. This is just so cool. Kudos to the team who did that, it must have saved man-centuries by now and contributed massively to kernel stability.

    By the way, I don't believe that *any* lock inversion is benign. You often have to think really hard to convince yourself, and no matter how sure you are that it works _today_, there is no guarantee that it will work after the latest bug fix (which Murphy tells us will be a rushed emergency security patch).
  • One thing I've done in the past to prevent this problem is this :

    a) a unique integer is associated to each critical section
    b) a stack (of integers) is in each TLS
    c) debug macros for EnterCriticalSection and LeaveCriticalSection do push/pop the associated integer in the stack. If the pushed value is less than a previously pushed value, an assert is fired

    this enforces a stricter policy than needed, but it solves this common problem :)
  • This applies not only to system programming, but also to database programmers. I can't tell you how many bugs I've fixed in database software because two developers writing two separate sprocs access (and lock) the same database objects in a different order. It never seems to manifest itself in testing, but in a production environment the starts to ring with statements like "My process was chosen as a deadlock victim!!!"
  • Another variation of the unique integer system is to assign a priority to each lock that doesn't need to be unique. If you try and lock multiple locks, then the new lock must have a higher priority than the previous lock. VMS used this trick with their IRQL locks (or something like that).
  • I think the issue here is that you're overly protective. You should only lock around code that absolutely needs to be locked. Pointers to the other thing are instanced, not global, so calls to their methods do not need to be locked. If you stick to locking only around code that accesses common resource (and does absolutely nothing else) and have a single lock per each resource you would not have this problem. Instead a lot of programmers decide to be "safe" and lock the whole method even though it should not be.
  • I do not like APIs that do their locking bythemselves, as it can create all sorts of weirdness. The APIs should provide a lock mechanism that the caller will have to use to ensure correctness.

    However circular relations between APIs are even worse.

    Both have been violated here and of course you will end up in a mess.

    It isn't concurrency's fault that you get into trouble, you would end up with the same problems even in a single thread application with an asynchrounous state machine for instance.

    The only difference is, it would then probably be a called a race condition.

    Threaded applications have more benefits than just speed (which in my opinion is a very limited gain on normal machines with normal applications which often don'
    t have many concurrent jobs). Threaded application often become more top-down and the error flows are usually easier to write. It also forces the developer to design proper and disjunct interfaces between different parts of the application, something which often is very violated in a single threaded applications.

    Threading does not have to be as difficult as some want to make you believe. There are some simple guidelines to follow and if you do 95% of your headaches are most likely gone.
  • Niclas,
    You're right. I disagree on the APIs that do locking - there are very few examples where it's appropriate for an API to trust the integrity of the APIs internal data structures to its callers - the callers will almost always get it wrong.

    But you're right about multithreading not being hard - that's actually why I started writing this series - it's really NOT that hard, if you follow a few rules.
Page 1 of 1 (14 items)