Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Concurrency, part 6 - Reference counting is hard :)

Concurrency, part 6 - Reference counting is hard :)

  • Comments 9
The other day, I talked about using reference counting as a way of ensuring object when enumerating a list containing objects.

Reference counting is an incredibly useful, but incredibly dangerous technique.  If you use it right, it can be a huge help, if you get it wrong, it's a nightmare.

For example, what's wrong with this code:

class Thing1 : public RefCountedObject
{
    CriticalSection _Thing1Lock;
    class Thing2 *_Thing2;
public:
    void Lock() { EnterCriticalSection(&_Thing1Lock); }
    void Unlock() { LeaveCriticalSection(&_Thing1Lock); }

    void SetThing2(Thing2 thing2)
    {
        Lock();
        _Thing2 = thing2;
        _Thing2->AddRef();
        Unlock();
    }
};

class Thing2 : public RefCountedObject

{
    CriticalSection _Thing2Lock;
    class Thing1 *_Thing1;
public:
    void Lock() { EnterCriticalSection(&_Thing2Lock); }
    void Unlock() { LeaveCriticalSection(&_Thing2Lock); }

    void SetThing1(Thing1 *thing1)
    {
        Lock();
        _Thing1 = thing1;
        _Thing1->AddRef();
        Unlock();
    }
};

If you call thing2->SetThing1 and thing1->SetThing2, you've just introduced a circular dependency - neither thing1 or thing2 will ever be released.  And, while this example is utterly silly, there are lots of examples where you can get circular references. 

I think a real world example fits in quite nicely here.  We've got a notification class for some of our stuff that had a deadlock in it.  The original class looked like:

class NotificationManager: public IUnknown
{
    struct NotificationBlock
    {
        CALLBACKFUNCTION _Function;
        LPVOID _Context;
        NotificationBlock(CALLBACKFUNCTION Function, LPVOID Context) :
            _Function(Function),
            _Context(Context);
        {
        }
    }
    CAtlList<NotificationBlock> _NotificationListList;
    CRITICAL_SECTION _NotificationListLock;

    RegisterForNotifications(CALLBACKFUNCTION Function, LPVOID Context)
    {
        NotificationBlock block;
        block._Function = Function; block._Context = Context;
       
        EnterCriticalSection(&_NotificationListLock);
        _NotificationList.Add(block);
        LeaveCriticalSection(&_NotificationListLock);
    }
    UnregisterNotifications(CALLBACKFUNCTION Function, LPVOID Context)
    {
        NotificationBlock block(Function, Context);
        EnterCriticalSection(&_NotificationListLock);
        pos = _NotificationList.Find(block);
        _NotificationList.RemoveAt(pos);
        LeaveCriticalSection(&_NotificationListLock);
    }
    GenerateNotification(LPVOID NotificationBlock)
    {
        NotificationBlock block;
        EnterCriticalSection(&_NotificiationListLock)
        pos = _NotificationList.GetHeadList(block);
        while (block = _NotificationList->GetNext(pos))
        {
            block._Function(block._Context, NotificationBlock);
        }
        LeaveCriticalSection(&_NotificationListLock);
    }
}

Unfortunately, because the callback function was called with the list locked, there was a huge potential for deadlock (and in fact, one was observed).

My first thought was to change the notification block using a temporary list..

    GenerateNotification(LPVOID NotificationBlock)
    {
        NotificationBlock block;
        CAtlList<NotificationBlock> temporaryList;
        EnterCriticalSection(&_NotificiationListLock)
        pos = _NotificationList.GetHeadList(block);
        while (block = _NotificationList->GetNext(pos))
        {
            NotificationBlock tempBlock(block._Function, block._Context);
            temporaryList.Add(tempBlock);
        }
        LeaveCriticalSection(&_NotificationListLock);
        while (block = temporaryList.RemoveHead())
        {
            block._Function(block._Context, NotificationBlock);
        }       
    }

The problem with this is that it's possible for someone to have removed the block from the list as soon as the lock was released.  So I needed to have a way of protecting the blocks.  Well, the fourth principle comes into play: I reference counted the blocks.

Now the function looks like:

class Thing1 : public INotification
{
    CriticalSection _Thing1Lock;
    Thing1Container *_MyContainer;
public:
    void Lock() { EnterCriticalSection(&_Thing1Lock); }
    void Unlock() { LeaveCriticalSection(&_Thing1Lock); }

    void FinalConstruct(NotificationContainer *MyContainer)
    {
        _MyContainer = MyContainer;
        _MyContainer->AddRef();
        _MyContainer->RegisterForNotifications(CallbackFunction, this);
    }
    void FinalRelease()
    {
        _MyContainer->UnregisterNotifications(CallbackFunction, this);
        _MyContainer->Release();
    }
};

class NotificationManager: public IUnknown
{
    CAtlList<INotification *> _NotificationListList;
    CRITICAL_SECTION _NotificationListLock;

    RegisterForNotifications(INotification* Notify)
    {
        EnterCriticalSection(&_NotificationListLock);
        _NotificationList.Add(Notify);
        LeaveCriticalSection(&_NotificationListLock);
    }
    UnregisterNotifications(INotification *Notify)
    {
        EnterCriticalSection(&_NotificationListLock);
        pos = _NotificationList.Find(Notify);
        _NotificationList.RemoveAt(pos);
        LeaveCriticalSection(&_NotificationListLock);
    }
    GenerateNotification(LPVOID NotificationBlock)
    {
        INotification *notify;
        CAtlList<INotification *> temporaryList;
        EnterCriticalSection(&_NotificiationListLock)
        pos = _NotificationList.GetHeadList(block);
        while (notify = _NotificationList->GetNext(pos))
        {
            notify->AddRef();
            temporaryList.Add(notify);
        }
        LeaveCriticalSection(&_NotificationListLock);
        while (notify = temporaryList.RemoveHead())
        {
            notify->OnNotification(NotificationBlock);
            notify->Release();
        }
    }


}

Unfortunately, this introduces a circular reference.  Since thing1 is added to the list in its final construct, the call to UnregisterNotifications never happens - the notification block keeps a reference to thing1 and the reference count never goes away.  Unless there's an external mechanism for removing thing1 from the container, neither object will ever be freed.

The easiest way of solving this is to introduce a tiny "delegator" object between thing1 and the notification manager.  The notification manager references the delegator, not the Thing1 object, and thus the lifetime of Thing1 is unaffected by the notification manager.

The good news is that managed environments, like the CLR make most of this reference counting stuff pretty much unnecessary, which is a very good thing.

Oh, and the fifth principle: "Reference counting is hard if you're not really careful".

So much for framework involving concurrency.  Believe it or not, I'm almost done :)  Tomorrow, I want to talk about opportunities for concurrency.

 

Edit: One of these days, I'll remember that your principal is your friend.

  • s/principal/principle

    If you do ever write a book get a good proofreader ;-)
  • First, great blog! I always was too lazy for writing that sort of things...

    I guess, your NotificationManager would not compile that easily. I can see about a ten of errors that won't compile. But still, I've got an idea :)

    In fact I'm in software development for pretty long time and I always hated that people do not understand such absolutely-must-know things like reference counting, multi-threading, re-entrancy, and syncronization issues. Most of them lose themselves even in virtual methods!.. But I digress.

    I haven't worked with .NET thing for very long time, but what they did with reference counting in remoting was another WTF for me. This is where reference counting is broken I think.
  • regarding the 1st example,
    in Thing1::SetThing2, don't you need to release whatever was in this->_Thing2 before overwriting it & addref'ing the passed value? (same thing in Thing2::SetThing1). This is so you don't leak refs when these methods are called repeatedly.
    Also the AddRef could probably be moved before the Lock call in these methods, thereby reducing the time spent with the lock held.
    But then again, maybe i'm wrong, since ref count is hard ;-)


  • Alexey, I didn't bother to try to ensure that the examples compiled, to be honest :)

    jbn,
    you're right, I should have released the _Thing1 variable (if it was not null). And the addref could be moved before the lock is held (after changing the AddRef target to thing1). If you were addref'ing thing1 (as opposed to _Thing1) you could even move it AFTER releasing the lock (because the caller has a reference to thing1).

    On the other hand, the AddRef of _Thing1 has to happen within the lock - otherwise _Thing1 could be set to NULL between the LeaveCriticalSection and the call to Release.
  • Could you explain a little more about solving the circular reference issues? I'm sure many people have run into these sort of problems.
  • I have had to debug many problems of circular references, and to me deadlock problems and circular references are a lot alike.
    If you look at http://lists.ximian.com/archives/public/mono-patches/2003-August/023295.html, they explain how they model their lock hierarchy to never violate their lock order; now if you graph how your data structures point to each other and which pointers are owning references (i.e. AddRef'ed pointers), then it's pretty natural to think that when two items are in a refcount cycle, they "deadlock" as in "they lock each other in memory". Whether for mutexes or refs, no deadlocks means no cycles in the associated graph.. In any case, it should be possible to demonstrate via formal methods that a given program cannot deadlock: does this make any sense to you or am i off-topic already? ;-)
  • I think circular references is more a design problem. Often, reference counting is used as a remedy for some problem and added on some stage of coding (or even after it was already done). That's just plain wrong. These questions like who owns what and who controls what must be solved early in design stage.

    I remember I had to fix some file system filter driver (that was an anti-virus filter). None of the internal objects were reference-counted and all code (il-)logic was built on some expected behaviour pattern. After adding new functionality to driver, it started to bugcheck after just several seconds of stressing. It turned out, that objects were often destroyed before they were actually used. So, the thing I did was minor re-design. First, I made all entities to be objects (that means at least reference counting support). Second, I defined locking hierarchy for all of objects. I then needed just three days to re-work all the code and it started working! And manager said that previous guy needed about half a year to find workarounds for all of issues they had. In fact, I needed half of synchronization objects there was before and according to performance testing I managed to dramaticaly reduce the time they were locked.

    So, you see I love to talk about myself even on other one's blog :) But what I want to say is that every developer really should think about lock ordering and reference counting and concurrency and re-entrance and all that stuff on very early stages of designing. And every developer *must* known what *is* concurrency and *understand* its issues. But as Neal Christiansen says on Channel 9, so many people (event experienced ones) cannot do that, and you need kind of talent to do thay. Sigh.
  • Alexey,
    You're right that it's a design problem. But it's indescribably easy for circular references to come into existance. That's one of the reasons that garbage collection is harder than just reference counting objects - you need to actively sweep through all allocated objects looking for cycles.

    Having said that, your comments are spot on.
  • The boost shared_ptr / weak_ptr libraries (coming soon to a C++ standard library near you) are a great way to handle this sort of circular dependency. If A has a strong pointer to B but B only has a weak pointer to A, no circularity is introduced (although B has to be aware that when it tries to access the A it might have gone away).

    Works really well for callbacks, and is well-enough designed that it's hard to use it wrongly.
Page 1 of 1 (9 items)