April, 2011

  • The Old New Thing

    Lock-free algorithms: The try/commit/(try again) pattern

    • 31 Comments

    The singleton constructor pattern and the Interlocked­Multiply example we saw some time ago are really special cases of the more general pattern which I'll call try/commit/(try again). I don't know if this pattern has a real name, but that's what I'm calling it for today.

    The general form of this pattern goes like this:

    for (;;) {
     // capture the initial value of a shared variable we want to update
     originalValue = sharedVariable;
    
     ... capture other values we need to perform the operation ...
     ... these values must be indepedent of sharedVariable ...
    
     newValue = ... calculate the desired new result
                    based on originalValue and other captured values ...
    
     // Xxx can be Acquire, Release, or null
     if (InterlockedCompareExchangeXxx(
                &sharedVariable,
                newValue, oldValue) == oldValue) {
      break; // update was successful
     }
    
     ... clean up newValue ...
    
    } // loop back and try again
    

    We calculate the desired new value based on the initial value, combining it with other values that vary depending on the operation you want to perform, and then use an Interlocked­Compare­Exchange to update the shared value, provided the variable hasn't changed from its initial value. If the value did change, then that means another thread raced against us and updated the value before we could; in that case, we go back and try it again. Maybe the next time through we won't collide against somebody.

    Note that the try/commit/try again pattern is unfair. There's no assurance that the thread that has been trying to update the value for the longest time will win the next race. (This is a property common to most lock-free algorithms.)

    The Interlocked­Multiply function follows this pattern very closely: The other value required to perform the operation is simply the multiplier, which is a parameter to the function and therefore is independent of the shared variable. The new value is simply the product, and if we are unable to update the shared value (because somebody else modified it), we just start over.

    A variation of try/commit/try again is try/commit/abandon. In this pattern, there is no loop. If the exchange fails, you just give up and return a failure code. The function Try­Enter­Critical­Section uses the try/commit/abandon pattern. (It also uses the Acquire version of Interlocked­Compare­Exchange for reasons which should be obvious.)

    Our singleton pattern is another special case of try/commit/try again where the "try again" is optimized out because we know what the result of "try again" is going to be, so we don't actually have to do it. In the singleton pattern case, the Interlocked­Compare­Exchange is a Release because the new value depends on other memory locations.

    Normally, the shared variable is an integer rather than a pointer, because a pointer is subject to the ABA problem if you incorporate the pointed-to data into your calculations. We get away with it in the singleton pattern case because the value change is unidirectional: It goes from NULL to something, and once it's something it never changes again. If the value of the shared variable can change in more general ways, then you have to be more careful if you use a pointer as the shared variable. (The most common solution is to make the shared variable not just a pointer but a pointer plus a counter which increments at each operation.)

    Here's another use of the try/commit/try again pattern, using a change counter as the shared variable. First, two helper functions:

    LONG InterlockedReadAcquire(__in LONG *pl, __in LONG lUnlikely)
    {
      return InterlockedCompareExchangeAcquire(pl, lUnlikely, lUnlikely);
    }
    
    LONG InterlockedReadRelease(__in LONG *pl, __in LONG lUnlikely)
    {
      return InterlockedCompareExchangeRelease(pl, lUnlikely, lUnlikely);
    }
    

    Although direct reads and writes of properly aligned LONGs are atomic, the operations are not synchronized and impose no memory ordering semantics. To read a value with specific semantics, I pull a sneaky trick: I perform an Interlocked­Compare­Exchange with the desired memory ordering semantics, passing the same value as the comparand and the exchange; therefore, the operation, even if successful, has no computational effect.

    if (*pl == lUnlikely) *pl = lUnlikely;
    

    To avoid dirtying the cache line, I use an unlikely value as the comparand/exchange, so most of the time, the comparison fails and no memory is written. (This trick doesn't help on all architectures, but it doesn't hurt.)

    Okay, back to the change counter example:

    LONG g_lColorChange;
    
    ...
    case WM_SYSCOLORCHANGE:
     InterlockedIncrement(&g_lColorChange);
     ...
    
    int CalculateSomethingAboutSystemColors()
    {
     LONG lColorChangeStart;
     do {
      lColorChangeStart = InterlockedReadAcquire(&g_lColorChange, -1);
      COLORREF clrWindow = GetSysColor(COLOR_WINDOW);
      COLORREF clrHighlight = GetSysColor(COLOR_HIGHLIGHT);
      ... other computations involving GetSysColor(...)
     } while (InterlockedReadRelease(
                           &g_lColorChange, -1) != lColorChangeStart);
     return iResult;
    }
    

    We capture the color change counter and then begin doing our calculations. We capture the value with acquire semantics so that we get the value before we start reading the system colors. When we're done, we compare the value of the change counter against the value we captured. If it's different, then that means that the colors changed while we were doing our calculations, so our calculations are all messed up. In that case, we go back and try it again.

    This technique does assume that you won't get into a situation where one thread manages to increment the change counter 4 billion times before the other thread manages to run. This is not a problem in practice. For example, in this case, it's reasonable to assume that nobody is going to change their system colors 4 billion times within a single thread quantum.

    Next time, I'll show a different variation on try/commit/abandon which might be suitable for simple caches.

    Exercise: Criticize the following: "I noticed that there is no interlocked read operation, but there is Interlocked­Or, so my plan is to perform an interlocked read by or'ing with zero."

  • The Old New Thing

    Lock-free algorithms: The singleton constructor

    • 25 Comments

    The first half may be familiar to many (most?) readers, but there's an interesting exercise at the bottom.

    A very useful pattern for the Interlocked* functions is lock-free lazy initialization via Interlocked­Compare­Exchange­Pointer­Release. Yes, that's a really long function name, but it turns out every part of it important.

    Widget *g_pwidCached;
    
    Widget *GetSingletonWidget()
    {
     Widget *pwid = g_pwidCached;
     if (!pwid) {
      pwid = new(nothrow) Widget();
      if (pwid) {
       Widget *pwidOld = reinterpret_cast<Widget*>
           (InterlockedCompareExchangePointerRelease(
              &reinterpret_cast<PVOID&>(g_pwidCached),
              pwid, NULL));
       if (pwidOld) {
        delete pwid; // lost the race - destroy the redundant copy
        pwid = pwidOld; // use the old one
       }
      }
     }
     return pwid;
    }
    

    This is a double-check lock, but without the locking. Instead of taking lock when doing the initial construction, we just let it be a free-for-all over who gets to create the object. If five threads all reach this code at the same time, sure, let's create five objects. After everybody creates what they think is the winning object, they called Interlocked­Compare­Exchange­Pointer­Release to attempt to update the global pointer.

    The parts of the name of the Interlocked­Compare­Exchange­Pointer­Release function work like this:

    • Interlocked: The operation is atomic. This is important to avoid two threads successfully updating the value of g_pwidCached.
    • Compare: The value in g_pwidCached is compared against NULL.
    • Exchange: If the values are equal, then g_pwidCached is set to pwid. This, combined with the comparison, ensures that only one thread gets to set the value of g_pwidCached.
    • Pointer: The operations are on pointer-sized data.
    • Release: The operation takes place with release semantics. This is important to ensure that the pwid we created is fully-constructed before we publish its pointer to other processors.

    This technique is suitable when it's okay to let multiple threads try to create the singleton (and have all the losers destroy their copy). If creating the singleton is expensive or has unwanted side-effects, then you don't want to use the free-for-all algorithm.

    Bonus reading:

    • One-Time Initialization helper functions save you from having to write all this code yourself. They deal with all the synchronization and memory barrier issues, and support both the one-person-gets-to-initialize and the free-for-all-initialization models.
    • A lazy initialization primitive for .NET provides a C# version of the same.

    Okay, now here's the interesting exercise. This is an actual problem I helped out with, although details have been changed for expository purposes.

    We have a data structure which manages a bunch of singleton objects, let's say that they are instances of a structure called ITEMCONTROLLER and they are keyed by a 32-bit ID. We're looking for design suggestions on making it thread-safe. The existing code goes like this (pseudocode):

    struct ITEMCONTROLLER;
    struct SINGLETONINFO {
     DWORD dwId;
     ITEMCONTROLLER *(*pfnCreateController)();
    };
    
    class SingletonManager {
    public:
     // rgsi is an array that describes how to create the objects.
     // It's a static array, with csi in the range 20 to 50.
     SingletonManager(const SINGLETONINFO *rgsi, UINT csi)
                   : m_rgsi(rgsi), m_csi(csi),
                     m_rgcs(NULL), m_ccs(0), m_ccsAlloc(0) { }
     ~SingletonManager() { ... }
     ITEMCONTROLLER *Lookup(DWORD dwId);
    
    private:
     struct CREATEDSINGLETON {
      DWORD dwId;
      ITEMCONTROLLER *pic;
     };
    
    private:
     const SINGLETONINFO *m_rgsi;
     int m_csi;
    
     // Array that describes objects we've created
     CREATEDSINGLETON *m_rgcs;
     int m_ccs;
    };
    
    ITEMCONTROLLER *SingletonManager::Lookup(DWORD dwId)
    {
     int i;
    
     // See if we already created one
     for (i = 0; i < m_ccs; i++) {
      if (m_rgcs[i].dwId == dwId)
       return m_rgcs[i].pic;
     }
    
     // Not yet created - time to create one
     ITEMCONTROLLER *pic;
     for (i = 0; i < m_rgsi; i++) {
      if (m_rgsi[i].dwId == dwId) {
       pic = m_rgsi[i].pfnCreateController();
       break;
      }
     }
     if (pic == NULL) return;
    
     ... if m_rgcs == NULL then allocate it and update m_ccsAlloc
     ... else realloc it bigger and update m_ccsAlloc
    
     // append to our array so we can find it next time
     m_rgcs[m_ccs].dwId = dwId;
     m_rgcs[m_ccs].pic  = pic;
     m_ccs++;
    
     return pic;
    }
    

    In words, the SingletonManager takes an array of SINGLETONINFO structures, each of which contains an ID and a function to call to create the object with that ID. To look up an entry, we first check if we already created one; if so, then we just return the existing one. Otherwise, we create the object (using pfnCreateController) and add it to our array of created objects.

    Our initial inclination is to put a critical section around the entire Lookup function, but maybe there's something more clever we can do here. Maybe a slim reader-writer lock?

    Bonus chatter: Although it's the case on Windows that the plain versions of the interlocked functions impose both acquire and release semantics, other platforms may not follow Windows' lead. In particular, on the XBOX360 platform, the plain versions of the interlocked functions impose neither acquire nor release semantics. I don't know what the rules are for Windows CE.

    Erratum: I once knew but subsequently forgot that the singleton pattern described in this article (with the InterlockedCompareExchangePointer) is not safe on some CPU architectures. An additional MemoryBarrier() needs to be inserted after the fetch of the single pointer to ensure that indirections through it will retrieve the new values and not any cached old values:

    Widget *GetSingletonWidget()
    {
     Widget *pwid = g_pwidCached;
     if (!pwid) {
      ...
     } else {
      // Ensure that dereferences of pwid access new values and not old
      // cached values.
      MemoryBarrier();
     }
     return pwid;
    }
    

    The discussion of lock-free algorithms continues (with probably more errors!) next time.

  • The Old New Thing

    Windows is not a .NET Framework delivery channel either

    • 52 Comments

    We learned a while ago that Windows is not an MFC delivery channel. And, since you asked, it's not a .NET Framework delivery channel either.

    If you're developing a program that uses the .NET Framework, you have to have a backup plan if the version of the .NET Framework you need is not installed on the computer. This might mean including a copy of the installer on your CD. It might mean redirecting the user to an appropriate download site. It might just mean telling the user, "This program requires version XYZ of the .NET Framework." Whatever you do, you need to do something.

    Windows XP didn't come with any version of the .NET Framework. Windows Vista came with version 2, and Windows 7 came with version 3.5, but these were provided as optional components which were installed by default. You can go into the Programs and Features control panel to remove them.

    As I recall, the application compatibility folks have a list of programs that treated Windows as a .NET Framework delivery channel; if you install any of them, and the version of the .NET Framework they depend on isn't installed, the application compatibility layer will display the informational message that the application neglected to.

    Bonus chatter: These programs which treated Windows as a .NET Framework delivery channel may have been misled by charts like this one or lists like this one which give the impression that Windows is a .NET Framework delivery channel.

  • The Old New Thing

    Lock-free algorithms: The opportunistic cache

    • 22 Comments

    Suppose profiling reveals that a specific calculation is responsible for a significant portion of your CPU time, and instrumentation says that most of the time, it's just being asked to calculate the same thing over and over. A simple one-level cache would do the trick here.

    BOOL IsPrime(int n)
    {
     static int nLast = 1;
     static BOOL fLastIsPrime = FALSE;
    
     // if it's the same value as last time, then
     // use the answer we cached from last time
     if (n == nLast) return fLastIsPrime;
    
     // calculate and cache the new result
     nLast = n;
     fLastIsPrime = slow_IsPrime(n);
     return fLastIsPrime;
    }
    

    Of course, this isn't thread-safe, because if one thread is pre-empted inside the call to slow_IsPrime, then another thread will see values for nLast and fLast­Is­Prime that do not correspond to each other.

    One solution would be to put a critical section around this code, but this introduces an artificial bottleneck: If the most recent cached result is nLast = 5, fLast­Is­Prime = TRUE, and if two threads both try to see whether 5 is prime, you don't want those two threads to serialize against each other.

    Another solution is to use slim reader-writer locks and acquire in shared mode when checking the cache and in exclusive mode when updating the cache. But let's try a lock-free solution.

    We're going to combine two different techniques. First, we use the change counter technique we saw last time when we investigated try/commit/(try again), but we also combine it with a lock that is manipulated with a try/commit/abandon pattern.

    #define IsLocked(l) ((l) & 1)
    
    BOOL IsPrime(int n)
    {
     static int nLast = 1;
     static BOOL fLastIsPrime = FALSE;
     static LONG lCounter = 0;
    
     // see if we can get it from the cache
     LONG lCounterStart = InterlockedReadAcquire(&lCounter, -1);
     if (!IsLocked(lCounterStart) && n == nLast) {
      BOOL fResult = fLastIsPrime;
      if (InterlockedReadRelease(&lCounter, -1) == lCounterStart)
       return fResult;
     }
    
     // calculate it the slow way
     BOOL fIsPrime = slow_IsPrime(n);
    
     // update the cache if we can
     lCounterStart = lCounter;
     if (!IsLocked(lCounterStart) &&
         InterlockedCompareExchangeAcquire(&lCounter,
                  lCounterStart+1, lCounterStart) == lCounterStart) {
      nLast = n;
      fLastIsPrime = fIsPrime;
      InterlockedCompareExchangeRelease(&lCounter,
                  lCounterStart+2, lCounterStart+1);
     }
     return fIsPrime;
    }
    

    The lCounter consists of a LOCK bit as the bottom bit and a change counter in the remaining bits. (Choosing the bottom bit as the LOCK bit makes the operations of clearing the lock and incrementing the counter very simple.)

    There are two parts to this function, the part that reads the cache and the part that updates the cache.

    To read the cache, we first read the counter with acquire semantics, so that the reads of nLast and fLast­Is­Prime will not take place until after we get the counter. If the counter says that the cache is not locked, then we go ahead and fetch the last value and the last result. If the last value in the cache matches the value we're calculating, then we go ahead and use the last result. But as a final check, we make sure that the counter hasn't changed while we were busy looking at the protected variables. If it has, then it means that we may have read inconsistent values and cannot trust the cached result.

    If we have a cache miss or we couldn't access the cache, we go ahead and calculate the result the slow way.

    Next, we try to update the cache. This time, instead of just looking to see whether the cache is locked, we try to lock it ourselves by setting the bottom bit. (If the lock fails, then we skip the cache update and just return the value we calculated.) Once the lock is taken, we update the protected variables, then atomically release the lock and increment the counter. (This is where putting the lock in the bottom bit comes in handy: You can increment the counter by adding 2 and not worry about a carry out of the counter bits turning into an accidental lock bit.) We use Release semantics so that the values of the protected values are committed to memory before lock bit clears.

    Note that in both halves of the function, if the cache is locked, we just proceed as if there were no cache at all. The theory here is that it's better just to say "Oh, the heck with it, I'll just do it myself" than to line up and wait to access the cache. Continuing instead of waiting avoids problems like priority inversion, but it also means that you get some spurious cache misses. Fortunately, since it's just a cache, an occasional spurious miss is not the end of the world.

    You could do something similar with the Try­Enter­Critical­Section function provided you're running Windows NT 4.0 or higher:

    BOOL IsPrime(int n)
    {
     static int nLast = 1;
     static BOOL fLastIsPrime = FALSE;
     BOOL fHaveAnswer = FALSE;
     BOOL fIsPrime;
    
     // see if we can get it from the cache
     if (TryEnterCriticalSection(&g_cs)) {
      if (n == nLast) {
       fHaveAnswer = TRUE;
       fIsPrime = fLastIsPrime;
      }
      LeaveCriticalSection(&g_cs);
     }
     if (fHaveAnswer) return fIsPrime;
    
     // calculate it the slow way
     fIsPrime = slow_IsPrime(n);
    
     // update the cache if we can
     if (TryEnterCriticalSection(&g_cs)) {
      nLast = n;
      fLastIsPrime = fIsPrime;
      LeaveCriticalSection(&g_cs);
     }
     return fIsPrime;
    }
    

    This does have the disadvantage that multiple readers will lock each other out, so we can switch to a slim reader/writer lock provided we're running on Window  7 or higher:

    BOOL IsPrime(int n)
    {
     static int nLast = 1;
     static BOOL fLastIsPrime = FALSE;
     BOOL fHaveAnswer = FALSE;
     BOOL fIsPrime;
    
     // see if we can get it from the cache
     if (TryAcquireSRWLockShared(&g_lock)) {
      if (n == nLast) {
       fHaveAnswer = TRUE;
       fIsPrime = fLastIsPrime;
      }
      ReleaseSRWLockShared(&g_lock);
     }
     if (fHaveAnswer) return fIsPrime;
    
     // calculate it the slow way
     fIsPrime = slow_IsPrime(n);
    
     // update the cache if we can
     if (TryAcquireSRWLockExclusive(&g_lock)) {
      nLast = n;
      fLastIsPrime = fIsPrime;
      LeaveSRWLockExclusive(&g_lock);
     }
     return fIsPrime;
    }
    

    This still has the problem that readers can lock out a cache update. If the function is hot (and if it weren't, why would you switch to a lock-free algorithm?), and the usage pattern shifts (say, instead of checking whether 13 is prime over and over, it starts checking whether 17 is prime over and over), everybody will be so busy reading the cache to see if the cached value is 17 that nobody will get a chance to update the cache to actually be 17!

    Exercise: What constraints must be imposed on the protected variables for this technique to be successful?

  • The Old New Thing

    Patterns for using the InitOnce functions

    • 14 Comments

    Since writing lock-free code is is such a headache-inducer, you're probably best off making some other people suffer the headaches for you. And those other people are the kernel folks, who have developed quite a few lock-free building blocks so you don't have to. For example, there's a collection of functions for manipulating interlocked lists. But today we're going to look at the one-time initialization functions.

    The simplest version of the one-time initialization functions isn't actually lock-free, but it does implement the double-checked-lock pattern for you so you don't have to worry about the details. The usage pattern for the Init­Once­Execute­Once function is pretty simple. Here it is in its simplest form:

    int SomeGlobalInteger;
    
    BOOL CALLBACK ThisRunsAtMostOnce(
        PINIT_ONCE initOnce,
        PVOID Parameter,
        PVOID *Context)
    {
        calculate_an_integer(&SomeGlobalInteger);
        return TRUE;
    }
    
    void InitializeThatGlobalInteger()
    {
        static INIT_ONCE initOnce = INIT_ONCE_STATIC_INIT;
        InitOnceExecuteOnce(&initOnce,
                            ThisRunsAtMostOnce,
                            nullptr, nullptr);
    }
    

    In the simplest form, you give Init­Once­Execute­Once an INIT_ONCE structure (where it records its state), and a callback. If this is the first time that Init­Once­Execute­Once is called for a particular INIT_ONCE structure, it calls the callback. The callback can do whatever it likes, but presumably it's doing some one-time initialization. If another thread calls Init­Once­Execute­Once on the same INIT_ONCE structure, that other thread will wait until the first thread is finished its one-time execution.

    We can make this a tiny bit fancier by supposing that the calculation of the integer can fail.

    BOOL CALLBACK ThisSucceedsAtMostOnce(
        PINIT_ONCE initOnce,
        PVOID Parameter,
        PVOID *Context)
    {
        return SUCCEEDED(calculate_an_integer(&SomeGlobalInteger));
    }
    
    BOOL TryToInitializeThatGlobalInteger()
    {
        static INIT_ONCE initOnce = INIT_ONCE_STATIC_INIT;
        return InitOnceExecuteOnce(&initOnce,
                                   ThisSucceedsAtMostOnce,
                                   nullptr, nullptr);
    }
    

    If your initialization function returns FALSE, then the initialization is considered to have failed, and the next time somebody calls Init­Once­Execute­Once, it will try to initialize again.

    A slightly fancier use of the Init­Once­Execute­Once function takes advantage of the Context parameter. The kernel folks noticed that an INIT_ONCE structure in the "initialized" state has a lot of unused bits, and they've offered to let you use them. This is convenient if the thing you're initializing is a pointer to a C++ object, because it means that there's only one thing you need to worry about instead of two.

    BOOL CALLBACK AllocateAndInitializeTheThing(
        PINIT_ONCE initOnce,
        PVOID Parameter,
        PVOID *Context)
    {
        *Context = new(nothrow) Thing();
        return *Context != nullptr;
    }
    
    Thing *GetSingletonThing(int arg1, int arg2)
    {
        static INIT_ONCE initOnce = INIT_ONCE_STATIC_INIT;
        void *Result;
        if (InitOnceExecuteOnce(&initOnce,
                                AllocateAndInitializeTheThing,
                                nullptr, &Result))
        {
            return static_cast<Thing*>(Result);
        }
        return nullptr;
    }
    

    The final parameter to Init­Once­Execute­Once function receives the magic almost-pointer-sized data that the function will remember for you. Your callback function passes this magic value back through the Context parameter, and then Init­Once­Execute­Once gives it back to you as the Result.

    As before, if two threads call Init­Once­Execute­Once simultaneously on an uninitialized INIT_ONCE structure, one of them will call the initialization function and the other will wait.

    Up until now, we've been looking at the synchronous initialization patterns. They aren't lock-free: If you call Init­Once­Execute­Once and initialization of the the INIT_ONCE structure is already in progress, your call will wait until that initialization attempt completes (either successfully or unsuccessfully).

    More interesting is the asynchronous pattern. Here it is, as applied to our Singleton­Manager exercise:

     SingletonManager(const SINGLETONINFO *rgsi, UINT csi)
                   : m_rgsi(rgsi), m_csi(csi),
                     m_rgio(new INITONCE[csi]) {
       for (UINT iio = 0; iio < csi; iio++) {
        InitOnceInitialize(&m_rgio[iio]);
       }
     }
     ...
     // Array that describes objects we've created
     // runs parallel to m_rgsi
     INIT_ONCE *m_rgio;
    };
    
    ITEMCONTROLLER *SingletonManager::Lookup(DWORD dwId)
    {
     ... same as before until we reach the "singleton constructor pattern"
    
     void *pv = NULL;
     BOOL fPending;
     if (!InitOnceBeginInitialize(&m_rgio[i], INIT_ONCE_ASYNC,
                                  &fPending, &pv)) return NULL;
    
     if (fPending) {
      ITEMCONTROLLER *pic = m_rgsi[i].pfnCreateController();
      DWORD dwResult = pic ? 0 : INIT_ONCE_INIT_FAILED;
      if (InitOnceComplete(&m_rgio[i],
                           INIT_ONCE_ASYNC | dwResult, pic)) {
       pv = pic;
      } else {
       // lost the race - discard ours and retrieve the winner
       delete pic;
       InitOnceBeginInitialize(&m_rgio[i], INIT_ONCE_CHECK_ONLY,
                               X&fPending, &pv);
      }
     }
     return static_cast<ITEMCONTROLLER *>(pv);
    }
    

    The pattern for asynchronous initialization is as follows:

    • Call Init­Once­Begin­Initialize in async mode.
    • If it returns fPending == FALSE, then initialization has already been performed and you can go ahead and use the result passed back in the final parameter.
    • Otherwise, initialization is pending. Do your initialization, but remember that since this is a lock-free algorithm, there can be many threads trying to initialize simultaneously, so you have to be careful how you manipulate global state. This pattern works best if initialization takes the form of creating a new object (because that means multiple threads performining initialization are each creating independent objects).
    • Call Init­Once­Complete with the result of your initialization.
    • If Init­Once­Complete succeeds, then you won the initialization race, and you're done.
    • If Init­Once­Complete fails, then you lost the initialization race and should clean up your failed initialization. In that case, you should call Init­Once­Begin­Initialize one last time to get the answer from the winner.

    it's conceptually simple; it just takes a while to explain. but at least now it's in recipe form.

    Exercise: Instead of calling Init­Once­Complete with INIT_ONCE_INIT_FAILED, what happens if the function simply returns without ever completing the init-once?

    Exercise: What happens if two threads try to perform asynchronous initialization and the first one to complete fails?

    Exercise: Combine the results of the first two exercises and draw a conclusion.

  • The Old New Thing

    Lock-free algorithms: The one-time initialization

    • 21 Comments

    A special case of the singleton constructor is simply lazy-initializing a bunch of variables. In a single-threaded application you can do something like this:

    // suppose that any valid values for a and b stipulate that
    // a ≥ 0 and b ≥ a. Therefore, -1 is never a valid value,
    // and we use it to mean "not yet initialized".
    int a = -1, b = -1;
    
    void LazyInitialize()
    {
     if (a != -1) return; // initialized already
    
     a = calculate_nominal_a();
     b = calculate_nominal_b();
    
     // Adjust the values to conform to our constraint.
     a = max(0, a);
     b = max(a, b);
    }
    

    This works fine in a single-threaded program, but if the program is multi-threaded, then two threads might end up trying to lazy-initialize the variables, and there are race conditions which can result in one thread using values before they have been initialized:

    Thread 1 Thread 2
    if (a != -1) [not taken]
    a = calculate_nominal_a(); // returns 2
    if (a != -1) return; // premature return!

    Observe that if the first thread is pre-empted after the value for a is initially set, the second thread will think that everything is initialized and may end up using an uninitialized b.

    "Aha," you say, "that's easy to fix. Instead of a, I'll just use b to tell if initialization is complete."

    void LazyInitialize()
    {
     if (b != -1) return; // initialized already (test b, not a)
    
     a = calculate_nominal_a();
     b = calculate_nominal_b();
    
     // Adjust the values to conform to our constraint.
     a = max(0, a);
     b = max(a, b);
    }
    

    This still suffers from a race condition:

    Thread 1 Thread 2
    if (b != -1) [not taken]
    a = calculate_nominal_a(); // returns 2
    b = calculate_nominal_b(); // returns 1
    if (b != -1) return; // premature return!

    "I can fix that too. I'll just compute the values of a and b in local variables, and update the globals only after the final values have been computed. That way, the second thread won't see partially-calculated values."

    void LazyInitialize()
    {
     if (b != -1) return; // initialized already
    
     // perform all calculations in temporary variables first
     int temp_a = calculate_nominal_a();
     int temp_b = calculate_nominal_b();
    
     // Adjust the values to conform to our constraint.
     temp_a = max(0, temp_a);
     temp_b = max(temp_a, temp_b);
    
     // make the temporary values permanent
     a = temp_a;
     b = temp_b;
    }
    

    Nearly there, but there is still a race condition:

    Thread 1 Thread 2
    if (b != -1) [not taken]
    temp_a = calculate_nominal_a(); // returns 2
    temp_b = calculate_nominal_b(); // returns 1
    temp_a = max(0, temp_a); // temp_a = 2
    temp_b = max(temp_a, temp_b); // temp_b = 2
    a = temp_a; // store issued to memory
    b = temp_b; // store issued to memory
    store of b completes to memory
    if (b != -1) return; // premature return!
    store of a completes to memory

    There is no guarantee that the assignment b = 2 will become visible to other processors after the assignment a = 2. Even though the store operations are issued in that order, the memory management circuitry might complete the memory operations in the opposite order, resulting in Thread 2 seeing a = -1 and b = 2.

    To prevent this from happening, the store to b must be performed with Release semantics, indicating that all prior memory stores must complete before the store to b can be made visible to other processors.

    void LazyInitialize()
    {
     if (b != -1) return; // initialized already
    
     // perform all calculations in temporary variables first
     int temp_a = calculate_nominal_a();
     int temp_b = calculate_nominal_b();
    
     // Adjust the values to conform to our constraint.
     temp_a = max(0, temp_a);
     temp_b = max(temp_a, temp_b);
    
     // make the temporary values permanent
     a = temp_a;
     // since we use "b" as our indication that
     // initialization is complete, we must store it last,
     // and we must use release semantics.
     InterlockedCompareExchangeRelease(
        reinterpret_cast<LONG*>&b, temp_b, -1);
    }
    

    If you look at the final result, you see that this is pretty much a re-derivation of the singleton initialization pattern: Do a bunch of calculations off to the side, then publish the result with a single Interlocked­Compare­Exchange­Release operation.

    The general pattern is therefore

    void LazyInitializePattern()
    {
     if (global_signal_variable != sentinel_value) return;
    
     ... calculate values into local variables ...
    
     globalvariable1 = temp_variable1;
     globalvariable2 = temp_variable2;
     ...
     globalvariableN = temp_variableN;
    
     // publish the signal variable last, and with release
     // semantics to ensure earlier values are visible as well
     InterlockedCompareExchangeRelease(
        reinterpret_cast<LONG*>&global_signal_variable,
        temp_signal_variable, sentinel_value);
    }
    

    If this all is too much for you (and given some of the subtlety of double-check-locking that I messed up the first time through, it's clearly too much for me), you can let the Windows kernel team do the thinking and use the one-time initialization functions, which encapsulate all of this logic. (My pal Doron called out the one-time initialization functions a while back.) Version 4 of the .NET Framework has corresponding functionality in the Lazy<T> class.

    Exercise: What hidden assumptions are being made about the functions calculate_nominal_a and calculate_nominal_b?

    Exercise: What are the consequences if we use Interlocked­Exchange instead of Interlocked­Compare­Exchange­Release?

    Exercise: In the final version of Lazy­Initialize, are the variables temp_a and temp_b really necessary, or are they just leftovers from previous attempts at fixing the race condition?

    Exercise: What changes (if any) are necessary to the above pattern if the global variables are pointers? Floating point variables?

    Update: See discussion below between Niall and Anon regarding the need for acquire semantics on the initial read.

  • The Old New Thing

    Lock-free algorithms: Choosing a unique value (warm-up)

    • 34 Comments

    Here's a snippet of code whose job is to generate a unique number within the process. Here's some reference reading to get yourself in the mood. Caution: It may or may not be useful.

    dwUniqueId = InterlockedCompareExchange(&g_dwUniqueId, 
                                            g_dwUniqueId+1, 
                                            g_dwUniqueId);
    

    Criticize this code fragment.

  • The Old New Thing

    Lock-free algorithms: The try/commit/(hand off) model

    • 16 Comments

    The last lock-free pattern for this week isn't actually lock-free, but it does run without blocking.

    The pattern for what I'll call try/commit/(hand off) is more complicated than the other patterns, so I'll start off by describing it in words rather than in code, because the code tends to make things more complicated.

    First, you take the state variable and chop it up into pieces. You need some bits to be used as a lock and as a work has been handed off flag. And if the work that has been handed off is complicated, you may need some more bits to remember the details of the handoff. A common way of doing this is to use a pointer-sized state variable, require that the objects being pointed to are suitably aligned, and reusing the bottom bits as flags. For example, if you require that the objects be DWORD-aligned, then the two bottom bits will always be zero and you can reuse them as flags.

    To perform an operation, you first try to lock the state variable. If you can't because the state variable is already locked, then you record the details of the operation in the state variable and update it atomically.

    If you succeed in locking the state variable, then you perform the desired operation, but before you unlock the state variable, you look to see if any work has been handed off. (This hand-off work is the result of attempts to perform the operation while you held the lock.) If there is hand-off work, then you perform that work as well. Of course, while you're doing that, more hand-off work may arrive. You can't unlock the state variable until you've drained off all the pent-up hand-off work.

    The code for this pattern tends to be a tangle of loops since there is a lot off backing off and retrying. Every atomic operation is its own loop, draining the hand-off work is another loop, and any time an Interlocked­Compare­Exchange fails, you have to undo the work you did and retry—another loop.

    I trust only about five people in the world to write code that is this advanced, and I'm not one of them. But just to illustrate the principle (although I will certainly get the details wrong), here's an implementation of a synchronization-like object which I will call a Group­Wait for lack of any other name. It has the following operations:

    • Add­Wait: Register an event handle with the group wait.
    • Signal­All: Signals all events that are registered with the group wait. Once an event is signalled, it is automatically unregistered from the group wait. If you want the event to be signalled at the next call to Signal­All you have to re-add it.

    The group wait object is just a linked list of NODEs containing the handles being waited on.

    Actually, this type of object doesn't need to use the try/commit/hand off model. It can be implemented in a much more straightforward manner by having Add­Wait atomically prepend the node to a list and having Signal­All atomically steal the list. There are even prewritten functions to perform these atomic linked list operations for you. But I'm going to implemented it the complicated way for demonstration purposes. In real life, the code would be much simpler.

    Since the bottom two bits of the pointer must be zero due to alignment, we repurpose them as a lock bit and a signal bit. The lock bit is set when the list is locked, and the signal bit is set when a signal was requested but had to be handed off because the list was locked.

    // WARNING! IF YOU USE THIS CODE YOU ARE AN IDIOT - READ THE TEXT ABOVE
    
    struct NODE;
    NODE *Node(LONG_PTR key) { return reinterpret_cast<NODE*>(key); }
    
    enum {
     Locked = 1,
     Signalled = 2,
    };
    
    struct NODE {
     NODE *pnNext;
     HANDLE hEvent;
    
     LONG_PTR Key() { return reinterpret_cast<LONG_PTR>(this); }
     NODE *Ptr() { return Node(Key() & ~(Locked | Signalled)); }
    };
    
    #define NODE_INVALID Node(-1)
    
    class GroupWait {
    public:
     GroupWait() : m_pnRoot(NULL) { }
     ~GroupWait();
     BOOL AddWait(HANDLE hEvent);
     void SignalAll();
    private:
     NODE *m_pnRoot;
    };
    

    Since I will be viewing the NODE* as both a pointer and as a bunch of bits (which I call a key), I created some helper methods to save typing. Node and Key convert back and forth between node pointers and keys, and Ptr strips off the tag bits and returns a usable pointer.

    For notational purposes, a NODE* will be written as the combination p|S|L where p is a pointer to the next node, S is the signalled bit, and L is the lock bit. The signalled bit is set to indicate that we need to signal all the nodes in the list starting with the next node. (Think of the S bit as being attached to the outgoing arrow.) For example, this linked list:

       m_pnRoot
      +--------+-+-+
      |   *    |0|1|
      +---|----+-+-+
          |
          v
      +--------+-+-+---------+
    A |   *    |1|?| hEvent1 |
      +---|----+-+-+---------+
          |
          v
      +--------+-+-+---------+
    B |   *    |?|?| hEvent2 |
      +---|----+-+-+---------+
          |
          v
      +--------+-+-+---------+
    C |  NULL  |?|?| hEvent3 |
      +--------+-+-+---------+
    

    represents a group wait with three registered event handles. The S bit is clear on the root pointer, which means that nobody has yet requested that hEvent1 be signalled. On the other hand, the S bit is set on node A, which means that all the events after node A need to be signaled, specifically, hEvent2 and hEvent3. Note that this means that it doesn't matter whether the S bit is set on nodes B or C; those events are getting set regardless because the S bit on node A already requested it. (In particular, the S bit on the last node is meaningless since there are no nodes which come after it.)

    The L bit is meaningless on all pointers other than m_pnRoot.

    Okay, let's start be adding a handle to the wait list:

    BOOL GroupWait::AddWait(HANDLE hEvent)
    {
     NODE *pnInsert = new(nothrow) NODE;
     if (pnInsert == NULL) return FALSE;
     pnInsert->hEvent = hEvent;
    
     NODE *pn;
     NODE *pnNew;
     do {
      pn = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID);
      pnInsert->pnNext = pn;
      pnNew = Node(pnInsert->Key() | (pn->Key() & Locked));
     } while (InterlockedCompareExchangeRelease(&m_pnRoot, pnNew, pn) != pn);
     return TRUE;
    }
    

    To add a handle to the wait list, we just prepend it to the linked list, being careful to propagate the L bit into the new pointer so we don't accidentally release a lock that somebody else took. We add the node with the S bit clear on the inbound pointer since nobody has yet asked for this handle to be signalled. After setting up the node, we attempt to insert it into the head of the list, and if we can't (because somebody else beat us to it), then we restart and try again. This is a standard try/commit/try again pattern.

    Exercise: Is there an ABA race condition here?

    The Add­Wait method illustrates one extreme case of the try/commit/hand off model, where there is really nothing to hand off; we did it all ourselves. Of course, this does make other parts of the code trickier since they have to go back and deal with nodes that were added while the list was locked.

    The nasty part of the code is in Signal­All. I'll present it in pieces.

    void GroupWait::SignalAll()
    {
     NODE *pnCapture;
     NODE *pnNew;
     do {
      pnCapture = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID);
    
      if (pnCapture->Key() & Locked) {
       pnNew = Node(pnCapture->Key() | Signaled);
      } else {
       pnNew = Node(Locked);
      }
     } while (InterlockedCompareExchangeAcquire(&m_pnRoot,
                                  pnNew, pnCapture) != pnCapture);
    
     if (pnCapture->Key() & Locked) return;
    
     ...
    

    If the list is locked, then all we do is try to set the S bit on the root. If the list is not locked, then we try to lock it and simultaneously detach all the nodes by replacing the root pointer with NULL|0|1. Either way, we perform the operation with the try/commit/try again pattern until we finally get through.

    If the list was locked, then all we had to do was set the S bit on the root pointer. Setting the S bit on the root pointer means that all the nodes reachable from this pointer (i.e., all nodes after the root, which is all nodes) should be signalled, which is exactly what we want. Since the list is locked, we leave the actual signalling to the code that unlocks the list. (This is the hand off part of try/commit/hand off.)

    Exercise: What if the S bit is already set? Did we lose a signal?

    Otherwise, we are the ones to lock the list. We also detach the node list, for if another thread calls Signal­All, we don't want that signal to affect the nodes that we're signalling. (Otherwise we might end up double-signalling the event.)

     ...
     NODE *pnNext;
     NODE *pn;
     for (pn = pnCapture->Ptr(); pn; pn = pnNext) {
      SetEvent(pn->hEvent);
      pnNext = pn->pnNext->Ptr();
      delete pn;
     }
     ...
    

    That little fragment above is basically what you would do in a naïve implementation that didn't worry about multithreading: It walks the list of nodes, signals each event, and then frees the node. The only trick is sending each node pointer through ->Ptr() to strip off the tag bits.

    Next comes the unlock code. First, a preparatory step:

     ...
     pnCapture = pnNew;
     ...
    

    We exchanged pnNew into m_pnRoot up above, and if that's still the value of m_pnRoot, then it means that nobody tried to perform any operations while the list was locked, and we got off easy.

     ...
     for (;;) {
      pnNew = Node(pnCapture->Key() & ~Locked);
      if (InterlockedCompareExchangeRelease(&m_pnRoot,
                          pnNew, pnCapture) == pnCapture) {
       return;
      }
     ...
    

    We start a new loop whose job is to drain off all the handed-off work items that built up while the list was locked. First, we see whether anything has changed since the last time we looked; if not, then we unlock and we're done. Otherwise, we proceed to pick up all the handed-off work:

     ...
      pnCapture = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID);
    
      NODE *pnNew = Node(pnCapture->Key() & ~(Locked | Signaled));
      NODE **ppn = &pnNew;
      NODE *pn;
      NODE *pnNext;
    
      BOOL fSignalSeen = FALSE;
      for (pn = pnNew; pn->Ptr(); pn = pnNext) {
       pnNext = pn->Ptr()->pnNext;
       if (fSignalSeen) {
        SetEvent(pn->Ptr()->hEvent);
        delete pn->Ptr();
       } else if (pn->Key() & Signaled) {
        fSignalSeen = TRUE;
        (*ppn) = Node(Locked); // detach but retain lock
        SetEvent(pn->Ptr()->hEvent);
        delete pn->Ptr();
       } else {
        ppn = &pn->Ptr()->pnNext;
       }
      }
     } // retry unlock
    } // end of function
    

    To drain the handed-off work, we walk the list of nodes, keeping track of whether we've seen an S bit. If so, then we signal the event and free the node. And the first time we see an S bit, we null out the inbound pointer to detach the list from the chain so we do not double-signal the event in the future.

    Once that's done, we go back and try to unlock again. Eventually, there will be no more hand-off work, and we can finally return.

    And that's it, a demonstration of the try/commit/hand off model. The basic idea is simple, but getting all the details right is what makes your head hurt.

    I leave this sort of thing to the kernel folks, who have the time and patience and brainpower to work it all through. An example of this pattern can be found, for example, in this talk that describes the dismantling of the dispatcher spinlock.
  • The Old New Thing

    The funniest joke I've ever told (to a three-year-old)

    • 33 Comments

    I've tested this joke on several children ages three and four, and it never fails.

    There were two kittens walking down the street, and one of them fell on its butt!

    I developed this joke for one of my honorary nieces. She had just started to learn about joke-telling and asked me to tell her a joke. One of the keys to joke-telling is to know your audience: A three-year-old won't have the attention span for a longer joke, and subtlety and observational humor won't work either. It's got to be short and direct. She thought kittens were cute, and falling on one's butt was funny, so I figured, hey, let me combine the two.

    When I told this joke, she fell on the floor in uncontrollable laughter. I had a winner.

  • The Old New Thing

    The performance improvements of a lock-free algorithm is often not in the locking

    • 9 Comments

    GWO wonders what the conditions are under which the lock-free version significantly outpeforms a simple critical section.

    Remember that switching to a lock-free algorithm should be guided by performance measurements. Switching from a simple algorithm to a complex one shouldn't be done unless you know that the simple algorithm is having trouble.

    That said, here are some non-obvious advantages of a lock-free algorithm over one that uses a simple lock. (Later, we'll see how you can take advantage of these techniques without actually going lock-free.)

    Consider a program that uses a simple critical section to perform something like the singleton constructor. Instead of a fancy lock-free algorithm, we use the much simpler version:

    CRITICAL_SECTION g_csSingletonX;
    X *g_px = NULL;
    
    X *GetSingletonX()
    {
        EnterCriticalSection(&g_csSingletonX);
        if (g_px == NULL)
        {
            g_px = new(nothrow) X();
        }
        LeaveCriticalSection(&g_csSingletonX);
        return g_px;
    }
    

    This simple code can run into trouble if the constructor function itself requires some locks, because now you have to impose a lock hierarchy in order to avoid a deadlock. (And this becomes impossible if the constructor function belongs to code outside your control.)

    When working out what your lock hierarchy should be, you may discover that you need to consolidate some locks. This avoids the inversion problem, but it also reduces your lock granularity. You might decide to use a single lock to cover all singletons, and then you later discover that you also have to extend the lock that protects X's constructor to cover other operations on X.

    CRITICAL_SECTION g_csCommon;
    
    // (updated to remove double-check lock because that just raises
    // more questions that distract from the point of the article)
    X *GetSingletonX()
    {
        EnterCriticalSection(&g_csCommon);
        if (g_px == NULL)
        {
            g_px = new(nothrow) X();
        }
        LeaveCriticalSection(&g_csCommon);
        return g_px;
    }
    
    Y *GetSingletonY()
    {
        EnterCriticalSection(&g_csCommon);
        if (g_py == NULL)
        {
            g_py = new(nothrow) Y();
        }
        LeaveCriticalSection(&g_csCommon);
        return g_py;
    }
    
    void X::DoSomething()
    {
        EnterCriticalSection(&g_csCommon);
        .. something ..
        LeaveCriticalSection(&g_csCommon);
    }
    

    Over time, your quiet little singleton lock has turned into a high-contention lock in your system.

    One nice thing about a lock-free algorithm is that since there is no lock, it can't create inversion in a lock hierarchy. (Of course, you have to be careful not to use the interlocked operations to build a private lock, because that puts you back where you started.)

    Another nice consequence of a lock-free algorithm is that, since there is no lock, you don't have to handle the WAIT_ABANDONED case. The data structure is never inconsistent; it passes atomically from one consistent state to another. Therefore, there's no need to write code to clean up leftover inconsistency. This came in handy in a case we looked at earlier, so that an application which crashes at an inopportune time will not corrupt the shared data and require a server reboot.

Page 1 of 3 (28 items) 123