April, 2011

  • The Old New Thing

    Lock-free algorithms: The singleton constructor (answer to exercises)

    • 4 Comments

    A few days ago, I asked you to make an existing class multithread-safe. The class caches objects called SINGLETON­INFO which are indexed by a 32-bit ID. The cache is implemented as an array that dynamically resizes as more items are added to it. A naïve multithreaded version might use a slim reader-writer lock with shared access on reads, exclusive access on writes, and mixed access on the treacherous "create if it doesn't already exist" path.

    Let's see. First of all, the function doesn't allocate the memory for the cache until somebody actually tries to look something up. But duh, that's the whole point of the class: To look up things! The only time this lazy-initialization actually provides a benefit is if somebody creates a Singleton­Manager, calls no methods on it, and then destroys it.

    This doesn't happen in practice, and even if it did, it's certainly not a scenario we're going to optimize for. Get rid of the lazy-initialization of the cache; it makes multithreading unnecessarily complicated.

    Second, since the only way an ITEM­CONTROLLER can get into the cache is via the SINGLETON­INFO, if a Singleton­Manager is told, "Here are 30 item IDs and their corresponding controller creation functions," then the cache can never hold more than 30 items. If you only know how to create 30 items, and you never create more than one copy of each item, then you're never going to create more than 30 items.

    Therefore, instead of managing a dynamically-growing array, we can allocate a fixed-size array at construction of length equal to the number of SINGLETON­INFO elements. This avoids having to lock around the code that reallocates the array. Since the array length is in the range 30–50, we don't have the problem of allocating megabytes of memory to track just a few objects. In the worst case, we allocate a 50-element cache.

    Next, we can store each ITEM­CONTROLLER in the same position in the cache array as it exists in the SINGLETON­INFO array.

    With these simplifications, we see that we don't need to do any locking or complicated duplicate-detection. After locating the ID in the SINGLETON­INFO array, look at the corresponding entry in the cache array and perform a singleton initialization there.

    struct ITEMCONTROLLER;
    struct SINGLETONINFO {
     DWORD dwId;
     ITEMCONTROLLER *(*pfnCreateController)();
    };
    
    class SingletonManager {
    public:
     // rgsi is an array that describes how to create the objects.
     // It's a static array, with csi in the range 20 to 50.
     SingletonManager(const SINGLETONINFO *rgsi, UINT csi)
                   : m_rgsi(rgsi), m_csi(csi),
                     m_rgpic(new ITEMCONTROLLER*[csi]) { }
     ~SingletonManager() { ... }
     ITEMCONTROLLER *Lookup(DWORD dwId);
    
    private:
     const SINGLETONINFO *m_rgsi;
     int m_csi;
    
     // Array that describes objects we've created
     // runs parallel to m_rgsi
     ITEMCONTROLLER *m_pic;
    };
    
    ITEMCONTROLLER *SingletonManager::Lookup(DWORD dwId)
    {
     int i;
    
     // Convert ID to index
     for (i = 0; i < m_csi; i++) {
      if (m_rgsi[i].dwId == dwId)
       break;
     }
     if (i >= m_csi) return NULL; // not something we know about
    
     // Singleton constructor pattern
     if (!m_rgpic[i]) {
      ITEMCONTROLLER *pic = m_rgsi[i].pfnCreateController();
      if (!pic) return NULL;
      if (InterlockedCompareExchangePointerRelease(
              &reinterpret_cast<PVOID&>(m_rgpic[i]),
              pic, NULL) != 0) {
       delete pic; // lost the race - destroy the redundant copy
      }
     }
     MemoryBarrier();
     return m_rgpic[i];
    }
    

    Comments on proposed solutions: Gabe pointed out that the reallocation was a sticking point which made a lock-free implementation difficult if not impossible. Credit to him for recognizing the problem.

    Thorsten proposed using a linked list instead of an array to avoid the reallocation problem.

    Ray Trent reminded us of the C++ function-local static technique, which works if it's what you need, but it has its own problems, such as lack of thread-safety (up until perhaps two weeks ago), and the fact that it doesn't generalize to a solution to the exercise. The not-thread-safe-ness of C++ static initialization was called out as a feature in early versions of the C++ language specification (to permit recursive initialization). This was revised in the ISO version of C++, which declared that if control enters a function which is in the middle of initializing its statics, the behavior is undefined. I don't know what C++0x has to say about the subject, but seeing as the standard was approved only two weeks ago and hasn't even been formally published yet, it seems premature to expect all compilers to conform to any new multi-threading semantics.

    Note that the function-local static technique works only if you want a process-wide singleton. If you need a singleton with a tighter scope (say, "one object per thread" or "one object per transaction"), then the function-local static technique will not work. Which after all was the point of the SingletonManager class: To manage singletons relative to its own scope, not globally. If you had wanted global singletons, then you wouldn't need a singleton manager; you would just have each object manage its own singleton.

    To elaborate: Suppose you have an object with a bunch of components. Most clients don't use all the components, so you want to lazy-create those components. Say, each Transaction can have an error log file, but you don't want to create the error log file until an error occurs. On the other hand, you want all the errors for a single transaction to go into the same log file.

    class LogFile : public ITEMCONTROLLER
    {
    public:
      static ITEMCONTROLLER *Create() { return new LogFile(); }
    };
    
    const SINGLETONINFO c_rgsiTransactions[] = {
      { LOGFILE_ID, LogFile::Create };
    };
    
    class Transaction
    {
    public:
      Transaction()
        : m_singletons(c_rgsiTransactions,
                       ARRAYSIZE(c_rgsiTransactions))
      { }
    
      void LogError(blah blah)
      {
        LogFile *plog = static_cast<LogFile*>
                            (m_singletons.Lookup(LOGFILE_ID));
        if (plog) plog->Log(blah blah);
      }
    private:
      SingletonManager m_singletons;
    };
    

    The singleton manager makes sure that each transaction has at most one log file. But we can't use the function-local static technique in LogFile::Create, because we want multiple log files in general, just a singleton log file per transaction. If we had used the function-local static technique in LogFile::Create, then all errors would have been logged into a giant log file instead of a separate log file per transaction.

    Scott tried to update the singleton atomically but forgot about the thread safety of the reallocation, and the solution had its own holes too. Alex Grigoriev's solution is the classic back-off-and-retry algorithm modulo forgetting to protect against reallocation.

    nksingh was the first to observe that the reallocation could be removed, and effectively came up with the solution presented here. (But missed the further optimization that the dwId member was redundant.) He also recommended using the InitOnce functions, which is something I too recommend. We'll look at the InitOnce functions in a separate article since this one is getting kind of long.

  • The Old New Thing

    Lock-free algorithms: The one-time initialization

    • 21 Comments

    A special case of the singleton constructor is simply lazy-initializing a bunch of variables. In a single-threaded application you can do something like this:

    // suppose that any valid values for a and b stipulate that
    // a ≥ 0 and b ≥ a. Therefore, -1 is never a valid value,
    // and we use it to mean "not yet initialized".
    int a = -1, b = -1;
    
    void LazyInitialize()
    {
     if (a != -1) return; // initialized already
    
     a = calculate_nominal_a();
     b = calculate_nominal_b();
    
     // Adjust the values to conform to our constraint.
     a = max(0, a);
     b = max(a, b);
    }
    

    This works fine in a single-threaded program, but if the program is multi-threaded, then two threads might end up trying to lazy-initialize the variables, and there are race conditions which can result in one thread using values before they have been initialized:

    Thread 1 Thread 2
    if (a != -1) [not taken]
    a = calculate_nominal_a(); // returns 2
    if (a != -1) return; // premature return!

    Observe that if the first thread is pre-empted after the value for a is initially set, the second thread will think that everything is initialized and may end up using an uninitialized b.

    "Aha," you say, "that's easy to fix. Instead of a, I'll just use b to tell if initialization is complete."

    void LazyInitialize()
    {
     if (b != -1) return; // initialized already (test b, not a)
    
     a = calculate_nominal_a();
     b = calculate_nominal_b();
    
     // Adjust the values to conform to our constraint.
     a = max(0, a);
     b = max(a, b);
    }
    

    This still suffers from a race condition:

    Thread 1 Thread 2
    if (b != -1) [not taken]
    a = calculate_nominal_a(); // returns 2
    b = calculate_nominal_b(); // returns 1
    if (b != -1) return; // premature return!

    "I can fix that too. I'll just compute the values of a and b in local variables, and update the globals only after the final values have been computed. That way, the second thread won't see partially-calculated values."

    void LazyInitialize()
    {
     if (b != -1) return; // initialized already
    
     // perform all calculations in temporary variables first
     int temp_a = calculate_nominal_a();
     int temp_b = calculate_nominal_b();
    
     // Adjust the values to conform to our constraint.
     temp_a = max(0, temp_a);
     temp_b = max(temp_a, temp_b);
    
     // make the temporary values permanent
     a = temp_a;
     b = temp_b;
    }
    

    Nearly there, but there is still a race condition:

    Thread 1 Thread 2
    if (b != -1) [not taken]
    temp_a = calculate_nominal_a(); // returns 2
    temp_b = calculate_nominal_b(); // returns 1
    temp_a = max(0, temp_a); // temp_a = 2
    temp_b = max(temp_a, temp_b); // temp_b = 2
    a = temp_a; // store issued to memory
    b = temp_b; // store issued to memory
    store of b completes to memory
    if (b != -1) return; // premature return!
    store of a completes to memory

    There is no guarantee that the assignment b = 2 will become visible to other processors after the assignment a = 2. Even though the store operations are issued in that order, the memory management circuitry might complete the memory operations in the opposite order, resulting in Thread 2 seeing a = -1 and b = 2.

    To prevent this from happening, the store to b must be performed with Release semantics, indicating that all prior memory stores must complete before the store to b can be made visible to other processors.

    void LazyInitialize()
    {
     if (b != -1) return; // initialized already
    
     // perform all calculations in temporary variables first
     int temp_a = calculate_nominal_a();
     int temp_b = calculate_nominal_b();
    
     // Adjust the values to conform to our constraint.
     temp_a = max(0, temp_a);
     temp_b = max(temp_a, temp_b);
    
     // make the temporary values permanent
     a = temp_a;
     // since we use "b" as our indication that
     // initialization is complete, we must store it last,
     // and we must use release semantics.
     InterlockedCompareExchangeRelease(
        reinterpret_cast<LONG*>&b, temp_b, -1);
    }
    

    If you look at the final result, you see that this is pretty much a re-derivation of the singleton initialization pattern: Do a bunch of calculations off to the side, then publish the result with a single Interlocked­Compare­Exchange­Release operation.

    The general pattern is therefore

    void LazyInitializePattern()
    {
     if (global_signal_variable != sentinel_value) return;
    
     ... calculate values into local variables ...
    
     globalvariable1 = temp_variable1;
     globalvariable2 = temp_variable2;
     ...
     globalvariableN = temp_variableN;
    
     // publish the signal variable last, and with release
     // semantics to ensure earlier values are visible as well
     InterlockedCompareExchangeRelease(
        reinterpret_cast<LONG*>&global_signal_variable,
        temp_signal_variable, sentinel_value);
    }
    

    If this all is too much for you (and given some of the subtlety of double-check-locking that I messed up the first time through, it's clearly too much for me), you can let the Windows kernel team do the thinking and use the one-time initialization functions, which encapsulate all of this logic. (My pal Doron called out the one-time initialization functions a while back.) Version 4 of the .NET Framework has corresponding functionality in the Lazy<T> class.

    Exercise: What hidden assumptions are being made about the functions calculate_nominal_a and calculate_nominal_b?

    Exercise: What are the consequences if we use Interlocked­Exchange instead of Interlocked­Compare­Exchange­Release?

    Exercise: In the final version of Lazy­Initialize, are the variables temp_a and temp_b really necessary, or are they just leftovers from previous attempts at fixing the race condition?

    Exercise: What changes (if any) are necessary to the above pattern if the global variables are pointers? Floating point variables?

    Update: See discussion below between Niall and Anon regarding the need for acquire semantics on the initial read.

  • The Old New Thing

    Lock-free algorithms: Choosing a unique value (solutions)

    • 7 Comments

    Last time, I left a warm-up exercise consisting of a code fragment which tries to compute a unique process-wide value. Here it is again:

    dwUniqueId = InterlockedCompareExchange(&g_dwUniqueId, 
                                            g_dwUniqueId+1, 
                                            g_dwUniqueId);
    

    It may be easier to enumerate what the function does right rather than what it does wrong.

    Um, the words are correctly-spelled.

    That's about it.

    Damien was the first to note that the author basically reimplemented Interlocked­Increment. Poorly.

    As we saw earlier, the algorithm for performing complex calculations with interlocked functions is (capture, compute, compare-exchange, retry). But the above code didn't do any of these things.

    By failing to capture the values, the code is vulnerable to another thread modifying the g_dwUniqueId value simultaneously. This means that the computation step can fail, because the inconsistent reads of g_dwUniqueId result in who-knows-what getting passed to the Interlocked­Compare­Exchange function.

    Okay, they managed to spell Interlocked­Compare­Exchange correctly.

    And then they forgot to retry the operation if the compare-exchange failed, which means that they will just proceed with whatever value the g_dwUniqueId variable held at the time of the Interlocked­Compare­Exchange call. If it just got incremented by another thread, then this thread and the other thread will be using the same "unique" value.

    Joshua points out that compiler optimization can prevent the capture from being a true capture. Though I would put the volatile keyword on g_dwUniqueId rather than scv, because the volatile object is the global variable, not the local. Marking the local as volatile forces all accesses to the local to be executed as written, but the compiler can still optimize the access to g_dwUniqueId. (It might, for example, propagate the value in from a previous read earlier in the function.)

    And do take into consideration Leo Davidson's warning: This series of articles is a peek behind the scenes series, not a here's how you should do it series. We're taking apart a bunch of toasters to see how they work. When possible, take advantage of code written by people smarter than you.

  • The Old New Thing

    Lock-free algorithms: The singleton constructor

    • 25 Comments

    The first half may be familiar to many (most?) readers, but there's an interesting exercise at the bottom.

    A very useful pattern for the Interlocked* functions is lock-free lazy initialization via Interlocked­Compare­Exchange­Pointer­Release. Yes, that's a really long function name, but it turns out every part of it important.

    Widget *g_pwidCached;
    
    Widget *GetSingletonWidget()
    {
     Widget *pwid = g_pwidCached;
     if (!pwid) {
      pwid = new(nothrow) Widget();
      if (pwid) {
       Widget *pwidOld = reinterpret_cast<Widget*>
           (InterlockedCompareExchangePointerRelease(
              &reinterpret_cast<PVOID&>(g_pwidCached),
              pwid, NULL));
       if (pwidOld) {
        delete pwid; // lost the race - destroy the redundant copy
        pwid = pwidOld; // use the old one
       }
      }
     }
     return pwid;
    }
    

    This is a double-check lock, but without the locking. Instead of taking lock when doing the initial construction, we just let it be a free-for-all over who gets to create the object. If five threads all reach this code at the same time, sure, let's create five objects. After everybody creates what they think is the winning object, they called Interlocked­Compare­Exchange­Pointer­Release to attempt to update the global pointer.

    The parts of the name of the Interlocked­Compare­Exchange­Pointer­Release function work like this:

    • Interlocked: The operation is atomic. This is important to avoid two threads successfully updating the value of g_pwidCached.
    • Compare: The value in g_pwidCached is compared against NULL.
    • Exchange: If the values are equal, then g_pwidCached is set to pwid. This, combined with the comparison, ensures that only one thread gets to set the value of g_pwidCached.
    • Pointer: The operations are on pointer-sized data.
    • Release: The operation takes place with release semantics. This is important to ensure that the pwid we created is fully-constructed before we publish its pointer to other processors.

    This technique is suitable when it's okay to let multiple threads try to create the singleton (and have all the losers destroy their copy). If creating the singleton is expensive or has unwanted side-effects, then you don't want to use the free-for-all algorithm.

    Bonus reading:

    • One-Time Initialization helper functions save you from having to write all this code yourself. They deal with all the synchronization and memory barrier issues, and support both the one-person-gets-to-initialize and the free-for-all-initialization models.
    • A lazy initialization primitive for .NET provides a C# version of the same.

    Okay, now here's the interesting exercise. This is an actual problem I helped out with, although details have been changed for expository purposes.

    We have a data structure which manages a bunch of singleton objects, let's say that they are instances of a structure called ITEMCONTROLLER and they are keyed by a 32-bit ID. We're looking for design suggestions on making it thread-safe. The existing code goes like this (pseudocode):

    struct ITEMCONTROLLER;
    struct SINGLETONINFO {
     DWORD dwId;
     ITEMCONTROLLER *(*pfnCreateController)();
    };
    
    class SingletonManager {
    public:
     // rgsi is an array that describes how to create the objects.
     // It's a static array, with csi in the range 20 to 50.
     SingletonManager(const SINGLETONINFO *rgsi, UINT csi)
                   : m_rgsi(rgsi), m_csi(csi),
                     m_rgcs(NULL), m_ccs(0), m_ccsAlloc(0) { }
     ~SingletonManager() { ... }
     ITEMCONTROLLER *Lookup(DWORD dwId);
    
    private:
     struct CREATEDSINGLETON {
      DWORD dwId;
      ITEMCONTROLLER *pic;
     };
    
    private:
     const SINGLETONINFO *m_rgsi;
     int m_csi;
    
     // Array that describes objects we've created
     CREATEDSINGLETON *m_rgcs;
     int m_ccs;
    };
    
    ITEMCONTROLLER *SingletonManager::Lookup(DWORD dwId)
    {
     int i;
    
     // See if we already created one
     for (i = 0; i < m_ccs; i++) {
      if (m_rgcs[i].dwId == dwId)
       return m_rgcs[i].pic;
     }
    
     // Not yet created - time to create one
     ITEMCONTROLLER *pic;
     for (i = 0; i < m_rgsi; i++) {
      if (m_rgsi[i].dwId == dwId) {
       pic = m_rgsi[i].pfnCreateController();
       break;
      }
     }
     if (pic == NULL) return;
    
     ... if m_rgcs == NULL then allocate it and update m_ccsAlloc
     ... else realloc it bigger and update m_ccsAlloc
    
     // append to our array so we can find it next time
     m_rgcs[m_ccs].dwId = dwId;
     m_rgcs[m_ccs].pic  = pic;
     m_ccs++;
    
     return pic;
    }
    

    In words, the SingletonManager takes an array of SINGLETONINFO structures, each of which contains an ID and a function to call to create the object with that ID. To look up an entry, we first check if we already created one; if so, then we just return the existing one. Otherwise, we create the object (using pfnCreateController) and add it to our array of created objects.

    Our initial inclination is to put a critical section around the entire Lookup function, but maybe there's something more clever we can do here. Maybe a slim reader-writer lock?

    Bonus chatter: Although it's the case on Windows that the plain versions of the interlocked functions impose both acquire and release semantics, other platforms may not follow Windows' lead. In particular, on the XBOX360 platform, the plain versions of the interlocked functions impose neither acquire nor release semantics. I don't know what the rules are for Windows CE.

    Erratum: I once knew but subsequently forgot that the singleton pattern described in this article (with the InterlockedCompareExchangePointer) is not safe on some CPU architectures. An additional MemoryBarrier() needs to be inserted after the fetch of the single pointer to ensure that indirections through it will retrieve the new values and not any cached old values:

    Widget *GetSingletonWidget()
    {
     Widget *pwid = g_pwidCached;
     if (!pwid) {
      ...
     } else {
      // Ensure that dereferences of pwid access new values and not old
      // cached values.
      MemoryBarrier();
     }
     return pwid;
    }
    

    The discussion of lock-free algorithms continues (with probably more errors!) next time.

  • The Old New Thing

    Lock-free algorithms: Choosing a unique value (warm-up)

    • 34 Comments

    Here's a snippet of code whose job is to generate a unique number within the process. Here's some reference reading to get yourself in the mood. Caution: It may or may not be useful.

    dwUniqueId = InterlockedCompareExchange(&g_dwUniqueId, 
                                            g_dwUniqueId+1, 
                                            g_dwUniqueId);
    

    Criticize this code fragment.

  • The Old New Thing

    Windows is not a .NET Framework delivery channel either

    • 52 Comments

    We learned a while ago that Windows is not an MFC delivery channel. And, since you asked, it's not a .NET Framework delivery channel either.

    If you're developing a program that uses the .NET Framework, you have to have a backup plan if the version of the .NET Framework you need is not installed on the computer. This might mean including a copy of the installer on your CD. It might mean redirecting the user to an appropriate download site. It might just mean telling the user, "This program requires version XYZ of the .NET Framework." Whatever you do, you need to do something.

    Windows XP didn't come with any version of the .NET Framework. Windows Vista came with version 2, and Windows 7 came with version 3.5, but these were provided as optional components which were installed by default. You can go into the Programs and Features control panel to remove them.

    As I recall, the application compatibility folks have a list of programs that treated Windows as a .NET Framework delivery channel; if you install any of them, and the version of the .NET Framework they depend on isn't installed, the application compatibility layer will display the informational message that the application neglected to.

    Bonus chatter: These programs which treated Windows as a .NET Framework delivery channel may have been misled by charts like this one or lists like this one which give the impression that Windows is a .NET Framework delivery channel.

  • The Old New Thing

    The funniest joke I've ever told (to a three-year-old)

    • 33 Comments

    I've tested this joke on several children ages three and four, and it never fails.

    There were two kittens walking down the street, and one of them fell on its butt!

    I developed this joke for one of my honorary nieces. She had just started to learn about joke-telling and asked me to tell her a joke. One of the keys to joke-telling is to know your audience: A three-year-old won't have the attention span for a longer joke, and subtlety and observational humor won't work either. It's got to be short and direct. She thought kittens were cute, and falling on one's butt was funny, so I figured, hey, let me combine the two.

    When I told this joke, she fell on the floor in uncontrollable laughter. I had a winner.

  • The Old New Thing

    The introduction of whimsical teasing in Comic Chat

    • 11 Comments

    A few months after my post on the sad demise of whimsical teasing in Comic Chat, I received a piece of email from none other than the author of Comic Chat, DJ Kurlander:

    I was the person that started the Comic Chat project in Microsoft Research and was responsible for that line, "This person is too lazy to create a profile entry."

    Not a whole lot of thought went into the default profile. In maybe the 30 seconds that I put into it, I thought that the line was moderately humorous, fit with the quirky nature of Comic Chat, and might motivate more people to create profiles. When we released version 2, MSN got the following complaint. Yes, this is verbatim (and specifically in reference to the default profile):

    I am very offended that this is type of code is allowed to leave microsoft. I am seriously considering dropping my subscription to MSN. I don't get this kind of crap from CompuServe, and I can get comparable Internet access from a local ISP. I can see that there is a serious lack of customer respect at microsoft and this is probably just the tip of the iceberg.

    MSN was pretty conservative, at least at the time, and this caused a big headache for the team. How dare we insult our customers! Did the project not have enough management overhead? Can we make things any more difficult for the small team? Yes we can! As a result, we were required to remove the message from the 2.1 release. It became "This user has no profile." at least in the version that was released on the MSN CD-ROM.

    Meanwhile, we got some advanced notice about an upcoming PC World review of the 2.0 version that had included the original message. The summary (which was emailed to us), said: "She [the reviewer] praises Microsoft for finally having a sense of humor (loves the note about being 'too lazy to have a bio' message). She claims Comic Chat is an absolute delight."

    I love the part about MS finally having a sense of humor. Well at least we were allowed to add back the message for the 2.5 release!

    Check out DJ's history of Comic Chat, which includes links to videos and other time-wasty goodness.

    Bonus chatter: DJ also has a page on Peedy the Parrot, a research project to develop a more natural user interface. Out in the product groups, we just called it that stupid parrot demo because it became tiresome from overexposure, having been used in a CES keynote and multiple company meetings.

Page 3 of 3 (28 items) 123