April, 2011

  • The Old New Thing

    Visual Studio 2005 gives you acquire and release semantics for free on volatile memory access

    • 10 Comments

    If you are using Visual Studio 2005 or later, then you don't need the weird Interlocked­Read­Acquire function because Visual Studio 2005 and later automatically impose acquire semantics on reads from volatile locations. It also imposes release semantics on writes to volatile locations. In other words, you can replace the old Interlocked­Read­Acquire function with the following:

    #if _MSC_VER >= 1400
    LONG InterlockedReadAcquire(__in volatile LONG *pl)
    {
        return *pl; // Acquire imposed by volatility
    }
    #endif
    

    This is a good thing because it expresses your intentions more clearly to the compiler. The old method that overloaded Interlocked­Compare­Exchange­Acquire forced the compiler to perform the actual compare-and-exchange even though we really didn't care about the operation; we just wanted the side effect of the Acquire semantics. On some architectures, this forces the cache line dirty even if the comparison fails.

  • The Old New Thing

    Back from Las Vegas, and now my clothes smell like cigarette smoke

    • 23 Comments

    I actually came back Thursday night, but I've been too lazy to jot down some reactions until now.

    There are signs on the street directing you to a tram connecting the Monte Carlo hotel with the Bellagio. but once you follow the first sign (that takes you into the casino), there are no more signs telling you how to get to the tram. The tram is a lie.

    Actually, the tram does exist, but it's not marked. You have to walk through the entire Monte Carlo casino to the Street of Dreams shops, and then past them and up a flight of stairs to the tram station. When you reach the Bellagio, you then have to do the reverse, walking through the Bellagio to get your way back to the street. I'm not sure if it's actually faster than just walking the regular way.

    Note to people who choose swag for conferences: if people are flying into your conference, a toy gun is probably not a good choice for a giveaway.

    A word of warning to Japanese tourists: The people in costumes standing around enticing you to pose with them for a picture? After you take your picture, they're going to ask you for money. And apparently, the practice has spread to helpful "volunteer" photographers who will just shrug and walk away if they break your camera.

    The Marquee bills itself as a "Nightclub and Dayclub." Doesn't that just make it a "Club"?

    I didn't win a Niney, but I did get to meet Lady Gaga (or a least a Lady Gaga impersonator). As a courtesy, I called her "Your Grace." I knew it wasn't the correct honorific, but she seemed okay with it.

    Man, that Niney award is heavy. And Scott won two of them! If he packs them both in the same suitcase, he's going to get a hernia.

    At the urinal is a small shelf, presumably a place to put your drink while you make room for your next drink.

    Let me get this straight. If you visit the Paris Hotel and take a picture of the fake Eiffel Tower, you'll also capture in the frame a U.S. flag. I guess that's sort of like the fake streets mapmakers add so they can detect copyright violations.

    One of the figures atop the Paris Hotel is a statue with the simple label Suger. Once again proving that if you know Swedish, the world is funnier. In Swedish, suger means sucks.

    When airline pilots come onto the speaker to tell you the weather at your destination, why do they all assume they're talking to other pilots? "The winds are out of the south-southwest, and the cloud ceiling is 10,000 feet, with visibility of five miles and air pressure of 30.05 inches." The only way this could be made more useless would be if a passenger shouted out, "But what's the dew point?" One of my friends suggested that airline pilots are simply frustrated that they couldn't make it as a weatherman.

    My clothes smell like cigarette smoke, so I guess what happens in Vegas doesn't always stay in Vegas.

  • The Old New Thing

    Don't forget to include the message queue in your lock hierarchy

    • 5 Comments

    In addition to the loader lock, the message queue is another resource that people often forget to incorporate into their lock hierarchy. If your code runs on a UI thread, then it implicitly owns the message queue whenever it is running, because messages cannot be dispatched to a thread until it calls a message-retrieval function such as Get­Message or Peek­Message. In other words, whenever a thread is not checking for a message, it cannot receive a message.

    For example, consider the following code:

    EnterCriticalSection(&g_cs);
    for (int i = 0; i < 10; i++) {
      SendMessage(hwndLB, LB_ADDSTRING, 0, (LPARAM)strings[i]);
    }
    LeaveCriticalSection(&g_cs);
    

    If hwndLB belongs to another thread, then you have a potential deadlock, because that thread might be waiting for your critical section.

    case WM_DOESNTMATTERWHAT:
        EnterCriticalSection(&g_cs);
        ... doesn't matter what goes here ...
        LeaveCriticalSection(&g_cs);
        break;
    

    If you happen to try to send the message while that other thread is waiting for the critical section, you will deadlock because you are waiting for that thread to finish whatever it's doing so it can process the message you sent to it, but that thread is waiting for the critical section which you own.

    Even if you promise that hwndLB belongs to your thread, the possibility of subclassing or window hooks means that you do not have full control over what happens when you try to send that message. A WH_CALL­WND­PROC window hook may decide to communicate with another thread (for example, to log the message). Boom, what you thought was a simple message sent to a window on your thread turned into a cross-thread message.

    There are many actions that generate message traffic that may not be obvious at first glance because they don't involve explicitly sending a message. Invoking a COM method from an STA thread on an object which belongs to another apartment requires the call to be marshaled to the thread that hosts the object. Tinkering with a window's scroll bars can result in the WS_HSCROLL or WS_VSCROLL style being added or removed, which in turn generates WM_STYLE­CHANGING and WM_STYLE­CHANGED messages. Obtaining the text from a window belonging to another thread in your process results in synchronous message traffic.

    A good rule of thumb is basically to avoid anything that involves windows belonging to other threads while holding a critical section or other resource. And even windows which belong to your thread are suspect (due to hooks and subclassing).

  • The Old New Thing

    Lock-free algorithms: The try/commit/(hand off) model

    • 16 Comments

    The last lock-free pattern for this week isn't actually lock-free, but it does run without blocking.

    The pattern for what I'll call try/commit/(hand off) is more complicated than the other patterns, so I'll start off by describing it in words rather than in code, because the code tends to make things more complicated.

    First, you take the state variable and chop it up into pieces. You need some bits to be used as a lock and as a work has been handed off flag. And if the work that has been handed off is complicated, you may need some more bits to remember the details of the handoff. A common way of doing this is to use a pointer-sized state variable, require that the objects being pointed to are suitably aligned, and reusing the bottom bits as flags. For example, if you require that the objects be DWORD-aligned, then the two bottom bits will always be zero and you can reuse them as flags.

    To perform an operation, you first try to lock the state variable. If you can't because the state variable is already locked, then you record the details of the operation in the state variable and update it atomically.

    If you succeed in locking the state variable, then you perform the desired operation, but before you unlock the state variable, you look to see if any work has been handed off. (This hand-off work is the result of attempts to perform the operation while you held the lock.) If there is hand-off work, then you perform that work as well. Of course, while you're doing that, more hand-off work may arrive. You can't unlock the state variable until you've drained off all the pent-up hand-off work.

    The code for this pattern tends to be a tangle of loops since there is a lot off backing off and retrying. Every atomic operation is its own loop, draining the hand-off work is another loop, and any time an Interlocked­Compare­Exchange fails, you have to undo the work you did and retry—another loop.

    I trust only about five people in the world to write code that is this advanced, and I'm not one of them. But just to illustrate the principle (although I will certainly get the details wrong), here's an implementation of a synchronization-like object which I will call a Group­Wait for lack of any other name. It has the following operations:

    • Add­Wait: Register an event handle with the group wait.
    • Signal­All: Signals all events that are registered with the group wait. Once an event is signalled, it is automatically unregistered from the group wait. If you want the event to be signalled at the next call to Signal­All you have to re-add it.

    The group wait object is just a linked list of NODEs containing the handles being waited on.

    Actually, this type of object doesn't need to use the try/commit/hand off model. It can be implemented in a much more straightforward manner by having Add­Wait atomically prepend the node to a list and having Signal­All atomically steal the list. There are even prewritten functions to perform these atomic linked list operations for you. But I'm going to implemented it the complicated way for demonstration purposes. In real life, the code would be much simpler.

    Since the bottom two bits of the pointer must be zero due to alignment, we repurpose them as a lock bit and a signal bit. The lock bit is set when the list is locked, and the signal bit is set when a signal was requested but had to be handed off because the list was locked.

    // WARNING! IF YOU USE THIS CODE YOU ARE AN IDIOT - READ THE TEXT ABOVE
    
    struct NODE;
    NODE *Node(LONG_PTR key) { return reinterpret_cast<NODE*>(key); }
    
    enum {
     Locked = 1,
     Signalled = 2,
    };
    
    struct NODE {
     NODE *pnNext;
     HANDLE hEvent;
    
     LONG_PTR Key() { return reinterpret_cast<LONG_PTR>(this); }
     NODE *Ptr() { return Node(Key() & ~(Locked | Signalled)); }
    };
    
    #define NODE_INVALID Node(-1)
    
    class GroupWait {
    public:
     GroupWait() : m_pnRoot(NULL) { }
     ~GroupWait();
     BOOL AddWait(HANDLE hEvent);
     void SignalAll();
    private:
     NODE *m_pnRoot;
    };
    

    Since I will be viewing the NODE* as both a pointer and as a bunch of bits (which I call a key), I created some helper methods to save typing. Node and Key convert back and forth between node pointers and keys, and Ptr strips off the tag bits and returns a usable pointer.

    For notational purposes, a NODE* will be written as the combination p|S|L where p is a pointer to the next node, S is the signalled bit, and L is the lock bit. The signalled bit is set to indicate that we need to signal all the nodes in the list starting with the next node. (Think of the S bit as being attached to the outgoing arrow.) For example, this linked list:

       m_pnRoot
      +--------+-+-+
      |   *    |0|1|
      +---|----+-+-+
          |
          v
      +--------+-+-+---------+
    A |   *    |1|?| hEvent1 |
      +---|----+-+-+---------+
          |
          v
      +--------+-+-+---------+
    B |   *    |?|?| hEvent2 |
      +---|----+-+-+---------+
          |
          v
      +--------+-+-+---------+
    C |  NULL  |?|?| hEvent3 |
      +--------+-+-+---------+
    

    represents a group wait with three registered event handles. The S bit is clear on the root pointer, which means that nobody has yet requested that hEvent1 be signalled. On the other hand, the S bit is set on node A, which means that all the events after node A need to be signaled, specifically, hEvent2 and hEvent3. Note that this means that it doesn't matter whether the S bit is set on nodes B or C; those events are getting set regardless because the S bit on node A already requested it. (In particular, the S bit on the last node is meaningless since there are no nodes which come after it.)

    The L bit is meaningless on all pointers other than m_pnRoot.

    Okay, let's start be adding a handle to the wait list:

    BOOL GroupWait::AddWait(HANDLE hEvent)
    {
     NODE *pnInsert = new(nothrow) NODE;
     if (pnInsert == NULL) return FALSE;
     pnInsert->hEvent = hEvent;
    
     NODE *pn;
     NODE *pnNew;
     do {
      pn = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID);
      pnInsert->pnNext = pn;
      pnNew = Node(pnInsert->Key() | (pn->Key() & Locked));
     } while (InterlockedCompareExchangeRelease(&m_pnRoot, pnNew, pn) != pn);
     return TRUE;
    }
    

    To add a handle to the wait list, we just prepend it to the linked list, being careful to propagate the L bit into the new pointer so we don't accidentally release a lock that somebody else took. We add the node with the S bit clear on the inbound pointer since nobody has yet asked for this handle to be signalled. After setting up the node, we attempt to insert it into the head of the list, and if we can't (because somebody else beat us to it), then we restart and try again. This is a standard try/commit/try again pattern.

    Exercise: Is there an ABA race condition here?

    The Add­Wait method illustrates one extreme case of the try/commit/hand off model, where there is really nothing to hand off; we did it all ourselves. Of course, this does make other parts of the code trickier since they have to go back and deal with nodes that were added while the list was locked.

    The nasty part of the code is in Signal­All. I'll present it in pieces.

    void GroupWait::SignalAll()
    {
     NODE *pnCapture;
     NODE *pnNew;
     do {
      pnCapture = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID);
    
      if (pnCapture->Key() & Locked) {
       pnNew = Node(pnCapture->Key() | Signaled);
      } else {
       pnNew = Node(Locked);
      }
     } while (InterlockedCompareExchangeAcquire(&m_pnRoot,
                                  pnNew, pnCapture) != pnCapture);
    
     if (pnCapture->Key() & Locked) return;
    
     ...
    

    If the list is locked, then all we do is try to set the S bit on the root. If the list is not locked, then we try to lock it and simultaneously detach all the nodes by replacing the root pointer with NULL|0|1. Either way, we perform the operation with the try/commit/try again pattern until we finally get through.

    If the list was locked, then all we had to do was set the S bit on the root pointer. Setting the S bit on the root pointer means that all the nodes reachable from this pointer (i.e., all nodes after the root, which is all nodes) should be signalled, which is exactly what we want. Since the list is locked, we leave the actual signalling to the code that unlocks the list. (This is the hand off part of try/commit/hand off.)

    Exercise: What if the S bit is already set? Did we lose a signal?

    Otherwise, we are the ones to lock the list. We also detach the node list, for if another thread calls Signal­All, we don't want that signal to affect the nodes that we're signalling. (Otherwise we might end up double-signalling the event.)

     ...
     NODE *pnNext;
     NODE *pn;
     for (pn = pnCapture->Ptr(); pn; pn = pnNext) {
      SetEvent(pn->hEvent);
      pnNext = pn->pnNext->Ptr();
      delete pn;
     }
     ...
    

    That little fragment above is basically what you would do in a naïve implementation that didn't worry about multithreading: It walks the list of nodes, signals each event, and then frees the node. The only trick is sending each node pointer through ->Ptr() to strip off the tag bits.

    Next comes the unlock code. First, a preparatory step:

     ...
     pnCapture = pnNew;
     ...
    

    We exchanged pnNew into m_pnRoot up above, and if that's still the value of m_pnRoot, then it means that nobody tried to perform any operations while the list was locked, and we got off easy.

     ...
     for (;;) {
      pnNew = Node(pnCapture->Key() & ~Locked);
      if (InterlockedCompareExchangeRelease(&m_pnRoot,
                          pnNew, pnCapture) == pnCapture) {
       return;
      }
     ...
    

    We start a new loop whose job is to drain off all the handed-off work items that built up while the list was locked. First, we see whether anything has changed since the last time we looked; if not, then we unlock and we're done. Otherwise, we proceed to pick up all the handed-off work:

     ...
      pnCapture = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID);
    
      NODE *pnNew = Node(pnCapture->Key() & ~(Locked | Signaled));
      NODE **ppn = &pnNew;
      NODE *pn;
      NODE *pnNext;
    
      BOOL fSignalSeen = FALSE;
      for (pn = pnNew; pn->Ptr(); pn = pnNext) {
       pnNext = pn->Ptr()->pnNext;
       if (fSignalSeen) {
        SetEvent(pn->Ptr()->hEvent);
        delete pn->Ptr();
       } else if (pn->Key() & Signaled) {
        fSignalSeen = TRUE;
        (*ppn) = Node(Locked); // detach but retain lock
        SetEvent(pn->Ptr()->hEvent);
        delete pn->Ptr();
       } else {
        ppn = &pn->Ptr()->pnNext;
       }
      }
     } // retry unlock
    } // end of function
    

    To drain the handed-off work, we walk the list of nodes, keeping track of whether we've seen an S bit. If so, then we signal the event and free the node. And the first time we see an S bit, we null out the inbound pointer to detach the list from the chain so we do not double-signal the event in the future.

    Once that's done, we go back and try to unlock again. Eventually, there will be no more hand-off work, and we can finally return.

    And that's it, a demonstration of the try/commit/hand off model. The basic idea is simple, but getting all the details right is what makes your head hurt.

    I leave this sort of thing to the kernel folks, who have the time and patience and brainpower to work it all through. An example of this pattern can be found, for example, in this talk that describes the dismantling of the dispatcher spinlock.
  • The Old New Thing

    Lock-free algorithms: The opportunistic cache

    • 22 Comments

    Suppose profiling reveals that a specific calculation is responsible for a significant portion of your CPU time, and instrumentation says that most of the time, it's just being asked to calculate the same thing over and over. A simple one-level cache would do the trick here.

    BOOL IsPrime(int n)
    {
     static int nLast = 1;
     static BOOL fLastIsPrime = FALSE;
    
     // if it's the same value as last time, then
     // use the answer we cached from last time
     if (n == nLast) return fLastIsPrime;
    
     // calculate and cache the new result
     nLast = n;
     fLastIsPrime = slow_IsPrime(n);
     return fLastIsPrime;
    }
    

    Of course, this isn't thread-safe, because if one thread is pre-empted inside the call to slow_IsPrime, then another thread will see values for nLast and fLast­Is­Prime that do not correspond to each other.

    One solution would be to put a critical section around this code, but this introduces an artificial bottleneck: If the most recent cached result is nLast = 5, fLast­Is­Prime = TRUE, and if two threads both try to see whether 5 is prime, you don't want those two threads to serialize against each other.

    Another solution is to use slim reader-writer locks and acquire in shared mode when checking the cache and in exclusive mode when updating the cache. But let's try a lock-free solution.

    We're going to combine two different techniques. First, we use the change counter technique we saw last time when we investigated try/commit/(try again), but we also combine it with a lock that is manipulated with a try/commit/abandon pattern.

    #define IsLocked(l) ((l) & 1)
    
    BOOL IsPrime(int n)
    {
     static int nLast = 1;
     static BOOL fLastIsPrime = FALSE;
     static LONG lCounter = 0;
    
     // see if we can get it from the cache
     LONG lCounterStart = InterlockedReadAcquire(&lCounter, -1);
     if (!IsLocked(lCounterStart) && n == nLast) {
      BOOL fResult = fLastIsPrime;
      if (InterlockedReadRelease(&lCounter, -1) == lCounterStart)
       return fResult;
     }
    
     // calculate it the slow way
     BOOL fIsPrime = slow_IsPrime(n);
    
     // update the cache if we can
     lCounterStart = lCounter;
     if (!IsLocked(lCounterStart) &&
         InterlockedCompareExchangeAcquire(&lCounter,
                  lCounterStart+1, lCounterStart) == lCounterStart) {
      nLast = n;
      fLastIsPrime = fIsPrime;
      InterlockedCompareExchangeRelease(&lCounter,
                  lCounterStart+2, lCounterStart+1);
     }
     return fIsPrime;
    }
    

    The lCounter consists of a LOCK bit as the bottom bit and a change counter in the remaining bits. (Choosing the bottom bit as the LOCK bit makes the operations of clearing the lock and incrementing the counter very simple.)

    There are two parts to this function, the part that reads the cache and the part that updates the cache.

    To read the cache, we first read the counter with acquire semantics, so that the reads of nLast and fLast­Is­Prime will not take place until after we get the counter. If the counter says that the cache is not locked, then we go ahead and fetch the last value and the last result. If the last value in the cache matches the value we're calculating, then we go ahead and use the last result. But as a final check, we make sure that the counter hasn't changed while we were busy looking at the protected variables. If it has, then it means that we may have read inconsistent values and cannot trust the cached result.

    If we have a cache miss or we couldn't access the cache, we go ahead and calculate the result the slow way.

    Next, we try to update the cache. This time, instead of just looking to see whether the cache is locked, we try to lock it ourselves by setting the bottom bit. (If the lock fails, then we skip the cache update and just return the value we calculated.) Once the lock is taken, we update the protected variables, then atomically release the lock and increment the counter. (This is where putting the lock in the bottom bit comes in handy: You can increment the counter by adding 2 and not worry about a carry out of the counter bits turning into an accidental lock bit.) We use Release semantics so that the values of the protected values are committed to memory before lock bit clears.

    Note that in both halves of the function, if the cache is locked, we just proceed as if there were no cache at all. The theory here is that it's better just to say "Oh, the heck with it, I'll just do it myself" than to line up and wait to access the cache. Continuing instead of waiting avoids problems like priority inversion, but it also means that you get some spurious cache misses. Fortunately, since it's just a cache, an occasional spurious miss is not the end of the world.

    You could do something similar with the Try­Enter­Critical­Section function provided you're running Windows NT 4.0 or higher:

    BOOL IsPrime(int n)
    {
     static int nLast = 1;
     static BOOL fLastIsPrime = FALSE;
     BOOL fHaveAnswer = FALSE;
     BOOL fIsPrime;
    
     // see if we can get it from the cache
     if (TryEnterCriticalSection(&g_cs)) {
      if (n == nLast) {
       fHaveAnswer = TRUE;
       fIsPrime = fLastIsPrime;
      }
      LeaveCriticalSection(&g_cs);
     }
     if (fHaveAnswer) return fIsPrime;
    
     // calculate it the slow way
     fIsPrime = slow_IsPrime(n);
    
     // update the cache if we can
     if (TryEnterCriticalSection(&g_cs)) {
      nLast = n;
      fLastIsPrime = fIsPrime;
      LeaveCriticalSection(&g_cs);
     }
     return fIsPrime;
    }
    

    This does have the disadvantage that multiple readers will lock each other out, so we can switch to a slim reader/writer lock provided we're running on Window  7 or higher:

    BOOL IsPrime(int n)
    {
     static int nLast = 1;
     static BOOL fLastIsPrime = FALSE;
     BOOL fHaveAnswer = FALSE;
     BOOL fIsPrime;
    
     // see if we can get it from the cache
     if (TryAcquireSRWLockShared(&g_lock)) {
      if (n == nLast) {
       fHaveAnswer = TRUE;
       fIsPrime = fLastIsPrime;
      }
      ReleaseSRWLockShared(&g_lock);
     }
     if (fHaveAnswer) return fIsPrime;
    
     // calculate it the slow way
     fIsPrime = slow_IsPrime(n);
    
     // update the cache if we can
     if (TryAcquireSRWLockExclusive(&g_lock)) {
      nLast = n;
      fLastIsPrime = fIsPrime;
      LeaveSRWLockExclusive(&g_lock);
     }
     return fIsPrime;
    }
    

    This still has the problem that readers can lock out a cache update. If the function is hot (and if it weren't, why would you switch to a lock-free algorithm?), and the usage pattern shifts (say, instead of checking whether 13 is prime over and over, it starts checking whether 17 is prime over and over), everybody will be so busy reading the cache to see if the cached value is 17 that nobody will get a chance to update the cache to actually be 17!

    Exercise: What constraints must be imposed on the protected variables for this technique to be successful?

  • The Old New Thing

    Lock-free algorithms: Update if you can I'm feeling down

    • 19 Comments

    A customer was looking for advice on this synchronization problem:

    We have a small amount of data that we need to share among multiple processes. One way to protect the data is to use a spin lock. However, that has potential for deadlock if the process which holds the spinlock doesn't get a chance to release it. For example, it might be suspended in the debugger, or somebody might decide to use Terminate­Process to nuke it.

    Any suggestions on how we can share this data without exposure to these types of horrible failure modes? I'm thinking of something like a reader takes the lock, fetches the values, and then checks at status at the end of tell if the data is valid. Meanwhile, a writer tries to take the lock with a timeout, and if the timeout fires, then the writer just goes ahead anyway and updates the values, and somehow sets the status on the reader so it knows that the value is no good and it should try again. Basically, I don't want either the reader or writer to get stuck indefinitely if a developer, say, just happens to break into the debugger at the worst possible time.

    This can be solved with a publishing pattern. When you want to update the values, you indicate that new values are ready by publishing their new location.

    Let's say that the data that needs to be shared is a collection of four integers.

    struct SHAREDDATA {
     int a, b, c, d;
    };
    

    Assume that there is a practical limit on how often the value can change; this is usually a safe assumption because you'll have some sort of external rate limiter, like "This value changes in response to a user action." (Even if there is no such limit, most solutions will simply posit one. For example, the SLIST functions simply assume that a processor won't get locked out more than 65535 times in a row.) In our case, let's say that the value will not change more than 64 times in rapid succession.

    #define SHAREDDATA_MAXCONCURRENT 64
    
    SHAREDDATA g_rgsd[SHAREDDATA_MAXCONCURRENT];
    UINT g_isd; // current valid value
    
    void GetSharedData(__out SHAREDDATA *psd)
    {
     *psd = g_rgsd[g_isd];
    }
    

    Reading the data simply retrieves the most recently published value. The hard part is publishing the value.

    Actually, it's not hard at all.

    LONG g_isdNext = 1;
    
    void UpdateSharedData(__in const SHAREDDATA *psd)
    {
     UINT isd = (UINT)InterlockedIncrementAcquire(&g_isdNext);
     isd %= SHAREDDATA_MAXCONCURRENT;
     g_rgsd[isd] = *psd;
     InterlockedExchange(&g_isdNext, isd);
     
    }
    

    Publishing the data is a simple matter of obtaining a slot for the data, using Interlocked­Increment to get a unique location to store the data, or at least least unique until SHAREDDATA_MAXCONCURRENT - 1 intervening publications have occurred. We store our results into the memory we obtained and then publish the new index. The publication needs to be done with release semantics, but since there is no Interlocked­Exchange­Release, we just do a full barrier exchange.

    Note that the update is not atomic with the read. A processor can call Get­Shared­Data, revise the values, then publish them, only to find that it overwrite a publication from another processor. If the new values are dependent on the old values (for example, if they are a running total), then you just lost an update.

    Note also that if two threads try to update at the same time, it's pretty much random which set of values you get since it's last writer wins.

    It so happens that in this particular case, the new values had nothing to do with the old values, so there was no problem with lost updates. And in practice, only one process updated the values at a time. (There is a master controller who updates the values, and everybody else just reads them.) Therefore, this simple method meets the requirements.

    Exercise: How would you adapt this solution if the new values depended on the old values?

  • The Old New Thing

    Overheard conversation fragment: I'm over here by the slot machines

    • 11 Comments

    While on a trip to Las Vegas, I happened to overhear a woman talking on her mobile phone who, from her body language, was clearly trying to meet up with a friend. We were in the casino of one of the major hotels. She said, "I'm over here by the slot machines."

    Yeah, that narrows it down.

    I'll be heading to Vegas for the Niney Awards. If I don't see you at the ceremony, I'll meet you by the slot machines.

  • The Old New Thing

    Lock-free algorithms: The try/commit/(try again) pattern

    • 31 Comments

    The singleton constructor pattern and the Interlocked­Multiply example we saw some time ago are really special cases of the more general pattern which I'll call try/commit/(try again). I don't know if this pattern has a real name, but that's what I'm calling it for today.

    The general form of this pattern goes like this:

    for (;;) {
     // capture the initial value of a shared variable we want to update
     originalValue = sharedVariable;
    
     ... capture other values we need to perform the operation ...
     ... these values must be indepedent of sharedVariable ...
    
     newValue = ... calculate the desired new result
                    based on originalValue and other captured values ...
    
     // Xxx can be Acquire, Release, or null
     if (InterlockedCompareExchangeXxx(
                &sharedVariable,
                newValue, oldValue) == oldValue) {
      break; // update was successful
     }
    
     ... clean up newValue ...
    
    } // loop back and try again
    

    We calculate the desired new value based on the initial value, combining it with other values that vary depending on the operation you want to perform, and then use an Interlocked­Compare­Exchange to update the shared value, provided the variable hasn't changed from its initial value. If the value did change, then that means another thread raced against us and updated the value before we could; in that case, we go back and try it again. Maybe the next time through we won't collide against somebody.

    Note that the try/commit/try again pattern is unfair. There's no assurance that the thread that has been trying to update the value for the longest time will win the next race. (This is a property common to most lock-free algorithms.)

    The Interlocked­Multiply function follows this pattern very closely: The other value required to perform the operation is simply the multiplier, which is a parameter to the function and therefore is independent of the shared variable. The new value is simply the product, and if we are unable to update the shared value (because somebody else modified it), we just start over.

    A variation of try/commit/try again is try/commit/abandon. In this pattern, there is no loop. If the exchange fails, you just give up and return a failure code. The function Try­Enter­Critical­Section uses the try/commit/abandon pattern. (It also uses the Acquire version of Interlocked­Compare­Exchange for reasons which should be obvious.)

    Our singleton pattern is another special case of try/commit/try again where the "try again" is optimized out because we know what the result of "try again" is going to be, so we don't actually have to do it. In the singleton pattern case, the Interlocked­Compare­Exchange is a Release because the new value depends on other memory locations.

    Normally, the shared variable is an integer rather than a pointer, because a pointer is subject to the ABA problem if you incorporate the pointed-to data into your calculations. We get away with it in the singleton pattern case because the value change is unidirectional: It goes from NULL to something, and once it's something it never changes again. If the value of the shared variable can change in more general ways, then you have to be more careful if you use a pointer as the shared variable. (The most common solution is to make the shared variable not just a pointer but a pointer plus a counter which increments at each operation.)

    Here's another use of the try/commit/try again pattern, using a change counter as the shared variable. First, two helper functions:

    LONG InterlockedReadAcquire(__in LONG *pl, __in LONG lUnlikely)
    {
      return InterlockedCompareExchangeAcquire(pl, lUnlikely, lUnlikely);
    }
    
    LONG InterlockedReadRelease(__in LONG *pl, __in LONG lUnlikely)
    {
      return InterlockedCompareExchangeRelease(pl, lUnlikely, lUnlikely);
    }
    

    Although direct reads and writes of properly aligned LONGs are atomic, the operations are not synchronized and impose no memory ordering semantics. To read a value with specific semantics, I pull a sneaky trick: I perform an Interlocked­Compare­Exchange with the desired memory ordering semantics, passing the same value as the comparand and the exchange; therefore, the operation, even if successful, has no computational effect.

    if (*pl == lUnlikely) *pl = lUnlikely;
    

    To avoid dirtying the cache line, I use an unlikely value as the comparand/exchange, so most of the time, the comparison fails and no memory is written. (This trick doesn't help on all architectures, but it doesn't hurt.)

    Okay, back to the change counter example:

    LONG g_lColorChange;
    
    ...
    case WM_SYSCOLORCHANGE:
     InterlockedIncrement(&g_lColorChange);
     ...
    
    int CalculateSomethingAboutSystemColors()
    {
     LONG lColorChangeStart;
     do {
      lColorChangeStart = InterlockedReadAcquire(&g_lColorChange, -1);
      COLORREF clrWindow = GetSysColor(COLOR_WINDOW);
      COLORREF clrHighlight = GetSysColor(COLOR_HIGHLIGHT);
      ... other computations involving GetSysColor(...)
     } while (InterlockedReadRelease(
                           &g_lColorChange, -1) != lColorChangeStart);
     return iResult;
    }
    

    We capture the color change counter and then begin doing our calculations. We capture the value with acquire semantics so that we get the value before we start reading the system colors. When we're done, we compare the value of the change counter against the value we captured. If it's different, then that means that the colors changed while we were doing our calculations, so our calculations are all messed up. In that case, we go back and try it again.

    This technique does assume that you won't get into a situation where one thread manages to increment the change counter 4 billion times before the other thread manages to run. This is not a problem in practice. For example, in this case, it's reasonable to assume that nobody is going to change their system colors 4 billion times within a single thread quantum.

    Next time, I'll show a different variation on try/commit/abandon which might be suitable for simple caches.

    Exercise: Criticize the following: "I noticed that there is no interlocked read operation, but there is Interlocked­Or, so my plan is to perform an interlocked read by or'ing with zero."

  • The Old New Thing

    Holding down the shift key when right-clicking lets you pin things to the Start menu even when you might have been better off not doing so

    • 20 Comments

    Holding the shift key when calling up a context menu is a convention for indicating that you want to see additional advanced options which are normally hidden. One of those options is Pin to Start menu. What is this doing as an extended command?

    The Pin to Start menu command normally appears on the context menu of a program or a shortcut to a program. The Start menu pin list was intended to be a launch point for programs, so if the item you pick isn't actually a program, the Pin to Start menu item will be hidden. Furthermore, only items on the local hard drive will show Pin to Start menu. This avoids ugliness if the user accidentally pins a file on a network drive or removable media, and the network is down or the removable media is removed. (Or, if the removable media is a floppy drive or CD-ROM, freaking out the user by spinning up the drive to get the properties of the shortcut.)

    The shift key override was aded as an escape hatch, like the double-Ctrl+Alt+Del, in case there was some situation discovered after Windows XP shipped where it would be reasonable to pin a non-program or a shortcut on network or removable media to the Start menu. For example, ClickOnce application are not recognized as executables because their extension is not .exe but rather is .appref-ms. But you can use the shift key override to pin them to your Start menu.

    With great power comes great responsibility: If you manually override the Pin to Start menu logic, then it's your problem if you pin something that ends up making your life miserable.

    Update: Upon closer examination, I've discovered that my characterization of what the shift keys accomplishes was incorrect. The shift key does not let you force-pin files on the network or removable media. It only lets you force-pin files that aren't programs.

  • The Old New Thing

    Patterns for using the InitOnce functions

    • 14 Comments

    Since writing lock-free code is is such a headache-inducer, you're probably best off making some other people suffer the headaches for you. And those other people are the kernel folks, who have developed quite a few lock-free building blocks so you don't have to. For example, there's a collection of functions for manipulating interlocked lists. But today we're going to look at the one-time initialization functions.

    The simplest version of the one-time initialization functions isn't actually lock-free, but it does implement the double-checked-lock pattern for you so you don't have to worry about the details. The usage pattern for the Init­Once­Execute­Once function is pretty simple. Here it is in its simplest form:

    int SomeGlobalInteger;
    
    BOOL CALLBACK ThisRunsAtMostOnce(
        PINIT_ONCE initOnce,
        PVOID Parameter,
        PVOID *Context)
    {
        calculate_an_integer(&SomeGlobalInteger);
        return TRUE;
    }
    
    void InitializeThatGlobalInteger()
    {
        static INIT_ONCE initOnce = INIT_ONCE_STATIC_INIT;
        InitOnceExecuteOnce(&initOnce,
                            ThisRunsAtMostOnce,
                            nullptr, nullptr);
    }
    

    In the simplest form, you give Init­Once­Execute­Once an INIT_ONCE structure (where it records its state), and a callback. If this is the first time that Init­Once­Execute­Once is called for a particular INIT_ONCE structure, it calls the callback. The callback can do whatever it likes, but presumably it's doing some one-time initialization. If another thread calls Init­Once­Execute­Once on the same INIT_ONCE structure, that other thread will wait until the first thread is finished its one-time execution.

    We can make this a tiny bit fancier by supposing that the calculation of the integer can fail.

    BOOL CALLBACK ThisSucceedsAtMostOnce(
        PINIT_ONCE initOnce,
        PVOID Parameter,
        PVOID *Context)
    {
        return SUCCEEDED(calculate_an_integer(&SomeGlobalInteger));
    }
    
    BOOL TryToInitializeThatGlobalInteger()
    {
        static INIT_ONCE initOnce = INIT_ONCE_STATIC_INIT;
        return InitOnceExecuteOnce(&initOnce,
                                   ThisSucceedsAtMostOnce,
                                   nullptr, nullptr);
    }
    

    If your initialization function returns FALSE, then the initialization is considered to have failed, and the next time somebody calls Init­Once­Execute­Once, it will try to initialize again.

    A slightly fancier use of the Init­Once­Execute­Once function takes advantage of the Context parameter. The kernel folks noticed that an INIT_ONCE structure in the "initialized" state has a lot of unused bits, and they've offered to let you use them. This is convenient if the thing you're initializing is a pointer to a C++ object, because it means that there's only one thing you need to worry about instead of two.

    BOOL CALLBACK AllocateAndInitializeTheThing(
        PINIT_ONCE initOnce,
        PVOID Parameter,
        PVOID *Context)
    {
        *Context = new(nothrow) Thing();
        return *Context != nullptr;
    }
    
    Thing *GetSingletonThing(int arg1, int arg2)
    {
        static INIT_ONCE initOnce = INIT_ONCE_STATIC_INIT;
        void *Result;
        if (InitOnceExecuteOnce(&initOnce,
                                AllocateAndInitializeTheThing,
                                nullptr, &Result))
        {
            return static_cast<Thing*>(Result);
        }
        return nullptr;
    }
    

    The final parameter to Init­Once­Execute­Once function receives the magic almost-pointer-sized data that the function will remember for you. Your callback function passes this magic value back through the Context parameter, and then Init­Once­Execute­Once gives it back to you as the Result.

    As before, if two threads call Init­Once­Execute­Once simultaneously on an uninitialized INIT_ONCE structure, one of them will call the initialization function and the other will wait.

    Up until now, we've been looking at the synchronous initialization patterns. They aren't lock-free: If you call Init­Once­Execute­Once and initialization of the the INIT_ONCE structure is already in progress, your call will wait until that initialization attempt completes (either successfully or unsuccessfully).

    More interesting is the asynchronous pattern. Here it is, as applied to our Singleton­Manager exercise:

     SingletonManager(const SINGLETONINFO *rgsi, UINT csi)
                   : m_rgsi(rgsi), m_csi(csi),
                     m_rgio(new INITONCE[csi]) {
       for (UINT iio = 0; iio < csi; iio++) {
        InitOnceInitialize(&m_rgio[iio]);
       }
     }
     ...
     // Array that describes objects we've created
     // runs parallel to m_rgsi
     INIT_ONCE *m_rgio;
    };
    
    ITEMCONTROLLER *SingletonManager::Lookup(DWORD dwId)
    {
     ... same as before until we reach the "singleton constructor pattern"
    
     void *pv = NULL;
     BOOL fPending;
     if (!InitOnceBeginInitialize(&m_rgio[i], INIT_ONCE_ASYNC,
                                  &fPending, &pv)) return NULL;
    
     if (fPending) {
      ITEMCONTROLLER *pic = m_rgsi[i].pfnCreateController();
      DWORD dwResult = pic ? 0 : INIT_ONCE_INIT_FAILED;
      if (InitOnceComplete(&m_rgio[i],
                           INIT_ONCE_ASYNC | dwResult, pic)) {
       pv = pic;
      } else {
       // lost the race - discard ours and retrieve the winner
       delete pic;
       InitOnceBeginInitialize(&m_rgio[i], INIT_ONCE_CHECK_ONLY,
                               X&fPending, &pv);
      }
     }
     return static_cast<ITEMCONTROLLER *>(pv);
    }
    

    The pattern for asynchronous initialization is as follows:

    • Call Init­Once­Begin­Initialize in async mode.
    • If it returns fPending == FALSE, then initialization has already been performed and you can go ahead and use the result passed back in the final parameter.
    • Otherwise, initialization is pending. Do your initialization, but remember that since this is a lock-free algorithm, there can be many threads trying to initialize simultaneously, so you have to be careful how you manipulate global state. This pattern works best if initialization takes the form of creating a new object (because that means multiple threads performining initialization are each creating independent objects).
    • Call Init­Once­Complete with the result of your initialization.
    • If Init­Once­Complete succeeds, then you won the initialization race, and you're done.
    • If Init­Once­Complete fails, then you lost the initialization race and should clean up your failed initialization. In that case, you should call Init­Once­Begin­Initialize one last time to get the answer from the winner.

    it's conceptually simple; it just takes a while to explain. but at least now it's in recipe form.

    Exercise: Instead of calling Init­Once­Complete with INIT_ONCE_INIT_FAILED, what happens if the function simply returns without ever completing the init-once?

    Exercise: What happens if two threads try to perform asynchronous initialization and the first one to complete fails?

    Exercise: Combine the results of the first two exercises and draw a conclusion.

Page 2 of 3 (28 items) 123