May, 2006

  • The Old New Thing

    Doing quick arithmetic from the command prompt

    • 64 Comments

    The command processor CMD.EXE comes with a mini-calculator that can perform simple arithmetic on 32-bit signed integers:

    C:\>set /a 2+2
    4
    C:\>set /a 2*(9/2)
    8
    C:\>set /a (2*9)/2
    9
    C:\>set /a "31>>2"
    7
    

    Note that we had to quote the shift operator since it would otherwise be misinterpreted as a "redirect stdout and append" operator.

    For more information, type set /? at the command prompt.

  • The Old New Thing

    The alertable wait is the non-GUI analog to pumping messages

    • 20 Comments

    When you are doing GUI programming, you well know that the message pump is the primary way of receiving and dispatching messages. The non-GUI analog to the message pump is the alertable wait.

    A user-mode APC is a request for a function to run on a thread in user mode. You can explicitly queue an APC to a thread with the QueueUserAPC function, or you can do it implicitly by passing a completion function to a waitable timer or asynchronous I/O. (That's why the code that indicates that a wait was interrupted by an APC is WAIT_IO_COMPLETION: Originally, the only thing that queued APCs was asynchronous I/O.)

    Of course, when an APC is queued, the function cannot run immediately. Imagine what the world would be like if it did: The function would interrupt the thread in the middle of whatever it was doing, possibly with unstable data structures, leaving the APC function to run at a point where the program is in an inconsistent state. If APCs really did run this way, it would be pretty much impossible to write a meaningful APC function since it couldn't reliably read from or write to any variables (since those variables could be unstable), nor could it call any functions that read from or wrote to variables. Given these constraints, there isn't much left for a function to do.

    Instead, APCs are dispatched when you perform what is known as an "alertable wait". The "Ex" versions of most wait functions (for example, WaitForSingleObjectEx, WaitForMultipleObjectsEx, MsgWaitForMultipleObjectsEx, and SleepEx) allow you to specify whether you want to wait alertably. If you do wait alertably, and one or more APCs are queued to the thread, then all the pending APCs are run and the wait operation returns with a code indicating that the wait was interrupted by an APC. If the APC you are waiting for has not yet run (maybe you were interrupted by some unrelated APC), then it is your responsibility to restart the wait and try again.

    Why doesn't the operating system automatically restart the wait? "Imagine what the world would be like if it did": Suppose you want to issue asynchronous I/O and then go off and do some other stuff, and then wait for the asynchronous I/O to complete so you can use the result.

    // When an asynchronous read completes, we fire off the next
    // read request. When all done, set fCompleted.
    BOOL fCompleted = FALSE;
    BOOL fSuccess;
    void CALLBACK CompletionRoutine(DWORD, DWORD, LPOVERLAPPED)
    {
     if (<finished>) {
      fSuccess = TRUE;
      fCompleted = TRUE;
     } else {
      // issue the next read in the sequence
      if (!ReadFileEx(hFile, ..., CompletionRoutine)) {
       fSuccess = FALSE; // problem occurred
       fCompleted = TRUE; // we're done
     }
    }
    ...
    // start the read cycle
    if (ReadFileEx(hFile, ..., CompletionRoutine)) {
      DoOtherStuffInTheMeantime();
      <wait for fCompleted to be set>
      DoStuffWithResult();
    }
    

    How would you write the "wait for fCompleted to be set" if the operating system auto-restarted waits? If you did an alertable infinite wait, say with SleepEx(INFINITE, TRUE), then the APCs would run, the operating system would auto-restart the waits, and the sleep would just run forever. You would be forced to use a non-INFINITE sleep and poll for the completion. But this has two serious flaws: One is that polling is bad. The second is that the rate at which you poll controls how quickly your program reacts to the completion of the read chain. Higher polling rates give you better responsiveness but consume more CPU.

    Fortunately, waits are not auto-restarted. This gives you a chance to decide for yourself whether you want to restart them or not:

    ...
    // start the read cycle
    if (ReadFileEx(hFile, ..., CompletionRoutine)) {
      DoOtherStuffInTheMeantime();
      while (!fCompleted) SleepEx(INFINITE, TRUE);
      DoStuffWithResult();
    }
    

    The SleepEx loop just keeps waiting alertably, processing APCs, until the completion routine finally decides that it's had enough and sets the fCompleted flag.

  • The Old New Thing

    A cache with a bad policy is another name for a memory leak

    • 37 Comments

    A common performance trick is to reduce time spent in the heap manager by caching the last item freed (or maybe the last few) so that a subsequent allocation can just re-use the item rather than having to go make a new one. But you need to be careful how you do this or you can end up making things worse rather than better. Here's an example motivated by an actual problem the Windows performance team researched.

    Consider a cache of variable-sized buffers. I will use only a one-entry cache for simplicity. In real life, the cache would be more complicated: People tend to have a deeper cache of four to ten entries, and you would have to ensure that only one thread used the cache at a time; typically this is done by associating the cache with something that has thread affinity. Furthermore, you probably would keep the size of the cached buffer in a member variable instead of calling LocalSize all the time. I've left out all these complications to keep the presentation simple.

    class BufferCache {
    public:
     BufferCache() : m_pCache(NULL) { }
     ~BufferCache() { LocalFree(m_pCache); }
    
     void *GetBuffer(SIZE_T cb);
     void ReturnBuffer(void *p);
    
    private:
     void *m_pCache;
    };
    

    If a request for a memory buffer arrives and it can be satisfied from the cache, then the cached buffer is returned. Otherwise, a brand new buffer is allocated.

    void *BufferCache::GetBuffer(SIZE_T cb)
    {
     // Satisfy from cache if possible
     if (m_pCache && LocalSize(m_pCache) >= cb) {
      void *p = m_pCache;
      m_pCache = NULL;
      return p;
     }
     return LocalAlloc(LMEM_FIXED, cb);
    }
    

    When a buffer is returned to the cache, we compare it against the item already in the cache and keep the bigger one, since that is more likely to satisfy a GetBuffer in the future. (In the general case of a multiple-entry cache, we would free the smallest entry.)

    // Flawed design - see discussion
    void BufferCache::ReturnBuffer(void *p)
    {
     SIZE_T cb = LocalSize(p);
     if (!m_pCache || cb > LocalSize(m_pCache)) {
      // Returned buffer is bigger than the cache:
      // Keep the returned buffer
      LocalFree(m_pCache);
      m_pCache = p;
     } else {
      // Returned buffer is smaller than the cache:
      // Keep the cache
      LocalFree(p);
     }
    }
    

    Why is this a flawed design? I'll let you think about this for a while.

    No really, I want you to think about it.

    Are you thinking? Take your time; I'll be here when you're done.

    Okay, since I know you haven't actually thought about it but are just sitting there waiting for me to tell you, I'll give you a bit of a nudge.

    The distribution of buffer sizes is rarely uniform. The most common distribution is that small buffers are popular, with larger and larger buffers being required less and less often. Let's write a sample program that allocates and frees memory according to this pattern. To make the bad behavior easier to spot in a short run, I'm going to use a somewhat flat distribution and say that half of the buffers are small, with larger buffers becoming less popular according to exponential decay. In practice, the decay curve is usually much, much steeper.

    #include <vector>
    #include <iostream>
    
    // Since this is just a quick test, we're going to be sloppy
    using namespace std; //  sloppy
    int __cdecl main(int argc, char **argv)
    {
     BufferCache b;
    
     // seeding the random number generator is not important here
    
     vector<void *> v; // keeps track of allocated memory
     for (;;) {
      // randomly allocate and free
      if (v.size() == 0 || (rand() & 1)) { // allocate
       SIZE_T cb = 100;
       while (cb < 1024 * 1024 && (rand() & 1)) {
        cb *= 2; // exponential decay distribution up to 1MB
       }
       void* p = b.GetBuffer(cb);
       if (p) {
        cout << " A" << LocalSize(p) << "/" << cb;
        v.push_back(p);
       }
      } else { // free
       int victim = rand() % v.size(); // choose one at random
       cout << " F" << LocalSize(v[victim]);
       b.ReturnBuffer(v[victim]); // free it
       v[victim] = v.back();
       v.pop_back();
      }
     }
    }
    

    This short program randomly allocates and frees memory from the buffer cache, printing (rather cryptically) the size of the blocks allocated and freed. When memory is allocated, it prints "A1/2" where "1" is the size of the block actually allocated and "2" is the size requested. When freeing memory, it prints "F3" where "3" is the size of the block allocated. Run this program, let it do its thing for maybe ten, fifteen seconds, then pause the output and study it. I'll wait. If you're too lazy to actually compile and run the program, I've included some sample output for you to study:

    F102400 A102400/400 F800 F200 A800/100 A200/200 A400/400
    A400/400 A200/200 F1600 A1600/100 F100 F800 F25600 A25600/200
    F12800 A12800/200 F200 F400 A400/100 F200 A200/100 A200/200
    A100/100 F200 F3200 A3200/400 A200/200 F51200 F800 F25600
    F1600 F1600 A51200/100 F100 A100/100 F3200 F200 F409600 F100
    A409600/400 A100/100 F200 F3200 A3200/800 A400/400 F800 F3200
    F200 F12800 A12800/200 A100/100 F200 F25600 F400 F6400
    A25600/100 F100 F200 F400 F200 F800 F400 A800/800 A100/100
    

    Still waiting.

    Okay, maybe you don't see it. Let's make the effect even more obvious by printing some statistics periodically. Of course, to generate the statistics, we need to keep track of them, so we'll have to remember how big the requested buffer was (which we'll do in the buffer itself):

    int __cdecl main(int argc, char **argv)
    {
     BufferCache b;
    
     // seeding the random number generator is not important here
    
     vector<void *> v; // keeps track of allocated memory
     SIZE_T cbAlloc = 0, cbNeeded = 0;
     for (int count = 0; ; count++) {
      // randomly allocate and free
      if (v.size() == 0 || (rand() & 1)) { // allocate
       SIZE_T cb = 100;
       while (cb < 1024 * 1024 && !(rand() % 4)) {
        cb *= 2; // exponential decay distribution up to 1MB
       }
       void* p = b.GetBuffer(cb);
       if (p) {
        *(SIZE_T*)p = cb;
        cbAlloc += LocalSize(p);
        cbNeeded += cb;
        v.push_back(p);
       }
      } else { // free
       int victim = rand() % v.size(); // choose one at random
       cbAlloc -= LocalSize(v[victim]);
       cbNeeded -= *(SIZE_T*)v[victim];
       b.ReturnBuffer(v[victim]); // free it
       v[victim] = v.back();
       v.pop_back();
      }
      if (count % 100 == 0) {
       cout << count << ": " << v.size() << " buffers, "
            << cbNeeded << "/" << cbAlloc << "="
            << cbNeeded * 100.0 / cbAlloc << "% used" << endl;
      }
     }
    }
    

    This new version keeps track of how many bytes were allocated as opposed to how many were actually needed, and prints a summary of those statistics every hundred allocations. Since I know you aren't actually going to run it yourself, I've run it for you. Here is some sample output:

    0: 1 buffers, 400/400=100% used
    100: 7 buffers, 4300/106600=4.03377% used
    200: 5 buffers, 1800/103800=1.7341% used
    300: 19 buffers, 9800/115800=8.46287% used
    400: 13 buffers, 5100/114000=4.47368% used
    500: 7 buffers, 2500/28100=8.8968% used
    ...
    37200: 65 buffers, 129000/2097100=6.15135% used
    37300: 55 buffers, 18100/2031400=0.891011% used
    37400: 35 buffers, 10400/2015800=0.515924% used
    37500: 43 buffers, 10700/1869100=0.572468% used
    37600: 49 buffers, 17200/1874000=0.917823% used
    37700: 75 buffers, 26000/1889900=1.37573% used
    37800: 89 buffers, 30300/1903100=1.59214% used
    37900: 91 buffers, 29600/1911900=1.5482% used
    

    By this point, the problem should be obvious: We're wasting insane quantities of memory. For example, after step 37900, we've allocated 1.8MB of memory when we needed only 30KB, for a waste of over 98%.

    How did we go horribly wrong?

    Recall that most of the time, the buffer being allocated is a small buffer, and most of the time, a small buffer is freed. But it's the rare case of a large buffer that messes up everything. The first time a large buffer is requested, it can't come from the cache, since the cache has only small buffers, so it must be allocated. And when it is returned, it is kept, since the cache keeps the largest buffer.

    The next allocation comes in, and it's probably one of the common-case small buffers, and it is given the cached buffer—which is big. You're wasting a big buffer on something that needs only 100 bytes. Some time later, another rare big buffer request comes in, and since that other big buffer got wasted on a small allocation, you have to allocate a new big buffer. You allocated two big buffers even though you need only one. Since big buffers are rare, it is unlikely that a big buffer will be given to a caller that actually needs a big buffer; it is much more likely to be given to a caller that needs a small buffer.

    Bad effect 1: Big buffers get wasted on small callers.

    Notice that once a big buffer enters the system, it is hard to get rid of, since a returned big buffer will be compared against what is likely to be a small buffer, and the small buffer will lose.

    Bad effect 2: Big buffers rarely go away.

    The only way a big buffer can get freed is if the buffer in the cache is itself already a big buffer. If instead of a one-entry cache like we have here, you keep, say, ten buffers in your buffer cache, then in order to free a big buffer, you have to have eleven consecutive ReturnBuffer calls, all of which pass a big buffer.

    Bad effect 3: The more efficient you try to make your cache, the more wasteful it gets!

    What's more, when that eleventh call to ReturnBuffer is made with a big buffer, it is only the smallest of the big buffers that gets freed. The biggest buffers stay.

    Bad effect 4: When a big buffer does go away, it's only because you are keeping an even bigger buffer!
    Corollary: The biggest buffer never gets freed.

    What started out as an "obvious" decision in choosing which buffer to keep has turned into a performance disaster. By favoring big buffers, you allowed them to "poison" the cache, and the longer you let the system run, the more allocations end up being big "poisoned" buffers. It doesn't matter how rare those big blocks are; you will eventually end up in this state. It's just a matter of time.

    When the performance team tries to explain this problem to people, many of them get the mistaken impression that the problem is merely that there is wasted space in the cache. But look at our example: Our cache has only one entry and we are still wasting over 90% of the memory. That's because the waste is not in the memory being held by the cache, but rather is in the memory that the cache hands out. (It's sort of like that scene in It's a Wonderful Life where George Bailey is explaining where all the money is. It's not in the bank; it's in all the places that got money from the bank.)

    My recommendations:

    • Instrument your cache and understand what your program's memory allocation patterns are.
    • Use that information to pick a size cutoff point beyond which you simply will not use the cache at all. This ensures that big buffers never get into the cache in the first place. Choosing this cutoff point is usually extremely easy once you look at then allocation histogram.
    • Although you've taken the big buffers out of the picture, you will still have the problem that the small buffers will gradually grow up to your cutoff size. (I.e., you still have the same problem, just in miniature.) Therefore, if the cache is full, you should just free the most recently returned buffer regardless of its size.
    • Do not use the cached buffer if the waste is too great. You might decide to use multiple "buckets" of cached entries, say one for buffers below 100 bytes, another for buffers between 100 and 200 bytes, and so on. That way, the waste per allocation is never more than 100 bytes.
    • Finally, reinstrument your cache to ensure that you're not suffering from yet some other pathological behavior that I haven't taken into account.

    Here's a new ReturnBuffer implementation that takes some of the above advice into account. Instrumentation shows that three quarters of the allocations are in the 100–200 byte range, so let's cap our cache at 200 bytes.

    void BufferCache::ReturnBuffer(void *p)
    {
     if (m_pCache == NULL && LocalSize(p) <= 200) {
      m_pCache = p;
     } else {
      LocalFree(p);
     }
    }
    

    With this one seemingly-minor change, our efficiency stays above 90% and occasionally even gets close to 100%:

    0: 1 buffers, 400/400=100% used
    100: 7 buffers, 4300/4400=97.7273% used
    200: 5 buffers, 1800/1800=100% used
    300: 19 buffers, 9800/9800=100% used
    400: 13 buffers, 5100/5100=100% used
    500: 7 buffers, 2500/2600=96.1538% used
    ...
    37200: 65 buffers, 129000/130100=99.1545% used
    37300: 55 buffers, 18100/18700=96.7914% used
    37400: 35 buffers, 10400/11000=94.5455% used
    37500: 43 buffers, 10700/11000=97.2727% used
    37600: 49 buffers, 17200/18000=95.5556% used
    37700: 75 buffers, 26000/26800=97.0149% used
    37800: 89 buffers, 30300/31900=94.9843% used
    37900: 91 buffers, 29600/30600=96.732% used
    

    Don't forget to check out performance guru Rico Mariani's reminder that Caching implies Policy. As he explained to me, "Cache policy is everything so you must be dead certain that your policy is working as you intended. A cache with a bad policy is another name for a memory leak."

  • The Old New Thing

    Tips from an American on on driving in Taiwan

    • 19 Comments

    Although I have been a passenger in a car many times, I thank my lucky stars that I have never had to be the person behind the wheel in Taiwan. But if you decide you want to give it a shot, you might want to pick up some driving tips from an American who spent time in Taiwan as an English teacher, part of his Teaching English in Taiwan site. My conclusion is simply that one should merely avoid driving entirely.

    The text is largely disconnected from the pictures, but that's okay. The pictures just bring back memories of Taiwan in both its scenic and not-so-scenic glory.

    In the section on learning Chinese, he remarks:

    Most foreigners go through a difficult phase after they know some Chinese, in which they think they are speaking Chinese, but actually, because their tones are unclear, they are speaking gibberish which the locals literally cannot understand. ...

    You will know that your Chinese tones have arrived when you hear another foreigner speaking during this phase of her learning, and you discover that you can't understand a word they are saying.

    I guess that's good news for me, as my sense for Mandarin tones didn't take long to develop (having grown up with a tonal language, albeit a different one). Too bad my vocabulary and grammar are still effectively nonexistent. Although I can pronounce each of the tones, I often just plain forget which tone a word is in! (My cousin lent me some textbooks that use a different system of representing tones: Instead of treating the tone as an add-on, it incorporates it into the spelling of the word. Words therefore have no superscripts or accent marks. This may actually stick, I'll have to try it out and see.)

  • The Old New Thing

    What's so special about bitmaps and DCs?

    • 31 Comments

    You can select pens, brushes, fonts and bitmaps into a DC with the SelectObject function, and from this list, bitmaps are special. Because, if you look carefully, bitmaps are the only modifiable objects on the list. Pens, brushes and fonts cannot be modified once they are created.

    But bitmaps, oh, bitmaps. A bitmap selected into a DC changes as you draw into it. Selecting a bitmap into multiple DCs means that writing to the bitmap from one DC secretly changes it in another, which isn't a very nice thing to do to a DC.

    So let's see, you can select pens, brushes, and fonts into multiple DCs, but you can't do it with bitmaps. Coincidence? You make the call.

Page 4 of 4 (35 items) 1234