• The Old New Thing

    The psychology of confirmation, or something, I don't know what to call it

    • 46 Comments

    There is probably a name for this phenomenon. I will illustrate it below.

    "Is there a way to configure the system to do X?"

    Go to the Y dialog and select Z.

    "It doesn't work."

    I just tried it. It works for me. I'm using ⟨configuration details⟩.

    "Thanks. It's working."

  • The Old New Thing

    Creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result

    • 20 Comments
    Suppose you want to multiply two double-word values producing a quad-word result, but your processor supports only single-word multiplication with a double-word result. For concreteness, let's say that your processor supports 32 × 32 → 64 multiplication and you want to implement 64 × 64 → 128 multiplication. (Sound like any processor you know?)

    Oh boy, let's do some high school algebra. Let's start with unsigned multiplication.

    Let x = A × 2³² + B and y = C × 2³² + D, where A, B, C, and D are all in the range 0 … 2³² − 1.

    x × y AC × 264 + (AD + BC) × 232 + BD
    AC × 264 + BD  +  (AD + BC) × 232
    provisional result cross-terms

    Each of the multiplications (not counting the power-of-two multiplications) is a 32 × 32 → 64 multiplication, so they are something we have as a building block. And the basic implementation is simply to perform the four multiplications and add the pieces together. But if you have SSE, you can perform two multiplies in a single instruction.

        // Prepare our source registers
        movq xmm0, x         // xmm0 = { 0, 0, A, B } = { *, *, A, B }
        movq xmm1, y         // xmm1 = { 0, 0, C, D } = { *, *, C, D }
        punpckldq xmm0, xmm0 // xmm0 = { A, A, B, B } = { *, A, *, B }
        punpckldq xmm1, xmm1 // xmm1 = { C, C, D, D } = { *, C, *, D }
        pshufd xmm2, xmm1, _MM_SHUFFLE(2, 0, 3, 1)
                             // xmm2 = { D, D, C, C } = { *, D, *, C }
    

    The PMULUDQ instruction multiplies 32-bit lanes 0 and 2 of its source and destination registers, producing 64-bit results. The values in lanes 1 and 3 do not participate in the multiplication, so it doesn't matter what we put there. It so happens that the PUNPCKLDQ instruction duplicates the value, but we really don't care. I used * to represent a don't-care value.

        pmuludq xmm1, xmm0 // xmm1 = { AC, BD } // provisional result
        pmuludq xmm2, xmm0 // xmm2 = { AD, BC } // cross-terms
    

    In two PMULUDQ instructions, we created the provisional result and the cross-terms. Now we just need to add the cross-terms to the provisional result. Unfortunately, SSE does not have a 128-bit addition (or at least SSE2 doesn't; who knows what they'll add in the future), so we need to do that the old-fashioned way.

        movdqa result, xmm1
        movdqa crossterms, xmm2
    
        mov    eax, crossterms[0]
        mov    edx, crossterms[4] // edx:eax = BC
        add    result[4], eax
        adc    result[8], edx
        adc    result[12], 0      // add the first cross-term
    
        mov    eax, crossterms[8]
        mov    edx, crossterms[12] // edx:eax = AD
        add    result[4], eax
        adc    result[8], edx
        adc    result[12], 0      // add the second cross-term
    

    There we go, a 64 × 64 → 128 multiply constructed from 32 × 32 → 64 multiplies.

    Exercise: Why didn't I use the rax register to perform the 64-bit addition? (This is sort of a trick question.)

    That calculates an unsigned multiplication, but how do we do a signed multiplication? Let's work modulo 2128 so that signed and unsigned multiplication are equivalent. This means that we need to expand x and y to 128-bit values X and Y.

    Let s = the sign bit of x expanded to a 64-bit value, and similarly t = the sign bit of y expanded to a 64-bit value. In other words, s is 0xFFFFFFFF`FFFFFFFF if x < 0 and zero if x ≥ 0.

    The 128-bit values being multiplied are

    X s × 264 + x
    Y t × 264 + y

    The product is therefore

    X × Y st × 2128  (sy + tx) × 264   +  xy
    zero adjustment unsigned product

    The first term is zero because it overflows the 128-bit result. That leaves the second term as the adjustment, which simplifies to "If x < 0 then subtract y from the high 64 bits; if y < 0 then subtract x from the high 64 bits."

        if (x < 0) result.m128i_u64[1] -= y;
        if (y < 0) result.m128i_u64[1] -= x;
    

    If we were still playing with SSE, we could compute this as follows:

        movdqa xmm0, result   // xmm0 = { high, low }
        movq   xmm1, x        // xmm1 = { 0, x }
        movq   xmm2, y        // xmm2 = { 0, y }
        pshufd xmm3, xmm1, _MM_SHUFFLE(1, 1, 3, 2) // xmm3 = { xhi, xhi, 0, 0 }
        pshufd xmm1, xmm1, _MM_SHUFFLE(1, 0, 3, 2) // xmm1 = { x, 0 }
        pshufd xmm4, xmm2, _MM_SHUFFLE(1, 1, 3, 2) // xmm4 = { yhi, yhi, 0, 0 }
        pshufd xmm2, xmm2, _MM_SHUFFLE(1, 0, 3, 2) // xmm2 = { y, 0 }
        psrad  xmm3, 31       // xmm3 = { s, s, 0, 0 } = { s, 0 }
        psrad  xmm4, 31       // xmm4 = { t, t, 0, 0 } = { t, 0 }
        pand   xmm3, xmm2     // xmm3 = { x < 0 ? y : 0, 0 }
        pand   xmm4, xmm1     // xmm4 = { y < 0 ? x : 0, 0 }
        psubq  xmm0, xmm3     // first adjustment
        psubq  xmm0, xmm4     // second adjustment
        movdqa result, xmm0   // update result
    

    The code is a bit strange because SSE2 doesn't have a full set of 64-bit integer opcodes. We would have liked to have used a psraq instruction to fill a 64-bit field with a sign bit. But there is no such instruction, so instead we duplicate the 64-bit sign bit into two 32-bit sign bits (one in lane 2 and one in lane 3) and then fill the lanes with that bit using psrad.

  • The Old New Thing

    Killing a window timer prevents the WM_TIMER message from being generated for that timer, but it doesn't retroactively remove ones that were already generated

    • 7 Comments

    Calling Kill­Timer to cancel a window timer prevents WM_TIMER messages from being generated for that timer, even if one is overdue. In other words, give this sequence of operations:

    SetTimer(hwnd, IDT_MYTIMER, 1000, NULL);
    Sleep(2000);
    KillTimer(hwnd, IDT_MYTIMER);
    

    no WM_TIMER message is ever generated. Even though a timer became due during the Sleep, no timer message was generated during the sleep because timer messages are generated on demand, and nobody demanded one. Killing the timer then removes the ability to demand a timer message, and the result is that no message ever appears.

    In general, this means that once you kill a timer, you will not receive any WM_TIMER messages for that timer.

    Unless you demanded one while the timer was active and didn't process it.

    Let's try a variation:

    SetTimer(hwnd, IDT_MYTIMER, 1000, NULL);
    Sleep(2000);
    if (PeekMessage(&msg, NULL, WM_TIMER, WM_TIMER, 0)) {
     DispatchMessage(&msg);
    }
    KillTimer(hwnd, IDT_MYTIMER);
    

    In this case, the Peek­Message function looks for a WM_TIMER message in the queue, and if none is found, it asks for one to be generated on the fly if a timer is due. It so happens that one is due (IDT_MY­TIMER), so the Peek­Message causes a WM_TIMER to be generated and placed in the queue. But it doesn't remain in this state for long, because the message is removed from the queue by the Peek­Message function.

    Okay, now let's make things weird:

    SetTimer(hwnd, IDT_MYTIMER, 1000, NULL);
    Sleep(2000);
    if (PeekMessage(&msg, NULL, WM_TIMER, WM_TIMER, PM_NOREMOVE)) {
     // oh hey there is an overdue timer, how about that
    }
    KillTimer(hwnd, IDT_MYTIMER);
    

    This time, we passed the PM_NO­REMOVE flag. The window manager goes through the same process as before, first looking for a WM_TIMER message in the queue, and then failing to find one, generates one on the fly since the IDT_MY­TIMER timer is overdue. But the PM_NO­REMOVE flag makes things weird because it says, "Thanks for generating that message for me. But don't remove it from the queue. Leave it there. I'll deal with it later."

    You might do this if you want to stop processing if a timer elapses, but you don't want to handle the timer immediately because you are in some sensitive state at the point you realize that you need to stop processing. Instead, you want to return back out to the main message loop and let it deal with the timer.

    BOOL DoWorkUntilTheNextTimer()
    {
     BOOL fFinished = FALSE;
     MSG msg;
     PrepareToDoWork();
     while (!PeekMessage(&msg, NULL, WM_TIMER, WM_TIMER, PM_NOREMOVE)) {
      if (AnyWorkLeft()) DoSomeWork(); 
      else { fFinished = TRUE; break; }
     }
     CleanUpAfterDoingWork();
     return fFinished;
    }
    

    And then you might call it like this:

    void DoWorkForUpToOneSecond()
    {
     SetTimer(hwnd, IDT_MYTIMER, 1000, NULL);
     DoWorkUntilTheNextTimer();
     KillTimer(hwnd, IDT_MYTIMER);
    }
    

    The Kill­Timer will prevent any new timer messages from being generated for IDT_MY­TIMER, but it does not go back in time and retroactively un-generate the timer message that was generated when Do­Work­Until­The­Next­Timer asked to see if there were any timer messages.

    You are now in the strange situation where a subsequent call to Peek­Message or Get­Message will retrieve a timer message for a timer that is no longer active!

    This is captured in the MSDN documentation with the simple sentence, "The Kill­Timer function does not remove WM_TIMER messages already posted to the message queue."

  • The Old New Thing

    If my WM_TIMER handler takes longer than the timer period, will my queue fill up with WM_TIMER messages?

    • 15 Comments

    A customer was worried that they may have a problem with their message queue filling with WM_TIMER messages. "If my WM_TIMER handler takes longer than the timer period, will my queue fill up with WM_TIMER messages?"

    As we should know by now, timer messages are generated on demand:

    The WM_TIMER message is a low-priority message. The Get­Message and Peek­Message functions post this message only when no other higher-priority messages are in the thread's message queue.

    Here's the basic algorithm. (I'm ignoring filtering and I'm assuming that messages are removed.)

    • Look for a posted message. If one exists, then return it.
    • Was Post­Quit­Message called? If so, then generate and return a WM_QUIT message.
    • Look for an input message. If one exists, then return it.
    • Did the mouse move since the last call? If so, then generate and return a WM_MOUSE­MOVE message.
    • Does a window need to be repainted? If so, then generate and return a WM_PAINT message.
    • Is there a timer that has elapsed? If so, then generate and return a WM_TIMER message.

    Notice that the generated messages are generated on demand by message retrieval functions. If you never call a message retrieval function, then no messages are generated. And in the case where the messages are removed (i.e., you use Get­Message or you use Peek­Message with PM_REMOVE), the messages are removed immediately after being generated, so they don't hang around very long at all.

    In particular, if your WM_TIMER handler takes longer than the timer period, and it doesn't call a message retrieval function, then there is no opportunity for another WM_TIMER message to be generated. Only when you call a message retrieval function does there become a possibility for a WM_TIMER message to be generated.

  • The Old New Thing

    What happens if I don't paint when I get a WM_PAINT message?

    • 19 Comments

    Suppose your window procedure doesn't paint when it gets a WM_PAINT message. What happens?

    It depends on how you don't paint.

    If you have an explicit handler for the WM_PAINT message that does nothing but return without painting, then the window manager will turn around and put a new WM_PAINT message in your queue. "And try harder this time." Remember that the rules for the WM_PAINT message are that the window manager will generate a WM_PAINT message for any window that has a dirty region. If you fail to remove the dirty region in your WM_PAINT message handler, well, then the rules state that you get another WM_PAINT message. (The most common way of clearing the dirty region is to call Begin­Paint, but there are other less common ways, like Validate­Rect or Redraw­Window with the RDW_VALIDATE flag.)

    The other case is that you simply don't have a WM_PAINT handler and let the message fall through to Def­Window­Proc. In that case, Def­Window­Proc will do a blank paint for you. In other words, Def­Window­Proc contains the logical equivalent of

    case WM_PAINT:
     {
      PAINTSTRUCT ps;
      if (BeginPaint(hwnd, &ps))
       EndPaint(hwnd, &ps);
     }
     return 0;
    

    In the case where you pass the WM_PAINT to Def­Window­Proc, the dirty region is cleared because Def­Window­Proc will call Begin­Paint for you.

    There are some quirks in the handling of the WM_PAINT message by the Def­Window­Proc function to handle various application compatibility cases, but the above is the basic idea.

    To avoid tripping over the weird application compatibility cases, decide up front how you want to deal with WM_PAINT messages delivered to your window procedure.

    • Handle them completely by calling Begin­Paint and End­Paint, then returning 0. (Do not pass the message to Def­Window­Proc.)
    • Pass them all to Def­Window­Proc, and let it do the Begin­Paint and End­Paint.

    Don't try playing fancy games like "Oh, I'm going to call Begin­Paint and End­Paint, but sometimes I'm also going to pass the message to Def­Window­Proc afterwards." Just pick one plan and stick to it. It's a lot simpler for everybody that way.

  • The Old New Thing

    Microspeak: Redlines

    • 30 Comments

    To the outside world, redline can mean to mark something for removal, or it could mean the maximum safe speed of an engine. But in the world of Microsoft design, the term redlines (pronounced as if it were written as the two words red lines, but the accent is on the red) refers to a diagram showing the exact position of visual elements. They typically take the form of a proposed screen shot, with arrows and lines superimposed to indicate the distances between items, which items align with each other, and so on. They also contain indications as to the exact colors to use for different elements.

    Originally the lines and arrows were actually red, hence the name. Here's an example of something that gives you the idea:

    These aren't real redlines because the diagram doesn't contain any indications about the colors to use, and more complete redlines would include diagrams showing the hover, pressed, and disabled states.

  • The Old New Thing

    Counting array elements which are below a particular limit value using SSE

    • 12 Comments

    Some time ago, we looked at how doing something can be faster than not doing it. That is, we observed the non-classical effect of the branch predictor. I took the branch out of the inner loop, but let's see how much further I can push it.

    The trick I'll employ today is using SIMD in order to operate on multiple pieces of data simultaneously. Take the original program and replace the count­them function with this one:

    int countthem(int boundary)
    {
     __m128i xboundary = _mm_cvtsi32_si128(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000; i++) {
      __m128i value =  _mm_cvtsi32_si128(array[i]);
      __m128i test = _mm_cmplt_epi32(value, xboundary);
      count = _mm_sub_epi32(count, test);
     }
     return _mm_cvtsi128_si32(count);
    }
    

    Now, this program doesn't actually use any parallel operations, but it's our starting point. For each 32-bit value, we load it, compare it agains the boundary value, and accumulate the result. The _mm_cmplt_epi32 function compares the four 32-bit integers in the first parameter against the four 32-bit integers in the second parameter, producing four new 32-bit integers. Each of the new 32-bit integers is 0xFFFFFFFF if the corresponding first parameter is less than the second, or it is 0x00000000 if it is greater than or equal.

    In this case, we loaded up the value we care about, then compare it against the boundary value. The result of the comparison is either 32 bits of 0 (for false) or 32 bits of 1 (for true), so this merely sets test equal to 0xFFFFFFFF if the value is less than the boundary; otherwise 0x0000000. Since 0xFFFFFFFF is the same as a 32-bit -1, we subtract the value so that the count goes up by 1 if the value is less than the boundary.

    Finally, we convert back to a 32-bit integer and return it.

    With this change, the running time drops from 2938 time units to 2709, an improvement of 8%.

    So far, we have been using only the bottom 32 bits of the 128-bit XMM registers. Let's turn on the parallelism.

    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i++) {
      __m128i value = _mm_loadu_si128(&xarray[i]);
      __m128i test = _mm_cmplt_epi32(value, xboundary);
      count = _mm_sub_epi32(count, test);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return _mm_cvtsi128_si32(count);
    }
    

    We take our 32-bit integers and put them in groups of four, so instead of thinking of them as 10000 32-bit integers, we think of them as 2500 128-bit blocks, each block containing four lanes, with each lane holding one 32-bit integers.

    Lane 3 Lane 2 Lane 1 Lane 0
    xarray[0] array[3] array[2] array[1] array[0]
    xarray[1] array[7] array[6] array[5] array[4]
    xarray[2499] array[9999] array[9998] array[9997] array[9996]

    Now we can run our previous algorithm in parallel on each lane.

    Lane 3 Lane 2 Lane 1 Lane 0
    xboundary boundary boundary boundary boundary
     
    test array[3] < boundary array[2] < boundary array[1] < boundary array[0] < boundary
    test array[7] < boundary array[6] < boundary array[5] < boundary array[4] < boundary
    test array[9999] < boundary array[9998] < boundary array[9997] < boundary array[9996] < boundary
     
    count = Σtest Lane 3 totals Lane 2 totals Lane 1 totals Lane 0 totals

    The xboundary variable contains a copy of the boundary in each of the four 32-bit lanes. We load the values from the array four at a time¹ and compare them (in parallel) against the boundary, then we tally them (in parallel). The result of the loop is that each lane of count performs a count of values for its lane.

    After we complete the loop, we combine the parallel results by adding the lanes together. We do this by shuffling the values around and performing more parallel adds. The _mm_shuffle_epi32 function lets you rearrange the lanes of an XMM register. The _MM_SHUFFLE macro lets you specify how you want the shuffle to occur. For example, _MM_SHUFFLE(1, 0, 3, 2) says that we want lanes 1, 0, 3 then 2 of the original value. (You can shuffle a value into multiple destination lanes; for example, _MM_SHUFFLE(0, 0, 0, 0) says that you want four copies of lane 0. That's how we created xboundary.)

    Lane 3 Lane 2 Lane 1 Lane 0
    count Lane 3 totals Lane 2 totals Lane 1 totals Lane 0 totals
    shuffle1 Lane 1 totals Lane 0 totals Lane 3 totals Lane 2 totals
     
    count += shuffle1 Lane 3 + Lane 1 Lane 2 + Lane 0 Lane 1 + Lane 3 Lane 0 + Lane 2
    shuffle2 Lane 2 + Lane 0 Lane 3 + Lane 1 Lane 0 + Lane 2 Lane 1 + Lane 3
     
    count += shuffle2 Lane 3 + Lane 1 +
    Lane 2 + Lane 0
    Lane 2 + Lane 0 +
    Lane 3 + Lane 1
    Lane 1 + Lane 3 +
    Lane 0 + Lane 2
    Lane 0 + Lane 2 +
    Lane 1 + Lane 3

    At the end of the shuffling and adding, we have calculated the sum of all four lanes. (For style points, I put the answer in all the lanes.)

    This new version runs in 688 time units, or 3.9 times faster than the previous one. This makes sense because we are counting four values at each iteration. The overall improvement is 4.3×.

    Let's see if we can reduce the loop overhead by doing some unrolling.

    #define GETVALUE(n) __m128i value##n = _mm_loadu_si128(&xarray[i+n])
    #define GETTEST(n) __m128i test##n = _mm_cmplt_epi32(value##n, xboundary)
    #define GETCOUNT(n)  count = _mm_sub_epi32(count, test##n)
    
    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 4) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return _mm_cvtsi128_si32(count);
    }
    

    We unroll the loop fourfold. At each iteration, we load 16 values from memory, and then accumulate the totals. We fetch all the memory values first, then do the comparisons, then accumulate the results. If we had written it as GETVALUE immediately followed by GETTEST, then the _mm_cmplt_epi32 would have stalled waiting for the result to arrive from memory. By interleaving the operations, we get some work done instead of stalling.

    This version runs in 514 time units, an improvement of 33% over the previous version and an overall improvement of 5.7×.

    Can we unroll even further? Let's try fivefold.

    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 5) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return _mm_cvtsi128_si32(count);
    }
    

    Huh? This version runs marginally slower, at 528 time units. So I guess further unrolling won't help any more. (For example, if you unroll a loop so much that you have more live variables than registers, the compiler will need to spill registers to memory. The x86 has eight XMM registers available, so you can easily cross that limit.)

    But wait, there's still room for tweaking. We have been using _mm_cmplt_epi32 to perform the comparison, expecting the compiler to generate code like this:

        ; suppose xboundary is in xmm0 and count is in xmm1
        movdqu   xmm2, xarray[i] ; xmm2 = value
        pcmpltd  xmm2, xmm0      ; xmm2 = test
        psubd    xmm1, xmm2
    

    If you crack open your Intel manual, you'll see that there is no PCMPLTD instruction. The compiler intrinsic is emulating the instruction by flipping the parameters and using PCMPGTD.

    _mm_cmplt_epi32(x, y) ↔ _mm_cmpgt_epi32(y, x)
    

    But the PCMPGTD instruction writes the result back into the first parameter. In other words, it always takes the form

    y = _mm_cmpgt_epi32(y, x);
    

    In our case, y is xboundary, but we don't want to modify xboundary. As a result, the compiler needs to introduce a temporary register:

        movdqu   xmm2, xarray[i] ; xmm2 = value
        movdqa   xmm3, xmm0      ; xmm3 = copy of xboundary
        pcmpgtd  xmm3, xmm2      ; xmm3 = test
        psubd    xmm1, xmm3
    

    We can take an instruction out of the sequence by switching to _mm_cmpgt_epi32 and adjusting our logic accordingly, taking advantage of the fact that

    x < y ⇔ ¬(x ≥ y) ⇔ ¬(x > y − 1)
    

    assuming the subtraction does not underflow. Fortunately, it doesn't in our case since boundary ranges from 0 to 10, and subtracting 1 does not put us in any danger of integer underflow.

    With this rewrite, we can switch to using _mm_cmpgt_epi32, which is more efficient for our particular scenario. Since we are now counting the values which don't meet our criteria, we need to take our final result and subtract it from 10000.

    #define GETTEST(n) __m128i test##n = _mm_cmpgt_epi32(value##n, xboundary1)
    
    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary1 = _mm_set1_epi32(boundary - 1);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 5) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return 10000 - _mm_cvtsi128_si32(count);
    }
    

    Notice that we have two subtractions which cancel out. We are subtracting the result of the comparison, and then we subtract the total from 10000. The two signs cancel out, and we can use addition for both. This saves an instruction in the return because subtraction is not commutative, but addition is.

    #define GETCOUNT(n) count = _mm_add_epi32(count, test##n)
    
    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary1 = _mm_set1_epi32(boundary - 1);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 5) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return 10000 + _mm_cvtsi128_si32(count);
    }
    

    You can look at the transformation this way: The old code considered the glass half empty. It started with zero and added 1 each time it found an entry that passed the test. The new code considers the glass half full. It assumes each entry passes the test, and it subtracts one each time it finds an element that fails the test.

    This version runs in 453 time units, an improvement of 13% over the fourfold unrolled version and an improvement of 6.5× overall.

    Okay, let's unroll sixfold, just for fun.

    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary - 1);
     __m128i count = _mm_setzero_si128();
     int i = 0;
     {
        GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3);
         GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);
        GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3);
     }
     i += 4;
     for (; i < 10000 / 4; i += 6) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2);
      GETVALUE(3); GETVALUE(4); GETVALUE(5);
       GETTEST(0);  GETTEST(1);  GETTEST(2);
       GETTEST(3);  GETTEST(4);  GETTEST(5);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2);
      GETCOUNT(3); GETCOUNT(4); GETCOUNT(5);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return 10000 + _mm_cvtsi128_si32(count);
    }
    

    Since 10000 / 4 % 6 = 4, we have four values that don't fit in the loop. We deal with those values up front, and then enter the loop to get the rest.

    This version runs in 467 time units, which is 3% slower than the previous version. So I guess it's time to stop unrolling. Let's go back to the previous version which ran faster.

    The total improvement we got after all this tweaking is speedup of 6.5× over the original jumpless version. And most of that improvement (5.7×) came from unrolling the loop fourfold.

    Anyway, no real moral of the story today. I just felt like tinkering.

    Notes

    ¹ The _mm_loadu_si128 intrinsic is kind of weird. Its formal argument is a __m128i*, but since it is for loading unaligned data, the formal argument really should be __m128i __unaligned*. The problem is that the __unaligned keyword doesn't exist on x86 because prior to the introduction of MMX and SSE, x86 allowed arbitrary misaligned data. Therefore, you are in this weird situation where you have to use an aligned pointer to access unaligned data.

    Bonus chatter: Clang at optimization level 3 does autovectorization. It doesn't know some of the other tricks, like converting x + 1 to x - (-1), thereby saving an instruction and a register.

  • The Old New Thing

    A user's SID can change, so make sure to check the SID history

    • 23 Comments

    It doesn't happen often, but a user's SID can change. For example, when I started at Microsoft, my account was in the SYS-WIN4 domain, which is where all the people on the Windows 95 team were placed. At some point, that domain was retired, and my account moved to the REDMOND domain. We saw some time ago that the format of a user SID is

    S-1- version number (SID_REVISION)
    -5- SECURITY_NT_AUTHORITY
    -21- SECURITY_NT_NON_UNIQUE
    -w-x-y- the entity (machine or domain) that issued the SID
    -z the unique user ID for that entity

    The issuing entity for a local account on a machine is the machine to which the account belongs. The issuing entity for a domain account is the domain.

    If an account moves between domains, the issuing entity changes, which means that the old SID is not valid. A new SID must be issued.

    Wait, does this mean that if my account moves between domains, then I lose access to all my old stuff? All my old stuff grants access to my old SID, not my new SID.

    Fortunately, this doesn't happen, thanks to the SID history. When your account moves to the new domain, the new domain controller remembers all the previous SIDs you used to have. When you authenticate against the domain controller, it populates your token with your SID history. In my example, it means that my token not only says "This is user number 271828 on the REDMOND domain", it also says "This user used to be known as number 31415 on the SYS-WIN4 domain." That way, when the system sees an object whose ACL says, "Grant access to user 31415 on the SYS-WIN4 domain," then it should grant me access to that object.

    The existence of SID history means that recognizing users when they return is more complicated than a simple Equal­Sid, because Equal­Sid will say that "No, S-1-5-21-REDMOND-271828 is not equal to S-1-5-21-SYS-WIN4-31415," even though both SIDs refer to the same person.

    If you are going to remember a SID and then try to recognize a user when they return, you need to search the SID history for a match, in case the user changed domains between the two visits. The easiest way to do this is with the Access­Check function. For example, suppose I visited your site while I belong to the SYS-WIN4 domain, and you remembered my SID. When I return, you create a security descriptor that grants access to the SID you remembered, and then you ask Access­Check, "If I had an object that granted access only to this SID, would you let this guy access it?"

    (So far, this is just recapping stuff I discussed a few months ago. Now comes the new stuff.)

    There are a few ways of building up the security descriptor. In all the cases, we will create a security descriptor that grants the specified SID some arbitrary access, and then we will ask the operating system whether the current user has that access.

    My arbitrary access shall be

    #define FROB_ACCESS     1 // any single bit less than 65536
    

    One way to build the security descriptor is to let SDDL do the heavy lifting: Generate the string D:(A;;1;;;⟨SID⟩) and then pass it to String­Security­Descriptor­To­Security­Descriptor.

    Another is to build it up with security descriptor functions. I defer to the sample code in MSDN for an illustration.

    The hard-core way is just to build the security descriptor by hand. For a security descriptor this simple, the direct approach involves the least amount of code. Go figure.

    The format of the security descriptor we want to build is

    struct ACCESS_ALLOWED_ACE_MAX_SIZE
    {
        ACCESS_ALLOWED_ACE Ace;
        BYTE SidExtra[SECURITY_MAX_SID_SIZE - sizeof(DWORD)];
    };
    

    The ACCESS_ALLOWED_ACE_MAX_SIZE structure represents the maximum possible size of an ACCESS_ALLOWED_ACE. The ACCESS_ALLOWED_ACE leaves a DWORD for the SID (Sid­Start), so we add additional bytes afterward to accommodate the largest valid SID. If you wanted to be more C++-like, you could make ACCESS_ALLOWED_ACE_MAX_SIZE derive from ACCESS_ALLOWED_ACE.

    struct ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR
    {
        SECURITY_DESCRIPTOR_RELATIVE Header;
        ACL Acl;
        ACCESS_ALLOWED_ACE_MAX_SIZE Ace;
    };
    
    const ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR c_sdTemplate = {
      // SECURITY_DESCRIPTOR_RELATIVE
      {
        SECURITY_DESCRIPTOR_REVISION,           // Revision
        0,                                      // Reserved
        SE_DACL_PRESENT | SE_SELF_RELATIVE,     // Control
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Ace.Ace.SidStart),
                                                // Offset to owner
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Ace.Ace.SidStart),
                                                // Offset to group
        0,                                      // No SACL
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Acl),
                                                // Offset to DACL
      },
      // ACL
      {
        ACL_REVISION,                           // Revision
        0,                                      // Reserved
        sizeof(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR) -
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Acl),
                                                // ACL size
        1,                                      // ACE count
        0,                                      // Reserved
      },
      // ACCESS_ALLOWED_ACE_MAX_SIZE
      {
        // ACCESS_ALLOWED_ACE
        {
          // ACE_HEADER
          {
            ACCESS_ALLOWED_ACE_TYPE,            // AceType
            0,                                  // flags
            sizeof(ACCESS_ALLOWED_ACE_MAX_SIZE),// ACE size
          },
          FROB_ACCESS,                          // Access mask
        },
      },
    };
    

    Our template security descriptor says that it is a self-relative security descriptor with an owner, group and DACL, but no SACL. The DACL consists of a single ACE. We set up everything in the ACE except for the SID. We point the owner and group to that same SID. Therefore, this security descriptor is all ready for action once you fill in the SID.

    BOOL IsInSidHistory(HANDLE Token, PSID Sid)
    {
      DWORD SidLength = GetLengthSid(Sid);
    
      if (SidLength > SECURITY_MAX_SID_SIZE) {
        // Invalid SID. That's not good.
        // Somebody is playing with corrupted data.
        // Stop before anything bad happens.
        RaiseFailFastException(nullptr, nullptr, 0);
      }
    
      ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR Sd = c_sdTemplate;
      CopyMemory(&Sd.Ace.Ace.SidStart, Sid, SidLength);
    

    As you can see, generating the security descriptor is a simple matter of copying our template and then replacing the SID. The next step is performing an access check of the token against that SID.

      const static GENERIC_MAPPING c_GenericMappingFrob = {
        FROB_ACCESS,
        FROB_ACCESS,
        FROB_ACCESS,
        FROB_ACCESS,
      };
      PRIVILEGE_SET PrivilegeSet;
      DWORD PrivilegeSetSize = sizeof(PrivilegeSet);
      DWORD GrantedAccess = 0;
      BOOL AccessStatus = 0;
      return AccessCheck(&Sd, Token, FROB_ACCESS,
        const_cast<PGENERIC_MAPPING>(&c_GenericMappingFrob),
        &PrivilegeSet, &PrivilegeSetSize,
        &GrantedAccess, &AccessStatus) &&
        AccessStatus;
    }
    

    So let's take this guy out for a spin. Since I don't know what is in your SID history, I'm going to pick something that should be in your token already (Authenticated Users) and something that shouldn't (Local System).

    // Note: Error checking elided for expository purposes.
    
    void CheckWellKnownSid(HANDLE Token, WELL_KNOWN_SID_TYPE type)
    {
      BYTE rgbSid[SECURITY_MAX_SID_SIZE];
      DWORD cbSid = sizeof(rgbSid);
      CreateWellKnownSid(type, NULL, rgbSid, &cbSid);
      printf("Is %d in SID history? %d\n", type,
             IsInSidHistory(Token, rgbSid));
    }
    
    int __cdecl wmain(int argc, wchar_t **argv)
    {
      HANDLE Token;
      // In real life you had better error-check these calls,
      // to avoid a security hole.
      ImpersonateSelf(SecurityImpersonation);
      OpenThreadToken(GetCurrentThread(), TOKEN_QUERY, TRUE, &Token);
      RevertToSelf();
    
      CheckWellKnownSid(Token, WinAuthenticatedUserSid);
      CheckWellKnownSid(Token, WinLocalSystemSid);
      CloseHandle(Token);
    
      return 0;
    }
    

    Related reading: Hey there token, long time no see! (Did you do something with your hair?)

  • The Old New Thing

    Some light reading on lock-free programming

    • 8 Comments

    Today is a holiday in the United States, so I'm going to celebrate by referring you to other things to read.

    I'm going to start with a presentation by Bruce Dawson at GDC 2009, which is basically multiple instances of the question "Is this code correct?", and the answer is always "No!" Although the title of the talk is Lockless Programming in Games, the information is relevant to pretty much everybody. I can't find a recording of the presentation, but you can download the PowerPoint slides or view them in your browser. But I recommend downloading the PowerPoint slides and reading the notes, because the notes explain the slides. [Update: Ah, you can see the notes in the browser by clicking the Notes button at the bottom. So download whichever you prefer. Just make sure you read the notes.]

    A more game-focused presentation by Bruce Dawson has the more general title Coding for Multiple Cores. Download the PowerPoint sides or view them in your browser.

    Then there is the MSDN white paper that he authored, Lockless Programming Considerations for Xbox 360 and Microsoft Windows.

    Finally, there's Herb Sutter's two-part talk atomic<> Weapons, part 1 and part 2.

    That should keep you busy for a while.

  • The Old New Thing

    If 16-bit Windows had a single input queue, how did you debug applications on it?

    • 32 Comments

    After learning about the bad things that happened if you synchronized your application's input queue with its debugger, commenter kme wonders how debugging worked in 16-bit Windows, since 16-bit Windows didn't have asynchronous input? In 16-bit Windows, all applications shared the same input queue, which means you were permanently in the situation described in the original article, where the application and its debugger (and everything else) shared an input queue and therefore would constantly deadlock.

    The solution to UI deadlocks is to make sure the debugger doesn't have any UI.

    At the most basic level, the debugger communicated with the developer through the serial port. You connected a dumb terminal to the other end of the serial port. Mine was a Wyse 50 serial console terminal. All your debugging happened on the terminal. You could disassemble code, inspect and modify registers and memory, and even patch new code on the fly. If you wanted to consult source code, you needed to have a copy of it available somewhere else (like on your other computer). It was similar to using the cdb debugger, where the only commands available were r, db, eb, u, and a. Oh, and bp to set breakpoints.

    Now, if you were clever, you could use a terminal emulator program so you didn't need a dedicated physical terminal to do your debugging. You could connect the target computer to your development machine and view the disassembly and the source code on the same screen. But you weren't completely out of the woods, because what did you use to debug your development machine if it crashed? The dumb terminal, of course.¹

    Target machine
    Debugger
    Development machine
    Debugger
    Wyse 50
    dumb terminal

    I did pretty much all my Windows 95 debugging this way.

    If you didn't have two computers, another solution was to use a debugger like CodeView. CodeView avoided the UI deadlock problem by not using the GUI to present its UI. When you hit a breakpoint or otherwise halted execution of your application, CodeView talked directly to the video driver to save the first 4KB of video memory, then switched into text mode to tell you what happened. When you resumed execution, it restored the video memory, then switched the video card back into graphics mode, restored all the pixels it captured, then resumed execution as if nothing had happened. (If you were debugging a graphics problem, you could hit F3 to switch temporarily to graphics mode, so you could see what was on the screen.)

    If you were really fancy, you could spring for a monochrome adapter, either the original IBM one or the Hercules version, and tell CodeView to use that adapter for its debugging UI. That way, when you broke into the debugger, you could still see what was on the screen! We had multiple monitors before it was cool.

    ¹ Some people were crazy and cross-connected their target and development machines.

    Target machine
    Debugger
    Development machine
    Debugger

    This allowed them to use their target machine to debug their development machine and vice versa. But if your development machine crashed while it was debugging the target machine, then you were screwed.

Page 2 of 438 (4,378 items) 12345»