• The Old New Thing

    What states are possible in a DRAWITEMSTRUCT structure?

    • 24 Comments

    The DRAW­ITEM­STRUCT structure has an item­State member which contains a number of bits describing the state of the item being drawn. How do those states map to the underlying control?

    Most of the states are rather obvious. For a list box item to be selected, it means that the item is part of the selection. But what does selected mean for a button?

    Since people like tables, I'll put the answer in a table:

    Menu Listbox Combobox Button
    CtlType ODT_MENU ODT_LISTBOX ODT_COMBOBOX ODT_BUTTON
    itemID menu item ID item index or −1 item index or −1
    ODS_SELECTED Selected Selected Selected Pushed
    ODS_GRAYED Grayed
    ODS_DISABLED Disabled Disabled Disabled Disabled
    ODS_CHECKED Checked
    ODS_FOCUS Focus Focus Focus
    ODS_DEFAULT Default menu item
    ODS_HOTLIGHT Hover
    ODS_INACTIVE Inactive
    ODS_NOACCEL HideAccel HideAccel HideAccel HideAccel
    ODS_NOFOCUSRECT HideFocus HideFocus HideFocus
    ODS_COMBOBOXEDIT Is edit control
    Static Header Tab Listview Status
    CtlType ODT_STATIC ODT_HEADER ODT_TAB ODT_LISTVIEW
    itemID item index item index item index part index
    ODS_SELECTED Pushed Selected Selected
    ODS_GRAYED
    ODS_DISABLED Oops
    ODS_CHECKED AutoChecked
    ODS_FOCUS Focus
    ODS_DEFAULT
    ODS_HOTLIGHT Hover
    ODS_INACTIVE
    ODS_NOACCEL HideAccel
    ODS_NOFOCUSRECT
    ODS_COMBOBOXEDIT

    Okay, now that it's all in a table, how do I read the table?

    A box is blank if the corresponding flag is not currently used by the control type. (No guarantees about the future.) For example, as of this writing, button controls do not set an itemID, nor do they ever ask for ODS_GRAYED.

    You may have noticed that the box for CtlType is blank for status controls. That's an oops. The status bar control forgot to set the CtlType when it sends the WM_DRAW­ITEM message, so the value is uninitialized garbage. The way to detect a status bar control is to check the window handle. (This works in general. You can always detect a control by checking the window handle.)

    For list boxes and combo boxes, the itemID can have the special value -1 to mean "I am drawing a list box/combo box where no item is selected." For list boxes, this happens when the list box is empty. For combo boxes, this happens when the user types text into the edit box that does not match any of the items in the list portion of the combo box.

    Most of the other box entries are self-explanatory. For the most part, the flag name matches the conditions under which the corresponding flag is set. For example, the ODS_FOCUS flag is set when the list box item being drawn is the selected item.

    Note that the ODS_SELECTED flag is used for button and header controls to indicate that the control should be drawn in the pushed state. For example, the user may have put focus on a button control and pressed the space bar and not yet released it, or the application may have manually set the BST_PUSHED state. Header controls can get into a pushed state if you enable the HDS_BUTTONS style.

    List view controls set the ODS_CHECKED flag if a check box should be drawn over the item. This happens if the LVS_EX_AUTO­CHECK­SELECT extended style is specified and the item is selected. (Normally, the check box is drawn to the side as a state image.)

    The ODS_COMBO­BOX­EDIT flag is used only by combo box controls. It is set if the item being drawn is the edit portion of a combo box control. If not set, then the item being drawn is in the list box portion of the combo box control.

    Finally, there is a box marked Oops.

    The static control is supposed to set ODS_DISABLED if the static control is disabled. And that's what happens if you are using the classic static control. However, there is a typo in the the fancy themed static control, and it sets the ODS_DISBALED flag incorrectly. If you are owner-drawing a themed static control, and you want to draw differently depending on whether the control is disabled, then you should ignore the ODS_DISABLED flag and instead draw the disabled state based on the result of calling Is­Window­Enabled function.

    The bug in the themed static control cannot be fixed for compatibility reasons. I can pretty much guarantee that there is some application which doesn't draw correctly if the ODS_DISABLED flag is not set.

  • The Old New Thing

    If you get a procedure address by ordinal, you had better be absolutely sure it's there, because the failure mode is usually indistinguishable from success

    • 18 Comments

    A customer reported that the Get­Proc­Address function was behaving strangely.

    We have this code in one of our tests:

    typedef int (CALLBACK *T_FOO)(int);
    
    void TestFunctionFoo(HINSTANCE hDLL)
    {
      // Function Foo is ordinal 1 in our DLL
      T_FOO pfnFoo = (T_FOO)GetProcAddress(hDLL, (PCSTR)1);
      if (pfnFoo) {
        ... run tests on pfnFoo ...
      }
    }
    

    Recently, this test started failing in bizarre ways. When we stepped through the code, we discovered that pfnFoo ends up calling Bar instead of Foo. The first time we try to test pfnFoo, we get stack corruption because Bar has a different function prototype from Foo, and of course on top of that the test fails horribly because it's calling the wrong function!

    When trying to narrow the problem, we found that the issue began when the test was run against a version of the DLL that was missing the Foo function entirely. The line

        Foo @1
    

    was removed from the DEF file. Why did the call to Get­Proc­Address succeed and return the wrong function? We expected it to fail.

    Let's first consider the case where a DLL exports no functions by ordinal.

    EXPORTS
        Foo
        Bar
        Plugh
    

    The linker builds a list of all the exported functions (in an unspecified order) and fills in two arrays based on that list. If you look in the DLL image, you'll see something like this:

    Exported Function Table
    
    00049180 address of Bar
    00049184 address of Foo
    0004918C address of Plugh
    
    Exported Names
    
    00049190 address of the string "Bar"
    00049194 address of the string "Foo"
    00049198 address of the string "Plugh"
    

    There are two parallel arrays, one with function addresses and one with function names. The string "Bar" is the first entry in the exported names table, and the function Bar is the first entry in the exported function table. In general, the string in the Nth entry in the exported names table corresponds to the function in the Nth entry of the exported function table.

    Since it is only the relative position that matters, let's replace the addresses with indices.

    Exported Function Table
    
    [1] address of Bar
    [2] address of Foo
    [3] address of Plugh
    
    Exported Names
    
    [1] address of the string "Bar"
    [2] address of the string "Foo"
    [3] address of the string "Plugh"
    

    Okay, now let's introduce functions exported by ordinal. When you do that, you're telling the linker, "Make sure this function goes into the NNth slot in the exported function table." Suppose your DEF file went like this:

    EXPORTS
        Foo @1
        Bar
        Plugh
    

    This says "First thing we do is put Foo in slot 1. Once that's done, fill in the rest arbitrarily."

    The linker says, "Okay, I have a total of three functions, so let me build two tables with three entries each."

    Exported Function Table
    
    [1] address of ?
    [2] address of ?
    [3] address of ?
    
    Exported Names
    
    [1] address of ?
    [2] address of ?
    [3] address of ?
    

    "Now I place Foo in slot 1."

    Exported Function Table
    
    [1] address of Foo
    [2] address of ?
    [3] address of ?
    
    Exported Names
    
    [1] address of the string "Foo"
    [2] address of ?
    [3] address of ?
    

    "Now I fill in the rest arbitrarily."

    Exported Function Table
    
    [1] address of Foo
    [2] address of Bar
    [3] address of Plugh
    
    Exported Names
    
    [1] address of the string "Foo"
    [2] address of the string "Bar"
    [3] address of the string "Plugh"
    

    Since you explicitly placed Foo in slot 1, when you do a Get­Proc­Address(hDLL, 1), you will get Foo. On the other hand, if you do a Get­Proc­Address(hDLL, 2), you will get Bar, or at least you will with this build. With the next build, you may get something else, because the linker just fills in the slots arbitrarily, and next time, it may choose to fill them arbitrarily in some other order. Furthermore, if you do a Get­Proc­Address(hDLL, 6), you will get NULL because the table has only three functions in it.

    I hope you see where this is going.

    If you delete Foo from the EXPORTS section, this stops exporting Foo but says nothing about what goes into slot 1. As a result, the linker is free to put anything it wants into that slot.

    Exported Function Table
    
    [1] address of Bar
    [2] address of Plugh
    
    Exported Names
    
    [1] address of the string "Bar"
    [2] address of the string "Plugh"
    

    Now, when you do a Get­Proc­Address(hDLL, 1), you get Bar, since that's the function that happened to fall into slot 1 this time.

    The moral of the story is that if you try to obtain a function by ordinal, then it had better be there, because there is no reliable way of being sure that the function you got is one that was explicitly placed there, as opposed to some other function that happened to be assigned that slot arbitrarily.

    Related reading: How are DLL functions exported in 32-bit Windows?

  • The Old New Thing

    The psychology of confirmation, or something, I don't know what to call it

    • 46 Comments

    There is probably a name for this phenomenon. I will illustrate it below.

    "Is there a way to configure the system to do X?"

    Go to the Y dialog and select Z.

    "It doesn't work."

    I just tried it. It works for me. I'm using ⟨configuration details⟩.

    "Thanks. It's working."

  • The Old New Thing

    Creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result

    • 20 Comments
    Suppose you want to multiply two double-word values producing a quad-word result, but your processor supports only single-word multiplication with a double-word result. For concreteness, let's say that your processor supports 32 × 32 → 64 multiplication and you want to implement 64 × 64 → 128 multiplication. (Sound like any processor you know?)

    Oh boy, let's do some high school algebra. Let's start with unsigned multiplication.

    Let x = A × 2³² + B and y = C × 2³² + D, where A, B, C, and D are all in the range 0 … 2³² − 1.

    x × y AC × 264 + (AD + BC) × 232 + BD
    AC × 264 + BD  +  (AD + BC) × 232
    provisional result cross-terms

    Each of the multiplications (not counting the power-of-two multiplications) is a 32 × 32 → 64 multiplication, so they are something we have as a building block. And the basic implementation is simply to perform the four multiplications and add the pieces together. But if you have SSE, you can perform two multiplies in a single instruction.

        // Prepare our source registers
        movq xmm0, x         // xmm0 = { 0, 0, A, B } = { *, *, A, B }
        movq xmm1, y         // xmm1 = { 0, 0, C, D } = { *, *, C, D }
        punpckldq xmm0, xmm0 // xmm0 = { A, A, B, B } = { *, A, *, B }
        punpckldq xmm1, xmm1 // xmm1 = { C, C, D, D } = { *, C, *, D }
        pshufd xmm2, xmm1, _MM_SHUFFLE(2, 0, 3, 1)
                             // xmm2 = { D, D, C, C } = { *, D, *, C }
    

    The PMULUDQ instruction multiplies 32-bit lanes 0 and 2 of its source and destination registers, producing 64-bit results. The values in lanes 1 and 3 do not participate in the multiplication, so it doesn't matter what we put there. It so happens that the PUNPCKLDQ instruction duplicates the value, but we really don't care. I used * to represent a don't-care value.

        pmuludq xmm1, xmm0 // xmm1 = { AC, BD } // provisional result
        pmuludq xmm2, xmm0 // xmm2 = { AD, BC } // cross-terms
    

    In two PMULUDQ instructions, we created the provisional result and the cross-terms. Now we just need to add the cross-terms to the provisional result. Unfortunately, SSE does not have a 128-bit addition (or at least SSE2 doesn't; who knows what they'll add in the future), so we need to do that the old-fashioned way.

        movdqa result, xmm1
        movdqa crossterms, xmm2
    
        mov    eax, crossterms[0]
        mov    edx, crossterms[4] // edx:eax = BC
        add    result[4], eax
        adc    result[8], edx
        adc    result[12], 0      // add the first cross-term
    
        mov    eax, crossterms[8]
        mov    edx, crossterms[12] // edx:eax = AD
        add    result[4], eax
        adc    result[8], edx
        adc    result[12], 0      // add the second cross-term
    

    There we go, a 64 × 64 → 128 multiply constructed from 32 × 32 → 64 multiplies.

    Exercise: Why didn't I use the rax register to perform the 64-bit addition? (This is sort of a trick question.)

    That calculates an unsigned multiplication, but how do we do a signed multiplication? Let's work modulo 2128 so that signed and unsigned multiplication are equivalent. This means that we need to expand x and y to 128-bit values X and Y.

    Let s = the sign bit of x expanded to a 64-bit value, and similarly t = the sign bit of y expanded to a 64-bit value. In other words, s is 0xFFFFFFFF`FFFFFFFF if x < 0 and zero if x ≥ 0.

    The 128-bit values being multiplied are

    X s × 264 + x
    Y t × 264 + y

    The product is therefore

    X × Y st × 2128  (sy + tx) × 264   +  xy
    zero adjustment unsigned product

    The first term is zero because it overflows the 128-bit result. That leaves the second term as the adjustment, which simplifies to "If x < 0 then subtract y from the high 64 bits; if y < 0 then subtract x from the high 64 bits."

        if (x < 0) result.m128i_u64[1] -= y;
        if (y < 0) result.m128i_u64[1] -= x;
    

    If we were still playing with SSE, we could compute this as follows:

        movdqa xmm0, result   // xmm0 = { high, low }
        movq   xmm1, x        // xmm1 = { 0, x }
        movq   xmm2, y        // xmm2 = { 0, y }
        pshufd xmm3, xmm1, _MM_SHUFFLE(1, 1, 3, 2) // xmm3 = { xhi, xhi, 0, 0 }
        pshufd xmm1, xmm1, _MM_SHUFFLE(1, 0, 3, 2) // xmm1 = { x, 0 }
        pshufd xmm4, xmm2, _MM_SHUFFLE(1, 1, 3, 2) // xmm4 = { yhi, yhi, 0, 0 }
        pshufd xmm2, xmm2, _MM_SHUFFLE(1, 0, 3, 2) // xmm2 = { y, 0 }
        psrad  xmm3, 31       // xmm3 = { s, s, 0, 0 } = { s, 0 }
        psrad  xmm4, 31       // xmm4 = { t, t, 0, 0 } = { t, 0 }
        pand   xmm3, xmm2     // xmm3 = { x < 0 ? y : 0, 0 }
        pand   xmm4, xmm1     // xmm4 = { y < 0 ? x : 0, 0 }
        psubq  xmm0, xmm3     // first adjustment
        psubq  xmm0, xmm4     // second adjustment
        movdqa result, xmm0   // update result
    

    The code is a bit strange because SSE2 doesn't have a full set of 64-bit integer opcodes. We would have liked to have used a psraq instruction to fill a 64-bit field with a sign bit. But there is no such instruction, so instead we duplicate the 64-bit sign bit into two 32-bit sign bits (one in lane 2 and one in lane 3) and then fill the lanes with that bit using psrad.

  • The Old New Thing

    Killing a window timer prevents the WM_TIMER message from being generated for that timer, but it doesn't retroactively remove ones that were already generated

    • 7 Comments

    Calling Kill­Timer to cancel a window timer prevents WM_TIMER messages from being generated for that timer, even if one is overdue. In other words, give this sequence of operations:

    SetTimer(hwnd, IDT_MYTIMER, 1000, NULL);
    Sleep(2000);
    KillTimer(hwnd, IDT_MYTIMER);
    

    no WM_TIMER message is ever generated. Even though a timer became due during the Sleep, no timer message was generated during the sleep because timer messages are generated on demand, and nobody demanded one. Killing the timer then removes the ability to demand a timer message, and the result is that no message ever appears.

    In general, this means that once you kill a timer, you will not receive any WM_TIMER messages for that timer.

    Unless you demanded one while the timer was active and didn't process it.

    Let's try a variation:

    SetTimer(hwnd, IDT_MYTIMER, 1000, NULL);
    Sleep(2000);
    if (PeekMessage(&msg, NULL, WM_TIMER, WM_TIMER, 0)) {
     DispatchMessage(&msg);
    }
    KillTimer(hwnd, IDT_MYTIMER);
    

    In this case, the Peek­Message function looks for a WM_TIMER message in the queue, and if none is found, it asks for one to be generated on the fly if a timer is due. It so happens that one is due (IDT_MY­TIMER), so the Peek­Message causes a WM_TIMER to be generated and placed in the queue. But it doesn't remain in this state for long, because the message is removed from the queue by the Peek­Message function.

    Okay, now let's make things weird:

    SetTimer(hwnd, IDT_MYTIMER, 1000, NULL);
    Sleep(2000);
    if (PeekMessage(&msg, NULL, WM_TIMER, WM_TIMER, PM_NOREMOVE)) {
     // oh hey there is an overdue timer, how about that
    }
    KillTimer(hwnd, IDT_MYTIMER);
    

    This time, we passed the PM_NO­REMOVE flag. The window manager goes through the same process as before, first looking for a WM_TIMER message in the queue, and then failing to find one, generates one on the fly since the IDT_MY­TIMER timer is overdue. But the PM_NO­REMOVE flag makes things weird because it says, "Thanks for generating that message for me. But don't remove it from the queue. Leave it there. I'll deal with it later."

    You might do this if you want to stop processing if a timer elapses, but you don't want to handle the timer immediately because you are in some sensitive state at the point you realize that you need to stop processing. Instead, you want to return back out to the main message loop and let it deal with the timer.

    BOOL DoWorkUntilTheNextTimer()
    {
     BOOL fFinished = FALSE;
     MSG msg;
     PrepareToDoWork();
     while (!PeekMessage(&msg, NULL, WM_TIMER, WM_TIMER, PM_NOREMOVE)) {
      if (AnyWorkLeft()) DoSomeWork(); 
      else { fFinished = TRUE; break; }
     }
     CleanUpAfterDoingWork();
     return fFinished;
    }
    

    And then you might call it like this:

    void DoWorkForUpToOneSecond()
    {
     SetTimer(hwnd, IDT_MYTIMER, 1000, NULL);
     DoWorkUntilTheNextTimer();
     KillTimer(hwnd, IDT_MYTIMER);
    }
    

    The Kill­Timer will prevent any new timer messages from being generated for IDT_MY­TIMER, but it does not go back in time and retroactively un-generate the timer message that was generated when Do­Work­Until­The­Next­Timer asked to see if there were any timer messages.

    You are now in the strange situation where a subsequent call to Peek­Message or Get­Message will retrieve a timer message for a timer that is no longer active!

    This is captured in the MSDN documentation with the simple sentence, "The Kill­Timer function does not remove WM_TIMER messages already posted to the message queue."

  • The Old New Thing

    If my WM_TIMER handler takes longer than the timer period, will my queue fill up with WM_TIMER messages?

    • 15 Comments

    A customer was worried that they may have a problem with their message queue filling with WM_TIMER messages. "If my WM_TIMER handler takes longer than the timer period, will my queue fill up with WM_TIMER messages?"

    As we should know by now, timer messages are generated on demand:

    The WM_TIMER message is a low-priority message. The Get­Message and Peek­Message functions post this message only when no other higher-priority messages are in the thread's message queue.

    Here's the basic algorithm. (I'm ignoring filtering and I'm assuming that messages are removed.)

    • Look for a posted message. If one exists, then return it.
    • Was Post­Quit­Message called? If so, then generate and return a WM_QUIT message.
    • Look for an input message. If one exists, then return it.
    • Did the mouse move since the last call? If so, then generate and return a WM_MOUSE­MOVE message.
    • Does a window need to be repainted? If so, then generate and return a WM_PAINT message.
    • Is there a timer that has elapsed? If so, then generate and return a WM_TIMER message.

    Notice that the generated messages are generated on demand by message retrieval functions. If you never call a message retrieval function, then no messages are generated. And in the case where the messages are removed (i.e., you use Get­Message or you use Peek­Message with PM_REMOVE), the messages are removed immediately after being generated, so they don't hang around very long at all.

    In particular, if your WM_TIMER handler takes longer than the timer period, and it doesn't call a message retrieval function, then there is no opportunity for another WM_TIMER message to be generated. Only when you call a message retrieval function does there become a possibility for a WM_TIMER message to be generated.

  • The Old New Thing

    What happens if I don't paint when I get a WM_PAINT message?

    • 19 Comments

    Suppose your window procedure doesn't paint when it gets a WM_PAINT message. What happens?

    It depends on how you don't paint.

    If you have an explicit handler for the WM_PAINT message that does nothing but return without painting, then the window manager will turn around and put a new WM_PAINT message in your queue. "And try harder this time." Remember that the rules for the WM_PAINT message are that the window manager will generate a WM_PAINT message for any window that has a dirty region. If you fail to remove the dirty region in your WM_PAINT message handler, well, then the rules state that you get another WM_PAINT message. (The most common way of clearing the dirty region is to call Begin­Paint, but there are other less common ways, like Validate­Rect or Redraw­Window with the RDW_VALIDATE flag.)

    The other case is that you simply don't have a WM_PAINT handler and let the message fall through to Def­Window­Proc. In that case, Def­Window­Proc will do a blank paint for you. In other words, Def­Window­Proc contains the logical equivalent of

    case WM_PAINT:
     {
      PAINTSTRUCT ps;
      if (BeginPaint(hwnd, &ps))
       EndPaint(hwnd, &ps);
     }
     return 0;
    

    In the case where you pass the WM_PAINT to Def­Window­Proc, the dirty region is cleared because Def­Window­Proc will call Begin­Paint for you.

    There are some quirks in the handling of the WM_PAINT message by the Def­Window­Proc function to handle various application compatibility cases, but the above is the basic idea.

    To avoid tripping over the weird application compatibility cases, decide up front how you want to deal with WM_PAINT messages delivered to your window procedure.

    • Handle them completely by calling Begin­Paint and End­Paint, then returning 0. (Do not pass the message to Def­Window­Proc.)
    • Pass them all to Def­Window­Proc, and let it do the Begin­Paint and End­Paint.

    Don't try playing fancy games like "Oh, I'm going to call Begin­Paint and End­Paint, but sometimes I'm also going to pass the message to Def­Window­Proc afterwards." Just pick one plan and stick to it. It's a lot simpler for everybody that way.

  • The Old New Thing

    Microspeak: Redlines

    • 30 Comments

    To the outside world, redline can mean to mark something for removal, or it could mean the maximum safe speed of an engine. But in the world of Microsoft design, the term redlines (pronounced as if it were written as the two words red lines, but the accent is on the red) refers to a diagram showing the exact position of visual elements. They typically take the form of a proposed screen shot, with arrows and lines superimposed to indicate the distances between items, which items align with each other, and so on. They also contain indications as to the exact colors to use for different elements.

    Originally the lines and arrows were actually red, hence the name. Here's an example of something that gives you the idea:

    These aren't real redlines because the diagram doesn't contain any indications about the colors to use, and more complete redlines would include diagrams showing the hover, pressed, and disabled states.

  • The Old New Thing

    Counting array elements which are below a particular limit value using SSE

    • 12 Comments

    Some time ago, we looked at how doing something can be faster than not doing it. That is, we observed the non-classical effect of the branch predictor. I took the branch out of the inner loop, but let's see how much further I can push it.

    The trick I'll employ today is using SIMD in order to operate on multiple pieces of data simultaneously. Take the original program and replace the count­them function with this one:

    int countthem(int boundary)
    {
     __m128i xboundary = _mm_cvtsi32_si128(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000; i++) {
      __m128i value =  _mm_cvtsi32_si128(array[i]);
      __m128i test = _mm_cmplt_epi32(value, xboundary);
      count = _mm_sub_epi32(count, test);
     }
     return _mm_cvtsi128_si32(count);
    }
    

    Now, this program doesn't actually use any parallel operations, but it's our starting point. For each 32-bit value, we load it, compare it agains the boundary value, and accumulate the result. The _mm_cmplt_epi32 function compares the four 32-bit integers in the first parameter against the four 32-bit integers in the second parameter, producing four new 32-bit integers. Each of the new 32-bit integers is 0xFFFFFFFF if the corresponding first parameter is less than the second, or it is 0x00000000 if it is greater than or equal.

    In this case, we loaded up the value we care about, then compare it against the boundary value. The result of the comparison is either 32 bits of 0 (for false) or 32 bits of 1 (for true), so this merely sets test equal to 0xFFFFFFFF if the value is less than the boundary; otherwise 0x0000000. Since 0xFFFFFFFF is the same as a 32-bit -1, we subtract the value so that the count goes up by 1 if the value is less than the boundary.

    Finally, we convert back to a 32-bit integer and return it.

    With this change, the running time drops from 2938 time units to 2709, an improvement of 8%.

    So far, we have been using only the bottom 32 bits of the 128-bit XMM registers. Let's turn on the parallelism.

    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i++) {
      __m128i value = _mm_loadu_si128(&xarray[i]);
      __m128i test = _mm_cmplt_epi32(value, xboundary);
      count = _mm_sub_epi32(count, test);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return _mm_cvtsi128_si32(count);
    }
    

    We take our 32-bit integers and put them in groups of four, so instead of thinking of them as 10000 32-bit integers, we think of them as 2500 128-bit blocks, each block containing four lanes, with each lane holding one 32-bit integers.

    Lane 3 Lane 2 Lane 1 Lane 0
    xarray[0] array[3] array[2] array[1] array[0]
    xarray[1] array[7] array[6] array[5] array[4]
    xarray[2499] array[9999] array[9998] array[9997] array[9996]

    Now we can run our previous algorithm in parallel on each lane.

    Lane 3 Lane 2 Lane 1 Lane 0
    xboundary boundary boundary boundary boundary
     
    test array[3] < boundary array[2] < boundary array[1] < boundary array[0] < boundary
    test array[7] < boundary array[6] < boundary array[5] < boundary array[4] < boundary
    test array[9999] < boundary array[9998] < boundary array[9997] < boundary array[9996] < boundary
     
    count = Σtest Lane 3 totals Lane 2 totals Lane 1 totals Lane 0 totals

    The xboundary variable contains a copy of the boundary in each of the four 32-bit lanes. We load the values from the array four at a time¹ and compare them (in parallel) against the boundary, then we tally them (in parallel). The result of the loop is that each lane of count performs a count of values for its lane.

    After we complete the loop, we combine the parallel results by adding the lanes together. We do this by shuffling the values around and performing more parallel adds. The _mm_shuffle_epi32 function lets you rearrange the lanes of an XMM register. The _MM_SHUFFLE macro lets you specify how you want the shuffle to occur. For example, _MM_SHUFFLE(1, 0, 3, 2) says that we want lanes 1, 0, 3 then 2 of the original value. (You can shuffle a value into multiple destination lanes; for example, _MM_SHUFFLE(0, 0, 0, 0) says that you want four copies of lane 0. That's how we created xboundary.)

    Lane 3 Lane 2 Lane 1 Lane 0
    count Lane 3 totals Lane 2 totals Lane 1 totals Lane 0 totals
    shuffle1 Lane 1 totals Lane 0 totals Lane 3 totals Lane 2 totals
     
    count += shuffle1 Lane 3 + Lane 1 Lane 2 + Lane 0 Lane 1 + Lane 3 Lane 0 + Lane 2
    shuffle2 Lane 2 + Lane 0 Lane 3 + Lane 1 Lane 0 + Lane 2 Lane 1 + Lane 3
     
    count += shuffle2 Lane 3 + Lane 1 +
    Lane 2 + Lane 0
    Lane 2 + Lane 0 +
    Lane 3 + Lane 1
    Lane 1 + Lane 3 +
    Lane 0 + Lane 2
    Lane 0 + Lane 2 +
    Lane 1 + Lane 3

    At the end of the shuffling and adding, we have calculated the sum of all four lanes. (For style points, I put the answer in all the lanes.)

    This new version runs in 688 time units, or 3.9 times faster than the previous one. This makes sense because we are counting four values at each iteration. The overall improvement is 4.3×.

    Let's see if we can reduce the loop overhead by doing some unrolling.

    #define GETVALUE(n) __m128i value##n = _mm_loadu_si128(&xarray[i+n])
    #define GETTEST(n) __m128i test##n = _mm_cmplt_epi32(value##n, xboundary)
    #define GETCOUNT(n)  count = _mm_sub_epi32(count, test##n)
    
    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 4) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return _mm_cvtsi128_si32(count);
    }
    

    We unroll the loop fourfold. At each iteration, we load 16 values from memory, and then accumulate the totals. We fetch all the memory values first, then do the comparisons, then accumulate the results. If we had written it as GETVALUE immediately followed by GETTEST, then the _mm_cmplt_epi32 would have stalled waiting for the result to arrive from memory. By interleaving the operations, we get some work done instead of stalling.

    This version runs in 514 time units, an improvement of 33% over the previous version and an overall improvement of 5.7×.

    Can we unroll even further? Let's try fivefold.

    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 5) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return _mm_cvtsi128_si32(count);
    }
    

    Huh? This version runs marginally slower, at 528 time units. So I guess further unrolling won't help any more. (For example, if you unroll a loop so much that you have more live variables than registers, the compiler will need to spill registers to memory. The x86 has eight XMM registers available, so you can easily cross that limit.)

    But wait, there's still room for tweaking. We have been using _mm_cmplt_epi32 to perform the comparison, expecting the compiler to generate code like this:

        ; suppose xboundary is in xmm0 and count is in xmm1
        movdqu   xmm2, xarray[i] ; xmm2 = value
        pcmpltd  xmm2, xmm0      ; xmm2 = test
        psubd    xmm1, xmm2
    

    If you crack open your Intel manual, you'll see that there is no PCMPLTD instruction. The compiler intrinsic is emulating the instruction by flipping the parameters and using PCMPGTD.

    _mm_cmplt_epi32(x, y) ↔ _mm_cmpgt_epi32(y, x)
    

    But the PCMPGTD instruction writes the result back into the first parameter. In other words, it always takes the form

    y = _mm_cmpgt_epi32(y, x);
    

    In our case, y is xboundary, but we don't want to modify xboundary. As a result, the compiler needs to introduce a temporary register:

        movdqu   xmm2, xarray[i] ; xmm2 = value
        movdqa   xmm3, xmm0      ; xmm3 = copy of xboundary
        pcmpgtd  xmm3, xmm2      ; xmm3 = test
        psubd    xmm1, xmm3
    

    We can take an instruction out of the sequence by switching to _mm_cmpgt_epi32 and adjusting our logic accordingly, taking advantage of the fact that

    x < y ⇔ ¬(x ≥ y) ⇔ ¬(x > y − 1)
    

    assuming the subtraction does not underflow. Fortunately, it doesn't in our case since boundary ranges from 0 to 10, and subtracting 1 does not put us in any danger of integer underflow.

    With this rewrite, we can switch to using _mm_cmpgt_epi32, which is more efficient for our particular scenario. Since we are now counting the values which don't meet our criteria, we need to take our final result and subtract it from 10000.

    #define GETTEST(n) __m128i test##n = _mm_cmpgt_epi32(value##n, xboundary1)
    
    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary1 = _mm_set1_epi32(boundary - 1);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 5) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return 10000 - _mm_cvtsi128_si32(count);
    }
    

    Notice that we have two subtractions which cancel out. We are subtracting the result of the comparison, and then we subtract the total from 10000. The two signs cancel out, and we can use addition for both. This saves an instruction in the return because subtraction is not commutative, but addition is.

    #define GETCOUNT(n) count = _mm_add_epi32(count, test##n)
    
    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary1 = _mm_set1_epi32(boundary - 1);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 5) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return 10000 + _mm_cvtsi128_si32(count);
    }
    

    You can look at the transformation this way: The old code considered the glass half empty. It started with zero and added 1 each time it found an entry that passed the test. The new code considers the glass half full. It assumes each entry passes the test, and it subtracts one each time it finds an element that fails the test.

    This version runs in 453 time units, an improvement of 13% over the fourfold unrolled version and an improvement of 6.5× overall.

    Okay, let's unroll sixfold, just for fun.

    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary - 1);
     __m128i count = _mm_setzero_si128();
     int i = 0;
     {
        GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3);
         GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);
        GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3);
     }
     i += 4;
     for (; i < 10000 / 4; i += 6) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2);
      GETVALUE(3); GETVALUE(4); GETVALUE(5);
       GETTEST(0);  GETTEST(1);  GETTEST(2);
       GETTEST(3);  GETTEST(4);  GETTEST(5);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2);
      GETCOUNT(3); GETCOUNT(4); GETCOUNT(5);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return 10000 + _mm_cvtsi128_si32(count);
    }
    

    Since 10000 / 4 % 6 = 4, we have four values that don't fit in the loop. We deal with those values up front, and then enter the loop to get the rest.

    This version runs in 467 time units, which is 3% slower than the previous version. So I guess it's time to stop unrolling. Let's go back to the previous version which ran faster.

    The total improvement we got after all this tweaking is speedup of 6.5× over the original jumpless version. And most of that improvement (5.7×) came from unrolling the loop fourfold.

    Anyway, no real moral of the story today. I just felt like tinkering.

    Notes

    ¹ The _mm_loadu_si128 intrinsic is kind of weird. Its formal argument is a __m128i*, but since it is for loading unaligned data, the formal argument really should be __m128i __unaligned*. The problem is that the __unaligned keyword doesn't exist on x86 because prior to the introduction of MMX and SSE, x86 allowed arbitrary misaligned data. Therefore, you are in this weird situation where you have to use an aligned pointer to access unaligned data.

    Bonus chatter: Clang at optimization level 3 does autovectorization. It doesn't know some of the other tricks, like converting x + 1 to x - (-1), thereby saving an instruction and a register.

  • The Old New Thing

    A user's SID can change, so make sure to check the SID history

    • 23 Comments

    It doesn't happen often, but a user's SID can change. For example, when I started at Microsoft, my account was in the SYS-WIN4 domain, which is where all the people on the Windows 95 team were placed. At some point, that domain was retired, and my account moved to the REDMOND domain. We saw some time ago that the format of a user SID is

    S-1- version number (SID_REVISION)
    -5- SECURITY_NT_AUTHORITY
    -21- SECURITY_NT_NON_UNIQUE
    -w-x-y- the entity (machine or domain) that issued the SID
    -z the unique user ID for that entity

    The issuing entity for a local account on a machine is the machine to which the account belongs. The issuing entity for a domain account is the domain.

    If an account moves between domains, the issuing entity changes, which means that the old SID is not valid. A new SID must be issued.

    Wait, does this mean that if my account moves between domains, then I lose access to all my old stuff? All my old stuff grants access to my old SID, not my new SID.

    Fortunately, this doesn't happen, thanks to the SID history. When your account moves to the new domain, the new domain controller remembers all the previous SIDs you used to have. When you authenticate against the domain controller, it populates your token with your SID history. In my example, it means that my token not only says "This is user number 271828 on the REDMOND domain", it also says "This user used to be known as number 31415 on the SYS-WIN4 domain." That way, when the system sees an object whose ACL says, "Grant access to user 31415 on the SYS-WIN4 domain," then it should grant me access to that object.

    The existence of SID history means that recognizing users when they return is more complicated than a simple Equal­Sid, because Equal­Sid will say that "No, S-1-5-21-REDMOND-271828 is not equal to S-1-5-21-SYS-WIN4-31415," even though both SIDs refer to the same person.

    If you are going to remember a SID and then try to recognize a user when they return, you need to search the SID history for a match, in case the user changed domains between the two visits. The easiest way to do this is with the Access­Check function. For example, suppose I visited your site while I belong to the SYS-WIN4 domain, and you remembered my SID. When I return, you create a security descriptor that grants access to the SID you remembered, and then you ask Access­Check, "If I had an object that granted access only to this SID, would you let this guy access it?"

    (So far, this is just recapping stuff I discussed a few months ago. Now comes the new stuff.)

    There are a few ways of building up the security descriptor. In all the cases, we will create a security descriptor that grants the specified SID some arbitrary access, and then we will ask the operating system whether the current user has that access.

    My arbitrary access shall be

    #define FROB_ACCESS     1 // any single bit less than 65536
    

    One way to build the security descriptor is to let SDDL do the heavy lifting: Generate the string D:(A;;1;;;⟨SID⟩) and then pass it to String­Security­Descriptor­To­Security­Descriptor.

    Another is to build it up with security descriptor functions. I defer to the sample code in MSDN for an illustration.

    The hard-core way is just to build the security descriptor by hand. For a security descriptor this simple, the direct approach involves the least amount of code. Go figure.

    The format of the security descriptor we want to build is

    struct ACCESS_ALLOWED_ACE_MAX_SIZE
    {
        ACCESS_ALLOWED_ACE Ace;
        BYTE SidExtra[SECURITY_MAX_SID_SIZE - sizeof(DWORD)];
    };
    

    The ACCESS_ALLOWED_ACE_MAX_SIZE structure represents the maximum possible size of an ACCESS_ALLOWED_ACE. The ACCESS_ALLOWED_ACE leaves a DWORD for the SID (Sid­Start), so we add additional bytes afterward to accommodate the largest valid SID. If you wanted to be more C++-like, you could make ACCESS_ALLOWED_ACE_MAX_SIZE derive from ACCESS_ALLOWED_ACE.

    struct ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR
    {
        SECURITY_DESCRIPTOR_RELATIVE Header;
        ACL Acl;
        ACCESS_ALLOWED_ACE_MAX_SIZE Ace;
    };
    
    const ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR c_sdTemplate = {
      // SECURITY_DESCRIPTOR_RELATIVE
      {
        SECURITY_DESCRIPTOR_REVISION,           // Revision
        0,                                      // Reserved
        SE_DACL_PRESENT | SE_SELF_RELATIVE,     // Control
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Ace.Ace.SidStart),
                                                // Offset to owner
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Ace.Ace.SidStart),
                                                // Offset to group
        0,                                      // No SACL
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Acl),
                                                // Offset to DACL
      },
      // ACL
      {
        ACL_REVISION,                           // Revision
        0,                                      // Reserved
        sizeof(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR) -
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Acl),
                                                // ACL size
        1,                                      // ACE count
        0,                                      // Reserved
      },
      // ACCESS_ALLOWED_ACE_MAX_SIZE
      {
        // ACCESS_ALLOWED_ACE
        {
          // ACE_HEADER
          {
            ACCESS_ALLOWED_ACE_TYPE,            // AceType
            0,                                  // flags
            sizeof(ACCESS_ALLOWED_ACE_MAX_SIZE),// ACE size
          },
          FROB_ACCESS,                          // Access mask
        },
      },
    };
    

    Our template security descriptor says that it is a self-relative security descriptor with an owner, group and DACL, but no SACL. The DACL consists of a single ACE. We set up everything in the ACE except for the SID. We point the owner and group to that same SID. Therefore, this security descriptor is all ready for action once you fill in the SID.

    BOOL IsInSidHistory(HANDLE Token, PSID Sid)
    {
      DWORD SidLength = GetLengthSid(Sid);
    
      if (SidLength > SECURITY_MAX_SID_SIZE) {
        // Invalid SID. That's not good.
        // Somebody is playing with corrupted data.
        // Stop before anything bad happens.
        RaiseFailFastException(nullptr, nullptr, 0);
      }
    
      ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR Sd = c_sdTemplate;
      CopyMemory(&Sd.Ace.Ace.SidStart, Sid, SidLength);
    

    As you can see, generating the security descriptor is a simple matter of copying our template and then replacing the SID. The next step is performing an access check of the token against that SID.

      const static GENERIC_MAPPING c_GenericMappingFrob = {
        FROB_ACCESS,
        FROB_ACCESS,
        FROB_ACCESS,
        FROB_ACCESS,
      };
      PRIVILEGE_SET PrivilegeSet;
      DWORD PrivilegeSetSize = sizeof(PrivilegeSet);
      DWORD GrantedAccess = 0;
      BOOL AccessStatus = 0;
      return AccessCheck(&Sd, Token, FROB_ACCESS,
        const_cast<PGENERIC_MAPPING>(&c_GenericMappingFrob),
        &PrivilegeSet, &PrivilegeSetSize,
        &GrantedAccess, &AccessStatus) &&
        AccessStatus;
    }
    

    So let's take this guy out for a spin. Since I don't know what is in your SID history, I'm going to pick something that should be in your token already (Authenticated Users) and something that shouldn't (Local System).

    // Note: Error checking elided for expository purposes.
    
    void CheckWellKnownSid(HANDLE Token, WELL_KNOWN_SID_TYPE type)
    {
      BYTE rgbSid[SECURITY_MAX_SID_SIZE];
      DWORD cbSid = sizeof(rgbSid);
      CreateWellKnownSid(type, NULL, rgbSid, &cbSid);
      printf("Is %d in SID history? %d\n", type,
             IsInSidHistory(Token, rgbSid));
    }
    
    int __cdecl wmain(int argc, wchar_t **argv)
    {
      HANDLE Token;
      // In real life you had better error-check these calls,
      // to avoid a security hole.
      ImpersonateSelf(SecurityImpersonation);
      OpenThreadToken(GetCurrentThread(), TOKEN_QUERY, TRUE, &Token);
      RevertToSelf();
    
      CheckWellKnownSid(Token, WinAuthenticatedUserSid);
      CheckWellKnownSid(Token, WinLocalSystemSid);
      CloseHandle(Token);
    
      return 0;
    }
    

    Related reading: Hey there token, long time no see! (Did you do something with your hair?)

Page 7 of 443 (4,430 items) «56789»