• The Old New Thing

    Limiting the bottom byte of an XMM register and clearing the other bytes

    • 9 Comments

    Suppose you have a value in an XMM register and you want to limit the bottom byte to a particular value and set all the other bytes to zero. (Yes, I needed to do this.)

    One way to do this is to apply the two steps in sequence:

    ; value to truncate/limit is in xmm0
    
    ; First, zero out the top 15 bytes
        pslldq  xmm0, 15
        psrldq  xmm0, 15
    
    ; Now limit the bottom byte to N
        mov     al, N
        movd    xmm1, eax
        pminub  xmm0, xmm1
    

    But you can do it all in one step by realizing that min(x, 0) = 0 for all unsigned values x.

    ; value to truncate/limit is in xmm0
        mov     eax, N
        movd    xmm1, eax
        pminub  xmm0, xmm1
    

    In pictures:

    xmm0 xmm1 xmm0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    ? min 0 = 0
    x min N = min(x, N)

    In intrinsics:

    __m128i min_low_byte_and_set_upper_bytes_to_zero(__m128i x, uint8_t N)
    {
     return _mm_min_epi8(x, _mm_cvtsi32_si128(N));
    }
    
  • The Old New Thing

    Finding the leaked object reference by scanning memory: Example

    • 16 Comments

    An assertion failure was hit in some code.

        // There should be no additional references to the object at this point
        assert(m_cRef == 1);
    

    But the reference count was 2. That's not good. Where is that extra reference and who took it?

    This was not code I was at all familiar with, so I went back to first principles: Let's hope that the reference was not leaked but rather that the reference was taken and not released. And let's hope that the memory hasn't been paged out. (Because debugging is an exercise in optimism.)

    1: kd> s 0 0fffffff 00 86 ec 00
    04effacc  00 86 ec 00 c0 85 ec 00-00 00 00 00 00 00 00 00  ................ // us
    0532c318  00 86 ec 00 28 05 00 00-80 6d 32 05 03 00 00 00  ....(....m2..... // rogue
    

    The first hit is the reference to the object from the code raising the assertion. The second hit is the interesting one. That's probably the rogue reference. But who is it?

    1: kd> ln 532c318
    1: kd>
    

    It does not report as belong to any module, so it's not a global variable.

    Is it a reference from a stack variable? If so, then a stack trace of the thread with the active reference may tell us who is holding the reference and why.

    1: kd> !process -1 4
    PROCESS 907ef980  SessionId: 2  Cid: 06cc    Peb: 7f4df000  ParentCid: 0298
        DirBase: 9e983000  ObjectTable: a576f560  HandleCount: 330.
        Image: contoso.exe
    
            THREAD 8e840080  Cid 06cc.0b78  Teb: 7f4de000 Win32Thread: 9d04b3e0 WAIT
            THREAD 91e24080  Cid 06cc.08d8  Teb: 7f4dd000 Win32Thread: 00000000 WAIT
            THREAD 8e9a3580  Cid 06cc.09f8  Teb: 7f4dc000 Win32Thread: 9d102cc8 WAIT
            THREAD 8e2be080  Cid 06cc.0878  Teb: 7f4db000 Win32Thread: 9d129978 WAIT
            THREAD 82c08080  Cid 06cc.0480  Teb: 7f4da000 Win32Thread: 00000000 WAIT
            THREAD 90552400  Cid 06cc.0f5c  Teb: 7f4d9000 Win32Thread: 9d129628 WAIT
            THREAD 912c9080  Cid 06cc.02ec  Teb: 7f4d8000 Win32Thread: 00000000 WAIT
            THREAD 8e9e8680  Cid 06cc.0130  Teb: 7f4d7000 Win32Thread: 9d129cc8 READY on processor 0
            THREAD 914b8b80  Cid 06cc.02e8  Teb: 7f4d6000 Win32Thread: 9d12d568 WAIT
            THREAD 9054ab00  Cid 06cc.0294  Teb: 7f4d5000 Win32Thread: 9d12fac0 WAIT
            THREAD 909a2b80  Cid 06cc.0b54  Teb: 7f4d4000 Win32Thread: 00000000 WAIT
            THREAD 90866b80  Cid 06cc.0784  Teb: 7f4d3000 Win32Thread: 93dbb4e0 RUNNING on processor 1
            THREAD 90cfcb80  Cid 06cc.08c4  Teb: 7f3af000 Win32Thread: 93de0cc8 WAIT
            THREAD 90c39a00  Cid 06cc.0914  Teb: 7f3ae000 Win32Thread: 00000000 WAIT
            THREAD 90629480  Cid 06cc.0bc8  Teb: 7f3ad000 Win32Thread: 00000000 WAIT
    

    Now I have to dump the stack boundaries to see whether the address in question lies within the stack range.

    1: kd> dd 7f4de000 l3
    7f4de000  ffffffff 00de0000 00dd0000
    1: kd> dd 7f4dd000 l3
    7f4dd000  ffffffff 01070000 01060000
    ...
    1: kd> dd 7f4d7000 l3
    7f4d7000  ffffffff 04e00000 04df0000 // our stack
    ...
    

    The rogue reference did not land in any of the stack ranges, so it's probably on the heap. Fortunately, since it's on the heap, it's probably part of some larger object. And let's hope (see: optimism) that it's an object with virtual methods.

    0532c298  73617453
    0532c29c  74654d68
    0532c2a0  74616461
    0532c2a4  446e4961
    0532c2a8  00007865
    0532c2ac  00000000
    0532c2b0  76726553 USER32!_NULL_IMPORT_DESCRIPTOR  (USER32+0xb6553)
    0532c2b4  44497265
    0532c2b8  45646e49
    0532c2bc  41745378 contoso!CMumble::CMumble+0x4c
    0532c2c0  00006873
    0532c2c4  00000000
    0532c2c8  4e616843
    0532c2cc  79546567
    0532c2d0  4e496570
    0532c2d4  00786564
    0532c2d8  2856662a
    0532c2dc  080a9b87
    0532c2e0  00f59fa0
    0532c2e4  05326538
    0532c2e8  00000000
    0532c2ec  00000000
    0532c2f0  0000029c
    0532c2f4  00000001
    0532c2f8  00000230
    0532c2fc  fdfdfdfd
    0532c300  45ea1370 contoso!CFrumble::`vftable'
    0532c304  45ea134c contoso!CFrumble::`vftable'
    0532c308  00000000
    0532c30c  05b9a040
    0532c310  00000002
    0532c314  00000001
    0532c318  00ec8600
    

    Hooray, there is a vtable a few bytes before the pointer, and the contents of the memory do appear to match a CFrumble object, so I think we found our culprit.

    I was able to hand off the next stage of the investigation (why is a Frumble being created with a reference to the object?) to another team member with more expertise with Frumbles.

    (In case anybody cared, the conclusion was that this was a variation of a known bug.)

  • The Old New Thing

    What happens if I don't pass a pCreateExParams to CreateFile2?

    • 10 Comments

    The final pCreateExParams parameter to the CreateFile2 function is optional. What happens if I pass NULL?

    If you pass NULL as the pCreateExParams parameter, then the function behaves as if you had passed a pointer to this structure:

    CREATEFILE2_EXTENDED_PARAMETERS defaultCreateExParams =
    {
     sizeof(defaultCreateExParameters), // dwSize
     0, // dwFileAttributes
     0, // dwFileFlags
     0, // dwSecurityQosFlags
     NULL, // lpSecurityAttributes
     NULL, // hTemplateFile
    };
    
  • The Old New Thing

    Why is there a BSTR cache anyway?

    • 25 Comments

    The Sys­Alloc­String function uses a cache of BSTRs which can mess up your performance tracing. There is a switch for disabling it for debugging purposes, but why does the cache exist at all?

    The BSTR cache is a historical artifact. When BSTRs were originally introduced, performance tracing showed that a significant chunk of time was spent merely allocating and freeing memory in the heap manager. Using a cache reduced the heap allocation overhead significantly.

    In the intervening years, um, decades, the performance of the heap manager has improved to the point where the cache isn't necessary any more. But the Sys­Alloc­String people can't get rid of the BSTR because so many applications unwittingly rely on it.

    The BSTR cache is now a compatibility constraint.

  • The Old New Thing

    No good deed goes unpunished: Marking a document as obsolete

    • 43 Comments

    I was contacted by a customer support liaison who was hoping that I could help them understand Feature X.

    I saw your name on a "Feature X technical specification" document in the Windows specification repository, and I was hoping you could answer a few questions for me, or redirect me to somebody who can.

    I was puzzled why this person saw my name on the "Feature X technical specification" document in the Windows specification repository, because I was not the author of that specification. I went to the specification repository, opened the document in question, and nope, my name appears nowhere in it.

    I asked, "What gave you the impression that I had anything to do with Feature X? XYZ can help you with your questions; he's the one listed as the author of the document."

    The response was, "Oh, I'm sorry. I didn't actually read the specification. I merely did a search through the entire repository for Feature X, and the "Feature X technical specification" is the one that showed up as most recently updated by you. In the past, this technique has been pretty good at finding someone who can help with a feature. Sorry about that."

    I went back and took another look at the document, and then I remembered why I updated it: My duties at the time included reviewing all documents that met certain criteria, such as this particular document. I had some feedback about the document for the author, who told me, "Oh, that's an obsolete version of the document, but it's retained for historical purposes. The current one is over there." To save the next person some time, I edited the obsolete document by inserting in big letters at the top, "TECHNICAL DOCUMENTATION FOR THIS FEATURE HAS MOVED TO ⟨new location⟩. THIS DOCUMENT IS OBSOLETE." I could've asked the author to do this, but I had the document open already, so I figured I'd save a few steps (ask author to update document, wait for reply, reopen document to verify that edit occurred) and just do it myself.

    Boom, no good deed goes unpunished. My update was made long after the real technical specification was completed. As a result, of all the documents on Feature X, not only is it the obsolete one that shows up as most recently updated, but I am the one listed as the person who made that most recent update.

    Next time, I'll try to remember to do things the long way, even though it is big hassle for everybody.

    "Please update the document to indicate that it is obsolete and redirect the reader to the current document."

    — Could you do that? You've already got the document open.

    "No, I used to do that, but it came back and bit me, because I become the person to edit the document last, and then everybody comes to me with questions about the document instead of you."

    — You do realize that in the time you tried to convince me to do it, you could've just done it.

    Follow-up: I tried it, and sometimes the response was "I'm really busy now, I'll get around to it in a few weeks." Now I have to create a reminder task in two weeks to follow up. More hassle for everybody.

    I think the next time this happens, I'll write back, "I'm coming over to your office. I'll make the one-line edit on your computer so that your name is the one attached to the edit."

  • The Old New Thing

    More notes on calculating constants in SSE registers

    • 10 Comments

    A few weeks ago I noted some tricks for creating special bit patterns in all lanes, but I forgot to cover the case where you treat the 128-bit register as one giant lane: Setting all of the least significant N bits or all of the most significant N bits.

    This is a variation of the trick for setting a bit pattern in all lanes, but the catch is that the pslldq instruction shifts by bytes, not bits.

    We'll assume that N is not a multiple of eight, because if it were a multiple of eight, then the pslldq or psrldq instruction does the trick (after using pcmpeqd to fill the register with ones).

    One case is if N ≤ 64. This is relatively easy because we can build the value by first building the desired value in both 64-bit lanes, and then finishing with a big pslldq or psrldq to clear the lane we don't like.

    ; set the bottom N bits, where N ≤ 64
    pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
    unsigned shift right
    64 − N bits
    unsigned shift right
    64 − N bits
    psrlq   xmm0, 64 - N ; 0000 0000 0FFF FFFF 0000 0000 0FFF FFFF
    unsigned shift right 64 bits
    psrldq  xmm0, 8 ; 0000 0000 0000 0000 0000 0000 0FFF FFFF
     
    ; set the top N bits, where N ≤ 64
    pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
    unsigned shift left
    64 − N bits
    unsigned shift left
    64 − N bits
    psllq   xmm0, 64 - N ; FFFF FFF0 0000 0000 FFFF FFF0 0000 0000
    unsigned shift left 64 bits
    pslldq  xmm0, 8 ; FFFF FFF0 0000 0000 0000 0000 0000 0000

    If N ≥ 80, then we shift in zeroes into the top and bottom half, but then use a shuffle to patch up the half that needs to stay all-ones.

    ; set the bottom N bits, where N ≥ 80
    pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
    unsigned shift right
    128 − N bits
    unsigned shift right
    128 − N bits
    psrlq   xmm0, 128 - N ; 0000 0000 0FFF FFFF 0000 0000 0FFF FFFF
    copy shuffle
    pshuflw xmm0, _MM_SHUFFLE(0, 0, 0, 0) ; 0000 0000 0FFF FFFF FFFF FFFF FFFF FFFF
     
    ; set the top N bits, where N ≥ 80
    pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
    unsigned shift left
    128 − N bits
    unsigned shift left
    128 − N bits
    psllq   xmm0, 128 - N ; FFFF FFF0 0000 0000 FFFF FFF0 0000 0000
    shuffle copy
    pshufhw xmm0, _MM_SHUFFLE(3, 3, 3, 3) ; FFFF FFFF FFFF FFFF FFFF FFF0 0000 0000

    We have N ≥ 80, which means that 128 - N ≤ 48, which means that there are at least 16 bits of ones left in low-order bits after we shift right. We then use a 4×16-bit shuffle to copy those known-all-ones 16 bits into the other lanes of the lower half. (A similar argument applies to setting the top bits.)

    This leaves 64 < N < 80. That uses a different trick:

    ; set the bottom N bits, where N ≤ 120
    pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
    unsigned shift right 8 bits
    psrldq  xmm0, 1 ; 00FF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
    signed shift right
    120 − N bits
    signed shift right
    120 − N bits
    psrad  xmm0, 120 - N ; 0000 00FF FFFF FFFF FFFF FFFF FFFF FFFF

    The sneaky trick here is that we use a signed shift in order to preserve the bottom half. Unfortunately, there is no corresponding left shift that shifts in ones, so the best I can come up with is four instructions:

    ; set the top N bits, where 64 ≤ N ≤ 96
    pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
    unsigned shift left
    96 − N bits
    unsigned shift left
    96 − N bits
    psllq   xmm0, 96 - N ; FFFF FFFF FFF0 0000 FFFF FFFF FFF0 0000
    shuffle
    pshufd  xmm0, _MM_SHUFFLE(3, 3, 1, 0) ; FFFF FFFF FFFF FFFF FFFF FFFF FFF0 0000
    unsigned shift left 32 bits
    pslldq  xmm0, 4 ; FFFF FFFF FFFF FFFF FFFF FF00 0000 0000

    We view the 128-bit register as four 32-bit lanes. split the shift into two steps. First, we fill Lane 0 with the value we ultimately want in Lane 1, then we patch up the damage we did to Lane 2, then we do a shift the 128-bit value left 32 places to slide the value into position and zero-fill Lane 0.

    Note that a lot of the ranges of N overlap, so you often have a choice of solutions. There are other three-instruction solutions I didn't bother presenting here. The only one I couldn't find a three-instruction solution for was setting the top N bits where 64 < N < 80.

    If you find a three-instruction solution for this last case, share it in the comments.

  • The Old New Thing

    How do you prevent the linker from discarding a function you want to make available for debugging?

    • 22 Comments

    We saw some time ago that you can ask the Windows symbolic debugger engine to call a function directly from the debugger. To do this, of course, the function needs to exist.

    But what if you want a function for the sole purpose of debugging? It never gets called from the main program, so the linker will declare the code dead and remove it.

    One sledgehammer solution is to disable discarding of unused functions. This the global solution to a local problem, since you are now preventing the discard of any unused function, even though all you care about is one specific function.

    If you are comfortable hard-coding function decorations for specific architectures, you can use the /INCLUDE directive.

    #if defined(_X86_)
    #define DecorateCdeclFunctionName(fn) "_" #fn
    #elif defined(_AMD64_)
    #define DecorateCdeclFunctionName(fn) #fn
    #elif defined(_IA64_)
    #define DecorateCdeclFunctionName(fn) "." #fn
    #elif defined(_ALPHA_)
    #define DecorateCdeclFunctionName(fn) #fn
    #elif defined(_MIPS_)
    #define DecorateCdeclFunctionName(fn) #fn
    #elif defined(_PPC_)
    #define DecorateCdeclFunctionName(fn) ".." #fn
    #else
    #error Unknown architecture - don't know how it decorates cdecl.
    #endif
    #pragma comment(linker, "/include:" DecoratedCdeclFunctionName(TestMe))
    EXTERN_C void __cdecl TestMe(int x, int y)
    {
        ...
    }
    

    If you are not comfortable with that (and I don't blame you), you can create a false reference to the debugging function that cannot be optimized out. You do this by passing a pointer to the debugging function to a helper function outside your module that doesn't do anything interesting. Since the helper function is not in your module, the compiler doesn't know that the helper function doesn't do anything, so it cannot optimize out the debugging function.

    struct ForceFunctionToBeLinked
    {
      ForceFunctionToBeLinked(const void *p) { SetLastError(PtrToInt(p)); }
    };
    
    ForceFunctionToBeLinked forceTestMe(TestMe);
    

    The call to Set­Last­Error merely updates the thread's last-error code, but since this is not called at a time where anybody cares about the last-error code, it is has no meaningful effect. The compiler doesn't know that, though, so it has to generate the code, and that forces the function to be linked.

    The nice thing about this technique is that the optimizer sees that this class has no data members, so no data gets generated into the module's data segment. The not-nice thing about this technique is that it is kind of opaque.

  • The Old New Thing

    Horrifically nasty gotcha: FindResource and FindResourceEx

    • 15 Comments

    The Find­Resource­Ex function is an extension of the Find­Resource function in that it allows you to specify a particular language fork in which to search for the resource. Calilng the Find­Resource function is equivalent to calling Find­Resource­Ex and passing zero as the wLanguage.

    Except for the horrible nasty gotcha: The second and third parameters to Find­Resource­Ex are in the opposite order compared to the second and third parameters to Find­Resource!

    In other words, if you are adding custom language support to a program, you cannot just stick a wLanguage parameter on the end when you switch from Find­Resource to Find­Resource­Ex. You also have to flip the second and third parameters.

    Original code Find­Resource(hModule, MAKEINTRESOURCE(IDB_MYBITMAP), RT_BITMAP)
    You change it to Find­Resource­Ex(hModule, MAKEINTRESOURCE(IDB_MYBITMAP), RT_BITMAP, 0)
    You should have changed it to Find­Resource­Ex(hModule, RT_BITMAP, MAKEINTRESOURCE(IDB_MYBITMAP), 0)

    The nasty part of this is that since the second and third parameters are the same type, the compiler won't notice that you got them backward. The only way you find out is that your resource code suddenly stopped working.

  • The Old New Thing

    2014 year-end link clearance

    • 14 Comments

    Another round of the semi-annual link clearance.

  • The Old New Thing

    Even the publishing department had its own Year 2000 preparedness plan

    • 7 Comments

    On December 31, 1999, Microsoft Product Support Services were ready in case something horrible happened as the calendar rolled over into the new year.

    I'm told that Microsoft Press also had its own Year 2000 plan. They staffed their helpline continuously from Friday evening December 31, 1999 all the way through Sunday, January 2, 2000. They did this even though Microsoft Press did not normally staff its helpline ouside normal business hours, and even though all sample code in all publications come with a disclaimer that they are provided "as is" with no warranty.

    I do not know if they took any calls, but I suspect not.

Page 5 of 444 (4,435 items) «34567»