• The Old New Thing

    Integer signum in SSE


    The signum function is defined as follows:

    signum(x) =  −1  if x < 0
    signum(x) =  if x = 0
    signum(x) =  +1  if x > 0

    There are a couple of ways of calculating this in SSE integers.

    One way is to convert the C idiom

    int signum(int x) { return (x > 0) - (x < 0); }

    The SSE translation of this is mostly straightforward. The quirk is that the SSE comparison functions return −1 to indicate true, whereas C uses +1 to represent true. But this is easy to take into account:

    x > 0  ⇔  − pcmpgt(x, 0)
    x < 0  ⇔  − pcmpgt(0, x)

    Substituting this into the original signum function, we get

    signum(x) =  (x > 0)  −  (x < 0)
    − pcmpgt(x, 0)  −  − pcmpgt(0, x)
    − pcmpgt(x, 0)  +  pcmpgt(0, x)
    pcmpgt(0, x)  −  pcmpgt(x, 0)

    In assembly:

            ; assume x is in xmm0
            pxor    xmm1, xmm1
            pxor    xmm2, xmm2
            pcmpgtw xmm1, xmm0 ; xmm1 = pcmpgt(0, x)
            pcmpgtw xmm0, xmm2 ; xmm0 = pcmpgt(x, 0)
            psubw   xmm0, xmm1 ; xmm0 = signum
            ; answer is in xmm0

    With intrinsics:

    __m128i signum16(__m128i x)
        return _mm_sub_epi16(_mm_cmpgt_epi16(_mm_setzero_si128(), x),
                             _mm_cmpgt_epi16(x, _mm_setzero_si128()));

    This pattern extends mutatus mutandis to signum8, signum32, and signum64.

    Another solution is to use the signed minimum and maximum opcodes, using the formula

    signum(x) = min(max(x, −1), +1)

    In assembly:

            ; assume x is in xmm0
            pcmpgtw xmm1, xmm1 ; xmm1 = -1 in all lanes
            pmaxsw  xmm0, xmm1
            psrlw   xmm1, 15   ; xmm1 = +1 in all lanes
            pminsw  xmm0, xmm1
            ; answer is in xmm0

    With intrinsics:

    __m128i signum16(__m128i x)
        // alternatively: minusones = _mm_set1_epi16(-1);
        __m128i minusones = _mm_cmpeq_epi16(_mm_setzero_si128(),
        x = _mm_max_epi16(x, minusones);
        // alternatively: ones = _mm_set1_epi16(1);
        __m128i ones = _mm_srl_epi16(minusones, 15);
        x = _mm_min_epi16(x, ones);
        return x;

    The catch here is that SSE2 supports only 16-bit signed minimum and maximum; to get other bit sizes, you need to bump up to SSE4. But if you're going to do that, you may as well use the psign instruction. In assembly:

            ; assume x is in xmm0
            pcmpgtw xmm1, xmm1 ; xmm1 = -1 in all lanes
            psrlw   xmm1, 15   ; xmm1 = +1 in all lanes
            psignw  xmm1, xmm0 ; apply sign of x to xmm1
            ; answer is in xmm1

    With intrinsics:

    __m128i signum16(__m128i x)
        // alternatively: ones = _mm_set1_epi16(1);
        __m128i minusones = _mm_cmpeq_epi16(_mm_setzero_si128(),
        __m128i ones = _mm_srl_epi16(minusones, 15);
        return _mm_sign_epi16(ones, x);

    The psign instruction applies the sign of its second argument to its first argument. We load up the first argument with the value +1 in all lanes, then apply the sign of x, which negates the value if the corresponding lane of x is negative; sets the value to zero if the lane is zero, and leaves it alone if the corresponding lane is positive.

  • The Old New Thing

    Debugging walkthrough: Access violation on nonsense instruction


    A colleague of mine asked for help puzzling out a mysterious crash dump which arrived via Windows Error Reporting.

    rax=00007fff219c5000 rbx=00000000023c8380 rcx=00000000023c8380
    rdx=0000000000000000 rsi=00000000043f0148 rdi=0000000000000000
    rip=00007fff21af2d22 rsp=000000000392e518 rbp=000000000392e580
     r8=00000000276e4639  r9=00000000043b2360 r10=00000000ffffffff
    r11=0000000000000000 r12=0000000000000001 r13=0000000000000000
    r14=000000000237cfc0 r15=00000000023d3ea0
    iopl=0         nv up ei pl zr na po nc
    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246
    00007fff`21af2d22 30488b xor byte ptr [rax-75h],cl ds:00007fff`219c4f8b=41

    Well that's a pretty strange instruction. Especially since it doesn't match up with the source code at all.

    void CNosebleed::OnFrimble(...)
        if (CanFrumble(...))
            hr = pCereal->AddMilk(pCarton);
            if (SUCCEEDED(hr))
                if (SUCCEEDED(pCereal->Pop(uId)) // ← crash here

    There is no bit-toggling in the actual code. The method calls to Snap, Crackle, and Pop are all interface calls and therefore should be vtable calls. We are clearly in a case of a bogus return address, possibly a stack smash (and therefore cause for concern from a security standpoint).

    My approach was to try to figure out what was happening just before the crash. And that meant figuring out how we ended up in the middle of an instruction.

    Here is the code surrounding the crash point.

    00007fff`21af2d17 ff90d0020000    call    qword ptr [rax+2D0h]
    00007fff`21af2d1d 488b03          mov     rax,qword ptr [rbx]
    00007fff`21af2d20 8b5530          mov     edx,dword ptr [rbp+30h]
    00007fff`21af2d23 488bcb          mov     rcx,rbx

    Notice that the code that crashed is actually the last byte of the mov edx, dword ptr [rbp+30h] (the 30) and the first two bytes of the mov rcx, rbx (the 488b).

    Disassembling backward is a tricky business on a processor with variable-length instructions, so to get my bearings, I looked for the call to Can­Frumble:

    0:011> #CanFrumble nosebleed!CNosebleed::OnFrimble
    00007fff`21af2c43 e8e0e40f00 call nosebleed!CNosebleed::OnFrimble

    The # command means "Start disassembling at the specified location and stop when you see the string I passed." This is an automated way of just hitting u until you get to the thing you are looking for.

    Now that I am at some known good code, I can disassemble forward:

    00007fff`21af2c48 488bcb          mov     rcx,rbx
    00007fff`21af2c4b 84c0            test    al,al
    00007fff`21af2c4d 0f849a000000    je      nosebleed!CNosebleed::OnFrimble+0x1f88e5 (00007fff`21af2ced)

    The above instructions check whether the Can­Frumble returned true, and if not, it jumps to 00007fff`21af2ced. Since we know that we are in the false path, we follow the jump.

    // Make a vtable call into pCereal->AddMilk()
    00007fff`21af2ced 488b03          mov     rax,qword ptr [rbx] ; vtable
    00007fff`21af2cf0 498bd7          mov     rdx,r15 ; pCarton
    00007fff`21af2cf3 ff9068010000    call    qword ptr [rax+168h] ; call
    00007fff`21af2cf9 8bf8            mov     edi,eax ; save to hr
    00007fff`21af2cfb 85c0            test    eax,eax ; succeeded?
    00007fff`21af2dfd 0f880dffffff    js      nosebleed!CNosebleed::OnFrimble+0x1f8808 (00007fff`21af2c10)
    // Now call Snap()
    00007fff`21af2d03 488b03          mov     rax,qword ptr [rbx] ; vtable
    00007fff`21af2d06 488bcb          mov     rcx,rbx ; "this"
    00007fff`21af2d09 ff9070020000    call    qword ptr [rax+270h] ; Snap
    / Now call Crackle
    00007fff`21af2d0f 488b03          mov     rax,qword ptr [rbx] ; vtable
    00007fff`21af2d12 33d2            xor     edx,edx ; parameter: false
    00007fff`21af2d14 488bcb          mov     rcx,rbx ; "this"
    00007fff`21af2d17 ff90d0020000    call    qword ptr [rax+2D0h] ; Crackle
    // Get ready to Pop
    00007fff`21af2d1d 488b03          mov     rax,qword ptr [rbx] ; vtable
    00007fff`21af2d20 8b5530          mov     edx,dword ptr [rbp+30h] ; uId
    00007fff`21af2d23 488bcb          mov     rcx,rbx ; "this"

    But we never got to execute the Pop because our return address from Crackle got messed up.

    Let's follow the call into Crackle.

    0:011> dps @rbx l1
    00000000`02b4b790  00007fff`219c50a0 nosebleed!CCereal::`vftable'
    0:011> dps 00007fff`219c50a0+2d0 l1
    00007fff`219c5370  00007fff`21aa5c28 nosebleed!CCereal::Crackle
    0:011> u 00007fff`21aa5c28
    00007fff`21aa5c28 889163010000    mov     byte ptr [rcx+163h],dl
    00007fff`21aa5c2e c3              ret

    So at least the pCereal pointer seems to be okay. It has a vtable and the slot in the vtable points to the function we expect. The Crackle method merely stashes the bool parameter into a member variable. No stack corruption here because rbx is nowhere near rsp.

    0:012> db @rbx+163 l1
    00000000`02b4b8f3  ??                                               ?

    Sadly, the byte in question was not captured in the dump, so we cannot verify whether the call actually was made. Similarly, the members of CCereal manipulated by the Snap method were also not captured in the dump, so we can't verify that either. (The only member of CCereal captured in the dump is the vtable itself.)

    So we can't find any evidence one way or the other as to whether any of the calls leading up to Pop actually occurred. Maybe we can try to figure out how many misaligned instructions we managed to execute before we crashed, see if that reveals anything. To do this, I'm going to disassemble at varying incorrect offsets and see which ones lead to the instruction that crashed.

    0:011> u .-1 l2
    00007fff`21af2d21 55              push    rbp
    00007fff`21af2d22 30488b          xor     byte ptr [rax-75h],cl
    // ^^ this looks interesting; we'll come back to it
    0:011> u .-3 l2
    00007fff`21af2d1f 038b5530488b    add     ecx,dword ptr [rbx-74B7CFABh]
    00007fff`21af2d25 cb              retf
    // ^^ this doesn't lead to the crashed instruction
    0:011> u .-4 l2
    00007fff`21af2d1e 8b03            mov     eax,dword ptr [rbx]
    00007fff`21af2d20 8b5530          mov     edx,dword ptr [rbp+30h]
    // ^^ this doesn't lead to the crashed instruction
    0:012> u .-5 l3
    00007fff`21af2d1c 00488b          add     byte ptr [rax-75h],cl
    00007fff`21af2d1f 038b5530488b    add     ecx,dword ptr [rbx-74B7CFABh]
    00007fff`21af2d25 cb              retf
    // ^^ this doesn't lead to the crashed instruction
    0:012> u .-6 l3
    00007fff`21af2d1b 0000            add     byte ptr [rax],al
    00007fff`21af2d1d 488b03          mov     rax,qword ptr [rbx]
    00007fff`21af2d20 8b5530          mov     edx,dword ptr [rbp+30h]
    // ^^ this doesn't lead to the crashed instruction

    Exercise: Why didn't I bother checking .-2?

    You only need to test as far back as the maximum instruction length, and in practice you can give up much sooner because the maximimum instruction length involves a lot of prefixes which are unlikely to occur in real code.

    The only single-instruction rewind that makes sense is the push rbp. Let's see if it matches.

    0:011> ?? @rbp
    unsigned int64 0x453e700
    0:011> dps @rsp l1
    00000000`0453e698  00000000`0453e700

    Yup, it lines up. This wayward push is also consistent with the stack frame layout for the function.

    00007fff`218fa408 48895c2410      mov     qword ptr [rsp+10h],rbx
    00007fff`218fa40d 4889742418      mov     qword ptr [rsp+18h],rsi
    00007fff`218fa412 55              push    rbp
    00007fff`218fa413 57              push    rdi
    00007fff`218fa414 4154            push    r12
    00007fff`218fa416 4156            push    r14
    00007fff`218fa418 4157            push    r15
    00007fff`218fa41a 488bec          mov     rbp,rsp
    00007fff`218fa41d 4883ec60        sub     rsp,60h

    The values of rbp and rsp should differ by 0x60.

    0:012> ?? @rbp-@rsp
    unsigned int64 0x68

    The difference is in error by 8 bytes, exactly the size of the rbp register that was pushed.

    It therefore seems highly likely that the push rbp was executed.

    Repeating the exercise to find the instruction before the push rbp shows that no instruction fell through to the push rbp. Therefore, execution jumped to 00007fff`21af2d21 somehow.

    Another piece of data is that rax matches the value we expect it to have, sort of. Here are some selected lines from earlier in the debug session:

    // What we expected to have executed
    00007fff`21af2d1e 8b03            mov     eax,dword ptr [rbx]
    // The value we expected to have fetched
    0:011> dps @rbx l1
    00000000`02b4b790  00007fff`219c50a0 nosebleed!CCereal::`vftable'
    // The value in the rax register
    rax=00007fff219c5000 ...

    The value we expect is 00007fff`219c50a0, but the value in the register has the bottom eight bits cleared.

    Putting this all together, my theory is that the CPU executed the instruction at 00007fff`21af2d1e, and then due to some sort of hardware failure, instead of incrementing the rip register by two, it (1) incremented it by three, and then (2) as part of its confusion, zeroed out the bottom byte of rax. The erroneous rip led to the rogue push rbp and the crash on the nonsensical xor.

    It's not a great theory, but it's all I got.

    As to what sort of hardware failure could have occurred: This particular failure was reported twice, so a cosmic ray is less likely to be the culprit (because you have to get lightning to strike twice) than overheating or overclocking.

  • The Old New Thing

    My pants are fancy!


    During the development of Windows, the User Research team tried out an early build of some proposed changes on volunteers from the general community. During one of the tests, they invited the volunteer to just play around with a particular component, to explore it the way they would at home.

    The usability subject scrolled around a bit, admired the visuals, selected a few things, and then had an idea to try to customize the component. He fiddled around a bit and quickly discovered the customization feaure.

    To celebrate his success, he proudly announced in a sing-song sort of way, "My pants are fancy!"

    That clip of a happy usability study participant gleefully announcing "My pants are fancy!" tickled the team's funny bone, and the phrase "My pants are fancy" became a catch phrase.

  • The Old New Thing

    How can I let my child use an app that I bought from the Windows Store?


    If you buy an app from the Windows Store, you can make it available to other users on the same Windows PC. This is useful if you, say, buy an app for your child to use. Here's how you do it. (This is all explained on the Windows Store blog, but I've converted it into a step-by-step and updated it for Windows 8.1.)

    First, sign on as yourself and install the app under your own account.

    Next, sign on as the child (or whatever other account you want to share the app with), and launch the Store from that second account.

    In the Store app, go to the top of the screen and hit Account, then My account.

    From the My account page, use the Change User button to sign out as the child account and sign in as yourself.

    Once signed in as yourself, you can reinstall the app into the child account. You can do this the hard way, by searching for the app, or the easy way by hitting Account at the top of the screen, and then choosing My Apps. Tap the app you want to reinstall, then hit the Reinstall button. (Since the app is already installed, all this does is increment the reference count on the app.)

    When finished, sign out of the Store from the child account.

    In Windows 8, each purchased app could be used on up to five PCs, regardless of how many times it was installed on each PC, so adding an app to a second account did not eat into your device quota. In Windows 8.1, the limit was bumped to 81 PCs, which means that for most people, the device limit will not be problem.

  • The Old New Thing

    The Softsel Hot List for the week of December 22, 1986


    Back in the days before Internet-based software distribution, heck back even before the Internet existed in a form resembling what it is today, one of the most important ways of keeping track of the consumer computing industry was to subscribe to the Softsel Hot List, a weekly poster of the top sellers in various categories. Here is the Softsel Hot List for the week of December 22, 1986, or at least an HTML reproduction of it. The title at the top was inspired by a space agey font popular at the time, but I am too lazy to figure out how to do that in HTML so you'll have to use your imagination. (If your imagination fails you, you can use the photo from this page.)

     1   1   43    120D Dot Matrix • Citizen America
     2   3   30  P7 Pinwriter • NEC Information Systems
     3   2   100    MSP-10 Dot Matrix • Citizen America
     4   4   19    P6 Pinwriter • NEC Information Systems
     5   5   34    Premiere 35 Daisywheel • Citizen America
     6   6   37    MSP-20 Dot Matrix • Citizen America
     7   7   86    MSP-15 Dot Matrix • Citizen America
     8   8   46    3550 Spinwwriter • NEC Information Systems
     9   9   33    P5 Pinwriter • NEC Information Systems
     10   –   1  KX-P1080i Dot Matrix • Panasonic
     1   1   40    JC 1401 Multisync • NEC Home Electronics
     2   2   98    Video 310A Hi-Res Amber TTL • Amdek
     3   3   79    Color 600 Hi-Res RGB • Amdek
     4   4   33    JB 1285 Amber TTL • NEC Home Electronics
     5   6   37    Color 722 CGA/EGA • Amdek
     6   5   82    121 Hi-Res Green TTL • Taxan
     7   –   1  318 Hi-Res Color • AT&T
     8   7   50    122 Amber TTL • Taxan
     9   –   1  Super Vision 720 Hi-Res • Taxan
     10   –   1  313 Mono • AT&T
     1   1   27    Laser FD100 Apple Drives • Video Technology • AP
     2   4   17  Bernoulli Box Dual 20MB • Iomega • IBM, MAC
     3   2   26    FileCard 20MB Hard Disk/Card • Western Digital • IBM
     4   3   36    QIC-60H External Tape Backup • Tecmar • IBM
     5   6   23    Bernoulli Box Dual 10MB • Iomega • IBM, MAC
     6   5   26    QIC-60AT Internal Tape Backup • Tecmar • IBM
     7   8   3    Teac AT 360k Drive • Maynard • IBM
     8   –   23    Maynstream 20MB Portable Backup • Maynard • IBM
     1   1   165    Hercules Graphics Card Plus • Hercules • IBM
     2   2   155    SixPakPlus • AST Research • IBM
     3   3   113    Hercules Color Card • Hercules • IBM
     4   4   170    Smartmodem 1200B • Hayes • IBM
     5   7   14    Above Board/AT • Intel • IBM
     6   5   188    Smartmodem 1200 • Hayes • AP
     7   8   38    Advantage AT! • AST Research • IBM
     8   6   68    Smartmodem 2400 • Hayes • IBM
     9   9   36    Gamecard III • CH Products • IBm
     10   11   35    Smartmodem 2400B • Hayes • IBM
     11   13   113    Grappler • Orange Micro • AP
     12   16   37  Practical Modem 1200 • Practical Peripherals • IBM
     13   12   37    Above Board/PC • Intel • IBM
     14   10   31    QuadEGA+ • Quadram • IBM
     15   15   7    SixPakPremium • AST Research • IBM
     16   20   20    Rampage! AT • AST Research • IBM
     17   –   19    Hotlink • Orange Micro • AP
     18   18   17    Autoswitch EGA • Paradise Systems • IBM
     19   19   39    Expanding Quadboard • Quadram • IBM
     20   17   2    Advantage Premium • AST Research • IBM
     1   1   116    Mach III • CH Products • AP, IBM
     2   2   98    Microsoft Mouse • Microsoft • IBM
     3   3   178    Joystick • Kraft Systems • AP, IBM
     4   4   35    Mach II • CH Products • AP, IBM
     5   5   2  Tac 10 • Suncom • AP, IBM
     6   7   34    Safe Strip • Curtis Manufacturing
     7   8   222    System Saver • Kensington • AP, MAC
     8   9   10    Intel 80287 Coprocessor • Intel • IBM
     9   10   99    MasterPiece • Kensington • IBM
     10   –   21    Intel 8087 Coprocessor • Intel • IBM
     1   1   138    WordPerfect • WordPerfect Corp • AP, IBM
     2   2   202    1-2-3 • Lotus • IBM
     3   5   22  Javelin • Javelin • IBM
     4   3   161    Microsoft Word • Microsoft • IBM, MAC
     5   4   7    Quicken • Intuit • AP, IBM
     6   6   15    PFS:First Choice • Software Publishing • IBM
     7   8   31    SQZ! • Turner Hall • IBM
     8   7   49    dBase III Plus • Ashton-tate • IBM
     9   11   3  Lotus HAL • Lotus • IBM
     10   9   57    Q & A • Symantec • IBM
     11   12   113    Sidekick • Borland Int'l. • IBM
     12   10   56    Paradox • Ansa Software • IBM
     13   27   2  NewsMaster • Unison (Brown-Wagh) • IBM
     14   20   27  ProDesign II • American Small Bus. Comp. • IBM
     15   15   27    DAC Easy Accounting • DAC • IBM
     16   16   9    Microsoft Works • Microsoft • MAC
     17   13   57    VP Planner • Paperback Software • IBM
     18   18   11    PFS:Professional Write • Software Publishing • IBM
     19   14   18    MacDraft • IDD • MAC
     20   21   177    Multimate • Ashton-Tate • IBM
     21   22   55    Reflex • Borland Int'l. • IBM, MAC
     22   19   37    Multimate Advantage • Ashton-Tate • IBM
     23   24   25    Note-It • Turner Hall • IBM
     24   25   8    Clipper • Nantucket • IBM
     25   23   97    Wordstar 2000 • MicroPro Int'l. • IBM
     26   28   52    Microsoft Windows • Microsoft • IBM
     27   17   62    Microsoft Excel • Microsoft • MAC
     28   –   9    MORE • Living Videotext • MAC
     29   –   1  PFS:Professional File • Software Publishing • IBM
     30   –   7    R:Base System V • Microrim • IBM
     1   1   167    Crosstalk XVI • DCA/Crosstalk Communications • AP, IBM
     2   3   116    Norton Utilities • Norton Computing • IBM
     3   4   140    Sideways • Funk Software • IBM
     4   2   39    Fastback • Fifth Generation • IBM
     5   9   109  Turbo Pascal • Borland Int'l • AP, IBM, MAC
     6   8   17    Carbon Copy • Meridian Technology • IBM
     7   6   25    Dan Bricklin's Demo Program • Software Garden • IBM
     8   5   99    Smartcom II • Hayes • IBM, MAC
     9   7   27    XTREE • Executive Systems • IBM
     10   10   5    Disk Optimizer • SoftLogic Solutions • IBM
     1   1   128    Print Shop • Broderbund • AP, IBM, MAC, COM
     2   2   163    Math Blaster! • Davidson & Assoc. • AP, IBM, MAC, COM, AT
     3   6   5  Microsoft Learning DOS • Microsoft • IBM
     4   3   122    Typing Tutor III • Simon & Shuster • AP, IBM, MAC, COM
     5   4   21    Certificate Maker • Springboard • AP, IBM, COM
     6   –   90  Managing Your Money • MECA • AP, IBM
     7   5   95    The Newsroom • Springboard • AP, IBM, COM
     8   8   194    Bank Street Writer • Broderbund • AP, IBM, COM
     9   7   211    Mastertype • Mindscape • AP, IBM
     10   –   118    E.G. for Young Children • Springboard • AP, IBM, MAC
     1   1   203    Microsoft Flight Simulator • Microsoft • IBM, MAC
     2   2   156    Sargon III • Hayden Software • AP, IBM, MAC, AT
     3   4   3    King's Quest III • Sierra On-Line • IBM, ST
     4   3   71    Jet • SubLogic • AP, IBM, COM
     5   5   52    Winter Games • Epyx • AP, MAC, COM, ST
     6   7   91    F-15 Strike Eagle • Microprose • AP, IBM
     7   6   204    Flight Simulator II • SubLogic • AP, COM, AT, AG
     8   8   43    Silent Service • Microprose • AP, IBM
     9   10   48    Where is Carmen San Diego • Broderbund • AP, IBM, COM
     10   –   1  Bop'N Wrestle • Mindscape • AP, IBM, COM
    Week of December 22, 1986
    The HOT LIST is compiled from Softsel sales to over 15,000 dealers in 50 states and 45 countries. Sales may vary regionally. The names of the products and companies appearing above may be trademarks or registered trademarks.
    For an annual HOT LIST subscription, send your check for to: Softsel Computer Products, Inc., Attn: Hot List Subscriptions, 546 North Oak Street, P.O. Box 6080, Inglewood California, 90312-6080. For more details, please call Softsel's Marketing Department at (213) 412-8290.
    ©1986 Softsel® Computer Products, Inc.
  • The Old New Thing

    Setting, clearing, and testing a single bit in an SSE register


    Today I'm going to set, clear, and test a single bit in an SSE register.


    On Mondays I don't have to explain why.

    First, we use the trick from last time that lets us generate constants where all set bits are contiguous, and apply it to the case where we want only one bit.

        pcmpeqd xmm0, xmm0      ; set all bits to one
        psrlq   xmm0, 63        ; set both 64-bit lanes to 1
    IF N LT 64
        psrldq  xmm0, 64 / 8    ; clear the upper lane
        pslldq  xmm0, 64 / 8    ; clear the lower lane
    IF N AND 63
        psllq   xmm0, N AND 63  ; shift the bit into position

    We start by setting all bits in xmm0.

    We then shift both 64-bit lanes right by 63 positions, putting 1 in each lane.

    If the bit we want is in the upper half, then we shift the entire value left 8 bytes (64 bits). This clears the bottom 64 bits and leaves the upper 64 bits with all bits set. (Similarly, if the bit we want is in the lower half, shifting right instead of left.)

    Finally, if we need a bit other than 0 or 64, we shift left by the desired amount within the 64-bit lane.

    Now that we can generate a single bit value, we can use it to set and clear individual bits.

    ; Set bit N in xmm1 (using xmm0 as a helper)
            ⟨set xmm0 = 2^N⟩
            por     xmm1, xmm0
    ; Clear bit N in xmm1 (putting result in xmm0)
            ⟨set xmm0 = 2^N⟩
            pandn   xmm0, xmm1

    To test a bit, we can use the PMOVMSKB instruction.

    IF 7 - (N AND 7)
        psllq xmm0, 7 - (N AND 7)
        pmovmskb eax, xmm0
    IF N LT 64
        test  al, 1 SHL (N / 8)
        test  ah, 1 SHL (N / 8 - 8)

    First, we move the bit we want to test into a position that is 7 mod 8, because those are the bits captured by the PMOVMSKB instruction. (If the bit is already there, then we don't need to do anything.) Then we use the PMOVMSKB instruction to extract the bits into a general purpose register and test the one that corresponds to the bit we want.

    Alternatives: I tend to stick to SSE2 instructions because they are widely supported (and are indeed part of the minimum system requirements for Windows 8), but if you are willing to do CPU dispatching on SSE4, you can use PTEST, which might be faster, I haven't tested it.

    You could use movd and movq to load up a constant, but you do incur domain crossing penalties. Another alternative is to put the constant in memory, but then you pay an even bigger cost for memory access if the value is not in cache.

    Other remarks: Of course, you want to schedule the instructions better than the way I wrote them above. I wrote them in a logical order above to make the algorithm clearer, but you will want to reorder them to avoid stalls.

    Using intrinsics:

    __m128i Calc2ToTheN(int N)
     __m128i zero = _mm_setzero_si128();
     __m128i ones = _mm_cmpeq_epi32(zero, zero);
     __m128i onesLowHigh = _mm_slli_epi64(ones, 63);
     __m128i singleOne = N < 64 ? _mm_srli_si128(onesLowHigh, 64 / 8) :
                                  _mm_slli_si128(onesLowHigh, 64 / 8);
     return _mm_slli_epi64(singleOne, N & 63);
    __m128i SetBitN(__m128i value, int N)
      return _mm_or_si128(value, Calc2ToTheN(N));
    __m128i ClearBitN(__m128i value, int N)
      return _mm_andnot_si128(value, Calc2ToTheN(N));
    __m128i TestBitN(__m128i value, int N)
     __m128i positioned = _mm_slli_epi64(value, 7 - (N & 7));
     return (_mm_movemask_epi8(positioned) & (1 << (N / 8))) != 0;

    Note that since these functions pass a non-constant value to intrinsics like _mm_slli_epi64, you incur additional runtime penalties because the compiler is going to use a movd to load up the value, incurring the exact domain crossing penalty we are trying to avoid. To avoid this, templatize the function to force the bit number to be determined at compile time.

    template<int N>
    __m128i Calc2ToTheN()
     __m128i zero = _mm_setzero_si128();
     __m128i ones = _mm_cmpeq_epi32(zero, zero);
     __m128i onesLowHigh = _mm_slli_epi64(ones, 63);
     __m128i singleOne = N < 64 ? _mm_srli_si128(onesLowHigh, 64 / 8) :
                                  _mm_slli_si128(onesLowHigh, 64 / 8);
     return _mm_slli_epi64(singleOne, N & 63);
    template<int N>
    __m128i SetBitN(__m128i value)
      return _mm_or_si128(value, Calc2ToTheN<N>());
    template<int N>
    __m128i ClearBitN(__m128i value)
      return _mm_andnot_si128(value, Calc2ToTheN<N>());
    template<int N>
    __m128i TestBitN(__m128i value)
     __m128i positioned = _mm_slli_epi64(value, 7 - (N & 7));
     return (_mm_movemask_epi8(positioned) & (1 << (N / 8))) != 0;
  • The Old New Thing

    How did protected-mode 16-bit Windows fix up jumps to functions that got discarded?


    Commenter Neil presumes that Windows 286 and later simply fixed up the movable entry table with jmp selector:offset instructions once and for all.

    It could have, but it went one step further.

    Recall that the point of the movable entry table is to provide a fixed location that always refers to a specific function, no matter where that function happens to be. This was necessary because real mode has no memory manager.

    But protected mode does have a memory manager. Why not let the memory manager do the work? That is, after all, its job.

    In protected-mode 16-bit Windows, the movable entry table was ignored. When one piece of code needed to reference another piece of code, it simply jumped to or called it by its selector:offset.

        push    ax
        call    0987:6543

    (Exercise: Why didn't I use call 1234:5678 as the sample address?)

    The selector was patched directly into the code as part of fixups. (We saw this several years ago in another context.)

    When a segment is relocated in memory, there is no stack walking to patch up return addresses to point to thunks, and no editing of the movable entry points to point to the new location. All that happens is that the base address in the descriptor table entry for the selector is updated to point to the new linear address of the segment. And when a segment is discarded, the descriptor table entry is marked not present, so that any future reference to it will raise a selector not present exception, which the kernel handles by reloading the selector.

    Things are a lot easier when you have a memory manager around. A lot of the head-exploding engineering in real-mode windows was in all the work of simulating a memory manager on a CPU that didn't have one!

  • The Old New Thing

    How can I query the location of the taskbar on secondary monitors?


    A customer wanted to know how to get the location of the taskbar on secondary monitors. "I know that SHApp­Bar­Message will tell me the location of the taskbar on the primary monitor, but how do I get its location on secondary monitors?"

    We asked the customer what their actual problem is, where they think that determining the taskbar location on secondary monitors is the solution. The customer was kind enough to explain.

    Our application shows a small window, and sometimes users move it behind the taskbar. They then complain that they can't find it, and they have to move their taskbar out of the way in order to find it again. We want our window to automatically avoid the taskbar.

    The solution to the customer's problem is to stop obsessing about the taskbar. Use the Get­Monitor­Info function to obtain the working area for the monitor the window is on. The window can then position itself inside the working area.

    The working area is the part of the monitor that is not being used by the taskbar or other application bars. The customer was too focused on avoiding the taskbar and missed the fact that they needed to avoid other taskbar-like windows as well.

    The customer was kind enough to write back to confirm that Get­Monitor­Info was working.

  • The Old New Thing

    It's not too late (okay maybe it's too late) to get this gift for the physicist who has everything


    A LEGO set to measure Planck's constant.

  • The Old New Thing

    It rather involved being on the other side of this airtight hatchway: Account vulnerable to Active Directory administrator


    A security vulnerability report came in that went something like this:

    Disclosure of arbitrary data from any user

    An attacker can obtain arbitrary data from any user by means of the following steps:

    1. Obtain administrative access on the domain controller.
    2. Stop the XYZZY service.
    3. Edit the XYZZY.DAT file in a hex editor and changes the bytes starting at offset 0x4242 as follows:
    4. ...

    There's no point continuing, because the first step assumes that you are on the other side of the airtight hatchway. If you have compromised the domain controller, then you control the domain. From there, all the remaining steps are just piling on style points and cranking up the degree of difficulty.

    A much less roundabout attack is as follows:

    1. Obtain administrative access on the domain controller.
    2. Deploy a logon script to all users that does whatever you want.
    3. Wait for the user to log in next, and your script will DO ANYTHING YOU WANT.

    No, wait, I can make it even easier.

    1. Obtain administrative access on the domain controller.
    2. Change the victim's password.
    3. Log on as that user and DO ANYTHING YOU WANT.

    You are the domain administrator. You already pwn the domain. That you can pwn a domain that you pwn is really not much of a surprise.

    This is why it is important to choose your domain administrators carefully.

Page 3 of 441 (4,404 items) 12345»