• The Old New Thing

    The Itanium processor, part 3: The Windows calling convention, how parameters are passed

    • 16 Comments

    The calling convention on Itanium uses a variable-sized register window. The mechanism by which this is done is rather complicated, so I'm first going to present a conceptual version, and then I'll come back and fix up some of the implementation details. For today, I'm just going to talk about how parameters are passed. There are other aspects of the calling convention that I will cover in separate articles.

    Recall that the first 32 registers r0 through r31 are static (do not change), and the remaining registers r32 through r127 are stacked. These stacked registers fall into three categories: input registers, local registers, and output registers.

    The input registers receive the function parameters. On entry to a function, the function's parameters are received in registers starting at r32 and increasing. For example, a function that takes two parameters receives the first parameter in r32 and the second parameter in r33.

    Immediately after the input registers are the registers for the function's private use. These are known as local registers. For example, if that function with two parameters also wants four registers for private use, those private registers would be r34 through r37.

    After the input registers are the registers used to call other functions, known as output registers. For example, if the function with two parameters and four local registers wants to call a function that has three parameters, it would put those parameters in registers r38 through r40. Therefore, a function needs as many output registers as the maximum number of parameters of any function it calls.

    The input registers and local registers are collectively known as the local region. The input registers, local registers, and output registers are collectively known as the register frame.

    Any registers higher than the last output register are off-limits to the function, and we shall henceforth pretend they do not exist. Since the registers go up to r127, and in practice register frames are around one or two dozen registers, there end up being a lot of registers that go unused.

    The first thing a function does is notify the processor of its intended register usage. It uses the alloc instruction to say how many input registers, local registers, and output registers it needs.

    alloc r35 = ar.pfs, 2, 4, 3, 0
    

    This means, "Set up my register frame as follows: Two input registers, four local registers, three output registers, and no rotating registers. Put the previous register frame state (pfs) in register r35."

    The second thing a function does is save the return address, typically in one of the local registers it just created. For example, the above alloc might be followed by

    mov r34 = rp
    

    On entry to a function, the rp register contains the caller's return address, and most of the time, the compiler will save the return address in a register. Note that this means that on the Itanium, a stack buffer overrun will never overwrite a return address, since return addresses are not kept on the stack. (Let that sink in. On Itanium, return addresses are not kept on the stack. This means that tricks like _Address­Of­Return­Address will not work!)

    By convention, the rp and ar.pfs are saved in consecutive registers (here, r34 and r35). This convention makes exception unwinding slightly easier.

    Let's see what happens when somebody calls this function. Suppose the caller's register frame looks like this:

    static local region output
    input local
    r0 r1 r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 r41

    The caller places the parameters to our function in its output registers, in this case r37 and r38. (Our function takes only two parameters, so r39 and beyond are not used.)

    static local region output
    input local
    r0 r1 r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 r41
    0 A F G H I J K L M N ? ? ?

    The caller then invokes our function.

    Our function opens by performing this alloc, declaring two input registers, four local registers, and three output registers.

    alloc r35 = ar.pfs, 2, 4, 3, 0
    

    That alloc instruction shuffles the registers like this:

    • The static registers don't change.
    • The registers in the caller's local region are saved in a magic place.
    • The specified number of output registers from the caller become the new function's input registers.
    • New local and output registers are created but left uninitialized.
    • The previous function state is placed in the specified register (for restoration at function exit). There are many parts of the function state, but the part we care about is the frame state, which describes how registers are assigned.

    Here's what the register frame looks like after all but the last steps above:

    static local region output
    input local
    r0 r1 r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40
    0 A F G M N ? ? ? ? ? ? ?
    unchanged moved uninitialized

    The last step (storing the previous function state in the specified register) updates the r35 register:

    static local region output
    input local
    r0 r1 r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40
    0 A F G M N ? pfs ? ? ? ? ?

    The next instruction is typically one to save the return address.

    mov r34 = rp
    

    After that mov instruction, the function prologue is complete, and the register state looks like this:

    static local region output
    input local
    r0 r1 r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40
    0 A F G M N ra pfs ? ? ? ? ?

    where ra is the function's return address.

    At this point the function runs and does actual work. Once it's done, its register state might look like this:

    static local region output
    input local
    r0 r1 r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40
    0 A′ F′ G′ T U ra pfs V W X Y Z

    The function epilogue typically consists of three instructions:

    mov rp = r34     // prepare to return to caller
    mov ar.pfs = r35 // restore previous function state
    br.ret rp        // return!
    

    This sequence begins by copying the saved return address into the rp register so that we can jump back to it. (We could have copied r34 into any scratch branch register, but by convention we use the rp register because it makes exception unwinding easier.)

    Next, it restores the register state from the pfs it saved at function entry. Finally, it transfers control back to the caller by jumping through the rp register. (We cannot do a br.ret r34 because r34 is not a branch register; the parameter to br.ret must be a branch register.)

    Restoring the previous function state causes the caller's register frame layout to be restored, and the values of the registers in the caller's local region are restored from that magic place.

    The register state upon return back to the caller looks like this:

    static local region output
    input local
    r0 r1 r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 r41
    0 A′ F′ G′ H I J K L ? ? ? ? ?
    unchanged restored uninitialized

    From the point of view of the calling function, calling another function has the following effect:

    • Static registers are shared with the called function. (Any changes to static registers are visible to the caller.)
    • The local region is preserved across the call.
    • The output registers are trashed by the call.

    At most eight parameters are passed in registers. Any additional parameters are passed on the stack, and it is the caller's responsibility to clean them up. (The stack-based parameters begin after the red zone. We'll talk more about the red zone later.)

    Thank goodness for the parameter cap, because a variadic function doesn't know how many parameters were passed, so it would otherwise not know how many input parameters to declare in its alloc instruction. The parameter cap means that variadic functions alloc eight input registers, and typically the first thing they do is spill them onto the stack so that they are contiguous with any parameters beyond 8 (if any). Note that this spilling must be done very carefully to avoid crashing if the corresponding register does not correspond to an actual parameter but happens to be a NaT left over from a failed speculative execution. (There is a special instruction for spilling without taking a NaT consumption exception.)

    If any parameter is smaller than 64 bits, then the unused bits of the corresponding register are garbage and should be ignored. I didn't discuss floating point parameters or aggregates. You can read Thiago's comment for a quick version, or dig into the Itanium Software Conventions and Runtime Architecture Guide (Section 8.5: Parameter Passing) for gory details.

    Okay, that's the conceptual model. The actual implementation is not quite as I described it, but the conceptual model is good enough for most debugging purposes. Here are some of the implementation details which will come in handy if you need to roll up your sleeves.

    First of all, the processor does not actually distinguish between input registers and local registers. It only cares about the local region. In other words, the parameters to the alloc instruction are

    • Size of local region.
    • Number of output registers.
    • Number of rotating registers.
    • Register to receive previous function state.

    When the called function established its register frame, the processor just takes all the caller's output registers (even the ones that aren't actually relevant to the function call) and slides them down to r32. It is the compiler's responsibility to ensure that the code passes the correct number of parameters. Therefore, our diagram of the function call process would more accurately go like this: The caller's register frame looks like this before the call:

    static local region output
    input local
    r0 r1 r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 r41
    0 A F G H I J K L M N X₁ X₂ X₃

    where the X values are whatever garbage values happen to be left over from previous computations, possibly even NaT.

    When the called function sets up its register frame (before storing the previous register frame), it gets this:

    static local region output
    input local
    r0 r1 r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40
    0 A F G M N X₁ X₂ X₃ ? ? ? ?
    unchanged moved uninitialized

    The processor took all the output registers from the caller and slid them down to r32 through r36.

    Of course, the called function shouldn't try to read from any registers beyond r33, if it knows what's good for it, because those registers contain nothing of value and may indeed be poisoned by a NaT.

    This little implementation detail has no practical consequences because those registers were uninitialized in the conceptual model anyway. But it does mean that when you disassemble the alloc instruction, you'll see that the distinction between input registers and local registers has been lost, and that both sets of registers are reported as input registers. In other words, an instruction written as

    alloc r34 = ar.pfs, 2, 4, 3, 0
    

    disassembles as

    alloc r34 = ar.pfs, 6, 0, 3, 0
    

    The disassembler doesn't know how many of the six registers in the input region are input registers and how many are local, so it just treats them all as input registers.

    That explains some of the undefined registers, but what about those question marks? To solve this riddle, we need to answer a different question first: "Where is this magic place that the caller's local region gets saved to and restored from?"

    This is where the infamous Itanium second stack comes into play.

    There are two stacks on Itanium. One is indexed by the sp register and is what one generally means when one says the stack. The other stack is indexed by the bsp register (backing store pointer), and it is the magic place where these "registers from long ago" are saved. The bsp register grows upward in memory (toward higher addresses), opposite from the sp which grows downward (toward lower addresses). Windows allocates the two stacks right next to each other, Here's an artistic impression by Slava Oks. Bear in mind that Slava drew the diagram upside-down (low addresses at the top, high addresses at the bottom). The bsp grows toward toward higher addresses, but in Slava's diagram, that direction is downward.

    One curious implementation detail is that the two stacks abut each other without a gap. I'm told that the kernel team considered putting a no-access page between the two stacks, so that a runaway memory copy into the stack would encounter an access violation before it reached the backing store. For whatever reason, they didn't bother.

    Now, the processor is sneaky and doesn't actually push the values onto the backing store immediately. Instead, the processor rotates them into high-numbered unused registers (all the registers beyond the last output register), and only when it runs out of space there does it spill them into the backing store. When the function returns, the rotation is undone, and the values squirreled away into the high-numbered unused registers magically reappear in the caller's local region.

    Each time a function is called, the registers rotate to the left, and when a function returns, the registers rotate to the right. As a result, the local regions of functions in the call stack can be found among the off-limits registers, up until we reach the last spill point.

    Suppose the call stack looks like this (most recent function at the top):

    a() -- current function
    b()
    c()
    d()
    e()
    f()
    g()
    

    If we zoom out, we can see all those local regions.

    static a open g f e d c b
    LR O LR LR LR LR LR LR
    •••••• ••••• ••• •••••••••••••• ••••• ••••• •••••• •••••• •••• ••••••

    Why don't we see any output registers for any functions other than the current one? You know why: Because at each function call, the caller's output registers become the called function's input registers. If you really wanted to draw the output registers, you could do it like this, where each function's input registers is shared with the caller's output registers.

    static a open g e c
    I L O I L O I L O I L O
    •••••• •• ••• ••• •••••••••••••• •• ••• •• ••• ••• ••• ••• ••• •• •• ••• •••
    O I L O I L O I L
    b f d b

    But we won't bother drawing this exploded view any more.

    Now, if the function a calls another function x, then all the registers rotate left, with a's local region wrapping around to the end of the list:

    static x open g f e d c b a
    LR O LR LR LR LR LR LR LR
    •••••• ••• •••• •••••••••• ••••• ••••• •••••• •••••• •••• •••••• •••••

    And when x returns, the registers rotate right, bringing us back to

    static a open g f e d c b
    LR O LR LR LR LR LR LR
    •••••• ••••• ••• •••••••••••••• ••••• ••••• •••••• •••••• •••• ••••••

    Note that the conceptual model doesn't care about this implementation detail. In theory, future versions of the Itanium processor might have additional "bonus registers" after r127 which are programmatically inaccessible but which are used to expand the number of register frames that can be held before needing to spill.

    With this additional information, you now can see the contents of those undefined registers on entry to a function: They contain whatever garbage happened to be left over in the open registers. Similarly, the contents of those undefined output registers after the function returns to its caller are the leftover values in the called function's local region.

    You can also see the contents of the uninitialized output registers on return from a function: They contain whatever garbage happened to be left over in the called function's input registers. This behavior is actually documented by the processor, so in theory somebody could invent a calling convention where information is passed from a function back to its caller through the input registers, say, for a language that supports functions with multiple return values. (In other words, the input registers are actually in/out registers.) The Windows calling convention doesn't use this feature, however.

    It so happens that the debugger forces a full spill into the backing store when it gains control. This is useful, because groveling into the backing store gives you a way to see the local regions of any function on the stack.

    kd> r
    ...
          r32 =      6fbffd21130 0        r33 =          1170065 0
          r34 =      6fbffd23700 0        r35 =                8 0
          r36 =      6fbffd21338 0        r37 =            20000 0
          r38 =             8000 0        r39 =             2000 0
          r40 =              800 0        r41 =              400 0
          r42 =              100 0        r43 =               80 0
          r44 =              200 0        r45 =            10000 0
          r46 =         7546fdf0 0        r47 = c000000000000693 0
          r48 =             5041 0        r49 =         75ab0000 0
          r50 =      6fbffd21130 0        r51 =          1170065 0
          r52 =      6fbfc79f770 0        r53 =         7546cbe0 0
    kd> dq @bsp
    000006fb`fc7a02e0  000006fb`ffd21130 00000000`01170065 // r32 and r33
    000006fb`fc7a02f0  000006fb`ffd23700 00000000`00000008 // r34 and r35
    000006fb`fc7a0300  000006fb`ffd21338 00000000`00020000 // r36 and r37
    000006fb`fc7a0310  00000000`00008000 00000000`00002000 // r38 and r39
    000006fb`fc7a0320  00000000`00000800 00000000`00000400 // r40 and r41
    000006fb`fc7a0330  00000000`00000100 00000000`00000080 // r42 and r43
    000006fb`fc7a0340  00000000`00000200 00000000`00010000 // r44 and r45
    000006fb`fc7a0350  00000000`7546fdf0 c0000000`00000693 // r46 and r47
    

    But wait, ia64 integer registers are 65 bits wide, not 64. The extra bit is the NaT bit. Where did that go?

    Whenever the bsp hits a 512-byte boundary (bsp & 0x1F8 == 0x1F8, or after 63 registers have been spilled), the value spilled into the backing store is not a 64-bit register but rather the accumulated NaT bits. You are not normally interested in the NaT bits, so the only practical consequence of this is that you have to remember to skip an entry whenever you hit a 512-byte boundary.

    Suppose we wanted to look at our caller's local region. Here's the start of a sample function. Don't worry about most of the instructions, just pay attention to the alloc and the mov ... = rp.

    SAMPLE!.Sample:
           alloc    r47 = ar.pfs, 013h, 00h, 04h, 00h
           mov      r48 = pr
           addl     r31 = -2004312, gp
           adds     sp = -1072, sp ;;
           ld8.nta  r3 = [sp]
           mov      r46 = rp
           adds     r36 = 0208h, r32
           or       r49 = gp, r0 ;;
    

    Suppose you hit a breakpoint partway through this function, and you want to know why the caller passed a strange value for the first input parameter r32.

    From reading the function prologue, you see that the return address is kept in r46, so you can disassemble there to see how your caller set up its output parameters:

    kd> u @r46-20
    SAMPLE!.Caller+2bd0:
           ld8      r47 = [r32]
           ld4      r46 = [r33]
           or       r45 = r35, r0
           nop.b    00h
           nop.b    00h
           br.call.sptk.many  rp = SAMPLE!.Sample
    

    (Notice the nop instructions which suggest that this is unoptimized code.)

    But we don't know which of those registers are the output registers of the caller. For that, we need to know the register frame of the caller. We see from the alloc instruction that the previous function state (pfs) was saved in the r47 register.

    kd> ?@r47
    Evaluate expression: -4611686018427386221 = c0000000`00000693
    

    This value is not easy to parse. The bottom seven bits record the total size of the caller's register frame, which includes both the local region and the output registers. The size of the local region is kept in bits 7 through 13, which is a bit tricky to extract by eye. You take the third and fourth digits from the right, double the value, and add one more if the second digit from the right is 8 or higher. This is easier to do than to explain:

    • The third- and fourth-to-last digits are 06 hex.
    • Double that, and you get 12 (decimal).
    • Since the second-to-last digit is 9, add one more.
    • Result: 13.

    The previous function's local region has 13 registers. Therefore, the previous function's output registers begin at 32 + 13 = 45. (You can also see that the previous function had 0x13 = 19 registers in its register frame, and you can therefore infer that it had 19 - 13 = 6 output registers.)

    Applying this information to the disassembly of the caller, we see that the caller passed

    • first output register r45 = r35. (Recall that the r0 register is always zero, so or'ing it with another value just copies that other value.)
    • second output register r46 = 4-byte value stored at [r33]
    • third output register r47 = 8-byte value stored at [r32]

    That first output register was a copy of the r35 register. We can grovel through the backing store to see what that value is.

    0:000> dq @bsp-0n13*8 l4
    000006fb`ffe906d8  00000000`4b1e9720 00000000`4b1ea2e8     // r32 and r33
    000006fb`ffe906e8  00000000`0114a7c0 000006fb`fe728cac     // r34 and r35
    

    And now we have extracted the registers from our caller's local region. Specifically, we see that the caller's r35 is 000006fb`fe728cac.

    We can extend this technique to grovel even further back in the stack. To do that, we need to obtain the pfs chain so we can see the structure of the register frame for each function in the call stack.

    From the disassembly above, we saw that the caller was kept in r46. To go back another level, we need to find that caller's caller. We merely repeat the exercise, but with the caller. Sometimes it can be hard to find the start of a function (especially if you don't have symbols); it can be easier to look for the end of the function instead! Instead of looking for the alloc and mov ... = rp instructions which save the previous function state and return address, we look for the mov ar.pfs = ... and mov rp = ... instructions which restore them.

    Here's an example of a stack trace I had to reconstruct:

    0:000> u
    00000000`4b17e9d4       mov      rp = r37              // return address
    00000000`4b17e9e4       mov.i    ar.pfs = r38          // restore pfs
    00000000`4b17e9e8       br.ret.sptk.many  rp ;;        // return to caller
    0:000> dq @bsp
    000006fb`ffe90758  000006fb`fe761cc0 000006fb`ffe8f860 // r32 and r33
    000006fb`ffe90768  000006fb`ffe8fa70 00000000`00000104 // r34 and r35
    000006fb`ffe90778  00000000`0114a7c0 00000000`4b1b6890 // r36 and r37
    000006fb`ffe90788  c0000000`0000050e 00000000`00005001 // r38 and r39
    

    Double the 05 to get 10 (decimal), and don't add one since the next digit (0) is less than 8. The previous function therefore has 10 registers in its local region.

    The current function's return address is kept in r37 and the pfs in r38. I've highlighted them in the bsp dump.

    Let's disassemble at the return address and dump that function's local variables, thereby walking back one level in the call stack.

    0:000> u 00000000`4b1b6890
    ...
    00000000`4b1b6bd4       mov      rp = r38 ;;           // return address
    00000000`4b1b6be4       mov.i    ar.pfs = r39          // restore pfs
    00000000`4b1b6be8       br.ret.sptk.many  rp ;;
    // we calculated that the local region of the previous function is size 0xA
    0:000> dq @bsp-a*8 la
    000006fb`ffe90708  000006fb`fe73bfc0 000006fb`fe73ff10     // r32 and r33
    000006fb`ffe90718  00000000`00000000 000006fb`ffe8f850     // r34 and r35
    000006fb`ffe90728  000006fb`ffe8f858 00000000`00000000     // r36 and r37
    000006fb`ffe90738  00000000`4b1e9350 c0000000`00000308     // r38 and r39
    000006fb`ffe90748  00000000`00009001 00000000`4b57e000     // r40 and r41
    

    By studying the value in the caller's r39, we see that the caller's caller has 3 × 2 + 0 = 6 registers in its local region. And the caller's r38 gives us the return address. Let's walk back another frame in the call stack.

    0:000> u 4b1e9350
    ...
    00000000`4b1e9354       mov      rp = r34              // return address
    00000000`4b1e9368       mov.i    ar.pfs = r35          // restore pfs
    00000000`4b1e9378       br.ret.sptk.many  rp ;;
    0:000> dq @bsp-a*8-6*8 l6
    000006fb`ffe906d8  00000000`0114a7c0 000006fb`fe728cac     // r32 and r33
    000006fb`ffe906e8  00000000`4b1e9720 c0000000`00000389     // r34 and r35
    000006fb`ffe906f8  00000000`00009001 00000000`4b57e000     // r36 and r37
    

    This time, the return address is in r34 and the previous pfs is in r35. This time, the caller's caller's caller has 3 × 2 + 1 = 7 registers in its local region.

    0:000> u 4b1e9720
    ...
    00000000`4b1e9784       mov      rp = r35             // return address
    00000000`4b1e9788       adds     sp = 010h, sp ;;
    00000000`4b1e9790       nop.m    00h
    00000000`4b1e9794       mov      pr = r37, -2 ;;
    00000000`4b1e9798       mov.i    ar.pfs = r36         // restore pfs
    00000000`4b1e97a0       nop.m    00h
    00000000`4b1e97a4       nop.f    00h
    00000000`4b1e97a8       br.ret.sptk.many  rp ;;
    0:000> dq @bsp-a*8-6*8-7*8 l7
    000006fb`ffe906a0  00000000`0114a7c0 00000000`00000000    // r32 and r33
    000006fb`ffe906b0  00000000`0114a900 00000000`4b19ba00    // r34 and r35
    000006fb`ffe906c0  c0000000`0000058f 00000000`00009001    // r36 and r37
    000006fb`ffe906d0  00000000`4b57e000                      // r38
    

    This function also allocates 0x10 bytes from the stack, so if you want to see its stack variables, you can dump the values at sp + 0x10 for length 0x10. The + 0x10 is to skip over the red zone.

    Anyway, that's the way to reconstruct the call stack on an Itanium. Repeat until bored.

    Maybe you can spot the fast one I pulled when discussing how the alloc instruction and pfs register work. More details next time, when we discuss leaf functions and the red zone.

  • The Old New Thing

    The Itanium processor, part 2: Instruction encoding, templates, and stops

    • 27 Comments

    Instructions on Itanium are grouped into chunks of three, known as bundles, and each of the three positions in a bundle is known as a slot. A bundle is 128 bits long (16 bytes) and always resides on a 16-byte boundary, so that the last digit of the address is always zero. The Windows debugging engine disassembler shows the three slots as if they were at offsets 0, 4, and 8 in the bundle, but in reality they are all crammed together into one bundle.

    You cannot jump into the middle of a bundle.

    Now, you can't just put any old instruction into any old slot. There are 32 bundle templates, and each has different rules about what types of instructions they can accept and the dependencies between the the slots. For example, the bundle template MII allows a memory access instruction in slot 0, an integer instruction in slot 1, and another integer instruction in slot 2.

    (Math: Each slot is 41 bits wide, so 123 bits are required to encode the slots. Add five bits for encoding the template, and you get 128 bits for the entire bundle.)¹

    The slot types are

    • M = memory or move
    • I = complex integer or multimedia
    • A = simple arithmetic, bit logic, or multimedia
    • F = floating point or SIMD
    • B = branch

    Some instructions can be used in multiple slot types, and the disassembler will stick a suffix (known as a completer) to disambiguate them. For example, there are five different nop instructions, one for each slot type: nop.m, nop.i, nop.a, nop.f, and nop.b. When reading code, you don't need to worry too much about slotting. You can assume that the compiler did it correctly; otherwise it wouldn't have disassembled properly! (For the remainder of this series, I will tend to omit completers if their sole purpose is to disambiguate a slot type.)

    If you are debugging unoptimized code, you may very well see a lot of nops because the compiler didn't bother trying to optimize slot usage.

    Another thing that bundles encode is the placement of what are known as stops. A stop is used to indicate that the instructions after the stop depend on instructions before the stop. For example, if you had the following sequence of instructions

        mov r3 = r2
        add r1 = r2, r4 ;;
        add r2 = r1, r3
    

    there is no dependency between the first two instructions; they can execute in parallel. However, the third instruction cannot execute until the first two have completed. The compiler therefore inserts a stop after the second instruction, which is represented by a double-semicolon.

    A sequence of instructions without any stops is known as an instruction group. (There are other things that can end an instruction group, but they aren't important here.) As noted above, the instructions in an instruction group may not have any dependencies among them. This allows the processor to execute them in parallel. (This is an example of how the processor relies on the compiler: By making it the compiler's responsibility to ensure that there are no dependencies within an instruction group, the processor can avoid having to do its own dependency analysis.)

    There are some exceptions to the rule against having dependencies within an instruction group:

    • A branch instruction is allowed to depend on a predicate register and/or branch register set up earlier in the group.
    • You are allowed to use the result of a successful ld.c without an intervening stop. We'll learn more about ld.c when we discuss explicit speculation.
    • Comparison instructions .and, .andcm, .or, and .orcm are allowed to combine with others of the same type into the same targets. (In other words, you can combine two .ands, but not an .and and an .or.)
    • You are allowed to write to a register after a previous instruction reads it. (With rare exceptions.)
    • Two instructions in the same group cannot write to the same register. (With the exception of combined comparisons noted above.)

    There are a lot of fine details in the rules, but I'm ignoring them because they are of interest primarily to compiler-writers. The above rules are to give you a general idea of the sorts of dependencies that can exist within an instruction group. (Answer: Not much.)

    It does highlight that writing ia64 assembly by hand is exceedingly difficult because you have to make sure every triplet of instructions you write matches a valid template in terms of slots and stops, and you have to ensure that the instruction groups do not break the rules.

    Next time, we'll look at the calling convention.

    ¹ There are two templates which are special in that they encode only two slots rather than three. The first slot is the normal 41 bits, but the second slot is a double-wide 82 bits. The double-wide slot is used by a few special-purpose instructions we will not get into.

  • The Old New Thing

    The Itanium processor, part 1: Warming up

    • 61 Comments

    The Itanium may not have been much of a commercial success, but it is interesting as a processor architecture because it is different from anything else commonly seen today. It's like learning a foreign language: It gives you an insight into how others view the world.

    The next two weeks will be devoted to an introduction to the Itanium processor architecture, as employed by Win32. (Depending on the reaction to this series, I might also do a series on the Alpha AXP.)

    I originally learned this information in order to be able to debug user-mode code as part of the massive port of several million lines of code from 32-bit to 64-bit Windows, so the focus will be on being able to read, understand, and debug user-mode code. I won't cover kernel-mode features since I never had to learn them.

    Introduction

    The Itanium is a 64-bit EPIC architecture. EPIC stands for Explicitly Parallel Instruction Computing, a design in which work is offloaded from the processor to the compiler. For example, the compiler decides which operations can be safely performed in parallel and which memory fetches can be productively speculated. This relieves the processor from having to make these decisions on the fly, thereby allowing it to focus on the real work of processing.

    Registers overview

    There are a lot of registers.

    • 128 general-purpose integer registers r0 through r127, each carrying 64 value bits and a trap bit. We'll learn more about the trap bit later.
    • 128 floating point registers f0 through f127.
    • 64 predicate registers p0 through p63.
    • 8 branch registers b0 through b7.
    • An instruction pointer, which the Windows debugging engine for some reason calls iip. (The extra "i" is for "insane"?)
    • 128 special-purpose registers, not all of which have been given meanings. These are called "application registers" (ar) for some reason. I will cover selected register as they arise during the discussion.
    • Other miscellaneous registers we will not cover in this series.

    Some of these registers are further subdivided into categories like static, stacked, and rotating.

    Note that if you want to retrieve the value of a register with the Windows debugging engine, you need to prefix it with an at-sign. For example ? @r32 will print the contents of the r32 register. If you omit the at-sign, then the debugger will look for a variable called r32.

    A notational note: I am using the register names assigned by the Windows debugging engine. The formal names for the registers are gr# for integer registers, fr# for floating point registers, pr# for predicate registers, and br# for branch registers.

    Static, stacked, and rotating registers

    These terms describe how the registers participate in register renumbering.

    Static registers are never renumbered.

    Stacked registers are pushed onto a register stack when control transfers into a function, and they pop off the register stack when control transfers out. We'll see more about this when we study the calling convention.

    Rotating registers can be cyclically renumbered during the execution of a function. They revert to being stacked when the function ends (and are then popped off the register stack). We'll see more about this when we study register rotation.

    Integer registers

    Of the 128 integer registers, registers r0 through r31 are static, and r32 through r127 are stacked (but they can be converted to rotating).

    Of the static registers, Win32 assigns them the following mnemonics which correspond to their use in the Win32 calling convention.

    Register Mnemonic Meaning
    r0 Reads as zero (writes will fault)
    r1 gp Global pointer
    r8r11 ret0ret3 Return values
    r12 sp Stack pointer
    r13 TEB

    Registers r4 through r7 are preserved across function calls. Well, okay, you should also preserve the stack pointer and the TEB if you know what's good for you, and there are special rules for gp which we will discuss later. The other static variables are scratch (may be modified by the function).

    Register r0 is a register that always contains the value zero. Writes to r0 trigger a processor exception.

    The gp register points to the current function's global variables. The Itanium has no absolute addressing mode. In order to access a global variable, you need to load it indirectly through a register, and the gp register points to the global variables associated with the current function. The gp register is kept up to date when code transfers between DLLs by means we'll discuss later. (This is sort of a throwback to the old days of MAKEPROCINSTANCE.)

    Every integer register contains 64 value bits and one trap bit, known as not-a-thing, or NaT. The NaT bit is used by speculative execution to indicate that the register values are not valid. We learned a little about NaT some time ago; we'll discuss it further when we reach the topic of control speculation. The important thing to know about NaT right now is that if you take a register which is tagged as NaT and try to do arithmetic with it, then the NaT bit is set on the output register. Most other operations on registers tagged as NaT will raise an exception.

    The NaT bit means that accessing an uninitialized variable can crash.

    void bad_idea(int *p)
    {
     int uninitialized;
     *p = uninitialized; // can crash here!
    }
    

    Since the variable uninitialized is uninitialized, the register assigned to it might happen to have the NaT bit set, left over from previous execution, at which point trying to save it into memory raises an exception.

    You may have noticed that there are four return value registers, which means that you can return up to 32 bytes of data in registers.

    Floating point registers

    Register Meaning
    f0 Reads as 0.0 (writes will fault)
    f1 Reads as 1.0 (writes will fault)

    Registers f0 through f31 are static, and f32 through f127 are rotating.

    By convention, registers f0 through f5 and f16 through f31 are preserved across calls. The others are scratch.

    That's about all I'm going to say about floating point registers, since they aren't really where the Itanium architecture is exciting.

    Predicate registers

    Instead of a flags register, the Itanium records the state of previous comparison operations in dedicated registers known as predicates. Each comparison operation indicates which predicates should hold the comparison result, and future instructions can test the predicate.

    Register Meaning
    p0 Reads as true (writes are ignored)

    Predicate registers p0 through p15 are static, and p16 through p63 are rotating.

    You can predicate almost any instruction, and the instruction will execute only if the predicate register is true. For example:

    (p1) add ret0 = r32, r33
    

    means, "If predicate p1 is true, then set register ret0 equal to the sum of r32 and r33. If not, then do nothing." The thing inside the parentheses is called the qualifying predicate (abbreviated qp).

    Instructions which execute unconditionally are internally represented as being conditional upon predicate register p0, since that register is always true.

    Actually, I lied when I said that the instruction will execute only if the qualifying predicate is true. There is one class of instructions which execute regardless of the state of the qualifying predicate; we'll learn about that when we get to them.

    The Win32 calling convention specifies that predicate registers p0 through p5 are preserved across calls, and p6 through p63 are scratch.

    There is a special pseudo-register called preds by the Windows debugging engine which consists of the 64 predicate registers combined into a single 64-bit value. This pseudo-register is used when code needs to save and restore the state of the predicate registers.

    Branch registers

    The branch registers are used for indirect jump instructions. The only things you can do with branch registers are load them from an integer register, copy them to an integer register, and jump to them. In particular, you cannot load them directly from memory or do arithmetic on them. If you want to do any of those things, you need to do it with an integer register, then transfer it to a branch register.

    The Win32 calling convention assigns the following meanings to the branch registers:

    Register Mnemonic Meaning
    b0 rp Return address

    The return address register is sometimes called br, but the disassembler calls it rp, so that's what we'll call it.

    The return address register is set automatically by the processor when a br.call instruction is executed.

    By convention, registers b1 through b5 are preserved across calls, while b6 and b7 are scratch. (Exercise: Is b0 preserved across calls?)

    Application registers

    There are a large number of application registers, most of which are not useful to user-mode code. We'll introduce the interesting ones as they arise. I've already mentioned one of them already: bsp is the ia64's second stack pointer.

    Break

    Okay, this was a whirlwind tour of the Itanium register set. I bet your head hurts already, and we haven't even started coding yet!

    In fact, we're not going to be coding for quite some time. Next time, we'll look at the instruction format.

  • The Old New Thing

    The curse of the redefinition of the symbol HLOG

    • 6 Comments

    A customer was running into a compiler error complaining about redefinition of the symbol HLOG.

    #include <pdh.h>
    #include <lm.h>
    
    ...
    

    The result is

    lmerrlog.h(80): error C2373: 'HLOG' redefinition; different type modifiers
    pdh.h(70): See declaration of 'HLOG'
    

    "Our project uses both performance counters (pdh.h) and networking (lm.h). What can we do to avoid this conflict?"

    We've seen this before. The conflict arises from two problems.

    First is hubris/lack of creativity. "My component does logging. I need a handle to a log. I will call it HLOG because (1) I can't think of a better name, and/or (2) obviously I'm the only person who does logging. (Anybody else who wants to do logging should just quit their job now because it's been done.)"

    This wouldn't normally be a problem except that Win32 uses a global namespace. This is necessary for annoying reasons:

    • Not all Win32 languages support namespaces.
    • Even though C++ supports namespaces, different C++ implementations decorate differently, so there is no agreement on the external linkage. (Indeed, the decoration can change from one version of the C++ compiler to another!)

    Fortunately, in the case of HLOG, the two teams noticed the collision and came to some sort of understanding. If you include them in the order

    #include <lm.h>
    #include <pdh.h>
    

    then pdh.h detects that lm.h has already been included and avoids the conflicting definition.

    #ifndef _LMHLOGDEFINED_
    typedef PDH_HLOG     HLOG;
    #endif
    

    The PDH log is always accessible via the name PDH_HLOG. If lm.h was not also included, then the PDH log is also accessible under the name HLOG.

    Sorry for the confusion.

  • The Old New Thing

    Corrupted file causes application to crash; is that a security vulnerability?

    • 39 Comments

    A security vulnerability report came in that went something like this:

    We have found a vulnerability in the XYZ application when it opens the attached corrupted file. The error message says, "Unhandled exception: System.OverflowException. Value was either too large or too small for an Int16." For a nominal subscription fee, you can learn about similar vulnerabilities in Microsoft products in the future.

    Okay, so there is a flaw in the XYZ application where a file that is corrupted in a specific way causes it to suffer an unhandled exception trying to load the file.

    That's definitely a bug, and thanks for reporting it, but is it a security vulnerability?

    The attack here is that you create one of these corrupted files and you trick somebody into opening it. And then when they open it, the XYZ application crashes. The fact that an Overflow­Exception was raised strongly suggests that the application was diligent enough to do its file parsing under the checked keyword, or that the entire module was compiled with the /checked compiler option, so that any overflow or out-of-range errors raise an exception rather than being ignored. That way, the overflow cannot be used as a vector to another attack.

    What is missing from the story is that nobody was set up to catch the overflow exception, so the corrupted file resulted in the entire application crashing rather than some sort of "Sorry, this file is corrupted" error being displayed.

    Okay, so let's assess the situation. What have you gained by tricking somebody into opening the file? The program detects that the file is corrupted and crashes instead of using the values in it. There is no code injection because the overflow is detected at the point it occurs, before any decisions are made based on the overflowed value. Consequently, there is no elevation of privilege. All you got was a denial of service against the XYZ application. (The overflow checking did its job and stopped processing as soon as corruption was detected.)

    There isn't even data loss, because the problem occurred while loading up the corrupted file. It's not like the XYZ application had any old unsaved data.

    At the end of the day, the worst you can do with this crash is annoy somebody.

    Here's another way you can annoy somebody: Send them a copy of onestop.mid.

  • The Old New Thing

    When you think you found a problem with a function, make sure you're actually calling the function, episode 2

    • 16 Comments

    A customer reported that the Duplicate­Handle function was failing with ERROR_INVALID_HANDLE even though the handle being passed to it seemed legitimate:

      // Create the handle here
      m_Event =
        ::CreateEvent(NULL, FALSE/*bManualReset*/,
                           FALSE/*bInitialState*/, NULL/*lpName*/));
      ... error checking removed ...
    
    
    // Duplicate it here
    HRESULT MyClass::CopyTheHandle(HANDLE *pEvent)
    {
     HRESULT hr = S_OK;
     
     if (m_Event != NULL) {
      BOOL result = ::DuplicateHandle(
                    GetCurrentProcess(),
                    m_Event,
                    GetCurrentProcess(),
                    pEvent,
                    0,
                    FALSE,
                    DUPLICATE_SAME_ACCESS
                    );
      if (!result) {
        // always fails with ERROR_INVALID_HANDLE
        return HRESULT_FROM_WIN32(GetLastError());
      }
     } else {
      *pEvent = NULL;
     }
     
     return hr;
    }
    

    The handle in m_Event appears to be valid. It is non-null, and we can still set and reset it. But we can't duplicate it.

    Now, before claiming that a function doesn't work, you should check what you're passing to it and what it returns. The customer checked the m_Event parameter, but what about the other parameters? The function takes three handle parameters, after all, and they checked only one of them. According to the debugger, Duplicate­Handle was called with the parameters

    hSourceProcessHandle  = 0x0aa15b80
    hSourceHandle  = 0x00000ed8 m_Event, appears to be valid
    hTargetProcessHandle  = 0x0aa15b80
    lpTargetHandle  = 0x00b0d914
    dwDesiredAccess  = 0x00000000
    bInheritHandle  = 0x00000000
    dwOptions  = 0x00000002

    Upon sharing this information, the customer immediately saw the problem: The other two handle parameters come from the Get­Current­Process function, and that function was returning 0x0aa15b80 rather than the expected pseudo-handle (which is currently -1, but that is not contractual).

    The customer explained that their My­Class has a method with the name Get­Current­Process, and it was that method which was being called rather than the Win32 function Get­Current­Process. They left off the leading :: and ended up calling the wrong Get­Current­Process.

    By default, Visual Studio colors member functions and global functions the same, but you can change this in the Fonts and Colors options dialog. Under Show settings for, select Text Editor, and then under Display items you can customize the colors to use for various language elements. In particular, you can choose a special color for static and instance member functions.

    Or, as a matter of style, you could have a policy of not giving member functions the same name as global functions. (This has the bonus benefit of reducing false positives when grepping.)

    Bonus story: A different customer reported a problem with visual styles in the common tab control. After a few rounds of asking questions, coming up with theories, testing the theories, disproving the theories, the customer wrote back: "We figured out what was happening when we tried to step into the call to Create­Dialog­Indirect­ParamW. Someone else in our code base redefined all the dialog creation functions in an attempt to enforce a standard font on all of them, but in doing so, they effectively made our code no longer isolation aware, because in the overriding routines, they called Create­Dialog­Indirect­ParamW instead of Isolation­Aawre­Create­Dialog­Indirect­ParamW. Thanks for all the help, and apologies for the false alarm."

  • The Old New Thing

    Please enjoy the new eco-friendly printers, now arguably less eco-friendly

    • 27 Comments

    Some years ago, the IT department replaced the printers multifunction devices with new reportedly eco-friendly models. One feature of the new devices is that when you send a job to the printer, it doesn't print out immediately. Printing doesn't begin until you go to the device and swipe your badge through the card reader. The theory here is that this cuts down on the number of forgotten or accidental printouts, where you send a job to a printer and forget to pick it up, or you click the Print button by mistake. If a job is not printed within a few days, it is automatically deleted.

    The old devices already supported secured printing, where the job doesn't come out of the printer until you go to the device and release the job. But with this change, secured printing is now mandatory. Of course, this means that even if you weren't printing something sensitive, you still have to stand there and wait for your document to print instead of having the job already completed and waiting for you.

    The new printing system also removes the need for job separator pages. Avoiding banner pages and eliminating forgotten print jobs are touted as the printer's primary eco-friendly features.

    Other functions provided by the devices are photocopying and scanning. With the old devices, you place your document on the glass or in the document hopper, push the Scan button, and the results are emailed to you. With the new devices, you place your document on the glass or in the document hopper, push the Scan button, and the results are emailed to you, plus a confirmation page is printed out.

    Really eco-friendly service there, printing out confirmation pages for every scanning job.

    The problem was fixed a few weeks later.

    Bonus chatter: Our fax machines also print confirmation pages, or at least they did the last time I used one, many years ago.

  • The Old New Thing

    How can I detect whether a keyboard is attached to the computer?

    • 24 Comments

    Today's Little Program tells you whether a keyboard is attached to the computer. The short answer is "Enumerate the raw input devices and see if any of them is a keyboard."

    Remember: Little Programs don't worry about silly things like race conditions.

    #include <windows.h>
    #include <iostream>
    #include <vector>
    #include <algorithm>
    
    bool IsKeyboardPresent()
    {
     UINT numDevices = 0;
      if (GetRawInputDeviceList(nullptr, &numDevices,
                                sizeof(RAWINPUTDEVICELIST)) != 0) {
       throw GetLastError();
     }
    
     std::vector<RAWINPUTDEVICELIST> devices(numDevices);
    
     if (GetRawInputDeviceList(&devices[0], &numDevices,
                               sizeof(RAWINPUTDEVICELIST)) == (UINT)-1) {
      throw GetLastError();
     }
    
     return std::find_if(devices.begin(), devices.end(),
        [](RAWINPUTDEVICELIST& device)
        { return device.dwType == RIM_TYPEKEYBOARD; }) != devices.end();
    }
    
    int __cdecl main(int, char**)
    {
     std::cout << IsKeyboardPresent() << std::endl;
     return 0;
    }
    

    There is a race condition in this code if the number of devices changes between the two calls to Get­Raw­Input­Device­List. I will leave you to fix it before incorporating this code into your program.

  • The Old New Thing

    What did the Ignore button do in Windows 3.1 when an application encountered a general protection fault?

    • 37 Comments

    In Windows 3.0, when an application encountered a general protection fault, you got an error message that looked like this:

    Application error
    CONTOSO caused a General Protection Fault in
    module CONTOSO.EXE at 0002:2403
    Close

    In Windows 3.1, under the right conditions, you would get a second option:

    CONTOSO
    An error has occurred in your application.
    If you choose Ignore, you should save your work in a new file.
    If you choose Close, your application will terminate.
    Close
    Ignore

    Okay, we know what Close does. But what does Ignore do? And under what conditions will it appear?

    Roughly speaking, the Ignore option becomes available if

    • The fault is a general protection fault,
    • The faulting instruction is not in the kernel or the window manager,
    • The faulting instruction is one of the following, possibly with one or more prefix bytes:
      • Memory operations: op r, m; op m, r; or op m.
      • String memory operations: movs, stos, etc.
      • Selector load: lds, les, pop ds, pop es.

    If the conditions are met, then the Ignore option became available. If you chose to Ignore, then the kernel did the following:

    • If the faulting instruction is a selector load instruction, the destination selector register is set to zero.
    • If the faulting instruction is a pop instruction, the stack pointer is incremented by two.
    • The instruction pointer is advanced over the faulting instruction.
    • Execution is resumed.

    In other words, the kernel did the assembly language equivalent of ON ERROR RESUME NEXT.

    Now, your reaction to this might be, "How could this possibly work? You are just randomly ignoring instructions!" But the strange thing is, this idea was so crazy it actually worked, or at least worked a lot of the time. You might have to hit Ignore a dozen times, but there's a good chance that eventually the bad values in the registers will get overwritten by good values (and it probably won't take long because the 8086 has so few registers), and the program will continue seemingly-normally.

    Totally crazy.

    Exercise: Why didn't the code have to know how to ignore jump instructions and conditional jump instructions?

    Bonus trivia: The developer who implemented this crazy feature was Don Corbitt, the same developer who wrote Dr. Watson.

  • The Old New Thing

    Why do I get ERROR_INVALID_HANDLE from GetModuleFileNameEx when I know the process handle is valid?

    • 18 Comments

    Consider the following program:

    #define UNICODE
    #define _UNICODE
    #include <windows.h>
    #include <psapi.h>
    #include <stdio.h> // horrors! mixing C and C++!
    
    int __cdecl wmain(int, wchar_t **)
    {
     STARTUPINFO si = { sizeof(si) };
     PROCESS_INFORMATION pi;
     wchar_t szBuf[MAX_PATH] = L"C:\\Windows\\System32\\notepad.exe";
    
     if (CreateProcess(szBuf, szBuf, NULL, NULL, FALSE,
                       CREATE_SUSPENDED,
                       NULL, NULL, &si, &pi)) {
      if (GetModuleFileNameEx(pi.hProcess, NULL, szBuf, ARRAYSIZE(szBuf))) {
       wprintf(L"Executable is %ls\n", szBuf);
      } else {
       wprintf(L"Failed to get module file name: %d\n", GetLastError());
      }
      TerminateProcess(pi.hProcess, 0);
      CloseHandle(pi.hProcess);
      CloseHandle(pi.hThread);
     } else {
      wprintf(L"Failed to create process: %d\n", GetLastError());
     }
    
     return 0;
    }
    

    This program prints

    Failed to get module file name: 6
    

    and error 6 is ERROR_INVALID_HANDLE. "How can the process handle be invalid? I just created the process!"

    Oh, the process handle is valid. The handle that isn't valid is the NULL.

    "But the documentation says that NULL is a valid value for the second parameter. It retrieves the path to the executable."

    In Windows, processes are initialized in-process. (In other words, processes are self-initializing.) The Create­Process function creates a process object, sets the initial state of that object, copies some information into the address space of the new process (like the command line parameters), and sets the instruction pointer to the process startup code inside ntdll.dll. From there, the startup code in ntdll.dll pulls the process up by its bootstraps. It creates the default heap. It loads the primary executable and the associated bookkeeping that says "Here is the module information for the primary executable, in case anybody asks." It identifies all the DLLs referenced by the primary executable, the DLLs referenced by those DLLs, and so on. It loads each of the DLLs in turn, creating the module information that says "Here is another module that this process loaded, in case anybody asks," and then it initializes the DLLs in the proper order. Once all the process bootstrapping is complete, ntdll.dll calls the executable entry point, and the program takes control.

    An interesting take-away from this is that modules are a user-mode concept. Kernel mode does not know about modules. All kernel mode sees is that somebody in user mode asked to map sections of a file into memory.

    Okay, so if the process is responsible for managing its modules, how do functions like Get­Module­File­Name­Ex work? They issue a bunch of Read­Process­Memory calls and manually parse the in-memory data structures of another process. Normally, this would be considered "undocumented reliance on internal data structures that can change at any time," and in fact those data structures do change quite often. But it's okay because the people who maintain the module loader (and therefore would be the ones who change the data structures) are also the people who maintain Get­Module­File­Name­Ex (so they know to update the parser to match the new data structures).

    With this background information, let's go back to the original question. Why is Get­Module­File­Name­Ex failing with ERROR_INVALID_HANDLE?

    Observe that the process was created suspended. This means that the process object has been created, the initialization parameters have been injected into the new process's address space, but no code in the process has run yet. In particular, the startup code inside ntdll.dll hasn't run. This means that the code to add a module information entry for the main executable hasn't run.

    Now we can connect the dots. Since the module information entry for the main executable hasn't been added to the module table, the call to Get­Module­File­Name­Ex is going to try to parse the module table from the suspended Notepad process, and it will see that the table is empty. Actually, it's worse than that. The module table hasn't been created yet. The function then reports, "There is no module table entry for NULL," and it tells you that the handle NULL is invalid.

    Functions like Get­Module­File­Name­Ex and Create­Tool­help­32­Snapshot are designed for diagnostic or debugging tools. There are naturally race conditions involved, because the process you are inspecting is certainly free to load or unload a module immediately after the call returns, at which point your information may be out of date. What's worse, the process you are inspecting may be in the middle of updating its module table, in which case the call may simply fail with a strange error like ERROR_PARTIAL_COPY. (Protecting the data structures with a critical section isn't good enough because critical sections do not cross processes, and the process doing the inspecting is going to be using Read­Process­Memory, which doesn't care about critical sections.)

    In the particular example above, the code could avoid the problem by using the Query­Full­Process­Image­Name function to get the path to the executable.

    Bonus chatter: The Create­Tool­help­32­Snapshot function extracts the information in a different way from Get­Module­File­Name­Ex. Rather than trying to parse the information via Read­Process­Memory, it injects a thread into the target process and runs code to extract the information from within the process, and then marshals the results back. I'm not sure whether this is more crazy than using Read­Process­Memory or less crazy.

    Second bonus chatter: A colleague of mine chose to describe this situation more directly. "Let's cut to the heart of the matter. These APIs don't really work by the normally-accepted definitions of 'work'." These snooping-around functions are best-effort, so use them in situations where best-effort is better than nothing. For example, if you have a diagnostic tool, you're probably happy that it gets information at all, even if it may sometimes be incomplete. (Debuggers don't use any of these APIs. Debuggers receive special events to notify them of modules as they are loaded and unloaded, and those notifications are generated by the loader itself, so they are reliable.)

    Exercise: Diagnose this customer's problem: "If we launch a process suspended, the Get­Module­Information function fails with ERROR_INVALID_HANDLE."

    #include <windows.h>
    #include <psapi.h>
    #include <iostream>
    
    int __cdecl wmain(int, wchar_t **)
    {
     STARTUPINFO si = { sizeof(si) };
     PROCESS_INFORMATION pi;
     wchar_t szBuf[MAX_PATH] = L"C:\\Windows\\System32\\notepad.exe";
    
     if (CreateProcess(szBuf, szBuf, NULL, NULL, FALSE,
                       CREATE_SUSPENDED,
                       NULL, NULL, &si, &pi)) {
      DWORD addr;
      std::cin >> std::hex >> addr;
      MODULEINFO mi;
      if (GetModuleInformation(pi.hProcess, (HINSTANCE)addr,
                               &mi, sizeof(mi))) {
       wprintf(L"Got the module information\n");
      } else {
       wprintf(L"Failed to get module information: %d\n", GetLastError());
      }
      TerminateProcess(hProcess, 0);
      CloseHandle(pi.hProcess);
      CloseHandle(pi.hThread);
     } else {
      wprintf(L"Failed to create process: %d\n", GetLastError());
     }
    
     return 0;
    }
    

    Run Process Explorer, then run this program. When the program asks for an address, enter the address that Process Explorer reports for the base address of the module.

Page 1 of 455 (4,544 items) 12345»