Other

  • The Old New Thing

    Debugging walkthrough: Access violation on nonsense instruction, episode 3

    • 14 Comments

    A colleague of mine asked for help debugging a strange failure. Execution halted on what appeared to be a nonsense instruction.

    eax=022b13a0 ebx=00000000 ecx=02570df4 edx=769f4544 esi=02570dec edi=05579748
    eip=76c49131 esp=05cce038 ebp=05cce07c iopl=0         nv up ei pl nz na po nc
    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
    KERNELBASE!GetFileAttributesExW+0x2:
    76c49131 ec              in      al,dx
    

    This is clearly an invalid instruction. But observe that the offset is +2, which is normally the start of the function, because the first two bytes of Windows operating system functions are a mov edi, edi instruction. Therefore, the function is corrupted. Lets look back two bytes to see if it gives any clues.

    0:006> u 76c49131-2
    KERNELBASE!GetFileAttributesExW:
    76c4912f e95aecebf3      jmp     IoLog!Mine_GetFileAttributesExW (6ab07d8e)
    

    Oh look, somebody is doing API patching (already unsupported) and they did a bad job. They tried to patch code while a thread was in the middle of executing it, resulting in a garbage instruction.

    This is a bug in IoLog. The great thing about API patching is that when you screw up, it looks like an OS bug. That way, nobody ever files bugs against you!

    (In this case, IoLog is a diagnostic tool which is logging file I/O performed by an application which is being instrumented.)

    My colleague replied, "Thanks. Looks like a missing lock in IoLog. It doesn't surprise me that API patching isn't supported..."

  • The Old New Thing

    Rules can exist not because there's a problem, but in order to prevent future problems

    • 19 Comments

    I lost the link, but one commenter noted that the Read­File function documentation says

    Applications must not read from, write to, reallocate, or free the input buffer that a read operation is using until the read operation completes.

    The commenter noted, "What is the point of the rule that disallows reading from or writing to the input buffer while the I/O is in progress? If there is no situation today where this actually causes a problem, then why is the rule there?"

    Not all rules exist to address current problems. They can also exist to prevent future problems.

    In general, you don't want the application messing with an I/O buffer because the memory may have been given to the device, and now the device has to deal with bus contention. And there isn't really much interesting you can do with the buffer before the I/O completes. You can't assume that the I/O will complete the first byte of the buffer first, and the last byte of the buffer last. The I/O request may get split into multiple pieces, and the individual pieces may complete out of order.

    So the rule against accessing the buffer while I/O is in progress is not a significant impediment in practice because you couldn't reliably obtain any information from the buffer until the I/O completed. And the rule leaves room for the future versions of the operating system to take advantage of the fact that the application will not read from or write to the buffer.

    Tomorrow, I'll tell a story of a case where accessing the I/O buffer before the I/O completed really did cause problems in Windows 95.

  • The Old New Thing

    Microspeak: DRI, the designated response individual

    • 7 Comments

    Someone sent a message to a peer-to-peer discussion group and remarked, "This is critical. I'm a DRI at the moment and have some issues to fix."

    The term DRI was new to many people on the mailing list (including me), and while others on the mailing list helped to solve the person's problem, I also learned that DRI stands for designated response individual or designated responsible individual, depending on whom you ask. This is the person who is responsible for monitoring and replying to email messages sent to a hot issues mailing list. For online services, it's the person responsible for dealing with live site issues.

    From what I can gather, teams that use this model rotate the job of being the DRI through the members of the team, so that each person on the schedule serves as DRI for a set period of time (typically a day or a week). The DRI may also be responsible for running various tools at specific times. Each team sets its own rules.

    Other teams have come up with their own name for this job. Another term I've seen is Point Dev. On our team, we call it the Developer of the Day.

    Bonus chatter: I bought this hat back in the day when the stitching was done by hand on a specially-designed sewing machine. Nowadays, it's computerized.

  • The Old New Thing

    Crashes in the I/O stack tend to occur in programs which do the most I/O

    • 29 Comments

    A customer was diagnosing repeated blue screen errors on their system. They shared a few crash dumps, and they all had a similar profile: The crash occurred in the file system filter stack as the I/O request passed through the anti-virus software.

    Some of the crashes reported PROCESS_NAME: ngen.exe. "Could ngen.exe be the problem?"

    As a general rule, user-mode code cannot be responsible for blue-screen failures. It's the job of the kernel to be resistant to misbehavior in user-mode. Failures of the form IRQL_NOT_LESS_THAN_OR_EQUAL and PAGE_FAULT_IN_NON_PAGED_AREA are typically driver bugs or faulty hardware (for example, due to overheating or overclocking).

    The application that happened to be active at the time of the failure is not typically interesting in and of itself, although it can give a clue as to what part of the kernel is misbehaving. The fact that ngen appears is more an indication that ngen performs a lot of disk I/O, so if there's a problem in the I/O stack, there's a good chance that ngen was involved, simply because ngen is involved in a lot of I/O requests.

    • Bob goes to the beach very frequently.
    • Every time there is a shark attack, Bob is at the beach.
    • Conclusion: Bob causes shark attacks.

    Blaming ngen for the kernel crash is like blaming Bob for the shark attacks.

    Bonus chatter: Some of my colleagues came to different conclusions:

    • Conclusion: Bob should stop going to the beach.
    • Conclusion: Bob must be the shark.

  • The Old New Thing

    Why does the taskbar icon for grouped windows change to something weird?

    • 21 Comments

    This question came in from two customers. I'm pretty sure it was two customers rather than one customer going on a fishing expedition because the questions came six months apart, and the scenarios were different.

    Suppose you remove all shortcuts to Explorer from the taskbar and the Start menu. Then you create a shortcut to Explorer and put it on the desktop. Wait, you're not done yet. Now view the Properties of that shortcut, use the Change Icon button, and give it some random icon. The uglier the better.

    Last step: Go to the Taskbar properties and set Taskbar buttons to Always combine, hide labels.

    All right, now open an Explorer window. Observe that it has the ugly icon in the taskbar rather than an icon that matches the Explorer window that you opened.

    What's going on here?

    Last step first: Since you configured the taskbar as Always combine, the icon for the Explorer does not come from the window itself, but is rather the group icon.

    Okay, so where does the taskbar get the group icon from? The taskbar looks in the following places to get the group icon:

    1. A shortcut on the Start menu.
    2. A shortcut on the desktop.
    3. The executable itself.

    Normally, a shortcut is found on the Start menu, but in this case, the user explicitly removed all shortcuts to explorer.exe from the Start menu. That means that the winner was the shortcut on the desktop. That shortcut has a really ugly icon, so the taskbar shows the really ugly icon.

    In other words, the reason you're getting an ugly icon is that when Windows tries to figure out the icon to show for Explorer groups, you deleted all the good icons and left only the ugly icon.

    Okay, so why does the taskbar even bother looking at shortcuts on the Start menu and on the desktop? Why doesn't it just show the icon for the executable itself?

    A lot of applications don't bother giving their executable a nice icon. The theory being, "Well, we give our Start menu shortcut a nice icon. And when the program runs, it registers a nice icon when it calls Register­Class. The executable itself is buried off in the Program Files directory, which nobody should be messing with anyway, so who cares if it has an ugly icon there?" And then when the taskbar first added the "group icons" feature, a lot of programs showed the wrong icon when collapsed to a group.

    MFC logo 3 Contoso Designer

    So that's where the first rule comes from: See if there is a shortcut to the program on the Start menu. If so, then use that icon, because that's the icon the program wants to show to the user to say "Hey, run my program!"

    But even with that, there were still some incorrect icons. Those were from programs who installed their shortcut on the desktop rather than the Start menu. That's why there is rule number two.

    Only if there is no shortcut on the Start menu or the desktop does the taskbar look to the executable itself.

    It so happens that Explorer already has to keep track of every shortcut on the Start menu and on the desktop, because it needs to keep track of any hotkeys registered by those shortcuts. Having it keep track of yet another piece of information for every shortcut wasn't too much of an extra burden.

    Bonus chatter: Why not just create a compatibility shim for these ugly applications?

    In general, when you find these sorts of compatibility issues, you can choose to fix them either by accommodating the issue in the core operating system, or by creating a compatibility shim that applies only to the applications affected by the issue. If the problem is widespread, then you just have to suck it up and put the compatibility behavior in the core operating system. Otherwise, the compatibility database would be bloated with thousands of entries. What's more, it's clear that there will be a very long tail of affected applications, seeing as the default icon for MFC applications is the generic MFC icon, and there are probably a whole lot of vertical-market and line-of-business applications that are just using the default icon without realizing it. These applications are not big-market mainstream applications, so they will likely never come to the attention of the application compatibility team.

  • The Old New Thing

    The ARM processor architecture: Somebody else's introduction

    • 15 Comments

    While you wait for me to write the Alpha AXP processor architecture overview,¹ I'd like to direct your attention to the ARM processor architecture overview written up by Marion Cole over at the NT Debugging blog. I hope that'll keep you busy for a little while longer.

    ¹ Spoiler alert: Don't hold your breath.

  • The Old New Thing

    The Itanium processor, part 10: Register rotation

    • 18 Comments

    Last time, we looked at counted loops and then improved a simple loop by explicitly pipelining the loop iterations. This time, we're going to take the pipelining to the next level.

    Let's reorder the columns of the chart we had last time so the instructions are grouped not by the register being operated upon but by the operation being performed. Since no instructions within an instruction group are dependent on each other in our example, I can reorder them without affecting the logic.

    1   ld4 r32 = [r29], 4       ;;
    2 ld4 r33 = [r29], 4       ;;
    3 ld4 r34 = [r29], 4   adds r32 = r32, 1   ;;
    4 ld4 r35 = [r29], 4   adds r33 = r33, 1 st4 [r28] = r32, 4 ;;
    5 ld4 r32 = [r29], 4   adds r34 = r34, 1 st4 [r28] = r33, 4 ;;
    6 ld4 r33 = [r29], 4   adds r35 = r35, 1 st4 [r28] = r34, 4 ;;
    7 ld4 r34 = [r29], 4   adds r32 = r32, 1 st4 [r28] = r35, 4 ;;
    8 ld4 r35 = [r29], 4   adds r33 = r33, 1 st4 [r28] = r32, 4 ;;
    9 ld4 r32 = [r29], 4   adds r34 = r34, 1 st4 [r28] = r33, 4 ;;
    10 ld4 r33 = [r29], 4   adds r35 = r35, 1 st4 [r28] = r34, 4 ;;
    11     adds r32 = r32, 1 st4 [r28] = r35, 4 ;;
    12   adds r33 = r33, 1 st4 [r28] = r32, 4 ;;
    13     st4 [r28] = r33, 4 ;;

    What an interesting pattern. Each column represents a functional unit, and at each cycle, the unit operates on a different register in a clear pattern: r32, r33, r34, r35, then back to r32. The units are staggered so that each operates on a register precisely when its result from the previous unit is ready.

    Suppose you have to make 2000 sandwiches and you have four employees. You could arrange your sandwich factory with three stations. At the first station, you have the bread and the toaster. At the second station, you have the protein. At the third station, you have the toppings. Each employee goes through the stations in order: First they take two pieces of bread and put them in the toaster. When the toast is finished, they add the protein, then they add the toppings, and then they put the finished sandwich in the box. Once that's done, they go back to the first station. You stagger the starts of the four employees so that at any moment, one is preparing the bread, one is waiting for the toaster, one is adding protein, and one is adding the toppings.

    That is how the original code was arranged. Each register is an employee that is at one of the four stages of sandwich construction.

    But another way to organize your sandwich factory is as an assembly line. You put one employee in charge of the bread, one in charge of the toaster, one in charge of the protein, and one in charge of the toppings. When a sandwich completes a stage in the process, it gets handed from one employee to the next.

    (And since there isn't really anything for the toaster-boy to do, you can eliminate that position and create the same number of sandwiches per second with only three employees. The original version had each employee sitting idle waiting for the toaster 25% of the time. Switching to the assembly line model allowed us to squeeze out that idle time.)

    Let's apply the assembly line model to our code. Handing a sandwich from one person to the next is done by moving the value from one register to the next. Let's imagine than there is a slide instruction that you can put at the end of an instruction group which copies r32 to r33, r33 to r34, and so on.

    1   ld4 r32 = [r29], 4       slide ;;
    2 ld4 r32 = [r29], 4       slide ;;
    3 ld4 r32 = [r29], 4   adds r34 = r34, 1   slide ;;
    4 ld4 r32 = [r29], 4   adds r34 = r34, 1 st4 [r28] = r35, 4 slide ;;
    5 ld4 r32 = [r29], 4   adds r34 = r34, 1 st4 [r28] = r35, 4 slide ;;
    6 ld4 r32 = [r29], 4   adds r34 = r34, 1 st4 [r28] = r35, 4 slide ;;
    7 ld4 r32 = [r29], 4   adds r34 = r34, 1 st4 [r28] = r35, 4 slide ;;
    8 ld4 r32 = [r29], 4   adds r34 = r34, 1 st4 [r28] = r35, 4 slide ;;
    9 ld4 r32 = [r29], 4   adds r34 = r34, 1 st4 [r28] = r35, 4 slide ;;
    10 ld4 r32 = [r29], 4   adds r34 = r34, 1 st4 [r28] = r35, 4 slide ;;
    11     adds r34 = r34, 1 st4 [r28] = r35, 4 slide ;;
    12   adds r34 = r34, 1 st4 [r28] = r35, 4 slide ;;
    13     st4 [r28] = r35, 4 slide ;;

    During the execution of the first instruction group, the first value is loaded into r32, and the slide instruction slides it into r33.

    At the second instruction group, the second value is loaded into r32, and the first value sits unchanged in r33. (Technically, the value is waiting to be loaded into r33.) The slide instruction slides the second value into r33 and the first value into r34.

    At the third instruction group, the third value is loaded into r32, and the first value (now in r34) is incremented. Then the slide instruction slides the third value into r33, the second value into r34, and the first value into r35.

    At the fourth instruction group, the fourth value is loaded into r32, the second value (now in r34) is incremented, and the first value (now in r35) is stored to memory. Then the slide instruction slides the fourth value into r33, the third value into r34, and the second value into r35. (The first value slides into r36, but we don't really care.)

    And so on. At each instruction group, a fresh value is loaded into r32, a previously-loaded value is incremented in r34, and the incremented value is stored from r35. And then the slide instruction moves everything down one step for the next instruction group.

    When we reach the 11th instruction group, we drain out the last value and don't bother starting up any new ones.

    Observe that the above code also falls into a prologue/kernel/epilogue pattern. In the prologue, the assembly line starts up and gradully fills the registers with work. In the kernel, the assembly line is completely busy. And in the epilogue, the work of the final registers drains out.

    You can already see how br.cloop would come in handy here: The kernel can be written as a single-instruction loop! But wait there's more.

    Let's add some predicate registers to the mix. Let's suppose that the slide instruction slides not only integer registers but also predicate registers.

    1   (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    2 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    3 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    4 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    5 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    6 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    7 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    8 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    9 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    10 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    11 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    12 (p16) ld4 r32 = [r29], 4 (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;
    13 (p16) ld4 r32 = [r29], 4   (p18) adds r34 = r34, 1 (p19) st4 [r28] = r35, 4 slide ;;

    We can initally set p16 = true, p17 = p18 = p19 = false. That way, only the load executes from the first instruction group. And then the slide instruction slides both the integer registers and the predicate registers, which causes p17 to become true.

    In the second instruction group, again, only the load executes. And then the slide instruction slides p17 into p18, so now p18 = true also.

    Since p18 = true, the third instruction group both loads and increments. And then the slide instruction slides p18 into p19, so now all of the predicates are true.

    With all the predicates true, every step in instruction groups three through 10 execute.

    Now, with instruction group 11, we want to slide the predicates, but also set p16 to false. That turns off the ld4 instruction.

    The p16 = false then slides into p17 for instruction group 12, then into p18 for instruction group 13, which turns off the increment instruction.

    If we can get the slide instruction to slide the predicates and set p16 to true for the first 10 instructions, and set it to false for the last three, then we can simply execute the same instruction 13 times!

    Okay, now I can reveal the true identity of the slide instruction: It's called br.ctop.

    The br.ctop instruction works like this:

     if (ar.lc != 0) { slide; p16 = true; ar.lc = ar.lc - 1; goto branch; }
     if (ar.ec != 0) { slide; p16 = false; ar.ec = ar.ec - 1;
                      if (ar.ec != 0) goto branch; }
     else { /* unimportant */ }
    

    In words, the br.ctop instruction first checks the ar.lc register (loop counter). If it is nonzero, then the registers slide over, the p16 register is set to true, the loop counter is decremented, and the jump is taken.

    If ar.lc is zero, then the br.ctop instruction checks the ar.ec register (epilogue counter). If it is nonzero, then the register slide over, the p16 register is set to false, and.the epilogue counter is decremented. If the decremented value of the epilogue counter is nonzero, then the jump is taken; otherwise we fall through and the loop ends.

    (If both ar.lc and ar.ec are zero, then the loop is finished before it even started. Some weird edge-case handing happens here which is not important to the discussion.)

    Code that takes advantage of the br.ctop instruction goes like this:

          alloc r36 = ar.pfs, 0, 8, 0, 4 // four rotating registers!
          mov r37 = ar.lc         // preserve lc
          mov r38 = ar.ec         // preserve ec
          mov r39 = preds         // preserve predicates
    
          addl r31 = r0, 1999 ;;  // r31 = 1999
          mov ar.lc = r31         // ar.lc = 1999
          mov ar.ec = 4
          mov pr.rot = 1 << 16 // p16 = true, all others false
          addl r29 = gp, -205584  // calculate start of array
          addl r28 = r29, 0 ;;     // put it in both r28 and r29
    
    again:
    (p16) ld4 r32 = [r29], 4      // execute an entire loop with
    (p18) adds r34 = r34, 1       // a single instruction group
    (p19) st4 [r28] = r35, 4      // using this one weird trick
          br.ctop again ;;
    
          mov ar.lc = r37         // restore registers we preserved
          mov ar.ec = r38
          mov preds = r39
          mov ar.pfs = r36
          br.ret.sptk.many rp     // return
    

    We are now using the last parameter to the alloc instruction. The 4 says that we want four rotating registers. The ar.lc and ar.ec register must be preserved across calls, so we save them here for restoration at the end. Predicate registers p16 through p63 must also be preserved, so we save all the predicate registers by using the preds pseudo-register which grabs all 64 predicates into a single 64-bit value.

    Next, we set up the loop by setting the loop counter to the number of additional times we want to execute the loop (not counting the one execution we get via fall-through), the epilogue counter to the number of steps we need in order to drain the final iterations, and set the predicates so that p16 = true and everything else is false. We also set up r28 and r29 to step through the array.

    Once that is done, we can execute the entire loop in a single instruction group.

    And then we clean up after the loop by restoring all the registers to how we found them, then return.

    And there you have register rotation. It lets you compress the prologue, kernel, and epilogue of a pipelined loop into a single instruction group.

    I pulled a fast one here: The Itanium requires that the number of rotating registers be a multiple of eight. So our code really should look like this:

          alloc r40 = ar.pfs, 0, 12, 0, 8
          mov r41 = ar.lc         // preserve lc
          mov r42 = ar.ec         // preserve ec
          mov r43 = preds         // preserve predicates
    
          addl r31 = r0, 1999 ;;  // r31 = 1999
          mov ar.lc = r31         // ar.lc = 1999
          mov ar.ec = 4
          mov pr.rot = 1 << 16 // p16 = true, all others false
          addl r29 = gp, -205584  // calculate start of array
          addl r28 = r29, 0 ;;     // put it in both r28 and r29
    
    again:
    (p16) ld4 r32 = [r29], 4      // execute an entire loop with
    (p18) adds r34 = r34, 1       // a single instruction group
    (p19) st4 [r28] = r35, 4      // using this one weird trick
          br.ctop again ;;
    
          mov ar.lc = r41         // restore registers we preserved
          mov ar.ec = r42
          mov preds = r43
          mov ar.pfs = r40
          br.ret.sptk.many rp     // return
    

    Instead of four rotating registers, we use eight. The underlying analysis remains the same. We are just throwing more registers into the pot.

    Now, the loop we were studying happens to be very simple, with only one load and one store. For more complex loops, you may need to use things like the unconditional comparison, or pipelining the iterations with a stagger of more than one cycle.

    There are other types of instructions for managing loops with register rotation. For example, br.cexit is like br.ctop except that it jumps when br.ctop falls through and vice versa. This is handy to put at the start of your pipelined loop to handle the case where the number of iterations is zero. There are also br.wtop and br.wexit instructions to handle while loops instead of counted loops. The basic idea is the same, so I won't go into the details. You can read the Itanium manual to learn more.

    That's the end of the whirlwind tour of the Itanium architecture. There are still parts left unexplored, but I tried to hit the most interesting sights.

  • The Old New Thing

    The Itanium processor, part 9: Counted loops and loop pipelining

    • 8 Comments

    There is a dedicated register named ar.lc for counted loops. The br.cloop instruction acts like this:

        if (ar.lc != 0) { ar.lc = ar.lc - 1; goto branch; }
    

    Consider this loop to increment every 32-bit integer in an array.

    extern int array[2000];
    
    void IncrementEachElement()
    {
     for (int i = 0; i < 2000; i++) {
      array[i]++;
     }
    }
    

    This could be compiled as

        mov r30 = ar.lc         // save original value of ar.lc
        addl r29 = gp, -205584  // calculate start of array
        addl r31 = r0, 1999 ;;  // r31 = 1999
        mov ar.lc = r31         // loop 1999 times
    again:
        ld4 r31 = [r29] ;;      // load the next integer
        adds r31 = r31, 1 ;;    // increment the value
        st4 [r29] = r31, 4      // store it and autoincrement
        br.cloop again ;;       // do it 1999 more times
        mov ar.lc = r30         // restore ar.lc
        br.ret.sptk.many rp     // return
    

    Note that the ar.lc register is initialized to one fewer than the number of iterations desired. That's because it counts the number of times the br.cloop instruction will branch. Since we used fall-through to initiate the loop, one of the times through the loop was already performed, and we want br.cloop to branch only 1999 times.

    The ar.lc register must be preserved across calls, so if you intend to use it in your function, you need to save its original value and restore it when done. (You also need to record in the unwind table where you saved it, so it's easier to do it up front; otherwise, you have to go to the extra work of encoding how you shrink-wrapped the function.)

    For the sake of illustration, let's say that the CPU can fetch memory from cache in two cycles, and each cycle it can issue one load and one store. (If the memory access is not in cache, it takes basically forever, in which case it doesn't really matter how we optimize the rest of the code, so we may as well assume that all memory accesses are cache hits.) Each iteration of the loop performs a fetch (two cycles), an addition (one cycle), then a store in parallel with a conditional jump (one cycle), for a total of four cycles per iteration.

    Let's try to do better.

    First, let's simplify to the case where the array has only four elements. We could do it like this:

        alloc r36 = ar.pfs, 0, 5, 0, 0 // set up frame
        addl r29 = gp, -205584  // calculate start of array
        addl r28 = r29, 0 ;;    // put it in both r28 and r29
    
        ld4 r32 = [r29], 4 ;;   // crazy stuff
        ld4 r33 = [r29], 4 ;;
        adds r32 = r32, 1
        ld4 r34 = [r29], 4 ;;
        st4 [r28] = r32, 4
        adds r33 = r33, 1
        ld4 r35 = [r29], 4 ;;
        st4 [r28] = r33, 4
        adds r34 = r34, 1 ;;
        st4 [r28] = r34, 4
        adds r35 = r35, 1 ;;
        st4 [r28] = r35, 4 ;;
    
        mov ar.pfs = r36
        br.ret.sptk.many rp     // return
    

    (In reality, we would reorder the instructions in order to match the templates better, but I'll leave them in this order for now.)

    That is kind of hard to understand, so let me rewrite the crazy middle part like this, putting all the instructions from an instruction group on one line, adding some separator lines, and putting instructions into columns carefully chosen to highlight the structure of the code.

    1   ld4 r32 = [r29], 4       ;;
    2   ld4 r33 = [r29], 4     ;;
    3 adds r32 = r32, 1   ld4 r34 = [r29], 4   ;;
    4 st4 [r28] = r32, 4 adds r33 = r33, 1   ld4 r35 = [r29], 4 ;;
    5   st4 [r28] = r33, 4 adds r34 = r34, 1   ;;
    6     st4 [r28] = r35, 4 adds r35 = r35, 1 ;;
    7       st4 [r28] = r35, 4 ;;

    The first thing to observe is that this sequence completes in just seven cycles, as opposed to the 16 cycles of the original version. That's over double the performance!

    Notice that each column performs one iteration of the loop. Each column uses a different register to do the calculation, and they share register r29 to hold the address of the next value to read and r28 to hold the address of the next value to write. Each column also waits two cycles after each read before consuming the result, thereby avoiding memory stalls.

    The idea here is to run multiple iterations of the loop in parallel, but setting each one to begin one cycle after the start of the previous iteration. Staggering the starts keeps us from overloading the memory controller. (Otherwise, everybody would issue load requests in cycle 1, and the memory controller would stall.)

    Now, the Itanium has a lot of registers, but it doesn't have 2000 of them. Fortunately, we don't need 2000 of them. Observe that starting at cycle 5, we can reuse register r32 because the previous iteration doesn't need it any more. So if we need to increment ten elements, we can do it this way:

    1   ld4 r32 = [r29], 4       ;;
    2   ld4 r33 = [r29], 4     ;;
    3 adds r32 = r32, 1   ld4 r34 = [r29], 4   ;;
    4 st4 [r28] = r32, 4 adds r33 = r33, 1   ld4 r35 = [r29], 4 ;;
    5 ld4 r32 = [r29], 4 st4 [r28] = r33, 4 adds r34 = r34, 1   ;;
    6   ld4 r33 = [r29], 4 st4 [r28] = r34, 4 adds r35 = r35, 1 ;;
    7 adds r32 = r32, 1   ld4 r34 = [r29], 4 st4 [r28] = r35, 4 ;;
    8 st4 [r28] = r32, 4 adds r33 = r33, 1   ld4 r35 = [r29], 4 ;;
    9 ld4 r32 = [r29], 4 st4 [r28] = r33, 4 adds r34 = r34, 1   ;;
    10   ld4 r33 = [r29], 4 st4 [r28] = r34, 4 adds r35 = r35, 1 ;;
    11 adds r32 = r32, 1   st4 [r28] = r35, 4 ;;
    12 st4 [r28] = r32, 4 adds r33 = r33, 1   ;;
    13 st4 [r28] = r33, 4     ;;

    We incremented ten elements in 13 cycles instead of 40. In general, we can increment n elements in n + 3 cycles instead of 4n. For large values of n this is a four-fold speed-up over the original version.

    The pattern above breaks down into three natural sections.

    1   ld4 r32 = [r29], 4       ;; Warm-up
    2   ld4 r33 = [r29], 4     ;;
    3 adds r32 = r32, 1   ld4 r34 = [r29], 4   ;;
    4   st4 [r28] = r32, 4 adds r33 = r33, 1   ld4 r35 = [r29], 4 ;; Cruise
    5 ld4 r32 = [r29], 4 st4 [r28] = r33, 4 adds r34 = r34, 1   ;;
    6   ld4 r33 = [r29], 4 st4 [r28] = r34, 4 adds r35 = r35, 1 ;;
    7 adds r32 = r32, 1   ld4 r34 = [r29], 4 st4 [r28] = r35, 4 ;;
    8 st4 [r28] = r32, 4 adds r33 = r33, 1   ld4 r35 = [r29], 4 ;;
    9 ld4 r32 = [r29], 4 st4 [r28] = r33, 4 adds r34 = r34, 1   ;;
    10   ld4 r33 = [r29], 4 st4 [r28] = r34, 4 adds r35 = r35, 1 ;;
    11   adds r32 = r32, 1   st4 [r28] = r35, 4 ;; Cool-down
    12 st4 [r28] = r32, 4 adds r33 = r33, 1   ;;
    13 st4 [r28] = r33, 4     ;;

    The first three cycles comprise the warm-up phase (formally known as the prologue). At the start, no registers are doing any work, but during the course of the warm-up phase, they get into the act one at a time. At the end of the warm-up phase, all the registers are busy doing work.

    Most of the time is spent in the middle cruise phase (formally known as the kernel), wherein all four registers are busy carrying out one of the iterations. Note that during every cycle of the cruise phase, there is a load, an increment, and a store, with the registers taking turns performing each of the operations.

    The last three cycles are the cool-down phase (formally known as the epilogue), where the registers start draining their last bits of work and no new work is started.

    Okay, now that we understand how the above code works, we're going to turn it on its side next time. Stay tuned for the thrilling conclusion!

  • The Old New Thing

    The Itanium processor, part 8: Advanced loads

    • 14 Comments

    Today we'll look at advanced loads, which is when you load a value before you're supposed to, in the hope that the value won't change in the meantime.

    Consider the following code:

    int32_t SomeClass::tryGetValue(int32_t *value)
    {
     if (!m_errno) {
      *value = m_value;
      m_readCount++;
     }
     return m_errno;
    }
    

    Let's say that the Some­Class has m_value at offset zero, m_errno at offset 4, and m_readCount at offset 8.

    The naïve way of compiling this function would go something like this:

            // we are a leaf function, so no need to use "alloc" or to save rp.
            // on entry: r32 = this, r33 = value
    
            addl    r30 = 08h, r32          // calculate &m_errno
            addl    r29 = 04h, r32 ;;       // calculate &m_readCount
    
            ld4     ret0 = [r30] ;;         // load m_errno
    
            cmp4.eq p6, p7 = ret0, r0       // p6 = m_errno == 0, p7 = !p6
    
    (p7)    br.ret.sptk.many rp             // return m_errno if there was an error¹
    
            ld4     r31 = [r32] ;;          // load m_value (at offset 0)
            st4     [r33] = r31 ;;          // store m_value to *value
    
            ld4     r28 = [r29] ;;          // load m_readCount
            addl    r28 = 01h, r28 ;;       // calculate m_readCount + 1
            st4     [r29] = r28 ;;          // store updated m_readCount
    
            ld4     ret0 = [r30]            // reload m_errno for return value
    
            br.ret.sptk.many rp             // return
    

    First, we calculate the addresses of our member variables. Then we load m_errno, and if there is an error, then we return it immediately. Otherwise, we copy the current value to *value, load m_readCount, increment it, and finally, we return m_errno.

    The problem here is that we have a deep dependency chain.

    addl r30 = 08h, r32
    ld4 ret0 = [r30]
    cmp4.eq p6, p7 = ret0, r0
    (p7) br.ret.sptk.many rp ld4 r31 = [r32]
    st4 [r33] = r31 addl r29 = 04h, r32
    non-obvious dependency
    ld4 r28 = [r29]
    addl r28 = 01h, r28
    st4 [r29] = r28
    non-obvious dependency
    ld4 ret0 = [r30]
    br.ret.sptk.many rp

    Pretty much every instruction depends on the result of the previous instruction. Some of these dependencies are obvious. You have to calculate the address of a member variable before you can read it, and you have to get the result of a memory access befure you can perform arithmetic on it. Some of the dependencies are not obvious. For example, we cannot access m_value or m_readCount until after we confirm that m_errno is zero to avoid a potential access violation if the object straddles a page boundary with m_errno on one page and m_value on the other (invalid) page. (We saw last time how this can be solved with speculative loads, but let's not add that to the mix yet.)

    Returning m_errno is a non-obvious dependency. We'll see why later. For now, note that the return value came from a memory access, which means that if the caller of the function tries to use the return value, it may stall waiting for the result to arrive from the memory controller.

    When you issue a read on Itanium, the processor merely initiates the operation and proceeds to the next instruction before the read completes. If you try to use the result of the read too soon, the processor stalls until the value is received from the memory controller. Therefore, you want to put as much distance as possible between the load of a value from memory and the attempt to use the result.

    Let's see what we can do to parallelize this function. We'll perform the increment of m_readCount and the fetch of m_value simultaneously.

            // we are a leaf function, so no need to use "alloc" or to save rp.
            // on entry: r32 = this, r33 = value
    
            addl    r30 = 08h, r32          // calculate &m_errno
            addl    r29 = 04h, r32 ;;       // calculate &m_readCount
    
            ld4     ret0 = [r30] ;;         // load m_errno
    
            cmp4.eq p6, p7 = ret0, r0       // p6 = m_errno == 0, p7 = !p6
    
    (p7)    br.ret.sptk.many rp             // return m_errno if there was an error
    
            ld4     r31 = [r32]             // load m_value (at offset 0)
            ld4     r28 = [r29] ;;          // preload m_readCount
    
            addl    r28 = 01h, r28          // calculate m_readCount + 1
            st4     [r33] = r31 ;;          // store m_value to *value
    
            st4     [r29] = r28             // store updated m_readCount
    
            br.ret.sptk.many rp             // return (answer already in ret0)
    

    We've basically rewritten the function as

    int32_t SomeClass::getValue(int32_t *value)
    {
     int32_t local_errno = m_errno;
     if (!local_errno) {
      int32_t local_readCount = m_readCount;
      int32_t local_value = m_value;
      local_readCount = local_readCount + 1;
      *value = local_value;
      m_readCount = local_readCount;
     }
     return local_errno;
    }
    

    This time we loaded the return value from m_errno long before the function ends, so when the caller tries to use the return value, it will definitely be ready and not incur a memory stall. (If a stall were needed, it would have occurred at the cmp4.) And we've also shortened the dependency chain significantly in the second half of the function.

    addl r30 = 08h, r32
    ld4 ret0 = [r30]
    cmp4.eq p6, p7 = ret0, r0 addl r29 = 04h, r32
    (p7) br.ret.sptk.many rp ld4 r31 = [r32] ld4 r28 = [r29]
    st4 [r33] = r31 addl r28 = 01h, r28
    st4 [r29] = r28
    br.ret.sptk.many rp

    This works great until somebody does this:

    int32_t SomeClass::Haha()
    {
      return this->tryGetValue(&m_readCount);
    }
    

    or even this:

    int32_t SomeClass::Hoho()
    {
      return this->tryGetValue(&m_errno);
    }
    

    Oops.

    Let's look at Haha. Suppose that our initial conditions are m_errno = 0, m_value = 42, and m_readCount = 0.

    Original Optimized
    local_errno = m_errno; // true
    if (!m_errno) // true if (!m_errno) // true
    readCount = m_readCount; // 0
    *value = m_value; // m_readCount = 42 *value = m_value; // m_readCount = 42
    m_readCount++; // m_readCount = 43 m_readCount = readCount + 1; // m_readCount = 1
    return m_errno; // 0 return errno; // 0

    The original code copies the value before incrementing the read count. This means that if the caller says that m_readCount is the output variable, the act of copying the value modifies m_readCount. This modified value is then incremented. Our optimized version does not take this case into account and sets m_readCount to the old value incremented by 1.

    We were faked out by pointer aliasing!

    (A similar disaster occurs in Hoho.)

    Now, whether the behavior described above is intentional or desirable is not at issue here. The C++ language specification requires that the original code result in the specified behavior, so the compiler is required to honor it. Optimizations cannot alter the behavior of standard-conforming code, even if that behavior seems strange to a human being reading it.

    But we can still salvage this optimization by handling the aliasing case. The processor contains support for aliasing detection via the ld.a instruction.

            // we are a leaf function, so no need to use "alloc" or to save rp.
            // on entry: r32 = this, r33 = value
    
            addl    r30 = 08h, r32          // calculate &m_errno
            addl    r29 = 04h, r32 ;;       // calculate &m_readCount
    
            ld4     ret0 = [r30] ;;         // load m_errno
    
            cmp4.eq p6, p7 = ret0, r0       // p6 = m_errno == 0, p7 = !p6
    
    (p7)    br.ret.sptk.many rp             // return m_errno if there was an error
    
            ld4     r31 = [r32]             // load m_value (at offset 0)
            ld4.a   r28 = [r29] ;;          // preload m_readCount
    
            addl    r28 = 01h, r28          // calculate m_readCount + 1
            st4     [r33] = r31             // store m_value to *value
    
            chk.a.clr r28, recover ;;       // recover from pointer aliasing
    recovered:
            st4     [r29] = r28 ;;          // store updated m_readCount
    
            br.ret.sptk.many rp             // return
    
    recover:
            ld4     r28 = [r29] ;;          // reload m_readCount
            addl    r28 = 01h, r28          // recalculate m_readCount + 1
            br      recovered               // recovery complete, resume mainline code
    

    The ld.a instruction is the same as an ld instruction, but it also tells the processor that this is an advanced load, and that the processor should stay on the lookout for any instructions that write to any bytes accessed by the load instruction. When the value is finally consumed, you perform a chk.a.clr to check whether the value you loaded is still valid. If no instructions have written to the memory in the meantime, then great. But if the address was written to, the processor will jump to the recovery code you provided. The recovery code re-executes the load and any other follow-up calculations, then returns to the original mainline code path.

    The .clr completer tells the processor to stop monitoring that address. It clears the entry from the Advanced Load Address Table, freeing it up for somebody else to use.

    There is also a ld.c instruction which is equivalent to a chk.a that jumps to a reload and then jumps back. In other words,

        ld.c.clr r1 = [r2]
    

    is equivalent to

        chk.a.clr r1, recover
    recovered:
        ...
    
    recover:
        ld     r1 = [r2]
        br     recovered
    

    but is much more compact and doesn't take branch penalties. This is used if there is no follow-up computation; you merely want to reload the value if it changed.

    As with recovery from speculative loads, we can inline some of the mainline code into the recovery code so that we don't have to pad out the mainline code to get recovered to sit on a bundle boundary. I didn't bother doing it here; you can do it as an exercise.

    The nice thing about processor support for pointer aliasing detection is that it can be done across functions, something that cannot easily be done statically. Consider this function:

    void accumulateTenTimes(void (*something)(int32_t), int32_t *victim)
    {
     int32_t total = 0;
     for (int32_t i = 0; i < 10; i++) {
      total += something(*victim);
     }
     *victim = total;
    }
    
    int32_t negate(int32_t a) { return -a; }
    
    int32_t value = 2;
    accumulateTenTimes(negate, &value);
    // result: value = -2 + -2 + -2 + ... + -2 = -20
    
    int32_t sneaky_negate(int32_t a) { value2 /= 2; return -a; }
    int32_t value2 = 2;
    accumulateTenTimes(sneaky_negate, &value2);
    // result: value2 = -2 + -1 + -0 + -0 + ... + -0 = -3
    

    When compiling the accumulate­Ten­Times function, the compiler has no way of knowing whether the something function will modify victim, so it must be conservative and assume that it might, just in case we are in the sneaky_negate case.

    Let's assume that the compiler has done flow analysis and determined that the function pointer passed to accumulate­Ten­Times is always within the same module, so it doesn't need to deal with gp. Since function descriptors are immutable, it can also enregister the function address.

            // 2 input registers, 6 local registers, 1 output register
            alloc   r34 = ar.pfs, 2, 6, 1, 0
            mov     r35 = rp                // save return address
            mov     r36 = ar.lc             // save loop counter
            or      r37 = r0, r0            // total = 0
            ld8     r38 = [r32]             // get the function address
            or      r31 = 09h, r0 ;;        // r31 = 9
            mov     ar.lc = r31             // loop nine more times (ten total)
    again:
            ld4     r39 = [r33]             // load *victim for output
            mov     b6 = r38                // move to branch register
            br.call.dptk.many rp = b6 ;;    // call function in b6
            addl    r37 = ret0, r37         // accumulate total
            br.cloop.sptk.few again ;;      // loop 9 more times
    
            st4     [r33] = r37             // save the total
    
            mov     ar.lc = r36             // restore loop counter
            mov     rp = r35                // restore return address
            mov     ar.pfs = r34            // restore stack frame
            br.ret.sptk.many rp             // return
    

    Note that at each iteration, we read *victim from memory because we aren't sure whether the something function modifies it. But with advanced loads, we can remove the memory access from the loop.

            // 2 input registers, 7 local registers, 1 output register
            alloc   r34 = ar.pfs, 2, 7, 1, 0
            mov     r35 = rp                // save return address
            mov     r36 = ar.lc             // save loop counter
            or      r37 = r0, r0            // total = 0
            ld8     r38 = [r32]             // get the function address
            or      r31 = 09h, r0 ;;        // r31 = 9
            mov     ar.lc = r31             // loop nine more times (ten total)
            ld4.a   r39 = [r33]             // get the value of *victim
    again:
            ld4.c.nc r39 = [r33]            // reload *victim if necessary
            or      r40 = r39, r0           // set *victim as the output parameter
            mov     b6 = r38                // move to branch register
            br.call.dptk.many rp = b6 ;;    // call function in b6
            addl    r37 = ret0, r37         // accumulate total
            br.cloop.sptk.few again ;;      // loop 9 more times
    
            invala.e r39                    // stop tracking r39
    
            st4     [r33] = r37             // save the total
    
            mov     ar.lc = r36             // restore loop counter
            mov     rp = r35                // restore return address
            mov     ar.pfs = r34            // restore stack frame
            br.ret.sptk.many rp             // return
    

    We perform an advanced load of *value in the hope that the callback function will not modify it. This is true if the callback function is negate, but it will trigger reloads if the accumulator function is sneaky_negate.

    Note here that we use the .nc completer on the ld.c instruction. This stands for no clear and tells the processor to keep tracking the address because we will be checking it again. When the loop is over, we use invala.e to tell the processor, "Okay, you can stop tracking it now." This also shows how handy the ld.c instruction is. We can do the reload inline rather than have to write separate recovery code and jumping out and back.

    (Processor trivia: We do not need a stop after the ld4.c.nc. You are allowed to consume the result of a check load in the same instruction group.)

    In the case where the callback function does not modify value, the only memory accesses performed by this function and the callback are loading the function address, loading the initial value from *value, and storing the final value to *value. The loop body itself runs without any memory access at all!

    Going back to our original function, I noted that we could also add speculation to the mix. So let's do that. We're going to speculate an advanced load!

            // we are a leaf function, so no need to use "alloc" or to save rp.
            // on entry: r32 = this, r33 = value
    
            ld4.sa  r31 = [r32]             // speculatively preload m_value (at offset 0)
            addl    r30 = 08h, r32          // calculate &m_errno
            addl    r29 = 04h, r32 ;;       // calculate &m_readCount
    
            ld4.sa  r28 = [r29]             // speculatively preload m_readCount
            ld4     ret0 = [r30] ;;         // load m_errno
    
            cmp4.eq p6, p7 = ret0, r0       // p6 = m_errno == 0, p7 = !p6
    
    (p7)    invala.e r31                    // abandon the advanced load
    (p7)    invala.e r28                    // abandon the advanced load
    (p7)    br.ret.sptk.many rp             // return false if value not set
    
            ld4.c.clr r31 = [r32]           // validate speculation and advanced load of m_value
            st4     [r33] = r31             // store m_value to *value
    
            ld4.c.clr r28 = [r29]           // validate speculation and advanced load of m_readCount
            addl    r28 = 01h, r28 ;;       // calculate m_readCount + 1
            st4     [r29] = r28             // store updated m_readCount
    
            br.ret.sptk.many rp             // return
    

    To validate a speculative advanced load, you just need to do a ld.c. If the speculation failed, then the advanced load also fails, so all we need to do is check the advanced load. and the reload will raise the exception.

    The dependency chain for this function is even shorter now that we were able to speculate the case where there is no error. (Since you are allowed to consume an ld4.c in the same instruction group, I combined the ld4.c and its consumption in a single box since they occur within the same cycle.)

    ld4.sa r31 = [r32] addl r30 = 08h, r32 addl r29 = 04h, r32
    ld4 ret0 = [r30] ld4.sa r28 = [r29]
    cmp4.eq p6, p7 = ret0, r0
    ld4.c st4 [r33] = r31
    invala.e r31 invala.e r28 br.ret rp
    ld4.c addl r28 = 01h, r28
    st4 [r29] = r28
    br.ret.sptk.many rp

    Aw, look at that pretty diagram. Control speculation and data speculation allowed us to run three different operations in parallel even though they might have dependencies on each other. The idea here is that if profiling suggests that the dependencies are rarely realized (pointers are usually not aliased), you can use speculation to run the operations as if they had no dependencies, and then use the check instructions to convert the speculated results to real ones.

    ¹ Note the absence of a stop between the cmp4 and the br.ret. That's because of a special Itanium rule that says that a conditional branch is permitted to use a predicate register calculated earlier within the same instruction group. (Normally, instructions within an instruction group are not allowed to have dependencies among each other.) This allows a test and jump to occur within the same cycle.

  • The Old New Thing

    The Itanium processor, part 7: Speculative loads

    • 15 Comments

    Today we'll look at speculative loads, which is when you load a value before you even know whether the value should be loaded. (A related but distinct concept is advanced loading, which we'll look at next time.)

    Consider the following code:

    int32_t SomeClass::calculateSomething(int32_t *a, int32_t *b)
    {
     int32_t result;
     if (m_p > m_q + 1) {
      result = a[*b];
     } else {
      result = defaultValue;
     }
     return result;
    }
    

    The naïve way to compile this function would go something like this:

            // we are a leaf function, so no need to use "alloc" or to save rp.
            // on entry: r32 = this, r33 = a, r34 = b
    
            // calculate complex condition, putting the result in p6
            // and the opposite of the result in p7.
    
            addl    r31 = 08h, r32          // r31 = &this->m_p
            addl    r30 = 10h, r32 ;;       // r30 = &this->m_q
            ld8     r31 = [r31]             // r31 = this->m_p
            ld8     r30 = [r30] ;;          // r30 = this->m_q
            addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
            cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6
    
            // Now take action based on the result.
    
    (p6)    ld4     r29 = [r34] ;;          // if true: load *b
    (p6)    shladd  r29 = r29, 2, r33 ;;    // if true: calculate &a[*b]
    (p6)    ld4     ret0 = [r29]            // if true: load a[*b]
    (p7)    or      ret0 = 2ah, r0          // if false: return default value
            br.ret.sptk.many rp             // return
    

    First we decide whether the condition is true or not. Once we know whether the branch is taken, we execute the appropriate branch. If the condition is true, then we load *b, then use it to index the array a. If the condition is false, then we just set the result to the default value.

    Now, in reality, the encoded instructions aren't that neat due to template restrictions. For example, we cannot put the first three instructions into a bundle because they consist of two arithmetic instructions, a stop, and a memory instruction, but there is no II|M template. The actual encoding would be more like this:

            // we are a leaf function, so no need to use "alloc" or to save rp.
            // on entry: r32 = this, r33 = a, r34 = b
    
            // calculate complex condition, putting the result in p6
            // and the opposite of the result in p7.
    
    {      // MII| template
            nop.m
            addl    r31 = 08h, r32          // r31 = &this->m_p
            addl    r30 = 10h, r32 ;;       // r30 = &this->m_q
    }
    {       // MM|I| template (M port can also execute integer operations)
            ld8     r31 = [r31]             // r31 = this->m_p
            ld8     r30 = [r30] ;;          // r30 = this->m_q
            addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
    }
    {       // M|MI| template
            cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6
    
            // Now take action based on the result.
    
    (p6)    ld4     r29 = [r34]             // if true: load *b
            nop.i ;;                        // move the stop here
    }
    {       //  M|MI template
    (p6)    shladd  r29 = r29, 2, r33 ;;    // if true: calculate &a[*b]
    (p6)    ld4     ret0 = [r29]            // if true: load a[*b]
    (p7)    or      ret0 = 2ah, r0          // if false: return default value
    }
    {       // BBB template
            br.ret.sptk.many rp             // return
            nop.b
            nop.b
    }
    

    Anyway, let's go back to the original version. What you might notice is that there is a long dependency chain:

    addl r31 = 08h, r32 addl r30 = 10h, r32
    ld8 r31 = [r31] ld8 r30 = [r30]
    addl r30 = 08h, r30
    cmp.gt p6, p7 = r30, r31
    (p7) or ret0 = 2Ah, r0 (p6) ld4 r29 = [r34]
    (p6) shladd r29 = r29, 2, r33
    (p6) ld4 ret0 = [r29]
    br.ret.sptk.many rp

    It would be great if we could parallelize more of this computation. For example, we could precalculate the true branch:

            // we are a leaf function, so no need to use "alloc" or to save rp.
            // on entry: r32 = this, r33 = a, r34 = b
    
            ld4     r29 = [r34]             // assume true: load *b
            addl    r31 = 08h, r32
            addl    r30 = 10h, r32 ;;
            shladd  r29 = r29, 2, r33       // assume true: calculate &a[*b]
            ld8     r31 = [r31]
            ld8     r30 = [r30] ;;
            ld4     r29 = [r29]             // assume true: load a[*b]
            addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
            cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6
    
            // Now take action based on the result.
    
    (p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
    (p7)    or      ret0 = 2ah, r0          // if false: return the default value
            br.ret.sptk.many rp             // return
    

    After applying template restrictions, the code looks more like this:

    {       // MII| template
            ld4     r29 = [r34]             // assume true: load *b
            addl    r31 = 08h, r32
            addl    r30 = 10h, r32 ;;
    }
    {       // MMI| template
            ld8     r31 = [r31]
            ld8     r30 = [r30]
            shladd  r29 = r29, 2, r33 ;;    // assume true: calculate &a[*b]
    }
    {       // MI|I| template
            ld4     r29 = [r29]             // assume true: load a[*b]
            addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
            cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6
    }
    {       // MIB template (recalling that the M port can also execute integer instructions)
    (p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
    (p7)    or      ret0 = 2ah, r0          // if false: return the default value
            br.ret.sptk.many rp             // return
    }
    

    Note that we managed to shrink the code because we were able to use the speculative instructions to "fill in the holes" left by the template-mandated nop instructions in the original. We managed to squeeze out all the nops!

    Okay, enough with the template restrictions digressions.

    This alternate version assumes that the complex condition is true and speculatively evaluates the true branch. If the test turns out to be true, then it just uses the precalculated result. if it turns out to be false, then the precalculated result is thrown away.

    In other words, we rewrote the code like this:

    int32_t SomeClass::calculateSomething(int32_t *a, int32_t *b)
    {
     int32_t speculativeResult = a[*b];
     int32_t result;
     if (m_p > m_q + 1) {
      result = speculativeResult;
     } else {
      result = defaultValue;
     }
     return result;
    }
    

    The dependency chain is now much shorter.

    addl r31 = 08h, r32 addl r30 = 10h, r32 ld4 r29 = [r34]
    ld8 r31 = [r31] ld8 r30 = [r30] shladd r29 = r29, 2, r33
    addl r30 = 08h, r30 ld4 r29 = [r29]
    cmp.gt p6, p7 = r30, r31
    (p7) or ret0 = 2ah, r0 (p6) or ret0 = r0, r29
    br.ret.sptk.many rp

    This rewrite is generally beneficial if profiling feedback suggests that the conditional is normally true, since it hides the memory access latency inside the calculation of the conditional.

    Until somebody does this:

        // Get the default value. We know that m_p == m_q.
        p->calculateSomething(nullptr, nullptr);
    

    In this case, the speculative execution will take an access violation because it tries to deference the null pointer as part of calculating the speculative result.

    Well, that sucks.

    To solve this problem, the Itanium lets you explicitly tag memory read operations as speculative. If you try to load a value speculatively, the instruction will read the value if possible, but if doing so would normally raise an exception, the processor says, "Sorry, I couldn't read it. But since this is part of speculative execution, I shouldn't raise an exception. Instead, I will set the NaT bit on the value so that you will know that the speculative load failed."

    The NaT bit (short for Not a Thing) is a 65th bit associated with each 64-bit general-purpose integer register that says whether the register holds a valid value. (Floating point registers do not have a NaT bit; instead there is a special value called NaTVal which serves the same purpose.) Arithmetic operations on NaT simply result in another NaT, but if you try to do something interesting with a NaT, you incur a STATUS_REG_NAT_CONSUMPTION exception.

    So let's take advantage of explicit speculative execution:

            // we are a leaf function, so no need to use "alloc" or to save rp.
            // on entry: r32 = this, r33 = a, r34 = b
    
            ld4.s   r29 = [r34]             // speculatively load *b
            addl    r31 = 08h, r32
            addl    r30 = 10h, r32 ;;
            ld8     r31 = [r31]
            ld8     r30 = [r30]
            shladd  r29 = r29, 2, r33 ;;    // speculatively calculate &a[*b]
            ld4.s   r29 = [r29]             // speculatively load a[*b]
            addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
            cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6
    
            // Now take action based on the result.
    
    (p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
    (p7)    or      ret0 = 2ah, r0          // if false: return the default value
            br.ret.sptk.many rp             // return
    

    We changed the two load operations from ld4 to ld4.s. The trailing .s means that the load is being performed speculatively, and that if an error occurs or if the address is itself NaT, then set the result to NaT rather than raising an exception.

    Okay, so this prevents the exception from being raised during the speculative execution, but what if an exception really occurred? How do we turn the NaT back into the original exception? As written, we return NaT back to our caller, which is definitely not what the caller expects!

    You might put on your language lawyer hat at this point and say that dereferencing a null pointer invokes undefined behavior, so returning NaT is standard-conforming (because undefined behavior allows anything to be standard-conforming). That's true, if the exception was due to an access violation. But the exception might have been a page not present exception because the memory was paged out. In that case, we really do want to raise the exception so that the kernel can handle it by paging the memory back in, and then we want to read the value and resume our calculations. The caller definitely does not expect that passing valid parameters will result in a NaT just because the memory happens to be paged out.

    What we need to do is convert the deferred exception back into the original exception so that it can be raised as if no speculation had occurred. The instruction that lets us know that an exception got converted to a NaT is chk.s. This means "Check if the register contains NaT. If so, then jump to recovery code." The recovery code re-executes the instructions non-speculatively so that all the exceptions can be raised in the standard way, and any exception handlers can do their work in the standard way. Since NaT infects future computations, we don't need to check every speculative step; we need only check the final speculated result.

            // we are a leaf function, so no need to use "alloc" or to save rp.
            // on entry: r32 = this, r33 = a, r34 = b
    
            ld4.s   r29 = [r34]             // speculatively load *b
            addl    r31 = 08h, r32
            addl    r30 = 10h, r32 ;;
            ld8     r31 = [r31]
            ld8     r30 = [r30]
            shladd  r29 = r29, 2, r33 ;;    // speculatively calculate &a[*b]
            ld4.s   r29 = [r29]             // speculatively load a[*b]
            addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
            cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6
    
            // Now take action based on the result.
    
    (p6)    chk.s   r29, recovery           // if true: recover r29 if not valid
    recovered:
    (p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
    (p7)    or      ret0 = 2ah, r0          // if false: return the default value
            br.ret.sptk.many rp             // return
    
    recovery:
            ld4     r29 = [r34] ;;          // load *b
            shladd  r29 = r29, 2, r33 ;;    // calculate &a[*b]
            ld4     r29 = [r29]             // load a[*b]
            br      recovered               // resume with recovered value
    

    The chk.s instruction checks the specified register to see if the NaT bit is set. If not, the instruction allows execution to continue normally. But if the register is invalid, then control transfers to the specified label. Our recovery code re-executes the instructions that led to the invalid value, but this time we execute them non-speculatively so that exceptions can be raised and handled. Once the value has been recovered, we jump back to the instruction after the chk.s so that normal execution can resume.

    In this case, we can make an additional optimization. Since the only things happening after recovery are copying r29 to ret0 and returning, we can inline those two instructions then perform peephole optimization to combine the ld4 r29 = [r29] and or ret0 = r0, r29 into ld4 ret0 = [r29].

    Note that optimizing the recovery code is not really that important from an execution speed standpoint, since the recovery code runs only if an exception occurred, and the cost of raising and handling the exception will drown out any cycle squeezing effects. The real benefit of optimizing the recovery code is to avoid the jump back into the mainline code, because that allows the mainline code to be more compact: Recall that all jump targets must be the start of a bundle. If we had the recovery code jump back to the mainline code, we would have to insert some nops so that the recovered label is at the start of a bundle. (In practice, what the compiler will do is repeat the trailing instructions in the bundle containing the chk.s then jump to the start of the next bundle.)

    The final compiled function now looks like this:

            // we are a leaf function, so no need to use "alloc" or to save rp.
            // on entry: r32 = this, r33 = a, r34 = b
    
            ld4.s   r29 = [r34]             // speculatively load *b
            addl    r31 = 08h, r32
            addl    r30 = 10h, r32 ;;
            ld8     r31 = [r31]
            ld8     r30 = [r30]
            shladd  r29 = r29, 2, r33 ;;    // speculatively calculate &a[*b]
            ld4.s   r29 = [r29]             // speculatively load a[*b]
            addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
            cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6
    
            // Now take action based on the result.
    
    (p6)    chk.s   r29, recovery           // if true: recover r29 if not valid
    (p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
    (p7)    or      ret0 = 2ah, r0          // if false: return the default value
            br.ret.sptk.many rp             // return
    
    recovery:
            ld4     r29 = [r34] ;;          // load *b
            shladd  r29 = r29, 2, r33 ;;    // calculate &a[*b]
            ld4     ret0 = [r29]            // load a[*b] as return value
            br.ret.sptk.many rp             // return
    

    Next time, we'll look at advanced loading, which is a different type of speculative execution.

Page 1 of 99 (984 items) 12345»