• #### The Itanium processor, part 7: Speculative loads

Today we'll look at speculative loads, which is when you load a value before you even know whether the value should be loaded. (A related but distinct concept is advanced loading, which we'll look at next time.)

Consider the following code:

```int32_t SomeClass::calculateSomething(int32_t *a, int32_t *b)
{
int32_t result;
if (m_p > m_q + 1) {
result = a[*b];
} else {
result = defaultValue;
}
return result;
}
```

The naïve way to compile this function would go something like this:

```        // we are a leaf function, so no need to use "alloc" or to save rp.
// on entry: r32 = this, r33 = a, r34 = b

// calculate complex condition, putting the result in p6
// and the opposite of the result in p7.

addl    r31 = 08h, r32          // r31 = &this->m_p
addl    r30 = 10h, r32 ;;       // r30 = &this->m_q
ld8     r31 = [r31]             // r31 = this->m_p
ld8     r30 = [r30] ;;          // r30 = this->m_q
addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

// Now take action based on the result.

(p6)    ld4     r29 = [r34] ;;          // if true: load *b
(p6)    shladd  r29 = r29, 2, r33 ;;    // if true: calculate &a[*b]
(p6)    ld4     ret0 = [r29]            // if true: load a[*b]
(p7)    or      ret0 = 2ah, r0          // if false: return default value
br.ret.sptk.many rp             // return
```

First we decide whether the condition is true or not. Once we know whether the branch is taken, we execute the appropriate branch. If the condition is true, then we load `*b`, then use it to index the array `a`. If the condition is false, then we just set the result to the default value.

Now, in reality, the encoded instructions aren't that neat due to template restrictions. For example, we cannot put the first three instructions into a bundle because they consist of two arithmetic instructions, a stop, and a memory instruction, but there is no II|M template. The actual encoding would be more like this:

```        // we are a leaf function, so no need to use "alloc" or to save rp.
// on entry: r32 = this, r33 = a, r34 = b

// calculate complex condition, putting the result in p6
// and the opposite of the result in p7.

{      // MII| template
nop.m
addl    r31 = 08h, r32          // r31 = &this->m_p
addl    r30 = 10h, r32 ;;       // r30 = &this->m_q
}
{       // MM|I| template (M port can also execute integer operations)
ld8     r31 = [r31]             // r31 = this->m_p
ld8     r30 = [r30] ;;          // r30 = this->m_q
addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
}
{       // M|MI| template
cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

// Now take action based on the result.

(p6)    ld4     r29 = [r34]             // if true: load *b
nop.i ;;                        // move the stop here
}
{       //  M|MI template
(p6)    shladd  r29 = r29, 2, r33 ;;    // if true: calculate &a[*b]
(p6)    ld4     ret0 = [r29]            // if true: load a[*b]
(p7)    or      ret0 = 2ah, r0          // if false: return default value
}
{       // BBB template
br.ret.sptk.many rp             // return
nop.b
nop.b
}
```

Anyway, let's go back to the original version. What you might notice is that there is a long dependency chain:

 addl r31 = 08h, r32 addl r30 = 10h, r32 ↓ ↓ ld8 r31 = [r31] ld8 r30 = [r30] ↓ ↓ ↓ addl r30 = 08h, r30 ↓ ↙ cmp.gt p6, p7 = r30, r31 ↓ ↘ (p7) or ret0 = 2Ah, r0 (p6) ld4 r29 = [r34] ↓ ↓ ↓ (p6) shladd r29 = r29, 2, r33 ↓ ↓ ↓ (p6) ld4 ret0 = [r29] ↓ ↙ br.ret.sptk.many rp

It would be great if we could parallelize more of this computation. For example, we could precalculate the true branch:

```        // we are a leaf function, so no need to use "alloc" or to save rp.
// on entry: r32 = this, r33 = a, r34 = b

ld4     r29 = [r34]             // assume true: load *b
addl    r31 = 08h, r32
addl    r30 = 10h, r32 ;;
shladd  r29 = r29, 2, r33       // assume true: calculate &a[*b]
ld8     r31 = [r31]
ld8     r30 = [r30] ;;
ld4     r29 = [r29]             // assume true: load a[*b]
addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

// Now take action based on the result.

(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
(p7)    or      ret0 = 2ah, r0          // if false: return the default value
br.ret.sptk.many rp             // return
```

After applying template restrictions, the code looks more like this:

```{       // MII| template
ld4     r29 = [r34]             // assume true: load *b
addl    r31 = 08h, r32
addl    r30 = 10h, r32 ;;
}
{       // MMI| template
ld8     r31 = [r31]
ld8     r30 = [r30]
shladd  r29 = r29, 2, r33 ;;    // assume true: calculate &a[*b]
}
{       // MI|I| template
ld4     r29 = [r29]             // assume true: load a[*b]
addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6
}
{       // MIB template (recalling that the M port can also execute integer instructions)
(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
(p7)    or      ret0 = 2ah, r0          // if false: return the default value
br.ret.sptk.many rp             // return
}
```

Note that we managed to shrink the code because we were able to use the speculative instructions to "fill in the holes" left by the template-mandated `nop` instructions in the original. We managed to squeeze out all the `nop`s!

Okay, enough with the template restrictions digressions.

This alternate version assumes that the complex condition is true and speculatively evaluates the true branch. If the test turns out to be true, then it just uses the precalculated result. if it turns out to be false, then the precalculated result is thrown away.

In other words, we rewrote the code like this:

```int32_t SomeClass::calculateSomething(int32_t *a, int32_t *b)
{
int32_t speculativeResult = a[*b];
int32_t result;
if (m_p > m_q + 1) {
result = speculativeResult;
} else {
result = defaultValue;
}
return result;
}
```

The dependency chain is now much shorter.

 addl r31 = 08h, r32 addl r30 = 10h, r32 ld4 r29 = [r34] ↓ ↓ ↓ ld8 r31 = [r31] ld8 r30 = [r30] shladd r29 = r29, 2, r33 ↓ ↓ ↓ ↓ addl r30 = 08h, r30 ld4 r29 = [r29] ↓ ↙ ↙ cmp.gt p6, p7 = r30, r31 ↙ ↓ ↘ ↙ (p7) or ret0 = 2ah, r0 (p6) or ret0 = r0, r29 ↓ ↙ br.ret.sptk.many rp

This rewrite is generally beneficial if profiling feedback suggests that the conditional is normally true, since it hides the memory access latency inside the calculation of the conditional.

Until somebody does this:

```    // Get the default value. We know that m_p == m_q.
p->calculateSomething(nullptr, nullptr);
```

In this case, the speculative execution will take an access violation because it tries to deference the null pointer as part of calculating the speculative result.

Well, that sucks.

To solve this problem, the Itanium lets you explicitly tag memory read operations as speculative. If you try to load a value speculatively, the instruction will read the value if possible, but if doing so would normally raise an exception, the processor says, "Sorry, I couldn't read it. But since this is part of speculative execution, I shouldn't raise an exception. Instead, I will set the NaT bit on the value so that you will know that the speculative load failed."

The NaT bit (short for Not a Thing) is a 65th bit associated with each 64-bit general-purpose integer register that says whether the register holds a valid value. (Floating point registers do not have a NaT bit; instead there is a special value called NaTVal which serves the same purpose.) Arithmetic operations on NaT simply result in another NaT, but if you try to do something interesting with a NaT, you incur a `STATUS_REG_NAT_CONSUMPTION` exception.

So let's take advantage of explicit speculative execution:

```        // we are a leaf function, so no need to use "alloc" or to save rp.
// on entry: r32 = this, r33 = a, r34 = b

ld4.s   r29 = [r34]             // speculatively load *b
addl    r31 = 08h, r32
addl    r30 = 10h, r32 ;;
ld8     r31 = [r31]
ld8     r30 = [r30]
shladd  r29 = r29, 2, r33 ;;    // speculatively calculate &a[*b]
ld4.s   r29 = [r29]             // speculatively load a[*b]
addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

// Now take action based on the result.

(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
(p7)    or      ret0 = 2ah, r0          // if false: return the default value
br.ret.sptk.many rp             // return
```

We changed the two load operations from `ld4` to `ld4.s`. The trailing `.s` means that the load is being performed speculatively, and that if an error occurs or if the address is itself NaT, then set the result to NaT rather than raising an exception.

Okay, so this prevents the exception from being raised during the speculative execution, but what if an exception really occurred? How do we turn the NaT back into the original exception? As written, we return NaT back to our caller, which is definitely not what the caller expects!

You might put on your language lawyer hat at this point and say that dereferencing a null pointer invokes undefined behavior, so returning NaT is standard-conforming (because undefined behavior allows anything to be standard-conforming). That's true, if the exception was due to an access violation. But the exception might have been a page not present exception because the memory was paged out. In that case, we really do want to raise the exception so that the kernel can handle it by paging the memory back in, and then we want to read the value and resume our calculations. The caller definitely does not expect that passing valid parameters will result in a NaT just because the memory happens to be paged out.

What we need to do is convert the deferred exception back into the original exception so that it can be raised as if no speculation had occurred. The instruction that lets us know that an exception got converted to a NaT is `chk.s`. This means "Check if the register contains NaT. If so, then jump to recovery code." The recovery code re-executes the instructions non-speculatively so that all the exceptions can be raised in the standard way, and any exception handlers can do their work in the standard way. Since NaT infects future computations, we don't need to check every speculative step; we need only check the final speculated result.

```        // we are a leaf function, so no need to use "alloc" or to save rp.
// on entry: r32 = this, r33 = a, r34 = b

ld4.s   r29 = [r34]             // speculatively load *b
addl    r31 = 08h, r32
addl    r30 = 10h, r32 ;;
ld8     r31 = [r31]
ld8     r30 = [r30]
shladd  r29 = r29, 2, r33 ;;    // speculatively calculate &a[*b]
ld4.s   r29 = [r29]             // speculatively load a[*b]
addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

// Now take action based on the result.

(p6)    chk.s   r29, recovery           // if true: recover r29 if not valid
recovered:
(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
(p7)    or      ret0 = 2ah, r0          // if false: return the default value
br.ret.sptk.many rp             // return

recovery:
ld4     r29 = [r34] ;;          // load *b
shladd  r29 = r29, 2, r33 ;;    // calculate &a[*b]
ld4     r29 = [r29]             // load a[*b]
br      recovered               // resume with recovered value
```

The `chk.s` instruction checks the specified register to see if the NaT bit is set. If not, the instruction allows execution to continue normally. But if the register is invalid, then control transfers to the specified label. Our recovery code re-executes the instructions that led to the invalid value, but this time we execute them non-speculatively so that exceptions can be raised and handled. Once the value has been recovered, we jump back to the instruction after the `chk.s` so that normal execution can resume.

In this case, we can make an additional optimization. Since the only things happening after recovery are copying `r29` to `ret0` and returning, we can inline those two instructions then perform peephole optimization to combine the `ld4 r29 = [r29]` and `or ret0 = r0, r29` into `ld4 ret0 = [r29]`.

Note that optimizing the recovery code is not really that important from an execution speed standpoint, since the recovery code runs only if an exception occurred, and the cost of raising and handling the exception will drown out any cycle squeezing effects. The real benefit of optimizing the recovery code is to avoid the jump back into the mainline code, because that allows the mainline code to be more compact: Recall that all jump targets must be the start of a bundle. If we had the recovery code jump back to the mainline code, we would have to insert some `nop`s so that the `recovered` label is at the start of a bundle. (In practice, what the compiler will do is repeat the trailing instructions in the bundle containing the `chk.s` then jump to the start of the next bundle.)

The final compiled function now looks like this:

```        // we are a leaf function, so no need to use "alloc" or to save rp.
// on entry: r32 = this, r33 = a, r34 = b

ld4.s   r29 = [r34]             // speculatively load *b
addl    r31 = 08h, r32
addl    r30 = 10h, r32 ;;
ld8     r31 = [r31]
ld8     r30 = [r30]
shladd  r29 = r29, 2, r33 ;;    // speculatively calculate &a[*b]
ld4.s   r29 = [r29]             // speculatively load a[*b]
addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

// Now take action based on the result.

(p6)    chk.s   r29, recovery           // if true: recover r29 if not valid
(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
(p7)    or      ret0 = 2ah, r0          // if false: return the default value
br.ret.sptk.many rp             // return

recovery:
ld4     r29 = [r34] ;;          // load *b
shladd  r29 = r29, 2, r33 ;;    // calculate &a[*b]
ld4     ret0 = [r29]            // load a[*b] as return value
br.ret.sptk.many rp             // return
```

Next time, we'll look at advanced loading, which is a different type of speculative execution.

• #### The Itanium processor, part 6: Calculating conditionals

The Itanium does not have a flags register. A flags register creates implicit dependencies between instructions, which runs contrary to the highly parallel model the Itanium was designed for. Instead of implicitly setting a register after computations, the Itanium has explicit comparison operations that put the comparison result into dedicated predicate registers.

Here's a simple fragment that performs some operation if two registers are equal.

```        cmp.eq p6, p7 = r32, r33 ;;
(p6)    something
```

The cmp instruction compares two values and sets the two specified predicate registers as follows:

• p6 is true if the values satisfy the condition, or false if they do not satisfy the condition.
• p7 is set to the opposite of p6

The comparison operation generates two results, one which holds the nominal result and one which holds the opposite. This lets you conditionalize both sides of a branch.

```        cmp.eq p6, p7 = r32, r33 ;;
(p6)    something // executes if they are equal
(p7)    something // executes if they are not equal
```

There is also a cmp4 instruction which compares two 32-bit values, in which case only the least-significant 32 bits participate in the comparison.

The comparands can be either two registers or an immediate and a register. The immediate is an 8-bit sign-extended value, though the final value may be interpreted as unsigned depending on the comparison type.

There are three comparison types:

type meaning
eq equality
lt signed less than
ltu unsigned less than

The first destination predicate register receives result of the test, and the second gets the opposite of the result.

These are the only comparisons you will see in disassembly, but the compiler can manufacture other types of comparisons. For example, if the compiler wants to perform a ge comparison, it can just do a lt comparison and flip the order of the two predicates.

More generally, the compiler can synthesize the other integer comparisons as follows:

imaginary opcode meaning synthesized as
cmp.ne p, q = a, b not equal cmp.eq q, p = a, b
cmp.ge p, q = a, b signed greater than or equal cmp.lt q, p = a, b
cmp.gt p, q = a, b signed greater than cmp.lt p, q = b, a if a is a register
cmp.lt q, p = a − 1, b if a is an immediate
cmp.le p, q = a, b signed less than or equal cmp.lt q, p = b, a if a is a register
cmp.lt p, q = a − 1, b if a is an immediate
cmp.geu p, q = a, b unsigned greater than or equal cmp.ltu q, p = a, b
cmp.gtu p, q = a, b unsigned greater than cmp.ltu p, q = b, a if a is a register
cmp.ltu q, p = a − 1, b if a is an immediate
cmp.leu p, q = a, b unsigned less than or equal cmp.ltu q, p = b, a if a is a register
cmp.ltu p, q = a − 1, b if a is an immediate

These syntheses rely on the identities

 x > y ⇔ y < x x ≤ y ⇔ ¬(x > y) x ≤ y ⇔ x − 1 < y for integers x and y, assuming no overflow x ≥ y ⇔ y ≤ x

The next level of complexity is the parallel comparisons. These perform a comparison and combine the result with the values already in the destination predicates.

opcode meaning really
cmp.xx.or p, q = a, b p = p || (a xx b)
q = q || (a xx b)
if (a xx b) then p = q = true
cmp.xx.orcm p, q = a, b p = p || ¬(a xx b)
q = q || ¬(a xx b)
if ¬(a xx b) then p = q = true
cmp.xx.and p, q = a, b p = p && (a xx b)
q = q && (a xx b)
if ¬(a xx b) then p = q = false
cmp.xx.andcm p, q = a, b p = p && ¬(a xx b)
q = q && ¬(a xx b)
if (a xx b) then p = q = false
cmp.xx.or.andcm p, q = a, b p = p || (a xx b)
q = q && ¬(a xx b)
if (a xx b) then p = true, q = false
cmp.xx.and.orcm p, q = a, b p = p && (a xx b)
q = q || ¬(a xx b)
if ¬(a xx b) then p = false, q = true

The meaning column describes how it is convenient to think of the operations, but the really column describes how they actually work.

The `orcm` and `andcm` versions take the complement of the comparison, which is handy because some of the synthesized comparisons involve taking the opposite of the specified result.

These parallel comparisons get their name because they are designed to have multiple copies executed in parallel. Consequently, they are an exception to the general rule that you can write to a register only once per instruction group. If all writes to a predicate register are AND-like (i.e., `and` or `andcm`) or all writes are OR-like (i.e., `or` or `orcm`), then the writes are allowed to coexist within a single instruction group. (This is where the actually column comes in handy. You can see that all AND-like operations either do nothing or set the predicate to false, and that all OR-like operations either do nothing or set the predicate to true. That's why they can run in parallel: If multiple conditions pass, they all do the same thing, so it doesn't matter which one goes first.)

Executing them in parallel lets you perform multiple tests in a single cycle. For example:

``` x =  ... calculate x ...;
y =  ... calculate y ...;
z =  ... calculate z ...;
if (x == 0 || y == 0 || z == 0) {
something_is_zero;
} else {
all_are_nonzero;
}
```

could be compiled as

```        ... calculate x in r29 ...
... calculate y in r30 ...
... calculate z in r31 ...
cmp.eq p6, p7 = +1, r0 ;; // set p6 = false, p7 = true

cmp.eq.or.andcm p6, p7 = r29, r0 // p6 = p6 || x == 0
// p7 = p7 && x != 0
cmp.eq.or.andcm p6, p7 = r30, r0 // p6 = p6 || y == 0
// p7 = p7 && y != 0
cmp.eq.or.andcm p6, p7 = r31, r0 ;; // p6 = p6 || z == 0
// p7 = p7 && z != 0

(p6)    something_is_zero
(p7)    all_are_nonzero
```

First, we calculate the values of x, y and x. At the same time, we prime the parallel comparison: we compare the constant +1 against register r0, which is the hard-coded zero register. This comparison always fails, so we set p6 to false and p7 to true.

Now we perform the three comparisons in parallel. We check if r29, r30, and r31 are zero. If any of them is zero, then p6 becomes true and p7 becomes false. If all are nonzero, then nothing changes, so p6 stays false and p7 stays true.

Finally, we act on the calculated predicates.

Notice that the parallel comparison lets us calculate and combine all the parts of the test in a single cycle. In a flags-based architecture, we would have to perform a comparison, test the result, then perform another comparison, test the result, then perform the last comparison, and test the result one last time. That's a sequence of six dependent operations, which is difficult to parallelize. (And most likely consume three branch prediction slots instead of just one.)

The last wrinkle in the comparison instructions is the so-called unconditional comparison. This special instruction violates the rule that a predicated instruction has no effect if the predicate is false.

```(qp)    cmp.xx.unc p, q = r, s
```

Even though there is a qualifying predicate, this comparison is executed unconditionally (as indicated by the `unc` suffix). The behavior of an unconditional comparison is

 p = qp && (r xx s) p = qp && ¬(r xx s)

In other words, if the qualifying predicate is true, then the instruction behaves as normal. But if the qualifying predicate is false, then the result of the comparison is considered false for all branches, regardless of the actual test.

This formulation is handy when you are nesting predicates. Consider:

``` x =  ... calculate x ...;
y =  ... calculate y ...;
if (x == 0) {
x_is_zero;
} else {
x_is_nonzero;
if (y == 0) {
x_is_nonzero_and_y_is_zero;
} else {
both_are_nonzero;
}
}
```

This can be compiled like this:

```        ... calculate x in r30 ...
... calculate y in r31 ...

cmp.eq p6, p7 = r30, r0 ;;
(p6)    x_is_zero
(p7)    x_is_nonzero
(p7)    cmp.eq.unc p8, p9 = r31, r0 ;;
(p8)    x_is_nonzero_and_y_is_zero
(p9)    both_are_nonzero
```

After calculating x and y, we check whether x is zero. If it is, then we execute x_is_zero. If not, then we execute x_is_nonzero. Next, we check whether y is zero, and we do so via an unconditional comparison. That way, if we are in the case that x is zero, then both p8 and p9 are set to false. Now we can use p8 and p9 to select between the final two branches. (Or if x is zero, neither gets selected.)

We'll see later that the unconditional comparison is also useful in register rotation.

So that's a quick tour of the Itanium conditional instructions. Next time, we'll start looking at speculation.

• #### The Itanium processor, part 5: The GP register, calling functions, and function pointers

We saw a brief mention of the gp register last time, where we saw it used when we calculated the address of a global variable.

The only addressing mode supported by the Itanium processor is register indirect (possibly with post-increment). There is no absolute addressing mode. If you want to access a global variable, you need to calculate its address, and the convention for this is that the gp register points to the module's global variables. If you want to access a global variable stored at offset n in the global data segment, you do it in two steps:

```        addl    r30 = n, gp ;;    // r30 -> global variable
ld4     r30 = [r30]       // load 4 bytes from the global variable
```

The name gp stands for global pointer since it is the pointer used to access global variables. (Note that since immediates are signed, the range of values of n is −2MB to +2MB.)

Those of you familiar with the PowerPC will recognize this model, since it is very similar to the Table of Contents model, except that Itanium uses a single table of contents for the entire module, as opposed to the PowerPC which gives each function its own table of contents.

The Itanium addl instruction is limited to a 22-bit immediate, which provides a reach of 4MB. This means that the above pattern is viable only for 4MB of global variables. Since some modules have more than 4MB of global data, the compiler separates global data into two categories, large and small. Small data objects are stored directly in the global data segment, but large data objects are not. Instead, the large data object is placed outside the global data segment, and all that is placed in the global data segment is a pointer to the large object. This means that accessing a large object actually takes three instructions.

```        addl    r30 = n, gp ;; // r30 -> global variable forwarder
ld8     r30 = [r30] ;; // r30 -> global variable
ld4     r30 = [r30]    // load 4 bytes from the global variable
```

We see that it is vitally important that the gp register be set properly. Otherwise, the code has no idea where its global variables are. The Itanium calling convention says that on entry to a function, the gp register must be set to that function's global pointer.

Okay, so if you're going to call a function, how do you know what global pointer it expects?

Since all functions in the same module share the same global variables, the answer is easy if you are calling a function within the same module: You don't need to do anything special with gp, since the caller's gp is the same as the callee's gp. You also don't need to perform an indirect call; you know where the target is and can use a direct `br.call OtherFunction`

On the other hand, if you are calling a function through a function pointer, then the target of the call might belong to another module. How are you supposed to know what the target function wants gp to be?

The answer is that on Itanium, a function pointer is not the address of the first instruction. Rather, it is a pointer to a structure containing two pointers. The first pointer in the structure points to the first instruction of the target function. The second pointer is the target function's gp. Therefore, calling a function through a function pointer looks like this:

```        // suppose the function pointer is in r30
ld8     r31 = [r30], 8 ;;       // get the function address
// then add 8 to r30
ld8     gp = [r30]              // get the function's gp
mov     b6 = r31                // move to branch register
br.call.dptk.many rp = b6 ;;    // call function in b6
or      gp = r41, r0            // gp = r41 OR 0 = r41
```

First, we load the address of the first instruction into the r31 register, using a post-increment addressing mode so that r30 after the instruction points to the callee's gp.

Next, we load the gp register with the caller's gp. Simultaneously, we move r31 to b6 so that we can use it as the target of the br.call. (Branch registers cannot be the target of a ld8 instruction, which is why we needed to use r31 as a middle-man.)

Now that gp is set up properly, we can call the function through the branch register.

After the call returns, the gp register is now whatever value is left over by the function we called. We need to set gp to the current function's global pointer, which for the sake of example we'll assume had been saved in the r41 register.

There's yet another wrinkle: The naïve imported function. In the case of an imported function not declared with the `dllimport` attribute, the compiler doesn't know that the function is imported. It acts as if the function is part of the current module. On x86, this is simulated by making a stub function which jumps to the real (imported) function. On Itanium, the same thing is done, with a stub function that looks like this:

```.ImportedFunction:
addl    r30 = n, gp ;;      // r30 -> function descriptor
ld8     r31 = [r30], 8;;    // get the function address
// then add 8 to r30
ld8     gp = [r30]          // get the function's gp
mov     b6 = r31            // move to branch register
br.cond.sptk.many b6 ;;     // jump there
```

The stub function loads the gp register with the value expected by the imported function then jumps to the imported function. Unconditional computed jumps are encoded as conditional jumps where the qualifying predicate is p0, which is always true.

The possibility that any function is really a stub function for an imported function this creates a problem for the compiler: Since any function could be an imported function in disguise, the compiler must assume that any function is potentially imported and therefore may result in the gp register being trashed. Therefore, the compiler needs to restore the gp register after any function call.

Now, the above pessimistic assumption can be relaxed if the compiler has other information available to it. For example, if the function being called is in the same translation unit, then the compiler can see by inspection that the target function is not a stub and therefore can elide the restoration of gp. Similarly, if link-time code generation is enabled, then the linker can see all the code in the module and see whether the target function is a stub or a real function.

Exercise: How does tail-call elimination affect this optimization?

Bonus reading: Programming for 64-bit Windows which spends nearly all its time talking about the gp register.

¹ The direct call instruction has a reach of 16MB, so if the function you want to call is too far away, the linker redirects the br.call to a stub function which in turn jumps to the final destination.

```    br.call.dptk.many stub_for_OtherFunction
...

stub_for_OtherFunction:
```

You have a few options for jumping to the function.

• If the stub is within 16MB of the target, it can use a br.cond direct jump:
```stub_for_OtherFunction:
br.cond.sptk.many OtherFunction
```
• The stub can load the target address from the data segment and use an indirect jump:
```stub_for_OtherFunction:
addl r3 = n, gp ;;  // look up the function address
ld8  r3 = [r3] ;;   // fetch it
mov  b6 = r3 ;;     // prepare to jump there
br.cond.sptk.many b6 ;; // and off we go
```
• The stub can load the target address offset from data stored in the code segment, then apply the offset to the current instruction pointer to determine the target:
```stub_for_OtherFunction:
mov  r3 = iip ;;    // get current location
addl r3 = n, r3 ;;  // find the offset
ld8  r2 = [r3] ;;   // load the offset
addl r2 = r2, r3 ;; // apply to current location
mov  b6 = r2 ;;     // prepare to jump there
br.cond.sptk.many b6 ;; // and off we go
```

This last case is tricky because the Itanium conventions forbid relocations in the code segment; all code is position-independent. Therefore, the data in the code segment must not be relocatable. We work around this by storing an offset rather than the absolute address and applying the offset at runtime.

• #### The Itanium processor, part 3b: How does spilling actually work?

Answering some interesting questions that arose yesterday. I didn't know the answers either, so I went and read the manual, specifically Volume 2 (IA-64 System Architecture), chapter 6 (IA-64 Register Stack Engine).

According to the manual:

The RSE operates concurrently and asynchronously with respect to instruction execution by taking advantage of unused memory bandwidth to dynamically perform register spill and fill operations.

So yeah, it's done in hardware.

Let's look at our register allocation diagram again. But I'm going to use colors to describe how they are treated by the Register Stack Engine.

 static a open g f e d c b LR O LR LR LR LR LR LR •••••• ••••• ••• •••••••••••••• ••••• ••••• •••••• •••••• •••• ••••••

The static registers do not participate in register stacking, so I've grayed them out.

The registers belonging to the currently active function are never spilled to backing store. Those are marked red in the diagram above.

The registers with undefined contents are in white. The ones marked "open" are registers that were discarded when a function returned, or registers which have never been used at all. There might also be registers, like those belonging to function g, which got flushed out to backing store, then got reused to satisfy some deeply-nested function call, and haven't yet been reloaded from backing store.

The green registers are those that are clean. They contain valid values, and the values have been flushed out to backing store.

The yellow registers are dirty. They contain valid values, and the values have not yet been flushed out to backing store.

We saw last time that when a function call occurs, the registers in the local region are rotated to the end of the diagram, and the other registers slide left to make room. What also happens is that the formerly-red registers are colored yellow, to indicate that they are dirty and can be flushed. The function being called carves out some white and green registers and colors them red. These registers become the new local registers and output registers.

When a function returns, its red registers are colored white, and the previous function's local region is slid into position (assuming the registers are yellow or green) and colored red.

What the Register Stack Engine does is look for unused memory bandwidth and use it to extend the size of the green region. To turn white registers green, it loads them from backing store. To turn yellow registers green, it saves them to backing store.

Turning white registers green means that when you return from a function, the caller's registers will already be on-chip and don't need to be loaded from memory. Turning yellow registers green means that when you call a function, the new function will have enough white and green registers available to satisfy its needs.

If there aren't enough registers of the appropriate color to satisfy the needs of a function call or return, the processor will perform spills or loads immediately in order to make room.

Bob wonders if the large number of registers makes context switching expensive. On a context switch, only the red and yellow registers need to be flushed. Since the processor is working in the background to eliminate yellow registers, the hope is that by the time you perform the context switch, there aren't many red and yellow registers left.

But yeah, context switching is more expensive.

• #### The Itanium processor, part 4: The Windows calling convention, leaf functions

Last time, we looked at the general rules for parameter passing on the Itanium. But those rules are relaxed for leaf functions (functions which call no other functions).

Before we start, I need to correct some of the explanation I had given when introducing the calling convention. I used that explanation because it makes for an easier conceptual model, but the reality is slightly different.

First of all, I said that the `alloc` function shuffles the registers around and lays out the new local region and output registers. In reality, it is the `br.call` instruction that moves the registers and the `alloc` which sets up the register frame. Since the first instruction of a function is `alloc`, it doesn't make much difference how the work is distributed between the `br.call` and the `alloc` since they come right after each other. The only time you notice the difference is if you happen to break into the debugger immediately between those two instructions.

More precisely, here's what the `br.call` instruction does:

• Copy the current register frame state (and some other stuff) to the `pfs` register.
• Rotate the registers so that the first output register is now r32.
• Create a new register frame as follows:
• input registers = caller's output registers
• no local registers
• no output registers
• no rotating registers
Registers are spilled if necessary to make room.
• Set the rp register to the return address.
• Transfer control to the target.

In other words, the register stack changes like this:

 r32 r33 r34 r35 r36 r37 r38 r39 r40 r41 r42 r43 f's Input f's Local f's Output Before f does a `br.call` r32 r33 r34 On entry to g g's Input r32 r33 r34 r35 r36 r37 r38 r39 r40 r41 r42 After g does an `alloc` g's Input g's Local g's Output

A consequence of this division of labor between `br.call` and `alloc` is that leaf functions can take advantage of this default register frame: If a leaf function can do all its work with just

• its input registers
• scratch registers
• the red zone

then it doesn't need to perform an `alloc` at all! It can use the default register allocation of "Caller's output registers become my input registers, and I have no local registers or output registers." When finished, the function just does a `br.ret rp` to return to the caller.

Note that this optimization is available only to leaf functions. If the function calls another function, then the `br.call` will overwrite the `pfs` and `rp` registers, which will make it hard to return to your caller when you're done.

The red zone is officially known as the scratch area. The first 16 bytes on the stack are available for use by the currently executed function. If you want values to be preserved across a function call, you need to move them out of the scratch area, because they become the scratch area for the function being called! In other words, the scratch area is not preserved across function calls.

A more obscure consequence of this division of labor between `br.call` and `alloc` is that a function could in principle perform `alloc` more than once in order to change the size of its local region or the number of output registers. For example, a function might start by saying, "I have five local registers and two output registers," and then later realize, "Oh, wait, I need to call a function with six parameters. I will issue a new `alloc` instruction that requests five local registers and six output registers." Although technically legal, it doesn't often occur in practice because it's usually easier just to set up your register state once and stick with it for the duration of the function.

A more common case where this occurs is when a function has an early exit that can be determined using only leaf-available resources.

```extern HANDLE LogFile;

void Log(char *message, char *file, int line)
{
if (!LogFile) return;
... complicated logging code goes here ...
}
```

If profiling feedback indicates that logging is rarely enabled, then the compiler can avoid setting up all the registers and stack for the complicated logging code, at least until it knows that logging is enabled.

```.Log:
addl    r30, -205584, gp ;; // get address of LogFile variable
ld8     r30, [r30] ;;       // fetch the value
cmp.eq  p6, p0 = r30, r0    // is it zero?
(p6)  br.ret  rp                  // return if so

// Okay, we are really logging. Set up our stack.
alloc   r35 = ar.pfs, 3, 5, 6, 0 // set up register frame
sub     sp = sp, 0x240      // set up stack buffers
mov     r36 = ra            // save return address

... do complicated logging ...

mov     rp = r36            // return address
mov.i   ar.pfs = r34        // restore pfs
br.ret.sptk.many  rp ;;     // return to caller
```

The first instruction calculates the effective address of the `Log­File` variable. We'll learn more about the gp register later.

The second instruction loads an 8-byte value from that address, thereby retrieving the value of `Log­File`.

The third instruction compares the value against r0, which is a hard-coded zero register. It asks for an equality comparison, putting the answer in the predicate variable p6 (and putting the complement of the answer in p0, which effectively throws it away).

The fourth instruction conditionally returns from the function if the comparison was true. In the common case where logging is not enabled, the function returns at this point. Only if logging is enabled do the `alloc` and related instructions execute to set up the stack frame and then perform the complicated logging.

This is an example of an optimization known as shrink-wrapping. Shrink-wrapping occurs when a function does some work with a temporary stack frame, and then expands to a full stack frame only if it is needed. (Shrink-wrapping entails a few extra entries in the unwind exception table because the unwinding needs to take place differently depending on where in the function the exception occurred. I'll spare you the details.)

Okay, that's all for leaf functions and getting to the bottom of the whole `br.call` / `alloc` dance. Next time, we'll look at function pointers and the funky gp register. Here's something to whet your appetite: On ia64, a function pointer is not the address of the first instruction in the function's code. In fact, it's nowhere near the function's code.

• #### The Itanium processor, part 3: The Windows calling convention, how parameters are passed

The calling convention on Itanium uses a variable-sized register window. The mechanism by which this is done is rather complicated, so I'm first going to present a conceptual version, and then I'll come back and fix up some of the implementation details. For today, I'm just going to talk about how parameters are passed. There are other aspects of the calling convention that I will cover in separate articles.

Recall that the first 32 registers r0 through r31 are static (do not change), and the remaining registers r32 through r127 are stacked. These stacked registers fall into three categories: input registers, local registers, and output registers.

The input registers receive the function parameters. On entry to a function, the function's parameters are received in registers starting at r32 and increasing. For example, a function that takes two parameters receives the first parameter in r32 and the second parameter in r33.

Immediately after the input registers are the registers for the function's private use. These are known as local registers. For example, if that function with two parameters also wants four registers for private use, those private registers would be r34 through r37.

After the input registers are the registers used to call other functions, known as output registers. For example, if the function with two parameters and four local registers wants to call a function that has three parameters, it would put those parameters in registers r38 through r40. Therefore, a function needs as many output registers as the maximum number of parameters of any function it calls.

The input registers and local registers are collectively known as the local region. The input registers, local registers, and output registers are collectively known as the register frame.

Any registers higher than the last output register are off-limits to the function, and we shall henceforth pretend they do not exist. Since the registers go up to r127, and in practice register frames are around one or two dozen registers, there end up being a lot of registers that go unused.

The first thing a function does is notify the processor of its intended register usage. It uses the `alloc` instruction to say how many input registers, local registers, and output registers it needs.

```alloc r35 = ar.pfs, 2, 4, 3, 0
```

This means, "Set up my register frame as follows: Two input registers, four local registers, three output registers, and no rotating registers. Put the previous register frame state (pfs) in register r35."

The second thing a function does is save the return address, typically in one of the local registers it just created. For example, the above `alloc` might be followed by

```mov r34 = rp
```

On entry to a function, the rp register contains the caller's return address, and most of the time, the compiler will save the return address in a register. Note that this means that on the Itanium, a stack buffer overrun will never overwrite a return address, since return addresses are not kept on the stack. (Let that sink in. On Itanium, return addresses are not kept on the stack. This means that tricks like `_Address­Of­Return­Address` will not work!)

By convention, the rp and ar.pfs are saved in consecutive registers (here, r34 and r35). This convention makes exception unwinding slightly easier.

Let's see what happens when somebody calls this function. Suppose the caller's register frame looks like this:

 static local region output input local r0 r1 … r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 r41

The caller places the parameters to our function in its output registers, in this case r37 and r38. (Our function takes only two parameters, so r39 and beyond are not used.)

 static local region output input local r0 r1 … r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 r41 0 A … F G H I J K L M N ? ? ?

The caller then invokes our function.

Our function opens by performing this `alloc`, declaring two input registers, four local registers, and three output registers.

```alloc r35 = ar.pfs, 2, 4, 3, 0
```

That `alloc` instruction shuffles the registers like this:

• The static registers don't change.
• The registers in the caller's local region are saved in a magic place.
• The specified number of output registers from the caller become the new function's input registers.
• New local and output registers are created but left uninitialized.
• The previous function state is placed in the specified register (for restoration at function exit). There are many parts of the function state, but the part we care about is the frame state, which describes how registers are assigned.

Here's what the register frame looks like after all but the last steps above:

 static local region output input local r0 r1 … r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 0 A … F G M N ? ? ? ? ? ? ? unchanged moved uninitialized

The last step (storing the previous function state in the specified register) updates the r35 register:

 static local region output input local r0 r1 … r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 0 A … F G M N ? pfs ? ? ? ? ?

The next instruction is typically one to save the return address.

```mov r34 = rp
```

After that `mov` instruction, the function prologue is complete, and the register state looks like this:

 static local region output input local r0 r1 … r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 0 A … F G M N ra pfs ? ? ? ? ?

where `ra` is the function's return address.

At this point the function runs and does actual work. Once it's done, its register state might look like this:

 static local region output input local r0 r1 … r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 0 A′ … F′ G′ T U ra pfs V W X Y Z

The function epilogue typically consists of three instructions:

```mov rp = r34     // prepare to return to caller
mov ar.pfs = r35 // restore previous function state
br.ret rp        // return!
```

This sequence begins by copying the saved return address into the rp register so that we can jump back to it. (We could have copied r34 into any scratch branch register, but by convention we use the rp register because it makes exception unwinding easier.)

Next, it restores the register state from the pfs it saved at function entry. Finally, it transfers control back to the caller by jumping through the rp register. (We cannot do a `br.ret r34` because `r34` is not a branch register; the parameter to `br.ret` must be a branch register.)

Restoring the previous function state causes the caller's register frame layout to be restored, and the values of the registers in the caller's local region are restored from that magic place.

The register state upon return back to the caller looks like this:

 static local region output input local r0 r1 … r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 r41 0 A′ … F′ G′ H I J K L ? ? ? ? ? unchanged restored uninitialized

From the point of view of the calling function, calling another function has the following effect:

• Static registers are shared with the called function. (Any changes to static registers are visible to the caller.)
• The local region is preserved across the call.
• The output registers are trashed by the call.

At most eight parameters are passed in registers. Any additional parameters are passed on the stack, and it is the caller's responsibility to clean them up. (The stack-based parameters begin after the red zone. We'll talk more about the red zone later.)

Thank goodness for the parameter cap, because a variadic function doesn't know how many parameters were passed, so it would otherwise not know how many input parameters to declare in its `alloc` instruction. The parameter cap means that variadic functions `alloc` eight input registers, and typically the first thing they do is spill them onto the stack so that they are contiguous with any parameters beyond 8 (if any). Note that this spilling must be done very carefully to avoid crashing if the corresponding register does not correspond to an actual parameter but happens to be a NaT left over from a failed speculative execution. (There is a special instruction for spilling without taking a NaT consumption exception.)

If any parameter is smaller than 64 bits, then the unused bits of the corresponding register are garbage and should be ignored. I didn't discuss floating point parameters or aggregates. You can read Thiago's comment for a quick version, or dig into the Itanium Software Conventions and Runtime Architecture Guide (Section 8.5: Parameter Passing) for gory details.

Okay, that's the conceptual model. The actual implementation is not quite as I described it, but the conceptual model is good enough for most debugging purposes. Here are some of the implementation details which will come in handy if you need to roll up your sleeves.

First of all, the processor does not actually distinguish between input registers and local registers. It only cares about the local region. In other words, the parameters to the `alloc` instruction are

• Size of local region.
• Number of output registers.
• Number of rotating registers.
• Register to receive previous function state.

When the called function established its register frame, the processor just takes all the caller's output registers (even the ones that aren't actually relevant to the function call) and slides them down to r32. It is the compiler's responsibility to ensure that the code passes the correct number of parameters. Therefore, our diagram of the function call process would more accurately go like this: The caller's register frame looks like this before the call:

 static local region output input local r0 r1 … r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 r41 0 A … F G H I J K L M N X₁ X₂ X₃

where the X values are whatever garbage values happen to be left over from previous computations, possibly even NaT.

When the called function sets up its register frame (before storing the previous register frame), it gets this:

 static local region output input local r0 r1 … r30 r31 r32 r33 r34 r35 r36 r37 r38 r39 r40 0 A … F G M N X₁ X₂ X₃ ? ? ? ? unchanged moved uninitialized

The processor took all the output registers from the caller and slid them down to r32 through r36.

Of course, the called function shouldn't try to read from any registers beyond r33, if it knows what's good for it, because those registers contain nothing of value and may indeed be poisoned by a NaT.

This little implementation detail has no practical consequences because those registers were uninitialized in the conceptual model anyway. But it does mean that when you disassemble the `alloc` instruction, you'll see that the distinction between input registers and local registers has been lost, and that both sets of registers are reported as input registers. In other words, an instruction written as

```alloc r34 = ar.pfs, 2, 4, 3, 0
```

disassembles as

```alloc r34 = ar.pfs, 6, 0, 3, 0
```

The disassembler doesn't know how many of the six registers in the input region are input registers and how many are local, so it just treats them all as input registers.

That explains some of the undefined registers, but what about those question marks? To solve this riddle, we need to answer a different question first: "Where is this magic place that the caller's local region gets saved to and restored from?"

This is where the infamous Itanium second stack comes into play.

There are two stacks on Itanium. One is indexed by the sp register and is what one generally means when one says the stack. The other stack is indexed by the bsp register (backing store pointer), and it is the magic place where these "registers from long ago" are saved. The bsp register grows upward in memory (toward higher addresses), opposite from the sp which grows downward (toward lower addresses). Windows allocates the two stacks right next to each other, Here's an artistic impression by Slava Oks. Bear in mind that Slava drew the diagram upside-down (low addresses at the top, high addresses at the bottom). The bsp grows toward toward higher addresses, but in Slava's diagram, that direction is downward.

One curious implementation detail is that the two stacks abut each other without a gap. I'm told that the kernel team considered putting a no-access page between the two stacks, so that a runaway memory copy into the stack would encounter an access violation before it reached the backing store. For whatever reason, they didn't bother.

Now, the processor is sneaky and doesn't actually push the values onto the backing store immediately. Instead, the processor rotates them into high-numbered unused registers (all the registers beyond the last output register), and only when it runs out of space there does it spill them into the backing store. When the function returns, the rotation is undone, and the values squirreled away into the high-numbered unused registers magically reappear in the caller's local region.

Each time a function is called, the registers rotate to the left, and when a function returns, the registers rotate to the right. As a result, the local regions of functions in the call stack can be found among the off-limits registers, up until we reach the last spill point.

Suppose the call stack looks like this (most recent function at the top):

```a() -- current function
b()
c()
d()
e()
f()
g()
```

If we zoom out, we can see all those local regions.

 static a open g f e d c b LR O LR LR LR LR LR LR •••••• ••••• ••• •••••••••••••• ••••• ••••• •••••• •••••• •••• ••••••

Why don't we see any output registers for any functions other than the current one? You know why: Because at each function call, the caller's output registers become the called function's input registers. If you really wanted to draw the output registers, you could do it like this, where each function's input registers is shared with the caller's output registers.

 static a open g e c I L O I L O I L O I L O •••••• •• ••• ••• •••••••••••••• •• ••• •• ••• ••• ••• ••• ••• •• •• ••• ••• O I L O I L O I L b f d b

But we won't bother drawing this exploded view any more.

Now, if the function `a` calls another function `x`, then all the registers rotate left, with `a`'s local region wrapping around to the end of the list:

 static x open g f e d c b a LR O LR LR LR LR LR LR LR •••••• ••• •••• •••••••••• ••••• ••••• •••••• •••••• •••• •••••• •••••

And when `x` returns, the registers rotate right, bringing us back to

 static a open g f e d c b LR O LR LR LR LR LR LR •••••• ••••• ••• •••••••••••••• ••••• ••••• •••••• •••••• •••• ••••••

Note that the conceptual model doesn't care about this implementation detail. In theory, future versions of the Itanium processor might have additional "bonus registers" after r127 which are programmatically inaccessible but which are used to expand the number of register frames that can be held before needing to spill.

With this additional information, you now can see the contents of those undefined registers on entry to a function: They contain whatever garbage happened to be left over in the open registers. Similarly, the contents of those undefined output registers after the function returns to its caller are the leftover values in the called function's local region.

You can also see the contents of the uninitialized output registers on return from a function: They contain whatever garbage happened to be left over in the called function's input registers. This behavior is actually documented by the processor, so in theory somebody could invent a calling convention where information is passed from a function back to its caller through the input registers, say, for a language that supports functions with multiple return values. (In other words, the input registers are actually in/out registers.) The Windows calling convention doesn't use this feature, however.

It so happens that the debugger forces a full spill into the backing store when it gains control. This is useful, because groveling into the backing store gives you a way to see the local regions of any function on the stack.

```kd> r
...
r32 =      6fbffd21130 0        r33 =          1170065 0
r34 =      6fbffd23700 0        r35 =                8 0
r36 =      6fbffd21338 0        r37 =            20000 0
r38 =             8000 0        r39 =             2000 0
r40 =              800 0        r41 =              400 0
r42 =              100 0        r43 =               80 0
r44 =              200 0        r45 =            10000 0
r46 =         7546fdf0 0        r47 = c000000000000693 0
r48 =             5041 0        r49 =         75ab0000 0
r50 =      6fbffd21130 0        r51 =          1170065 0
r52 =      6fbfc79f770 0        r53 =         7546cbe0 0
kd> dq @bsp
000006fb`fc7a02e0  000006fb`ffd21130 00000000`01170065 // r32 and r33
000006fb`fc7a02f0  000006fb`ffd23700 00000000`00000008 // r34 and r35
000006fb`fc7a0300  000006fb`ffd21338 00000000`00020000 // r36 and r37
000006fb`fc7a0310  00000000`00008000 00000000`00002000 // r38 and r39
000006fb`fc7a0320  00000000`00000800 00000000`00000400 // r40 and r41
000006fb`fc7a0330  00000000`00000100 00000000`00000080 // r42 and r43
000006fb`fc7a0340  00000000`00000200 00000000`00010000 // r44 and r45
000006fb`fc7a0350  00000000`7546fdf0 c0000000`00000693 // r46 and r47
```

But wait, ia64 integer registers are 65 bits wide, not 64. The extra bit is the NaT bit. Where did that go?

Whenever the bsp hits a 512-byte boundary (bsp & 0x1F8 == 0x1F8, or after 63 registers have been spilled), the value spilled into the backing store is not a 64-bit register but rather the accumulated NaT bits. You are not normally interested in the NaT bits, so the only practical consequence of this is that you have to remember to skip an entry whenever you hit a 512-byte boundary.

Suppose we wanted to look at our caller's local region. Here's the start of a sample function. Don't worry about most of the instructions, just pay attention to the `alloc` and the `mov ... = rp`.

```SAMPLE!.Sample:
alloc    r47 = ar.pfs, 013h, 00h, 04h, 00h
mov      r48 = pr
addl     r31 = -2004312, gp
adds     sp = -1072, sp ;;
ld8.nta  r3 = [sp]
mov      r46 = rp
adds     r36 = 0208h, r32
or       r49 = gp, r0 ;;
```

Suppose you hit a breakpoint partway through this function, and you want to know why the caller passed a strange value for the first input parameter r32.

From reading the function prologue, you see that the return address is kept in r46, so you can disassemble there to see how your caller set up its output parameters:

```kd> u @r46-20
SAMPLE!.Caller+2bd0:
ld8      r47 = [r32]
ld4      r46 = [r33]
or       r45 = r35, r0
nop.b    00h
nop.b    00h
br.call.sptk.many  rp = SAMPLE!.Sample
```

(Notice the `nop` instructions which suggest that this is unoptimized code.)

But we don't know which of those registers are the output registers of the caller. For that, we need to know the register frame of the caller. We see from the `alloc` instruction that the previous function state (`pfs`) was saved in the r47 register.

```kd> ?@r47
Evaluate expression: -4611686018427386221 = c0000000`00000693
```

This value is not easy to parse. The bottom seven bits record the total size of the caller's register frame, which includes both the local region and the output registers. The size of the local region is kept in bits 7 through 13, which is a bit tricky to extract by eye. You take the third and fourth digits from the right, double the value, and add one more if the second digit from the right is 8 or higher. This is easier to do than to explain:

• The third- and fourth-to-last digits are `06` hex.
• Double that, and you get 12 (decimal).
• Since the second-to-last digit is 9, add one more.
• Result: 13.

The previous function's local region has 13 registers. Therefore, the previous function's output registers begin at 32 + 13 = 45. (You can also see that the previous function had 0x13 = 19 registers in its register frame, and you can therefore infer that it had 19 - 13 = 6 output registers.)

Applying this information to the disassembly of the caller, we see that the caller passed

• first output register r45 = r35. (Recall that the r0 register is always zero, so or'ing it with another value just copies that other value.)
• second output register r46 = 4-byte value stored at [r33]
• third output register r47 = 8-byte value stored at [r32]

That first output register was a copy of the r35 register. We can grovel through the backing store to see what that value is.

```0:000> dq @bsp-0n13*8 l4
000006fb`ffe906d8  00000000`4b1e9720 00000000`4b1ea2e8     // r32 and r33
000006fb`ffe906e8  00000000`0114a7c0 000006fb`fe728cac     // r34 and r35
```

And now we have extracted the registers from our caller's local region. Specifically, we see that the caller's r35 is `000006fb`fe728cac`.

We can extend this technique to grovel even further back in the stack. To do that, we need to obtain the pfs chain so we can see the structure of the register frame for each function in the call stack.

From the disassembly above, we saw that the caller was kept in r46. To go back another level, we need to find that caller's caller. We merely repeat the exercise, but with the caller. Sometimes it can be hard to find the start of a function (especially if you don't have symbols); it can be easier to look for the end of the function instead! Instead of looking for the `alloc` and `mov ... = rp` instructions which save the previous function state and return address, we look for the `mov ar.pfs = ...` and `mov rp = ...` instructions which restore them.

Here's an example of a stack trace I had to reconstruct:

```0:000> u
00000000`4b17e9d4       mov      rp = r37              // return address
00000000`4b17e9e4       mov.i    ar.pfs = r38          // restore pfs
00000000`4b17e9e8       br.ret.sptk.many  rp ;;        // return to caller
0:000> dq @bsp
000006fb`ffe90758  000006fb`fe761cc0 000006fb`ffe8f860 // r32 and r33
000006fb`ffe90768  000006fb`ffe8fa70 00000000`00000104 // r34 and r35
000006fb`ffe90778  00000000`0114a7c0 00000000`4b1b6890 // r36 and r37
000006fb`ffe90788  c0000000`0000050e 00000000`00005001 // r38 and r39
```

Double the `05` to get 10 (decimal), and don't add one since the next digit (`0`) is less than 8. The previous function therefore has 10 registers in its local region.

The current function's return address is kept in r37 and the pfs in r38. I've highlighted them in the bsp dump.

Let's disassemble at the return address and dump that function's local variables, thereby walking back one level in the call stack.

```0:000> u 00000000`4b1b6890
...
00000000`4b1b6bd4       mov      rp = r38 ;;           // return address
00000000`4b1b6be4       mov.i    ar.pfs = r39          // restore pfs
00000000`4b1b6be8       br.ret.sptk.many  rp ;;
// we calculated that the local region of the previous function is size 0xA
0:000> dq @bsp-a*8 la
000006fb`ffe90708  000006fb`fe73bfc0 000006fb`fe73ff10     // r32 and r33
000006fb`ffe90718  00000000`00000000 000006fb`ffe8f850     // r34 and r35
000006fb`ffe90728  000006fb`ffe8f858 00000000`00000000     // r36 and r37
000006fb`ffe90738  00000000`4b1e9350 c0000000`00000308     // r38 and r39
000006fb`ffe90748  00000000`00009001 00000000`4b57e000     // r40 and r41
```

By studying the value in the caller's r39, we see that the caller's caller has 3 × 2 + 0 = 6 registers in its local region. And the caller's r38 gives us the return address. Let's walk back another frame in the call stack.

```0:000> u 4b1e9350
...
00000000`4b1e9354       mov      rp = r34              // return address
00000000`4b1e9368       mov.i    ar.pfs = r35          // restore pfs
00000000`4b1e9378       br.ret.sptk.many  rp ;;
0:000> dq @bsp-a*8-6*8 l6
000006fb`ffe906d8  00000000`0114a7c0 000006fb`fe728cac     // r32 and r33
000006fb`ffe906e8  00000000`4b1e9720 c0000000`00000389     // r34 and r35
000006fb`ffe906f8  00000000`00009001 00000000`4b57e000     // r36 and r37
```

This time, the return address is in r34 and the previous pfs is in r35. This time, the caller's caller's caller has 3 × 2 + 1 = 7 registers in its local region.

```0:000> u 4b1e9720
...
00000000`4b1e9784       mov      rp = r35             // return address
00000000`4b1e9788       adds     sp = 010h, sp ;;
00000000`4b1e9790       nop.m    00h
00000000`4b1e9794       mov      pr = r37, -2 ;;
00000000`4b1e9798       mov.i    ar.pfs = r36         // restore pfs
00000000`4b1e97a0       nop.m    00h
00000000`4b1e97a4       nop.f    00h
00000000`4b1e97a8       br.ret.sptk.many  rp ;;
0:000> dq @bsp-a*8-6*8-7*8 l7
000006fb`ffe906a0  00000000`0114a7c0 00000000`00000000    // r32 and r33
000006fb`ffe906b0  00000000`0114a900 00000000`4b19ba00    // r34 and r35
000006fb`ffe906c0  c0000000`0000058f 00000000`00009001    // r36 and r37
000006fb`ffe906d0  00000000`4b57e000                      // r38
```

This function also allocates 0x10 bytes from the stack, so if you want to see its stack variables, you can dump the values at sp + 0x10 for length 0x10. The `+ 0x10` is to skip over the red zone.

Anyway, that's the way to reconstruct the call stack on an Itanium. Repeat until bored.

Maybe you can spot the fast one I pulled when discussing how the `alloc` instruction and pfs register work. More details next time, when we discuss leaf functions and the red zone.

• #### The Itanium processor, part 2: Instruction encoding, templates, and stops

Instructions on Itanium are grouped into chunks of three, known as bundles, and each of the three positions in a bundle is known as a slot. A bundle is 128 bits long (16 bytes) and always resides on a 16-byte boundary, so that the last digit of the address is always zero. The Windows debugging engine disassembler shows the three slots as if they were at offsets 0, 4, and 8 in the bundle, but in reality they are all crammed together into one bundle.

You cannot jump into the middle of a bundle.

Now, you can't just put any old instruction into any old slot. There are 32 bundle templates, and each has different rules about what types of instructions they can accept and the dependencies between the the slots. For example, the bundle template MII allows a memory access instruction in slot 0, an integer instruction in slot 1, and another integer instruction in slot 2.

(Math: Each slot is 41 bits wide, so 123 bits are required to encode the slots. Add five bits for encoding the template, and you get 128 bits for the entire bundle.)¹

The slot types are

• M = memory or move
• I = complex integer or multimedia
• A = simple arithmetic, bit logic, or multimedia
• F = floating point or SIMD
• B = branch

Some instructions can be used in multiple slot types, and the disassembler will stick a suffix (known as a completer) to disambiguate them. For example, there are five different `nop` instructions, one for each slot type: `nop.m`, `nop.i`, `nop.a`, `nop.f`, and `nop.b`. When reading code, you don't need to worry too much about slotting. You can assume that the compiler did it correctly; otherwise it wouldn't have disassembled properly! (For the remainder of this series, I will tend to omit completers if their sole purpose is to disambiguate a slot type.)

If you are debugging unoptimized code, you may very well see a lot of `nop`s because the compiler didn't bother trying to optimize slot usage.

Another thing that bundles encode is the placement of what are known as stops. A stop is used to indicate that the instructions after the stop depend on instructions before the stop. For example, if you had the following sequence of instructions

```    mov r3 = r2
add r1 = r2, r4 ;;
add r2 = r1, r3
```

there is no dependency between the first two instructions; they can execute in parallel. However, the third instruction cannot execute until the first two have completed. The compiler therefore inserts a stop after the second instruction, which is represented by a double-semicolon.

A sequence of instructions without any stops is known as an instruction group. (There are other things that can end an instruction group, but they aren't important here.) As noted above, the instructions in an instruction group may not have any dependencies among them. This allows the processor to execute them in parallel. (This is an example of how the processor relies on the compiler: By making it the compiler's responsibility to ensure that there are no dependencies within an instruction group, the processor can avoid having to do its own dependency analysis.)

There are some exceptions to the rule against having dependencies within an instruction group:

• A branch instruction is allowed to depend on a predicate register and/or branch register set up earlier in the group.
• You are allowed to use the result of a successful `ld.c` without an intervening stop. We'll learn more about `ld.c` when we discuss explicit speculation.
• Comparison instructions `.and`, `.andcm`, `.or`, and `.orcm` are allowed to combine with others of the same type into the same targets. (In other words, you can combine two `.and`s, but not an `.and` and an `.or`.)
• You are allowed to write to a register after a previous instruction reads it. (With rare exceptions.)
• Two instructions in the same group cannot write to the same register. (With the exception of combined comparisons noted above.)

There are a lot of fine details in the rules, but I'm ignoring them because they are of interest primarily to compiler-writers. The above rules are to give you a general idea of the sorts of dependencies that can exist within an instruction group. (Answer: Not much.)

It does highlight that writing ia64 assembly by hand is exceedingly difficult because you have to make sure every triplet of instructions you write matches a valid template in terms of slots and stops, and you have to ensure that the instruction groups do not break the rules.

Next time, we'll look at the calling convention.

¹ There are two templates which are special in that they encode only two slots rather than three. The first slot is the normal 41 bits, but the second slot is a double-wide 82 bits. The double-wide slot is used by a few special-purpose instructions we will not get into.

• #### The Itanium processor, part 1: Warming up

The Itanium may not have been much of a commercial success, but it is interesting as a processor architecture because it is different from anything else commonly seen today. It's like learning a foreign language: It gives you an insight into how others view the world.

The next two weeks will be devoted to an introduction to the Itanium processor architecture, as employed by Win32. (Depending on the reaction to this series, I might also do a series on the Alpha AXP.)

I originally learned this information in order to be able to debug user-mode code as part of the massive port of several million lines of code from 32-bit to 64-bit Windows, so the focus will be on being able to read, understand, and debug user-mode code. I won't cover kernel-mode features since I never had to learn them.

Introduction

The Itanium is a 64-bit EPIC architecture. EPIC stands for Explicitly Parallel Instruction Computing, a design in which work is offloaded from the processor to the compiler. For example, the compiler decides which operations can be safely performed in parallel and which memory fetches can be productively speculated. This relieves the processor from having to make these decisions on the fly, thereby allowing it to focus on the real work of processing.

Registers overview

There are a lot of registers.

• 128 general-purpose integer registers r0 through r127, each carrying 64 value bits and a trap bit. We'll learn more about the trap bit later.
• 128 floating point registers f0 through f127.
• 64 predicate registers p0 through p63.
• 8 branch registers b0 through b7.
• An instruction pointer, which the Windows debugging engine for some reason calls iip. (The extra "i" is for "insane"?)
• 128 special-purpose registers, not all of which have been given meanings. These are called "application registers" (ar) for some reason. I will cover selected register as they arise during the discussion.
• Other miscellaneous registers we will not cover in this series.

Some of these registers are further subdivided into categories like static, stacked, and rotating.

Note that if you want to retrieve the value of a register with the Windows debugging engine, you need to prefix it with an at-sign. For example `? @r32` will print the contents of the r32 register. If you omit the at-sign, then the debugger will look for a variable called r32.

A notational note: I am using the register names assigned by the Windows debugging engine. The formal names for the registers are gr# for integer registers, fr# for floating point registers, pr# for predicate registers, and br# for branch registers.

Static, stacked, and rotating registers

These terms describe how the registers participate in register renumbering.

Static registers are never renumbered.

Stacked registers are pushed onto a register stack when control transfers into a function, and they pop off the register stack when control transfers out. We'll see more about this when we study the calling convention.

Rotating registers can be cyclically renumbered during the execution of a function. They revert to being stacked when the function ends (and are then popped off the register stack). We'll see more about this when we study register rotation.

Integer registers

Of the 128 integer registers, registers r0 through r31 are static, and r32 through r127 are stacked (but they can be converted to rotating).

Of the static registers, Win32 assigns them the following mnemonics which correspond to their use in the Win32 calling convention.

Register Mnemonic Meaning
r0 Reads as zero (writes will fault)
r1 gp Global pointer
r8r11 ret0ret3 Return values
r12 sp Stack pointer
r13 TEB

Registers r4 through r7 are preserved across function calls. Well, okay, you should also preserve the stack pointer and the TEB if you know what's good for you, and there are special rules for gp which we will discuss later. The other static variables are scratch (may be modified by the function).

Register r0 is a register that always contains the value zero. Writes to r0 trigger a processor exception.

The gp register points to the current function's global variables. The Itanium has no absolute addressing mode. In order to access a global variable, you need to load it indirectly through a register, and the gp register points to the global variables associated with the current function. The gp register is kept up to date when code transfers between DLLs by means we'll discuss later. (This is sort of a throwback to the old days of `MAKEPROCINSTANCE`.)

Every integer register contains 64 value bits and one trap bit, known as not-a-thing, or NaT. The NaT bit is used by speculative execution to indicate that the register values are not valid. We learned a little about NaT some time ago; we'll discuss it further when we reach the topic of control speculation. The important thing to know about NaT right now is that if you take a register which is tagged as NaT and try to do arithmetic with it, then the NaT bit is set on the output register. Most other operations on registers tagged as NaT will raise an exception.

The NaT bit means that accessing an uninitialized variable can crash.

```void bad_idea(int *p)
{
int uninitialized;
*p = uninitialized; // can crash here!
}
```

Since the variable uninitialized is uninitialized, the register assigned to it might happen to have the NaT bit set, left over from previous execution, at which point trying to save it into memory raises an exception.

You may have noticed that there are four return value registers, which means that you can return up to 32 bytes of data in registers.

Floating point registers

Register Meaning
f0 Reads as 0.0 (writes will fault)
f1 Reads as 1.0 (writes will fault)

Registers f0 through f31 are static, and f32 through f127 are rotating.

By convention, registers f0 through f5 and f16 through f31 are preserved across calls. The others are scratch.

That's about all I'm going to say about floating point registers, since they aren't really where the Itanium architecture is exciting.

Predicate registers

Instead of a flags register, the Itanium records the state of previous comparison operations in dedicated registers known as predicates. Each comparison operation indicates which predicates should hold the comparison result, and future instructions can test the predicate.

Register Meaning
p0 Reads as true (writes are ignored)

Predicate registers p0 through p15 are static, and p16 through p63 are rotating.

You can predicate almost any instruction, and the instruction will execute only if the predicate register is true. For example:

```(p1) add ret0 = r32, r33
```

means, "If predicate p1 is true, then set register ret0 equal to the sum of r32 and r33. If not, then do nothing." The thing inside the parentheses is called the qualifying predicate (abbreviated qp).

Instructions which execute unconditionally are internally represented as being conditional upon predicate register p0, since that register is always true.

Actually, I lied when I said that the instruction will execute only if the qualifying predicate is true. There is one class of instructions which execute regardless of the state of the qualifying predicate; we'll learn about that when we get to them.

The Win32 calling convention specifies that predicate registers p0 through p5 are preserved across calls, and p6 through p63 are scratch.

There is a special pseudo-register called preds by the Windows debugging engine which consists of the 64 predicate registers combined into a single 64-bit value. This pseudo-register is used when code needs to save and restore the state of the predicate registers.

Branch registers

The branch registers are used for indirect jump instructions. The only things you can do with branch registers are load them from an integer register, copy them to an integer register, and jump to them. In particular, you cannot load them directly from memory or do arithmetic on them. If you want to do any of those things, you need to do it with an integer register, then transfer it to a branch register.

The Win32 calling convention assigns the following meanings to the branch registers:

Register Mnemonic Meaning
b0 rp Return address

The return address register is sometimes called br, but the disassembler calls it rp, so that's what we'll call it.

The return address register is set automatically by the processor when a `br.call` instruction is executed.

By convention, registers b1 through b5 are preserved across calls, while b6 and b7 are scratch. (Exercise: Is b0 preserved across calls?)

Application registers

There are a large number of application registers, most of which are not useful to user-mode code. We'll introduce the interesting ones as they arise. I've already mentioned one of them already: `bsp` is the ia64's second stack pointer.

Break

Okay, this was a whirlwind tour of the Itanium register set. I bet your head hurts already, and we haven't even started coding yet!

In fact, we're not going to be coding for quite some time. Next time, we'll look at the instruction format.

• #### The curse of the redefinition of the symbol HLOG

A customer was running into a compiler error complaining about redefinition of the symbol `HLOG`.

```#include <pdh.h>
#include <lm.h>

...
```

The result is

```lmerrlog.h(80): error C2373: 'HLOG' redefinition; different type modifiers
pdh.h(70): See declaration of 'HLOG'
```

"Our project uses both performance counters (`pdh.h`) and networking (`lm.h`). What can we do to avoid this conflict?"

We've seen this before. The conflict arises from two problems.

First is hubris/lack of creativity. "My component does logging. I need a handle to a log. I will call it `HLOG` because (1) I can't think of a better name, and/or (2) obviously I'm the only person who does logging. (Anybody else who wants to do logging should just quit their job now because it's been done.)"

This wouldn't normally be a problem except that Win32 uses a global namespace. This is necessary for annoying reasons:

• Not all Win32 languages support namespaces.
• Even though C++ supports namespaces, different C++ implementations decorate differently, so there is no agreement on the external linkage. (Indeed, the decoration can change from one version of the C++ compiler to another!)

Fortunately, in the case of `HLOG`, the two teams noticed the collision and came to some sort of understanding. If you include them in the order

```#include <lm.h>
#include <pdh.h>
```

then `pdh.h` detects that `lm.h` has already been included and avoids the conflicting definition.

```#ifndef _LMHLOGDEFINED_
typedef PDH_HLOG     HLOG;
#endif
```

The PDH log is always accessible via the name `PDH_HLOG`. If `lm.h` was not also included, then the PDH log is also accessible under the name `HLOG`.

Sorry for the confusion.

• #### Corrupted file causes application to crash; is that a security vulnerability?

A security vulnerability report came in that went something like this:

We have found a vulnerability in the XYZ application when it opens the attached corrupted file. The error message says, "Unhandled exception: System.OverflowException. Value was either too large or too small for an Int16." For a nominal subscription fee, you can learn about similar vulnerabilities in Microsoft products in the future.

Okay, so there is a flaw in the XYZ application where a file that is corrupted in a specific way causes it to suffer an unhandled exception trying to load the file.

That's definitely a bug, and thanks for reporting it, but is it a security vulnerability?

The attack here is that you create one of these corrupted files and you trick somebody into opening it. And then when they open it, the XYZ application crashes. The fact that an `Overflow­Exception` was raised strongly suggests that the application was diligent enough to do its file parsing under the `checked` keyword, or that the entire module was compiled with the `/checked` compiler option, so that any overflow or out-of-range errors raise an exception rather than being ignored. That way, the overflow cannot be used as a vector to another attack.

What is missing from the story is that nobody was set up to catch the overflow exception, so the corrupted file resulted in the entire application crashing rather than some sort of "Sorry, this file is corrupted" error being displayed.

Okay, so let's assess the situation. What have you gained by tricking somebody into opening the file? The program detects that the file is corrupted and crashes instead of using the values in it. There is no code injection because the overflow is detected at the point it occurs, before any decisions are made based on the overflowed value. Consequently, there is no elevation of privilege. All you got was a denial of service against the XYZ application. (The overflow checking did its job and stopped processing as soon as corruption was detected.)

There isn't even data loss, because the problem occurred while loading up the corrupted file. It's not like the XYZ application had any old unsaved data.

At the end of the day, the worst you can do with this crash is annoy somebody.

Here's another way you can annoy somebody: Send them a copy of onestop.mid.

Page 1 of 455 (4,549 items) 12345»