Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title



Rate This
  • Comments 22

I was chatting with one of the perf guys last week and he mentioned something that surprised me greatly.  Apparently he's having perf issues that appear to be associated with a 3rd party driver.  Unfortunately, he's having problems figuring out what's going wrong because the vendor wrote the driver used FPO (and hasn't provided symbols), so the perf guy can't track the root cause of the problem.

The reason I was surprised was that I didn't realize that ANYONE was using FPO any more.

What's FPO?

To know the answer, you have to go way back into prehistory.

Intel's 8088 processor had an extremely limited set of registers (I'm ignoring the segment registers), they were:


With such a limited set of registers, the registers were all assigned specific purposes.  AX, BX, CX, and DX were the "General Purpose" registers, SI and DI were "Index" registers, SP was the "Stack Pointer", BP was the "Frame Pointer", IP was the "Instruction Pointer", and FLAGS was a read-only register that contained several bits that were indicated information about the processors' current state (whether the result of the previous arithmetic or logical instruction was 0, for instance).

The BX, SI, DI and BP registers were special because they could be used as "Index" registers.  Index registers are critically important to a compiler, because they are used to access memory through a pointer.  In other words, if you have a structure that's located at offset 0x1234 in memory, you can set an index register to the value 0x1234 and access values relative to that location.  For example:

MOV    BX, [Structure]
MOV    AX, [BX]+4

Will set the BX register to the value of the memory pointed to by [Structure] and set the value of AX to the WORD located at the 4th byte relative to the start of that structure.

One thing to note is that the SP register wasn't an index register.  That meant that to access variables on the stack, you needed to use a different register, that's where the BP register came from - the BP register was dedicated to accessing values on the stack.

When the 386 came out, they stretched the various registers to 32bits, and they fixed the restrictions that only BX, SI, DI and BP could be used as index registers.


This was a good thing, all of a sudden, instead of being constrained to 3 index registers, the compiler could use 6 of them.

Since index registers are used for structure access, to a compiler they're like gold - more of them is a good thing, and it's worth almost any amount of effort to gain more of them.

Some extraordinarily clever person realized that since ESP was now an index register the EBP register no longer had to be dedicated for accessing variables on the stack.  In other words, instead of:

    PUSH    EBP
    MOV     EBP, ESP
    SUB      ESP, <LocalVariableStorage>
    MOV     EAX, [EBP+8]
    MOV     ESP, EBP
    POP      EBP

to access the 1st parameter on the stack (EBP+0 is the old value of EBP, EBP+4 is the return address), you can instead do:

    SUB      SP, <LocalVariableStorage>
    MOV     EAX, [ESP+4+<LocalVariableStorage>]
    ADD     SP, <LocalVariableStorage>

This works GREAT - all of a sudden, EBP can be repurposed and used as another general purpose register!  The compiler folks called this optimization "Frame Pointer Omission", and it went by the acronym FPO.

But there's one small problem with FPO.

If you look at the pre-FPO example for MyFunction, you'd notice that the first instruction in the routine was PUSH EBP followed by a MOV EBP, ESP.  That had an interesting and extremely useful side effect.  It essentially created a singly linked list that linked the frame pointer for each of the callers to a function.  From the EBP for a routine, you could recover the entire call stack for a function.  This was unbelievably useful for debuggers - it meant that call stacks were quite reliable, even if you didn't have symbols for all the modules being debugged.  Unfortunately, when FPO was enabled, that list of stack frames was lost - the information simply wasn't being tracked.

To solve the is problem, the compiler guys put the information that was lost when FPO was enabled into the PDB file for the binary.  Thus, when you had symbols for the modules, you could recover all the stack information.

FPO was enabled for all Windows binaries in NT 3.51, but was turned off for Windows binaries in Vista because it was no longer necessary - machines got sufficiently faster since 1995 that the performance improvements that were achieved by FPO weren't sufficient to counter the pain in debugging and analysis that FPO caused.


Edit: Clarified what I meant by "FPO was enabled in NT 3.51" and "was turned off in Vista", thanks Steve for pointing this out.

  • Did the syntax really change that much from 16-bit to 32-bit? Perhaps the following patch makes it more correct?

    - MOV    AX, [BX]+4

    + MOV    AX, [BX+4]

  • > How does it matter whether it's callee-pops or not?  Maybe I'm just not understanding what kind of static analysis you're describing.

    Assuming that the code is well-behaved and doesn't do anything wacky to the stack pointer, you can compute the ESP offsets across a piece of code by stepping through the disassembly from current IP and maintaining a virtual stack pointer until you hit the return instruction. PUSH / POP / LEA ESP, [ESP+n] / ADD ESP, n cover most of the cases generated by the compiler. This is common practice when running on RISC CPUs with easy-to-interpret instruction streams, like MIPS, and gives you a perfect stack trace. X64 does this during an unwind, with the help of restrictions on the prolog/epilog and extra unwind bytecode.

    The nit is when you hit a CALL instruction with a dynamically computed address, because you can't determine the called function without actually executing the code. If the function has the cdecl calling convention, ESP is unchanged across the CALL and thus you can ignore it. If it's stdcall or thiscall, however, there is a hidden offset on the RET at the end of the callee that you can't see statically, and thus in the absence of debug information you have to resort to heuristics.

    The REALLY evil case that I mentioned is that the linker can merge two functions with entirely different stack behaviors, but identical code. Consider the following instruction stream:

    push 2

    push 1

    call dword ptr [ecx]

    mov ecx, eax

    call dword ptr [ecx] ;<-- AV on ecx=0

    The calling sequences A(1, 2)->B(), A(1)->B(2), and A()->B(1, 2) can be represented by this code. If presented with a crash on the second call instruction, you can't tell whether you need to pop off 0, 4, or 8 bytes to clear the parameters. I'm not even sure this is resolvable even with debug information, because if the preceding call is polymorphic the call stack may not be enough to determine which function is executing.

  • Ok I see now, you're talking about static analysis at runtime to try to determine return addresses.  (I first thought you're talking about the compiler.)

    I guess __stdcall would be a pretty common roadblock to the static analysis you described, though there are other difficulties that could arise, such as encountering an unconditional indirect jump.

    >> I'm not even sure this is resolvable even with debug information, because if the preceding call is polymorphic the call stack may not be enough to determine which function is executing.


    Well at least in C++, even if the call is polymorphic, I think the function prototype of the method in the class declaration would ensure that the number of parameters (which has to be fixed for __stdcall) and the calling convention cannot vary amongst the possible different classes.

  • foxyshadis:



    before your procedure and



    after your procedure is the way to do this in MASM.  So I'm pretty sure VC++ exposes some pragma to do the same in C/C++.

  • >especially on AMD64 where we have eight more general-purpose registers to use.

    I would like to clear a misconception here -- those registers come with cost. Each instruction referencing them has an extra byte prefix hampering instruction decoding throughput. It is even advisable to use 32-bit parts instead of full registers whenever possible. Another thing to consider is that hardware register renamer has been pretty efficient so having more (developer and complier) visible registers doesn't mean much for x86.

    I wrote a function in assembler (highly optimized piece of code if I might add) which uses EBP as a general purpose register and [ESP + offset] to access parameters and local variables on the stack. It does pay off, but not unless you are sorely missing a register like I did.

  • That is exactly what FPO is.

  • Windows symbols are used for two purposed as I know. One is providing FPO data for call stacks, other is providing names for symbols themselves. This means, pdb file sizes for Vista are reduced a bit comparatively, disregarding the additional code and functionality?

    Right or wrong?

Page 2 of 2 (22 items) 12