Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title



Rate This
  • Comments 22

I was chatting with one of the perf guys last week and he mentioned something that surprised me greatly.  Apparently he's having perf issues that appear to be associated with a 3rd party driver.  Unfortunately, he's having problems figuring out what's going wrong because the vendor wrote the driver used FPO (and hasn't provided symbols), so the perf guy can't track the root cause of the problem.

The reason I was surprised was that I didn't realize that ANYONE was using FPO any more.

What's FPO?

To know the answer, you have to go way back into prehistory.

Intel's 8088 processor had an extremely limited set of registers (I'm ignoring the segment registers), they were:


With such a limited set of registers, the registers were all assigned specific purposes.  AX, BX, CX, and DX were the "General Purpose" registers, SI and DI were "Index" registers, SP was the "Stack Pointer", BP was the "Frame Pointer", IP was the "Instruction Pointer", and FLAGS was a read-only register that contained several bits that were indicated information about the processors' current state (whether the result of the previous arithmetic or logical instruction was 0, for instance).

The BX, SI, DI and BP registers were special because they could be used as "Index" registers.  Index registers are critically important to a compiler, because they are used to access memory through a pointer.  In other words, if you have a structure that's located at offset 0x1234 in memory, you can set an index register to the value 0x1234 and access values relative to that location.  For example:

MOV    BX, [Structure]
MOV    AX, [BX]+4

Will set the BX register to the value of the memory pointed to by [Structure] and set the value of AX to the WORD located at the 4th byte relative to the start of that structure.

One thing to note is that the SP register wasn't an index register.  That meant that to access variables on the stack, you needed to use a different register, that's where the BP register came from - the BP register was dedicated to accessing values on the stack.

When the 386 came out, they stretched the various registers to 32bits, and they fixed the restrictions that only BX, SI, DI and BP could be used as index registers.


This was a good thing, all of a sudden, instead of being constrained to 3 index registers, the compiler could use 6 of them.

Since index registers are used for structure access, to a compiler they're like gold - more of them is a good thing, and it's worth almost any amount of effort to gain more of them.

Some extraordinarily clever person realized that since ESP was now an index register the EBP register no longer had to be dedicated for accessing variables on the stack.  In other words, instead of:

    PUSH    EBP
    MOV     EBP, ESP
    SUB      ESP, <LocalVariableStorage>
    MOV     EAX, [EBP+8]
    MOV     ESP, EBP
    POP      EBP

to access the 1st parameter on the stack (EBP+0 is the old value of EBP, EBP+4 is the return address), you can instead do:

    SUB      SP, <LocalVariableStorage>
    MOV     EAX, [ESP+4+<LocalVariableStorage>]
    ADD     SP, <LocalVariableStorage>

This works GREAT - all of a sudden, EBP can be repurposed and used as another general purpose register!  The compiler folks called this optimization "Frame Pointer Omission", and it went by the acronym FPO.

But there's one small problem with FPO.

If you look at the pre-FPO example for MyFunction, you'd notice that the first instruction in the routine was PUSH EBP followed by a MOV EBP, ESP.  That had an interesting and extremely useful side effect.  It essentially created a singly linked list that linked the frame pointer for each of the callers to a function.  From the EBP for a routine, you could recover the entire call stack for a function.  This was unbelievably useful for debuggers - it meant that call stacks were quite reliable, even if you didn't have symbols for all the modules being debugged.  Unfortunately, when FPO was enabled, that list of stack frames was lost - the information simply wasn't being tracked.

To solve the is problem, the compiler guys put the information that was lost when FPO was enabled into the PDB file for the binary.  Thus, when you had symbols for the modules, you could recover all the stack information.

FPO was enabled for all Windows binaries in NT 3.51, but was turned off for Windows binaries in Vista because it was no longer necessary - machines got sufficiently faster since 1995 that the performance improvements that were achieved by FPO weren't sufficient to counter the pain in debugging and analysis that FPO caused.


Edit: Clarified what I meant by "FPO was enabled in NT 3.51" and "was turned off in Vista", thanks Steve for pointing this out.

  • Everybody who's using default MSVC options is using FPO. As of MSVC 2005 /O1, /O2 and /Ox (I forget which is default for Release) all include /Oy.

  • So for those of who have symbols for our modules, is there any debugging advantage to disabling FPO?

  • If you list an option as an "optimization" and the program doesn't crash, people are going to use it even if they don't know whether it's really beneficial.

    I'd turn it off permanently if I could turn it on for only specific loops with a pragma, but I haven't seen docs pointing to that possibility.

  • Chris: If your code lives 100% in modules you own, then perhaps not.  (If you're the type who needs to debug at an assembly level rather than via a source-augmenting GUI debugger, though, then being able to look at the previous frame via EBP is extremely handy.)

    The minute you touch modules you don't own, though, not using FPO becomes a godsend.  As far as my experience goes, half the time that an app of mine breaks, it's deep in some OS-provided API, or in some third-party middleware that you're calling.  Being able to walk EBP without symbols gives you a good sense of parameters being passed, functions being called, etc. without needing to actually read backwards through the code.  Of course, the bug will usually be in my code -- but if being able to trace backwards through the call stack allows me to find the source of the bug more quickly, all the better.

    (Also, not all symbols are created equal.  It's pretty common for middleware to have two sets of symbols -- a "private" set that is complete, and a "public" set that has topmost function names and datatypes, but has internal functions and variables stripped.)

    The difference in performance between FPO and non-FPO code is pretty negligable on modern CPUs, enough that it slips into the noise factor -- especially on AMD64 where we have eight more general-purpose registers to use.  Unless you're hand-coding assembly and specifically know that you can make a major difference (i.e. fitting in a cache line) by having that one extra register, leave it off.  This is one of those small things that has a great "ecosystem"-scale benefit when everyone does it.

  • It's the normal rule: if you think an optimisation may be useful, use the profiler to confirm that it's actually measurable, and significant.

  • > Some extraordinarily clever person realized that since ESP

    > was now an index register the EBP register no longer had to

    > be dedicated for accessing variables on the stack.

    No, that was just ordinary cleverness.  For comparison, I was using R13 as a base register in addition to pointing to the save area, on IBM 360 and 370 before Intel's 8080 and 8086 existed.  Registers were gold to assembly language coders too.

  • Disabling FPO can have both serious code size and performance impact. Tail call optimizations have to be disabled when a frame pointer is present, leading to much greater stack usage in affected paths. Small functions are also disproportionately affected by prolog/epilog code. Third, although there are still six registers available with a frame pointer on X86, only three of them are nonvolatile with respect to nested calls: EBX, ESI, and EDI. Opening up a fourth register can drop out a bunch of spill code.

    That having been said, FPO is often minor in impact, mainly because ESP-based addressing takes an extra byte per instruction, and any aligned objects on the stack will force an aligned frame pointer anyway.

    The issue with call stacks is a problem with the Win32 ABI and the debugging information written by VC++, not with FPO itself. Not only does VC++ not write enough information to always crawl past FPO functions, the __stdcall calling convention makes it impossible to statically determine ESP offsets since it is caller-pops. On other compilers or platforms, either the calling convention and ISA are simple enough that instruction analysis can reliably determine the ESP-to-return-address offset, or the required information is available due to table-based exception handling (X64).

  • > the __stdcall calling convention makes it impossible to

    > statically determine ESP offsets since it is caller-pops.

    No, isn't it callee-pops?  Called function pops its own arguments from stack when returning.  Accomplished on the x86 by using the "ret x" form of the x86 return instruction.

  • __cdecl is caller-pops. __stdcall is callee-pops.

    (caller-pops is needed to support variadic functions like the printf family).

  • Is it just not worthwhile to add a feature to your debugger to walk through the stack effects of taking the next return from the function??  I suppose this wouldn't work in all cases, but it should work in many.

    This strategy might not work right if you have a really messed up stack, but the code should give you enough information to pop the correct amount from the stack since the processor needs to do it correctly too!  With some added heuristics about return addresses, this could even be made to work quite reliably.

    Maybe this is done already in windbg and I'm just being an idiot.

  • Whoops, I meant to type callee-pops. You're screwed in static analysis in that case, because after an indirect call you don't know the stack offset. It's theoretically possible for the linker to merge two routines that have different ESP offsets between two CALL instructions in the same instruction stream.

  • How was this turned off in Vista?  Isn't FPO done at executable compile time?  Do you just mean this was not used by OS code in Vista or that Vista somehow stops applications from using it?

  • Good point, Steve - a more accurate statement is that we disabled FPO for Windows components.

    3rd party apps that chose to enable FPO can still enable it.  But Windows doesn't.

  • FPO is pretty much the standard out there in the world today; I'm surprised that you are surprised about it :)

    Microsoft is the exception in that it does things like turn off FPO for DLLs and rebases their modules and all sorts of other little things that virtually nobody else in the world really thinks to do.  Of course, the fact that VS still continues to default to enabling FPO doesn't really help the situation either.

    In any case, though, it's not too hard to do a manual stack trace in the lack of FPO; see the `dds' command in WinDbg (for instance, `dds @esp').  That and a little bit of disassembly make it quite doable to manually construct a call stack where there are no frame pointers involved.  More work, yes, but not a blocker.

    The places where FPO really begins to be a problem are where you have automated processes that take snapshots of stacks for debugging purposes (e.g. page heap, or handle tracing).  There, you really lose out because those features can't handle FPO frames on x86 and you lose the ability to easily track down heap corruption or handle leak style problems in many circumstances if a function wrappering a handle or memory allocation API has FPO enabled.

    Winidows on x64 makes this a nonissue with the stringent requirements on calling conventions (and unwind metadata) that allow for perfect unwinds in all circumstances without symbols.

  • >> Whoops, I meant to type callee-pops. You're screwed in static analysis in that case, because after an indirect call you don't know the stack offset. It's theoretically possible for the linker to merge two routines that have different ESP offsets between two CALL instructions in the same instruction stream.


    How does it matter whether it's callee-pops or not?  Maybe I'm just not understanding what kind of static analysis you're describing.

Page 1 of 2 (22 items) 12