January, 2004

  • The Old New Thing

    The history of calling conventions, part 5: amd64

    The last architecture I'm going to cover in this series is the AMD64 architecture (also known as x86-64).

    The AMD64 takes the traditional x86 and expands the registers to 64 bits, naming them rax, rbx, etc. It also adds eight more general purpose registers, named simply R8 through R15.

    • The first four parameters to a function are passed in rcx, rdx, r8 and r9. Any further parameters are pushed on the stack. Furthermore, space for the register parameters is reserved on the stack, in case the called function wants to spill them; this is important if the function is variadic.

    • Parameters that are smaller than 64 bits are not zero-extended; the upper bits are garbage, so remember to zero them explicitly if you need to. Parameters that are larger than 64 bits are passed by address.

    • The return value is placed in rax. If the return value is larger than 64 bits, then a secret first parameter is passed which contains the address where the return value should be stored.

    • All registers must be preserved across the call, except for rax, rcx, rdx, r8, r9, r10, and r11, which are scratch.

    • The callee does not clean the stack. It is the caller's job to clean the stack.

    • The stack must be kept 16-byte aligned. Since the "call" instruction pushes an 8-byte return address, this means that every non-leaf function is going to adjust the stack by a value of the form 16n+8 in order to restore 16-byte alignment.

    Here's a sample:

    void SomeFunction(int a, int b, int c, int d, int e);
    void CallThatFunction()
        SomeFunction(1, 2, 3, 4, 5);
        SomeFunction(6, 7, 8, 9, 10);

    On entry to CallThatFunction, the stack looks like this:

    xxxxxxx0 .. rest of stack ..
    xxxxxxx8 return address <- RSP

    Due to the presence of the return address, the stack is misaligned. CallThatFunction sets up its stack frame, which might go like this:

        sub    rsp, 0x28

    Notice that the local stack frame size is 16n+8, so that the result is a realigned stack.

    xxxxxxx0 .. rest of stack ..
    xxxxxxx8 return address
    xxxxxxx0   (arg5)
    xxxxxxx8   (arg4 spill)
    xxxxxxx0   (arg3 spill)
    xxxxxxx8   (arg2 spill)
    xxxxxxx0   (arg1 spill) <- RSP

    Now we can set up for the first call:

            mov     dword ptr [rsp+0x20], 5     ; output parameter 5
            mov     r9d, 4                      ; output parameter 4
            mov     r8d, 3                      ; output parameter 3
            mov     edx, 2                      ; output parameter 2
            mov     ecx, 1                      ; output parameter 1
            call    SomeFunction                ; Go Speed Racer!

    When SomeFunction returns, the stack is not cleaned, so it still looks like it did above. To issue the second call, then, we just shove the new values into the space we already reserved:

            mov     dword ptr [rsp+0x20], 10    ; output parameter 5
            mov     r9d, 9                      ; output parameter 4
            mov     r8d, 8                      ; output parameter 3
            mov     edx, 7                      ; output parameter 2
            mov     ecx, 6                      ; output parameter 1
            call    SomeFunction                ; Go Speed Racer!

    CallThatFunction is now finished and can clean its stack and return.
            add     rsp, 0x28

    Notice that you see very few "push" instructions in amd64 code, since the paradigm is for the caller to reserve parameter space and keep re-using it.

    [Updated 11:00am: Fixed some places where I said "ecx" and "edx" instead of "rcx" and "rdx"; thanks to Mike Dimmick for catching it.]

  • The Old New Thing

    What can go wrong when you mismatch the calling convention?

    Believe it or not, calling conventions is one of the things that programs frequently get wrong. The compiler yells at you when you mismatch a calling convention, but lazy programmers will just stick a cast in there to get the compiler to "shut up already".

    And then Windows is stuck having to support your buggy code forever.

    The window procedure

    So many people misdeclare their window procedures (usually by declaring them as __cdecl instead of __stdcall), that the function that dispatches messages to window procedures contains extra protection to detect incorrectly-declared window procedures and perform the appropriate fixup. This is the source of the mysterious 0xdcbaabcd on the stack. The function that dispatches messages to window procedures checks whether this value is on the stack in the correct place. If not, then it checks whether the window procedure popped one dword too much off the stack (if so, it fixes up the stack; I have no idea how this messed up a window procedure could have existed), or whether the window procedure was mistakenly declared as __cdecl instead of __stdcall (if so, it pops the parameters off the stack that the window procedure was supposed to do).

    DirectX callbacks

    Many DirectX functions use callbacks, and people once again misdeclared their callbacks as __cdecl instead of __stdcall, so the DirectX enumerators have to do special stack cleanup for those bad functions.


    I remember there was one program that decided to declare their CreateViewWindow function incorrectly, and somehow they managed to trick the compiler into accepting it!

    class BuggyFolder : public IShellFolder ... {
     // wrong function signature!
     HRESULT CreateViewObject(HWND hwnd) { return S_OK; }

    Not only did they get the function signature wrong, they returned S_OK even though they failed to do anything! I had to add extra code to clean up the stack after calling this function, as well as verify that the return value wasn't a lie.

    Rundll32.exe entry points

    The function signature required for functions called by rundll32.exe is documented in this Knowledge Base article. That hasn't stopped people from using rundll32 to call random functions that weren't designed to be called by rundll32, like user32 LockWorkStation or user32 ExitWindowsEx.

    Let's walk through what happens when you try to use rundll32.exe to call a function like ExitWindowsEx:

    The rundll32.exe program parses its command line and calls the ExitWindowsEx function on the assumption that the function is written like this:

    void CALLBACK ExitWindowsEx(HWND hwnd, HINSTANCE hinst,
           LPSTR pszCmdLine, int nCmdShow);
    But it isn't. The actual function signature for ExitWindowsEx is
    BOOL WINAPI ExitWindowsEx(UINT uFlags, DWORD dwReserved);
    What happens? Well, on entry to ExitWindowsEx, the stack looks like this:

    .. rest of stack ..
    return address <- ESP

    However, the function is expecting to see

    .. rest of stack ..
    return address <- ESP

    What happens? The hwnd passed by rundll32.exe gets misinterpreted as uFlags and the hinst gets misinterpreted as dwReserved. Since window handles are pseudorandom, you end up passing random flags to ExitWindowsEx. Maybe today it's EWX_LOGOFF, tomorrow it's EWX_FORCE, the next time it might be EWX_POWEROFF.

    Now suppose that the function manages to return. (For example, the exit fails.) The ExitWindowsEx function cleans two parameters off the stack, unaware that it was passed four. The resulting stack is

    .. rest of stack ..
    nCmdShow (garbage not cleaned up)
    pszCmdLine <- ESP (garbage not cleaned up)
    Now the stack is corrupted and really fun things happen. For example, suppose the thing at ".. rest of the stack .." is a return address. Well, the original code is going to execute a "return" instruction to return through that return address, but with this corrupted stack, the "return" instruction will instead return to a command line and attempt to execute it as if it were code.

    Random custom functions
    An anonymous commenter exported a function as __cdecl but treated it as if it were __stdcall. This will seem to work, but on return, the stack will be corrupted (because the caller is expecting a __stdcall function that cleans the stack, but what it gets is a __cdecl funcion that doesn't), and bad things will happen as a result.

    Okay, enough with the examples; I think you get the point. Here are some questions I'm sure you're asking:

    Why doesn't the compiler catch all these errors?

    It does. (Well, not the rundll32 one.) But people have gotten into the habit of just inserting the function cast to get the compiler to shut up.

    Here's a random example I found:

       WPARAM wParam, LPARAM lParam);

    This is the incorrect function signature for a dialog procedure. The correct signature is

    INT_PTR CALLBACK DialogProc(HWND hwndDlg, UINT uMsg,
        WPARAM wParam, LPARAM lParam);

    You start with

              hWnd, DlgProc);
    but the compiler rightly spits out the error message
    error C2664: 'DialogBoxParamA' : cannot convert parameter 4
    so you fix it by slapping a cast in to make the compiler shut up:
              hWnd, reinterpret_cast<DLGPROC>(DlgProc));

    "Aw, come on, who would be so stupid as to insert a cast to make an error go away without actually fixing the error?"

    Apparently everyone.

    I stumbled across this page that does exactly the same thing, and this one in German which gets not only the return value wrong, but also misdeclares the third and fourth parameters, and this one in Japanese. It's as easy to fix (incorrectly) as 1-2-3.

    How did programs with these bugs ever work at all? Certainly these programs worked to some degree or people would have noticed and fixed the bug. How can the program survive a corrupted stack?

    I'll answer this question tomorrow.

  • The Old New Thing

    The history of calling conventions, part 3

    Okay, here we go: The 32-bit x86 calling conventions.

    (By the way, in case people didn't get it: I'm only talking in the context of calling conventions you're likely to encounter when doing Windows programming or which are used by Microsoft compilers. I do not intend to cover calling conventions for other operating systems or that are specific to a particular language or compiler vendor.)

    Remember: If a calling convention is used for a C++ member function, then there is a hidden "this" parameter that is the implicit first parameter to the function.


    The 32-bit x86 calling conventions all preserve the EDI, ESI, EBP, and EBX registers, using the EDX:EAX pair for return values.

    C (__cdecl)

    The same constraints apply to the 32-bit world as in the 16-bit world. The parameters are pushed from right to left (so that the first parameter is nearest to top-of-stack), and the caller cleans the parameters. Function names are decorated by a leading underscore.


    This is the calling convention used for Win32, with exceptions for variadic functions (which necessarily use __cdecl) and a very few functions that use __fastcall. Parameters are pushed from right to left [corrected 10:18am] and the callee cleans the stack. Function names are decorated by a leading underscore and a trailing @-sign followed by the number of bytes of parameters taken by the function.


    The first two parameters are passed in ECX and EDX, with the remainder passed on the stack as in __stdcall. Again, the callee cleans the stack. Function names are decorated by a leading @-sign and a trailing @-sign followed by the number of bytes of parameters taken by the function (including the register parameters).


    The first parameter (which is the "this" parameter) is passed in ECX, with the remainder passed on the stack as in __stdcall. Once again, the callee cleans the stack. Function names are decorated by the C++ compiler in an extraordinarily complicated mechanism that encodes the types of each of the parameters, among other things. This is necessary because C++ permits function overloading, so a complex decoration scheme must be used so that the various overloads have different decorated names.

    There are some nice diagrams on MSDN illustrating some of these calling conventions.

    Remember that a calling convention is a contract between the caller and the callee. For those of you crazy enough to write in assembly language, this means that your callback functions need to preserve the registers mandated by the calling convention because the caller (the operating system) is relying on it. If you corrupt, say, the EBX register across a call, don't be surprised when things fall apart on you. More on this in a future entry.

  • The Old New Thing

    Why can't I GetProcAddress a function I dllexport'ed?

    The dllexport attribute tells the linker to generate an export table entry for the specified function. This export entry is decorated. This is necessary to support dllexporting of overloaded functions. But it also means that the string you pass to GetProcAddress needs to be decorated.

    As we learned earlier, the decoration scheme varies from architecture to architecture and from calling convention to calling convention. So, for example, if the function is exported from a PPC DLL, you would have to do GetProcAddress(hinst, "..SomeFunction"), but if it is exported from an 80386 DLL as extern "C" __stdcall, you would need GetProcAddress(hinst, "_SomeFunction@8"), but if it's __fastcall you would need GetProcAddress(hinst, "@SomeFunction@8").

    What's more, C++ decoration varies from compiler vendor to compiler vendor. A C++ exported function might require GetProcAddress(hinst, "?SomeFunction@@YGXHH@Z") if compiled with the Microsoft C++ compiler, but some other decorated string if compiled with the Borland C++ compiler.

    So if you intend people to be able to GetProcAddress for functions and you intend your code to be portable to multiple platforms, or if you intend them to be able to use your DLL from a language other than C/C++ or use a C++ compiler different from Microsoft Visual Studio, then you must export the function by its undecorated name.

    When a DLL is generated, the linker produces a matching LIB file which translates the decorated names to undecorated names. So, for example, the LIB file has an entry that says, "If somebody asks for the function _GetTickCount@0, send them to kernel32!GetTickCount."

    Exercise: If dllexport ties you to an architecture, compiler, and language (by exporting decorated names), then why does MSVCRT.DLL use it?

  • The Old New Thing

    Some reasons not to do anything scary in your DllMain

    As everybody knows by now, you're not supposed to do anything even remotely interesting in your DllMain function. Oleg Lvovitch has written two very good articles about this, one about how things work, and one about what goes wrong when they don't work.

    Here's another reason not to do anything remotely interesting in your DllMain: It's common to load a library without actual intent to invoke its full functionality. For example, somebody might load your library like this:

    // error checking deleted for expository purposes
    hinst = LoadLibrary(you);
    hicon = LoadIcon(you, MAKEINTRESOURCE(5));

    This code just wants your icon. It would be very surprised (and perhaps even upset) if your DLL did something heavy like starting up a timer or a thread.

    (Yes, this could be avoided by using LoadLibraryEx and LOAD_LIBRARY_AS_DATAFILE, but that's not my point.)

    Another case where your library gets loaded even though no code is going to be run is when it gets tugged along as a dependency for some other DLL. Suppose "middle" is the name of some intermediate DLL that is linked to your DLL.

    hinst = LoadLibrary(middle);
    pfn = GetProcAddress(hinst, "SomeFunction");

    When "middle" is loaded, your DLL will get loaded and initialized, too. So your initialization runs even if "SomeFunction" doesn't use your DLL.

    This "intermediate DLL loaded for a brief time" scenario is actually quite common. For example, if somebody does "Regsvr32 middle.dll", that will load the middle DLL to call its DllRegisterServer function, which typically doesn't do much other than install some registry keys. It almost certainly doesn't call into your helper DLL.

    Another example is the opening of the Control Panel folder. The Control Panel folder loads every *.cpl file so it can call its CplApplet function to determine what icon to display. Again, this typically will not call into your helper DLL.

    And under no circumstances should you create any objects with thread affinity in your DLL_PROCESS_ATTACH handler. You have no control over which thread will send the DLL_PROCESS_ATTACH message, nor which thread will send the DLL_PROCESS_DETACH message. The thread that sends the DLL_PROCESS_ATTACH message might terminate immediately after it loads your DLL. Any object with thread-affinity will then stop working since its owner thread is gone.

    And even if that thread survives, there is no guarantee that the thread that calls FreeLibrary is the same one that called LoadLibrary. So you can't clean up those objects with thread affinity in DLL_PROCESS_DETACH since you're on the wrong thread.

    And absolutely under no circumstances should you be doing anything as crazy as creating a window inside your DLL_PROCESS_ATTACH. In addition to the thread affinity issues, there's the problem of global hooks. Hooks running inside the loader lock are a recipe for disaster. Don't be surprised if your machine deadlocks.

    Even more examples to come tomorrow.
  • The Old New Thing

    Another reason not to do anything scary in your DllMain: Inadvertent deadlock


    Your DllMain function runs inside the loader lock, one of the few times the OS lets you run code while one of its internal locks is held. This means that you must be extra careful not to violate a lock hierarchy in your DllMain; otherwise, you are asking for a deadlock.

    (You do have a lock hierarchy in your DLL, right?)

    The loader lock is taken by any function that needs to access the list of DLLs loaded into the process. This includes functions like GetModuleHandle and GetModuleFileName. If your DllMain enters a critical section or waits on a synchronization object, and that critical section or synchronization object is owned by some code that is in turn waiting for the loader lock, you just created a deadlock:

    // global variable
    CRITICAL_SECTION g_csGlobal;
    // some code somewhere
    ... GetModuleFileName(MyInstance, ..);
    DllMain(HINSTANCE hinstDLL, DWORD fdwReason,
            LPVOID lpvReserved)
      switch (fdwReason) {

    Now imagine that some thread is happily executing the first code fragment and enters g_csGlobal, then gets pre-empty. During this time, another thread exits. This enters the loader lock and sends out DLL_THREAD_DETACH messages while the loader lock is still held.

    You receive the DLL_THREAD_DETACH and attempt to enter your DLL's g_csGlobal. This blocks on the first thread, who owns the critical section. That thread then resumes execution and calls GetModuleFileName. This function requires the loader lock (since it's accessing the list of DLLs loaded into the process), so it blocks, since the loader lock is owned by somebody else.

    Now you have a deadlock:

    • g_cs owned by first thread, waiting on loader lock.
    • Loader lock owned by second thread, waiting on g_cs.

    I have seen this happen. It's not pretty.

    Moral of the story: Respect the loader lock. Include it in your lock hierarchy rules if you take any locks in your DllMain.
  • The Old New Thing

    The history of calling conventions, part 1

    The great thing about calling conventions on the x86 platform is that there are so many to choose from!

    In the 16-bit world, part of the calling convention was fixed by the instruction set: The BP register defaults to the SS selector, whereas the other registers default to the DS selector. So the BP register was necessarily the register used for accessing stack-based parameters.

    The registers for return values were also chosen automatically by the instruction set. The AX register acted as the accumulator and therefore was the obvious choice for passing the return value. The 8086 instruction set also has special instructions which treat the DX:AX pair as a single 32-bit value, so that was the obvious choice to be the register pair used to return 32-bit values.

    That left SI, DI, BX and CX.

    (Terminology note: Registers that do not need to be preserved across a function call are often called "scratch".)

    When deciding which registers should be preserved by a calling convention, you need to balance the needs of the caller against the needs of the callee. The caller would prefer that all registers be preserved, since that removes the need for the caller to worry about saving/restoring the value across a call. The callee would prefer that no registers be preserved, since that removes the need to save the value on entry and restore it on exit.

    If you require too few registers to be preserved, then callers become filled with register save/restore code. But if you require too many registers to be preserved, then callees become obligated to save and restore registers that the caller might not have really cared about. This is particularly important for leaf functions (functions that do not call any other functions).

    The non-uniformity of the x86 instruction set was also a contributing factor. The CX register could not be used to access memory, so you wanted to have some register other than CX be scratch, so that a leaf function can at least access memory without having to preserve any registers. So BX was chosen to be scratch, leaving SI and DI as preserved.

    So here's the rundown of 16-bit calling conventions:

    All calling conventions in the 16-bit world preserve registers BP, SI, DI (others scratch) and put the return value in DX:AX or AX, as appropriate for size.

    C (__cdecl)
    Functions with a variable number of parameters constrain the C calling convention considerably. It pretty much requires that the stack be caller-cleaned and that the parameters be pushed right to left, so that the first parameter is at a fixed position relative to the top of the stack. The classic (pre-prototype) C language allowed you to call functions without telling the compiler what parameters the function requested, and it was common practice to pass the wrong number of parameters to a function if you "knew" that the called function wouldn't mind. (See "open" for a classic example of this. The third parameter is optional if the second parameter does not specify that a file should be created.)

    In summary: Caller cleans the stack, parameters pushed right to left.

    Function name decoration consists of a leading underscore. My guess is that the leading underscore prevented a function name from accidentally colliding with an assembler reserved word. (Imagine, for example, if you had a function called "call".)

    Pascal (__pascal)
    Pascal does not support functions with a variable number of parameters, so it can use the callee-clean convention. Parameters are pushed from left to right, because, well, it seemed the natural thing to do. Function name decoration consists of conversion to uppercase. This is necessary because Pascal is not a case-sensitive language.

    Nearly all Win16 functions are exported as Pascal calling convention. The callee-clean convention saves three bytes at each call point, with a fixed overhead of two bytes per function. So if a function is called ten times, you save 3*10 = 30 bytes for the call points, and pay 2 bytes in the function itself, for a net savings of 28 bytes. It was also fractionally faster. On Win16, saving a few hundred bytes and a few cycles was a big deal.

    Fortran (__fortran)
    The Fortran calling convention is the same as the Pascal calling convention. It got a separate name probably because Fortran has strange pass-by-reference behavior.

    Fastcall (__fastcall)
    The Fastcall calling convention passes the first parameter in the DX register and the second in the CX register (I think). Whether this was actually faster depended on your call usage. It was generally faster since parameters passed in registers do not need to be spilled to the stack, then reloaded by the callee. On the other hand, if significant computation occurs between the computation of the first and second parameters, the caller has to spill it anyway. To add insult to injury, the called function often spilled the register into memory because it needed to spare the register for something else, which in the "significant computation between the first two parameters" case means that you get a double-spill. Ouch!

    Consequently, __fastcall was typically faster only for short leaf functions, and even then it might not be.

    Okay, those are the 16-bit calling conventions I remember. Part 2 will discuss 32-bit calling conventions, if I ever get around to writing it.

  • The Old New Thing

    Words I'd like to ban in 2004

    It seems to be fashionable to do a "top words" list this time of year. We have Google 2003 Zeitgeist, Top Yahoo! Searches 2003, Merriam-Webster's Words of the Year for 2003, YourDictionary.com's Top Ten Words of 2003, Lake Superior State University's Banished Words List for 2004; still waiting for the American Dialect Society's choice for Word of the Year for 2003.

    I like LSSU's approach, so here's my list of words I'd like to ban.


    Thank goodness this has faded, but there are still some citations out there. Please don't use it to describe my work. It makes me sound like a dog in a show. (No offense to dogs in shows!)


    Everybody is "the leading this" or "the leading that". Here's my rule: If you say you're the leading XYZ or (even dodgier) "among the leading XYZs", then have to list at least three companies that are not leaders in the XYZ market. Because if nobody is following you, then you're not really "leading", now, are you.

    And the word I most would like to banish from the English language:

    Ask (as a noun)

    This has taken over Microsoft-speak in the past year or so and it drives me batty. "What are our key asks here?", you might hear in a meeting. Language tip: The thing you are asking for is called a "request". Plus, of course, the thing that is an "ask" is usually more of a "demand" or "requirement". But those are such unfriendly words, aren't they? Why not use a warm, fuzzy word like "ask" to take the edge off?

    Answer: Because it's not a word.

    I have yet to find any dictionary which sanctions this usage. Indeed, the only definition for "ask" as a noun is A water newt [Scot. & North of Eng.], and that was from 1913!

    Answer 2: Because it's passive-aggressive.

    These "asks" are really "demands". So don't guilt-trip me with "Oh, you didn't meet our ask. We had to cut half our features. But that's okay. We'll just suffer quietly, you go do your thing, don't mind us."

  • The Old New Thing

    What happened to DirectX 4?

    If you go through the history of DirectX, you'll see that there is no DirectX 4. It went from DirectX 3 straight to DirectX 5. What's up with that?

    After DirectX 3 was released, development on two successor products took place simultaneously: a shorter-term release called DirectX 4 and a more substantial longer-term release called DirectX 5.

    But based on the feedback we were getting from the game development community, they didn't really care about the small features in DirectX 4; what they were much more interested in were the features of DirectX 5. So it was decided to cancel DirectX 4 and roll all of its features into DirectX 5.

    So why wasn't DirectX 5 renamed to DirectX 4?

    Because there were already hundreds upon hundreds of documents that referred to the two projects as DirectX 4 and DirectX 5. Documents that said things like "Feature XYZ will not appear until DirectX 5". Changing the name of the projects mid-cycle was going to create even more confusion. You would end up with headlines like "Microsoft removes DirectX 5 from the table - kiss good-bye to feature XYZ" and conversations reminiscent of Who's on First:

    "I have some email from you saying that feature ABC won't be ready until DirectX 5. When do you plan on releasing DirectX 5?"

    "We haven't even started planning DirectX 5; we're completely focused on DirectX 4, which we hope to have ready by late spring."

    "But I need feature XYZ and you said that won't be ready until DirectX 5."

    "Oh, that email was written two weeks ago. Since then, DirectX 5 got renamed to DirectX 4, and DirectX 4 was cancelled."

    "So when I have a letter from you talking about DirectX 5, I should pretend it says DirectX 4, and when it says DirectX 4, I should pretend it says 'a project that has since been cancelled'?"

    "Right, but check the date at the top of the letter, because if it's newer than last week, then when it says DirectX 4, it really means the new DirectX 4."

    "And what if it says DirectX 5?"

    "Then somebody screwed up and didn't get the memo."

    "Okay, thanks. Clear as mud."

  • The Old New Thing

    How can a program survive a corrupted stack?

    Continuing from yesterday:

    The x86 architecture traditionally uses the EBP register to establish a stack frame. A typical function prologue goes like this:

      push ebp       ; save old ebp
      mov  ebp, esp  ; establish new ebp
      sub  esp, nn*4 ; local variables
      push ebx       ; must be preserved for caller
      push esi       ; must be preserved for caller
      push edi       ; must be preserved for caller
    This establishes a stack frame that looks like this, for, say, a __stdcall function that takes two parameters.

    .. rest of stack ..
    return address
    saved EBP <- EBP
    saved EBX
    saved ESI
    saved EDI <- ESP

    Parameters can be accessed with positive offsets from EBP; for example, param1 is [ebp+8]. Local variables have negative offsets from EBP; for example, local2 is [ebp-8].

    Now suppose that a calling convention or function declaration mismatch occurs and extra garbage is left on the stack:

    .. rest of stack ..
    return address
    saved EBP <- EBP
    saved EBX
    saved ESI
    saved EDI
    garbage <- ESP

    The function doesn't really feel any damage yet. The parameters are still accessible at the same positive offsets and the local variables are still accessible at the same negative offsets.

    The real damage doesn't occur until it's time to clean up. Look at the function epilogue:

      pop  edi       ; restore for caller
      pop  esi       ; restore for caller
      pop  ebx       ; restore for caller
      mov  esp, ebp  ; discard locals
      pop  ebp       ; restore for caller
      retd 8         ; return and clean stack

    In a normal stack, the three "pop" instructions match with the actual values on the stack and nobody gets hurt. But on the garbage stack, the "pop edi" actually loads garbage into the EDI register, as does the "pop esi". And the "pop ebx" - which thinks it's restoring the original value of EBX - actually loads the original value of the EDI register into EBX. But then the "mov esp, ebp" instruction fixes the stack back up, so the "pop ebp" and "retd" are executed with a repaired stack.

    What happened here? Things sort of got put back on their feet. Well, except that the ESI, EDI, and EBX registers got corrupted. If you're lucky, the values in ESI, EDI and EBX weren't important and could have survived corruption. Or all that was important was whether the value was zero or not, and you were lucky and replaced one nonzero value with another. For whatever reason, the corruption of those three registers is not immediately apparent, and you end up never realizing what you did wrong.

    Maybe the corruption has a subtle effect (say, you changed a value from zero to nonzero, causing the caller to go down the wrong codepath), but it's subtle enough that you don't notice, so you ship it, throw a party, and start the next project.

    But then a new compiler comes along, say one that does FPO optimization.

    FPO stands for "frame pointer omission"; the function dispenses with the EBP register as a frame register and instead just uses it like any other register. On the x86, which has comparatively few registers, an extra arithmetic register goes a long way.

    With FPO, the function prologue goes like this:

      sub  esp, nn*4 ; local variables
      push ebp       ; must be preserved for caller
      push ebx       ; must be preserved for caller
      push esi       ; must be preserved for caller
      push edi       ; must be preserved for caller
    The resulting stack frame looks like this:

    .. rest of stack ..
    return address
    saved EBP
    saved EBX
    saved ESI
    saved EDI <- ESP

    Everything is now accessed relative to the ESP register. For example, local-nn is [esp+0x10].

    Under these conditions, garbage on the stack is much more fatal. The function epilogue goes like this:

      pop  edi       ; restore for caller
      pop  esi       ; restore for caller
      pop  ebx       ; restore for caller
      pop  ebp       ; restore for caller
      add  esp, nn*4 ; discard locals
      retd 8         ; return and clean stack

    If there is garbage on the stack, the four "pop" instructions will restore the wrong values, as before, but this time, the cleanup of local variables won't fix anything. The "add esp, nn*4" will adjust the stack by what the function believes to be the correct amount, but since there was garbage on the stack, the stack pointer will be off.

    .. rest of stack ..
    return address
    local2 <- ESP (oops)

    The "retd 8" instruction now attempts to return to the caller, but instead it returns to whatever is in local2, which is probably not valid code.

    So this is an example of where optimizing your code reveals other people's bugs.

    Monday, I'll give a much more subtle example of something that can go wrong if you use the wrong function signature for a callback.

Page 1 of 5 (43 items) 12345