Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

What's wrong with this code, part 21, a psychic debugging example

What's wrong with this code, part 21, a psychic debugging example

Rate This
  • Comments 48

Over the weekend, one of the developers in my group sent me some mail - he was seeing one of the registers in his code getting corrupted across a procedure call.  He was quite surprised to see this, and asked me for any suggestions.

With the help of the info he gave me, I was able to figure out what had gone wrong with his code, and I realized that it'd make a great "what's wrong with this code" example.

There are three parts to the code associated with this "what's wrong".  The first is an interface definition:

class IPsychicInterface
{
public:
    virtual bool DoSomeOperation(int argc, _TCHAR *argv[]) = 0;
};

Next, you have a tiny test application:

int _tmain(int argc, _TCHAR* argv[])
{
    register int value1 = 1;
    IPsychicInterface *psychicInterface = GetPsychicInterface();
    register int value2 = 2;

    psychicInterface->DoSomeOperation(argc, argv);
    assert(value1 == 1);
    assert(value2 == 2);
    return 0;
}

The failure  happened when the caller returned from psychicInterface->DoSomeOperation - upon the return, the ESI register, which is supposed to be preserved got trashed.  Further debugging showed that the reason that ESI was trashed was that the stack was imbalanced after the call to DoSomeOperation.

There's one more piece of information that I was given that let me immediately realize the root cause of the problem.

 

I know that if I include that information, what went wrong should be blindingly obvious, so I'm going to be mean and ask you to tell me what the one additional piece of information was.   The reason that the other developer in my group didn't find it was simply because he was looking at too much data - if I had pointed out that one piece of additional data, he'd have instantly figured it out too.

 

So the "answer" to this part of the "What's wrong" problem is "What single additional piece of information was I given that made this problem simple to solve?"

  • The additional piece of information may be it works if the developer makes the definition 'register int value=1' global. In most of the compilers the stores to local register variables are clobbered because the assemble code in the calling function may use this same register(the availability of the register depends on dataflow analysis of the compiler so local registers are reused) for some other purpose.

    one suggestion would be to find which registers are saved and restored during a function call smilar to %EBP ,%ESP (which are the local frame and stack pointers), don't store any thing to %EBP,%ESP your stack will be corrupted :)

  • Does the calling convention on the implementation of DoSomeOperation() match the interface definition?

  • Although slightly off topic, Vamsi's comment reminded me of a Michael Abrash article at http://www.ddj.com/184405848 -

    "Many people don't realize that under Windows it's safe to use ESP as a general-purpose register because the OS switches to a system stack when fielding interrupts and context switching. (Note, though, that the code is no longer multithread-safe unless thread-local storage is used because ESP must be stored somewhere other than the stack so it can be restored at the end of the routine.)"

    Health warning - messing around with esp is dangerous, and has to be treated with extreme care - however in certain situations (such as what Michael was facing) it may be worthwhile.

    It's tempting to try this - does anyone have any experience with this approach? Is the behaviour the same across versions of Windows?

  • Sounds like vtable ordering was changed or a function parameter was added to the implementation of an interface without changing the caller, and whatever mechanism that the caller used to locate the interface (i.e. the IID if this is a COM interface) wasn't changed for the new interface revision.

    For example, this might be a COM interface, and somebody got lazy and didn't change the IID when they changed the interface, and now instead of failing to create the interface, callers blow up in bad ways (if the call is in-proc) or get mysterious call failures (if the call is marshalled).

    - S

  • He told you that the class that implemented the interface was exported from a DLL.  That changed the calling convention from __cdecl to __stdcall.

  • The calling convention.

    It wasn't specified as part of the method declaration. If either GetPsychicInterface() or main() was compiled using different default calling conventions (e.g. one _cdecl, the other _stdcall) then the stack will go nuts.

  • I'm probably just shooting in the dark on this, but the lack of a prototype for GetPsychicInterface() bothers me.  What if this is a 64-bit system, and the compiler defaults the return type to "int"?

    'Course that wouldn't be the problem here, 'cause you'd very likely crash on the call to DoSomeOperation--unless you just got lucky, or unlucky depending on your point of view.

  • I'd guess that the piece of information is that GetPsychicInterface() is defined in a different DLL than the application.  Then you would go on to assume that the app was using wide chars and the DLL was not, or vice versa,

  • "What single additional piece of information was I given that made this problem simple to solve?"

    The threat model? :)

  • dewb: DoSomeOperation uses the thiscall calling convention as do all C++ methods.

  • Adam: Good answer, unfortunately not right - this isn't a threat modeling problem :).

    Ben: This all happens in a single executable.

    It's not a calling convention mismatch, because the calling convention is specified by the fact that this is a pure virtual method on a function.

    Several people are close to the root cause, but I'll go back to my original question: What was the one piece of information that's necessary to confirm the diagnosis.

  • The information that would tip me off is that it only happens in an optimized build or when frame pointer omission is enabled.

    I'm surprised this didn't expose itself in a debug build, because the enhanced stack checking added in VS2002 will ordinarily flag errors like this.

  • I'd guess that the IPsychicInterface implementation returned BOOL rather than bool. (Although this requires a C-style cast too and missing the actual inheritance from the implementation.)

  • James, that wouldn't cause a stack overrun (the compiler still returns 4 bytes of data when returning a bool, it's just that only the low 8 bits are used).

    Phaeron: The problem occurs on both optimized and non-optimized builds.  And FPO doesn't change the outcome.  There was a debug/retail mismatch (the OS was retail, and the binary was debug), but the problem ocurred even with the retail bits.

  • My guess would be that multiple inheritance is used and there's an incorrect cast, which causes DoSomeOperation() to call a different method (with a different signature) than actually intended.

Page 1 of 4 (48 items) 1234