Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

What's wrong with this code, Part 21 - A psychic debugging example: The missing piece

What's wrong with this code, Part 21 - A psychic debugging example: The missing piece

Rate This
  • Comments 18

As I mentioned yesterday, one of the other developers in my group had hit a sticky problem, and he asked me for my opinion on what was going wrong.

There were 3 pieces of information that I needed to use to diagnose the problem, I gave you two of them yesterday:

The interface:

class IPsychicInterface
{
public:
    virtual bool DoSomeOperation(int argc, _TCHAR *argv[]) = 0;
};

And the test application:

int _tmain(int argc, _TCHAR* argv[])
{
    register int value1 = 1;
    IPsychicInterface *psychicInterface = GetPsychicInterface();
    register int value2 = 2;

    psychicInterface->DoSomeOperation(argc, argv);
    assert(value1 == 1);
    assert(value2 == 2);
    return 0;
}

Originally the problem was that the ESI register was being trashed.  Since the C and C++ calling convention requires that the ESI register be preserved and the ESI register was trashed, that narrowed down the failure to three possible causes to the problem:

  • Somewhere inside DoSomeOperation, there was a stack overflow that caused the saved version of ESI to be corrupted.  This was actually my first thought.
  • Somewhere inside DoSomeOperation, there was a stack imbalance, which would cause garbage to be restored when the ESI register was popped off the stack.  Normally the compiler catches these errors, so I originally discounted this possibility.
  • There was a horrible compiler bug or OS bug which caused the register to be trashed (which is extraordinarily unlikely (but has happened)).

The other developer had chased the problem down further and realized that there was a stack imbalance on the call to DoSomeOperation.  There are basically two things that can cause a stack imbalance, most of the people who left comments in the original post caught one of them, some caught the other:

  • A calling convention mismatch.
  • A parameter declaration mismatch.

But I didn't have enough information to figure out which of the two it was.  That's when he gave me the final piece that let me accurately figure out what was going wrong.

The final piece was the last bit of assembly language in the DoSomeOperation function:

0040106B  pop         edi 
0040106C  pop         ebx 
0040106D  mov         al,1
0040106F  pop         esi 
00401070  ret         0Ch 

Now that you have the last piece, what was the bug?  Be specific - we already know that the problem is a stack imbalance, but what's the root cause?

For a bonus, why didn't the compiler catch it?

  • The ret 0Ch moves the stack pointer 12 bytes to clear up the args. The function only takes 8 bytes of parameters (the this pointer gets passed in ecx). I believe that means there's a parameter mismatch.

    As for how it compiled I can only assume that there's two versions of the IPsychicInterface - one that is being compiled with and another that is being linked with.

  • If it's within the same program, I would say that two source files had different compiler settings.  The compiler wouldn't catch the call at compile time if it's a virtual function call as that doesn't require a linker lookup against the decorated symbol name (which would be wrong if the calling convention was mismatched across two files).

    Another possibility is that one source file was compiled with different padding or class layout options than another source file.

    Being in the same problem, it would seem to be that it's a violation of the one definition rule...

  • Wow, that last sentence did not come out right at all :)

    "Being in the same program, it would seem to me that it's a violation of the one definition rule..."

  • Larry, have you answered the question of whether the interface, the class implementing the interface, and the application using the interface were all compiled together yet?

    I can't say I actually understand the problem (and the assembly snippet made it more opaque rather than less to me) but it seems like if the interface were compiled twice, once with one calling convention and once with the other, and then the application and the implementing class were compiled against those different copies, it might cause this kind of problem?

  • Darn it all, Skywing.  I was planning on being really geeky and mentioning the ODR and you went ahead and spoiled it for me :).

    Nice job.

  • LOL...

    I was looking for really technically obscure reasons and missed the obvious one of someone doing something REALLLLLY dumb.  I hope it really wasn't the case of the interface being cut/copied/pasted to two different locations.

    In a big and complex code base, I could see someone having this problem similar to the issues you see when two modules use different settings for such things as WINVER.

    Two definitions, different vtable ordering, the wrong routine getting called.

  • "thiscall" strikes again!

  • Barry: How is this an issue with thiscall?

  • Since the type of parameter is _TCHAR, my guess is that one or the other compilation units was compiled with _UNICODE defined.  The compilation unit that had it defined saw _TCHAR == wchar_t and the one that didn't saw _TCHAR == char.

    Originally I didn't see how that would cause a stack problem because it is an array of pointers which is still 4 bytes no matter wchar_t or char.

    The only thing I can think of is if DoSomeOperation used the ATL character conversion functions, those use stack space.  And those would think that the pointer coming in contained the wrong character set and that might stomp on the stack a bit.

  • I am not quite sure if this is what skywing was getting at, but I was trying to see if it was possible to define an overload of the DoSomeOperation method that would take 3 or more parameters, with the 3rd taking a default value. In which case, the compiler might not complain about the call in the code above (viz default value assigned), but such an overload does not satisfy the implementation for the virtual DoSomeOperation method (you're left with an abstract class)...

    So I am not sure. Sorry for rambling.

  • From the disassembly, it's obvious DoSomeOperation() is using a calling convention where the callee cleans up the stack.

    Let assume _tmain is using the default C calling convention, where the caller cleans up the stack. The fact that the program run means it linked OK, so that means GetPsychicInterface() most likely got the interface from a DLL or something, where the interface was compiled and linked using a non-C calling convention.

  • Was the IPsychicInterface object implemented in C (rather than C++)? This is possible (although a little unorthodox these days). If so, then the implementation of DoSomeOperation() is likely using the wrong calling convention.

  • Charlie, nope.  Skywing nailed it the first post: it was a violation of the one definition rule.

  • Nitpick:

    "the last bit of assembly language" indicates that there will be a stack imbalance _after_ returning from DoSomeOperation. That wouldn't show as a trashed ESI upon returning from DoSomeOperation.

    There must be some other imbalance (ODR violation or otherwise) inside of DoSomeOperation.

    "the last bit of assembly language" does indicate that there is ODR violation in the program, so it's likely that the other imbalance will also be caused by ODR in some other function called from inside of DoSomeOperation, but not certain.

  • I think this is a mismatch between stdcall and thiscall.

    The function is implemented as stdcall and the caller is assuming the thiscall.

    If for the for implementer the calling convention is defined (in some header file) as stdcall and for the user it is defined as thiscall (in a different header file or as a compiler setting) it will keep the compiler happy.

Page 1 of 2 (18 items) 12