Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

What's wrong with this code, part 10 - the answers

What's wrong with this code, part 10 - the answers

  • Comments 15

Yesterday's article was a bit of a trick question, but was a real world example.  Our group encountered this in some code we were testing last week (in some pre-production code - it was not part of any product).

Somewhat surprisingly, it turns out that the code, as written, wasn't incorrect.  The hr = ERROR_NOT_SUPPORTED line was in fact correct - the code was using the fact that it was a success error code as a signal to a higher level component.  As many of you pointed out, that looked like an absolute error, and I was bitten by it as well: I was making an unrelated modification to the routine and noticed the hr = ERROR_NOT_SUPPORTED; expression and fixed it to be hr = HRESULT_FROM_WIN32(ERROR_NOT_SUPPORTED);

But that turned out to be a HORRIBLE mistake.  Even though the code had passed my unit tests, and several other developers had tested the code, one of our testers encountered a fairly reliable test failure on his machines.  One of the developers in our group (after spending a long day debugging) finally chased the problem down to this change.  It turns out that on certain audio cards, PerformAnOperation() was returning a failure (but only on those audio cards).

And my change propagated a failure code out of the routine (which used to be a SUCCESS return, with an informational error code).  And, because I'd changed the SUCCESS error code to a FAILED error code, the caller of this routine didn't handle the error correctly.

So what was wrong?  The mistake was that the semantics associated with returning a success return code was not called out anywhere in the code.  The code was correct (or rather, functioning as intended), but some of the assumptions associated with the code were not documented.  So someone (me) came along later and "fixed" the code and thus introduced a bug.  Correcting the function was as simple as documenting the assumption...

// ----------------------------------------------------------------------
// Function:
// CThing1::OnSomethingHappening()
//
// Description:
// Called when something happens
//
// Return:
// S_OK if successful.  If an error occurs while performing an operation, returns a success
// error code with the error number set to ERROR_NOT_SUPPORTED
//
// ----------------------------------------------------------------------
HRESULT CThing1::OnSomethingHappening()
{
    HRESULT hr;
        :

        :
    <Do Some Stuff>
        :
   
    // Perform some operation...
    hr = PerformAnOperation();
    if (FAILED(hr))
        hr = ERROR_NOT_SUPPORTED;  // Swallow the error from PerformAnOperation and return an indicator to the caller.
    IF_FAILED_JUMP(hr, Error);

Exit:
    return hr;

Error:
    goto Exit;
}

If that had been done, it would have saved us a great deal of pain.

And, as I mentioned above, this was NOT a problem in any production systems.  And we're reviewing our unit tests to see how we can improve them to ensure that problems like this issue get caught even earlier.

Ok, Kudos and comments...

Michael Ruck spotted the root cause of the error: That the failure was not specified, and thus the next poor schlump (me) coming along into the code fixes what appears to be a simple bug and thus breaks something.

Some other comments: Skywing was, of course the first person to notice the ERROR_NOT_SUPPORTED (he/she was also the first person to post) - I sort-of used him as bait because I intentionally didn't comment that the hr = ERROR_NOT_SUPPORTED was not an error - upon reflection, I realize that that was a mistake.

A number of people were concerned about the IF_FAILED_JUMP macro, that's just a convention we use in my group, I should have edited it out since it complicated the problem, that was my bad.

Another group of people were concerned about the goto Error/goto Exit paradigm.  This is actually a convention I picked up several years ago - you move all the error handling to the end of the function (out-of-line) and then jump back to a common "cleanup" or "exit" label.  That allows the "normal" case to be in-line and isolates the error handling logic to one location.

Skywing also pointed out that returning ERROR_NOT_SUPPORTED STILL isn't correct, the facility should have been set to FACILITY_WIN32, he's absolutely right.

A number of other people suggested that the routine should be changed to return either BOOL or bool, but (as was immediately pointed out) that throws away error code information.

And again, several people mentioned throwing exceptions, the code's neutral w.r.t. exception handling (although I disagree with the whole "throw an exception to return an error" paradigm - that's a different issue).

 

  • Isnt this an example of how horrendous a system based on error codes can get.

    With exceptions, if it succeeds it succeeds. If there is an exception it has to be handled, without exception
  • indranil:

    Exceptions wouldn't have solved any of these problems, the only thing they are solving is that they force the caller to place the code into a try/catch block or to propagate the exception into higher levels.

    Imagine this scenario in C#, were you can't specify exceptions thrown by a function. You'd document that you're throwing exception A only or nothing. And now you're code comes in and throws (or even propagates from calls deeper in the callstack) exception B - Your client still isn't prepared for this situation.

    I believe exceptions don't solve anything. They have some advantages, but the costs outweigh the benefit by far - just look at the costs to throw an exception in one of the managed or virtual machine environments. None of them is really speedy in doing so - they're a performance killer as no runtime is optimized for the worst case.

    Larry:

    I was thinking that ERROR_NOT_SUPPORTED could indicate a success state, but was caught in the thought that it's use indicates wrong state as I've written in my comment.

    This was a nice post - forcing the reader to think out of the box.
  • Michael,
    Thank you. We so often concentrate on the code bits and ignore the non-code bits, but they all go together.

    That's actually why I put the link in to my "programming with style" entries in as a hint.
  • There are some APIs that return HRESULTs but are documented as something like "returns S_OK on success". Since becoming the WinVerifyTrust tester I've taken that to heart. When that's what MSDN or the header says, I check for S_OK and treat anything else as failure. I don't use SUCCEEDED in those cases.
    The reason I do this has to do with the garbage lagacy code underneath WinVerifyTrust that sometimes returns an actual HRESULT, sometimes does HRESULT_FROM_WIN32, and sometimes just bubbles up an NTSTATUS or a Windows error DWORD instead of an actual HRESULT. Yeah - it's a mess, but the header is the contract and the header specifically tells callers to check vs S_OK, which works.

    Given all of that:
    Without knowing what PerformAnOperation can return, I have to assume that it might return a success code that is not S_OK. So I'm sticking by what I said yesterday when I claimed that the bug is that OnSomethingHappening doesn't necessarily return S_OK on success as it claims it does.
  • Drew, you're right (on all points). I also try to only check for S_OK (but I get pushback from the rest of my group).

    And you're right, the issue was that the contract for OnSomethingHappening as documented by the headers was incorrect.

    Sorry if I missed that - I got more responses yesterday than I could do justice to.
  • The comments should probably be changed, then. The part about S_OK on success should be removed/replaced/whatever.

    Believe it or not, if OnSomethingHappening had swallowed any additional status info and returned S_OK for all successes in the original version you posted yesterday I would have considered this function to have been correct. Poorly designed, certainly, and almost certainly I'd complain to the dev about it, but it would work as promised.

    I'm not claiming my statements reflect the way that HRESULTs are suposed to work. My comments here are all heavily biased by my experience (horror!) with real-world, brittle APIs.

    I wonder if I'm getting too jaded to be a tester.
  • > A number of people were concerned about the
    > IF_FAILED_JUMP macro, that's just a
    > convention we use in my group,

    A convention to do meaningful things is good.

    A convention to add something that is syntactic fluff but confuses the hell out of readers just for the purpose of conforming to ritual^H^H^H^H^H^Hconvention is not good.

    > I should have edited it out since it
    > complicated the problem, that was my bad.

    In the posting or in the original code? If the convention requires conforming to ritual then you need to add a comment to the ritual saying "This can never be executed (because errors were swallowed above) but our convention requires syntactically putting it here anyway."
  • Why is a constant with a name beginning with ERROR_ a success result? I'd expect everything called ERROR_* to be a failure.
  • I don't think the fix is entirely changing the comments.

    I would probably have changed the error code assignment line to:

    hr = MAKE_HRESULT(SEVERITY_SUCCESS, FACILITY_WIN32, ERROR_NOT_SUPPORTED);

    that way - the next "schlump" can "get it" without knowing he needs to read the comments.

    I totally agree with your sentiments on exceptions btw.
  • "A number of other people suggested that the routine should be changed to return either BOOL or bool, but (as was immediately pointed out) that throws away error code information."

    How does only returning S_OK or ERROR_NOT_SUPPORTED convey any more information than a boolean would? If other return values are also valid, then the documentation for the function is still incomplete.
  • ERROR_NOT_SUPPORTED is not a success code. It is a number, whose value is unknown and not guaranteed. It is not <0 or =0 or >0. If Microsoft changed ERROR_NOT_SUPPORTED to 0xfedcba98 (and NULL to 0xeaeaeaea, come to that) and recompiled Windows, Windows should still work... as long, of course, as application software vendors followed suit.
    And it is pretty simple to bulletproof an application so that is not crucially dependent on the half-understood policies of another software vendor.
  • Larry wrote: "A number of other people suggested that the routine should be changed to return either BOOL or bool, but (as was immediately pointed out) that throws away error code information."

    Isn't error code information being thrown away anyway? The function reduces all cases to S_OK or ERROR_NOT_SUPPORTED. That sounds pretty boolean to me.

    (From a readability standpoint, I still like having something more explicit that a bool, but without knowing more about the domain of the problem, I don't really understand why all errors are mapped to ERROR_NOT_SUPPORTED.)

    If ERROR_NOT_SUPPORTED is a successful return, then IF_ERROR_JUMP will never fire, and you've created unreachable code. Yes, the compiler can optimize it away, but unreachable code should be removed from the source to eliminate confusion and compiler warnings. (Please tell me you haven't lowered your compiler warning level.)

    Macros that embody an if statement without a braces or an else clause will inevitably be misused and cause a mismatching else problem.
  • It really is horrible that you criticise exceptions and then do someone similar using macros and gotos! Readability conventions that use gotos?! What year is it?
  • You're 100% right Asd. I should have expunged the IF_FAILED before I posted it because it complicates the discussion.

    However I firmly believe that using "goto" dramatically enhances the readability of code IF it's used in a limited way. And the contortions you need to go through to avoid goto's can make code be totally unreadable.

    When was the last time you saw:
    do
    {
    ...<something>
    ...if (FAILED(result))
    ......break;
    } while (0);
    ?

    Someone inserted a gratuitous do/while loop just to enable the use of break as a replacement for goto.
  • PingBack from http://paidsurveyshub.info/story.php?title=larry-osterman-s-weblog-what-s-wrong-with-this-code-part-10-the

Page 1 of 1 (15 items)