Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Error Codes, again...

Error Codes, again...

Rate This
  • Comments 18

One of the tech writers in my group just asked a question about documenting error codes.

I've written about my feelings regarding documenting error codes in the past, but I've never actually written about what it means to define error codes for your component.

The critical aspect of error codes is recognition of the fact that error codes are all about diagnosibility. They're about providing enough information to someone to figure out the cause of a problem.  This is true whether you use error codes or exceptions, btw - they're all mechanisms for diagnosing failures.

Error codes serve two related purposes.  You need to be able to provide information to the developer of an application that allows that developer to diagnose the cause of a failure (or to let the developer of an application determine the appropriate corrective action to take in the event of a failure).  And you need to be able to provide information to the user of the application that hosts your control to allow them to diagnose the cause of a failure.

The second reason above is why there are APIs like FormatMessage which allow you to determine a string version of system errors.  Or waveOutGetErrorText, which does the same thing for the multimedia APIs (there's a similar mixerGetErrorText, etc).  These APIs allow you to get a human readable error string for any system error.

One of the basic requirements for any interface is that you define the errors that will be returned by that interface.  It's a fundamental part of the contract (and every interface defines a contract).

Now your definition of errors can be simple ("Returns an HRESULT which defines the failure") or it can be complex ("When the frobble can't be found, it returns E_FROBLE_NOT_FOUND").  But you need to define your error codes.

When you define your error codes, you essentially have three choices:

  1. You can choose to simply let the lower level error code bubble up to your caller.
  2. You can choose to define new error codes for your component.
  3. You can completely define the error codes that your component returns.

There are pros and cons to each of these choices.

The problem with the first choice is that often times the low level error code is meaningless.  Or worse, it may be incorrect.  A great example of this occurs if you mess up the AEDebug registry key for an application.  The loader will attempt to access this registry key, and if there is an error (like an entry not found), it will bubble the failure up to the caller.  Which can result in your getting an ERROR_FILE_NOT_FOUND error when you try to launch your application, even though the application is there - the problem is that the AEDebug registry key pointed to a debugger that wasn't found.  But bubbling the failure up has killed diagnosibility - the actual problem had to do with the parsing of a registry key, but the caller has no way of knowing that.  This is also yet another example of Joel's Law of Leaky Abstractions - the lower level information leaked to the higher level.

The problem with the second choice is actually that that it hides the information from the lower level abstraction.  It's just the opposite - sometimes you WANT the abstraction to leak, because there is often useful information that gets lost.  For instance, in the component on which I'm working, RPC_X_ENUM_VALUE_OUT_OF_RANGE, RPC_X_BYTE_COUNT_TO_SMALL, and a couple of other RPC errors are mapped to E_INVALIDARG.  While E_INVALIDARG is reasonably accurate (these are all errors in argument), RPC returned specific information about the failure that hiding the error masks.  So there has been a loss of specificity about the error, which once again hinders diagnosability - it's harder to debug the problem from the error.  On the other hand, the errors that are returned are domain specific.

The third choice (locking down the set of error codes returned) is what was done in my linked example.  The problem with this is that it locks you into those error codes forever.  You will NEVER have an opportunity to change them, even if something changes underneath.  So when the time comes to add offline storage to your file system, you can't add a "tape index not found" error to the CreateFile API because it wasn't one of the previously enumerated error codes.

The first is a recipe for confusion, especially when the lower level error codes apply to another domain - what do you do if CreateThread returns ERROR_PATH_NOT_FOUND?  The third option is simply an unmitigated nightmare for the long term viability of your system.

My personal choice is #2, even with the error hiding potential.  But you need to be very careful to ensure that your choice of error codes is appropriate - you need to ensure that you provide enough diagnostic information for a developer to determine the cause of the failure while retaining enough domain specific information to allow the user to understand the cause of the failure.

Interestingly enough CLR Exceptions handle the leaky abstraction issue neatly by defining the Exception.InnerException property which allows you to retain the original cause of the error.  This allows a developer attempting to diagnose a failure to see the ACTUAL cause of the failure, while allowing the component to define a failure that's more germane to its problem domain.

  • In my programming experience I have found that the simplest, best and easiest way to report an obscure error (such as would trigger an exception) is a string which describes all the aspects of the problem that it is reasonable to gather. Why?

    (1) Error codes convey very little information; their only advantage is that they are easy for programs to check. But programs will only check those error codes which they expect, which means, those error codes which are expected results with specific fallback action. This is a very small portion of possible error codes and can be handled by providing the necessary programmatic information as part of a method result.

    (2) With errors that aren't common enough to warrant special handling, what you need to fix them is context: where did the error happen and what exactly was it about? Error codes store no explanation and no context. With strings, each component relaying the error can just prepend its explanation and interpretation of the error, like this (whole text thought up):

    "The following error occured while processing socket rule #13: The DNS name could not be looked up: Error code 54: The network is unavailable."

    In summary, there are two kinds of errors:

    (1) to be interpreted by programs, in which case it shouldn't be an error, it should be a well-defined result;

    (2) to be interpreted by humans (system admins / developers), in which case the error should be a string containing all the necessary information.

    It's as simple as this. In all my (long) experience as a C++ programmer, I have hardly found any use for exceptions any more structured than containing simple and straightforward strings.
  • This goes blatantly off-topic but without a more proper blog to post to..

    I bumped to a very interesting WinHEC powerpoint when googling around. It does not seem to be up at the WinHEC page yet though. And I can see why - it immediately raises some questions to the random viewer:

    What is WASAPI and is there a better name for this?

    Is the MFT "plugin" architecture the "Media Foundation" or are these separate things? One could presume MFT is abbreviation of Media Foundation.

    These new slides still didn't give a good story regarding the Video story. What is the Directshow story, obviously it is going to be there in LH but are there new/easier possibilities coming regarding what is currently achieved by writing say a source filter and what's the place for first hand information.

    Needless to say I'd love to see a blog from person who works with this stuff even if it only had posts only once a month. Too much hoped for?
  • I plan to get around to this topic in the next month or so but basically there's some notion of "contractually significant" errors vs. "general errors". The fact that CreateFile() returns ERROR_FILE_NOT_FOUND if and only if the file in question could not be located in the directory specified (it would be ERROR_PATH_NOT_FOUND if the directory could not be located...) is something that a lot of code depends on and is part of the basic contract of the interface.

    I have a theory that "error returns" (status return codes, exceptions) should never be contractually significant but that requires a lot of motivation and will buck pretty much all the currently blessed coding/interface patterns.

    In particular, we need to teach people that handling an error not called out in the contract is just holding onto a live hand grenade with the pin pulled. Maybe you can pass the buck on to someone else but the mistake was pulling the pin in the first place.
  • I plan to get around to this topic in the next month or so but basically there's some notion of "contractually significant" errors vs. "general errors". The fact that CreateFile() returns ERROR_FILE_NOT_FOUND if and only if the file in question could not be located in the directory specified (it would be ERROR_PATH_NOT_FOUND if the directory could not be located...) is something that a lot of code depends on and is part of the basic contract of the interface.

    I have a theory that "error returns" (status return codes, exceptions) should never be contractually significant but that requires a lot of motivation and will buck pretty much all the currently blessed coding/interface patterns.

    In particular, we need to teach people that handling an error not called out in the contract is just holding onto a live hand grenade with the pin pulled. Maybe you can pass the buck on to someone else but the mistake was pulling the pin in the first place.
  • Joku,
    The PPT you're talking about is up there. I know :) The PM who gave it is the PM for my group. I can answer a bit of this...

    MFT - Media Foundation Transform - mostly the equivilant of a DMO but with a different input/output logic.

    I don't know about video because I don't work on video (although video is owned by my group).

    WASAPI - Windows Audio Session API. That's the best name you're going to get for it :)

    And I'm trying to get permission to start blogging about this. Certainly I'll be doing that after LH beta1 ships.
  • (banging head to the kbd) I just found some blogs with DS information. I'll go bang my head there from now on.
  • 1) Is [1] the ppt you mention, Joku (or Larry!)? Seems very interesting!

    2) Joku: Could you please give the links to those DirectShow blogs? I've done some DS work in the past, and would like to read some "ramblings" about it :)

    3) FYI: Java introduced a mechanism to keep lower layer error information in its Throwable class(which Error and Exception inherits from) in v1.4. Unfortunately this isn't the case with J2ME.

    [1] http://download.microsoft.com/download/9/8/f/98f3fe47-dfc3-4e74-92a3-088782200fe7/TWEN05003_WinHEC05.ppt
  • Andreas,

    1) that's the ppt
    2) Better than none, but not exactly very active so far..

    http://blogs.msdn.com/mikewasson/default.aspx

    http://blogs.msdn.com/ccgibson/archive/2004/5/20.aspx

    http://blogs.msdn.com/deanro/archive/2005/01/21/358586.aspx - Well this is like saying Larry's blog was about Video ;)
  • Larry, I've been reading your blog the last few days, and you've repeated many times you prefer using error codes over exceptions.

    I think we're finally getting somewhere about the merits of using exceptions. An error code provides almost no information, except sometimes to the immediate caller (unless you apply method #2, which makes it worse). CLR exceptions, as you pointed out, solves this problem very nicely. Plus, exceptions contain a Message property, which allow storing contextual information in them.

    In your example of AEDebug and loader, suppose the system was based on exceptions, the end message to the user would be something like:
    "Error starting application "c:\MyPrograms\foo.exe". The program specified in the registry key "HKLM\...\AEDebug" can't be found."

    The 2nd part of the message would come from the component reading from the registry (Exception.InnerException.Message), the 1st part from the upper layer (Exception.Message).

    Sure enough, this doesn't really help a clueless user, but, now, we have a starting point to help her. If she calls support, support will solve the problem very quickly, since the error message is very precise and easy to search for. If shes looks on the internet (I won't name the search engine ;-)), her problem will be solved in 2 sec. Compare that to the same user getting the message "File not found". What file? Why?

    Applying method #2, as you point out, means, in practice, swallowing error codes (you can't possibly redefine every single error code from all your lower components). This makes support even harder. Now, your user gets a message saying "Invalid Argument".

    How many times have I seen "Unspecified error" in the event log or in a message box? How many times have I seen a "Access denied" message. Access to what? Why was it denied?

    Of course, you could achieve the same result with a richer error code based system, but it's never been done in practice because it would be too time consuming (it gets more complicated than just returning an HR). An exception based system, as the one defined by the CLR, gives you this for free.
  • Renaud, what about HRESULTs? It's a error code but it contains a lot of information (see WinError.h in PSDK for the details). How to relate the error code to a message is not a hard task, if you own both. It is very neat in .NET, but you can do the same wrapping yourself.
  • > One of the basic requirements for any
    > interface is that you define the errors that
    > will be returned by that interface. It's a
    > fundamental part of the contract (and every
    > interface defines a contract).

    Agreed. That's how even MSDN's documentation of the ReadFile API has been pissing me off the last few days.
    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/fs/readfile.asp

    > If the anonymous write pipe handle has been
    > closed and ReadFile attempts to read using
    > the corresponding anonymous read pipe
    > handle, the function returns FALSE and
    > GetLastError returns ERROR_BROKEN_PIPE.

    1. The documentation has no exception to that statement for an anonymous pipe handle that was opened with FILE_FLAG_OVERLAPPED and a non-NULL lpOverlapped parameter. The page for GetOverlappedResult also doesn't mention such a case. I think I know some programmers who experiment and deliberately disobey the contract.

    2. There is no documentation at all for what happens if the named write pipe handle has been closed and ReadFile attempts to read using the corresponding named read pipe handle. I think I know some programmers who rely on experiments.

    Is #2 trivial? If so, why does the documentation say "anonymous" where it could have just omitted the word? Twice?

    Is #1 trivial? If so, why is there a not-completely-trivial amount of documentation for asynchronous reads reaching EOF? If a programmer notices the discussion of EOF and guesses that other details are trivial then they would expect ERROR_HANDLE_EOF instead of ERROR_BROKEN_PIPE in the following case:
    ReadFile says 0, GetLastError says ERROR_IO_PENDING,
    peer process closes write side of named pipe,
    GetOverlappedResult says FALSE, GetLastError says why.

    If the contract will be changed (reworded) to match the results of experiments, this one will be fine with me.
  • Andreas: My point is that error codes don't contain enough context. E_ACCESSDENIED can be displayed as "Access is denied" (and that's why you get those useless error messages in Windows), but the error code doesn't tell you the resource name and the reason it failed.

    Exception.Message is a string and is generated by the component throwing the exception. Usually, the component throwing the exception has all the information to add context to the error message. In the case of an access denied, the component probably knows the name of the file or the resources it was trying to access.

    Let me give you a practical example: A few years back, i was working for a small company and we were trying to install SQL Server 2000. Setup would fail after 1 hour of processing with the error "Access is Denied" (actually, it would give us the error code, and we had to translate it manually). We tried everything, running a admin, adding a bunch of rights to the admin, killing services, rebooting, searching msdn, etc. It took us 2 weeks to solve the problem.

    We finally found out that someone had changed some ACL on the disk we were trying to install it too, and Setup couldn't copy some files to some location.

    If the error message had been "Can't copy file 'blah' to folder 'bar': Write access is denied", we would have solved the problem right away.

    In my experience, this kind of problems happens all the time. And MS doesn't do much to improve the situation (In the case of access denied, granted, you can enable logging them in the event viewer, but you have to reboot your machine). One reason is that most devs at ms will tell you that "exceptions are too complicated and dangerous", so we end up having to deal with more and more API using HResult as the error reporting mechanism.
  • Is IErrorInfo the solution to this? The happy medium between the rich information of an exception vs the simplicity of a return value?
  • Fat error codes are interesting and as pointed out are supported in COM as well as exception-based languages. (I would bet a latte that IErrorInfo bears an uncanny resemblance to the internal visual basic error object...)

    Exceptions have their own raft of problems. It's a different set of (source code) problems from statuses due to the syntactic sugar but really it's the same problems. You just can't see them any more. In fact, we know less about building and maintaining exception-based platforms than status-based so caveat programmer.

    The fundamentals are still that if an error code is part of the contract, it has to be documented and then the function has the responsibility to only return that error code in the contractually correct circumstances.

    I think it's this point that blocks the desire to actually document error codes. Once you do that you've said that every single line of code that checks an error and propagates it has a bug. I think that's a scary notion too which is why I suggest that we're really just in a bad space in the first place when you're either (a) comparing status codes or (b) catching certain exceptions.
  • One of the comments on my philosopy of error codes post from last week indicated that all the problems...
Page 1 of 2 (18 items) 12