June, 2005

Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    What's wrong with this code, part 13 - the answers

    • 15 Comments
    I loved this particular "What's wrong" because it was quite subtle (and real world).

    One of the developers on the audio team came to me the other day and complained that the assert in CFooBase::~CFooBase was firing.  I looked at the source for a few minutes, and all of a sudden realized what was happening.

    You see (as mirobin figured out), an object that derives from ATL::CComObjectRoot doesn't actually implement its IUnknown.  Instead, the IUnknown implementation comes from a wrapper class, the CComObject.

    As a result, the IUnknown implementation in the CFooBase is dead code - it will never be called (unless someone instantiates a CFooBase directly).  But, since CFooBase implements IFooBase, it also needs to implement IUnknown (even if the implementation is never called).

    What complicates this scenario is that you don't get to see the constructor for the CFooDerived class - that's hidden inside:

    OBJECT_ENTRY_AUTO(CLSID_FooDerived, CFooDerived)

    This macro invokes linker magic to cause the class factory for the CFooDerived to instantiate the CComObject that wraps the CFooDerived - you never see the code, so it's not at all obvious where the class is being instantiated.

    So for the kudos:

    First off, mad props to mirobin for initially catching two huge, massive typos (ok, stupid omissions) in the original post (I forgot the AddRef in the QI method and forgot to return a value in the Release() method).

    mirobin correctly called out one other issue.  Instead of:

            if (iid == IID_FooBase)
            {
                AddRef();
                *ppUnk = reinterpret_cast<void *>(this);
            }

    It should be:

            if (iid == IID_FooBase)
            {
                AddRef();
                *ppUnk = reinterpret_cast<void *>(static_cast<IFooBase>(this));
            }

    The reason for this is that if CFooBase implements multiple interfaces, each interface has its own vtable.  So you need to do the double cast to resolve the ambiguity.  The good news is that the compiler will complain if there's an ambiguous cast, which forces you to resolve the ambiguity.  In this case it's not necessary, but in general it's a good idea.

    Other comments: Michael Ruck complained that CFooBase() isn't implemented.  You don't have to have a constructor on a class, it's possible that this class doesn't need it.

    And Michael complained that ~CFooBase() is private.  This one merits an entire blog post by itself.

     

  • Larry Osterman's WebLog

    What's wrong with this code, part lucky 13

    • 35 Comments
    Today's example is a smidge long, I've stripped out everything I can possibly imagine stripping out to reduce size.

    This is a very real world example that we recently hit - only the names have been changed to protect the innocent.

    I've used the built-in C++ decorations for interfaces, but that was just to get this stuff to compile in a single source file, it's not related to the bug.

    extern CLSID CLSID_FooDerived;
    [
        object,
        uuid("0A0DDEDC-C422-4BB3-9869-4FED020B66C5"),
    ]
    __interface IFooBase : IUnknown
    {
        HRESULT FooBase();
    };

    class CFooBase: public IFooBase
    {
        LONG _refCount;
        virtual ~CFooBase()
        {
            ASSERT(_refCount == 0);
        };
    public:
        CFooBase() : _refCount(1) {};
        virtual HRESULT STDMETHODCALLTYPE QueryInterface(const IID& iid, void** ppUnk)
        {
            HRESULT hr=S_OK;
            *ppUnk = NULL;
            if (iid == IID_FooBase)
            {
                AddRef();
                *ppUnk = reinterpret_cast<void *>(this);
            }
            else if (iid == IID_IUnknown)
            {
                AddRef();
                *ppUnk = reinterpret_cast<void *>(this);
            }
            else
            {
                hr = E_NOINTERFACE;
            }
            return hr;
        }
        virtual ULONG STDMETHODCALLTYPE AddRef(void)
        {
            return InterlockedIncrement(&_refCount);
        }
        virtual ULONG STDMETHODCALLTYPE Release(void)
        {
            LONG refCount;
            refCount = InterlockedDecrement(&_refCount);
            if (refCount == 0)
            {
                delete this;
            }
            return refCount;

        }
        STDMETHOD(FooBase)(void);
    };
    class ATL_NO_VTABLE CFooDerived :
        public CComObjectRootEx<CComMultiThreadModel>,
        public CComCoClass<CFooDerived, &CLSID_FooDerived>,
        public CFooBase
    {
        virtual ~CFooDerived();
        public:
        CFooDerived();
        DECLARE_NO_REGISTRY()
        BEGIN_COM_MAP(CFooDerived)
            COM_INTERFACE_ENTRY(IFooBase)
        END_COM_MAP()
        DECLARE_PROTECT_FINAL_CONSTRUCT()

    };

    OBJECT_ENTRY_AUTO(CLSID_FooDerived, CFooDerived)

     

    As always, tomorrow I'll post the answers along with kudos and mea culpas.

    Edit: Fixed missing return value in Release() - without it it doesn't compile.  Also added the addrefs - my stupid mistake.  mirobin gets major props for those ones.

  • Larry Osterman's WebLog

    Chimney sweeping

    • 23 Comments

    Sorry, nothing technical's coming to mind (I'm utterly swamped with work related stuff), but I just had to vent.

    Sharron's going to camp next week, so once again we're doing the "label everything that Sharron owns" ritual.

    Neither Valorie or I are particularly handy with a needle and thread (or a sewing machine), so we look for shortcuts on the "label ritual".  Many, many years ago, we bought a self inking fabric stamp from one of those mail-in circulars that you get in the mail.  Unlike so many of the things you buy from those flyers, this one has lasted us for almost 6 years and a huge number of stamping on a wide variety of fabrics (ever tried to stamp a terry-cloth towel?  Not pretty).

    Unfortunately, after six long hard years of work, the stamp finally bit the big one (it was close last year, but this year it didn't have the gumption to ink even a paper towel).

    So we needed to go out and buy a new stamp.  Valorie spent some time yesterday looking and found a local stamp company, and I stopped by there on my way to work.  The store is quite fascinating, actually - their web site looks quite professional, but the store itself is in an old one family home in the heart of downtown Redmond.

    For those of you that don't live in the Seattle area, Redmond is a rather fascinating mix of old and new.  New Redmond is all high tech and modern - brand new shops filled with the latest in stores.  It has its own outdoor mall called Redmond Town Center, with a 12 screen multiplex, and several other brand spanking new shopping centers.  New Redmond has street after street of planned ticky-tacky pseudo-urban environments with chiropractor offices on the ground floor and residential spaces above (all within a 4 story height limit).  It's like someone decided that because Redmond was a city, and cities had apartment buildings with businesses on the ground floor that Redmond had to have them to compete.

    Old Redmond, on the other hand is more like a small town from the 1960s or 1970's.  It reminds me of the town I spent most of my childhood, Delmar, NY - relatively small streets, old 1970's style shopping centers.  Most of the buildings downtown date from the turn of the century or earlier.

    The stamp store is a perfect example of old redmond - it's a tiny little stand alone building, painted white (and it needs a new coat of paint).  At the top of the building, where you'd find the "Mercentile" sign if this was a Western is the name of the business.  On the other hand, in the windows are crisp decals reminiscent of the web site.  I went inside, and found exactly what I'd expect from a professional stamp store - counter, stamps of various types, etc.  Nothing exciting at all.

    I did my business with the owner, and ordered the stamp - we'll get our stamp tomorrow (awesome turnaround time, we'll probably go back to them again).

    But the reason I'm writing this particular post is that the owner smoked.  In his store.

    Man, that was a bummer.  You see (again, for those that don't live around here), Washington state is almost a 100% smoke free environment, at least indoors.  This means that I've not had to experience the utter joy of being in close proximity to a cigarette factory in many years (I have several co-workers that smoke, but they keep it outside).

    As I was heading into work, I started realizing that I was still smelling the store.  It kept on, even after I'd gotten to my desk.

    It appears that the 5 minutes I spent in the store were enough to imbue my clothing with the smell of the store.

    So today, I smell like the inside of a chimney.

  • Larry Osterman's WebLog

    How many computers does it take to make a Microsoft employee happy

    • 9 Comments
    Over the weekend, it seems like a mini-meme went through http://blogs.msdn.com, "How many computers does it take to make a Microsoft employee happy".

    Normally I don't do meme's but enough people have asked this question privately that...

    What machines do I have in my office today?

    I currently have four computers with power cords connected to them - my dev machine (a 3ish GHz HP machine), and two test machines (one AMD64, one Dell 2ish GHz machine), and my laptop (an old Dell Lattitude C610).  I also have an old Dell that's currently taking up space in the corner in case I need a spare kernel debugger machine (I used to use it for Windows Media Connect testing).

    Over the years, I've had as many as 8 computers in my office, typically I've got three (the AMD64 is mostly a loaner, for some AMD64-only work I'm doing).

    As I mentioned, my highest number of computers was 8.  That was back when I was in Exchange and had to have an NT4 machine running Exchange 5.5, two machines running Exchange 2000, one development machine, one laptop, a prototype Itanium machine (which was kept powered off most of the time) and one other I don't remember).

    When I started, I had three computers - a PC/XT, a Salmon (prototype PC/AT), a PCjr (which was turned off most of the time), and a Z19 terminal.  How times have changed.

    It's a little bit weird - I've not debugged on the same machine I develop on for almost my entire career at Microsoft.  It's to the point that even when I COULD develop on my dev machine (when I was working on SCP) I didn't - I copied all the bits to my test machine and ran the tests on the dedicated test machine.  After all this time, I'm not sure that I'd feel comfortable going back to working on the same machine as my source code.

     

  • Larry Osterman's WebLog

    Why is the DOS path character "\"?

    • 54 Comments
    Many, many months ago, Declan Eardly asked why the \ character was chosen as the path separator.

    The answer's from before my time, but I do remember the original reasons.

    It all stems from Microsoft's relationship with IBM.  For DOS 1.0, DOS only supported floppy disks.

    Many of the DOS utilities (except for command.com) were written by IBM, and they used the "/" character as the "switch" character for their utilities (the "switch" character is the character that's used to distinguish command line switches - on *nix, it's the "-" character, on most DEC operating systems (including VMS, the DECSystem-20 and DECSystem-10), it's the "/" character" (note: I'm grey on whether the "/" character came from IBM or from Microsoft - several of the original MS-DOS developers were old-hand DEC-20 developers, so it's possible that they carried it forward from their DEC background).

    The fact that the "/" character conflicted with the path character of another relatively popular operating system wasn't particularly relevant to the original developers - after all, DOS didn't support directories, just files in a single root directory.

    Then along came DOS 2.0.  DOS 2.0 was tied to the PC/XT, whose major feature was a 10M hard disk.  IBM asked the Microsoft to add support for hard disks, and the MS-DOS developers took this as an opportunity to add support for modern file APIs - they added a whole series of handle based APIs to the system (DOS 1.0 relied on an application controlled structure called an FCB).  They also had to add support for hierarchical paths.

    Now historically there have been a number of different mechanisms for providing hierarchical paths.  The DecSystem-20, for example represented directories as: "<volume>:"<"<Directory>[.<Subdirectory>">"FileName.Extension[,Version]" ("PS:<SYSTEM>MONITR.EXE,4").   VMS used a similar naming scheme, but instead of < and > characters it used [ and ] (and VMS used ";" to differentiate between versions of files).  *nix defines hierarchical paths with a simple hierarchy rooted at "/" - in *nix's naming hierarchy, there's no way of differentiating between files and directories, etc (this isn't bad, btw, it just is).

    For MS-DOS 2.0, the designers of DOS chose a hybrid version - they already had support for drive letters from DOS 1.0, so they needed to continue using that.  And they chose to use the *nix style method of specifying a hierarchy - instead of calling the directory out in the filename (like VMS and the DEC-20), they simply made the directory and filename indistinguishable parts of the path.

    But there was a problem.  They couldn't use the *nix form of path separator of "/", because the "/" was being used for the switch character.

    So what were they to do?  They could have used the "." character like the DEC machines, but the "." character was being used to differentiate between file and extension.  So they chose the next best thing - the "\" character, which was visually similar to the "/" character.

    And that's how the "\" character was chosen.

    Here's a little known secret about MS-DOS.  The DOS developers weren't particularly happy about this state of affairs - heck, they all used Xenix machines for email and stuff, so they were familiar with the *nix command semantics.  So they coded the OS to accept either "/" or "\" character as the path character (this continues today, btw - try typing "notepad c:/boot.ini"  on an XP machine (if you're an admin)).  And they went one step further.  They added an undocumented system call to change the switch character.  And updated the utilities to respect this flag.

    And then they went and finished out the scenario:  They added a config.sys option, SWITCHAR= that would let you set the switch character to "-".

    Which flipped MS-DOS into a *nix style system where command lines used "-switch", and paths were / delimited.

    I don't know the fate of the switchar API, it's been long gone for many years now.

     

    So that's why the path character is "\".  It's because "/" was taken.

    Edit: Fixed title - it's been bugging me all week.

     

  • Larry Osterman's WebLog

    Why is defense in depth so important?

    • 16 Comments
    Yesterday's post on the principal of least privilege engendered some discussion about how important the PLP was.

    Many of the commentors (including Skywing, the originator of the discussion) felt that it wasn't important because malware could always enable the privilege.

    And they're right - it just raises the bar.  But raising the bar is important.

    The PLP is a specific example of a technique named "Defense in Depth" - the idea is that even if you don't know about a specific attack vector to your component, you still ensure that your code doesn't have defects.

    Let me give a contrived but feasible example.

    I've got an API that lives in a DLL, call it MyCoolNewImageRenderingAPI.dll.  The component ships as a part of Windows (or it's open source and freely redistributable, it doesn't really matter).

    It turns out that there's a bug in my API - if you pass the API a filename that's invalid, it can overflow.

    Do you fix the bug?  Your code lives in a DLL.  It's not network facing.  None of the components that use the DLL are network facing.  So it's not exploitable, right?

    Well, it might not be exploitable, today.  But you still have to fix the bug.

    Why?

    Because you can't control the callers of your API.  You simply can't predict what they're going to do with the API.

    What if the brand spanking new IceWeasel web browser decides that it really likes your API and decides to use it to render images.  And further, what if that browser allows a web site author to control the filename, or a portion of the filename passed into the API.

    Now all an attacker needs to do is to construct a filename that exploits your buffer overflow and they own the client machine.  In the absence of your bug, the only  consequence of the browser's allowing the attacker to control the name might be a denial of service attack (it might crash the browser, or fail to render the page correctly).  But with your bug, your "unexploitable" bug just became a security hole.

    You could argue back and forth about whether this is a bug in the browser or not, but ultimately it's not the browsers responsibility to work around YOUR bugs.  You just need to fix the bug.

    And that, in a nutshell is what defense in depth is all about.  If you've got a problem in your code that might conceivably be used to to launch an exploit, even though you believe that there are ample mitigations in place, you still need to fix the problem.

  • Larry Osterman's WebLog

    Why did I say that the Date&Time control panel applet was a security hole waiting to happen?

    • 23 Comments

    In response to a comment I'd made on Raymond's post about the Date/Time CPL:

    And what's even neater is that it's a security hole waiting to happen - the reason the dialog pops up when you're a LUA user is that they're enabling the set date&time privilege on startup (rather than when they set the date&time).

    That means that the applet runs with a privilege enabled, which violates the principle of least privilege (don't enable a privilege until you absolutely need it).

    Now since most users are admins, this isn't a problem (they already have the set date&time privilege) but when users are LUA...

    I received a private email from Skywing asking:

    I was wondering why exactly you think that leaving a privilege enabled in a token is a security hole.

    Any program (or conceivably exploit code delivered via a buffer overflow or similar mechanism) could call AdjustTokenPrivileges to re-enable a disabled privilege. Thus, I don't see how enabling a privilege by default as opposed to disabling it by default increases system security at all.

    In fact, I really don't understand the reason behind there being the concept of disabled or enabled privileges, given that any program that is running under a user can change a privilege from disabled to enabled in their token.

    Of course, this is completely different from adding or removing a privilege from a token. Clearly, adding additional privileges to a token can cause security problems, and as such requires special permissions. However, if a token has a privilege disabled, that privilege can still be used (after an AdjustTokenPrivileges call), so it doesn't really prevent misuse.
     

    Instead of writing a private answer, I figured it was worth a public one.  The three word answer to why this is important is "Defense in Depth".  But it begs a more thorough explanation.

    The principle of least privilege says that you run with the minimum set of privileges to accomplish your task. But the Date&Time CPL applet runs with the date&time privilege enabled all the time (not just when it attempts to change the date&time), thus violating the PLP.

    You can think of a privilege as being a trump card over the normal security - privileges can allow the user to bypass the normal security mechanisms to perform their job. For example, the backup privilege lets you bypass the ACLs on the files on your hard disk.  The restore privilege lets you set an arbitrary SD, including the owner field (normally the owner field in an SD is set to the user who set the owner (that's why it's called "take ownership").

    Another way of thinking about privileges is by having them disabled, it makes the application take another step before it can cause harm.  A backup operator can see all the files on the hard disk - but they have to enable a privilege (which can be audited) in order to do it - normally, when running from the console, backup operators can't bypass security.

    The thing about privileges is that they provide sort-of an "alternative ACL" mechanism - there are operations that need to bypass the normal ACL mechanism (like restoring the ACL on a file that's being restored from tape).  You can't have an ACL that protects the restore process, because the restore process has to have an ACL.  Similarly, since there's no clock object in the system (and no, I don't know why there's no clock object), there's no way of having ACLs to protect the ability to change the system time.  So you need a mechanism to perform these operations WITHOUT using ACLs.  Privileges provide that mechanism.

    The other thing to remember is that not all users have all privileges.  When running as a local administrator, I have most of the privileges, but not all of them - for example, I don't have the TCB privilege (that's the "Act as a part of the operating system" privilege).  The TCB privilege is the holy grail of hackers - if you have the TCB privilege, you own the system (and if you're an administrator, the escalation path to gaining the TCB privilege isn't particularly hard).

    Privileges that a user has that aren't enabled are a speedbump - putting the "trump cards" in a privilege reduces the potential for someone who has that privilege from accidentally causing harm.

    Running with a privilege enabled is like running with sharp pointed sticks in your hand.  Most of the time you're just fine.  But what happens if you trip?

    Now consider the Date&Time control panel applet.  It runs all of it's UI with the privilege enabled.  This means that if there's an exploitable buffer overflow in the UI, then a bad guy who wants to change the machine's time just has to find the exploitable buffer overflow to work his will with the system.  If, on the other hand, they had only enabled the change time privilege around the call to set the date&time, then there's a huge amount of code that isn't vulnerable to attacks.  By applying the principle of least privilege, and only enabling the privilege when it's needed, the attack surface of the product is reduced.  It's not eliminated - Skywing's totally right, the exploit could simply chose to enable the privilege and move on.

    Further, let's consider a hypothetical system where the user DOESN'T have the date&time privilege.  And the attempt to enable the date&time privilege causes an authentication dialog to come up (sort-of like what OS X does today).  In that case, the privilege would only be enableable after the user was prompted.  If the prompting happened before the UI came up, the attacker would have a large viable target, if the prompting happened only when the change was happening, the target would be WAY smaller.

    In general, even if you have a privilege, you don't want to have it enabled all the time.  Because enabling the privilege is a speed bump.

    And defense-in-depth is all about putting as many speed bumps in the system as possible.  If you make the hackers job hard, then maybe they'll leave you alone (or rather, they'll go find someone with fewer speed bumps).

     

  • Larry Osterman's WebLog

    Something I learned the other day...

    • 10 Comments

    I have a mild allergy to theatrical makeup (powder in particular).

     

    Go figure that one.

     

    Yes, it's a tease.  And no, I'm not telling (until people figure out the tease).  Why?  Because I've been asked to keep it under my hat.  But the reason is out there.

     

     

  • Larry Osterman's WebLog

    Hey, I'm an author too!

    • 8 Comments

    Joel just sent me an email letting me know that the first edition of "Best Software Writing, I" has now gone live.

    I'm quite honored to have have had one of my blog posts (Larry's rules of software engineering #2: Measuring Testers by Test Metrics doesn't) selected for inclusion in the book. 

    I've got to say that it feels pretty cool to have my name up there in a listing on Amazon. 

    It's also humbling to have my writing up there with articles by Raymond, Eric, and Rick.  Very, very neat.

    Joel let me know about this back in January, it's been an "interesting" experience working through all the issues.

    Sometime, over a beer, I'll talk about them.

     

  • Larry Osterman's WebLog

    Michael's finally writing the nitty gritty details of DllLoad

    • 9 Comments

    I don't normally post on weekends, but I just noticed that Michael Grier's finally started posting his "How does the NT loader work" series.

    His second post, on the basic operation of the loader is also up.

    Michael sent out a doc internally on Thursday with most of the meat of what he's doing (in response to a question on an internal mailing list), so I know where he's going with this, but it's gonna be good.  When I got the original message, I immediately forwarded it to my extended team (and put it on the team wiki) because it's THAT important.  I also sent him a private email asking (no, begging) him to post this to his blog.  I don't know if my email was the impetus or if he was planning on doing it anyway, frankly I don't care.  Now I'll have a definitive reference to point people at when they ask DllMain questions.

    This is good stuff folks, what Michael's describing is a great example of why Microsoft blogs are so important.

    You can't find this stuff written down as clearly and as plainly as he's stating it - anywhere.  You can infer and guess, but this is the real deal.

    Yay!

  • Larry Osterman's WebLog

    Nathan's laws of software

    • 16 Comments
    Way back in 1997, Nathan Myhrvold (CTO of Microsoft at the time) wrote a paper entitled "The Next Fifty Years of Software" (Subtitled "Software: The Crisis Continues!")  which was presented at the ACM97 conference (focused on the next 50 years of computing).

    I actually attended an internal presentation of this talk, it was absolutely riveting. Nathan's a great public speaker, maybe even better than Michael Howard :).

    But an email I received today reminded me of Nathan's First Law of Software:  "Software is a Gas!"

    Nathan's basic premise is that as machines get bigger, the software that runs on those computers will continue to grow. It doesn't matter what kind of software it is, or what development paradigm is applied to that software.  Software will expand to fit the capacity of the container.

    Back in the 1980's, computers were limited.  So software couldn't do much.  Your spell checker didn't run automatically, it needed to be invoked separately.  Nowadays, the spell checker runs concurrently with the word processor.

    The "Bloatware" phenomenon is a direct consequence of Nathan's First Law.

    Nathan's second law is also fascinating: "Software grows until it becomes limited by Moore's Law". 

    The second law is interesting because we're currently nearing the end of the cycle of CPU growth brought on by Moore's law.  So in the future, the growth of software is going to become significantly constrained (until some new paradigm comes along).

    His third law is "Software growth makes Moore's Law possible".  Essentially he's saying that because software grows to hit the limits of Moore's law, software regularly comes out that pushes the boundaries.  And that's what drives hardware sales.  And the drive for ever increasing performance drives hardware manufacturers to make even faster and smaller machines, which in turn makes Moore's Law a reality.

    And I absolutely LOVE Nathan's 4th law.  "Software is only limited by human ambition and expectation."   This is so completely true.  Even back when the paper was written, the capabilities of computers today were mere pipe dreams.  Heck, in 1997, you physically couldn't have a computer with a large music library - a big machine in 1997 had a 600M hard disk.

    What's also interesting is the efforts in fighting Nathan's first law.  It's a constant fight, waged by diligent performance people against the hoards of developers who want to add their new feature to the operating system.  All the developers want to expand their features.  And the perf people need to fight back to stop them (or at least make them justify what they're doing).  The fight is ongoing, and unending.

    Btw, check out the slides they're worth reading.  Especially when he gets to the part where the stuff that makes you genetically unique fits on a 3 1/2" floppy drive.

    He goes on from that point - at one point in his presentation, he pointed out that the entire human sensory experience can be transmitted easily on a 100mB ethernet connection.

     

    Btw, for those of you who would like, there's a link to two different streaming versions of the talk here: http://research.microsoft.com/acm97/

     

    Edit: Added link to video of talk.

     

  • Larry Osterman's WebLog

    The dirty little secret of Windows volume

    • 8 Comments

    Here's a dirty little secret about volume in Windows.

    If you look at the documentation for waveOutSetVolume it very clearly says:

    Volume settings are interpreted logarithmically. This means the perceived increase in volume is the same when increasing the volume level from 0x5000 to 0x6000 as it is from 0x4000 to 0x5000.

    The implication of this is that you can implement a linear slider for volume control and use the position of the slider to represent the volume.  This is pretty cool.

    But if you've ever written an application that uses the waveform volume (say an app that plays content with a volume slider attached to it), you'll notice that your volume control is far more responsive when it's on the low end of the slider and less responsive on the high end of the slider.

    Logarithmic curve

    That's weird.  The volume settings are supposed to be logarithmic, but a slider that's more responsive at the low end of the scale than the high end of the scale is an indicator that the slider's controlling LINEAR volume.

    And that's the dirty little secret.  Even though the wave volume is supposed to be logarithmic, the wave volume is actually linear.

    What's worse is that we didn't notice this until we shipped Media Center Edition.  The PM for my group was playing with his MCE machine and noticed that the volume was linear.   To confirm it, he whipped out his sound pressure meter (he's a recording  artist so he has stuff like that in his house).  And yup, the volume control was linear.

    When he came back to work the next day, panic ensued.  I can't explain WHY nobody had noticed this, but they hadn't.

    In response, we added support (for XP SP2) for customized volume tapers for the audio APIs.  The results of that are discussed in this article.

     

    Interestingly enough, it appears that this problem is well known.  The article from which I stole this image discusses the problem of linear vs. logarithmic tapers and discusses how to find the optimal volume taper.

     

    Edit: Cleared up some ambiguities in the language.

  • Larry Osterman's WebLog

    What's in an audio volume?

    • 16 Comments

    I've been talking about audio controls - volumes and mutes and the like, but one of the more confusing things I've run into here at work is the concept of "volume".

    First off, what IS volume?

    Well, roughly speaking (and I know the audiophiles out there will get on my case about this), there are actually several concepts when you talk about "volume".  The first (and most common) is that volume is a representation of "loudness".  But it turns out that in practice, volume is a representation of "intensity".

    The difference between "loudness" and "intensity" is that "loudness" is perceptual - how do you perceive a sound.  But "intensity" is actually what's measured - as SPL (Sound Pressure Level), which is a representation of energy in the sound space.

    Typically volumes are measured in decibels - a decibel is a logarithmic scale (each 10dB increase is a 10x increase in sound pressure).  20dB is about the volume of a whisper, 140dB is that of a jet airplane taking off next door. 

    Now when you deal with volumes in pro audio equipment, volume is measured by two factors - attenuation and amplification.  0 means that the sound is playing at its native level, negative numbers are reductions in volume from that native level, and positive numbers indicate amplification. 

    For most computer hardware, volume is measured as attenuations - negative numbers running from 0 (max volume) to -infinity (0 volume).  In practice, the number runs from 0 to -96dB.  Typically computers don't ever amplify signals, just attenuate them.  If you think about how digital audio works this makes sense.  Since an audio sample at full volume is at 0dB, it's easy to attenuate the samples (just scale them down appropriately).  On the other hand, it's not easy to amplify them - they're already AT 100% - any amplification would have to come AFTER the DAC.  So digital volumes ultimately measure attenuation.

    But audio volumes AREN'T in decibels (because that would be easy).  Instead, the audio volume is represented in a number of different sets of units, depending on your API.

    And that's where it gets really, really ugly.  There are at least five different sets of APIs in the system that measure audio volume, and they use totally different units.

    For example, the wave APIs ((waveOutSetVolume, waveOutGetVolume) represent volume as a number between 0x0000 and 0xffff, where 0 represents silence and 0xffff represents full volume.  The wave APIs assume that all audio outputs are stereo, and they pack the left and right channels into a single DWORD.  Of course if your audio system has more than two channels, that's a problem, but the reality is that almost nobody ever wants to adjust the balance as a normal activity (it's typically done once during system setup and then ignored).

    The mixer APIs on the other hand set their volumes with the mixerSetControlDetails API.  That API takes an integer between a low bound and a high bound, determined from the dwMinimum and dwMaximum fields of the relevent MIXERCONTROL.  The MIXERCONTROL structure also defines the number of steps between the low and the high value.  For most audio adapters, this is a number between 0 and 0xffff, with 0xffff steps, but this is not guaranteed - I've seen audio adapters with discrete volumes - 256 steps, for example.

    And then there's direct sound.  DirectSound sets volume on individual DSound buffers - you set the volume with the IDirectSoundBuffer8::SetVolume API.  The DSound set volume API sets the volume as a DWORD with the volume measured in hundredths of a dB, ranging from 0 to -10,000 (0 to -100dB).

    Oh, and I can't forget the audio CD playback volume.  The IOCTL_CDROM_GET_VOLUME (which is used to control the volume of CD playback when you're playing an audio CD over the analog connector to your sound card) specifies volumes in numbers between 0 and 255.

    And of course, the audio device driver that's actually used to render all these different volume levels takes a fourth type of volume.  The KSPROPERTY_AUDIO_VOLUMELEVELEL API takes a number from -2147483648 to +2147483647 where -2147483648 is silence (-32767 dB), 0 is max volume and 2147473647 is +32767 decibels (gain).  The units for the sysaudio volume are in 1/65536th decibel, which is nice since the high 16 bits represents the decibel value, the low 16 bits represent the fractional portion of the volume (typically 0).

    Sigh.

  • Larry Osterman's WebLog

    Mapping audio topologies to mixer topologies

    • 3 Comments

    Yesterday, I described the internal topology of an audio card.

    But there's no Windows API to expose that topology directly (you can generate the topology using the KS IOCTLs (KS stands for Kernel Streaming) but the reality is that it's not an API for mortals - a quick browse through the API makes that quite clear.

    Instead, the internal topology of the audio card is exposed via the MME mixer API set.

    But the mixer APIs only provide a rough approximation of the functionality in the topology.  A lot of this is because of the architecture of the mixer APIs.

    Instead of viewing the internal topology of the audio card as a graph, the mixer APIs represent the internal audio topology as an audio mixer.

    Consider this mixer:

    On the left hand side, there are 8 input channels (or source lines).  On the right hand side, there's a single output channel (or destination line).

    On each of the channels, there are a series of controls - mute, treble, bass, volume, etc - in the pictured mixer (a Mackie 1202VLZPRO), the volume for each of the input channels is the silver knob at the bottom, there's a gray button above it which is the mute.  On the output, there's a meter (that's the two rows of LEDs and a bunch of volume controls (and some more controls that I can't recognise).

    Well, the mixer APIs take the audio adapter's topology and attempt to turn that graph into something that looks like the mixer above.

    Here's the picture of the topology of my test machine's AC97 adapter again, for reference.

    Here's a snapshot of another tool written by the same developer:

    You can see the wave pin (pin 9), the CD Player (pin 6) etc here.

    The mixer API's really only concerned with the topology filter - the wave filter isn't very interesting as far as the mixer's concerned (the mixer does look at the wave filter, but not deeply).  The mixer API takes the 10 pins on the left hand side and calls them "sources"  And it takes the 3 pins on the right and calls them "destinations".  It then walks the connections from the source to the destination pins and takes every control that's on the path and adds it to the source line.

    But it turns out that that isn't quite right.  Consider  what happens when the code walking the source lines hits the mixer control #18.  Every source line generated would have the master volume and mute controls.

    So there's a special case - controls to the right of a mixer (or mux) control get assigned to the destination, controls to the left get assigned to the individual source lines.  In the case of the mux control, the mux control is added to the destination (which makes sense).  Mixers don't get added to the destination because they don't have any controls.

    One other thing to notice is what happens to mux controls (essentially switches) like #27 and 30.  For instance, the mux control #27 is connected to most of the input sources - if it's set incorrectly however, then you can't capture off of that input source.

    It turns out that mux's are one of the common problem areas in audio trouble shooting.  Because if the mux isn't set correctly, you don't get audio.

    And there's something else going on here.  If you looked carefully at the wave mixer, you noticed that the mixer line has a volume and mute control.  But if you looked at the topology, there's no mute or volume. Why is that?  Well, in Windows XP, the kernel drivers that provide audio support recognize that the source line doesn't have any volume controls and they add them in for you.  That's also where the SW Synth source line and a number of the other lines come from - they're controls that are synthesized by the kernel drivers even though they're not actually implemented in the hardware.

    During review, I was pointed to the MSDN article that describes this translation, here.

  • Larry Osterman's WebLog

    What's in your audio card?

    • 8 Comments
    No, it's not a credit card (obscure reference to a current credit card campaign running here in the US).

    One of the things that really surprised me when I joined this group is the complexity of an audio adapter.

    I had figured that an audio adapter was similar to most of the devices I'd run into over the years.  Relatively simple - a small command set - a register for volume, others for bass, treble, etc, and a register hooked up to DMA for rendering and capturing.

    Boy was I wrong.

    It turns out that an audio adapter has a remarkably rich internal topology - essentially inside each audio adapter is a complex topology of controls and connections.

    We need to start at the beginning.  In general each audio adapter exposes two "filters" - the "wave" filter and the "topology" filter.  The wave filter is used for rendering audio, the topology filter contains the controls through which the audio passes.  Within each filter are a series of "pins", if you look below, the pins are the little black boxes on the outside of the larger boxes.

    Essentially the topology of the audio adapter describes a complex path from a source pin to a destination pin (which usually corresponds to an external audio jack on the adapter).

    But the path that the audio samples take through the driver can be quite complex.

    You can see a rough approximation of this topology if you dump the mixerline APIs - but the mixer APIs don't always reflect the full richness of the device's topology - for that, you need to use a lower level tool like KSStudio (a DDK tool written by a member of the audio team several years ago)/

    For a really simple example of an audio topology, I've attached the topology from the AC97 adapter in my test machine below:

    The pink blob on the left is the topology filter, the pink blob on the right is the wave filter.  There are four "pin factories" on the wave filter - 0 is the render filter, 1 is the capture filter, two is the data output and three is the data input (ignore the direction of the arrows, it doesn't reflect the actual data flow).  The topology filter has 10 different inputs, corresponding to wave out, cd audio, aux, etc - the left hand column is essentially the list of source lines on the mixer topology.  It's important to notice that the only one of these that can be used by a PC application is number 9 (the wave pin) - the other pins are connected to external hardware inputs (for instance the analog audio output for my CDROM drive is connected to pin 4).  The right hand column roughly corresponds to the jacks on the back of the audio adapter - the top one (#10) is the pc speakers, 11 is the microphone input and 12 is the line input.

    If you follow the path from the wave filter control 0, data flows from the render pin to a DAC (control 0) then is passed in analog form to pin 9 of the topology filter, to the mixer (pin 18), then to the master volume (16) and master mute (17) before going to the speakers (10).

    One of the things to note when looking at the line in and mic graphs is the last control in the graph (27 and 30).  Those are mux controls - essentially the mux control selects which of the "input" lines (mic, etc) receives the signal from the microphone. 

    Another control to notice is controls 4 and 5 (connected to the mic source).  Those are AGC (Automatic Gain Correction) controls - they help ensure that the input microphone gain is set correctly.

  • Larry Osterman's WebLog

    Venting steam

    • 21 Comments

    Ok, today I'm going to vent a bit...

    This has been an extraordinarily frustrating week (that's a large part of why I've had virtually no technical content this week).  Think of this one as a peek behind the curtain into a bit of what happens behind the scenes here.

    The week started off great, on Tuesday, we had a meeting that finally put the final pieces together on a month-long multi-group design effort that I've been driving (over the course of the month, the effort's wandered through the core windows team, the security team, the terminal services team, the multimedia team, and I don't know how many other teams).  For me, it's been a truly challenging development effort and I was really happy to see it finally come to a conclusion.  I've been working on developing the non controversial work, and that stuff has been going pretty well.

    On Wednesday, I started trying to test the next set of changes I've made.

    I dropped a version of win32k.sys that I'd built (since my feature involves some minor changes to win32k.sys) onto my test machine and rebooted.  Kaboom.  The system failed to boot.  It turns out that you can't drop a checked version of the win32k.sys onto a retail build (yeah, I test on a retail OS).  This isn't totally surprising, if I'd thought about it I'd have realized that it wouldn't work.

    But it's not the end of the world, I rebooted my test machine back to the safe build - you always have to have a safe build if you're doing OS development, otherwise if the test OS crashes irretrievably (and that does happen on test OSs), you need to be able to recover your system.

    Unfortunately, one of the security changes in Longhorn meant that I was unable to put the working version of win32k.sys back on my machine when running my safe build.  Not a huge deal, and if I'd been thinking about it I could have probably tried the recovery console to repair the system.

    Instead, I decided to try to install the checked build on my test machine (that way I'd be able to just copy my checked binary over)

    One of the tools we have internally is used to automate the installation of a new OS.  Since we do this regularly, it's an invaluable tool.  Essentially, after installing it on our test machine, we can click a couple of buttons and have the latest build installed cleanly on our test machines (or we can click a different set of buttons and have a built upgraded, etc).   It's extraordinarily useful because it pretty much guarantees that we don't have to waste time chasing down a debugger and installing it, enabling the kernel debugger, etc.  It's a highly specialized tool, and is totally unsuitable for general distribution, but boy is it useful if you're installing a new build once a week or so.

    I installed the checked build, and my test machine went to work copying the binaries and running setup.  A while later, it had rebooted.

    It turns out that the driver for the network card in my test machine isn't in the current Longhorn build - this is temporary, but...  No big deal, I have a copy of the driver for the network card saved on the test machine's hard disk.

    The thing is, sometimes (as often happens) the auto-install tool is temperamental. It can be extremely sensitive to failure scenarios (if one of the domain controllers is unavailable, bad sectors on the disk, etc).  And this week the tool was particularly temperamental.  And it turns out that not having a network card is one of the situations that makes the tool temperamental.  If you don't get things just right, the script can get "stuck" (that's the problem with an automated solution - it's automated, and if something goes wrong, it gets upset).

    And that's what happened.  My test machine got stuck somewhere in the middle of running the scripts.  I'm not even sure where in the scripts it got stuck, since the tool doesn't report progress (it's intended for unattended use, so that normally isn't necessary). 

    Sigh.  Well, it's time to reinstall.  And reinstall.  And reinstall.  The stupid tool got stuck three different times.  All at the same place.  It's quite frustrating.   I'm skipping a bunch of stuff that went on here as I tried to make progress, but you get the picture.  I think I did this about 4 times yesterday alone.

    And of course the team expert for this tool is on vacation, so...

    This morning, I'm trying one more time. 

    ** Flashes to an image of someone banging their head against the wall exclaiming that they're hoping it will stop hurting soon **

    I just want to get to testing my code - I've got a bunch of work to do on this silly feature and the stupid tool is getting in my way.  Aargh.

    Oh, and one of the program managers on the team that's asking for my new feature just added a new requirement to the feature.  That's going to involve even more cross-group discussions and coordination of work items.

    Oh well.  on the other hand, I've made some decent progress documenting the new feature in it's feature spec, and I've been to some really quite interesting meetings about the process for our annual review cycle (which runs through this month).

     

    Edit: One of the testers in my group came by and helped me get the machine unstuck.  Yay.

     

  • Larry Osterman's WebLog

    Hacking Billy

    • 20 Comments

    I love it when I come into work in the morning and I find something in my email that just screams "write about me".

    As a couple of people have commented, I have a LOT of toys in my office.  I love collecting them, they're just a huge amount of fun.

    One of the toys that Valorie got for me many years ago that I don't particularly care for, but everyone else in my office seems to love is my Big Mouth Billy Bass (yes, I posted a picture for thos that don't have one).:

    Billy's really annoying (especially since my version has a bad sector in the sampled "Take me to the River" track, which causes it to screech horribly) but people constantly stop by and push his button, so I'm keeping him.

    This morning, one of the people in my group sent me a pointer to this site which describes how to hack a Billy to play random audio files.  Unfortunately, they use Linux, I wasn't able to find a Windows client :(

    But it is tempting...

    And yes, I know this isn't a new site.  But it's the first I've heard about it...

  • Larry Osterman's WebLog

    @@#$@# License plates...

    • 26 Comments

    Nothing technical today, just a bit of "a day in the life"...

    So I'm heading home last night, driving up Avondale Road, and I hear a noise.

    Sort of a rattle, rattle, thump, kerthud, then nothing.

    I figure, no big deal, I went over something.  But it was wierd - I hadn't seen anything on the road ahead of me, so it was very wierd.  So I checked my rear view mirror.

    There, in the traffic behind me was a black rectangle being bounced around by the traffic behind me (I did mention this was in the middle of rush hour, didn't I?).

    I recognised the rectangle instantly - it was the license plate holder for my front license plate.  I knew it because it had fallen off the car a couple of times before - each other time either I or the car dealer had put it back on, and I'd driven off.

    But this time, it fell off on a busy road (at peak, over 1,000 cars an hour).  If you don't know Avondale, it's a four lane arterial with a central median.  There are sidewalks on each side of the road, but where the plate fell off, there were no crossings - the nearest one was about a half a mile north or south of the where the plate fell off.  And of course, since it's a major arterial, there's no on-street parking.  And there's really no off-street parking either - there are either empty fields or apartment complexes (which don't have public parking) on each side of the road.

    I REALLY wanted to get the license plate - it's something like a $500 fine for driving without a plate in this state, so you REALLY don't want to be caught without it.  So I knew I was in for a bit of a hike.

    Fortunately, about 1/4 mile south of where the plate fell off, there's a car wash.  And they were nice enough to let me park my car at one of their vacuum stations while I went to get my plate.  Very, very nice of them. 

    Of course, as I mentioned, there was no crosswalk near the car wash.  But fortunately it turns out that while traffic is really heavy, there's a traffic light about 1/2 a mile south of the car wash.  So there are periodic lulls in the traffic - just enough for me to cross the street to the median.  Then I realize my first problem - there's no way through the median, it's got a boxwood headge across it.  Sigh.  I push through the hedge (getting my pants soaking wet from the rain that's collected on the hedge) and stand, watching the traffic zoom by two feet from my face at 45 miles an hour. 

    Eventually there's another lull, and I run across the remaining two lanes.   Time to start looking for the license plate.  I don't know exactly where the plate is, I just know it's north of where I am.  That's no big deal, I'm on the sidewalk, so even if it takes a while, I'm safe.  And then I spot it - the black rectangle of my license plate holder. 

    In the gutter next to the median strip.

    Sigh.  I wait for the next lull in the traffic and sprint across the road.  Unfortunately, when I get to it, I realize that the license plate's not attached - all I have is the licence plate holder.  Crud.

    So I wait in the median some more, waiting for the traffic.  Fortunately, the road's a bit wider and the median's less overgrown, so it's not as nervewracking as the first time.  But I'm still standing there waiting.  And waiting.

    Eventually traffic opens up and I can run back across the road to the safety of the sidewalk.  And it's time to keep on looking.

    About another hundred yards or so north, I finally find my license plate - it's pretty banged up but it looks like it'll still work.

    So back to my car I go. This time I go back on the southbound side of the road - there's almost no traffic on that side since it's evening.  At some point, one of my co-workers (who lives near me) calls out to me to see if I need help, fortunately, I'm just wet at this point so it's not a big deal.  I walk back to the car wash, pick up my car and put the license plate on the dashboard.

    All the way home, I can't help but think that I've somehow turned my expensive Mercedes into some kind of hillbilly car by putting the license plate on the dashboard.  I don't know why, but somehow it just feels trashy.

    This morning, I took the car to the dealership, the guys at the Mercedes dealership were really nice, they straightened out the plate and re-mounted it on my bumper.  This time they drilled new holes so hopefully it won't fall off as easily again.

    What a pain in the patooties.

     

  • Larry Osterman's WebLog

    The Endian of Windows

    • 17 Comments

    Rick's got a great post on what big and little endian are, and what the Apple switch has to do with Word for the Mac.

    In the comments, Alicia asked about Windows...

    I tried to make this a comment on his blog but the server wouldn't take it.

     

    The answer is pretty simple: For Windows, there are parts of Windows that are endian-neutral (for instance, the CIFS protocol handlers and all of the DCE RPC protocol, etc), but the vast majority of Windows is little-endian.

    A decision was made VERY long ago that Windows would not be ported to a big-endian processor.  And as far as I can see, that's going to continue.  Since almost all the new processors coming out are either little-endian, or swing both ways (this is true of all the RISC machines Windows has supported, for example), this isn't really a big deal.

  • Larry Osterman's WebLog

    When new features expose old bugs.

    • 13 Comments
    Not quite "Riffing on Raymond" but he just wrote about this, and it reminded me of a story that was related to me by the dev lead for the security team (the guys who own the LSA and authentication in Windows, not the SWI team) here at Microsoft.

    Raymond recently wrote about one of the changes for 32bit applications with the /LARGEADDRESSAWARE flag set - on 64bit windows, they get access to a full 32bit address space (4G) instead of the 31.5bit address space (3G) to which they previously had access.

    In a much earlier post I'd written about transferring a pointer between processes, and why you'd want to do that.  Well, it turns out that two of the Windows components that needs to do this is the LSA and SSPI components.  For those that don't know, the LSA is the component that creates the initial user token when the user logs on (and maintains the local accounts database, etc)) and SSPI is a generalized authentication package API.  So the devlead had to do essentially the same thing that the RPC team had done - ensuring that their pointers could be marshalled across processes and guaranteeing that the shared memory structures were sized appropriately.  The work was done way back in 2000, for the first release of 64bit windows (on the Itanium), and mostly forgotten (as is all code that works reliably).

    Until the very end of the testing of Win2K3 SP1. I mean the VERY end of testing.  The builds were in escrow.  For those that aren't familiar with the escrow process, when a build is in escrow, it means that it's almost ready to ship - the final test suites are being run, and the only bugs that will be fixed are bugs that would literally cause us to recall the product from manufacturing.

    And the security lead got a report of an SSL failure in one of our server applications running under Wow64 (32bit application on Win64).  The application would work just fine for a really long time, and all of a sudden they'd start getting SSL failures. 

    Now on NT, SSL is implemented by an SSPI provider.  Running the app under the debugger showed that they were getting a STATUS_ACCESS_VIOLATION error when they were trying to copy information from a client certificate from the server process into the LSA process.     Hmm.  That error only occurs when dealing with a bad pointer.

    It turns out that way back in 2000, when the 32-bit->64-bit marshalling code was originally written, 32bit applications ran the same on x64 as they did on x32 - they had access to only 2G of address space (I believe this was even before the /LARGEADDRESSAWARE flag).  Later on in the process, the decision was made to grant these applications access to the entire 4G address space (thus allowing the Win64 platform to provide benefits to 32bit applications that are address space sensitive like Microsoft Exchange and SQL Server).

    And it turns out that the code to convert pointers from the 32bit space to the 64bit space did its conversion using LONG datatypes.  And if the high bit was enabled on the 32bit value, it was quite happily sign extended into a really big negative number on the 64bit side.  This wasn't a problem when the 32bit apps only had access to 2G of RAM, since the high bit was never enabled.  But it only showed up in very limited cases - essentially the problem could only show up when the app had used up 2G of address space (since NT tends to allocate from the bottom of the address space to the top).  And then only when it used a particular subset of the LSA or SSPI APIs .

    Whoops.  One way of looking at this is that the marshalling logic wasn't /LARGEADDRESSAWARE - it had an example of same bug that prevents the /LARGEADDRESSAWARE flag from being the default. Double whoops.

    So the LSA developers went around and quickly found the cases with the error and fixed them all.  And, having been burned by this once, the base team went around and took a hard look at all of their integer conversions and found a few of those as well.

    And of course, we now have BVT test cases that will attempt to catch this kind of error in the future.

  • Larry Osterman's WebLog

    My office guest chair

    • 42 Comments

    Adam writes about his office guest chair.

    Microsoft's a big company, and, like all big company has all sorts of silly rules of what you can have in your office.   One of them is that for office furniture, you get:

    1. A desk chair
    2. One PED (sort of a mobile filing cabinet)
    3. One curved desk piece (we have modular desk pieces with adjustable heights)
    4. One short straight desk piece
    5. One long straight desk piece
    6. One white board
    7. One cork board
    8. One or two hanging book shelves with THREE shelves (not 4)
    9. One guest chair.

    If you're a manager, you can get a round table as well (presumably to have discussions at).

    In my case, most of my office stuff is pretty stock - except I got my manager to requisition a round table for his office for me (he already had one).  I use it to hold my manipulative puzzles.  I also have two PEDs

    But I'm most proud of my guest chair.  I have two of them.  One's the standard Microsoft guest chair.  But the other one's special.  You see, it comes from the original Microsoft campus at 10700 Northup Way, and is at least 20 years old.

    I don't think that it's the original chair I had in my original office way back then - that was lost during one of my moves, but I found the exact match for the chair in a conference room the day after the move and "liberated" it. 

    But I've had this particular chair since at least 1988 or so.  The movers have dutifly moved it with me every time.

    Daniel loves it when he comes to my office since it's comfy - it's padded and the standard guest chairs aren't.

    Edit: Someone asked me to include a picture of the chair:

  • Larry Osterman's WebLog

    Error Code Paradigms

    • 33 Comments

    At some point when I was reading the comments on the "Exceptions as repackaged error codes" post, I had an epiphany (it's reflected in the comments to that thread but I wanted to give it more visibility).

    I'm sure it's just an indication of just how slow my mind is working these days, but I just realized that in all the "error code" vs. "exception" discussions that seem to go on interminably, there are two UNRELATED issues being discussed.

    The first is about error semantics - what information do you hand to the caller about what failed.  The second is about error propogation - how do you report the failure to the caller.

    It's critical for any discussion about error handling to keep these two issues separate, because it's really easy to commingle them.  And when you commingle them, you get confusion.

    Consider the following example classes (cribbed in part from the previous post):

    class Win32WrapperException
    {
        // Returns a handle to the open file.  If an error occurs, it throws an object derived from
        // System.Exception that describes the failure.
        HANDLE OpenException(LPCWSTR FileName)
        {
            HANDLE fileHandle;
            fileHandle = CreateFile(FileName, xxxx);
            if (fileHandle == INVALID_HANDLE_VALUE)
            {
                throw (System.Exception(String.Format("Error opening {0}: {1}", FileName, GetLastError());
            }

        };
        // Returns a handle to the open file.  If an error occurs, it throws the Win32 error code that describes the failure.
        HANDLE OpenError(LPCWSTR FileName)
        {
            HANDLE fileHandle;
            fileHandle = CreateFile(FileName, xxxx);
            if (fileHandle == INVALID_HANDLE_VALUE)
            {
                throw (GetLastError());
            }

        };
    };

    class Win32WrapperError
    {
        // Returns either NULL if the file was successfully opened or an object derived from System.Exception on failure.
        System.Exception OpenException(LPCWSTR FileName, OUT HANDLE *FileHandle)
        {
            *FileHandle = CreateFile(FileName, xxxx);
            if (*FileHandle == INVALID_HANDLE_VALUE)
            {
                return new System.Exception(String.Format("Error opening {0}: {1}", FileName, GetLastError()));
            }
            else
            {
                return NULL;
            }

        };
        // Returns either NO_ERROR if the file was successfully opened or a Win32 error code describing the failure.
        DWORD OpenError(LPCWSTR FileName, OUT HANDLE *FileHandle)
        {
            *FileHandle = CreateFile(FileName, xxxx);
            if (&FileHandle == INVALID_HANDLE_VALUE)
            {
                return GetLastError();
            }
            else
            {
                return NO_ERROR;
            }
        };
    };

    I fleshed out the example from yesterday and broke it into two classes to more clearly show what I'm talking about.  I have two classes that perform the same operation.  Win32WrapperException is an example of a class that solves the "How do I report a failure to the caller" problem by throwing exceptions.  Win32WrapperError is an example that solves the "How do I report a failure to the caller" problem by returning an error code.

    Within each class are two different methods, each of which solves the "What information do I return to the caller" problem - one returns a simple numeric error code, the other returns a structure that describes the error.  I used System.Exception as the error structure, but it could have just as easily been an IErrorInfo class, or any one of a bazillion other ways of reporting errors to callers.

    But looking at these examples, it's not clear which is better.  If you believe that reporting errors by exceptions is better than reporting by error codes, is Win32WrapperException::OpenError better than Win32WrapperError::OpenException?  Why? 

    If you believe that reporting  errors by error codes is better, then is CWin32WrapperError::OpenError better than CWin32WrapperError::OpenException?  Why?

    When you look at the problem in this light (as two unrelated problems), it allows you to look at the "exceptions vs. error codes" debate in a rather different light.  Many (most?) of the arguments that I've read in favor of exceptions as an error propagation mechanism  concentrate on the additional information that the exception carries along with it.  But those arguments ignore the fact that it's totally feasible (and in fact reasonable) to define an error code based system that provides the caller with exactly the same level of information that is provided by exception.

    These two problems are equally important when dealing with errors.  The mechanism for error propagation has critical ramifications for all aspects of engineering - choosing one form of error propagation over another can literally alter the fundamental design of a system.

    And the error semantic mechanism provides critical information for diagnosability - both for developers and for customers.  Everyone HATES seeing a message box with nothing but "Access Denied" and no additional context.

     

    And yes, before people complain, I recognize that none of the common error code returning APIs today provide the same quality of error semantics that System.Exception does as first class information - the error return information is normally hidden in a relatively unsophisticated scalar value.  I'm just saying that if you're going to enter into a discussion of error codes vs. exceptions, from a philosophical point of view, then you need to recognize that there are two related problems that are being discussed, and differentiate between these two. 

    In other words, are you advocating exceptions over error codes because you like how they solve the "what information do I return to the caller?" problem, or are you advocating them because you like how they solve the "how do I report errors?" problem?

    Similarly, are you denigrating exceptions because you don't like their solution to the "how do I report errors?" problem and ignoring the "what information do I return to the caller?" problem?

    Just some food for thought.

  • Larry Osterman's WebLog

    My favorite error code

    • 4 Comments

    Yesterday I'd mentioned X.400 OM error codes.  Originally, the primary message transport for Exchange was an x.400 transport (this changed with Exchange 2000).  At one point, one of the Exchange MTA developers told me about his favorite x.400 error code:

       ERROR_RECIPIENT_DEAD

    Yup, x.400 apparently defined a non delivery status code that indicates that the person who was supposed to receive the email message was no longer alive. 

    Given that x.400 was developed by the post office (PTT), this error code actually makes sense - the PTT had non delivery codes for physical mail that indicated that the recipient of a physical mail message was deceased, so they simply translated their physical mail error codes into electronic mail.

     

    Unfortunately, I can't come up with an independant confirmation of this (except my recollection of the conversation), so I can't cite a reference.

    Edit: This just gets better.  One of the MTA testers just sent me a private email indicating that Exchange had re-used that particular error code value as the error code mapping for an access denied error on public folders.  One of the gateways for Exchange did a simple minded error code->text mapping and we got a bug from one of our testers saying "I just got this NDR when I sent mail to a public folder: "Mail could not be delivered.  Recipient is dead." ".

     

Page 1 of 1 (23 items)