Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    What does Larry's style look like?

    • 31 Comments

    Sorry this took so long - work's been really busy lately.

    Clearly I'm not enough of an egotist[1]  to believe that my style is in any way the "one true style", but over the course of writing the series on style, I realized that I've never written down my current coding style, and I think it would be vaguely humorous to write down "Larry's Coding Conventions".

    So here goes.  I'm using the keywords defined in RFC2119 within.

    Larry's Coding Conventions

    1. Files - Global file information.

    All source files (.C, .CXX, .CPP, etc) MUST be plain text files, with CR/LF (0x0D0x0A) at the end of each line.  Each file MUST end with a CRLF.  C++ code SHOULD be in files with an extension of .CPP, C code SHOULD be in files with an extension of .C.  Header files SHOULD have an extension of .H.

    Tab size MUST be set to 4, and tab characters MUST NOT appear in source files, this allows for a user to use any editor and still have the same experience while viewing the code.

    Every source file MUST start with the following header:

    /*++
     * <Copyright Notice (talk to your legal department for the format of the copyright notice)>
     *
     * Module-Name:
     *     <file name>
     *
     * Author:
     *     <author full name> (<author email address>) <date of creation>
     *
     * Abstract:
     *     <Brief abstract of the source file>
     *
     * Revision History:
     *
     *--*/


    Filenames SHOULD be representative of the contents of the files.  There SHOULD be one class (or set of functionality) per file.  So the CFoo class should be located in a source file named "Foo.cpp".

    Global variables SHOULD begin with a prefix of g_.  Care MUST be taken when declaring global variables, since they are likely sources of synchronization issues.

    Source code lines SHOULD be no longer than 100 characters long.

    2. Class definitions

    All classes SHOULD start with C in the class name.  Member variables of classes MUST start with an _ character.   Member variables are PascalCased (so a member variable could be _MyMemberVariable).

    Classes that are reference counted SHOULD follow the OLE conventions - an AddRef() and a Release() method should be used for reference counting.  If a class is reference counted, then the destructor for that class SHOULD be private.

    3. Functions and Methods - Info pertaining to various functions.

    Each routine MUST have a function header, with the following format:

    /*++
     * Function Name
     *
     *     <Description of the function>
     *
     * Inputs:
     *     <Description of the parameters to the function, or "None.">
     *
     * Returns:
     *     <Description of the return value, or "None.">
     *
     * Remarks:
     *     <Relevant information about the function, may be empty>
     *
     *--*/
     

    Function names SHOULD be representative of their function.  All function names MUST be in PascalCase as per the CLR coding standards.  If the project is using an auto-doc tool, it's acceptable to tag the inputs closer to their definition.

    Parameters to functions are also in PascalCase (note that this is a difference from the CLR coding standard).

    Local variables in functions should be camelCased.  This allows for the reader to determine the difference between local variables, parameters and class members easily.

    Parameter names SHOULD NOT match the names of methods in the class.

    Code SHOULD have liberal use of vertical whitespace, with descriptive block comments every five or so lines of source code.

    Descriptive comments SHOULD have the format:

        :
        :
        //
        //<space><space>Descriptive Comment Line 1
        //<space><space>Descriptive Comment Line 2
        //
        :
        :

    Each descriptive comment starts 4 spaces from the left margin, there is a single empty comment line before and after the descriptive comment, and two spaces between the // and the start of the comment text.

    Functions SHOULD occupy no more than one screen, or about 70 lines, including comments (not including headers).  This means that each function SHOULD be at most about 40 lines of code.

    4. Predefined Identifiers (Manifest Constants)

    Manifest constants SHOULD be in all upper-case.  Instead of using #define, enum's or const's SHOULD be used when possible, especially if the value being defined is unique, since it allows for better representation in the debugger (yeah, I know I've said that source level debuggers are a crutch, but...).

    5. Code layout

    Code is laid out using BSD-style - braces appear on their own line at the same indentation level as the conditional, code is indented 4 spaces on the line after the brace.

    So an if/else statement is formatted as:

        if (i < 0)
        {
            i += 1;
        }
        else
        {
            i += 1;
        }

    In general, unless semantically necessary, I use <variable> += 1 instead of <variable>++ or ++<variable>.

    Variable declarations SHOULD appear each on their own line.

    6. Code Example

    The following is an example of code in "Larry's Style".

    /*++
     * ErrorPrint
     *
     *     Print a formatted error string on the debug console.
     *
     * Inputs:
     *     Format - printf style format specifier
     *
     * Returns:
     *     None.
     *
     * Remarks:
     *     The total printf string should be less than DEBUG_STRING_BUFFER_SIZE bytes.
     *
     *--*/
    static void ErrorPrint(LPCSTR Format, ...)
    {
        int result = 0;
        static char outputBuffer[DEBUG_STRING_BUFFER_SIZE];
        va_list marker;

        va_start(marker, Format);
        result = StringCchVPrintfA(outputBuffer, DEBUG_STRING_BUFFER_SIZE, Format, marker);
        if (result == S_OK)
        {
            OutputDebugStringA(buffer);
        }
        va_end(marker);
    }

     

    [1] Ok, I've got a blog, that makes me an egotist, but not enough of an egotist[2]
    [2] Apologies to KC for stealing her footnoting style :)

    Edit: pszFormat->Format.

     

  • Larry Osterman's WebLog

    Windows Error Reporting and Online Crash Analysis are your friends.

    • 31 Comments

    I normally don’t do “me too” posts, since I figure that most of the people reading my blog are also looking at the main weblogs.asp.net/blogs.msdn.com feed, but I felt obliged to chime in on this one.

    A lot of people on weblogs.msdn.com have been posting this, but I figured I’d toss in my own version.

    When you get an “your application has crashed, do you want to let Microsoft know about it?” dialog, then yes, please send the crash report in.  We’ve learned a huge amount of where we need to improve our systems from these reports.  I know of at least three different bug fixes that I’ve made in the audio area that directly came from OCA (online crash analysis) reports.  Even if the bugs are in drivers that we didn’t write (Jerry Pisk commented about creative lab’s drivers here for example), we still pass the info on to the driver authors.

    In addition, we do data mining to see if there are common mistakes made by different driver authors and we use these to improve the driver verifier – if a couple of driver authors make the same mistake, then it makes sense for us to add tests to ensure that the problems get fixed on the next go-round.

    And we do let 3rd party vendors review their data.  There was a chat about this in August of 2002 where Greg Nichols and Alther Haleem discussed how it’s done.  The short answer is you go here and follow the instructions.  You have to have a Verisign Class 3 code-signing ID to do participate though.

    Bottom line: Participate in WER/OCA – Windows gets orders of magnitude more stable because of it.  As Steve Ballmer said:

    About 20 percent of the bugs cause 80 percent of all errors, and — this is stunning to me — one percent of bugs cause half of all errors.

    Knowing where the bugs are in real-world situations allows us to catch the high visibility bugs that plague our users that we’d otherwise have no way of discovering.

  • Larry Osterman's WebLog

    Should I check the parameters to my function?

    • 31 Comments

    I just had an interesting discussion with one of the testers in my group.

    He had just finished filing a series of bugs against our components because they weren’t failing when he passed bogus pointers to the API.  Instead, they raised a 0xC0000005 exception and crashed his application.

    The APIs did fail if he passed a null pointer in, with E_POINTER. 

    But he felt that the API should check all the bogus pointers passed in and fail with E_POINTER if the pointer passed in didn’t point to valid memory.

    This has been a subject of a lot of ongoing discussion over the years internally here at Microsoft.  There are two schools of thought:

    School one says “We shouldn’t crash the application on bogus data.  Crashing is bad.  So we should check our parameters and return error codes if they’re bogus”.

    School two says “GIGO – if the app hands us garbage, then big deal if it crashes”.

    I’m firmly in the second camp (not surprisingly, if you know me).  There are a lot of reasons for this.  The biggest one is security.  The way you check for bad pointers on Win32 is by calling the IsBadReadPtr and IsBadWritePtr API.  Michael Howard calls these APIs “CrashMyApplication” and “CorruptMemoryAndCrashMySystem” respectively.  The problem with IsBadReadPtr/IsBadWritePtr is that they do exactly what they’re advertised as doing:  They read and/or write to the memory location specified, with an exception handler wrapped around the read/write.  If an exception is thrown, they fail, if not, they succeed.

    There are two problems with this.  The only thing that IsBadReadPtr/IsBadWritePtr verifies is that at the instant that the API is called, there was valid memory at that location.  There’s nothing to prevent another thread in the application from unmapping the virtual address passed into IsBadReadPtr immediately after the call is made.  Which means that any error checks you made based on the results of this API aren’t valid (this is called out in the documentation for IsBadWritePtr/IsBadReadPtr).

    The other one is worse.  What happens if the memory address passed into IsBadReadPtr is a stack guard page (a guard page is a page kept at the bottom of the stack – when the system top level exception handler sees a fault on a guard page, it will grow the threads stack (up to the threads stack limit))?  Well, the IsBadReadPtr will catch the guard page exception and will handle it (because IsBadReadPtr handles all exceptions).  So the system exception handler doesn’t see the exception.  Which means that when that thread later runs, its stack won’t grow past the current limit.  By calling IsBadReadPtr in your API, you’ve turned an easily identifiable application bug into a really subtle stack overflow bug that may not be encountered for many minutes (or hours) later.

    The other problem with aggressively checking for bad parameters on an API is that what happens if the app doesn’t check the return code from the API?  This means that they could easily have a bug in their code that passes a bogus pointer into IsBadWritePtr, thus corrupting memory.  But, since they didn’t check the return code, they don’t know about their bug.  And, again, much later the heap corruption bug that’s caused by the call to IsBadWritePtr shows up.  If the API had crashed, then they’d find the problem right away.

    Now, having said all this, if you go with school two, you’ve still got a problem – you can’t trust the user’s buffers.  At all.  This means you’ve got to be careful when touching those buffers to ensure that you’re not going to deadlock the process by (for instance holding onto a critical section while writing to the user’s buffer).

    The other thing to keep in mind is that there are some situations where it’s NOT a good idea to crash the user’s app.  For example, if you’re using RPC, then RPC uses structured exception handling to communicate RPC errors back to the application (as opposed to API return codes).  So sometimes you have no choice but to catch the exceptions and return them.  The other case is if someone has written and shipped an existing API that uses IsBadReadPtr to check for bad pointers on input, it may not be possible to remove this because there may be applications that depend on this behavior.

    So in general, it’s a bad idea to use IsBadXxxPtr on your input parameters to check for correctness.  Your users may curse you for crashing their app when they screw up, but in the long term, it’s a better idea.

  • Larry Osterman's WebLog

    Larry's rules of software engineering #2: Measuring testers by test metrics doesn't.

    • 30 Comments

    This one’s likely to get a bit controversial J.

    There is an unfortunate tendency among test leads to measure the performance of their testers by the number of bugs they report.

    As best as I’ve been able to figure out, the logic works like this:

    Test Manager 1: “Hey, we want to have concrete metrics to help in the performance reviews of our testers.  How can we go about doing that?”
    Test Manager 2: “Well, the best testers are the ones that file the most bugs, right?”
    Test Manager 1: “Hey that makes sense.  We’ll measure the testers by the number of bugs they submit!”
    Test Manager 2: “Hmm.  But the testers could game the system if we do that – they could file dozens of bogus bugs to increase their bug count…”
    Test Manager 1: “You’re right.  How do we prevent that then? – I know, let’s just measure them by the bugs that are resolved “fixed” – the bugs marked “won’t fix”, “by design” or “not reproducible” won’t count against the metric.”
    Test Manager 2: “That sounds like it’ll work, I’ll send the email out to the test team right away.”

    Sounds good, right?  After all, the testers are going to be rated by an absolute value based on the number of real bugs they find – not the bogus ones, but real bugs that require fixes to the product.

    The problem is that this idea falls apart in reality.

    Testers are given a huge incentive to find nit-picking bugs – instead of finding significant bugs in the product, they try to find the bugs that increase their number of outstanding bugs.  And they get very combative with the developers if the developers dare to resolve their bugs as anything other than “fixed”.

    So let’s see how one scenario plays out using a straightforward example:

    My app pops up a dialog box with the following:

     

                Plsae enter you password:  _______________ 

     

    Where the edit control is misaligned with the text.

    Without a review metric, most testers would file a bug with a title of “Multiple errors in password dialog box” which then would call out the spelling error and the alignment error on the edit control.

    They might also file a separate localization bug because there’s not enough room between the prompt and the edit control (separate because it falls under a different bug category).

    But if the tester has their performance review based on the number of bugs they file, they now have an incentive to file as many bugs as possible.  So the one bug morphs into two bugs – one for the spelling error, the other for the misaligned edit control. 

    This version of the problem is a total and complete nit – it’s not significantly more work for me to resolve one bug than it is to resolve two, so it’s not a big deal.

    But what happens when the problem isn’t a real bug – remember – bugs that are resolved “won’t fix” or “by design” don’t count against the metric so that the tester doesn’t flood the bug database with bogus bugs artificially inflating their bug counts. 

    Tester: “When you create a file when logged on as an administrator, the owner field of the security descriptor on the file’s set to BUILTIN\Administrators, not the current user”.
    Me: “Yup, that’s the way it’s supposed to work, so I’m resolving the bug as by design.  This is because NT considers all administrators as idempotent, so when a member of BUILTIN\Administrators creates a file, the owner is set to the group to allow any administrator to change the DACL on the file.”

    Normally the discussion ends here.  But when the tester’s going to have their performance review score based on the number of bugs they submit, they have an incentive to challenge every bug resolution that isn’t “Fixed”.  So the interchange continues:

    Tester: “It’s not by design.  Show me where the specification for your feature says that the owner of a file is set to the BUILTIN\Administrators account”.
    Me: “My spec doesn’t.  This is the way that NT works; it’s a feature of the underlying system.”
    Tester: “Well then I’ll file a bug against your spec since it doesn’t document this.”
    Me: “Hold on – my spec shouldn’t be required to explain all of the intricacies of the security infrastructure of the operating system – if you have a problem, take it up with the NT documentation people”.
    Tester: “No, it’s YOUR problem – your spec is inadequate, fix your specification.  I’ll only accept the “by design” resolution if you can show me the NT specification that describes this behavior.”
    Me: “Sigh.  Ok, file the spec bug and I’ll see what I can do.”

    So I have two choices – either I document all these subtle internal behaviors (and security has a bunch of really subtle internal behaviors, especially relating to ACL inheritance) or I chase down the NT program manager responsible and file bugs against that program manager.  Neither of which gets us closer to shipping the product.  It may make the NT documentation better, but that’s not one of MY review goals.

    In addition, it turns out that the “most bugs filed” metric is often flawed in the first place.  The tester that files the most bugs isn’t necessarily the best tester on the project.  Often times the tester that is the most valuable to the team is the one that goes the extra mile and spends time investigating the underlying causes of bugs and files bugs with detailed information about possible causes of bugs.  But they’re not the most prolific testers because they spend the time to verify that they have a clean reproduction and have good information about what is going wrong.  They spent the time that they would have spent finding nit bugs and instead spent it making sure that the bugs they found were high quality – they found the bugs that would have stopped us from shipping, and not the “the florblybloop isn’t set when I twiddle the frobjet” bugs.

    I’m not saying that metrics are bad.  They’re not.  But basing people’s annual performance reviews on those metrics is a recipe for disaster.

    Somewhat later:  After I wrote the original version of this, a couple of other developers and I discussed it a bit at lunch.  One of them, Alan Ludwig, pointed out that one of the things I missed in my discussion above is that there should be two halves of a performance review:

                MEASUREMENT:          Give me a number that represents the quality of the work that the user is doing.
    And      EVALUATION:               Given the measurement, is the employee doing a good job or a bad job.  In other words, you need to assign a value to the metric – how relevant is the metric to your performance.

    He went on to discuss the fact that any metric is worthless unless it is reevaluated at every time to determine how relevant the metric is – a metric is only as good as its validity.

    One other comment that was made was that absolute bug count metrics cannot be a measure of the worth of a tester.  The tester that spends two weeks and comes up with four buffer overflow errors in my code is likely to be more valuable to my team than the tester that spends the same two weeks and comes up with 20 trivial bugs.  Using the severity field of the bug report was suggested as a metric, but Alan pointed out that this only worked if the severity field actually had significant meaning, and it often doesn’t (it’s often very difficult to determine the relative severity of a bug, and often the setting of the severity field is left to the tester, which has the potential for abuse unless all bugs are externally triaged, which doesn’t always happen).

    By the end of the discussion, we had all agreed that bug counts were an interesting metric, but they couldn’t be the only metric.

    Edit: To remove extra <p> tags :(

  • Larry Osterman's WebLog

    Coding with Style

    • 30 Comments

    AKA, Software archaeology, a working example.

    I've touched on this this a couple of times before, but I'd like to talk a bit about style.

    This isn't an issue for people working on a single person project, but once you start working in a team, the issue of coding style comes up often.

    The problem is that everyone has their own style, and they're usually pretty different.  K&R lays out several different styles, and Kernighan and Plaugher lay out still more.  Style encompasses everything from where the braces go, to how far they're indented (and where they're indented), to the names of identifiers (both variables and function names), to where the variables are declared.  Even if source files contain spaces or tabs (and what the tab setting is) is an aspect of programming style.

    Most programmers learn a particular style when they start to learn how to program, and then they tend to stick with that style for their career (with minor modifications).  One thing to keep in mind about style is that it's personal.  Nobody's style is any better than anyone else's style, they're all ultimately a matter of personal choice (having said that, some personal style choices can be rather distracting, for a simple example, see this Daily WTF post).

    When you're laying out a project at the beginning, you essentially have two ways to go when determining the style for your code.  The first is to get all the developers together in a meeting, hash out the details, and write a specification for the coding style for your project.  And then you've got to be diligent in enforcing that style, especially if the style you're enforcing is different from that of an individual developer in your group.

    The other way is to be somewhat more ad-hoc in coding style - typically each developer owns one or more components in the system, you let them develop those components with their own style, and trust that the final product will be harmonious.  If you're using paired programming, or if there is more than one owner of a piece of software, then clearly the individual developers need to come to consensus on their style.

    There are times that the ad-hoc coding style is effectively forced on an organization.  This happens a lot when dealing with legacy code - often times the code was written by developers who have long left the organization, and thus they are no longer maintaining the code.

    There's one critical thing to deal with when you're dealing with disparate styles - you should never, ever change existing code to match your personal style.  I've seen this happen in real life, two developers on a team each have their own idea of what a variable should be named, and as the two developers work on the module, each of them goes through and makes sweeping changes to rename the variables to meet their personal naming convention.  The biggest problem with this is that sweeping changes like this cause huge issues with code reviews - each gratuitous change to the code shows up as a change when diff'ing source files, and they increase the signal-to-noise ratio, which reduces the effectiveness of the code review - the code reviewer can easily miss a critical error if it's buried deep in a sea of variable name changes.

    For those of you who have worked on long running projects, how many times have you run across code like this?

    typedef int CB;
    typedef wchar * SZ;
    #define length 20
    void MyFunctionName(SZ szInput, CB cbInput)
    {
        CB stringlength = strlen(szInput);
        char myBuffer[length];
        if (stringlength < length)
        {
            if (szInput[0] == 'A')
                {
                    <Handle the case where the input string starts with 'A'>
                }
            else {
            if ('B' == szInput[0]) {
                <Handle the case where the input string starts with 'B'>
            }
            }
        }
    }

    What happened here?  Well, the code was worked on by four different developers, each of whom had their own style, and who applied it to the source.  The one that authored the routine used strict Hungarian - they defined types for SZ and CB, and strictly typed the parameters to the routine.  The next developer came along, and renamed the local variables to match their own personal style, and added the check for case "A".  The second developer used the following style for their if/then statements:

    if (if condition, constants in conditionals appear on the right)
        {
            <if statement>
        }
    else
        {
            <else statement>
        }

    Now, a 3rd developer came along, this developer added the check for case "B".  The third developer used the following style for their if/then statements:

    if (if condition, constants in conditionals appear on the left) {
        <if statement>
    }
    else {
        <else statement>
    }

    And then, finally the 4th developer came along and added the buffer overflow case.  Their if/then style was:

    if (if condition, constants in conditionals appear on the left)
    {
        <if statement>
    }
    else
    {
        <else statement>
    }

    None of these styles is wrong, they're just different.  But when put together, they turn into a totally unmaintainable mess.

    So it's utterly critical, if you're working on a project that uses ad-hoc coding standards that you adopt your coding style to match the code.  The future maintainers of the code will thank you for it.  Even if your project has strict coding standards, the instant you have to deal with legacy code, you MUST adopt the coding standard of the original author of the code, even if it doesn't match your groups standards.

  • Larry Osterman's WebLog

    Building a flicker free volume control

    • 30 Comments

    When we shipped Windows Vista, one of the really annoying UI annoyances with the volume control was that whenever you resized it, it would flicker. 

    To be more specific, the right side of the control would flicker – the rest didn’t flicker (which was rather strange).

     

    Between the Win7 PDC release (what we called M3 internally) and the Win7 Beta, I decided to bit the bullet and see if I could fix the flicker.  It seems like I tried everything to make the flickering go away but I wasn’t able to do it until I ran into the WM_PRINTCLIENT message which allowed me to direct all of the internal controls on the window to paint themselves.

    Basically on a paint call, I’d take the paint DC and send a WM_PRINTCLIENT message to each of the controls in sndvol asking them each to paint themselves to the new DC.  This worked almost perfectly – I was finally able to build a flicker free version of the UI.  The UI wasn’t perfect (for instance the animations that faded in the “flat buttons” didn’t fire) but the UI worked just fine and looked great so I was happy that' I’d finally nailed the problem.  That happiness lasted until I got a bug report in that I simply couldn’t figure out.  It seems that if you launched the volume mixer, set the focus to another application then selected the volume mixer’s title bar and moved the mixer, there were a ton of drawing artifacts left on the screen.

    I dug into it a bunch and was stumped.  It appeared that the clipping rectangle sent in the WM_PAINT message to the top level message didn’t include the entire window, thus portions of the window weren’t erased.  I worked on this for a couple of days trying to figure out what was going wrong and I finally asked for help on one of our internal mailing lists.

    The first response I got was that I shouldn’t use WM_PRINTCLIENT because it was going to cause me difficulty.  I’d already come to that conclusion – by trying to control every aspect of the drawing experience for my app, I was essentially working against the window manager – that’s why the repaint problem was happening.  By calling WM_PRINTCLIENT I was essentially putting a band-aid on the real problem but I hadn’t solved the real problem, all I’d done is to hide it.

     

    So I had to go back to the drawing board.  Eventually (with the help of one of the developers on the User team) I finally tracked down the original root cause of the problem and it turns out that the root cause was somewhere totally unexpected.

    Consider the volume UI:

    image

    The UI is composed of two major areas: The “Devices” group and the “Applications” group.  There’s a group box control wrapped around the two areas.

    Now lets look at the group box control.  For reasons that are buried deep in the early history of Windows, a group box is actually a form of the “button” control.  If you look at the window styles for a button in SpyXX, you’ll see:

    image

     

    Notice the CS_VREDRAW and CS_HREDRAW window class styles.  The MSDN documentation for class styles says:

    CS_HREDRAW - Redraws the entire window if a movement or size adjustment changes the width of the client area.
    CS_VREDRAW - Redraws the entire window if a movement or size adjustment changes the height of the client area.

    In other words every window class with the CS_HREDRAW or CS_VREDRAW style will always be fully repainted whenever the window is resized (including all the controls inside the window).  And ALL buttons have these styles.  That means that whenever you resize any buttons, they’re going to flicker, and so will all of the content that lives below the button.  For most buttons this isn’t a big deal but for group boxes it can be a big issue because group boxes contain other controls.

    In the case of sndvol, when you resize the volume control, we resize the applications group box (because it’s visually pinned to the right side of the dialog).  Which causes the group box and all of its contained controls to repaint and thus flicker like crazy.  The only way to fix this is to remove the CS_HREDRAW and CS_VREDRAW buttons from the window style for the control.

    The good news is that once I’d identified the root cause, the solution to my problem was relatively simple.  I needed to build my own custom version of the group box which handled its own painting and didn’t have the CS_HREDRAW and CS_VREDRAW class.  Fortunately it’s really easy to draw a group box – if themes are enabled a group box can be drawn with DrawThemeBackground API with the BP_GROUPBOX part and if theming is disabled, you can use the DrawEdge API to draw the group box.  Once I added the new control that and dealt with a number of other clean-up issues (making sure that the right portions of the window were invalidated when the window was resized for example), making sure that my top level control had the WS_CLIPCHILDREN style and that each of the sub windows had the WS_CLIPSIBLINGS style I had a version of sndvol that was flicker free AND which let the window manager handle all the drawing complexity.  There are still some minor visual gotchas in the UI (for example, if you resize the window using the left edge the right side of the group box “shudders” a bit – this is apparently an artifact that’s outside my control – other apps have similar issues when resized on the left edge) but they’re acceptable.

    As an added bonus, now that I was no longer painting everything manually, the fade-in animations on the flat buttons started working again!

     

    PS: While I was writing this post, I ran into this tutorial on building flicker free applications, I wish I’d run into it while I was trying to deal with the flickering problem because it nicely lays out how to solve the problem.

  • Larry Osterman's WebLog

    An Overview of Windows Sound and "Glitching" Issues

    • 30 Comments

    Nick White over at the Windows Vista Blog just posted an article written by Steve Ball, the PM in charge of the sounds team.

     

    It does a pretty good job of covering why my $2000 PCs sometimes glitches like crazy, while my $20 CD player works perfectly every single time.

     

    It's worth a read.

  • Larry Osterman's WebLog

    The Windows command line is just a string...

    • 30 Comments

    Yesterday, Richard Gemmell left the following comment on my blog (I've trimmed to the critical part):

    I was referring to the way that IE can be tricked into calling the Firefox command line with multiple parameters instead of the single parameter registered with the URL handler.

    I saw this comment and was really confused for a second, until I realized the disconnect.  The problem is that *nix and Windows handle command line arguments totally differently.  On *nix, you launch a program using the execve API (or  it's cousins execvp, execl, execlp, execle, and execvp).  The interesting thing about these APIs is that they allow the caller to specify each of the command line arguments - the signature for execve is:

    int execve(const char *filename, char *const argv [], char *const envp[]);

    In *nix, the shell is responsible for turning the string provided by the user into the argv parameter to the program[1].

     

    On Windows, the command line doesn't work that way.  Instead, you launch a new program using the CreateProcess API, which takes the command line as a string (the lpComandLine parameter to CreateProcess).  It's considered the responsibility of the newly started application to call the GetCommandLine API to retrieve that command line and parse it (possibly using the CommandLineToArgvW helper function).

    So when Richard talked about IE "tricking" Firefox by calling it with multiple parameters, he was apparently thinking about the *nix model where an application launches a new application with multiple command line arguments.  But that model isn't the Windows model - instead, in the Windows model, the application is responsible for parsing it's own command line arguments, and thus IE can't "trick" anything - it's just asking the shell to pass a string to the application, and it's the application's job to figure out how handle that string.

    We can discuss the relative merits of that decision, but it was a decision made over 25 years ago (in MS-DOS 2.0).

     

    [1] Yes, I know that the execl() API allows you to specify a command line string, but the execl() API parses that command line string into argv and argc before calling execve.

  • Larry Osterman's WebLog

    So why are applets so bad, anyway?

    • 29 Comments

    There's a simple answer to that question.  As I mentioned in the first post in this series, "It's my machine dagnabbit".  The simple answer is that applets consume resources that can be better used by by the customer.

    At an absolute minimum, each applet process consumes a process (no duh - that was a stupid statement, Larry).  But you need to realize that each process on Windows consumes a significant amount of system resources - you can see this in Vista's taskmgr.

    There are three columns that are interesting:  Working Set, Commit Size and Memory.  Commit Size is the amount of memory reserved for the process (so can be insanely large , Working Set  is the amount of physical memory that the process is currently consuming, and Memory is the amount of working set that's not being used by DLLs.

    On my machine, to pick on two applets that I have running, you find:

    • FlashUtil9d.exe consuming 4.5M of working set, 1.3M of commitment and 760K of Memory
    • FwcMgmt.exe (the ISA firewall client) consuming 4M of working set, 1.6M of commitment and 300K of Memory

    That 700K is real, physical RAM that's being actively used by the process (otherwise it would have been swapped out).  With multiple applets running, it adds up FAST.  On todays big machines, this isn't a big deal, but on a machine with less memory, it can be crippling.

     

    In my last post, I categorized applets into 4 categories (updaters, tray notification handlers, helper applications and services).  In addition to the common issues mentioned above, each of these has its own special set of issues associated with it.

    Updaters often to run all the time, even though they're only actually doing work once a day (or once a month).  That means that they consume resources all the time that they're active.  Adding insult to injury, on my machine at home, I have an updater that is manifested to require elevation (which means I get the "your app requires elevation" popup whenever it tries to run). 

    Tray notification handlers also run all the time, and adding insult to injury, they clutter up the notification area.  The more items in the notification area, the less useful it is.  This is actually the primary justification for the "big 4" notification area items in Vista - people kept on finding that the 3rd party notification area icons crowded out functionality they wanted to access.  In addition, notification handlers seem to love popping up toast on the desktop, which often interrupts the user.  In addition, since tray handlers often run synchronously at startup, they delay system boot time.

    Helper applications don't have any specific issues, from what I've seen.  They just consume resources when they're running.

    Services are both good and bad.  Each Windows service has a start type which lets the system know what to do with the service on startup.  There are 3 relevant start types for most services: AutoStart, DemandStart and Disabled.  When a service is marked as AutoStart, it starts on every boot of the system, which degrades the system startup time.  In addition, because services often run in highly privileged accounts, the author of the service needs to take a great deal of care to ensure that they don't introduce security holes into the system.  Before Vista, high privileged services were notorious for popping up UI on the user's desktop, a practice so dangerous, it justified its own category of security threat ("shatter attacks").  In Vista, changes were made to eliminate classic shatter attacks for all subsequent versions of the OS, so fortunately this issue isn't as grave as it was in the past.

     

     

    Tomorrow:  So how do you mitigate the damage that applets can cause?

  • Larry Osterman's WebLog

    A few of my favorite Win7 Sound features – Stream Switching

    • 29 Comments

    Way back when I was in college, I learned Lisp using a derivative of Lisp called MACLISP (named for MIT’s project MAC, not for anything that came from a fruity company in Cupertino).  One of the coolest features that MACLISP offered was the (DWIM) command – basically if you had a typo when entering an s-list into MACLISP, you could type (DWIM) and the MACLISP interpreter would fix your code for you (and yeah, it usually got it somewhat wrong :)).

    Stream Switching is a DWIM feature in the audio stack.  If an application is rendering to a default audio device and the device is removed, the audio stream will automatically switch to the new device.  The same happens if the default device changes for other reasons (if the user changes the default device for example) or if the sample rate on the device changes (this can happen with certain types of audio hardware allow external controls to change the sample rate for the device).

    We were able to figure out how to implement the stream switching logic in a fashion that causes it to work without requiring changes from 3rd party applications, which is really cool because it allows us to enable new scenarios without breaking appcompat – as long as the application is rendering to a default endpoint, we’ll stream switch the application without the app being notified.

    If an application is rendering to a specific endpoint, we’re not going to stream switch when the endpoint is removed – we don’t know the reasons for the application choosing that particular endpoint so we don’t attempt to second guess the applications intent (maybe the user has asked that their music player only play through the headphones and not through the speakers because playing through the speakers would disturb the baby).

    We also don’t support stream switching if the application is using WASAPI (the low level audio rendering APIs) to render audio.  That’s for a number of reasons, but mostly it’s because the presumption is that any application that is using WASAPI is using a low level rendering API and thus doesn’t want this kind of behavior.

     

    The stream switching logic is really cool in action, especially if you’ve got a machine which supports dynamic jack detection – when you’re watching a DVD in windows media player and you plug in a pair of headphones, poof – the audio gets redirected to the headphones just like you’d expect it to.

  • Larry Osterman's WebLog

    So what exactly IS COM anyway?

    • 29 Comments

    A couple of days ago, David Candy asked (in a comment on a previous COM related post) what exactly was COM.

    Mike Dimmick gave an excellent answer to the question, and I'd like to riff on his answer a bit.

    COM is just one of three associated technologies: RPC, COM and OLE (really OLE Automation).

    Taken in turn:

    RPC, or Remote Proceedure Call, is actually the first of the "Cairo" features to debut in Windows (what, you didn't know that there were parts of Cairo already in Windows?  Yup, actually, almost all of what was called "Cairo" is currently in windows).

    RPC provides a set of services to enable inter-procedure and inter-machine procedure calls.  The RPC technology is actually an implementation of the DCE RPC specification (the DCE APIs are renamed to be more windows-like), and is on-the-wire interoperable with 3rd party DCE implementations.  RPC deals with two types of entities, client's and servers.  The client makes requests, and the server responds to those requests.  You tell RPC about the semantics of the procedures you're calling with an IDL file (IDL stands for "Interface Definition Language" - It defines the interface between client and server).  IDL files are turned into C files by MIDL, the "Microsoft IDL compiler".

    When RPC needs to make a call from one process to another, it "marshalls" the parameters to the function call.  Marshalling is essentially the process of flattening the data structures (using the information in the IDL file), copying the data to the destination and then unpacking the flattened data into a format that the receiver can use.

    RPC provides an extraordinarily rich set of services - it's essentially trivial to write an application that says "I want to talk to someone on my local network segment who's providing this service, but I don't care who they are - find out who's offering this service and let me talk to them" and RPC will do the hard work.

    The next technology, COM, is built on RPC.  COM stands for "Component Object Model".  COM is many, many, things - it's a design pattern, it's a mechanism to hide implementation of functionality, it's an inter-process communication mechanism, it's the kitchen sink.

    At it's heart, COM's all about a design pattern that's based around "Interfaces".  Just as RPC defines an interface as the contract between a client and a server, COM defines an interface as a contract between a client of a set of functionality and the implementor of that functionality.  All COM interfaces are built around a single "base" interface called IUnknown, which provides reference count semantics, and the ability to query to see if a particular object implements a specific interface.  In addition, COM provides a standardized activation pattern (CoCreateInstance) that allows the implementation of the object to be isolated from the client of the object. 

    Because the implementation of the COM object is hidden from the client of the object, and the implementation may exist in another process (or on another machine in the case of DCOM), COM also defines its interfaces in an IDL file.  When the MIDL compiler is compiling an IDL file for COM, it emits some additional information including a C++ class definitions (and C surrogates for those definitions).  It will also optionally emit a typelib for the interfaces.

    The typelib is essentially a partially compiled version of the information in the IDL - it contains enough information to allow someone to know how to marshall the data.  For instance, you can take the information in a typelib and generate enough information to allow managed code to interoperate with the COM object - the typelib file contains enough information for the CLR to know how to convert the unmanaged data into its managed equivilant (and vice versa).

    The third technology is OLE Automation (Object Linking and Embedding Automation).  OLE Automation is an extension of COM that allows COM to be used by languages that aren't C/C++.  Essentially OLE Automation is built around the IDispatch interface. IDispatch can be though of as "varargs.h-on-steroids" - it provides a abstraction for the process of passing parameters too and from functions, thus allowing an application to accept method semantics that are radically different from the semantics provided by the language (for instance, VB allows parameters to functions to be absent, which is not allowed for C functions - IDispatch allows a VB client to call into an object implemented in C).

    Anyway that's a REALLY brief discussion, there are MANY, MANY books written about this subject.  Mike referenced Dale Rogerson's "Inside COM", I've not read that one, but he says it's good :)

     

     

  • Larry Osterman's WebLog

    What's wrong with this code?

    • 29 Comments

    This one comes courtesy of a co-worker who once upon a time was the lead developer for the Manx (http://www.itee.uq.edu.au/~csmweb/decompilation/hist-c-pc.html) compiler.

    He ran into the following problem that was reported by the customer.  They tried and tried to get the following code to compile and it didn't seem to work:

    Main.c:

    #include <stdio.h>
    float ReadFileAndCalculateAverage(FILE *InputFile, int Values[], int LengthOfValues, int *NumberOfValuesRead)
    {
       float average;
       int i;

       //
       // Read the Values array from InputFile.
       //

       <Code eliminated for brevity>

       //
       // Now calculate the average.
       //

       for (i = 1 ; i <= *NumberOfValuesRead ; i += 1)
       {
          average += Values[i];
       }

       average = average/*NumberOfValuesRead;
    }

    int main(int argc, char *argv[])
    {
        int Values[20];
        float average;
        int numberOfValues;
        FILE *inputFile;

        inputFile = fopen(“MyFile.Txt”);
        if (inputFile != NULL)
        {
            average = ReadFileAndCalculateAverage(inputFile, Values, sizeof(j) / sizeof(j[0]), &numberOfValues);
        }
        fclose(inputFile);
     }

     Now imagine this problem in a 10,000 line source file :) Have fun trying to track it down.

    Edit: Removed bogus <p>'s.  Thank you Newsgator's .Text plugin...

  • Larry Osterman's WebLog

    Why do we all use Wintel machines these days?

    • 29 Comments

    Barry Dorrans made a comment on Monday’s blog post that reminded me of the old IBM PC technical reference manual.

    In my opinion, this document is the only reason that we’re all using wintel computers these days (as opposed to Apple MacIntoshs or Commadore Amiga’s).

    You see, when IBM first developed the IBM PC, they entrusted the project to a visionary named Don Estridge.  Don’s vision was to produce a platform whose design was closed but whose architecture was totally open.  When IBM first shipped the PC, they also made available a reference manual for the PC.  This reference manual included EVERYTHING about the PC’s hardware.  The pin-outs on the cards.  The source code to the System ROMs.  And most importantly, they even included the schematics of the original PC.

    They continued this tradition throughout the original IBM PC line – for every major revision of the original PC line, there was a technical reference manual that accompanied the product.  The XT, AT and network cards all got their own technical reference manuals.

    This was an EXTRAORDINARY admission.  For most of the other PC manufacturers, their schematics and ROM source code were tightly held secrets.  They didn’t want people designing hardware for their platforms or messing with their system ROMs, because then 3rd parties could produce replacement parts for their PCs and undercut their hardware business.  For instance, the original Mac didn’t even have an expansion ability – you could plug a keyboard, a mouse and a power cord into it and that was about it.

    For whatever reason, Don Estridge decided that IBM should have a more open policy, and so he published EVERYTHING about the IBM PC.  The ROM sources were copyrighted, but other than that, everything was fully documented – Everything, from the pin-outs and timing diagrams on the parallel interface, to the chip specifications of the various processors used on the motherboard.  As a result, a thriving 3rd party hardware market ensued providing a diverse hardware platform far beyond what was available on other platforms.  In addition, they licensed MS-DOS and published full documentation for it as well.  When I was writing the BIOS for MS-DOS 4.0, I had a copy of the Intel components data catalog and a ream of chip spec sheets on my desk at all times so I could look up the detailed specifications for the system.  I used the timing diagrams in the system to debug a bunch of problems with the printer drivers, for example (there was a bug in the printer hardware on the original IBM PC that prevented using the printer interrupt to allow interrupt driven printing – IIRC, the INTR line was raised before the “data ready” line was raised, which meant that the printer interrupt would be generated before the printer was actually ready to accept the next byte of data – they later fixed this on the PC/AT machines).

    As a result, a confluence of documented hardware and software platforms existed which allowed software developers to take full advantage of the hardware platform, and the IBM PC platform grew and flourished.  When IBM didn’t provide graphics support for their monochrome monitors, then an OEM, Hercules stepped up and provided it.  When IBM/Microsoft didn’t provide spreadsheet support, then an ISV, Lotus stepped up and provided it.

    But it was the synergy of open hardware and open software that made all the magic come together.  None of the other PC manufacturers provided that level of openness at the time.

    This openness wasn’t always to IBM’s advantage – it also allowed OEM’s like Compaq to clone the IBM hardware and produce their own interoperable IBM clone machines, but it did allow the platform to thrive and succeed.

    In my honest opinion, THIS is the reason that the IBM PC architecture (ISA, later called Wintel) succeeded.  It was because IBM and Microsoft let anyone produce products for their platform and NOT because of any marketing genius on IBM’s (or Microsoft’s) part.

     

  • Larry Osterman's WebLog

    AARDvarks in your code.

    • 29 Comments

    If there was ever a question that I’m a glutton for punishment, this post should prove it.

    We were having an email discussion the other day, and someone asked:

    Isn't there a similar story about how DOS would crash when used with [some non-MS thing] and only worked with [some MS thing]? I don't remember what the "thing" was though =)

    Well, the only case I could think of where that was the case was the old AARD code in Windows.  Andrew Schulman wrote a great article on it back in the early 1990’s, which dissected the code pretty thoroughly.

    The AARD code in Windows was code to detect when Windows was running on a cloned version of MS-DOS, and to disable Windows on that cloned operating system.  By the time that Windows 3.1 shipped, it had been pulled from Windows, but the vestiges of the code were left behind.  As Andrew points out, the code was obfuscated, and had debugger-hiding logic, but it could be reverse engineered, and Andrew did a great job of doing it.

    I can’t speak as to why the AARD code was obfuscated, I have no explanation for that, it seems totally stupid to me.  But I’ve got to say that I totally agree with the basic concept of Windows checking for an alternative version of MS-DOS and refusing to run on it.

    The thing is that the Windows team had a problem to solve, and they didn’t care how they solved it.  Windows decided that it owned every part of the system, including the internal data structures of the operating system.  It knew where those structures were located, it knew what the size of those data structures was, and it had no compunction against replacing those internal structures with its own version.  Needless to say, from a DOS developer’s standpoint, keeping Windows working was an absolute nightmare.

    As a simple example, when Windows started up, it increased the size of MS-DOS’s internal file table (the SFT, that’s the table that was created by the FILES= line in config.sys).  It did that to allow more than 20 files to be opened on the windows system (a highly desirable goal for a multi-tasking operating system).  But it did that by using an undocumented API call, which returned a pointer to a set of “interesting” pointers in MS-DOS. It then indexed a known offset relative to that pointer, and replaced the value of the master SFT table with its own version of the SFT.  When I was working on MS-DOS 4.0, we needed to support Windows.  Well, it was relatively easy to guarantee that our SFT was at the location that Windows was expecting.  But the problem was that the MS-DOS 4.0 SFT was 2 bytes larger than the MS-DOS 3.1 SFT.   In order to get Windows to work, I had to change the DOS loader to detect when win.com was being loaded, and if it was being loaded, I looked at the code at an offset relative to the base code segment, and if it was a “MOV” instruction, and the amount being moved was the old size of the SFT, I patched the instruction in memory to reflect the new size of the SFT!  Yup, MS-DOS 4.0 patched the running windows binary to make sure Windows would still continue to work.

    Now then, considering how sleazy Windows was about MS-DOS, think about what would happen if Windows ran on a clone of MS-DOS.  It’s already groveling internal MS-DOS data structures.  It’s making assumptions about how our internal functions work, when it’s safe to call them (and which ones are reentrant and which are not).  It’s assuming all SORTS of things about the way that MS-DOS’s code works.

    And now we’re going to run it on a clone operating system.  Which is different code.  It’s a totally unrelated code base.

    If the clone operating system isn’t a PERFECT clone of MS-DOS (not a good clone, a perfect clone), then Windows is going to fail in mysterious and magical ways.  Your app might lose data.  Windows might corrupt the hard disk.   

    Given the degree with which Windows performed extreme brain surgery on the innards of MS-DOS, it’s not unreasonable for Windows to check that it was operating on the correct patient.

     

    Edit: Given that most people aren't going to click on the link to the Schulman article, it makes sense to describe what the AARD check was :)

    Edit: Fixed typo, thanks KC

  • Larry Osterman's WebLog

    Audio in Vista, the big picture

    • 29 Comments

    So I've talked a bit about some of the details of the Vista audio architecture, but I figure a picture's worth a bunch of text, so here's a simple version of the audio architecture:

    This picture is for "shared" mode, I'll talk about exclusive mode in a future post.

    The picture looks complicated, but in reality it isn't.  There are a boatload of new constructs to discuss here, so bear with me a bit.

    The flow of audio samples through the audio engine is represented by the arrows - data flows from the application, to the right in this example.

    The first thing to notice is that once the audio leaves the application, it flows through a very simple graph - the topology is quite straightforward, but it's a graph nonetheless, and I tend to refer to samples as moving through the graph.

    Starting from the left, the audio system introduces the concept of an "audio session".  An audio session is essentially a container for audio streams, in general there is only one session per process, although this isn't strictly true.

    Next, we have the application that's playing audio.  The application (using WASAPI) renders audio to a "Cross Process Transport".  The CPT's job is to get the audio samples to the audio engine running in the Windows Audio service.

    In general, the terminal nodes in the graph are transports, there are three transports that ship with Vista, the cross process transport I mentioned above, a "Kernel Streaming" transport (used for rendering audio to a local audio adapter), and an "RDP Transport" (used for rendering audio over a Remote Desktop Connection). 

    As the audio samples flow from the cross process transport to the kernel streaming transport, they pass through a series of Audio Processing Objects, or APOs.  APOs are used to provide DSP on the audio samples.  Some examples of the APOs shipped in Vista are:

    • Volume - The volume APO provides mute and gain control.
    • Format Conversion - The format converter APOs (there are several) provide data format conversion - int to float32, float32 to int, etc.
    • Mixer - The mixer APO mixes multiple audio streams
    • Meter - The meter APO remembers the peak and RMS values of the audio samples pumped through it.
    • Limiter - The limiter APO prevents audio samples from clipping when rendering.

    All of the code above runs in user mode except for the audio driver at the very end.

  • Larry Osterman's WebLog

    What's wrong with this code, part 17

    • 29 Comments
    Time for another "What's wrong with this code".  This time, it's an exercise in how a fix for a potential security problem has the potential to go horribly wrong.  This is a multi-part bug, so we'll start with the original code.

    We start the exercise with some really old code:

    BOOL GetStringValueFromRegistry(HKEY KeyHandle,
                        LPCWSTR ValueName,
                        ULONG dwLen,
                        LPWSTR lpszValue)
    {
        BOOL returnCode;
        WCHAR buffer[256];
        DWORD bufferSize = sizeof(buffer);
        DWORD valueType;
        returnCode = RegQueryValueEx(KeyHandle,
                                                                ValueName,
                                                                NULL,
                                                                &valueType,
                                                                (LPBYTE)buffer,
                                                                &bufferSize) == ERROR_SUCCESS;

        if (returnCode) {
            /*
             ** Check we got the right type of data and not too much
             */

            if (bufferSize > dwLen * sizeof(WCHAR) ||
                (valueType != REG_SZ &&
                 valueType != REG_EXPAND_SZ))
            {
                returnCode = FALSE;
            }
            else
            {
                /*
                 ** Copy back the data
                 */
                if (valueType == REG_EXPAND_SZ)
                {
                    lpszValue[0] = TEXT('\0');
                    ExpandEnvironmentStringsW(buffer,
                                            (LPWSTR)lpszValue,
                                            dwLen);
                }
                else
                {
                    CopyMemory((PVOID)lpszValue,
                                (PVOID)buffer,
                                dwLen * sizeof(WCHAR));
                }
            }
        }
        return returnCode;
    }

    There's a security hole in this code, but it's not really obvious.  If you've been paying attention and it's blindingly obvious what's going on, please give others a chance :)

    As always, kudos and mea culpas on each step of the way.

     

  • Larry Osterman's WebLog

    Larry goes to Layer Court

    • 29 Comments
    Two weeks ago, my boss, another developer in my group, and I had the opportunity to attend "Layer Court".

    Layer Court is the end product of a really cool part of the quality gate process we've introduced for Windows Vista.  This is a purely internal process, but the potential end-user benefits are quite cool.

    As systems get older, and as features get added, systems grow more complex.  The operating system (or database, or whatever) that started out as a 100,000 line of code paragon of elegant design slowly turns into fifty million lines of code that have a distinct resemblance to a really big plate of spaghetti.

    This isn't something specific to Windows, or Microsoft, it's a fundamental principal of software engineering.  The only way to avoid it is extreme diligence - you have to be 100% committed to ensuring that your architecture remains pure forever.

    It's no secret that regardless of how architecturally pure the Windows codebase was originally, over time, lots of spaghetti-like issues have crept into the  product over time.

    One of the major initiatives that was ramped up with the Longhorn Windows Vista reset was the architectural layering initiative.  The project had existed for quite some time, but with the reset, the layering team got serious.

    What they've done is really quite remarkable.  They wrote tools that perform static analysis of the windows binaries and they work out the architectural and engineering dependencies between various system components.

    These can be as simple as DLL dependencies (program A references DLLs B and C, DLL B references DLL D, DLL D in turn references DLL C), they can be as complicated as RPC dependencies (DLL A has a dependency on process B because DLL A contacts an RPC server that is hosted in process B).

    The architectural layering team then went out and assigned a number to every single part of the system starting at ntoskrnl.exe (which is the bottom, at layer 0).

    Everything that depended only on ntoskrnl.exe (things like win32k.sys or kernel32.dll) was assigned layer 1 , the pieces that depend on those (for example, user32.dll) got layer 2, and so forth (btw, I'm making these numbers up - the actual layering is somewhat more complicated, but this is enough to show what's going on).

    As long as the layering is simple, this is pretty straightforward.  But then the spaghetti problem starts to show up.  Raymond may get mad, but I'm going to pick on the shell team as an example of how a layering violation can appear.  Consider a DLL like SHELL32.DLL.  SHELL32 contains a host of really useful low level functions that are used by lots of applications (like PathIsExe, for example).  These functions do nothing but string manipulation of their input functions, so they have virtually no lower level dependencies.   But other functions in SHELL32 (like DefScreenSaverProc or DragAcceptFiles) manipulate windows and interact with large number of lower components.  As a result of these high level functions, SHELL32 sits relatively high in the architectural layering map (since some of its functions require high level functionality).

    So if relatively low level component (say the Windows Audio service) calls into SHELL32, that's what is called a layering violation - the low level component has taken an architectural dependency on a high level component, even if it's only using the low level functions (like PathIsExe). 

    They also looked for engineering dependencies - when low level component A gets code that's delivered from high level component B - the DLLs and other interfaces might be just fine, but if a low level component A gets code from a higher level component, it still has a dependency on that higher level component - it's a build-time dependency instead of a runtime dependency, but it's STILL a dependency.

    Now there are times when low level components have to call into higher level components - it happens all the time (windows media player calls into skins which in turn depend on functionality hosted within windows media player).  Part of the layering work was to ensure that when this type of violation occurred that it fit into one of a series of recognized "plug-in" patterns - the layering team defined what were "recognized" plug-in design patterns and factored this into their analysis.

    The architectural layering team went through the entire Windows product and identified every single instance of a layering violation.  They then went to each of the teams, in turn and asked them to resolve their dependencies (either by changing their code (good) or by explaining why their code matches the plugin pattern (also good), or by explaining the process by which their component will change to remove the dependency (not good, because it means that the dependency is still present)).   For this release, they weren't able to deal with all the existing problems, but at least they are preventing new ones from being introduced.  And, since there's a roadmap for the future, we can rely on the fact that things will get better in the future.

    This was an extraordinarily painful process for most of the teams involved, but it was totally worth the effort.  We now have a clear map of which Windows components call into which other Windows components.  So if a low level component changes, we can now clearly identify which higher level components might be effected by that change.  We finally have the ability to understand how changes ripple throughout the system, and more importantly, we now have mechanisms in place to ensure that no lower level components ever take new dependencies on higher level components (which is how spaghetti software gets introduced).

    In order to ensure that we never introduce a layering violation that isn't understood, the architectural layering team has defined a "quality gate" that ensures that no new layering violations are introduced into the system (there are a finite set of known layering violations that are allowed for a number of reasons).  Chris Jones mentioned "quality gates" in his Channel9 video, essentially they are a series of hurdles that are placed in front of a development team - the team is not allowed to check code into the main Windows branches unless they have met all the quality gates.  So by adding the architectural layering quality gate, the architectural layering team is drawing a line in the sand to ensure that no new layering violations ever get added to the system.

    So what's this "layer court" thingy I talked about in the title?  Well, most of the layering issues can be resolved via email, but for some set of issues, email just doesn't work - you need to get in front of people with a whiteboard so you can draw pretty pictures and explain what's going on.  And that's where we were two weeks ago - one of the features I added for Beta2 restored some functionality that was removed in Beta1, but restoring the functionality was flagged as a layering violation.  We tried, but were unable to resolve it via email, so we had to go to explain what we were doing and to discuss how we were going to resolve the dependency.

    The "good" news (from our point of view) is that we were able to successfully resolve the issue - while we are still in violation, we have a roadmap to ensure that our layering violation will be fixed in the next release of Windows.  And we will be fixing it :)

     

  • Larry Osterman's WebLog

    UUIDs are only unique if you generate them...

    • 28 Comments

    We had an internal discussion recently and the upshot of the discussion was that it turns out that some distributed component on the web appears to have used the UUID of a sample COM component.

    Sigh.

    I wonder sometimes why people do this.  It's not like it's hard to run uuidgen and then copy the relevent GUIDs to your RGS file (and/or IDL file, or however it is you're defining and registering your class).

    I guess the developers of the distributed component figured that they didn't have to follow the rules because everyone else was going to follow them.

    And, no, I don't know what component it was, or why they decided to copy the sample.

    So here's a good rule of thumb.  When you're designing a COM component, you should probably use UUIDGEN (or UuidCreate()) to generate unique (and separate) GUIDS for the Interface ID, Class ID, and Library ID and App ID.

     

  • Larry Osterman's WebLog

    What are these "Threading Models" and why do I care?

    • 28 Comments

    Somehow it seems like it’s been “Threading Models” week, another example of “Blogger synergy”.  I wrote this up for internal distribution to my group about a year ago, and I’ve been waiting for a good time to post it.  Since we just hit another instance of the problem in my group yesterday, it seemed like a good time.

     

    So what is this thing called a threading model anyway?

    Ok.  So the COM guys had this problem.  NT supports multiple threads, but most developers, especially the VB developers at which COM/ActiveX were targeted are totally terrified by the concept of threading.  In fact, it’s very difficult to make thread-safe VB (or JS) applications, since those languages don’t support any kind of threading concepts.  So the COM guys needed to design an architecture that would allow for supporting these single-threaded objects and host them in a multi-threaded application.

    The solution they came up was the concept of apartments.  Essentially each application that hosts COM objects holds one or more apartments.  There are two types of apartments, Single Threaded Apartments (STAs) and Multi Threaded Apartments (MTAs).  Within a given process there can be multiple STA’s but there is only one MTA.

    When a thread calls CoInitializeEx (or CoInitialize), the thread tells COM which of the two apartment types it’s prepared to host.  To indicate that the thread should live in the MTA, you pass the COINIT_MULTITHREADED flag to CoInitializeEx.  To indicate that the thread should host an STA, either call CoInitialize or pass the COINIT_APARTMENTTHREADED flag to CoInitializeEx.

    A COM object’s lifetime is limited to the lifetime of the apartment that creates the object.  So if you create an object in an STA, then destroy the apartment (by calling CoUninitialize), all objects created in this apartment will be destroyed.

    Single Threaded Apartment Model Threads

    When a thread indicates that it’s going to be in single threaded apartment, then the thread indicates to COM that it will host single threaded COM objects.  Part of the contract of being an STA is that the STA thread cannot block without running a windows message pump (at a minimum, if they block they must call MsgWaitForSingleObject – internally, COM uses windows messages to do inter-thread marshalling).

    The reason for this requirement is that COM guarantees that objects will be executed on the thread in which they were created regardless of the thread in which they’re called (thus the objects don’t have to worry about multi-threading issues, since they can only ever be called from a single thread).  Eric mentions “rental threaded objects”, but I’m not aware of any explicit support in COM for this.

     

    Multi Threaded Apartment Model Threads

    Threads in the multi threaded apartment don’t have any restrictions – they can block using whatever mechanism they want.  If COM needs to execute a method on an object and no thread is blocked, then COM will simply spin up a new thread to execute the code (this is particularly important for out-of-proc server objects – COM will simply create new RPC threads to service the object as more clients call into the server).

    How do COM objects indicate which thread they work with?

    When an in-proc COM object is registered with OLE, the COM object creates the following registry key:

                HKCR\CLSID\{<Object class ID>}\InprocServer32

    The InprocServer32 tells COM which DLL hosts the object (in the default value for the key), and via the ThreadingModel value tells COM the threading model for the COM object.

     

    There are essentially four legal values for the ThreadingModel value.  They are:

    Apartment

    Free

    Both

    Neutral

    Apartment Model objects.

    When a COM object is marked as being an “Apartment” threading model object, it means that the object will only run in an STA thread.  All calls into the object will be serialized by the apartment model, and thus it will not have to worry about synchronization.

    Free Model objects.

    When a COM object is marked as being a “Free” threading model object, it means that the object will run in the MTA.  There is no synchronization of the object.  When a thread in an STA wants to call into a free model object, then the STA will marshal the parameters from the STA into the MTA to perform the call. 

    Both Model objects.

    The “Both” threading model is an attempt at providing the best of both worlds.  An object that is marked with a threading model of “Both” takes on the threading model of the thread that created the object. 

    Neutral Model objects.

    With COM+, COM introduced the concept of a “Neutral” threading model.  A “Neutral” threading model object is one that totally ignores the threading model of its caller.

    COM objects declared as out-of-proc (with a LocalServer32=xxx key in the class ID.) are automatically considered to be in the multi-threaded apartment (more about that below).

    It turns out that COM’s enforcement of the threading model is not consistent.  In particular, when a thread that’s located in an STA calls into an object that was created in the MTA, COM does not enforce the requirement that the parameters be marshaled through a proxy object.   This can be a big deal, because it means that the author of COM objects can be lazy and ignore the threading rules – it’s possible to create a COM object in that uses the “Both” threading model and, as long as the object is in-proc, there’s nothing that’ll check to ensure you didn’t violate the threading model.  However the instant you interact with an out-of-proc object (or call into a COM method that enforces apartment model checking), you’ll get the dreaded RPC_E_WRONG_THREAD error return.  The table here describes this in some detail.

    What about Proxy/Stub objects?

    Proxy/Stub objects are objects that are created by COM to handle automatically marshaling the parameters of the various COM methods to other apartments/processes.  The normal mechanism for registering Proxy/Stub objects is to let COM handle the registration by letting MIDL generate a dlldata.c file that is referenced during the proxy DLL’s initialization.

    When COM registers these proxy/stub objects, it registers the proxy/stub objects with a threading model of “Both”.  This threading model is hard-coded and cannot be changed by the application.

    What limitations are there that I need to worry about?

    The problem that we most often see occurs because of the Proxy/Stub objects.  Since the proxy/stub objects are registered with a threading model of “Both”, they take on the threading model of the thread that created the object.  So if a proxy/stub object is created in a single threaded apartment, it can only be executed in the apartment that created it.  The proxy/stub marshaling routines DO enforce the threading restriction I mentioned above, so applications learn about this when they unexpectedly get a RPC_E_WRONG_THREAD error return from one of their calls.  On the server side, the threading model of the object is set by the threading model of the caller of CoRegisterClassObject.  The good news is that the default ALT 7.1 behavior is to specify multi-threaded initialization unless otherwise specified (in other words, the ATL header files define _ATL_FREE_THREADED by default.

    How do I work around these limitations?

    Fortunately, this problem is a common problem, and to solve it COM provides a facility called the “Global Interface Table”.  The GIT is basically a singleton object that allows you to register an object with the GIT and it will then return an object that can be used to perform the call from the current thread.  This object will either be the original object (if you’re in the apartment that created the object) or it will be a proxy object that simply marshals the calls into the thread that created the object.

    If you have a COM proxy/stub object (or you use COM proxy/stub objects in your code), you need to be aware of when you’ll need to use the GIT to hold your object.

    Use the GIT, after you’ve called CoCreateInstance to create your COM object, call IGlobalInterfaceTable::RegisterInterfaceInGlobal to add the object to the global interface table.  This will return a “cookie” to you.  When you want to access the COM object, you first call IGlobalInterfaceTable::GetInterfaceFromGlobal to retrieve the interface.  When you’re done with the object, you call IGlobalInterface::RevokeInterfaceFromGlobal.

    In our case, we didn’t feel that pushing the implementation details of interacting with the global interface table to the user was acceptable, so we actually wrote an in-proc object that wraps our out-of-proc object. 

    Are there other problems I need to worry about?

    Unfortunately, yes.  Since the lifetime of a COM object is scoped to the lifetime of the apartment that created the object, this means that when the apartment goes away, the object will go away.  This will happen even if the object is referenced from another thread.  If the object in question is a local object, this really isn’t that big a deal since the memory backing the object won’t go away.  If, however the object is a proxy/stub object, then the object will be torn down post-haste.  The global interface table will not help this problem, since it will remove all the entries in the table that were created in the apartment that’s going away.

    Additional resources:

    The MSDN article Geek Speak Decoded #7 (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dngeek/html/geekthread.asp) also has some detail on how this stuff works (although it’s somewhat out-of-date).

     

  • Larry Osterman's WebLog

    Choosing a C runtime library

    • 28 Comments

    Yesterday a developer in my group came by asking about a failure he saw when running the application verifier on his component.  The app verifier was reporting that he was using a HEAP_NO_SERIALIZE heap from a thread other than the one that created the heap.

    I looked a bit deeper and realized that he was running with the single threaded statically linked C runtime library.  An honest mistake, given that it’s the default version of the C runtime library.

    You see, there are 3 different versions of the C runtime library shipped (and 3 different versions of the ATL and MFC libraries too). 

    The first is the statically linked single-threaded library.  This one can be used only on single threaded applications, and all the object code for the C runtime library functions used is included in the application binary.  You get this with the /ML compiler switch.

    The second is the statically linked, multi-threaded library.  This one’s the same as the first, but you can use it in a multithreaded application.  You get this one with the /MT compiler switch.

    The third is the dynamically linked library.  This one keeps all the C runtime library code in a separate DLL (MSVCRTxx.DLL).  Since the runtime library code’s in a DLL, it also handles multi-threaded issues.   The DLL library is enabled with the /MD switch.

    But I’ve been wondering.  Why on earth would anyone ever choose any option OTHER than multi-threaded DLL version of the runtime library?

    There are LOTS of reasons for always using the multithreaded DLL:

    1)      Your application is smaller because it doesn’t have the C runtime library loaded into it.

    2)      Because of #1, your application will load faster.  The C runtime library is almost certainly in memory, so the pages containing the library don’t have to be read from disk.

    3)      Using the multithreaded DLL future-proofs your application.  If you ever add a second thread to your application (or call into an API that creates multiple threads), you don’t have to remember to change your C runtime library.  And unless you’re running the app verifier regularly, the only way you’ll find out about the problem is if you get a heap corruption (if you’re lucky).

    4)      If your application has multiple DLL’s, then you need to be VERY careful about allocation – each DLL will have its own C runtime library heap, as will the application.  If you allocate a block in one DLL, you must free it in the same DLL.

    5)      If a security bug is ever found in the C runtime library, you don’t have to release an update to your app.

    The last one’s probably the most important IMHO.  Just to be clear - There haven’t been any security holes found in the C runtime library.  But it could happen.  And when it happens, it’s pretty ugly.  A really good example of this can be seen with the security vulnerability that was found in the zlib compression library. This library was shipped in dozens of products, and every single one of them had to be updated.  If you do a google search for “zlib library security vulnerability” you can see some of the chaos that resulted from this disclosure.  If your app used the DLL C runtime library, then you’d get the security fix for free from windows update when Microsoft posted the update.

    The only arguments I’ve been able to come up with for using the static C runtime libraries are:

    1)      I don’t have to distribute two binaries with my application – If I use the DLL, I need to redistribute the DLL.  This makes my application setup more complicated.

    Yes, but not significantly (IMHO).  This page lists the redistribution info for the C runtime library and other components.

    2)      If I statically link to the C runtime library, I avoid DLL hell.

    This is a red herring IMHO.  Ever since VC6, the C runtime library has been tightly versioned, as long as your installer follows the rules for version checking of redistributable files (found here) you should be ok.

    3)      My code is faster since the C runtime library doesn’t have to do all that heap synchronization stuff.

    Is it really?  How much checking is involved in the multithreaded library?  Let’s see.  The multithreaded library puts some stuff that was kept in global variable in thread local storage.  So there’s an extra memory indirection involved on routines like strtok etc.  Also, the single threaded library creates it’s heap with HEAP_NO_SERIALIZE (that’s what led to this entire post J).  But that just wraps the heap access with an EnterCriticalSection/ExitCriticalSection.  Which is very very fast if there’s no contention.  And since this is a single threaded application, by definition there’s no contention for the critical section.

    Using the multithreaded DLL C runtime library is especially important for systems programmers.  First off, if your system component is a DLL, it’s pretty safe to assume that you’ll be called from multiple threads, so at an absolute minimum, you’re going to want to use the multithreaded static C runtime library.  And if you’re using the multithreaded static C runtime library, why NOT use the DLL version?

    If you’re not writing a DLL, then it’s highly likely that your app does (or will) use multiple threads.  Which brings me back to the previous comment – why NOT use the DLL version? 

    You’re app will be smaller, more secure, future-proof, and no slower than if you don’t.

     

  • Larry Osterman's WebLog

    Checking file versions is surprisingly hard.

    • 28 Comments

    I was wandering around the web the other day and ran into this post.  In general I don’t have many issues with the post, until you get to the bottom of the article.  The author mentions that his code only runs on Win7 or newer so he helpfully included a check to make sure that his code only runs on WIn7:

    // Example in C#.
    
    internal bool SupportsTaskProgress() {
        if (System.Environment.OSVersion.Version.Major >= 6) {
            if (System.Environment.OSVersion.Version.Minor >= 1) {
                return true;
            }
        }
        return false;
    }

    This is a great example of why it’s so hard to write code that checks for versions.  The problem here is that this code is highly likely to fail to work on the next version of Windows (or whenever Windows 7.0 is released).  In that case SupportsTaskProgress will incorrectly return false.

     

    Personally I wouldn’t even bother writing the SupportsTaskProgress function this way.  Instead I’d check for the “new TaskbarLib.TaskbarList()” call to return NULL and assume that if it returned NULL the API call wasn’t supported (the non COM interop equivalent would be to check for a failure on the call to CoCreateInstance).  That way the code would work even if (for some obscure reason) the taskbar logic was ported to a previous OS version.

     

    If I simply HAD to keep the SupportsTaskProgress function, I’d rewrite it as:

    // Example in C#.
    
    internal bool SupportsTaskProgress() {
        if (System.Environment.OSVersion.Version.Major >= 6) {
            if (System.Environment.OSVersion.Version.Major == 6) {
    if (System.Environment.OSVersion.Version.Minor >= 1) { return true;
    }
    return false; }
    return true; } return false; }

    That way it would only check for minor version being greater than 1 if the major version is 6.  I suspect that this code could be tightened up further as well.

     

    This is a part of the reason that picking a version number for the OS is so complicated.

  • Larry Osterman's WebLog

    What’s wrong with this code, part 26 – a real-world example

    • 28 Comments

    This is an example of a real-world bug that was recently fixed in an unreleased Microsoft product.  I was told about the bug because it involved the PlaySound API (and thus they asked me to code review the fix), but it could happen with any application.

    static DWORD WINAPI _PlayBeep(__in void* pv)
    {
        UNREFERENCED_PARAMETER(pv);
        PlaySound(L".Default"NULL, SND_SYNC | SND_ALIAS);
        return 0;
    }
    
    LRESULT WndProc(...)
    {
        :
        :
        case WM_KEYDOWN:
            if (!_AcceptInputKeys(wParam, lParam))
            {
                QueueUserWorkItem(_PlayBeep, NULL, 0);
            }
            break;
    }

     

    This is actual code from inside the client side of a client/server component in Windows that was attempting to “beep” on invalid input (I’ve changed the code slightly to hide the actual origin and undoubtedly introduced issues).  And it has a whopper of a bug in it.

    Given the simplicity of the code above, to get the answer right, it’s not enough to say what’s wrong with the code (the problem should be blindingly obvious).  You also need to be able to explain why this is so bad (in other words, what breaks when you do this).

     

    Bonus points if you can identify the fix that was eventually applied.

  • Larry Osterman's WebLog

    I can make it arbitrarily fast if I don’t actually have to make it work.

    • 27 Comments

    Digging way back into my pre-Microsoft days, I was recently reminded of a story that I believe was told to me by Mary Shaw back when I took her Computer Optimization class at Carnegie-Mellon…

    During the class, Mary told an anecdote about a developer “Sue” who found a bug in another developer’s “Joe” code that “Joe” introduced with a performance optimization.  When “Sue” pointed the bug out to “Joe”, his response was “Oops, but it’s WAY faster with the bug”.  “Sue” exploded “If it doesn’t have to be correct, I can calculate the result in 0 time!” [1].

    Immediately after telling this anecdote, she discussed a contest that the CS faculty held for the graduate students every year.  Each year the CS faculty posed a problem to the graduate students with a prize awarded to the grad student who came up with the most efficient (fastest) solution to the problem.  She then assigned the exact same problem to us:

    “Given a copy of the “Declaration of Independence”, calculate the 10 most common words in the document”

    We all went off and built programs to parse the words in the document, inserting them into a tree (tracking usage) and read off the 10 most frequent words.  The next assignment was “Now make it fast – the 5 fastest apps get an ‘A’, the next 5 get a ‘B’, etc.”

    So everyone in the class (except me :)) went out and rewrote their apps to use a hash table so that their insertion time was constant and then they optimized the heck out of their hash tables[2].

    After our class had our turn, Mary shared the results of what happened when the CS grad students were presented with the exact same problem.

    Most of them basically did what most of the students in my class did – built hash tables and tweaked them.  But a couple of results stood out.

    • The first one simply hard coded the 10 most common words in their app and printed them out.  This was disqualified because it was perceived as breaking the rules.
    • The next one was quite clever.  The grad student in question realized that they could write the program much faster if they wrote it in assembly language.  But the rules of the contest required that they use Pascal for the program.  So the grad student essentially created an array on the stack and introduced a buffer overflow and he loaded his assembly language program into the buffer and used that as a way of getting his assembly language version of the program to run.  IIRC he wasn’t disqualified but he didn’t win because he circumvented the rules (I’m not sure, it’s been more than a quarter century since Mary told the class this story).
    • The winning entry was even more clever.  He realized that he didn’t actually need to track all the words in the document.  Instead he decided to track only some of the words in the document in a fixed array.  His logic was that each of the 10 most frequent words were likely to appear in the first <n> words in the document so all he needed to do was to figure out what "”n” is and he’d be golden.

     

    So the moral of the story is “Yes, if it doesn’t have to be correct, you can calculate the response in 0 time.  But sometimes it’s ok to guess and if you guess right, you can get a huge performance benefit from the result”. 

     

     

    [1] This anecdote might also come from Jon L. Bentley’s “Writing Efficient Programs”, I’ll be honest and say that I don’t remember where I heard it (but it makes a great introduction to the subsequent story).

    [2] I was stubborn and decided to take my binary tree program and make it as efficient as possible but keep the basic structure of the solution (for example, instead of comparing strings, I calculated a hash for the string and compared the hashes to determine if strings matched).  I don’t remember if I was in the top 5 but I was certainly in the top 10.  I do know that my program beat out most of the hash table based solutions.

  • Larry Osterman's WebLog

    This is the way the world (wide web) ends...

    • 27 Comments

    Robert Hensing linked to a post by Thomas Ptacek over on the Matasano Chargen blog. Thomas (who is both a good hacker AND a good writer) has a writeup of a “game-over” vulnerability that was just published by Mark Dowd over at IBM's ISS X-Force that affects Flash. For those that don’t speak hacker-speak, in this case, a “game-over” vulnerability is one that can be easily weaponized (his techniques appear to be reliable and can be combined to run an arbitrary payload). As an added bonus, because it’s a vulnerability in Flash, it allows the attacker to write a cross-browser, cross-platform exploit – this puppy works just fine in both IE and Firefox (and potentially in Safari and Opera).

    This vulnerability doesn’t affect Windows directly, but it DOES show how a determined attacker can take what was previously thought to be an unexploitable failure (a null pointer dereference) and turn it into something that can be used to 0wn the machine.

    Every one of the “except not quite” issues that Thomas writes about in the article represented a stumbling block that the attacker (who had no access to the source to Flash) had to overcome – there are about 4 of them, but the attacker managed to overcome all of them.

    This is seriously scary stuff.  People who have flash installed should run, not walk over to Adobe to pick up the update.  Please note that the security update comes with the following warning:

    "Due to the possibility that these security enhancements and changes may impact existing Flash content, customers are advised to review this March 2008 Adobe Developer Center article to determine if the changes will affect their content, and to begin implementing necessary changes immediately to help ensure a seamless transition."

    Edit2: It appears that the Adobe update center I linked to hasn't yet been updated with the fix, I followed their update proceedure, and my Flash plugin still had the vulnerable version number. 

    Edit: Added a link to the relevant Adobe security advisory, thanks JD.

     

  • Larry Osterman's WebLog

    Why doesn't Mozilla (Firefox) like the Microsoft OCA web site?

    • 27 Comments

    In my previous post about OCA, the comments thread has a long discussion started by Shannon J Hager about Mozilla’s behavior when you attempt to access https://winqual.microsoft.com.  If you attempt to access this web site using Firefox (or other Mozilla variants), you get the following dialog box:

    Which is weird, because of course the web site works just fine in IE.  No big deal, right – Microsoft’s well known for sleazing the rules for it’s own products, so obviously this is Microsoft’s fault – they probably did something like hard coding in trust to the Microsoft issuing CA.  But I was kinda surprised at this, so I spent a bit of time checking it out...

    The way that SSL certificate verification is supposed to work is that if the issuer of a certificate isn’t trusted, then the code validating the certificate is supposed to check the parent of the issuer to see if IT is trusted.  If the parent of the issuer isn’t trusted, it’s supposed to check the grandparent of the issuer, and so forth until you find the root certificate authority (CA).

    The issuing CA of the certificate on the winqual web site is the “Microsoft Secure Server Authority”, it’s not surprising Mozilla doesn’t trust that one.  The parent of the issuing CA is the “Microsoft Internet Authority”, again, no surprise that Mozilla doesn’t trust it.

    But the grandparent of the issuing CA is the “GTE CyberTrust Root”.  This is a well known CA, and Mozilla should be trusting it.  And what do you know, Mozilla DOES claim to trust that root CA:

    Well, Cesar Eduardo Barros actually went and checked using openssl to see why the CA isn’t trusted.  He tried:

    $ openssl s_client -connect winqual.microsoft.com:443 -showcerts

    depth=0 /C=US/ST=Washington/L=Redmond/O=WHDC (Old WHQL)/OU=Microsoft/CN=winqual.microsoft.com
    verify error:num=20:unable to get local issuer certificate
    verify return:1
    depth=0 /C=US/ST=Washington/L=Redmond/O=WHDC (Old WHQL)/OU=Microsoft/CN=winqual.microsoft.com
    verify error:num=27:certificate not trusted
    verify return:1
    depth=0 /C=US/ST=Washington/L=Redmond/O=WHDC (Old WHQL)/OU=Microsoft/CN=winqual.microsoft.com
    verify error:num=21:unable to verify the first certificate
    verify return:1
    CONNECTED(00000003)
    ---
    Certificate chain
    0 s:/C=US/ST=Washington/L=Redmond/O=WHDC (Old WHQL)/OU=Microsoft/CN=winqual.microsoft.com
    i:/DC=com/DC=microsoft/DC=corp/DC=redmond/CN=Microsoft Secure Server Authority
    -----BEGIN CERTIFICATE-----
    [...]
    -----END CERTIFICATE-----
    ---
    Server certificate
    subject=/C=US/ST=Washington/L=Redmond/O=WHDC (Old WHQL)/OU=Microsoft/CN=winqual.microsoft.com
    issuer=/DC=com/DC=microsoft/DC=corp/DC=redmond/CN=Microsoft Secure Server Authority
    ---
    No client certificate CA names sent
    ---
    SSL handshake has read 1444 bytes and written 324 bytes
    ---
    New, TLSv1/SSLv3, Cipher is RC4-MD5
    Server public key is 1024 bit
    SSL-Session:
    Protocol : TLSv1
    Cipher : RC4-MD5
    Session-ID: [...]
    Session-ID-ctx:
    Master-Key: [...]
    Key-Arg : None
    Start Time: [...]
    Timeout : 300 (sec)
    Verify return code: 21 (unable to verify the first certificate)
    ---
    DONE

    Decoding the certificate it gave me above (openssl x509 -text) I get the same information Mozilla gives me and a bit more, but no copy of the issuer. The only suspicious thing in there is:

    Authority Information Access:
    CA Issuers - URI:http://www.microsoft.com/pki/mscorp/msssa1(1).crt
    CA Issuers - URI:http://corppki/aia/msssa1(1).crt

    Getting that URI gives me a blank HTML page with a 0.1 second redirect to itself. (The CRL one seems valid, however.)

    So I was confused, why wasn’t openSSL able to verify the certificate?  So I started asking the security PM’s here at Microsoft what was up.  One of the things he told me was that Microsoft doesn’t hard code ANY intermediate certificates in our browser.  Instead, our browser relies on the referral information in the certificate to chase down the CA hierarchy.

    So why can’t Mozilla do the same thing?  Is there something wrong with our certificates that’s preventing this from working?  I kept on pestering and the PM’s kept on digging.  Eventually I got email from someone indicating “IE is chasing 48.2 AIA”.

    Well, this isn’t very helpful to me, so I asked the security PM in question to explain it in English.  Apparently the root cause of the problem is that IE is following the Authority Information Access 48.2 OID (1.3.6.1.5.5.7.48.2) to find the parent of the certificate, while Mozilla isn’t.

    Inside the Microsoft certificate is the following:

    And if you go to http://www.microsoft.com/pki/mscorp/msssa1(1).crt you’ll find the parent CA for the certificate on the winqual web site.  So now it’s off to figure out if the IE behavior is according to standard, or if it’s another case of Microsoft ignoring web standards in favor of proprietary extensions.

    A few minutes of googling discovers that the AIA 48.2 field is also known as the id-ad-caIssuers OID.  The authoritative reference for this OID is RFC2459 (the RFC that defines the x.509 certificate infrastructure).  It describes this field as:

     The id-ad-caIssuers OID is used when the additional information lists CAs that have issued certificates superior to the CA that
    issued the certificate containing this extension. The referenced CA Issuers description is intended to aid certificate users in
    the selection of a certification path that terminates at a point trusted by the certificate user.

    In other words, IE is correctly chasing the AIA 48.2 references in the certificate to find the root issuing CA of the certificate. Since it didn’t have direct knowledge of the issuing CA, it correctly looked at the AIA 48.2 field of the certificate for the winqual web site and chased the AIA 48.2 references to the root CA.  It appears that Mozilla (and OpenSSL and GnuSSL) apparently don’t follow this link, which is why they pop up the untrusted certificate dialog.

    Issue solved.  Now all someone has to do is to file bugs against Mozilla and OpenSSL to get them to fix their certificate validation logicJ.

    Btw, I want to give HUGE kudo’s to Cesar Eduardo Barros for tirelessly trying to figure this out, and to Michael Howard and the lead program manager for NT security for helping me figure this out.  If you look at the info from the certificate that Cesar posted above, he correctly caught the AIA 48.2 fields inside the CA, it was a huge step in the right direction, all that remained was to figure out what it really meant.

    Edit: Fixed picture links.

    Edit2: Fixed line wrapping of reference from RFC2459.

Page 5 of 33 (815 items) «34567»