Welcome to MSDN Blogs Sign in | Join | Help

My Obligatory Bill Gates Story

This one of course says more about me than Bill, but it's a true enough tale.

The year is either late in 1989 or early in 1990- shortly after I joined Microsoft the first time around.  In those days, meeting Bill wasn't too terribly difficult, as periodically all of the new hires in the development and marketing disciplines were invited to a dinner at his home (in those days it was in the Laurelhurst area of Seattle, near the University of Washington- his current home was still under construction).

I had come from a background where geeks and nerds weren't particularly well-liked.  During my ten previous years at IBM, I had heard it all- insinuations I was kept in a cage because my social ineptness was akin to that of an animal, for instance.  To be fair, this was New York, after all- a locale well-known for the art of the insult.  But always the professed belief that intelligence had to come at the cost of some other more worthy attribute.

Well, as the 80's wore on, Bill became the guy I'd throw back in their face.  "He's not only smart, but he outmaneuvered all of your suits in one of the most important technical shifts of the century"- I never quite put it exactly that way, but you probably get the idea- the idea that you couldn't be trusted to make a business decision if you were capable of higher mathematics had to be the most galling thing I constantly heard.

At any rate, I eventually decided to leave (for unrelated reasons, actually- I grew up in that area, and that kind of treatment had been the norm from my childhood- I was quite used to it), and the material from Microsoft intrigued me- it reminded me of pleasant memories of my days at Caltech- people dressed casually [yes, I was one of those people who at one time received the "so when are you going to cut your hair and trim your beard and wear a good suit instead of just slapping a necktie on any old shirt" speech, with the clear insinuation these things were roadblocks to career advancement], doing all kinds of cool things with software, and being appreciated for it [I did some pretty cool stuff at my previous job, but most of the time nobody even had the background to evaluate it properly].  After that round of interviews I was more excited than I had been for a long time [some of that passion I've written about before], and bailed on a verbally accepted previous offer and came to Redmond.

You need to understand how I practically idolized him- heroic in stature and a redeemer for the oppressed and underappreciated geeks of the world.  The man who proved you could be intelligent AND business savvy.  We were even roughly contemporaries- I'm about one year older than he is.

Well, there I was, a guest in his house- they'd loaded dozens and dozens of us into several busses on the campus and driven us there.  I went into the entryway, and, having an intense dislike of large crowds [most of the time, anyway], tried a likely looking door and found myself heading down the stairs into the basement area where all the tables were set up.

There at the bottom of the stairs was Bill (easily recognized, of course).  I thought briefly that perhaps he was about as fond of crowds as I was, and then began thinking of all the things I had wanted to say.  I then realized that if he was anything like me, he'd find that kind of talk to be rather embarrassing, and would prefer not to have to deal with it...

Well I was also probably throwing some quizzical expressions in his direction, so he spoke first- "Hi!".

And I, feeling completely unable to say anything that wasn't sheer idiocy, responded in kind, and went on past him as if I didn't even know who he was.  I think that's now something that could be viewed as a form of meltdown...

Well, at least my anxiety eventually subsided- I looked wistfully, later in the evening, at Bill in the center of a sizable chatting group, and eventually came to the realization that at least some of what I'd been hearing all along was true- I had the chance to meet a personal celebrity in the flesh, and I didn't even know what to do with it when the moment came.

Still, while it reveals more about who I am than it ever will about Bill, I at least had a story to tell afterwards.  Having now told it, I'll get back to doing my job...

Posted by BobKjelgaard | 0 Comments

Potential

The month since my last post has been incredibly hectic.  I've started this particular article several times, only to abandon the effort because there were too many other things needing to be done.

So, just as the last time, I'm here working on a weekend- this time the project is completely something I should be doing during the week (I'm adding enough underlying bus support to a software test bus driver that we can use it for some simulated DMA test drivers, our previous solution to this seems to have pulled a vanishing act).  I'm in the test phase of it, so that occasionally leaves me some time to write as I wait for things to complete.

That previous article was the latest in a theme of the "P's and Q's" of my job here at Microsoft.  Events of the last month have surfaced an important "P"- the one I titled this essay with.

First off, there has been incredible growth in the support I've had in triaging that never ending stream of lab failures from Patrick and Wei, buttressed by their invaluable assistance in fixing the stream of test bugs we find in doing that.  Evgeny has been making steady progress on some new tools to help us catch any inadvertent regressions in the basic framework interfaces for UMDF and KMDF as early as possible.

But the one that brought the issue into focus was a job interview- a delightful young woman named Neslihan, who had been at Microsoft a short time, was desirous of joining our QA team.  She was energetic and excited about the work, well-educated (a graduate of the Middle East Technical University in Ankara), and in the obligatory coding exercises she showed that all that passion and drive was coupled with plenty of intellect- even as she assembled the basic solutions, she was looking for ways to improve her algorithms, rapidly proposing and then disposing of various solutions at a rate that made it clear she had "the stuff" needed to work on the sorts of problems we wind up working on.  Now, I've interviewed good candidates before (Wei himself, for instance), and I get excited about it myself- will the other interviewers see the same things I saw, or will some shortcoming surface that I didn't see?

Well, she did fine, and will be joining us in a few short weeks, which is certainly good for us.  But it also got me to thinking about where that excitement comes from.  for me, at least in some vicarious fashion, it's a sense of all this young person is going to be able to accomplish- not simply for us, but for Microsoft, and even [as is the case of much of what we do] for larger groups- planetary impact (even if it's only a tiny one, it can at least ripple there, after all), if you will.  Thinking about that led me to thinking about the way I view my colleagues (as the senior member of the group, mentoring in one form or another is an important part of my job)- again that same sense of potential, and the chance to help in bringing it forth.  Finally came the idea that I myself am here for the same reason- someone saw that potential in me as I interviewed, and in some sense still sees it as my management and I go through our career planning processes.

Then came the dislocation of one of those occasional "group moves" that occur here- at the end of it, I had a new office mate- a young Russian lad named Alexey Karetnikov, who is here as an intern- a student in St. Petersburg in the Russian Federation.   I began by trotting out the half dozen phrases I could remember of Russian from my high school days (hard to believe that was more than 30 years ago), and we proceeded to get acquainted- common interests in anime and music (I've been listening to quite a bit of Russian music he has brought in for us to listen to ever since, and I quite enjoy it), and of course in software.  He has been working on automating testing of our debugger extensions, and it is clear even in this short amount of time that he has had excellent training and is a highly skilled programmer already.

I took a brief trip early on with him to help him get signed up with the Social Security Administration (need to have that Taxpayer ID number, after all) and secure a bank account (after all, a dislocation to a new language, culture, and country could require some assistance).  Basically, he's a very nice person in addition to all else.  Once again, the potential for a person like that is staggering.

Now for people with that kind of potential, we offer the opportunity to have that kind of global impact, and somewhere inside all of us is the desire to see that happen.  I recall (or else I've become delirious in my old age) that we had an advertising tag line for a while- "Your Potential, Our Passion".  Well, I believe that's more than a slogan- it's a maxim and a glimpse into who and what we truly are- or at least want to be [and that is itself a part of our own potential].

Finally, there's Bill's exodus- plenty has been said about it (and if I've got time, I'll add my little send off a bit later)- but again- think of the potential Mr. Gates still has and the efforts he is now beginning to make to realize it.  Behind him he has left for many of us the means to realize our dreams and ambitions.  Now among his undertakings is one to help potentially millions of our fellows to achieve their potential- among others, helping them get the chance to stay alive long enough to have an opportunity to move beyond subsistence.

Well, in his case, it's been said better and no doubt by better observers and in more skilled terms.  But overall, it's been time for me to think about what the implications of the term are, and I thought I'd share some of that...

Setup is done- time to get back and make stuff work...

Posted by BobKjelgaard | 0 Comments
Filed under:

Passion and Persistence

Well, it's the Memorial Day weekend here in Redmond, and I am in my office working for the second of the three days of the weekend.  Just can't stay away.

Why?

Because for months now I have been trying to improve the efficiency of our team's automated tests, and also get all the content in them under control.  Over the years (I'm afraid it was the KMDF side of the team, from which I originally came, that was the biggest culprit), bits and pieces of content- test tool binaries, script files, and the like- were placed on a server accessible to both us and the lab- none of it under source control, nor with any idea in most cases where it could be built from or symbols to it procured.  Shefali and I have been whittling away at this problem for months now, and I'm finally close enough to having it work I'm determined to make that one last push to get the job done once and for all (except for the inevitable maintenance, and I've tried to be smart enough about designing my solution that this should also be easy).

We've got many hours of tests overall between KMDF and UMDF, and I've got to make sure all of them work before I tell the lab that this new stuff is the way it's going to be from here on out [not to mention there is going to be some peer review first, as well].  It takes a lot of time, and of course, none of this is committed or budgeted work.  I'm filling in the idle time [among other things, I'm doing abusive searches of build release servers so I can figure out where some of the tools we've just stuck up there before actually appear in Windows builds- so my being here when almost no one else is proves to be somewhat considerate of me, as well] by writing blog entries, updating code, and working on even further out ideas I want to try out.

Not that it's all hard work.  I've got the music turned way up (a live version of the Grateful Dead's "Casey Jones" of all things, right at the moment- so I hear my blog's title being raucously sung), and I'm having my usual whimsical fun with comment fields and the like.  For instance, the job itself (which is working rather well, so far) is named "Sauron", and the Details tab (which is supposed to describe it) begins with this absurd off-the-cuff plagiarized quatrain [my apologies to the Tolkiens]:

One job to to run for all-

One job to find them;

One job to bring them all

And at the test box bind them.

So why do something this exhaustive,

and do it when I'm supposed to be relaxing, and when its unpaid time, and so on?

Because I care about it (passion) and because I'm not going to just give up because it isn't easy (persistence).

Besides, this reminds me of my early days at Microsoft (the days of OS/2, Windows 3.1, and what became Windows NT 3.1).  The days where the 70 and 80 hour week were common and I usually didn't mind.  The days when I slept on the floor of my office because I got to be too tired to drive home, even though it was only a block away [yes, I really should walk, but I find the idea daunting at the 1-5 AM hours I am usually coming to work- Redmond is usually safe, but why take the chance?].  The days when I wrote code because I thought I'd need it or because I wanted to try something that struck me as cool at the time.  The days when I may have worked hard, but I was enjoying it so much it didn't feel like work.

They came to an end- the reasons why don't matter.  At one point after that I was interviewed by a journalist / author (he was writing a book about the development of Windows NT, and I guess he thought I'd make for a mildly interesting highlight or side story), and I remember telling him "I'd sit at the keyboard and I couldn't type- I didn't want to type.  I wasn't interested in programming anything- I'd never had that happen before".  While I didn't put it that way, it was like being dead, but still being capable of breath and movement.  In a very real way, that was behind my resigning at the time I did- partly protective (no way could I do a decent job in that state, and sooner or later that would result in some black marks on the record), and partly reactive (some of the roots were environmental, or so I felt- doesn't matter really, it was all so long ago).

Ever since then, I've been looking for that feeling- that sense of wonder- that confident feeling that if I could think of it, I'd find a way to use the technology available to me and my own capabilities to make it happen.  I'd get occasional flashes of it, but never the full-blown, out and out desire to do nothing but work on something for endless hours, and actually achieving that goal.  The closest I came was gaming, but to me that was always "play".  "Work" was a better use of time, but it just wasn't engaging me.

When I decided to rejoin Microsoft, I was hoping I might finally find that desire and enthusiasm capable of being rekindled.  Working as a contractor, I'd had urges in that direction [but you must charge for those hours- so the kind of thing I am doing now would be an abuse of the business relationship to either my employer or to their customer].  But it just seemed to sputter- there were brief flashes, but nothing consistent.

But now it looks like it is happening at last.  Perhaps I do get another bite at the apple- I already feel younger for some reason.  Perhaps I can take on big challenges and bring them to fruition, even though I am highly prone to "the lone wolf" style of addressing things.  If it's happened- I know who to thank- Shefali, Abdullah, Scott, Darren, and everyone from them right on up to Messrs. Ballmer and Gates.  For providing an environment (and yes, there is much more to that environment than just the chain I've listed) where I can be myself, and still accomplish things- maybe not great things, but at least things worthy of accomplishing.  For listening to sometimes petty concerns and putting up with my foibles, inconsistencies and mistakes (especially Shefali in this case).

I hadn't thought about that before- perhaps the problem previously is I felt I had to be something other than what I am to succeed.  Now I finally feel otherwise, and it is very liberating.

Well, perhaps I am deluded.  While I see this as a good thing to do, and an improvement over the existing conditions, perhaps it will be judged otherwise.  I am not using all the latest and greatest technologies to solve the problem, so perhaps it will lack appeal because of that.  Time will tell, in either event.  I'll deal with that as it comes.

In the meantime, I am going to do my level best to make this thing work, and I don't mind the time it is taking.

Because believe it or not, I am actually having quite a lot of fun...

[Segue into Grateful Dead [Wake of the Flood] Mississippi Half Step Uptown Toodeloo.... "along the Rio Grande, oh-..."]

The Forever Wait

Many thanks to Patrick for this reminder about another practice an SDET should avoid (actually, most SDE's should, as well- but they have their own masters and I'll leave that job to them).  Not that it's easy to pick the "right" amount of time.  In this much older post of mine, I mention a test case that took an incredible amount of time when we turned Driver Verifier on.  Well, that test case has been a rascal in many ways (it even found some unusual and rare bugs in Driver Verifier itself).  I was initially using the same wait time on all of my tests, but I had to stop- I set the wait on this one out to many minutes, but special cased it so the others wait a few seconds- if I didn't, a hardcore hang would potentially set runtime out to hours, not good for automation.

Well, I ran it today [I'm spending time I shouldn't be trying to streamline our automation- in this case making the installation aspects of it smart enough to find their own content, instead of having people type an ever changing set of locations into WTT as parameters every time we have to make a run, which is, after all, every day or more- but at the same time I removed the copying of things we no longer use, or are getting from insecure places when there are secure places, etc- to make sure I didn't mess it up, I'm running all of our tests to make sure I didn't prune out the collection of something we need], and the test case AFTER it failed because it timed out.  I already know this is because the cleanup from all those allocations is hogging resources (I've even hit the "too long at dispatch level" break on machines with way too much memory), preventing me from getting time to get that next instance spawned off my test bus (serves me right for trying to squeeze some test performance by spinning up the next test while the first was spinning down).

The forever wait would make life easier in this case- but it wouldn't be near as much fun.

Now if any of you are thinking- "wait- why aren't you even talking about breaking and forcing people to look at the hang when it times out like this?"- well, we still have an opening or two for more SDETs on our team (and our close buddies in Storage have plenty of good opportunities for you as well), and you're obviously beginning to think like one!  The best answer I ahve at the moment is that I serve multiple masters, one of which is regular automated checking, for which recording a failure and moving on is OK (as long as it isn't new, anyway).  I've been thinking about ways over the last few days to accommodate this [break conditionally, but not all of the time in way that's smart enough to suit our needs], but haven't hit upon that most desirable solution yet.  Not that I can't think of ways- but I need ways I can achieve with what's at hand- all the aggressive work we've done on hardening the test code so it doesn't break more often than the product it's testing have been working, so I've at least got time to move from "Stop all the bugchecks" to "what about attempt rates and failure rates" phase.  Actually, it's not just that- it's that Patrick and Wei have been stepping up and helping with some of that fixing and the never-ending job of triaging failures in our test labs.

But, thanks to Patrick, that "things to do next time you crack that DDI test code" list now has another item:

Things to do the next time you crack that moldy old test code

  • Set USER_C_FLAGS to /TP
  • Set the warning level to 4 (convert warnings to error is already on)
  • PFD green before you're done, period.
  • Look for suspicious coding patterns where failures are being masked or ignored- at the very least in the places the above changes have forced you to look.
  • Examine KeWait... calls for NULL timeout and WaitFor... calls for INFINITE.  Take them out with the rest of the trash!

[Edit 5/24] Sorry , I just realized I revealed someone's email address wtihout permission [it made a fine pun, but still not the right thing to do]- stats say not likely anyone realized it yet, but mea culpa, anyway...

Just had to add this one

The title of this post keeps reminding me of The Forever War- a science fiction novel that earned both Nebula and Hugo awards.  The awards probably speak for themselves- I'm quite pleased to have a copy of it in my personal library (I guess I'm pleased to have a personal library, but I leave that sort of thing to my personal blog.

As for the music- happens to be another bit of mellow fineness from Field of View (sorry, I didn't come up with a more useful link, but I'm short on time, as always) at the moment- I may have paid what seemed an outrageous fee to import this CD set (not the one I linked to, that would take me a while to find, I'm afraid) from Japan, but it was soooooo worth it!

Why SDETS should be the most fastidious and paranoid coders in existence

Well, to follow up some more on my adventures in test code maintenance, I bring you a case study.

As I whittled away at the 100+ PFD warnings I mentioned in my previous article,  I did my paranoid best to look for dubious code above and beyond what PFD was telling me about when I noticed this in one of our oldest test drivers:

NTSTATUS  _stdcall 
CTestDriver::SetupObjectHandleParameter(
    IN ULONG eHandle, 
    IN OBJECT_TYPE ObjectType,
    IN OUT PVOID * Handle
    )
{
    NTSTATUS status = STATUS_SUCCESS;

    //
    //setup Queue Handle parameter
    //
    switch(eHandle){
        
        case HANDLE_INVALID_NULL:
            *Handle = NULL;
            break;

        case HANDLE_INVALID_OBJECT_DELETED:
            //
            //we will treat this as HANDLE_INVALID_NULL since we can not delete the default queue
            //
            *Handle = NULL;
            break;

        case HANDLE_INVALID_WRONG_OBJECT_TYPE:
            
            if(ObjectType != OBJECT_TYPE_DRIVER){ 
                
                *Handle = (PVOID)WdfGetDriver();
                
            }else{
            
                *Handle = (PVOID)this->m_pFdo;
            }
            break;

        case  HANDLE_VALID:
            
            switch(ObjectType)
            {
                case OBJECT_TYPE_DEVICE:
                    *Handle = (PVOID)this->m_pFdo;
                    break;
                    
                case OBJECT_TYPE_QUEUE:
                    *Handle = WdfDeviceGetDefaultQueue(this->m_pFdo);
                    break;
                    
                case OBJECT_TYPE_REQUEST:
                    status = WdfRequestCreate(NULL, NULL, (WDFREQUEST *)Handle);
                    break;
                    
                default:
                    *Handle = NULL;
                    
            }
            break;
            
        default:
            DbgBreakPoint();
#pragma prefast(suppress:__WARNING_UNREACHABLE_CODE, "Always runs under debugger, so not unreachable, merely deliberately annoying")
            *Handle = NULL;
            status = STATUS_INVALID_PARAMETER;
    }

The breakpoint and suppression pragma are my later addition [and I removed some invalid cases prior to this- in fact, it was the act of removing them that led me to even look at this code]. Why did I add them? Because this silently changes an invalid and unexpectedly out-of-range value without any notice or any indication that something unexpected and unwanted has happened.  I wanted to make sure that if this EVER happened, someone was forced to take a look (and yes, this is why people use ASSERT of course).

Well, I finally got all my warnings cleaned up, and I built the dozens of drivers and assembled everything I needed so I could make a "dry run" of all the affected code under lab conditions.  I hit that breakpoint about a half hour into the run.  So I took a look, of course- that IS why I did this, after all [to force someone to look- I just also turned out to BE that someone].

As one would expect from even middlin'-to-fair practice, those constants are in a header file, the appropriate snippet being this:

typedef enum {
    HANDLE_INVALID_NULL,
    HANDLE_INVALID_OBJECT_DELETED,
    TOTALLY_INVALID_AND_DO_NOT_USE_UNTIL_ALL_ARE_REMOVED_AND_THIS_CAN_BE_RECYCLED,
    HANDLE_VALID,
    HANDLE_INVALID_NOT_SET,
    HANDLE_VALID_NULL,
    HANDLE_INVALID_NOT_OPEN,
    HANDLE_INVALID_WRONG_OBJECT_TYPE,
    HANDLE_CALL_DDI_MULTIPLE_TIMES = 23
    } HANDLE_STATES;
 

The long name was my addition- it was for the invalid case I was trying to prune from all those drivers [it passed -1 as  an "invalid handle value"- fine practice in user mode, but in KMDF they [wisely, IMO] decided NULL was enough to guard against].  The oddball value at the end is there because of one of the items that factors into my tale [but I won't mention it again].

Soon after these early test drivers were developed, we off-shored the work of developing more of them to test the burgeoning KMDF DDI.  They basically made a copy of what we had, and then began developing more, extending and refining our techniques.  Now I'm about to describe an error of theirs, but if the original developers had been more paranoid, these folks could have known about this long before I found put about it!  People make mistakes, especially during those early stages of development, period.  They continue to make them in maintenance and addition of new features.  This isn't a "look at this fool" story- it's a "this is good and intelligent people just being human" story- please don't misunderstand me!

To go back now to the present, a quick check in the debugger showed that "9" was the value for eHandle.  You'll notice it doesn't match any of the enumerated values.  So where did it come from?  Well these tests are driven from scripts, and the original developers being fastidious used the same names and definitions in their script-based engine. [I told you these people were good!]  So I found the test case used because it was readily available from the driver at the point this failed (drum roll, please):

    var    eQueue    = HANDLE_INVALID_WRONG_OBJECT_TYPE;
    var    bStopComplete = POINTER_VALID;
    var    eContext   = POINTER_VALID_NULL;

Now HANDLE_INVALID_WRONG_OBJECT_TYPE is defined, but its value is 7, not 9- so why the mismatch?  Well, like I said, all the original files were forked off, and the offshore team maintained their copies.  At first I diff'd the original header to see if I'd done something totally bad, like deleting values from the enumeration without maintaining the value of those that were left- after all a stream of 14+ hour days leaves me quite capable of such a move!  But it wasn't me, this time- so I went to their version of that same header, and found:

typedef enum 
{
    HANDLE_INVALID_NULL                             = 0,
    HANDLE_INVALID_OBJECT_DELETED                   = 1,
    INVALID_VALUE_NOT_TO_BE_USED_1                  = 2,
    HANDLE_VALID                                    = 3,
    HANDLE_INVALID_NOT_SET                          = 4,
    HANDLE_VALID_NULL                               = 5,
    HANDLE_INVALID_NOT_OPEN                         = 6,
    HANDLE_INVALID_REACQUIRE                        = 7,
    HANDLE_INVALID_NOTACQUIRED                      = 8,
    HANDLE_INVALID_WRONG_OBJECT_TYPE                = 9,

Oops (and yes my names for the value I ripped out were inconsistent, but then maybe some consistent inconsistency is useful in doing this sort of maintenance activity, and besides, none of this is finished yet]!

I then went to the trouble of trolling through our configuration management system (mostly out of curiosity, as this bug isn't a killer in and of itself] to see when this breaking code change was made, and found it happened to have been made about 10 days after I walked out of New Employee Orientation and into my second SDET career at Microsoft [I probably couldn't even begin to use WTT {DTM to you} at that time].  This defect silently caused us to stop testing the effects of passing the wrong kind of object to the DDI tested in this driver [passing NULL instead, which as you can probably divine was already being done].  The test failed, but it was supposed to, so no visible change in outward results- just a silent loss of coverage of some paths in the product code for the last 3 years.

Umm, so what?

Granted- this is hardly the most important problem found on the planet this week (probably even in the last 10 seconds).  But I think there's a point here for any serious SDET.  Be paranoid!  Check everything!  Trust nothing!  Before you punt that oddball case, at least consider using an ASSERT or some other mechanism to focus attention on questionable results or behaviors.  If the code looks ratty as a result, and you'd never ship it, so what?  Your purpose isn't to make pretty code- leave that to the SDEs- your task is to find bugs [and preferably in the product- but if you find your own with these practices, the odds of your doing the same for your product are that much improved].  I don't personally care much where the braces go in your code- but I sure care a lot about what you do between them!

Yes, someone inclined to point out errors and less-than-optimal practices can point out many more [like, oh, say- keeping two versions of the same header file about?]  But it takes time to sort out improving what you've got, and there's the continual need to add new stuff [which is what I was doing when I began this little detour this week, and I'd better get back to it, and soon].  I'll add removing that superfluous header to the ever growing list of "things that ought to get done when there's finally time to do it"...

Enough of the soapbox rant for now.  I'll go put on my hat and find some of the other curmudgeons [someday I'll have to take a photo of it perkily perched on top of the replica Spartan helmet from the Halo 3 Legendary edition [someone once told me it gave them visions of John-117 as Indiana Jones, bullwhip in hand] in my office]...

Odds and Ends

  • Well, PFD eventually found well over 30 IRQL misusages by the times I was done- many of the pattern I showed in that last article, others more subtle- and some of those were more of a "risk of future maintenance trouble" issue- exactly the sort of maintenance error behind this bug.  PFD- it isn't just for developers- your test code ought to be as good as theirs and preferably better [so when something breaks you've got all the information you need to say what's going on].  It also focused attention on dozens more problem areas in this code base.  Being PFD clean doesn't guarantee your driver is rock-solid bulletproof, but it sure makes a nice foundation!
  • Wei reminded me that you can combine the values when suppressing by separating them with white space.  Figures- I'd tried commas and got into too much of a hurry to look up the basics of those pragmas.  Also that the suppression is suppose to apply to a single source line (but I think stacking them as I did also works- can't say for sure).
  • I've begun using some Analytical / tracking software on my blog to understand my readership better [nothing personally identifiable].  I see that I get a reasonable number of non-English speakers, so I'll see if I can rein in my vocabulary and use of idioms in the future.
  • The web site I keep some of my blog content on (like my photograph) is undergoing maintenance for a couple of days.  The text files for fixing KMDF installation failres are there too.  They'll be back by Monday, May 12, 2008.
  • No song lists- too busy to try to cpture thm this time [lucky you!].
  • A good weekend to all!
Posted by BobKjelgaard | 0 Comments
Filed under: , ,

Reaping the Benefits of Static Analysis Tools

Our team has a huge amount of legacy test driver code, much of it rather old, and originally created by somewhat less experienced developers.  I continually find myself going back and performing maintenance on that code, and as I've noted previously- Static Analysis Tools are a great help in identifying problems in this code.  I'll put this "down time" (among other things, I've got PFD analyzing 40+ files that are rather large as I write this) to good use by describing both some further benefits and showing how easy it is to annotate your own code so you can recognize this sort of benefit.

Irql Tracking Finds Yet Another Flaw

One of the benefits PFD adds above regular PreFast is tracking IRQL changes in your code.  Of course, there are additional annotations used to support this, and I found them to be most useful in a very real-world way.

In our tests of the KMDF DDI, we try to make a lot of invalid calls to assess how well the framework verifier works.  One of the things that we do is make calls at improper IRQL levels- since efficiency suggests you usually have one place that tests a particular call, this change has to be programmable.  Not the sort of thing you expect a static analysis tool to be able to do for you, eh?  Well one of our SDETs made an initial pass at this issue [after I told the team we were going to make all of our test drivers PFD clean over the next few months, instead of having me piecemeal it as I fixed bugs as I had done previously], and encountered some difficulties

The original declarations looked like this:

/*******************************************************************************
 KmdfTestRaiseAndPrintIRQL

 Synopsis - Raises IRQL to DISPATCH_LEVEL and prints IRQL

 Parameters -
   eIRQL    -   Enumerator for IRQL
   OldIRQL  -   IRQL prior to raising to DISPATCH_LEVEL

******************************************************************************/
VOID 
KmdfTestRaiseAndPrintIRQL(
    __in    ULONG   eIRQL,
    __out   KIRQL   *OldIRQL
    );
Initial code- nice enough, and PreFast annotated, but it changes IRQL and doesn't say a thing about that other than in comments

There was no declaration for a function to lower IRQL- typically eIRQL was compared to a constant and KeLowerIrql was called when needed.  Well, PFD quickly noticed this function elevates IRQL on some paths, so the attempt was made to annotate it, and a paired restore function was also defined- thus:

/*******************************************************************************
 KmdfTestRaiseAndPrintIRQL

 Synopsis - Raises IRQL to DISPATCH_LEVEL and prints IRQL

 Parameters -
   eIRQL    -   Enumerator for IRQL
   OldIRQL  -   IRQL prior to raising to DISPATCH_LEVEL

******************************************************************************/
__drv_raisesIRQL(DISPATCH_LEVEL)
VOID 
KmdfTestRaiseAndPrintIRQL(
    __in    ULONG   eIRQL,
    __out   __drv_out_deref(__drv_savesIRQL) PKIRQL   OldIRQL
    );

/*******************************************************************************
 KmdfTestRestoreIRQL

 Synopsis - Restore IRQL

 Parameters -
   eIRQL    -   Enumerator for IRQL
   OldIRQL  -   IRQL returned from KmdfTestRaiseAndPrintIRQL

******************************************************************************/
VOID
KmdfTestRestoreIRQL(
    __in    ULONG   eIRQL,
    __in __drv_restoresIRQL __drv_nonConstant KIRQL OldIRQL
    );
First attempt- this says IRQL is changed and a restore function is added, but is it enough?

However, this didn't work out so well, when the implementation of these routines was analyzed by PFD, and since I had mandated cleanliness, warnings from PFD were suppressed, thus:

__drv_raisesIRQL(DISPATCH_LEVEL)
VOID __stdcall
#pragma prefast(suppress:__WARNING_IRQL_NOT_SET, "Intentionally save irql at specified condition");
KmdfTestRaiseAndPrintIRQL(
    __in    ULONG   eIRQL,
    __out   __drv_out_deref(__drv_savesIRQL) PKIRQL   OldIRQL
    )

// and then this

VOID __stdcall
#pragma prefast(suppress:__WARNING_IRQL_NOT_USED, "Intentionally restore irql at specified condition");
KmdfTestRestoreIRQL(
    __in    ULONG   eIRQL,
    __in __drv_restoresIRQL __drv_nonConstant KIRQL OldIRQL
    )
These annotations must not be enough, because PFD is smart enough to find paths that put the lie to the annotated effects of these functions

Why the need to suppress?  Because PFD correctly analyzed that there were paths through the code that did not do what the annotations say these functions do (raise IRQL to dispatch and restore it, respectively). These actions were conditional, and the annotations didn't reflect this.

Conditional Annotations: One Powerful Tool Makes One Effective Solution

I encountered the problem in different form.  I found myself working with the earliest of our drivers, written before these library routines were used.  They defined similar functions inline,   At first, I made these same annotations there [but didn't suppress], as that let me borrow from the work already done.  But since I didn't suppress, I got the identical errors.

When I realized why I was getting errors, I recalled reading about conditional annotations, so I read up on them and really, for a case like this they just weren't that hard to figure out.  I fixed the code I was working with, and then when I was sure this was the right thing to do, I made the same fix to the library code and header file I snipped above.  The end result looked like this (and I removed the suppression pragma from the library code, of course).

/*******************************************************************************
 KmdfTestRaiseAndPrintIRQL

 Synopsis - Raises IRQL to DISPATCH_LEVEL and prints IRQL

 Parameters -
   eIRQL    -   Enumerator for IRQL
   OldIRQL  -   IRQL prior to raising to DISPATCH_LEVEL

******************************************************************************/
__drv_when( eIRQL == EXECUTION_LEVEL_DISPATCH, __drv_raisesIRQL(DISPATCH_LEVEL))
VOID __stdcall
KmdfTestRaiseAndPrintIRQL(
    __in    ULONG   eIRQL,
    __out __drv_when( eIRQL == EXECUTION_LEVEL_DISPATCH, __drv_out_deref(__drv_savesIRQL))
            PKIRQL  OldIRQL
    );

/*******************************************************************************
 KmdfTestRestoreIRQL

 Synopsis - Restore IRQL

 Parameters -
   eIRQL    -   Enumerator for IRQL
   OldIRQL  -   IRQL returned from KmdfTestRaiseAndPrintIRQL

******************************************************************************/
VOID __stdcall
KmdfTestRestoreIRQL(
    __in    ULONG   eIRQL,
    __in __drv_when( eIRQL == EXECUTION_LEVEL_DISPATCH, __drv_restoresIRQL)
            KIRQL   OldIRQL
    );
Adding conditional annotations lets PFD correctly analyze the functions and code that uses them, removing the need to suppress perfectly valid warnings.
It also more clearly states how the function works in a fairly readable way.

The annotations are now more precise.  EXECUTION_LEVEL_DISPATCH is defined in a header always known when this is included, so PFD can properly analyze all usages.  It now knows that we change and restore when the eIRQL parameter is set to this value, and leave it alone otherwise.

Reaping The Benefits, Part 1- Finding a Real Bug at Build Time

Well, it didn't take long for it to turn up a real flaw [in one of those 40 files I alluded to earlier].  This is pseudo-code illustrating what it found.

KmdfTestRaiseAndPrintIRQL(
    eIRQL, 
    &OriginalIRQL
    );

if (condition) {
    MakeTheCall();
    KmdfTestRestoreIRQL(
        eIRQL,
        OriginalIRQL
        );
}
Pseudocode for a bug the above annotations exposed- found by PFD in that ancient test code.

Out of the 10's of thousands of source lines I'm working with, I don't think I could have noticed that- but PFD zeroed right in on it.  If the condition isn't met, we can raise IRQL without lowering it.  It's rare for the condition to not occur, which is one reason why nobody had found this before [another is that intermittent failures in these tests used to be routinely overlooked- but I don't work that way- I only give up when I absolutely have to].  Easily fixed, of course, just move the raise inside the conditional block as well.

Reaping the Benefits, Part 2- Clearer Documentation of Test Practices

By requiring the code to be clean (no errors or warnings) we have to use suppression in those places where we deliberately break the rules.  Now I could call this an annoyance, but actually I think it makes it a lot easier to understand better how we are testing the DDI in these cases.  So I'll leave this final sample:

        TraceEvents(TRACE_LEVEL_INFORMATION, DBG_INFO, 
            "\nCalling DDI WdfDmaEnablerCreate\n\n");

#pragma prefast(suppress:__WARNING_INVALID_PARAM_VALUE_1, "Intentionally passing 0 to test the DDI")
#pragma prefast(suppress:__WARNING_IRQ_TOO_HIGH, "Deliberately done to test framework verifier")
        status = WdfDmaEnablerCreate(
                    device,          // Handle to device object
                    pDmaConfig,      // Pointer to WDF_DMA_ENABLER_CONFIG structure
                    pAttributes,     // Pointer to WDF_OBJECT_ATTRIBUTES
                    pDmaEnabler      // Pointer to WDFDMAENABLER to create
                    );
For a further benefit, places where rules are deliberately broken for test purposes are now clearly marked by the suppression pragmas

Now, we actually pass NULL values in some other parameters- one reason this isn't reflected is there are common routines that establish some of these pointer values, and they are not annotated to reflect that the values can be NULL on output.  Someday maybe I'll get to that task [or just try to make sure we do it upfront on new test code].

Well, that run finished and I've got 128 bits of further errata to investigate [I sort by warning number, and I'd just finished the warnings for Irql too high when I saved it all off, recompiled to make sure I hadn't broken anything, and went on- I started with over 180 failures, so I'm making headway].  In case you were unaware of it, you can get the mnemonic constants for the errors [typically reported by number] from the file suppress.h in the sdk\inc directory.

So, I'm off to do more cleanup [I started with one test driver, and this just sort of ballooned on me when I decided to rip out some less desirable things I've been meaning to rip out], and I'll let you get back to your regularly scheduled workday!

Tunes

Grateful Dead

[Aoxomoxoa] China Cat Sunflower

[Workingman's Dead] High Time, Black Peter

[Europe 72] Sugar Magnolia

[Wake Of The Flood] Eyes of the World

[Shakedown Street] If I had the World to Give [Bonus Track]

[Go To Heaven] Alabama Getaway

[Dead Set] Brokedown Palace, Rhythm Devils

[Reckoning] It Must Have Been The Roses

Various Artists

[Record of Lodoss War] Instrumental [name untranslated]

[InuYasha TV OST] Kagome and InuYasha 1

Tsuneo Imahori

[Trigun the First Donuts] Knives

Yoko Kanno /Seatbelts

[Cowboy Bebop] Car 24, The Egg and I

[Cowboy Bebop Blue] Farewell Blues

[Cowboy Bebop Knockin On Heaven's Door] Diggin'

Yuki Kajiura / See-Saw

[.hack//liminality] [name untranslated- also on "Dream Field"], [name untranslated- also on "Dream Field"], [name untranslated]

[.hack//sign OST1] silent life, kiss

[Fiction] Fiction

Posted by BobKjelgaard | 0 Comments
Filed under: , ,

Hardening takes time [unfortunately]

As time has gone on and I have worked with various KMDF users over the last few years, some of them have been helpful in forwarding information to me regarding installation failures even when they had successfully resolved them.  Nothing secret involved- setup logs, what fix they used and how well it worked, etc. 

Recently I received one from Dana Gregory of DataColor [names used with permission], which highlighted another problem with our older coinstallers, which I thought I'd discuss in today's installment.  While I had previously passed along instructions on fixing a broken installation by registering the runtime service for KMDF, and that worked for him, this problem would still break the next attempt to install another product (or to reinstall his own).  I'll add links for the service fix to this article, and also to the problem this showed.

But the logs showed me more than that.  This user had first attempted to use another vendor's product which utilized KMDF 1.5, tried dozens of times to get it to install, and ran afoul of one of our other issues with that coinstaller- thus never successfully installing.

He or she then uninstalled that product, and replaced it with a competing one that did not use KMDF.  I believe that the uninstall set up the problem Dana faced, but again, this customer made dozens of attempts to install, some of those probably at the direction of DataColor's tech support, before finally getting the product to work.

It doesn't take loads of empathy to realize how frustrating this experience must have been for that customer.  Not to mention the fact that an even closer customer [the vendor of the first product] suffered damage to their reputation because they had relied upon us to provide a robust installation experience for the end user and had failed to do so.  Beyond that, there were potential career repercussions for the engineers and others within that company who advocated the use of WDF [which obviously we would like to think was a good thing for them to have done].  Now some of those damages may be small in the larger picture, but they aren't small to the individuals involved, and success is something that really needs to built one customer at a time, even in a global market.  These negative images aren't things I like to think of- all I can say at this point is I don't take them lightly and it's one reason I've been pursuing coinstaller issues as much as I have been [it wasn't even part of my job to begin with- I just got concerned folks were having trouble with installation, and I've been meddling in it ever since].

The problem here is that the coinstaller will add version information about drivers being installed to a key beneath HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Wdf.  This key is normally created when the runtime is installed.  If it is removed, the coinstaller will not attempt to create it if it determines the runtime is present- it just emits an error message and fails the installation.

When we first encountered this, Ilias had me file a bug which we would fix for KMDF 1.9 [1.7, alas, was already shipped].  In this case it was happening [we believed] because a failed boot driver caused a system to revert to last known good, which undid the registry changes made by the coinstaller.  Well, when it came to be time to address the bug fix, we had one of those long discussions about whether and how to fix this.  In this case, argument comes closer to the passion and heat level- but I don't give up as easily as I used to [rather, I don't bide my time as easily].  Consensus wasn't totally reached, but I got a majority of the key players and that was enough to get what I wanted, and believed was best for our customers.

The issue as I saw it:  all of the information under that key was either informational, related to debugging [and thus expected to be entered manually, initially]  or kept around for "future uses" which we didn't test and had no clear rationale for.  You can remove the key and all your installed drivers will function just fine.  WdfVerifier will create it if it doesn't exist [the switch for turning on diagnostic output from the loader lives in there].  So its nonexistence should just be ignored- creating it might be a good idea, but I was fine with even skipping it.

The primary opposing viewpoint was that it was installed as a part of the installation, and if it was missing, that should be treated as an indication that the installation had been corrupted, and the correct way to fix it was to re-install.  I was vehemently opposed to this on several grounds:

  1. Eventually we would have a greater minor version than 1.9 on the machine, and this logic would have the 1.9 coinstaller overinstall that and break any 1.10+ drivers [I didn't push it much though, so perhaps I didn't think of it at the time, and this is another instance of my muddled memory].
  2. The assumption is being made that the installation process is 100% reliable, and we know from experience that it isn't, and given that part of the issue is unexpected interference by unknown 3rd parties, we can never be certain of reaching that level.
  3. As I said above- there was nothing under that key that was absolutely required to make drivers operational.

My basic argument revolved around the last two: if we know the existing installation is working except for that key existing, we should not assume it is corrupted to the point of requiring reinstallation because a key (or subkey of that key) that can be removed without disturbing its operation is no longer there.  We can know the existing installation is working, because we can now tell which version of KMDF is actually loaded and running [beginning with 1.7].  I basically relied on the old maxim "If it ain't broke, don't fix it".

I still believe this was the correct decision- it makes it harder to break an existing installation, and we already have seen cases where this sort of thing has happened.  But let's just say there isn't universal agreement on that point [or at least there wasn't at the time].

The real good news

That would be that Wei is putting together a comprehensive set of coinstaller and versioning tests, and he has done a very thorough job of it.  It's a lot easier to get the job done right when you've got enough people to get it done, and they are capable people.  I'm making certain that this testing includes doing more "real world" testing- deliberately breaking keys, deleting and moving files, etc.  Eventually I hope we can utilize more of our fault injection technologies to assess the overall robustness of the installation itself [but there will always be holes there we can't plug- some faults you simply cannot recover from gracefully].  The problem is in good hands [in my opinion, anyway].

So the best I can say for now to those we failed in the case Dana gave me information about is that we're learning from their experiences and doing our best to not repeat old mistakes.  More to the point, we're not falling back upon quibbling about the fact that your experiences are relatively rare- we're trying to harden this as best we can.  After all- a 1% failure rate of something that occurs 1 billion times is 10 million failures- a number that had better be daunting!

I must also add that I am grateful to all those who have shared this sort of detailed information with us, rather than simply throwing up their hands in disgust and assuming we don't care about their problem.  I can find cases of people having problems using serach engines, but all too often there's not enough information to tell what broke, and I see a lot of dangerous [to future users of WDF on the same machine] solutions being promoted, as well.

The links I promised: this one sets the default runtime service settings, while this one ensures the required control key settings for pre-1.9 coinstallers are present.  You may copy the text from both (or just download them with a "Save Target As", etc.).  You can rename them as .REG files and use the right-click to install method, or leave them as is and use them with the OS "reg import" command line tool.  Please do not make these part of your normal installation process- the first should never be necessary beginning with KMDF 1.7, and the second beginning with KMDF 1.9- widespread use of them could prevent us from making future improvements to the installation process and thus make life harder for everyone [including eventually yourself].  If your driver is a boot driver, then you may need to tweak the default service settings further, as the default is for non-boot.  As always. if you have problems using these or other questions, let me know!

Posted by BobKjelgaard | 0 Comments
Filed under: ,

It's soup!

You can now get the good WDF 1.7 coinstallers from Microsoft connect.  What you will get is an MSI that will copy them and some accompanying notices to your machine.  If you have the WS08 RTM WDK (18000) replace those coinstallers with these.  If you have the 18001 WDK (with no WDF coinstallers) just use these.

In a while (hope it's a few weeks, anyway, I'm getting real tired of coinstallers) a WDK 18002 will be produced and take the place of the WDK sans coinstallers plus coinstaller MSI.  We've started the process to get this stuff into MSDN proper, but that process has been a bit sluggish lately, so I'm not about to offer any estimates as to when it will make it there.

Short and to the point...

Now Playing: Yuki Kajiura [Noir Vol 2]: Secret Game- interesting anime- gotta spring for "Madlax" sometime...

Posted by BobKjelgaard | 0 Comments
Filed under: , ,

The light at the end of the WDF 1.7 tunnel

I know people want to know about this, but firm dates still prove elusive.  My best guess is today or tomorrow at this point.

Basically we're trying to put an interim package up that satisfies everyone and isn't going to cause us foreseeable problems in the future.  Eliyas keeps tweaking this MSI to satisfy one concern or another raised by members of the Dev, PM and QA teams, and once it's done, it will go to Microsoft Connect and you'll have your coinstallers.  Sometime after that we'll have an updated WDK with everything tied together nicely again.  As I understand it, the links for it on MSDN will work eventually, but I have no idea when.

As far as the real testing- all done, all good.  All the legal and licensing stuff, good.  Just have to make an MSI with everything in it that doesn't look like it was slapped together at the last minute-  even though it sort of is.

Now playing: Yoko Kanno [Ghost in the Shell Stand Alone Complex, V2]: Psychedelic Soul

Some follow up

Kumar fixed the UMDF coinstaller build problem (which you'll never see) in fairly short order, and then went on vacation himself.  So of course, more build problems turned up- but until we get that MSI out, not likely I'll get any traction.  This time the x64 update package for WinXP and Win2K3 is missing some pieces- blocks me, but again, I doubt this is a problem that affects anybody but my team [call this an attempt to give you an idea what it can be like sometimes "on the inside", because it's easy to get the idea everything is endlessly smooth and we spend all our time pondering the whichness of the why].

Now Playing: Grateful Dead [Live / Dead] St Stephen [always a welcome album]

So that's what they look like!

Yesterday I gave a brief presentation on WdfVerifier and WdfTester in a session on Driver Testing at the Global MVP summit.  Met Mikhail Vodicka, Gianlucca Varenni, Martin O Brien and Maxim Shatskih- or rather at least exchanged greetings face-to-face, either in the session or out in the hall afterward.  I'd dropped off the OSR lists for a bit to help manage my email and workload, but after talking to Martin a bit, I decided to re-subscribe again.  Hopefully I can better manage my impulses to jump in on some of the esoteric issues that occasionally surface there...

Now Playing: Kraftwerk [The Mix] Musique Non Stop

Now for a different UMDF Coinstaller story...

This one can never affect you, so your blood pressure can start easing now.  It's just been a while since I tried to put on my "war stories" hat and tried tellin' one of them tales...

In the Beginning was...

the need to make sure drivers we build for test are signed so we can automate their usage and avoid all the nasty workarounds unsigned driver installation entails- particularly since we test on so many operating system platforms.  In the early versions of KMDF, we did this as part of our normal build process, getting them signed just like all the rest of the OS is.  Even when we went to produce releases (this happens in build lines not under our direct control) they built all the samples and test code as well, and life was good.  We always had a coinstaller with the right update packages in it, and everything was signed.

You still see vestiges of those halcyon days in the KMDF samples, where entries for the catalog kmdfsamples,cat can be found even now [and this will be the case for as long as we are responsible for them].  All of our test drivers are also catalogued in there, so we can mix and match with impunity and it all works as long as it's from the same build.

The price of success

But we also became a part of the OS beginning with Windows Vista.  Now for a host of reasons, the build lines that build the OS produce coinstallers for KMDF and UMDF that contain no update packages.  We did continue to build coinstallers with update packages on our private build machines, though.  We utilized various nefarious techniques to undo the system's file protection and place KMDF on a Vista machine when testing so we could test our latest versions, so things still weren't too bad.  But sometimes we wanted to use coinstallers and test binaries from differing builds- signing was becoming a problem.  More importantly, when we approached WDK release times, our external builders now only produced the coinstallers.  So we no longer had a single nice signed package automagically produced for us.

We worked around this as best we could- typically one of the SDETs assembled a "build" out of disparate pieces and then re-ran the signing steps with a bcz in the proper directory.  Like all manual processes, errors happened, but we muddled along.

Finally, late last year things got to be too much for Shefali and I- we had to run really old test content and found the signatures were no longer valid.  I should explain that a bit, if I can.  Developers working on Windows have certificates created for them identifying components they build that chain to a special test root certificate (a term you can look up, for instance- this MSDN article touches upon them) that is recognized by most interim builds produced of Windows.  This means all of our content is "signed" (and it also means that if any of those signed binaries show up where they should not, they identify who produced it, giving one easy place to start searching for a leak).  When we approach releases a switch is made to more official forms of signing- our test drivers are also part of those builds, although they never ship to anyone, so we're still good to install on those- but we can't use them anywhere else because the coinstallers are still what we call "thin" [no update packages].  Of course, since nobody should need them for very long, those certificates also have a very short shelf life, which is what I was referring to at the start of this rambling paragraph of mine.

We also had a hard time making clear to the software test engineers who were trying to run our rapidly changing test mixes what parameters to use (or even to figure out for ourselves which combinations of parameters really did what).  This led to delays, confusion, dissatisfaction [one of those poor STEs must have been sure he was on the verge of being dismissed- and that bothered me because I knew it wasn't all his fault], and other generally bad and stressful things.

If you want it done right, DIY

So, if self-signing driver packages and using test-signing approaches is good enough for our customers, it ought to be good enough for us.  I redesigned our entire automation process around this approach (with much encouragement and prodding from Shefali).  I solved both problems at the same time, but since I like to wander when telling tales, and I'm the one with the keyboard, I shall tangent...

Too many cooks

Normally a good test automation design in WTT (known to you as DTM) is fairly self-contained.  It gets its stuff, does its work, and cleans up after itself.  You see this in the three phases- setup, regular, and cleanup.  Virtually all of our test jobs worked this way, meaning any one could run independently of the others.

We have literally hundreds of jobs that work this way- the bulk of them testing the KMDF DDI.  They took parameters with little bits and pieces of path names [because most of the time everything came from a specific machine, or if the machine changed, parts of the path were known, etc] and assembled them together to locate things, install them, run them, and clean up.

But the names weren't consistent, the portions mapped weren't consistent and while it was sometimes possible to get a correct path by using .. in path entries and even blank entries for some parameters, determining those values was a logic puzzle in and of itself.  Worse, you couldn't be sure after a run that all of those jobs had really run the same thing.  Since I now found myself the only cook left in the KMDF QA kitchen, I took advantage of the situation to impose order on the chaos.

Slicing the knot

The story of Alexander the Great and the Gordian Knot has been with me all these last few weeks for some reason, and this may have been another of my "Brute Force" solutions.  I broke our test pass into three stages:

  1. Staging- in this phase, all of the tools and content used is copied from disparate and myriad sources to the test machine in a known location.  Common tools like DSF are installed.  The key to the underlying narrative here is that this job also creates a test certificate on the machine, creates a catalog containing the entire contents that had been copied earlier, and signs that catalog with the new certificate.  It then sets the machine up so it works with test-signed binaries effectively.  So there's still a kmdfsamples.cat- but now it gets built fresh and piping hot right at your table [that thought makes me want to visit Benihana].
  2. Setting the framework on the machine.  In this phase, if we need to overwrite the normal version of KMDF already there we do- either brute force (by overriding the system protection on it) or elegantly (by using a "fat" coinstaller containing the appropriate update package).  As mentioned somewhere in here, you have to reboot the machine if the coinstaller is used (in fact, you have to do it either way).  Sometimes we don't even need this phase [XP, for instance].  Whenever possible we try to utilize real coinstallers in random configurations to more closely duplicate the end user experience, after all.
  3. The tests themselves.

I had a single ground rule- only the staging job would have any parameters.  All the subsequent jobs would use what was staged.  There was a corollary based on previous experience:  those parameters would be substantially complete path namesThey might take longer to type, but it was easier to switch and accommodate quirks in how paths were assembled as you tried to get things from elsewhere if you just always took entire paths.

I finished most of that work in one weekend (in November, if I remember correctly).  The most mind-numbing part of it was modifying the existing tests- I'd go through task by task with the new "known" staged path in the clipboard, and selected each directory name I found and pasted it in.  There were some deviations that I wound up adjusting in the initial setup job because they were done too many places.  There were places where parameters were passed down into library jobs that I left untouched (I actually set any such parameters to totally invalid values to make sure nothing escaped my wrath).

After all that surgery I now had the ability to mix and match with much greater flexibility, and simple instructions with four basic parameters that covered all the known variations we had seen.  I then created a Wiki on the internal Microsoft network where I listed the instructions for all of the normal passes we did so we could clearly communicate what settings were to be used each day- at first, the STE could literally cut and paste from my instructions.  Once they were familiar with the new setup, they could do more of the work themselves.  You wouldn't recognize that same STE today.

It worked pretty well, even if underneath there are a lot of rough edges (if you like clean setups, this isn't one- the sheer scale of the task is too big to justify yet).  It also made testing test changes easier- I build everything on my machine, and can schedule a job to pick up the content from there.  If I'm doing even more aggressive mixing and matching than usual, I actually assemble the binaries on the test machine and let the setup job copy them from there into the new official staged location [one part of the hard drive to another, but it's all with a tool we use continually and rarly needs to be done].  To be fully fair, I should add that I didn't get all the tests at this time, just the ones that we absolutely had to keep running.  For instance, our stress mix fell out.  But Shefali later chipped in on her own and got them working.

Life was good- and there was now time to work on problems in the test code instead of trying to figure out how to continually tweak creaky automation into doing something slightly different every few days.  Shefali seemed pleased, and what the heck, making the boss happy is generally a good idea in the business world...

Trouble in Paradise

Until I deployed a new test.  Or rather a new variation on an old test.  I have a rather elaborate setup I use to verify operation of the IoTarget and IRP processing function in KMDF [although I don't go totally into queues- just the very basic configurations], and to support some new features in WDF 1.7 you'll be hearing about soon, I added UMDF drivers into that test.  I had to add another parameter to make sure I had all the flexibility I needed in finding a UMDF update coinstaller, but that's not a problem.  I put it all together, tested it quite a bit, rejoicing somewhat in how easy this new process made it for me to do a test that was now creating 88 different devices on top of a virtual test bus, installing the proper drivers with no popups in sight, and then putting those devices through their paces- and it looked good.  So I called it complete. got it reviewed, and checked it all in.  [For the 2 or 3 regular readers [overestimating my impact again?] this is the test with the targets and "hunters" where I showed some code here].

But early this week, it failed.  Makecat wouldn't process the UMDF coinstaller from our own build machines (to debug it, I forced WTT to halt when this failed- we were losing logs due to some problems not worth going into here- the following is an email snippet):

I set the task to freeze if it fails.  This is weird- this is the tail of the self-sign log:

 

processing: <hash>C:\kmdftest\WUDFSvc.dll.mui

processing: <hash>C:\kmdftest\WUDFUpdate_01007.dll

NOT processed: calculating the indirect data (C:\kmdftest\WUDFUpdate_01009.dll)

Failed: CryptCATCDFEnumMembersByCDFTagEx.  Last Error: 0x80004005

 

Errors found in parsing the CDF file

A comedy of errors ensued for a while after that, as I tried to find out why the 1.7 RC1 coinstaller was there when the job parameters I was told used pointed to locations that couldn't have contained it [if I'd dug into the job reports, I'd have seen that when the set they gave didn't work, they pointed to a server containing that and tried it again].  Once that was settled, I began focusing on why makecat was giving me the ever-so-helpful E_FAIL parsing a file that seemed perfectly good.

Well, it was the file itself for some reason- take it out of the CDF, makecat worked.  Have it as the only one in the CDF, same error.  Since we  recently had some changes made, I was wondering how they could affect hashing the file- so I went back and tried earlier versions.  This led to another comedy of errors when I inadvertently mistyped a path and the process worked [because it couldn't find the coinstaller, and without going too deep into why, I couldn't treat it as an error at this point in our setup job].

Well, the world was looking strange- I know I ran this dozens of times while I was developing this, didn't I?  We'd run it the previous week, and there'd been no problems.  It got stranger when I ran the same thing on a Windows 7 machine, and it worked flawlessly.  Why should the OS have made a difference?  It was the same set of binaries, tools and all...  Now if I weren't stressed and well-befuddled by then, I might not have continued to be so stressed and befuddled at that point, but of course I was and I did...

But this is serious after all- in this design, all the test pass eggs are in one basket called the staging or "Unified Setup Job" and with it broken, NOTHING works!

Fools Rush In

And this old fool is no exception. 

Ilias sends me an email about the problem late in the day with this link and the comment "real programmers use butterflies".  I had decided what I was going to do, so at about 5 AM the next morning (I started sometime between 1 and 2 AM) in reply I said, "A sledgehammer is more my style"- ahh, well- afterthought says "Real SDETs use sledgehammers" might have been a better retort...

Onward- "Take the bull by the horns!", Papa sez to himself, and loads the debugger package on the machine.  Point it to makecat, give it the command line to process the CDF, and go.  Make sure we've got all the symbols [miracle of all, they were there the first time!], and set a breakpoint on the routine name which was most helpfully displayed in that error message above, and go.  Then step into the code [now while I did have symbols, I don't normally need to work with that part of the Windows source, and this is Windows 2003, anyway- so I'm doing it the old-fashioned way, reading the assembler and using the old noggin to cipher out what's up...  I wasn't totally cheating- but because I had symbols I could see internal names and also the names and types of local variables, so I wasn't flying entirely blind].

Now before I did this, I went through a phase where I thought there was a defect I could note externally that would tell me what had happened- and in the process, I dumped the headers of the coinstaller with the linker (link /dump /headers <file name>).  It seems to me that my long-term memory is beginning to suffer the ravages of age, but short-term is still pretty good, so I still remember things like this:

    10 number of directories
11440 [      8D] RVA [size] of Export Directory
10974 [      50] RVA [size] of Import Directory
15000 [  12A53C] RVA [size] of Resource Directory
     0 [       0] RVA [size] of Exception Directory
 13400 [     FB8] RVA [size] of Certificates Directory
140000 [     AE0] RVA [size] of Base Relocation Directory
  1210 [      1C] RVA [size] of Debug Directory
     0 [       0] RVA [size] of Architecture Directory
     0 [       0] RVA [size] of Global Pointer Directory
     0 [       0] RVA [size] of Thread Storage Directory
  5530 [      40] RVA [size] of Load Configuration Directory
     0 [       0] RVA [size] of Bound Import Directory
  1000 [     1D0] RVA [size] of Import Address Table Directory
     0 [       0] RVA [size] of Delay Import Directory
     0 [       0] RVA [size] of COM Descriptor Directory
     0 [       0] RVA [size] of Reserved Directory

what was odd, was that even though this says a certificate was there, I couldn't see one in Explorer.  Odd, but it didn't raise any red flags to this old bull, so on he went.

Well, after much digging and a bit of backtracking, I found the place makecat decided to make that error.  So I followed the preceding call deeper and deeper and got into code that was preparing to hash the binary and was looking for parts of the PE image to exclude.  Now it happens I've done lots of hacking to binaries- stripping resources out, putting them back in, altering tables and all sort of general mayhem, so following this code is a snap, even in assembler [with those handy locals about, anyway].  I find a path where it is clearly failing, and looking back up through the registers shown as I single-stepped the code, the values FB8 and 13400 caught my eye.  Hurrah for what memory remains!  A quick check confirmed they were the header values.  Bashing them against the values in dv, I had my cause...

It had refused to hash the binary because the certificate was not at the end of the file's memory image- specifically, the resources followed it.  It turns out this also caused the certificate to be invisible to explorer and made signtool verify most unhappy [but signtool also happily replaced the certificate in situ every time I tried, alas].

I then sent a rather rambling and somewhat incoherent email to Ilias telling him we'd been building an unsignable UMDF coinstaller [too much stress and too little sleep- forgot that it had worked in Windows 7] since time immemorial.  Probably boosted his blood pressure since I wasn't all that clear I meant only on our private build line [obviously WHQL signs the official versions from time to time, after all].  I also wasn't making any clear distinction between signing the binary by embedding a certificate and signing it by having it properly hashed in a catalog signed with an embedded certificate [that duplication of terms has always been a source of confusion] . After a face to face and some more coherent and detailed explanations from me we had it down- that only the UMDF coinstaller from our private line was unsignable, and then only on Windows 2003 and earlier [I'm afraid you'll have to repeat some of what I did to see why]...

As I describe here, I can disassemble the coinstallers quite readily and in converse I know how they're put together- the problem is clearly that we signed it before we added the update package as a resource [now the package that does this could just be made smarter...]. We found out how that's happening, but fixing it is proving a challenge.

Well, the main build lines have well-funded and trained staff to handle all those scripts that handle all those things we do after build- on a private line like ours, you have something a bit more seat of the pants, and Ilias is not the originator of most of those scripts.  If you've been around software long enough, you probably get the picture- logging not quite up to snuff, commentary a bit lacking, and so on.  He's still working on it [actually, he's going on vacation, so my old buddy Kumar probably gets to hold this hot potato].

So you unhappy souls we've held up with the Server WDK [and I mean this with all sympathy and respect- you've got good reason to feel that way] aren't the only ones with coinstaller issues- but at least this one is never going to affect you.

I also got to tell him somewhere in the middle of all that about my fascination with the Gordian Knot, him being Greek and all [alas, I kept wondering if Greeks regarded Alexander as Greek, since Alexander was Macedonian- but I kept saying Mycenaean and totally fudging the issue- that aging memory again].  This time I figured my brute force approach to the knot was debugging it myself instead of doing the usual thing and trying to find someone who could just tell me or wanted to find out why something didn't work on such an old operating system [lets face it, the main focus is on Windows 7 around here].  He admitted to being one of those closet WdfVerifier users I occasionally speculate exist, and so it went [our conversations are usually quite a bit of fun, even when the situation isn't all that much fun- similar sense of humor, perhaps].

Those who hate my endless music lists can rejoice- this was much too long to bother trying to accumulate one.  But I at least got to hear all that good stuff (ahh, Garcia's "Bird Song"- "tell me all that you know- I'll show you snow and rain")

L8r!

Make that April 16th for WDF 1.7...

That seems to be the new consensus- only off by a day in that last post (which also has some background on why).

Some "Thank You"s

First off to Emil Protalinksi, for this article about our updated release date on Ars Technica.  Not many people read my blog (which is fine with me), so having some coverage in an outlet of that caliber helps alleviate the flood of emails we get asking when we're going to be ready.  Heck, I wind up there at least a couple times a month without even trying...

A second to Mr Evgeny Balykov, who just this week has joined us (and Microsoft) as our latest SDET (due to visa issues he is working across the border in Canada, but he's worth the hassle).  Jenya's credentials are impressive, and he's eager to help us not have problems like this one [and to make WDF a better product overall].  One position left to fill!

Tons of Tunes, again...

Grateful Dead [Dead set]: Loser, Space

                     [Hundred Year Hall]: Me and My Uncle

Yoko Kanno [untranslated] : untranslated name- nice Japanese ballad

                    [Ghost In The Shell Stand Alone Complex]- Pet Food

Yuki Kajiura [Madlax]: Inside Your Heart

Unknown [InuYasha SoundTrack]- instrumental, sounds like the little fox demon's theme...

Posted by BobKjelgaard | 1 Comments
Filed under: , ,

A New Target Date For WDF 1.7!

As noted in the previous post (among others), I've not been saying anything about when WDF 1.7 will reappear for all those eager to begin using it, biding my time and waiting for a date that sounds achievable and has been given to a customer from our Program Management team.

Well, that's happened, and it's a day familiar to most U.S. taxpayers: April 15th, 2008.  For those not in the US, that's the day income tax payments and their associated paperwork for the previous year are due [although you can get an extension on the paperwork by just filing more paperwork].

I've been involved in the testing, and this date certainly looks achievable.  So hopefully this sad sorry saga is soon to be behind us. Still, I feel the need to stress- this is not guaranteed to happen.  I could never guarantee such a thing, especially given what's already occurred.

Could I be a little more vague?

I've decided to shed a bit more light on what happened here.  As I mentioned here and here, our problems trace to the fact that we are a part of the OS itself beginning with Windows Vista, and this introduced us to new update technologies [both new to us and new to the world of OS updating in general] that we just didn't know as well as one could have hoped.

The initial problem (and the one we were well on our way to fixing by that original March 15th date) was that the mechanism we used to update KMDF 1.5 to KMDF 1.7 was undone by a must-have fix to KMDF 1.5 that was finalized and released later than Server 2008.  We knew about it, but we did them in the reverse order, and thus were quite surprised by this.

But in the world of OS fixes, there are two classes of fixes- those we know apply to all users (general distribution) and those that apply only to specific cases (limited distribution).  When broad sets of fixes like service packs are made, a decision has to be made which fixes are included and which can be updated or superseded.  Turns out we didn't understand this, properly, either.

So our second issue- the one mentioned in the previous post (link above)- came when we found out the UMDF 1.7 coinstaller failed to work when one of those "limited distribution" fixes was present.  It seems our coinstallers could not update ANY fix of that sort.  Worse, fixing it so they could meant changing the way they were built and packaged- I show here how to disassemble it, so that should give you an insight into the underlying complexity.  But it's even harder than that would look, because the packages I extract there contain more internals you don't need to look at to solve the problem I was describing in that article.

So, learning what needed to be changed, getting the build processes changed, verifying the new processes [and there's a bunch of manual