Fire drill!
Our coinstallers are incredibly complex, as you may have been getting an inkling of if you've been following this subject (that probably means I'm talking to myself again, but I'll post anyway).
Now in theory the way a piece of software goes out of here is it gets designed, built, heavily tested, and QA finally signs off with a minimum of fear and trepidation. But WDF 1.7's coinstallers just didn't work that way. First off, take a look at this post where I show how to disassemble a coinstaller. All of those pieces have to be put together after a normal build process (something we commonly call "post-build" for obvious reasons). This is done with custom scripts that have to be manually juggled as build processes change and things move about. I didn't even get into how not everything gets signed the same way...
On top of that, for 1.7 there was a technology change related to how we can update on Windows Vista (because we are part of the OS there, unlike the situation where we are an add-on- which was the case for 1.0-1.5). So new scripts to build that update package, and then stuffing it into the coinstaller had to be added.
Of course all of that was late in coming- as a result, it wasn't until WDK RC1 that we even had all of the pieces together. Not only were there the errata mentioned in the above post, but there were internal flaws we kept discovering related to edge cases (actually to be fair, Ilias is the one doing most of this), right up to literally the final build for Server 2008.
Also because we have been short a few people in QA for a long time, now, we don't have good automated tests for the coinstaller. We do use it in regular automated testing (and so far we have avoided optimizing installation just so we can leverage our automated testing), but the really important cases are all done manually. So in the end, we had a fire drill where various people tried using the coinstaller and making sure various versioning scenarios work. Of course, other important things also had to be done, and of course, somewhere along the line even basic record-keeping like checklists went by the wayside.
So we signed off, but maybe with more fear and trepidation than usual (I still think it was the right thing to do, though), and with a lot less of the old rabbit trail than one would like to have.
So what's the point of all that gossip?
Well, we have of course begun to hear some feedback from people having problems with 1.7. An early common thread is problems installing on Windows Vista. The problem being that all the logs make it look like we updated from 1.5 to 1.7, but when we check afterward, it didn't actually happen. So, we try it on a Vista machine we have handy, and Oh, blazes and be damned, it doesn't work! Well, this is on a developer's test machine, and who knows what's been going on there, but one immediately has to ask "How did this happen- we did test this, right?".
So I begin checking emails and such because that's all the records there are to fall back on. Guess who was going to test this on the final coinstaller? Guess who got sidetracked by other interrupting tasks? Guess who didn't report not being able to get to that task? Are my ears burning? Maybe somebody checked it (I could have made a verbal request, and I wouldn't have any record), but I have no firm proof this was tested on the final build.
<Expletive deleted>!
Time for a nice drill, d00d!
So I find and boot up the machine I was going to do this on in the first place (and notice none of what would have been there if I'd done the testing at the correct time was present), put the ramdisk sample driver for 1.7 and the released 1.7 coinstaller on it, and install. Installation (devcon) succeeds, but system says it needs a reboot. Well, that's what should happen (and was happening on the failing machines as well). So I take a look at the actual binaries on the machine before I reboot- they are the 1.5 binaries. Now I think that's OK (Ilias, however, thinks they should have been replaced by this point, but fortunately, he's not around to raise my adrenaline levels), so I reboot the machine.
I reboot, and the ramdisk driver works! A check of the binaries reveals they are the 1.7 binaries. This is what I expected...
I Be the Artful (Bullet) Dodger!
OK, perhaps I should have mentioned that I and others HAD been checking this periodically and irregularly- we just don't have records and even two weeks after the event (much less the timespan since Server 2008 completed) I can't remember exactly what I had done when. We signed off in the belief (apparently accurate) that we had met our criteria, but we knew there was some risk- we managed it as best we could under the circumstances.
So yes, the 1.7 coinstaller will update a Vista machine [obviously my concern was it was totally broken at the last minute and I had missed it because of what happened]. So our problem (yet unresolved, but we're gaining some understanding even as I write this) is that there are cases where it doesn't, and it doesn't even report a failure. When it is understood, we'll also address the issue of what went wrong and how to try to prevent it in the future.
Well, as I said not all that long ago, I am a pragmatist. If I'd been dogmatic, I could have said we can't ship because we need more time to shake out problems like this. But in reality, it's highly unlikely that we could have seen what's happening in these cases even if we had that extra bake time. The final changes were for things like correct certificates which should not have affected basic installation. In my current opinion, our basic problem is we have a new servicing technology and we are still discovering its weaknesses. I suppose in retrospect I should add that I also occasionally take some risks, and while I do my best, I'm not always certain I take the "right" ones.
Can't change the past, though- for WDF 1.9 we have begun testing versioning and coinstallers even earlier in the process- we will add new automation (now that we are finally getting some of the new people we've been looking for), and do our best to avoid all that last minute uncertainty. We keep looking for as good information as we can get about what the problems in our shipped product are and find the best ways we can to fix them and avoid similar problems in the future.
But we still rely on others for key parts of that technology and for the processes that build it all. Dependency after dependency, so risk of a repeat remains.
That's how it continues to be done, do what we can, do our best and hope it's all enough and works well enough when all is done- so far it seems to, but it's never as good as I'd like it (but then again- I really like perfect). Theory and practice, forever at odds- in part because the human element is so, ummm, human.