This one can never affect you, so your blood pressure can start easing now. It's just been a while since I tried to put on my "war stories" hat and tried tellin' one of them tales...
In the Beginning was...
the need to make sure drivers we build for test are signed so we can automate their usage and avoid all the nasty workarounds unsigned driver installation entails- particularly since we test on so many operating system platforms. In the early versions of KMDF, we did this as part of our normal build process, getting them signed just like all the rest of the OS is. Even when we went to produce releases (this happens in build lines not under our direct control) they built all the samples and test code as well, and life was good. We always had a coinstaller with the right update packages in it, and everything was signed.
You still see vestiges of those halcyon days in the KMDF samples, where entries for the catalog kmdfsamples,cat can be found even now [and this will be the case for as long as we are responsible for them]. All of our test drivers are also catalogued in there, so we can mix and match with impunity and it all works as long as it's from the same build.
The price of success
But we also became a part of the OS beginning with Windows Vista. Now for a host of reasons, the build lines that build the OS produce coinstallers for KMDF and UMDF that contain no update packages. We did continue to build coinstallers with update packages on our private build machines, though. We utilized various nefarious techniques to undo the system's file protection and place KMDF on a Vista machine when testing so we could test our latest versions, so things still weren't too bad. But sometimes we wanted to use coinstallers and test binaries from differing builds- signing was becoming a problem. More importantly, when we approached WDK release times, our external builders now only produced the coinstallers. So we no longer had a single nice signed package automagically produced for us.
We worked around this as best we could- typically one of the SDETs assembled a "build" out of disparate pieces and then re-ran the signing steps with a bcz in the proper directory. Like all manual processes, errors happened, but we muddled along.
Finally, late last year things got to be too much for Shefali and I- we had to run really old test content and found the signatures were no longer valid. I should explain that a bit, if I can. Developers working on Windows have certificates created for them identifying components they build that chain to a special test root certificate (a term you can look up, for instance- this MSDN article touches upon them) that is recognized by most interim builds produced of Windows. This means all of our content is "signed" (and it also means that if any of those signed binaries show up where they should not, they identify who produced it, giving one easy place to start searching for a leak). When we approach releases a switch is made to more official forms of signing- our test drivers are also part of those builds, although they never ship to anyone, so we're still good to install on those- but we can't use them anywhere else because the coinstallers are still what we call "thin" [no update packages]. Of course, since nobody should need them for very long, those certificates also have a very short shelf life, which is what I was referring to at the start of this rambling paragraph of mine.
We also had a hard time making clear to the software test engineers who were trying to run our rapidly changing test mixes what parameters to use (or even to figure out for ourselves which combinations of parameters really did what). This led to delays, confusion, dissatisfaction [one of those poor STEs must have been sure he was on the verge of being dismissed- and that bothered me because I knew it wasn't all his fault], and other generally bad and stressful things.
If you want it done right, DIY
So, if self-signing driver packages and using test-signing approaches is good enough for our customers, it ought to be good enough for us. I redesigned our entire automation process around this approach (with much encouragement and prodding from Shefali). I solved both problems at the same time, but since I like to wander when telling tales, and I'm the one with the keyboard, I shall tangent...
Too many cooks
Normally a good test automation design in WTT (known to you as DTM) is fairly self-contained. It gets its stuff, does its work, and cleans up after itself. You see this in the three phases- setup, regular, and cleanup. Virtually all of our test jobs worked this way, meaning any one could run independently of the others.
We have literally hundreds of jobs that work this way- the bulk of them testing the KMDF DDI. They took parameters with little bits and pieces of path names [because most of the time everything came from a specific machine, or if the machine changed, parts of the path were known, etc] and assembled them together to locate things, install them, run them, and clean up.
But the names weren't consistent, the portions mapped weren't consistent and while it was sometimes possible to get a correct path by using .. in path entries and even blank entries for some parameters, determining those values was a logic puzzle in and of itself. Worse, you couldn't be sure after a run that all of those jobs had really run the same thing. Since I now found myself the only cook left in the KMDF QA kitchen, I took advantage of the situation to impose order on the chaos.
Slicing the knot
The story of Alexander the Great and the Gordian Knot has been with me all these last few weeks for some reason, and this may have been another of my "Brute Force" solutions. I broke our test pass into three stages:
- Staging- in this phase, all of the tools and content used is copied from disparate and myriad sources to the test machine in a known location. Common tools like DSF are installed. The key to the underlying narrative here is that this job also creates a test certificate on the machine, creates a catalog containing the entire contents that had been copied earlier, and signs that catalog with the new certificate. It then sets the machine up so it works with test-signed binaries effectively. So there's still a kmdfsamples.cat- but now it gets built fresh and piping hot right at your table [that thought makes me want to visit Benihana].
- Setting the framework on the machine. In this phase, if we need to overwrite the normal version of KMDF already there we do- either brute force (by overriding the system protection on it) or elegantly (by using a "fat" coinstaller containing the appropriate update package). As mentioned somewhere in here, you have to reboot the machine if the coinstaller is used (in fact, you have to do it either way). Sometimes we don't even need this phase [XP, for instance]. Whenever possible we try to utilize real coinstallers in random configurations to more closely duplicate the end user experience, after all.
- The tests themselves.
I had a single ground rule- only the staging job would have any parameters. All the subsequent jobs would use what was staged. There was a corollary based on previous experience: those parameters would be substantially complete path names. They might take longer to type, but it was easier to switch and accommodate quirks in how paths were assembled as you tried to get things from elsewhere if you just always took entire paths.
I finished most of that work in one weekend (in November, if I remember correctly). The most mind-numbing part of it was modifying the existing tests- I'd go through task by task with the new "known" staged path in the clipboard, and selected each directory name I found and pasted it in. There were some deviations that I wound up adjusting in the initial setup job because they were done too many places. There were places where parameters were passed down into library jobs that I left untouched (I actually set any such parameters to totally invalid values to make sure nothing escaped my wrath).
After all that surgery I now had the ability to mix and match with much greater flexibility, and simple instructions with four basic parameters that covered all the known variations we had seen. I then created a Wiki on the internal Microsoft network where I listed the instructions for all of the normal passes we did so we could clearly communicate what settings were to be used each day- at first, the STE could literally cut and paste from my instructions. Once they were familiar with the new setup, they could do more of the work themselves. You wouldn't recognize that same STE today.
It worked pretty well, even if underneath there are a lot of rough edges (if you like clean setups, this isn't one- the sheer scale of the task is too big to justify yet). It also made testing test changes easier- I build everything on my machine, and can schedule a job to pick up the content from there. If I'm doing even more aggressive mixing and matching than usual, I actually assemble the binaries on the test machine and let the setup job copy them from there into the new official staged location [one part of the hard drive to another, but it's all with a tool we use continually and rarly needs to be done]. To be fully fair, I should add that I didn't get all the tests at this time, just the ones that we absolutely had to keep running. For instance, our stress mix fell out. But Shefali later chipped in on her own and got them working.
Life was good- and there was now time to work on problems in the test code instead of trying to figure out how to continually tweak creaky automation into doing something slightly different every few days. Shefali seemed pleased, and what the heck, making the boss happy is generally a good idea in the business world...
Trouble in Paradise
Until I deployed a new test. Or rather a new variation on an old test. I have a rather elaborate setup I use to verify operation of the IoTarget and IRP processing function in KMDF [although I don't go totally into queues- just the very basic configurations], and to support some new features in WDF 1.7 you'll be hearing about soon, I added UMDF drivers into that test. I had to add another parameter to make sure I had all the flexibility I needed in finding a UMDF update coinstaller, but that's not a problem. I put it all together, tested it quite a bit, rejoicing somewhat in how easy this new process made it for me to do a test that was now creating 88 different devices on top of a virtual test bus, installing the proper drivers with no popups in sight, and then putting those devices through their paces- and it looked good. So I called it complete. got it reviewed, and checked it all in. [For the 2 or 3 regular readers [overestimating my impact again?] this is the test with the targets and "hunters" where I showed some code here].
But early this week, it failed. Makecat wouldn't process the UMDF coinstaller from our own build machines (to debug it, I forced WTT to halt when this failed- we were losing logs due to some problems not worth going into here- the following is an email snippet):
I set the task to freeze if it fails. This is weird- this is the tail of the self-sign log:
processing: <hash>C:\kmdftest\WUDFSvc.dll.mui
processing: <hash>C:\kmdftest\WUDFUpdate_01007.dll
NOT processed: calculating the indirect data (C:\kmdftest\WUDFUpdate_01009.dll)
Failed: CryptCATCDFEnumMembersByCDFTagEx. Last Error: 0x80004005
Errors found in parsing the CDF file
A comedy of errors ensued for a while after that, as I tried to find out why the 1.7 RC1 coinstaller was there when the job parameters I was told used pointed to locations that couldn't have contained it [if I'd dug into the job reports, I'd have seen that when the set they gave didn't work, they pointed to a server containing that and tried it again]. Once that was settled, I began focusing on why makecat was giving me the ever-so-helpful E_FAIL parsing a file that seemed perfectly good.
Well, it was the file itself for some reason- take it out of the CDF, makecat worked. Have it as the only one in the CDF, same error. Since we recently had some changes made, I was wondering how they could affect hashing the file- so I went back and tried earlier versions. This led to another comedy of errors when I inadvertently mistyped a path and the process worked [because it couldn't find the coinstaller, and without going too deep into why, I couldn't treat it as an error at this point in our setup job].
Well, the world was looking strange- I know I ran this dozens of times while I was developing this, didn't I? We'd run it the previous week, and there'd been no problems. It got stranger when I ran the same thing on a Windows 7 machine, and it worked flawlessly. Why should the OS have made a difference? It was the same set of binaries, tools and all... Now if I weren't stressed and well-befuddled by then, I might not have continued to be so stressed and befuddled at that point, but of course I was and I did...
But this is serious after all- in this design, all the test pass eggs are in one basket called the staging or "Unified Setup Job" and with it broken, NOTHING works!
Fools Rush In
And this old fool is no exception.
Ilias sends me an email about the problem late in the day with this link and the comment "real programmers use butterflies". I had decided what I was going to do, so at about 5 AM the next morning (I started sometime between 1 and 2 AM) in reply I said, "A sledgehammer is more my style"- ahh, well- afterthought says "Real SDETs use sledgehammers" might have been a better retort...
Onward- "Take the bull by the horns!", Papa sez to himself, and loads the debugger package on the machine. Point it to makecat, give it the command line to process the CDF, and go. Make sure we've got all the symbols [miracle of all, they were there the first time!], and set a breakpoint on the routine name which was most helpfully displayed in that error message above, and go. Then step into the code [now while I did have symbols, I don't normally need to work with that part of the Windows source, and this is Windows 2003, anyway- so I'm doing it the old-fashioned way, reading the assembler and using the old noggin to cipher out what's up... I wasn't totally cheating- but because I had symbols I could see internal names and also the names and types of local variables, so I wasn't flying entirely blind].
Now before I did this, I went through a phase where I thought there was a defect I could note externally that would tell me what had happened- and in the process, I dumped the headers of the coinstaller with the linker (link /dump /headers <file name>). It seems to me that my long-term memory is beginning to suffer the ravages of age, but short-term is still pretty good, so I still remember things like this:
10 number of directories
11440 [ 8D] RVA [size] of Export Directory
10974 [ 50] RVA [size] of Import Directory
15000 [ 12A53C] RVA [size] of Resource Directory
0 [ 0] RVA [size] of Exception Directory
13400 [ FB8] RVA [size] of Certificates Directory
140000 [ AE0] RVA [size] of Base Relocation Directory
1210 [ 1C] RVA [size] of Debug Directory
0 [ 0] RVA [size] of Architecture Directory
0 [ 0] RVA [size] of Global Pointer Directory
0 [ 0] RVA [size] of Thread Storage Directory
5530 [ 40] RVA [size] of Load Configuration Directory
0 [ 0] RVA [size] of Bound Import Directory
1000 [ 1D0] RVA [size] of Import Address Table Directory
0 [ 0] RVA [size] of Delay Import Directory
0 [ 0] RVA [size] of COM Descriptor Directory
0 [ 0] RVA [size] of Reserved Directory
what was odd, was that even though this says a certificate was there, I couldn't see one in Explorer. Odd, but it didn't raise any red flags to this old bull, so on he went.
Well, after much digging and a bit of backtracking, I found the place makecat decided to make that error. So I followed the preceding call deeper and deeper and got into code that was preparing to hash the binary and was looking for parts of the PE image to exclude. Now it happens I've done lots of hacking to binaries- stripping resources out, putting them back in, altering tables and all sort of general mayhem, so following this code is a snap, even in assembler [with those handy locals about, anyway]. I find a path where it is clearly failing, and looking back up through the registers shown as I single-stepped the code, the values FB8 and 13400 caught my eye. Hurrah for what memory remains! A quick check confirmed they were the header values. Bashing them against the values in dv, I had my cause...
It had refused to hash the binary because the certificate was not at the end of the file's memory image- specifically, the resources followed it. It turns out this also caused the certificate to be invisible to explorer and made signtool verify most unhappy [but signtool also happily replaced the certificate in situ every time I tried, alas].
I then sent a rather rambling and somewhat incoherent email to Ilias telling him we'd been building an unsignable UMDF coinstaller [too much stress and too little sleep- forgot that it had worked in Windows 7] since time immemorial. Probably boosted his blood pressure since I wasn't all that clear I meant only on our private build line [obviously WHQL signs the official versions from time to time, after all]. I also wasn't making any clear distinction between signing the binary by embedding a certificate and signing it by having it properly hashed in a catalog signed with an embedded certificate [that duplication of terms has always been a source of confusion] . After a face to face and some more coherent and detailed explanations from me we had it down- that only the UMDF coinstaller from our private line was unsignable, and then only on Windows 2003 and earlier [I'm afraid you'll have to repeat some of what I did to see why]...
As I describe here, I can disassemble the coinstallers quite readily and in converse I know how they're put together- the problem is clearly that we signed it before we added the update package as a resource [now the package that does this could just be made smarter...]. We found out how that's happening, but fixing it is proving a challenge.
Well, the main build lines have well-funded and trained staff to handle all those scripts that handle all those things we do after build- on a private line like ours, you have something a bit more seat of the pants, and Ilias is not the originator of most of those scripts. If you've been around software long enough, you probably get the picture- logging not quite up to snuff, commentary a bit lacking, and so on. He's still working on it [actually, he's going on vacation, so my old buddy Kumar probably gets to hold this hot potato].
So you unhappy souls we've held up with the Server WDK [and I mean this with all sympathy and respect- you've got good reason to feel that way] aren't the only ones with coinstaller issues- but at least this one is never going to affect you.
I also got to tell him somewhere in the middle of all that about my fascination with the Gordian Knot, him being Greek and all [alas, I kept wondering if Greeks regarded Alexander as Greek, since Alexander was Macedonian- but I kept saying Mycenaean and totally fudging the issue- that aging memory again]. This time I figured my brute force approach to the knot was debugging it myself instead of doing the usual thing and trying to find someone who could just tell me or wanted to find out why something didn't work on such an old operating system [lets face it, the main focus is on Windows 7 around here]. He admitted to being one of those closet WdfVerifier users I occasionally speculate exist, and so it went [our conversations are usually quite a bit of fun, even when the situation isn't all that much fun- similar sense of humor, perhaps].
Those who hate my endless music lists can rejoice- this was much too long to bother trying to accumulate one. But I at least got to hear all that good stuff (ahh, Garcia's "Bird Song"- "tell me all that you know- I'll show you snow and rain")
L8r!