Root causing a not reproducible KMDF installation issue- part 1: Looking for trouble, finding it, and working around it
Working in the Windows Driver Frameworks QA department has its challenges. One of them for me has been that occasionally there is a paucity of the interesting problems that make for an intense workday. Since those are the days you write code or blog, they aren't wasted- they just lack that adrenaline...
For one thing, in my opinion the development team is usually very good about testing before checking in. Their unit tests are excellent and keep getting better [a topic I hope to return to someday]. For another, the other members of my team have put together additional tests that are also quite effective. For instance, we have exhaustive DDI-level tests for the frameworks. Microsoft has an excellent set of code coverage measurement tools, and we utilize these to further improve test coverage [the more you cover at even the path level, the less there is of "the easy stuff" to find]. In my case, a final factor is I'm Johnny-come-lately, so KMDF was much more stable by the time I began working on it than is often the case.
But this is good- there are other aspects of quality than the simple bugcheck or hang, and that leaves time to go after them. For instance, it leaves time to pursue "the ones that got away".
This time, the problem could not be reproduced except on the end user's machine, and there were oceans, time zones, and a few national borders separating us. Good reason to give up?
Well, what would you do for a customer with a problem? The answer says a lot about how your business will fare [sorry- I used to run my own, so it's hard for me to not think in those terms].
Making House Calls? Or Just "Beating the Bushes" to Search for Trouble?
As an aside, this is how I got to where I could hear of this problem.
The frameworks are developer tools, and developers of course have Internet communities. Being shy to begin with and something of the lone wolf [although I prefer the term self-sufficient], I have rarely participated in these. But my manager insists I commit to doing something to "engage customers", so I decide a mailing list is a good place to start.
OSR's NTDEV mailing list is recommended to me by several colleagues as an active community of likely early adopters of KMDF, and also having a high level of expertise. So, with a general idea of monitoring traffic to see if we missed anything in our test processes, I subscribe. I also begin participating [often quite foolishly, because my judgement's never been terribly good in any community- I also tend to shoot from the hip, which doesn't always go well with such an audience], but that's not of much further relevance.
An Opportunity to Help
So: many days later, a post arrives (http://www.osronline.com/showthread.cfm?link=110819). A developer has used KMDF in their product, and the installation is failing mysteriously on a very small percentage of machines, all of them in Europe. This sounds like a potential test escape to me! After exchanging a few posts on the mailing list, I ask the developer if we can work via email to address the problem. He agrees.
So, just what seems to be the problem?
Those who read Doron Holan's blog know that there are several files that can tell us quite a bit of history about what may have broken for you when a KMDF driver installation fails. Our customer is obviously conversant with these facts, because his very first post has the problem that I focus on from the beginning.
WdfCoInstaller: [03/27/2007 23:59.53.609] DIF_INSTALLDEVICE: Pre-Processing
WdfCoInstaller: [03/27/2007 23:59.53.718] ReadComponents: WdfSection for Driver Service hypkern using KMDF lib version Major 0x1, minor 0x1
WdfCoInstaller: [03/27/2007 23:59.53.875] VerifyMSRoot: exit: error(0) Op?ration r?ussie.
WdfCoInstaller: [03/28/2007 00:01.11.203] Update process returned error code :status(C0000005) <no error text>. Possible causes are running fre version of coinstaller on checked version of OS and vice versa
WdfCoInstaller: [03/28/2007 00:01.17.875] DIF_INSTALLDEVICE: Pre-Processing
WdfCoInstaller: [03/28/2007 00:01.17.890] ReadComponents: WdfSection for Driver Service hypaudio using KMDF lib version Major 0x1, minor 0x1
WdfCoInstaller: [03/28/2007 00:01.17.953] DIF_INSTALLDEVICE: GetLatestInstalledVersion install version major 0x1, minor 0x1 is less then or equal to latest major 0x1, minor 0x1, asking for post processing
WdfCoInstaller: [03/28/2007 00:01.18.859] DIF_INSTALLDEVICE: Post-Processing
WdfCoInstaller: [03/28/2007 00:03.19.437] DIF_INSTALLDEVICE: Pre-Processing
Take a look at the bolded line, here- looks like someone has a serious problem on their hands!
C0000005 is the exception code for an access violation [touching non-existent memory]. What seems strange (and this issue resurfaces later) is that this is usually accompanied by a popup, or an error report, and not one mention has been made of this.
The rest of the error message indicates that perhaps we should be more explicit in checking error codes, because the customer was wasting his time pursuing a red herring, so I make that suggestion to the development team.
Favoring the Direct Approach
This gentleman's company apparently has an enviable dynamic with its customers, because they have several who are helping them troubleshoot the problem. Not everyone has such a good relationship, and it made a world of difference here.
Still, first you try to reproduce the problem right where you can get your hands on it if you can. We know it's KMDF 1.1, and XP SP2 French and German- not much else seems relevant, although we do try more things than I'm going to bore you further with. After a few fruitless attempts on our part to reproduce the problem (not a big surprise, since their engineers have also been unable to reproduce it), I begin to focus on two things:
- What can we do for a workaround? If the end user (and our IHV customer, who remains involved throughout this process) can see progress, then they will have more confidence that we can deliver a solution, and will be more motivated to continue to work with us to solve this problem. Since it looks like we have to debug transcontinentally, it may take a while- lose the engagement, and you may never find out what really happened.
- How do we find the source of that access violation, when the only cases of it are thousands of miles away, OCA/Watson/WER doesn't seem to be happening, and we don't even share the same language?
The first is the more important- it is clear that we at least have clear reproduction of the problem somewhere, and a workaround will make the IHV's (and by extension our) customers able to enjoy their product as intended, while we work on the second problem. As I mentioned earlier, it will also boost confidence, helping to guarantee we keep the essential players involved.
So after some of my typically incoherent requests for information, I write a batch file that generates a simple text file that will let us assess the current state of the installation on those machines. It tells me if the files are where they should be, what their vintage is, and also if the registry shows the appropriate services installed, and also that the framework registry support has been properly created.
reg export hklm\System\CurrentControlSet\Services\wdf01000 Service.Prt
reg export hklm\System\CurrentControlSet\Control\wdf Loader.Prt
Dir %windir%\system32\wdf* > CoinstallerFiles.Prt
Dir %windir%\system32\drivers\wdf* > RuntimeFiles.Prt
copy *.prt InstallationStatus.Log
del *.prt
Bob's handy-dandy batch file for assessing KMDF installation health in one swell foop. Copy it, run it, and send us InstallationStatus.Log.
What we find is that on each machine, the framework's files have been copied to where they belong, but none of the registry entries have been made. This and some reading of the coinstaller's source code provides the insight needed for solving the first priority.
It is pretty clear now that this is what has happened:
We provide developers with a coinstaller for KMDF. They include this in their device package and make specific INF changes to invoke it when their device is installed.
The KMDF coinstaller first checks to see if there is any need to install the version of the KMDF runtime it contains. If it finds that KMDF is not present, or that the version that is present is not as new as the one it contains, then it will try to install its payload.
To do this, we have an executable hidden in the coinstaller's resources, which is extracted as a temporary file, and then executed.
Normally, this just works, or else it returns a specific error code related to checked/free mismatches between the coinstaller and the OS. The latter should never happen for a typical end user, because checked OS versions are only provided to software developers, and are not redistributable by them.
But this time it failed, and then [as often happens with unexpected failures] things got worse. The installation has actually done its work, and all the files needed for the framework are on the machine. But since it failed, we will not complete the process and take the steps needed to make the framework usable.
Unfortunately, the way the check for prior version installation works is this: we check if the files are where they belong, and if they are, we see what version they are. On these machines, this check tells us that the correct version is present, so all subsequent installations "succeed". But none of them actually work (because the steps to make KMDF usable were never taken).
So I make another "suggestion"- that initial check needs to be strengthened- some malicious soul could just drop a file in the right place on an end user's machine that wasn't using our technology and we could never successfully install on it (for me clearing up an earlier mystery from NTDEV, where a developer trying to troubleshoot an installation issue did this to himself more than once).
Now it's time for that workaround- the coinstaller's developer has some REG files handy that we can use to complete the installation properly [normally used when working on WinPE]. We send these and instructions on how to use them to complete the aborted installation. If they use these, and retry the installation, everything should work.
Get out the way...
After the usual transcontinental delays, we get back some results. We now know that we have a workaround!
That should ease some tensions and build confidence for both our direct and indirect customers. After several weeks of frustrating delays, some progress!
The job isn't over, yet- time now for the second problem- what is causing that access violation?
This being plenty long for an initial post, I'll save THAT story for the next time around.
Later...