Root causing a not reproducible KMDF installation issue- part 2: Not stupid, merely human
In the last installment, we had a workaround, so people could get on with their lives. BUT, there's still that problem of an access violation of unknown origin that happens every time we try to install KMDF 1.1 on a machine a continent or two away...
I'll get right to the stupid / human part. In my befuddled mind, I am convinced that this AV is being handled within the failing program itself. Actually, it is being handled by the OS code that manages the process. But being lucky, my "direct approach" is still a right one- if I try to use the normal post-mortem techniques, then the faulting stack will already have unwound to the exception handler that invoked a post-mortem debugger (however, the stack is typically still available using the debugger's .cxr command). Being a simpleton, I want to see what's actually failing, rather than try to make guesses [because weeks have elapsed, there's been a whole lotta guessin' goin' on, and nothing really useful has come of it]. So I may be about to make things harder than they needed to be.
So, the plan (simple minds like simple plans, and this is pure brute force]:
- Have the user install the Debugging Tools For Windows package, using the default settings.
- Create a REG file that uses Image File Execution Options to debug just the program I think is failing.
- Create a debugger script that, as each access violation occurs (aka "first chance" exception), captures a rather exhaustive minidump, and just keeps doing this [overwriting it each time] until the process ends.
- Provide batch files for setting it all up and tearing it down, so that the user isn't left with a big mess on their machine when we're done [we are guests, after all- if we can't do the dishes, we can at least pick up after ourselves].
The user received a ZIP file with several files and some instructions. The first file here is this batch file, run to start the process:
@echo off
@Echo Preparing to capture dumps from broken installation...
md c:\dumps
md c:\dumps\backups
@Echo Preparing to capture dumps from broken installation... > C:\Dumps\Preparation.log
reg import DebugUpdateExe.reg >> C:\Dumps\Preparation.log
net stop hypkern >> C:\Dumps\Preparation.log
net stop hypaudio >> C:\Dumps\Preparation.log
net stop wdf01000 >> C:\Dumps\Preparation.log
sc delete hypkern >> C:\Dumps\Preparation.log
sc delete hypaudio >> C:\Dumps\Preparation.log
sc delete wdf01000 >> C:\Dumps\Preparation.log
del %windir%\system32\drivers\hypkern.sys >> C:\Dumps\Preparation.log
del %windir%\system32\drivers\hypaudio.sys >> C:\Dumps\Preparation.log
del %windir%\system32\drivers\wdf01000.sys >> C:\Dumps\Preparation.log
del %windir%\system32\drivers\wdfldr.sys >> C:\Dumps\Preparation.log
copy script.txt c:\dumps >> C:\Dumps\Preparation.log
copy %windir%\setup*.log C:\Dumps\Backups >> C:\Dumps\Preparation.log
copy %windir%\wdf*.log c:\Dumps\Backups >> C:\Dumps\Preparation.log
del %windir%\setup*.log >> C:\Dumps\Preparation.log
del %windir%\wdf*.log >> C:\Dumps\Preparation.log
Echo You are now ready to attempt the installation again >> C:\Dumps\Preparation.log
Echo You are now ready to attempt the installation again
A Batch file for a roll your own "try an install, get a minidump with the final AV, and make sure you get all the backup info you need in case anything in this massive hack breaks"
The batch file does these things in order (I leave matching numbers to batch file lines up to the reader, but you can always ask if you wants to).
- Directories are made to hold the dump and to backup setup-related files.
- A log file for the batch process is maintained, so we can review it in case anything breaks.
- We set things up so the debugger will debug the update binary when it spawns.
- Just to be safe, all of their driver services are stopped and then deleted (as is KMDF). This is insurance, as they were told to uninstall first.
- To be even safer, all of the driver files are deleted.
- The debugger script is copied to a known directory, since the command line we registered references it.
- The setup-related log files are cached and then deleted.
- Just to be nice, we spit out a message saying they are now ready to retry the installation (the message being in a foreign language for them being hopefully forgivable under the circumstances).
Our second file is this registry file that starts the secret debugging sauce:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\update.exe]
"Debugger"="C:\\Program Files\\Debugging Tools For Windows\\Cdb.Exe -G -c \"$<C:\\\\Dumps\\\\Script.Txt\""
Secret sauce? No- just a way to make sure we debug only what we want to debug.
Key things to note here:
- I used CDB instead of WinDbg because by default, WinDbg is going to throw a popup when the session starts, and I don't want to confuse the user any further than I already will.
- The path is hard coded, but it is the default when the MSI on WHDC is used. I verified this, and provided the user with instructions to allow them to double-check, just in case.
- The command line itself skips unneeded breaks [I imagine having a console window pop up in the middle of the installation is going to be shock enough, why make them punch keys?], and executes the script, which is what I will show next.
The third file in the package is the debugger script itself:
sxd -c ".dump /ma /o C:\\Dumps\\UpdateAV.Dmp" av;g
A simple script for catching the final access violation in the program when it occurs, and then taking a "full minidump"
Now this is pretty brute force- this script creates a dump every single time an access violation occurs! They occur a lot in user mode, but most are silently handled. But it overwrites it each time. Furthermore, if the final dump isn't the right one [as in maybe some AV occurs after the handler has decided to reflect the failure out], I can change a switch in the .dump command and get one each time (and have them send the last half dozen or so, which I could mine to see if they had what I was looking for).
Finally, being the good houseguest, there is the final file- the "cleanup script":
@Echo off
echo Cleaning up the registry and collecting logs...
echo Cleaning up the registry and collecting logs... > C:\Dumps\Cleanup.Log
reg delete "hklm\software\microsoft\windows NT\currentversion\Image File Execution Options\Update.Exe" /v Debugger /f >> C:\Dumps\Cleanup.Log
copy %windir%\setup*.log c:\Dumps >> C:\Dumps\Cleanup.Log
del %windir%\setup*.log >> C:\Dumps\Cleanup.Log
copy %windir%\wdf*.log c:\Dumps >> C:\Dumps\Cleanup.Log
del %windir%\wdf*.log >> C:\Dumps\Cleanup.Log
copy c:\Dumps\Backups %windir% >> C:\Dumps\Cleanup.Log
@Echo Cleanup complete >> C:\Dumps\Cleanup.Log
@Echo Cleanup complete
Cleaning up after the fact- turn off the debugger intervention, grab the installation logs, and put the original installation records back where you found them.
On my end, I not only try these pieces out to make sure they work, I save the final dump for comparison against the one they send back. If it is the same AV, then we missed something. I also make the change to catch "all" AVs (actually it doesn't- the switch uses the timestamp to name files, but the granularity isn't quite enough to literally catch all of them)- I'm looking at 100+MB of dumps, but I decide we can cross that bridge when we get to it.
So who gets to play host?
I already know from previous submissions that our valiant European consumers can ZIP files, but even so, these dumps are going to be too big for an email. Somebody needs an FTP site. Yes, Microsoft has them, but its not like I know anything about that! So I ask the IHV engineer who's acting as the middleman if they have the capacity. Turns out they do- so Bob can continue his slacker ways, and let them host the files...
When the middleman tries my technique, his installation breaks. Might be the debugger return code is creeping out the code that calls it? I decide that if we got the dumps, then we'll get what we really need so I urge him (perhaps against his better judgement) to pass it on to our end users [long may their devices function!].
No fairy tale ending here...
So, when all is said and done, what happens? I get the same dump from the end user that I got on my own machine [cue Obi-Wan, who gestures and says "this is not the AV you are looking for"]. I could make the changes to ask for more, but for the moment, doubt about the whole chain of thought that has led us to this pass is the order of the day.
Well, if George Lucas can end the second part of his trilogy on a down note, why can't I? Since suspense won't change my compensation, I'll say now that I do eventually figure out where it all went wrong [and the above technique is basically sound, or I wouldn't have listed it here]- but I'm saving that for the next and final installment.