Serendipitous PreFast, or A Day in the Life
I noticed yesterday [Wednesday, 9/26] that my blog is listed on the WHDC page for WDF tools, and the same day a static analysis tool saved me hours of future work, so somehow from that came the thought of sketching a day in my life as an SDET, just to keep to the title of the blog.
Trouble ahead
No lady in red, though! After grabbing a fresh-brewed Starbucks from the kitchen, and a Mountain Dew Game Fuel from the cooler there before I left, I reviewed my emails of the previous day. I had two test pass failures from the lab to look into (more accurately, SOMEBODY had to look into them, and since it was 5 AM, the reports were from the previous day and nobody else had found the time yet, I stepped up and began doing some of what I get paid for].
We have an extensive set of automated tests [the automation is essentially the same as what you folks see in DTM], and we run them each week in a lab elsewhere on the campus against the latest builds. In this particular case, I was looking at two bugchecks- one on the latest incarnation of KMDF 1.7, and the other on the latest of what will probably become KMDF 1.9 [not sure why we're fond of the odd numbers- wasn't it supposed to be the even numbered Star Trek movies that didn't...]. The 1.7 case was really weird, and I eventually asked the Dev team to look into it- I suspect it's a test code bug, but if I'm uncertain, it never hurts to get Eliyas' expert opinion.
The 1.9 case was easier- someone on OSR's NTDEV list had asked Doron why we didn't NULL the WDFREQUEST pointer passed when you call any of the WdfIoQueue... request retrieval routines and they fail to find a suitable request. That seemed like a good idea, so it's a "bug fix" in 1.9. Well, we have detailed DDI-level tests, and one of them was failing. The case that was failing is one where we submit a single request to a queue, then call WdfIoQueueRetrieveNextRequest twice to make sure it fails appropriately the second time. This test was originally coded so it pointed to the same WDFREQUEST variable both times, and then completed it afterwards. This meant it relied on the undocumented behavior that the WDFREQUEST pointed to was not touched when a call failed. Now that we NULL it, the test passes a NULL WDFREQUEST to WdfRequestComplete, and since we leverage the Wdf Verifier capabilities in these tests, we were now rewarded with the requisite bugcheck.
Well, that's easily fixed, and it costs everyone to leave it as it is [we tie up lab assets, including support staff, waiting to triage a known problem, and have to spend our own time investigating them when we've all got plenty else to do], so I checked out the offending code and fixed it. I then built the driver [I like x64 machines- they helped me get this job, and I have one as my main machine- so I usually build amd64fre or amd64chk first] to make sure the change hadn't broken anything.
My build environment ALWAYS runs PFD [as a background task] against checked out or new code each time I build with it- it also includes a little applet that displays an icon in the system tray- it shows green when all the new or modified code is clean, yellow if there are warnings, and red if there are errors. While the build went clean, I waited for PFD (reading my email while listening to my usual eclectic music mix on my headphones, because my neighbors don't enjoy my music quite the way I do) to finish- it was yellow. Double clicking the icon brought up 7 warnings, all in the single file I had checked out.
How could that be?
Many reasons- errors are unacceptable, but some people don't bother with warnings. In my own tests, even though my code is always green when I check it in, these are evolving tool sets, and as they get more sophisticated, they find things they couldn't before. This wasn't my code originally, so it may have been a combo of both.
Now since I'm the one owning it at the moment, I'd prefer to apply my rules. But it's not as simple as that- I need to make sure I know the "rules of engagement" for this code base. Usually changes need an accompanying bug report, so I write one. But good practice also dictates that at times you only make the minimum necessary changes to avoid unnecessary destabilization of the product. I ask my lead if I can get approval to check in the fix, and she tells me that currently we don't have that kind of restriction on test code. I take that as carte blanche to fix the PreFast issues [but I do tell her so in case I am assuming too much].
I once sneered at static analysis tools- I had to be forced to use PreFast at first. But that first time, among all "those worthless nitpicking trivia" were several real bugs. I later realized that even the "trivial" problems I had incorrectly analyzed, and represented potential threats. Sometime after that I came to the conclusion that being "green" was just good discipline- maybe those practices WERE safe in some particular instance, but why get comfortable with shoddy practice? Even though this is test code, as I am already seeing, building it solidly in the first place pays dividends over the long run.
The monster driving my thinking is this: I have new tests to write for new features in 1.9 [can't tell you what those are yet, of course]- if the tests don't get written, and the feature isn't adequately tested, it may never appear. Our dev team does great work, and I want to support that, so time spent in maintenance isn't "annoying" or "boring"- it's a schedule risk to development plans, which makes it a lot more of a problem. So I do a quick analysis and fix of each problem- this means I might miss something, but in most cases, good solutions are clear, even if perfect or optimal ones are not.
Today, at least, the tradeoff is making sure this maintenance action gets us closer to that ideal of tests that never break except when the product does. Of the 7 warnings, I suppress one [I want to look into it further later- it identified a potential leak, and I took some steps to fix it, but in reality, if the leaking path occurred, I'd be facing more important problems anyway]. Almost all of the others are real quality issues- several were of the sort where a superfluous check occurred- the risk there being that when you dig deeper, it means that one can take a future maintenance action that breaks or degrades the test, and it gets missed because the check "handles" the mistake for you. For instance, in a test of the WdfRequestRequeue DDI (sorry about my color choices- that's me being idiosyncratic- those on a black background is how I look at the world of code):
if(NT_SUCCESS(status)){
TraceEvents(TRACE_LEVEL_INFORMATION, DBG_INFO,
"\nWdfRequestRequeue DDI called successfully\n");
status = WdfIoQueueRetrieveNextRequest(
Queue,
pRequestNext
);
if(NT_SUCCESS(status) && pRequestNext != NULL){
TraceEvents(TRACE_LEVEL_INFORMATION, DBG_INFO,
"\nWdfIoQueueRetrieveNextRequest DDI called successfully\n");
if(*pRequestNext == *pRequest){
status = STATUS_SUCCESS;
WdfRequestCompleteWithInformation(
Request, // Request
status, // Status
0L // Information
);
}
else{
WdfRequestCompleteWithInformation(
Request, // Request
STATUS_UNSUCCESSFUL, // Status
0L // Information
);
return;
}
}
else{
WdfRequestCompleteWithInformation(
Request, // Request
status, // Status
0L // Information
);
}
}
else{
WdfRequestCompleteWithInformation(
Request, // Request
status, // Status
0L // Information
);
}
The problem is what I marked in red above- pRequestNext can never be NULL according to the analysis PreFast made [and I checked that conclusion myself, just to be diligent]. At first, I thought it was supposed to be *pRequestNext, which in and of itself would be an easy win for having paid attention. But if so, there's something else wrong, because this will cause the test to report a successful status back, and returning a NULL WDFREQUEST on a successful call is a violation of the DDI contract.
So is it safe to leave it be? The answer is "NO", because a NULL pointer to a request should have caused WdfIoQueueRetrieveNextRequest to fail, and the offending code has already verified a successful return code. Yes, having that pointer be NULL can cause the subsequent attempt to compare WDFREQUEST handles [the idea being, of course, that the requeued request is the one you just pulled off the queue] to access violate, but this is test code- you want it to tell you when things go wrong! As written, it will have the same problem- the test will have looked successful, even though an error occurred.
Now, as PreFast noted, this actually doesn't happen- but having the check there means that if someone unintentionally changes the test so it DOES make that pointer NULL, and the other DDI breaks, then an opportunity gets missed. We probably catch it in a test for that DDI, but what if some bizarre set of circumstances causes it to only fail in this particular sequence? OK, it's stretching, but so what? I deleted the code in red- if said circumstance occurs, we get a bugcheck under obviously broken and easily analyzed circumstances, probably picking up two bugs [one in the test code and one in the framework] at one go. That works for me.
Not done yet
Alas, discipline is important in being part of a huge group, and the Windows team is about as huge as it can get. Since this is a fix to a single file, I won't ask someone else to help further, but I now go and build the driver for all three computer platforms, in a mix of free and checked builds [on more complex changes, I simply build all six flavors, but this time I chose 3]. Why? Because "build breaks" cost- build line assets [including the people] go underutilized waiting for you to fix your mistake, people waiting for the build have to wait longer since many things change between builds, emails go to a bunch of people, and often people who shouldn't have to do so try to figure out how to fix your mistake. When time to deadlines is short, and the pressure is really on, management attention [rightly so, IMHO] can be brought to bear. So why only 3?
Because:
- the first builds affected will only affect the Windows Driver Frameworks team [meaning it can be fixed there before it spreads], currently about 15-20 people
- the fix is simple
- I need to get back to that new test code as soon as I can.
I haven't been the cause of a break in a very long time, so I think my tradeoffs are working, but I readily admit there is some risk involved in not doing all 6 [and also in not asking someone else to try the changes first- something we call a "buddy build"], each time I change something. Worse yet, I didn't make even a cursory test- if that goes badly, I'll pay the price next week, and I won't be alone. But perhaps part of being a seasoned veteran is knowing when such a choice is conscionable [or at least doing it well enough not to get caught doing it, which of course I've just blown].
One long day (Trouble Behind)
As it turns out, after investigating both issues, reading the day's emails, and a very small amount of new test code- it's off to a meeting. A couple of times each week, the Development, QA, and Program Management team meet for an hour to discuss current work [in this particular case, it's features for 1.9- what we can and can't do, when we can do them, that sort of boring stuff. Since I've usually been at work for 9-10 hours before the meeting even starts, I often miss them, but this time I didn't....
It wound up being well over 11 hours by the time I left, and [this is not typical for Microsoft, of course- it's a weird habit I picked up somewhere] I don't eat a meal during the workday, so I leave- tired, a bit hungry, slightly bummed that I didn't make half the progress I wanted to on the new test that day, and pleased as punch that there are tools like PFD and SDV around to keep me from having even worse days.