Why SDETS should be the most fastidious and paranoid coders in existence
Well, to follow up some more on my adventures in test code maintenance, I bring you a case study.
As I whittled away at the 100+ PFD warnings I mentioned in my previous article, I did my paranoid best to look for dubious code above and beyond what PFD was telling me about when I noticed this in one of our oldest test drivers:
NTSTATUS _stdcall
CTestDriver::SetupObjectHandleParameter(
IN ULONG eHandle,
IN OBJECT_TYPE ObjectType,
IN OUT PVOID * Handle
)
{
NTSTATUS status = STATUS_SUCCESS;
//
//setup Queue Handle parameter
//
switch(eHandle){
case HANDLE_INVALID_NULL:
*Handle = NULL;
break;
case HANDLE_INVALID_OBJECT_DELETED:
//
//we will treat this as HANDLE_INVALID_NULL since we can not delete the default queue
//
*Handle = NULL;
break;
case HANDLE_INVALID_WRONG_OBJECT_TYPE:
if(ObjectType != OBJECT_TYPE_DRIVER){
*Handle = (PVOID)WdfGetDriver();
}else{
*Handle = (PVOID)this->m_pFdo;
}
break;
case HANDLE_VALID:
switch(ObjectType)
{
case OBJECT_TYPE_DEVICE:
*Handle = (PVOID)this->m_pFdo;
break;
case OBJECT_TYPE_QUEUE:
*Handle = WdfDeviceGetDefaultQueue(this->m_pFdo);
break;
case OBJECT_TYPE_REQUEST:
status = WdfRequestCreate(NULL, NULL, (WDFREQUEST *)Handle);
break;
default:
*Handle = NULL;
}
break;
default:
DbgBreakPoint();
#pragma prefast(suppress:__WARNING_UNREACHABLE_CODE, "Always runs under debugger, so not unreachable, merely deliberately annoying")
*Handle = NULL;
status = STATUS_INVALID_PARAMETER;
}
The breakpoint and suppression pragma are my later addition [and I removed some invalid cases prior to this- in fact, it was the act of removing them that led me to even look at this code]. Why did I add them? Because this silently changes an invalid and unexpectedly out-of-range value without any notice or any indication that something unexpected and unwanted has happened. I wanted to make sure that if this EVER happened, someone was forced to take a look (and yes, this is why people use ASSERT of course).
Well, I finally got all my warnings cleaned up, and I built the dozens of drivers and assembled everything I needed so I could make a "dry run" of all the affected code under lab conditions. I hit that breakpoint about a half hour into the run. So I took a look, of course- that IS why I did this, after all [to force someone to look- I just also turned out to BE that someone].
As one would expect from even middlin'-to-fair practice, those constants are in a header file, the appropriate snippet being this:
typedef enum {
HANDLE_INVALID_NULL,
HANDLE_INVALID_OBJECT_DELETED,
TOTALLY_INVALID_AND_DO_NOT_USE_UNTIL_ALL_ARE_REMOVED_AND_THIS_CAN_BE_RECYCLED,
HANDLE_VALID,
HANDLE_INVALID_NOT_SET,
HANDLE_VALID_NULL,
HANDLE_INVALID_NOT_OPEN,
HANDLE_INVALID_WRONG_OBJECT_TYPE,
HANDLE_CALL_DDI_MULTIPLE_TIMES = 23
} HANDLE_STATES;
The long name was my addition- it was for the invalid case I was trying to prune from all those drivers [it passed -1 as an "invalid handle value"- fine practice in user mode, but in KMDF they [wisely, IMO] decided NULL was enough to guard against]. The oddball value at the end is there because of one of the items that factors into my tale [but I won't mention it again].
Soon after these early test drivers were developed, we off-shored the work of developing more of them to test the burgeoning KMDF DDI. They basically made a copy of what we had, and then began developing more, extending and refining our techniques. Now I'm about to describe an error of theirs, but if the original developers had been more paranoid, these folks could have known about this long before I found put about it! People make mistakes, especially during those early stages of development, period. They continue to make them in maintenance and addition of new features. This isn't a "look at this fool" story- it's a "this is good and intelligent people just being human" story- please don't misunderstand me!
To go back now to the present, a quick check in the debugger showed that "9" was the value for eHandle. You'll notice it doesn't match any of the enumerated values. So where did it come from? Well these tests are driven from scripts, and the original developers being fastidious used the same names and definitions in their script-based engine. [I told you these people were good!] So I found the test case used because it was readily available from the driver at the point this failed (drum roll, please):
var eQueue = HANDLE_INVALID_WRONG_OBJECT_TYPE;
var bStopComplete = POINTER_VALID;
var eContext = POINTER_VALID_NULL;
Now HANDLE_INVALID_WRONG_OBJECT_TYPE is defined, but its value is 7, not 9- so why the mismatch? Well, like I said, all the original files were forked off, and the offshore team maintained their copies. At first I diff'd the original header to see if I'd done something totally bad, like deleting values from the enumeration without maintaining the value of those that were left- after all a stream of 14+ hour days leaves me quite capable of such a move! But it wasn't me, this time- so I went to their version of that same header, and found:
typedef enum
{
HANDLE_INVALID_NULL = 0,
HANDLE_INVALID_OBJECT_DELETED = 1,
INVALID_VALUE_NOT_TO_BE_USED_1 = 2,
HANDLE_VALID = 3,
HANDLE_INVALID_NOT_SET = 4,
HANDLE_VALID_NULL = 5,
HANDLE_INVALID_NOT_OPEN = 6,
HANDLE_INVALID_REACQUIRE = 7,
HANDLE_INVALID_NOTACQUIRED = 8,
HANDLE_INVALID_WRONG_OBJECT_TYPE = 9,
Oops (and yes my names for the value I ripped out were inconsistent, but then maybe some consistent inconsistency is useful in doing this sort of maintenance activity, and besides, none of this is finished yet]!
I then went to the trouble of trolling through our configuration management system (mostly out of curiosity, as this bug isn't a killer in and of itself] to see when this breaking code change was made, and found it happened to have been made about 10 days after I walked out of New Employee Orientation and into my second SDET career at Microsoft [I probably couldn't even begin to use WTT {DTM to you} at that time]. This defect silently caused us to stop testing the effects of passing the wrong kind of object to the DDI tested in this driver [passing NULL instead, which as you can probably divine was already being done]. The test failed, but it was supposed to, so no visible change in outward results- just a silent loss of coverage of some paths in the product code for the last 3 years.
Umm, so what?
Granted- this is hardly the most important problem found on the planet this week (probably even in the last 10 seconds). But I think there's a point here for any serious SDET. Be paranoid! Check everything! Trust nothing! Before you punt that oddball case, at least consider using an ASSERT or some other mechanism to focus attention on questionable results or behaviors. If the code looks ratty as a result, and you'd never ship it, so what? Your purpose isn't to make pretty code- leave that to the SDEs- your task is to find bugs [and preferably in the product- but if you find your own with these practices, the odds of your doing the same for your product are that much improved]. I don't personally care much where the braces go in your code- but I sure care a lot about what you do between them!
Yes, someone inclined to point out errors and less-than-optimal practices can point out many more [like, oh, say- keeping two versions of the same header file about?] But it takes time to sort out improving what you've got, and there's the continual need to add new stuff [which is what I was doing when I began this little detour this week, and I'd better get back to it, and soon]. I'll add removing that superfluous header to the ever growing list of "things that ought to get done when there's finally time to do it"...
Enough of the soapbox rant for now. I'll go put on my hat and find some of the other curmudgeons [someday I'll have to take a photo of it perkily perched on top of the replica Spartan helmet from the Halo 3 Legendary edition [someone once told me it gave them visions of John-117 as Indiana Jones, bullwhip in hand] in my office]...
Odds and Ends
- Well, PFD eventually found well over 30 IRQL misusages by the times I was done- many of the pattern I showed in that last article, others more subtle- and some of those were more of a "risk of future maintenance trouble" issue- exactly the sort of maintenance error behind this bug. PFD- it isn't just for developers- your test code ought to be as good as theirs and preferably better [so when something breaks you've got all the information you need to say what's going on]. It also focused attention on dozens more problem areas in this code base. Being PFD clean doesn't guarantee your driver is rock-solid bulletproof, but it sure makes a nice foundation!
- Wei reminded me that you can combine the values when suppressing by separating them with white space. Figures- I'd tried commas and got into too much of a hurry to look up the basics of those pragmas. Also that the suppression is suppose to apply to a single source line (but I think stacking them as I did also works- can't say for sure).
- I've begun using some Analytical / tracking software on my blog to understand my readership better [nothing personally identifiable]. I see that I get a reasonable number of non-English speakers, so I'll see if I can rein in my vocabulary and use of idioms in the future.
- The web site I keep some of my blog content on (like my photograph) is undergoing maintenance for a couple of days. The text files for fixing KMDF installation failres are there too. They'll be back by Monday, May 12, 2008.
- No song lists- too busy to try to cpture thm this time [lucky you!].
- A good weekend to all!