• #### We can't cut that; it's our last feature

Many years ago, I was asked to help a customer with a problem they were having. I don't remember the details, and they aren't important to the story anyway, but as I was investigating one of their crashes, I started to wonder why they were even doing it.

I expressed my concerns to the customer liaison. "Why are they writing this code in the first place? The performance will be terrible, and it'll never work exactly the way they want it to."

The customer liaison confided, "Yeah, I thought the same thing. But this is a feature they're adding to the next version of their product. The product is so far behind schedule, they've been cutting features like mad to get back on track. But they can't cut this feature. It's the last one left!"

• #### Controlling which devices will wake the computer out of sleep

I haven't experienced this problem, but I know of people who have. They'll put their laptop into suspend or standby mode, and after a few seconds, the laptop will spontaneously wake itself up. Someone gave me this tip that might (might) help you figure out what is wrong.

Open a command prompt and run the command

powercfg -devicequery wake_from_any

This lists all the hardware devices that are capable of waking up the computer from standby. But the operating system typically ignores most of them. To see the ones that are not being ignored, run the command

powercfg -devicequery wake_armed

This second list is typically much shorter. On my computer, it listed just the keyboard, the mouse, and the modem. (The modem? I never use that thing!)

You can disable each of these devices one by one until you find the one that is waking up the computer.

powercfg -devicedisablewake "device name"

(How is this different from unchecking Allow this device to wake the computer from the device properties in Device Manager? Beats me.)

Once you find the one that is causing problems, you can re-enable the others.

powercfg -deviceenablewake "device name"

I would start by disabling wake-up for the keyboard and mouse. Maybe the optical mouse is detecting tiny vibrations in your room. Or the device might simply be "chatty", generating activity even though you aren't touching it.

This may not solve your problem, but at least's something you can try. I've never actually tried it myself, so who knows whether it works.

Exercise: Count how many disclaimers there are in this article, and predict how many people will ignore them.

• #### If you are trying to understand an error, you may want to look up the error code to see what it means instead of just shrugging

A customer had a debug trace log and needed some help interpreting it. The trace log was generated by an operating system component, but the details aren't important to the story.

I've attached the log file. I think the following may be part of the problem.

[07/17/2005:18:31:19] Creating process D:\Foo\bar\blaz.exe
[07/17/2005:18:31:19] CreateProcess failed with error 2


Any ideas?

Thanks,
Bob Smith
Senior Test Engineer
Tailspin Toys

What struck me is that Bob is proud of the fact that he's a Senior Test Engineer, perhaps because it makes him think that we will take him more seriously because he has some awesome title.

But apparently a Senior Test Engineer doesn't know what error 2 is. There are some error codes that you end up committing to memory because you run into them over and over. Error 32 is ERROR_SHARING_VIOLATION, error 3 is ERROR_PATH_NOT_FOUND, and in this case, error 2 is ERROR_FILE_NOT_FOUND.

And even if Bob didn't have error 2 memorized, he should have known to look it up.

Error 2 is ERROR_FILE_NOT_FOUND. Does the file D:\Foo\bar\blaz.exe exist?

No, it doesn't.

-Bob

Bob seems to have shut off his brain and decided to treat troubleshooting not as a collaborative effort but rather as a game of Twenty Questions in which the person with the problem volunteers as little information as possible in order to make things more challenging. I had to give Bob a nudge.

Can you think of a reason why the system would be looking at D:\Foo\bar\blaz.exe? Where did you expect it to be looking for blaz.exe?

This managed to wake Bob out of his stupor, and the investigation continued. (And no, I don't remember what the final resolution was. I didn't realize I would have to remember the fine details of this support incident three years later.)

• #### Quick overview of how processes exit on Windows XP

Exiting is one of the scariest moments in the lifetime of a process. (Sort of how landing is one of the scariest moments of air travel.)

Many of the details of how processes exit are left unspecified in Win32, so different Win32 implementations can follow different mechanisms. For example, Win32s, Windows 95, and Windows NT all shut down processes differently. (I wouldn't be surprised if Windows CE uses yet another different mechanism.) Therefore, bear in mind that what I write in this mini-series is implementation detail and can change at any time without warning. I'm writing about it because these details can highlight bugs lurking in your code. In particular, I'm going to discuss the way processes exit on Windows XP.

I should say up front that I do not agree with many steps in the way processes exit on Windows XP. The purpose of this mini-series is not to justify the way processes exit but merely to fill you in on some of the behind-the-scenes activities so you are better-armed when you have to investigate into a mysterious crash or hang during exit. (Note that I just refer to it as the way processes exit on Windows XP rather than saying that it is how process exit is designed. As one of my colleagues put it, "Using the word design to describe this is like using the term swimming pool to refer to a puddle in your garden.")

When your program calls ExitProcess a whole lot of machinery springs into action. First, all the threads in the process (except the one calling ExitProcess) are forcibly terminated. This dates back to the old-fashioned theory on how processes should exit: Under the old-fashioned theory, when your process decides that it's time to exit, it should already have cleaned up all its threads. The termination of threads, therefore, is just a safety net to catch the stuff you may have missed. It doesn't even wait two seconds first.

Now, we're not talking happy termination like ExitThread; that's not possible since the thread could be in the middle of doing something. Injecting a call to ExitThread would result in DLL_THREAD_DETACH notifications being sent at times the thread was not prepared for. Nope, these threads are terminated in the style of TerminateThread: Just yank the rug out from under it. Buh-bye. This is an ex-thread.

Well, that was a pretty drastic move, now, wasn't it. And all this after the scary warnings in MSDN that TerminateThread is a bad function that should be avoided!

Wait, it gets worse.

Some of those threads that got forcibly terminated may have owned critical sections, mutexes, home-grown synchronization primitives (such as spin-locks), all those things that the one remaining thread might need access to during its DLL_PROCESS_DETACH handling. Well, mutexes are sort of covered; if you try to enter that mutex, you'll get the mysterious WAIT_ABANDONED return code which tells you that "Uh-oh, things are kind of messed up."

What about critical sections? There is no "Uh-oh" return value for critical sections; EnterCriticalSection doesn't have a return value. Instead, the kernel just says "Open season on critical sections!" I get the mental image of all the gates in a parking garage just opening up and letting anybody in and out. [See correction.]

As for the home-grown stuff, well, you're on your own.

This means that if your code happened to have owned a critical section at the time somebody called ExitProcess, the data structure the critical section is protecting has a good chance of being in an inconsistent state. (Afer all, if it were consistent, you probably would have exited the critical section! Well, assuming you entered the critical section because you were updating the structure as opposed to reading it.) Your DLL_PROCESS_DETACH code runs, enters the critical section, and it succeeds because "all the gates are up". Now your DLL_PROCESS_DETACH code starts behaving erratically because the values in that data structure are inconsistent.

Oh dear, now you have a pretty ugly mess on your hands.

And if your thread was terminated while it owned a spin-lock or some other home-grown synchronization object, your DLL_PROCESS_DETACH will most likely simply hang indefinitely waiting patiently for that terminated thread to release the spin-lock (which it never will do).

But wait, it gets worse. That critical section might have been the one that protects the process heap! If one of the threads that got terminated happened to be in the middle of a heap function like HeapAllocate or LocalFree, then the process heap may very well be inconsistent. If your DLL_PROCESS_DETACH tries to allocate or free memory, it may crash due to a corrupted heap.

Moral of the story: If you're getting a DLL_PROCESS_DETACH due to process termination,† don't try anything clever. Just return without doing anything and let the normal process clean-up happen. The kernel will close all your open handles to kernel objects. Any memory you allocated will be freed automatically when the process's address space is torn down. Just let the process die a quiet death.

Note that if you were a good boy and cleaned up all the threads in the process before calling ExitThread, then you've escaped all this craziness, since there is nothing to clean up.

Note also that if you're getting a DLL_PROCESS_DETACH due to dynamic unloading, then you do need to clean up your kernel objects and allocated memory because the process is going to continue running. But on the other hand, in the case of dynamic unloading, no other threads should be executing code in your DLL anyway (since you're about to be unloaded), so—assuming you coded up your DLL correctly—none of your critical sections should be held and your data structures should be consistent.

Hang on, this disaster isn't over yet. Even though the kernel went around terminating all but one thread in the process, that doesn't mean that the creation of new threads is blocked. If somebody calls CreateThread in their DLL_PROCESS_DETACH (as crazy as it sounds), the thread will indeed be created and start running! But remember, "all the gates are up", so your critical sections are just window dressing to make you feel good.

(The ability to create threads after process termination has begun is not a mistake; it's intentional and necessary. Thread injection is how the debugger breaks into a process. If thread injection were not permitted, you wouldn't be able to debug process termination!)

Next time, we'll see how the way process termination takes place on Windows XP caused not one but two problems.

Footnotes

• #### We've traced the call and it's coming from inside the house: A function call that always fails

A customer reported that they had a problem with a particular function added in Windows 7. The tricky bit was that the function was used only on very high-end hardware, not the sort of thing your average developer has lying around.

GROUP_AFFINITY GroupAffinity;
... code that initializes the GroupAffinity structure ...
{
return FALSE;
}


The customer reported that the function always failed with error 122 (ERROR_INSUFFICIENT_BUFFER) even though the buffer seems perfectly valid.

Since most of us don't have machines with more than 64 processors, we couldn't run the code on our own machines to see what happens. People asked some clarifying questions, like whether this code is compiled 32-bit or 64-bit (thinking that maybe there is an issue with the emulation layer), until somebody noticed that there was a stray semicolon at the end of the if statement.

The customer was naturally embarrassed, but was gracious enough to admit that, yup, removing the semicolon fixed the problem.

This reminds me of an incident many years ago. I was having a horrible time debugging a simple loop. It looked like the compiler was on drugs and was simply ignoring my loop conditions and always dropping out of the loop. At wit's end, I asked a colleague to come to my office and serve as a second set of eyes. I talked him through the code as I single-stepped:

"Okay, so we set up the loop here..."

NODE pn = GetActiveNode();


"And we enter the loop, continuing while the node still needs processing."

if (pn->NeedsProcessing())
{


"Okay, we entered the loop. Now we realign the skew rods on the node."

 pn->RealignSkewRods();


"If the treadle is splayed, we need to calibrate the node against it."

 if (IsSplayed()) pn->Recalibrate(this);


"And then we loop back to see if there is more work to be done on this node."

}


"But look, even though the node needs processing «view node members», we don't loop back. We just drop out of the loop. What's going on?"

Um, that's an if statement up there, not a while statement.

A moment of silence while I process this piece of information.

"All right then, sorry to bother you, hey, how about that sporting event last night, huh?"

• #### What does an invalid handle exception in LeaveCriticalSection mean?

Internally, a critical section is a bunch of counters and flags, and possibly an event. (Note that the internal structure of a critical section is subject to change at any time—in fact, it changed between Windows XP and Windows 2003. The information provided here is therefore intended for troubleshooting and debugging purposes and not for production use.) As long as there is no contention, the counters and flags are sufficient because nobody has had to wait for the critical section (and therefore nobody had to be woken up when the critical section became available).

If a thread needs to be blocked because the critical section it wants is already owned by another thread, the kernel creates an event for the critical section (if there isn't one already) and waits on it. When the owner of the critical section finally releases it, the event is signaled, thereby alerting all the waiters that the critical section is now available and they should try to enter it again. (If there is more than one waiter, then only one will actually enter the critical section and the others will return to the wait loop.)

If you get an invalid handle exception in LeaveCriticalSection, it means that the critical section code thought that there were other threads waiting for the critical section to become available, so it tried to signal the event, but the event handle was no good.

Now you get to use your brain to come up with reasons why this might be.

One possibility is that the critical section has been corrupted, and the memory that normally holds the event handle has been overwritten with some other value that happens not to be a valid handle.

Another possibility is that some other piece of code passed an uninitialized variable to the CloseHandle function and ended up closing the critical section's handle by mistake. This can also happen if some other piece of code has a double-close bug, and the handle (now closed) just happened to be reused as the critical section's event handle. When the buggy code closes the handle the second time by mistake, it ends up closing the critical section's handle instead.

Of course, the problem might be that the critical section is not valid because it was never initialized in the first place. The values in the fields are just uninitialized garbage, and when you try to leave this uninitialized critical section, that garbage gets used as an event handle, raising the invalid handle exception.

Then again, the problem might be that the critical section is not valid because it has already been destroyed. For example, one thread might have code that goes like this:

EnterCriticalSection(&cs);
... do stuff...
LeaveCriticalSection(&cs);


While that thread is busy doing stuff, another thread calls DeleteCriticalSection(&cs). This destroys the critical section while another thread was still using it. Eventually that thread finishes doing its stuff and calls LeaveCriticalSection, which raises the invalid handle exception because the DeleteCriticalSection already closed the handle.

All of these are possible reasons for an invalid handle exception in LeaveCriticalSection. To determine which one you're running into will require more debugging, but at least now you know what to be looking for.

Postscript: One of my colleagues from the kernel team points out that the Locks and Handles checks in Application Verifier are great for debugging issues like this.

• #### What's the difference between the COM and EXE extensions?

Commenter Koro asks why you can rename a COM file to EXE without any apparent ill effects. (James MAstros asked a similar question, though there are additional issues in James' question which I will take up at a later date.)

Initially, the only programs that existed were COM files. The format of a COM file is... um, none. There is no format. A COM file is just a memory image. This "format" was inherited from CP/M. To load a COM file, the program loader merely sucked the file into memory unchanged and then jumped to the first byte. No fixups, no checksum, nothing. Just load and go.

The COM file format had many problems, among which was that programs could not be bigger than about 64KB. To address these limitations, the EXE file format was introduced. The header of an EXE file begins with the magic letters "MZ" and continues with other information that the program loader uses to load the program into memory and prepare it for execution.

And there things lay, with COM files being "raw memory images" and EXE files being "structured", and the distinction was rigidly maintained. If you renamed an EXE file to COM, the operating system would try to execute the header as if it were machine code (which didn't get you very far), and conversely if you renamed a COM file to EXE, the program loader would reject it because the magic MZ header was missing.

So when did the program loader change to ignore the extension entirely and just use the presence or absence of an MZ header to determine what type of program it is? Compatibility, of course.

Over time, programs like FORMAT.COM, EDIT.COM, and even COMMAND.COM grew larger than about 64KB. Under the original rules, that meant that the extension had to be changed to EXE, but doing so introduced a compatibility problem. After all, since the files had been COM files up until then, programs or batch files that wanted to, say, spawn a command interpreter, would try to execute COMMAND.COM. If the command interpreter were renamed to COMMAND.EXE, these programs which hard-coded the program name would stop working since there was no COMMAND.COM any more.

Making the program loader more flexible meant that these "well-known programs" could retain their COM extension while no longer being constrained by the "It all must fit into 64KB" limitation of COM files.

But wait, what if a COM program just happened to begin with the letters MZ? Fortunately, that never happened, because the machine code for "MZ" disassembles as follows:

0100 4D            DEC     BP
0101 5A            POP     DX


The first instruction decrements a register whose initial value is undefined, and the second instruction underflows the stack. No sane program would begin with two undefined operations.

• #### Do not overload the E_NOINTERFACE error

One of the more subtle ways people mess up IUnknown::QueryInterface is returning E_NOINTERFACE when the problem wasn't actually an unsupported interface. The E_NOINTERFACE return value has very specific meaning. Do not use it as your generic "gosh, something went wrong" error. (Use an appropriate error such as E_OUTOFMEMORY or E_ACCESSDENIED.)

Recall that the rules for IUnknown::QueryInterface are that (in the absence of catastrophic errors such as E_OUTOFMEMORY) if a request for a particular interface succeeds, then it must always succeed in the future for that object. Similarly, if a request fails with E_NOINTERFACE, then it must always fail in the future for that object.

These rules exist for a reason.

In the case where COM needs to create a proxy for your object (for example, to marshal the object into a different apartment), the COM infrastructure does a lot of interface caching (and negative caching) for performance reasons. For example, if a request for an interface fails, COM remembers this so that future requests for that interface are failed immediately rather than being marshalled to the original object only to have the request fail anyway. Requests for unsupported interfaces are very common in COM, and optimizing that case yields significant performance improvements.

If you start returning E_NOINTERFACE for problems other than "The object doesn't support this interface", COM will assume that the object really doesn't support the interface and may not ask for it again even if you do. This in turn leads to very strange bugs that defy debugging: You are at a call to IUnknown::QueryInterface, you set a breakpoint on your object's implementation of IUnknown::QueryInterface to see what the problem is, you step over the call and get E_NOINTERFACE back without your breakpoint ever hitting. Why? Because at some point in the past, you said you didn't support the interface, and COM remembered this and "saved you the trouble" of having to respond to a question you already answered. The COM folks tell me that they and their comrades in product support end up spending hours debugging customer's problems like "When my computer is under load, sometimes I start getting E_NOINTERFACE for interfaces I definitely support."

Save yourself and the COM folks several hours of frustration. Don't return E_NOINTERFACE unless you really mean it.

• #### Everybody thinks about CLR objects the wrong way (well not everybody)

Many people responded to Everybody thinks about garbage collection the wrong way by proposing variations on auto-disposal based on scope:

What these people fail to recognize is that they are dealing with object references, not objects. (I'm restricting the discussion to reference types, naturally.) In C++, you can put an object in a local variable. In the CLR, you can only put an object reference in a local variable.

For those who think in terms of C++, imagine if it were impossible to declare instances of C++ classes as local variables on the stack. Instead, you had to declare a local variable that was a pointer to your C++ class, and put the object in the pointer.

C#C++
void Function(OtherClass o)
{
// No longer possible to declare objects
// with automatic storage duration
Color c(0,0,0);
Brush b(c);
o.SetBackground(b);
}
void Function(OtherClass o)
{
Color c = new Color(0,0,0);
Brush b = new Brush(c);
o.SetBackground(b);
}
void Function(OtherClass* o)
{
Color* c = new Color(0,0,0);
Brush* b = new Brush(c);
o->SetBackground(b);
}

This world where you can only use pointers to refer to objects is the world of the CLR.

In the CLR, objects never go out of scope because objects don't have scope.¹ Object references have scope. Objects are alive from the point of construction to the point that the last reference goes out of scope or is otherwise destroyed.

If objects were auto-disposed when references went out of scope, you'd have all sorts of problems. I will use C++ notation instead of CLR notation to emphasize that we are working with references, not objects. (I can't use actual C++ references since you cannot change the referent of a C++ reference, something that is permitted by the CLR.)

C#C++
void Function(OtherClass o)
{
Color c = new Color(0,0,0);
Brush b = new Brush(c);
Brush b2 = b;
o.SetBackground(b2);

}
void Function(OtherClass* o)
{
Color* c = new Color(0,0,0);
Brush* b = new Brush(c);
Brush* b2 = b;
o->SetBackground(b);
// automatic disposal when variables go out of scope
dispose b2;
dispose b;
dispose c;
dispose o;
}

Oops, we just double-disposed the Brush object and probably prematurely disposed the OtherClass object. Fortunately, disposal is idempotent, so the double-disposal is harmless (assuming you actually meant disposal and not destruction). The introduction of b2 was artificial in this example, but you can imagine b2 being, say, the leftover value in a variable at the end of a loop, in which case we just accidentally disposed the last object in an array.

Let's say there's some attribute you can put on a local variable or parameter to say that you don't want it auto-disposed on scope exit.

C#C++
void Function([NoAutoDispose] OtherClass o)
{
Color c = new Color(0,0,0);
Brush b = new Brush(c);
[NoAutoDispose] Brush b2 = b;
o.SetBackground(b2);

}
void Function([NoAutoDispose] OtherClass* o)
{
Color* c = new Color(0,0,0);
Brush* b = new Brush(c);
[NoAutoDispose] Brush* b2 = b;
o->SetBackground(c);
// automatic disposal when variables go out of scope
dispose b;
dispose c;
}

Okay, that looks good. We disposed the MyClass object exactly once and didn't prematurely disposed the OtherClass object that we received as a parameter. (Maybe we could make [NoAutoDispose] the default for parameters to save people a lot of typing.) We're good, right?

Let's do some trivial code cleanup, like inlining the Color parameter.

C#C++
void Function([NoAutoDispose] OtherClass o)
{
Brush b = new Brush(new Color(0,0,0));
[NoAutoDispose] Brush b2 = b;
o.SetBackground(b2);

}
void Function([NoAutoDispose] OtherClass* o)
{
Brush* b = new Brush(new Color(0,0,0));
[NoAutoDispose] Brush* b2 = b;
o->SetBackground(c);
// automatic disposal when variables go out of scope
dispose b;
}

Whoa, we just introduced a semantic change by what seemed like a harmless transformation: The Color object is no longer auto-disposed. This is even more insidious than the scope of a variable affecting its treatment by anonymous closures, for introduction of temporary variables to break up a complex expression (or removal of one-time temporary variables) are common transformations that people expect to be harmless, especially since many language transformations are expressed in terms of temporary variables. Now you have to remember to tag all of your temporary variables with [NoAutoDospose].

Wait, we're not done yet. What does SetBackground do?

C#C++
void OtherClass.SetBackground([NoAutoDispose] Brush b)
{
this.background = b;
}
void OtherClass::SetBackground([NoAutoDispose] Brush* b)
{
this->background = b;
}

Oops, there is still a reference to that Brush in the o.background member. We disposed an object while there were still outstanding references to it. Now when the OtherClass object tries to use the reference, it will find itself operating on a disposed object.

Working backward, this means that we should have put a [NoAutoDispose] attribute on the b variable. At this point, it's six of one, a half dozen of the other. Either you put using around all the things that you want auto-disposed or you put [NoAutoDispose] on all the things that you don't.²

The C++ solution to this problem is to use something like shared_ptr and reference-counted objects, with the assistance of weak_ptr to avoid reference cycles, and being very selective about which objects are allocated with automatic storage duration. Sure, you could try to bring this model of programming to the CLR, but now you're just trying to pick all the cheese off your cheeseburger and intentionally going against the automatic memory management design principles of the CLR.

I was sort of assuming that since you're here for CLR Week, you're one of those people who actively chose to use the CLR and want to use it in the manner in which it was intended, rather than somebody who wants it to work like C++. If you want C++, you know where to find it.

Footnote

¹ Or at least don't have scope in the sense we're discussing here.

² As for an attribute for specific classes to have auto-dispose behavior, that works only if all references to auto-dispose objects are in the context of a create/dispose pattern. References to auto-dispose objects outside of the create/dispose pattern would need to be tagged with the [NoAutoDispose] attribute.

[AutoDispose] class Stream { ... };

Stream MyClass.GetSaveStream()
{
[NoAutoDispose] Stream stm;
if (saveToFile) {
stm = ...;
} else {
stm = ...;
}
return stm;
}

void MyClass Save()
{
// NB! do not combine into one line
Stream stm = GetSaveStream();
SaveToStream(stm);
}