Holy cow, I wrote a book!
Many years ago, I was asked to help a customer with a
problem they were having.
I don't remember the details, and they aren't important to the story anyway,
but as I was investigating one of their crashes,
I started to wonder why they were even doing it.
I expressed my concerns to the customer liaison.
"Why are they writing this code in the first place?
The performance will be terrible, and
it'll never work exactly the way they want it to."
The customer liaison confided,
"Yeah, I thought the same thing.
But this is a feature they're adding to the next version of their
The product is so far behind schedule,
they've been cutting features like mad
to get back on track.
But they can't cut this feature. It's the last one left!"
A customer reported that they had a problem with a particular function
added in Windows 7.
The tricky bit was that the function was used only on very high-end
not the sort of thing your average developer has lying around.
... code that initializes the GroupAffinity structure ...
if (!SetThreadGroupAffinity(hThread, &GrouAffinity, NULL));
printf("SetThreadGroupAffinity failed: %d\n", GetLastError());
The customer reported that the function always failed
with error 122 (ERROR_INSUFFICIENT_BUFFER)
even though the buffer seems perfectly valid.
Since most of us don't have machines with more than 64 processors,
we couldn't run the code on our own machines to see what happens.
People asked some clarifying questions,
like whether this code is compiled 32-bit or 64-bit
(thinking that maybe there is
an issue with the emulation layer),
until somebody noticed that there was a stray semicolon at the end
of the if statement.
The customer was naturally embarrassed, but was gracious enough to
admit that, yup, removing the semicolon fixed the problem.
This reminds me of an incident many years ago.
I was having a horrible time debugging a simple loop.
It looked like the compiler was on drugs and was simply
ignoring my loop conditions and always dropping out of the loop.
At wit's end, I asked a colleague to come to my office and
serve as a second set of eyes.
I talked him through the code as I single-stepped:
"Okay, so we set up the loop here..."
NODE pn = GetActiveNode();
"And we enter the loop, continuing while the node still needs processing."
"Okay, we entered the loop.
Now we realign the skew rods on the node."
"If the treadle is splayed, we need to calibrate the node against it."
if (IsSplayed()) pn->Recalibrate(this);
"And then we loop back to see if there is more work to be done
on this node."
"But look, even though the node needs processing
«view node members», we don't loop back.
We just drop out of the loop.
What's going on?"
— Um, that's an if statement up there,
not a while statement.
A moment of silence while I process this piece of information.
"All right then, sorry to bother you, hey,
how about that sporting event last night, huh?"
I haven't experienced this problem, but I know of people who have.
They'll put their laptop into suspend or standby mode, and after a few
seconds, the laptop will spontaneously wake itself up.
Someone gave me this tip
that might (might) help you figure out what is wrong.
Open a command prompt and run the command
powercfg -devicequery wake_from_any
This lists all the hardware devices that are capable of waking up
the computer from standby.
But the operating system typically ignores most of them.
To see the ones that are not being ignored, run the command
powercfg -devicequery wake_armed
This second list is typically much shorter.
On my computer, it listed just the keyboard, the mouse, and the modem.
(The modem? I never use that thing!)
You can disable each of these devices one by one until you find the
one that is waking up the computer.
powercfg -devicedisablewake "device name"
(How is this different from unchecking
Allow this device to wake the computer
from the device properties in Device Manager?
Once you find the one that is causing problems, you can re-enable
powercfg -deviceenablewake "device name"
I would start by disabling wake-up for the keyboard and mouse.
Maybe the optical mouse is detecting tiny vibrations in your room.
Or the device might simply be "chatty",
generating activity even though you aren't touching it.
This may not solve your problem, but at least's something you can try.
I've never actually tried it myself, so who knows whether it works.
Exercise: Count how many disclaimers there are in this article,
and predict how many people will ignore them.
Internally, a critical section is a bunch of counters and flags,
and possibly an event.
(Note that the internal structure of a critical section is subject
to change at any time—in fact, it changed between
Windows XP and Windows 2003.
The information provided here is therefore intended for troubleshooting
and debugging purposes and not for production use.)
As long as there is no contention, the counters and flags are
sufficient because nobody has had to wait for the critical section
(and therefore nobody had to be woken up when the critical section
If a thread needs to be blocked because the critical section it wants
is already owned by another thread,
the kernel creates an event for the critical section
(if there isn't one already) and waits on it.
When the owner of the critical section finally releases it,
the event is signaled, thereby alerting all the waiters that the
critical section is now available and they should try to enter it
(If there is more than one waiter, then only one will actually
enter the critical section and the others will return to the wait
If you get an invalid handle exception in
it means that the critical section code thought that there
were other threads waiting for the critical section to become
available, so it tried to signal the event, but the event handle
was no good.
Now you get to use your brain to come up with reasons why this might be.
One possibility is that the critical section has been corrupted,
and the memory that normally holds the event handle has been
overwritten with some other value that happens not to be a
Another possibility is that some other piece of code passed
an uninitialized variable to the CloseHandle
function and ended up closing the critical section's handle
This can also happen if some other piece of code has a double-close
bug, and the handle (now closed) just happened to be reused as the
critical section's event handle.
When the buggy code closes the handle the second time by mistake,
it ends up closing the critical section's handle instead.
Of course, the problem might be that the critical section is not
valid because it was never initialized in the first place.
The values in the fields are just uninitialized garbage,
and when you try to leave this uninitialized critical section,
that garbage gets used as an event handle, raising the invalid
Then again, the problem might be that the critical section is
not valid because it has already been destroyed.
For example, one thread might have code that goes like this:
... do stuff...
While that thread is busy doing stuff,
another thread calls
This destroys the critical section while another thread
was still using it.
Eventually that thread finishes doing its stuff and calls
which raises the invalid handle exception because the
DeleteCriticalSection already closed the handle.
All of these are possible reasons for an invalid handle
exception in LeaveCriticalSection.
To determine which one you're running into will require more
debugging, but at least now you know what to be looking for.
One of my colleagues from the kernel team points out that
the Locks and Handles checks in
Application Verifier are great
for debugging issues like this.
The RunAs program demands that you type the password manually.
Why doesn't it accept a password on the command line?
This was a conscious decision.
If it were possible to pass the password on the command line,
people would start embedding passwords into batch files
and logon scripts, which is laughably insecure.
In other words, the feature is missing to remove the temptation
to use the feature insecurely.
If this offends you and you want to be insecure and pass the password
on the command line anyway (for everyone to see in the command window
title bar), you can write your own program that calls
the CreateProcessWithLogonW function.
(I'm told that there is
a tool available for download
which domain administrators might find useful, though it
solves a slightly different problem.)
Exiting is one of the scariest moments in the
lifetime of a process.
(Sort of how landing is one of the scariest moments of air travel.)
Many of the details of how processes exit are left
unspecified in Win32, so different Win32 implementations can
follow different mechanisms.
Win32s, Windows 95, and
Windows NT all shut down processes differently.
(I wouldn't be surprised if Windows CE uses yet another
Therefore, bear in mind that what I write in this mini-series
is implementation detail and can change at any time without warning.
I'm writing about it because these details can highlight
bugs lurking in your code.
In particular, I'm going to discuss the way processes exit
on Windows XP.
I should say up front that I do not agree with many steps in the
way processes exit on Windows XP.
The purpose of this mini-series is not to justify the way processes
exit but merely to fill you in on some of the behind-the-scenes
activities so you are better-armed when you have to investigate into
a mysterious crash or hang during exit.
(Note that I just refer to it as the way processes exit
on Windows XP rather than saying that it is how process
exit is designed.
As one of my colleagues put it,
"Using the word design to describe this is like using
the term swimming pool to refer to a puddle in your garden.")
When your program calls ExitProcess a whole lot of
machinery springs into action.
First, all the threads in the process (except the one calling
are forcibly terminated.
This dates back to the old-fashioned theory on how processes
Under the old-fashioned theory,
when your process decides that it's time to exit,
it should already have cleaned up all its threads.
The termination of threads, therefore, is just a safety
net to catch the stuff you may have missed.
It doesn't even wait two seconds first.
Now, we're not talking happy termination like ExitThread;
that's not possible since the thread could be in the middle of
Injecting a call to ExitThread would result in
DLL_THREAD_DETACH notifications being sent at times
the thread was not prepared for.
Nope, these threads are terminated in the style of
Just yank the rug out from under it.
This is an ex-thread.
Well, that was a pretty drastic move, now, wasn't it.
And all this after the scary warnings in MSDN that
TerminateThread is a bad function that should
Wait, it gets worse.
Some of those threads that got forcibly terminated may have
owned critical sections, mutexes, home-grown synchronization
primitives (such as spin-locks), all those things
that the one remaining thread might need access to during its
Well, mutexes are sort of covered; if you try to enter that
mutex, you'll get the mysterious
WAIT_ABANDONED return code
which tells you that "Uh-oh, things are kind of messed up."
What about critical sections?
There is no "Uh-oh" return value for critical sections;
EnterCriticalSection doesn't have a return value.
Instead, the kernel just says "Open season on critical sections!"
I get the mental image of all the gates in a parking garage just
opening up and letting anybody in and out.
As for the home-grown stuff, well, you're on your own.
This means that if your code happened to have owned a critical section
at the time somebody called ExitProcess,
the data structure the critical section is protecting has a good
chance of being in an inconsistent state.
(Afer all, if it were consistent, you probably would have exited
the critical section!
Well, assuming you entered the critical section because you were
updating the structure as opposed to reading it.)
Your DLL_PROCESS_DETACH code runs,
enters the critical section, and it
succeeds because "all the gates are up".
Now your DLL_PROCESS_DETACH code
starts behaving erratically because the values in that data
structure are inconsistent.
Oh dear, now you have a pretty ugly mess on your hands.
And if your thread was terminated while it owned a spin-lock
or some other home-grown synchronization object,
your DLL_PROCESS_DETACH will most likely simply
hang indefinitely waiting patiently for that terminated thread
to release the spin-lock (which it never will do).
But wait, it gets worse.
That critical section might have been the one that protects
the process heap!
If one of the threads that got terminated happened to be in
the middle of a heap function like HeapAllocate
or LocalFree, then the process heap may very
well be inconsistent.
If your DLL_PROCESS_DETACH tries to allocate or
free memory, it may crash due to a corrupted heap.
Moral of the story:
If you're getting a DLL_PROCESS_DETACH due to
don't try anything clever.
Just return without doing anything
and let the normal process clean-up happen.
The kernel will close all your open handles to kernel objects.
Any memory you allocated will be freed automatically when
the process's address space is torn down.
Just let the process die a quiet death.
Note that if you were a good boy and cleaned up all the
threads in the process
before calling ExitThread,
then you've escaped all this craziness, since
there is nothing to clean up.
Note also that if you're getting a DLL_PROCESS_DETACH
due to dynamic unloading, then you do need to clean up
your kernel objects and allocated memory
because the process is going to continue running.
But on the other hand,
in the case of dynamic unloading, no other threads should be
executing code in your DLL anyway (since you're about to be unloaded),
so—assuming you coded up your DLL correctly—none of your
critical sections should be held and
your data structures should be consistent.
Hang on, this disaster isn't over yet.
Even though the kernel went around terminating all
but one thread in the process,
that doesn't mean that the creation of new threads is blocked.
If somebody calls CreateThread in their
DLL_PROCESS_DETACH (as crazy as it sounds),
the thread will indeed be created and start running!
But remember, "all the gates are up", so your critical sections
are just window dressing to make you feel good.
(The ability to create threads after process termination has begun
is not a mistake; it's intentional and necessary.
Thread injection is how the debugger breaks into a process.
If thread injection were not permitted, you wouldn't be able
to debug process termination!)
Next time, we'll see how the
way process termination takes place on Windows XP
caused not one but two problems.
†Everybody reading this article
should already know how to determine whether
this is the case.
I'm assuming you're smart.
Don't disappoint me.
Commenter Koro asks
why you can rename a COM file to EXE without any apparent ill effects.
asked a similar question,
though there are additional issues in James' question
which I will take up at a later date.)
Initially, the only programs that existed were COM files.
The format of a COM file is... um, none.
There is no format.
A COM file is just a memory image.
This "format" was inherited from CP/M.
To load a COM file,
the program loader merely sucked the file into memory unchanged
and then jumped to the first byte.
No fixups, no checksum, nothing.
Just load and go.
The COM file format had many problems, among which was that
programs could not be bigger than about 64KB.
To address these limitations, the EXE file format was introduced.
The header of an EXE file begins with
the magic letters "MZ" and continues with other information that
the program loader uses to load the program into memory and prepare it
And there things lay, with COM files being "raw memory images"
and EXE files being "structured",
and the distinction was rigidly maintained.
If you renamed an EXE file to COM, the operating system would
try to execute the header as if it were machine code (which didn't
get you very far), and conversely if you renamed a COM file to EXE,
the program loader would reject it because the magic MZ header
So when did the program loader change to ignore the extension entirely
and just use the presence or absence of an MZ header to determine
what type of program it is?
Compatibility, of course.
Over time, programs like FORMAT.COM,
and even COMMAND.COM grew larger than about 64KB.
Under the original rules, that meant that the extension
had to be changed to EXE,
but doing so introduced a compatibility problem.
After all, since the files had been COM files up until then,
programs or batch files that wanted to, say, spawn a command interpreter,
would try to execute COMMAND.COM.
If the command interpreter were renamed to COMMAND.EXE,
these programs which hard-coded the program name
would stop working since there was no
COMMAND.COM any more.
Making the program loader more flexible meant that these
"well-known programs" could retain their COM extension
while no longer being constrained by the "It all must fit into 64KB"
limitation of COM files.
But wait, what if a COM program just happened to begin with the
Fortunately, that never happened, because the machine code for
"MZ" disassembles as follows:
0100 4D DEC BP
0101 5A POP DX
The first instruction decrements a register whose initial value
is undefined, and the second instruction underflows the stack.
No sane program would begin with two undefined operations.
a corrupted program sometimes results in a "Program too big to fit
in memory" error.
In response, Dog complained that while that may have been a reasonable
response back in the 1980's,
in today's world, there's plenty of memory around for the MS-DOS
emulator to add that extra
check and return a better error code.
Well yeah, but if you change the externally visible behavior,
then you've failed as an emulator.
The whole point of an emulator is to
mimic another world,
and any deviations from that other world can come back to bite you.
MS-DOS is perhaps one of the strongest examples of requiring absolute
unyielding backward compatibility.
Hundreds if not thousands of programs
scanned memory looking for specific byte sequences inside MS-DOS
so it could patch them or hunted around inside MS-DOS's internal state
variables so it could modify them.
If you move one thing out of place, those programs stop working.
chunks of "junk DNA",
code fragments which do nothing but waste space,
but which exist so that programs which go scanning through memory
looking for specific byte sequences will find them.
(This principle is not dead; there's even some junk DNA in Explorer.)
Given the extreme compatibility required for MS-DOS emulation,
I'm not surprised that the original error behavior was preserved.
There is certainly some program out there that stops working if
attempting to execute a COM-style image larger than 64KB returns
any error other than 8.
(Besides, if you wanted it to return some other error code,
you had precious few to choose from.)
One of the more subtle ways
people mess up IUnknown::QueryInterface
when the problem wasn't actually an unsupported interface.
E_NOINTERFACE return value
has very specific meaning.
Do not use it as your generic "gosh, something went wrong" error.
(Use an appropriate error such as E_OUTOFMEMORY
Recall that the rules for
are that (in the absence of catastrophic errors such as
if a request for a particular interface succeeds,
then it must always succeed in the future for that object.
Similarly, if a request fails with E_NOINTERFACE,
then it must always fail in the future for that object.
These rules exist for a reason.
In the case where COM needs to create a proxy for your object
(for example, to marshal the object into a different apartment),
the COM infrastructure does a lot of interface caching (and
negative caching) for performance reasons.
For example, if a request for an interface fails, COM remembers
this so that future requests for that interface are failed
immediately rather than being marshalled to the original object
only to have the request fail anyway.
Requests for unsupported interfaces are very common in COM,
and optimizing that case yields significant performance improvements.
If you start returning E_NOINTERFACE for problems
other than "The object doesn't support this interface",
COM will assume that the object really doesn't support the interface
and may not ask for it again even if you do.
This in turn leads to very strange bugs that defy debugging:
You are at a call to
you set a breakpoint on your object's implementation of
IUnknown::QueryInterface to see what the problem is,
you step over the call and get
E_NOINTERFACE back without your breakpoint ever hitting.
Because at some point in the past, you said you didn't support
the interface, and COM remembered this and "saved you the trouble"
of having to respond to a question you already answered.
The COM folks tell me that they and their comrades in product support
end up spending hours debugging customer's problems like
"When my computer is under load, sometimes I start getting
E_NOINTERFACE for interfaces I definitely support."
Save yourself and the COM folks several hours of frustration.
Don't return E_NOINTERFACE
unless you really mean it.
One of my colleagues pointed out that my web site is listed in
the references section of
It scares me that I'm being used as formal documentation because that
is explicitly what this web site isn't.
I wrote back,
I really need to put a disclaimer on my web site.
FOR ENTERTAINMENT PURPOSES ONLY
FOR ENTERTAINMENT PURPOSES ONLY
Remember, this is a blog.
The opinions (and even some facts) expressed here are those of the author
and do not necessarily reflect those of Microsoft Corporation.
Nothing I write here creates an obligation on Microsoft
or establishes the company's official position on anything.
I am not a spokesperson.
I'm just this guy who strings people along in the hopes that they might
hear a funny story once in a while.
You'd think this was obvious,
but apparently there are people who think that
what I write has the weight of official Microsoft policy
and take my sentences apart as if they were legal documents
or who take my articles and declare them to be
official statements from Microsoft Corporation.