Holy cow, I wrote a book!
During the development of Windows,
User Research team
tried out an early build of some proposed changes on
volunteers from the general community.
During one of the tests,
they invited the volunteer to just play around with a particular
to explore it the way they would at home.
The usability subject
scrolled around a bit,
admired the visuals,
selected a few things,
and then had an idea to try to customize the
He fiddled around a bit and quickly discovered the customization
To celebrate his success, he proudly announced in a sing-song sort of way,
"My pants are fancy!"
That clip of a happy usability study participant gleefully announcing
"My pants are fancy!"
tickled the team's funny bone,
and the phrase "My pants are fancy" became a catch phrase.
Back in the days before Internet-based software distribution,
heck back even before the Internet existed in a form resembling
what it is today,
one of the most important ways of keeping track of the consumer
industry was to subscribe to the
Softsel Hot List,
a weekly poster of the top sellers in various categories.
Here is the Softsel Hot List for the week of December 22, 1986,
or at least an HTML reproduction of it.
The title at the top was inspired by a
space agey font popular at the time,
but I am too lazy to figure out how to do that in HTML
so you'll have to use your imagination.
(If your imagination fails you, you can
use the photo from this page.)
Commenter Neil presumes that
Windows 286 and later simply fixed up the movable entry table with
jmp selector:offset instructions once and for all.
It could have, but it went one step further.
Recall that the point of the movable entry table is to provide
a fixed location that always refers to a specific function,
no matter where that function happens to be.
This was necessary because real mode has no memory manager.
But protected mode does have a memory manager.
Why not let the memory manager do the work?
That is, after all, its job.
In protected-mode 16-bit Windows, the movable entry
table was ignored.
When one piece of code needed to reference another piece of code,
it simply jumped to or called it
by its selector:offset.
(Exercise: Why didn't I use
call 1234:5678 as the sample address?)
When a segment is relocated in memory,
there is no stack walking to
patch up return addresses
to point to thunks,
editing of the movable entry points
to point to the
All that happens is that the base address in the
descriptor table entry for the selector is updated to
point to the new linear address of the segment.
And when a segment is discarded,
the descriptor table entry is marked not present,
so that any future reference to it will raise a
selector not present exception,
which the kernel handles by reloading the selector.
Things are a lot easier when you have a memory manager around.
A lot of the head-exploding engineering in real-mode windows was in
all the work of
simulating a memory manager on a CPU that didn't have one!
After learning about the bad things that happened if you synchronized
your application's input queue with its debugger,
commenter kme wonders
how debugging worked in 16-bit Windows,
since 16-bit Windows didn't have asynchronous input?
In 16-bit Windows, all applications shared the same input queue,
which means you were permanently in the situation described
in the original article,
where the application and its debugger (and everything else)
shared an input queue and therefore would constantly deadlock.
The solution to UI deadlocks is to make sure the debugger doesn't
have any UI.
At the most basic level, the debugger communicated with the developer
through the serial port.
You connected a dumb terminal to the other end of the serial port.
Mine was a
Wyse 50 serial console terminal.
All your debugging happened on the terminal.
You could disassemble code,
inspect and modify registers and memory,
and even patch new code on the fly.
If you wanted to consult source code,
you needed to have a copy of it available somewhere else
(like on your other computer).
It was similar to using the cdb debugger,
where the only commands available were
Oh, and bp to set breakpoints.
Now, if you were clever, you could use a
terminal emulator program so you didn't need a dedicated
physical terminal to do your debugging.
You could connect the target computer to your development machine
and view the disassembly and the source code on the same screen.
But you weren't completely out of the woods,
because what did you use to debug your development machine if it crashed?
The dumb terminal, of course.¹
I did pretty much all my Windows 95 debugging this way.
If you didn't have two computers,
another solution was to use a debugger like CodeView.
CodeView avoided the UI deadlock problem by not using the GUI to
present its UI.
When you hit a breakpoint or otherwise halted execution
of your application, CodeView talked directly to the video driver
to save the first 4KB of video memory, then
switched into text mode to tell you what happened.
When you resumed execution, it restored the video memory,
then switched the video card back into
restored all the pixels it captured,
then resumed execution as if nothing had happened.
(If you were debugging a graphics problem,
you could hit F3 to switch temporarily to graphics mode,
so you could see what was on the screen.)
If you were really fancy,
you could spring for a monochrome adapter,
either the original IBM one or the Hercules version,
and tell CodeView to use that adapter for its debugging UI.
That way, when you broke into the debugger,
you could still see what was on the screen!
We had multiple monitors before it was cool.
¹ Some people were crazy and cross-connected their
target and development machines.
This allowed them to use their target machine to debug their
development machine and vice versa.
But if your development machine crashed
while it was debugging the target machine,
then you were screwed.
Anon is interested in
why the FAT driver is called FASTFAT.SYS.
"Was there an earlier slower FAT driver?
What could you possibly get so wrong with a FAT implementation that
it needed to be chucked out?"
The old FAT driver probably had a boring name like, um, FAT.SYS.
At some point, somebody decided to write a newer, faster one,
so they called it FASTFAT.
And the name stuck.
As for what you could possibly get so wrong with a FAT implementation
that it needed to be improved:
Remember that circumstances change over time.
A design that works well under one set of conditions may start
to buckle when placed under alternate conditions.
It's not that the old implementation was wrong;
it's just that conditions have changed,
and the new implementation is better for the new conditions.
For example, back in the old days,
there were three versions of FAT: FAT8, FAT12, and FAT16.
For such small disks, simple algorithms work just fine.
In fact, they're preferable because a simple algorithm
is easy to get right and is easier to debug.
It also typically takes up a lot less space,
and memory was at a premium in the old days.
An O(n) algorithm is not a
big deal if n never gets very large and the
constants are small.
Since FAT16 capped out at 65535 clusters per drive,
there was a built-in limit on how big n could get.
If a typical directory has only a few dozen files in it,
then a linear scan is just fine.
It's natural to choose algorithms
that map directly to the on-disk data structures.
(Remember, data structures determine algorithms.)
FAT directories are just unsorted arrays of file names,
so a simple directory searching function would just
read through the directory one entry at a time until it
found the matching file.
Finding a free cluster is just a memory scan looking for a 0
in the allocation table.
Memory management was simple: Don't try.
Let the disk cache do it.
These simple algorithms worked fine until FAT32 showed up
and bumped n sky high.
But fortunately, by the time that happened, computers were also
faster and had more memory available,
so you had more room to be ambitious.
The big gains in FASTFAT came from algorithmic changes.
For example, the on-disk data structures are transformed
into more efficient in-memory data structures and cached.
The first time you look in a directory, you need to do
a linear search to collect all the file names,
but if you cache them in a faster data structure
(say, a hash table),
subsequent accesses to the directory become much faster.
And since computers now have more memory available,
you can afford to keep a cache of directory entries around,
as opposed to the old days where memory was tighter and
large caches were a big no-no.
(I wonder if any non-Microsoft FAT drivers do this sort of
optimization, or whether they just do the obvious thing and
use the disk data structures as memory data structures.)
The original FAT driver
was very good at solving the problem it was given,
while staying within the limitations it was forced to operate under.
It's just that over time, the problem changed,
and the old solutions didn't hold up well any more.
I guess it's a matter of interpretation whether this means that
the old driver was "so wrong."
If your child outgrows his toddler bed,
does that mean the toddler bed was a horrible mistake?
some time ago that
there is a 64KB no-man's land
near the 2GB boundary
to accommodate a quirk of the Alpha AXP processor architecture.
But that's not the only reason why it's there.
The no-man's land near the
boundary is useful even on x86 processors
because it simplifies parameter validation at the boundary between
user mode and kernel mode by taking out a special case.
If the 64KB zone did not exist,
then somebody could pass a buffer that straddles the
and the kernel mode validation layer
would have to detect that unusual condition and reject the buffer.
By having a guaranteed invalid region,
the kernel mode buffer validation code can simply validate that
the starting address is below the 2GB boundary,
then walk through the buffer checking each page.
If somebody tries to straddle the boundary,
the validation code will hit the permanently-invalid region and
Yes, this sounds like a micro-optimization,
but I suspect this was not so much for optimization purposes
as it was to remove weird boundary conditions,
because weird boundary conditions are where the bugs tend to be.
(Obviously, the no-man's land moves if you set the /3GB switch.)
The default base address for a DLL is 0x10000000,
but the default base address for an EXE is 0x00400000.
Why that particular value for EXEs?
What's so special about
It has to do with the amount of address space
mapped by a single page directory entry on an x86
and a design decision made in 1987.
The only technical requirement for
the base address of an EXE is that it be
a multiple of 64KB.
But some choices for base address are better than others.
The goal in choosing a base address is to minimize
the likelihood that modules will have to be relocated.
This means not colliding with things already in the
address space (which will force you to relocate)
as well as not colliding with things
that may arrive in the address space later
(forcing them to relocate).
For an executable, the not colliding with things that may
arrive later part means avoiding the region of the
address space that tends to fill with DLLs.
Since the operating system itself puts DLLs at high addresses
and the default base address for non-operating system DLLs
this means that the base address for the executable should be
somewhere below 0x10000000,
and the lower you go, the more room you have before you start
colliding with DLLs.
But how low can you go?
The first part means that
you also want to avoid the things that
are already there.
Windows NT didn't have a lot of stuff at low addresses.
The only thing that was already there was a PAGE_NOACCESS
page mapped at zero in order to catch null pointer accesses.
Therefore, on Windows NT, you could base your executable at
0x00010000, and many applications did just that.
But on Windows 95,
there was a lot of stuff already there.
The Windows 95 virtual machine manager
permanently maps the first
64KB of physical
memory to the first 64KB of virtual memory
in order to avoid a CPU erratum.
had to work around a lot of CPU bugs
Furthermore, the entire first megabyte of virtual
address space is mapped to the logical address space
of the active virtual machine.
actually a little more than a megabyte.)
This mapping behavior is required by the x86 processor's
Windows 95, like its predecessor Windows 3.1,
runs Windows in a special
virtual machine (known as the System VM),
and for compatibility it still routes all
sorts of things through 16-bit code
just to make sure the decoy quacks the right way.
Therefore, even when the CPU is running a Windows application
(as opposed to an MS-DOS-based application),
it still keeps the virtual machine mapping active
so it doesn't have to do page remapping
expensive TLB flush that comes with it)
every time it needs to go to the MS-DOS compatibility layer.
Okay, so the first megabyte of address space is
already off the table.
What about the other three megabytes?
Now we come back to that little hint at the top of the article.
In order to make context switching fast,
the Windows 3.1 virtual machine manager "rounds up" the per-VM
context to 4MB.
It does this so that a memory context switch can be performed
by simply updating a single 32-bit value in the page directory.
You also have to mark
instance data pages, but that's just flipping a dozen or so bits.)
This rounding causes us to lose three megabytes of address space,
but given that there was four gigabytes of address space,
a loss of less than one tenth of one percent was deemed a fair
trade-off for the significant performance improvement.
(Especially since no applications at the time came anywhere
near beginning to scratch the surface of this limit.
Your entire computer had only 2MB of RAM in the first place!)
This memory map was carried forward into
with some tweaks to handle separate address spaces for 32-bit Windows applications.
Therefore, the lowest address an executable could be loaded on Windows 95
was at 4MB, which is 0x00400000.
The linker chooses a default base address for executables of 0x0400000
so that the resulting binary can load without relocation on both Windows NT
and Windows 95.
Nobody really cares much about targeting Windows 95 any more,
so in principle, the linker folks could choose a different default base address now.
But there's no real incentive for doing it aside from making diagrams look prettier,
especially since ASLR makes the whole issue moot anyway.
And besides, if they changed it, then people would be asking,
"How come some executables have a base address of 0x04000000 and some executables
have a base address of 0x00010000?"
TL;DR: To make context switching fast.
There are a few references to the
DS_RECURSE dialog style scattered throughout MSDN,
and they are all of the form
"Don't use it."
But if you look in your copy of
there is no sign of DS_RECURSE anywhere.
This obviously makes it trivial to avoid using it
because you couldn't use it even if you wanted it,
seeing as it doesn't exist.
"Do not push the red button on the control panel!"
— There is no red button on the control panel.
"Well, that makes it easy not to push it."
As with many of these types of stories,
the answer is rather mundane.
When nested dialogs were added to Windows 95,
the flag to indicate that a dialog is a control host was
The name was intended to indicate
that anybody who is walking a dialog looking for controls
should recurse into this window, since it has more controls inside.
The window manager folks later decided to change the name,
and they changed it to DS_CONTROL.
All documentation that was written before the renaming had to be revised
to change all occurrences of DS_RECURSE to
It looks like they didn't quite catch them all:
There are two straggling references in the
Windows Embedded documentation.
My guess is that the Windows Embedded team
took a snapshot of the main Windows documentation,
and they took their snapshot before the renaming was complete.
Unfortunately, I don't have any contacts in the Windows Embedded
so I don't know whom to contact to get them to remove the references
to flags that don't exist.
This is the end of
a week that sort of happened around me and I had to catch up with.
The Windows 3.1 virtual machine manager had a clever solution for
There was only one synchronization object in the entire kernel.
It was called "the critical section", with the definite article
because there was only one.
The nice thing about a system where the only available
synchronization object is a single critical section
is that deadlocks are impossible:
The thread with the critical section will always be able to make
progress because the only thing that could cause it to stop would be
blocking on a synchronization object.
But there is only one synchronization object (the critical section),
and it already owns that.
When you hit Ctrl+Alt+Del in
Windows 3.1, a bunch of crazy stuff happened.
All this work was in a separate driver, known as the
virtual reboot device.
By convention, all drivers in Windows 3.1 were called
the virtual something device
because their main job was to virtualize some hardware or other
That's where the funny name VxD came from.
It was short for virtual x device.
First, the virtual reboot device driver checked
which virtual machine
If you were using an MS-DOS program, then it told all the device
drivers to clean up whatever they were doing for that virtual machine,
and then it terminated the virtual machine.
This was the easy case.
Otherwise, the focus was on a Windows application.
Now things got messy.
When the 16-bit Windows kernel started up, it gave the
virtual reboot device the addresses of a few magic things.
One of those magic things was a special byte that was set to 1
every time the 16-bit Windows scheduler regained control.
When you hit Ctrl+Alt+Del,
the virtual reboot device set the byte to 0,
and it also registered a callback with the virtual machine manager
to say "Call me back once the critical section becomes available."
The callback didn't do anything aside from remember the fact that it was
called at all.
And then the code waited for ¾ seconds.
(Why ¾ seconds? I have no idea.)
After ¾ seconds, the virtual reboot device looked to see
what the state of the machine was.
If the "call me back once the critical section becomes available"
callback was never called,
then the problem is that a device driver is stuck in the critical section.
Maybe the device driver put an
Abort, Retry, Ignore message on the screen that the user
needs to respond to.
The user saw this message:
After the user presses a key,
focus was placed on the
virtual machine that holds the critical section
so the user can address the problem.
A user who is still stuck can hit
to restart the whole process,
and this time, execution will go into the
"If you were using an MS-DOS program" paragraph,
and the code will shut down the stuck virtual machine.
If the critical section was not the problem, then the
virtual reboot device checked if the 16-bit kernel scheduler
had set the byte to 1 in the meantime.
If so, then it means that no applications were hung,
and you got the message
The System menu was called the Control menu back then.)
Otherwise, the special byte was still 0,
which means that the 16-bit scheduler never got control,
which meant that a 16-bit Windows application was not releasing
control back to the kernel.
The virtual reboot device then waited
for the virtual machine to finish processing any pending
(This allowed any pending MS-DOS emulation or 16-bit MS-DOS device
drivers to finish up their work.)
If things did not return to this sane state
within 3¼ seconds, then you got this screen:
we are in the case where the system returned to
a state where there are no active virtual interrupts.
The kernel single-stepped the processor if necessary
until the instruction pointer
was no longer in the kernel,
or until it had single-stepped for 5000
instructions and the instruction pointer was not in the heap manager.
(The heap manager was allowed to run for more than 5000 instructions.)
At this point, you got the screen that Steve Ballmer wrote.
If you hit Enter,
then the 16-bit kernel terminated the application by doing
mov ax, 4c00h followed by int 21h,
which was the system call that applications used to exit normally.
This time, the kernel is making the exit call on behalf of the
Everything looks like the application simply decided to exit
mov ax, 4c00h
The stuck application exits,
the kernel regains control,
and hopefully, things return to normal.
I should point out that I didn't write any of this code.
"It was like that when I got here."
There were various configuration settings to tweak all of the
For example, you could say that
restarted the computer rather than terminating the current application.
Or you could
skip the check whether the 16-bit kernel scheduler had
set the byte to 1 so that you could use
Ctrl+Alt+Del to terminate
an application even if it wasn't hung.¹
There was also a setting to restart the computer upon receipt
of an NMI,
the intention being that the signal would be triggered
either by a dedicated add-on switch
poking a ball-point pen in just the right spot.
(This is safer than just pushing the reset button because
the restart would flush disk caches and shut down devices in
an orderly manner.)
This setting was intended for developers to assist in debugging their
if you went for this option, the program that got terminated
is whichever one happened to have control of the CPU at the time you
This was, in theory, random,
but in practice it often guessed right.
That's because the problem was usually that a program got wedged
into an infinite message loop,
so most of the CPU was being run in the stuck application anyway.
The CRITICAL_SECTION structure has gone through
a lot of changes since its introduction back oh so many decades ago.
The amazing thing is that as long as you stick to the documented API,
your code is completely unaffected.
Initially, the critical section object had an owner field
to keep track of which thread entered the critical section, if any.
It also had
a lock count to keep track of how many times the owner thread
entered the critical section, so that the critical section would
be released when the matching number of
LeaveCriticalSection calls was made.
And there was an auto-reset event used to manage contention.
We'll look more at that event later.
(It's actually more complicated than this, but the details
If you've ever looked at the innards of a critical section
(for entertainment purposes only),
you may have noticed that the lock count was off by one:
The lock count was the number of active calls to
EnterCriticalSection minus one.
The bias was needed because the original version of the
interlocked increment and decrement operations
returned only the sign of the result, not the revised value.
Biasing the result by 1 means that all three states could be detected:
Unlocked (negative), locked exactly once (zero), reentrant lock (positive).
(It's actually more complicated than this, but the details
If a thread tries to enter a critical section but can't
because the critical section is owned by another thread,
then it sits and waits on the contention event.
When the owning thread releases all its claims on the critical section,
it signals the event to say,
"Okay, the door is unlocked.
The next guy can come in."
The contention event is allocated only when contention occurs.
(This is what older versions of MSDN meant when they said that
the event is "allocated on demand.")
Which leads to a nasty problem:
What if contention occurs,
but the attempt to create the contention event fails?
the answer was "The kernel raises an out-of-memory exception."
Now you'd think that a clever program could catch this exception
and try to recover from it, say, by unwinding everything that led
up to the exception.
Unfortunately, the weakest link in the chain is the critical section
In the original version of the code,
the out-of-memory exception was raised while the critical section was
in an unstable state.
Even if you managed to catch the exception and unwind everything you could,
the critical section was itself irretrievably corrupted.
Another problem with the original design became apparent on multiprocessor
systems: If a critical section was typically held for a very brief time,
then by the time you called into kernel to wait on the contention event,
the critical section was already freed!
void SetGuid(REFGUID guid)
g_theGuid = guid;
void GetGuid(GUID *pguid)
*pguid = g_theGuid;
This imaginary code uses a critical section to protect accesses
to a GUID.
The actual protected region is just nine instructions long.
Setting up to wait on a kernel object is way,
way more than nine instructions.
If the second thread immediately waited on the critical section
it would find that by the time the kernel got around to entering
the wait state, the event would say,
"Dude, what took you so long? I was signaleded, like, a bazillion
Windows 2000 added the
which lets you avoid the problem where waiting for a critical section
costs more than the code the critical section was protecting.
If you initialize with a spin count, then when a thread tries to
enter the critical section and can't,
it goes into a loop trying to enter it over and over again,
in the hopes that it will be released.
"Are we there yet?
How about now?
How about now?
How about now?
How about now?
How about now?
How about now?
How about now?
How about now?
How about now?
How about now?
How about now?"
If the critical section is not released after the requested number
then the old slow wait code is executed.
Note that spinning on a critical section is helpful only on
and only in the case where you know that all the protected code
segments are very short in duration.
If the critical section is held for a long time,
then spinning is wasteful since the odds that the critical section
will become free during the
are very low,
and you wasted a bunch of CPU.
Another feature added in Windows 2000 is the ability to
preallocate the contention event.
This avoids the dreaded
"out of memory" exception in
but at a cost of a kernel event for every critical section,
whether actual contention occurs or not.
Windows XP solved the problem of the dreaded "out of memory"
exception by using a fallback algorithm that could be used
if the contention event could not be allocated.
The fallback algorithm is not as efficient, but at least
it avoids the "out of memory" situation.
(Which is a good thing, because nobody really expects
EnterCriticalSection to fail.)
This also means that requests for the contention event to be preallocated
are now ignored, since the reason for preallocating
(avoiding the "out of memory" exception) no longer exists.
(And while they were there, the kernel folks also fixed
InitializeCriticalSection so that
a failed initialization left the critical section object in
a stable state so you could safely try again later.)
In Windows Vista, the internals of the critical section object
were rejiggered once again,
this time to add convoy resistance.
The internal bookkeeping completely changed;
the lock count is no longer a 1-biased count of the number of
EnterCriticalSection calls which are pending.
As a special concession to backward compatibility with people
who violated the API contract and
looked directly at the internal data structures,
the new algorithm goes to some extra effort to ensure that
if a program breaks the rules and
looks at a specific offset inside the critical section
the value stored there is −1 if and only if the critical section
Often, people will remark that "your compatibility problems would go
away if you just open-sourced the operating system."
I think there is some confusion over what "go away" means.
If you release the source code to the operating system,
it makes it even easier for people to take undocumented
dependencies on it,
because they no longer have the barrier of "Well, I can't find any
documentation, so maybe it's not documented."
They can just read the source code and say,
"Oh, I see that if the critical section is unlocked,
the LockCount variable has the value −1."
Boom, instant undocumented dependency.
Compatibility is screwed.
(Unless what people are saying "your compatibility problems would
go away if you just open-sourced all applications,
so that these problems can be identified and fixed as soon as they
Why isn't it important that the fallback algorithm be highly efficient?