Holy cow, I wrote a book!
Commenter HiTechHiTouch wants to know whether
the "X-Mouse" feature went through the
"every request starts at −100 points
filter", and if so, how did it manage to gain 99 points?
The X-Mouse feature is ancient and long predates the
"−100 points" rule.
It was added back in the days when a developer could
add a random rogue feature because he liked it.
But I'm getting ahead of myself.
Rewind back to 1995.
Windows 95 had just shipped,
and some of the graphics people had shifted their focus to DirectX.
The DirectX team maintained a very close relationship with the
video game software community,
and a programmer at one of the video game software companies
mentioned in passing as part of some other conversation,
"Y'know, one thing I miss from my X-Windows workstation
is the ability to set focus to a window by just
moving the mouse into it."
As it happened,
that programmer happened to mention it to a DirectX team member
who used to be on the shell team,
so the person he mentioned it to
actually knew a lot about all this GUI programming stuff.
Don't forget, in the early days of DirectX,
it was a struggle convincing game vendors to target this new
Windows 95 operating system;
they were all accustomed to writing their games to run under MS-DOS.
Video game programmers
didn't know much about programming for Windows
because they had never done it before.
That DirectX team member sat down and quickly pounded out the
first version of what eventually
became known to the outside world as the X-Mouse PowerToy.
He gave a copy to that programmer whose request was made almost
as an afterthought,
and he was thrilled that he could move focus around with the mouse
the way he was used to.
"Hey, great little tool you got there.
Could you tweak it so that when I move the mouse into a window,
it gets focus but doesn't come to the top?
Sorry I didn't mention that originally;
I didn't realize you were going to interpret my idle musing
as a call to action!"
The DirectX team member
added the feature and added a check-box to the X-Mouse PowerToy
to control whether the window is brought to the top when it is
activated by mouse motion.
"This is really sweet.
I hate to overstay my welcome, but could you tweak it so that
it doesn't change focus until my mouse stays in the window for
Again, sorry I didn't mention that originally."
Version three of X-Mouse added the ability to set a delay
before it moved the focus.
And that was the version of X-Mouse that went into the PowerToys.
When the Windows NT folks saw the X-Mouse PowerToy,
"Aw shucks, we can do that too!"
And they added the three SystemParametersInfo
values I described in
an earlier article
so as to bring Windows NT up to feature parity with X-Mouse.
It was a total rogue feature.
Anon wants to know why
I am listed in
the credits for the video game Quake
under the "Special Thanks" section.
"Were you an early tester/debugger?"
I've never played a game of Quake in my entire life.
most of the rest of the Windows 95 team)
but after a while,
first-person-shooter games started giving me a headache.
By the time Quake came out, I had already abandoned playing FPS games.
I don't remember what it was that I did specifically,
but it was along the lines of helping them with various
technical issues related to running under Windows.
At the time, I was a kernel developer,
and the advice I gave was almost certainly related
to memory management and swapping.
Sorry it wasn't anything exciting.
I noted some time ago that real-mode Windows had to do all
its memory management without any hardware assistance.
And yet, along the way, they managed to implement an LRU-based
Gabe is really interested in how that was done.
As we saw a few months ago,
inter-segment calls were redirected through a little stub which
either jumped directly to the target (if it was in memory)
or loaded the target
(possibly discarding other memory to make room)
before jumping to it.
And we saw that the executable format had
INT 3Fh instructions baked into it
so that the Entry Table could be loaded directly into memory
As it happens, Windows didn't take advantage of that feature,
because it wanted to do more.
When it came time to load the Entry Table,
the loader did a little rewriting, converting each
sar byte ptr cs:[xxx], 1
where the xxx refers to a table of bytes
in the Entry Table preallocated for this purpose,
initialized to 1's.
What is "this purpose"?
Whenever anybody needed the address of an inter-segment
function, instead of return the address of the int 3Fh,
the kernel returned the address of the sar instruction.
The sar instruction stands for shift arithmetic right,
For a byte value, this means to shift the bits right one place,
but keep the high-order bit the same.
Okay, so what was the effect of sticking this little
sar instruction at the start of every inter-segment
Since the values in the table were initialized to 1,
a right arithmetic shift changed the 1 to 0.
Therefore, each time an inter-segment call was performed,
the corresponding byte in the table was set to zero.
Hooray, a software-implemented Accessed bit!
Every 250 milliseconds, Windows scanned and reset the Access bits,
using the data to maintain an LRU-list of all the segments in the
That way, when it was time to discard some memory,
it could discard the least recently used ones first.
Today, a timer that runs continuously at 250ms would
incur the wrath of the power management team.
But back in the days of real-mode Windows,
there was no power management.
Like Chuck Norris, PCs ran at only one power level: Awesome.
I continue to be amazed at how much Windows 1.0 accomplished
with so little.
[Raymond is currently away; this message was pre-recorded.]
Leo Davidson observes that
a hit-test code is defined for
HTOBJECT, but it is not documented, and wonders
#define HTOBJECT 19
The HTOBJECT is another one of those features
that never got implemented.
The code does nothing and nobody uses it.
It was added back in Windows 95 for reasons lost to the
mists of time,
but when the reason for adding it vanished (maybe a feature got cut),
it was too late to remove it from the header file because that would
require renumbering HTCLOSE and HTHELP,
two values which were in widespread use already.
So the value just stayed in the header file,
taking up space but accomplishing nothing.
the history of the Internet, there have been many cases
of one company providing a service, and others trying to
piggyback off the service through a nonstandard client.
The result is usually a back-and-forth where the provider changes
the interface, the piggybacker reverse-engineers the interface,
back and forth, until one side finally gives up.
Once upon a time, there was one company with a well-known service,
and another company that was piggybacking off it.
(I first heard this story from somebody who worked at the
The back-and-forth continued for several rounds, until the provider
made a change to the interface that ended the game:
They exploited a buffer overflow bug in their own client.
The server sent an intentional buffer overflow to the client,
resulting in the client being pwned by the server.
I'm not sure what happened next, but presumably the server
sent some exploit code to the client and waited for the client to
respond in a manner that confirmed that the exploit had executed.
With that discovery, the people from the piggybacking company gave up.
They weren't going to introduce an intentional security flaw into
The service provider could send not only the exploit but also some
code to detect and disable the rogue client.
By an amazing stroke of good fortune,
I happened to also hear the story of this battle from somebody
who worked at the provider.
He said that they had a lot of fun fighting this particular battle
and particularly enjoyed timing the releases so they caused
maximum inconvenience for their adversaries,
like, for example, 2am on Saturday.
"trying to guess the identity of a program whose name I did not reveal."
The first question
suggestion has to do with how shortcuts can find their targets
even if they've been renamed.
This is something I had covered
nearly a year before the question was asked,
so the reason I'm not answering that question isn't that I'm ignoring the
It's that I already answered it.
While I'm at it, here are other questions that I've already answered:
The other question was a series of questions about the history of
multiple monitor support in Windows.
Actually, I think I've already discussed all of the parts of this
so today's entry is more like a
"Remember the first time I talked about multiple monitors?"
Windows 98 was the first version of Windows to support
(Code to support multiple monitors started being written
shortly after Windows 95 was out the door,
so my guess is that the preliminary design work overlapped
the end of the Windows 95 project.)
To facilitate development of code that takes advantage of
the multimon.h header file was introduced
so you could code as if multiple monitor support was present in the
operating system, and it would emulate the multimon APIs
(with a single monitor) if running on Windows 95.
In Windows 98, the maximum number of monitors was nine.
There was no restriction on color depth
the most common configuration involved one powerful graphics card
combined with one really lame one.
When support for multiple monitors was ported to Windows NT,
the Windows NT folks
figured they could one-up the Windows 98 team.
The maximum number of monitors was increased from nine to ten.
maybe someday it will
go to eleven.
Last week, I described
how real-mode Windows fixed up jumps to functions that got discarded.
what about return addresses to functions that got discarded?
The naïve solution would be to allocate a special "return address
recovery" function for each return address you found,
but that idea comes with its own problems:
You are patching addresses on the stack because you are trying to free
It would be a bad idea to try to allocate memory while you're trying
to free memory!
What if in order to satisfy the allocation, you had to discard still
You would start moving and patching stacks
before they were fully patched from the previous round of memory
The stack patcher would get re-entered and
see an inconsistent stack because the previous stack patcher
didn't get a chance to finish.
The result would be rampant memory corruption.
Therefore, you have to preallocate your "return address recovery" functions.
But you don't know how many return addresses you're going to need
until you walk the stack (at which point it's too late),
and you definitely don't want to preallocate the worst-case scenario
since each stack can be up to
in size, and can hold up to 16384
You'd end up allocating nearly all your available memory just for
return address recovery stubs!
How did real-mode Windows solve this problem?
It magically found a way to put ten pounds of flour in a five-pound bag.
For each segment, there was a special "return thunk"
that was shared by all return addresses which returned back
into that segment.
Since there is only one per segment, you can preallocate it as part of
the segment bookkeeping data.
To patch the return address, the original return address was overwritten
by this shared return thunk.
Okay, so you have 32 bits of information you need to save
(the original return address,
which consists of
16 bits for the segment and 16 bits for the offset),
and you have a return thunk that captures the segment information.
But you still have 16 bits left over.
Where do you put the offset?
We saw some time ago that
the BP register associated with far calls was incremented before being
pushed on the stack.
This was necessary so that the stack patcher knew whether to decode the
frame as a near frame or a far frame.
But that wasn't the only rule associated with far stack frames.
On entry to a far function,
Every far call therefore looks like this on the stack:
The stack patcher overwrote the saved CS:IP with the address of the
The return thunk describes the segment that got discarded,
so the CS is implied by your choice of return thunk,
but the stack patcher still needed to save the IP somewhere.
So it saved it where the DS used to be.
Wait, you've just traded one problem for another.
Sure, you found a place to put the IP,
but now you have to find a place to put the DS.
Here comes the magic.
Recall that on the 8086, the combination segment:offset
corresponds to the physical address
segment×16 + offset.
For example, the address 1234:0005 refers to physical byte
0x1234 * 16 + 0x0005 = 0x12345.
Since the segment and offset were both 16-bit values,
there were multiple ways to refer to the same physical address.
1000:2345 would also resolve to physical address 0x12345.
But there are other ways to refer to the same byte,
like the not entirely obvious
In fact, there's a whole range of values you can use,
0235:FFF5 and ending at 1234:0005.
In general, there are 4096 different ways of referring to
the same address.
There's a bit of a problem with very low addresses, though.
To access byte 0x00400, for example, you could use
0000:0400 through 0040:0000, but that's as far as you could go,
so these very low addresses do not have the full complement
Aha, but they do
if you take advantage of wraparound.
Since the 8086 had only 20 address lines,
any overflow in the calculations was simply taken mod 0x100000.
Therefore, you could also use F041:FFF0 to refer to the address,
0xF041 × 16 + 0xFFF0 = 0x100400 ≡ 0x00400.
Wraparound allowed the full range of 4096 aliases
since you could use
F041:FFF0 to FFFF:0410,
and then 0000:0400 to 0040:0000.
The story of the mysterious WINA20.386 file.
Okay, back to stack patching.
Once you consider aliasing, you realize that the 32 bits of flour
actually had a lot of air in it.
By rewriting the address of the return thunk into the form
XXXX:000Y, you can see the 12 bits of air,
and to stash the 12-bit value N into that air pocket,
you would set the segment to XXXX−N and the offset to
Recall that we were looking for a place to put that saved DS value,
which is a 16-bit value,
and we have found 12 bits of air in which to save it.
We need to find 4 more bits of air somewhere.
The next trick is realizing that DS is not an arbitrary 16-bit value.
It's a 16-bit segment value that was obtained from
the Windows memory manager.
Therefore, if the Windows memory manager imposed a generous artificial limit
of 4096 segments,
it could convert the DS segment value into an integer segment index.
That index got saved in the upper 12 bits of the offset.
Okay, let's see what happens when the code tries to unwind to the
patched return address.
The function whose return address got patched goes through the
usual function epilogue.
It pops what it thinks is the original DS off the stack,
even though that DS has been
by the original return address's IP.
The epilogue then pops the old BP, decrements it,
and returns to the return thunk.
The return thunk now knows where the real return address is
(it knows which segment it is responsible for,
and it can figure out the IP from the incoming DS register).
It can also study its own IP to extract the segment index
and from that recover the original DS value.
Now that it knows what the original code was trying to do,
it can reload the segment, restore the registers to their proper
values, and jump to the original return address inside the
I continue to be amazed at how real-mode Windows managed to get so
much done with so little.
The arbitrary limit of 4096 segments was quite generous,
seeing as the maximum number of selectors
in protected mode was defined by the processor to be 8191.
What small change could you make to expand the segment limit
in real mode to match that of protected mode?
In a discussion of how real-mode Windows
commenter Matt wonders
about fixing jumps in the rest of the code to the discarded functions.
I noted in the original article that
"there are multiple parts to the solution"
and that stack-walking was just one piece.
Today, we'll look at another piece:
Recall that real-mode Windows ran on an 8086 processor,
a simple processor with no memory manager, no CPU privilege levels,
and no concept of task switching.
Memory management in real-mode Windows was handled manually by
the real-mode kernel,
and the way it managed memory was by loading code from disk
on demand, and discarding code when under memory pressure.
(It didn't discard data because it wouldn't know how to regenerate it,
and it can't swap it out because there was no swap file.)
There were a few flags you could attach to a segment.
Of interest for today's discussion are movable
(and it was spelled without the "e")
If a segment was not movable (known as fixed),
then it was loaded into memory and stayed there until the
module was unloaded.
If a segment was movable, then the memory manager was allowed
to move it around when it needed to defragment memory in order
to satisfy a large memory allocation.
And if a segment was discardable,
then it could even be evicted from memory
to make room for other stuff.
I'm going to combine the movable and discardable cases,
since the effect is the same for the purpose of
the difference being that with discardable memory,
you also have the option of throwing the memory out entirely.
First of all, let's get the easy part out of the way.
If you had an intra-segment call
(calling a function in your own segment),
then there was no work that needed to be done.
Real-mode Windows always discarded full segments,
so if your segment was running code,
it was by definition present,
and therefore any other code in that segment was also present.
The hard part is the inter-segment calls.
As it happens,
an old document on the 16-bit Windows executable file format
gives you some insight into how things worked,
if you sit down and puzzle it out hard enough.
Let's start with the GetProcAddress function.
When you call GetProcAddress, the kernel needs
to locate the address of the function inside the target module.
The loader consults the Entry Table to find the function
you're asking for.
As you can see, there are three types of entries in the
Unused entries (representing ordinals with no associated function),
fixed segment entries, and movable segment entries.
Obviously, if the match is in an unused entry, the return value
is NULL because there is no such function.
If the match is in a fixed entry, that's pretty easy too:
Look up the segment number in the target module's segment list
and combine it with the specified offset.
Since the segment is fixed, you can just return the raw pointer
directly, since the code will never move.
The tricky part is if the function is in a movable segment.
If you look at the document, it says that "a moveable segment entry
is 6 bytes long and has the following format."
It starts with a byte of flags (not important here),
a two-byte INT 3Fh instruction,
a one-byte segment number, and the offset within the segment.
What's the deal with the
INT 3Fh instruction?
It seems rather pointless to specify that a file format
requires some INT 3Fh instructions
scattered here and there.
Why not get rid of it to save some space in the file?
If you called GetProcAddress and the result
was a function in a movable segment,
the GetProcAddress didn't actually return the
address of the target function.
It returned the address of the INT 3Fh instruction!
(Thankfully, the Entry Table is always a fixed segment,
so we don't have to worry about the Entry Table itself being discarded.)
(Now you see why the file format includes these strange
INT 3Fh instructions:
The file format was designed to be loaded directly
When the loader loads the entry table,
it just slurps it into memory and bingo, it's ready to go,
INT 3Fh instructions and all!)
Since GetProcAddress returned the address of the
INT 3Fh instruction,
calls to imported functions didn't actually go straight
to the target.
Instead, you called the INT 3Fh instruction,
and it was the
INT 3Fh handler which said,
"Gosh, somebody is trying to call code in another segment.
Is that segment loaded?"
It took the return address of the interrupt and used it to
locate the segment number and offset.
If the segment in question was already in memory,
then the handler jumped straight to the segment at the
You got the function call you wanted, just in a roundabout way.
If the segment wasn't loaded, then the
INT 3Fh handler loaded it
(which might trigger a round of discarding
then jumped to the newly-loaded segment at the specified offset.
An even more roundabout function call.
Okay, so that's the case where a function pointer is obtained
by calling GetProcAddress.
But it turns out that a lot of stuff inside the kernel turns into
GetProcAddress at the end of the day.
Suppose you have some code that calls a function in another
segment within the same module.
As we saw earlier, fixups are
threaded through the code segment,
and if you scroll down to the
Per Segment Data section of that old document,
you'll see a description of the way the relocation records
A call to a function to a segment within the same module
requires an INTERNALREF fixup,
and as you can see in the document, there are two types of
INTERNALREF fixups, ones which refer to fixed
segments and ones which refer to movable segments.
The easy case is a reference to a fixed segment.
In that case, the kernel can just look up where it put that
segment, add in the offset, and patch that address into the
Since it's a fixed segment, the patch will never have to be
The hard case is a reference to a movable segment.
In that case, you can see that the associated information in the
fixup table is the "ordinal number index into [the] Entry Table."
Aha, you now realize that the Entry Table is more than just a list
of your exported functions.
It's also a list of all the functions in movable segments that
are called from other segments.
In a sense, these are "secret exports" in your module.
(However, you can't get to them by GetProcAddress
GetProcAddress knows how to keep a secret.)
To fix up a reference to a function in a movable segment,
the kernel calls the SecretGetProcAddress (not its real name)
function, which as we saw before, returns not the actual function pointer
but rather a pointer to the magic INT 3Fh in the
It is that pointer which is patched into your code segment,
so that when your code calls what it thinks is a function in
it's really calling the Entry Table,
which as we saw before, loads the code in the target segment if necessary
before jumping to it.
"If the kernel wants to discard that procedure,
it has to find that jump address in my code,
and redirect it to a page fault handler,
so that when my process gets to it,
it will call the procedure and fault the code back in.
How does it find all of the references to that function across the program,
so that it can patch them all up?"
Now you know the answer:
It finds all of those references because it already had to find them
when applying fixups.
It doesn't try to find them at discard time;
it finds them when it loads your segment.
Why doesn't it need to reapply fixups when a segment moves?)
All inter-segment function pointers were really pointers into the
You passed a function pointer to be used as a callback?
Not really; you really passed a pointer to your own Entry Table.
You have an array of function pointers?
Not really; you really have an array of pointers into your Entry Table.
It wasn't actually hard for the kernel to find all of these
because you had to declare them in your fixup table in the first place.
It is my understanding that the INT 3Fh trick
came from the overlay manager which was included with
the Microsoft C compiler.
Zortech C compiler followed a similar model.)
While the above discussion describes how things worked
there are in fact a few places where the actual
implementation differs from the description above,
although not in any way that fundamentally affects the design.
For example, real-mode Windows did a bit of optimization
in the INT 3Fh stubs.
If the target segment was in memory,
then it replaced the INT 3Fh instruction
with a direct jmp xxxx:yyyy to the target,
effectively precalculating the jump destination when a segment
is loaded rather than performing the calculation each time
a function in that segment is called.
By an amazing coincidence, the code sequence
is five bytes long, which is the exact length
of a jmp xxxx:yyyy instruction.
Phew, the patch just barely fits!
An anonymous commenter was curious about how
the GetRandomRgn function
arrived at its strange name,
what the purpose of the third parameter is,
and why it is inconsistent between Windows 95 and Windows NT.
The sense of word "random" here is
its formal probabilistic definition,
specifically sense 2:
perhaps with a bit of sense 4 sprinkled in:
"Not well organized."
(Commenter Gabe suggested that
a better name would have been GetSpecificRgn.)
Once upon a time, when men were men and Windows was 16-bit,
there was an internal function used to communicate
between the window manager and GDI in order to set up device contexts.
Internally, the region was called the Rao Region,
the programmer who invented it,
and the function that calculated the Rao Region was
rather uncreatively called
When porting to 32-bit Windows,
the Windows NT and Windows 95 teams
both found that they needed this same internal
communication between the window manager and GDI.
GDI already has a bunch of functions named
GetXxxRgn, so instead of writing a separate
marshaler for each one, they opted to write a single
GetRandomRgn function which takes an
integer which serves as a function code,
specifying which region the caller actually wants.
(I suspect the Windows 95 team followed the cue of the Windows NT
team, since Windows NT ran into the problem first.)
Since this was an internal function,
it didn't matter that the name was a bit cutesy,
nor did it matter what coordinate
system it used, as long as the window manager and GDI agreed on the
name and coordinate system.
The Windows 95 team still had a lot of 16-bit code that they
needed to be compatible with, so they chose to generate the Rao region
the same way
that the 16-bit ComputeRaoRgn function did it.
The Windows NT folks, on the other hand,
decided that it was more convenient for them
that this internal function use screen coordinates,
so that's what it returns on Windows NT.
GetRandomRgn isn't really a function that
was designed to be public.
It was just an internal helper function that outsiders discovered
and relied upon to the point that
that it became a compatibility constraint so strong that it
turned into a de facto documented function.
And all the weirdness you see behind it is the weirdness of
a function never intended for public consumption.
The introduction of the Desktop Window Manager in Windows Vista
changed the way the visible region was managed (since all windows
are logically visible even when occluded because their drawing
is redirected to an off-screen surface),
GetRandomRgn function has to keep track
of the "visible region" anyway, for compatibility.
I don't know why the Close button went to the upper right
instead of going to the left of the other buttons,
but I'm going to guess.
(That's what I do around here most of the time anyway;
I just don't usually call it out.)
The corners of the screen are very valuable,
because users can target them with very little effort.
You just slam the mouse in the direction you want,
and the cursor goes into the corner.
And since closing a window is a much more common operation
than minimizing, maximizing, and restoring it,
it seems a natural choice to give the close button the
Besides, maximizing and restoring a window
already have very large targets,
namely the entire caption.
You can double-click the caption to maximize,
and double-click again to restore.
The restore even gets you a little bit of Fitt's Law
because the top of the screen makes the height of the
caption bar effectively infinite.