History

  • The Old New Thing

    How did protected-mode 16-bit Windows fix up jumps to functions that got discarded?

    • 7 Comments

    Commenter Neil presumes that Windows 286 and later simply fixed up the movable entry table with jmp selector:offset instructions once and for all.

    It could have, but it went one step further.

    Recall that the point of the movable entry table is to provide a fixed location that always refers to a specific function, no matter where that function happens to be. This was necessary because real mode has no memory manager.

    But protected mode does have a memory manager. Why not let the memory manager do the work? That is, after all, its job.

    In protected-mode 16-bit Windows, the movable entry table was ignored. When one piece of code needed to reference another piece of code, it simply jumped to or called it by its selector:offset.

        push    ax
        call    0987:6543
    

    (Exercise: Why didn't I use call 1234:5678 as the sample address?)

    The selector was patched directly into the code as part of fixups. (We saw this several years ago in another context.)

    When a segment is relocated in memory, there is no stack walking to patch up return addresses to point to thunks, and no editing of the movable entry points to point to the new location. All that happens is that the base address in the descriptor table entry for the selector is updated to point to the new linear address of the segment. And when a segment is discarded, the descriptor table entry is marked not present, so that any future reference to it will raise a selector not present exception, which the kernel handles by reloading the selector.

    Things are a lot easier when you have a memory manager around. A lot of the head-exploding engineering in real-mode windows was in all the work of simulating a memory manager on a CPU that didn't have one!

  • The Old New Thing

    If 16-bit Windows had a single input queue, how did you debug applications on it?

    • 32 Comments

    After learning about the bad things that happened if you synchronized your application's input queue with its debugger, commenter kme wonders how debugging worked in 16-bit Windows, since 16-bit Windows didn't have asynchronous input? In 16-bit Windows, all applications shared the same input queue, which means you were permanently in the situation described in the original article, where the application and its debugger (and everything else) shared an input queue and therefore would constantly deadlock.

    The solution to UI deadlocks is to make sure the debugger doesn't have any UI.

    At the most basic level, the debugger communicated with the developer through the serial port. You connected a dumb terminal to the other end of the serial port. Mine was a Wyse 50 serial console terminal. All your debugging happened on the terminal. You could disassemble code, inspect and modify registers and memory, and even patch new code on the fly. If you wanted to consult source code, you needed to have a copy of it available somewhere else (like on your other computer). It was similar to using the cdb debugger, where the only commands available were r, db, eb, u, and a. Oh, and bp to set breakpoints.

    Now, if you were clever, you could use a terminal emulator program so you didn't need a dedicated physical terminal to do your debugging. You could connect the target computer to your development machine and view the disassembly and the source code on the same screen. But you weren't completely out of the woods, because what did you use to debug your development machine if it crashed? The dumb terminal, of course.¹

    Target machine
    Debugger
    Development machine
    Debugger
    Wyse 50
    dumb terminal

    I did pretty much all my Windows 95 debugging this way.

    If you didn't have two computers, another solution was to use a debugger like CodeView. CodeView avoided the UI deadlock problem by not using the GUI to present its UI. When you hit a breakpoint or otherwise halted execution of your application, CodeView talked directly to the video driver to save the first 4KB of video memory, then switched into text mode to tell you what happened. When you resumed execution, it restored the video memory, then switched the video card back into graphics mode, restored all the pixels it captured, then resumed execution as if nothing had happened. (If you were debugging a graphics problem, you could hit F3 to switch temporarily to graphics mode, so you could see what was on the screen.)

    If you were really fancy, you could spring for a monochrome adapter, either the original IBM one or the Hercules version, and tell CodeView to use that adapter for its debugging UI. That way, when you broke into the debugger, you could still see what was on the screen! We had multiple monitors before it was cool.

    ¹ Some people were crazy and cross-connected their target and development machines.

    Target machine
    Debugger
    Development machine
    Debugger

    This allowed them to use their target machine to debug their development machine and vice versa. But if your development machine crashed while it was debugging the target machine, then you were screwed.

  • The Old New Thing

    Why is the FAT driver called FASTFAT? Why would anybody ever write SLOWFAT?

    • 33 Comments

    Anon is interested in why the FAT driver is called FASTFAT.SYS. "Was there an earlier slower FAT driver? What could you possibly get so wrong with a FAT implementation that it needed to be chucked out?"

    The old FAT driver probably had a boring name like, um, FAT.SYS. At some point, somebody decided to write a newer, faster one, so they called it FASTFAT. And the name stuck.

    As for what you could possibly get so wrong with a FAT implementation that it needed to be improved: Remember that circumstances change over time. A design that works well under one set of conditions may start to buckle when placed under alternate conditions. It's not that the old implementation was wrong; it's just that conditions have changed, and the new implementation is better for the new conditions.

    For example, back in the old days, there were three versions of FAT: FAT8, FAT12, and FAT16. For such small disks, simple algorithms work just fine. In fact, they're preferable because a simple algorithm is easy to get right and is easier to debug. It also typically takes up a lot less space, and memory was at a premium in the old days. An O(n) algorithm is not a big deal if n never gets very large and the constants are small. Since FAT16 capped out at 65535 clusters per drive, there was a built-in limit on how big n could get. If a typical directory has only a few dozen files in it, then a linear scan is just fine.

    It's natural to choose algorithms that map directly to the on-disk data structures. (Remember, data structures determine algorithms.) FAT directories are just unsorted arrays of file names, so a simple directory searching function would just read through the directory one entry at a time until it found the matching file. Finding a free cluster is just a memory scan looking for a 0 in the allocation table. Memory management was simple: Don't try. Let the disk cache do it.

    These simple algorithms worked fine until FAT32 showed up and bumped n sky high. But fortunately, by the time that happened, computers were also faster and had more memory available, so you had more room to be ambitious.

    The big gains in FASTFAT came from algorithmic changes. For example, the on-disk data structures are transformed into more efficient in-memory data structures and cached. The first time you look in a directory, you need to do a linear search to collect all the file names, but if you cache them in a faster data structure (say, a hash table), subsequent accesses to the directory become much faster. And since computers now have more memory available, you can afford to keep a cache of directory entries around, as opposed to the old days where memory was tighter and large caches were a big no-no.

    (I wonder if any non-Microsoft FAT drivers do this sort of optimization, or whether they just do the obvious thing and use the disk data structures as memory data structures.)

    The original FAT driver was very good at solving the problem it was given, while staying within the limitations it was forced to operate under. It's just that over time, the problem changed, and the old solutions didn't hold up well any more. I guess it's a matter of interpretation whether this means that the old driver was "so wrong." If your child outgrows his toddler bed, does that mean the toddler bed was a horrible mistake?

  • The Old New Thing

    Why is there a 64KB no-man's land near the end of the user-mode address space?

    • 29 Comments

    We learned some time ago that there is a 64KB no-man's land near the 2GB boundary to accommodate a quirk of the Alpha AXP processor architecture. But that's not the only reason why it's there.

    The no-man's land near the 2GB boundary is useful even on x86 processors because it simplifies parameter validation at the boundary between user mode and kernel mode by taking out a special case. If the 64KB zone did not exist, then somebody could pass a buffer that straddles the 2GB boundary, and the kernel mode validation layer would have to detect that unusual condition and reject the buffer.

    By having a guaranteed invalid region, the kernel mode buffer validation code can simply validate that the starting address is below the 2GB boundary, then walk through the buffer checking each page. If somebody tries to straddle the boundary, the validation code will hit the permanently-invalid region and fail.

    Yes, this sounds like a micro-optimization, but I suspect this was not so much for optimization purposes as it was to remove weird boundary conditions, because weird boundary conditions are where the bugs tend to be.

    (Obviously, the no-man's land moves if you set the /3GB switch.)

  • The Old New Thing

    Why is 0x00400000 the default base address for an executable?

    • 31 Comments

    The default base address for a DLL is 0x10000000, but the default base address for an EXE is 0x00400000. Why that particular value for EXEs? What's so special about 4 megabytes

    It has to do with the amount of address space mapped by a single page directory entry on an x86 and a design decision made in 1987.

    The only technical requirement for the base address of an EXE is that it be a multiple of 64KB. But some choices for base address are better than others.

    The goal in choosing a base address is to minimize the likelihood that modules will have to be relocated. This means not colliding with things already in the address space (which will force you to relocate) as well as not colliding with things that may arrive in the address space later (forcing them to relocate). For an executable, the not colliding with things that may arrive later part means avoiding the region of the address space that tends to fill with DLLs. Since the operating system itself puts DLLs at high addresses and the default base address for non-operating system DLLs is 0x10000000, this means that the base address for the executable should be somewhere below 0x10000000, and the lower you go, the more room you have before you start colliding with DLLs. But how low can you go?

    The first part means that you also want to avoid the things that are already there. Windows NT didn't have a lot of stuff at low addresses. The only thing that was already there was a PAGE_NOACCESS page mapped at zero in order to catch null pointer accesses. Therefore, on Windows NT, you could base your executable at 0x00010000, and many applications did just that.

    But on Windows 95, there was a lot of stuff already there. The Windows 95 virtual machine manager permanently maps the first 64KB of physical memory to the first 64KB of virtual memory in order to avoid a CPU erratum. (Windows 95 had to work around a lot of CPU bugs and firmware bugs.) Furthermore, the entire first megabyte of virtual address space is mapped to the logical address space of the active virtual machine. (Nitpickers: actually a little more than a megabyte.) This mapping behavior is required by the x86 processor's virtual-8086 mode.

    Windows 95, like its predecessor Windows 3.1, runs Windows in a special virtual machine (known as the System VM), and for compatibility it still routes all sorts of things through 16-bit code just to make sure the decoy quacks the right way. Therefore, even when the CPU is running a Windows application (as opposed to an MS-DOS-based application), it still keeps the virtual machine mapping active so it doesn't have to do page remapping (and the expensive TLB flush that comes with it) every time it needs to go to the MS-DOS compatibility layer.

    Okay, so the first megabyte of address space is already off the table. What about the other three megabytes?

    Now we come back to that little hint at the top of the article.

    In order to make context switching fast, the Windows 3.1 virtual machine manager "rounds up" the per-VM context to 4MB. It does this so that a memory context switch can be performed by simply updating a single 32-bit value in the page directory. (Nitpickers: You also have to mark instance data pages, but that's just flipping a dozen or so bits.) This rounding causes us to lose three megabytes of address space, but given that there was four gigabytes of address space, a loss of less than one tenth of one percent was deemed a fair trade-off for the significant performance improvement. (Especially since no applications at the time came anywhere near beginning to scratch the surface of this limit. Your entire computer had only 2MB of RAM in the first place!)

    This memory map was carried forward into Windows 95, with some tweaks to handle separate address spaces for 32-bit Windows applications. Therefore, the lowest address an executable could be loaded on Windows 95 was at 4MB, which is 0x00400000.

    Geek trivia: To prevent Win32 applications from accessing the MS-DOS compatibility area, the flat data selector was actually an expand-down selector which stopped at the 4MB boundary. (Similarly, a null pointer in a 16-bit Windows application would result in an access violation because the null selector is invalid. It would not have accessed the interrupt vector table.)

    The linker chooses a default base address for executables of 0x0400000 so that the resulting binary can load without relocation on both Windows NT and Windows 95. Nobody really cares much about targeting Windows 95 any more, so in principle, the linker folks could choose a different default base address now. But there's no real incentive for doing it aside from making diagrams look prettier, especially since ASLR makes the whole issue moot anyway. And besides, if they changed it, then people would be asking, "How come some executables have a base address of 0x04000000 and some executables have a base address of 0x00010000?"

    TL;DR: To make context switching fast.

  • The Old New Thing

    What is the story of the mysterious DS_RECURSE dialog style?

    • 12 Comments

    There are a few references to the DS_RECURSE dialog style scattered throughout MSDN, and they are all of the form "Don't use it." But if you look in your copy of winuser.h, there is no sign of DS_RECURSE anywhere. This obviously makes it trivial to avoid using it because you couldn't use it even if you wanted it, seeing as it doesn't exist.

    "Do not push the red button on the control panel!"

    There is no red button on the control panel.

    "Well, that makes it easy not to push it."

    As with many of these types of stories, the answer is rather mundane.

    When nested dialogs were added to Windows 95, the flag to indicate that a dialog is a control host was DS_RECURSE. The name was intended to indicate that anybody who is walking a dialog looking for controls should recurse into this window, since it has more controls inside.

    The window manager folks later decided to change the name, and they changed it to DS_CONTROL. All documentation that was written before the renaming had to be revised to change all occurrences of DS_RECURSE to DS_CONTROL.

    It looks like they didn't quite catch them all: There are two straggling references in the Windows Embedded documentation. My guess is that the Windows Embedded team took a snapshot of the main Windows documentation, and they took their snapshot before the renaming was complete.

    Unfortunately, I don't have any contacts in the Windows Embedded documentation team, so I don't know whom to contact to get them to remove the references to flags that don't exist.

  • The Old New Thing

    What did Windows 3.1 do when you hit Ctrl+Alt+Del?

    • 29 Comments

    This is the end of Ctrl+Alt+Del week, a week that sort of happened around me and I had to catch up with.

    The Windows 3.1 virtual machine manager had a clever solution for avoiding deadlocks: There was only one synchronization object in the entire kernel. It was called "the critical section", with the definite article because there was only one. The nice thing about a system where the only available synchronization object is a single critical section is that deadlocks are impossible: The thread with the critical section will always be able to make progress because the only thing that could cause it to stop would be blocking on a synchronization object. But there is only one synchronization object (the critical section), and it already owns that.

    When you hit Ctrl+Alt+Del in Windows 3.1, a bunch of crazy stuff happened. All this work was in a separate driver, known as the virtual reboot device. By convention, all drivers in Windows 3.1 were called the virtual something device because their main job was to virtualize some hardware or other functionality. That's where the funny name VxD came from. It was short for virtual x device.

    First, the virtual reboot device driver checked which virtual machine had focus. If you were using an MS-DOS program, then it told all the device drivers to clean up whatever they were doing for that virtual machine, and then it terminated the virtual machine. This was the easy case.

    Otherwise, the focus was on a Windows application. Now things got messy.

    When the 16-bit Windows kernel started up, it gave the virtual reboot device the addresses of a few magic things. One of those magic things was a special byte that was set to 1 every time the 16-bit Windows scheduler regained control. When you hit Ctrl+Alt+Del, the virtual reboot device set the byte to 0, and it also registered a callback with the virtual machine manager to say "Call me back once the critical section becomes available." The callback didn't do anything aside from remember the fact that it was called at all. And then the code waited for ¾ seconds. (Why ¾ seconds? I have no idea.)

    After ¾ seconds, the virtual reboot device looked to see what the state of the machine was.

    If the "call me back once the critical section becomes available" callback was never called, then the problem is that a device driver is stuck in the critical section. Maybe the device driver put an Abort, Retry, Ignore message on the screen that the user needs to respond to. The user saw this message:


     Procomm 


    This background non-Windows application is not responding.

    *  Press any key to activate the non-Windows application.
    *  Press CTRL+ALT+DEL again to restart your computer. You will
       lose any unsaved information.


      Press any key to continue _

    After the user presses a key, focus was placed on the virtual machine that holds the critical section so the user can address the problem. A user who is still stuck can hit Ctrl+Alt+Del again to restart the whole process, and this time, execution will go into the "If you were using an MS-DOS program" paragraph, and the code will shut down the stuck virtual machine.

    If the critical section was not the problem, then the virtual reboot device checked if the 16-bit kernel scheduler had set the byte to 1 in the meantime. If so, then it means that no applications were hung, and you got the message


     Windows 


    Although you can use CTRL+ALT+DEL to quit an application that has stopped responding to the system, there is no application in this state.

    To quit an application, use the application's quit or exit command, or choose the Close command from the Control menu.

    *  Press any key to return to Windows.
    *  Press CTRL+ALT+DEL again to restart your computer. You will
       lose any unsaved information in all applications.


      Press any key to continue _

    (Anachronism alert: The System menu was called the Control menu back then.)

    Otherwise, the special byte was still 0, which means that the 16-bit scheduler never got control, which meant that a 16-bit Windows application was not releasing control back to the kernel. The virtual reboot device then waited for the virtual machine to finish processing any pending virtual interrupts. (This allowed any pending MS-DOS emulation or 16-bit MS-DOS device drivers to finish up their work.) If things did not return to this sane state within 3¼ seconds, then you got this screen:


     Windows 


    The system is either busy or has become unstable. You can wait and see if the system becomes available again and continue working or you can restart your computer.

    *  Press any key to return to Windows and wait.
    *  Press CTRL+ALT+DEL again to restart your computer. You will
       lose any unsaved information in all applications.


      Press any key to continue _

    Otherwise, we are in the case where the system returned to a state where there are no active virtual interrupts. The kernel single-stepped the processor if necessary until the instruction pointer was no longer in the kernel, or until it had single-stepped for 5000 instructions and the instruction pointer was not in the heap manager. (The heap manager was allowed to run for more than 5000 instructions.)

    At this point, you got the screen that Steve Ballmer wrote.



    Contoso Deluxe Music Composer


      This Windows application has stopped responding to the system.

      *  Press ESC to cancel and return to Windows.
      *  Press ENTER to close this application that is not responding.
         You will lose any unsaved information in this application.
      *  Press CTRL+ALT+DEL again to restart your computer. You will
         lose any unsaved information in all applications.


    If you hit Enter, then the 16-bit kernel terminated the application by doing mov ax, 4c00h followed by int 21h, which was the system call that applications used to exit normally. This time, the kernel is making the exit call on behalf of the stuck application. Everything looks like the application simply decided to exit normally.

    The stuck application exits, the kernel regains control, and hopefully, things return to normal.

    I should point out that I didn't write any of this code. "It was like that when I got here."

    Bonus chatter: There were various configuration settings to tweak all of the above behavior. For example, you could say that Ctrl+Alt+Del always restarted the computer rather than terminating the current application. Or you could skip the check whether the 16-bit kernel scheduler had set the byte to 1 so that you could use Ctrl+Alt+Del to terminate an application even if it wasn't hung.¹ There was also a setting to restart the computer upon receipt of an NMI, the intention being that the signal would be triggered either by a dedicated add-on switch or by poking a ball-point pen in just the right spot. (This is safer than just pushing the reset button because the restart would flush disk caches and shut down devices in an orderly manner.)

    ¹ This setting was intended for developers to assist in debugging their programs because if you went for this option, the program that got terminated is whichever one happened to have control of the CPU at the time you hit Ctrl+Alt+Del. This was, in theory, random, but in practice it often guessed right. That's because the problem was usually that a program got wedged into an infinite message loop, so most of the CPU was being run in the stuck application anyway.

  • The Old New Thing

    The history of Win32 critical sections so far

    • 19 Comments

    The CRITICAL_SECTION structure has gone through a lot of changes since its introduction back oh so many decades ago. The amazing thing is that as long as you stick to the documented API, your code is completely unaffected.

    Initially, the critical section object had an owner field to keep track of which thread entered the critical section, if any. It also had a lock count to keep track of how many times the owner thread entered the critical section, so that the critical section would be released when the matching number of Leave­Critical­Section calls was made. And there was an auto-reset event used to manage contention. We'll look more at that event later. (It's actually more complicated than this, but the details aren't important.)

    If you've ever looked at the innards of a critical section (for entertainment purposes only), you may have noticed that the lock count was off by one: The lock count was the number of active calls to Enter­Critical­Section minus one. The bias was needed because the original version of the interlocked increment and decrement operations returned only the sign of the result, not the revised value. Biasing the result by 1 means that all three states could be detected: Unlocked (negative), locked exactly once (zero), reentrant lock (positive). (It's actually more complicated than this, but the details aren't important.)

    If a thread tries to enter a critical section but can't because the critical section is owned by another thread, then it sits and waits on the contention event. When the owning thread releases all its claims on the critical section, it signals the event to say, "Okay, the door is unlocked. The next guy can come in."

    The contention event is allocated only when contention occurs. (This is what older versions of MSDN meant when they said that the event is "allocated on demand.") Which leads to a nasty problem: What if contention occurs, but the attempt to create the contention event fails? Originally, the answer was "The kernel raises an out-of-memory exception."

    Now you'd think that a clever program could catch this exception and try to recover from it, say, by unwinding everything that led up to the exception. Unfortunately, the weakest link in the chain is the critical section object itself: In the original version of the code, the out-of-memory exception was raised while the critical section was in an unstable state. Even if you managed to catch the exception and unwind everything you could, the critical section was itself irretrievably corrupted.

    Another problem with the original design became apparent on multiprocessor systems: If a critical section was typically held for a very brief time, then by the time you called into kernel to wait on the contention event, the critical section was already freed!

    void SetGuid(REFGUID guid)
    {
     EnterCriticalSection(&g_csGuidUpdate);
     g_theGuid = guid;
     LeaveCriticalSection(&g_csGuidUpdate);
    }
    
    void GetGuid(GUID *pguid)
    {
     EnterCriticalSection(&g_csGuidUpdate);
     *pguid = g_theGuid;
     LeaveCriticalSection(&g_csGuidUpdate);
    }
    

    This imaginary code uses a critical section to protect accesses to a GUID. The actual protected region is just nine instructions long. Setting up to wait on a kernel object is way, way more than nine instructions. If the second thread immediately waited on the critical section contention event, it would find that by the time the kernel got around to entering the wait state, the event would say, "Dude, what took you so long? I was signaleded, like, a bazillion cycles ago!"

    Windows 2000 added the Initialize­Critical­Section­And­Spin­Count function, which lets you avoid the problem where waiting for a critical section costs more than the code the critical section was protecting. If you initialize with a spin count, then when a thread tries to enter the critical section and can't, it goes into a loop trying to enter it over and over again, in the hopes that it will be released.

    "Are we there yet? How about now? How about now? How about now? How about now? How about now? How about now? How about now? How about now? How about now? How about now? How about now?"

    If the critical section is not released after the requested number of iterations, then the old slow wait code is executed.

    Note that spinning on a critical section is helpful only on multiprocessor systems, and only in the case where you know that all the protected code segments are very short in duration. If the critical section is held for a long time, then spinning is wasteful since the odds that the critical section will become free during the spin cycle are very low, and you wasted a bunch of CPU.

    Another feature added in Windows 2000 is the ability to preallocate the contention event. This avoids the dreaded "out of memory" exception in Enter­Critical­Section, but at a cost of a kernel event for every critical section, whether actual contention occurs or not.

    Windows XP solved the problem of the dreaded "out of memory" exception by using a fallback algorithm that could be used if the contention event could not be allocated. The fallback algorithm is not as efficient, but at least it avoids the "out of memory" situation. (Which is a good thing, because nobody really expects Enter­Critical­Section to fail.) This also means that requests for the contention event to be preallocated are now ignored, since the reason for preallocating (avoiding the "out of memory" exception) no longer exists.

    (And while they were there, the kernel folks also fixed Initialize­Critical­Section so that a failed initialization left the critical section object in a stable state so you could safely try again later.)

    In Windows Vista, the internals of the critical section object were rejiggered once again, this time to add convoy resistance. The internal bookkeeping completely changed; the lock count is no longer a 1-biased count of the number of Enter­Critical­Section calls which are pending. As a special concession to backward compatibility with people who violated the API contract and looked directly at the internal data structures, the new algorithm goes to some extra effort to ensure that if a program breaks the rules and looks at a specific offset inside the critical section object, the value stored there is −1 if and only if the critical section is unlocked.

    Often, people will remark that "your compatibility problems would go away if you just open-sourced the operating system." I think there is some confusion over what "go away" means. If you release the source code to the operating system, it makes it even easier for people to take undocumented dependencies on it, because they no longer have the barrier of "Well, I can't find any documentation, so maybe it's not documented." They can just read the source code and say, "Oh, I see that if the critical section is unlocked, the Lock­Count variable has the value −1." Boom, instant undocumented dependency. Compatibility is screwed. (Unless what people are saying "your compatibility problems would go away if you just open-sourced all applications, so that these problems can be identified and fixed as soon as they are discovered.")

    Exercise: Why isn't it important that the fallback algorithm be highly efficient?

  • The Old New Thing

    I wrote the original blue screen of death, sort of

    • 45 Comments

    We pick up the story with Windows 95. As I noted, the blue Ctrl+Alt+Del dialog was introduced in Windows 3.1, and in Windows 95; it was already gone. In Windows 95, hitting Ctrl+Alt+Del called up a dialog box that looked something like this:

    Close Program × 
    Explorer
    Contoso Deluxe Composer [not responding]
    Fabrikam Chart 2.0
    LitWare Chess Challenger
    Systray
    WARNING: Pressing CTRL+ALT+DEL again will restart your computer. You will lose unsaved information in all programs that are running.
    End Task
    Shut Down
    Cancel

    (We learned about Systray some time ago.)

    Whereas Windows 3.1 responded to fatal errors by crashing out to a black screen, Windows 95 switched to showing severe errors in blue. And I'm the one who wrote it. Or at least modified it last.

    I was responsible for the code that displayed blue screen messages: Asking the kernel-mode video driver to switch into text mode, filling the screen with a blue background, drawing white text, waiting for the user to press a key, restoring the screen to its original contents, and reporting the user's response back to the component that asked to display the message.¹

    When a device driver crashed, Windows 95 tried its best to limp along despite a catastrophic failure in a kernel-mode component. It wasn't a blue screen of death so much as a blue screen of lameness. For those fortunate never to have seen one, it looked like this:



     Windows 


    An exception 0D has occurred at 0028:80014812. This was called from 0028:80014C34. It may be possible to continue normally.

    * Press any key to attempt to continue.
    * Press CTRL+ALT+DEL to restart your computer. You will
      lose any unsaved information in all applications.

    Note the optimistic message "It may be possible to continue normally." Everybody forgets that after Windows 95 showed you a blue screen error, it tried its best to ignore the error and keep running anyway. I mean, sure your scanner driver crashed, so scanning doesn't work any more, but the rest of the system seems to be okay.

    (Imagine if you did that today. "Press any key to ignore this kernel panic.")

    Technically, what happened was that the virtual machine manager abandoned the event currently in progress and returned to the event dispatcher. It's the kernel-mode equivalent to swallowing exceptions in window procedures and returning to the message loop. If there was no event in progress, then the current application was terminated.

    Sometimes the problem was global, and abandoning the current event or terminating the application did nothing to solve the problem; all that happened was that the next event or application to run encountered the same problem, and you got another blue screen message a few milliseconds later. After about a half dozen of these messages, you most likely gave up hope and hit Ctrl+Alt+Del.

    Now, that's what the message looked like originally, but that message had a problem: Since the addresses at which device drivers were loaded into the kernel were not predictable, having the raw address didn't really tell you much. If you were someone who was told, "This senior executive got this crash message, can you figure out what happened?", all you had to work with was a bunch of mostly useless numbers.

    That someone might have been me.

    To help with this problem, I tweaked the message to include the name of the driver, the section number, and the offset within the section.



     Windows 


    An exception 0D has occurred at 0028:80014812 in VxD CONTOSO(03) + 00000152. This was called from 0028:80014C34 in VxD CONTOSO(03) + 00000574. It may be possible to continue normally.

    * Press any key to attempt to continue.
    * Press CTRL+ALT+DEL to restart your computer. You will
      lose any unsaved information in all applications.

    Now you had the name of the driver that crashed, which might give you a clue of where the problem is, even if you knew nothing else. And somebody with access to a MAP file for the driver could now look up the address and identify which line crashed. Not great, but better than nothing, and before I made this change, nothing is what you had.

    So you could say that I wrote the Windows 95 blue screen of death lameness to make my own life easier.

    Bonus chatter: Later, someone (I forget whether it was me, so let's say it was one of my colleagues) added some more code to inspect the crashing address, and if it was inside the kernel heap manager, the message changed slightly:



     Windows 


    A 32-bit device driver has corrupted critical system memory, resulting in an exception 0D at 0028:80001812 in VxD VMM(01) + 00001812. This was called from 0028:80014C34 in VxD CONTOSO(03) + 00000575.

    * Press any key to attempt to continue.
    * Press CTRL+ALT+DEL to restart your computer. You will
      lose any unsaved information in all applications.

    In this case, the sentence "It may be possible to continue normally" disappeared. Because we knew that, odds are, it won't be.

    Bonus chatter: Nice job, Slashdot. You considered posting a correction, but your correction was also wrong. At least you realized your mistake.

    ¹ Since this code ran in the kernel, it didn't have access to keyboard layout information. It doesn't know that if you are using the Chinese-Bopomofo keyboard layout, then the way to type "OK" is to press C, followed by L, followed by 3. Not that it would help, because there is no IME in the kernel anyway. As much as possible, the responses were mapped to language-independent keys like Enter and ESC.

  • The Old New Thing

    Steve Ballmer did not write the text for the blue screen of death

    • 61 Comments

    Somehow, it ended up widely reported that Steve Ballmer wrote the blue screen of death. And all of those articles cited my article titled "Who wrote the text for the Ctrl+Alt+Del dialog in Windows 3.1?" Somehow, everybody decided to ignore that I wrote "Ctrl+Alt+Del dialog" and replace it with what they wanted to hear: "blue screen of death".¹

    Note also that people are somehow blaming the person who wrote the text of the error message for the fact that the message appears in the first place. It's like blaming Pat Fleet for AT&T's crappy service. It must suck being the person whose job it is to write error messages.

    Anyway, the Ctrl+Alt+Del dialog was not the blue screen of death. I mean, it had a Cancel option, for goodness's sake. What message of death comes with a Cancel option?

    Grim Reaper: I am here to claim your soul.

    You: No thanks.

    Grim Reaper: Oh, well, in that case, sorry to have bothered you.

    Also, blue screen error messages were not new to Windows 3.1. They existed in Windows 3.0, too. What was new in Windows 3.1 was a special handler for Ctrl+Alt+Del which tried to identify the program that was not responding and give you the opportunity to terminate it. Windows itself was fine; it was just the program that was in trouble.

    Recall that Windows in Enhanced mode ran three operating systems simultaneously, A copy of Standard mode Windows ran inside one of the virtual machines, and all your MS-DOS applications ran in the other virtual machines. These blue screen messages came from the virtual machine manager.

    If you had a single-floppy system, the two logical drives A: and B: were shared by the single physical floppy drive. When a program switched from accessing drive A: to drive B:, or vice versa, Windows prompted you to insert the disk for the new drive:



     XyWrite 


      Please Insert Diskette for drive B:


      Press any key to continue _

    Another job of the virtual machine manager is to arbitrate access to physical hardware. As long as two virtual machines didn't try to access the same resource simultaneously, the arbitration could be resolved invisibly. But if two virtual machines tried to access, say, the serial port at the same time, Windows alerted you to the conflict and asked you which virtual machines should be granted access and which should be blocked. It looked like this:



     Device Conflict 


    'XyWrite' is attempting to use the COM1 device, which 'Procomm' is currently using. Do you want 'XyWrite' to take control of the device?


    Press Y for Yes or N for No: _

    Windows 3.1 didn't have a blue screen of death. If an MS-DOS application crashed, you got a blue screen message saying that the application crashed, but Windows kept running. If it was a device driver that crashed, then Windows 3.1 shut down the virtual machine that the device driver was running in. But if the device driver crashed the Windows virtual machine, then the entire virtual machine manager shut itself down, sometimes (but not always) printed a brief text message, and handed control back to MS-DOS. So you might say that it was a black screen of death.

    Could not continue running Windows because of paging error.

    C:\>_



















    The window of opportunity for seeing the blue Ctrl+Alt+Del dialog was quite small: You basically had to be running Windows 3.1 or Windows 3.11.

    Next time, we'll see who actually wrote Windows 95 blue screen of death. Spoiler alert: It was me.

    ¹ I like how The Register wrote "Microsoft has revealed" and "Redmond's Old new thing blog", once again turning me into an official company spokesperson. (They also chose not to identify me by name, which may be a good thing.)

    DailyTech identified me as a Microsoft executive. I'm still waiting for that promotion letter. (And, more important, the promotion pay raise.)

    First honorable mention goes to Engadget for illustrating the article with the Windows 95 dialog that Steve Ballmer didn't write. I mean, I had a copy of the screen in my article, and they decided to show some other screen instead. I gues nobody bothered to verify that the dialogs matched.

    Second honorable mention goes jointly to Gizmodo and lifehacker for illustrating the article with the Windows NT blue screen of death. Not only was it the wrong dialog, it was the wrong operating system. Nice one, guys.

    And special mention goes to BGR, who entirely fabricated a scenario and posited it as real: "What longtime Windows user can forget the panic that set in the first time their entire screen went blue for no explicable reason and was informed that 'This Windows application has stopped responding to the system.'" Yeah, that never happened. The Ctrl+Alt+Del dialog did not appear spontaneously; you had to press Ctrl+Alt+Del to get it. The answer to their question "Who remembers this?" is "Nobody." BGR also titled their article "It turns out Steve Ballmer was directly responsible for developing Windows most hated feature." He didn't develop the feature. He just wrote the text. I also wonder why giving the user the opportunity to terminate a runaway program is the most-hated feature of Windows. Sorry for giving you some control over your computer. I guess they would prefer things the way Windows 3.0 did it: A runaway program forces you to reboot.

Page 1 of 50 (496 items) 12345»