History

  • The Old New Thing

    The history of Win32 critical sections so far

    • 19 Comments

    The CRITICAL_SECTION structure has gone through a lot of changes since its introduction back oh so many decades ago. The amazing thing is that as long as you stick to the documented API, your code is completely unaffected.

    Initially, the critical section object had an owner field to keep track of which thread entered the critical section, if any. It also had a lock count to keep track of how many times the owner thread entered the critical section, so that the critical section would be released when the matching number of Leave­Critical­Section calls was made. And there was an auto-reset event used to manage contention. We'll look more at that event later. (It's actually more complicated than this, but the details aren't important.)

    If you've ever looked at the innards of a critical section (for entertainment purposes only), you may have noticed that the lock count was off by one: The lock count was the number of active calls to Enter­Critical­Section minus one. The bias was needed because the original version of the interlocked increment and decrement operations returned only the sign of the result, not the revised value. Biasing the result by 1 means that all three states could be detected: Unlocked (negative), locked exactly once (zero), reentrant lock (positive). (It's actually more complicated than this, but the details aren't important.)

    If a thread tries to enter a critical section but can't because the critical section is owned by another thread, then it sits and waits on the contention event. When the owning thread releases all its claims on the critical section, it signals the event to say, "Okay, the door is unlocked. The next guy can come in."

    The contention event is allocated only when contention occurs. (This is what older versions of MSDN meant when they said that the event is "allocated on demand.") Which leads to a nasty problem: What if contention occurs, but the attempt to create the contention event fails? Originally, the answer was "The kernel raises an out-of-memory exception."

    Now you'd think that a clever program could catch this exception and try to recover from it, say, by unwinding everything that led up to the exception. Unfortunately, the weakest link in the chain is the critical section object itself: In the original version of the code, the out-of-memory exception was raised while the critical section was in an unstable state. Even if you managed to catch the exception and unwind everything you could, the critical section was itself irretrievably corrupted.

    Another problem with the original design became apparent on multiprocessor systems: If a critical section was typically held for a very brief time, then by the time you called into kernel to wait on the contention event, the critical section was already freed!

    void SetGuid(REFGUID guid)
    {
     EnterCriticalSection(&g_csGuidUpdate);
     g_theGuid = guid;
     LeaveCriticalSection(&g_csGuidUpdate);
    }
    
    void GetGuid(GUID *pguid)
    {
     EnterCriticalSection(&g_csGuidUpdate);
     *pguid = g_theGuid;
     LeaveCriticalSection(&g_csGuidUpdate);
    }
    

    This imaginary code uses a critical section to protect accesses to a GUID. The actual protected region is just nine instructions long. Setting up to wait on a kernel object is way, way more than nine instructions. If the second thread immediately waited on the critical section contention event, it would find that by the time the kernel got around to entering the wait state, the event would say, "Dude, what took you so long? I was signaleded, like, a bazillion cycles ago!"

    Windows 2000 added the Initialize­Critical­Section­And­Spin­Count function, which lets you avoid the problem where waiting for a critical section costs more than the code the critical section was protecting. If you initialize with a spin count, then when a thread tries to enter the critical section and can't, it goes into a loop trying to enter it over and over again, in the hopes that it will be released.

    "Are we there yet? How about now? How about now? How about now? How about now? How about now? How about now? How about now? How about now? How about now? How about now? How about now?"

    If the critical section is not released after the requested number of iterations, then the old slow wait code is executed.

    Note that spinning on a critical section is helpful only on multiprocessor systems, and only in the case where you know that all the protected code segments are very short in duration. If the critical section is held for a long time, then spinning is wasteful since the odds that the critical section will become free during the spin cycle are very low, and you wasted a bunch of CPU.

    Another feature added in Windows 2000 is the ability to preallocate the contention event. This avoids the dreaded "out of memory" exception in Enter­Critical­Section, but at a cost of a kernel event for every critical section, whether actual contention occurs or not.

    Windows XP solved the problem of the dreaded "out of memory" exception by using a fallback algorithm that could be used if the contention event could not be allocated. The fallback algorithm is not as efficient, but at least it avoids the "out of memory" situation. (Which is a good thing, because nobody really expects Enter­Critical­Section to fail.) This also means that requests for the contention event to be preallocated are now ignored, since the reason for preallocating (avoiding the "out of memory" exception) no longer exists.

    (And while they were there, the kernel folks also fixed Initialize­Critical­Section so that a failed initialization left the critical section object in a stable state so you could safely try again later.)

    In Windows Vista, the internals of the critical section object were rejiggered once again, this time to add convoy resistance. The internal bookkeeping completely changed; the lock count is no longer a 1-biased count of the number of Enter­Critical­Section calls which are pending. As a special concession to backward compatibility with people who violated the API contract and looked directly at the internal data structures, the new algorithm goes to some extra effort to ensure that if a program breaks the rules and looks at a specific offset inside the critical section object, the value stored there is −1 if and only if the critical section is unlocked.

    Often, people will remark that "your compatibility problems would go away if you just open-sourced the operating system." I think there is some confusion over what "go away" means. If you release the source code to the operating system, it makes it even easier for people to take undocumented dependencies on it, because they no longer have the barrier of "Well, I can't find any documentation, so maybe it's not documented." They can just read the source code and say, "Oh, I see that if the critical section is unlocked, the Lock­Count variable has the value −1." Boom, instant undocumented dependency. Compatibility is screwed. (Unless what people are saying "your compatibility problems would go away if you just open-sourced all applications, so that these problems can be identified and fixed as soon as they are discovered.")

    Exercise: Why isn't it important that the fallback algorithm be highly efficient?

  • The Old New Thing

    I wrote the original blue screen of death, sort of

    • 45 Comments

    We pick up the story with Windows 95. As I noted, the blue Ctrl+Alt+Del dialog was introduced in Windows 3.1, and in Windows 95; it was already gone. In Windows 95, hitting Ctrl+Alt+Del called up a dialog box that looked something like this:

    Close Program × 
    Explorer
    Contoso Deluxe Composer [not responding]
    Fabrikam Chart 2.0
    LitWare Chess Challenger
    Systray
    WARNING: Pressing CTRL+ALT+DEL again will restart your computer. You will lose unsaved information in all programs that are running.
    End Task
    Shut Down
    Cancel

    (We learned about Systray some time ago.)

    Whereas Windows 3.1 responded to fatal errors by crashing out to a black screen, Windows 95 switched to showing severe errors in blue. And I'm the one who wrote it. Or at least modified it last.

    I was responsible for the code that displayed blue screen messages: Asking the kernel-mode video driver to switch into text mode, filling the screen with a blue background, drawing white text, waiting for the user to press a key, restoring the screen to its original contents, and reporting the user's response back to the component that asked to display the message.¹

    When a device driver crashed, Windows 95 tried its best to limp along despite a catastrophic failure in a kernel-mode component. It wasn't a blue screen of death so much as a blue screen of lameness. For those fortunate never to have seen one, it looked like this:



     Windows 


    An exception 0D has occurred at 0028:80014812. This was called from 0028:80014C34. It may be possible to continue normally.

    * Press any key to attempt to continue.
    * Press CTRL+ALT+DEL to restart your computer. You will
      lose any unsaved information in all applications.

    Note the optimistic message "It may be possible to continue normally." Everybody forgets that after Windows 95 showed you a blue screen error, it tried its best to ignore the error and keep running anyway. I mean, sure your scanner driver crashed, so scanning doesn't work any more, but the rest of the system seems to be okay.

    (Imagine if you did that today. "Press any key to ignore this kernel panic.")

    Technically, what happened was that the virtual machine manager abandoned the event currently in progress and returned to the event dispatcher. It's the kernel-mode equivalent to swallowing exceptions in window procedures and returning to the message loop. If there was no event in progress, then the current application was terminated.

    Sometimes the problem was global, and abandoning the current event or terminating the application did nothing to solve the problem; all that happened was that the next event or application to run encountered the same problem, and you got another blue screen message a few milliseconds later. After about a half dozen of these messages, you most likely gave up hope and hit Ctrl+Alt+Del.

    Now, that's what the message looked like originally, but that message had a problem: Since the addresses at which device drivers were loaded into the kernel were not predictable, having the raw address didn't really tell you much. If you were someone who was told, "This senior executive got this crash message, can you figure out what happened?", all you had to work with was a bunch of mostly useless numbers.

    That someone might have been me.

    To help with this problem, I tweaked the message to include the name of the driver, the section number, and the offset within the section.



     Windows 


    An exception 0D has occurred at 0028:80014812 in VxD CONTOSO(03) + 00000152. This was called from 0028:80014C34 in VxD CONTOSO(03) + 00000574. It may be possible to continue normally.

    * Press any key to attempt to continue.
    * Press CTRL+ALT+DEL to restart your computer. You will
      lose any unsaved information in all applications.

    Now you had the name of the driver that crashed, which might give you a clue of where the problem is, even if you knew nothing else. And somebody with access to a MAP file for the driver could now look up the address and identify which line crashed. Not great, but better than nothing, and before I made this change, nothing is what you had.

    So you could say that I wrote the Windows 95 blue screen of death lameness to make my own life easier.

    Bonus chatter: Later, someone (I forget whether it was me, so let's say it was one of my colleagues) added some more code to inspect the crashing address, and if it was inside the kernel heap manager, the message changed slightly:



     Windows 


    A 32-bit device driver has corrupted critical system memory, resulting in an exception 0D at 0028:80001812 in VxD VMM(01) + 00001812. This was called from 0028:80014C34 in VxD CONTOSO(03) + 00000575.

    * Press any key to attempt to continue.
    * Press CTRL+ALT+DEL to restart your computer. You will
      lose any unsaved information in all applications.

    In this case, the sentence "It may be possible to continue normally" disappeared. Because we knew that, odds are, it won't be.

    Bonus chatter: Nice job, Slashdot. You considered posting a correction, but your correction was also wrong. At least you realized your mistake.

    ¹ Since this code ran in the kernel, it didn't have access to keyboard layout information. It doesn't know that if you are using the Chinese-Bopomofo keyboard layout, then the way to type "OK" is to press C, followed by L, followed by 3. Not that it would help, because there is no IME in the kernel anyway. As much as possible, the responses were mapped to language-independent keys like Enter and ESC.

  • The Old New Thing

    Steve Ballmer did not write the text for the blue screen of death

    • 61 Comments

    Somehow, it ended up widely reported that Steve Ballmer wrote the blue screen of death. And all of those articles cited my article titled "Who wrote the text for the Ctrl+Alt+Del dialog in Windows 3.1?" Somehow, everybody decided to ignore that I wrote "Ctrl+Alt+Del dialog" and replace it with what they wanted to hear: "blue screen of death".¹

    Note also that people are somehow blaming the person who wrote the text of the error message for the fact that the message appears in the first place. It's like blaming Pat Fleet for AT&T's crappy service. It must suck being the person whose job it is to write error messages.

    Anyway, the Ctrl+Alt+Del dialog was not the blue screen of death. I mean, it had a Cancel option, for goodness's sake. What message of death comes with a Cancel option?

    Grim Reaper: I am here to claim your soul.

    You: No thanks.

    Grim Reaper: Oh, well, in that case, sorry to have bothered you.

    Also, blue screen error messages were not new to Windows 3.1. They existed in Windows 3.0, too. What was new in Windows 3.1 was a special handler for Ctrl+Alt+Del which tried to identify the program that was not responding and give you the opportunity to terminate it. Windows itself was fine; it was just the program that was in trouble.

    Recall that Windows in Enhanced mode ran three operating systems simultaneously, A copy of Standard mode Windows ran inside one of the virtual machines, and all your MS-DOS applications ran in the other virtual machines. These blue screen messages came from the virtual machine manager.

    If you had a single-floppy system, the two logical drives A: and B: were shared by the single physical floppy drive. When a program switched from accessing drive A: to drive B:, or vice versa, Windows prompted you to insert the disk for the new drive:



     XyWrite 


      Please Insert Diskette for drive B:


      Press any key to continue _

    Another job of the virtual machine manager is to arbitrate access to physical hardware. As long as two virtual machines didn't try to access the same resource simultaneously, the arbitration could be resolved invisibly. But if two virtual machines tried to access, say, the serial port at the same time, Windows alerted you to the conflict and asked you which virtual machines should be granted access and which should be blocked. It looked like this:



     Device Conflict 


    'XyWrite' is attempting to use the COM1 device, which 'Procomm' is currently using. Do you want 'XyWrite' to take control of the device?


    Press Y for Yes or N for No: _

    Windows 3.1 didn't have a blue screen of death. If an MS-DOS application crashed, you got a blue screen message saying that the application crashed, but Windows kept running. If it was a device driver that crashed, then Windows 3.1 shut down the virtual machine that the device driver was running in. But if the device driver crashed the Windows virtual machine, then the entire virtual machine manager shut itself down, sometimes (but not always) printed a brief text message, and handed control back to MS-DOS. So you might say that it was a black screen of death.

    Could not continue running Windows because of paging error.

    C:\>_



















    The window of opportunity for seeing the blue Ctrl+Alt+Del dialog was quite small: You basically had to be running Windows 3.1 or Windows 3.11.

    Next time, we'll see who actually wrote Windows 95 blue screen of death. Spoiler alert: It was me.

    ¹ I like how The Register wrote "Microsoft has revealed" and "Redmond's Old new thing blog", once again turning me into an official company spokesperson. (They also chose not to identify me by name, which may be a good thing.)

    DailyTech identified me as a Microsoft executive. I'm still waiting for that promotion letter. (And, more important, the promotion pay raise.)

    First honorable mention goes to Engadget for illustrating the article with the Windows 95 dialog that Steve Ballmer didn't write. I mean, I had a copy of the screen in my article, and they decided to show some other screen instead. I gues nobody bothered to verify that the dialogs matched.

    Second honorable mention goes jointly to Gizmodo and lifehacker for illustrating the article with the Windows NT blue screen of death. Not only was it the wrong dialog, it was the wrong operating system. Nice one, guys.

    And special mention goes to BGR, who entirely fabricated a scenario and posited it as real: "What longtime Windows user can forget the panic that set in the first time their entire screen went blue for no explicable reason and was informed that 'This Windows application has stopped responding to the system.'" Yeah, that never happened. The Ctrl+Alt+Del dialog did not appear spontaneously; you had to press Ctrl+Alt+Del to get it. The answer to their question "Who remembers this?" is "Nobody." BGR also titled their article "It turns out Steve Ballmer was directly responsible for developing Windows most hated feature." He didn't develop the feature. He just wrote the text. I also wonder why giving the user the opportunity to terminate a runaway program is the most-hated feature of Windows. Sorry for giving you some control over your computer. I guess they would prefer things the way Windows 3.0 did it: A runaway program forces you to reboot.

  • The Old New Thing

    Who wrote the text for the Ctrl+Alt+Del dialog in Windows 3.1?

    • 42 Comments

    One of the differences between standard-mode Windows and enhanced-mode Windows was what happened when you hit Ctrl+Alt+Del. Since 16-bit Windows applications are co-operatively multi-tasked, it is easy to determine whether the system is responding, and if not, it is also easy to identify the application which is responsible. In that case, Windows gave you options to close the non-responsive application, restart the computer, or cancel.

    During this time period, Steve Ballmer was head of the Systems Division, and he paid a visit to the Windows team to see what they were up to, as is the wont of many executives.¹ When they showed him the Ctrl+Alt+Del feature, he nodded thoughtfully and added, "This is nice, but I don't like the text of the message. It doesn't sound right to me."

    "Okay, Steve. If you think you can do a better job, then go for it." Unlike some other executive, Steve took up the challenge, and a few days later, he emailed what he thought the Ctrl+Alt+Del screen should say.

    The text he came up with was actually quite good, and it went into the product pretty much word for word.



    Contoso Deluxe Music Composer


      This Windows application has stopped responding to the system.

      *  Press ESC to cancel and return to Windows.
      *  Press ENTER to close this application that is not responding.
         You will lose any unsaved information in this application.
      *  Press CTRL+ALT+DEL again to restart your computer. You will
         lose any unsaved information in all applications.


    Note to journalists: This is the Ctrl+Alt+Del dialog, not the blue screen of death. Thank you for paying attention.

    ¹ It occurred to me only as I wrote up this entry that people took the phrase Right on top of my notepad from the earlier story literally: There was a chair, the chair had a notepad on its seat, Bill sat in the chair (on top of the notepad). That interpretation never occurred to me. From the description in the previous paragraph, it was apparent to me that the notepad was on a desk, and Bill's choice of seat blocked access to the notepad. (I.e., the manager would have to reach around Bill to get the notepad.) The person telling the story is not a native English speaker, so there may have been a preposition translation issue.

  • The Old New Thing

    It's time we face reality, my friends: We're not rocket scientists

    • 13 Comments

    During the development of Windows 95, it was common for team members to pay visits to other teams to touch base and let them know what's been happening on the Windows 95 side of the project.

    It was during one of these informal visits that the one of my colleagues reported that he saw that one of the members of the partner team had a Gary Larson cartoon from The Far Side depicting a group of scientists studying a multi-stage rocket ship they just assembled, but the stages are connected all crooked. One of the scientists says, "It's time we face reality, my friends. … We're not exactly rocket scientists."

    The comic was "enhanced" a bit by the partner team. They added a sign on the wall of the laboratory that says Windows 95 Development, and the stages of the rocket are alternately labeled 16-bit and 32-bit. The graffiti were clearly poking fun at Windows 95's attempt to straddle the 16-bit and 32-bit worlds.

    The Windows 95 team knew how to take a joke, and for a time, they adopted "Hey, we're not rocket scientists" as a catch phrase.

    Following up on that article from 2010: Microsoft ran a free seminar on Windows 95 development for Macintosh programmers at the 1995 MacWorld Expo. Upon successful completion, participants received T-shirts with the slogan "Windows 95 sucks less."

  • The Old New Thing

    My friend and his buddy invented the online shopping cart back in 1994

    • 25 Comments

    Back in 1994 or so, my friend helped out his buddy who worked as the IT department for a local Seattle company known as Sub Pop Records. Here's what their Web site looked like back then. Oh, and in case you were wondering, when I said that his buddy worked as the IT department, I mean that the IT department consisted of one guy, namely him. And this wasn't even his real job. His main job was as their payroll guy; he just did their IT because he happened to know a little bit about computers. (If you asked him, he'd say that his main job was as a band member in Earth.)

    The mission was to make it possible for fans to buy records online. Nobody else was doing this at the time, so they had to invent it all by themselves. The natural metaphor for them was the shopping cart. You wandered through the virtual record store putting records in your basket, and then you went to check out.

    The trick here is how to keep track of the user as they wander through your store. This was 1994. Cookies hadn't been invented yet, or at least if they had been invented, support for them was very erratic, and you couldn't assume that every visitor to your site is using a browser that supported them.

    The solution was to encode the shopping cart state in the URL by making every link on the page include the session ID in the URL. It was crude but it got the job done.

    The site went online, and soon they were taking orders from excited fans around the world. The company loved it, because they probably got to charge full price for the records (rather than losing a cut to the distributor). And my friend told me the deep dark secret of his system: "We do okay if you ask for standard shipping, but the real money is when somebody is impatient and insists on overnight shipping. Overcharging for shipping is where the real money is."

    (Note: Statements about business models for a primitive online shopping site from 1994 are not necessarily accurate today.)

  • The Old New Thing

    Did the Windows 95 interface have a code name?

    • 17 Comments

    Commenter kinokijuf wonders whether the Windows 95 interface had a code name.

    Nope.

    We called it "the new shell" while it was under preliminary development, and when it got enabled in the builds, we just called it "the shell."

    (Explorer originally was named Cabinet, unrelated to the container file format of the same name. This original name lingers in the window class: CabinetWClass.)

  • The Old New Thing

    The alternate story of the time one of my colleagues debugged a line-of-business application for a package delivery service

    • 77 Comments

    Some people objected to the length, the structure, the metaphors, the speculation, and fabrication. So let's say they were my editors. Here's what the article might have looked like, had I taken their recommendations. (Some recommendations were to text that was also recommended cut. I applied the recommendations before cutting; the cuts are in gray.) You tell me whether you like the original or the edited version.

    Back in the days of Windows 95 development, one of my colleagues debugged a line-of-business application for a major delivery service. This was a program that the company gave to its top-tier high-volume customers, so that they could place and track their orders directly. And by directly, I mean that the program dialed the modem (since that was how computers communicated with each other back then) to contact the delivery service's mainframe (it was all mainframes back then) a computer at the delivery service and upload the new orders and download the status of existing orders.¹

    [Length. The "top tier customer" part of the story is irrelevant.]
    [Length. The mainframe part of the story is irrelevant.]
    [Speculation. No proof that the computer being dialed is a mainframe. For all you know, it was an Apple ][ on the other end of the modem.]

    Version 1.0 of the application had a notorious bug: Ninety days after you installed the program, it stopped working. They forgot to remove the beta expiration code. I guess that's why they have a version 1.01. It told you that the beta period has expired.

    [Length. Version 1.0 is irrelevant.]
    [Speculation. No proof that the beta expiration code was left by mistake. It could have been intentional, for whatever reason. Probably some nefarious reason.]

    Anyway, the bug that my colleague investigated was that If you entered a particular type of order with a particular set of options in a particular way, then the application crashed your system. Setting up a copy of the application in order to replicate the problem was itself a bit of an ordeal, but that's a whole different story.

    [Length. Retransition no longer necessary. The "setting up" story is irrelevant.]

    Okay, the program is set up, and yup, it crashes exactly as described when run on Windows 95. Actually, it also crashes exactly as described when run on Windows 3.1. This is just plain an application bug.

    [Length. Irrelevant.]

    The initial crash

    [Structure. Create heading (even though it gives away some of the story).]

    Here's why it crashed: After the program dials up the mainframe to submit the order the order system, it tries to refresh the list of orders that have yet to be delivered a list box control. The code that does this assumes that the list of undelivered orders the list box control is the control with focus. But if you ask for labels to be printed, then the printing code changes focus in order to display the "Please place the label on the package exactly like this" dialog, under the specific circumstances, the control is no longer focus; as I recall, it was because a dialog box had appeared and changed focus, and as a result, the refresh code can't find the undelivered order list list box and crashes on a null pointer. (I'm totally making this up, by the way. The details of the scenario aren't important to the story.)

    [Fabrication. All that is known is that there was a list box that lost focus to a dialog box.]

    Okay, well, that's no big deal. A null pointer fault should just put up the Unrecoverable Application Error dialog box and close the program. Why does this particular null pointer fault crash the entire system?

    [Embellishment.]

    Recovering from the crash

    [Structure. Create heading.]

    The developers of the program saw that their refresh code sometimes crashed on a null pointer, and instead of fixing it by actually fixing the code so it could find the list of undelivered orders even if it didn't have focus, or fixing it by adding a null pointer check, they fixed it by adding a null pointer exception handler. (I wish to commend myself for resisting the urge to put the word fixed in quotation marks in that last sentence.) The program installed a null pointer exception handler.

    [Speculation. No way of knowing that this was what the developers were thinking when they wrote the code.]

    Now, 16-bit Windows didn't have structured exception handling. The only type of exception handler was a global exception handler, and this wasn't just global to the process. This was global to the entire system. Your exception handler was called for every exception everywhere. If you screwed it up, you screwed up the entire system. (I think you can see where this is going.)

    [Embellishment.]

    The developers of the program converted their global exception handler to a local one by going to every function that had a "We seem to crash on a null pointer and I don't know why" bug and making these changes: A few functions in the program took the following form:

    extern jmp_buf caught;
    extern BOOL trapExceptions;
    
    void scaryFunction(...)
    {
     if (setjmp(&caught)) return;
     trapExceptions = TRUE;
     ... body of function ...
     trapExceptions = FALSE;
    }
    

    Their global exception handler checks the trapExceptions global variable, and if it is TRUE, they set it back to FALSE and do a longjmp which sends control back to the start of the function, which detects that something bad must have happened and just returns out of the function.

    [Speculation. No way of knowing that this was what the developers were thinking when they wrote the code. No proof that the code was first written without a global exception handler, and that the handler was added later. No proof that every such function set this variable. No proof that the reason for adding the setjmp was to protect against null pointer failures.]

    Yes, things are kind of messed up as a result of this. Yes, there is a memory leak. But at least their application didn't crash.

    [Embellishment.]

    On the other hand, if the global variable is FALSE, because their application crashed in some other function that didn't have this special protection, or because some other totally unrelated application crashed, the global exception handler decided to exit the application by running around freeing all the DLLs and memory associated with their application.

    Okay, so far so good, for certain values of good.

    [Embellishment.]

    Failed recovery

    [Structure. Add heading here.]

    These system-wide exception handlers had to be written in assembly code because they were dispatched with a very strange calling convention. But the developers of this application didn't write their system-wide exception handler in assembly language. Their application was written in MFC, so they just went to Visual C++ (as it was then known), clicked through some Add a Windows hook wizard, and got some generic HOOKPROC. (I don't know if Visual C++ actually had an Add a Windows hook wizard; they could just have copied the code from somewhere.) Nevermind that these system-wide exception handlers are not HOOKPROCs, so the function has the wrong prototype. What's more, the code they used marked the hook function as __loadds. This means that the function For whatever reason, the handler they installed saves the previous value of the DS register on entry, then changes the register to point to the application's data, and on exit, the function restores the previous value of DS.

    [Speculation. No proof that the program was written with MFC in the Microsoft Visual C++ IDE. It could have been written with Notepad in assembly language that just happens to look like the assembly language generated by the Microsoft Visual C++ compiler when it compiles code written in MFC.]

    The DS is a register on the x86 CPU that describes the data currently being operated upon. All that's important here is that the value in the DS register must always be valid, or the CPU will raise an exception.

    [Need to explain the DS register in case the reader cannot infer this from the description that comes later. We have established that neither the author nor the reader is allowed to draw inferences.]

    Okay, now we're about to enter the set piece at the end of the movie: Our hero's fear of spiders, his girlfriend's bad ankle from an old soccer injury, the executive toy on the villain's desk, and all the other tiny little clues dropped in the previous ninety minutes come together to form an enormous chain reaction.

    [Embellishment.]

    The application crashes on a null pointer. The system-wide custom exception handler is called. The crash is not one that is being protected by the global variable, so the custom exception handler frees the application from memory. The system-wide custom exception handler now returns, but wait, what is it returning to?

    The crash was in the application, which means that the DS register it saved on entry to the custom exception handler points to the application's data. The custom exception handler freed the application's data and then returned, declaring the exception handled. As the function exited, it tried to restore the original DS register, but the CPU said, "Nice try, but that is not a valid value for the DS register (because you freed it)." The CPU reported this error by (dramatic pause) raising an exception.

    [Embellishment.]

    That's right, The system-wide custom exception handler crashed with an exception.

    [Embellishment]

    The chain reaction

    [Structure. Add heading here.]

    Okay, things start snowballing. This is the part of the movie where the director uses quick cuts between different locations, maybe with a little slow motion thrown in.

    [Embellishment.]

    Since an exception was raised, the custom exception handler is called recursively. Each time through the recursion, the custom exception handler frees all the DLLs and memory associated with the application. But that's okay, right? Because the second and subsequent times, the memory was already freed, so the attempts to free them again will just fail with an invalid parameter error.

    But wait, their list of DLLs associated with the application included USER, GDI, and KERNEL. Now, Windows is perfectly capable of unloading dependent DLLs when you unload the main DLL, so when they unloaded their main program, the kernel already decremented the usage count on USER, GDI, and KERNEL automatically. But they apparently didn't trust Windows to do this, because after all, it was Windows that was causing their application to crash, so they took it upon themselves to free those DLLs manually. For whatever reason, the handler frees the DLLs anyway.

    [Speculation. No way of knowing that this was what the developers were thinking when they wrote the code.]

    Therefore, each time through the loop, the usage counts for USER, GDI, and KERNEL drop by one. Zoom in on the countdown clock on the ticking time bomb.

    Beep beep beep beep beep. The reference count finally drops to zero. The window manager, the graphics subsystem, and the kernel itself have all been unloaded from memory. There's nothing left to run the show!

    [Embellishment.]

    Boom, bluescreen. Hot flaming death.

    The punch line to all this is that whenever you call the company's product support line and describe a problem you encountered, their response is always, "Yeah, we're really sorry about that one."

    [Length. Irrelevant.]

    Bonus chatter: What is that whole different story mentioned near the top?

    [Length. Cut the entire bonus chatter. Irrelevant story.]

    Well, when the delivery service sent the latest version of the software to the Windows 95 team, they also provided an account number to use. My colleague used that account number to try to reproduce the problem, and since the problem occurred only after the order was submitted, she would have to submit delivery requests, say for a letter to be picked up from 221B Baker Street and delivered to 62 West Wallaby Street, or maybe for a 100-pound package of radioactive material to be picked up from 1600 Pennsylvania Avenue and delivered to 10 Downing Street. all of which were fictitious.

    [Fabrication. No proof that these were the addresses and orders used. All that is known is that fictitious orders were placed.]

    After about two weeks of this, my colleague got a phone call from people identifying themselves as Microsoft's shipping department. "What the heck are you doing?"

    [Speculation. No proof that the call truly came from the shipping department. Could have been a lucky prank call.]
    [Fabrication. No transcript of this call exists.]

    It turns out that the account number my colleague was given was Microsoft's own corporate account number. As in a real live account. She was inadvertently prank-calling the delivery company and sending actual trucks all over the country to pick up nonexistent letters and packages. The people who identified themselves as Microsoft's shipping department and people from the delivery service's headquarters claimed that they were frantic trying to trace where all the bogus orders were coming from.

    [Hearsay.]

    ¹ Mind you, this sort of thing is the stuff that average Joe customers can do while still in their pajamas, but back in those days, it was a feature that only top-tier customers had access to, because, y'know, mainframe.

  • The Old New Thing

    The time one of my colleagues debugged a line-of-business application for a package delivery service

    • 41 Comments

    Back in the days of Windows 95 development, one of my colleagues debugged a line-of-business application for a major delivery service. This was a program that the company gave to its top-tier high-volume customers, so that they could place and track their orders directly. And by directly, I mean that the program dialed the modem (since that was how computers communicated with each other back then) to contact the delivery service's mainframe (it was all mainframes back then) and upload the new orders and download the status of existing orders.¹

    Version 1.0 of the application had a notorious bug: Ninety days after you installed the program, it stopped working. They forgot to remove the beta expiration code. I guess that's why they have a version 1.01.

    Anyway, the bug that my colleague investigated was that if you entered a particular type of order with a particular set of options in a particular way, then the application crashed your system. Setting up a copy of the application in order to replicate the problem was itself a bit of an ordeal, but that's a whole different story.

    Okay, the program is set up, and yup, it crashes exactly as described when run on Windows 95. Actually, it also crashes exactly as described when run on Windows 3.1. This is just plain an application bug.

    Here's why it crashed: After the program dials up the mainframe to submit the order, it tries to refresh the list of orders that have yet to be delivered. The code that does this assumes that the list of undelivered orders is the control with focus. But if you ask for labels to be printed, then the printing code changes focus in order to display the "Please place the label on the package exactly like this" dialog, and as a result, the refresh code can't find the undelivered order list and crashes on a null pointer. (I'm totally making this up, by the way. The details of the scenario aren't important to the story.)

    Okay, well, that's no big deal. A null pointer fault should just put up the Unrecoverable Application Error dialog box and close the program. Why does this particular null pointer fault crash the entire system?

    The developers of the program saw that their refresh code sometimes crashed on a null pointer, and instead of fixing it by actually fixing the code so it could find the list of undelivered orders even if it didn't have focus, or fixing it by adding a null pointer check, they fixed it by adding a null pointer exception handler. (I wish to commend myself for resisting the urge to put the word fixed in quotation marks in that last sentence.)

    Now, 16-bit Windows didn't have structured exception handling. The only type of exception handler was a global exception handler, and this wasn't just global to the process. This was global to the entire system. Your exception handler was called for every exception everywhere. If you screwed it up, you screwed up the entire system. (I think you can see where this is going.)

    The developers of the program converted their global exception handler to a local one by going to every function that had a "We seem to crash on a null pointer and I don't know why" bug and making these changes:

    extern jmp_buf caught;
    extern BOOL trapExceptions;
    
    void scaryFunction(...)
    {
     if (setjmp(&caught)) return;
     trapExceptions = TRUE;
     ... body of function ...
     trapExceptions = FALSE;
    }
    

    Their global exception handler checks the trapExceptions global variable, and if it is TRUE, they set it back to FALSE and do a longjmp which sends control back to the start of the function, which detects that something bad must have happened and just returns out of the function.

    Yes, things are kind of messed up as a result of this. Yes, there is a memory leak. But at least their application didn't crash.

    On the other hand, if the global variable is FALSE, because their application crashed in some other function that didn't have this special protection, or because some other totally unrelated application crashed, the global exception handler decided to exit the application by running around freeing all the DLLs and memory associated with their application.

    Okay, so far so good, for certain values of good.

    These system-wide exception handlers had to be written in assembly code because they were dispatched with a very strange calling convention. But the developers of this application didn't write their system-wide exception handler in assembly language. Their application was written in MFC, so they just went to Visual C++ (as it was then known), clicked through some Add a Windows hook wizard, and got some generic HOOKPROC. (I don't know if Visual C++ actually had an Add a Windows hook wizard; they could just have copied the code from somewhere.) Nevermind that these system-wide exception handlers are not HOOKPROCs, so the function has the wrong prototype. What's more, the code they used marked the hook function as __loadds. This means that the function saves the previous value of the DS register on entry, then changes the register to point to the application's data, and on exit, the function restores the previous value of DS.

    Okay, now we're about to enter the set piece at the end of the movie: Our hero's fear of spiders, his girlfriend's bad ankle from an old soccer injury, the executive toy on the villain's desk, and all the other tiny little clues dropped in the previous ninety minutes come together to form an enormous chain reaction.

    The application crashes on a null pointer. The system-wide custom exception handler is called. The crash is not one that is being protected by the global variable, so the custom exception handler frees the application from memory. The system-wide custom exception handler now returns, but wait, what is it returning to?

    The crash was in the application, which means that the DS register it saved on entry to the custom exception handler points to the application's data. The custom exception handler freed the application's data and then returned, declaring the exception handled. As the function exited, it tried to restore the original DS register, but the CPU said, "Nice try, but that is not a valid value for the DS register (because you freed it)." The CPU reported this error by (dramatic pause) raising an exception.

    That's right, the system-wide custom exception handler crashed with an exception.

    Okay, things start snowballing. This is the part of the movie where the director uses quick cuts between different locations, maybe with a little slow motion thrown in.

    Since an exception was raised, the custom exception handler is called recursively. Each time through the recursion, the custom exception handler frees all the DLLs and memory associated with the application. But that's okay, right? Because the second and subsequent times, the memory was already freed, so the attempts to free them again will just fail with an invalid parameter error.

    But wait, their list of DLLs associated with the application included USER, GDI, and KERNEL. Now, Windows is perfectly capable of unloading dependent DLLs when you unload the main DLL, so when they unloaded their main program, the kernel already decremented the usage count on USER, GDI, and KERNEL automatically. But they apparently didn't trust Windows to do this, because after all, it was Windows that was causing their application to crash, so they took it upon themselves to free those DLLs manually.

    Therefore, each time through the loop, the usage counts for USER, GDI, and KERNEL drop by one. Zoom in on the countdown clock on the ticking time bomb.

    Beep beep beep beep beep. The reference count finally drops to zero. The window manager, the graphics subsystem, and the kernel itself have all been unloaded from memory. There's nothing left to run the show!

    Boom, bluescreen. Hot flaming death.

    The punch line to all this is that whenever you call the company's product support line and describe a problem you encountered, their response is always, "Yeah, we're really sorry about that one."

    Bonus chatter: What is that whole different story mentioned near the top?

    Well, when the delivery service sent the latest version of the software to the Windows 95 team, they also provided an account number to use. My colleague used that account number to try to reproduce the problem, and since the problem occurred only after the order was submitted, she would have to submit delivery requests, say for a letter to be picked up from 221B Baker Street and delivered to 62 West Wallaby Street, or maybe for a 100-pound package of radioactive material to be picked up from 1600 Pennsylvania Avenue and delivered to 10 Downing Street.

    After about two weeks of this, my colleague got a phone call from Microsoft's shipping department. "What the heck are you doing?"

    It turns out that the account number my colleague was given was Microsoft's own corporate account number. As in a real live account. She was inadvertently prank-calling the delivery company and sending actual trucks all over the country to pick up nonexistent letters and packages. Microsoft's shipping department and people from the delivery service's headquarters were frantic trying to trace where all the bogus orders were coming from.

    ¹ Mind you, this sort of thing is the stuff that average Joe customers can do while still in their pajamas, but back in those days, it was a feature that only top-tier customers had access to, because, y'know, mainframe.

  • The Old New Thing

    Why does Outlook map Ctrl+F to Forward instead of Find, like all right-thinking programs?

    • 97 Comments

    It's a widespread convention that the Ctrl+F keyboard shortcut initiates a Find operation. Word does it, Excel does it, Wordpad does it, Notepad does it, Internet Explorer does it. But Outlook doesn't. Why doesn't Outlook get with the program?

    Rewind to 1995.

    The mail team was hard at work on their mail client, known as Exchange (code name Capone, in keeping with all the Chicago-related code names from that era). Back in those days, the Ctrl+F keyboard shortcut did indeed call up the Find dialog, in accordance with convention.

    And then a bug report came in from a beta tester who wanted Ctrl+F to forward rather than find, because he had become accustomed to that keyboard shortcut from the email program he used before Exchange.

    That beta tester was Bill Gates.

Page 2 of 50 (499 items) 12345»