Anatomy of a Software Bug

Chris Mason is the person who hired me to work at Microsoft. By the time he hired me, he’d already spent a great deal of time looking into the issue of general software quality, and had written a memo (known as the “Zero Defects” memo) that underlies much of our software practices today. The ideas have been refined since then, but they haven’t changed much in terms of the basic concepts.

One of my favorite Chris Mason quotes comes from that memo, “Since human beings themselves are not fully debugged yet, there will be bugs in your code no matter what you do.” We work to minimize the bugs in the software we ship, but they’ll always be there.

The problem stems from the overall complexity of the software. In this context, “complexity” doesn’t refer to the code itself. Rather, we’re talking about the shear volume of things the user can do. In Word, for example, we have:

  • More than 850 command functions (e.g. Bold and Italic are the same command function)
  • More than 1600 distinct commands (e.g. Bold and Italic are distinct commands)
  • At any given time roughly 50% of these commands are enabled (conservative estimate)
  • With just 3 steps, the possible combinations of code execution paths exceeds 500 million

Now, there’s a philosophical issue about the desirability of increasingly complex software, but I’m not going to discuss it here. For all practical purposes, I don’t think there’s much benefit to getting into a discussion about it. It may be an interesting question on some level, but it’s one we’ll never fully resolve. And I’m just not all that interested in getting bogged down in an endless debate without the possibility of resolution.

I mention the issue of complexity because it leads to subtle interactions that can be difficult to track down. To illustrate the point even further, I thought I’d discuss the anatomy of one of the more famous bugs we’ve had in Word: the “Disk is full” on save error. Before I do, however, I should point out that Pierre Igot, after some prodding on my part, did provide us with a sample document that helped us to track down one of the more subtle interactions involved in this particular problem. For that, Pierre, I thank you and so do Word users everywhere.

The story of this problem begins with a basic design decision made when Richard Brodie was Word’s primary software architect. Brodie came to Microsoft along with Charles Simonyi after working at the Xerox PARC where he’d worked on Bravo—their version of the GUI word processor. A number of the ideas used in Word came from that early effort. Brodie joined Microsoft in 1981, began work on Word in the summer of 1982, and finished version 1.0 in October of 1983. You can read about much of the story in Microsoft First Generation by Cheryl Tsang.

Brodie figured out that a document is really just a collection of pieces of text, and that it didn’t really matter where each piece of text is physically located within the document’s file. For that matter, you could have one piece of text that came from one file and another piece of text that came from another file. We refer to this collection of pieces of text as the “piece table.” This design has a number of benefits. For example, if you copy text from one document to another, you don’t have to actually copy the text from one file to another—at least not right away. All you really need to do is copy the appropriate entries in the piece table in the source document to the piece table in the destination document. Of course, you do need to copy the physical text and formatting from one file to the other when you save the destination document, but delaying that physical copy until save time meant that the actual copy/paste could be done very quickly.

This design also made implementing undo rather simple. In fact, according to Brodie, implementing undo was the primary reason to use this design. With this design, all you have to do is create an internal undo document. When the user deletes some text from the current document, for example, you copy the deleted entries from the piece table to the undo document and save some information about where those piece-table entries had been located in the original document. To undo the delete, you just copy the piece-table entries from the undo document back to the original document.

This design does have one problem: where do you put the text that the user types into the document if it doesn’t go into the file that’s behind the document? To solve that, Brodie added something called a “scratch” file, and the scratch file remains a core part of Word’s design to this day. On the Mac, Word creates this file in your TemporaryItems folder. On Mac OS X, this folder is located at /private/tmp/<UID>/TemporaryItems, where “UID” is your user ID number (for most people, that’s 501, but it can be a different number altogether depending on how your user account was created). If start up Word, open the terminal window and get a listing of your TemporaryItems folder, you’ll see a file named something like “Word Work File S_.” There may be a number after the “_” character. That’s Word’s scratch file (the “S” standing for scratch). You might also see one or more files named “Word Work File D_” with some number after the “_” character. This is a back-up copy of a document file (the “D” standing for “Document”).

At this point, we need to fast-forward the story by a decade to the next major feature that brought this problem to the fore: multiple undo. For ten years, Word had undo, but it was just a single-level undo. For Word 6, we added the ability to go back and undo every change you’d made to the document since you first started editing it. And, with Word’s document/file architecture, this wasn’t all that difficult to do: just make the undo document contain multiple records with one record for each change to the main document. It’s a very cool feature, and most of us couldn’t think of how we’d survive without it. But it leads to a problem.

It’s not uncommon for users to make a few edits to a document, save the document, make a few more edits, save the document again, make a few more changes, and continue this process of edit/save for hours on end. Each time you delete text, however, the actual text itself exists in the last-saved file for the document you’re editing, and, with multiple levels of undo, the undo records for text deletions still point back to the last-saved version of the document’s file before you deleted the text. The next time you save, Word can’t close the last-saved version of the file, because the undo document still contains a reference to it. So, if you keep editing and saving, you’ll eventually hit an open file limit. At least this was true of Word 6. It’s changed quite a bit since then.

Arguably, this is something we should have figured out before we shipped Word 6, but, as Chris Mason pointed out, we humans haven’t been fully debugged yet. Moreover, it’s easy to say that one should have thought of a particular interaction in a complex piece of software, but that’s way easier said than done. When you’re implementing any given feature, you’re totally focused on the basic problems involved in the feature itself. To put this into perspective, the person who implemented multiple undo in Word is one of the best developers who has ever worked on Word, and has, since, been recognized as a Microsoft Distinguished Engineer.

The reality is, that we hadn’t realized we’d created this situation when we added multiple levels of undo. Moreover, this problem has several different variations on the basic theme. At this point, the story involves our efforts to understand the nature and scope of the problem, and to come up with the “best” way to fix it. Because of the variations, however, the problem has been like an onion. We’d peel away one layer of the onion, only to find some other variation that we hadn’t, for various reasons, figured out before.

As we weave our way through the rest of the story, there are some important points to keep in mind. The first is that I can’t fix what I can’t see, and, where software bugs are concerned, “seeing” means being able to watch the program execute, via some debugging tool, at the key point in the execution of the code where the problem occurs. In order to do this, I have to have a precise set of steps that consistently reproduces the problem. This not all that different from the problem a mechanic faces when trying to figure out the cause of that mysterious engine noise that only occurs after you’ve been driving the car around town for a few hours.

The second is that this particular problem is a developer’s worst nightmare. The fundamental cause is a basic design decision that you made more than a decade ago, and the only way to really fix it for certain is to rewrite the entire application from the ground up. Since that’s simply not an option for a product that you’ve shipped several times, you’re left with trying to make the problem difficult for most users to run into while trying to also minimize the negative effects if the user should ever run into the problem. This approach can, unfortunately, lead you to believe that you’ve come up with an “optimal” fix only to discover later that there’s another facet you haven’t taken into consideration (because you didn’t even know it existed until you peeled away the previous layer of the onion).

The third point to keep in mind is that we in Mac BU have relatively limited resources. When there’s a problem that’s fundamental to Word itself, we tend to let our Win Word siblings focus on that problem. Our efforts tend to have little chance of adding to their efforts, and this frees us up to focus on problems specific to Mac users. In general, this is the most efficient way to handle problems that our users are having, but there can be instances where there’s a Mac-specific dimension to a problem. As we’ll see soon enough, this particular problem had a Mac-specific dimension that complicated our efforts to fix it, and it took us a while to find that Mac-specific dimension.

Lastly, the fact that Mac Word’s code base has been forked from Win Word’s means that the Win Word people can make a change in the code for one reason, and that change can have other side-effects that we won’t see in the Mac version until we run into some very specific circumstances that show us the different behaviors caused by this change. In this particular case, Win Word added two lines of code in a routine that would seemingly be completely unrelated to this problem, but also made this problem much more difficult for users to run into in Win Word than it was in Mac Word. This one is the last piece (maybe I should say latest piece) in the puzzle that we discovered only a few months ago.

Whew! That’s a lot to keep in the back of our heads, but, nonetheless, let’s rewind back about ten years. Word 6.0 has just shipped on Windows, and we’re pretty happy with people’s reactions to the product. It doesn’t take long, though, for us to figure out that there’s a fly in the ointment. Reports start trickling in about people editing their documents “for a while” at which point they try to save their document and they get a “Disk is full” error. We’d ask people what they were doing, and the response was always some form of vague notion that they’d just been editing their document “for a while.” The precise measurement of “for a while” varied from user to user. For some folks, it was a little over an hour. For others, it was several hours. Reproducing the problem appeared to be highly dependant upon the user’s work habits.

After several months of trying to figure out the problem, someone in testing wrote a macro that inserted a large amount of text into the document and then, in a loop, replace successive words within the document saving it after each replace. Run this macro for a while, and you get a “Disk is full” error on one of the saves, at which point you can no longer save your document. Cool! We now have steps that reproduce the problem.

So, this document got handed off to a developer, who then fired up Word under the debugger, opened the document and ran the macro. The problem “reproduced,” but, for reasons that weren’t apparent at the time, the error that the developer ran into was subtly different from the error that the tester ran into. The developer thought about the problem he was seeing, and came up with one of those “optimal” fixes I mentioned above. It was the “right” fix in terms of the problem the developer saw, but it wasn’t the “right” fix for the problem that the tester saw.

What was this subtle difference between what the developer saw and what the tester saw? As I mentioned above, the basic theme of the problem is to hit an open file limit. In this case, there are two limits: Word’s internal open file limit and the OS’ open file limit. It turns out that the debugger bumps the OS’ open file limit from what it would normally be when you run Word outside the debugger. When the tester ran the macro, Word hit the OS’ open file limit. When the developer ran the macro, with Word running under the debugger, Word ran into its own, internal, file limit.

After a few iterations of the tester saying, “Sorry, but the bug’s not fixed yet,” and the developer saying, “What are you talking about? I don’t see the problem!” they both figured out that they were seeing different errors. Crap! The problem only reproduces when you’re not running under the debugger, which removes the one case where the developer can actually see what’s going on. At this point, we have yet to figure out that the problem involves hitting the OS’ open file limit. At this point, though, the developer isn’t completely in the dark, and comes up with a fix for the tester’s problem.

As I pointed out, the problem involves the undo document having a reference to the previously saved-version of the document’s file. The developer’s original fix was to add some code, in the case where Word hit its internal open file limit, that would basically remove everything from the undo document (what we refer to as “nuking the undo stack”). Nuking the undo stack allows the save to proceed, because Word can now close the open files that were referenced by the undo document. However, since the tester was seeing a different error, the developer’s fix didn’t handle that case.

Nonetheless, the developer took a different approach. Knowing that the undo document was very likely to be involved, one could walk through the undo document, and copy the text for any pieces that pointed to the previously-saved version of the document’s file to the scratch file. He coded up the solution, and handed a buddy-build off to the tester. The tester ran the macro, and the problem was fixed. The first layer of the onion had been peeled away, but the fix still wasn’t an “optimal” fix. As it stood, the chances that a user would run into the problem had been greatly reduced, but we still hadn’t dealt with the “minimize the damage if they do hit it” side of the issue. That’s because, at this point, we had yet to understand that the problem outside the debugger had to do with the OS’ open file limit. Because this problem wouldn’t reproduce under the debugger, the developer had no way of knowing exactly where the failure was occurring. Without knowing that, the developer didn’t know where to add the code that would “nuke the undo stack.”

To give you a sense of the time frame, this fix was ported from Win Word to Mac Word during the Office 2001 development cycle, and was back-ported into Word 98 for a service release that was done not too long after that. It’s also at this point where the Win Word and Mac Word stories diverge. There are two reasons for this. The first is that this was the point in time where Win Word got that two-line code change that I mentioned above. The second is that the open file limit under Mac OS is different than it is under Windows. I might be mistaken on this point, but I think the open file limit under Mac OS X is different from the limit under Mac OS 9 as well.

At this point, we still didn’t know that the basic problem involved hitting the OS’ open file limit. After a while, though, we did know that Mac Word users were seeing this problem way more often than Win Word users. In fact, the difference was enough for Mac Word testers to start investigating the problem directly. One of the things we did know is that the problem involved file references in the undo document. So, we came up with a variation of the original fix.

In order to understand this, we have to understand a basic principle of fixes. You make the simplest code change required to fix the problem. This reduces the chances that the fix will cause some other problem that is, potentially, worse than the one you’re trying to fix. When you’re mucking about with the locations where data is stored in files, the potential for catastrophic problems resulting from your fix is high. In that sense, the original fix for this problem was limited to copying what might be known as “simple” pieces. A “simple” piece has only text. A “complex” piece might have a graphic, or it might involve a field in the document, both of which are likely to have data in the file in addition to the text itself.

With this in mind, for Mac Word X, we modified the notion of what would be a “simple” piece of text for the sake of deciding whether or not to copy a piece from the previously-saved document’s file over to the scratch file. To view this in a slightly different way, we made the code that copies undo document referents more aggressive. This resolved another test case that the Mac Word testers had developed, again using a slightly different macro that would eventually cause the “Disk is full” error to occur. This fix didn’t actually make it into the shipping release of Mac Office X, but it was included in a subsequent SR (I don’t recall specifically which one).

At this point, we still don’t know that the problem involves the OS’ open file limit. That discovery didn’t happen until this past summer when, through the very persistent efforts of Mac Word’s current lead tester, we were able to use some tools on Mac OS X to figure out exactly what was happening. While we were able to verify this, we still didn’t know the exact location where Word was failing to open a file due to having hit the OS’ open file limit. Again, we still can’t get this to reproduce under the debugger, and there are a couple of places in the save process where it can fail because the OS won’t let Word open the file. So, rather than scatter fixes all over the place, we went with the sure fix: lower Word’s internal open file limit so we hit it before we hit the OS’ open file limit. This allows the code that nukes the undo stack to kick in, and then save the succeeds.

This brings us to late February/early March of this year, and the discussion I’d had with Pierre. While we still can’t reproduce the actual file open failure under the debugger, we now have enough information about what causes the problem to be able to predict when the failure will eventually occur. From that, we knew enough about the bug for me to believe that Pierre shouldn’t still be hitting that “Disk is full” save error in the version of Word he was using. Yet, he was still running into the problem.

That was the bad news: there was something about this problem that we still didn’t fully understand. However, armed with a sample document and the ability to predict when the error will occur, I could do something we’d never been able to do before: set up Word under the debugger, perform some steps in Word to see if those steps caused the predictive condition to occur, and set breakpoints that would tell me exactly why we weren’t able to copy pieces from the undo document over to the scratch file.

This is where I discovered those two lines of code that had been added to Win Word so long ago, yet hadn’t been added to Mac Word. When Word lays out a page in page layout view and a header or footer is visible, it updates any fields in the header or footer. If you have, say, a page field in the visible header/footer, Word will update that field. This is particularly necessary when you have the footer of one page and the header of the following page both visible in the document window. Word has to layout two pages in the same update, so it updates fields for the first page footer, lays out that page, then updates the fields for the next page header and lays out that page.

Now, why would this result in a field being copied over to the undo document? Well, Word has something called “auto undo tracking.” Basically, when you’re typing, Word automatically tracks the changes you’re making until you do something that causes Word to close out the “typing” undo record. You can see this when you click on the “Undo” dropdown on the standard toolbar. You’ll see “Typing <text you typed>” at various locations in the dropdown interspersed with other actions you’ve taken.

The two lines of code that were added to Win Word paused automatic undo tracking while updating these fields in the header or footer during page layout, then un-paused automatic undo tracking once the field update was finished. Ugh! How, on earth, were we to ever figure out that these two lines of code were the primary reason Win Word users weren’t seeing this problem nearly as often as Mac Word users? In any event, if you’ve stayed with me long enough, here’s a tip you can use until we release an SR of Word X (or earlier) with this fix. If you have a document that has headers and footers with page fields in them, do your editing in Normal view, and you’ll likely never hit the “Disk is full” save error.

Right about now, you’re probably asking, “Why did it take so long to figure out what was up with this?” Well, you might as well ask why police departments continue to have a large number of unsolved crimes on the books. The issue is the same: the investigation stalls for the lack of any further leads to follow. For the same reason that the police can’t just go out and start arresting anyone who might be a suspect, we can’t go scattering potential fixes throughout the code. Until we figure out what the precise nature of the problem is, we need leads that we can follow. The mere fact that you’re running into a particular problem isn’t a lead that I can follow. Specific details about potential suspects, however, are leads I can follow. When it comes to software problems, leads I can follow consist of information that helps me to reproduce the problems consistently.

And, always remember that I can’t fix what I can’t see. I have to be able to reproduce the problem while being able to run some kind of diagnostic tool. The key to fixing a bug is predictability. Without predictability, I can’t fix it, because without predictability I have no way to understand how the complex interactions in modern software cause the specific problem to occur.