Chris Mason is the person who hired me to work at Microsoft. By the time he
hired me, he’d already spent a great deal of time looking into the issue of
general software quality, and had written a memo (known as the “Zero Defects”
memo) that underlies much of our software practices today. The ideas have been
refined since then, but they haven’t changed much in terms of the basic
concepts.
One of my favorite Chris Mason quotes comes from that memo, “Since human
beings themselves are not fully debugged yet, there will be bugs in your code
no matter what you do.” We work to minimize the bugs in the software we ship,
but they’ll always be there.
The problem stems from the overall complexity of the software. In this
context, “complexity” doesn’t refer to the code itself. Rather, we’re talking
about the shear volume of things the user can do. In Word, for example, we
have:
- More than 850 command functions (e.g. Bold and Italic are the
same command function)
- More than 1600 distinct commands (e.g. Bold and Italic are
distinct commands)
- At any given time roughly 50% of these commands are enabled
(conservative estimate)
- With just 3 steps, the possible combinations of code execution
paths exceeds 500 million
Now, there’s a philosophical issue about the desirability of increasingly
complex software, but I’m not going to discuss it here. For all practical
purposes, I don’t think there’s much benefit to getting into a discussion about
it. It may be an interesting question on some level, but it’s one we’ll never
fully resolve. And I’m just not all that interested in getting bogged down in
an endless debate without the possibility of resolution.
I mention the issue of complexity because it leads to subtle interactions
that can be difficult to track down. To illustrate the point even further, I
thought I’d discuss the anatomy of one of the more famous bugs we’ve had in
Word: the “Disk is full” on save error. Before I do, however, I should point
out that Pierre Igot, after some prodding on my part, did provide us with a
sample document that helped us to track down one of the more subtle
interactions involved in this particular problem. For that, Pierre, I thank you
and so do Word users everywhere.
The story of this problem begins with a basic design decision made when
Richard Brodie was Word’s primary software architect. Brodie came to Microsoft
along with Charles Simonyi after working at the Xerox PARC where he’d worked on
Bravo—their version of the GUI word processor. A number of the ideas used
in Word came from that early effort. Brodie joined Microsoft in 1981, began
work on Word in the summer of 1982, and finished version 1.0 in October of 1983.
You can read about much of the story in Microsoft First Generation by Cheryl Tsang.
Brodie figured out that a document is really just a collection of pieces of
text, and that it didn’t really matter where each piece of text is physically
located within the document’s file. For that matter, you could have one piece
of text that came from one file and another piece of text that came from
another file. We refer to this collection of pieces of text as the “piece
table.” This design has a number of benefits. For example, if you copy text
from one document to another, you don’t have to actually copy the text from one
file to another—at least not right away. All you really need to do is
copy the appropriate entries in the piece table in the source document to the
piece table in the destination document. Of course, you do need to copy the
physical text and formatting from one file to the other when you save the
destination document, but delaying that physical copy until save time meant
that the actual copy/paste could be done very quickly.
This design also made implementing undo rather simple. In fact, according to
Brodie, implementing undo was the primary reason to use this design. With this
design, all you have to do is create an internal undo document. When the user
deletes some text from the current document, for example, you copy the deleted
entries from the piece table to the undo document and save some information
about where those piece-table entries had been located in the original document.
To undo the delete, you just copy the piece-table entries from the undo
document back to the original document.
This design does have one problem: where do you put the text that the user
types into the document if it doesn’t go into the file that’s behind the
document? To solve that, Brodie added something called a “scratch” file, and
the scratch file remains a core part of Word’s design to this day. On the Mac,
Word creates this file in your TemporaryItems folder. On Mac OS X, this folder
is located at /private/tmp/<UID>/TemporaryItems, where “UID” is your user
ID number (for most people, that’s 501, but it can be a different number
altogether depending on how your user account was created). If start up Word,
open the terminal window and get a listing of your TemporaryItems folder, you’ll
see a file named something like “Word Work File S_.” There may be a number
after the “_” character. That’s Word’s scratch file (the “S” standing for
scratch). You might also see one or more files named “Word Work File D_” with
some number after the “_” character. This is a back-up copy of a document file
(the “D” standing for “Document”).
At this point, we need to fast-forward the story by a decade to the next major
feature that brought this problem to the fore: multiple undo. For ten years,
Word had undo, but it was just a single-level undo. For Word 6, we added the
ability to go back and undo every change you’d made to the document since you
first started editing it. And, with Word’s document/file architecture, this
wasn’t all that difficult to do: just make the undo document contain multiple
records with one record for each change to the main document. It’s a very cool
feature, and most of us couldn’t think of how we’d survive without it. But it
leads to a problem.
It’s not uncommon for users to make a few edits to a document, save the
document, make a few more edits, save the document again, make a few more
changes, and continue this process of edit/save for hours on end. Each time you
delete text, however, the actual text itself exists in the last-saved file for
the document you’re editing, and, with multiple levels of undo, the undo
records for text deletions still point back to the last-saved version of the
document’s file before you deleted the text. The next time you save, Word can’t
close the last-saved version of the file, because the undo document still
contains a reference to it. So, if you keep editing and saving, you’ll
eventually hit an open file limit. At least this was true of Word 6. It’s
changed quite a bit since then.
Arguably, this is something we should have figured out before we shipped Word
6, but, as Chris Mason pointed out, we humans haven’t been fully debugged yet.
Moreover, it’s easy to say that one should have thought of a particular
interaction in a complex piece of software, but that’s way easier said than
done. When you’re implementing any given feature, you’re totally focused on
the basic problems involved in the feature itself. To put this into
perspective, the person who implemented multiple undo in Word is one of the
best developers who has ever worked on Word, and has, since, been recognized as
a Microsoft Distinguished Engineer.
The reality is, that we hadn’t realized we’d created this situation when we
added multiple levels of undo. Moreover, this problem has several different
variations on the basic theme. At this point, the story involves our efforts to
understand the nature and scope of the problem, and to come up with the “best”
way to fix it. Because of the variations, however, the problem has been like an
onion. We’d peel away one layer of the onion, only to find some other variation
that we hadn’t, for various reasons, figured out before.
As we weave our way through the rest of the story, there are some important
points to keep in mind. The first is that I can’t fix what I can’t see, and, where
software bugs are concerned, “seeing” means being able to watch the program
execute, via some debugging tool, at the key point in the execution of the code
where the problem occurs. In order to do this, I have to have a precise set of
steps that consistently reproduces the problem. This not all that different from
the problem a mechanic faces when trying to figure out the cause of that
mysterious engine noise that only occurs after you’ve been driving the car
around town for a few hours.
The second is that this particular problem is a developer’s worst nightmare.
The fundamental cause is a basic design decision that you made more than a
decade ago, and the only way to really fix it for certain is to rewrite the
entire application from the ground up. Since that’s simply not an option for a
product that you’ve shipped several times, you’re left with trying to make the
problem difficult for most users to run into while trying to also minimize the
negative effects if the user should ever run into the problem. This approach
can, unfortunately, lead you to believe that you’ve come up with an “optimal”
fix only to discover later that there’s another facet you haven’t taken into
consideration (because you didn’t even know it existed until you peeled away
the previous layer of the onion).
The third point to keep in mind is that we in Mac BU have relatively limited
resources. When there’s a problem that’s fundamental to Word itself, we tend to
let our Win Word siblings focus on that problem. Our efforts tend to have little
chance of adding to their efforts, and this frees us up to focus on problems
specific to Mac users. In general, this is the most efficient way to handle
problems that our users are having, but there can be instances where there’s a
Mac-specific dimension to a problem. As we’ll see soon enough, this particular
problem had a Mac-specific dimension that complicated our efforts to fix it,
and it took us a while to find that Mac-specific dimension.
Lastly, the fact that Mac Word’s code base has been forked from Win Word’s
means that the Win Word people can make a change in the code for one reason,
and that change can have other side-effects that we won’t see in the Mac
version until we run into some very specific circumstances that show us the
different behaviors caused by this change. In this particular case, Win Word
added two lines of code in a routine that would seemingly be completely
unrelated to this problem, but also made this problem much more difficult for
users to run into in Win Word than it was in Mac Word. This one is the last
piece (maybe I should say latest piece)
in the puzzle that we discovered only a few months ago.
Whew! That’s a lot to keep in the back of our heads, but, nonetheless, let’s
rewind back about ten years. Word 6.0 has just shipped on Windows, and we’re
pretty happy with people’s reactions to the product. It doesn’t take long,
though, for us to figure out that there’s a fly in the ointment. Reports start
trickling in about people editing their documents “for a while” at which point
they try to save their document and they get a “Disk is full” error. We’d ask
people what they were doing, and the response was always some form of vague
notion that they’d just been editing their document “for a while.” The precise
measurement of “for a while” varied from user to user. For some folks, it was a
little over an hour. For others, it was several hours. Reproducing the problem
appeared to be highly dependant upon the user’s work habits.
After several months of trying to figure out the problem, someone in testing
wrote a macro that inserted a large amount of text into the document and then,
in a loop, replace successive words within the document saving it after each
replace. Run this macro for a while, and you get a “Disk is full” error on one
of the saves, at which point you can no longer save your document. Cool! We
now have steps that reproduce the problem.
So, this document got handed off to a developer, who then fired up Word
under the debugger, opened the document and ran the macro. The problem “reproduced,”
but, for reasons that weren’t apparent at the time, the error that the
developer ran into was subtly different from the error that the tester ran into.
The developer thought about the problem he was seeing, and came up with one of
those “optimal” fixes I mentioned above. It was the “right” fix in terms of the
problem the developer saw, but it wasn’t the “right” fix for the problem that
the tester saw.
What was this subtle difference between what the developer saw and what the
tester saw? As I mentioned above, the basic theme of the problem is to hit an open file limit. In this case, there are two limits:
Word’s internal open file limit and the OS’ open file limit. It turns out that
the debugger bumps the OS’ open file limit from what it would normally be when
you run Word outside the debugger. When the tester ran the macro, Word hit the
OS’ open file limit. When the developer ran the macro, with Word running under
the debugger, Word ran into its own, internal, file limit.
After a few iterations of the tester saying, “Sorry, but the bug’s not fixed
yet,” and the developer saying, “What are you talking about? I don’t see the
problem!” they both figured out that they were seeing different errors. Crap!
The problem only reproduces when you’re not
running under the debugger, which removes the one case where the developer can
actually see what’s going on. At this point, we have yet to figure out that the
problem involves hitting the OS’ open file limit. At this point, though, the
developer isn’t completely in the dark, and comes up with a fix for the tester’s
problem.
As I pointed out, the problem involves the undo document having a reference
to the previously saved-version of the document’s file. The developer’s
original fix was to add some code, in the case where Word hit its internal open
file limit, that would basically remove everything from the undo document (what
we refer to as “nuking the undo stack”). Nuking the undo stack allows the save
to proceed, because Word can now close the open files that were referenced by
the undo document. However, since the tester was seeing a different error, the
developer’s fix didn’t handle that case.
Nonetheless, the developer took a different approach. Knowing that the undo
document was very likely to be involved, one could walk through the undo
document, and copy the text for any pieces that pointed to the previously-saved
version of the document’s file to the scratch file. He coded up the solution,
and handed a buddy-build off to the tester. The tester ran the macro, and the
problem was fixed. The first layer of the onion had been peeled away, but the
fix still wasn’t an “optimal” fix. As it stood, the chances that a user would
run into the problem had been greatly reduced, but we still hadn’t dealt with
the “minimize the damage if they do hit it” side of the issue. That’s because,
at this point, we had yet to understand that the problem outside the debugger
had to do with the OS’ open file limit. Because this problem wouldn’t reproduce
under the debugger, the developer had no way of knowing exactly where the
failure was occurring. Without knowing that, the developer didn’t know where to
add the code that would “nuke the undo stack.”
To give you a sense of the time frame, this fix was ported from Win Word to
Mac Word during the Office 2001 development cycle, and was back-ported into
Word 98 for a service release that was done not too long after that. It’s also
at this point where the Win Word and Mac Word stories diverge. There are two
reasons for this. The first is that this was the point in time where Win Word
got that two-line code change that I mentioned above. The second is that the
open file limit under Mac OS is different than it is under Windows. I might be
mistaken on this point, but I think the open file limit under Mac OS X is
different from the limit under Mac OS 9 as well.
At this point, we still didn’t know that the basic problem involved hitting
the OS’ open file limit. After a while, though, we did know that Mac Word users
were seeing this problem way more often than Win Word users. In fact, the
difference was enough for Mac Word testers to start investigating the problem
directly. One of the things we did know is that the problem involved file
references in the undo document. So, we came up with a variation of the
original fix.
In order to understand this, we have to understand a basic principle of
fixes. You make the simplest code change required to fix the problem. This
reduces the chances that the fix will cause some other problem that is,
potentially, worse than the one you’re trying to fix. When you’re mucking about
with the locations where data is stored in files, the potential for
catastrophic problems resulting from your fix is high. In that sense, the
original fix for this problem was limited to copying what might be known as “simple”
pieces. A “simple” piece has only text. A “complex” piece might have a graphic,
or it might involve a field in the document, both of which are likely to have
data in the file in addition to the text itself.
With this in mind, for Mac Word X, we modified the notion of what would be a
“simple” piece of text for the sake of deciding whether or not to copy a piece
from the previously-saved document’s file over to the scratch file. To view
this in a slightly different way, we made the code that copies undo document
referents more aggressive. This resolved another test case that the Mac Word
testers had developed, again using a slightly different macro that would
eventually cause the “Disk is full” error to occur. This fix didn’t actually
make it into the shipping release of Mac Office X, but it was included in a
subsequent SR (I don’t recall specifically which one).
At this point, we still don’t know that
the problem involves the OS’ open file limit. That discovery didn’t happen
until this past summer when, through the very persistent efforts of Mac Word’s
current lead tester, we were able to use some tools on Mac OS X to figure out
exactly what was happening. While we were able to verify this, we still didn’t
know the exact location where Word was failing to open a file due to having hit
the OS’ open file limit. Again, we still can’t get this to reproduce under the
debugger, and there are a couple of places in the save process where it can
fail because the OS won’t let Word open the file. So, rather than scatter fixes
all over the place, we went with the sure fix: lower Word’s internal open file
limit so we hit it before we hit the OS’ open file limit. This allows the code
that nukes the undo stack to kick in, and then save the succeeds.
This brings us to late February/early March of this year, and the discussion
I’d had with Pierre. While we still can’t reproduce the actual file open
failure under the debugger, we now have enough information about what causes
the problem to be able to predict when the failure will eventually occur. From
that, we knew enough about the bug for me to believe that Pierre shouldn’t
still be hitting that “Disk is full” save error in the version of Word he was
using. Yet, he was still running into the problem.
That was the bad news: there was something about this problem that we still
didn’t fully understand. However, armed with a sample document and the ability
to predict when the error will occur, I could do something we’d never been able
to do before: set up Word under the debugger, perform some steps in Word to see
if those steps caused the predictive condition to occur, and set breakpoints
that would tell me exactly why we weren’t able to copy pieces from the undo
document over to the scratch file.
This is where I discovered those two lines of code that had been added to
Win Word so long ago, yet hadn’t been added to Mac Word. When Word lays out a
page in page layout view and a header or footer is visible, it updates any
fields in the header or footer. If you have, say, a page field in the visible
header/footer, Word will update that field. This is particularly necessary when
you have the footer of one page and the header of the following page both
visible in the document window. Word has to layout two pages in the same
update, so it updates fields for the first page footer, lays out that page,
then updates the fields for the next page header and lays out that page.
Now, why would this result in a field being copied over to the undo
document? Well, Word has something called “auto undo tracking.” Basically,
when you’re typing, Word automatically tracks the changes you’re making until
you do something that causes Word to close out the “typing” undo record. You
can see this when you click on the “Undo” dropdown on the standard toolbar. You’ll
see “Typing <text you typed>” at various locations in the dropdown
interspersed with other actions you’ve taken.
The two lines of code that were added to Win Word paused automatic undo
tracking while updating these fields in the header or footer during page
layout, then un-paused automatic undo tracking once the field update was
finished. Ugh! How, on earth, were we to ever figure out that these two lines
of code were the primary reason Win Word users weren’t seeing this problem
nearly as often as Mac Word users? In any event, if you’ve stayed with me long
enough, here’s a tip you can use until we release an SR of Word X (or earlier)
with this fix. If you have a document that has headers and footers with page
fields in them, do your editing in Normal view, and you’ll likely never hit the
“Disk is full” save error.
Right about now, you’re probably asking, “Why did it take so long to figure
out what was up with this?” Well, you might as well ask why police departments
continue to have a large number of unsolved crimes on the books. The issue is the
same: the investigation stalls for the lack of any further leads to follow. For
the same reason that the police can’t just go out and start arresting anyone
who might be a suspect, we can’t go scattering potential fixes throughout the
code. Until we figure out what the precise nature of the problem is, we need
leads that we can follow. The mere fact that you’re running into a particular
problem isn’t a lead that I can follow. Specific details about potential
suspects, however, are leads I can follow. When it comes to software problems, leads
I can follow consist of information that helps me to reproduce the problems
consistently.
And, always remember that I can’t fix what I can’t see. I have to be able to
reproduce the problem while being able to run some kind of diagnostic tool. The
key to fixing a bug is predictability. Without predictability, I can’t fix it,
because without predictability I have no way to understand how the complex
interactions in modern software cause the specific problem to occur.
Rick