Welcome to MSDN Blogs Sign in | Join | Help

Anatomy of a Software Bug

Anatomy of a Software Bug

Chris Mason is the person who hired me to work at Microsoft. By the time he hired me, he’d already spent a great deal of time looking into the issue of general software quality, and had written a memo (known as the “Zero Defects” memo) that underlies much of our software practices today. The ideas have been refined since then, but they haven’t changed much in terms of the basic concepts.

One of my favorite Chris Mason quotes comes from that memo, “Since human beings themselves are not fully debugged yet, there will be bugs in your code no matter what you do.” We work to minimize the bugs in the software we ship, but they’ll always be there.

The problem stems from the overall complexity of the software. In this context, “complexity” doesn’t refer to the code itself. Rather, we’re talking about the shear volume of things the user can do. In Word, for example, we have:

  • More than 850 command functions (e.g. Bold and Italic are the same command function)
  • More than 1600 distinct commands (e.g. Bold and Italic are distinct commands)
  • At any given time roughly 50% of these commands are enabled (conservative estimate)
  • With just 3 steps, the possible combinations of code execution paths exceeds 500 million

Now, there’s a philosophical issue about the desirability of increasingly complex software, but I’m not going to discuss it here. For all practical purposes, I don’t think there’s much benefit to getting into a discussion about it. It may be an interesting question on some level, but it’s one we’ll never fully resolve. And I’m just not all that interested in getting bogged down in an endless debate without the possibility of resolution.

I mention the issue of complexity because it leads to subtle interactions that can be difficult to track down. To illustrate the point even further, I thought I’d discuss the anatomy of one of the more famous bugs we’ve had in Word: the “Disk is full” on save error. Before I do, however, I should point out that Pierre Igot, after some prodding on my part, did provide us with a sample document that helped us to track down one of the more subtle interactions involved in this particular problem. For that, Pierre, I thank you and so do Word users everywhere.

The story of this problem begins with a basic design decision made when Richard Brodie was Word’s primary software architect. Brodie came to Microsoft along with Charles Simonyi after working at the Xerox PARC where he’d worked on Bravo—their version of the GUI word processor. A number of the ideas used in Word came from that early effort. Brodie joined Microsoft in 1981, began work on Word in the summer of 1982, and finished version 1.0 in October of 1983. You can read about much of the story in Microsoft First Generation by Cheryl Tsang.

Brodie figured out that a document is really just a collection of pieces of text, and that it didn’t really matter where each piece of text is physically located within the document’s file. For that matter, you could have one piece of text that came from one file and another piece of text that came from another file. We refer to this collection of pieces of text as the “piece table.” This design has a number of benefits. For example, if you copy text from one document to another, you don’t have to actually copy the text from one file to another—at least not right away. All you really need to do is copy the appropriate entries in the piece table in the source document to the piece table in the destination document. Of course, you do need to copy the physical text and formatting from one file to the other when you save the destination document, but delaying that physical copy until save time meant that the actual copy/paste could be done very quickly.

This design also made implementing undo rather simple. In fact, according to Brodie, implementing undo was the primary reason to use this design. With this design, all you have to do is create an internal undo document. When the user deletes some text from the current document, for example, you copy the deleted entries from the piece table to the undo document and save some information about where those piece-table entries had been located in the original document. To undo the delete, you just copy the piece-table entries from the undo document back to the original document.

This design does have one problem: where do you put the text that the user types into the document if it doesn’t go into the file that’s behind the document? To solve that, Brodie added something called a “scratch” file, and the scratch file remains a core part of Word’s design to this day. On the Mac, Word creates this file in your TemporaryItems folder. On Mac OS X, this folder is located at /private/tmp/<UID>/TemporaryItems, where “UID” is your user ID number (for most people, that’s 501, but it can be a different number altogether depending on how your user account was created). If start up Word, open the terminal window and get a listing of your TemporaryItems folder, you’ll see a file named something like “Word Work File S_.” There may be a number after the “_” character. That’s Word’s scratch file (the “S” standing for scratch). You might also see one or more files named “Word Work File D_” with some number after the “_” character. This is a back-up copy of a document file (the “D” standing for “Document”).

At this point, we need to fast-forward the story by a decade to the next major feature that brought this problem to the fore: multiple undo. For ten years, Word had undo, but it was just a single-level undo. For Word 6, we added the ability to go back and undo every change you’d made to the document since you first started editing it. And, with Word’s document/file architecture, this wasn’t all that difficult to do: just make the undo document contain multiple records with one record for each change to the main document. It’s a very cool feature, and most of us couldn’t think of how we’d survive without it. But it leads to a problem.

It’s not uncommon for users to make a few edits to a document, save the document, make a few more edits, save the document again, make a few more changes, and continue this process of edit/save for hours on end. Each time you delete text, however, the actual text itself exists in the last-saved file for the document you’re editing, and, with multiple levels of undo, the undo records for text deletions still point back to the last-saved version of the document’s file before you deleted the text. The next time you save, Word can’t close the last-saved version of the file, because the undo document still contains a reference to it. So, if you keep editing and saving, you’ll eventually hit an open file limit. At least this was true of Word 6. It’s changed quite a bit since then.

Arguably, this is something we should have figured out before we shipped Word 6, but, as Chris Mason pointed out, we humans haven’t been fully debugged yet. Moreover, it’s easy to say that one should have thought of a particular interaction in a complex piece of software, but that’s way easier said than done. When you’re implementing any given feature, you’re totally focused on the basic problems involved in the feature itself. To put this into perspective, the person who implemented multiple undo in Word is one of the best developers who has ever worked on Word, and has, since, been recognized as a Microsoft Distinguished Engineer.

The reality is, that we hadn’t realized we’d created this situation when we added multiple levels of undo. Moreover, this problem has several different variations on the basic theme. At this point, the story involves our efforts to understand the nature and scope of the problem, and to come up with the “best” way to fix it. Because of the variations, however, the problem has been like an onion. We’d peel away one layer of the onion, only to find some other variation that we hadn’t, for various reasons, figured out before.

As we weave our way through the rest of the story, there are some important points to keep in mind. The first is that I can’t fix what I can’t see, and, where software bugs are concerned, “seeing” means being able to watch the program execute, via some debugging tool, at the key point in the execution of the code where the problem occurs. In order to do this, I have to have a precise set of steps that consistently reproduces the problem. This not all that different from the problem a mechanic faces when trying to figure out the cause of that mysterious engine noise that only occurs after you’ve been driving the car around town for a few hours.

The second is that this particular problem is a developer’s worst nightmare. The fundamental cause is a basic design decision that you made more than a decade ago, and the only way to really fix it for certain is to rewrite the entire application from the ground up. Since that’s simply not an option for a product that you’ve shipped several times, you’re left with trying to make the problem difficult for most users to run into while trying to also minimize the negative effects if the user should ever run into the problem. This approach can, unfortunately, lead you to believe that you’ve come up with an “optimal” fix only to discover later that there’s another facet you haven’t taken into consideration (because you didn’t even know it existed until you peeled away the previous layer of the onion).

The third point to keep in mind is that we in Mac BU have relatively limited resources. When there’s a problem that’s fundamental to Word itself, we tend to let our Win Word siblings focus on that problem. Our efforts tend to have little chance of adding to their efforts, and this frees us up to focus on problems specific to Mac users. In general, this is the most efficient way to handle problems that our users are having, but there can be instances where there’s a Mac-specific dimension to a problem. As we’ll see soon enough, this particular problem had a Mac-specific dimension that complicated our efforts to fix it, and it took us a while to find that Mac-specific dimension.

Lastly, the fact that Mac Word’s code base has been forked from Win Word’s means that the Win Word people can make a change in the code for one reason, and that change can have other side-effects that we won’t see in the Mac version until we run into some very specific circumstances that show us the different behaviors caused by this change. In this particular case, Win Word added two lines of code in a routine that would seemingly be completely unrelated to this problem, but also made this problem much more difficult for users to run into in Win Word than it was in Mac Word. This one is the last piece (maybe I should say latest piece) in the puzzle that we discovered only a few months ago.

Whew! That’s a lot to keep in the back of our heads, but, nonetheless, let’s rewind back about ten years. Word 6.0 has just shipped on Windows, and we’re pretty happy with people’s reactions to the product. It doesn’t take long, though, for us to figure out that there’s a fly in the ointment. Reports start trickling in about people editing their documents “for a while” at which point they try to save their document and they get a “Disk is full” error. We’d ask people what they were doing, and the response was always some form of vague notion that they’d just been editing their document “for a while.” The precise measurement of “for a while” varied from user to user. For some folks, it was a little over an hour. For others, it was several hours. Reproducing the problem appeared to be highly dependant upon the user’s work habits.

After several months of trying to figure out the problem, someone in testing wrote a macro that inserted a large amount of text into the document and then, in a loop, replace successive words within the document saving it after each replace. Run this macro for a while, and you get a “Disk is full” error on one of the saves, at which point you can no longer save your document. Cool! We now have steps that reproduce the problem.

So, this document got handed off to a developer, who then fired up Word under the debugger, opened the document and ran the macro. The problem “reproduced,” but, for reasons that weren’t apparent at the time, the error that the developer ran into was subtly different from the error that the tester ran into. The developer thought about the problem he was seeing, and came up with one of those “optimal” fixes I mentioned above. It was the “right” fix in terms of the problem the developer saw, but it wasn’t the “right” fix for the problem that the tester saw.

What was this subtle difference between what the developer saw and what the tester saw? As I mentioned above, the basic theme of the problem is to hit an open file limit. In this case, there are two limits: Word’s internal open file limit and the OS’ open file limit. It turns out that the debugger bumps the OS’ open file limit from what it would normally be when you run Word outside the debugger. When the tester ran the macro, Word hit the OS’ open file limit. When the developer ran the macro, with Word running under the debugger, Word ran into its own, internal, file limit.

After a few iterations of the tester saying, “Sorry, but the bug’s not fixed yet,” and the developer saying, “What are you talking about? I don’t see the problem!” they both figured out that they were seeing different errors. Crap! The problem only reproduces when you’re not running under the debugger, which removes the one case where the developer can actually see what’s going on. At this point, we have yet to figure out that the problem involves hitting the OS’ open file limit. At this point, though, the developer isn’t completely in the dark, and comes up with a fix for the tester’s problem.

As I pointed out, the problem involves the undo document having a reference to the previously saved-version of the document’s file. The developer’s original fix was to add some code, in the case where Word hit its internal open file limit, that would basically remove everything from the undo document (what we refer to as “nuking the undo stack”). Nuking the undo stack allows the save to proceed, because Word can now close the open files that were referenced by the undo document. However, since the tester was seeing a different error, the developer’s fix didn’t handle that case.

Nonetheless, the developer took a different approach. Knowing that the undo document was very likely to be involved, one could walk through the undo document, and copy the text for any pieces that pointed to the previously-saved version of the document’s file to the scratch file. He coded up the solution, and handed a buddy-build off to the tester. The tester ran the macro, and the problem was fixed. The first layer of the onion had been peeled away, but the fix still wasn’t an “optimal” fix. As it stood, the chances that a user would run into the problem had been greatly reduced, but we still hadn’t dealt with the “minimize the damage if they do hit it” side of the issue. That’s because, at this point, we had yet to understand that the problem outside the debugger had to do with the OS’ open file limit. Because this problem wouldn’t reproduce under the debugger, the developer had no way of knowing exactly where the failure was occurring. Without knowing that, the developer didn’t know where to add the code that would “nuke the undo stack.”

To give you a sense of the time frame, this fix was ported from Win Word to Mac Word during the Office 2001 development cycle, and was back-ported into Word 98 for a service release that was done not too long after that. It’s also at this point where the Win Word and Mac Word stories diverge. There are two reasons for this. The first is that this was the point in time where Win Word got that two-line code change that I mentioned above. The second is that the open file limit under Mac OS is different than it is under Windows. I might be mistaken on this point, but I think the open file limit under Mac OS X is different from the limit under Mac OS 9 as well.

At this point, we still didn’t know that the basic problem involved hitting the OS’ open file limit. After a while, though, we did know that Mac Word users were seeing this problem way more often than Win Word users. In fact, the difference was enough for Mac Word testers to start investigating the problem directly. One of the things we did know is that the problem involved file references in the undo document. So, we came up with a variation of the original fix.

In order to understand this, we have to understand a basic principle of fixes. You make the simplest code change required to fix the problem. This reduces the chances that the fix will cause some other problem that is, potentially, worse than the one you’re trying to fix. When you’re mucking about with the locations where data is stored in files, the potential for catastrophic problems resulting from your fix is high. In that sense, the original fix for this problem was limited to copying what might be known as “simple” pieces. A “simple” piece has only text. A “complex” piece might have a graphic, or it might involve a field in the document, both of which are likely to have data in the file in addition to the text itself.

With this in mind, for Mac Word X, we modified the notion of what would be a “simple” piece of text for the sake of deciding whether or not to copy a piece from the previously-saved document’s file over to the scratch file. To view this in a slightly different way, we made the code that copies undo document referents more aggressive. This resolved another test case that the Mac Word testers had developed, again using a slightly different macro that would eventually cause the “Disk is full” error to occur. This fix didn’t actually make it into the shipping release of Mac Office X, but it was included in a subsequent SR (I don’t recall specifically which one).

At this point, we still don’t know that the problem involves the OS’ open file limit. That discovery didn’t happen until this past summer when, through the very persistent efforts of Mac Word’s current lead tester, we were able to use some tools on Mac OS X to figure out exactly what was happening. While we were able to verify this, we still didn’t know the exact location where Word was failing to open a file due to having hit the OS’ open file limit. Again, we still can’t get this to reproduce under the debugger, and there are a couple of places in the save process where it can fail because the OS won’t let Word open the file. So, rather than scatter fixes all over the place, we went with the sure fix: lower Word’s internal open file limit so we hit it before we hit the OS’ open file limit. This allows the code that nukes the undo stack to kick in, and then save the succeeds.

This brings us to late February/early March of this year, and the discussion I’d had with Pierre. While we still can’t reproduce the actual file open failure under the debugger, we now have enough information about what causes the problem to be able to predict when the failure will eventually occur. From that, we knew enough about the bug for me to believe that Pierre shouldn’t still be hitting that “Disk is full” save error in the version of Word he was using. Yet, he was still running into the problem.

That was the bad news: there was something about this problem that we still didn’t fully understand. However, armed with a sample document and the ability to predict when the error will occur, I could do something we’d never been able to do before: set up Word under the debugger, perform some steps in Word to see if those steps caused the predictive condition to occur, and set breakpoints that would tell me exactly why we weren’t able to copy pieces from the undo document over to the scratch file.

This is where I discovered those two lines of code that had been added to Win Word so long ago, yet hadn’t been added to Mac Word. When Word lays out a page in page layout view and a header or footer is visible, it updates any fields in the header or footer. If you have, say, a page field in the visible header/footer, Word will update that field. This is particularly necessary when you have the footer of one page and the header of the following page both visible in the document window. Word has to layout two pages in the same update, so it updates fields for the first page footer, lays out that page, then updates the fields for the next page header and lays out that page.

Now, why would this result in a field being copied over to the undo document? Well, Word has something called “auto undo tracking.” Basically, when you’re typing, Word automatically tracks the changes you’re making until you do something that causes Word to close out the “typing” undo record. You can see this when you click on the “Undo” dropdown on the standard toolbar. You’ll see “Typing <text you typed>” at various locations in the dropdown interspersed with other actions you’ve taken.

The two lines of code that were added to Win Word paused automatic undo tracking while updating these fields in the header or footer during page layout, then un-paused automatic undo tracking once the field update was finished. Ugh! How, on earth, were we to ever figure out that these two lines of code were the primary reason Win Word users weren’t seeing this problem nearly as often as Mac Word users? In any event, if you’ve stayed with me long enough, here’s a tip you can use until we release an SR of Word X (or earlier) with this fix. If you have a document that has headers and footers with page fields in them, do your editing in Normal view, and you’ll likely never hit the “Disk is full” save error.

Right about now, you’re probably asking, “Why did it take so long to figure out what was up with this?” Well, you might as well ask why police departments continue to have a large number of unsolved crimes on the books. The issue is the same: the investigation stalls for the lack of any further leads to follow. For the same reason that the police can’t just go out and start arresting anyone who might be a suspect, we can’t go scattering potential fixes throughout the code. Until we figure out what the precise nature of the problem is, we need leads that we can follow. The mere fact that you’re running into a particular problem isn’t a lead that I can follow. Specific details about potential suspects, however, are leads I can follow. When it comes to software problems, leads I can follow consist of information that helps me to reproduce the problems consistently.

And, always remember that I can’t fix what I can’t see. I have to be able to reproduce the problem while being able to run some kind of diagnostic tool. The key to fixing a bug is predictability. Without predictability, I can’t fix it, because without predictability I have no way to understand how the complex interactions in modern software cause the specific problem to occur.

 

Rick

Published Wednesday, May 19, 2004 1:50 PM by Rick Schaut

Comments

Wednesday, May 19, 2004 2:13 PM by Larry Osterman

# re: Anatomy of a Software Bug

So what exactly is the Mac's open file limit?

Windows NT's is in the hundreds of thousands of handles per process (depending on the amount of physical RAM available), I'm surprised you guys didn't notice that the file handle count for Word was getting that high.

Even Win9x had an open file handle limit in the tens of thousands IIRC.
Wednesday, May 19, 2004 2:35 PM by TWR IV

# re: Anatomy of a Software Bug

Thanks for the interesting post. I remember the disk full error with great dismay although happily I haven't seen it in years.

If you want lists of reproducible Word 2004 bugs I'd suggest you put out a blog notice. We've already noticed a few around here, although nothing serious so far.
Wednesday, May 19, 2004 2:44 PM by matthew

# re: Anatomy of a Software Bug

In DOS, I used to set files=20 to save my conventional RAM. I fiddle with this, moving it up and down, as some programs didn't work if it was set too low.
Wednesday, May 19, 2004 3:05 PM by Ryan Gregg

# re: Anatomy of a Software Bug

Great post! It's really nice being able to read information about products I use everyday and what it took to development and resolve there issues.
Wednesday, May 19, 2004 4:18 PM by B.Y.

# re: Anatomy of a Software Bug

I remember reading in a book (by Steve Maguire, I think) saying that Mac and Win Word codebases have been merged. Are they split again ? I can see the reason for GUI code branching, but not internal stuff like undo levels.
Wednesday, May 19, 2004 4:44 PM by David Buxton

# re: Anatomy of a Software Bug

Wasn't this bug in Mac Word related to the classic Mac file open limit of 384 open handles? A fairly well known limit to the classic Mac system. Mac Word 6 and later had a habit of using file handles for fast saves without closing them properly, and at some point a user would hit the max open file limit, hence these problems?
Wednesday, May 19, 2004 10:26 PM by Eric Albert

# re: Anatomy of a Software Bug

A couple questions from above are answered by Apple's <a href="http://developer.apple.com/technotes/tn/tn1184.html">Technote 1184</a>, "FCBs Now and Forever".

The quick summary: Mac OS versions prior to 9.0 were limited to 348 open files or, more correctly, open forks. The limit was actually far lower in very early system software releases, but that's another story. Mac OS 9.0 increased the open file limit to 8169.

Mac OS X, being a Unix-like system, has completely different open file limits.
Wednesday, May 19, 2004 11:46 PM by Nate Friedman

# re: Anatomy of a Software Bug

great article, reminds me of the details, and explanations of those that you'll find at http://www.folklore.org/index.py
Thursday, May 20, 2004 2:30 AM by Rick Schaut

# re: Anatomy of a Software Bug

Being a BSD derivative, the open file handle limit on OS X is 256. It can be modified, but that's through a native BSD call. The technote is <a href="http://developer.apple.com/qa/qa2001/qa1292.html">here</a>
Thursday, May 20, 2004 2:42 AM by Rick Schaut

# re: Anatomy of a Software Bug

B.Y. the Mac/Win code bases were the same as of Word 6.0. We forked the code bases as of Office 98.

In theory, you'd think the internals should remain within the same code base, but, in practice, it becomes a source, and quality, control nightmare. Unless you have everyone doing quality checks on both platforms for every code change, you end up with Win developers breaking the Mac product and vice versa.
Thursday, May 20, 2004 6:15 AM by aellath

# re: Anatomy of a Software Bug

i *knew* there was a reason i assiduously avoided Office! AppleWorks has rarely (in a quick scan of my memory just now, no events occur) screwed up on me.
Thursday, May 20, 2004 8:35 AM by Chris

# re: Anatomy of a Software Bug

Hmm... seems to me that it would just be easier (and better allocate resources on the machine) if Word would just do a better job of cleaning itself up every now and then to not have so many open files.
Thursday, May 20, 2004 9:00 AM by Larry Osterman

# re: Anatomy of a Software Bug

Oh my goodness. 256 handles/process? That's obscenely painful. I'm not surprised this showed up on Mac's only.


Thursday, May 20, 2004 10:28 AM by Eric Hildum

# re: Anatomy of a Software Bug

Thanks for the explanation of the Cut and Paste process. Now I understand a particularly nasty bug affecting the US and Japanese versions of Windows Word that I encountered while working in Japan. Summary of problem: receive a document from the US (made using US Word), cut and paste into Japanese document (using Japanese Word). Upon autosave or save, Japanese document is destroyed in memory and on disk. The delayed copying of the text would explain the behavior perfectly. Apparently, there was a bug in the text copy code executed when documents were saved.

By the way I did try to report the bug via our $500,000+/year global support contract with Microsoft, and was told directly by our Microsoft support representative, and I quote, "I wouldn't know how to file a bug report for that." Never was able to get it addressed, even though I had two good sample documents for reproduction of the problem.
Thursday, May 20, 2004 10:48 AM by Beth Rosengard

# re: Anatomy of a Software Bug

Fascinating! Thanks, Rick.
Thursday, May 20, 2004 11:26 AM by John A.

# re: Anatomy of a Software Bug

My God! You mean to say that you couldn't pin this bug for years because you couldn't get to it with the debugger? What about debugging through code manipulation? This isn't Schroedinger's cat.
Thursday, May 20, 2004 11:50 AM by Kiliman

# re: Anatomy of a Software Bug

One of the things that really bugs me is when you get misleading or unhelpful error messages (like "Unexpected Error").

I'm assuming that since Windows and OS X are completely different platforms, that the "Disk is full" error is coming from Word and not the OS.

Wouldn't it have been possible to instrument Word so it would display the actual error code returned from the OS? I imagine for Windows the underlying error code would be ERR_TOO_MANY_FILES_OPEN.

Kiliman
Thursday, May 20, 2004 12:13 PM by skeptic

# re: Anatomy of a Software Bug

The part about this story that bothers me is this line: "At this point, we still don’t know that the problem involves the OS’ open file limit. That discovery didn’t happen until this past summer." Well, users had posited this years ago as the explanation of the problem. So why didn't this knowledge filter up to the MacBU? Because there's no way to report it? Because Microsoft is in denial? Also, why doesn't Microsoft just stop spewing these temporary files all over my hard disk? (Remember when they used to be *visible*???) Isn't there anyone at Microsoft who is ready to admit that this architecture sucks?
Thursday, May 20, 2004 12:44 PM by Rick Schaut

# re: Anatomy of a Software Bug

John, I did leave out a few details in the story. In answer to your question, yes. I wrote almost as much debug-only code trying to track this down as I've written shipping code for some features.

Kiliman, that sounds easy on the surface, but there are several places where a failure in an OS call could result in that error message. Knowing where to instrument is almost as difficult as tracking down the bug itself.

Remember, also, that the key point in this, the discovery about the open undo tracking while header/footer fields were being updated, would never have been caught by an instrumented version. It's also a scenario that can't be duplicated by running a macro to simulate user behavior. The key to getting that down was having the bevy of diagnostic tools available on Mac OS X.

Skeptic, there's a difference between "knowing" the source of a problem and being able to prove it. It's like knowing that someone has committed a crime, but being unable to prove that fact in a court of law. What we couldn't do was prove that this was the source of the problem. A sure way to introduce other bugs that are potentially worse than the one you're trying to fix is to make some change to the code when you can't prove that the change actually fixes the problem.
Thursday, May 20, 2004 2:16 PM by MacJack44

# re: Anatomy of a Software Bug

This saga is truly informative (if exhausting), and supports my growing conviction that both OS development (by any company) and software development (by any company) has reached an overload limit.

Notice that in both MS and Apple cases, advancement of the OS has resulted in endless updates, vulnerability discoveries and too much time and money spent by consumer/users just to "keep up." The same applies in "simple" word-processing apps. AppleWorks ported to OS X is buggier, less useful and more annoying than it ever was under OS 9 (for example). Third party T/Es like Nisus Writer have taken ferrrevvverrr to reach full featured useability and suffer from the same kind of "generational bugs" as OS X and AppleWorks.

MS Word and MS Office are a perfect example of trying to "do all, be all" to prospective customer / users. But this is consumer choice, the line of thinking goes, "Better buy the whole hog, might need it some day..." Or the ever-popular: "Gotta have the latest version" muck. When in fact, I'd bet there's a large percentage of users who use only a fraction of MS Office features. And, to "just write a novel" you need only use RichText format.

This nonsense causes real problems: when publishers, for example, demand electronic submissions be in Word format. Why? There's no legitimate reason, other than a kind of mindset prejudice.

Simply put from the consumer / user standpoint: We have committed the sin of expecting too much from our computers. The "computer companies" are only trying to satisfy "Demand" (with a capital "D"). Their efforts (as exemplified by Rick's story) have been heroic, yet there's no end in sight for this kind of problem -- which ultimately falls on the shoulders of end users.

I'm not against "newest and best" but have disciplined myself to use and want "just what I need" to communicate, to enjoy myself and to stay informed. I think it's time that everybody just slowed down a bit.
Thursday, May 20, 2004 2:48 PM by Ryan Clark

# re: Anatomy of a Software Bug

Wow. As a Mac user, and someone who's been using Word for quite a while, it was fascinating to read about what had been causing the dreaded "disk is full" bug. I used to work as a consultant in our university computer lab and it was incredibly frustrating when people using Word 2001 on the Macs would run into this problem.
Thursday, May 20, 2004 3:32 PM by John Fisher

# re: Anatomy of a Software Bug

Very well-told story.

I am completely in sympathy with the problem. Its not MS fault that Word
is too large - nobody wants Works though its usually free; its not the developer's fault that Word changes too much and
too often. The problems are endemic to large, complex software with
millions of users. Other Windows software I use is equally buggy, and
frequently has less usable design.

However, I also think there is a near-total lack of two aspects of good
engineering practice at MS: 1) they have never understood trace,
logging, and error messages 2) they do (did) not implement code dumps correctly.

As one commenter pointed out, if they had simply passed through the
actual OS error it would have helped. Better yet the code that failed
should have logged a failure indicating what code failed and why.

If they were able to dump their code correctly, they would be able to
run Word in the debugger, dump it, and sift through the output to find
which code failed, and what the values were at the time it failed. My ( NT4 era ) experience was that the debugger was buggy and symbols did not always align.
Having a power-of-two number for a value would have been a strong hint here.
MS paid support is infamous. We had the same dead-end experience with
fundamental problems with NT4 Wolfpack failover ( in the end it never
worked ) at a company in which MS had some ownership.
Lastly, there should never be 'unlimited' anything. Unlimited Undo is a
marketing standard, not good engineering. Developers should always place
arbitrary limits on repetitive actions to prevent unknowable results
like this. If there had been a limit, and a log entry had then said, "UnDoer reached u_limit" all would have been easily fixed. Their undo function is so complex, that it may not qualify as
'repetitive.' If so, this might be a sign that it is inherently too complex.
Thursday, May 20, 2004 3:33 PM by RG

# re: Anatomy of a Software Bug

Thanks for the detailed walkthrough ... I would love to see a similar explanation as to why text copied from Word 2004 and pasted into iChat comes out as a black graphic blob. It will paste fine into Mail, TextEdit, etc. (and then copy/paste from there into iChat), but it won't go directly into iChat.

I suspect it's an iChat issue, since it pastes fine elsewhere, but I can't find another Mac program that causes the same iChat behavior...
Thursday, May 20, 2004 5:09 PM by Paul Berkowitz

# re: Anatomy of a Software Bug

Great story, Rick.

But really only 256 open files in OSX, compared to 8196 in OS 9?? I remember that joy that ensued when 384 increased to 8196. Is there something else in OS X that makes such a minute number of open files feasible? (Such as - do they get closed automatically or something like that?) I have never run into this limit in OS X in any app. Something just doesn't sound right here, Rick. Do you, or anyone, have an idea where to look this up?
Thursday, May 20, 2004 5:42 PM by Kiyooka

# re: Anatomy of a Software Bug

Fascinating reading about how things work from the inside.

I am looking for a contact in the MacBU Office group. As creator of arguably one of the most successful office add-ins of all time, and now on the mac, I'm interested in writing some add-ins for MacOffice. I sent some email to s sinofsky but he's apparently PC only, as are all my other contacts.

Do you guys have an add-in/office evangelist that you could point towards me? My contact nfo is on my blog site (above).

-gen
Thursday, May 20, 2004 6:05 PM by OS X file limit

# re: Anatomy of a Software Bug

256 is only the baseline limit in OS X, per process.

In OS 9, I believe the limit was global (but changable by some utilities).

The actual limit in OS X depends on the amount of RAM in the system (more RAM, more file allowed). Also, this limit can be changed by the application. I don't know if there is a fixed upper limit.

Photoshop and Illustrator hit the same limit when porting to OS X (due to an OS bug that left files open after certain API calls).
Thursday, May 20, 2004 10:11 PM by TrackBack

# Betalogue

Betalogue
Thursday, May 20, 2004 7:33 PM by SomeRandomGuy

# re: Anatomy of a Software Bug

Larry, the 256 open file limit is a soft limit imposed on processes so that they can't go around eating system resources. In a shell you can use 'ulimit' to increase it, or in an app can use a BSD API to increase it if needed. There is a way to increase the limit globally for all processes, but I don't want to post it since it is a bad idea to ever use it.

The default open file limit for the system is 12288, but this can be increased using the systcl command to increase kern.maxfiles if you really needed to.
Thursday, May 20, 2004 7:36 PM by SomeRandomGuy

# re: Anatomy of a Software Bug

I didn;t explicitly explain it, so I guess I should say that the 256 limit is per-process. So Word eating 240 files has no effect at all on Photoshop etc. until you get into the 12288 open files range at which point you probably have bigger problems.
Friday, May 21, 2004 4:55 AM by JD

# re: Anatomy of a Software Bug

I've dealt with this problem at clients for a long time now and to have a different perspective is helpful. However, I'm still of the opinion that most people would be productive and happy with the equivalent of Word 5.1. There wasn't a lot of extra stuff in the way and one could get a document completed quickly without the "wizards" and "assistants" popping up and needing to be killed one by one.
Take the code from that, port it, sell it for $99-I'd suspect that lots of people would buy it because it would be small, fast, and easy to use.
I'm of the opinion that smaller and more focussed is the way that software should be looking.
Friday, May 21, 2004 9:41 AM by TrackBack

# Betalogue

Betalogue
Friday, May 21, 2004 9:23 AM by Gideon Greenspan

# Runtime error logging

Rick,

Interesting post. But maybe I'm missing something - surely this would have been tracked down much earlier by some runtime error logging code, switched in and out with a flag? You should be able to flip a header flag, and get *every* error code returned by an OS call or internal function, logged assertion-style in a text file with the source file name, line number and error code. For projects the size of Word, I think this should be built-in from the bottom up. It also lets you send debug builds to your users and then ask them to send back the log - it's been a lifesaver many times with my own programs.

Gideon
Saturday, May 22, 2004 1:43 PM by Larry Osterman

# re: Anatomy of a Software Bug

On my current Windows XP machine, the handle count for the system is 10509. This isn't just file handles, NT doesn't differentiate between file handles and other kinds of handles.

There are 14 processes with more than 256 handles open, including IE (345), perfmon (314).

I'm surprised that a modern operating system like OSX has hard coded limits of any kind to be honest.

But this is totally off-topic, and irrelevent to a rather remarkable piece of detection.

Btw, for those like Gideon and John A, and Killiman. The fact that the Mac has such a restricted limit and Windows effectively doesn't have a limit drastically reduces the ability to diagnose the problem. With NT processes routinely having hundreds of handles open, Word's having one or two more simply falls out in the noise factor, while on OS X, those two handles could easily be the difference between a trivial resolution and one that requires much more work.
Saturday, May 22, 2004 10:20 PM by Josh

# re: Anatomy of a Software Bug

Yadda. Yadda. Yadda.

I've got Word v.X and have installed updates 10.1.2, 10.1.4, and 10.1.5. And yet the bug still occurs... specifically when I have been working for hours and have saved the document many, many times. (In other words, only when I've worked really hard, and am really stressed and tired.)

I suppose the most important question now is: is this fixed (finally) in Word 2004?

Sheesh.
Tuesday, May 25, 2004 6:03 AM by Jason

# re: Anatomy of a Software Bug

hi,
i came accross to a software quote on a web site that says something like, vendors create buggy software to be able to sell it. as i remember this is from a famous author, but i lost the address. does anyone know about this quote and its author?
thanks
Thursday, May 27, 2004 1:33 PM by Benjamin Huot

# re: Anatomy of a Software Bug

I stopped using Word within a year because of it crashing so often. That is what interested me in coding (so I could fix the errors) until I found that there was free software available that had over 90 percent of the functionality. And if there was an error, at least I didn't have to pay hundreds of dollars to call Microsoft or for buying it in the first place. Then when OS X came out, I found that you could buy commercial software from small companies, whose software is much cheaper and their support is much better. A large company should be able to support their products much better and have the money to test it in more conditions for errors, yet they don't. I have a strong suspicion that the bigger the company, the bigger the scam. It is a similar problem to how the US can possibly lose a war when they are at least 20 years ahead of the other most advanced Army in the world (of the technology that isn't classified). You can have overwhelming firepower, but if you don't have the intelligence to focus it at the right place and time, you can be defeated by a much smaller force. I would buy products from Mircosoft if they would certify that they hadn't coded anything on them. People aren't switching to Linux because of the liscences; they are doing it because Microsoft products don't work right. Microsoft's poor quality makes the whole industry look bad and holds back innovation. Why buy new software or a new computer when it doesn't provide any more real value and the support is still unaffordable?
Saturday, May 29, 2004 5:17 AM by dave rogers

# re: Anatomy of a Software Bug

Rick,

Thanks for your weblog, and for what you do at BU.

I've never been a big fan of MS (just ask Scoble), but you don't deserve the negative comments being left in your weblog.

The point about human beings not being fully debugged is amply demonstrated here.

I hope you keep writing stories like this, and _some_ of the comments have been interesting and useful.
Monday, May 31, 2004 10:35 AM by TrackBack

# Betalogue

Betalogue
Tuesday, June 01, 2004 4:03 PM by Julie Krauss

# re: Anatomy of a Software Bug

Rick,

Amen! to all Dave Rogers' comments.

Those of us who have been professionals know how hard it can be to figure out the problem. There's even a novel about this exact issue: "The Bug," by Ellen Ullman.

Once again, Rick, thanks for the story. It's fascinating.

Tuesday, June 22, 2004 10:05 AM by Michael

# re: Anatomy of a Software Bug

A possible lead:

For a while I was running into the symptoms described very frequently in Word v.X, especially while translating documents with complex formatting (tables, styles, graphics, etc.). I use a translation assistance tool written in VBA called Wordfast (see wordfast.net), which among other things is able to automatically set the language property of translated segments to the desired target language. (Of course, you still have to do the actual translating yourself.)

While working on one job, I started running into the "open files" bug so frequently that I couldn't work properly. I used a tool called Sloth (sorry, I don't have an URL handy) that identifies the actual names and paths of open files for each active application. I didn't take notes at the time, so I no longer have the actual filenames, but I found a very large number of temporary files that were apparently related to spelling and grammar checking in the target language. When I temporarily removed the proofing tools for that language from their standard location, I stopped encountering the "open files" bug every few minutes and was able to continue work. This was several months ago and I don't remember exactly what I did, but I think I probably also turned off the Wordfast feature that sets the language property of target-language segments.
Thursday, June 24, 2004 1:53 PM by bynkii.com's Mac Matters

# In Search of a Bug

Rick Schaut, of the Microsoft Mac BU has written an excellent article on how hard it can be to track down a bug. However, lest one thinks that only big applications like Word can be that hard to troubleshoot, let...
Friday, June 25, 2004 3:24 AM by Buggin' My Life Away

# Scripting Word

Friday, June 25, 2004 3:27 AM by Buggin' My Life Away

# Scripting Word

Friday, July 16, 2004 1:40 PM by Maireth

# re: Anatomy of a Software Bug

http://word.mvps.org/FAQs/WordMac/DiskFullError.htm

I found this helpful. It is a tweak that lets you clear out all the temporary files opened by word so that you can continue working without microsofts "handy" fix of telling you to close every 20 saves. Really, who counts?

Rick, thanks for posting. It was facinating to see this problem so fully disected.
Wednesday, July 28, 2004 1:02 AM by Norman Diamond

# re: Anatomy of a Software Bug

5/20/2004 10:28 AM Eric Hildum

> Summary of problem: receive a document from
> the US (made using US Word), cut and paste
> into Japanese document (using Japanese
> Word). Upon autosave or save, Japanese
> document is destroyed in memory and on disk.

Odds are that it didn't matter if one of the source documents had been made using US Word. Microsoft occasionally tested US Word (including but not limited to the case described in this blog entry). Odds are that the bug is either wholly within Japanese Word, or within the combination of Japanese Word and Japanese Windows.

> The delayed copying of the text would
> explain the behavior perfectly. Apparently,
> there was a bug in the text copy code
> executed when documents were saved.

The bug could be anything related to the way Word stores its documents, not necessarily related to copying and pasting. Though it is fortunate that the copy on disk wouldn't get corrupted until you actually did a save.

> By the way I did try to report the bug via
> our $500,000+/year global support contract
> with Microsoft, and was told directly by our
> Microsoft support representative, and I
> quote, "I wouldn't know how to file a bug
> report for that."

Surely your support contact only understood the US. Microsoft's idea of globalization is still pretty much US-centric. If you had a support contract with Microsoft Japan then you would be able to submit a report.

But even if you could get the report submitted, odds are that it would still never get solved. There are quite a lot of things you can see in Japanese versions of Office, and even just Windows without Office, that make it pretty clear that no testing was ever done. Things happen in the Start menu on the first reboot after installation, that could not be missed by anyone except Microsoft. Well, one of the bugs introduced by Windows NT4 SP4 was half-fixed in SP5, but has never been fully fixed in Japanese versions of Windows. In the US it was fixed in Windows 2000 during beta testing, but you expect a fix in US Windows' handling of Japanese to be copied back in to fix Japanese Windows' handling of Japanese, ha, no way.
Tuesday, August 03, 2004 4:36 AM by John J. Rynne

# re: Anatomy of a Software Bug

Rick, thanks for the interesting article. So that's what caused me and my company to lose so much work :-(
We ran into this on upgrading to Word 98. Called Microsoft support and got little relief. It would have been very helpful if they had said "Yes, is is a known bug - please do not adjust your set". But the didn't, so I reinstalled the OS and Office, on all machines - and still couldn't fix it.
In fact, a "fix" came out, but we continued to have the problem on and off for a long time.
We resorted to quitting Word every hour or so. Certainly, the problem could be attributed to "saving too often".
Friday, August 27, 2004 1:23 PM by Lee Packham's Corner

# Anatomy of a Software Bug

Friday, August 27, 2004 2:36 PM by TrackBack

# Ensight - Jeremy C. Wright &raquo; Anatomy of a Microsoft Bug

Ensight - Jeremy C. Wright &raquo; Anatomy of a Microsoft Bug
Friday, August 27, 2004 5:11 PM by TrackBack

# Joakim Andersson's blog &raquo; A bug is not always that easy to fix

Joakim Andersson's blog &raquo; A bug is not always that easy to fix
Saturday, August 28, 2004 12:08 PM by TrackBack

# skrud.net

skrud.net
Sunday, August 29, 2004 2:25 AM by Shaghaghi.net

# Anatomy of a Software Bug

Anatomy of a Software Bug...
Sunday, August 29, 2004 6:12 AM by TheophileEscargot from Hulver's site

# King of the World

Me. What I'm Not Watching. What I'm Downloading. Cups. What I'm Listening To. What I'm Reading: "Ancient Light" (with SPOILERS and ENDING). Web. <br>
Contains great content EXCLUSIVE to loyal weekend readers!<br>
Plus a super-multipoll! <br>
Me Phew. Definitely need a three day weekend right now: feeling very abraded by the...
Sunday, August 29, 2004 8:19 PM by Chris Online

# Anatomy of a Software Bug -MS Word

Anatomy of a Software Bug

Thursday, September 02, 2004 6:54 AM by Lee Packham's Corner

# Anatomy of a Software Bug

Wednesday, November 03, 2004 2:56 PM by WebMonster blog

# Anatomy of a Software Bug

Wednesday, November 03, 2004 7:16 PM by TrackBack

# diego sevilla's weblog &raquo; The Anatomy of a Software Bug

diego sevilla's weblog &raquo; The Anatomy of a Software Bug
Friday, November 05, 2004 11:59 AM by TrackBack

# Bug Links &raquo; Undo in Word 6

Bug Links &raquo; Undo in Word 6
Thursday, December 23, 2004 11:30 AM by Buggin' My Life Away

# You Don't Need Word

Tuesday, January 18, 2005 1:14 PM by The Old New Thing

# re: The importance of error code backwards compatibility

Friday, April 21, 2006 12:37 AM by CyraX’s lair » MS Word ‘disk-full’ error

# CyraX&#8217;s lair &raquo; MS Word &#8216;disk-full&#8217; error

Thursday, June 22, 2006 11:43 PM by Buggin' My Life Away

# Opportunity Cost

Back when I started blogging, I had a go-around with Pierre Igot.  I'm not going to rehash it, but I...
Tuesday, July 25, 2006 7:27 PM by How things (should) work

# Debugando cebolas

O post &#233; meio antigo, mas eu s&#243; li hoje e vale a leitura: Anatomy of a Software Bug
Ele conta como um...
Wednesday, September 06, 2006 7:36 PM by npat's blog - the anatomy of a bug

# npat&#39;s blog - the anatomy of a bug

Wednesday, September 06, 2006 7:36 PM by the anatomy of a bug by npat () | LjSEEK.COM

# the anatomy of a bug by npat () | LjSEEK.COM

Monday, April 02, 2007 9:13 PM by Inherent Quality by Ron Richard

# Portals

Field of dreams… The software and IT industry is a field of dreams. More than ever all can come to the field to offer ideas and contribute to its evolution. One means of doing so is through portals. A ...

Wednesday, September 12, 2007 10:47 AM by Love and light - VIIP, beam it into your step

# Love and light - VIIP, beam it into your step

# Software Information &raquo; Buggin&#8217; My Life Away : Anatomy of a Software Bug

Wednesday, January 30, 2008 9:18 PM by Anatomy of a Software Bug

# Anatomy of a Software Bug

# College Fun Facts &raquo; Buggin&#8217; My Life Away : Anatomy of a Software Bug

# Famous Peoples Birthdays &raquo; Buggin&#8217; My Life Away : Anatomy of a Software Bug

Wednesday, June 11, 2008 3:30 AM by Eric Hu's Weblog

# A Software Tester is...

我是個 Software Tester , 但是我突然發現, 我的 BLOG 裏的 "Test" 這個類別裏的 Post 居然沒有很多, 所以我想我應該多談談 Software Testing. 最近看了一些其它高手的部落格,

Saturday, June 14, 2008 9:59 AM by IWebThereforeIAm

# Anatomy of a software bug

[Anatomy of a software bug] A Microsoft developer explains how a tricky bug in Word's undo stack behavior was tracked down....

Friday, July 25, 2008 12:34 PM by Top work at home moms.

# Work from home moms.

Work from home moms. Wahm com the online magazine for work at home moms. Moms work from home. Moms work at home.

Sunday, September 07, 2008 2:14 PM by Schwieb &middot; How to report a bug

# Schwieb &middot; How to report a bug

# the boy in the bubble &raquo; Blog Archive &raquo; Vom Bug zum Feature

Friday, November 07, 2008 1:24 AM by Buspar.

# Buspar.

Buspar online med. Buspar vs zanax. Buspar.

# Word 2004 BUG: Disk full when it is not (OsX 10.4.4) | keyongtech

Sunday, February 01, 2009 5:39 AM by Mati Lampu

# Mati Lampu

# Buggin My Life Away Anatomy of a Software Bug | Paid Surveys

# Buggin My Life Away Anatomy of a Software Bug | Uniform Stores

# Buggin My Life Away Anatomy of a Software Bug | Wood TV Stand

# Buggin My Life Away Anatomy of a Software Bug | Insomnia Cure

# Buggin My Life Away Anatomy of a Software Bug | Quick Diets

# Buggin My Life Away Anatomy of a Software Bug | garden statues

# Buggin My Life Away Anatomy of a Software Bug | unemployment office

# Buggin My Life Away Anatomy of a Software Bug | fix my credit

# Buggin My Life Away Anatomy of a Software Bug | pool toys

# Buggin My Life Away Anatomy of a Software Bug | debt settlement program

# Buggin My Life Away Anatomy of a Software Bug | debt consolidator

# Buggin My Life Away Anatomy of a Software Bug | debt solutions

New Comments to this post are disabled
 
Page view tracker