Repro, Man

Written communication is amazingly difficult. Even when you’re aware of some of the pitfalls, as I was when I wrote my last post about the disk full save error in Word, it’s still all too easy say something that other people will take to mean something other than what you had intended. Some of the comments to that post, both here and in other blogs, clearly indicate that I’d fallen into one of those pitfalls. And it’s a simple pitfall. I’d used a word that has a particular meaning within this context to some readers while it has a completely different meaning in other contexts.

I’d said that we didn’t know that the real problem with the disk is full error involved hitting the OS’ open file limit. The problem is that the word “know” can have several shades of meaning such that not knowing something can span an entire range from having no clue whatsoever to having a very strong suspicion while still being unable to conclusively prove something as a matter of fact. Comments I’ve read, both here and elsewhere, reflect this range of meaning in a rather interesting way. The more someone appeared willing to attribute cluelessness to my remark, the less that person actually knew or understood of the process of software development and fixing software bugs.

I point this out, because there are those who have an extremely judgmental attitude that goes beyond simply being rude. There is a tendency among some to respond to this kind of ambiguity by assuming the worst, or, more accurately, by choosing a meaning that confirms their own prejudice while giving little or no consideration to other possible meanings. These are people who will take a remark out of context, and turn it into a personal attack. To the people who do this, and, frankly, I think you know who you are, if what you write personally attacks me, then your words serve no useful purpose other than the possible gratification of your own egos.

Anyway, I digress. Sorry for the ambiguity, so please allow me to dispel the ambiguity by telling the story of another problem that we ran into with early versions of Jaguar. Jaguar beta testers were running into strange crashes in Word. We collected a large number of crash logs, all of which involved a protected memory violation in either GetHandleSize or memmove. The former is an operating system call, the other is a compiler intrinsic that moves data from one location in memory to another. Both of them involve loading an address and dereferencing the address as a pointer.

Now, there are two things an experienced developer can immediately tell you about this problem. The first is that some piece of code, somewhere, has overwritten some data that didn’t belong to it yet wasn’t in a block of protected memory—in essence, a buffer overrun. The second is that the actual bug could be anywhere, and not at all related to any of the code that happened to be executing at the time of the crash. These are nasty bugs, because figuring out what’s wrong requires a significant amount of sleuthing.

So, we looked for more information about the scenarios that were causing the crash both by asking users what they’d been doing before the crash and by having testers try to reproduce the problem based on what users were telling us. It took a while, but we found a common element to the problem: at some point someone clicked on the font drop-down in the formatting palette (or the font menu itself), and discovered some blank entries items on the menu/drop-down. The other is that the crash most often happened when they did something that added a file to the most-recently-used list on the File menu.

That information was enough for me to take a look at the order in which various chunks of memory were being allocated, and, sure enough, Word’s internal data structure for the font menu was being allocated immediately before the handle (a “handle” being a doubly-indirect reference to a piece of data in memory allowing the data to be moved around without requiring all references to be updated after the move) for the data structure that contains the files on the MRU list. This was a major clue. If the “GetHandleSize” crash involved the MRU handle, then it’s likely that some piece of code for maintaining the font menu had a buffer overrun.

So, I poked around in the font menu code, and found what would very likely be the cause of the problem. Apple added some APIs that would allow applications to ascertain whether or not the contents of the font menu would changed since that last time the application had asked for the font menu. For Word X, we added support for this so that we could update the font menu. Unfortunately, there was another, rather ancient, piece of code that assumed the contents of the font menu would never change. This piece of code had what would be a buffer overrun should the contents of the font menu ever grow as a result of calling this new API.

OK, so I wrote a fix for this, but we still had a problem. I’d come up with this fix not by actually reproducing the problem and watching the buffer overrun happen under a debugger, but by deducing where the buffer overrun occurred. What we lacked was a consistently reproducible set of steps that showed how the crash was happening for actual users. Without that set of steps under which we could consistently reproduce the crash, we had no way to prove that this fix resolved the problem that users were seeing. We strongly suspected that we had a fix, but we didn’t know for sure.

The problem was complicated by the fact that no one, neither the testers who could get the crash to happen occasionally nor the users who had reported the crash and had given us information about what they were doing, had actually done something that would have caused the contents of the font menu to change. No one had added fonts to either the system font folder or their user library folder. Something else was afoot, and we needed some help from Apple to figure it out.

So, one of the Word program managers sent a piece of e-mail off to one of our contacts at Apple asking about what circumstances in Jaguar would cause this new API to report that the contents of the font menu had changed—the point being that we wanted to be able to consistently reproduce the scenario. Unfortunately, this PGM had worded the e-mail in such a way as to imply that we didn’t expect the contents of the font menu to change after calling this API whose specific purpose is to tell the app that the font menu had changed. The Apple contact, quite understandably, asked why, on earth, would we not expect the contents of the font menu to change after calling this API?

At that point, I replied by saying that we, certainly, expected the contents of the font menu to change, but that there was an ancient piece of code in Word that didn’t expect the contents of the font menu to change. I said, “It sometimes helps to remember that we hired interns this past summer who were younger than Word’s code base,” to which the Apple rep replied, “Gosh! That must some kind of milestone for an app!”

Well, we got the help we needed, were able to come up with a consistent repro scenario, and were able to prove that the fix I’d implemented did, indeed, resolve the problem that users had been seeing. Thanks to some very hard work by a lot of people, both in and outside of Microsoft, we were able to have a service release of Word ready they day Jaguar shipped, and very few users every actually ran into this particular problem.

The disk is full on save problem, however, was plagued by our inability to come up with consistent repro scenarios, and while we’d suspected the OS file limit was the root cause for some time, the question that always plagued us was understanding why we could not reproduce the problem, in those rare instances where we came up with a repro scenario, while running under the debugger. It wasn’t until we learned that the debugger bumped the OS open file limit for the debugged process that we knew the answer to that question, and could proceed. Moreover, as we’d fix one scenario in one particular way, finding other repro scenarios got increasingly difficult.

The point of all of this is that fixing a bug requires two things. The first, of course, is some way to diagnose the actual problem and come up with a fix. But the second, and not so obvious, is the need for a consistent repro scenario that enables us to prove that the fix actually resolves the problem. Without both, we’re really only guessing, and even though it might be a very well educated guess, it’s still only a guess.

We resolved the font menu buffer overrun in September of 2002. Since then, the age of Word’s code base has surpassed the legal drinking age in all 50 states. With a code base that old, you simply can’t afford to release a fix based on even an educated guess. The risk that you’ve introduced a problem worse than the one you’re trying to fix is simply too high. At that point, the only alternative is to roll the fix into the next major release, which gives us enough time to pound on the fix in a wide variety of scenarios such that we have sufficient confidence in the efficacy of the fix.

Now, why tell these stories? Why even bother to air out our dirty laundry like this? Certainly not because I’m a masochist. Rather I’m a pragmatist. Word has a number of quirks, funky behaviors, and downright nasty bugs (of the latter I’m certain, even though I can’t prove it). Fixing them requires a cooperative effort between users and those of us who work on the product. We need as much information as you can give, but I want people to understand the difference between actual information and speculation about the causes of the problem.

I can pretty much guarantee that you won’t be able to out speculate those of us who work on the product. For any given problem that you’ve encountered, there’s a pretty good chance that we have suspects that never occurred to you. Sgt. Joe Friday’s most famous line is apropos. Speculation, especially speculation that is highly judgmental, is utterly useless to me, and a waste of both your time and mine. I gotta have a repro, man.

And for all of those people blogging about Word, don’t think we don’t read what you’ve had to say. In fact, right now, there’s a team of developers , testers and users who are scouring the Internet and various newsgroups for any reports of problems with Office 2004. They’re known as the MacSWAT team. They include all of the Mac Office MVPs—people who really have both my admiration and undying gratitude.