In my last post I discussed the subject of shipping quality products (which is really rather different from code quality or stability, or many other measures one could use). And let's not forget design quality - obviously, if a product does not meet the customer's needs or is too hard to use, it doesn’t matter how stable or well written it is. I got quite a few comments on my last post, so I guess this is an interesting topic. For the record, I am not writing a book here - so I am not necessarily exhaustive when I describe bugs or coding techniques. The remarkable thing about Watson that made me want to write about it is that it is a new way to make products better that tries to measure the real world. It is not an excuse to avoid good code architecture, or to not think about product design. One person pointed out that non-crashes can be more annoying than crashes - that's very true - although crashes and hangs are pretty nasty. In fact we're looking into ways to extend Watson beyond crashes and hangs. The Watson guys have a bunch of exciting ideas around that.

I mentioned last time that the goal is to ship a product with the highest known quality, not necessarily with fewer bugs. This is a counter-intuitive concept, so I'll explain a little more. The naïve way to think about bugs is that if you fix a bug, the product is better as a result. The truth is that it probably is better, but you cannot say for certain right after you check in your change that it absolutely is better. You may have inadvertently introduced a different problem by fixing the issue you were dealing with. That kind of bug is called a "regression" - the product quality has regressed (decreased) as a result of trying to improve it.

In the example I gave, fixing a redraw problem in a toolbar button may seem pretty harmless, but in fact something is now different, which might introduce a change somewhere else you did not anticipate. If you doubt this, then you haven’t developed enough software or had enough contact with the people who use your stuff to know that you sometimes make things worse, and you don’t always find out immediately. In the toolbar button example, if you do it by setting a flag on a call to the system which causes the video driver to be called in a different way, that'll probably be fine on your machine, but some fraction of people out there might be using a video driver that can’t handle that particular request, and they go down. Unless you or your test team happen to have that video card and driver, and happen to try this new code out, you will have no idea that your fix to a minor visual glitch has now caused 1.5% of your future users to have a horrible experience with bizarre crashes that seem to have no known cause since their actions are not causing the problem. It may take you months to discover what has happened - months of complaints from a seemingly random set of customers who have no repro steps for their crashes - truly a nightmare. If you fixed this bug right before you thought you were done, then you'll probably find out about the problem from your customers rather than your test team, which is not good. Once you do find out about this other bug and finally track it down to the video driver (if you ever do) you will of course fix it, but it is a little late - the damage is done. Some customer might have deployed your software on thousands of machines, and your fix is going to have to be deployed on all those machines as well. That little bug might cost your support lines and customers more money than they paid you for the software, or that you made from it.

So the goal is to get the product to a known state of quality before you put it out there. To maximize the known quality, the process we use in Office and the rest of Microsoft for the most part is a staged reduction in bug fixing, believe it or not. Naturally during the course of a project we fix just about every bug we find. But near the end of the project, as it gets harder to find bugs and our predicted ship date approaches, we need to make sure we can control the quality of the product in the end game. We begin a process of deciding which bugs to fix and which ones our customers can live with (because they won’t notice them, mainly). Some people will read this and say "I knew it! See, Microsoft guy admits shipping buggy software!". If you conclude that, you did not understand.

Bugs are not all equal - we use a measure called "severity" to gauge the impact of a bug in isolation of how common it is. This is a measure that a tester can apply without having to make a judgment on the importance of fixing a bug. It’s pretty simple: a "Sev 1" bug is a crash or hang or data loss. Sev 2 is serious impact on functionality - cannot use a feature. Sev 3 is minor impact on functionality - feature doesn't quite work right. Sev 4 is a minor glitch - visual annoyance, etc. The severity doesn’t control whether we fix a bug or not. We use another subjective measure - the opinion of a program manager, mainly. For example, if the splash screen for Word has "Word" misspelled, then that is a Sev 4, but clearly is a must-fix.

We know that statistically our testing process flushes out most regressions within about 3 months of the time they were introduced. Since there is such a long lead time, we start reducing the number of changes months before our expected ship date. Of course usually regressions are found within a day or two if not sooner because we test specifically for them, but some of them lurk, as I have explained. If a regression is not found by the special "regression testing" done specifically to try to catch these, then that area of the product may not be revisited for 2-3 months as the test team works on other areas, hence the long period.

Because of this long time that regression bugs can lurk, some months before we decide we can ship the product we begin a "triage" process. We start rejecting bugs (marking them as "won’t fix“, or “postpone”) that have little to no customer impact, and that only a very persistent customer would run into (like, typing 1,000,000,000,000,000 into the footnote number control causes the footnote format to look incorrect - who cares?) The goal is to reduce "code churn". Any sizeable software project is something you have to manage as an organic thing since as I discussed, it is no longer possible to know for certain what effect any particular code change will have. By simply reducing the amount of change, you can reduce the amount of random problems (regressions) being introduced.

As time goes by, we raise the "triage bar" that a bug must meet in order to be worth fixing. That is, bugs need to be a little more serious to get fixed. As a result of this process, fewer and fewer changes are happening to the code base. Time passes, and we can be more confident that new bugs caused by bug fixes are being introduced in much smaller numbers. Finally the number of bug fixes we are taking per week is down to single digits, and then finally no bugs remain that we would consider "must fix". We now have a "release candidate", and we will only take "recall class" bugs.  Essentially we could now ship the product, but we leave it in "escrow", testing like mad to see if any truly heinous bug that would make us recall the product appears, and also whether those last few fixes have regressions. So now we have reached a state where by choosing to not fix bugs that we considered survivable, we have minimized the chance that even worse bugs are now in the code instead. We're at maximum known quality.

And of course the inexperienced among us are shocked, because they can't understand why this or that bug did not get fixed. But the wiser ones remember what happens when you buck the process and fix that last bug.

A last anecdote to leave you with. Even re-linking your code (not even recompile) can introduce a crashing bug. A few years ago, we were working on the release candidate for an Asian-language version of Word97. We thought we were done, and ran our last optimization on the build. We have some technology at Microsoft that profiles code usage and arranges the code modules so that they are in the executable in the optimal order to produce the best possible boot speed. After running this process which involves mainly the linker and is considered very safe, we put the code in escrow while the testers tried to break it. And they did - they found that it would crash on some of the machines they had when a certain feature was used. But the unoptimized build did not crash with the same steps and machine.

So we ran the "debug" build (a version of the build that has the same code as the "ship" build you all use, but includes extra information to allow debugging) and it also did not crash. We then tried to debug the ship build - but just running the ship build in the debugger caused the crash to go away. Some developers love this sort of mystery. One of the future inventors of Watson stopped by to get involved. No matter what they did, as soon as they tried to find out what was causing the crash, it went away. We went to get an ICE - a hardware debugger that might give us some clues.

Then we noticed that there was a pattern. The only machines that showed this crash had Pentium processors of 150MHz or less, and some of the 150MHz machines did not show the problem. We had a hunch, and searched the Intel web site for "errata" (their word for bugs - we prefer "issues"). Sure enough, there was a flaw in the Pentium chip which under certain obscure circumstances could cause a fault - there needed to be a JMP instruction followed exactly 33 bytes later by a branch, and the JMP instruction had to be aligned with a "page" (4096 byte block) boundary in memory. Talk about specific. The flaw had been fixed in later versions of the 150MHz generation, and all chips produced later.

Now, there are actually flaws in chips quite often (one or two are even famous). Those chips include software after all. But as they are discovered, chip manufacturers tell the people who write compilers about the bugs, and the compiler people modify their compilers to produce code that cannot hit the bug. So the flaws seem to just disappear (thank goodness for software!) It turned out we were using a slightly older version of the compiler which did not know about this flaw. An Intel rep at Microsoft confirmed the details of the problem, and rather than relinking or taking any chances whatsoever, we ran a check for this byte sequence, and manually moved three bytes in our 5MB executable to make sure those instructions were 34 bytes apart. Problem solved. So now when someone tells me a fix is "safe", I can tell them that no fix is truly safe - you really can never know.