We’re very happen to announce that the second, newly expanded edition of I. M. Wright’s “Hard Code”: A Decade of Hard-Won Lessons from Microsoft, by Eric Brechner is available for purchase. (Print ISBN 9780735661707; Page Count 448).
Here is an excerpt from Chapter 5, “Software Quality—More Than a Dream”.
Some people mock software development, saying if buildings were built like software, the first woodpecker would destroy civilization. That’s quite funny, or disturbing, but regardless it’s misguided. Early buildings lacked foundations. Early cars broke down incessantly. Early TVs required constant fiddling to work properly. Software is no different.
At first, Microsoft wrote software for early adopters, people comfortable replacing PC boards. Back then, time to market won over quality, because early adopters could work around issues, but they couldn’t slow the clock. Shipping fastest meant coding quickly and then fixing just enough to make it work.
Now our market is consumers and the enterprise, who value quality over the hassles of experimentation. The market change was gradual, so Microsoft’s initial response was simply to fix more bugs. Soon bug fixing was taking longer than coding, an incredibly slow process. The fastest way to ship high quality is to trap errors early, coding it right the first time and minimizing rework. Microsoft has been shifting to this quality upstream approach over the time I’ve been writing these columns. The first major jolt that drove the company-wide change was a series of Internet virus attacks in late 2001.
In this chapter, I. M. Wright preaches quality to the engineering masses. The first column evaluates security issues. The second analyzes why quality is essential and how you get it. The third column explains an engineering approach to software that dramatically reduces defects. The fourth talks about design and code inspections. The fifth describes metrics that can predict quality issues before customers experience them. The sixth focuses on techniques to make software resilient. And the chapter aptly finishes by emphasizing the five basics of software quality. While all these columns provide an interesting perspective, the second one, “Where’s the beef? Why we need quality” stands out as an important turning point. When I wrote it few inside or outside Microsoft believed we were serious about quality. Years later, many of the concepts are taken for granted. It took far more than an opinion piece to drive that change, but it’s nice to call for action and have people respond.
I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it’s better to crash and let Watson report the error than it is to catch the exception and try to correct it.
From a technical perspective, there is some sense to the strategy of allowing the crash to complete and get reported. It’s like the logic behind asserts—the moment you realize you are in a bad state, capture that state and abort. That way, when you are debugging later you’ll be as close as possible to the cause of the problem. If you don’t abort immediately, it’s often impossible to reconstruct the state and identify what went wrong. That’s why asserts are good, right? So, crashing is sensible, right?
Oh please. Asserts and crashing are so 1990s. If you’re still thinking that way, you need to shut off your Walkman and join the twenty-first century, unless you write software just for yourself and your old-school buddies. These days, software isn’t expected to run only until its programmer got tired. It’s expected to run and keep running. Period.
Hold on, an old-school developer, I’ll call him Axl Rose, wants to inject “reality” into the discussion. “Look,” says Axl, “you can’t just wish bad machine states away, and you can’t fix every bug no matter how late you party.” You’re right, Axl. While we need to design, test, and code our products and services to be as error free as possible, there will always be bugs. What we in the new century have realized is that for many issues it’s not the bugs that are the problem—it’s how we respond to those bugs that matters.
Axl Rose responds to bugs by capturing data about them in hopes of identifying the cause. Enlightened engineers respond to bugs by expecting them, logging them, and making their software resilient to failure. Sure, we still want to fix the bugs we log because failures are costly to performance and impact the customer experience. However, cars, TVs, and networking fail all the time. They are just designed to be resilient to those failures so that crashes are rare.
“But asserts are still good, right? Everyone says so,” says Axl. No. Asserts as they are implemented today are evil. They are evil. I mean it, evil. They cause programs to be fragile instead of resilient. They perpetuate the mindset that you respond to failure by giving up instead of rolling back and starting over.
We need to change how asserts act. Instead of aborting, asserts should log problems and then trigger a recovery. I repeat—keep the asserts, but change how they act. You still want asserts to detect failures early. What’s even more important is how you respond to those failures, including the ones that slip through.
So, how do you respond appropriately to failure? Well, how do you? I mean, in real life, how do you respond to failure? Do you give up and walk away? I doubt you made it through the Microsoft interview process if that was your attitude.
When you experience failure, you start over and try again. Ideally, you take notes about what went wrong and analyze them to improve, but usually that comes later. In the moment, you simply dust yourself off and give it another go.
For web services, the approach is called the five Rs—retry, restart, reboot, reimage, and replace. Let’s break them down: ■ Retry First off, you try the failed action again. Often something just goofed the first time and will work the second time. ■ Restart If retrying doesn’t work, restarting often does. For services, this often means rolling back and restarting a transaction or unloading a DLL, reloading it, and performing the action again the way Internet Information Server (IIS) does. ■ Reboot If restarting doesn’t work, do what a user would do, and reboot the machine. ■ Reimage If rebooting doesn’t work, do what support would do, and reimage the application or entire box. ■ Replace If reimaging doesn’t do the trick, it’s time to get a new device.
Much of our software doesn’t run as a service in a datacenter, and contrary to what Google might have you believe, customers don’t want all software to depend on a service. For client software, the five Rs might seem irrelevant to you. Ah, to be so naïve and dismissive.
The five Rs apply just as well to client and application software on a PC or a phone. The key most engineers miss is defining the action, the scope of what gets retried or restarted. On the web it’s easier to identify—the action is usually a transaction to a database or a GET or POST to a page. For client and application software, you need to think more about what action the user or subsystem is attempting.
Well-designed software will have custom error handling at the end of each action, just like I talked about in my column “A tragedy of error handling” (which appears in Chapter 6). Having custom error handling after actions makes applying the five Rs much simpler. Unfortunately, lots of throwback engineers, like Axl Rose, use a Routine for Error Central Handling (RECH) instead, as I described in the same column. If your code looks like Axl’s, you’ve got some work to do to separate out the actions, but it’s worth it if a few actions harbor most crashes and you aren’t able to fix the root cause.
Let’s check out some examples of applying the five Rs to client and application software: ■ Retry PCs and devices are a bit more predictable than web services, so failed operations will likely fail again. However, retrying works for issues that fail sporadically, like network connectivity or data contention. So, when saving a file, rather than blocking for what seems like an eternity and then failing, try blocking for a short timeout and then trying again—a better result for the same time or less. Doing so asynchronously unblocks the user entirely and is even better, but it might be tricky. ■ Restart What can you restart at the client level? How about device drivers, database connections, OLE objects, DLL loads, network connections, worker threads, dialogs, services, and resource handles. Of course, blindly restarting the components you depend upon is silly. You have to consider the kind of failure, and you need to restart the full action to ensure that you don’t confuse state. Yes, it’s not trivial. What kills me is that as a sophisticated user, restarting components is exactly what I do to fix half the problems I encounter. Why can’t the code do the same? Why is the code so inept? Wait for it, the answer will come to you. ■ Reboot If restarting components doesn’t work or isn’t possible because of a serious failure, you need to restart the client or application itself—a reboot. Most of the Office applications do this automatically now. They even recover most of their state as a bonus. There are some phone and game applications that purposely freeze the screen and reboot the application or device in order to recover (works only for fast reboots). ■ Reimage If rebooting the application doesn’t work, what does product support tell you to do? Reinstall the software. Yes, this is an extreme measure, but these days installs and repairs are entirely programmable for most applications, often at a component level. You’ll likely need to involve the user and might even have to check online for a fix. But if you’re expecting the user to do it, then you should do it. ■ Replace This is where we lose. If our software fails to correct the problem, the customer has few choices left. These days, with competitors aching
Sorry, but I dont see any significant change in code quality of MS products since 2001, exactly the same rush, and let customers beta-test already purchased software. New major releases every 2 years just to feed stakeholders, and lets have constant patching bugs between the releases. By the time release N-1 becomes in some way a solid product, version N has been already RTMd, the new patching circle starts over on N. And, N-1 gets immediately deprecated, no first priority anymore, can be dumped soon.
seems to be a nice read