In the feedback to my last posting, Balaji said

“It has been by personal experience too. If we devote much time and effort to architecture, design, clarifying requirements, attention to detail etc, the outcoming product/feature is usually pretty good with very few bugs.”

No doubt “pretty good with very few bugs” is relative to the application market you are in.  Still, I am sure that every tester knows only too well the feeling of someone else finding this impossible bug, and then wondering “Could we have improved out testing methodology so that we caught that one before the egg hit our faces?”

At this point I think the world (deep down aren’t we all testers?) divides into two camps – one which accepts the occasional smell of egg as fact of life, and another that says it is time to learn from past mistakes and re-engineer the process so it can never happen again?

So how do you think the folks at GE Power Systems felt when they finally found the XA/21 bug?  Before I tell you, let me give you more detail. 

GE Power is no Johnny-come-lately – their annual revenues are in excess of $ 20 billion – and their XA/21 system has over 3 million operational hours (340 years!) since it was first introduced in 1990.  So, on August 14th, 2003, when 50 million people in the Northeast United States and Canada lost power, I am guessing that their QA staff in Florida went home pretty unconcerned that their XA/21 energy management and supervisory control and data acquisition (EMS/SCADA) system had any role to play in the ‘perfect storm’  that hit that day.

But by late October, after GE and their energy consultants from KEMA Inc spent weeks going through 4 million of lines of C/C++ code, they had identified the race condition that led to operators at FirstEnergy Corp’s Akron, Ohio control room being in the dark while three of the company's high voltage lines sagged into unkempt trees and tripped.  Because the alarm portion of the XA/21 system failed silently, control room operators didn't know they were relying on outdated information; apparently, choosing to trust their systems, they even discounted phone calls warning them about worsening conditions on their grid. 

There is an interesting fractal symmetry here, with the XA/21 encountering its own little perfect storm that caused the unknown race condition to be hit, which in its way was an ingredient in the larger storm in North America that day.  The drama in The Register’s  reporting is not something I normally associate with the software industry.

Sometimes working late into the night and the early hours of the morning, the team pored over the approximately one-million lines of code that comprise the XA/21's Alarm and Event Processing Routine, written in the C and C++ programming languages. Eventually they were able to reproduce the Ohio alarm crash in GE Energy's Florida laboratory, says Mike Unum [manager of commercial solutions at GE Energy]. "It took us a considerable amount of time to go in and reconstruct the events." In the end, they had to slow down the system, injecting deliberate delays in the code while feeding alarm inputs to the program. About eight weeks after the blackout, the bug was unmasked as a particularly subtle incarnation of a common programming error called a "race condition," triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitoring. The bug had a window of opportunity measured in milliseconds.

“There was a couple of processes that were in contention for a common data structure, and through a software coding error in one of the application processes, they were both able to get write access to a data structure at the same time," says Unum. "And that corruption lead to the alarm event application getting into an infinite loop and spinning."

After the alarm function crashed in FirstEnergy's controls center, unprocessed events began to cue [sic] up, and within half-an-hour the EMS server hosting the alarm process folded under the burden, according to the blackout report. A backup server kicked-in, but it also failed. By the time FirstEnergy operators figured out what was going on and restarted the necessary systems, hours had passed, and it was too late.

So in which of my camps might GE find itself?

The company did everything it could, says Unum. "We test exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug," says Unum. "I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software."

Even if Unum is a manager, I am guessing he has some grease under his fingernails, knows is customers, and has to get a product out the door while maintaining his company’s share value.  On the other hand  Peter Neumann
, a principal scientist at SRI International, can take a more detached, academic, arm’s-length view.  (“My main research interests continue to involve security, crypto applications, overall system survivability, reliability, fault tolerance, safety, software-engineering methodology, systems in the large, applications of formal methods, and risk avoidance.”)

[He says] that the root problem is that makers of critical systems aren't availing themselves of a large body of academic research into how to make software bulletproof.

"We keep having these things happen again and again, and we're not learning from our mistakes," says Neumann. "There are many possible problems that can cause massive failures, but they require a certain discipline in the development of software, and in its operation and administration, that we don't seem to find. ... If you go way back to the AT&T collapse of 1990, that was a little software flaw that propagated across the AT&T network. If you go ten years before that you have the ARPAnet collapse.

"Whether it's a race condition, or a bug in a recovery process as in the AT&T case, there's this idea that you can build things that need to be totally robust without really thinking through the design and implementation and all of the things that might go wrong," Neumann says.

Which camp are you in?  Or is there a middle ground?