I love hunting them down and killing them. Hunting down and killing the developer who caused the bug would be just as much fun if it weren’t illegal in every country in the world, that and programmers would probably be an extinct species if every programmer who had a fix a bug killed off the guy who created it ... and of course if they introduced the bug … well you get the picture, it's not pretty.
The thing is, if you haven’t debugged a bug in an emulator, you haven’t really had to debug. Oh, I know people say debugging hardware is tough and I’ve had to do that, but debugging software that emulates hardware or even virtualizes hardware is an art and you really don’t know the how hard hunting down a bug can be until you’ve gotten an emulation bug.
The first hard bug I ever had to find was back in my Apple days (think System 6 ... yeah I'm that old :-). Essentially if the user created a new folder on the desktop and renamed it you corrupted the desktop database which essentially wiped your hard drive clean. I didn’t track it down but figured out a way to reproduce it. The engineer who came and looked at it introduced me to one my favorite programs: Macsbug .
From then on it was assembly or nothing. For me starting at the bottom was the way to go; who needs source code. Understand the machine, the compiler and the rest is pretty easy … until I met an emulator. Don’t get me wrong, I know that developers all day deal with tough bugs. Code is filled with them, but hands down in my 16 years of developing software the emulator bugs are the toughest. In the six years I worked on Virtual PC and Virtual Game Station I would say on AVERAGE my bugs took over a week or more to fix. That’s PER bug, at least 40 hours of my time, not working on 5 or 6 bugs at once, focusing my entire attention at one issue. On Virtual PC 7, in the 16 months I worked on it I think I only fixed 30 some odd bugs.
The doozy of them all has to be the malformed TLB entry. For those who need a brush up, Virtual PC was a full blown PC motherboard emulator. We emulated a Triton-FX/BX chipset from the Pentium 3 chip down to the A20 interrupt. In between all of that was memory management, disk, networking and host of other integration features. In the memory management unit for our emulator we dealt with page faults and memory page mappings … all of this to properly emulate the MMU and let Windows (or other x86 based OS) believe it was running on real hardware. TLB stands for Translation Look-aside Buffer, and it is used in managing page faults. Most of the code that did this was written back in the early days of Virtual PC by Eric Traut for when we have the original PowerPC processors and G4 machines (circa 1998-1999).
Okay, so cutting to the chase, when the G5 processor was introduced we had to do a great deal of work to re-write our low-level emulator to deal with a number of issues brought up by the G5. What we didn’t know is that lurking deep in the guts of the memory management was a very subtle malformed TLB bug. So here we are, 14 months of development, we’re ready to ship with XP SP2 (we held up shipping to make sure we could ship the latest service pack) and wouldn’t you know, 20% of the time on DUAL G5 processors, when running through the OOBUE (Out of box user experience) on XP SP2 Professional, Windows would crash, but somehow manage to recover. With XP SP2 Home it was only 10% of the time and when we ran XP SP2 Professional with a stripped down OOBUE, we got the failure rate down to 5%.
As anyone who has ever developed software would say… SHIP IT! Okay, no that’s not really what they would say. Usually it’s something like this: “ $*&^!@%@!#$!@) !(!*@#&@!#*(“ and if they are from New Jersey the explicatives go on for quite a while. Needless to say we finally did ship VPC 7.0 and found a fix for the TLB bug. It only took me 9 weeks and 100s of hours trying to get the bug to reproduce though automation and manual means, as well as having the smartest emulator gurus at Microsoft looking over my shoulder at the source code and helping me debug the problem.
When I thought I had the fix, we ran all the lab machines through an automated test script overnight. Each machine ran through the OOBUE about 100 times, and we had maybe 20 machines going, turning our lab into a sauna. I came in the next more and found out that we had a 0% failure rate. I can’t even put into words the feeling I get when I solve these tough problems, anyone who has tackled a tough bug knows what I am talking about.
That was the thing I loved the most about working on Virtual PC, no matter how good you thought you were there was always a bug like that waiting to challenge you. In some ways that challenge is what I am going to miss the most too.