I recently completed a year at Microsoft. I thought I’d take the chance to write about some of the fun anecdotes from my one year here. Blood, sweat, screams tears and joy – I’ve lived through them all in my short stint here.

Nothing in Microsoft is feared as much as a P0 bug. P0, which stands for Priority 0, is the highest importance a bug can be given. You’ll usually see a P0 bug if there is a build break i.e. for some reason, the build didn’t complete or the smoke tests on the new build didn’t pass.

For a Microsoft developer, life stops if you get a P0 bug assigned to you. To quote a colleague, “a P0 means that you drop whatever you’re doing, forget about the outside world, and stop breathing even. The only thing that matters in your life now is that P0 bug”.

As a Program Manager, dealing with coding bugs is usually not a part of my day-to-day work– therefore I’ve never had to deal with P0 bugs for quite a while. Until one late night in August.

August 2005 was a stressful period. We were in the final stages of shipping Visual Studio 2005. This meant that we spent all our time fixing late bugs and cleaning up the product to get it out the door for it’s late October release.

At such a late point in the product’s cycle, there was always the fear that a showstopper bug might come at you from nowhere. Of course, there was no logical reasoning for this fear – the number of bugs being found was dropping drastically. But still, there was Murphy’s Law and the gnawing fear that something might dramatically fail at the last moment.

This particular night, I was up late at office looking at some bug. Finally done with that bug, I packed up my stuff and prepared to head home when a developer on my team asked me if I was free. He had a problem – he was investigating a couple of P0 issues and all of a sudden, a third P0 issue had been assigned to him.

Assigning the bug to myself, I unpacked my bags again and pulled up to my computer. I wasn’t worried – a large number of P0 issues are false alarms and are caused due to some weirdness in the machine running the test or in the automated test code. I’ve rarely seen P0 bugs due to actual issues in the code.

This bug sounded relatively benign. The toolbox was failing to load its controls when you opened a device project. My guess was that it was due to an unclean machine where someone hadn’t uninstalled the previous raw build cleanly.

Let me explain some terminology here. When we did Whidbey, we had two builds of Visual Studio. One was the layout build which we (and you) use with the nice graphical installer and everything. The other build was the ‘raw’ build which just registered the binaries, some debug tools and registry keys. With raw builds, you could build with debug information and with optimizations turned off and debug your way through easily.

In this particular case, our internal testing tools had just installed the raw build on this machine and were running through their usually battery of tests. Everything was hunky-dory until they tried to create a device project, at which point everything came down in flames since there was nothing in the toolbox.

I poked around in the debugger in a bit. After spending 30 minutes or so in the debugger, I knew why the toolbox wasn’t getting populated. For the simple reason that .Net Compact Framework wasn’t installed on the machine.

This was weird – NetCF gets installed along with the rest of Visual Studio and there was no reason for it not to be installed.

I broke into a cold sweat. Not finding NetCF was a pretty serious problem – it would render useless most of the device stuff. Had somebody checked in code which broke the setup? This was a ship stopper bug if I had ever seen one.

I poked through the logs of the raw build’s setup (essentially spew from a huge batch file) and the plot thickened. The log file indicated that the NetCF setup had been run just like the setup for every other part of Visual Studio. But for some weird reason, the files weren’t there. It was as if the setup had never happened. I looked through the event logs and they were a mystery – they told me that NetCF’s installation was kicked off. But for some reason, they never recorded the successful completion or any error.

I tried rerunning setup and voila! – NetCF installed perfectly and all tests passed successfully. I was confused – what had happened that first time? I tried rerunning setup a few times but couldn’t spot the problem. Beaten, I told the tester that I couldn’t ‘repro’ the bug and went home – but not before leaving a note in the bug asking people to assign any similar bugs to me. If this happened again, I wanted to know about it.

I didn’t have to wait long. 24 hours later, there was a mail in my inbox from another tester. A new build had exhibited the same issue. The toolbox wasn’t loading controls.

I plonked myself down and logged in remotely into the test machine. The scene of the crime was identical to the last time. No NetCF. Logs indicating the successful installation of NetCF.

I started poking around in the event logs of the machine. What was happening in the machine around the time of the incident? Was there a hardware failure? Was there a network outage?

The event logs didn’t reveal much. However, one entry caught my eye – just before the entry documenting the NetCF setup, there was an entry about a Systems Management Server operation on the machine. SMS is how we deploy updates to Windows and other internal tools inside Microsoft and they’re very routine around here. However, the reason this caught my eye was due to the fact that I had seen a similar entry in the previous machine.

On both machines, SMS was doing something a few minutes before NetCF’s setup was kicked off.

This was the clincher – I now had a good hunch as to what was happening. I checked in a piece of code to add logging of NetCF’s setup (basically the verbose logging option of msiexec). My code also checked now to see whether the setup failed and if so, it bailed out.

Hardly a day later was my guess vindicated. NetCF installation failed on another machine – but this time, my logs had documented the sequence of events.

For those of you who know the innards of msiexec, you would know that only one setup can happen at any point of time. Msiexec.exe tries to grab a global mutex – if another instance is already holding the mutex, it fails. The reasoning is simple – you don’t want two setup programs to be making changes to the system at the same time as they could trample all over each other.

Visual Studio 2005 setup is made up a few monolithic MSIs and smaller MSIs such as NetCF which are installed as part of the installation process. There is no one huge Visual Studio 2005 .msi file.

Some readers might have spotted the ending now. What was happening was a freak occurrence. In the infinitesimally small time gap between the end of the previous msi’s setup and the beginning of NetCF’s setup, the machine was unlucky enough to encounter a setup request from Systems Management Server. This was causing the NetCF installation to fail and things went rapidly downhill from thereon. The difference this time was that my log files had faithfully recorded all this.

Whew!

I shot off a mail off to the folks running the build machines to make sure the SMS updates happened before and we thankfully never had to deal with this again. We never did find out why this bug had never occurred earlier with other parts of setup. My guess is that something in SMS was changed

Moral of the story – I really don’t have a moral actually. Except maybe, that shipping a product is an incredibly fun and stressful experience all at once.

[Please post comments at http://www.sriramkrishnan.com/blog/2006/07/p0.html ]