My favorite bug was an issue that would cause a test hang on roughly 1 machine in 100 running overnight stress. All the test threads would get stuck making TCP connections, and we’d see the same memory signature in the kernel heap which indicated one of hundreds of kernel threads had written to a wrong memory location. Once in this state, a system reboot was required to get network connectivity again.

The developer for the area had tried instrumenting potential overflow candidates and as the tester, I had tried different testing angles to narrow down the culprit but we were making no progress. The product was a week or two away from shipping and general consensus was that we should probably just risk it.

Finally, in desperation, I just stared at the memory in hex for hours trying to discern what it might be and the answer came to me. It was a tick count and more than that, it was roughly the same tick count in all the failed machines. From that, we were able to pinpoint the source of the corruption in the DHCPv6 refresh timer, get the fix ready, and prevent hordes of customers from rebooting their devices to cure weird connectivity issues.

-- Corey

 

Do you have a bug whose story you love to tell? Let me know!