Also known as "Larry mounts a DDOS attack against every single machine running Windows NT"
Or: No stupid mistake goes unremembered.
I was recently in the office of a very senior person at Microsoft debugging a problem on his machine. He introduced himself, and commented "We've never met, but I've heard of you. Something about a ping of death?"
Oh. My. Word. People still remember the "ping of death"? Wow. I thought I was long past the ping of death (after all, it's been 15 years), but apparently not. I'm not surprised when people who were involved in the PoD incident remember it (it was pretty spectacular), but to have a very senior person who wasn't even working at the company at the time remember it is not a good thing :).
So, for the record, here's the story of Larry and the Ping of Death.
First I need to describe my development environment at the time (actually, it's pretty much the same as my dev environment today). I had my primary development machine running a version of NT, it was running a kernel debugger connected to my test machine over a serial cable. When my test machine crashed, I would use the kernel debugger on my dev machine to debug it. There was nothing debugging my dev machine, because NT was pretty darned reliable at that point and I didn't need a kernel debugger 99% of the time. In addition, the corporate network wasn't a switched network - as a result, each machine received datagram traffic from every other machine on the network.
Back in that day, I was working on the NT 3.1 browser (I've written about the browser here and here before). As I was working on some diagnostic tools for the browser, I wrote a tool to manually generate some of the packets used by the browser service.
One day, as I was adding some functionality to the tool, my dev machine crashed, and my test machine locked up.
*CRUD*. I can't debug the problem to see what happened because I lost my kernel debugger. Ok, I'll reboot my machines, and hopefully whatever happened will hit again.
The failure didn't hit, so I went back to working on the tool.
And once again, my machine crashed.
At this point, everyone in the offices around me started to get noisy - there was a great deal of cursing going on. What I'd not realized was that every machine had crashed at the same time as my dev machine had crashed. And I do mean EVERY machine. Every single machine in the corporation running Windows NT had crashed. Twice (after allowing just enough time between crashes to allow people to start getting back to work).
I quickly realized that my test application was the cause of the crash, and I isolated my machines from the network and started digging in. I quickly root caused the problem - the broadcast that was sent by my test application was malformed and it exposed a bug in the bowser.sys driver. When the bowser received this packet, it crashed.
I quickly fixed the problem on my machine and added the change to the checkin queue so that it would be in the next day's build.
I then walked around the entire building and personally apologized to every single person on the NT team for causing them to lose hours of work. And 15 years later, I'm still apologizing for that one moment of utter stupidity.
PingBack from http://www.artofbam.com/wordpress/?p=9190
Ah, but you *did* uncover the bug, and probably saved billions from losses due to maliciously malformed packets.
Though it does bring up the idea of isolated networks for stuff like this.
> I quickly root caused the problem - the broadcast that was sent by my test application was malformed and it exposed a bug in the bowser.sys driver. When the bowser received this packet, it crashed.
Bowser.sys? There's a whole *driver* dedicated to dogfooding?
I thought I'd done the story of hte name of the bowser before. It's because the driver is "such a dog" :). My boss at the time had a colorful way with names
Sounds like you're being harsh on yourself. Can't see anything you did as being stupid - it wasn't your fault that bowser.sys was buggy and caused OS crashes. (unless you also wrote that).
You sent out a malformed packet. Whoop-de-do. The network should be able to handle that.
The only possible reason you might have to be hard on yourself is the "doing it again" thing. But 1) you didn't cause people to lose much work there 'cos they'd only just rebooted from last time, and 2) I don't think spotting cause and effect from the first time around is something that would be expected. First time might be a coincidence. Simultaneous crashes on your machines due to an unrelated local other cause (power fluctuations in your office?).
Nah, that's not stupidity.
Now, going round apologising and letting everyone think it was your fault - that might have been a little foolish :)
Karellen: I wrote bowser.sys too.
Actually a single failure would have been excused. Stuff does happen, and we all know that.
The reason this became a legend was that I did it a second time.
And that was inexcusable.
Doesn't a story like this belong in Us Magazine though, in the "They're Just Like Us" section? I want to see a picture of Larry with a big caption saying, "THEY BRING DOWN ENTIRE CORPORATE NETWORKS!"
Technically, wouldn't this be a plain old DOS attack rather than a DDOS attack? From what you wrote, the PoD packets were from a single source (your machine) so they weren't really "distributed".
Chris: I was wondering if someone would think of that. I figured it was "distributed" because one packet sent from my dev machine was distributed to several thousand other machines and crashed them all.
Well, I'm no expert but as I understand it, back in Ye Olden Days, the conventional way to carry out a denial of service attack was to subvert a powerful machine with a big internet pipe and use it to launch a flood of traffic at the target computer. Two problems with this: first, as the computers people were trying to take down with DoS attacks got more powerful, eventually becoming services running on multiple computers, it got harder and harder to find a computer big enough to overwhelm them. There isn't a single computer in the world powerful enough to DoS Google, for instance. Second, a single source attack is relatively easy to deal with. While there are methods of disguising the origin of a DoS attack (forging information on the packets, for instance) it's still possible to trace such a big flood of packets back to the origin. That means most DoS attacks could be dealt with by either getting the owner to clean out the subverted system, or getting its ISP to filter the traffic or shut down their connection entirely.
These days, rather than using one big system, they started subverting a lot of systems into a botnet (including desktop machines as well as big servers) often using viruses, worms, trojans, or other automated mechanisms, and using them to launch a coordinated DoS attack. This sort of Distributed Denial of Service attack is a lot harder to stop. Each machine is sending out less traffic, so they're harder to trace back. Even if you can, there's so many of them that tracking down each one and dealing with the owner or ISP is effectively impossible. This makes DDoS attacks much harder to combat than old style single-machine DoS attacks. It also scales to attack websites and services that have far too much hardware behind them to be brought down by a single machine trying to DoS them. Now virtually all denial of service attacks are distributed.
I guess it would be a reverse DDoS attack, given that a normal DDoS is a bunch of machines bringing down one.
Not quite the same thing, but when I was testing Winsock, I used JamesG's harness api tester, on what I mistakenly believed to be my office isolated network. Hey, I was curious about how the competitor's TCP/IP stacks would handle it.
Buildings 1-4 had problems keeping up with the "very large" broadcast packet. I told my test manager and PM about it, and they both agreed that the incident should be forgotten asap and never brought up again.
Shame on me, and I quickly removed all of my office test machines in the lab.
> The reason this became a legend was that I did it a second time.
> And that was inexcusable.
But that is excusable, and enormously important. The first time you did it, you didn't know. The second time you did it, again you didn't know at first, but when you knew about it, you released a fix. Your fix eventually reached millions of customers, right? The only surprising part of this is that Microsoft didn't fire you for making a fix that eventually reached millions of customers. Outside of Microsoft, you'd be a hero.
Compare that to the Excel bug, where the typically Microsoftian decision was to not release a hotfix. Someone must have got a big bonus for deciding not to release that hotfix.
The way to get memories of that event to be forgotten would be to store them on hard drives partitioned by Windows. That'll get all those memories wiped out. Still. Thank you for bucking Microsoft's system and getting your fix out the door.
Norman: Huh? The Excel guys issued a hotfix ASAP. And this was way early in the development process (years before we shipped).
> The Excel guys issued a hotfix ASAP.
Last I saw, Microsoft wasn't distributing the hotfix but was considering including it in a service pack.