… Continued from part one:
In the beginning of 1999, the small hosting company I worked for (Virtual Servers, LLC.) realized that they were loosing business because they didn’t have a Windows based hosting product offering. By this time we were using NT 4.0 as workstations in the office, but we didn’t have any real expertise on IIS or running NT as a web server. They hired a New Yorker named “Vinnie” to put together our Windows based hosting solution. He did as good a job as anyone could do. The goal was to make it a similar style “virtual” product offering as our UNIX product. Unfortunately, true virtualization on Windows was impossible with the architecture at the time. In order to allow customers the “feel” of having more control of their Web server, we allowed customers to register their own dll’s. As you might imagine, this cause all kinds of troubles for our Windows Sysadmins. I also remember the company having all kinds of hardware troubles (only on the Windows side), as we had been used to building up our own servers to save on hardware costs. Eventually, we had to invest in some Dell servers and the major hardware problems went away.
During this time of our Windows hosting product beginning, I was busy creating the Unix systems administration team. At first, I was the only sysadmin. I had to wear all kinds of hats in that role: security, abuse, system administration, backup/recovery, and top tier support. All of our servers were hosted in a small NOC located in one of the Westin office towers in downtown Seattle. The NOC was owned and operated by our sister company, Lightrealm Communications. I wouldn’t quite call it a data center, as it couldn’t have been more than 1500 square feet in total size. Our office, where all personnel worked, was located on the East Side of Seattle in the city of Kirkland. As such, we had a “dark” data center.
We had built a custom monitoring system that would automatically page me if one of the servers went down. Each server’s power was hooked through a device called an “e-commander”, which allowed one to telnet in and power-cycle a box remotely if the box stopped responding and needed a hard reboot. This worked decently *most* of the time. Unfortunately, the UFS filesystem which was utilized by BSD/OS wasn’t the most resilient to hard booting. After coming up after a hard reboot, it would attempt to automatically repair the corrupted inodes, but every once in a while it wouldn’t be able to repair it automatically. If this happened, it meant that I had to make a trip downtown to give it personal attention. I don’t think I can count on one hand the number of times I had to make a trip downtown during business hours, but the systems had an uncanny knack of going down between 1-5 AM. The first few times I had to take a trip downtown in the middle of the night to fix a server, I thought it was pretty exciting. It didn’t take long to become quite a chore, however. I had been married for less than a year, and I can still remember my wife getting very angry at my pager.
Besides early morning file system recoveries, the biggest problem with the systems was performance. Since we allowed our customers a great deal of freedom on how they configured their servers, they were a very popular hosting solution for people who wanted to push the limits. Due to the backend system which created the “virtual” server environment, the systems tended to use more resources than a standalone hosting solution would have. The problems ranged from customers running bad “runaway” scripts, to sites which just got more traffic than the system could support. During the night, the systems tended to run pretty smoothly (when they weren’t crashing with filesystem problems), but during the day the loads could get pretty high. Any time the load averages rose above 7, the system performance would degrade to the point where customers would start to complain. I had to come up with a number of creative ways to automate managing the growing number of systems…
[To be continued] part 3