Random Disconnected Diatribes of a p&p Documentation Engineer
It's been four months since I moved all my websites to the Windows Azure Web Sites platform, so how's it working out? Very well so far is pretty much all I can say, because there's been nothing in terms of operational activity to report. A welcome change after all the fuss and effort of running the same sites on my own web server. And, so far, the bill has been zero. What you might call excellent value for money!
I monitor the sites using my own Server Monitor utility (available here), and they consistently show a minimum of 99.9% availability - even 100% for a lot of the time. Very occasionally the hourly automated FTP upload for the local weather site fails, but that's perhaps only once every couple of weeks.
Access and response times do, however, vary. As I've remarked before, the initial startup time when the sites have been idle for a while, and hence are no longer loaded on the shared web server, can be a bit more than I'd wish for. It's not uncommon to see a five seconds or longer delay on the first hit for the most complex site. However, subsequent requests return startlingly quickly from the North Europe datacenter where the sites are hosted, and people I've spoken to who use the sites have remarked how fast they seem to be.
I haven't needed to do much in the way of updates to the sites, and what bits I have done have been easy using Web Matrix. A recent neat addition to Windows Azure Web Sites is the Web Matrix icon on the lower ribbon of a website in the Management Portal that installs the latest version of Web Matrix, and launches it with a connection to the hosted site files.
The sites use Windows Azure SQL Database, and I've backed up the databases using the Import/Export tool within the Management Portal. The first time I used this it was really fiddly and annoying, with poor documentation and an unintuitive interface that resulted in several attempts with different subsets of values for the storage account and other parameters. However, the latest update to this tool makes it really easy. Just click Export in the lower ribbon and it automatically selects the storage account and other information required. You can often just select a blob container and enter your database password. Best of all, it can even create a new blob container if you haven't already done this.
And it's reassuring to see that the Azure team has as much trouble keeping up with the changes to the features and the portal UI as we do here in p&p. At the time of writing their online docs for using the Import/Export feature still described the previous version...
I love those scenes in nature documentaries where they deploy a remote camera, and the local wildlife takes an interest in it so you get wonderful close-up shots of inquisitive animals. A while ago my wife persuaded me that we should get one to use in the small copse of trees next to our house.
There’s certainly plenty of evidence of night-time activity because the local wildlife population has succeeded in producing several clearly defined pathways through the trees. No doubt, in part, it's down to the selection of foxes that come to visit every evening, helping to dispose of leftovers from the kitchen and the food that some days our two fine-mouth-hungry cats decline to even sniff.
However, we’ve also heard interesting night-time rumblings and cries from what we assumed were badgers, plus occasional visits from a bad-tempered squirrel that chases the cats and steals the birds’ sunflower hearts from the feeder. There’s even been several reports locally of a large black cat-like creature that may have escaped from captivity, though this is probably an urban myth that you hear in every area of the country. But you never know!
So, after a few months, did we catch any views of the passing wildlife? Here’s a selection of the results:
OK, so it’s not as dramatic as those people who put pictures on their blog of bears scavenging from their dustbin, or roe deer eating their geraniums, but it’s nice to know that we do get a regular procession of wildlife passing through. Even if most of it is ours and all the neighbors’ cats.
The camera is a ScoutGuard 550, which captures images at 5 Megapixels and can also do video. The only downside is that, at night, it takes a couple of seconds to switch on the infrared LEDs and take a picture when it detects motion, so you do get a lot of pictures of tails...
Does lateral thinking mean you need to look outside your own head instead of just accepting the most obvious solution? If so, I might as well plead guilty in terms of managing the backup power supply for my servers.
Like a great many people I depend on APC UPSs to handle mains power fluctuations and interruptions for my servers. Since Windows NT, through Server 2000, 2003, and now 2008 R2 I've blithely installed the default power management utilities provided by APC. Everything was hunky dory until I went virtual and set up the servers using Hyper-V. That's when the problems started.
Mind you, I can't say I really noticed the problems at the start. OK, so the latest versions of the APC software don't seem to install on a machine configured to host Hyper-V VMs, but the earlier version did and I continued to use that. The only thing I noticed were occasional messages that the server had lost connection with the UPS, but then immediately restored it again.
I had the software set up to shut down the server gracefully, well before the battery ran out in the UPS, and reboot it when back to 60% charge. In the past on Server 2000 etc. this has worked fine. So I reckoned that, because the Hyper-V system manages graceful shutdown of the hosted VMs, it would also work fine on 2008 and 2008 R2, and initial tests proved this to be the case.
However, during a recent power outage (I was rewiring a ring main socket) I came back to find the UPS fully charged but the server stopped. Pressing the power button initiated a reboot sequence but the machine just shut down again. It looked very much like the recent episode when a motherboard failed in another machine, and here I was on a busy Saturday morning pondering another visit from the Dell man. Thankfully I took out a full onsite warranty this time!
However, after shutting down the UPS, then restarting it and the server, everything came back to life again. The event log showed a graceful shutdown and reboot, and all the VMs that are set up to auto-start were running fine. Interestingly, the one that doesn't auto-start showed up in Hyper-V Manager as "Suspended" rather than "Off". It didn't take long to figure that Hyper-V had suspended the VMs rather than closing down and then restarting them.
But why had the server not restarted automatically? In fact, as the power was off for only a few minutes, why did it shut down in the first place? The answer, as evident in the Event Log, was a "Communication Lost" event from the APC software; followed by "Runtime Limit Reached" and then "Shutting Down the System". If the software can't see the UPS it assumes it's broken and shuts down the server automatically, even though there was at least an hour left in the batteries.
According to the APC site, the free software doesn't support Hyper-V because it can't guarantee to safely shut down each VM individually. As many people regularly attempt to point out on the APC forum, surely it doesn't need to. Server 2008 and 2008 R2 can quite happily respond to a shutdown message and safely manage the VMs it hosts. The suggestion from aggravated forum posters is that it's just a cynical way for APC to sell the network version of the management software.
Oh well, I don't mind paying a bit for the real thing, but it seems that to make it work I also have to buy and install a special network management card, and install a ton more drivers and stuff. Do I really want to do that? So I look at the Open Source alternative, apcupsd, but it looks complicated enough to need more than what remains of the afternoon to sort out. I'll need a day to read and understand the manual.
But that's when the "outside your head" thing struck me. A quick Bing located a post by Ben Armstrong (the Virtual PC Guy) that says that the built-in power management stuff in Server 2008 R2 can manage your server and UPS automatically. In fact, as I discovered when installing the APC software, battery management is part of the O/S and all you install from APC is the service that manages the UPS and interacts with Windows. Without the APC software, Server 2008's default settings will monitor battery power and can initiate a server hibernate and shutdown when it's low, though you probably want to tweak the Low and Critical level settings in the Advanced Power Management dialog to something less optimistic that 10% and 5%.
Then I read somewhere else that installing the Hyper-V service changes the server's behavior by disabling hibernate mode, because hibernating a server that hosts VMs is not recommended. When I checked the advanced power configuration settings in my box the Critical Battery Action was still set to "Hibernate", but opening the drop-down list showed that the only options available now are "Do Nothing" and "Shut Down". Obviously installing Hyper-V does not change the current settings. I selected "Shut Down" and set the Critical Battery Level to 50% to make sure that the O/S has plenty of time to shut down all the VMs. I also set the Low Battery Level to 75% and the Low Battery Notification to "On" so that I can see when (and if) the server detects a power failure.
Since uninstalling the APC software and allowing Windows to manage its power requirements directly I've had no Event Log warnings and the power icon in the system tray seems to work, as a quick shutoff of the mains feed to the UPS demonstrated. Of course, where the APC software and the Open Source apcupsd service have an advantage is that they can restart the server when power is restored. And without the APC software I can't monitor the UPS, or configure the EEPROM settings inside it (although apcupsd provides a utility that can do this). So before I uninstalled the APC software I set up the UPS to do a shutdown only (not turn off) and allow 15 minutes for the server to shut down when the low battery warning occurs.
I also configured the UPS is to turn on the power again when the charge reaches 60% after a power failure, and the server BIOS is configured to auto-start when power is restored. Therefore it should, in theory, all work by powering up the server again automatically. The real test was a few days later when the electrician arrived to rewire the kitchen as part of our ongoing modernization plan. Unfortunately, while it kind of worked, there are some issues.
The server did shut down, and restart again. But examining the Event Logs after the restart revealed that, despite the Power settings in Windows Server being set to notify when the battery charge drops to 75%, there was no matching Event Log message. Maybe the warning just pops up in the notification area of the screen. But the Event Log messages did indicate that the server correctly shut down, and restarted with no unexpected errors.
Things were different with the VMs, however. I had configured a combination of different settings in Hyper-V Manager to experiment with the behavior. One VM was set to "Turn Off" and restart if previously running, one was set to "Save" and restart if previously running, and one was set to "Shut Down" and restart if previously running. The fourth was set to "Save" and always restart. Hyper-V Manager revealed that they had all started automatically, so that's OK. The "Turn Off", "Save", and "Shut Down" actions when the host server shuts down all work as expected and allow for automatic restart if previously running.
The problem was that the Event Logs in all of the VMs indicated that they had all shut down unexpectedly. There was the System log message saying just this, and the Critical system error message to confirm it on every one. While the host server had shut down correctly, it seems that the VMs had not.
When you shut down the server manually this doesn't happen, so it must be that the shutdown initiated by the battery power management system does something different from the "Shut Down" command on the Start menu. I wondered if it was just that the UPS had switched off the power to the host server before it had a chance to shut down, turn off, or save the VMs, but the fact that the host server had shut down properly without error seems to indicate this isn't the case.
From the times recorded in the host server and VM Event Logs and by my NAS (which also logs power failure events), it seems that the shutdown occurred only 30 minutes after the power failure, whereas the UPS reckons it has more than 90 minutes of battery life. So it does look like the shutdown occurred when Windows power management detected only 50% battery life remaining.
Does the UPS send some signal to the Windows power management system that initiates a shutdown? Or perhaps Windows power management sends a signal to the UPS to hibernate until the power is restored? Or maybe it's just that there's some setting hidden away somewhere that I haven't found yet...
So, at last, we're done. After fighting with multiple new versions of the Windows Azure SDK, updated features in the management portal, changes to the functionality of services, and the regular changes to the names of various parts of Windows Azure, we've shipped the third editions of two of our Windows Azure guides and the associated Hands-on Labs.
The first, "Moving Applications to the Cloud", is aimed at those whose Field is Brown. It focuses on getting your existing on-premises applications running in Windows Azure using Virtual Machines, Windows Azure Web Sites, Cloud Services, and many other features of Windows Azure.
Through a series of migration stages, our fictional company named Adatum moves their aExpense application to Windows Azure. The first step is to use Windows Azure Virtual Machines, including a VM running SQL Server and another running Active Directory. This approach minimizes the need to change the application code; it simply runs exactly as it did when on-premises.
Next, Adatum experiments with Windows Azure Web Sites before refactoring the application to run as a Cloud Service. Along the way Adatum switches over to using federated authentication with Windows Azure Access Control, and using Windows Azure SQL Database instead of a hosted SQL Server.
Adatum then adds background processing with a separate Cloud Services worker role, before moving the data to Windows Azure table storage. Along the way Adatum calculates running costs, adds features to make the application scalable, and takes advantage of other Windows Azure features such as Caching.
OK, so most companies won’t go through the multiple migration steps that Adatum carried out, but the aim of the guide is to demonstrate as many of the available options as possible. As well as the planning, design, decision making, and development processes the guide also discusses application lifecycle management issues such as testing, monitoring, and maintenance.
The second guide, "Developing Multi-tenant Applications for the Cloud", is aimed at Greenfield scenarios. While much of the content is devoted to understanding the concepts of multi-tenant application design and development, the fundamentals are equally applicable to all kinds of applications designed from scratch to run in the cloud.
This guide is centered round a fictional independent software vendor (ISV) named Tailspin, and its design and implementation of the multi-tenant Surveys application. It discusses hosting options, application partitioning, and data storage options for multi-tenancy; but also contains a wealth of information about designing Windows Azure applications so as to maximize availability, scalability, elasticity, and performance. In addition, the guide explores different security and authentication options, how you can implement features directly related to ISVs, and techniques for managing the application.
One of the fundamental features for maximizing performance is to appreciate and manage the throughput limitations imposed by Windows Azure services and the Internet itself, and this guide will help you to understand how you can work round these limitations. For example, it explores how Tailspin uses multiple queues, the delayed write pattern, storage partitioning, and optimistic concurrency in order to maximize performance.
Both of the guides have had a full makeover from the previous editions, and the addition of considerable new content, so take a look and let us know what you think. Or just tell us what color your field is...