Update March 7, 2013
Added to the Q&A section --- Q: How long will the upgrade take? How long will my VM be down?
Roughly once per month Microsoft releases a new Guest OS version for Windows Azure PaaS VMs. The exact schedule varies and the historic trend can be seen at http://msdn.microsoft.com/en-us/library/windowsazure/ee924680.aspx. During this rollout the Window Azure Fabric Controller will do two passes through all of the datacenters.
Mark Russinovich has a great blog post which describes the Host OS upgrade process - http://blogs.technet.com/b/markrussinovich/archive/2012/08/22/3515679.aspx.
Note that this article is focused on PaaS scenarios, but the Host OS update process applies to IaaS Persistent VMs as well. For more information about IaaS VM restarts see http://blogs.msdn.com/b/windows_azure_technical_support_wats_team/archive/2013/11/27/windows-azure-iaas-host-os-update-demystified.aspx.
See http://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx for more information about the processes which are running and the location of log files which can be used to troubleshoot.
At this time the Windows Azure platform does not offer proactive notifications when an OS upgrade is happening. The Windows Azure development team is working on this functionality so that service administrators can better plan for upgrades and possible service impact. Your role instances will receive a RoleEnvironment.Stopping event prior to being shut down and you can use that event to gracefully terminate any work that the role instance is doing or notify an administrator that an instance is shutting down.
In the meantime you can subscribe to the Windows Azure OS Updates RSS feed at http://sxp.microsoft.com/feeds/3.0/msdntn/WindowsAzureOSUpdates. This feed should be updated the same day that the OS updates start being rolled out to the datacenter. This typically does not give advanced proactive notification, but it does help identify when the updates are happening. As noted above in the Host OS and Guest OS description the update process can take several days to complete, so it may be one or more days between when the RSS feed is updated and your hosted service begins updating.
The Guest OS list at http://msdn.microsoft.com/en-us/library/windowsazure/ee924680.aspx and the OS version selection dropdown in the management portal are typically updated after the Guest OS rollout has completed so you should not use the latest entry in these lists as an indication of when the OS updates are in progress.
At this time there is no direct way to detect a Host OS upgrade, but you can see the evidence of the reboot within the logs on the VM:
A: You cannot opt out of the Host OS updates because Microsoft must maintain updated and patched host OSes within the datacenter. You can opt out of the Guest OS update by specifying a version of the Guest OS, but note that your service will no longer receive security patches and may be vulnerable. See http://msdn.microsoft.com/en-us/library/windowsazure/ff729422.aspx.
A: There is no way to control when an individual instance or service will be upgraded for the Host OS. The upgrade is started on all Azure datacenters across the world at approximately the same time, and the fabric works continuously on upgrading each datacenter. This process takes several days due to the complexity of making sure upgrade domain rules are followed for all cloud services, and there is no way to control or determine when a specific instance will be impacted. To control the Guest OS update you can specify a fixed Guest OS version and then update it whenever you are ready.
A: Connecting to an Azure PaaS VM via RDP and making changes or installing software is unsupported. At any point in time the VM may be completely rebuilt and any changes you make will be lost. This can happen if the hardware fails and we have to startup a new VM on new hardware. This will also happen during the Guest OS update when the Windows Partition is rebuilt. If you need to install software or make changes to the VM you must create a startup task and do the work from there. This ensures that when the VM is recreated that your configuration will be executed again.
A: The updates that are installed onto the new guest OS version are publicly available and thoroughly tested hotfixes which are also being deployed to servers around the world via Windows Update and the chance negative impact to your service is extremely small. However, the root of the question goes back to how you manage OS patches in your on-premise services - do you install directly on the production servers and assume it will work, or do you have a staging environment where you test the patches first? You will follow the same pattern in Azure. If you want to have a staging environment to test patches prior to production then you should configure your production service to use a fixed version OS string in the .cscfg file. Then when a new guest OS is available you can deploy your service into the staging slot using the newest guest OS version. After you have validated that the service works correctly on the latest guest OS you can then either do a VIP swap, or do an in-place upgrade of your production service to use the latest OS.
A: There is a common misconception that the more patches being applied, the longer the update will take. This is based on the belief that the upgrade works similar to how a Windows Update upgrade happens on your local desktop machine where a bunch of patches are copied to Windows and installed with subsequent reboots, but this is not how upgrading works in Azure. When a new OS version is being released in Azure, the OS team will take the latest image, apply the patches, and then save a new VHD with this new base image. This base image is then copied to a repository in Azure. When the fabric is instructed to do an OS upgrade it will first make a copy pass where it copies this new base image VHD to the hard disks on each server in the datacenter that is going to be upgraded. Once this copy process is finished the fabric will begin the upgrade process, following the normal upgrade domain rules. When a guest is going to be updated the fabric will do a graceful shutdown of the OS and then start a new VM using the new base image. The time it takes to upgrade a given VM for a Guest OS is roughly the time it takes to do a graceful Windows shutdown + the time it takes to start Windows. The timing for a Host OS update is a little different. When a Host is being upgraded it first sends the shutdown message to each Guest OS running on that Host. Each Guest OS is then given the standard OnStop and Windows Shutdown time to finish shutting down. Once every Guest OS is shut down, then the Host OS does a graceful shutdown and goes through it's normal shutdown procedure. Once the Host OS is shutdown then the Host is rebooted using the new OS image. Once the Host is up and running then it will start each of the Guest OSes. Typically this Host OS update process will take 15 to 20 minutes, but it can vary depending on how many other Guests are on that Host and how long they take. Having said that, there will always be exceptions if there is a failure on a particular node and the Azure fabric determines that the Guests on that node need to be moved to a different node.
A: When the OS is being updated the Azure Fabric will perform a graceful shutdown of your role instance. This means that your ASP.NET code will receive the Application_End event, and the Azure service runtime will raise the Stopping and OnStop events. Your code will have 5 minutes to finish cleanup work in OnStop before the process is shut down. After your Azure host process is shut down then Windows will go through a normal graceful shutdown including raising the standard OnStop and related events for Windows Services. For more information about gracefully handling a shut down of your instance see http://blogs.msdn.com/b/windowsazure/archive/2013/01/14/the-right-way-to-handle-azure-onstop-events.aspx, http://msdn.microsoft.com/en-us/library/hh180152.aspx and http://msdn.microsoft.com/en-us/library/windowsazure/microsoft.windowsazure.serviceruntime.roleentrypoint.onstop.aspx.
This explains a lot of my recent headaches with our Azure web roles in an "infinite initialization" state.
Thanks for the information - it is very helpful!
Thanks for detailed information :)
We have had this issue where roles fail after the OS updates. Reimaging always fixes it.
Matt, I would encourage you to open a support incident at www.windowsazure.com/.../contact next time this happens and the team can help you investigate why your role fails to start. The root cause is typically pretty easy to find and the fix is usually easy to implement, and this will make your service much more robust.
Do you have any idea why two reboots are necessary? Once the host has been rebooted why not immediately reimage all the VMs inside it and let them start? What's the need for the second reboot?
Dmitry, the 2 upgrade pass has been around since Azure started and I am not positive of the reasoning behind this design decision. My best guess is to try to isolate the Host OS upgrade in order to make it faster and get through the datacenter as quickly as possible. During the host OS upgrade of any specific server the fabric waits for a maximum of 15 minutes for each guest on that host to report Ready before it is able to move to the next upgrade domain for that service. During a Host OS update the Windows partition on the guest OS is preserved which can shorten the startup time for the hosted service running in that guest OS. During a Guest OS update the Windows partition is wiped out which means startup tasks that do installations will have to run again which will increase the amount of time it takes to get to the Ready state. See blogs.msdn.com/.../windows-azure-disk-partition-preservation.aspx for more info on the disk preservation scenarios.
That's really awesome!!! Nice to know it before freaking out
There is an intermittent issue with the certificate path for our SSL web service that occurs at certain times, I am assuming, either when our cloud service on Azure reboots or is moved. This occurred on November 18, 2013, and previously on or about September 27, 2013. Using SSL Checker at www.sslshopper.com/ssl-checker.html it reports that the certificate is not trusted in all web browsers. When I add our domain to IIS site binding, the issue is resolved.
Sometime after the September occurrence, I later removed the site binding setting (as it is a real issue that prevents using staging for testing) and we had no issues until last night, Nov 18. Again I had to add the site binding to resolve the issue (at 08:45 UTC). I have now removed the site binding setting at 13:30 UTC and the issue remains resolved.
The real problem is that before I changed the site binding setting, requests to our web service could not be made. Salesforce.com only allows Apex callouts for GET and POST requests to SSL web services only for certain specified root certificates and only when the certificate path can be determined correctly by Salesforce. Callouts will result in a PKIX path building failed error when the path can not be determined. After adding the domain to IIS site binding, Salesforce has no problem.
This appears then to be a Windows Azure issue where the certificate paths are not re-established promptly when certain changes are made to the server instance. It seems that having multiple role instances would not avoid this issue as our web service works, using soapUI for testing, but the certificate path for Salesforce is still not correct.