Some of the more common questions I get are around heartbeats/probes, how the fabric recovers from failed probes, and how load balancer manages traffic to these instances.
Q: How does the fabric know that an instance has failed, and what actions does it take to recover that instance?
A: There are a series of heartbeat probes between the fabric and the instance --- Fabric <-> Host Agent <-> Guest Agent (WaAppAgent.exe) <-> Host Bootstrapper (WaHostBootstrapper.exe) <-> Host Process (typically WaIISHost.exe or WaWorkerHost.exe).
See http://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx for more information about the processes and probes on the Guest OS.
Q: How does the load balancer know when an instance is unhealthy?
A: There are 2 different mechanisms the load balancer can use to determine instance health and whether or not to include that instance in the round robin rotation and send new traffic to it.
Q: What does the load balancer do when an instance is detected as unhealthy?
A: The load balancer will route new incoming TCP connections to instances which are in rotation. The instances that are in rotation are either:
If an instance drops out of rotation, the load balancer will not terminate any existing TCP connections. So if the client and server maintain the TCP connection then traffic on that connection will still be sent to the instance which has dropped out of rotation, but no new TCP connections will be sent to that instance. If the TCP connection is broken by the server (ie. the VM restarts or the process holding the TCP connection crashes) then the client should retry the connection, at which time the load balancer will see it as a new TCP connection and route it to an instance which is in rotation.
Note that for single instance deployments, the load balancer considers that instance to always be in rotation. So regardless of the status of the instance the load balancer will send traffic to that instance.
Q: How can you determine if a role instance was recycled or moved to a new server?
A: There is no direct way to know if an instance was recycled. Fabric initiated restarts (ie. OS updates) will raise the Stopping/OnStop events will be raised, but for unexpected shutdowns you will not receive these events. There are some strategies to detect these events:
I will be following up with a couple more blog posts to go deeper into a couple of these topics – troubleshooting using various logs from the VM, and more details about how the Load Balancer works.