Windows Azure SQL Database Marketplace
Editor's Note: This post was written by Apurva Joshi, Program Manager on the Windows Azure Web Sites Team.
How many times have you been woken up in the middle of a night for an issue that was simply resolved by restarting your web site? Wouldn’t it be nice to auto detect certain conditions and automatically recover?
With recent updates to Windows Azure Web Sites (WAWS), we have tried to address these questions. There are some new enhancements to “Always ON” feature and with these enhancements comes the ability to automatically recycling the worker process hosting your web application. We call this the “Auto Healing” feature, and here is how it works:
You simply define the triggers in the root web.config file of your web site and configure the actions to be performed when these triggers are hit. At high level, your configuration section will have following structure,
NOTE: Just like “Always ON”, this feature is ONLY available with the Standard instances.
Let us break down available options per scenarios.
(Detailed explanation of all supported elements and attributes are at the end of the post.)
Consider a scenario where you have a need to recycle your application automatically after it has served X number of requests in Y amount of time. You know that it just doesn’t scale well after huge influx of requests in short amount of time. You want to detect this condition and recycle worker process automatically and log an event.
You simply edit the root web.config file for your application with following sample configuration. (If you have an existing web.config file then please copy <monitoring> section under an existing <system.webServer> section)
Above configuration will recycle the worker process that has served 1000 requests in 10 minutes. It will also log an event in eventlog.xml file (found in Logfiles folder of your web root directory). Having an event logged helps you track down the occurrence of an auto healed web site and provide important forensic for troubleshooting or root cause analysis. When the first request comes in, we start the timeInterval clock. We then start counting occurrences. If the count exceeds the maximum before the timeInterval expires, we take anaction. If the time interval expires, we reset both the timer and the count. The effect of this is that, given above configuration, something like this could happen:
00:00:00 – First request arrives
00:09:59 – 998 requests are served
00:10:00 – Timer expires and is reset to 0
00:10:01 – 999 requests are served
In this scenario, we did not have 1000 requests occur in either the first or second timeInterval window, so no action is taken.
NOTE: If you have multiple instances of your web site, it will only restart the worker process for the instance that has hit this trigger and not all instances.
Example of an event logged in eventlog.xml file.
Consider a scenario where the performance of your application starts degrading and several pages start taking longer time to render. You would like to detect this situation and recycle worker process automatically.
Above configuration will recycle the worker process when it detects that 20 requests have taken more than 45 seconds to execute in last 2 minutes. It is important to note that trigger for slowRequests is evaluated at the end of each request execution, which makes it equally important to set timeInterval higher value to timeTaken value.
Consider a scenario where you would like to get notified of a situation when your web site starts throwing specific HTTP status codes, sub-status code or win32 status codes. You could choose to recycle or simply log an event in eventlog.xml file (found inside Logfiles folder of your web sites content root)
You simply edit the root web.config file for your application with following sample configuration,
Above configuration will log an event in eventlog.xml file when it detects that 10 requests resulted in HTTP status code of 500 with sub status code of 100 last 30 seconds.
NOTE: If you have multiple instances of your web site, it will only log an event for the instance that has hit this trigger and not all instances. Optionally, you can choose to recycle instead of just logging an event. Recycling logs an event by default.
Consider a scenario where you are troubleshooting a memory leak in your web site and would like to perform a custom actions like generating memory dumps, or sending an email notification or generate memory dumps and recycle the process etc.
Above configuration will execute a custom action to run procdump.exe and generate mini memory dumps when it detects that worker process has reached 800MB ofprivate bytes. Auto healing will not trigger on certain HTTP error codes that are coming from http.sys (kernel driver), where request is not made it into the worker process pipeline. Some examples of such status codes are: 304, 302, 400 (many 400s but not all), 503 etc.
NOTE: If you have multiple instances of your web site, it will only generate memory dumps for the instance that has hit this trigger and not all instances. Optionally, you canchoose to run custom action that will send an email etc. Also note that, procdump.exe is not available by default in root of your web site (d:\home) – it is something you will have xcopy deploy with your web site.
Example of an event logged in eventlog.xml file for action type of recycle.
Finally, if you would like to configure trigger on specific page/URL then you can use our FREB module and configure steps that are outlined in following blog
This approach will have 5-10% performance hit and will require you to enable FREB.
NOTE: Above approach is also effective under Standard mode as well, since we will automatically disable FREB after 1 hour on shared and FREE modes.
Following is the list of supported configurations and their meaning.