Recently I was working on a very interesting case. My customer deployed his website on a Cloud service. He deployed the same codebase to both Production and staging environment. He has also configured a 3rd party monitoring service to keep an eye on these environments.
After sometime he started seeing the alerts about his staging Environment that his website was unreachable externally. Here are few symptoms he noticed about the issue:
We confirmed that there were no issues from Azure backend side and it looks like an application specific issue. We logged into the VM and investigated the following logs:
WaAppAgent.Log This log contains status updates and heartbeat notifications and is updated every 2-3 seconds. This log will contain a historic view of the status of the instance and will tell you when the instance was not in the Ready state. We found that VM was in Good health on all those reported times.
IIS LogsIIS logs shows very few activities but we noticed entries with 200 response during the reported time with very high response time for small static files.
8/9/2013 at 4:33 AM GMT - 2013-08-09 04:33:22 W3SVC839472807 RDXXXXXXXXXXXX IP.x.x.x GET / - 80 - IP.x.x.x HTTP/1.1 - - - xxx.cloudapp.net 200 0 64 0 79 17224
Application & System Event LogsWe looked into Application and System event logs and they looked clean with few generic logs. But we saw a log (type Information) which was a bit interesting. This logs suggests that the AppPool was Idle and was shutdown due to no activity. And it was happening after 20 minutes for each reported time for incident.
Log Name: SystemSource: Microsoft-Windows-WASEvent ID: 5186Task Category: NoneLevel: InformationKeywords: ClassicUser: N/AComputer: xxxxxxxxxxxxxDescription:A worker process with process id of '12168' serving application pool 'DefaultAppPool' was shutdown due to inactivity. Application Pool timeout configuration was set to 20 minutes. A new worker process will be started when needed.
We looked at these logs individually which makes no sense and could not tell us more about the issue. But when we combined the IIS logs and the Event ID 5186, we found the following pattern:
So all these symptoms suggest the following:
So it seems that the 3rd party monitoring service is probably having a smaller timeout and declaring the CLoudService unreachable while it was responding.