Troubleshooting Scenario 3 – Role Stuck in Busy

Continuing from the troubleshooting series at Windows Azure PaaS Compute Diagnostics Data, this blog post will describe how to troubleshoot a role that is stuck in the Busy state with the error message “Busy (Waiting for role to start... Application startup tasks are running.)”.

 

Symptom

You are deploying a service to Azure and your service never starts correctly and never gets to the Ready state. It is stuck forever in the Busy state.

image

 

Get the Big Picture

Just like in Scenario 1 we want to RDP onto the Azure VM and get a high level overview of what is happening on the VM. We see WaHostBootstrapper running, but we do not see WaWorkerHost.exe and watching task manager for a minute or two we don’t see any other changes. From the notes in Scenario 1 we know that this probably means a problem with a startup task.

image

 

Check the logs

We know from Windows Azure PaaS Compute Diagnostics Data that the WaHostBootstrapper logs are in C:\Resources\WaHostBootstrapper.log. Checking that folder we see only one WaHostBootstrapper.log file, and no instances of WaHostBootstarpper.log.old.<index>. This means that WaHostBootstrapper has only tried to start one time and it isn’t recycling due to an error.

WaHostBootstrapper log:

I am going to skip over all of the stuff that we expect to see in every host bootstrapper log (see Scenario 2 for more info) and go straight to the bottom of the log because I am interested in whatever host bootstrapper is doing right now, not what it did in the past. These are the last two lines in the file:

[00002968:00001680, 2013/09/05, 21:47:15.487, INFO ] Executing Startup Task type=0 rolemodule=(null) cmd="E:\approot\StartupTasks\Startup.cmd"
[00002968:00001680, 2013/09/05, 21:47:15.487, INFO ] Executing "E:\approot\StartupTasks\Startup.cmd" .

We can compare the timestamp in host bootstrapper (2013/09/05, 21:47:15.487) to the current system time (2013/09/05, 22:13:34.581) to see that the host bootstrapper has been trying to run E:\approot\StartupTasks\Startup.cmd for about 30 minutes. Notice that the task type is 0 (“Executing Startup Task type=0”) which is a simple startup task which causes the host bootstrapper to wait until the task finishes before continuing execution (see the taskType entry at https://msdn.microsoft.com/en-us/library/windowsazure/gg557552.aspx). 

Note that we can also get this same information from the event viewer. We can see that the last thing that Azure is doing is trying to run the startup task:

image

 

Check the startup task and test the fix

Now that we know where we are failing lets go check the startup task. Here are the contents of e:\approot\StartupTasks\Startup.cmd:

ECHO Press Enter When Ready
PAUSE
REM Other commands here...

This one is pretty simple and obvious to tell what is wrong, but if this were more complicated logic then we could run the startup task manually, attach a debugger to it, etc. Now it would be nice to test a quick fix to the startup task to make sure it fixes the problem without having to completely redeploy the solution. Using the information at How to Modify a Running Azure Service we know that we can terminate WaHostBootstrapper, replace the Startup.cmd, and then wait.

I modified Startup.cmd to remove the PAUSE line, and terminated WaHostBootstrapper, and waited for a minute, and now I see WaWorkerHost.exe running:

image

And the WaHostBootstrapper.log shows that my startup task finished correctly and the role host process (WaWorkerHost) was started:

[00002844:00002244, 2013/09/06, 14:28:55.386, INFO ] Executing Startup Task type=0 rolemodule=(null) cmd="E:\approot\StartupTasks\Startup.cmd"
[00002844:00002244, 2013/09/06, 14:28:55.386, INFO ] Executing "E:\approot\StartupTasks\Startup.cmd" .
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Program "E:\approot\StartupTasks\Startup.cmd" exited with 0. Working Directory = E:\approot\
[00002844:00002244, 2013/09/06, 14:28:55.401, ERROR] <- GetDebugger=0x1
[00002844:00002244, 2013/09/06, 14:28:55.401, ERROR] <- GetRoleHostDebugger=0x1
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Executing base\x64\WaWorkerHost.exe .
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Role host process PID: 4036.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Registering client with PID 4036.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Client WaWorkerHost.exe (4036) registered.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Client process 4036 is the role host.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Role host process registered.