It’s been a while since we posted on the blog but a question posted today on the forum is a good opportunity to break the silence. The question is about what to do when your program fails on the cluster.
Recently Microsoft Research held its Silicon Valley TechFair. The event is an opportunity to open the research lab to a broad-based Bay Area audience who can see first-hand how Microsoft Research is advancing the state of the art in computer science. One of the demos in the data-intensive computing track featured the DryadLINQ job browser. The cluster job browser is an application designed for analyzing and troubleshooting the performance of applications that run on large clusters inside data-centers. A part of the generic tool is specialized for visualizing, monitoring, profiling and debugging DryadLINQ jobs.
Unfortunately, the tool has not been released externally so it won’t help with today’s forum question.
To track errors with the external release of DryadLINQ, a first step is usually to check the log files (STDOUT.TXT/STDERR.TXT) in the working directory associated with the Dryad job manager.
For Dryad on Windows HPC Server, the execution of a DryadLINQ query translates into an HPC job on the cluster. The first task of the HPC job (Task ID = 1) is the Dryad job manager and the remaining tasks map to Dryad vertices. The task name provides information about the vertex. For each task, the HPC implementation of Dryad creates a working directory, \\NodeName\XC\UserName\JobID\TaskID\WD, which contains the executable files and resources required by the Dryad application. The working directory also contains standard output (STDOUT.TXT) and standard error (STDERR.TXT) files, which often yield the root of an error.
To find the source of a failure, start by launching the HPC Job Manager. Make a note of your job ID and use the task properties to find out on which compute node Task 1 (i.e. the Dryad job manager) ran. Then, go to: \\<nodename>\xc\<user-alias>\<job-id>\1\WD and look at the log files. I recommend starting with STDOUT since it often contains valuable clues. If there is not enough information, you may single out a failed vertex (i.e. a HPC task in the job with an ID strictly greater than one). To find the logs, you will then need to go to \\<nodename>\xc\<user-alias>\<job-id>\<task-id>\WD, where the node name is the machine name where the failed task ran.