Michael has done a great job introducing Dryad.  Now, let us take a look at what DryadLINQ is.

DryadLINQ provides a new programming model for large scale data-parallel computing.  The most unique feature of DryadLINQ is its complete integration with a traditional high-level programming language: the data model of DryadLINQ is just typed .NET objects, and a DryadLINQ program is just a sequential program (written in C#, VB, or F#) composed of LINQ queries.  The beauty of this approach is that there is not much difference in writing programs for a single computer or a compute cluster.  All .NET developers should be able to write a DryadLINQ program with minimal additional learning.  Perhaps, it would be helpful to show a concrete example.  So, here is a complete DryadLINQ program:

public class LogEntry {

    public string user;

    public string ip;

    public string page;

 

    public LogEntry(LineRecord line) {

        string[] fields = line.line.Split(' ');

        this.user = fields[8];

        this.ip = fields[9];

        this.page = fields[5];

    }

}

 

public class PageCount {

    public string user;

    public string page;

    public int count;

 

    public UserPageCount(string user, string page, int count) {

        this.user = user;

        this.page = page;

        this.count = count;

    }

}

 

public class Demo {

    public static void Main(string[] args) {

        string tableUri = "dfs://sherwood/yuanbyu/logs.pt";

        var logs = PartitionedTable.Get<LineRecord>(tableUri);

 

        var q = TopPages(logs, "michael");

        foreach (var x in q) { Console.WriteLine(x); }

    }

 

    static IQueryable<PageCount> TopPages(IQueryable<LineRecord> logs, string name) {

        return logs.Where(x => x.line.Contains(name))

                   .Select(x => new LogEntry(x))

                   .GroupBy(x => x.page)

                   .Select(x => new PageCount(name, x.Key, x.Count()))

                   .OrderByDescending(x => x.count).Take(10);

    }

}

 

The above program computes, from a web server log, the top 10 most frequently visited web pages by Michael.  The program looks no different from your normal LINQ programs: you get to use .NET classes to define your data types and familiar language constructs such as methods and loops to write the program.  You also get to use the big collection of .NET libraries in your programs.  Because LINQ queries are integrated into the language, invoking a LINQ query is just a method call and within a LINQ operator you can easily invoke any arbitrary .NET methods.  Perhaps, I am biased, but this is the best data-parallel programming model I have seen so far.

So, how do we get the LINQ queries in a program like above magically run on a large compute cluster of thousands of machines?  This is where Dryad and DryadLINQ come in.  Dryad is our distributed execution engine that provides efficient and reliable execution of programs on large compute clusters.  DryadLINQ automatically and transparently translates the LINQ queries in the program into a distributed execution plan and hands the plan and program to Dryad for execution.  In some sense, DryadLINQ is the parallel compiler and runtime, and Dryad is the ‘operating system’ of a compute cluster.  The basic DAG computation model that Dryad uses looks to me like the ‘instruction set’ of a compute cluster.   

As mentioned by Michael in his recent post, Dryad has been doing the heavy-lifting for all of Bing’s data-mining.  While DryadLINQ is still a research system, it has been used within Microsoft by many people for many applications.  (We will probably write a number of blogs about the applications in the near future.) The system has been reasonably stable.  We are actively improving both Dryad and DryadLINQ based on our users feedback.

In this post, we barely touched on the technical details of the Dryad and DryadLINQ systems. In future posts, we will get into the technical details of the systems.  But if you are eager to learn more about the systems, take a look at our project web page, which contains many papers and presentations about the systems.  Dryad and DryadLINQ are also available free for non-commercial use from this site.   So, please try it out!