In January 2006 I published a first draft of “Google's MapReduce Programming Model -- Revisited” and I blogged about this effort some time last summer : “Ana, cata, para, hylo, … – get it right!”. I have finally revised the paper; in fact, it’s nearly a full rewrite, but I am sure you can “google” for the old version somehow, if you like to the see the first generation. I have been amazed by the public interest in this “reverse engineering effort”, and I hope the revision will be again useful in the context of “Google Code for Educators” or that alike.
The highlights of the revision are:
1. The aggregators of Google’s Sawzall DSL are covered.
2. More details on parallelization are included.
3. Correctness of distribution is informally discussed.
4. The style should be now Ok for non-Haskell geeks.
Two points of the revision are worth calling out:
1. Unless someone convinces me otherwise, I cannot see how Sawzall is implemented on top of MapReduce (talking about the implementations discussed in the papers and referring to quotes in the Sawzall paper that may suggest such layering).
2. Sawzall is entirely based on list homomorphisms. This is not mentioned anywhere in the Sawzall paper, and indeed, it required some real work to make my statement true. This is good news because parallel programming folks understand list homomorphisms pretty well.
As a bonus feature of this post, let me also include some C# 3.0 code that describes a non-parallel abstraction for MapReduce computations. (This code is part of the source-code distribution that comes with the paper.) It should not be too surprising that LINQ style nicely comes to play in this context. What I found very appealing (and I had not thought of it before doing the exercise) is the fact that I would not need to use any sort of data type for dictionaries; cf. the type Data.Map.Map in the Haskell code. Instead, I would use a nested, generic, interface-based type such as IEnumerable applied to IGrouping, in turn applied to the intermediate key and value domains.
public abstract class MapReduce<k1, k2, v1, v2, v3>
public abstract IEnumerable<KeyValuePair<k2, v2>> mAP(k1 key, v1 value);
public abstract Maybe<v3> rEDUCE(k2 key, IEnumerable<v2> values);
public IEnumerable<KeyValuePair<k2, v3>> Compute(
IEnumerable<KeyValuePair<k1, v1>> input
private IEnumerable<KeyValuePair<k2, v2>> MapPerKey(
from x in input
from y in mAP(x.Key,x.Value)
private IEnumerable<IGrouping<k2, v2>> GroupByKey(
IEnumerable<KeyValuePair<k2, v2>> intermediate)
from x in intermediate
group x.Value by x.Key into g
private IEnumerable<KeyValuePair<k2, v3>> ReducePerKey(
from x in grouped
let k2 = x.Key
let v3 = rEDUCE(k2, x)
select new KeyValuePair<k2,v3>(k2, v3.Value);
BTW, I get two standard questions when I am talking about this stuff:
Q: Ralf, are you applying for a job at Google?
A: No. J
Q: Ralf, are you subliminally bashing the MapReduce/Sawzall papers?
A: These papers and the ideas that they report on are awesome. So I am not at all bashing the papers. Instead, I am just trying to hijack the popularity of the subject to show off with my favorite programming language and perhaps to get more folks aboard the Haskell train. Or as my (German) wife always says: “Was ich selber denk und tu, trau ich jedem andren zu”.