Sign In
Scalability Notes
[Read -> Think -> Write]
Translate This Page
Translate this page
Powered by
Microsoft® Translator
Options
Blog Home
Email Blog Author
Share this
RSS for posts
RSS for comments
Search
Advanced search options...
Search In:
Everything
Blogs
Forums
People
Groups
Places
Pages
Date range:
All Time
Last Year
Last 6 Months
Last 3 Months
Last Month
Last Week
Last Two Days
Tags
database
distributed system
engineering
hpc
network
parallel
scalability
search
Archive
Archives
December 2010
(1)
September 2010
(1)
August 2010
(1)
April 2010
(1)
February 2010
(2)
January 2010
(4)
December 2009
(1)
November 2009
(1)
October 2009
(1)
September 2009
(1)
August 2009
(4)
June 2009
(2)
May 2009
(1)
April 2009
(1)
March 2009
(2)
February 2009
(4)
January 2009
(1)
Map/Reduce - in Functional Programming & Parallel Processing Perspectives
MSDN Blogs
>
Scalability Notes
>
Map/Reduce - in Functional Programming & Parallel Processing Perspectives
Map/Reduce - in Functional Programming & Parallel Processing Perspectives
changl
10 Nov 2009 1:48 AM
Comments
0
Map/Reduce - in Functional Programming & Parallel Processing Perspectives
Map/Reduce
is a very popular term pair in today's technical community, mainly due to the popularity of its "inventor" - Google.
But in fact, the terms and concepts of map & reduce exist in programming language community long before G company's successful paper "
MapReduce: Simplified Data Processing on Large Clusters
", which appeared in OSDI04.
In this article, I want to summarize what this term pair means in functional programming literature and parallel processing literature respectively.
I - Map/Reduce in Functional Programming Perspective
Functional Programming
has long history in academia, but not been massively accepted in developer communities yet. It has some beautiful features, compared with our daily use imperative language.
Higher Order Function
is one of them. Basically, it means that function can be used as input parameters or return value for a function definition.
Among various higher order functions,
map
,
fold
and
filter
are the most popular ones:
-
Map
is a higher order function that applies a given function(a.k.a transformer) element-wise to a list of elements and returns a list of results.
Transformer
is a function applies to each element and will produce one or more new elements.
for example:
map (toLower) "abcDEFG12!@#"
will produces output:
"abcdefg12!@#"
-
Fold
(a.k.a.
Reduce
,
Accumulate
) is a higher order function that processes (using a combiner function) a list of elements in some order and build up a return value.
Combiner
is a function that is applied to two elements and produces a result that can be combined using combiner with the remaining elements in the list.
for example:
fold (+) 0 [1..5]
will produces output:
15
, which is the sum of all the elements.
-
Filter
is a higher-order function that processes a list of elements in some order to produce a result containing exactly those original elements for which a given predicate returns the Boolean value true.
Predicate
is a function that takes one element as input parameter and return either true or false.
for example:
filter (isAlpha) "$#!+abcDEF657"
will produces output:
"abcDEF"
Essentially, these three higher order functions apply an operation on some list/array and produce some results: map transform each element, filter filtering some elements and reduce combine all the elements.
Pure functional language, such as haskell/lisp, and some mixed language, such as python, have build-in functions named exactly as Map/Reduce. C# 3.0 introduces some functional features in LINQ subsystem, where Map is called
Select
and Reduce is called
Aggregate
.
More concrete examples can be found in [2].
II - Map/Reduce in Parallel Processing Perspective
Map/Reduce is a
Programming Model
& also an
Implementation Runtime
. The
programming model
is what you can use to express your computation tasks while
implementation runtime
is those software components that realize what the model claims.
This model is called map/reduce, but their meanings are somewhat different:
- the
map
semantic is the same as in functional programming language: the
transformer
(the
mapper
in Google's paper) is applied to each element of the list
- the
reduce
semantic differs. Here, the
combiner
(the
reducer
in Google's paper) is applied to multiple sub sections of the elements in the list and thus produces multiple reduce results, but in functional programming language it is applied to all the elements and only produces one result.
Conceptually, how the elements are divided into multiple sub sections?
To resolve this problem, this model introduces some structure on the elements that are produced by
mapper
and consumed by
reducer
- each element/record has two parts:
key
and
value
. Then all the elements are divided according to the key. The records with the same key form a sub section and are passed to a
reducer
function as a whole.
From implementation's perspective, the most important advantage of this
Programming Model
is that - it enables
automatic
parallelization
and
distribution
of large scale data processing:
- mapper is applied to each record, it's a data parallel problem by itself, we just need to distribute input data in record boundary among processing nodes.
- reducer is applied to some sub section, we just need to distribute those sub sections among process nodes.
Another implementation problem is
fail over
- what to do when failure happened?
Simple! It just
re-execute
the failed specific mapper/reducer, other mappers/reducers won't be bothered at all. Because there is no communication among mappers and reducers respectively, this solution is semantically correct for mapper/reducer.
Since the input of mapper is persisted in reliable storage system, failed mapper only need to re-execute that mapper. But the input of reducer (also the output of mapper) is persisted in worker's local storage system, re-executed reducer may found some input unavailable (for example intermediate data node crashed). In this situation, failed reducer need to re-execute both some mappers and that reducer.
[
Reference
]
1.
Functional Programming
2.
higher order functions - map, fold and filter
3.
Map/Reduce/Filter in Python
4.
Map/Reduce in PHP
5.
Google's MapReduce Programming Model — Revisited
6.
MapReduce: Simplified Data Processing on Large Clusters
7.
Map-Reduce-Merge: Simplified Relational Data Processingon Large Clusters
0 Comments
parallel
Blog - Comment List MSDN TechNet
Comments
Loading...
Leave a Comment
Name
Comment
Please add 5 and 3 and type the answer here:
Post