I made a post the other day pointing people at Project Daytona, Microsoft Research’s iterative MapReduce. I thought I’d make a quick post about what MapReduce actually is in the first place.

I guess the big name that everybody automatically associates with MapReduce is Google. They made it in to big news. You can think of Google’s MapReduce as the MapReduce, but MapReduce is really an adjective that describes the process of solving a specific problem-space that deals with huge amounts of data. The roots of this can be traced back to LISP in the 70s. The 2 functions, “Map” and “Reduce” are used to describe the process. Though you’ll find the purists saying Google got it wrong and the names should really be “Map” and “Accumulate”. Doesn’t make for such a catchy name though does it?

There are more implementations of MapReduce than just Google’s, in the same way there are more implementations of relational databases than just Oracle’s. I think because Google granted a license for its implementation to Hadoop, there are people around who believe Google has the IP on not only its implementation, but on the whole idea. That’s not the case, there are many independent Map Reduce projects going on all over the world, some of them from commercial vendors, some of them from the open source movement, just like you can use SQL Server or MySQL as a relational database: one from Microsoft, one from the open source movement.

Every idea has its day. The idea of parallel processing and getting hundreds or even thousands of computers to work on large data analytics problems certainly existed back in the 70s when the initial theories were developed. The practicalities of acquiring hundreds of computers to actually do something – hmmm, that was a different game entirely. But right now, anybody in the world could spend considerably less than $100 and acquire hundreds of computers for a couple of hours to do parallel processing on a large amount of data. The Cloud has brought this change about.

Although a lot of people want to analyse large chunks of data, not all that many of them want to write the code to do it. It was really the collision of these 2 notions – the need for parallel-processing and the low acquisition cost of the computing power to do it – that prompted Google to catch the world’s imagination with it, by creating a library that means, as a developer, you don’t have to understand the internals of parallel-processing. All the various MapReduce projects going on around the world are to do with the creation of libraries, protocols, code etc. that make it easy to submit a large amount of data, tell the computers how you want to analyse it and then get the results back fairly quickly.

One thing to be aware of is that not all large-volumes-of-data or indeed the analytics you wish to perform on them are eminently suitable as MapReduce candidates. It’s only those situations where you can break the problem down in to smaller parallel chunks that fit well with MapReduce. You may come across commentary on the Internet from relational database fundamentalists who illustrate the scenarios where MapReduce fails and relational databases succeed, even with absolutely huge volumes of data. Horses for courses: I wouldn’t use a Formula 1 car to move 60 people from one place to another. I wouldn’t use a bus to set a lap-record on a racing circuit (even though that might actually be pretty good fun…).


Horses for courses.

A contrived example might be that you want to find the most common word in “The Lord of the Rings”. If you fed this data in to a traditional function it would have to process each word sequentially, keep a count and then determine the biggest count at the end. Or you could develop a filesystem that could distribute the data across several computers, let’s say several hundred computers, chop the data in to chunks and feed the chunks in to each computer one line at a time. Each computer could very quickly determine, in parallel with all the others, which was the most popular word that it processed. It could then forward that on to a function that would aggregate the results of the many hundreds of calculations that were repeated over and over on small chunks of the data. I think you can see, the answer would come back more quickly. You can probably begin to see now why MapReduce is something that Google would be interested in with its primary business being Internet Search.

But let’s say you have 200 hundred computers acting on your data in parallel and one of them has a hardware failure. Ooops. Handling failures, reassigning the work, backing out anything that might affect the result. You can imagine how difficult that problem is to solve. That’s the sort of thing MapReduce projects do, so that you don’t have to.

So it’s really the advent of Cloud Computing that has given MapReduce broad appeal. Distributed parallel processing has been the reserved domain of the mega-corporation with huge datacentres and $millions to spend on the problem. That was actually still largely the case in 2004 when Google produced the paper, but that is no longer the situation.

If you want to understand the technology itself, a good place to start is the description at Wikipedia. There is also of course, the Google Research paper. I think a good way to get to grips with it, if you have 50 minutes free, is to watch this Berkeley UC lecture video. Once you’ve got that lot in your head, downloading the code and samples from Microsoft’s new iterative MapReduce RTP (Research Technology Preview) and playing with them will be less arduous.

Planky – GBR-257