Introduction to MapReduce
A few years back, thinking that you could have a cluster in your garage would have been crazy. Programming your own implementation of a reliable and powerful distributed system is feasible, but be ready to spend some months on it. Luckily, big companies and their need to handle increasing quantities of data led us to accessible solutions for cloud computing. The last groundbreaking solution in date, effective on clusters of cheap computers and developed by Google, is MapReduce. This article is yet another post on MapReduce, except that it is aimed at tech-savvy and non tech-savvy people, as it covers in details the different steps of a MapReduce iteration. It also explains how MapReduce is related to functional programming, why it enables parallel computing, and finally how the work is being distributed between workers during an iteration.
What is MapReduce doing?
MapReduce is a cloud computing framework developed by Google. It allows to compute operations on a cluster of computers, and therefore speed up computation time. It has been made possible by another technology developed at Google, the Google File System. As the GFS enables the storage of large quantities of data on a network of computers, it is possible to share files easily between different tasks on different computers. The idea behind MapReduce, as for any distributed system, is to split up the task to do into smaller subtasks, and have several machines compute these subtasks in parallel. MapReduce relies heavily on the GFS to partition the input data and share information between workers. For more details on MapReduce, like for instance how MapReduce uses GFS or how workers are communicating, please have look at the reference section at the end of this article.
Map and reduce operators in functional programming
MapReduce is a programing paradigm, that allows to split up a task into smaller subtasks that can be executed in parallel and therefore run faster compared to a single computer execution. Programming using the MapReduce paradigm requires a bit of training, but once you get that pattern on your mind, things are quite straightforward. The idea of the map and reducer operators is not new, and was borrowed from functional programming.
A map operator, or mapper, takes as input a set of items and an operation, and applies that operation on all the pairs, one by one. Suppose that the input is a set of numbers from one to six, and that the operation is the multiplication by the number two:
the input is:
<1, 2, 3, 4, 5, 6>,
the ouput is:
<2, 4, 6, 8, 10, 12>.
A reduce operator, or reducer, takes as input a set of items and an operation, and applies that operation on all these items, reducing them to only one item. Suppose that the input is a set is of numbers from one to six, and the operation is the sum:
the input is:
<1, 2, 3, 4, 5, 6>,
the ouput is:
Associativity is key to parallelism
There is something interesting to notice about the map and reduce operators: they are associative. Associativity is a notion of mathematics which applies to binary operators (here binary does not relates to the binary base, but to the fact that some operators take only two operands). For instance, the + operator is a binary operators, as it takes two operands at a time. It is also associative, because the order in which the operations are performed does not matter:
(1 + 2) + 5 is the same as 1 + (2 + 5).
So suppose we have two computers. Using the associativity property, we can split up the above summation into two smaller subtasks:
computer 1: task #1: 1 + 2 = 3,
computer 2: task #2: 5,
and then merge the intermediate results:
computer 1: task #3: 3 + 5 = 8.
Because the subtasks #1 and #2 are independent due to the associativity of the + operator, they can be computed on separate computers and their intermediate results can be merged later during another subtask, here subtask #3. If there is something you remember about parallel computation today, it has to be this:
Associativity is key to parallelism.
(Note: I shamelessly stole these words from my Algorithm Analysis professor at the University of Oklahoma).
Associativity is also the core idea behind MapReduce: map and reduce operations can be performed in any order on the input, as they are associative. This means that the workload can be divided among several computers. Thus, if we can express a computation only with of map and reduce operations, then we can split up the work in pieces and compute these pieces on multiple computers in parallel. This is how Google got a highly reliable distributed computing platform out of functional programming.
Map and reduce operators in MapReduce
MapReduce is using the map and reduce operators a bit differently than in functional programming. Indeed, each item is not just a value, but a (key, value) pair. The key is important, because it is used to group the data. Moreover, MapReduce tasks are always formed by one map operation followed by one reduce operations. Therefore, to compute something with MapReduce, you need to specify your input data, your map operation, and your reduce operation.
A map operator takes an input pair, and produces as output a set of (key, value) pairs. These pairs are called the intermediate pairs. The intermediate pairs are treated, and all pairs with the same keys are grouped together, whether or not they come from the same mapper. This creates a set where pairs are no longer simple (key, value) pair, but (key, <v1, v2, …>), as several pairs can have the same key.
A reduce operator takes as input a key and the set of values to which it is associated. The authors of the MapReduce paper say that “this allows [them] to handle lists of values that are too large to ft in memory“.
Now let’s take an example to see how this apply in the real world. Suppose that our input is now a string, and that we want to count the number of consonants and vowels. Please consider that this example only intents to clarify the different steps of a MapReduce iteration, and not to offer the best efficiency. Figure 1 below shows the different steps for the string “CODECAPSULE”. First of all, each letter is considered as a value, and receives a unique number id for key: this forms the input data. These data are then partitioned and directed to three distinct mappers. These mappers treat each letter and determine whether it is a consonant or a vowel. Intermediate (key, value) pairs are outputted, still with the letter as value, but this time with either consonant or vowel for key. Then, a very important step is taking place: all pairs of similar keys are merged together. This means that all pairs with consonant keys are merged together, and all pairs with vowel keys are merged together. This guaranties that all pairs with similar keys will be treated by a single reducer. A simple way to implementing this is to use a hash function on the keys. The reducers then work on the new (key, values) pairs, and count the number of consonants and vowels. The output pairs of the reduce step, which are the number of letters in each of the two families, are merge to form the output data.
Hadoop, an open-source implementation of MapReduce
Now that we have seen how cool MapReduce is, I have to tell you about a little drawback. MapReduce is a patented piece of software, and is not available for you to use. But don’t be sad, because the Apache Foundation, helped by Yahoo Inc., are working on an open source implementation of the MapReduce paradigm using Java, fast and reliable. This implementation, called Hadoop, is available for free. So you can perform distributed computations, given that you have several computers that you can use to install Hadoop on.
References to go further
- A great start on how to install the latest version of Hadoop is Michael Noll’s excellent tutorial, available here.
- Another good start is a tutorial by Travis Hegner, in which he explains how he installed Hadoop on six computers under Ubuntu, available here.
- Google released a publication regarding their conception of MapReduce, available here.