distributed computing

Counting with MapReduce seems straightforward. All what is needed is to map the pairs to the same intermediate key, and leave the reduce take care of counting all the items. But wait, what if we have millions of items? Then one reducer, that is to say one process on one computer, will be forced to handle millions of pairs at once. Nonetheless this is going to be very slow, and all the interest of having a cluster will be missed, but there is something more important: what if the data is too big to fit in memory? Here I am showing how to count elements using MapReduce in a way that really split up the task between multiple workers.

The one-iteration solution

Let us have a look at the solution discussed above. This solution counts items in a data in only one MapReduce iteration. Note that the values are replace with the value 1. Indeed, as counting does not require to keep track of the values, they are all changed to a common simple value to simplify computations. This solution seems pretty sweet, except that as we can see on Figure 1, reducing all the pairs to the same intermediate key gives one reducer, and one reducer only, a huge workload for counting the items. This can be efficient if the dataset is small. But there are cases in which the dataset is so big that it does not even fit into the memory of a single computer, or maybe it is so big that the computation on only one reducer is going to be very slow, and we need to know the count as soon as possible. As we will see in the next section, there is a way to improve workload balance along with computation time, at the cost of an additional iteration.

Figure 1: MapReduce - Counting with one iteration

A few years back, thinking that you could have a cluster in your garage would have been crazy. Programming your own implementation of a reliable and powerful distributed system is feasible, but be ready to spend some months on it. Luckily, big companies and their need to handle increasing quantities of data led us to accessible solutions for cloud computing. The last groundbreaking solution in date, effective on clusters of cheap computers and developed by Google, is MapReduce. This article is yet another post on MapReduce, except that it is aimed at tech-savvy and non tech-savvy people, as it covers in details the different steps of a MapReduce iteration. It also explains how MapReduce is related to functional programming, why it enables parallel computing, and finally how the work is being distributed between workers during an iteration.

Tag: distributed computing

Efficient counting with MapReduce

The one-iteration solution

Introduction to MapReduce