written 8.6 years ago by |
MapReduce is a style of computing that has been implemented in several systems, including Google’s internal implementation (simply called MapReduce) and the popular open-source implementation Hadoop which can be obtained, along with the HDFS file system from the Apache Foundation. You can use an implementation of MapReduce to manage many large-scale computations in a way that is tolerant of hardware faults. All you need to write are two functions, called Map and Reduce, while the system manages the parallel execution, coordination of tasks that execute Map or Reduce, and also deals with the possibility that one of these tasks will fail to execute.
In brief, a MapReduce computation executes as follows:
Some number of Map tasks each are given one or more chunks from a distributed file system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-value pairs are produced from the input data is determined by the code written by the user for the Map function.
The key-value pairs from each Map task are collected by a master controller and sorted by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same key wind up at the same Reduce task.
The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function.
Flow of Map Reduce Algorithm
To following diagram summarizes the flow of Map reduce algorithm:
In the above map reduce flow:
- The input data can be divided into n number of chunks depending upon the amount of data and processing capacity of individual unit.
- Next, it is passed to the mapper functions. Please note that all the chunks are processed simultaneously at the same time, which embraces the parallel processing of data.
- After that, shuffling happens which leads to aggregation of similar patterns.
- Finally, reducers combine them all to get a consolidated output as per the logic.
- This algorithm embraces scalability as depending on the size of the input data, we can keep increasing the number of the parallel processing units.
While processing large set of data, we should definitely address scalability and efficiency in the application code that is processing the large amount of data.
Map reduce algorithm (or flow) is highly effective in handling big data.
Let us take a simple example and use map reduce to solve a problem. Say you are processing a large amount of data and trying to find out what percentage of your user base where talking about games.
First, we will identify the keywords which we are going to map from the data to conclude that its something related to games.
Next, we will write a mapping function to identify such patterns in our data. For example, the keywords can be Gold medals, Bronze medals, Silver medals, Olympic football, basketball, cricket, etc.
Let us take the following chunks in a big data set and see how to process it.
“Hi, how are you”
“We love football”
“He is an awesome football player”
“Merry Christmas”
“Olympics will be held in China”
“Records broken today in Olympics”
“Yes, we won 2 Gold medals”
“He qualified for Olympics”
Mapping Phase
So our map phase of our algorithm will be as follows:
Declare a function “Map”
Loop: For each words equal to “football”
Increment counter
Return key value “football”=>counter
In the same way, we can define n number of mapping functions for mapping various words:
“Olympics”, “Gold Medals”, “cricket”, etc.
Reducing Phase
The reducing function will accept the input from all these mappers in form of key value pair and then processing it. So, input to the reduce function will look like the following:
- reduce(“football”=>2)
- reduce(“Olympics”=>3)
Our algorithm will continue with the following steps:
i. Declare a function reduce to accept the values from map function.
ii. Where for each key-value pair, add value to counter.
iii. Return “games”=> counter.
At the end, we will get the output like “games”=>5.