Figure offers an outline of how processes, tasks, and files interact. Taking advantage of a library provided by a MapReduce system such as Hadoop, the user program forks a Master Controller process and some number of Worker processes at different compute nodes.
Normally, a Worker handles either Map tasks (a Map worker) or Reduce tasks (a Reduce worker), but not both. The Master has many responsibilities. One is to create some number of Map tasks and some number of Reduce tasks, these numbers being selected by the user program.
These tasks will be assigned to Worker processes by the Master. It is reasonable to create one Map task for every chunk of the input file(s), but we may wish to create fewer Reduce tasks.
The reason for limiting the number of Reduce tasks is that it is necessary for each Map task to create an intermediate file for each Reduce task, and if there are too many
Reduce tasks the number of intermediate files explodes. The Master keeps track of the status of each Map and Reduce task (idle, executing at a particular Worker, or completed).
A Worker process reports to the Master when it finishes a task, and a new task is scheduled by the Master for that Worker process.
Each Map task is assigned one or more chunks of the input file(s) and executes on it the code written by the user. The Map task creates a file for each Reduce task on the local disk of the Worker that executes the Map task.
The Master is informed of the location and sizes of each of these files, and the Reduce task for which each is destined. When a Reduce task is assigned by the Master to a Worker process, that task is given all the files that form its input. The
Reduce task executes code written by the user and writes its output to a file that is part of the surrounding distributed file system.