MapReduce is a programming model that allows you to process your data across an entire cluster. It provides access to high-level applications using scripts in languages such as Hive and Pig, and programming languages as Scala and Python.

MapReduce consists of Mappers and Reducers that are different scripts, which you might write, or different functions you might use when writing a MapReduce program. MapReduce makes the use of two functions.

  • Map() performs sorting and filtering of data and thereby organizing them in the form of a group. Map generates a key-value pair based result which is later on processed by the Reduce() method.
  • Reduce() takes the output generated by Map() as input and combines those tuples into a smaller set of tuples.

Let's consider a word count example to explain the process of MapReduce. We want to find the number of occurrences of each word. First, the input is split to distribute the work among all the map nodes as shown in the figure. Then each word is identified and mapped to the number one. Thus the pairs also called tuples are created. In the first mapper node three words Deer, Bear, and River are passed. Thus the output of the node will be three key, value pairs with three distinct keys and value set to one. The mapping process remains the same in all the nodes. These tuples are then passed to the reduced nodes. A partitioner comes into action which carries out shuffling so that all the tuples with the same key are sent to the same node.

The Reducer node processes all the tuples such that all the pairs with the same key are counted and the count is updated as the value of that specific key. In the example, there are two pairs with the key ‘Bear’ which are then reduced to a single tuple with the value equal to the count. All the output tuples are then collected and written in the output file.

  • It’s easy to use, even by programmers with no experience in distributed processing.
  • A developer can express a variety of technical problems in MapReduce.
  • It simplified large-scale computing and large volumes of data.
  • Scalability
  • Cost-effective
  • Flexibility
  • Fast
  • Parallel processing

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store