03 March 2016

Writing MapReduce in Python using Hadoop Streaming

In this Blog we will be discussing execution of MapReduce application in Python using Hadoop Streaming.

We will be learning about streaming feature of hadoop which allow developers to write Mapreduce applications in other languages like Python and C++.

We will be starting our discussion with hadoop streaming which has enabled users to write MapReduce applications in a pythonic way.

We have used hadoop-2.6.0 for execution of the MapReduce Job.

Hadoop streaming can run MapReduce jobs in practically any language .

It Uses Unix Streams as communication mechanism between Hadoop and your code from any language that can read standard input and write standard output.

Steps implemented in Hadoop Streaming:

Mappers and Reducers are used as executables which read the input from stdin and emit the output to stdout.

First, the Mapper task converts the input into lines and places it into the stdin of the process.

After this the Mapper collects the output of the process from stdout and converts it into key/value pair.

These key/value pairs are the actual output from the Mapper task.

First it converts the key/value pairs into lines and put it into the stdin of the process.

Then the reducer collects the line output from the stdout of the process and prepare key/value pairs as the final output of the reduce job.

Steps for Execution:

Step 1: We need to create one input file and store that in HDFS,refer the below screenshot for the same

Step 2: We need to first write mapper class as shown below and save it in our local directory,in our case we have saved it in /home/acadgild directory.

Creating the mapper file and writing script:

#!/usr/bin/env python

import sys

for line in sys.stdin:

line = line.strip()

words = line.split()

for word in words:

print '%s\t%s' % (word, 1)

Step 3: We need to write reducer class in Python as shown below and save it in our local directory,in our case we have saved the file in /home/acadgild directory.

Creating the Reducer file and writing script in it:

#!/usr/bin/env python

from operator import itemgetter

import sys

current_word = None

current_count = 0

word = None

for line in sys.stdin:

line = line.strip()

word, count = line.split('\t', 1)

try:

count = int(count)

except ValueError:

continue

if current_word == word:

current_count += count

else:

if current_word:

print '%s\t%s' % (current_word, current_count)

current_count = count

current_word = word

if current_word == word:

print '%s\t%s' % (current_word, current_count)

Step 4: Execution of MapReduce using Hadoop Streaming Jar.

1	hadoop jar hadoop-streaming-2.6.0.jar -file /home/acadgild/my_first_mapper.py -mapper /home/acadgild/my_first_mapper.py -file /home/acadgild/my_first_reducer.py -reducer /home/acadgild/my_first_reducer.py -input /my_input -output /my_output3

Step 5: View the output from HDFS using the below command.

We hope this blog helped you in understanding hadoop streaming and executing MapReduce program in Python.

Keep visiting our website www.acadgild.com for more blogs on big data technologies.

AcadGild

Writing MapReduce in Python using Hadoop Streaming

Steps implemented in Hadoop Streaming:

Steps for Execution:

Related

Satyam

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Steps implemented in Hadoop Streaming:

Steps for Execution:

Share this:

Related

Satyam

Related Posts

Leave a Reply