Illustrated big data | practical case MapReduce big data statistics

Posted by John_S on Tue, 08 Mar 2022 11:41:30 +0100

Tutorial address: http://www.showmeai.tech/tutorials/84

Article address: http://www.showmeai.tech/article-detail/170

1. Introduction

This tutorial ShowMeAI will explain in detail the method of Hadoop using map reduce for data statistics. For the basic knowledge of Hadoop and map reduce, you can review the basic knowledge of ShowMeAI Detailed explanation of distributed platform Hadoop and map reduce.

Although most people use Hadoop in java, Hadoop programs can be completed in python, C + +, ruby, etc. This example teaches you to use python to complete the MapReduce instance to count the word frequency of the words in the input file.

Importing: text files
Output: word and word frequency information, separated by \ t

2. MapReduce code implemented in Python

To complete MapReduce using python, you need to use the API of Hadoop stream to transfer data between Map function and Reduce function through stdin (standard input) and stdout (standard output).

We will use Python's sys Stdin reads the input data and sends our output to sys stdout. Hadoop flow will do other work.

An abstract Hadoop big data processing flow is shown in the figure below:

For the tasks mentioned in this article, we will do a more detailed disassembly. The whole Hadoop map reduce process is shown in the figure below:

From the above figure, we can see that the steps we need to complete through core code in our current task are:

Map: generate word and number tag key value pairs
Reduce: aggregate the values of the same word (key) to complete statistics

Let's take a look at how to complete the Map and Reduce phases here through python.

2.1 Map stage: mapper py

Here, we assume that the python script storage address used in the map phase is showmeai / Hadoop / code / mapper py

#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print "%s\t%s" % (word, 1)

Explain the above code:

File reads the file from STDIN.
Cut the word and output the word and word frequency STDOUT.
The Map script does not calculate the total number of words, but directly outputs 1 (the statistics will be completed in the Reduce phase).

To make the script executable, add mapper Executable permissions of Py:

chmod +x ShowMeAI/hadoop/code/mapper.py

2.2 Reduce phase: reducer py

Here, we assume that the python script storage address used in the reduce phase is showmeai / Hadoop / code / reducer py :

#!/usr/bin/env python
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:  #If count is not a number, ignore it directly
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print "%s\t%s" % (current_word, current_count)
        current_count = count
        current_word = word

if word == current_word:  #Don't forget the final output
    print "%s\t%s" % (current_word, current_count)

The file reads mapper Py results as reducer Py, count the total number of occurrences of each word, and output the final result to STDOUT.

To make the script executable, add reducer Executable permissions of PY

chmod +x ShowMeAI/hadoop/code/reducer.py

3. Local test MapReduce process

Usually, we will do a simple test locally before submitting the data processing flow to the cluster for operation. We will connect the data flow in series with the help of linux pipeline command (cat data | map | sort | reduce) to verify the mapper we write Py and reducer Check whether the PY script function is normal. This test method can ensure that the final output result is what we expect.

The test commands are as follows:

cd ShowMeAI/hadoop/code/
echo "foo foo quux labs foo bar quux" | python mapper.py
echo ``"foo foo quux labs foo bar quux"` `| python mapper.py | sort -k1, 1  | python reducer.py

The sort process mainly completes the sorting based on the key, which is convenient for aggregation statistics in the reduce stage.

4. Running python code in Hadoop cluster

4.1 data preparation

We make word frequency statistics for the following three files. First, download them according to the following path:

Plain Text UTF-8 http://www.gutenberg.org/ebooks/4300.txt.utf-8
Plain Text UTF-8 http://www.gutenberg.org/ebooks/5000.txt.utf-8
Plain Text UTF-8 http://www.gutenberg.org/ebooks/20417.txt.utf-8

Place the file in the showmeai / Hadoop / data / directory.

4.2 execution procedure

Copy the local data files to the distributed file system HDFS.

bin/hadoop dfs -copyFromLocal ShowMeAI/hadoop/datas  hdfs_in

see:

bin/hadoop dfs -ls

View specific files:

bin/hadoop dfs -ls /user/showmeai/hdfs_in

Execute MapReduce job:

bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file ShowMeAI/hadoop/code/mapper.py     -mapper ShowMeAI/hadoop/code/mapper.py \
-file ShowMeAI/hadoop/code/reducer.py    -reducer ShowMeAI/hadoop/code/reducer.py \
-input /user/showmeai/hdfs_in/*    -output /user/showmeai/hdfs_out

Instance output:

Check whether the output result is in the target directory / user/showmeai/hdfs_out:

bin/hadoop dfs -ls /user/showmeai/hdfs_out

View results:

bin/hadoop dfs -cat /user/showmeai/hdfs_out2/part-00000

Output:

5.Mapper and Reducer code optimization

5.1 iterators and generators in Python

The code optimization of map reduce here is mainly based on iterators and generators. Students unfamiliar with this part can refer to the python part of ShowMeAI → Graphical python iterators and generators .

5.2 optimizing Mapper and Reducer code

mapper.py
#!/usr/bin/env python
import sys
def read_input(file):
    for line in file:
        yield line.split()

def main(separator='\t'):
    data = read_input(sys.stdin)
    for words in data:
        for word in words:
            print "%s%s%d" % (word, separator, 1)

if __name__ == "__main__":
    main()

reducer.py
#!/usr/bin/env python
from operator import itemgetter
from itertools import groupby
import sys

def read_mapper_output(file, separator = '\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator = '\t'):
    data = read_mapper_output(sys.stdin, separator = separator)
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_count = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except valueError:
            pass

if __name__ == "__main__":
    main()

Let's give a simple demonstration of the functions of groupby in the code, as follows:

from itertools import groupby
from operator import itemgetter

things = [('2009-09-02', 11),
          ('2009-09-02', 3),
          ('2009-09-03', 10),
          ('2009-09-03', 4),
          ('2009-09-03', 22),
          ('2009-09-06', 33)]

sss = groupby(things, itemgetter(0))
for key, items in sss:
    print key
    for subitem in items:
        print subitem
    print '-' * 20

result:

2009-09-02
('2009-09-02', 11)
('2009-09-02', 3)
--------------------
2009-09-03
('2009-09-03', 10)
('2009-09-03', 4)
('2009-09-03', 22)
--------------------
2009-09-06
('2009-09-06', 33)
--------------------

In the code:

Group by (things, itemsetter (0)) takes column 0 as the sorting target
Group by (things, itemsetter (1)) takes column 1 as the sorting target
Group by (things) sorts the targets in the whole row

6. References

Parameter problem in split function in python http://segmentfault.com/q/1010000000311861
Writing an Hadoop MapReduce Program in Python http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
The - k parameter of the shell's sort command http://blog.chinaunix.net/uid-25513153-id-200481.html

Programmer Think