wordcount of Mapreduce implemented by python

Posted by ryanb on Wed, 31 Jul 2019 16:54:25 +0200

Articles Catalogue

introduce

  • Hadoop is the foundation project of Apache, which solves the problem of long data processing time. MapReduce parallel processing framework is an important member of Hadoop. Because the architecture of Hadoop is implemented by JAVA, JAVA program is used more in large data processing. However, if you want to use deep learning algorithm in MapReduce, Python is an easy language for deep learning and data mining, so based on the above considerations, this paper introduces Python implementation. WordCount experiment in MapReduce, the content of the article (code part) comes from a blogger's CSDN blog, the reference link is at the end.

Hadoop Stream

Hadoop Streaming, which is provided by Hadoop, is mainly used. First, let's introduce Hadoop Stream.

The Role of Streaming

  • Hadoop Streaming framework, the greatest advantage is that any language written map, reduce program can run on the hadoop cluster; map/reduce program as long as it follows from the standard input stdin read, write out to the standard output stdout;
  • Secondly, it is easy to debug on a single machine, and streaming can be simulated by connecting pipes before and after, so that the map/reduce program can be debugged locally.
    #cat inputfile | mapper | sort | reducer > output
  • Finally, streaming framework also provides a rich parameter control for job submission, which can be done directly through streaming parameters without using java language modification; many higher-level functions of mapreduce can be accomplished by adjusting steaming parameters.

Limitations of Streaming

Streaming can only deal with text data by default. For binary data, a better method is to encode the key and value of binary system into text by base64.
Mapper and reducer need to convert standard input and standard output before and after, involving data copy and analysis, which brings a certain amount of overhead.

Relevant parameters of Streaming command

# hadoop jar hadoop-streaming-2.6.5.jar\ [common option] [Streaming option]

Ordinary options and Stream options can be consulted from the following websites:
https://www.cnblogs.com/shay-zhangjin/p/7714868.html

Python implements MapReduce's WordCount

  1. First, write the mapper.py script:
#!/usr/bin/env python  
  
import sys  
  
# input comes from STDIN (standard input)  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
    # split the line into words  
    words = line.split()  
    # increase counters  
    for word in words:  
        # write the results to STDOUT (standard output);  
        # what we output here will be the input for the  
        # Reduce step, i.e. the input for reducer.py  
        #  
        # tab-delimited; the trivial word count is 1  
        print '%s\t%s' % (word, 1)  

In this script, instead of calculating the total number of words that appear, it will output "1" quickly, although it may occur multiple times in the input, and the calculation is left to the subsequent Reduce step (or program) to implement. Remember to grant executable permissions to mapper.py: chmod 777 mapper.py

  1. Reducr.py script
#!/usr/bin/env python  
  
from operator import itemgetter  
import sys  
  
current_word = None  
current_count = 0  
word = None  
  
# input comes from STDIN  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
  
    # parse the input we got from mapper.py  
    word, count = line.split('\t', 1)  
  
    # convert count (currently a string) to int  
    try:  
        count = int(count)  
    except ValueError:  
        # count was not a number, so silently  
        # ignore/discard this line  
        continue  
  
    # this IF-switch only works because Hadoop sorts map output  
    # by key (here: word) before it is passed to the reducer  
    if current_word == word:  
        current_count += count  
    else:  
        if current_word:  
            # write result to STDOUT  
            print '%s\t%s' % (current_word, current_count)  
        current_count = count  
        current_word = word  
  
# do not forget to output the last word if needed!  
if current_word == word:  
    print '%s\t%s' % (current_word, current_count) 

Store the code in / usr/local/hadoop/reducer.py. The script works from mapper.py. STDIN reads the results, calculates the total number of occurrences of each word, and outputs the results to STDOUT.
Also, note the script permissions: chmod 777 reducer.py

  1. It is recommended that the script run correctly when running MapReduce tasks:
root@localhost:/root/pythonHadoop$ echo "foo foo quux labs foo bar quux" | ./mapper.py  
foo      1  
foo      1  
quux     1  
labs     1  
foo      1  
bar      1  
quux     1  
root@localhost:/root/pythonHadoop$ echo "foo foo quux labs foo bar quux" |./mapper.py | sort |./reducer.py  
bar     1  
foo     3  
labs    1  
quux    2  

If the execution effect is as above, it proves feasible. You can run MapReduce.

  1. Run python scripts on the Hadoop platform:
[root@node01 pythonHadoop]         
hadoop jar contrib/hadoop-streaming-2.6.5.jar    
-mapper mapper.py    
-file mapper.py    
-reducer reducer.py    
-file reducer.py    
-input /ooxx/*   
-output /ooxx/output/ 
  1. Finally, HDFS dfs-cat/ooxx/output/part-00000 is executed to view the output results.
    The results do not show that the hello.txt file can be produced by echo or downloaded from the Internet. For the test results, the results of different data sets are different.

Reference article: https://blog.csdn.net/crazyhacking/article/details/43304499

Topics: Hadoop Python Java Apache