Articles Catalogue
introduce
- Hadoop is the foundation project of Apache, which solves the problem of long data processing time. MapReduce parallel processing framework is an important member of Hadoop. Because the architecture of Hadoop is implemented by JAVA, JAVA program is used more in large data processing. However, if you want to use deep learning algorithm in MapReduce, Python is an easy language for deep learning and data mining, so based on the above considerations, this paper introduces Python implementation. WordCount experiment in MapReduce, the content of the article (code part) comes from a blogger's CSDN blog, the reference link is at the end.
Hadoop Stream
Hadoop Streaming, which is provided by Hadoop, is mainly used. First, let's introduce Hadoop Stream.
The Role of Streaming
- Hadoop Streaming framework, the greatest advantage is that any language written map, reduce program can run on the hadoop cluster; map/reduce program as long as it follows from the standard input stdin read, write out to the standard output stdout;
- Secondly, it is easy to debug on a single machine, and streaming can be simulated by connecting pipes before and after, so that the map/reduce program can be debugged locally.
#cat inputfile | mapper | sort | reducer > output - Finally, streaming framework also provides a rich parameter control for job submission, which can be done directly through streaming parameters without using java language modification; many higher-level functions of mapreduce can be accomplished by adjusting steaming parameters.
Limitations of Streaming
Streaming can only deal with text data by default. For binary data, a better method is to encode the key and value of binary system into text by base64.
Mapper and reducer need to convert standard input and standard output before and after, involving data copy and analysis, which brings a certain amount of overhead.
Relevant parameters of Streaming command
# hadoop jar hadoop-streaming-2.6.5.jar\ [common option] [Streaming option]
Ordinary options and Stream options can be consulted from the following websites:
https://www.cnblogs.com/shay-zhangjin/p/7714868.html
Python implements MapReduce's WordCount
- First, write the mapper.py script:
#!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%s\t%s' % (word, 1)
In this script, instead of calculating the total number of words that appear, it will output "1" quickly, although it may occur multiple times in the input, and the calculation is left to the subsequent Reduce step (or program) to implement. Remember to grant executable permissions to mapper.py: chmod 777 mapper.py
- Reducr.py script
#!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('\t', 1) # convert count (currently a string) to int try: count = int(count) except ValueError: # count was not a number, so silently # ignore/discard this line continue # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print '%s\t%s' % (current_word, current_count) current_count = count current_word = word # do not forget to output the last word if needed! if current_word == word: print '%s\t%s' % (current_word, current_count)
Store the code in / usr/local/hadoop/reducer.py. The script works from mapper.py. STDIN reads the results, calculates the total number of occurrences of each word, and outputs the results to STDOUT.
Also, note the script permissions: chmod 777 reducer.py
- It is recommended that the script run correctly when running MapReduce tasks:
root@localhost:/root/pythonHadoop$ echo "foo foo quux labs foo bar quux" | ./mapper.py foo 1 foo 1 quux 1 labs 1 foo 1 bar 1 quux 1 root@localhost:/root/pythonHadoop$ echo "foo foo quux labs foo bar quux" |./mapper.py | sort |./reducer.py bar 1 foo 3 labs 1 quux 2
If the execution effect is as above, it proves feasible. You can run MapReduce.
- Run python scripts on the Hadoop platform:
[root@node01 pythonHadoop] hadoop jar contrib/hadoop-streaming-2.6.5.jar -mapper mapper.py -file mapper.py -reducer reducer.py -file reducer.py -input /ooxx/* -output /ooxx/output/
- Finally, HDFS dfs-cat/ooxx/output/part-00000 is executed to view the output results.
The results do not show that the hello.txt file can be produced by echo or downloaded from the Internet. For the test results, the results of different data sets are different.
Reference article: https://blog.csdn.net/crazyhacking/article/details/43304499