Initial experience of Hadoop MapReduce operation
A simple MapReduce job requires a map function, a reduce function and some code to run the job
package com.grits.hadoop.learning; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Core profile
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
Scaling out
The data needs to be stored in the distributed file system. By using the Hadoop resource management system YARN, Hadoop can transfer the MapReduce calculation to each machine that stores some data
MapReduce job
MapReduce job = = input data + MapReduce program + configuration information
Task classification
Hadoop divides the job into several tasks to execute, including two types of tasks: map task and reduce task. These tasks run on the cluster node and are scheduled through YARN. If a task fails, it will be automatically rescheduled on a different node
input split
Hadoop divides the input data of MapReduce into small data blocks of equal length, which is called input split or "slice" for short
Hadoop builds a map task for each shard, and the task runs the user-defined map function to process each record in the shard
A MapReduce job usually splits an input data set into independent chunk s, which are processed in a completely parallel manner by the map task
Particle size of slicing
Relatively speaking, the finer the slice is cut, the higher the load balancing quality of the job. However, if the fragmentation is too fine, the total time to manage the fragmentation and the total time to build the map task will determine the entire execution time of the job
For most jobs, a reasonable slice size tends to the size of a block of HDFS (128MB)
data locality optimization
Hadoop can get the best performance by running the map task on the node where the input data is stored without using valuable cluster bandwidth resources
Cross rack map tasks
Sometimes, for the input slice of a map task, all nodes storing the HDFS data block copy of the slice may be running other map tasks. At this time, job scheduling needs to find an idle map slot from a node in the rack where a data block is located to run the map task, which will lead to network transmission between racks
Why should the optimal partition size be the same as the block size?
If the partition spans two data blocks, it is basically impossible for any HDFS node to store these two data blocks at the same time. Therefore, some data in the partition needs to be transmitted to the node where the map task runs through the network. This is obviously less efficient than running the entire map task with local data
The reduce task does not have the advantage of data localization. The input of a single reduce task usually comes from the output of all mapper s; Multiple reduce tasks, and each map task partitions the output
The output of reduce is usually stored in HDFS for reliable storage. The first replica is stored on the local node, and the other replicas are stored on the nodes of other racks for reliability reasons
The number of reduce tasks is not determined by the size of the input data, but is specified independently
combiner function
The combiner function can help reduce the amount of data transfer between mapper and reducer
// Call to enable the combiner function as follows job.setComiberClass(XXXReducer.class)
Hadoop Streaming
Hadoop Streaming uses Unix standard stream as the interface between Hadoop and applications, so you can use any programming language to write MapReduce programs through standard input / output
Streaming is naturally suitable for text processing. The input data of map is passed to the map function through the standard input stream and transmitted line by line. Finally, the result line is written to the standard output. The key value pairs of map output are rows separated by a tab. The input format of the reduce function is the same and transmitted through the standard input stream. The reduce function reads the input line from the standard input stream, which has been sorted by the Hadoop framework according to the key, and finally writes the result to the standard output