Initial experience of Hadoop MapReduce operation

Posted by les4017 on Sat, 18 Dec 2021 17:57:06 +0100

Initial experience of Hadoop MapReduce operation

A simple MapReduce job requires a map function, a reduce function and some code to run the job

package com.grits.hadoop.learning;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");

        job.setJarByClass(WordCount.class);

        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Core profile

core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml

Scaling out

The data needs to be stored in the distributed file system. By using the Hadoop resource management system YARN, Hadoop can transfer the MapReduce calculation to each machine that stores some data

MapReduce job

MapReduce job = = input data + MapReduce program + configuration information

Task classification

Hadoop divides the job into several tasks to execute, including two types of tasks: map task and reduce task. These tasks run on the cluster node and are scheduled through YARN. If a task fails, it will be automatically rescheduled on a different node

input split

Hadoop divides the input data of MapReduce into small data blocks of equal length, which is called input split or "slice" for short
Hadoop builds a map task for each shard, and the task runs the user-defined map function to process each record in the shard

A MapReduce job usually splits an input data set into independent chunk s, which are processed in a completely parallel manner by the map task

Particle size of slicing

Relatively speaking, the finer the slice is cut, the higher the load balancing quality of the job. However, if the fragmentation is too fine, the total time to manage the fragmentation and the total time to build the map task will determine the entire execution time of the job

For most jobs, a reasonable slice size tends to the size of a block of HDFS (128MB)

data locality optimization

Hadoop can get the best performance by running the map task on the node where the input data is stored without using valuable cluster bandwidth resources

Cross rack map tasks

Sometimes, for the input slice of a map task, all nodes storing the HDFS data block copy of the slice may be running other map tasks. At this time, job scheduling needs to find an idle map slot from a node in the rack where a data block is located to run the map task, which will lead to network transmission between racks

Why should the optimal partition size be the same as the block size?

If the partition spans two data blocks, it is basically impossible for any HDFS node to store these two data blocks at the same time. Therefore, some data in the partition needs to be transmitted to the node where the map task runs through the network. This is obviously less efficient than running the entire map task with local data

The reduce task does not have the advantage of data localization. The input of a single reduce task usually comes from the output of all mapper s; Multiple reduce tasks, and each map task partitions the output

The output of reduce is usually stored in HDFS for reliable storage. The first replica is stored on the local node, and the other replicas are stored on the nodes of other racks for reliability reasons

The number of reduce tasks is not determined by the size of the input data, but is specified independently

combiner function

The combiner function can help reduce the amount of data transfer between mapper and reducer

// Call to enable the combiner function as follows
job.setComiberClass(XXXReducer.class)

Hadoop Streaming

Hadoop Streaming uses Unix standard stream as the interface between Hadoop and applications, so you can use any programming language to write MapReduce programs through standard input / output

Streaming is naturally suitable for text processing. The input data of map is passed to the map function through the standard input stream and transmitted line by line. Finally, the result line is written to the standard output. The key value pairs of map output are rows separated by a tab. The input format of the reduce function is the same and transmitted through the standard input stream. The reduce function reads the input line from the standard input stream, which has been sorted by the Hadoop framework according to the key, and finally writes the result to the standard output

Topics: Big Data Hadoop