Hadoop note 3: MapReduce

Posted by dsinicco on Wed, 12 Jan 2022 20:15:45 +0100

MapReduce is a distributed computing framework. Originally developed by Google engineers, the GFS based distributed computing framework is mainly used to solve the computing problem of massive data in the search field.

According to this framework, Cutting designs the MapReduce framework based on HDFS

MapReduce allows programmers to stay away from distributed computing programming without considering task scheduling, logical chunking, location tracing and other issues. They can focus on their business.

MapReduce consists of two phases: map and Reduce. Users only need to implement map(inKey,inVal,outKey,outVal) and reduce(ikey,ival,okey,kval) functions to complete distributed computing.

MapReduce framework

Job functions of JobTracker/ResourceManager

Resource manager is Hadoop 2 X replaced JobTracker after introducing yarn.

  • Know which machines to manage, that is, which nodemanagers to manage.
  • Detect the status of NodeManager through RPC heartbeat.
  • For task allocation and scheduling, resource manager can achieve fine-grained task allocation, such as how much memory and computing resources a task needs.

TaskTracker/NodeManager job functions

Similarly, NodeManager was used after Hadoop introduced yarn.

NodeManager can handle tasks sent from resource manager and handle them. Map or Reduce tasks are handled here.

MapReduce execution steps

  • First, the massive data to be processed is generally stored in HDFS.
  • When reading this huge amount of data, the data will be divided based on rows.
  • The input of the Map job is each line of data, key is the offset at the beginning of the line (relative to the entire file), and value is the data of this line.
  • The specific functions of the Map job are implemented by the user. It is required to generate several key value outputs for each line (defined by the user)
  • After the Map job is completed, the results of all Map jobs will be "Aggregation and shuffle" once. This will merge the results of the Map job and merge the key value pairs (values may be different) of the same key. For example, key1-value1 and key1-value2 are merged into key1-(value1,value2). This process can ensure that the key of all calculation results is unique.
  • The Reduce job will get the merged key values. Then a new key value is generated through processing, and the specific processing process is customized by the user. Generally, we will make statistics on values or some calculations to produce a value output.
  • The output of Reduce will be saved to a file, which is the final calculation result of MapReduce.

In Hadoop, in order to facilitate serialization, LongWritable is used to represent long type and Text is used to represent String type. These types can be converted to each other with basic types.

MapReduce case: word statistics

The following is a simple word statistics (equivalent to MapReduce's hello world) function to demonstrate the MapReduce API.

We prepare a simple text file words Txt, our goal is to count the number of each word in this file (each word is distinguished by spaces):

hello world
I am a small cat
I love hadoop
I love cat
hello hadoop
hello cat

First, upload the text file to HDFS(/park/words.txt):

[root@hadoop1 ~]# hadoop fs -mkdir /park
[root@hadoop1 ~]# hadoop fs -put words.txt /park/words.txt
[root@hadoop1 ~]# hadoop fs -ls /park
Found 1 items
-rw-r--r--   1 root supergroup         77 2018-03-24 12:18 /park/words.txt
[root@hadoop1 ~]# hadoop fs -cat /park/words.txt
hello world
I am a small cat
I love hadoop
I love cat
hello hadoop
hello cat

Next, we write a Map job:

package cn.lazycat.bdd.hadoop.mapreduce.wordcount;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * Map The job class needs to inherit mapper (note that it comes from hadoop.mapreduce package)
 * The four generics represent input key value and output key value respectively
 * Enter key as long, indicating the offset at the beginning of each line
 * Input value is Text, indicating the content of each line
 * The output key is the content of the word
 * The output value is the number of times the word appears on this line
 */
public class WordCountMapper
        extends Mapper<LongWritable, Text, Text, LongWritable> {

    /**
     * User customized map method
     * @param key Enter key
     * @param value Enter value
     * @param context Exported objects
     */
    @Override
    protected void map(LongWritable key, Text value,
                       Context context)
            throws IOException, InterruptedException {

        // Gets the content of a line of text read
        String line = value.toString();

        // Split according to the space to get each word
        String words[] = line.split(" ");

        for (String word : words) {
            // For each word, an output is generated
            context.write(new Text(word), new LongWritable(1));
        }
    }
}

Reduce job:

package cn.lazycat.bdd.hadoop.mapreduce.wordcount;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * Reduce Job class
 * The input is the key value type of the Map job result, and the input is the final calculation result
 * In this case, the input is the mark of each word, and the Reduce job will
 * These times do a summation statistics, and finally produce each word and the number of words.
 */
public class WordCountReducer
        extends Reducer<Text, LongWritable, Text, LongWritable> {

    /**
     * reduce Job code
     * @param key key from map
     * @param values Shuffled values
     * @param context Output results
     */
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values,
                          Context context)
            throws IOException, InterruptedException {

        long sum = 0;   // Number of occurrences of words
        for (LongWritable value : values) {
            sum += value.get();
        }

        // Write data
        context.write(key, new LongWritable(sum));
    }
}

With Map and Reduce alone, we can't run code. We need to write a Job class to represent the startup class of the whole business processing. This class needs a main method as an entry for Hadoop to start the business.

The implementation of this class is as follows:

package cn.lazycat.bdd.hadoop.mapreduce.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCountJob {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // Hadoop configuration
        Configuration conf = new Configuration();
        // Hadoop Job class
        // Pass in the names of the conf object and the job (both default)
        Job job = Job.getInstance(conf, "wc");
        // Specify the entry class of the Job
        job.setJarByClass(WordCountJob.class);
        // Specify Mapper and Reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);
        // Specifies the output key value type of Mapper
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        // Specifies the output key value type of the Reducer
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        // Indicates the data entered (from HDFS)
        // Specify a directory here, and Hadoop will read all files in the directory (without recursion)
        FileInputFormat.addInputPath(job,
                new Path("hdfs://192.168.117.51:9000/park"));
        // Indicate output (output to HDFS)
        // Be careful not to create it in advance, or an exception will be thrown
        FileOutputFormat.setOutputPath(job,
                new Path("hdfs://192.168.117.51:9000/out"));
        // Start task
        job.waitForCompletion(true);
    }

}

If it is executed under windows, remember to add winutils under the bin of Hadoop Exe file. And Hadoop DLL to C:/Windows/System32 (these files are from the Hadoop plug-in of eclipse), otherwise an exception will be thrown.

If everything is normal and the execution is successful, we can see the operation results of the job under / out:

[root@hadoop1 ~]# hadoop fs -ls /out
Found 2 items
-rw-r--r--   3 LazyCat supergroup          0 2018-03-24 14:24 /out/_SUCCESS
-rw-r--r--   3 LazyCat supergroup         59 2018-03-24 14:24 /out/part-r-00000

Among them_ SUCCESS indicates that the task is executed successfully. This is an identification file. The partxxx file is the output result file. The result may consist of multiple files.

We can view the results:

[root@hadoop1 ~]# hadoop fs -cat /out/part-r-00000
I   3
a   1
am  1
cat 3
hadoop  2
hello   3
love    2
small   1
world   1

It can be seen that the output result is: key value (output from reduce job). Each key value pair occupies a row.

We can also package the code and execute it on the DataNode (through hadoop command):

[root@hadoop1 ~]# hadoop jar wc.jar

Among them, WC Jar is the project jar package I packaged. Note that the WordCountJob above needs to be set as the Main class.

If successful, the log entered is similar to this (omitting some logs):

18/03/24 14:43:44 INFO input.FileInputFormat: Total input paths to process : 1
18/03/24 14:43:45 INFO mapreduce.JobSubmitter: number of splits:1
18/03/24 14:43:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1521642825141_0002
18/03/24 14:43:45 INFO impl.YarnClientImpl: Submitted application application_1521642825141_0002
18/03/24 14:43:45 INFO mapreduce.Job: The url to track the job: http://hadoop1:8088/proxy/application_1521642825141_0002/
18/03/24 14:43:45 INFO mapreduce.Job: Running job: job_1521642825141_0002
18/03/24 14:43:58 INFO mapreduce.Job: Job job_1521642825141_0002 running in uber mode : false
18/03/24 14:43:58 INFO mapreduce.Job:  map 0% reduce 0%
18/03/24 14:44:05 INFO mapreduce.Job:  map 100% reduce 0%
18/03/24 14:44:10 INFO mapreduce.Job:  map 100% reduce 100%
18/03/24 14:44:11 INFO mapreduce.Job: Job job_1521642825141_0002 completed successfully
18/03/24 14:44:11 INFO mapreduce.Job: Counters: 49
    File System Counters
        ...
    Job Counters
        ...
    Map-Reduce Framework
        ...
    Shuffle Errors
        ...
    File Input Format Counters
        ...
    File Output Format Counters
        ...

The result of job running is the same as above.

MapReduce job specific process

To understand the execution process of MapReduce job in Hadoop, you must first understand the three roles of the whole job:

  • Client Node: Client Node, which represents the node executing code. This node may not be in the Hadoop cluster or on a DataNode of the Hadoop cluster.
  • ResourceManager: the scheduler of the whole resource, which comes from the YARN framework. The function is to assign tasks and resources to each NodeManager. In a Hadoop cluster, you usually stay with the NameNode.
  • NodeManager: equivalent to the performer of the task. Call the jar file to start the Java process, execute the task code, and read the task resources from the shared file system (such as HDFS) for processing.

The specific process of operation is as follows:

Submit job

The Job is submitted by the Client Node, which we can see from the above source code. The key to submission is the submit() method. We found that Job instantiates a JobSubmitter instance, which is specifically used to submit jobs.

If you drill down into the JobSubmitter source code, you will find that it will request a JobId from the resource manager. This ID is the unique identification of the task.

After obtaining the JobID, the JobSubmitter calculates the partition information (partition offset) and uploads the resources required by the job (job MapReduce jar package, configuration information and partition information) to HDFS.

Subsequently, the submitJob() method is called to submit the job truly.

Job initialization

After calling submitJob(), the resource manager will calculate (through the yarn scheduler scheduler) a computing node (NodeManager) suitable for the current task. Then, notify the NodeManager of the node to start a container (it can be imagined that it is a space for storing the code to implement a function and the resources required to implement the function. Each container must have an execution main class).

The NodeManager then starts the MRAppMaster process in the container. This process is specifically used to initialize the entire Job.

The MRAppMaster process will download partition information from HDFS (uploaded by the Client Node when submitting the job).

Then, the MRAppMaster will create a map task object for each input fragment downloaded and the map task object generated by MapReduce job. Multiple reduce objects specified by the reduce property. It will also request the resource manager for the containers of these map tasks and reduce tasks to store the resources required by these tasks.

The task container and AppMaster are usually on the same node, but if the task is very large, you will try to allocate the task container to other nodes.

The MRAppMaster process can also create multiple bookkeeping objects to keep track of job progress and return them to the Client Node for log output at any time.

Perform tasks

After MRAppMaster creates a task container, it tells NodeManager to start the container.

The main class of this container is the YarnChild process. Before it starts, it will download the job configuration, jar files and resources required by the job from HDFS. Then, start the map task or reduce task.

Progress and status updates

When running a task using yarn, it will report the progress and status to the AppMaster through the UML interface every 3s. The Client Node communicates with the AppMaster every second (set by mapreduce.client.Progressmonitor.pollinterval) to receive the task completion status and display it to the user in the foreground.

Job complete

The Client Node will call the waitforcompletement() method of the Job every 5 seconds (set by mapreduce.client.completion.polling) to check whether the Job is completed.

After the job is completed, the AppMaster container and task container will be cleaned up.

MapReduce serialization

A large number of serialization and deserialization operations are involved in the whole working process of the cluster. The bottom layer of Hadoop implements serialization and deserialization through AVRO, and encapsulates AVRO to provide some convenient API s.

Hadoop serializes and encapsulates Java basic types, which implement the Writable interface. The following is a comparison table:

Java basic types Writable implementation class Serialization size (byte)
null NullWritable.get()
boolean BooleanWritable 1
byte ByteWritable 1
Short ShortWritable 2
int IntWritable 4
Int (variable length) VintWritable 1~5
float FloatWritable 4
long LongWritable 8
Long (variable length) VlongWritable 1~9
double DoubleWritable 8

These types of readFields() can be serialized, and write() can be serialized.

Serialization case: traffic statistics

Here is a concrete example to show the use of serialization.

This case needs to realize a function of traffic statistics. The format of data file is as follows:

  • flow.txt
13877779999 bj zs 2145
13766668888 sh ls 1028
13766668888 sh ls 9987
13877779999 bj zs 5678
13544445555 sz ww 10577
13877779999 sh zs 2145
13766668888 sh ls 9987

The first represents the user's mobile phone number, the second represents the location where the traffic is generated, the third represents the user's name, and the last represents the generated traffic.

First push the data to HDFS:

[root@hadoop1 ~]# hadoop fs -put data/flow.txt /park/flow
[root@hadoop1 ~]# hadoop fs -cat /park/flow
13877779999 bj zs 2145
13766668888 sh ls 1028
13766668888 sh ls 9987
13877779999 bj zs 5678
13544445555 sz ww 10577
13877779999 sh zs 2145
13766668888 sh ls 9987

As usual, we can directly write MapReduce code (without using custom objects). For Map tasks, we need to output the mobile phone number and name recorded in each line as key and the traffic value as value (ignoring location information):

public class FlowMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value,
                       Context context)
            throws IOException, InterruptedException {

        String line = value.toString();

        // Take out three required data
        String data[] = line.split(" ");
        String phone = data[0];
        String name = data[2];
        int flow = Integer.parseInt(data[3]);

        context.write(new Text(phone + " " + name), new IntWritable(flow));

    }
}

In the reduce task, we can output the mobile phone number and name as key and the total traffic value as value:

public class FlowReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {

        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }

        context.write(key, new IntWritable(sum));
    }
}

Execute MapReduce and output:

[root@hadoop1 ~]# hadoop fs -cat /out/part-r-00000
13544445555 ww  10577
13766668888 ls  21002
13877779999 zs  9968

However, the telephone, user name and traffic can obviously be abstracted into a class through OOP. We can use this class as input and output, which requires the serialization mechanism of Hadoop.

Our custom serialized object needs to implement the Writable interface. This requires the implementation of the write(DataOutput) and readFields(DataInput) methods to achieve serialization and deserialization.

DataOutput and DataInput have a series of write() and read() methods. Note that the order of write and read must be the same. You can read and write strings through writeUTF(String) and readUTF().

The following is the code of this custom class:

package cn.lazycat.bdd.hadoop.mapreduce.flow.update;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class FlowBean implements Writable {

    private String phone;
    private String addr;
    private String name;
    private long flow;

    @Override
    public void write(DataOutput out) throws IOException {

        out.writeUTF(phone);
        out.writeUTF(addr);
        out.writeUTF(name);
        out.writeLong(flow);
    }

    @Override
    public void readFields(DataInput in) throws IOException {

        phone = in.readUTF();
        addr = in.readUTF();
        name = in.readUTF();
        flow = in.readLong();
    }

    // getter, setter omitted
}

Then modify the map method, and the output value can be changed to the entire FlowBean object:

public class UpdateFlowMapper
        extends Mapper<LongWritable, Text, Text, FlowBean> {

    @Override
    protected void map(LongWritable key, Text value,
                       Context context)
            throws IOException, InterruptedException {

        String line = value.toString();
        String temp[] = line.split(" ");

        String phone = temp[0];
        String addr = temp[1];
        String name = temp[2];
        long flow = Long.parseLong(temp[3]);

        FlowBean bean = new FlowBean();
        bean.setPhone(phone);
        bean.setAddr(addr);
        bean.setName(name);
        bean.setFlow(flow);

        context.write(new Text(phone), bean);
    }
}

The reduce task receives the phone and the object, and then outputs the processed object:

public class UpdateFlowReduce
        extends Reducer<Text, FlowBean, Text, LongWritable> {
    @Override
    protected void reduce(Text key, Iterable<FlowBean> values,
                          Context context)
            throws IOException, InterruptedException {

        String info = key.toString();
        long sum = 0;
        for (FlowBean flowBean : values) {
            sum += flowBean.getFlow();
        }

        context.write(new Text(info), new LongWritable(sum));
    }
}

The final result is as like as two peas without using custom objects.

Partitioner partition

Previously, all our MapReduce tasks generated a file. But sometimes we may want to make some distinctions according to some fields.

For example, the above case of statistical traffic. If I want to form different result files according to regions, for example, bj and sh traffic usage are placed in different files, I need to use partitions.

Partition operation is an important process of shuffle operation. Its function is to distribute the map results to different reduce for processing according to rules, so as to obtain multiple output files according to the partition.

Partitioner is the base class of partitioner. If you need to customize your own partition class, you can inherit this class.

The default partition is HashPartitioner. The calculation method is:

reducer = (key.hashCode() & Integer.MAX_VALUE) % numberReduceTasks;

This is a random partition, which randomly assigns the results generated by the map task to different reduce according to the hash value of the key.

By default, the number of reducetasks is 1. In other words, there is only one reduce, so there is only one final result generated by all the above MapReduce jobs.

Zoning case: according to regional statistical flow

The following is a case of transforming the statistical flow and storing the final result data according to different regions.

This requires us to customize the partition rules and partition according to addr. The following defines a partition class:

public class FlowPartitioner extends Partitioner<Text, FlowBean> {

    // Save city reducer number
    private static Map<String, Integer> map;

    static {
        map = new HashMap<>();
        map.put("bj", 0);
        map.put("sh", 1);
        map.put("sz", 2);
    }

    @Override
    public int getPartition(Text text, FlowBean flowBean,
                            int numPartitions) {

        if (flowBean != null) {
            // Partition by address
            String addr = flowBean.getAddr();
            return map.get(addr);
        }
        else {
            return 3;
        }

    }
}

getPartition will return the key value to which reduce to handle. The number of reducer (starting from 0) is returned.

Before starting the Job, you need to manually specify this partition class:

// Set custom partition class
job.setPartitionerClass(FlowPartitioner.class);
// Set number of partitions
job.setNumReduceTasks(4);

Note that when setting the number of partitions, make sure it is greater than or equal to the maximum number + 1 that your partition class can return, otherwise an exception will be thrown (for example, you return a 4 and your number of partitions is also 4)

The MapReduce code does not need to be changed. Run the job below, and then view the output of the job:

[root@hadoop1 ~]# hdls /out
Found 5 items
-rw-r--r--   3 LazyCat supergroup          0 2018-03-24 23:02 /out/_SUCCESS
-rw-r--r--   3 LazyCat supergroup         20 2018-03-24 23:02 /out/part-r-00000
-rw-r--r--   3 LazyCat supergroup         41 2018-03-24 23:02 /out/part-r-00001
-rw-r--r--   3 LazyCat supergroup         21 2018-03-24 23:02 /out/part-r-00002
-rw-r--r--   3 LazyCat supergroup          0 2018-03-24 23:02 /out/part-r-00003

You will see four output results. Then, 0 represents the result of "bj" city; 1 means the result of "sh" city; 2 represents the result of "sz" city; 4 indicates the result (wrong result) when the flowBean is null.

Sort sort

In the shuffle process, in addition to partitioning, we can also sort.

By default, key s are sorted in dictionary order. We can customize this rule.

After the Map is executed and before the data enters reduce, the data will be sorted.

Ranking case: calculate everyone's total income

Now suppose we have a scenario to calculate everyone's total income (income cost), and we need to sort by income.

The form of data is as follows:

  • profit
1 ls 2850 100
2 ls 3566 200
3 ls 4555 323
1 zs 19000 2000
2 zs 28599 3900
3 zs 34567 5000
1 ww 355 10
2 ww 555 222
3 ww 667 192

The first column represents the month, the second column represents the name, the third column represents the income, and the fourth column represents the cost.

To implement this function, two MapReduce jobs are required. The first is used to count everyone's total income, and the second is used to sort.

The first task is very simple. The map takes the name as the key, and the monthly income is output as the value. reduce summarizes and outputs the monthly income of all names.

The following is a code snippet of the task map and reduce:

// map:
@Override
protected void map(LongWritable key, Text value,
                    Context context)
        throws IOException, InterruptedException {

    String data[] = value.toString().split(" ");
    String name = data[1];
    long profit = Long.parseLong(data[2]) - Long.parseLong(data[3]);

    context.write(new Text(name), new LongWritable(profit));
}
// reduce:
@Override
protected void reduce(Text key, Iterable<LongWritable> values,
                        Context context)
        throws IOException, InterruptedException {

    long sum = 0;
    for (LongWritable val : values) {
        sum += val.get();
    }

    context.write(key, new LongWritable(sum));
}

Calculation results of MapReduce for the first time:

[root@hadoop1 ~]# hdcat /out/part-r-00000
ls  10348
ww  1153
zs  71266

Then we need to write the second MapReduce to sort.

We can write a ProfitBean class that implements the WritableComparable interface. This interface inherits both Writable and Comparable interfaces. Then, when shuffling, if this object is used as a key, it will be sorted according to the user-defined rules.

package cn.lazycat.bdd.hadoop.mapreduce.profit;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class ProfitBean
    implements WritableComparable<ProfitBean> {

    private String name;
    private long profit;

    @Override
    public int compareTo(ProfitBean o) {
        return (int) (o.getProfit() - profit);
    }

    @Override
    public void write(DataOutput out) throws IOException {

        out.writeUTF(name);
        out.writeLong(profit);

    }

    @Override
    public void readFields(DataInput in) throws IOException {

        name = in.readUTF();
        profit = in.readLong();
    }

    @Override
    public String toString() {
        return name + "\t" + profit;
    }

    // setter, getter omitted

}

Then MapReduce is only responsible for sorting one shuffle stage. In the map job, we need to convert each row into a ProfitBean object. In the reduce phase, we just need to simply output this object:

// map
@Override
protected void map(LongWritable key, Text value, 
                    Context context)
        throws IOException, InterruptedException {
    String data[] = value.toString().split(" ");
    ProfitBean bean = new ProfitBean();
    bean.setName(data[0]);
    bean.setProfit(Long.parseLong(data[1]));
    context.write(bean, NullWritable.get());
}
// reduce
@Override
protected void reduce(ProfitBean key, Iterable<NullWritable> values,
                        Context context)
        throws IOException, InterruptedException {

    context.write(new Text(key.toString()), NullWritable.get());

}

The result is sorted:

[root@hadoop1 ~]# hdcat /out/part-r-00000
zs  71266
ls  10348
ww  1153

It can be seen that sometimes a MepReduce cannot meet our needs. We may write a lot of MapReduce jobs.

In the MapReduce job above, we can even delete the Reduce job, because the Reduce job does too little and can directly output the Map results.

Combiner merge

Because the Map job operates on each row. Every Map task can produce a lot of output. The function of Combiner is to merge Map jobs first to Reduce the workload of Reduce jobs.

The feature of Combiner is to perform some "mini" merging after the Map job (note that multiple keys are not merged into one, but multiple values of the same key are merged). However, it cannot merge all data.

Upgrade WordCount: using Combiner

In the previous WordCount, the number of words output at the Map side must be 1. In this way, Reduce needs to merge a large number of words with a number of 1, which is a bit chicken. If we can merge some data on the Map side, the pressure on the Reduce side will be greatly reduced.

combiner is as like as two peas in Reducer. They all need to inherit Reducer classes, even logic is the same.

package cn.lazycat.bdd.hadoop.mapreduce.wordcount.combiner;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountCombiner
        extends Reducer<Text, LongWritable, Text, LongWritable> {


    @Override
    protected void reduce(Text key, Iterable<LongWritable> values,
                          Context context)
            throws IOException, InterruptedException {

        long sum = 0;
        for (LongWritable val : values) {
            sum += val.get();
        }

        context.write(key, new LongWritable(sum));
    }
}

Next, we need to declare the use of this Combiner in the startup class:

// Specify Combiner
job.setCombinerClass(WordCountCombiner.class);

Running as like as two peas will be found.

Shuffle shuffle

Shuffle is the "heart" of MapReduce. Its function is to merge and sort the key of the output of the Map job, and then send it to the Reduce job as input. But shuffle has some more complex mechanisms.

Each Map task has a ring cache, which is 100m by default (set by io.sort.mb). Once the cache content reaches the threshold (IO. Sort. Spin. Percent), it is 0.80 by default, which refers to 0.8 times the maximum size. A background thread is started to spill the contents of the cache to disk. At this point, Map will continue to write data to the cache. However, if the cache is full, the Map job will be blocked.

This process is very interesting. You can think of the buffer as a "disk". There are two capitals on this disc. One cylinder head slides over the disc to generate data, and the other slides over the disc to remove data. At the beginning, the column head generating data acts first. When the whole disk is about to be filled with data, the column head removing data starts to move and "catch up" with the column head writing data. If the column head writing data fills the whole disk with data, it will stop and wait for the column head reading data to read the data before moving.

The overflow write process will poll and write the data to mapred local. Dir.

Before overflow, the thread will first partition the overflow data according to the partition rules, and in the partition, it will perform "internal sorting" according to the key.

If the user specifies Combiner, the data will be merged after "internal sorting", making the result of Map job more compact.

Every time the data reaches the buffer threshold, a new "overflow file" will be generated. Therefore, multiple overflow files may be generated after the end of the Map task. These files are merged before the Shuffle process ends. The merged file data is well partitioned and sorted.

If there are more than 3 overflow files, Combine again after merging the overflow files. If it is less than 3, Hadoop does not think it necessary to Combine once.

In other words, the Combiner may be run repeatedly in the Shuffle to ensure the compactness of the Map job results as much as possible.

Reducer obtains the partitions of overflow files through HTTP and obtains the data of these partitions.

For data without overflow, it is directly handed over to the Reduce job in memory for processing.

InputFormat

We have introduced the whole process of Map, Shuffle and reduce. The following question is how does MapReduce read and write data files?

MapReduce uses InputFormat to read data.

At the beginning of MapReduce, InputFormat is used to generate inputsplit (logical chunk) and cut it into recorder to form Mapper's input.

org.apache.hadoop.InputFormat itself is an interface with the following methods:

/**
 * Logical segmentation of files.
 * Note that there is no physical segmentation. InputSplit includes the location of the file, where to start and where to end.
 */
InputSplit[] getSplits(JobConf job);

/**
 * It indicates the reading method, that is, how much content is read at a time and how to read these data.
 * RecorderReader Returns how much content has been read this time, as well as the read Key and Value. That is, the specific content read
 */
RecordReader<K, V> getRecordReader(InputSplit split, JobConf job);

MapReduce uses TextInputFormat of InputFormat as the file reading class by default. This class is a subclass of FileInputFormat.

In addition to TextInputFormat, there is KeyValueTextInputFormat. It is also used to read files. If the line is separated by a separator (tab by default), each line is divided into key and value according to the separator.

SequenceFileInputFormat is used to read the sequence file. This is a binary file used by Hadoop to store data in a custom format. It has two subclasses: SequenceFileAsBinaryInputFormat, which outputs key and value as bytesworitable. SequcenFileAsTextInputFormat outputs key and value as Text.

SequenceFileInputFilter obtains some qualified data from the sequence file according to the filter, and specifies the filter through setFilterClass. There are three built-in filters: regexfilter, percentfilter and md5filter. Regex can filter according to regular expressions. Percent gets the number of record lines divided by f by specifying the parameter f; MD5 by specifying parameters, take MD5()

If you want to customize the segmentation rule and read a file, you can inherit FileInputFormat.

FileInputFormat implements the geiplits () method. It will traverse the files under the Job and count the size of all files.

Custom InputFormat case: statistical results

Sometimes, the default InputFormat may not meet our needs. A typical example is reading multiple rows of data at once.

For example, we now want to count the total scores of students. The data format is as follows:

Zhang San
 Chinese 97
 Mathematics 77
 English 69
 Li Si
 Language 87
 Mathematics 57
 English 63
 Wang Wu
 Language 47
 Mathematics 54
 English 39

During the Map task, you need to read 4 rows of data at a time. Obviously, the default InputFormat does not meet our needs.

If we need to manually implement a logical chunk, the code will be too complex, so we generally inherit indirectly. For example, in this example, we can inherit FileInputFormat, which is the base class for file operations. In this way, the code of logical segmentation does not need to be written by ourselves.

We need to implement the createRecordReader() method ourselves. The RecordReader object returned by this method can be used to indicate how many rows of data are read at a time, which needs to contain the read key and value.

RecordReader is an abstract class. We need to implement many methods ourselves. The following is the implementation class code to implement 4 lines of reading:

package cn.lazycat.bdd.hadoop.mapreduce.format;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;

import java.io.IOException;

public class ScoreRecordReader extends RecordReader<Text, Text> {

    // Line reader for reading files
    private LineReader reader;

    // key per line
    private Text key;

    // value per line
    private Text value;

    // Because multiple rows need to be read at once, a separator is needed to distinguish different rows of data
    private static String separator = "|";

    // Contains next data
    private boolean hasNext = true;

    /**
     * Initialize RecordReader
     */
    @Override
    public void initialize(InputSplit split, TaskAttemptContext context)
            throws IOException, InterruptedException {

        // Get chunk object
        FileSplit fileSplit = (FileSplit) split;
        // Get slice
        Path path = fileSplit.getPath();

        // Configuration object
        Configuration conf = new Configuration();
        // Get file system object
        FileSystem fs = path.getFileSystem(conf);
        // Get the file stream to be operated in the file system
        FSDataInputStream in = fs.open(path);
        // Create a row reader object
        // BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        reader = new LineReader(in);

    }

    /**
     * Judge whether the next key value exists.
     * If so, assign values to key and value.
     */
    @Override
    public boolean nextKeyValue() throws IOException {
        // You need to instantiate a pair of key values
        key = new Text();
        value = new Text();

        // Read 4 rows at a time
        Text cur = new Text();
        int len;
        for (int i = 0; i < 4; ++i) {
            len = reader.readLine(cur);
            if (len == 0) {  // Indicates that the last line has been read
                hasNext = false;   // There is no data
                break;
            }
            else {
                if (i == 0) {  // The first line of the 4 lines should be the key value
                    key.set(cur);
                }
                else {        // The remaining lines in line 4 should be appended to value
                    // Append to the end, with separator as the partition
                    value.append(cur.getBytes(), 0, cur.getLength());
                    value.append(separator.getBytes(), 0,
                            separator.length());
                }
            }
            cur.clear();
        }
        return hasNext;
    }

    /**
     * Returns the current key
     */
    @Override
    public Text getCurrentKey() {
        return key;
    }

    /**
     * Returns the current value
     */
    @Override
    public Text getCurrentValue() {
        return value;
    }

    /**
     * Returns the current progress
     */
    @Override
    public float getProgress() {
        return hasNext ? 0f : 1f;
    }

    /**
     * Close the method of execution
     */
    @Override
    public void close() throws IOException {
        reader.close();
    }
}

Custom InputFormat Code:

package cn.lazycat.bdd.hadoop.mapreduce.format;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

public class ScoreInputFormat extends FileInputFormat<Text, Text> {
    @Override
    public RecordReader<Text, Text> createRecordReader(InputSplit split,
               TaskAttemptContext context) {
        return new ScoreRecordReader();
    }
}

In this way, the input of the Map task is text text. By the way, key is the student's name and value is all the student's grades.

The following map code (reduce is not required):

@Override
protected void map(Text key, Text value, Context context)
        throws IOException, InterruptedException {

    String name = key.toString();
    double sum = 0;
    String data[] = value.toString().split("\\|");

    for (String str : data) {
        String scoreStr = str.split(" ")[1];
        sum += Double.parseDouble(scoreStr);
    }
    context.write(new Text(name), new DoubleWritable(sum));
}

This InputFormat needs to be registered in the startup class:

job.setInputFormatClass(ScoreInputFormat.class);

Then you can see the output:

[root@hadoop1 ~]# hdcat /out/part-r-00000
 Zhang San    two hundred and forty-three.0
 Li Si    two hundred and seven.0
 Wang Wu    one hundred and forty.0

MultipleInputs

Multiple inputs can assemble multiple inputs and provide data to the Map.

This class provides static methods:

// Specify the data source and InputFormat
addInputPath(job, path, inputFormatClass);
// Specify the data source, InputFormat and Mapper
addInputPath(job, path, inputFormatClass, mapperClass);

If our input is in different directories, you can import multiple parties by calling this method. We can use different inputformats to process data from different sources.

We can even specify different maperclass to let different mapper s process files in different directories.

OutputFormat

Before MapReduce ends, OutputFormat determines how the Reduce job generates output.

Hadoop itself provides many default outputformats. If not specified, TextOutputFormat is used by default.

Common outputformats:

  • FileOutputFormat: implements the OutputFormat interface for outputting files.
    • MapFileOutputFormat: output in the form of partial index.
    • SequenceFileOutputFormat: binary key value data is output in compressed form.
    • SequenceFileAsBinaryOutputFormat: native binary compressed format.
    • TextOutputFormat: a text file format that contains tab delimited key value pairs separated by rows.
    • MultipleOutputFormat: an abstract class that writes parameters to a file using key value pairs.
      • MultipleTextOutputFormat: outputs multiple files in standard line split, tab delimited format.
      • MultipleSequenceFileOutputFormat: outputs multiple compressed files.

You can use job Setoutputformatclass (outputformatclass) to specify the OutputFormat to use.

Sometimes the default OutputFormat cannot meet our needs. We need to customize the OutputFormat. Like InputFormat, we generally inherit the abstract class FileOutputFormat.

OutputFormat includes getOutputCommitter() and checkOutputSecs() methods by default. We have implemented these two methods FileOutputFormat. If you want more precise logic control, you can override.

However, in general, we only rewrite the file output format. So you just need to override the getRecordWriter() method.

The RecordWriter object, like the RecordReader object, is used to specify the output format.

WordCount upgrade

Next, upgrade the WordCount task so that the output format is key1-value1#key2-value2#. The output format is one line.

In international practice, first define the RecordWriter class:

package cn.lazycat.bdd.hadoop.mapreduce.wordcount.format;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import java.io.IOException;

public class WordCountRecordWriter extends RecordWriter<Text, LongWritable> {
    
    private FSDataOutputStream out;

    /**
     * Need to receive output stream from FS
     */
    public WordCountRecordWriter(FSDataOutputStream out) {
        this.out = out;
    }

    @Override
    public void write(Text key, LongWritable value) throws IOException {
        // Output according to business requirements
        out.writeUTF(key.toString() + "-" + value + "#");
    }

    @Override
    public void close(TaskAttemptContext context) throws IOException {
        out.close();
    }
}

Then define the OutputFormat class:

package cn.lazycat.bdd.hadoop.mapreduce.wordcount.format;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCountOutputFormat extends FileOutputFormat<Text, LongWritable> {

    @Override
    public RecordWriter<Text, LongWritable> getRecordWriter(TaskAttemptContext job)
            throws IOException {
        // Get conf object
        Configuration conf = job.getConfiguration();
        // Get path
        Path path = getDefaultWorkFile(job, "");
        // Get file system object
        FileSystem fs = path.getFileSystem(conf);

        FSDataOutputStream out = fs.create(path, false);

        return new WordCountRecordWriter(out);
    }
}

Of course, you need to register this OutputFormat in the startup class:

job.setOutputFormatClass(WordCountOutputFormat.class);

Then, the output format is as follows:

I-3#a-1#am-1#cat-3#hadoop-2#hello-3#love-2#small-1#world-1#

MultipleOutputs

MultipleOutputs enables a Reduce job to produce multiple outputs.

We need to use a common method and a static method of this class:

// To add an output, you need to specify the tag and outputFormat
static addNamedOutput(job, "flag", outputFormat);
// Set the output file. Flag indicates the flag of the output file, and key and value indicate the output content
write("flag", key, value);

To use this class, you need to save an instance of this object in Reducer. Then you need to override the setup() method of Reducer to initialize this instance:

@Override
protected void setup(context) {
    super(context);
    outputs = new MultipleOutputs();
}

Different output files can be generated in the reduce method according to business needs. Context. Is no longer called at this time Write () outputs, but instead calls outputs Write() output

outputs.write("flag", key, value);

This MultipleOutputs is required in the startup class:

MultipleOutputs.addNamedOutput(job, "flag", outputFormat);

Note that the tags here need to correspond to those of Reducer one by one.

In this way, we can implement different outputformats for different output files.

In this way, the file name of the output file is similar to: flag-r-00000.

GroupingComparator grouping

By default, after the Map is completed, the values with the same key will be merged (key1-value1, key1-value2 - > key1 - (value1, Value2)). This process is also called Grouping.

We can also specify our own rules for Grouping. Just write a class that inherits GroupingComparator.

We can use the WritableComparator class, a subclass of GroupingComparator. Implement the compare method.

The prototype of this method is as follows:

public int compare(byte[] b1, int s1, int l2, byte[] b2, int s2, int l2);

b1 and b2 represent the two keys output by the Map job (note the serialized data). We need to specify the rules for key comparison.

Case: Statistics of the number of a-n, o-z words

This case is to modify the grouping rules based on WordCount. This requires us to define a grouping class ourselves.

package cn.lazycat.bdd.hadoop.mapreduce.wordcount.grouping;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparator;

import java.io.ByteArrayInputStream;
import java.io.DataInput;
import java.io.DataInputStream;
import java.io.IOException;

public class WCGroupingComparator extends WritableComparator {

    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
        Text key1 = new Text();
        Text key2 = new Text();

        // Deserialize b1 and b2
        deSerial(key1, b1, s1, l1);
        deSerial(key2, b2, s2, l2);

        // If key1 and key2 both start with a-n or o-z, they are considered equal
        if (chargeRegEquals(key1, key2, "[a-n][a-z]*")
                || chargeRegEquals(key1, key2, "[o-z][a-z]*")) {
            return 0;
        }
        else {  // Other cases are considered unequal
            return -1;
        }
    }

    // Deserialize byte array
    private void deSerial(Text text, byte[] b, int s, int l) {
        DataInput in;
        try {
            in = new DataInputStream(new ByteArrayInputStream(b, s, l));
            // Deserialization
            text.readFields(in);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    // Determines whether both text matches a regular expression
    private boolean chargeRegEquals(Text text1, Text text2, String regex) {
        return text1.toString().matches(regex)
                && text2.toString().matches(regex);
    }
}

Register this group in the startup class:

job.setGroupingComparatorClass(WCGroupingComparator.class);

SortComparator Reduce sort

Generally speaking, sorting is performed when the Map task is in progress. Therefore, we rarely sort in the Reduce task.

But it can still be done.

This requires us to write a WritableComparator implementation class. The writing method of this class is similar to that described above. The difference is that the size must be distinguished (compare returns 1 or - 1).

This Comparator can be registered by calling setSortComparator() of the job.