MapReduce notes - serialized cases

Posted by lnt on Tue, 04 Jan 2022 19:29:57 +0100

serialize

When the hosts transmit data to each other, they cannot directly send an object to another host. They need to encapsulate the content of the object into a packet in some form and then send it over the network. The most important method is to program the object in the form of a string, and the writing rules of this string are secretly agreed by both parties. In this way, the two hosts can obtain the object information sent by the other. Using serialization also frees communication from the restriction of using the same language.

case

Case preparation

The case runs in stand-alone mode.

Case description: count the uplink flow, downlink flow and total flow of each phone number. Read all data from the input file, leave useful data and add it. Finally, output it to the specified path.

The input files are as follows. One field is optional, and each field is separated by "\ t" (TAB). The meanings of each field are [phone number, IP, domain name, uplink traffic, downlink traffic and status code]

13012345678    192.168.200.1    www.xunxun.games        1345    14567    200
13012345679    192.168.200.1    2345    94567    200
13012345672    192.168.200.1    www.xunxun.games        2345    34567    200
13012345678    192.168.200.1    www.xunxun.games        4345    64567    200
13012345674    192.168.200.1    www.xunxun.games        7345    84567    200
13012345678    192.168.200.1    www.xunxun.games        4345    24567    200
13012345674    192.168.200.1    www.xunxun.games        5345    24567    200

The output file is as follows. The meaning of each field is [telephone number, uplink traffic, downlink traffic and total traffic]

13012345672    2345    34567    36912
13012345674    12690    109134    121824
13012345678    10035    103701    113736
13012345679    2345    94567    96912

Code writing

FlowBean

To transfer upstream and downstream traffic between hosts, you need to prepare an object to load the required field contents.

Then create a package com. Com in the java folder xunn. mapreduce. writable. Create a class in the package called FlowBean. The details are as follows. Be careful not to import the package incorrectly.

There are the following precautions when writing

  1. Implement the writeable interface so that the class can be serialized
  2. Because this class is used to transfer data, the serialization and deserialization methods should be rewritten (serialization: write, deserialization: readFields, and the order of serialization and deserialization should be consistent)
  3. Rewrite the null parameter constructor
  4. Rewrite the toString method for printout, which should be written in the format agreed by both parties
  5. Generate the set and get methods of private variables, and then add a setSumFlow method to directly add upFlow and downFlow
package com.xunn.mapreduce.writable;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;


public class FlowBean implements Writable {

    // Uplink, downlink and total traffic
    private long upFlow;
    private long downFlow;
    private long sumFlow;

    // Empty parameter structure
    public FlowBean() {
    }

    public long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
        this.sumFlow = sumFlow;
    }

    public void setSumFlow() {
        this.sumFlow = this.upFlow + this.downFlow;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        // Serialization order must be consistent with deserialization order
        dataOutput.writeLong(upFlow);
        dataOutput.writeLong(downFlow);
        dataOutput.writeLong(sumFlow);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.upFlow = dataInput.readLong();
        this.downFlow = dataInput.readLong();
        this.sumFlow = dataInput.readLong();
    }

    @Override
    public String toString() {
        return upFlow + "\t" + downFlow + "\t" + sumFlow ;
    }
}

Mapper

Compared with the cases in the previous article, this time is mainly an additional encapsulation process, and it is necessary to find ways to process or bypass the optional data when splitting again. Others will not be repeated here.

package com.xunn.mapreduce.writable;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {

    private Text outK = new Text();
    private FlowBean outV = new FlowBean();

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, FlowBean>.Context context) throws IOException, InterruptedException {

        //Get a line e: 13012345678 192.168 200.1 www.xunxun. games 2345 34567 200
        String line = value.toString();

        //Cutting line e: [13012345678192.168.200.1, www.xunxun.games, 234534567200]
        String[] split = line.split("\t");

        //Grab the desired data e: 13012345678 [234534567]
        String phone = split[0];
        String up = split[split.length - 3]; //length subtraction is used to bypass one of the data that may be empty
        String down = split[split.length - 2];

        // encapsulation
        outK.set(phone);
        outV.setUpFlow(Long.parseLong(up));
        outV.setDownFlow(Long.parseLong(down));
        outV.setSumFlow();

        // write
        context.write(outK, outV);
    }
}

Reducer

reducer is mainly responsible for data addition, and the work of sorting out the same mobile phone numbers (key s) is completed by the framework itself by using the map's own data structure. It does not need to operate by itself. This has been known in the previous case, but this time the values are a collection of FlowBean types. Use the iterator to obtain each value, extract upFlow and downFlow, and then add and.

Write after encapsulation.

package com.xunn.mapreduce.writable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {

    private FlowBean outV = new FlowBean();

    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Reducer<Text, FlowBean, Text, FlowBean>.Context context) throws IOException, InterruptedException {

        // Traversal set values
        long totalUp = 0;
        long totalDown = 0;
        for (FlowBean value : values) {
            totalUp += value.getUpFlow();
            totalDown += value.getDownFlow();
        }

        // encapsulation
        outV.setUpFlow(totalUp);
        outV.setDownFlow(totalDown);
        outV.setSumFlow();

        // write
        context.write(key, outV);
    }
}

Driver

The process of Driver is the same as that of the previous case. You can specify the type in the parameter. If you have any questions, you can see the first two articles of mapreduce. The Driver code is as follows.

package com.xunn.mapreduce.writable;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class FlowDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        // Get job object
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // Set jar package
        job.setJarByClass(FlowDriver.class);

        // Associate mapper and reducer
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);

        // Set the output kv of the map
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);

        // Set the final output kv
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        // Set input and output paths
        FileInputFormat.setInputPaths(job, new Path("D:\\DARoom\\Hadoop\\somefiles\\inputD2"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\DARoom\\Hadoop\\somefiles\\output3"));

        // If the job is submitted, true will print more log information
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

When the input data and output path are ready, you can run in stand-alone mode.

Topics: Big Data mapreduce