Some Notices for Hadoop Learning-mapreduce

Posted by toddg on Thu, 25 Jul 2019 07:35:47 +0200

Some Notices for Hadoop Learning (4)-mapreduce

Some details about mapreduce

If you pack the mapreduce program into liux and run it,

Command java - cp xxx.jar main class name

If the error is reported, it means that there is a lack of relevant dependent jar packages

Use the command hadoop jar xxx.jar class name because when the client main method is started on the cluster machine with the hadoop jar xx.jar mr.wc.JobSubmitter command, the hadoop jar command adds jar packages and configuration files from the hadoop installation directory on the machine to the runtime classpath

Then, the new Configuration() statement in our client main method loads the configuration file in the classpath, and so on.

Configuration of parameters fs.defaultFS and mapreduce.framework.name and yarn.resourcemanager.hostname

All jar packages associated with local hadoop will be referenced

Mapreduce also has a local job running, which can be simulated by multiple threads while running in a stand-alone mode without submitting to yarn.

If you submit a job under Linux or windows, it will be submitted to run locally by default.

If linux is submitted to yarn by default, you need to write the configuration file hadoop/etc/mapred-site.xml

mapreduce.framework.name

yarn

Key,value pair, if it's your own class, then this class implements Writable, converts the data you want to serialize into binary, and then puts it into DataOutput of the wirte parameter of the rewrite method, another readFields rewrite method is used for deserialization.

Notice that when deserialization occurs, an object is constructed by using the class's parametric construction method, and then restored by readFields method.

DataOutput is also a stream, but it is hadoop encapsulated. When you use it, you need to add a FileOutputStream object to it.

Data Output uses writeUTF("string") to write a string. When he encodes a string in this way, he adds the length of the string before it. This is because of the problem of character encoding. When hadoop parses, he reads the first two bytes first to see how long the string is. Otherwise, if he uses write, he will read the first two bytes first. (String. getBytes()) so that he does not know how many bytes the string actually has.

In the reduce phase, if an object is written to hdfs, the toString method of the string is called, and you can override the toString method of the class.

For example, the following class can be serialized in hadoop

package mapreduce2;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys.Write;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.Waitable;

public class FlowBean implements Writable {

private int up;//Upstream traffic
private int down;//Downstream flow
private int sum;//Total flow
private String phone;//Telephone number

public FlowBean(int up, int down, String phone) {
    this.up = up;
    this.down = down;
    this.sum = up + down;
    this.phone = phone;
}
public int getUp() {
    return up;
}
public void setUp(int up) {
    this.up = up;
}
public int getDown() {
    return down;
}
public void setDown(int down) {
    this.down = down;
}
public int getSum() {
    return sum;
}
public void setSum(int sum) {
    this.sum = sum;
}
public String getPhone() {
    return phone;
}
public void setPhone(String phone) {
    this.phone = phone;
}
@Override
public void readFields(DataInput di) throws IOException {
    //Note that the order of reading is the same as that of writing.
    this.up = di.readInt();
    this.down = di.readInt();
    this.sum = this.up + this.down;
    this.phone = di.readUTF();
}
@Override
public void write(DataOutput Do) throws IOException {
    Do.writeInt(this.up);
    Do.writeInt(this.down);
    Do.writeInt(this.sum);
    Do.writeUTF(this.phone);
}
@Override
public String toString() {
    return "Telephone number"+this.phone+" Total flow"+this.sum;
}

}

When all reduceTask s are running, a cleanup method is also called

Application Exercise: Statistical data with n total page accesses

Solution 1: Use only one reduce task, using cleanup method, in the reduce task stage, not directly put in hdfs, but stored in a Treemap.

After reducing task, the first five of Treemap are exported to HDFS in cleanup.

package cn.edu360.mr.page.topn;

public class PageCount implements Comparable{


private String page;
private int count;

public void set(String page, int count) {
    this.page = page;
    this.count = count;
}

public String getPage() {
    return page;
}
public void setPage(String page) {
    this.page = page;
}
public int getCount() {
    return count;
}
public void setCount(int count) {
    this.count = count;
}

@Override
public int compareTo(PageCount o) {
    return o.getCount()-this.count==0?this.page.compareTo(o.getPage()):o.getCount()-this.count;
}


}

Class map

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class PageTopnMapper extends Mapper{


@Override
protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
    String line = value.toString();
    String[] split = line.split(" ");
    context.write(new Text(split[1]), new IntWritable(1));
}

}

reduce class

package cn.edu360.mr.page.topn;

import java.io.IOException;
import java.util.Map.Entry;
import java.util.Set;
import java.util.TreeMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class PageTopnReducer extends Reducer{


TreeMap<PageCount, Object> treeMap = new TreeMap<>();

@Override
protected void reduce(Text key, Iterable<IntWritable> values,
        Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
    int count = 0;
    for (IntWritable value : values) {
        count += value.get();
    }
    PageCount pageCount = new PageCount();
    pageCount.set(key.toString(), count);
    
    treeMap.put(pageCount,null);
    
}
@Override
protected void cleanup(Context context)
        throws IOException, InterruptedException {
    Configuration conf = context.getConfiguration();

// You can get configuration in cleanup and read the first few pieces of data from it.

    int topn = conf.getInt("top.n", 5);
    
    
    Set<Entry<PageCount, Object>> entrySet = treeMap.entrySet();
    int i= 0;
    
    for (Entry<PageCount, Object> entry : entrySet) {
        context.write(new Text(entry.getKey().getPage()), new IntWritable(entry.getKey().getCount()));
        i++;
        if(i==topn) return;
    }   
}

}

Then the jobSubmit class, notice that this is to configure Configuration, there are several ways

The first is loading configuration files.

    Configuration conf = new Configuration();
    conf.addResource("xx-oo.xml");

Then write it in the xx-oo.xml file

<property>
    <name>top.n</name>
    <value>6</value>
</property>


The second way

// By direct setting

    conf.setInt("top.n", 3);
    //Parameters passed directly into the java main program
    conf.setInt("top.n", Integer.parseInt(args[0]));

The third way is to get configuration file parameters

     Properties props = new Properties();

    props.load(JobSubmitter.class.getClassLoader().getResourceAsStream("topn.properties"));
    conf.setInt("top.n", Integer.parseInt(props.getProperty("top.n")));

Then configure the parameters in topn.properties

top.n=5
Sububmit class, run locally by default

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class JobSubmitter {

public static void main(String[] args) throws Exception {

    /**
     * Parsing parameters by loading the *-site.xml file under classpath
     */
    Configuration conf = new Configuration();
    conf.addResource("xx-oo.xml");
    
    /**
     * Setting parameters by code
     */
    //conf.setInt("top.n", 3);
    //conf.setInt("top.n", Integer.parseInt(args[0]));
    
    /**
     * Getting parameters through property profile
     */
    /*Properties props = new Properties();
    props.load(JobSubmitter.class.getClassLoader().getResourceAsStream("topn.properties"));
    conf.setInt("top.n", Integer.parseInt(props.getProperty("top.n")));*/
    
    Job job = Job.getInstance(conf);

    job.setJarByClass(JobSubmitter.class);

    job.setMapperClass(PageTopnMapper.class);
    job.setReducerClass(PageTopnReducer.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.setInputPaths(job, new Path("F:\\mrdata\\url\\input"));
    FileOutputFormat.setOutputPath(job, new Path("F:\\mrdata\\url\\output"));

    job.waitForCompletion(true);

}

}

Additional java knowledge point supplement

Treemap, the things put in will be sorted automatically

There are two ways to customize Treemap. The first is to introduce a Comparator.

public class TreeMapTest {


public static void main(String[] args) {
    
    TreeMap<FlowBean, String> tm1 = new TreeMap<>(new Comparator<FlowBean>() {
        @Override
        public int compare(FlowBean o1, FlowBean o2) {
            //If the total traffic of the two classes is the same, the phone numbers will be compared.
            if( o2.getAmountFlow()-o1.getAmountFlow()==0){
                return o1.getPhone().compareTo(o2.getPhone());
            }
            //If traffic is different, order it from small to large
            return o2.getAmountFlow()-o1.getAmountFlow();
        }
    });
    FlowBean b1 = new FlowBean("1367788", 500, 300);
    FlowBean b2 = new FlowBean("1367766", 400, 200);
    FlowBean b3 = new FlowBean("1367755", 600, 400);
    FlowBean b4 = new FlowBean("1367744", 300, 500);
    
    tm1.put(b1, null);
    tm1.put(b2, null);
    tm1.put(b3, null);
    tm1.put(b4, null);
    //Traversal of treeset
    Set<Entry<FlowBean,String>> entrySet = tm1.entrySet();
    for (Entry<FlowBean,String> entry : entrySet) {
        System.out.println(entry.getKey() +"\t"+ entry.getValue());
    }
}

}

The second is to implement a Comparable interface in this class

package cn.edu360.mr.page.topn;

public class PageCount implements Comparable{


private String page;
private int count;

public void set(String page, int count) {
    this.page = page;
    this.count = count;
}

public String getPage() {
    return page;
}
public void setPage(String page) {
    this.page = page;
}
public int getCount() {
    return count;
}
public void setCount(int count) {
    this.count = count;
}

@Override
public int compareTo(PageCount o) {
    return o.getCount()-this.count==0?this.page.compareTo(o.getPage()):o.getCount()-this.count;
}


}

Original address https://www.cnblogs.com/wpbing/archive/2019/07/25/11242866.html

Topics: Java Hadoop Apache xml