Some Notices for Hadoop Learning (4)-mapreduce
Some details about mapreduce
If you pack the mapreduce program into liux and run it,
Command java - cp xxx.jar main class name
If the error is reported, it means that there is a lack of relevant dependent jar packages
Use the command hadoop jar xxx.jar class name because when the client main method is started on the cluster machine with the hadoop jar xx.jar mr.wc.JobSubmitter command, the hadoop jar command adds jar packages and configuration files from the hadoop installation directory on the machine to the runtime classpath
Then, the new Configuration() statement in our client main method loads the configuration file in the classpath, and so on.
Configuration of parameters fs.defaultFS and mapreduce.framework.name and yarn.resourcemanager.hostname
All jar packages associated with local hadoop will be referenced
Mapreduce also has a local job running, which can be simulated by multiple threads while running in a stand-alone mode without submitting to yarn.
If you submit a job under Linux or windows, it will be submitted to run locally by default.
If linux is submitted to yarn by default, you need to write the configuration file hadoop/etc/mapred-site.xml
mapreduce.framework.name
yarn
Key,value pair, if it's your own class, then this class implements Writable, converts the data you want to serialize into binary, and then puts it into DataOutput of the wirte parameter of the rewrite method, another readFields rewrite method is used for deserialization.
Notice that when deserialization occurs, an object is constructed by using the class's parametric construction method, and then restored by readFields method.
DataOutput is also a stream, but it is hadoop encapsulated. When you use it, you need to add a FileOutputStream object to it.
Data Output uses writeUTF("string") to write a string. When he encodes a string in this way, he adds the length of the string before it. This is because of the problem of character encoding. When hadoop parses, he reads the first two bytes first to see how long the string is. Otherwise, if he uses write, he will read the first two bytes first. (String. getBytes()) so that he does not know how many bytes the string actually has.
In the reduce phase, if an object is written to hdfs, the toString method of the string is called, and you can override the toString method of the class.
For example, the following class can be serialized in hadoop
package mapreduce2;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys.Write;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.Waitable;
public class FlowBean implements Writable {
private int up;//Upstream traffic private int down;//Downstream flow private int sum;//Total flow private String phone;//Telephone number public FlowBean(int up, int down, String phone) { this.up = up; this.down = down; this.sum = up + down; this.phone = phone; } public int getUp() { return up; } public void setUp(int up) { this.up = up; } public int getDown() { return down; } public void setDown(int down) { this.down = down; } public int getSum() { return sum; } public void setSum(int sum) { this.sum = sum; } public String getPhone() { return phone; } public void setPhone(String phone) { this.phone = phone; } @Override public void readFields(DataInput di) throws IOException { //Note that the order of reading is the same as that of writing. this.up = di.readInt(); this.down = di.readInt(); this.sum = this.up + this.down; this.phone = di.readUTF(); } @Override public void write(DataOutput Do) throws IOException { Do.writeInt(this.up); Do.writeInt(this.down); Do.writeInt(this.sum); Do.writeUTF(this.phone); } @Override public String toString() { return "Telephone number"+this.phone+" Total flow"+this.sum; }
}
When all reduceTask s are running, a cleanup method is also called
Application Exercise: Statistical data with n total page accesses
Solution 1: Use only one reduce task, using cleanup method, in the reduce task stage, not directly put in hdfs, but stored in a Treemap.
After reducing task, the first five of Treemap are exported to HDFS in cleanup.
package cn.edu360.mr.page.topn;
public class PageCount implements Comparable{
private String page; private int count; public void set(String page, int count) { this.page = page; this.count = count; } public String getPage() { return page; } public void setPage(String page) { this.page = page; } public int getCount() { return count; } public void setCount(int count) { this.count = count; } @Override public int compareTo(PageCount o) { return o.getCount()-this.count==0?this.page.compareTo(o.getPage()):o.getCount()-this.count; }
}
Class map
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class PageTopnMapper extends Mapper{
@Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] split = line.split(" "); context.write(new Text(split[1]), new IntWritable(1)); }
}
reduce class
package cn.edu360.mr.page.topn;
import java.io.IOException;
import java.util.Map.Entry;
import java.util.Set;
import java.util.TreeMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class PageTopnReducer extends Reducer{
TreeMap<PageCount, Object> treeMap = new TreeMap<>(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { int count = 0; for (IntWritable value : values) { count += value.get(); } PageCount pageCount = new PageCount(); pageCount.set(key.toString(), count); treeMap.put(pageCount,null); } @Override protected void cleanup(Context context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration();
// You can get configuration in cleanup and read the first few pieces of data from it.
int topn = conf.getInt("top.n", 5); Set<Entry<PageCount, Object>> entrySet = treeMap.entrySet(); int i= 0; for (Entry<PageCount, Object> entry : entrySet) { context.write(new Text(entry.getKey().getPage()), new IntWritable(entry.getKey().getCount())); i++; if(i==topn) return; } }
}
Then the jobSubmit class, notice that this is to configure Configuration, there are several ways
The first is loading configuration files.
Configuration conf = new Configuration(); conf.addResource("xx-oo.xml");
Then write it in the xx-oo.xml file
<property> <name>top.n</name> <value>6</value> </property>
The second way
// By direct setting
conf.setInt("top.n", 3); //Parameters passed directly into the java main program conf.setInt("top.n", Integer.parseInt(args[0]));
The third way is to get configuration file parameters
Properties props = new Properties();
props.load(JobSubmitter.class.getClassLoader().getResourceAsStream("topn.properties")); conf.setInt("top.n", Integer.parseInt(props.getProperty("top.n")));
Then configure the parameters in topn.properties
top.n=5
Sububmit class, run locally by default
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class JobSubmitter {
public static void main(String[] args) throws Exception { /** * Parsing parameters by loading the *-site.xml file under classpath */ Configuration conf = new Configuration(); conf.addResource("xx-oo.xml"); /** * Setting parameters by code */ //conf.setInt("top.n", 3); //conf.setInt("top.n", Integer.parseInt(args[0])); /** * Getting parameters through property profile */ /*Properties props = new Properties(); props.load(JobSubmitter.class.getClassLoader().getResourceAsStream("topn.properties")); conf.setInt("top.n", Integer.parseInt(props.getProperty("top.n")));*/ Job job = Job.getInstance(conf); job.setJarByClass(JobSubmitter.class); job.setMapperClass(PageTopnMapper.class); job.setReducerClass(PageTopnReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.setInputPaths(job, new Path("F:\\mrdata\\url\\input")); FileOutputFormat.setOutputPath(job, new Path("F:\\mrdata\\url\\output")); job.waitForCompletion(true); }
}
Additional java knowledge point supplement
Treemap, the things put in will be sorted automatically
There are two ways to customize Treemap. The first is to introduce a Comparator.
public class TreeMapTest {
public static void main(String[] args) { TreeMap<FlowBean, String> tm1 = new TreeMap<>(new Comparator<FlowBean>() { @Override public int compare(FlowBean o1, FlowBean o2) { //If the total traffic of the two classes is the same, the phone numbers will be compared. if( o2.getAmountFlow()-o1.getAmountFlow()==0){ return o1.getPhone().compareTo(o2.getPhone()); } //If traffic is different, order it from small to large return o2.getAmountFlow()-o1.getAmountFlow(); } }); FlowBean b1 = new FlowBean("1367788", 500, 300); FlowBean b2 = new FlowBean("1367766", 400, 200); FlowBean b3 = new FlowBean("1367755", 600, 400); FlowBean b4 = new FlowBean("1367744", 300, 500); tm1.put(b1, null); tm1.put(b2, null); tm1.put(b3, null); tm1.put(b4, null); //Traversal of treeset Set<Entry<FlowBean,String>> entrySet = tm1.entrySet(); for (Entry<FlowBean,String> entry : entrySet) { System.out.println(entry.getKey() +"\t"+ entry.getValue()); } }
}
The second is to implement a Comparable interface in this class
package cn.edu360.mr.page.topn;
public class PageCount implements Comparable{
private String page; private int count; public void set(String page, int count) { this.page = page; this.count = count; } public String getPage() { return page; } public void setPage(String page) { this.page = page; } public int getCount() { return count; } public void setCount(int count) { this.count = count; } @Override public int compareTo(PageCount o) { return o.getCount()-this.count==0?this.page.compareTo(o.getPage()):o.getCount()-this.count; }
}
Original address https://www.cnblogs.com/wpbing/archive/2019/07/25/11242866.html