The implementation and principle of wordcount word statistics using mapreduce are shared before. This blogger will continue to share a classic case of mobile traffic analysis to help understand and use hadoop platform in practical work.
I. requirements
The following is a mobile traffic log. We need to analyze the upstream traffic, downstream traffic and total traffic corresponding to each mobile phone number according to the log.
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200 1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200 1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200 1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200 1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com Video website 15 12 1527 2106 200 1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 20 16 4116 1432 200 1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200 1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn information safety 20 20 3156 2936 200 1363157983019 13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82 4 0 240 0 200 1363157984041 13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4 s19.cnzz.com Site statistics 24 9 6960 690 200 1363157973098 15013685858 5C-0E-8B-C7-F7-90:CMCC 120.197.40.4 rank.ie.sogou.com Search Engines 28 27 3659 3538 200 1363157986029 15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99 www.umeng.com Site statistics 3 3 1938 180 200 1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200 1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200 1363157984040 13602846565 5C-0E-8B-8B-B6-00:CMCC 120.197.40.4 2052.flash2-http.qq.com Comprehensive portal 15 12 1938 2910 200 1363157995093 13922314466 00-FD-07-A2-EC-BA:CMCC 120.196.100.82 img.qfc.cn 12 12 3008 3720 200 1363157982040 13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99 y0.ifengimg.com Comprehensive portal 57 102 7335 110349 200 1363157986072 18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99 input.shouji.sogou.com Search Engines 21 18 9531 2412 200 1363157990043 13925057413 00-1F-64-E1-E6-9A:CMCC 120.196.100.55 t3.baidu.com Search Engines 69 63 11058 48243 200 1363157988072 13760778710 00-FD-07-A4-7B-08:CMCC 120.196.100.82 2 2 120 120 200 1363157985066 13726238888 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200 1363157993055 13560436666 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
II. Implementation
Description of 「 field: in the above log, the second column is mobile number; the last three are downlink traffic and uplink traffic respectively
hdfs dfs -mkdir -p /user/hadoop/flowcount
FlowBean (analysis output bean)
package com.empire.hadoop.mr.flowcount; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.Writable; /** * The implementation description of the class FlowBean.java: the implementation of the flow bean, because mapreduce needs to serialize the results in the calculation for transmission, * So we need to implement writable interface; if we need to sort, we need to implement writablecompatible interface * * @author arron 2018 9:40:40 PM, November 24, 2010 */ public class FlowBean implements Writable { private long upFlow; private long dFlow; private long sumFlow; //When deserializing, the null parameter constructor needs to be called by reflection, so to display and define a public FlowBean() { } public FlowBean(long upFlow, long dFlow) { this.upFlow = upFlow; this.dFlow = dFlow; this.sumFlow = upFlow + dFlow; } public long getUpFlow() { return upFlow; } public void setUpFlow(long upFlow) { this.upFlow = upFlow; } public long getdFlow() { return dFlow; } public void setdFlow(long dFlow) { this.dFlow = dFlow; } public long getSumFlow() { return sumFlow; } public void setSumFlow(long sumFlow) { this.sumFlow = sumFlow; } /** * Serialization method */ public void write(DataOutput out) throws IOException { out.writeLong(upFlow); out.writeLong(dFlow); out.writeLong(sumFlow); } /** * Deserialization method note: the order of deserialization is exactly the same as that of serialization */ public void readFields(DataInput in) throws IOException { upFlow = in.readLong(); dFlow = in.readLong(); sumFlow = in.readLong(); } @Override public String toString() { return upFlow + "\t" + dFlow + "\t" + sumFlow; } }
mapreduce main program
package com.empire.hadoop.mr.flowcount; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /** * Implementation description of class FlowCount.java: the mobile log analyzes the total uplink traffic, total downlink traffic, total traffic and other information corresponding to a mobile phone number * * @author arron 2018 9:43:23 PM, November 24, 2010 */ public class FlowCount { static class FlowCountMapper extends Mapper<LongWritable, Text, Text, FlowBean> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //Convert a line to string String line = value.toString(); //Segmentation field String[] fields = line.split("\t"); //Take out the phone number String phoneNbr = fields[1]; //Take out the upstream flow and downstream flow long upFlow = Long.parseLong(fields[fields.length - 3]); long dFlow = Long.parseLong(fields[fields.length - 2]); context.write(new Text(phoneNbr), new FlowBean(upFlow, dFlow)); } } static class FlowCountReducer extends Reducer<Text, FlowBean, Text, FlowBean> { //<183323,bean1><183323,bean2><183323,bean3><183323,bean4>....... @Override protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException { long sum_upFlow = 0; long sum_dFlow = 0; //Traverse all bean s, and accumulate the upstream and downstream traffic respectively for (FlowBean bean : values) { sum_upFlow += bean.getUpFlow(); sum_dFlow += bean.getdFlow(); } FlowBean resultBean = new FlowBean(sum_upFlow, sum_dFlow); context.write(key, resultBean); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); /* * conf.set("mapreduce.framework.name", "yarn"); * conf.set("yarn.resoucemanager.hostname", "mini1"); */ Job job = Job.getInstance(conf); /* job.setJar("/home/hadoop/wc.jar"); */ //Specify the local path of the jar package of this program job.setJarByClass(FlowCount.class); //Specify the mapper/Reducer business class to be used by this business job job.setMapperClass(FlowCountMapper.class); job.setReducerClass(FlowCountReducer.class); //Specifies the kv type of mapper output data job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(FlowBean.class); //Specifies the kv type of the final output data job.setOutputKeyClass(Text.class); job.setOutputValueClass(FlowBean.class); //Specify the directory of the original input file of the job FileInputFormat.setInputPaths(job, new Path(args[0])); //Specify the directory where the output of the job is located FileOutputFormat.setOutputPath(job, new Path(args[1])); //Submit the relevant parameters configured in the job and the jar package of the java class used in the job to yarn for running /* job.submit(); */ boolean res = job.waitForCompletion(true); System.exit(res ? 0 : 1); } }
3. Package and run
Pack and run in the same way as the previous blog wordcount. The operation effect is as follows:
18/11/25 06:03:38 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032 18/11/25 06:03:39 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 18/11/25 06:03:39 INFO input.FileInputFormat: Total input files to process : 5 18/11/25 06:03:39 INFO mapreduce.JobSubmitter: number of splits:5 18/11/25 06:03:40 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 18/11/25 06:03:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1543096217465_0001 18/11/25 06:03:41 INFO impl.YarnClientImpl: Submitted application application_1543096217465_0001 18/11/25 06:03:41 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1543096217465_0001/ 18/11/25 06:03:41 INFO mapreduce.Job: Running job: job_1543096217465_0001 18/11/25 06:03:51 INFO mapreduce.Job: Job job_1543096217465_0001 running in uber mode : false 18/11/25 06:03:51 INFO mapreduce.Job: map 0% reduce 0% 18/11/25 06:04:00 INFO mapreduce.Job: map 20% reduce 0% 18/11/25 06:04:13 INFO mapreduce.Job: map 100% reduce 0% 18/11/25 06:04:14 INFO mapreduce.Job: map 100% reduce 100% 18/11/25 06:04:15 INFO mapreduce.Job: Job job_1543096217465_0001 completed successfully 18/11/25 06:04:15 INFO mapreduce.Job: Counters: 50 File System Counters FILE: Number of bytes read=4171 FILE: Number of bytes written=1193767 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=11574 HDFS: Number of bytes written=594 HDFS: Number of read operations=18 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Killed map tasks=1 Launched map tasks=5 Launched reduce tasks=1 Data-local map tasks=5 Total time spent by all maps in occupied slots (ms)=79442 Total time spent by all reduces in occupied slots (ms)=11115 Total time spent by all map tasks (ms)=79442 Total time spent by all reduce tasks (ms)=11115 Total vcore-milliseconds taken by all map tasks=79442 Total vcore-milliseconds taken by all reduce tasks=11115 Total megabyte-milliseconds taken by all map tasks=81348608 Total megabyte-milliseconds taken by all reduce tasks=11381760 Map-Reduce Framework Map input records=110 Map output records=110 Map output bytes=3945 Map output materialized bytes=4195 Input split bytes=624 Combine input records=0 Combine output records=0 Reduce input groups=21 Reduce shuffle bytes=4195 Reduce input records=110 Reduce output records=21 Spilled Records=220 Shuffled Maps =5 Failed Shuffles=0 Merged Map outputs=5 GC time elapsed (ms)=1587 CPU time spent (ms)=4710 Physical memory (bytes) snapshot=878612480 Virtual memory (bytes) snapshot=5069615104 Total committed heap usage (bytes)=623616000 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=10950 File Output Format Counters Bytes Written=594
Analysis results:
[hadoop@centos-aaron-h1 ~]$ hadoop fs -ls /user/hadoop/flowcountount Found 2 items -rw-r--r-- 2 hadoop supergroup 0 2018-11-25 06:04 /user/hadoop/flowcountount/_SUCCESS -rw-r--r-- 2 hadoop supergroup 594 2018-11-25 06:04 /user/hadoop/flowcountount/part-r-00000 [hadoop@centos-aaron-h1 ~]$ hadoop fs -cat /user/hadoop/flowcountount/part-r-00000 13480253104 900 900 1800 13502468823 36675 551745 588420 13560436666 5580 4770 10350 13560439658 10170 29460 39630 13602846565 9690 14550 24240 13660577991 34800 3450 38250 13719199419 1200 0 1200 13726230503 12405 123405 135810 13726238888 12405 123405 135810 13760778710 600 600 1200 13826544101 1320 0 1320 13922314466 15040 18600 33640 13925057413 55290 241215 296505 13926251106 1200 0 1200 13926435656 660 7560 8220 15013685858 18295 17690 35985 15920133257 15780 14680 30460 15989002119 9690 900 10590 18211575961 7635 10530 18165 18320173382 47655 12060 59715 84138413 20580 7160 27740
The last message of the blog is all the content of the blog. If you think the blog's article is good, please like it. If you are interested in big data technology of other servers or the blog, please pay attention to the blog and welcome to communicate with the blog at any time.