Eclipse builds Hadoop environment and actual resource sharing

Posted by clown[NOR] on Fri, 21 Jan 2022 03:56:50 +0100

First, build haoop2.0 of eclipse 7.1 development environment, the resources used are linked as follows:

Install Hadoop 2.0 for windows 7.1 environment

Building hadoop development environment under eclipse

In this way, we can develop hadoop in eclipse

catalogue

1, Introduction to MapReduce model

1. Map and Reduce functions

2. MapReduce architecture

3. MapReduce workflow

4. MapReduce application execution process

II. MapReduce actual combat

1. Data De duplication

2. Data sorting

3. Average score

4. Single table Association

III. summary

 

1, Introduction to MapReduce model

MapReduce The complex parallel computing process running on large-scale clusters is highly abstracted into two functions: Map and Reduce . It uses " divide and rule " Policy, a large-scale data set stored in a distributed file system will be cut into many independent split s ), these tiles can be used by multiple maps Task parallel processing.

1. Map and Reduce functions

2. MapReduce architecture

MapReduce The architecture is mainly composed of four parts: Client , JobTracker, TaskTracker and Task

1)Client

The MapReduce program written by the user is submitted to the JobTracker through the Client. The user can view the job running status through some interfaces provided by the Client

2)JobTracker

JobTracker is responsible for resource monitoring and Job scheduling. JobTracker monitors the health status of all tasktrackers and jobs. Once a failure is found, it will transfer the corresponding tasks to other nodes. JobTracker will track the execution progress of tasks, resource usage and other information, and report these information to the task scheduler. When resources are idle, the scheduler will, Choose the right tasks to use these resources

3)TaskTracker

TaskTracker will periodically report the use of resources and the running progress of tasks on this node to JobTracker through "heartbeat". At the same time, it will receive the commands sent by JobTracker and perform corresponding operations (such as starting a new Task, killing a Task, etc.). TaskTracker uses "slot" to divide the amount of resources (CPU, memory, etc.) on this node. A Task has the opportunity to run only after it obtains a slot, and the function of Hadoop scheduler is to allocate the idle slots on each TaskTracker to the Task. There are two kinds of slots: Map slot and Reduce slot, which are used by MapTask and Reduce Task respectively

4)Task

Tasks are divided into Map Task and Reduce Task, both of which are started by TaskTracker

3. MapReduce workflow

1) Workflow overview

 

  • There is no communication between different Map tasks
  • There will be no information exchange between different Reduce tasks
  • Users cannot explicitly send messages from one machine to another
  • All data exchange is realized through the MapReduce framework itself

2) MapReduce execution stages

4. MapReduce application execution process

 

II. MapReduce actual combat

1. Data De duplication

"Data De duplication" is mainly to grasp and use the idea of parallelization to screen data meaningfully. The seemingly complicated tasks such as counting the number of data types on the big data set and calculating the access place from the website log will involve data De duplication.

1.1 example description

De duplicate the data in the data file. Each line in the data file is a piece of data. The sample input is as follows:

1)file1:

2012-3-1 a

2012-3-2 b

2012-3-3 c

2012-3-4 d

2012-3-5 a

2012-3-6 b

2012-3-7 c

2012-3-3 c

2)file2:

2012-3-1 b

2012-3-2 a

2012-3-3 b

2012-3-4 d

2012-3-5 a

2012-3-6 c

2012-3-7 d

2012-3-3 c

The sample output is as follows:

2012-3-1 a

2012-3-1 b

2012-3-2 a

2012-3-2 b

2012-3-3 b

2012-3-3 c

2012-3-4 d

2012-3-5 a

2012-3-6 b

2012-3-6 c

2012-3-7 c

1.2 problem solving ideas

map stage: take the text of each line as the key of the key value pair

reduce phase: output each common key group

1.3 code display

package datadeduplicate.pers.xls.datadeduplicate;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.log4j.BasicConfigurator;

public class Deduplication {
    public static void main(String[] args) throws Exception {
    			BasicConfigurator.configure(); //Automatically and quickly use the default Log4j environment
    			//The classes of self-defined mapper and reducer must be passed, the path of input and output must be specified, and the type of output < K3, V3 > must be specified
    			//1 first write the job and know that conf and jobname are needed to create it
                Configuration conf=new Configuration();
                String jobName=Deduplication.class.getSimpleName();
                Job job = Job.getInstance(conf, jobName);
                //2 assemble the customized MyMapper and MyReducer together
                //3. Read HDFS content: FileInputFormat is in MapReduce Under lib package
                FileInputFormat.setInputPaths(job, new Path(args[0]));
                //4 specify the class to resolve < K1, V1 > (who will resolve key value pairs)
                //*The specified resolved class can be omitted from writing, because the default setting of the resolved class is textinputformat class
                job.setInputFormatClass(TextInputFormat.class);
                //5 specify the custom mapper class
                job.setMapperClass(MyMapper.class);
                //6 specify the type of key2 and value2 output from the map < K2, V2 >
                //*The following two steps can be omitted. When the < K3, V3 > and < K2, V2 > types are consistent, the < K2, V2 > type can not be specified
                job.setMapOutputKeyClass(Text.class);
                job.setMapOutputValueClass(Text.class);
                //7 partitions (1 by default), sorting and grouping. The protocol adopts the default
                job.setCombinerClass(MyReducer.class);
                //Next, take the reduce step
                //8 specify a custom reduce class
                job.setReducerClass(MyReducer.class);
                //9 specifies the < K3, V3 > type of output
                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(Text.class);
                //10 specify the class that outputs < K3, V3 >
                 //*The following step can save
                job.setOutputFormatClass(TextOutputFormat.class);
                //11 specify output path
                FileOutputFormat.setOutputPath(job, new Path(args[1]));
                //The mapreduce program written in 12 should be run by the resource manager
                job.waitForCompletion(true);
                //*Finally, if you want to package and run the modified program, you need to call the following line
                job.setJarByClass(Deduplication.class);
    }
    private static class MyMapper extends Mapper<Object, Text, Text, Text>{
        private static Text line=new Text();
        @Override
        protected void map(Object k1, Text v1,Mapper<Object, Text, Text, Text>.Context context) throws IOException, InterruptedException {
            line=v1;//v1 is each row of data and assigned to line
            context.write(line, new Text(""));
         }
    }
    private static class MyReducer extends Reducer<Text, Text, Text, Text>
    {
        @Override
        protected void reduce(Text k2, Iterable<Text> v2s,Reducer<Text, Text, Text, Text>.Context context) throws IOException, InterruptedException {
             context.write(k2, new Text(""));
         }
    }
}

1.4 display of operation results

Package the project into a running jar package, and upload the hdfs file system:

 

Under the linux system, enter the hadoop command in the terminal and run the jar package on the established hadoop node:

Check the out folder under the hdfs file system in eclipse and find that the previously specified duplication folder is generated, where part-r-00000 is the output of the run.

 2. Data sorting

package dararank.pers.xls.datarank;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.log4j.BasicConfigurator;
import java.io.IOException;
 
public class DataRank {
    /**
     * Use Mapper to directly output the data in the data file as the key of Mapper output
     */
    public static class forSortedMapper extends Mapper<Object, Text, IntWritable, IntWritable> {
        private IntWritable mapperValue = new IntWritable(); //Store the value of key
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString(); //Get the read value and convert it to String
            mapperValue.set(Integer.parseInt(line)); //Convert String to Int type
            context.write(mapperValue,new IntWritable(1)); //Mark each record as (key, value) the number of times key -- number value -- appears
          //Each occurrence is marked as (number, 1)
        }
    }
 
    /**
     * Use Reducer to directly output the input key itself as a key
     */
 public static class forSortedReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{
        private IntWritable postion = new IntWritable(1); //Storage ranking
        @Override
        protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            for (IntWritable item :values){ //The same number may be listed many times, so it must be sorted in parallel many times
                context.write(postion,key); //Write rank and specific number
                System.out.println(postion + "\t"+ key);
                postion = new IntWritable(postion.get()+1); //Rank plus 1
            }
        }
    }
 
 
    public static void main(String[] args) throws Exception {
 
    	BasicConfigurator.configure(); //Automatically and quickly use the default Log4j environment
        
    	Configuration conf = new Configuration(); //Set the configuration of MapReduce
        String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
        if(otherArgs.length < 2){
            System.out.println("Usage: datarank <in> [<in>...] <out>");
            System.exit(2);
        }
        //Set job
        //Job job = new Job(conf);
        Job job = Job.getInstance(conf);
        job.setJarByClass(DataRank.class);
        job.setJobName("DataRank");
        //Set the class that handles map and reduce
        job.setMapperClass(forSortedMapper.class);
        job.setReducerClass(forSortedReducer.class);
        //Processing of setting input / output format
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        //Set I / O path
        for (int i = 0; i < otherArgs.length-1;++i){
            FileInputFormat.addInputPath(job,new Path(otherArgs[i]));
        }
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length-1]));
        System.exit(job.waitForCompletion(true)?0:1);
    }
 
}

3. Average score

package averagescoreapp.pers.xls.averagescoreapp;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.log4j.BasicConfigurator;
 
/**
 * Average score
 *
 */
public class AverageScoreApp {
 
	public static class Map extends Mapper<Object, Text, Text, IntWritable>{
		@Override
		protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
			//The structure of the results is:
			// Zhang San 	 eighty
			// Li Si 	 eighty-two
			// Wang Wu 	 eighty-six
			StringTokenizer tokenizer = new StringTokenizer(value.toString(), "\n");
			while(tokenizer.hasMoreElements()) {
				StringTokenizer lineTokenizer = new StringTokenizer(tokenizer.nextToken());
				String name = lineTokenizer.nextToken(); //full name
				String score = lineTokenizer.nextToken();//achievement
				context.write(new Text(name), new IntWritable(Integer.parseInt(score)));
			}
		}
	}
	
	public static class Reduce extends Reducer<Text, IntWritable, Text, DoubleWritable>{
		@Override
		protected void reduce(Text key, Iterable<IntWritable> values,Reducer<Text, IntWritable, Text, DoubleWritable>.Context context)
				throws IOException, InterruptedException {
			//reduce the data structure entered here is:
			// Zhang San < 80,85,90 >
			// Li Si < 82,88,94 >
			// Wang Wu < 86,80,92 >
			int sum = 0;//Total score of all courses
			double average = 0;//Average score
			int courseNum = 0; //Number of courses
			for(IntWritable score:values) {
				sum += score.get();
				courseNum++;
			}
			average = sum/courseNum;
			context.write(new Text(key), new DoubleWritable(average));
		}
	}
	
	public static void main(String[] args) throws Exception{
		BasicConfigurator.configure(); //Automatically and quickly use the default Log4j environment
		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
        if(otherArgs.length < 2){
            System.out.println("Usage: AverageScoreRank <in> [<in>...] <out>");
            System.exit(2);
        }
		Job job = Job.getInstance(conf);
		job.setJarByClass(AverageScoreApp.class);
		job.setMapperClass(Map.class);
		job.setReducerClass(Reduce.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(DoubleWritable.class);
		
		 //Set input / output path
        for (int i = 0; i < otherArgs.length-1;++i){
            FileInputFormat.addInputPath(job,new Path(otherArgs[i]));
        }
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length-1]));
		
		System.exit(job.waitForCompletion(true)?0:1);
	}
 
}

 4. Single table Association

package singletabblerelation.pers.xls.singletablerelation;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.log4j.BasicConfigurator;

public class SingleTableRelation {
    public static int time = 0;
    public static class Map extends Mapper<LongWritable, Text, Text, Text> {
    protected void map(LongWritable key, Text value, Context context)throws java.io.IOException, InterruptedException {
        	// Identification of left and right tables
            int relation;
            StringTokenizer tokenizer = new StringTokenizer(value.toString());
            String child = tokenizer.nextToken();
            String parent = tokenizer.nextToken();
            if (child.compareTo("child") != 0) {
                // Left table
                relation = 1;
                context.write(new Text(parent), new Text(relation + "+" + child));
                // Right table
                relation = 2;
                context.write(new Text(child), new Text(relation + "+" + parent));
            }
        };

    }

    public static class Reduce extends Reducer<Text, Text, Text, Text> {
        protected void reduce(Text key, Iterable<Text> values,
                Reducer<Text, Text, Text, Text>.Context output)
                throws java.io.IOException, InterruptedException {
            int grandchildnum = 0;
            int grandparentnum = 0;
            List<String> grandchilds = new ArrayList<>();
            List<String> grandparents = new ArrayList<>();
            /** Output header */
            if (time == 0) {
                output.write(new Text("grandchild"), new Text("grandparent"));
                time++;
            }
            for (Text val : values) {
                String record = val.toString();
                char relation = record.charAt(0);
                // Take out the child corresponding to the key at this time
                if (relation == '1') {
                    String child = record.substring(2);
                    grandchilds.add(child);
                    grandchildnum++;
                }
                // Take out the parent corresponding to the key at this time
                else {
                    String parent = record.substring(2);
                    grandparents.add(parent);
                    grandparentnum++;
                }
            }
            if (grandchildnum != 0 && grandparentnum != 0) {
                for (int i = 0; i < grandchildnum; i++)
                    for (int j = 0; j < grandparentnum; j++)
                        output.write(new Text(grandchilds.get(i)), new Text(
                                grandparents.get(j)));
            }

        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
			BasicConfigurator.configure(); //Automatically and quickly use the default Log4j environment
			//The classes of self-defined mapper and reducer must be passed, the path of input and output must be specified, and the type of output < K3, V3 > must be specified
            //2 assemble the customized MyMapper and MyReducer together
            Configuration conf=new Configuration();
            String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
            if(otherArgs.length < 2){
                System.out.println("Usage: SingleTableRelation <in> [<in>...] <out>");
                System.exit(2);
            }
            String jobName=SingleTableRelation.class.getSimpleName();
            //1. First write the job and know that conf and jobname are needed to create it
             Job job = Job.getInstance(conf, jobName);
	        job.setJarByClass(SingleTableRelation.class);
	        job.setMapperClass(Map.class);
	        job.setReducerClass(Reduce.class);
	        job.setOutputKeyClass(Text.class);
	        job.setOutputValueClass(Text.class);
	        //Set I / O path
	        for (int i = 0; i < otherArgs.length-1;++i){
	            FileInputFormat.addInputPath(job,new Path(otherArgs[i]));
	        }
	        FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length-1]));       
	        System.exit((job.waitForCompletion(true) ? 0 : 1));
    }
}

III. summary

hadoop HDFS is a distributed infrastructure. It uses distributed to achieve efficient computing and storage. The core design lies in HDFS And MapReduce .
HDFS The distributed file system is implemented on the cluster, MapReduce Then distributed computing and task processing are realized on the cluster. HDFS stay MapReduce Support for file operation and storage is provided during task processing. MapReduce is HDFS On the basis of, the distribution, tracking and execution of tasks are realized, and the results are collected. The two interact to complete Hadoop The main task of distributed cluster.
Through these four practical topics, I further mastered them Hadoop Application of architecture in real life.

Topics: Eclipse Hadoop hdfs mapreduce