Distributed computing framework Map/reduce

Posted by warren on Thu, 10 Feb 2022 19:39:51 +0100

Introduction:

MapReduce is a cluster based high-performance parallel computing platform. MapReduce is a software framework for parallel computing and operation. MapReduce is a parallel programming model and method
characteristic:

① The distribution is reliable. The operation of the data set is distributed to multiple nodes in the cluster to achieve reliability. Each node periodically returns its completed tasks and the latest status

② Encapsulate the implementation details, program based on the framework API, and carry out distributed coding for business

③ Provide cross language programming capabilities

Main functions of MapReduce:

1.1 data division and computing task scheduling

1.2 data / code positioning

1.3 system optimization

Error detection and recovery

Operation process of MapReduce:

As can be seen from the above figure, MapReduce mainly includes the following steps:

1) First, formally submit the job code and slice the input data source

2) master schedules worker s to execute map tasks

3) The map task in the worker reads the input source slice

4) The worker executes the map task and saves the task output locally

5) The master schedules the worker to execute the reduce task, and the reduce} worker reads the output file of the map task

6) Execute the reduce task and save the task output to HDFS

Detailed explanation of operation process

Map: input the data of HDFS directory, then cut the blocks and distribute the cut blocks to different computers. Each computer partitions the blocks according to the local protocol, sorts the different areas according to the key, and then saves the task output locally

Reduce: copy the data from the remote, then merge and process the data according to the key, and finally output

Map The stage consists of a certain number of Map Task The composition and process are as follows:
■ Input data format analysis: InputFormat
■ Input data processing: Mapper
■ Data partition: Partitioner
■ Data according to key sort
■ Local protocol: Combiner(amount to local reducer，(optional)
■ Save task output locally
Reduce The stage consists of a certain number of Reduce Task The composition and process are as follows:
■ Remote copy of data
■ Data according to key Sorting and file merging merge
■ Data processing: Reducer
■ Data output format: OutputFormat
 Usually we start with Mapper Phase output data to Reduce Phasic reduce The process between calculations is called shuffle

MapReduce # Java # API Application

1. MapReduce development process

➢ Build a development environment for reference HDFS The environment is basically the same
➢ be based on MapReduce Framework coding, Map,Reduce,Driver It consists of three parts.
➢ Compile and package the source code into packages and dependencies jar Bag into a bag
➢ Upload to operating environment
➢ function hadoop jar Command, now by yarn jar Instead, it is recommended to use the new command to submit for execution
 The specific submission orders are:
yarn jar testhdfs-jar-with-dependencies.jar com.tianliangedu.driver.WordCount /tmp/tianliangedu/input /tmp/tianliangedu/output3
➢ adopt yarn web ui View execution process
➢ View execution results

2. WordCount code implementation

Mapper: yes MapReduce Calculation framework Map Process encapsulation

Text: Hadoop yes Java String Class, suitable for Hadoop Processing of text strings

IntWritable: Hadoop yes Java Integer Class, suitable for Hadoop Integer processing

Context: Hadoop Context based operational objects, such as Map in key/value Output of, distributed cache data, distributed parameter transfer, etc

StringTokenizer: yes String Object string operation class, which is a tool class for segmentation operation based on white space characters

2.1} Map class preparation

package com.tianliangedu.mapper;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyTokenizerMapper extends
        Mapper<Object, Text, Text, IntWritable> {
    // Temporarily store the word frequency count of each transmitted word, which is 1, eliminating the space for repeated applications
    private final static IntWritable one = new IntWritable(1);
    // Temporarily store the value of each transmitted word to save the space for repeated application
    private Text word = new Text();
    // The specific implementation of the core map method is to process < key, value > pairs one by one
    public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
        // Initializes the StringTokenizer with the string value of each line
        StringTokenizer itr = new StringTokenizer(value.toString());
        // Loop to get each element separated by each blank character
        while (itr.hasMoreTokens()) {
            // Put each obtained element into the word Text object
            word.set(itr.nextToken());
            // Output the map one by one through the context object
            context.write(word, one);
        }
    }
}

2.2 Reduce class writing

Reducer: it is the encapsulation of the Reduce process in the MapReduce computing framework

package com.tianliangedu.reducer;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
//The reduce class implements the reduce function
public class IntSumReducer extends
                Reducer<Text, IntWritable, Text, IntWritable> {
         private IntWritable result = new IntWritable();
         //The specific implementation of the core reduce method is processed one by one < key, list (V1, V2) >
         public void reduce(Text key, Iterable<IntWritable> values,
                      Context context) throws IOException, InterruptedException {
                //Temporarily calculate the sum in each key group
                int sum = 0;
                //Enhanced for, obtain the value of each element in the iterator in turn, that is, the word frequency value one by one
                for (IntWritable val : values) {
                      //sum up each word frequency value in the key group
                      sum += val.get();
                }
                //Put the value completed by the key group sum into the result IntWritable so that the output can be serialized
                result.set(sum);
                //Output the calculation results one by one
                context.write(key, result);
         }
   }

2.3. Driver class writing

➢ Configuration: And HDFS Medium Configuration Consistent, responsible for parameter loading and transfer
➢ Job: Homework is a round MapReduce The abstraction of a task, that is, a MapReduce Management of the whole process of implementation
➢ FileInputFormat: Specifies the tool class of input data, which is used to specify the input data path of the task
➢ FileOutputFormat: Specifies the tool class of the output data, which is used to specify the output data path of the task

package com.tianliangedu.driver;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.tianliangedu.mapper.MyTokenizerMapper;
import com.tianliangedu.reducer.IntSumReducer;
public class WordCountDriver {
   // How to start mr driver
   public static void main(String[] args) throws Exception {
       // Get cluster configuration parameters
       Configuration conf = new Configuration();
       // Set to this job instance
       Job job = Job.getInstance(conf, "dawn WordCount");
       // Specifies that the main class of this execution is WordCount
      job.setJarByClass(WordCountDriver.class);
       // Specify map class
      job.setMapperClass(MyTokenizerMapper.class);
       // Specify the combiner class or not. If specified, it is generally the same as the reducer class
      job.setCombinerClass(IntSumReducer.class);
       // Specify reducer class
      job.setReducerClass(IntSumReducer.class);
       // Specify the types of key and value of job output. If the output types of map and reduce are different, you need to reset the class types of key and value of map output
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
       // Specifies the path to the input data
      FileInputFormat.addInputPath(job, new Path(args[0]));
       // Specify the output path and require that the output path must not exist
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
       // Specify the job execution mode. The client submitting the task will not exit until the task is completed!
      System.exit(job.waitForCompletion(true) ? 0 : 1);
   }
}

2.4 running mapreduce in a local simulation distributed computing environment

In view of the complexity of remote running for code testing, and other new frameworks begin to support local environment to simulate distributed computing, mapreduce starts from 2.0 X has also begun to support local environments

For details, please refer to the auxiliary data set "06 - operation steps of simulating mapreduce parallel computing in local environment".

2.5 packaging of Maven

Use the Maven command to package the code based on the configured Maven plug-in.

2.6 upload to the operating environment

Use the rz command to upload the run package to the cluster environment.

2.7 run WordCount program

The specific submission orders are:

The specific submission orders are:
yarn jar testhdfs-jar-with-dependencies.jar com.tianliangedu.driver.WordCount /tmp/tianliangedu/input /tmp/tianliangedu/output3

2.8. Check the execution process

The Web access address is: http://cluster1.hadoop:8088/ui2/#/yarn-apps/apps

2.9. View the execution results

3. Standard code implementation

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
//Start mr's driver class
public class WordCountDriver {
       //Map class, which implements the map function
       public static class MyTokenizerMapper extends
                    Mapper<Object, Text, Text, IntWritable> {
             //Temporarily store the word frequency count of each transmitted word, which is 1, eliminating the space for repeated applications
             private final static IntWritable one = new IntWritable(1);
             //Temporarily store the value of each transmitted word to save the space for repeated application
             private Text word = new Text();
             //The specific implementation of the core map method is to process < key, value > pairs one by one
             public void map(Object key, Text value, Context context)
                          throws IOException, InterruptedException {
                    //Initializes the StringTokenizer with the string value of each line
                    StringTokenizer itr = new StringTokenizer(value.toString());
                    //Loop to get each element separated by each blank character
                    while (itr.hasMoreTokens()) {
                          //Put each obtained element into the word Text object
                          word.set(itr.nextToken());
                          //Output the map one by one through the context object
                          context.write(word, one);
                    }
             }
       }
       //The reduce class implements the reduce function
       public static class IntSumReducer extends
                    Reducer<Text, IntWritable, Text, IntWritable> {
             private IntWritable result = new IntWritable();
             //The specific implementation of the core reduce method is processed one by one < key, list (V1, V2) >
             public void reduce(Text key, Iterable<IntWritable> values,
                          Context context) throws IOException, InterruptedException {
                    //Temporarily calculate the sum in each key group
                    int sum = 0;
                    //Enhanced for, obtain the value of each element in the iterator in turn, that is, the word frequency value one by one
                    for (IntWritable val : values) {
                          //sum up each word frequency value in the key group
                          sum += val.get();
                    }
                    //Put the value completed by the key group sum into the result IntWritable so that the output can be serialized
                    result.set(sum);
                    //Output the calculation results one by one
                    context.write(key, result);
             }
       }
       //Start the driver method of mr
       public static void main(String[] args) throws Exception {
             //Get cluster configuration parameters
             Configuration conf = new Configuration();
             //Set to this job instance
             Job job = Job.getInstance(conf, "dawn WordCount");
             //By specifying the relevant bytecode object, find the main jar package to which it belongs
             job.setJarByClass(WordCountDriver.class);
             //Specify map class
             job.setMapperClass(MyTokenizerMapper.class);
             //Specify the combiner class or not. If specified, it is generally the same as the reducer class
             job.setCombinerClass(IntSumReducer.class);
             //Specify reducer class
             job.setReducerClass(IntSumReducer.class);
             //Specify the types of key and value of job output. If the output types of map and reduce are different, you need to reset the class types of key and value of map output
             job.setOutputKeyClass(Text.class);
             job.setOutputValueClass(IntWritable.class);
             //Specifies the path to the input data
             FileInputFormat.addInputPath(job, new Path(args[0]));
             //Specify the output path and require that the output path must not exist
             FileOutputFormat.setOutputPath(job, new Path(args[1]));
             //Specify the job execution mode. The client submitting the task will not exit until the task is completed!
             System.exit(job.waitForCompletion(true) ? 0 : 1);
       }
}

MapReduce Shell application

1. Secondary command of MapReduce

Mapred is called the first level command. Enter mapred and press enter to view the second level command:

2. Three level commands of MapReduce

Enter the first level command mapred, and then enter any second level command to view the third level command:

3. MapReduce shell application

View currently executing job tasks
First submit a WordCount task, and then use mapred job - list to view the task list

Kill the execution of a task

For some reason, to immediately terminate the execution of a task, use mapred job - kill job ID.

Construction scenario: first submit a WordCount job, and then terminate the task through kill job ID:

View the log of a job

Using the mapred shell command, you can view the job log through job ID.

The command format is: mapred job - logs job ID:

MapReduce technical characteristics 
1,Expand "out" horizontally, not "up" vertically
➢ The construction of the cluster completely selects low-end commercial servers that are cheap and easy to expand, rather than expensive and difficult to expand commercial services
➢ For the needs of large-scale data processing and large-scale data storage, the comprehensive capacity of the cluster is emphasized, rather than the processing capacity of a single machine, and the data volume of machine nodes is increased horizontally
2,Failure is considered normal
➢ Using a large number of ordinary servers, node hardware and software errors are normal
➢ It has a variety of effective error detection and recovery mechanisms, which will automatically transfer to other computing nodes after a computing node fails. After a task node fails, other nodes can seamlessly take over the computing tasks of the failed node
➢ After the failed node recovers, it will automatically and seamlessly join the cluster without requiring the administrator to manually configure the system
3,Mobile computing, migrating processing to data(Data locality)
➢ Adopt code/The function of data mutual positioning. The calculation and data are in the same machine node or the same rack, giving play to the characteristics of data localization
➢ It can avoid data transmission across machine nodes or racks and improve operation efficiency
4,Process data sequentially and avoid random access to data
➢ Sequential access to disks is much faster than random access, so MapReduce It is designed for disk access processing of sequential large-scale data
➢ Using a large number of data storage nodes in the cluster to access data at the same time, high-throughput parallel processing for large data set batch processing is realized
5,Speculative execution
➢ A job consists of several Map Tasks and Reduce Task composition: the completion time of the whole job depends on the completion time of the slowest task. Some tasks may run very slowly due to the hardware and software problems of the node
➢ Using the speculative execution mechanism, it is found that the running speed of a task is much lower than the average speed of the task, and a backup task will be started for the slow task and run at the same time. Which runs first and which results are adopted.
6,Smooth and seamless scalability
➢ Cluster computing nodes can be flexibly increased or reduced to adjust computing power
➢ The performance of the calculation keeps near linear growth with the increase of the number of nodes
7,Hide the underlying details of the system for application development
➢ There are many difficulties in parallel programming. We need to consider the complex and cumbersome details in multithreading, such as distributed storage management, data distribution, data communication and synchronization, calculation result collection and so on.
➢ MapReduce It provides an abstract mechanism to separate the programmer from the details of the system layer. The programmer only needs to pay attention to the business, and other specific execution can be handled by the framework.

Topics: Hadoop hdfs mapreduce

Programmer Think

Distributed computing framework Map/reduce

Hot Topics