hadoop2.6.5 Mapper class source code analysis

Posted by warydig on Mon, 31 Jan 2022 03:11:39 +0100

Mapper class

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//

package org.apache.hadoop.mapreduce;

import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience.Public;
import org.apache.hadoop.classification.InterfaceStability.Stable;

@Public
@Stable
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
    public Mapper() {
    }

    protected void setup(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
    }

    protected void map(KEYIN key, VALUEIN value, Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
        context.write(key, value);
    }

    protected void cleanup(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
    }

    public void run(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
        this.setup(context);

        try {
            while(context.nextKeyValue()) {
                this.map(context.getCurrentKey(), context.getCurrentValue(), context);
            }
        } finally {
            this.cleanup(context);
        }

    }

    public abstract class Context implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
        public Context() {
        }
    }
}

Mapper class has five main methods:

  • protected void setup(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context)
  • protected void map(KEYIN key, VALUEIN value, Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context)
  • protected void cleanup(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context)
  • public void run(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context)
  • public abstract class Context implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

1. setup()

protected void setup(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
}

It is called at the beginning to load some initialization work, such as global files, establishing database links, and so on.

2. cleanup()

protected void cleanup(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
}

It is called at the end to finish the work, such as closing the file, closing the database connection, key value distribution after executing map(), and so on.

3. map()

protected void map(KEYIN key, VALUEIN value, Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
	context.write(key, value);
}

It inherits Mapper class and is a method that needs to be rewritten. The processing logic of map work should be put in this method.

The map method has three parameters. Mapper has four generic variables keyin, valuein, keiout and valueout, which represent the input key value type and the output key value type respectively. Context is the context, which is used to save the Job configuration information, status and map processing results, and is used for the final write method.

There are four important methods in map

  • context.nextKeyValue();
    It is responsible for reading data, but the return value of the method is not the read key value, but a Boolean value identifying whether the data has been read
  • context.getCurrentKey();
    Responsible for obtaining context key read by nextkeyvalue()
  • context.getCurrentValue();
    Responsible for obtaining context value read by nextkeyvalue()
  • context.write(key,value);
    Be responsible for outputting the data output in the mapper stage

4. run()

public void run(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
	this.setup(context);

	try {
		while(context.nextKeyValue()) {
			this.map(context.getCurrentKey(), context.getCurrentValue(), context);
		}
	} finally {
		this.cleanup(context);
    }

}

In the run method, the setup function is executed first, and then the map is used to process the data. After processing the data, the cleanup is used to finish the work. It is worth noting that the setup function and cleanup function are only performed once by the system as callback functions, and are not executed multiple times like the map function.

It implements the template method pattern in the design pattern. Mapper class defines some methods. Users can inherit this class and rewrite methods to meet different needs, but the internal execution order of these methods is determined. It encapsulates the algorithm of the program, so that users can concentrate on dealing with the specific logic of each part.

run method will be called by default during program execution. From the perspective of its execution process, it often meets our expectations. Initialize first. If there is still input data, call map method to process each key value pair, and finally execute the end method.

Template method mode:
The template mode is divided into basic methods and template methods. The above run can be regarded as basic methods. Other methods need to be called to realize the overall logic and will not be modified. Other methods can be regarded as template methods. Basic methods are generally modified with final to ensure that they will not be modified by subclasses, while template methods are modified with protect to indicate that they need to be implemented in subclasses.

In addition, there is a concept called hook method in the template mode, that is, to give a subclass an authorization and allow the subclass to subvert the execution of basic logic by rewriting the hook method, which is sometimes very useful and can be used to improve specific logic.

5. Context

public abstract class Context implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
	public Context() {
	}
}

Context class, which implements the MapContext interface, and MapContext inherits from TaskInputOutputContext

reference material:
Wonderful use of mapper class of hadoop

Hadoop learning notes (VI) actual combat wordcount

Template pattern in Java design pattern

Topics: Big Data Hadoop mapreduce