JAVA HDFS API programming II

Posted by alecapone on Thu, 23 Dec 2021 23:39:28 +0100

Design pattern in java: template pattern

Define the skeleton (which is abstracted by the general algorithm) and hand over the specific implementation to the subclass.
This means that as long as the process is defined in the template and how to implement it, the template method does not pay attention to the specific implementation. The specific implementation is completed by subclasses. There can be multiple subclasses, and the functions implemented by each subclass can be different.

Define a template class:

package com.ruozedata.pattern.template;

public abstract class Mapper {

    //The three methods of setUp mapper clearUp are abstract methods
    //It will be implemented later with specific subclasses

    /**
     * Initialization operation: open the refrigerator door
     */
    abstract void setUp();

    /**
     * Specific business logic: put elephants, dogs, pigs, etc
     */
    abstract void mapper();

    /**
     * Operation of resource release: close the refrigerator door
     */
    abstract void clearUp();


    /**
     * Define the execution process of our template method
     * This run method will call the previous method to define the execution sequence: initialization, execution and end
     */
    public void run(){
        setUp();

        mapper();

        clearUp();
    }
}

Define a subclass to implement the abstract methods in the template Abstract Class:

package com.ruozedata.pattern.template;

public class SubMapper extends Mapper{
    void setUp() {
        System.out.println("SubMapper.setUp");
    }

    void mapper() {
        System.out.println("SubMapper.mapper");
    }

    void clearUp() {
        System.out.println("SubMapper.clearUp");
    }
}

Define a subclass to implement the abstract methods in the template abstract class, which is similar to the subclass above, but can realize different functions:

package com.ruozedata.pattern.template;

public class SubMapper2 extends Mapper{
    void setUp() {
        System.out.println("SubMapper2.setUp");
    }

    void mapper() {
        System.out.println("SubMapper2.mapper");
    }

    void clearUp() {
        System.out.println("SubMapper2.clearUp");
    }
}

Define another class to call the implemented subclass above:

package com.ruozedata.pattern.template;

public class Client {
    public static void main(String[] args) {
        SubMapper subMapper = new SubMapper();
        subMapper.run();

        SubMapper2 subMapper2 = new SubMapper2();
        subMapper2.run();
    }
}

Operation results:

SubMapper.setUp
SubMapper.mapper
SubMapper.clearUp
SubMapper2.setUp
SubMapper2.mapper
SubMapper2.clearUp

Use HDFS API to complete word count WC statistics

Functional disassembly

Word frequency statistics is to give you one or a batch of files to count the number of occurrences of each word.
When you get a function, don't think about how to write the code, but analyze the functions and requirements: describe clearly the steps such as 1, 2, 3, 4 in Chinese, and write the steps clearly. Then there is development: just translate 1234 into code. The upper architecture, including the technical framework used to implement each step above, is important. Therefore, the idea must be clear.
Now analyze this: use HDFS API to complete word count WC statistics
In addition, the thing in big data is syllogism:
1. Input
2. Treatment
3. Output
All are the above processes.

Disassemble the above functions:
Step 1: input: use HDFS API to read files;

Step 2: Processing: word frequency
1. Read the contents of the file line by line. Split the contents of the line according to a specified separator, and it will become a pile of words
2. Assign the number of occurrences to each word as 1, as follows:
For example, this line of words: wc,word,hello,word is separated by commas. The number of occurrences of each word is 1
(wc,1)
(word,1)
(hello,1)
(word,1)
The number of occurrences of each word above is 1. What we want is the total number of occurrences of each word, so how to add them up?
3. Put the above segmented words into a cache. For example, in a map, map < word, times >, when word appears once, it is map < word, 1 >, and when word appears twice, it is map < word, 2 >, which is the cache.
4. Traverse the contents in the cache of the map. This is the word frequency.

Step 3: the output can be output according to the place you want to output
1. Print to local
2. Write to local file system
3. Write to HDFS file system
4. Write to MySQL database

The above skeleton has been defined. Let's implement it.

code implementation

1. First, define an abstract class or Mapper interface, which only defines the function, but does not pay attention to the specific implementation.

package com.ruozedata.hadoop.hdfs;

public interface Mapper {
    /**
     * map Operate each element one by one
     * Now read in is a row of data. Operate on each row of data read in
     */
    public void map(String line,Context context);
}

This interface only defines a map method. Its function is to pass in a line of data, and the intermediate processing process data and result data will be placed in the context cache. Therefore, it can be understood that line is a line of data, and context is a cache, which temporarily stores data.

2. Define the cache Context
This cache Context has a cacheMap object. This object is a HashMap instance with two parameters, the first is key and the second is value. It can be understood that it can store < key, value > data, which is cache.
The code is as follows:

package com.ruozedata.hadoop.hdfs;

import java.util.HashMap;
import java.util.Map;

public class Context {
    private Map<Object,Object> cacheMap = new HashMap<Object, Object>();

    //get method
    public Map<Object, Object> getCacheMap() {
        return cacheMap;
    }

    /**
     * set method
     * Write data to cache
     * @param key  word
     * @param value frequency
     */
    public void write(Object key, Object value) {
        cacheMap.put(key, value);
    }
    /**
     * The number of times to get the word from the cache
     * @param key word
     * @return  frequency
     */
    public Object get(Object key) {
        return cacheMap.get(key);
    }
}

3. Define a class WordCountMapper to implement the above interface Mapper. The specific implementation is completed by WordCountMapper.
Pass in a row of data and a cache. Divide the row of data according to spaces. After the segmentation, it is an array, and then traverse the array. Then go to the cache according to the value of the key to see if there is a corresponding value. If not, put the word, that is, the key, into the cache, and give the value to 1. If there is the word key, take out the corresponding value, add 1, and then put it into the cache.

package com.ruozedata.hadoop.hdfs;

public class WordCountMapper implements Mapper {

    public void map(String line, Context context) {
        String[] splits = line.split(" ");
        for (String word : splits) {
            Object value = context.get(word);
            if (null == value){ //Words do not exist
                context.write(word,1);
            } else { //If the word exists, first add 1 to the read value, and then write it in
                context.write(word,Integer.parseInt(value.toString()) + 1);
            }
        }
    }

}

4. Define a class HDFSWCAPI01, read files, process data, and then output.
Define a configuration, then configure it with related hdfs addresses, and then define a fileSystem. With configuration, there is a fileSystem entry.

package com.ruozedata.hadoop.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

import javax.swing.plaf.synth.ColorType;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.Map;


public class HDFSWCAPI01 {
    public static void main(String[] args) throws Exception{

        //Configuration and FileSystem
        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS","hdfs://hadoop001:9000");
        configuration.set("dfs.replication","1");
        System.setProperty("HADOOP_USER_NAME","ruoze");
        FileSystem fileSystem = FileSystem.get(configuration);

        //Read data input
        Path input = new Path("/hdfsapi/test3/");

        WordCountMapper mapper = new WordCountMapper();
        Context context = new Context();

        //For remote iteration, the path may be a file or folder. There may be multiple files under the folder, including files under subfolders
        RemoteIterator<LocatedFileStatus> iterator = fileSystem.listFiles(input, true);

        //If there are multiple files, the iterator Next is a file, one file at a time, but it will always accumulate in the context cache
        while (iterator.hasNext()){
            LocatedFileStatus status = iterator.next();
            FSDataInputStream in = fileSystem.open(status.getPath());

            BufferedReader read = new BufferedReader(new InputStreamReader(in));

            String line = "";
            while ((line = read.readLine()) != null){
                System.out.println(line);

                mapper.map(line,context);

            }
            read.close();
            in.close();

            //Get the data from the context cache and read the < key, value > loop in the cacheMap.
            Map<Object, Object> cacheMap = context.getCacheMap();
            for (Map.Entry<Object, Object> entry : cacheMap.entrySet()) {
                System.out.println(entry.getKey() + "\t" + entry.getValue());
            }
        }
    }
}

Ignore word case and polymorphism

If you ignore the word case to count wc, you only need to copy the WordCountMapper above, and CaseIgnoreWordCountMapper only needs to add line toLowerCase().

package com.ruozedata.hadoop.hdfs;

public class CaseIgnoreWordCountMapper implements Mapper {

    public void map(String line, Context context) {
        String[] splits = line.toLowerCase().split(" ");
        for (String word : splits) {
            Object value = context.get(word);
            if (null == value){ //Words do not exist
                context.write(word,1);
            } else { //If the word exists, first add 1 to the read value, and then write it in
                context.write(word,Integer.parseInt(value.toString()) + 1);
            }
        }
    }
}

Then Mapper mapper = new CaseIgnoreWordCountMapper(); Case will be ignored when calling. This is also the use of polymorphism.

package com.ruozedata.hadoop.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.Map;


public class HDFSWCAPI01 {
    public static void main(String[] args) throws Exception{

        //Configuration and FileSystem
        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS","hdfs://hadoop001:9000");
        configuration.set("dfs.replication","1");
        System.setProperty("HADOOP_USER_NAME","ruoze");
        FileSystem fileSystem = FileSystem.get(configuration);

        //Read data input
        Path input = new Path("/hdfsapi/test3/");

        //WordCountMapper mapper = new WordCountMapper();  -- This does not ignore case
        //Mapper is added to the following one for polymorphic use. If mapper is replaced with CaseIgnoreWordCountMapper, it is not
        //CaseIgnoreWordCountMapper ignores case
        Mapper mapper = new CaseIgnoreWordCountMapper();
        Context context = new Context();

        //For remote iteration, the path may be a file or folder. There may be multiple files under the folder, including files under subfolders
        RemoteIterator<LocatedFileStatus> iterator = fileSystem.listFiles(input, true);

        //If there are multiple files, the iterator Next is a file, one file at a time, but it will always accumulate in the context cache
        while (iterator.hasNext()){
            LocatedFileStatus status = iterator.next();
            FSDataInputStream in = fileSystem.open(status.getPath());

            BufferedReader read = new BufferedReader(new InputStreamReader(in));

            String line = "";
            while ((line = read.readLine()) != null){
                System.out.println(line);

                mapper.map(line,context);

            }
            read.close();
            in.close();

            System.out.println("\n\n");

            //TODO...  Later, you can consider writing the results to a file in hdfs
            //Path result = new Path("/hdfsapi/result/result.txt");

            //Get the data from the context cache and read the < key, value > loop in the cacheMap.
            Map<Object, Object> cacheMap = context.getCacheMap();
            for (Map.Entry<Object, Object> entry : cacheMap.entrySet()) {
                System.out.println(entry.getKey() + "\t" + entry.getValue());
            }

        }
    }
}

Code transformation

The above writing method is not flexible enough. It can be modified in the form of configuration file, put what you need in the configuration file, read what you need, and process it by reflection after reading it.
1. Create a file in resources: WC properties

INPUT_PATH=/hdfsapi/test3/
OUTPUT_PATH=/hdfsapi/result/
HDFS_URI=hdfs://hadoop001:9000
##Which subclass is used to implement it? Write this class, and the bottom layer can only be implemented with reflection
MAPPER_CLASS=com.ruozedata.hadoop.hdfs.WordCountMapper

2. Create a tool class ParamsUtils to read the above configuration file

package com.ruozedata.hadoop.hdfs;

import java.io.IOException;
import java.util.Properties;

public class ParamsUtils {
    private static Properties properties = new Properties();

    static {
        try {
            properties.load(ParamsUtils.class.getClassLoader().getResourceAsStream("wc.properties"));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    //get method
    public static Properties getProperties() {
        return properties;
    }

    public static void main(String[] args) {
        System.out.println(getProperties().getProperty("MAPPER_CLASS"));
        System.out.println(getProperties().getProperty("INPUT_PATH"));
    }
}

The above getproperties() Getproperty ("MAPPER_CLASS") is written dead. It can be encapsulated with a constant class or not.

package com.ruozedata.hadoop.hdfs;

public class Constants {
    public static final String INPUT_PATH = "INPUT_PATH";
    public static final String OUTPUT_PATH = "OUTPUT_PATH";
    public static final String HDFS_URI = "HDFS_URI";
    public static final String MAPPER_CLASS = "MAPPER_CLASS";
}

Then call the constant class:

package com.ruozedata.hadoop.hdfs;

import java.io.IOException;
import java.util.Properties;

public class ParamsUtils {
    private static Properties properties = new Properties();

    static {
        try {
            properties.load(ParamsUtils.class.getClassLoader().getResourceAsStream("wc.properties"));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    //get method
    public static Properties getProperties() {
        return properties;
    }

    public static void main(String[] args) {
//        System.out.println(getProperties().getProperty("MAPPER_CLASS"));
//        System.out.println(getProperties().getProperty("INPUT_PATH"));
        System.out.println(getProperties().getProperty(Constants.MAPPER_CLASS));
        System.out.println(getProperties().getProperty(Constants.INPUT_PATH));
        System.out.println(getProperties().getProperty(Constants.HDFS_URI));
        System.out.println(getProperties().getProperty(Constants.OUTPUT_PATH));
    }
}

Output results:

com.ruozedata.hadoop.hdfs.WordCountMapper
/hdfsapi/test3/
hdfs://hadoop001:9000
/hdfsapi/result/

Finally, the test is carried out. Based on HDFSWCAPI02 and HDFSWCAPI01, only the input obtained through properties and mapper obtained through reflection are modified_ Class class, others do not need to be modified.

package com.ruozedata.hadoop.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.Map;
import java.util.Properties;


public class HDFSWCAPI02 {
    public static void main(String[] args) throws Exception{

        //Get configuration
        Properties properties = ParamsUtils.getProperties();
        Path input = new Path(properties.getProperty(Constants.INPUT_PATH));

        //Configuration and FileSystem
        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS","hdfs://hadoop001:9000");
        configuration.set("dfs.replication","1");
        System.setProperty("HADOOP_USER_NAME","ruoze");
        FileSystem fileSystem = FileSystem.get(configuration);

        //Because there is input on it, it is not needed here
//        //Read data input
//        Path input = new Path("/hdfsapi/test3/");


        //Because there is mapper in the configuration file_ Class will be used below, so it needs to be taken out through reflection
        Class<?> aClass = Class.forName(properties.getProperty(Constants.MAPPER_CLASS));
        //Then, through aClass reflection, go to new instance: aClass newInstance()

        //Because class <? > The type of is not clear, so here
        // Need to be in aclass Newinstance () is preceded by a Mapper cast to Mapper type, that is, to its parent class
        Mapper mapper = (Mapper) aClass.newInstance();
        Context context = new Context();

        //For remote iteration, the path may be a file or folder. There may be multiple files under the folder, including files under subfolders
        RemoteIterator<LocatedFileStatus> iterator = fileSystem.listFiles(input, true);

        //If there are multiple files, the iterator Next is a file, one file at a time, but it will always accumulate in the context cache
        while (iterator.hasNext()){
            LocatedFileStatus status = iterator.next();
            FSDataInputStream in = fileSystem.open(status.getPath());

            BufferedReader read = new BufferedReader(new InputStreamReader(in));

            String line = "";
            while ((line = read.readLine()) != null){
                System.out.println(line);

                mapper.map(line,context);

            }
            read.close();
            in.close();

            System.out.println("\n\n");

            //TODO...  Later, you can consider writing the results to a file in hdfs
            //Path result = new Path("/hdfsapi/result/result.txt");

            //Get the data from the context cache and read the < key, value > loop in the cacheMap.
            Map<Object, Object> cacheMap = context.getCacheMap();
            for (Map.Entry<Object, Object> entry : cacheMap.entrySet()) {
                System.out.println(entry.getKey() + "\t" + entry.getValue());
            }

        }
    }
}

In this way, all input is completed in the configuration file, including which class to use. The above statistics are case sensitive. If you want to call case sensitive classes, you only need to use WC Properties:
MAPPER_CLASS=com.ruozedata.hadoop.hdfs.WordCountMapper
Modify to:
MAPPER_CLASS=com.ruozedata.hadoop.hdfs.CaseIgnoreWordCountMapper

Topics: Java hdfs mapreduce