Hadoop | HDFS learning notes | HDFS Java API environment construction | Java operating HDFS file system | multiple cases

Posted by TomT64 on Tue, 18 Jan 2022 07:33:14 +0100

reference material

Video data

Operating environment

  • Windows10
  • JDK8
  • IDEA 2021.6 professional
  • Hadoop3.1.3
  • CentOS7
  • Three Hadoop fully distributed cluster nodes

1, Preparing the HDFS Java API environment

1.1 preparing Hadoop environment in windows system

Hadoop3.1.3 official download address: Click download , download and unzip.

hadoop is mainly written based on Linux. This winutil Exe is mainly used to simulate the directory environment under Linux system. Therefore, when hadoop runs under windows, it needs this auxiliary program to run

Download these environment software to the bin folder of hadoop Directory: Click download

After that, configure Hadoop environment variables, just like in Linux environment, first configure HADOOP_HOME, reconfigure the system variable PATH

For graphical operation, right-click the computer, click properties, and then click Advanced system settings and environment variables

After configuration, view the version in cmd

hadoop version


Configuration completed

1.2 connect the local machine to the cluster node

1.2.1 domain name mapping

Configure domain name mapping locally so that you can directly connect to the corresponding ID through the host name

Configure domain name mapping for windows 10 at:

C:\WINDOWS\System32\drivers\etc\

hosts configuration:

As shown in the figure: configure in the form of IP + space + host name

If it is a virtual machine of this machine, you can view it through commands in the virtual machine

ifconfig

Among the displayed information, select the information starting with eth. For example, eth33 in Vmware and eth0 in docker container

1.2.2 routing and forwarding

Since the author uses the docker container to build three nodes in the virtual machine, there are:

  • The virtual machine is connected to the docker container in itself
  • This function is connected to the virtual machine

To connect the local machine directly to the docker container in the virtual machine, you need to forward the domain name. Please refer to this blog post: Docker | use the host to ping the docker container in the virtual machine | route and forward

After the configuration is completed, you can ping the cluster node directly on the host, so that you can connect the HDFS of the cluster in the local program later.

1.3 creating Maven project using IDEA

Project structure:

pom.xml to configure Maven project dependencies

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.uni</groupId>
    <artifactId>HDFSLearn</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.1.3</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.30</version>
        </dependency>
    </dependencies>
</project>

log4j.properties configure the output information of the log4j log file

log4j.rootLogger=INFO, stdout  
log4j.appender.stdout=org.apache.log4j.ConsoleAppender  
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout  
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n  
log4j.appender.logfile=org.apache.log4j.FileAppender  
log4j.appender.logfile.File=target/spring.log  
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout  
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

So far, the IDEA project has been built.

2, HDFS Java API operation case

2.1 creating folders

org. apache. hadoop. fs. Related methods in the file system source code:

public static boolean mkdirs(FileSystem fs, Path dir, FsPermission permission)
public abstract boolean mkdirs(Path f, FsPermission permission) throws IOException
public boolean mkdirs(Path f) throws IOException

Parameter Description:

parameterdescribe
Path fFile directory created by HDFS
FsPermission permissionThe permission level of the current user

Test code:

public class HdfsClient {
    private static Configuration configuration;
    private static String HDFS_PATH = "hdfs://Hadoop 101:8020 "; / / HDFS connection address
    private static String HDFS_USER = "root";                  // HDFS connection object
    private static FileSystem fs;                              // HDFS file operation object
    /* Creating connections in singleton mode**/ 
    static { configuration = new Configuration(); }
    public FileSystem getFileSystem(){
        try{
            if(fs == null)
                fs = FileSystem.get(new URI(HDFS_PATH), configuration, HDFS_USER);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return fs;
    }
    @Test
    public void testMkdir() throws IOException{
        FileSystem fs = getFileSystem();
        fs.mkdirs(new Path("/uni/testMkdir/"));
        fs.close();
    }
}

2.2 uploading files

org. apache. hadoop. fs. Related methods in the file system source code:

public void copyFromLocalFile(Path src, Path dst)
public void moveFromLocalFile(Path[] srcs, Path dst)
public void copyFromLocalFile(boolean delSrc, Path src, Path dst)
public void copyFromLocalFile(boolean delSrc, boolean overwrite,Path[] srcs, Path dst)
public void copyFromLocalFile(boolean delSrc, boolean overwrite,Path src, Path dst)

Parameter Description:

parameterdescribe
boolean delSrcDelete the original data where the HDFS path is located
boolean overwriteAllow to overwrite the original content
Path src / []srcsLocal data path
Path dstHDFS destination path

Test code:

@Test
public void testCopyFromLocalFile() throws IOException{
    FileSystem fs = getFileSystem();
    Path localFile = new Path("test.txt");
    Path distPath = new Path("/uni/testMkdir");
    fs.copyFromLocalFile(localFile, distPath);
    fs.close();
}

2.3 downloading files

org. apache. hadoop. fs. Related methods in the file system source code:

public void copyToLocalFile(boolean delSrc, Path src, Path dst)
public void copyToLocalFile(boolean delSrc, Path src, Path dst,boolean useRawLocalFileSystem) throws IOException 
public void copyToLocalFile(Path src, Path dst) throws IOException

Parameter Description:

parameterdescribe
boolean delSrcDelete the original data where the local path is located
boolean overwriteAllow to overwrite the original content
Path src / []srcsLocal data path
Path dstHDFS destination path
boolean useRawLocalFileSystemWhether to enable local file verification (crc cyclic redundancy check)

2.4 deleting files

org. apache. hadoop. fs. Related methods in the file system source code:

public boolean delete(Path f) throws IOException
public abstract boolean delete(Path f, boolean recursive) throws IOException
public boolean deleteOnExit(Path f) throws IOException
public void deleteSnapshot(Path path, String snapshotName)

Parameter Description:

parameterdescribe
Path fPath address to delete
boolean recursiveIndicates whether to delete recursively
Path dstHDFS destination path
boolean useRawLocalFileSystemWhether to enable local file verification (crc cyclic redundancy check)

Test code:

@Test
public void testDelete(){
    FileSystem fs = getFileSystem();
    try{
        fs.delete(new Path("/uni/testMkdir/test.txt"),true);
        fs.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

2.5 renaming and moving of documents

org. apache. hadoop. fs. Related methods in the file system source code:

public abstract boolean rename(Path src, Path dst) throws IOException

Test method:

@Test
public void testMove(){
    FileSystem fs = getFileSystem();
    try{
        fs.rename(new Path("/uni/testMkdir"),new Path("/uni/MyDir"));
        fs.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    fs.close();
}

2.6 obtaining HDFS file information

org. apache. hadoop. fs. Related methods in the file system source code:

public RemoteIterator<LocatedFileStatus> listFiles(final Path f, final boolean recursive)

Test method:
Get HDFS file information

@Test
public void testDetail() throws IOException {
    FileSystem fs = getFileSystem();
    // Get all file information
    RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/"), true);
    while (listFiles.hasNext()) {
        LocatedFileStatus fileStatus = listFiles.next();

        System.out.println("===========" + fileStatus.getPath() +"=========");
        System.out.println(fileStatus.getPermission());
        System.out.println(fileStatus.getOwner());
        System.out.println(fileStatus.getGroup());
        System.out.println(fileStatus.getLen());
        System.out.println(fileStatus.getModificationTime());
        System.out.println(fileStatus.getReplication());
        System.out.println(fileStatus.getBlockSize());
        System.out.println(fileStatus.getPath().getName());
    }
    fs.close();
}

Get block information of HDFS file:

@Test
public void testDetail() throws IOException {
    FileSystem fs = getFileSystem();
    // Get all file information
    RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/"), true);
    while (listFiles.hasNext()) {
        LocatedFileStatus fileStatus = listFiles.next();
        BlockLocation[] blockLocations = fileStatus.getBlockLocations();
        System.out.println(Arrays.toString(blockLocations));
    }
	fs.close();
}

2.7 judgment of documents and folders

@Test
public void testFile() throws IOException {
    FileSystem fs = getFileSystem();
    FileStatus[] listStatus = fs.listStatus(new Path("/"));
    for (FileStatus status : listStatus) {
        if(status.isFile())
            System.out.println("file:" + status.getPath().getName());
        else
            System.out.println("catalogue:" + status.getPath().getName());
    }
    fs.close();
}

2.8 HDFS - API configuration parameter priority

2.8.1 passing the configuration file

hdfs-site.xml

You can use the hadoop configuration file HDFS site XML is placed in the resources folder in the project, so that the API will execute according to this configuration.

2.8.2 through Configuration object

In addition to creating files, you can also set parameters by configuring the Configuration object

This class is from the package org apache. hadoop. Conf

Example: set the number of partitions to 2

Configuration configuration = new Configuration();
configuration.set("dfs.replication", 2);

2.8.3 priority issues

The priority from high to low is:

Configuration of API program > configuration file under Project Resource File > HDFS site in cluster XML > HDFS default in cluster xml

Topics: Java Hadoop hdfs