Hadoop 3. X operation environment construction of big data (hand-in-hand cluster construction)

Posted by misteraven on Thu, 28 Oct 2021 01:39:47 +0200

🌹 Write at the beginning

Xiao Yuan began to update Hadoop series teaching articles to introduce you to big data from zero and look forward to your attention (according to the blog notes written by Hadoop 3. X in Silicon Valley) ❤️❤️
First article: Hadoop graphical overview of big data
Second article: Hadoop template virtual machine configuration diagram of big data
The third article: Hadoop running environment construction of big data (hand-in-hand cluster construction)
Article 4: bloggers are stepping up their preparation

💝 Installation package preparation

❤️❤️ The blogger has prepared all the installation packages for you to build the cluster. It will be much faster to download them with Alibaba cloud!!!

🚀 1, Hadoop running environment construction (development focus)

💒 1.1 template virtual machine environment preparation

0) install the template virtual machine, IP address 192.168.10.100, host name Hadoop 100, memory 4G and hard disk 50G

1) Hadoop 100 virtual machine configuration requirements are as follows

  • 1. Using Yum to install requires that the virtual machine can access the Internet normally. You can test the virtual machine networking before installing yum
    condition

    [root@hadoop100 ~]# ping www.baidu.com
    PING www.baidu.com (14.215.177.39) 56(84) bytes of data.
    64 bytes from 14.215.177.39 (14.215.177.39): icmp_seq=1 ttl=128 time=8.60 ms
    64 bytes from 14.215.177.39 (14.215.177.39): icmp_seq=2 ttl=128 time=7.72 ms
    
  • 2. Install EPEL release
    Note: Extra Packages for Enterprise Linux is an additional software package for the "red hat" operating system, which is applicable to RHEL, CentOS and Scientific Linux. As a software warehouse, most rpm packages cannot be found in the official repository)

    [root@hadoop100 ~]# yum install -y epel-release
    
  • 3. Note: if the minimum system version is installed on Linux, the following tools need to be installed; If you are installing Linux Desktop Standard Edition, you do not need to perform the following operations
    Net tool: toolkit collection, including ifconfig and other commands

    [root@hadoop100 ~]# yum install -y net-tools 
    

    vim: Editor

    [root@hadoop100 ~]# yum install -y vim
    

2) Turn off the firewall. Turn off the firewall and start it automatically

[root@hadoop100 ~]# systemctl stop firewalld
[root@hadoop100 ~]# systemctl disable firewalld.service

Note: during enterprise development, the firewall of a single server is usually turned off. The company will set up a very secure firewall

3) Create an ovo user and change the user's password (you don't need to create the one you created before)

[root@hadoop100 ~]# useradd ovo
[root@hadoop100 ~]# passwd 12356

4) Configure the ovo user to have root permission, which is convenient for sudo to execute the command with root permission later

[root@hadoop100 ~]# vim /etc/sudoers

Modify the / etc/sudoers file and add a line under the% wheel line as follows:

## Allow root to run any commands anywhere
root    ALL=(ALL)     ALL

## Allows people in group wheel to run all commands
%wheel  ALL=(ALL)       ALL
ovo   ALL=(ALL)     NOPASSWD:ALL

Note: the ovo line should not be placed directly under the root line, because all users belong to the wheel group. You first configured ovo to have the password free function, but when the program runs to the% wheel line, the function is overwritten and the password is required. So ovo should be placed under the% wheel line.

5) Create a folder in the / opt directory and modify the home and group

  • 1. Create the module and software folders in the / opt directory

    [root@hadoop100 ~]# mkdir /opt/module
    [root@hadoop100 ~]# mkdir /opt/software
    
  • 2. Modify that the owner and group of module and software folders are ovo users

    [root@hadoop100 ~]# chown ovo:ovo /opt/module 
    [root@hadoop100 ~]# chown ovo:ovo /opt/software
    
  • 3. View the owner and group of module and software folders

    [root@hadoop100 ~]# cd /opt/
    [root@hadoop100 opt]# ll
     Total consumption 12
    drwxr-xr-x. 2 ovo ovo 4096 10 June 24-17:18 module
    drwxr-xr-x. 2 root    root    4096 10 July 2017 rh
    drwxr-xr-x. 2 ovo ovo 4096 10 June 24-17:18 software
    

6) Uninstall the JDK that comes with the virtual machine. Note: if your virtual machine is minimized, you do not need to perform this step.

[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps 
  • rpm -qa: query all installed rpm packages
  • grep -i: ignore case
  • xargs -n1: means that only one parameter is passed at a time
  • rpm -e – nodeps: force software uninstallation

7) Restart the virtual machine

[root@hadoop100 ~]# reboot

🚔 1.2 cloning virtual machines

1) Using the template machine Hadoop 100, clone three virtual machines: Hadoop 102, Hadoop 103, Hadoop 104

Note: when cloning, close Hadoop 100 first

  • After shutdown, right-click clone management
  • Select create full clone
    -Modify virtual machine name
  • Clone complete

2) Modify the clone machine IP, as illustrated by Hadoop 102

  • 1. Modify the static IP of the cloned virtual machine

    [root@hadoop100 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33
    

    Change to

    DEVICE=ens33
    TYPE=Ethernet
    ONBOOT=yes
    BOOTPROTO=static
    NAME="ens33"
    IPADDR=192.168.10.102
    PREFIX=24
    GATEWAY=192.168.10.2
    DNS1=192.168.10.2
    
  • 2. View the virtual network editor of Linux virtual machine, edit - > virtual network editor - > VMnet8

  • 3. View the IP address of Windows system adapter VMware Network Adapter VMnet8

  • 4. Ensure that the IP address and virtual network editor address in the Linux system ifcfg-ens33 file are the same as the VM8 network IP address of the Windows system

3) Modify the host name of the clone machine. The following is an example of Hadoop 102

  • 1. Modify host name

    [root@hadoop100 ~]# vim /etc/hostname
    hadoop102
    
  • 2. Configure the host name mapping hosts file of the Linux clone, and open / etc/hosts (according to the previous article, you can not configure the small partner!)

    [root@hadoop100 ~]# vim /etc/hosts
    

    Add the following

    192.168.10.100 hadoop100
    192.168.10.101 hadoop101
    192.168.10.102 hadoop102
    192.168.10.103 hadoop103
    192.168.10.104 hadoop104
    192.168.10.105 hadoop105
    192.168.10.106 hadoop106
    192.168.10.107 hadoop107
    192.168.10.108 hadoop108
    

4) Restart the clone machine Hadoop 102

[root@hadoop100 ~]# reboot

5) Modify the host mapping hosts file of windows (you can not get it if you configure a small partner according to the previous article!)

  • 1. Enter C:\Windows\System32\drivers\etc

  • 2. Copy the hosts file to the desktop

  • 3. Open the desktop hosts file and add the following

    192.168.10.100 hadoop100
    192.168.10.101 hadoop101
    192.168.10.102 hadoop102
    192.168.10.103 hadoop103
    192.168.10.104 hadoop104
    192.168.10.105 hadoop105
    192.168.10.106 hadoop106
    192.168.10.107 hadoop107
    192.168.10.108 hadoop108
    
  • 4. Overwrite the desktop hosts file with the C:\Windows\System32\drivers\etc path hosts file

🎪 1.3 install JDK in Hadoop 102

1) Uninstall existing JDK
Note: before installing the JDK, be sure to delete the JDK of the virtual machine in advance. (you don't need to uninstall the previously uninstalled)

[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps

2) Use the XShell transport tool to import the JDK into the software folder under the opt directory

3) Check whether the software package is successfully imported in the opt directory under the Linux system

[ovo@hadoop102 ~]$ ls /opt/software/

See the following results:

4) Unzip the JDK to the / opt/module directory

[ovo@hadoop102 software]$ tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/

5) Configure JDK environment variables

  • 1. Create / etc/profile.d/my_env.sh file

    [ovo@hadoop102 ~]$ sudo vim /etc/profile.d/my_env.sh
    

    Add the following

    #JAVA_HOME
    export JAVA_HOME=/opt/module/jdk1.8.0_212
    export PATH=$PATH:$JAVA_HOME/bin
    
  • 2. Exit after saving

    :wq
    
  • 3.source click the / etc/profile file to make the new environment variable PATH effective

    [ovo@hadoop102 ~]$ source /etc/profile
    

6) Test whether the JDK is installed successfully

[ovo@hadoop102 ~]$ java -version

If you can see the following results, the Java installation is successful.

Note: restart (if java -version can be used, there is no need to restart)

[ovo@hadoop102 ~]$ sudo reboot

⌚ 1.4 installing Hadoop on Hadoop 102

1) Use the XShell file transfer tool to import hadoop-3.1.3.tar.gz into the software folder under the opt directory (just like dragging the JDK, there is no screenshot)

2) Enter the Hadoop installation package path

[ovo@hadoop102 ~]$ cd /opt/software/

3) Unzip the installation file under / opt/module

[ovo@hadoop102 software]$ tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/

4) Check whether the decompression is successful

[ovo@hadoop102 software]$ ls /opt/module/
hadoop-3.1.3

5) Add Hadoop to environment variable

  • 1. Obtain Hadoop installation path

    [ovo@hadoop102 hadoop-3.1.3]$ pwd
    /opt/module/hadoop-3.1.3
    
  • 2. Open / etc/profile.d/my_env.sh file

    [ovo@hadoop102 hadoop-3.1.3]$ sudo vim /etc/profile.d/my_env.sh
    

    In my_ Add the following at the end of env.sh file: (shift+g)

    #HADOOP_HOME
    export HADOOP_HOME=/opt/module/hadoop-3.1.3
    export PATH=$PATH:$HADOOP_HOME/bin
    export PATH=$PATH:$HADOOP_HOME/sbin
    

    Save and exit:: wq

  • 3. Make the revised document effective

    [ovo@hadoop102 hadoop-3.1.3]$ source /etc/profile
    

6) Test for successful installation

[ovo@hadoop102 hadoop-3.1.3]$ hadoop version
Hadoop 3.1.3

7) Restart (restart the virtual machine if the Hadoop command cannot be used)

[ovo@hadoop102 hadoop-3.1.3]$ sudo reboot

⌛ 1.5 Hadoop directory structure

1) View Hadoop directory structure

[ovo@hadoop102 hadoop-3.1.3]$ ll

2) Important directory

  • bin directory: stores scripts for operating Hadoop related services (hdfs, yarn, mapred)
  • etc Directory: Hadoop configuration file directory, which stores Hadoop configuration files
  • lib Directory: local library for storing Hadoop (function of compressing and decompressing data)
  • sbin Directory: stores scripts for starting or stopping Hadoop related services
  • share Directory: stores the dependent jar packages, documents, and official cases of Hadoop

⚡ II. Hadoop operation mode

1) Hadoop official website: http://hadoop.apache.org

2) Hadoop operation modes include: local mode, pseudo distributed mode and fully distributed mode.

  • Local mode: stand-alone operation, just to demonstrate the official case. Not used in production environment.
  • Pseudo distributed mode: it is also a stand-alone operation, but it has all the functions of Hadoop cluster. One server simulates a distributed environment. Some companies that are short of money are used for testing, and the production environment is not used.
  • Fully distributed mode: multiple servers form a distributed environment. Use in production environment.

🌟 2.1 fully distributed operation mode (development focus)

analysis:

  • 1. Prepare 3 virtual machines (turn off firewall, static IP, host name)
  • 2. Install JDK
  • 3. Configure environment variables
  • 4. Install Hadoop
  • 5. Configure environment variables
  • 6. Configure cluster
  • 7. Single point start
  • 8. Configure ssh
  • 9. Get together and test the cluster

♐ 2.2.1 virtual machine preparation

See sections 1.1 and 1.2 for details.

☁️ 2.2.2 writing cluster distribution script xsync

1) scp (secure copy)

scp definition: scp can copy data between servers. (from server1 to server2)

  • 1. Basic grammar (important!!!)

    scp-r$pdir/$fname$user@$host:$pdir/$fname
    commandrecursionFile path / name to copyDestination user @ host: destination path / name
  • 2. Case practice

    Premise: the / opt/module and / opt/software directories have been created in Hadoop 102, Hadoop 103 and Hadoop 104, and the two directories have been modified to ovo:ovo

    [ovo@hadoop102 ~]$ sudo chown ovo:ovo -R /opt/module
    

    The effect is shown in the figure:

    (a) On Hadoop 102, add / opt / module / jdk1.8.0 in Hadoop 102_ 212 directory to Hadoop 103.

    [ovo@hadoop102 ~]$ scp -r /opt/module/jdk1.8.0_212  ovo@hadoop103:/opt/module
    

    (b) On Hadoop 103, copy the / opt/module/hadoop-3.1.3 directory in Hadoop 102 to Hadoop 103.

    [ovo@hadoop103 ~]$ scp -r ovo@hadoop102:/opt/module/hadoop-3.1.3 /opt/module/
    

    (c) Operate on Hadoop 103 and copy all directories under / opt/module directory in Hadoop 102 to Hadoop 104.

    [ovo@hadoop103 opt]$ scp -r ovo@hadoop102:/opt/module/* ovo@hadoop104:/opt/module
    

2) rsync remote synchronization tool

rsync is mainly used for backup and mirroring. It has the advantages of high speed, avoiding copying the same content and supporting symbolic links.
Difference between rsync and scp: copying files with rsync is faster than scp. rsync only updates difference files. scp is to copy all the files.

  • 1. Basic grammar

    rsync-av$pdir/$fname$user@$host:$pdir/$fname
    commandrecursionFile path / name to copyDestination user @ host: destination path / name

    Description of option parameters:

    optionfunction
    -aArchive copy
    -vShow replication process
  • 2. Case practice

    (a) Delete / opt/module/hadoop-3.1.3/wcinput in Hadoop 103

    [ovo@hadoop103 hadoop-3.1.3]$ rm -rf wcinput/
    

    (b) Synchronize / opt/module/hadoop-3.1.3 in Hadoop 102 to Hadoop 103

    [ovo@hadoop102 module]$ rsync -av hadoop-3.1.3/ ovo@hadoop103:/opt/module/hadoop-3.1.3/
    

3) xsync cluster distribution script

  • 1. Requirement: copy files to the same directory of all nodes in a circular way

  • 2. Demand analysis:
    (a) rsync command original copy:

    rsync  -av     /opt/module  		ovo@hadoop103:/opt/
    

    (b) Expected script: xsync name of file to synchronize
    (c) It is expected that the script can be used in any path (the script is placed in the path where the global environment variable is declared)

    [ovo@hadoop102 ~]$ echo $PATH
    /usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ovo/.local/bin:/home/ovo/bin:/opt/module/jdk1.8.0_212/bin
    
  • 3. Script implementation

    (a) Create an xsync file in the / home/ovo/bin directory

    [ovo@hadoop102 opt]$ cd /home/ovo
    [ovo@hadoop102 ~]$ mkdir bin
    [ovo@hadoop102 ~]$ cd bin
    [ovo@hadoop102 bin]$ vim xsync
    

    Write the following code in this file

    #!/bin/bash
    
    #1. Number of judgment parameters
    if [ $# -lt 1 ]
    then
        echo Not Enough Arguement!
        exit;
    fi
    
    #2. Traverse all machines in the cluster
    for host in hadoop102 hadoop103 hadoop104
    do
        echo ====================  $host  ====================
        #3. Traverse all directories and send them one by one
    
        for file in $@
        do
            #4. Judge whether the file exists
            if [ -e $file ]
                then
                    #5. Get parent directory
                    pdir=$(cd -P $(dirname $file); pwd)
    
                    #6. Get the name of the current file
                    fname=$(basename $file)
                    ssh $host "mkdir -p $pdir"
                    rsync -av $pdir/$fname $host:$pdir
                else
                    echo $file does not exists!
            fi
        done
    done
    

    (b) The modified script xsync has execution permission

    [ovo@hadoop102 bin]$ chmod +x xsync
    

    (c) Test script

    [ovo@hadoop102 ~]$ xsync /home/ovo/bin
    

    (d) Copy the script to / bin for global invocation

    [ovo@hadoop102 bin]$ sudo cp xsync /bin/
    

    (e) Synchronize environment variable configuration (root owner)

    [ovo@hadoop102 ~]$ sudo ./bin/xsync /etc/profile.d/my_env.sh
    

    Note: if sudo is used, xsync must complete its path.
    Make environment variables effective

    [ovo@hadoop103 bin]$ source /etc/profile
    [ovo@hadoop104 opt]$ source /etc/profile
    

🎈 2.2.3 SSH non secret login configuration

1) Configure ssh

  • 1. Basic syntax: ssh the host name of another computer

  • 2. Solution to Host key verification failed during SSH connection

    [ovo@hadoop102 ~]$ ssh hadoop103
    

    If the following appears

    Are you sure you want to continue connecting (yes/no)? 
    

    Enter yes and enter

  • 3. Return to Hadoop 102

    [ovo@hadoop103 ~]$ exit
    

2) No key configuration

  • 1. Secret free login principle

  • 2. Generate public key and private key

    [ovo@hadoop102 .ssh]$ pwd
    /home/ovo/.ssh
    
    [ovo@hadoop102 .ssh]$ ssh-keygen -t rsa
    

    Then click (three carriage returns) and two file IDS will be generated_ RSA (private key), id_rsa.pub (public key)

  • 3. Copy the public key to the target machine for password free login

    [ovo@hadoop102 .ssh]$ ssh-copy-id hadoop102
    [ovo@hadoop102 .ssh]$ ssh-copy-id hadoop103
    [ovo@hadoop102 .ssh]$ ssh-copy-id hadoop104
    

    be careful:

    • You also need to configure an ovo account on Hadoop 103 to log in to Hadoop 102, Hadoop 103 and Hadoop 104 servers without secret.
    • You also need to configure an ovo account on Hadoop 104 to log in to Hadoop 102, Hadoop 103 and Hadoop 104 servers without secret.
    • You also need to use the root account on Hadoop 102 to configure non secret login to Hadoop 102, Hadoop 103 and Hadoop 104;

💐 2.2.4 cluster configuration

1) Cluster deployment planning

be careful:

  • NameNode and SecondaryNameNode should not be installed on the same server
  • Resource manager also consumes a lot of memory. It should not be configured on the same machine as NameNode and SecondaryNameNode.

2) Profile description

Hadoop configuration files are divided into two types: default configuration files and user-defined configuration files. Only when users want to modify a default configuration value, they need to modify the user-defined configuration file and change the corresponding attribute value.

  • Default profile:
  • Custom configuration files: core-site.xml, hdfs-site.xml, yarn-site.xml and mapred-site.xml are stored in $Hadoop_ On the path of home / etc / Hadoop, users can modify the configuration again according to the project requirements.

3) Configure cluster

  • 1. Core configuration file: core-site.xml

    [ovo@hadoop102 ~]$ cd $HADOOP_HOME/etc/hadoop
    [ovo@hadoop102 hadoop]$ vim core-site.xml
    

    The contents of the document are as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <configuration>
        <!-- appoint NameNode Address of -->
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://hadoop102:8020</value>
        </property>
    
        <!-- appoint hadoop Storage directory of data -->
        <property>
            <name>hadoop.tmp.dir</name>
            <value>/opt/module/hadoop-3.1.3/data</value>
        </property>
    
        <!-- to configure HDFS The static user used for web page login is ovo -->
        <property>
            <name>hadoop.http.staticuser.user</name>
            <value>ovo</value>
        </property>
    </configuration>
    
  • 2.HDFS configuration file: hdfs-site.xml

    [ovo@hadoop102 hadoop]$ vim hdfs-site.xml
    

    The contents of the document are as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <configuration>
    	<!-- nn web End access address-->
    	<property>
            <name>dfs.namenode.http-address</name>
            <value>hadoop102:9870</value>
        </property>
    	<!-- 2nn web End access address-->
        <property>
            <name>dfs.namenode.secondary.http-address</name>
            <value>hadoop104:9868</value>
        </property>
    </configuration>
    
  • 3.YARN configuration file: yarn-site.xml

    [ovo@hadoop102 hadoop]$ vim yarn-site.xml
    

    The contents of the document are as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <configuration>
        <!-- appoint MR go shuffle -->
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
    
        <!-- appoint ResourceManager Address of-->
        <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>hadoop103</value>
        </property>
    
        <!-- Inheritance of environment variables -->
        <property>
            <name>yarn.nodemanager.env-whitelist</name>
            <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
        </property>
    </configuration>
    
  • 4.MapReduce configuration file: mapred-site.xml

    [ovo@hadoop102 hadoop]$ vim mapred-site.xml
    

    The contents of the document are as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <configuration>
    	<!-- appoint MapReduce The program runs on Yarn upper -->
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
    </configuration>
    

4) Distribute the configured Hadoop configuration file on the cluster

[ovo@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/

5) Go to 103 and 104 to check the distribution of documents

[ovo@hadoop103 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml
[ovo@hadoop104 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml

🎁 2.2.5 cluster

1) Configure workers

[ovo@hadoop102 hadoop]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

Add the following contents to the document:

hadoop102
hadoop103
hadoop104

Note: no space is allowed at the end of the content added in the file, and no empty line is allowed in the file.

Synchronize all node profiles

[ovo@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc

2) Start cluster

  • 1. If the cluster is started for the first time, you need to format the NameNode on the Hadoop 102 node

    Note: formatting NameNode will generate a new cluster id, resulting in inconsistent cluster IDs of NameNode and DataNode, and the cluster cannot find past data.

    Important: if the cluster reports an error during operation and needs to reformat the namenode, be sure to stop the namenode and datanode processes, delete the data and logs directories of all machines, and then format them.

    [ovo@hadoop102 hadoop-3.1.3]$ hdfs namenode -format
    
  • 2. Start HDFS

    [ovo@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
    
  • 3. Start YARN on the node with ResourceManager configured (Note: Hadoop 103)

    [ovo@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
    
  • 4. View the NameNode of HDFS on the web side
    (a) Enter in the browser: http://hadoop102:9870
    (b) View data information stored on HDFS

  • 5. View YARN's ResourceManager on the web side
    (a) Enter in the browser: http://hadoop103:8088
    (b) View Job information running on YARN

😉 2.2.6 configuring the history server

In order to view the historical operation of the program (access the previous data after the cluster is reformatted), you need to configure the history server. The specific configuration steps are as follows:

1) Configure mapred-site.xml

[ovo@hadoop102 hadoop]$ vim mapred-site.xml

Add the following configuration to this file.

<!-- Historical server address -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop102:10020</value>
</property>

<!-- History server web End address -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop102:19888</value>
</property>

2) Distribution configuration

[ovo@hadoop102 hadoop]$ xsync $HADOOP_HOME/etc/hadoop/mapred-site.xml

3) Start the history server in Hadoop 102

[ovo@hadoop102 hadoop]$ mapred --daemon start historyserver

4) Check whether the history server is started

[ovo@hadoop102 hadoop]$ jps

5) To view JobHistory: http://hadoop102:19888/jobhistory

🌈 2.2.7 configuring log aggregation

Log aggregation concept: after the application runs, upload the program running log information to the HDFS system.


Benefits of log aggregation function: you can easily view the details of program operation, which is convenient for development and debugging.

Note: to enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryServer.

The specific steps to enable log aggregation are as follows:

1) Configure yarn-site.xml

[ovo@hadoop102 hadoop]$ vim yarn-site.xml

Add the following configuration to this file.

<!-- Enable log aggregation -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<!-- Set log aggregation server address -->
<property>  
    <name>yarn.log.server.url</name>  
    <value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<!-- Set the log retention time to 7 days -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

2) Distribution configuration

[ovo@hadoop102 hadoop]$ xsync $HADOOP_HOME/etc/hadoop/yarn-site.xml

3) Close NodeManager, ResourceManager, and HistoryServer

[ovo@hadoop103 hadoop-3.1.3]$ sbin/stop-yarn.sh
[ovo@hadoop102 hadoop-3.1.3]$ mapred --daemon stop historyserver

4) Start NodeManager, ResourceManage, and HistoryServer

[ovo@hadoop103 ~]$ start-yarn.sh
[ovo@hadoop102 ~]$ mapred --daemon start historyserver

5) Delete the existing output file on HDFS

[ovo@hadoop102 ~]$ hadoop fs -rm -r /output

6) Execute WordCount program (write any file path)

[ovo@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

7) View log

🎸 2.2.8 summary of cluster start / stop modes

1) Each module starts / stops separately (ssh configuration is the premise)

  • 1. Overall start / stop of HDFS

    start-dfs.sh/stop-dfs.sh
    
  • 2. Overall start / stop of YARN

    start-yarn.sh/stop-yarn.sh
    

2) Each service component starts / stops one by one

  • 1. Start / stop HDFS components respectively

    hdfs --daemon start/stop namenode/datanode/secondarynamenode
    
  • 2. Start / stop YARN

    yarn --daemon start/stop  resourcemanager/nodemanager
    

🎉 2.2.9 writing common scripts for Hadoop clusters

1) Hadoop cluster startup and shutdown script (including HDFS, Yan and Historyserver): myhadoop.sh

[ovo@hadoop102 ~]$ cd /home/ovo/bin
[ovo@hadoop102 bin]$ vim myhadoop.sh

Enter the following

#!/bin/bash

if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi

case $1 in
"start")
        echo " =================== start-up hadoop colony ==================="

        echo " --------------- start-up hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
        echo " --------------- start-up yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
        echo " --------------- start-up historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== close hadoop colony ==================="

        echo " --------------- close historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
        echo " --------------- close yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo " --------------- close hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

Exit after saving, and then grant script execution permission

[ovo@hadoop102 bin]$ chmod +x myhadoop.sh

2) View three server Java process scripts: jpsall

[ovo@hadoop102 ~]$ cd /home/ovo/bin
[ovo@hadoop102 bin]$ vim jpsall

Enter the following

#!/bin/bash

for host in hadoop102 hadoop103 hadoop104
do
        echo =============== $host ===============
        ssh $host jps 
done

Exit after saving, and then grant script execution permission

[ovo@hadoop102 bin]$ chmod +x jpsall

3) Distribute the / home/ovo/bin directory to ensure that custom scripts can be used on all three machines

[ovo@hadoop102 ~]$ xsync /home/ovo/bin/

👏 2.2.10 cluster time synchronization

If the server is in the public network environment (can connect to the external network), cluster time synchronization can not be adopted, because the server will calibrate with the public network time regularly;

If the server is in the Intranet environment, the cluster time synchronization must be configured, otherwise the time deviation will occur after a long time, resulting in the asynchronous execution of tasks by the cluster.

1) Demand

Find a machine as a time server. All machines are synchronized with the cluster time regularly. The production environment requires periodic synchronization according to the accuracy of the task to the time. In order to see the effect as soon as possible, the test environment uses 1 minute synchronization.

2) Time server configuration (must be root)

  • 1. View ntpd service status and startup and self startup status of all nodes

    [ovo@hadoop102 ~]$ sudo systemctl status ntpd
    [ovo@hadoop102 ~]$ sudo systemctl start ntpd
    [ovo@hadoop102 ~]$ sudo systemctl is-enabled ntpd
    
  • 2. Modify the ntp.conf configuration file of Hadoop 102

    [ovo@hadoop102 ~]$ sudo vim /etc/ntp.conf
    

    The amendments are as follows:

    (a) Modification 1 (all machines in the 192.168.10.0-192.168.10.255 network segment are authorized to query and synchronize time from this machine)

    #restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
    

    Is (the # number is removed and changed to 192.168.10.0)

    restrict 192.168.10.0 mask 255.255.255.0 nomodify notrap
    

    (b) Modification 2 (time when the cluster is in the LAN and does not use other Internet)

    server 0.centos.pool.ntp.org iburst
    server 1.centos.pool.ntp.org iburst
    server 2.centos.pool.ntp.org iburst
    server 3.centos.pool.ntp.org iburst
    

    Yes (add # sign to all and comment out)

    #server 0.centos.pool.ntp.org iburst
    #server 1.centos.pool.ntp.org iburst
    #server 2.centos.pool.ntp.org iburst
    #server 3.centos.pool.ntp.org iburst
    

    (c) Add 3 (when the node loses network connection, the local time can still be used as the time server to provide time synchronization for other nodes in the cluster)

    server 127.127.1.0
    fudge 127.127.1.0 stratum 10
    
  • 3. Modify the / etc/sysconfig/ntpd file of Hadoop 102

    [ovo@hadoop102 ~]$ sudo vim /etc/sysconfig/ntpd
    

    Add the following contents (synchronize the hardware time with the system time)

    SYNC_HWCLOCK=yes
    
  • 4. Restart ntpd service

    [ovo@hadoop102 ~]$ sudo systemctl start ntpd
    
  • 5. Set ntpd service startup

    [ovo@hadoop102 ~]$ sudo systemctl enable ntpd
    

3) Other machine configurations (must be root)

  • 1. Turn off ntp service and self startup on all nodes

    [ovo@hadoop103 ~]$ sudo systemctl stop ntpd
    [ovo@hadoop103 ~]$ sudo systemctl disable ntpd
    [ovo@hadoop104 ~]$ sudo systemctl stop ntpd
    [ovo@hadoop104 ~]$ sudo systemctl disable ntpd
    
  • 2. Configure other machines to synchronize with the time server once a minute

    [ovo@hadoop103 ~]$ sudo crontab -e
    

    The scheduled tasks are as follows:

    */1 * * * * /usr/sbin/ntpdate hadoop102
    
  • 3. Modify any machine time

    [ovo@hadoop103 ~]$ sudo date -s "2022-9-11 11:11:11"
    
  • 4.1 minutes later, check whether the machine is synchronized with the time server

    [ovo@hadoop103 ~]$ sudo date
    

⭐ 2.2.11 frequently asked questions

1) Common port number

Port nameHadoop2.xhadoop3.x
HDFS NameNode internal communication port8020/90008020/9000/9820
Query port of HDFS NameNode to user500709870
Yran view task running port80888088
History server communication port1988819888

2) Common profiles

hadoop2.xcore-site.xmlhadfs-size.xmlyarn-site.xmlmapred-site.xmlslaves
hadoop3.xcore-site.xmlhadfs-size.xmlyarn-site.xmlmapred-site.xmlworkers

🌸 3, Common errors and Solutions

1) The firewall is not closed or YARN is not started

INFO client.RMProxy: Connecting to ResourceManager at hadoop108/192.168.10.108:8032

2) Host name configuration error

3) IP address configuration error

4) ssh is not configured properly

5) The cluster started by root and ovo users are not unified

6) Careless modification of configuration file

7) Unrecognized host name

java.net.UnknownHostException: hadoop102: hadoop102
        at java.net.InetAddress.getLocalHost(InetAddress.java:1475)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:146)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
        at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)

terms of settlement:

  • (1) Add 192.168.10.102 Hadoop 102 to the / etc/hosts file
  • (2) Do not use Hadoop, Hadoop 000 and other special names for the host name

8) DataNode and NameNode processes can only work one at a time.
9) Executing the command does not take effect. When pasting the command in Word, it encounters - and long - and is not distinguished. Cause command failure

Solution: try not to paste the code in Word.

10) jps finds that the process has not been started, but restarts the cluster, indicating that the process has been started.

The reason is that there is a temporary file for the started process in the / tmp directory under the root directory of Linux. Delete the cluster related processes and restart the cluster.

11) jps does not take effect

Reason: the global variable hadoop java does not take effect. Solution: the source /etc/profile file is required.

12) 8088 port is not connected

[ovo@hadoop102 desktop]$ cat /etc/hosts

Comment out the following code

#127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1         hadoop102

Topics: Big Data Hadoop