Hadoop 3. X operation environment construction of big data (hand-in-hand cluster construction)

Posted by misteraven on Thu, 28 Oct 2021 01:39:47 +0200

🌹 Write at the beginning

Xiao Yuan began to update Hadoop series teaching articles to introduce you to big data from zero and look forward to your attention (according to the blog notes written by Hadoop 3. X in Silicon Valley) ❤️❤️
First article: Hadoop graphical overview of big data
Second article: Hadoop template virtual machine configuration diagram of big data
The third article: Hadoop running environment construction of big data (hand-in-hand cluster construction)
Article 4: bloggers are stepping up their preparation

💝 Installation package preparation

❤️❤️ The blogger has prepared all the installation packages for you to build the cluster. It will be much faster to download them with Alibaba cloud!!!

Alibaba cloud address: https://www.aliyundrive.com/s/CnYDAy4tPjM

🚀 1, Hadoop running environment construction (development focus)

💒 1.1 template virtual machine environment preparation

0) install the template virtual machine, IP address 192.168.10.100, host name Hadoop 100, memory 4G and hard disk 50G

If you don't know how to install a template machine, you can see another blog prepared by the blogger for you: Hadoop template virtual machine configuration diagram of big data

1) Hadoop 100 virtual machine configuration requirements are as follows

1. Using Yum to install requires that the virtual machine can access the Internet normally. You can test the virtual machine networking before installing yum
condition

[root@hadoop100 ~]# ping www.baidu.com
PING www.baidu.com (14.215.177.39) 56(84) bytes of data.
64 bytes from 14.215.177.39 (14.215.177.39): icmp_seq=1 ttl=128 time=8.60 ms
64 bytes from 14.215.177.39 (14.215.177.39): icmp_seq=2 ttl=128 time=7.72 ms

2. Install EPEL release
Note: Extra Packages for Enterprise Linux is an additional software package for the "red hat" operating system, which is applicable to RHEL, CentOS and Scientific Linux. As a software warehouse, most rpm packages cannot be found in the official repository)
```
[root@hadoop100 ~]# yum install -y epel-release
```
3. Note: if the minimum system version is installed on Linux, the following tools need to be installed; If you are installing Linux Desktop Standard Edition, you do not need to perform the following operations
Net tool: toolkit collection, including ifconfig and other commands
```
[root@hadoop100 ~]# yum install -y net-tools 
```
vim: Editor
```
[root@hadoop100 ~]# yum install -y vim
```

2) Turn off the firewall. Turn off the firewall and start it automatically

[root@hadoop100 ~]# systemctl stop firewalld
[root@hadoop100 ~]# systemctl disable firewalld.service

Note: during enterprise development, the firewall of a single server is usually turned off. The company will set up a very secure firewall

3) Create an ovo user and change the user's password (you don't need to create the one you created before)

[root@hadoop100 ~]# useradd ovo
[root@hadoop100 ~]# passwd 12356

4) Configure the ovo user to have root permission, which is convenient for sudo to execute the command with root permission later

[root@hadoop100 ~]# vim /etc/sudoers

Modify the / etc/sudoers file and add a line under the% wheel line as follows:

## Allow root to run any commands anywhere
root    ALL=(ALL)     ALL

## Allows people in group wheel to run all commands
%wheel  ALL=(ALL)       ALL
ovo   ALL=(ALL)     NOPASSWD:ALL

Note: the ovo line should not be placed directly under the root line, because all users belong to the wheel group. You first configured ovo to have the password free function, but when the program runs to the% wheel line, the function is overwritten and the password is required. So ovo should be placed under the% wheel line.

5) Create a folder in the / opt directory and modify the home and group

1. Create the module and software folders in the / opt directory

[root@hadoop100 ~]# mkdir /opt/module
[root@hadoop100 ~]# mkdir /opt/software

2. Modify that the owner and group of module and software folders are ovo users

[root@hadoop100 ~]# chown ovo:ovo /opt/module 
[root@hadoop100 ~]# chown ovo:ovo /opt/software

3. View the owner and group of module and software folders

[root@hadoop100 ~]# cd /opt/
[root@hadoop100 opt]# ll
 Total consumption 12
drwxr-xr-x. 2 ovo ovo 4096 10 June 24-17:18 module
drwxr-xr-x. 2 root    root    4096 10 July 2017 rh
drwxr-xr-x. 2 ovo ovo 4096 10 June 24-17:18 software

6) Uninstall the JDK that comes with the virtual machine. Note: if your virtual machine is minimized, you do not need to perform this step.

[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps

rpm -qa: query all installed rpm packages
grep -i: ignore case
xargs -n1: means that only one parameter is passed at a time
rpm -e – nodeps: force software uninstallation

7) Restart the virtual machine

[root@hadoop100 ~]# reboot

🚔 1.2 cloning virtual machines

1) Using the template machine Hadoop 100, clone three virtual machines: Hadoop 102, Hadoop 103, Hadoop 104

Note: when cloning, close Hadoop 100 first

After shutdown, right-click clone management
Select create full clone
-Modify virtual machine name
Clone complete

2) Modify the clone machine IP, as illustrated by Hadoop 102

1. Modify the static IP of the cloned virtual machine

[root@hadoop100 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33

Change to

DEVICE=ens33
TYPE=Ethernet
ONBOOT=yes
BOOTPROTO=static
NAME="ens33"
IPADDR=192.168.10.102
PREFIX=24
GATEWAY=192.168.10.2
DNS1=192.168.10.2

2. View the virtual network editor of Linux virtual machine, edit - > virtual network editor - > VMnet8
3. View the IP address of Windows system adapter VMware Network Adapter VMnet8
4. Ensure that the IP address and virtual network editor address in the Linux system ifcfg-ens33 file are the same as the VM8 network IP address of the Windows system

3) Modify the host name of the clone machine. The following is an example of Hadoop 102

1. Modify host name

[root@hadoop100 ~]# vim /etc/hostname
hadoop102

2. Configure the host name mapping hosts file of the Linux clone, and open / etc/hosts (according to the previous article, you can not configure the small partner!)

[root@hadoop100 ~]# vim /etc/hosts

Add the following

192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
192.168.10.105 hadoop105
192.168.10.106 hadoop106
192.168.10.107 hadoop107
192.168.10.108 hadoop108

4) Restart the clone machine Hadoop 102

[root@hadoop100 ~]# reboot

5) Modify the host mapping hosts file of windows (you can not get it if you configure a small partner according to the previous article!)

1. Enter C:\Windows\System32\drivers\etc
2. Copy the hosts file to the desktop

3. Open the desktop hosts file and add the following

192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
192.168.10.105 hadoop105
192.168.10.106 hadoop106
192.168.10.107 hadoop107
192.168.10.108 hadoop108

4. Overwrite the desktop hosts file with the C:\Windows\System32\drivers\etc path hosts file

🎪 1.3 install JDK in Hadoop 102

1) Uninstall existing JDK
Note: before installing the JDK, be sure to delete the JDK of the virtual machine in advance. (you don't need to uninstall the previously uninstalled)

[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps

2) Use the XShell transport tool to import the JDK into the software folder under the opt directory

3) Check whether the software package is successfully imported in the opt directory under the Linux system

[ovo@hadoop102 ~]$ ls /opt/software/

See the following results:

4) Unzip the JDK to the / opt/module directory

[ovo@hadoop102 software]$ tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/

5) Configure JDK environment variables

1. Create / etc/profile.d/my_env.sh file

[ovo@hadoop102 ~]$ sudo vim /etc/profile.d/my_env.sh

Add the following

#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin

2. Exit after saving
```
:wq
```
3.source click the / etc/profile file to make the new environment variable PATH effective
```
[ovo@hadoop102 ~]$ source /etc/profile
```

6) Test whether the JDK is installed successfully

[ovo@hadoop102 ~]$ java -version

If you can see the following results, the Java installation is successful.

Note: restart (if java -version can be used, there is no need to restart)

[ovo@hadoop102 ~]$ sudo reboot

⌚ 1.4 installing Hadoop on Hadoop 102

1) Use the XShell file transfer tool to import hadoop-3.1.3.tar.gz into the software folder under the opt directory (just like dragging the JDK, there is no screenshot)

2) Enter the Hadoop installation package path

[ovo@hadoop102 ~]$ cd /opt/software/

3) Unzip the installation file under / opt/module

[ovo@hadoop102 software]$ tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/

4) Check whether the decompression is successful

[ovo@hadoop102 software]$ ls /opt/module/
hadoop-3.1.3

5) Add Hadoop to environment variable

1. Obtain Hadoop installation path

[ovo@hadoop102 hadoop-3.1.3]$ pwd
/opt/module/hadoop-3.1.3

2. Open / etc/profile.d/my_env.sh file

[ovo@hadoop102 hadoop-3.1.3]$ sudo vim /etc/profile.d/my_env.sh

In my_ Add the following at the end of env.sh file: (shift+g)

#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

Save and exit:: wq

3. Make the revised document effective

[ovo@hadoop102 hadoop-3.1.3]$ source /etc/profile

6) Test for successful installation

[ovo@hadoop102 hadoop-3.1.3]$ hadoop version
Hadoop 3.1.3

7) Restart (restart the virtual machine if the Hadoop command cannot be used)

[ovo@hadoop102 hadoop-3.1.3]$ sudo reboot

⌛ 1.5 Hadoop directory structure

1) View Hadoop directory structure

[ovo@hadoop102 hadoop-3.1.3]$ ll

2) Important directory

bin directory: stores scripts for operating Hadoop related services (hdfs, yarn, mapred)
etc Directory: Hadoop configuration file directory, which stores Hadoop configuration files
lib Directory: local library for storing Hadoop (function of compressing and decompressing data)
sbin Directory: stores scripts for starting or stopping Hadoop related services
share Directory: stores the dependent jar packages, documents, and official cases of Hadoop

⚡ II. Hadoop operation mode

1) Hadoop official website: http://hadoop.apache.org

2) Hadoop operation modes include: local mode, pseudo distributed mode and fully distributed mode.

Local mode: stand-alone operation, just to demonstrate the official case. Not used in production environment.
Pseudo distributed mode: it is also a stand-alone operation, but it has all the functions of Hadoop cluster. One server simulates a distributed environment. Some companies that are short of money are used for testing, and the production environment is not used.
Fully distributed mode: multiple servers form a distributed environment. Use in production environment.

🌟 2.1 fully distributed operation mode (development focus)

analysis:

1. Prepare 3 virtual machines (turn off firewall, static IP, host name)
2. Install JDK
3. Configure environment variables
4. Install Hadoop
5. Configure environment variables
6. Configure cluster
7. Single point start
8. Configure ssh
9. Get together and test the cluster

♐ 2.2.1 virtual machine preparation

See sections 1.1 and 1.2 for details.

☁️ 2.2.2 writing cluster distribution script xsync

1) scp (secure copy)

scp definition: scp can copy data between servers. (from server1 to server2)

1. Basic grammar (important!!!)

scp -r $pdir/$fname $user@$host:$pdir/$fname
command recursion File path / name to copy Destination user @ host: destination path / name
2. Case practice

Premise: the / opt/module and / opt/software directories have been created in Hadoop 102, Hadoop 103 and Hadoop 104, and the two directories have been modified to ovo:ovo
```
[ovo@hadoop102 ~]$ sudo chown ovo:ovo -R /opt/module
```
The effect is shown in the figure:

(a) On Hadoop 102, add / opt / module / jdk1.8.0 in Hadoop 102_ 212 directory to Hadoop 103.
```
[ovo@hadoop102 ~]$ scp -r /opt/module/jdk1.8.0_212  ovo@hadoop103:/opt/module
```
(b) On Hadoop 103, copy the / opt/module/hadoop-3.1.3 directory in Hadoop 102 to Hadoop 103.
```
[ovo@hadoop103 ~]$ scp -r ovo@hadoop102:/opt/module/hadoop-3.1.3 /opt/module/
```
(c) Operate on Hadoop 103 and copy all directories under / opt/module directory in Hadoop 102 to Hadoop 104.
```
[ovo@hadoop103 opt]$ scp -r ovo@hadoop102:/opt/module/* ovo@hadoop104:/opt/module
```

scp	-r	$pdir/$fname	$user@$host:$pdir/$fname
command	recursion	File path / name to copy	Destination user @ host: destination path / name

2) rsync remote synchronization tool

rsync is mainly used for backup and mirroring. It has the advantages of high speed, avoiding copying the same content and supporting symbolic links.
Difference between rsync and scp: copying files with rsync is faster than scp. rsync only updates difference files. scp is to copy all the files.

1. Basic grammar

rsync -av $pdir/$fname $user@$host:$pdir/$fname
command recursion File path / name to copy Destination user @ host: destination path / name
Description of option parameters:

option function
-a Archive copy
-v Show replication process
2. Case practice

(a) Delete / opt/module/hadoop-3.1.3/wcinput in Hadoop 103
```
[ovo@hadoop103 hadoop-3.1.3]$ rm -rf wcinput/
```
(b) Synchronize / opt/module/hadoop-3.1.3 in Hadoop 102 to Hadoop 103
```
[ovo@hadoop102 module]$ rsync -av hadoop-3.1.3/ ovo@hadoop103:/opt/module/hadoop-3.1.3/
```

rsync	-av	$pdir/$fname	$user@$host:$pdir/$fname
command	recursion	File path / name to copy	Destination user @ host: destination path / name

option	function
-a	Archive copy
-v	Show replication process

3) xsync cluster distribution script

1. Requirement: copy files to the same directory of all nodes in a circular way
2. Demand analysis:
(a) rsync command original copy:
```
rsync  -av     /opt/module  		ovo@hadoop103:/opt/
```
(b) Expected script: xsync name of file to synchronize
(c) It is expected that the script can be used in any path (the script is placed in the path where the global environment variable is declared)
```
[ovo@hadoop102 ~]$ echo $PATH
/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ovo/.local/bin:/home/ovo/bin:/opt/module/jdk1.8.0_212/bin
```

3. Script implementation

(a) Create an xsync file in the / home/ovo/bin directory

[ovo@hadoop102 opt]$ cd /home/ovo
[ovo@hadoop102 ~]$ mkdir bin
[ovo@hadoop102 ~]$ cd bin
[ovo@hadoop102 bin]$ vim xsync

Write the following code in this file

#!/bin/bash

#1. Number of judgment parameters
if [ $# -lt 1 ]
then
    echo Not Enough Arguement!
    exit;
fi

#2. Traverse all machines in the cluster
for host in hadoop102 hadoop103 hadoop104
do
    echo ====================  $host  ====================
    #3. Traverse all directories and send them one by one

    for file in $@
    do
        #4. Judge whether the file exists
        if [ -e $file ]
            then
                #5. Get parent directory
                pdir=$(cd -P $(dirname $file); pwd)

                #6. Get the name of the current file
                fname=$(basename $file)
                ssh $host "mkdir -p $pdir"
                rsync -av $pdir/$fname $host:$pdir
            else
                echo $file does not exists!
        fi
    done
done

(b) The modified script xsync has execution permission

[ovo@hadoop102 bin]$ chmod +x xsync

[ovo@hadoop102 ~]$ xsync /home/ovo/bin

(d) Copy the script to / bin for global invocation

[ovo@hadoop102 bin]$ sudo cp xsync /bin/

(e) Synchronize environment variable configuration (root owner)

[ovo@hadoop102 ~]$ sudo ./bin/xsync /etc/profile.d/my_env.sh

Note: if sudo is used, xsync must complete its path.
Make environment variables effective

[ovo@hadoop103 bin]$ source /etc/profile
[ovo@hadoop104 opt]$ source /etc/profile

🎈 2.2.3 SSH non secret login configuration

1) Configure ssh

1. Basic syntax: ssh the host name of another computer
2. Solution to Host key verification failed during SSH connection
```
[ovo@hadoop102 ~]$ ssh hadoop103
```
If the following appears
```
Are you sure you want to continue connecting (yes/no)? 
```
Enter yes and enter
3. Return to Hadoop 102
```
[ovo@hadoop103 ~]$ exit
```

2) No key configuration

1. Secret free login principle
2. Generate public key and private key
```
[ovo@hadoop102 .ssh]$ pwd
/home/ovo/.ssh

[ovo@hadoop102 .ssh]$ ssh-keygen -t rsa
```
Then click (three carriage returns) and two file IDS will be generated_ RSA (private key), id_rsa.pub (public key)
3. Copy the public key to the target machine for password free login
```
[ovo@hadoop102 .ssh]$ ssh-copy-id hadoop102
[ovo@hadoop102 .ssh]$ ssh-copy-id hadoop103
[ovo@hadoop102 .ssh]$ ssh-copy-id hadoop104
```
be careful:
- You also need to configure an ovo account on Hadoop 103 to log in to Hadoop 102, Hadoop 103 and Hadoop 104 servers without secret.
- You also need to configure an ovo account on Hadoop 104 to log in to Hadoop 102, Hadoop 103 and Hadoop 104 servers without secret.
- You also need to use the root account on Hadoop 102 to configure non secret login to Hadoop 102, Hadoop 103 and Hadoop 104;

💐 2.2.4 cluster configuration

1) Cluster deployment planning

be careful:

NameNode and SecondaryNameNode should not be installed on the same server
Resource manager also consumes a lot of memory. It should not be configured on the same machine as NameNode and SecondaryNameNode.

2) Profile description

Hadoop configuration files are divided into two types: default configuration files and user-defined configuration files. Only when users want to modify a default configuration value, they need to modify the user-defined configuration file and change the corresponding attribute value.

Default profile:
Custom configuration files: core-site.xml, hdfs-site.xml, yarn-site.xml and mapred-site.xml are stored in $Hadoop_ On the path of home / etc / Hadoop, users can modify the configuration again according to the project requirements.

3) Configure cluster

1. Core configuration file: core-site.xml

[ovo@hadoop102 ~]$ cd $HADOOP_HOME/etc/hadoop
[ovo@hadoop102 hadoop]$ vim core-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- appoint NameNode Address of -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop102:8020</value>
    </property>

    <!-- appoint hadoop Storage directory of data -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-3.1.3/data</value>
    </property>

    <!-- to configure HDFS The static user used for web page login is ovo -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>ovo</value>
    </property>
</configuration>

2.HDFS configuration file: hdfs-site.xml

[ovo@hadoop102 hadoop]$ vim hdfs-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- nn web End access address-->
	<property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop102:9870</value>
    </property>
	<!-- 2nn web End access address-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop104:9868</value>
    </property>
</configuration>

3.YARN configuration file: yarn-site.xml

[ovo@hadoop102 hadoop]$ vim yarn-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- appoint MR go shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- appoint ResourceManager Address of-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop103</value>
    </property>

    <!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

4.MapReduce configuration file: mapred-site.xml

[ovo@hadoop102 hadoop]$ vim mapred-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- appoint MapReduce The program runs on Yarn upper -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

4) Distribute the configured Hadoop configuration file on the cluster

[ovo@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/

5) Go to 103 and 104 to check the distribution of documents

[ovo@hadoop103 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml
[ovo@hadoop104 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml

🎁 2.2.5 cluster

1) Configure workers

[ovo@hadoop102 hadoop]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

Add the following contents to the document:

hadoop102
hadoop103
hadoop104

Note: no space is allowed at the end of the content added in the file, and no empty line is allowed in the file.

Synchronize all node profiles

[ovo@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc

2) Start cluster

1. If the cluster is started for the first time, you need to format the NameNode on the Hadoop 102 node

Note: formatting NameNode will generate a new cluster id, resulting in inconsistent cluster IDs of NameNode and DataNode, and the cluster cannot find past data.

Important: if the cluster reports an error during operation and needs to reformat the namenode, be sure to stop the namenode and datanode processes, delete the data and logs directories of all machines, and then format them.
```
[ovo@hadoop102 hadoop-3.1.3]$ hdfs namenode -format
```

2. Start HDFS

[ovo@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh

3. Start YARN on the node with ResourceManager configured (Note: Hadoop 103)
```
[ovo@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
```
4. View the NameNode of HDFS on the web side
(a) Enter in the browser: http://hadoop102:9870
(b) View data information stored on HDFS
5. View YARN's ResourceManager on the web side
(a) Enter in the browser: http://hadoop103:8088
(b) View Job information running on YARN

😉 2.2.6 configuring the history server

In order to view the historical operation of the program (access the previous data after the cluster is reformatted), you need to configure the history server. The specific configuration steps are as follows:

1) Configure mapred-site.xml

[ovo@hadoop102 hadoop]$ vim mapred-site.xml

Add the following configuration to this file.

<!-- Historical server address -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop102:10020</value>
</property>

<!-- History server web End address -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop102:19888</value>
</property>

2) Distribution configuration

[ovo@hadoop102 hadoop]$ xsync $HADOOP_HOME/etc/hadoop/mapred-site.xml

3) Start the history server in Hadoop 102

[ovo@hadoop102 hadoop]$ mapred --daemon start historyserver

4) Check whether the history server is started

[ovo@hadoop102 hadoop]$ jps

5) To view JobHistory: http://hadoop102:19888/jobhistory

🌈 2.2.7 configuring log aggregation

Log aggregation concept: after the application runs, upload the program running log information to the HDFS system.

Benefits of log aggregation function: you can easily view the details of program operation, which is convenient for development and debugging.

Note: to enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryServer.

The specific steps to enable log aggregation are as follows:

1) Configure yarn-site.xml

[ovo@hadoop102 hadoop]$ vim yarn-site.xml

Add the following configuration to this file.

<!-- Enable log aggregation -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<!-- Set log aggregation server address -->
<property>  
    <name>yarn.log.server.url</name>  
    <value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<!-- Set the log retention time to 7 days -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

2) Distribution configuration

[ovo@hadoop102 hadoop]$ xsync $HADOOP_HOME/etc/hadoop/yarn-site.xml

3) Close NodeManager, ResourceManager, and HistoryServer

[ovo@hadoop103 hadoop-3.1.3]$ sbin/stop-yarn.sh
[ovo@hadoop102 hadoop-3.1.3]$ mapred --daemon stop historyserver

4) Start NodeManager, ResourceManage, and HistoryServer

[ovo@hadoop103 ~]$ start-yarn.sh
[ovo@hadoop102 ~]$ mapred --daemon start historyserver

5) Delete the existing output file on HDFS

[ovo@hadoop102 ~]$ hadoop fs -rm -r /output

6) Execute WordCount program (write any file path)

[ovo@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

7) View log

1. Historical server address: http://hadoop102:19888/jobhistory
2. Historical task list
3. View the task running log
4. Operation log details

🎸 2.2.8 summary of cluster start / stop modes

1) Each module starts / stops separately (ssh configuration is the premise)

1. Overall start / stop of HDFS
```
start-dfs.sh/stop-dfs.sh
```
2. Overall start / stop of YARN
```
start-yarn.sh/stop-yarn.sh
```

2) Each service component starts / stops one by one

1. Start / stop HDFS components respectively

hdfs --daemon start/stop namenode/datanode/secondarynamenode

2. Start / stop YARN

yarn --daemon start/stop  resourcemanager/nodemanager

🎉 2.2.9 writing common scripts for Hadoop clusters

1) Hadoop cluster startup and shutdown script (including HDFS, Yan and Historyserver): myhadoop.sh

[ovo@hadoop102 ~]$ cd /home/ovo/bin
[ovo@hadoop102 bin]$ vim myhadoop.sh

Enter the following

#!/bin/bash

if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi

case $1 in
"start")
        echo " =================== start-up hadoop colony ==================="

        echo " --------------- start-up hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
        echo " --------------- start-up yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
        echo " --------------- start-up historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== close hadoop colony ==================="

        echo " --------------- close historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
        echo " --------------- close yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo " --------------- close hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

Exit after saving, and then grant script execution permission

[ovo@hadoop102 bin]$ chmod +x myhadoop.sh

2) View three server Java process scripts: jpsall

[ovo@hadoop102 ~]$ cd /home/ovo/bin
[ovo@hadoop102 bin]$ vim jpsall

Enter the following

#!/bin/bash

for host in hadoop102 hadoop103 hadoop104
do
        echo =============== $host ===============
        ssh $host jps 
done

Exit after saving, and then grant script execution permission

[ovo@hadoop102 bin]$ chmod +x jpsall

3) Distribute the / home/ovo/bin directory to ensure that custom scripts can be used on all three machines

[ovo@hadoop102 ~]$ xsync /home/ovo/bin/

👏 2.2.10 cluster time synchronization

If the server is in the public network environment (can connect to the external network), cluster time synchronization can not be adopted, because the server will calibrate with the public network time regularly;

If the server is in the Intranet environment, the cluster time synchronization must be configured, otherwise the time deviation will occur after a long time, resulting in the asynchronous execution of tasks by the cluster.

1) Demand

Find a machine as a time server. All machines are synchronized with the cluster time regularly. The production environment requires periodic synchronization according to the accuracy of the task to the time. In order to see the effect as soon as possible, the test environment uses 1 minute synchronization.

2) Time server configuration (must be root)

1. View ntpd service status and startup and self startup status of all nodes

[ovo@hadoop102 ~]$ sudo systemctl status ntpd
[ovo@hadoop102 ~]$ sudo systemctl start ntpd
[ovo@hadoop102 ~]$ sudo systemctl is-enabled ntpd

2. Modify the ntp.conf configuration file of Hadoop 102
```
[ovo@hadoop102 ~]$ sudo vim /etc/ntp.conf
```
The amendments are as follows:

(a) Modification 1 (all machines in the 192.168.10.0-192.168.10.255 network segment are authorized to query and synchronize time from this machine)
```
#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
```
Is (the # number is removed and changed to 192.168.10.0)
```
restrict 192.168.10.0 mask 255.255.255.0 nomodify notrap
```
(b) Modification 2 (time when the cluster is in the LAN and does not use other Internet)
```
server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org iburst
server 2.centos.pool.ntp.org iburst
server 3.centos.pool.ntp.org iburst
```
Yes (add # sign to all and comment out)
```
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
```
(c) Add 3 (when the node loses network connection, the local time can still be used as the time server to provide time synchronization for other nodes in the cluster)
```
server 127.127.1.0
fudge 127.127.1.0 stratum 10
```
3. Modify the / etc/sysconfig/ntpd file of Hadoop 102
```
[ovo@hadoop102 ~]$ sudo vim /etc/sysconfig/ntpd
```
Add the following contents (synchronize the hardware time with the system time)
```
SYNC_HWCLOCK=yes
```

4. Restart ntpd service

[ovo@hadoop102 ~]$ sudo systemctl start ntpd

5. Set ntpd service startup

[ovo@hadoop102 ~]$ sudo systemctl enable ntpd

3) Other machine configurations (must be root)

1. Turn off ntp service and self startup on all nodes

[ovo@hadoop103 ~]$ sudo systemctl stop ntpd
[ovo@hadoop103 ~]$ sudo systemctl disable ntpd
[ovo@hadoop104 ~]$ sudo systemctl stop ntpd
[ovo@hadoop104 ~]$ sudo systemctl disable ntpd

2. Configure other machines to synchronize with the time server once a minute
```
[ovo@hadoop103 ~]$ sudo crontab -e
```
The scheduled tasks are as follows:
```
*/1 * * * * /usr/sbin/ntpdate hadoop102
```

3. Modify any machine time

[ovo@hadoop103 ~]$ sudo date -s "2022-9-11 11:11:11"

4.1 minutes later, check whether the machine is synchronized with the time server
```
[ovo@hadoop103 ~]$ sudo date
```

⭐ 2.2.11 frequently asked questions

1) Common port number

Port name	Hadoop2.x	hadoop3.x
HDFS NameNode internal communication port	8020/9000	8020/9000/9820
Query port of HDFS NameNode to user	50070	9870
Yran view task running port	8088	8088
History server communication port	19888	19888

2) Common profiles

hadoop2.x	core-site.xml	hadfs-size.xml	yarn-site.xml	mapred-site.xml	slaves
hadoop3.x	core-site.xml	hadfs-size.xml	yarn-site.xml	mapred-site.xml	workers

🌸 3, Common errors and Solutions

1) The firewall is not closed or YARN is not started

INFO client.RMProxy: Connecting to ResourceManager at hadoop108/192.168.10.108:8032

2) Host name configuration error

3) IP address configuration error

4) ssh is not configured properly

5) The cluster started by root and ovo users are not unified

6) Careless modification of configuration file

7) Unrecognized host name

java.net.UnknownHostException: hadoop102: hadoop102
        at java.net.InetAddress.getLocalHost(InetAddress.java:1475)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:146)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
        at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)

terms of settlement:

(1) Add 192.168.10.102 Hadoop 102 to the / etc/hosts file
(2) Do not use Hadoop, Hadoop 000 and other special names for the host name

8) DataNode and NameNode processes can only work one at a time.
9) Executing the command does not take effect. When pasting the command in Word, it encounters - and long - and is not distinguished. Cause command failure

Solution: try not to paste the code in Word.

10) jps finds that the process has not been started, but restarts the cluster, indicating that the process has been started.

The reason is that there is a temporary file for the started process in the / tmp directory under the root directory of Linux. Delete the cluster related processes and restart the cluster.

11) jps does not take effect

Reason: the global variable hadoop java does not take effect. Solution: the source /etc/profile file is required.

12) 8088 port is not connected

[ovo@hadoop102 desktop]$ cat /etc/hosts

Comment out the following code

#127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1         hadoop102

Topics: Big Data Hadoop

Programmer Think

Hadoop 3. X operation environment construction of big data (hand-in-hand cluster construction)

🌹 Write at the beginning

💝 Installation package preparation

🚀 1, Hadoop running environment construction (development focus)

💒 1.1 template virtual machine environment preparation

🚔 1.2 cloning virtual machines

🎪 1.3 install JDK in Hadoop 102

⌚ 1.4 installing Hadoop on Hadoop 102

⌛ 1.5 Hadoop directory structure

⚡ II. Hadoop operation mode

🌟 2.1 fully distributed operation mode (development focus)

♐ 2.2.1 virtual machine preparation

☁️ 2.2.2 writing cluster distribution script xsync

🎈 2.2.3 SSH non secret login configuration

💐 2.2.4 cluster configuration

🎁 2.2.5 cluster

😉 2.2.6 configuring the history server

🌈 2.2.7 configuring log aggregation

🎸 2.2.8 summary of cluster start / stop modes

🎉 2.2.9 writing common scripts for Hadoop clusters

👏 2.2.10 cluster time synchronization

⭐ 2.2.11 frequently asked questions

🌸 3, Common errors and Solutions

Hot Topics