Hadoop installation complete

Posted by cesarcesar on Mon, 07 Feb 2022 09:17:24 +0100

HADOOP installation Linux stand-alone

Download Hadoop

Hadoop3.xx download address: http://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/

Upload to Linux via FTP

Decompression software

tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/

Configure HADOOP environment variables

Create custom profile

vim /etc/profile.d/my_env.sh

Configure HADOOP_HOME environment variable

export HADOOP_HOME=/opt/module/hadoop-3.1.3

Make configuration effective

source /etc/profile

HADOOP local operation

Create test file

mkdir wcinput
cd wcinput
vim word.txt
# Enter the following test contents
hadoop yarn
hadoop mapreduce
# Save exit: wq

Execute test file

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount wcinput/ wcoutput

View results

HADOOP installing Linux Cluster Edition

Write cluster distribution script

scp secure copy

  • scp can copy data between servers.
  • Basic syntax: scp -r $pdir/$fname $user@$host:$pdir/$fname
  • Example: SCP - R / opt / module / jdk1 8.0_ two hundred and twelve root@hadoop103 :/opt/module


  • rsync is mainly used for backup and mirroring. It has the advantages of high speed, avoiding copying the same content and supporting symbolic links
  • Using rsync to copy files is faster than scp. rsync only updates difference files. scp is to copy all the files
  • Basic syntax rsync -av $pdir/$fname $user@$host:$pdir/$fname -a: archive copy - v: Show copy process
  • Example: rsync -av hadoop-3.1.3/ root@hadoop103:/opt/module/hadoop-3.1.3/


Synchronize files by script

It is expected that the script can be used in any path (the script is placed in the path where the global environment variable is declared)

Create an xsync file in the / home/root/bin directory. The script code is as follows vim xsync

#1. Number of judgment parameters
if [ $# -lt 1 ]
 echo Not Enough Arguement!
#2. Traverse all machines in the cluster
for host in hadoop102 hadoop103 hadoop104
 echo ==================== $host ====================
 #3. Traverse all directories and send them one by one
 for file in $@
 #4. Judge whether the document exists
 if [ -e $file ]
 #5. Get parent directory
 pdir=$(cd -P $(dirname $file); pwd)
 #6. Get the name of the current file
 fname=$(basename $file)
 ssh $host "mkdir -p $pdir"
 rsync -av $pdir/$fname $host:$pdir
 echo $file does not exists!

Modify the executable permissions of the script chmod 777 xsync

Configure ssh password free login in the cluster

Normal ssh connection login

Basic syntax: ssh the IP address of another computer

Generate public and private keys

Command: SSH keygen - t RSA, then hit (three carriage returns), and two file IDS will be generated_ RSA (private key), id_rsa.pub (public key)

Copy the public key to the target machine for password free login

Command: SSH copy ID target machine IP

Hadoop cluster configuration

Cluster deployment planning

  • NameNode and SecondaryNameNode should not be installed on the same server
  • Resource manager also consumes a lot of memory and should not be configured on the same machine as NameNode and SecondaryNameNode.

Profile description

Hadoop configuration files are divided into two types: default configuration files and user-defined configuration files. Only when users want to modify a default configuration value, they need to modify the user-defined configuration file and change the corresponding attribute value.

Default profile:

Custom profile:

  • core-site.xml
  • hdfs-site.xml
  • yarn-site.xml
  • mapred-site.xml
  • Four configuration files are stored in $Hadoop_ On the path of home / etc / Hadoop, users can modify the configuration again according to the project requirements.

Hadoop cluster configuration core

Core configuration file: core site xml

# Modify custom profile
vim core-site.xml
<!-- The contents of the document are as follows -->
     <!-- appoint NameNode Address of -->
     <!-- appoint hadoop Storage directory of data -->
     <!-- to configure HDFS The static user used for web page login is atguigu -->

HDFS configuration file: HDFS site xml

# Modify custom profile
vim hdfs-site.xml
<!-- The contents of the document are as follows -->
    <!-- nn web End access address-->
    <!-- 2nn web End access address-->

YARN configuration file: YARN site xml

# Modify custom profile
vim yarn-site.xml
<!-- The contents of the document are as follows -->
     <!-- appoint MR go shuffle -->
     <!-- appoint ResourceManager Address of-->
     <!-- Inheritance of environment variables -->


MapReduce configuration file: mapred site xml

# Modify custom profile
vim mapred-site.xml
	<!-- appoint MapReduce The program runs on Yarn upper -->

Configure workers

  • No space is allowed at the end of the content added in the file, and no empty line is allowed in the file.

The cluster synchronizes all node profiles

Start Hadoop cluster

Initial formatting

  • If the cluster is started for the first time, you need to format the NameNode on Hadoop 102 node and generate data and logs directories, indicating that the format is successful

  • Formatting namenode will generate a new cluster id, resulting in inconsistent cluster IDS between namenode and datanode, and the cluster cannot find past data. If the cluster reports an error during operation and needs to reformat the namenode, be sure to stop the namenode and datanode processes first, and delete the data and logs directories of all machines before formatting

    Format command: hdfs namenode -format

Start HDFS

Command: SBIN / start DFS sh

If the following error is caused by the lack of user definition, edit the start and close scripts respectively, add content in the blank space at the top, and restart

$ vim sbin/start-dfs.sh
$ vim sbin/stop-dfs.sh


Similarly, if yarn SH starts with an error, and the following configuration is also added:

$ vim sbin/start-yarn.sh
$ vim sbin/stop-yarn.sh


After modifying the configuration, delete the data and logs directories and reformat NameNodehdfs namenode -format

On the node with NameNode configured, start HDFS: SBIN / start DFS sh

On the node where ResourceManager is configured, start YARN: SBIN / start YARN sh

View the NameNode of HDFS on the Web side and access it by browser http://hadoop102:9870/

View YARN's resource manager on the WEB side and visit it by browser http://hadoop103:8088/

Cluster basic function test

Upload files to cluster

Command: hadoop fs -put file address cluster directory address

hadoop fs -put wcinout/word.txt /input

Execute the wordcount program

Command: Hadoop jar share / Hadoop / MapReduce / hadoop-mapreduce-examples-3.1.3 Jar wordcount source directory destination directory

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

If the following errors are reported, configure the following to mapred site XML, Hadoop home directory, restart the cluster


Configure history server

In order to view the historical operation of the program, you need to configure the history server.

Configure mapred site xml

# Modify custom profile
vim mapred-site.xml
<!-- Add the following file contents -->
    <!-- Historical server address -->
    <!-- History server web End address -->

Cluster distribution configuration

xsync mapred-site.xml

Start the history server on the NameNode server

mapred --daemon start historyserver

Check whether the history server is started


Browser access http://hadoop102:19888/jobhistory

Configure log aggregation

Log aggregation concept: after the application runs, upload the program running log information to the HDFS system.

Benefits of log aggregation function: you can easily view the details of program operation, which is convenient for development and debugging.

To enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryServer.

Configure yarn site xml

The configuration on the NameNode server is as follows

# Modify custom profile
vim yarn-site.xml
<!-- Add the following file contents -->
    <!-- Enable log aggregation -->
    <!-- Set log aggregation server address -->
    <!-- Set the log retention time to 7 days -->

Cluster distribution configuration

xsync yarn-site.xml

Restart NodeManager, ResourceManager, and HistoryServer

# close
mapred --daemon stop historyserver
# start-up
mapred --daemon start historyserver

Delete and re execute the wordcount program to view the log

Historical server address http://hadoop102:19888/jobhistory

Summary of cluster start / stop modes

Each module starts / stops separately (ssh configuration is the premise)

Overall start / stop HDFS


Overall start / stop of YARN


Each service component starts / stops one by one

Start / stop HDFS components respectively

hdfs --daemon start/stop namenode/datanode/secondarynamenode

Start / stop YARN

yarn --daemon start/stop resourcemanager/nodemanager

Write common scripts for Hadoop cluster

Hadoop cluster startup and shutdown script (including HDFS, Yan and Historyserver): myhadoop sh

cd /root/bin
vim myhadoop.sh

The script code is as follows

if [ $# -lt 1 ]
 echo "No Args Input..."
 exit ;
case $1 in
 echo " =================== start-up hadoop colony ==================="
 echo " --------------- start-up hdfs ---------------"
 ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
 echo " --------------- start-up yarn ---------------"
 ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
 echo " --------------- start-up historyserver ---------------"
 ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver";;
 echo " =================== close hadoop colony ==================="
 echo " --------------- close historyserver ---------------"
 ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
 echo " --------------- close yarn ---------------"
 ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
 echo " --------------- close hdfs ---------------"
 ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh";;
 echo "Input Args Error...";;

Grant script execution permission

chmod +x myhadoop.sh 

Cluster distribution ensures that user-defined scripts can be used on cluster machines

xsync myhadoop.sh

Write and view the Java process script of cluster server

cd /root/bin
vim jpsall.sh

The script code is as follows

for host in hadoop102 hadoop103 hadoop104
 echo =============== $host ===============
 ssh $host jps 

Grant script execution permission

chmod +x jpsall.sh 

Cluster distribution ensures that user-defined scripts can be used on cluster machines

xsync jpsall.sh

Topics: Big Data Hadoop hdfs mapreduce