hadoop installation and deployment

Posted by ma5ect on Wed, 29 Dec 2021 15:04:18 +0100

Download and install

Official website https://hadoop.apache.org/

System CentOS7

Download and install hadoop-3.1 3.tar. gz

Installation decompression path / opt/module

Also configure environment variables

#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

Three servers are used here. The domain names are Hadoop 102, Hadoop 103 and Hadoop 104

Cluster configuration

1) Cluster deployment planning

Note:

Ø NameNode and SecondaryNameNode should not be installed on the same server

Ø resource manager also consumes a lot of memory. It should not be configured on the same machine as NameNode and SecondaryNameNode.

hadoop102 hadoop103 hadoop104
HDFS NameNode DataNode DataNode SecondaryNameNode DataNode
YARN NodeManager ResourceManager NodeManager NodeManager

2) Profile description

Hadoop configuration files are divided into two types: default configuration files and user-defined configuration files. Only when users want to modify a default configuration value, they need to modify the user-defined configuration file and change the corresponding attribute value.

(1) Default profile:

Default file to get The file is stored in the jar package of Hadoop
[core-default.xml] hadoop-common-3.1.3.jar/core-default.xml
[hdfs-default.xml] hadoop-hdfs-3.1.3.jar/hdfs-default.xml
[yarn-default.xml] hadoop-yarn-common-3.1.3.jar/yarn-default.xml
[mapred-default.xml] hadoop-mapreduce-client-core-3.1.3.jar/mapred-default.xml

(2) Custom profile:

​ core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml four configuration files are stored in $Hadoop_ On the path of home / etc / Hadoop, users can modify the configuration again according to the project requirements.

3) Configure cluster

(1) Core profile

Configure core site xml

cd $HADOOP_HOME/etc/hadoop
vim core-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration> 
  <!-- appoint NameNode Address of -->  
  <property> 
    <name>fs.defaultFS</name>  
    <value>hdfs://hadoop102:8020</value> 
  </property>  
  <!-- appoint hadoop Storage directory of data -->  
  <property> 
    <name>hadoop.tmp.dir</name>  
    <value>/opt/module/hadoop-3.1.3/data</value> 
  </property>  
  <!-- to configure HDFS The static user used for web page login is username -->  
  <property> 
    <name>hadoop.http.staticuser.user</name>  
    <value>username</value> 
  </property> 
</configuration>

PS:username is the user name you log in to CentOS7

(2) HDFS profile

Configure HDFS site xml

 vim hdfs-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration> 
  <!-- nn web End access address-->  
  <property>
    <name>dfs.namenode.http-address</name> 
    <value>hadoop102:9870</value> 
  </property>  
  <!-- 2nn web End access address-->  
  <property>
    <name>dfs.namenode.secondary.http-address</name> 
    <value>hadoop104:9868</value> 
  </property> 
</configuration>

(3) YARN profile

Configure yarn site xml

vim yarn-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration> 
  <!-- appoint MR go shuffle -->  
  <property> 
    <name>yarn.nodemanager.aux-services</name>  
    <value>mapreduce_shuffle</value> 
  </property>  
  <!-- appoint ResourceManager Address of-->  
  <property> 
    <name>yarn.resourcemanager.hostname</name>  
    <value>hadoop103</value> 
  </property>  
  <!-- Inheritance of environment variables -->  
  <property> 
    <name>yarn.nodemanager.env-whitelist</name>  
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> 
  </property> 
</configuration>

(4) MapReduce profile

Configure mapred site xml

vim mapred-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration> 
  <!-- appoint MapReduce The program runs on Yarn upper -->  
  <property> 
    <name>mapreduce.framework.name</name>  
    <value>yarn</value> 
  </property> 
</configuration>

4) Distribute the configured Hadoop configuration file on the cluster

xsync /opt/module/hadoop-3.1.3/etc/hadoop/

5) Go to 103 and 104 to check the distribution of documents

cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml

Group together

1) Configure workers

vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

Add the following contents to the document:

hadoop102

hadoop103

hadoop104

Note: no space is allowed at the end of the content added in the file, and no empty line is allowed in the file.

Synchronize all node profiles

xsync /opt/module/hadoop-3.1.3/etc

2) Start cluster

(1) if the cluster is started for the first time, you need to format the namenode on the Hadoop 102 node (Note: formatting namenode will generate a new cluster id, resulting in inconsistent cluster IDS between namenode and datanode, and the cluster cannot find past data. If the cluster reports an error during operation and needs to reformat namenode, be sure to stop the namenode and datanode processes, delete the data and logs directories of all machines, and then format.)

hdfs namenode -format

(2) Start HDFS

sbin/start-dfs.sh

(3) Start YARN on the node (Hadoop 103) where the resource manager is configured

sbin/start-yarn.sh

(4) View the NameNode of HDFS on the Web side

(a) Enter in the browser: http://hadoop102:9870

(b) view the data information stored on the HDFS

(5) View YARN's ResourceManager on the Web

(a) Enter in the browser: http://hadoop103:8088

(b) view the Job information running on YARN

3) Cluster Basic test

(1) Upload files to cluster

Upload small files

hadoop fs -mkdir /input
hadoop fs -put $HADOOP_HOME/wcinput/word.txt /input

Upload large files

hadoop fs -put  /opt/software/jdk-8u212-linux-x64.tar.gz  /

Execute the wordcount program

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

Configure history server

In order to view the historical operation of the program, you need to configure the history server. The specific configuration steps are as follows:

1) Configure mapred site xml

vim mapred-site.xml

Add the following configuration to this file.

<!-- Historical server address -->
<property>
  <name>mapreduce.jobhistory.address</name>
  <value>hadoop102:10020</value>
</property>

<!-- History server web End address -->
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>hadoop102:19888</value>
</property>

2) Distribution configuration

xsync $HADOOP_HOME/etc/hadoop/mapred-site.xml

3) Start the history server in Hadoop 102

mapred --daemon start historyserver

4) Check whether the history server is started

jps

5) View JobHistory

http://hadoop102:19888/jobhistory

Configure log aggregation

Log aggregation concept: after the application runs, upload the program running log information to the HDFS system.

Benefits of log aggregation function: you can easily view the details of program operation, which is convenient for development and debugging.

Note: to enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryServer.

The specific steps to enable log aggregation are as follows:

1) Configure yarn site xml

 vim yarn-site.xml

Add the following configuration to this file.

<!-- Enable log aggregation -->
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>

<!-- Set log aggregation server address -->
<property> 
  <name>yarn.log.server.url</name> 
  <value>http://hadoop102:19888/jobhistory/logs</value>
</property>

<!-- Set the log retention time to 7 days -->
<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>604800</value>
</property>

2) Distribution configuration

xsync $HADOOP_HOME/etc/hadoop/yarn-site.xml

3) Close NodeManager, ResourceManager, and HistoryServer

Execute under the hadoop directory on hadoop 103

sbin/stop-yarn.sh
mapred --daemon stop historyserver

4) Start NodeManager, ResourceManage, and HistoryServer

On Hadoop 103

start-yarn.sh

On Hadoop 102

mapred --daemon start historyserver

5) Delete the existing output file on HDFS

On Hadoop 102

hadoop fs -rm -r /output

6) Execute the WordCount program

Under the hadoop directory on hadoop 102

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

7) View log

(1) historical server address

http://hadoop102:19888/jobhistory

(2) historical task list

(3) view the task running log

(4) details of operation log

Cluster start / stop mode summary

1) Each module starts / stops separately (ssh configuration is the premise)

(1) overall start / stop of HDFS

start-dfs.sh/stop-dfs.sh

(2) overall start / stop of YARN

start-yarn.sh/stop-yarn.sh

2) Each service component starts / stops one by one

(1) start / stop HDFS components respectively

hdfs --daemon start/stop namenode/datanode/secondarynamenode

(2) start / stop YARN

yarn --daemon start/stop resourcemanager/nodemanager

Write common scripts for Hadoop cluster

1) Hadoop cluster startup and shutdown script (including HDFS, Yan and Historyserver): myhadoop.sh

cd /home/username/bin

#This username is your username
vim myhadoop.sh

Ø enter the following

#!/bin/bash
if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi
case $1 in
"start")
        echo " =================== start-up hadoop colony ==================="

        echo " --------------- start-up hdfs ---------------"
        ssh hadoop101 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
        echo " --------------- start-up yarn ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
        echo " --------------- start-up historyserver ---------------"
        ssh hadoop101 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== close hadoop colony ==================="

        echo " --------------- close historyserver ---------------"
        ssh hadoop101 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
        echo " --------------- close yarn ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo " --------------- close hdfs ---------------"
        ssh hadoop101 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

Ø exit after saving, and then grant script execution permission

chmod +x myhadoop.sh

2) View three server Java process scripts: jpsall

cd /home/username/bin
vim jpsall

Ø enter the following

#!/bin/bash
for host in hadoop101 hadoop102 hadoop103
do
        echo =============== $host ===============
        ssh $host jps $@ | grep -v Jps
done

Ø exit after saving, and then grant script execution permission

chmod +x jpsall

3) Distribute the / home/atguigu/bin directory to ensure that custom scripts can be used on all three machines

xsync /home/atguigu/bin/

Common port number Description

Port name Hadoop2.x Hadoop3.x
NameNode internal communication port 8020 / 9000 8020 / 9000/9820
NameNode HTTP UI 50070 9870
MapReduce view task execution port 8088 8088
History server communication port 19888 19888

Topics: Hadoop