Hadoop environment installation

Posted by phelpsa on Sun, 26 Dec 2021 10:23:42 +0100

Hadoop distributed environment

0. Preliminary preparation

Create normal user

# Create fzk user
useradd fzk
# Modify fzk user's password
passwd fzk
# The configuration fzk user has root permission, which is convenient for sudo to execute the command with root permission later (/ etc/sudoers file, added under% wheel)
fzk     ALL=(ALL)       NOPASSWD:ALL

ssh password free login

# Generate public and private keys
ssh-keygen -t rsa

# Copy the public key to the target machine for password free login
ssh-copy-id hadoop152

1. Construction of distributed basic environment

Cluster deployment planning

	hadoop151	hadoop152	hadoop153
HDFS	NameNode DataNode	DataNode	SecondaryNameNode DataNode
YARN	NodeManager	ResourceManager NodeManager	NodeManager

Environmental construction steps

Step 1: prepare 3 clients (turn off firewall, static IP, host name)

192.168.37.151 hadoop151
192.168.37.152 hadoop152
192.168.37.153 hadoop153
Step 2: install JDK, install Hadoop, and configure environment variables

Step 3: modify the configuration file

hadoop-env.sh ,core-site.xml ,hdfs-site.xml ,yarn-site.xml ,mapred-site.xml ,workers ,hdfs

etc/hadoop/hadoop-env.xml

export JAVA_HOME=/opt/software/jdk/jdk1.8.0_281
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export HADOOP_SECURE_DN_USER=root

etc/hadoop/core-site.xml

<configuration>
    <!-- appoint NameNode Address of -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop151:8020</value>
    </property>
    <!-- appoint hadoop Storage directory of data -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/software/hadoop/data</value>
    </property>
    <!-- to configure HDFS The static user used for web page login is root -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>root</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml

<configuration>
    <!-- NameNode web End access address-->
    <property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop151:9870</value>
    </property>
    <!-- SecondaryNameNode web End access address-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop153:9868</value>
    </property>
</configuration>

etc/hadoop/yarn-site.xml

<configuration>
    <!-- appoint MapReduce go shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!-- appoint ResourceManager Address of-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop152</value>
    </property>
    <!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

etc/hadoop/mapred-site.xml

<configuration>
    <!-- appoint MapReduce The program runs on Yarn upper -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

etc/hadoop/workers
```
hadoop151
hadoop152
hadoop153
```

bin/hdfs

// Add HADOOP_SHELL_EXECNAME="hdfs" changed to
HADOOP_SHELL_EXECNAME="root"

Start cluster

Step 1: if the cluster is started for the first time, you need to format the NameNode on the Hadoop 151 node
- Command: hdfs namenode -format
- Note: formatting NameNode will generate a new cluster id, resulting in inconsistent cluster IDs of NameNode and datanode, and the cluster cannot find past data. If the cluster reports an error during operation and needs to reformat the NameNode, be sure to stop the NameNode and datanode processes, delete the data and logs directories of all machines, and then format them.
Step 2: start HDFS on the machine with NameNode node configured
- Command: SBIN / start DFS sh
Step 3: start YARN on the machine with the ResourceManager node configured
- Command: SBIN / start yarn sh
Step 4: view the NameNode of HDFS on the Web side
- http://192.168.37.151:9870
Step 5: View YARN's resource manager on the Web side
- http://192/168.37.152:8088

2. Configure aggregation of historical servers and logs

Environment construction

Step 1: build a distributed basic environment

Step 2: in / etc / Hadoop / mapred site Add the following configuration to the XML and copy the change file to the other two hosts

	<!-- Historical server address -->
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>hadoop151:10020</value>
    </property>
    <!-- History server web End address -->
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>hadoop151:19888</value>
    </property>

Step 3: in / etc / Hadoop / yen site Add the following configuration to the XML and copy the change file to the other two hosts

    <!-- Enable log aggregation -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <!-- Set log aggregation server address -->
    <property>
        <name>yarn.log.server.url</name>
        <value>http://hadoop151:19888/jobhistory/logs</value>
    </property>
    <!-- Set the log retention time to 7 days -->
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>

start-up

Start on the machine with the JobHistory node configured
- mapred --daemon start historyserver
Viewing JobHistory on the Web
- http://192.168.37.151:19888/jobhistory

3. Cluster start / stop mode summary

Start / stop each module separately (ssh configuration is the premise)

Overall start / stop HDFS
- start-dfs.sh
- stop-dfs.sh
Overall start / stop of YARN
- start-yarn.sh
- stop-yarn.sh

Each service component starts / stops one by one

Start / stop HDFS components separately
- hdfs --daemon start/stop namenode/datanode/secondarynamenode
Start / stop YARN
- yarn --daemon start/stop resourcemanager/nodemanager
Start / stop JobHistory
- mapred --daemon start/stop historyserver

4. Write common scripts for Hadoop cluster

Hadoop cluster startup and shutdown script

Start HDFS, Yan, Historyserver: myhadoop sh

#!/bin/bash
if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi

case $1 in
"start")
    echo " =================== start-up hadoop colony ==================="
    echo " --------------- start-up hdfs ---------------"
    ssh hadoop151 "start-dfs.sh"
    echo " --------------- start-up yarn ---------------"
    ssh hadoop152 "start-yarn.sh"
    echo " --------------- start-up historyserver ---------------"
    ssh hadoop151 "mapred --daemon start historyserver"
    ;;
"stop")
    echo " =================== close hadoop colony ==================="
    echo " --------------- close historyserver ---------------"
    ssh hadoop151 "mapred --daemon stop historyserver"
    echo " --------------- close yarn ---------------"
    ssh hadoop152 "stop-yarn.sh"
    echo " --------------- close hdfs ---------------"
    ssh hadoop151 "stop-dfs.sh"
    ;;
*)
	echo "Input Args Error..."
	;;
esac

View three server Java process scripts

jps: jpsall.sh

#!/bin/bash
for host in hadoop151 hadoop152 hadoop153
do
    echo =============== $host ===============
    ssh $host jps 
done

5. Common port number Description

Port name	Hadoop2.x	Hadoop3.x
NameNode internal communication port	8020 / 9000	8020 / 9000 / 9820
NameNode HTTP UI	50070	9870
MapReduce view task execution port	8088	8088
History server communication port	19888	19888

6. High availability environment (HA)

Cluster planning

hadoop151	hadoop152	hadoop153
NameNode	NameNode	NameNode
JournalNode	JournalNode	JournalNode
DataNode	DataNode	DataNode
ZK	ZK	ZK
ResourceManager	ResourceManager
NodeManager	NodeManager	NodeManager

Configure HDFS-HA cluster

Step 1: core site XML configuration file content

    <!-- hold NameNode The addresses are assembled into a cluster mycluster -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://mycluster</value>
    </property>
    <!-- appoint hadoop The storage directory where files are generated at run time -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/software/hadoop/hadoop-3.1.4/data/tmp</value>
    </property>
    <!-- statement journalnode Server storage directory-->
    <property>
        <name>dfs.journalnode.edits.dir</name>
        <value>/opt/software/hadoop/hadoop-3.1.4/data/tmp/journalnode</value>
    </property>

Step 2: HDFS site XML configuration file content

    <!-- Fully distributed cluster name -->
    <property>
        <name>dfs.nameservices</name>
        <value>mycluster</value>
    </property>
    <!-- In cluster NameNode What are the nodes -->
    <property>
        <name>dfs.ha.namenodes.mycluster</name>
        <value>nn1,nn2,nn3</value>
    </property>
    <!-- nn1,nn2,nn3 of RPC mailing address -->
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn1</name>
        <value>hadoop151:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn2</name>
        <value>hadoop152:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn3</name>
        <value>hadoop153:8020</value>
    </property>
    <!-- nn1,nn2,nn3 of http mailing address -->
    <property>
        <name>dfs.namenode.http-address.mycluster.nn1</name>
        <value>hadoop151:9870</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.mycluster.nn2</name>
        <value>hadoop152:9870</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.mycluster.nn3</name>
        <value>hadoop153:9870</value>
    </property>
    <!-- appoint NameNode Metadata in JournalNode Storage location on -->
    <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://hadoop151:8485;hadoop152:8485;hadoop153:8485/mycluster</value>
    </property>
    <!-- Access proxy class: client，mycluster，active Implementation mode of automatic switching in case of configuration failure-->
    <property>
        <name>dfs.client.failover.proxy.provider.mycluster</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    <!-- Close permission check (the company is not allowed to close, and the company needs to configure it. Refer to the official website for specific configuration) -->
    <property>
        <name>dfs.permissions.enable</name>
        <value>false</value>
    </property>
    <!-- Configure the isolation mechanism, that is, only one server can respond at the same time -->
    <property>
        <name>dfs.ha.fencing.methods</name>
        <value>sshfence</value>
    </property>
    <!-- Required when using isolation mechanism ssh No secret key login-->
    <property>
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/root/.ssh/id_rsa</value>
    </property>

Step 3: add the core site XML and HDFS site XML two files are distributed to other machines, scp
Step 4: start the JournalNode in all clusters
- hdfs --daemon start journalnode
Step 5: format the NameNode in one of the nodes
- hdfs namenode -format
Step 6: copy the contents of the NameNode metadata directory to other unformatted namenodes
- hdfs namenode -bootstrapStandby
Step 7: start NameNode and DataNode in all nodes
- hdfs --daemon start namenode
- hdfs --daemon start datanode
Step 8: set one of the nodes to Active state (for example, set nn1 to Active state)
- hdfs haadmin -transitionToActive nn1

Configuring Zookeeper clusters

Deploy Zookeeper on Hadoop 151, Hadoop 152 and Hadoop 153 nodes

Step 1: install jdk
Step 2: copy the Zookeeper file to Linux and unzip it
Step 3: create zkData directory under Zookeeper installation directory to store data
Step 4: create a myid file in the zkData directory to distinguish Zookeeper
- Write 1, 2 and 3 in the myid files of Hadoop 151, Hadoop 152 and Hadoop 153 clusters respectively

Step 4: set the * * / Zookeeper installation path / conf * * path to zoo_ sample. One copy of CFG file is zoo CFG and modify

# Modify data storage path configuration
dataDir=/zookeeper Installation path/zkData

# Add Zookeeper cluster configuration
# Format: server A=B:C:D
# A is a number. In the cluster mode, the number of the configuration file myid indicates the server number
# B is the ip address of the server
# C is the port where this server exchanges information with the Leader server in the cluster
# D is the port where the server communicates with each other when the cluster Leader server hangs up and a new Leader is re selected
server.1=hadoop151:2888:3888
server.2=hadoop152:2888:3888
server.3=hadoop153:2888:3888

Step 5: distribute the entire installed zookeeper directory to Hadoop 152 and Hadoop 153 machines
Step 6: change the number of myid files of Hadoop 151, Hadoop 152 and Hadoop 153 clusters to 1, 2 and 3 respectively
Step 7: start Zookeeper on three machines respectively
- bin/zkServer.sh start
Step 8: check the status. The status of one machine is Mode: leader and the status of the other two machines is Mode: follower
- bin/zkServer.sh status

Configure HDFS-HA cluster

Step 1: on HDFS site Add configuration content to XML

    <!-- Turn on automatic failover -->
    <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
    </property>

Step 2: on the core site Add configuration content to XML

    <!-- Specifies the cluster configuration for automatic failover -->
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>hadoop151:2181,hadoop152:2181,hadoop153:2181</value>
    </property>

Step 3: Hadoop env SH configuration file

export JAVA_HOME=/opt/software/jdk/jdk1.8.0_281
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export HADOOP_SECURE_DN_USER=root
export HDFS_ZKFC_USER=root
export HDFS_JOURNALNODE_USER=root

Step 4: add the core site xml , hdfs-site.xml ,hadoop-env.sh three files are distributed on other machines, scp
Step 5: start Zookeeper on three machines respectively
- bin/zkServer.sh start
Step 6: initialize ZKFC
- hdfs zkfc -formatZK
Step 7: start the cluster
- start-dfs.sh
Step 8: check whether all the processes are started, including six processes
```
Jps
QuorumPeerMain
NameNode
DataNode
JournalNode
DFSZKFailoverController
```

Configure YARN-HA cluster

Step 1: yarn site SH configuration

    <!-- appoint MapReduce go shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!--Enable ResourceManager ha-->
    <property>
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>
    <!--Declare two ResourceManager Address of-->
    <property>
        <name>yarn.resourcemanager.cluster-id</name>
        <value>cluster1</value>
    </property>
    <property>
        <name>yarn.resourcemanager.ha.rm-ids</name>
        <value>rm1,rm2</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname.rm1</name>
        <value>hadoop151</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname.rm2</name>
        <value>hadoop152</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address.rm1</name>
        <value>hadoop151:8088</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address.rm2</name>
        <value>hadoop152:8088</value>
    </property>
    <!--appoint zookeeper Address of the cluster-->
    <property>
        <name>hadoop.zk.address</name>
        <value>hadoop151:2181,hadoop152:2181,hadoop153:2181</value>
    </property>
    <!--Enable automatic recovery--> 
    <property>
        <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
    </property>
    <!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
    <!-- Enable log aggregation -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <!-- Set the log retention time to 7 days -->
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>

Step 2: add the yarn site On other machines where XML files are distributed, scp
Step 3: start yarn (provided Zookeeper and hdfs are started)
- start-yarn.sh

Start and stop the cluster after installation

start-up

Step 1: start Zookeeper on three machines respectively
- bin/zkServer.sh start
Step 2: start NameNode, DataNode, JournalNode, ZKCF, etc. on any computer
- start-all.sh

stop it

Step 1: start NameNode, DataNode, JournalNode, ZKCF, etc. on any computer
- stop-all.sh
Step 2: start Zookeeper on three machines respectively
- bin/zkServer.sh stop

Topics: Big Data Hadoop

Programmer Think

Hadoop environment installation

Hadoop distributed environment

0. Preliminary preparation

Create normal user

ssh password free login

1. Construction of distributed basic environment

Cluster deployment planning

Environmental construction steps

Start cluster

2. Configure aggregation of historical servers and logs

Environment construction

start-up

3. Cluster start / stop mode summary

Start / stop each module separately (ssh configuration is the premise)

Each service component starts / stops one by one

4. Write common scripts for Hadoop cluster

Hadoop cluster startup and shutdown script

View three server Java process scripts

5. Common port number Description

6. High availability environment (HA)

Cluster planning

Configure HDFS-HA cluster

Configuring Zookeeper clusters

Configure HDFS-HA cluster

Configure YARN-HA cluster

Start and stop the cluster after installation

start-up

stop it

Hot Topics