How to build hadoop cluster

Posted by gum1982 on Mon, 17 Jan 2022 03:59:43 +0100

How to build hadoop cluster

1. Mapping of host name and corresponding IP

Open the hosts file

vi /etc/hosts

Add host name and corresponding ip address

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.192.30.40    hadoop101
10.192.30.41    hadoop102
10.192.30.42    hadoop103

After that, Hadoop 101 will automatically correspond to IP 10.192.30.40, Hadoop 102 will automatically correspond to 10.192.30.41, and Hadoop 103 will automatically correspond to 10.192.30.42

2. Write xsync cluster distribution script

Create a bin directory in the / root directory and an xsync file in the bin directory. The file contents are as follows:

#!/bin/bash
#1 get the number of input parameters. If there are no parameters, exit directly
pcount=$#
if((pcount==0)); then
echo no args;
exit;
fi

#2 get file name
p1=$1
fname=`basename $p1`
echo fname=$fname

#3 get the absolute path from the parent directory
pdir=`cd -P $(dirname $p1); pwd`
echo pdir=$pdir

#4 get the current user name
user=`whoami`

#5 cycle
for((host=102; host<104; host++)); do
        echo ------------------- hadoop$host --------------
        rsync -rvl $pdir/$fname $user@hadoop$host:$pdir
done

Add execution permission to the script.

3.SSH non secret login configuration

1. Configure ssh

Basic syntax:
ssh ip address of another computer:

ssh 10.192.30.41

Solution to Host key verification failed during ssh connection:

Enter yes directly

2. No key configuration

Generate public and private keys:

ssh-keygen -t rsa

By typing (three carriage returns), two file IDS will be generated_ RSA (private key), id_rsa.pub (public key)
Copy the public key to the target machine for password free login:

ssh-copy-id hadoop101
ssh-copy-id hadoop102
ssh-copy-id hadoop103

After entering the command, enter the password according to the prompt to complete the keyless configuration

be careful:
It is also necessary to configure non secret login to Hadoop 101, Hadoop 102 and Hadoop 103 on Hadoop 102;
It is also necessary to configure non secret login to Hadoop 101, Hadoop 102 and Hadoop 103 servers on Hadoop 103.

4. Install jdk

Decompression:

tar -zxvf jdk-8u144-linux-x64.tar.gz -C /opt/module/

Add environment variable:

vi /etc/profile
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144
export PATH=$PATH:$JAVA_HOME/bin

Application:

source /etc/profile

5. Install hadoop

Decompression:

tar -zxvf hadoop-2.10.1.tar.gz -C /opt/module/

Add Hadoop to the environment variable and open the / etc/profile file:

##HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-2.10.1
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

Make the modified document effective:

source /etc/profile

Test for successful installation:

hadoop version

6. Configure hadoop cluster

In / opt/module/hadoop-2.10.1/etc/hadoop Directory:

Configure Hadoop env sh:

Get the installation path of JDK in Linux system:

echo $JAVA_HOME
/opt/module/jdk1.8.0_144

Modify JAVA_HOME path:

export JAVA_HOME=/opt/module/jdk1.8.0_144

Configure core site xml:

<!-- appoint HDFS in NameNode Address of -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop101:9000</value>
</property>

<!-- appoint Hadoop The storage directory where files are generated at run time -->
<property>
	<name>hadoop.tmp.dir</name>
	<value>/opt/module/hadoop-2.10.1/data/tmp</value>
</property>

<!-- i/o properties Cache size used when reading and writing files -->
<property>
  <name>io.file.buffer.size</name>
  <value>131072</value>
</property>

Configure HDFS site xml:

<!-- appoint HDFS Number of copies -->
<property>
	<name>dfs.replication</name>
	<value>3</value>
</property>

<!-- be used for namenode storage fsimage file -->
<property>
  <name>dfs.namenode.name.dir</name>
  <value>file://${hadoop.tmp.dir}/dfs/name</value>
</property>

<!-- hdfs A place where data blocks are stored -->
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file://${hadoop.tmp.dir}/dfs/data</value>
</property>

7. Start a single node cluster

1. Format NameNode

(format when starting for the first time. Do not always format later. Formatting will cause namenode and datanode to not find each other. Delete data and log before formatting.)

[root@hadoop101 hadoop-2.10.1]# bin/hdfs namenode -format

2. Start node:

[root@hadoop101 hadoop-2.10.1]# sbin/start-dfs.sh

8. View nodes

1. Check whether the startup is successful:

Note: jps is a command in JDK, not a Linux command. jps cannot be used without installing JDK

[root@hadoop101 hadoop-2.10.1]#	 jps
13586 NameNode
13668 DataNode
13786 Jps

2. View HDFS file system on the Web:

http://10.192.30.40:50070/explorer.html#/

The generated logs are in / opt/module/hadoop-2.10.1/logs

9. Operation cluster

1. Create a new folder:

Create an input folder on the HDFS file system:

[root@hadoop101 hadoop-2.10.1]# bin/hdfs dfs -mkdir -p /user/input

2. Upload file:

Upload the test file content to the file system:

[root@hadoop101 hadoop-2.10.1]# bin/hdfs dfs -put ./aaa.txt /user/input

3. Download files:

Download the contents of the test file to the local / opt Directory:

[root@hadoop101 hadoop-2.10.1]# bin/hdfs dfs -get  /user/input/aaa.txt /opt/

4. Delete file:

[root@hadoop101 hadoop-2.10.1]# bin/hdfs dfs -rm -r /user/input/aaa.txt

10. Start yarn

Configure yen env sh

Configure JAVA_HOME:

export  JAVA_HOME=/opt/module/jdk1.8.0_144

Configure yarn site xml

<!-- Reducer How to get data -->
<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
</property>

 <!--appoint mapreduce_shuffle.class Location of-->
 <property>
    	<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
 </property>

<!-- appoint YARN of ResourceManager Address of -->
<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>hadoop101</value>
</property>

<!-- nodemanager Total available physical memory -->
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>8192</value>
  </property>

<!-- yarn Localized file directory for the program -->
 <property>
        <name>yarn.nodemanager.local-dirs</name>
        <value>/opt/module/data</value>
 </property>

<!-- ResoureManager The scheduler is allocating container resources -->
 <property>
   		<name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>2</value>
 </property>

Configure mapred env sh

Configure JAVA_HOME:

export JAVA_HOME=/opt/module/jdk1.8.0_144

Configure mapred site xml

mapred-site.xml. Rename template mapred site xml

[root@hadoop101 hadoop]# mv mapred-site.xml.template mapred-site.xml
[root@hadoop101 hadoop]# vi mapred-site.xml
<!-- appoint MR Run in YARN upper -->
<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
</property>

Start yarn

Before starting, you must ensure that NameNode and DataNode have been started

[root@hadoop101 hadoop-2.10.1]# sbin/start-yarn.sh

Check whether the startup is successful:

[root@hadoop101 hadoop-2.10.1]# jps
4227 NodeManager
3268 NameNode
3974 ResourceManager
4363 Jps
3356 DataNode

View the browser page of YARN: (view the progress of mapreduce task here)

http://10.192.30.41:8088/cluster

11. Start the history server

Start history server

[root@hadoop101 hadoop-2.10.1]# sbin/mr-jobhistory-daemon.sh start historyserver

Execute a mapreduce task (count the number of words)

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar wordcount /user/input /user/output

View execution results

[root@hadoop101 hadoop-2.10.1]# bin/hdfs dfs -cat /user/output/*
beijing 1
hebei   1
hello   1
world   1
xiaoming        2
xiaozhang       1
yantai  1
zhangjiakou     1

View historical server information in yarn

http://10.192.30.40:19888/jobhistory

12. Configure log aggregation

Log aggregation concept: after the application runs, upload the program running log information to the HDFS system.
Benefits of log aggregation function: you can easily view the details of program operation, which is convenient for development and debugging.
Note: to enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryManager.

Configure yarn site xml

<!-- Log aggregation enabled -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

<!-- Log retention time is set to 7 days -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>

Close NodeManager, ResourceManager, and HistoryManager

[root@hadoop101 hadoop-2.10.1]# sbin/yarn-daemon.sh stop resourcemanager
[root@hadoop101 hadoop-2.10.1]# sbin/yarn-daemon.sh stop nodemanager
[root@hadoop101 hadoop-2.10.1]# sbin/mr-jobhistory-daemon.sh stop historyserver

Start NodeManager, ResourceManager, and HistoryManager

[root@hadoop101 hadoop-2.10.1]# sbin/yarn-daemon.sh start resourcemanager
[root@hadoop101 hadoop-2.10.1]# sbin/yarn-daemon.sh start nodemanager
[root@hadoop101 hadoop-2.10.1]# sbin/mr-jobhistory-daemon.sh start historyserver

Delete the output file and re execute the wordcount program. At this time, click log on the history server to view the mapreduce log

13.hadoop cluster planning

hadoop101hadoop102hadoop103
HDFSNameNodeSecondaryNameNode
DataNodeDataNodeDataNode
YARNResourceManager
NodeManagerNodeManagerNodeManager

Configure cluster

Configure HDFS site xml

Add the following configuration:

<!-- appoint Hadoop Secondary name node host configuration -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop103:50090</value>
</property>
Configure yarn site xml

Modify the address of YARN's resource manager to Hadoop 102

<!-- appoint YARN of ResourceManager Address of -->
<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>hadoop102</value>
</property>

Close hadoop

Close NodeManager, ResourceManager, and HistoryManager

[root@hadoop101 hadoop-2.10.1]# sbin/stop-yarn.sh  
[root@hadoop101 hadoop-2.10.1]# sbin/mr-jobhistory-daemon.sh stop historyserver

Close the namenode datanode

[root@hadoop101 hadoop-2.10.1]# sbin/stop-dfs.sh

Delete data and log

[root@hadoop101 hadoop-2.10.1]# rm -rf data/
[root@hadoop101 hadoop-2.10.1]# rm -rf logs/

Format namenode

[root@hadoop101 hadoop-2.10.1]# bin/hdfs namenode -format

Distribute the configured Hadoop files on the cluster:

xsync /opt/module/hadoop-2.10.1/

14. Start the cluster

Configure slaves

/opt/module/hadoop-2.10.1/etc/hadoop/slaves

Add the following contents to the document:

hadoop101
hadoop102
hadoop103

Note: no space is allowed at the end of the content added in the file, and no empty line is allowed in the file.

Synchronize all node profiles

[root@hadoop102 hadoop]# xsync slaves

Start hdfs command

[root@hadoop101 hadoop-2.10.1]# sbin/start-dfs.sh

Start the yarn command

sbin/start-yarn.sh  

Note that you should start yarn on Hadoop 102
If NameNode and ResourceManger are not the same machine, YARN cannot be started on NameNode,
YARN should be started on the machine where the ResouceManager is located.

Viewing SecondaryNameNode on the Web

http://10.192.30.42:50090/status.html

15. View process

hadoop101:

[root@hadoop101 hadoop-2.10.1]# jps
13937 Jps
13732 NodeManager
13189 NameNode
13901 JobHistoryServer
13374 DataNode

hadoop102:

[root@hadoop102 hadoop-2.10.1]# jps
3456 NodeManager
3142 DataNode
3832 Jps
3323 ResourceManager

hadoop103:

[root@hadoop103 hadoop]# jps
7377 Jps
7049 SecondaryNameNode
6938 DataNode
7198 NodeManager

16. Precautions

1. Modify Hadoop daemon. In sbin directory sh
HADOOP_PID_DIR=/tmp changed to HADOOP_PID_DIR=/opt/module/hadoop-2.10.1/tmp
Then modify the yarn daemon sh
Put yarn_ PID_ The dir is modified to / opt/module/hadoop-2.10.1/tmp
/tmp is a temporary directory of linux and will be deleted regularly. When hadoop closes the cluster, it will look for the pid file. If it is cleaned up,
You can't close normally, so you need to change the location of the pid file.

2.jps finds that the process has not been started, but restarts the cluster, indicating that the process has been started.
The reason is that there is a temporary file for the started process in the / tmp directory under the root directory of linux. Delete the cluster related processes and restart the cluster.

3.jps finds that the process datanode is still open, but prompts no datanode to stop when it is closed. This is caused by not deleting the contents of / data/data/dfs when formatting the namenode. Just delete / data/data / *.

17. Parameter summary

hdfs-site.xml

<!-- appoint HDFS Number of copies -->
<property>
        <name>dfs.replication</name>
        <value>3</value>
</property>
<!-- appoint Hadoop Secondary name node host configuration -->
<property>
		<name>dfs.namenode.secondary.http-address</name>
		<value>10.192.30.42:50090</value>
</property>
<!-- be used for namenode storage fsimage file -->
<property>
  		<name>dfs.namenode.name.dir</name>
  		<value>file://${hadoop.tmp.dir}/dfs/name</value>
</property>
<!-- hdfs A place where data blocks are stored -->
<property>
 		 <name>dfs.datanode.data.dir</name>
  		<value>file://${hadoop.tmp.dir}/dfs/data</value>
</property>

core-site.xml

<!-- appoint HDFS in NameNode Address of -->
<property>
		<name>fs.defaultFS</name>
		<value>hdfs://hadoop101:9000</value>
</property>

<!-- appoint Hadoop The storage directory where files are generated at run time -->
<property>
        <name>hadoop.tmp.dir</name>
        <value>/data/data</value>
</property>
<!-- i/o properties Cache size used when reading and writing files -->
<property>
 		<name>io.file.buffer.size</name>
 		<value>131072</value>
</property>

yarn-site.xml

<!-- Site specific YARN configuration properties -->
<!-- Reducer How to get data -->
<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
</property>

  <!--appoint mapreduce_shuffle.class Location of-->
 <property>
 		<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
 		<value>org.apache.hadoop.mapred.ShuffleHandler</value>
 </property>

<!-- appoint YARN of ResourceManager Address of -->
<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>hadoop102</value>
</property>

<!-- Log aggregation enabled -->
<property>
		<name>yarn.log-aggregation-enable</name>
		<value>true</value>
</property>

<!-- Log retention time is set to 7 days -->
<property>
		<name>yarn.log-aggregation.retain-seconds</name>
		<value>604800</value>
</property>

<!-- nodemanager Total available physical memory -->
 <property>
 		<name>yarn.nodemanager.resource.memory-mb</name>
 		<value>8192</value>
 </property>

<!-- yarn Localized file directory for the program -->
 <property>
        <name>yarn.nodemanager.local-dirs</name>
        <value>/opt/module/data</value>
 </property>

<!-- ResoureManager The scheduler is allocating container resources -->
 <property>
   		<name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>2</value>
 </property>


mapred-site.xml

<!-- appoint MR Run in YARN upper -->
<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>

Topics: Big Data Hadoop