How to build hadoop cluster
1. Mapping of host name and corresponding IP
Open the hosts file
vi /etc/hosts
Add host name and corresponding ip address
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 10.192.30.40 hadoop101 10.192.30.41 hadoop102 10.192.30.42 hadoop103
After that, Hadoop 101 will automatically correspond to IP 10.192.30.40, Hadoop 102 will automatically correspond to 10.192.30.41, and Hadoop 103 will automatically correspond to 10.192.30.42
2. Write xsync cluster distribution script
Create a bin directory in the / root directory and an xsync file in the bin directory. The file contents are as follows:
#!/bin/bash #1 get the number of input parameters. If there are no parameters, exit directly pcount=$# if((pcount==0)); then echo no args; exit; fi #2 get file name p1=$1 fname=`basename $p1` echo fname=$fname #3 get the absolute path from the parent directory pdir=`cd -P $(dirname $p1); pwd` echo pdir=$pdir #4 get the current user name user=`whoami` #5 cycle for((host=102; host<104; host++)); do echo ------------------- hadoop$host -------------- rsync -rvl $pdir/$fname $user@hadoop$host:$pdir done
Add execution permission to the script.
3.SSH non secret login configuration
1. Configure ssh
Basic syntax:
ssh ip address of another computer:
ssh 10.192.30.41
Solution to Host key verification failed during ssh connection:
Enter yes directly
2. No key configuration
Generate public and private keys:
ssh-keygen -t rsa
By typing (three carriage returns), two file IDS will be generated_ RSA (private key), id_rsa.pub (public key)
Copy the public key to the target machine for password free login:
ssh-copy-id hadoop101 ssh-copy-id hadoop102 ssh-copy-id hadoop103
After entering the command, enter the password according to the prompt to complete the keyless configuration
be careful:
It is also necessary to configure non secret login to Hadoop 101, Hadoop 102 and Hadoop 103 on Hadoop 102;
It is also necessary to configure non secret login to Hadoop 101, Hadoop 102 and Hadoop 103 servers on Hadoop 103.
4. Install jdk
Decompression:
tar -zxvf jdk-8u144-linux-x64.tar.gz -C /opt/module/
Add environment variable:
vi /etc/profile #JAVA_HOME export JAVA_HOME=/opt/module/jdk1.8.0_144 export PATH=$PATH:$JAVA_HOME/bin
Application:
source /etc/profile
5. Install hadoop
Decompression:
tar -zxvf hadoop-2.10.1.tar.gz -C /opt/module/
Add Hadoop to the environment variable and open the / etc/profile file:
##HADOOP_HOME export HADOOP_HOME=/opt/module/hadoop-2.10.1 export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin
Make the modified document effective:
source /etc/profile
Test for successful installation:
hadoop version
6. Configure hadoop cluster
In / opt/module/hadoop-2.10.1/etc/hadoop Directory:
Configure Hadoop env sh:
Get the installation path of JDK in Linux system:
echo $JAVA_HOME /opt/module/jdk1.8.0_144
Modify JAVA_HOME path:
export JAVA_HOME=/opt/module/jdk1.8.0_144
Configure core site xml:
<!-- appoint HDFS in NameNode Address of --> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop101:9000</value> </property> <!-- appoint Hadoop The storage directory where files are generated at run time --> <property> <name>hadoop.tmp.dir</name> <value>/opt/module/hadoop-2.10.1/data/tmp</value> </property> <!-- i/o properties Cache size used when reading and writing files --> <property> <name>io.file.buffer.size</name> <value>131072</value> </property>
Configure HDFS site xml:
<!-- appoint HDFS Number of copies --> <property> <name>dfs.replication</name> <value>3</value> </property> <!-- be used for namenode storage fsimage file --> <property> <name>dfs.namenode.name.dir</name> <value>file://${hadoop.tmp.dir}/dfs/name</value> </property> <!-- hdfs A place where data blocks are stored --> <property> <name>dfs.datanode.data.dir</name> <value>file://${hadoop.tmp.dir}/dfs/data</value> </property>
7. Start a single node cluster
1. Format NameNode
(format when starting for the first time. Do not always format later. Formatting will cause namenode and datanode to not find each other. Delete data and log before formatting.)
[root@hadoop101 hadoop-2.10.1]# bin/hdfs namenode -format
2. Start node:
[root@hadoop101 hadoop-2.10.1]# sbin/start-dfs.sh
8. View nodes
1. Check whether the startup is successful:
Note: jps is a command in JDK, not a Linux command. jps cannot be used without installing JDK
[root@hadoop101 hadoop-2.10.1]# jps 13586 NameNode 13668 DataNode 13786 Jps
2. View HDFS file system on the Web:
http://10.192.30.40:50070/explorer.html#/
The generated logs are in / opt/module/hadoop-2.10.1/logs
9. Operation cluster
1. Create a new folder:
Create an input folder on the HDFS file system:
[root@hadoop101 hadoop-2.10.1]# bin/hdfs dfs -mkdir -p /user/input
2. Upload file:
Upload the test file content to the file system:
[root@hadoop101 hadoop-2.10.1]# bin/hdfs dfs -put ./aaa.txt /user/input
3. Download files:
Download the contents of the test file to the local / opt Directory:
[root@hadoop101 hadoop-2.10.1]# bin/hdfs dfs -get /user/input/aaa.txt /opt/
4. Delete file:
[root@hadoop101 hadoop-2.10.1]# bin/hdfs dfs -rm -r /user/input/aaa.txt
10. Start yarn
Configure yen env sh
Configure JAVA_HOME:
export JAVA_HOME=/opt/module/jdk1.8.0_144
Configure yarn site xml
<!-- Reducer How to get data --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!--appoint mapreduce_shuffle.class Location of--> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <!-- appoint YARN of ResourceManager Address of --> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop101</value> </property> <!-- nodemanager Total available physical memory --> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>8192</value> </property> <!-- yarn Localized file directory for the program --> <property> <name>yarn.nodemanager.local-dirs</name> <value>/opt/module/data</value> </property> <!-- ResoureManager The scheduler is allocating container resources --> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>2</value> </property>
Configure mapred env sh
Configure JAVA_HOME:
export JAVA_HOME=/opt/module/jdk1.8.0_144
Configure mapred site xml
mapred-site.xml. Rename template mapred site xml
[root@hadoop101 hadoop]# mv mapred-site.xml.template mapred-site.xml [root@hadoop101 hadoop]# vi mapred-site.xml
<!-- appoint MR Run in YARN upper --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
Start yarn
Before starting, you must ensure that NameNode and DataNode have been started
[root@hadoop101 hadoop-2.10.1]# sbin/start-yarn.sh
Check whether the startup is successful:
[root@hadoop101 hadoop-2.10.1]# jps 4227 NodeManager 3268 NameNode 3974 ResourceManager 4363 Jps 3356 DataNode
View the browser page of YARN: (view the progress of mapreduce task here)
http://10.192.30.41:8088/cluster
11. Start the history server
Start history server
[root@hadoop101 hadoop-2.10.1]# sbin/mr-jobhistory-daemon.sh start historyserver
Execute a mapreduce task (count the number of words)
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar wordcount /user/input /user/output
View execution results
[root@hadoop101 hadoop-2.10.1]# bin/hdfs dfs -cat /user/output/* beijing 1 hebei 1 hello 1 world 1 xiaoming 2 xiaozhang 1 yantai 1 zhangjiakou 1
View historical server information in yarn
http://10.192.30.40:19888/jobhistory
12. Configure log aggregation
Log aggregation concept: after the application runs, upload the program running log information to the HDFS system.
Benefits of log aggregation function: you can easily view the details of program operation, which is convenient for development and debugging.
Note: to enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryManager.
Configure yarn site xml
<!-- Log aggregation enabled --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <!-- Log retention time is set to 7 days --> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property>
Close NodeManager, ResourceManager, and HistoryManager
[root@hadoop101 hadoop-2.10.1]# sbin/yarn-daemon.sh stop resourcemanager [root@hadoop101 hadoop-2.10.1]# sbin/yarn-daemon.sh stop nodemanager [root@hadoop101 hadoop-2.10.1]# sbin/mr-jobhistory-daemon.sh stop historyserver
Start NodeManager, ResourceManager, and HistoryManager
[root@hadoop101 hadoop-2.10.1]# sbin/yarn-daemon.sh start resourcemanager [root@hadoop101 hadoop-2.10.1]# sbin/yarn-daemon.sh start nodemanager [root@hadoop101 hadoop-2.10.1]# sbin/mr-jobhistory-daemon.sh start historyserver
Delete the output file and re execute the wordcount program. At this time, click log on the history server to view the mapreduce log
13.hadoop cluster planning
hadoop101 | hadoop102 | hadoop103 | |
---|---|---|---|
HDFS | NameNode | SecondaryNameNode | |
DataNode | DataNode | DataNode | |
YARN | ResourceManager | ||
NodeManager | NodeManager | NodeManager |
Configure cluster
Configure HDFS site xml
Add the following configuration:
<!-- appoint Hadoop Secondary name node host configuration --> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop103:50090</value> </property>
Configure yarn site xml
Modify the address of YARN's resource manager to Hadoop 102
<!-- appoint YARN of ResourceManager Address of --> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop102</value> </property>
Close hadoop
Close NodeManager, ResourceManager, and HistoryManager
[root@hadoop101 hadoop-2.10.1]# sbin/stop-yarn.sh [root@hadoop101 hadoop-2.10.1]# sbin/mr-jobhistory-daemon.sh stop historyserver
Close the namenode datanode
[root@hadoop101 hadoop-2.10.1]# sbin/stop-dfs.sh
Delete data and log
[root@hadoop101 hadoop-2.10.1]# rm -rf data/ [root@hadoop101 hadoop-2.10.1]# rm -rf logs/
Format namenode
[root@hadoop101 hadoop-2.10.1]# bin/hdfs namenode -format
Distribute the configured Hadoop files on the cluster:
xsync /opt/module/hadoop-2.10.1/
14. Start the cluster
Configure slaves
/opt/module/hadoop-2.10.1/etc/hadoop/slaves
Add the following contents to the document:
hadoop101 hadoop102 hadoop103
Note: no space is allowed at the end of the content added in the file, and no empty line is allowed in the file.
Synchronize all node profiles
[root@hadoop102 hadoop]# xsync slaves
Start hdfs command
[root@hadoop101 hadoop-2.10.1]# sbin/start-dfs.sh
Start the yarn command
sbin/start-yarn.sh
Note that you should start yarn on Hadoop 102
If NameNode and ResourceManger are not the same machine, YARN cannot be started on NameNode,
YARN should be started on the machine where the ResouceManager is located.
Viewing SecondaryNameNode on the Web
http://10.192.30.42:50090/status.html
15. View process
hadoop101:
[root@hadoop101 hadoop-2.10.1]# jps 13937 Jps 13732 NodeManager 13189 NameNode 13901 JobHistoryServer 13374 DataNode
hadoop102:
[root@hadoop102 hadoop-2.10.1]# jps 3456 NodeManager 3142 DataNode 3832 Jps 3323 ResourceManager
hadoop103:
[root@hadoop103 hadoop]# jps 7377 Jps 7049 SecondaryNameNode 6938 DataNode 7198 NodeManager
16. Precautions
1. Modify Hadoop daemon. In sbin directory sh
HADOOP_PID_DIR=/tmp changed to HADOOP_PID_DIR=/opt/module/hadoop-2.10.1/tmp
Then modify the yarn daemon sh
Put yarn_ PID_ The dir is modified to / opt/module/hadoop-2.10.1/tmp
/tmp is a temporary directory of linux and will be deleted regularly. When hadoop closes the cluster, it will look for the pid file. If it is cleaned up,
You can't close normally, so you need to change the location of the pid file.
2.jps finds that the process has not been started, but restarts the cluster, indicating that the process has been started.
The reason is that there is a temporary file for the started process in the / tmp directory under the root directory of linux. Delete the cluster related processes and restart the cluster.
3.jps finds that the process datanode is still open, but prompts no datanode to stop when it is closed. This is caused by not deleting the contents of / data/data/dfs when formatting the namenode. Just delete / data/data / *.
17. Parameter summary
hdfs-site.xml
<!-- appoint HDFS Number of copies --> <property> <name>dfs.replication</name> <value>3</value> </property> <!-- appoint Hadoop Secondary name node host configuration --> <property> <name>dfs.namenode.secondary.http-address</name> <value>10.192.30.42:50090</value> </property> <!-- be used for namenode storage fsimage file --> <property> <name>dfs.namenode.name.dir</name> <value>file://${hadoop.tmp.dir}/dfs/name</value> </property> <!-- hdfs A place where data blocks are stored --> <property> <name>dfs.datanode.data.dir</name> <value>file://${hadoop.tmp.dir}/dfs/data</value> </property>
core-site.xml
<!-- appoint HDFS in NameNode Address of --> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop101:9000</value> </property> <!-- appoint Hadoop The storage directory where files are generated at run time --> <property> <name>hadoop.tmp.dir</name> <value>/data/data</value> </property> <!-- i/o properties Cache size used when reading and writing files --> <property> <name>io.file.buffer.size</name> <value>131072</value> </property>
yarn-site.xml
<!-- Site specific YARN configuration properties --> <!-- Reducer How to get data --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!--appoint mapreduce_shuffle.class Location of--> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <!-- appoint YARN of ResourceManager Address of --> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop102</value> </property> <!-- Log aggregation enabled --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <!-- Log retention time is set to 7 days --> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> <!-- nodemanager Total available physical memory --> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>8192</value> </property> <!-- yarn Localized file directory for the program --> <property> <name>yarn.nodemanager.local-dirs</name> <value>/opt/module/data</value> </property> <!-- ResoureManager The scheduler is allocating container resources --> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>2</value> </property>
mapred-site.xml
<!-- appoint MR Run in YARN upper --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>