Building hadoop cluster with CentOS7

Posted by defx on Sun, 12 Dec 2021 16:58:54 +0100

explain:

        1. Prepare three virtual machines, refer to: Construction of CentOS7 cluster environment (taking 3 sets as an example)

        2. Configure secret free login between virtual machines: Reference: CentOS7 cluster configuration password free login

        3. Install jdk on virtual machines separately: Reference: CentOS7 cluster installation jdk1 eight

Ready to start building hadoop cluster, the following operations are performed on the first machine node1:

1. Upload and unzip the hadoop installation package

1.1 upload the installation package to / development / software

mkdir -p /develop/software
mkdir -p /develop/server
cd /develop/software
rz

1.2 unzip the installation package to / development / server

tar -zxvf hadoop-2.7.5.tar.gz -C /develop/server/

1.3 switch to the decompression directory and view the extracted files

cd /develop/server/hadoop-2.7.5
ll

2. Modify hadoop configuration file

2.1 switch to the etc/hadopp directory of hadoop and modify hadoop env sh

cd /develop/server/hadoop-2.7.5/etc/hadoop/
ll
vim hadoop-env.sh

 2.2hadoop-env. Configuration jdk path of SH file

export JAVA_HOME=/develop/server/jdk1.8.0_241

2.3 configure core site XML (the core configuration file of hadoop) configure the following in < configuration > < / configuration >

<!-- set up Hadoop File system for --> 
<property>
	<name>fs.defaultFS</name>
	<value>hdfs://node1:8020</value>
 </property>
<!-- to configure Hadoop Data storage directory -->
 <property>
   <name>hadoop.tmp.dir</name>
   <value>/develop/server/hadoop-2.7.5/data/tempdata</value>
</property>
<!--  Buffer size -->
 <property>
   <name>io.file.buffer.size</name>
   <value>4096</value>
 </property>
<!--  hdfs Trash can mechanism, in minutes -->
 <property>
   <name>fs.trash.interval</name>
   <value>10080</value>
 </property>

2.4 configure hdfs site XML (core configuration file of hdfs). Configure the following contents in < configuration > < / configuration >. Note that secondaryNameNode and Namenode should not be placed on the same machine

<!-- SecondaryNameNode Host and port for -->
<property>
	<name>dfs.namenode.secondary.http-address</name>
	<value>node2:50090</value>
</property>
<!-- namenode Page access address and port -->
<property>
	<name>dfs.namenode.http-address</name>
	<value>node1:50070</value>
</property>
<!-- namenode Storage location of metadata -->
<property>
	<name>dfs.namenode.name.dir</name>
	<value>file:///develop/server/hadoop-2.7.5/data/nndata</value>
</property>
<!--  definition datanode Node location of the data store -->
<property>
	<name>dfs.datanode.data.dir</name>
	<value>file:///develop/server/hadoop-2.7.5/data/dndata</value>
</property>	
<!-- namenode of edits File storage path -->
<property>
	<name>dfs.namenode.edits.dir</name>
	<value>file:///develop/server/hadoop-2.7.5/data/nn/edits</value>
</property>
<!-- Checkpoint directory -->
<property>
	<name>dfs.namenode.checkpoint.dir</name>
	<value>file:///develop/server/hadoop-2.7.5/data/snn/name</value>
</property>

<property>
	<name>dfs.namenode.checkpoint.edits.dir</name>
	<value>file:///develop/server/hadoop-2.7.5/data/dfs/snn/edits</value>
</property>
<!-- Number of copies of file slice-->
<property>
	<name>dfs.replication</name>
	<value>3</value>
</property>
<!-- HDFS File permissions for-->
<property>
	<name>dfs.permissions</name>
	<value>true</value>
</property>
<!-- Set the size of a file slice: 128 M-->
<property>
	<name>dfs.blocksize</name>
	<value>134217728</value>
</property>

2.5 copy mapred site xml. Template and change the name to mapred site xml

cp mapred-site.xml.template mapred-site.xml
ll

2.6 configuring mapred site XML (the core configuration file of MapReduce), configure the following in < configuration > < / configuration >

<!-- Framework for distributed computing -->
<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
</property>
<!-- open MapReduce Small task mode -->
<property>
	<name>mapreduce.job.ubertask.enable</name>
	<value>true</value>
</property>
<!-- Hosts and ports for historical tasks -->
<property>
	<name>mapreduce.jobhistory.address</name>
	<value>node1:10020</value>
</property>
<!-- Host and port for web access history tasks -->
<property>
	<name>mapreduce.jobhistory.webapp.address</name>
	<value>node1:19888</value>
</property>

2.7 configuring mapred env SH, specify JAVA_HOME

export JAVA_HOME=/develop/server/jdk1.8.0_241

2.8 configure YARN site XML (the core configuration file of YARN), configure the following in < configuration > < / configuration >

<!-- yarn Location of master node -->
<property>
	<name>yarn.resourcemanager.hostname</name>
	<value>node1</value>
</property>
<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>
<!-- Enable log aggregation -->
<property>
	<name>yarn.log-aggregation-enable</name>
	<value>true</value>
</property>
<!-- Set aggregation log in hdfs Save time on -->
<property>
	<name>yarn.log-aggregation.retain-seconds</name>
	<value>604800</value>
</property>
<!-- set up yarn Memory allocation scheme of cluster -->
<property>    
	<name>yarn.nodemanager.resource.memory-mb</name>    
	<value>2048</value>
</property>
<property>  
	<name>yarn.scheduler.minimum-allocation-mb</name>
	<value>2048</value>
</property>
<property>
	<name>yarn.nodemanager.vmem-pmem-ratio</name>
	<value>2.1</value>
</property>

2.9 configure slaves. The slave file records the cluster host name. Delete the original content and configure the following content

node1
node2
node3

3. Create data storage directory

mkdir -p /develop/server/hadoop-2.7.5/data/tempdata
mkdir -p /develop/server/hadoop-2.7.5/data/nndata
mkdir -p /develop/server/hadoop-2.7.5/data/dndata
mkdir -p /develop/server/hadoop-2.7.5/data/nn/edits
mkdir -p /develop/server/hadoop-2.7.5/data/snn/name
mkdir -p /develop/server/hadoop-2.7.5/data/dfs/snn/edits

4. Document distribution

4.1 distribute the installed and configured hadoop to the other two machines

scp -r hadoop-2.7.5/ node2:$PWD
scp -r hadoop-2.7.5/ node3:$PWD

4.2 view the distributed documents on the other two machines

cd /develop/server/
ll

 

4.3 configure hadoop environment variables on three machines respectively

vim /etc/profile.d/my_env.sh
# HADOOP_HOME
export HADOOP_HOME=/develop/server/hadoop-2.7.5
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

4.4 refresh environment variables

source /etc/profile

4.5 the other two machines are also configured with environment variables

5. Start hadoop cluster

5.1 format is required when starting hdfs for the first time. Execute the following command in node1

hadoop namenode -format

5.2 start relevant services. The following startup methods are single startup

5.2. 1 start namenode, and node1 executes the following command

hadoop-daemon.sh start namenode

5.2. 2 start datanode on the three machines respectively. On node1, node2 and node3, use the following commands to start datanode respectively

hadoop-daemon.sh start datanode

5.2. 3 start the resource manager on node1

yarn-daemon.sh  start resourcemanager

5.2. 4 start YARN nodemanager on node1, node2 and node3 with the following commands

yarn-daemon.sh start nodemanager

5.2. 5 start the secondarynamenode on node2

hadoop-daemon.sh start secondarynamenode

 5.2. 6 start the historyserver on node1

mr-jobhistory-daemon.sh start historyserver

5.3 check the service startup and execute the jps command on the three machines respectively

 

5.4 shut down services

If you want to shut down the service, just change the start in the above command to stop

6. One click Startup script: to be improved

Now, the hadoop cluster is set up

Topics: Linux Big Data Hadoop