National big data skills competition for college students (Hadoop cluster construction)

Posted by johnsonzhang on Thu, 23 Dec 2021 09:25:26 +0100

preface

This blog will build Hadoop based on the data of the National College Students' big data skills competition in previous years and this year's training. There are corresponding screenshots of each step. The following blog is only used as a record of the process of building Hadoop ~ if there are deficiencies, you are welcome to point out and learn and progress together. Attach a link to the information.

Data link

Training link on building Hadoop cluster in the 4th national big data skills competition for college students:
https://www.qingjiaoclass.com/market/detail/4486
All environment tools Baidu network disk links:
https://pan.baidu.com/s/1oOW7WqHK4fiqv4Xja5f7gQ
Extraction code: vvi7

When you practice building a Hadoop cluster, try to take a snapshot at every step to prevent errors. If you can't solve them, you have to rebuild them. It's very troublesome

Prepare three virtual machines and modify the network to bridge before configuring with VMware

1. Temporarily turn off the firewall and selinux

systemctl stop firewalld
setenforce 0

2. Permanent shutdown

systemctl disable firewalld
vi /etc/sysconfig/selinux


2. Configure the network bridging mode and modify the static ip address

vi /etc/sysconfig/network-scripts/ifcfg-ens33
nmcli c reload ifcfg-ens33
nmcli con up ens33
ifconfig



3. Clone two virtual machines and modify the IP address

Clone virtual machine
Modify the ip addresses of the two cloned virtual machines as follows:
192.168.43.140,192.168.43.150
Open SecureCRT and remotely log in to the three virtual machines

01 local source YUM


02 basic environment configuration

1. Firewall

systemctl stop firewalld
systemctl status firewalld

2. Host name and mapping

2.1 modify host name

There are three nodes in this cluster construction, including a master node and two slave nodes slave1 and slave2. You need to modify the host names of the three nodes respectively.

1. Take the host point master as an example and switch to the root user for the first time: su
2. Modify the host name to master:

hostnamectl set-hostname master

3. Effective immediately:

bash


4. To permanently modify the host name, edit the / etc/sysconfig/network file

vi /etc/sysconfig/network

The contents are as follows:

NETWORKING=yes
HOSTNAME=master

Save the file and restart the computer: reboot
Check whether it works: hostname

2.2 add mapping

Enable each node to connect to the corresponding address using the corresponding node host name.

The hosts file is mainly used to determine the IP address of each node, so that subsequent nodes can quickly find and access it. This file needs to be configured on the above three virtual machine nodes.

vi /etc/hosts

3. Time synchronization

3.1 time zone

1. Check your machine time:

date

2. Select time zone:

tzselect

3.2 time synchronization protocol NTP

1.3 machine installation ntp:

yum install -y ntp


2. As an ntp server, the master modifies the ntp configuration file. (executed on the master)

vi /etc/ntp.conf

Find line 58, the last line
add to
server 127.127.1.0
fudge 127.127.1.0 stratum 10

3. Restart the ntp service.

/bin/systemctl restart ntpd.service


4. Synchronization of other machines (slave1, slave2)
Wait about five minutes, and then synchronize the time of the machine on another machine.

ntpdate master


4. Scheduled task crontab

4.1 description

  • Asterisk (*): represents all possible values, such as the month field. If it is an asterisk, it means that the command is executed every month after the constraints of other fields are met.
  • Comma (,): you can specify a list range with comma separated values, for example, "1,2,5,7,8,9"
  • Middle bar (-): the middle bar between integers can be used to represent an integer range, for example, "2-6" means "2,3,4,5,6"
  • Forward slash (/): you can specify the interval frequency of time with forward slash. For example, "0-23 / 2" means to execute every two hours.
  • At the same time, the forward slash can be used with the asterisk, for example * / 10. If it is used in the minute field, it means that it is executed every ten minutes


input

crontab -h

You can view command related operations

4.2 case: scheduled task

Requirement: every 10min

Write a scheduled task:

crontab -e

Type i to enter editing mode
Input:

*/10 * * * * usr/sbin/ntpdate master

To view a scheduled task list:

crontab -l

03 remote login ssh

In order to enable the master node of the master node to log in to the slave node of the two child nodes through SSH without password.
1. The three machines execute respectively

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa


2. Execute only on the master node

cd .ssh/
ls

id_dsa.pub is the public key, id_dsa is the private key

3. Copy the public key file to authorized_keys file: (master only)

cat id_dsa.pub >> authorized_keys

Verify ssh inner loop

ssh master
 input yes
exit
ssh master#You do not need to enter yes validation this time


4. Execute the following commands on the master node:

scp ./authorized_keys root@slave1:~/.ssh/
scp ./authorized_keys root@slave2:~/.ssh/


5. Verify ssh password free login

ssh slave1
exit

04 locale - Java

Install JDK

All three virtual machines need to be installed
1. Establish work path

mkdir -p /usr/java 

2. Transfer the file to the local root directory (SecureCRT I use)
Under Windows, first copy the jdk installation package, click the one indicated by the arrow, then right-click and paste, and wait for the file transfer to complete.

3. Decompression

tar -zxvf jdk-8u171-linux-x64.tar.gz -C /usr/java/

4. Modify environment variables

vi /etc/profile

Add the following to line 55 of the file

export JAVA_HOME=/usr/java/jdk1.8.0_171 
export PATH=$PATH:$JAVA_HOME/bin


Effective files and viewing jdk versions

source /etc/profile
java -version


Installation succeeded!

05 coordination services zookeeper installation

Note: Step 1 is performed on three virtual machines, steps 2 to 10 are performed on the master virtual machine, step 11 is performed on slave1 and slave2 virtual machines, and step 12 is performed on three virtual machines

1. Modify the host name to ip address mapping configuration (three virtual machines)

vi /etc/hosts

192.168.43.130 master master.root
192.168.43.140 slave1 slave1.root
192.168.43.150 slave2 slave2.root

2. Upload the zookeeper installation package to the root directory of the virtual machine
3. Create a new directory

mkdir -p /usr/zookeeper


4. Decompression

tar -zxvf zookeeper-3.4.10.tar.gz -C /usr/zookeeper/

5. Modify the / etc/profile file and configure the zookeeper environment variable.

vi /etc/profile

Add on line 58

export ZOOKEEPER_HOME=/usr/zookeeper/zookeeper-3.4.10 
PATH=$PATH:$ZOOKEEPER_HOME/bin


6. Enter the zookeeper configuration folder conf and set the zoo_ sample. A copy of CFG file named zoo cfg

cd /usr/zookeeper/zookeeper-3.4.10/conf/ 
cp -p zoo_sample.cfg zoo.cfg


7. Configuration file zoo cfg

vi zoo.cfg

Modify act 12
dataDir=/usr/zookeeper/zookeeper-3.4.10/zkdata

On line 29, at the end of the file, add the following
dataLogDir=/usr/zookeeper/zookeeper-3.4.10/zkdatalog

server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888

8. In the directory of zookeeper, create two folders, zkdata and zkdata log, which are required in the configuration.

mkdir zkdata
mkdir zkdatalog


9. Enter the zkdata folder and create the file myid to indicate the server number. In the master host, set the server id to 1.

cd zkdata
vi myid


10. ZooKeeper has been configured on the master node. Now you can remotely copy the configured installation file to the cluster
Under the directory corresponding to each node in (in this case, the child node):

scp -r /usr/zookeeper root@slave1:/usr/ 
scp -r /usr/zookeeper root@slave2:/usr/


11. For the configuration file myid file, slave1 is 2 and slave2 is 3

cd /usr/zookeeper/zookeeper-3.4.10/zkdata
vi myid



12. Start the zookeeper cluster
Each virtual machine must first enter the zookeeper directory

cd /usr/zookeeper/zookeeper-3.4.10/

Then execute separately

bin/zkServer.sh start 
bin/zkServer.sh status

If one node is a leader and the other nodes are follower s, the installation is successful~


06.Hadoop installation

Extract the installation package and configure environment variables

1. Copy the Hadoop installation package to the root directory, create a working directory, and unzip the Hadoop installation package

mkdir –p /usr/hadoop
tar -zxvf hadoop-2.7.3.tar.gz -C /usr/hadoop/

2. Configure environment variables and modify the / etc/profile file

vi /etc/profile

Add on line 60

#HADOOP
export HADOOP_HOME=/usr/hadoop/hadoop-2.7.3
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib
export PATH=$PATH:$HADOOP_HOME/bin


3. Effective configuration file

source /etc/profile

Configure Hadoop components

1. Enter the / etc/hadoop directory of Hadoop

cd /usr/hadoop/hadoop-2.7.3/etc/hadoop


2. Edit Hadoop env SH file

vi /hadoop-env.sh

Add the following on line 28

export JAVA_HOME=/usr/java/jdk1.8.0_171


3. Edit core site xml

vi core-site.xml

Add the following

<configuration>
<property>
  <name>fs.default.name</name>
   <value>hdfs://master:9000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
   <value>/usr/hadoop/hadoop-2.7.3/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
  <name>io.file.buffer.size</name>
   <value>131072</value>
</property>
<property>
  <name>fs.checkpoint.period</name>
   <value>60</value>
</property>
<property>
  <name>fs.checkpoint.size</name>
   <value>67108864</value>
</property>
</configuration>


4. Edit yarn - site xml

vi yarn-site.xml
<property>
 <name>yarn.resourcemanager.address</name>
   <value>master:18040</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address</name>  <value>master:18030</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>master:18088</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>master:18025</value>
</property>
<property>
 <name>yarn.resourcemanager.admin.address</name>
 <value>master:18141</value>
</property>
<property>
 <name>yarn.resourcemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>
<property>
 <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
 <value>org.apache.hadoop.mapred.shuffleHandler</value>
</property>


5. Edit HDFS site xml

vi hdfs-site.xml

Add the following

<configuration>
<property>
 <name>dfs.replication</name>
   <value>2</value>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/usr/hadoop/hadoop-2.7.3.hdfs/name</value>
   <final>true</final>
 </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/usr/hadoop/hadoop-2.7.3/hdfs/data</value>
    <final>true</final>
 </property>
 <property>
  <name>dfs.namenode.secondary.http-address</name>
   <value>master:9001</value>
 </property>
 <property>
   <name>dfs.webhdfs.enabled</name>
    <value>true</value>
 </property>
 <property>
   <name>dfs.permissions</name>
   <value>false</value>
</property>
</configuration>


6. Edit mapred site xml
hadoop does not have this file. You need to add mapred site xml. Copy template as mapredsite xml.

cp mapred-site.xml.template mapred-site.xml
vi mapred-site.xml

Add the following

<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>


7. Set node file

vi slaves

vi master


8. Distribute hadoop

scp -r /usr/hadoop root@slave1:/usr/
scp -r /usr/hadoop root@slave2:/usr/

9.hdfs format (executed by master virtual machine)

hadoop namenode -format

When the following sentence appears, congratulations on your ~ formatting success!

10. Start the cluster
Open the operation command only on the master host. It brings up the start of the slave node.
Start the cluster:
Enter the Hadoop directory and execute:

sbin/start-all.sh 
perhaps
start-all.sh start

View process:

jsp


11. Browser access: masterIP: 50070
The following page appears, and the surface is successfully built~~

07. Add or delete nodes









Topics: Big Data Hadoop