Building Hadoop using virtual machine (pseudo distributed building, distributed building)

Posted by many_pets on Sat, 01 Jan 2022 19:05:16 +0100

After learning Hadoop for a semester, I finally chewed off this big bone, tears!!!
This article is more like a summary of learning Hadoop

1, Preparatory work

1. hadoop compressed package

There will be this official website. Download the compressed package and prepare it. I use version 2.7.1

2. jdk compressed package

This is the Java running environment. Because Hadoop is based on Java, you need to configure the Java environment

3,Xshell


It can cooperate with Xftp to transfer files to virtual machines

4,Xftp


Use transfer files with Xshell

I will write a tutorial to let you learn to use these two main tools

5. VM virtual machine

I won't repeat this. If you don't know how to learn Hadoop, it doesn't make sense. There are many installation tutorials on the Internet. You can refer to the post installation VM virtual machine. I use the 12pro version here

2, Hadoop pseudo distributed

1. Install Java

Create a new jvm folder in / usr/local

cd /usr/local
mkdir jvm

Enter the jvm directory, import the compressed package of jdk into the virtual machine through xftp, and tar -zxvf unzip it to the current directory

2. Configure environment variables

vi ~/.bashrc edit the environment variable and add the following directory (my installation directory: / usr/local/jvm/jdk)

export JAVA_HOME=/usr/local/jvm/jdk      
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=$PATH:${JAVA_HOME}/bin

source ~/.bashrc let bashrc takes effect immediately. Check whether the environment variable is successfully configured through java -version

3. Install Hadoop

Import the Hadoop installation package into the virtual machine through Xftp and extract it to the current folder (/ usr/local/hadoop)

Modify the configuration file core site XML and HDFS site XML

<configuration>
</configuration>

core-site.xml modification (/ usr/local/hadoop/etc/hadoop directory)

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml modification (/ usr/local/hadoop/etc/hadoop directory)

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
</configuration>

4. Format NameNode

Format NameNode after configuration

$ cd /usr/local/hadoop
$ ./bin/hdfs namenode -format

If successful, you will see "successfully formatted" and "exiting with status 0". If it is "exiting with status 1", it is an error.

If error: JAVA is prompted at this step_ HOME is not set and could not be found. If there is an error, it indicates that JAVA was set before_ The home environment variable is not set. Please set JAVA first according to the tutorial_ Home variable, otherwise the subsequent process cannot proceed. If you have followed the previous tutorial JAVA is set in the bashrc file_ Home, or Error: JAVA_HOME is not set and could not be found. Go to the Hadoop installation directory to modify the configuration file "/ usr / local / Hadoop / etc / Hadoop / Hadoop env. Sh", find the line "export JAVA_HOME=${JAVA_HOME}" in it, and then modify it to the specific address of the JAVA installation path, for example, "export JAVA_HOME = / usr / lib / JVM / default JAVA", and then start Hadoop again.

5. Configure Hadoop, start DFS SH and start yarn SH environment variable

vi ~/.bashrc opens the environment variable configuration and adds the following path (export PATH is best placed last)

export HADOOP_HOME=/usr/local/hadoop
export PATH=${HADOOP_HOME}/sbin:${HADOOP_HOME}/bin:$PATH

source ~/.bashrc makes the configuration take effect immediately. Enter hadoop version to determine whether the configuration is successful.

You can then enter start - all anywhere SH start the Hadoop service.

6. Start Hadoop

Enter start - all SH, start Hadoop (it is recommended to go back to the root directory)

7. Access HDFS via Web port

It should be noted that the reason why you cannot access the Web end may be that your NameNode node is hung (the configuration file is not configured properly), and the firewall is not closed. Here is a solution to the problem of inaccessible firewall

systemctl stop firewalld
systemctl disable firewlld

The first command stops the firewall, the second command permanently closes the firewall, and then you can access it through localhost:50070 port

3, Hadoop distributed

The example I demonstrate here is a master and a slave slave. The essence is the same. If you need more than one slave, it is equivalent to adding more than one slave, but the hostname and IP can not be repeated. I will add comments in the following configuration

0. Install Java

Referring to pseudo distributed, first install the Java environment on the master host

1. Modify hostname

Because Hadoop distributed has multiple machines, the hostname of each machine cannot be the same

vi /etc/hostname modify hostname, hostname $(cat /etc/hostname)hostname will take effect immediately after modification

Add the IP addresses of all machines in the cluster to / etc/hosts (all slaves also need to be configured in this step)

[root@slave hadoop]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.117.150 master				#host
192.168.117.151 slave				#Slave
#192.168.117.152 slave02	
#192.168.117.153 slave03

2. Configure environment variables

vi ~/.bashrc edit the environment variable and add the following directory (my installation directory: / usr/local/jvm/jdk)

export JAVA_HOME=/usr/local/jvm/jdk      
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:${PATH}

source ~/.bashrc let bashrc takes effect immediately. Check whether the environment variable is successfully configured through java -version

All slaves also need to configure and update environment variables

3. Set ssh password free

After setting password free, you don't need to enter the password every time you start Hadoop, and you don't need to enter the password when connecting to the slave, which is convenient for the host to log in to the slave
On the master, enter Cd ~ / ssh check if you have ssh folder. If not, enter ssh localhost, enter the password, and connect the local with ssh SSH folder. Enter the ssh keygen - t RSA command in this folder and press enter according to the process (three times), and then the key will be generated in this directory

3.1 setting local security free

Later on Enter cat ID in ssh directory_ rsa. pub >> authorized_ Keys so you don't have to enter a password for ssh localhost.

[root@slave .ssh]# cat id_rsa.pub >> authorized_keys
[root@slave .ssh]# ls
authorized_keys  id_rsa  id_rsa.pub  known_hosts     #You can see more authorized_keys file
[root@slave .ssh]# ssh localhost
Last login: Wed Dec  1 15:33:55 2021 from ::1		#Direct connection

3.2 setting the slave password free login

Copy the generated key to the slave

ssh-copy-id -i root@slave	
#ssh-copy-id -i root@slave02
#ssh-copy-id -i root@slave03

Root @ is the hostname of your slave machine. After copying, enter ssh root@slave verification

[root@master ~]# ssh slave
Last login: Wed Dec  1 16:02:37 2021 from 192.168.117.150
[root@slave ~]# exit
 Logout
Connection to slave closed.

You can see that password verification is no longer required

4. Configure Hadoop

Modify the core site. On the master host XML file (/ usr/local/hadoop/etc/hadoop directory):

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
        # Note that the 9000 port of the host is open, so the localhost should be changed to the hostname of the host
    </property>
</configuration>

Modify HDFS site XML file (/ usr/local/hadoop/etc/hadoop directory):

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
    
    
    #Add the following configuration without any impact
    <property>
	<name>dfs.namenode.secorndary.http-address</name>
	<value>master:50090</value>
	#Change localhost to the name of the host
    </property>
    
</configuration>

Modify the slave configuration file in the host (/ usr/local/hadoop/etc/hadoop directory) and add the hostname of the slave

[root@master hadoop]# cat slaves 
slave		#My slave hostname
#slave02
#slave03

5. Copy file

Copy hadoop and jdk from the configured host to the slave

scp -r <File directory to be copied> <Slave hostname>:<Copy to the directory of the slave>
scp -r /usr/local/hadoop slave:/usr/local   #Copy the host hadoop file to the local directory of the slave
#scp -r /usr/local/hadoop slave02:/usr/local
#scp -r /usr/local/hadoop slave03:/usr/local

scp -r /usr/local/jvm slave:/usr/local   #Copy the host jvm file to the slave jvm directory
#scp -r /usr/local/jvm slave02:/usr/local
#scp -r /usr/local/jvm slave03:/usr/local

Don't forget the slave For bashrc environment variable configuration, refer to the environment variable of the host, and update the source after configuration!!!

6. Format NameNode

$ cd /usr/local/hadoop
$ ./bin/hdfs namenode -format

The format operation only needs to be run once on the master host
If successful, you will see "successfully formatted" and "exiting with status 0". If it is "exiting with status 1", it is an error.

If error: JAVA is prompted at this step_ HOME is not set and could not be found. If there is an error, it indicates that JAVA was set before_ The home environment variable is not set. Please set JAVA first according to the tutorial_ Home variable, otherwise the subsequent process cannot proceed. If you have followed the previous tutorial JAVA is set in the bashrc file_ Home, or Error: JAVA_HOME is not set and could not be found. Go to the Hadoop installation directory to modify the configuration file "/ usr / local / Hadoop / etc / Hadoop / Hadoop env. Sh", find the line "export JAVA_HOME=${JAVA_HOME}" in it, and then modify it to the specific address of the JAVA installation path, for example, "export JAVA_HOME = / usr / lib / JVM / default JAVA", and then start Hadoop again.
hdfs is in the bin directory of hadoop

7. Verify that the cluster is started successfully

Enter start.all on the host SH, enter jps after startup

[root@master hadoop]# jps
13845 NameNode
14182 ResourceManager
14875 Jps
14029 SecondaryNameNode

Enter jps on the slave

[root@slave ~]# jps
10405 DataNode
11045 Jps
10509 NodeManager

You can see that the NameNode node of the host has been started and the DataNode node of the slave has been started

The cluster is set up successfully

8. Access HDFS via Web port

systemctl stop firewalld
systemctl disable firewlld

The first command stops the firewall, the second command permanently closes the firewall, and then you can access it through the localhost:50070 port. Click DataNodes at the top of the Web interface to see the slave information of the host

If the datanode cannot be displayed, but there are datanodes on the jps, check the hosts. The hosts of each slave must have the IP of the entire cluster

stop-all.sh then service network restart the network card of each machine, and then access the web side, or restart the virtual machine.

Topics: Hadoop Distribution hdfs