Building of Hadoop distributed environment

Posted by caspert_ghost on Sun, 21 Jun 2020 11:11:30 +0200

This note is written by myself with reference to Lin Ziyu's teaching documents. Please refer to the database Laboratory of Xiamen University for details

Personal built hadoop platform practical environment: Ubuntu 14.04 64 bit * 3, JDK1.8, Hadoop 2.6.5 (apache)

1, Hadoop preparation before installation

1. Set root

The first time you need to set the root password

root user boot

Add the following red box content in the file:

 

user-session=ubuntu

greeter-show-manual-login=true

all-guest=false

Restart the machine after saving. Log in with root and password

 

2. Configure network

 

Create three Ubuntu 14.04 64 bit virtual machines, one master and two Slave under vmware. Three virtual machines are in the same network segment and ping each other.

Modify the virtual machine name: ා vim /etc/hostname (vim downloads by itself)

ip mapping: ා vim /etc/hosts (host name corresponds to host ip)

 

3. Replace aliyuan

Step 1: enter the / etc/apt / directory

cd /etc/apt

Step 2: check the sources.list File for backup

sudo cp sources.list sources.list.bak

Step 3: modify sources.list File, replace with alicloud source

sudo vim sources.list

See the source of each version

https://www.cnblogs.com/dtiove/p/5917263.html

Step 4: update source

apt-get update

4. Install SSH

1. The first step is to download ssh:

apt-get install openssh-server

After installation, you can use the following command to log in to the machine:

ssh localhost

2. Step 2: configure SSH password free login:

First, exit the ssh and return to our original terminal window. Then use ssh keygen to generate the key and add the key to the authorization:

Note: it is configured on the master

exit                           # Exit ssh localhost
cd ~/.ssh/                     # If there is no such directory, please execute ssh localhost once
ssh-keygen -t rsa              # There will be a prompt. Just press enter

ssh-copy-id -i ~/.ssh/id_rsa.pub slave1   
#Copy the slave1 key. If there is a slave2, change it to slave2 and execute it again. The password of each node will be entered

3. Step 3: set Root remote login:

gedit /etc/ssh/sshd_config

Modify the content of line 28 to

#PermitRootLogin prohibit-password

PermitRootLogin yes

5. Installing the Java environment

1. Install JDK

sudo apt-get install default-jre default-jdk

2. Configure environment variables

vim ~/.bashrc

Add a separate line at the front of the file as follows (note that there can be no space before and after the equal sign "="), and then save to exit:

export JAVA_HOME=/usr/lib/jvm/default-java

Next, for the environment variables to take effect immediately, execute the following code:

source ~/.bashrc

3. Verify environment is successful

echo $JAVA_HOME     # Test variable value
java -version

2, Install Hadoop

1. Download hadoop

hadoop download address: https://mirrors.cnnic.cn/apache/hadoop/common/

Here I use 2.6.5. After downloading, it will be transferred to the master

2. Decompress hadoop

sudo tar -zxf ~/download/hadoop-2.6.0.tar.gz -C /usr/local    # Extract to / usr/local
cd /usr/local/
sudo mv ./hadoop-2.6.0/ ./hadoop            # Change folder name to hadoop
sudo chown -R hadoop ./hadoop               # Modify file permissions

3. Inspection

Hadoop can be used after decompression. Enter the following command to check whether Hadoop is available. If it succeeds, the Hadoop version information will be displayed:

cd /usr/local/hadoop
./bin/hadoop version

3, Configure Hadoop files

1, Configure PATH variable

First, execute vim ~/.bashrc to add in the first line. The path is configured according to the bin and sbin of the extracted hadoop, which is convenient for later command writing

export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

As shown in the figure:

Then source ~/.bashrc takes the configuration into effect.

2, Configure cluster environment

Cluster / distributed mode needs to modify five configuration files in / usr/local/hadoop/etc/hadoop. For more settings, click to see the official instructions. Here, only the settings necessary for normal startup are set: slaves core-site.xml,hdfs-site.xml,mapred-site.xml,yarn-site.xml .

1. File slaves, which is written to the file as the host name of DataNode, one for each line. The default is localhost. The distributed configuration can keep localhost or delete it, so that the Master node can only be used as the NameNode.

In this tutorial, the Master node is only used as a NameNode, so the original localhost in the file is deleted and only two lines are added: slave1 and slave2. Here 1 and 2 are the number of nodes. If you add nodes, add slave3

If there are several nodes, they will be configured. Otherwise, the unconfigured nodes will not be able to generate related processes

 

2. File core-site.xml Change to the following configuration: (note that the fourth line of master is the host name you map, which is the same for the following Master)

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://master:9000</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>file:/usr/local/hadoop/tmp</value>
                <description>Abase for other temporary directories.</description>
        </property>
</configuration>

 

3. File hdfs-site.xml , dfs.replication Generally set to 3

<configuration>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>master:50090</value>
        </property>
        <property>
                <name>dfs.replication</name>
                <value>3</value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:/usr/local/hadoop/tmp/dfs/name</value>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:/usr/local/hadoop/tmp/dfs/data</value>
        </property>
</configuration>

 

4. File mapred-site.xml (you may need to rename it first. The default file name is mapred-site.xml.template ), and then the configuration is modified as follows:

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>master:10020</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>master:19888</value>
        </property>
</configuration>

 

5. File yarn-site.xml :

<configuration>
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>master</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
</configuration>

 

6. After configuration, copy the / usr/local/Hadoop folder on the master to each node.

Now compress the hadoop configured in the master:

tar -zcf ~/hadoop.master.tar.gz ./hadoop 

 

Then use the scp command to pass the compressed hadoop to slave1 Among other nodes

scp ./hadoop.master.tar.gz Slave1:/home/hadoop

 

At this time, the hadoop of the master node has been completed. Next, decompress the hadoop from the master node (find the hadoop path and decompress it)

sudo tar -zxf ~/hadoop.master.tar.gz -C /usr/local

 

At this time, hadoop from the node is also completed. Now, start hadoop

4, Hadoop startup

1. To start for the first time, you need to perform the format of NameNode in the master node:

hdfs namenode -format       # Initialization is required for the first run, not later

 

2. Then you can start hadoop, which needs to be started on the master node:

start-all.sh

 

3. The command jps allows you to view the processes started by each node. If correct, you can see the NameNode, ResourceManager, and SecondrryNameNode processes on the master node

4. You can see the DataNode and NodeManager processes in the Slave1 node

5. Validate cluster

The absence of any process indicates an error. In addition, you need to use the command hdfs dfsadmin -report on the master node to check whether the DataNode starts normally. If the Live datanodes is not 0, the cluster starts successfully.

You can also view the status of DataNode and NameNode through the Web page: http://master:50070/ . If not, you can check the cause by starting the log.

At this time, the hadoop cluster is completed, and some distributed instances can be carried out below

Topics: Hadoop ssh xml sudo