Big data Hadoop installation and configuration

Posted by jamesh on Sat, 12 Feb 2022 04:41:59 +0100

Big data, Spark, Hadoop, python

Big data Hadoop installation and configuration

1, Hadoop pseudo distributed configuration

1. Create Hadoop user:

sudo useradd -m hadoop -s /bin/bash     # Create hadoop user
sudo passwd hadoop          # Change Password
sudo adduser hadoop sudo    # Add administrator privileges

Log out and log in with the Hadoop user, then update apt and install vim:

sudo apt-get update         # Update apt

Install SSH and configure password less login:

sudo apt-get install openssh-server # Ubuntu installs SSH client by default. Only SSH server can be installed here
cd ~
mkdir .ssh                  # The file may already exist and will not be affected
cd ~/.ssh/
ssh-keygen -t rsa           # You can press enter if there is a prompt
cat id_rsa.pub >> authorized_keys  # Join authorization

Enter ssh localhost to log in without entering a password. Otherwise, you need to enter the password hadoop to log in after installing SSH

Note: both cluster and single node modes require SSH login (similar to remote login, you can log in to a Linux host and run commands on it). Ubuntu has installed SSH client by default. In addition, SSH server needs to be installed

2. Configure Java environment

Hadoop needs java environment support. For JDK configuration, please refer to another blog post https://blog.csdn.net/Acegem/article/details/120852985?spm=1001.2014.3001.5502.

3. Install Hadoop

It can be found on the Hadoop official website https://dlcdn.apache.org/hadoop/common/ Download, we download hadoop-3.1.3. The download link is: https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz
Or domestic mirror websites, such as: http://mirrors.cnnic.cn/apache/hadoop/common/ Download,
Unzip the downloaded hadoop-3.1.3 package to / usr/local /

sudo tar -zxvf hadoop-3.1.3.tar.gz -C /usr/local  # Unzip into / usr/local
cd /usr/local/
sudo mv ./hadoop-3.1.3/ ./hadoop            # Change the folder name to hadoop
sudo chown -R hadoop ./hadoop        # Modify file permissions

Test:

cd /usr/local/hadoop
./bin/hadoop version

4. Pseudo distributed configuration

The configuration file of Hadoop is located in / usr/local/hadoop/etc/hadoop /. The pseudo distributed system needs to modify two configuration files, core site XML and HDFS site xml .
Modify the configuration file core site xml

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Modify the configuration file HDFS site xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
</configuration>

5. Start Hadoop

cd /usr/local/hadoop
bin/hdfs namenode -format       # namenode formatting
sbin/start-dfs.sh               # Start daemon
jps                             # Judge whether the startup is successful

If it starts successfully, the following processes will be listed: NameNode, DataNode and SecondaryNameNode
Attachment:
Error 1: if error: Java is prompted in this step_ HOME is not set and could not be found. If there is an error, it indicates that Java was set before_ The home environment variable is not set properly.
If you have followed the previous tutorial JAVA is set in bashrc file_ Home, there is still an Error. Then, go to the Hadoop installation directory and modify the configuration file "/ usr / local / Hadoop / etc / Hadoop / Hadoop env. Sh", find the line "export JAVA_HOME=${JAVA_HOME}" in it, and then modify it to the specific address of the JAVA installation path, for example, "export JAVA_HOME = / usr / lib / JVM / default JAVA", and then start Hadoop again.
Error 2: if you encounter an exception that outputs a lot of "ssh: Could not resolve hostname xxx" when starting Hadoop, as shown in the following figure:
This is not an ssh problem, but can be solved by setting Hadoop environment variables. At ~ / In bashrc, add the following two lines (the setting process is the same as the JAVA_HOME variable, where HADOOP_HOME is the installation directory of Hadoop):

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

Test:
Run WordCount instance:

bin/hdfs dfs -mkdir -p /user/hadoop     # Create HDFS directory
bin/hdfs dfs -mkdir input
bin/hdfs dfs -put etc/hadoop/*.xml input  # Take the configuration file as input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'
bin/hdfs dfs -cat output/*                # View output

You can also retrieve the running results locally:

rm -r ./output    # Delete the local output folder (if any) first
./bin/hdfs dfs -get output ./output     # Copy the output folder on HDFS to the local computer
cat ./output/*

When Hadoop runs the program, the output directory cannot exist, otherwise the error "org.apache.hadoop.mapred.filealreadyexistsexception: output directory" will be prompted hdfs://localhost:9000/user/hadoop/output "Already exists", so to execute it again, you need to execute the following command to delete the output folder:

./bin/hdfs dfs -rm -r output    # Delete output folder

2, Hadoop cluster configuration

Suppose there are two machines:

Master  192.168.1.121
Slave1  192.168.1.122

Hadoop cluster configuration process:

Select a machine as Master,Configure network mapping on all hosts
 stay Master Configuration on host hadoop User, installation SSH server,install Java environment
 stay Master Install on host Hadoop,And complete the configuration
 Configure on other hosts hadoop User, installation SSH server,install Java environment
 take Master On the host Hadoop Copy directory to another host
 Open, use Hadoop

Configure hadoop users, install SSH server and install Java environment on all hosts:

sudo useradd -m hadoop -s /bin/bash     # Create hadoop user
sudo passwd hadoop          # Change Password
sudo adduser hadoop sudo    # Add administrator privileges
# Log out and log in using the Hadoop user
sudo apt-get update         # Update apt
sudo apt-get install vim    # Install vim
sudo apt-get install openssh-server  # Install ssh

Then configure the JDK environment of java
Configure network mapping for all hosts:

sudo vim /etc/hostname      # Modify host name
sudo vim /etc/hosts         # Modify the mapping relationship between host and IP
sudo reboot                 # Restart to make the network configuration effective

On the Master host:

cd ~/.ssh
ssh-keygen -t rsa              # Just press enter all the time
cat ~/id_rsa.pub >> ~/authorized_keys
scp ~/.ssh/id_rsa.pub hadoop@Slave1:/home/hadoop/ # Transfer public key to Slave1

Then execute on the Slave1 node

cd ~
mkdir .ssh
cat ~/id_rsa.pub >> ~/.ssh/authorized_keys

Configure the Hadoop cluster on the Master node (located in / usr/local/hadoop/etc/hadoop):

File slave:

Delete the original localhost and write the host names of all Slave, one on each line.

File core site xml:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://Master:9000</value>
</property>
<property>
    <name>hadoop.tmp.dir</name>
    <value>file:/usr/local/hadoop/tmp</value>
    <description>Abase for other temporary directories.</description>
</property>

File HDFS site xml:

<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>Master:50090</value>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/usr/local/hadoop/tmp/dfs/name</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/usr/local/hadoop/tmp/dfs/data</value>
</property>
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

File mapred site XML (first execute cp mapred-site.xml.template mapred-site.xml):

<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

File yarn site xml:

<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>Master</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>

After configuration, copy the Hadoop file to each node on the Master host:

cd /usr/local
rm -r ./hadoop/tmp  # Delete Hadoop temporary files
sudo tar -zcf ./hadoop.tar.gz ./hadoop
scp ./hadoop.tar.gz Slave1:/home/hadoop

Execute on Slave1:

sudo tar -zxf ~/hadoop.tar.gz -C /usr/local
sudo chown -R hadoop:hadoop /usr/local/hadoop

Finally, you can start hadoop on the Master host:

cd /usr/local/hadoop/
bin/hdfs namenode -format
sbin/start-dfs.sh
sbin/start-yarn.sh
jps             # Judge whether the startup is successful

If it is started successfully, the Master node starts the NameNode, SecondrryNameNode and ResourceManager processes, and the Slave node starts the DataNode and NodeManager processes.

Execute the WordCount instance on the Master host:

bin/hdfs dfs -mkdir -p /user/hadoop
bin/hdfs dfs -put etc/hadoop input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'

Topics: Big Data Hadoop hdfs