hadoop basic configuration and pseudo distribution implementation

Posted by FFEMTcJ on Tue, 26 Oct 2021 16:07:53 +0200

hadoop basic configuration and pseudo distribution implementation

catalogue

1, Basic environmental preparation

Operating system preparation
- Installing ubuntu virtual machines
  
  Mirror address: https://mirrors.nju.edu.cn/ubuntu-releases/18.04/ubuntu-18.04.6-desktop-amd64.iso
- If docker is used
  
  docker pull ubuntu:18.04

Source change

Backup start source list

sudo cp /etc/apt/sources.list /etc/apt/sources.list_backup

Open the source list (gedit command is to open the editor of ubuntu. If it is docker, it needs to be replaced with vi command)
- ```
sudo gedit /etc/apt/sources.list
```

Add the following at the beginning of the file

#  Ali source
deb http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse

Refresh List

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install build-essential

Install java

Download openJdk directly

sudo apt-get update
sudo apt-get install openjdk-8-jdk

Verify java
- ```
java -version
```

Install other software
- Install ssh
  - ```
  sudo apt-get install ssh
```
- Generate ssh key
```
  ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
```
  Get native ssh access
```
  ssh-copy-id localhost
```
  After pressing enter, you will be asked to confirm whether to join the hostList. Enter "yes" and press enter
  
  Then you need to enter your local password
- Test ssh access
```
  ssh localhost
```
  Install rsync
- ```
sudo apt-get install rsync
```

2, Single machine hadoop installation and configuration

hadoop installation
- Download hadoop compressed package
  - ```
  cd ~
  wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
```
- Unzip it here
  - ```
  hadoop@ubuntu:~$ tar –zxvf hadoop-3.2.2.tar.gz
```
- Check whether the decompression is successful
  - ```
  hadoop@ubuntu:~$ ls
  Desktop    Downloads         hadoop-3.2.2         Music     Public     Videos
  Documents  examples.desktop  hadoop-3.2.2.tar.gz  Pictures  Templates
```
- Is there a folder for hadoop-3.2.2?
- Configure JAVA_HOME environment variable
  - ```
  gedit ~/hadoop-3.2.2/etc/hadoop/hadoop-env.sh
```
- Find "# export JAVA_HOME =" in the editor
Change this line to export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64, save and exit
- PS: the above address is obtained by entering the following command and then looking at the output
```
  update-alternatives --config java
```

hadoop configuration - single machine pseudo distributed

First of all, the configuration here, often a space, a punctuation mark or case error will lead to the final operation failure, so be sure to check before entering!!

core-site.xml

open editor

gedit ~/hadoop-3.2.2/etc/hadoop/core-site.xml

Insert the configuration without adding comments

<configuration>   # to configure
	<property>      # A property
		<name>fs.defaultFS</name>  # The name of the property is this
		<value>hdfs://localhost:9000</value> 	#  The value of the property is this, which is the path of hdfs
	</property>   # The closed label of the attribute must have a slash
</configuration>

hdfs-site.xml

gedit ~/hadoop-3.2.2/etc/hadoop/hdfs-site.xml

Write the following

<configuration>
	<property>
		<name>dfs.replication</name>   # Number of data backups
		<value>1</value>			# The number of backups is one
	</property>
</configuration>

3, Hadoop operation

Start hadoop and run tasks
- Format NameNode and start
  - format
```
~/hadoop-3.2.2/bin/hdfs namenode -format
```
  - Start NameNode
```
~/hadoop-3.2.2/sbin/start-dfs.sh
```
  - View cluster status
```
hadoop@ubuntu:~$ jps
10960 NameNode
11529 Jps
11115 DataNode
11372 SecondaryNameNode
```
  - Browse the web interface of NameNode
    
    Open directly in the browser of the virtual machine http://localhost:9870
- Perform some hdfs operations (which will be frequently used later, remember by heart)
  
  hdfs is a file system. Its operation is similar to that of the computer's file browser. It has operations such as creating folders, moving files, etc
  - Create hdfs directory
```
~/hadoop-3.2.2/bin/hdfs dfs -mkdir /user     # -mkdir path is the create directory
~/hadoop-3.2.2/bin/hdfs dfs -mkdir /user/input
```
  - Copy input files to hdfs
```
~/hadoop-3.2.2/bin/hdfs dfs -put ~/hadoop-3.2.2/etc/hadoop/*.xml /user/input  # -put a b is to place the local a on b of hdfs
```
    This operation is to copy all files with xml suffix in the etc/hadoop / directory after - put to the / user/hadoop directory of hdfs file system
  - Check the hdfs file
```
hadoop@ubuntu:~$ ~/hadoop-3.2.2/bin/hdfs dfs -ls /user/input   # -ls a lists all files in directory a
Found 9 items
-rw-r--r--   1 hadoop supergroup       9213 2021-10-26 06:54 /user/hadoop/capacity-scheduler.xml
-rw-r--r--   1 hadoop supergroup        867 2021-10-26 06:54 /user/hadoop/core-site.xml
-rw-r--r--   1 hadoop supergroup      11392 2021-10-26 06:54 /user/hadoop/hadoop-policy.xml
-rw-r--r--   1 hadoop supergroup        849 2021-10-26 06:54 /user/hadoop/hdfs-site.xml
-rw-r--r--   1 hadoop supergroup        620 2021-10-26 06:54 /user/hadoop/httpfs-site.xml
-rw-r--r--   1 hadoop supergroup       3518 2021-10-26 06:54 /user/hadoop/kms-acls.xml
-rw-r--r--   1 hadoop supergroup        682 2021-10-26 06:54 /user/hadoop/kms-site.xml
-rw-r--r--   1 hadoop supergroup        758 2021-10-26 06:54 /user/hadoop/mapred-site.xml
-rw-r--r--   1 hadoop supergroup        690 2021-10-26 06:54 /user/hadoop/yarn-site.xml
```
- Run MapReduce
  - Run the sample that comes with hadoop
```
~/hadoop-3.2.2/bin/hadoop jar ~/hadoop-3.2.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar grep /user/input /user/output 'dfs[a-z.]+'
```
    Analyze the following:
    - ~/hadoop-3.2.2/bin/hadoop: call Hadoop program
    - Jar: to execute the jar instruction of this program is to tell the program to run the jar package
    - ~/hadoop-3.2.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar: the address of the jar package to be executed
    - Grep: grep is the instruction in the jar package, which is similar to the above relationship between jar and hadoop (the main class is specified in the hadoop program written by yourself later, so you can not fill in this branch instruction)
    - /user/input: is the input address
    - /user/output: is the output address. Both addresses are addresses on hdfs. It should be noted that the / user/output directory cannot exist in advance
    - '' dfs[a-z.] + ': is a parameter of the grep instruction. What grep does is filter text. This parameter tells grep to filter text according to this mode.
  - View run results:
    - First get the results on hdfs to local
      ~/hadoop-3.2.2/bin/hdfs dfs -get /user/output ~/output
    - Read results
      cat ~/output/* # cat is a command that displays text
- Finally, close the hadoop cluster. Close the cluster before modifying the configuration
```
~/hadoop-3.2.3/sbin/stop-dfs.sh
```

Topics: Docker Hadoop Ubuntu

Programmer Think

hadoop basic configuration and pseudo distribution implementation

hadoop basic configuration and pseudo distribution implementation

catalogue

1, Basic environmental preparation

Operating system preparation

Source change

Install java

Install other software

2, Single machine hadoop installation and configuration

hadoop installation

hadoop configuration - single machine pseudo distributed

3, Hadoop operation

Start hadoop and run tasks

Hot Topics