hadoop storage and analysis

Posted by ZephyrWest on Mon, 03 Jan 2022 12:59:49 +0100

Apache Hadoop

##Background

With the development requirements of information internet and Internet of things, the trend of interconnection of all things is imperative. This leads to the evolution of architecture from a single architecture to a highly concurrent distributed architecture. Data storage also began to evolve from the original stand-alone storage to distributed storage.

  • JavaWeb: in order to cope with high concurrency and distribution, LNMP: (Linux, Nginx, MySQL, PHP) is proposed.

  • Massive data storage | data analysis: storage scheme (HDFS), computing scheme (Map Reduce, Storm, Spark, Flink)

Big data background

Distributed: cross machine and cross process communication between services is called distributed

  • storage
    • Stand alone storage: capacity limitation, poor scalability and data disaster recovery
    • Distributed storage: use the storage cluster to realize the parallel reading and writing of massive data and improve the throughput of the system. At present, distributed file storage schemes for traditional business areas include file storage and block storage.

  • Calculation (analysis)
    • Single machine analysis / calculation: slow, limited by the memory, CPU and network limitations of single machine storage
    • Distributed computing: assign computing tasks to a special computing cluster to be responsible for task computing. Break the bottleneck of stand-alone computing, realize parallel computing, and simulate the computing power of multi-core CPU. It can achieve the effective analysis of data in a certain time.

Hadoop appears

In order to solve a series of problems caused by massive data, people realized the prototype of Hadoop in the early Nutch project by referring to the papers published by Googl e File System and simple Data processing on large cluster, In the early stage, there were two plates in Nutch: NDFS (Nutch distribution file system) and MapReduce, which respectively solved the two problems of storage and calculation of the project. Then, the plate was stripped into Nutch to form an independent module, and finally renamed Hadoop.

  • HDFS: Hadoop distributed file system

  • MapReduce: Distribution -- > summary. MapReduce is a general distributed parallel computing framework in hadoop.

Doug Cutting, known as the father of Hadoop, is the chairman of Apache Software Foundation and the initiator of Lucene, nutch, Hadoop and other projects. At first, Hadoop was just a part of nutch, a subproject of Apache Lucene. Lucene is the first open-source full-text search engine toolkit in the world. Nutch is based on Lucene and has the functions of web page capture and analysis. It can realize the development of a search engine. However, if it is put into use, it must respond in a very short time and be able to analyze and process hundreds of millions of web pages in a short time, This requires consideration of distributed task processing, fault recovery and load balancing. Later, Doug Cutting transplanted the technology from two papers of Google File System and MapReduce:Simplified Data Processing On Large Clusters, and named it Hadoop.

Download address: https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz

Environment construction

Environmental preparation

  • Install the virtual machine and CentOS-7 64 bit

  • Install JDK and configure environment variables

    ① Install jdk

[root@CentOS ~]# rpm -ivh  jdk-8u191-linux-x64.rpm
 Warning: jdk-8u191-linux-x64.rpm: head V3 RSA/SHA256 Signature, secret key ID ec551f03: NOKEY
 In preparation...                          ################################# [100%]
Upgrading/install...
   1:jdk1.8-2000:1.8.0_191-fcs        ################################# [100%]
Unpacking JAR files...
        tools.jar...
        plugin.jar...
        javaws.jar...
        deploy.jar...
        rt.jar...
        jsse.jar...
        charsets.jar...
        localedata.jar...

By default, the JDK is installed in the / usr/java path

② Configure environment variables

[root@CentOS ~]# vi .bashrc 
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
# Reload variables using the source command
[root@CentOS ~]# source ~/.bashrc
  • Configure host name
[root@CentOS ~]# vi /etc/hostname
CentOS

reboot is required after modifying the host name

  • Configure the mapping relationship between host name and IP

    ① View ip

[root@CentOS ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:3c:e6:31 brd ff:ff:ff:ff:ff:ff
    inet 192.168.73.130/24 brd 192.168.73.255 scope global noprefixroute dynamic ens33
       valid_lft 1427sec preferred_lft 1427sec
    inet6 fe80::fffe:2129:b1f8:2c9b/64 scope link noprefixroute
       valid_lft forever preferred_lft forever

You can see that the address of the network card ens33 is 192.168 73.130 then map the relationship between hostname and IP in / etc/hosts

[root@CentOS ~]# vi /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.73.130 CentOS
  • Configure SSH password free authentication

    ① Generate the public-private key pair required for authentication

[root@CentOS ~]# ssh-keygen -t rsa -P ''
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:KIx5N+++qLzziq6LaCBT2g5Dqcq0j+TxGV3jJXs7cwc root@CentOS
The key's randomart image is:
+---[RSA 2048]----+
|                 |
|                 |
|  .              |
| o.+   .         |
|o+o + ++S.       |
|O....ooo=  E     |
|*B.. . o..  .    |
|*+=o+  o.o.. .   |
|**++=*o.+o+ .    |
+----[SHA256]-----+

② Add the trust list, and then realize password free authentication

[root@CentOS ~]# ssh-copy-id CentOS
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'centos (192.168.73.130)' can't be established.
ECDSA key fingerprint is SHA256:WnqQLGCjyJjgb9IMEUUhz1RLkpxvZJxzEZjtol7iLac.
ECDSA key fingerprint is MD5:45:05:12:4c:d6:1b:0c:1a:fc:58:00:ec:12:7e:c1:3d.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@centos's password:
Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'CentOS'"
and check to make sure that only the key(s) you wanted were added.

③ Test whether to set ssh password free successfully

[root@CentOS ~]# ssh root@CentOS
Last failed login: Fri Sep 25 14:19:39 CST 2020 from centos on ssh:notty
There was 1 failed login attempt since the last successful login.
Last login: Fri Sep 25 11:58:52 2020 from 192.168.73.1

If you do not need to enter a password, SSH password free authentication is successful!

  • Turn off firewall
[root@CentOS ~]# systemctl stop firewalld.service # Shut down service
[root@CentOS ~]# systemctl disable firewalld.service # Turn off startup and self startup
[root@CentOS ~]# firewall-cmd --state # View firewall status
not running

HADOOP installation

  • Unzip and install Hadoop, Download https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
[root@CentOS ~]# tar -zxf hadoop-2.9.2.tar.gz -C /usr/
  • Configure HADOOP_HOME environment variable
[root@CentOS ~]# vi .bashrc
JAVA_HOME=/usr/java/latest
HADOOP_HOME=/usr/hadoop-2.9.2/
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
export HADOOP_HOME
# Reload HADOOP_HOME environment variable
[root@CentOS ~]# source .bashrc
  • Configure the hadoop configuration file, etc / hadoop / {core site. Xml|hdfs-site. Xml|slaves}

① Configure core site xml

[root@CentOS ~]# cd /usr/hadoop-2.9.2/
[root@CentOS hadoop-2.9.2]# vi etc/hadoop/core-site.xml
<!--NameNode Access portal-->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://CentOS:9000</value>
</property>
<!--hdfs Working base directory-->
<property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/hadoop-2.9.2/hadoop-${user.name}</value>
</property>

② Configure HDFS site xml

[root@CentOS ~]# cd /usr/hadoop-2.9.2/
[root@CentOS hadoop-2.9.2]# vi etc/hadoop/hdfs-site.xml
<!--block Replica factor-->
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<!--to configure Sencondary namenode Physical host-->
<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>CentOS:50090</value>
</property>

③ Configure slave text file

vi etc/hadoop/slaves

CentOS
  • Start HDFS system

① When you start the HDFS system for the first time, you need to format the system to prepare for subsequent startup. Here, you need to pay attention to this only when you start it for the first time. You can ignore this step when you start HDFS again in the future!

[root@CentOS ~]# hdfs namenode -format
...
20/09/25 14:31:23 INFO common.Storage: Storage directory /usr/hadoop-2.9.2/hadoop-root/dfs/name has been successfully formatted.
...

Create the image file that needs to be loaded when the NameNode service starts in HDFS.

② Start HDFS service

The startup script is placed in the sbin directory. Because we have set the sbin directory to PATH, we can directly use start DFS The SH script starts HDFS. If you want to shut down the HDFS system, you can use stop DFS sh

[root@CentOS ~]# start-dfs.sh
Starting namenodes on [CentOS]
CentOS: starting namenode, logging to /usr/hadoop-2.9.2/logs/hadoop-root-namenode-CentOS.out
CentOS: starting datanode, logging to /usr/hadoop-2.9.2/logs/hadoop-root-datanode-CentOS.out
Starting secondary namenodes [CentOS]
CentOS: starting secondarynamenode, logging to /usr/hadoop-2.9.2/logs/hadoop-root-secondarynamenode-CentOS.out

After successful startup, the user can view the java process using the jsp instruction provided with the JDK. Normally, the user can see three services: DataNode, NameNode and SecondaryNameNode.

[root@CentOS ~]# jps
3457 DataNode
3691 SecondaryNameNode
3325 NameNode
4237 Jps
[root@CentOS ~]#

Finally, the user can access the WEB page embedded in the NameNode service to view the running status of HDFS. By default, the listening port of the service is 50070. The access effect is as follows:

HDFS architecture

brief introduction

Hadoop Distributed File System (HDFS) is a Distributed File System designed to run on common hardware (Distributed File System). It has a lot in common with the existing Distributed File System. But at the same time, it is also very different from other distributed file systems. HDFS is a highly fault-tolerant system, which is suitable for deployment on cheap machines. HDFS can provide high-throughput data access, and is very suitable for applications on large-scale data sets. HDFS relaxes one Part of POSIX constraints to achieve the purpose of streaming reading file system data. HDFS was originally developed as the infrastructure of the Apache Nutch search engine project. HDFS is part of the Apache Hadoop Core project.

HDFS has a high Fault tolerance It is designed to be deployed on low-cost hardware, and it provides high throughput to access application data, which is suitable for applications with large data set s.

framework

NameNode & DataNodes

HDFS is a master/slave architecture. An HDFS cluster contains a NameNode. The service is the main service, which is responsible for managing the Namespace of the file system and responding to the regular access of the client. In addition, there are many DataNode nodes, and each DataNode is responsible for managing the files stored on the host where the DataNode runs. HDFS exposes a file system Namespace and allows user data to be stored in files. The bottom layer of HDFS will divide the file into 1~N blocks, which are stored on a series of datanodes. The NameNode is responsible for the DDL operations of modifying the Namespace, such as opening, closing and modifying files or folders. NameNode determines the mapping of data block to DataNode. The DataNode is responsible for responding to the read-write request of the client. At the same time, after receiving the instruction from the NameNode, the DataNode also needs to create, delete, copy and other operations of the block.

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system's clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

NameNode: use memory to store metadata in the cluster (file namespace - file directory structure, data block to DataNode mapping)

DataNode: it is responsible for responding to the client's read-write request for the data block and reporting its own status information to the NameNode.

Block: is the scale of HDFS segmentation file. The default is 128MB. A file can only have one block less than 128MB at most

Replica factor: in order to prevent block loss caused by DataNode downtime, HDFS allows one block and multiple backups. The default backup is 3

HDFS is not good at storing small files

Because Namenode uses stand-alone memory storage, small files will occupy more memory space, resulting in a waste of Namenode memory.

caseNameNodeDataNode
1 file 128MB1 data block mapping metadata128MB disk storage * (copy factor)
128MB for 1000 files1000 * 1 data block mapping metadata128MB disk storage * (copy factor)

HDFS rack awareness

Distributed clusters usually contain a large number of machines. Due to the limitations of rack slots and switch network ports, large distributed clusters usually span several racks, and machines in multiple racks form a distributed cluster. The network speed between machines in racks is usually higher than that between machines across racks, and the network communication between machines in racks is usually limited by the network bandwidth between upper switches.

In the design of Hadoop, the security and efficiency of data are considered. By default, three copies of data files are stored on HDFS. The storage strategy is:

The first block copy is placed in the data node where the client is located (if the client is not in the cluster, a suitable data node is randomly selected from the whole cluster to store it).

The second replica is placed on other data nodes in the same rack as the node where the first replica is located

The third replica is placed on nodes in different racks

In this way, if the local data is damaged, the node can get the data from the adjacent nodes in the same rack, and the speed must be faster than that from the cross rack nodes; At the same time, if the network of the whole rack is abnormal, it can also ensure that data can be found on the nodes of other racks. In order to reduce the overall bandwidth consumption and read delay, HDFS will try to let the reader read the copy closest to it. If there is a copy in the same rack as the reader, the copy is read. If an HDFS cluster spans multiple data centers, the client will also read the replica of the local data center first.

reference resources: https://www.cnblogs.com/zwgblog/p/7096875.html

SecondaryNameNode & NameNode

! [[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-421xa6yc-1628160864369) (C: \ users \ administrator \ appdata \ roaming \ typora \ user images \ image-20210302171532758. PNG)]( https://img-blog.csdnimg.cn/62f5b900e0ae46e8b2ad2f1f3a44fe59.png?x -oss-process=image/watermark,type_ ZmFuZ3poZW5naGVpdGk,shadow_ 10,text_ aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQyMDc0OTQ5,size_ 16,color_ FFFFFF,t_ 70)

fsimage: a binary text file stored on the physical host disk where the Namenode service is located. Metadata information is recorded

edits: a binary text file stored on the disk of the physical host where the Namenode service is located, which records the modification of metadata.

When the namenode service is started for the first time, the system will load the fsimage and edits files, merge them to obtain the latest metadata information, and update the fsimage and edits. Once the service is started successfully, the fsimage will not be updated during the service operation, but the operation will be recorded in the edits. This may cause the namenode to restart after long-term operation, cause the namenode to start too long, and may also cause the edits file to be too large. Therefore, Hadoop HDFS introduces a Secondary Namenode to assist the namenode in sorting metadata during operation.

The NameNode stores modifications to the file system as a log appended to a native file system file, edits. When a NameNode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file. It then writes new HDFS state to the fsimage and starts normal operation with an empty edits file. Since NameNode merges fsimage and edits files only during start up, the edits log file could get very large over time on a busy cluster. Another side effect of a larger edits file is that next restart of NameNode takes longer.

The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode.

The start of the checkpoint process on the secondary NameNode is controlled by two configuration parameters.

  • dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum delay between two consecutive checkpoints, and
  • dfs.namenode.checkpoint.txns, set to 1 million by default, defines the number of uncheckpointed transactions on the NameNode which will force an urgent checkpoint, even if the checkpoint period has not been reached.

The secondary NameNode stores the latest checkpoint in a directory which is structured the same way as the primary NameNode's directory. So that the check pointed image is always ready to be read by the primary NameNode if necessary.

NameNode startup process

SafeMode of NameNode

During startup, NameNode will enter a special state called safemode HDFS does not replicate data blocks when it is in safe mode. NameNode receives heartbeat and Blockreport information from DataNode in safe mode. The report information of each DataNode block contains the information of all data blocks held on the physical host. Name will check whether all reported blocks meet the set minimum number of copies at startup (the default value is 1). The current block is recognized as safe only when the block reaches the minimum number of replicas. The NameNode waits for 30 seconds and then tries to check whether the proportion of the reported so-called safe blocks reaches 99.9%. If the threshold is reached, the NameNode automatically exits the safe mode. Then it starts to check whether the number of replicas of the block is lower than the configured number of replicas, and then sends a replication instruction for replication Copy of blocks.

On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.

Note: HDFS will automatically enter and exit the safe mode when it is started. Generally, HDFS will be forced to enter the safe mode sometimes in production, so as to maintain the server.

[root@CentOS ~]# hdfs dfsadmin -safemode get
Safe mode is OFF
[root@CentOS ~]# hdfs dfsadmin -safemode enter
Safe mode is ON
[root@CentOS ~]# hdfs dfs -put hadoop-2.9.2.tar.gz /
put: Cannot create file/hadoop-2.9.2.tar.gz._COPYING_. Name node is in safe mode.
[root@CentOS ~]# hdfs dfsadmin -safemode leave
Safe mode is OFF
[root@CentOS ~]# hdfs dfs -put hadoop-2.9.2.tar.gz /

SSH password free authentication

SSH is a security protocol based on the application layer. SSH is more reliable and designed for Remote login A protocol that provides security for sessions and other network services. Using SSH protocol can effectively prevent information disclosure in the process of remote management. There are two login methods provided:

  • Password based security authentication - it is possible for the remote host to impersonate the target host and intercept user information.
  • Security verification of key - what needs to be authenticated is the identity of the machine

! [[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-5dlonvyx-1628160864373) (C: \ users \ administrator \ appdata \ roaming \ typora \ user images \ image-20210303165034628. PNG)]( https://img-blog.csdnimg.cn/2237283e1aad45549dcfc9b9bfacacbf.png?x -oss-process=image/watermark,type_ ZmFuZ3poZW5naGVpdGk,shadow_ 10,text_ aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQyMDc0OTQ5,size_ 16,color_ FFFFFF,t_ 70)

① To generate public-private key pairs, RSA or DSA algorithm can be selected

[root@CentOS ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:qWX5zumy1JS1f1uxPb3Gr+5e8F0REVueJew/WYrlxwc root@CentOS
The key's randomart image is:
+---[RSA 2048]----+
|             ..+=|
|              .o*|
|            .. +.|
|         o o .E o|
|        S o .+.*+|
|       + +  ..o=%|
|      . . o   o+@|
|       ..o .   ==|
|        .+=  +*+o|
+----[SHA256]-----+

By default, it will be in ~ / ID generated in ssh directory_ RSA (private key) and id_rsa.pub (public key)

② Add the local public key to the credit list file of the target host

[root@CentOS ~]# ssh-copy-id root@CentOS
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'centos (192.168.73.130)' can't be established.
ECDSA key fingerprint is SHA256:WnqQLGCjyJjgb9IMEUUhz1RLkpxvZJxzEZjtol7iLac.
ECDSA key fingerprint is MD5:45:05:12:4c:d6:1b:0c:1a:fc:58:00:ec:12:7e:c1:3d.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@centos's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@CentOS'"
and check to make sure that only the key(s) you wanted were added.

By default, the local public key will be added to the ~ /. Of the remote target host ssh/authorized_keys file.

Trash recycle bin

In order to avoid data deletion loss due to user misoperation, HDFS users can configure the garbage collection function of HDFS when building HDFS. The so-called garbage collection essentially means that when a user deletes a file, the system will not delete the file immediately, but just move the file to the garbage collection directory. Then, the system will delete the file after a more configured time. Users need to remove the files in the recycle bin from the garbage bin before expiration to avoid deletion.

  • To enable garbage collection, you need to go to the core site Add the following configuration to the XML, and then restart hdfs
<!--garbage collection,5 minites-->
<property>
  <name>fs.trash.interval</name>
  <value>5</value>
</property>
[root@CentOS hadoop-2.9.2]# hdfs dfs -rm -r -f /jdk-8u191-linux-x64.rpm
20/09/25 20:09:24 INFO fs.TrashPolicyDefault: Moved: 'hdfs://CentOS:9000/jdk-8u191-linux-x64.rpm' to trash at: hdfs://CentOS:9000/user/root/.Trash/Current/jdk-8u191-linux-x64.rpm

directory structure

[root@CentOS ~]# tree -L 1 /usr/hadoop-2.9.2/
/usr/hadoop-2.9.2/
├── bin  # System script, hdfs, hadoop, yarn
├── etc  # Configuration directory xml, text file
├── include # Some C header files need no attention
├── lib  # Third party native C implementation
├── libexec # When hadoop runs, load the configured script
├── LICENSE.txt
├── logs # System operation log directory, troubleshooting!
├── NOTICE.txt
├── README.txt
├── sbin  # User script, usually used to start a service, for example: start|stop DFS sh,
└── share # hadoop operation depends on jars and embedded webapp 

HDFS practice

HDFS Shell command (frequently used)

√ print hadoop class path

[root@CentOS ~]# hdfs classpath

√ format NameNode

[root@CentOS ~]# hdfs namenode -format

dfsadmin command

① You can use - report -live or - dead to view the status of dataNode nodes in the cluster

[root@CentOS ~]# hdfs dfsadmin -report  -live 

② Use - safemode enter|leave|get and other operation security modes

[root@CentOS ~]# hdfs dfsadmin -safemode get
Safe mode is OFF

③ View cluster network topology

[root@CentOS ~]# hdfs dfsadmin -printTopology
Rack: /default-rack
   192.168.73.130:50010 (CentOS)

For more information, please refer to: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfsadmin

Check the status of a directory

[root@CentOS ~]# hdfs fsck /

√ DFS command

[root@CentOS ~]# hdfs dfs - command options 
Or the old version
[root@CentOS ~]# hadoop fs - command options  
-appendToFile

Anaconda KS CFG added to AA Log

[root@CentOS ~]# hdfs dfs -appendToFile /root/anaconda-ks.cfg /aa.log
[root@CentOS ~]# hdfs dfs -appendToFile /root/anaconda-ks.cfg /aa.log
-cat

view file contents

[root@CentOS ~]# hdfs dfs -cat /aa.log
 equivalence
[root@CentOS ~]# hdfs dfs -cat hdfs://CentOS:9000/aa.log
-chmod

Modify file permissions

[root@CentOS ~]# hdfs dfs -chmod -R u+x  /aa.log
[root@CentOS ~]# hdfs dfs -chmod -R o+x  /aa.log
[root@CentOS ~]# hdfs dfs -chmod -R a+x  /aa.log
[root@CentOS ~]# hdfs dfs -chmod -R a-x  /aa.log
-copyFromLocal/-copyToLocal

copyFromLocal is uploaded locally to HDFS;copyToLocal downloads files from HDFS

[root@CentOS ~]# hdfs dfs -copyFromLocal jdk-8u191-linux-x64.rpm /
[root@CentOS ~]# rm -rf jdk-8u191-linux-x64.rpm
[root@CentOS ~]# hdfs dfs -copyToLocal /jdk-8u191-linux-x64.rpm /root/
[root@CentOS ~]# ls
anaconda-ks.cfg  hadoop-2.9.2.tar.gz  jdk-8u191-linux-x64.rpm
-mvToLocal/mvFromLocal

mvToLocal downloads the file first, and then deletes the remote data; mvFromLocal: upload first, then delete local

[root@CentOS ~]# hdfs dfs -moveFromLocal jdk-8u191-linux-x64.rpm /dir1
[root@CentOS ~]# ls
anaconda-ks.cfg  hadoop-2.9.2.tar.gz
[root@CentOS ~]# hdfs dfs -moveToLocal /dir1/jdk-8u191-linux-x64.rpm /root
moveToLocal: Option '-moveToLocal' is not implemented yet.
-put/get

File upload / download

[root@CentOS ~]# hdfs dfs -get /dir1/jdk-8u191-linux-x64.rpm /root
[root@CentOS ~]# ls
anaconda-ks.cfg  hadoop-2.9.2.tar.gz  jdk-8u191-linux-x64.rpm
[root@CentOS ~]# hdfs dfs -put hadoop-2.9.2.tar.gz /dir1

For more commands, use

[root@CentOS ~]# hdfs dfs -help command

For example, I want to know how to use touchz

[root@CentOS ~]# hdfs dfs -touchz /dir1/Helloworld.java
[root@CentOS ~]# hdfs dfs -ls /dir1/
Found 5 items
-rw-r--r--   1 root supergroup          0 2020-09-25 23:47 /dir1/Helloworld.java
drwxr-xr-x   - root supergroup          0 2020-09-25 23:07 /dir1/d1
drwxr-xr-x   - root supergroup          0 2020-09-25 23:09 /dir1/d2
-rw-r--r--   1 root supergroup  366447449 2020-09-25 23:43 /dir1/hadoop-2.9.2.tar.gz
-rw-r--r--   1 root supergroup  176154027 2020-09-25 23:41 /dir1/jdk-8u191-linux-x64.rpm

Java API operation HDFS (understand)

① Set up the development steps, create a Maven project (without selecting any template), and add the following dependencies in the pom.xml file

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.9.2</version>
</dependency>

② Configure windows development environment (very important)

  • Hadoop-2.9.0 needs to be 2. Unzip it in the specified directory of window. For example, here we unzip it in the C: / directory
  • Add Hadoop to the environment variable of Windows system_ Home environment variable
  • Add Hadoop window master All files in the bin directory in zip are copied to% Hadoop_ Overwrite in the home% / bin directory
  • Restart IDEA, otherwise the integrated development environment does not recognize the configuration HADOOP_HOME environment variable

③ It is recommended that the core site XML and HDFS site Copy the XML file to the resources directory of the project

④ Configure the mapping relationship between host name and IP on Windows (omitted)

⑤ Create FileSystem and Configuration objects

public static FileSystem fs=null;
public static Configuration conf=null;
static {
    try {
        conf= new Configuration();
        conf.addResource("core-site.xml");
        conf.addResource("hdfs-site.xml");
        fs=FileSystem.get(conf);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

File upload

Path src = new Path("file://xx path "");
Path dst = new Path("/");
fs.copyFromLocalFile(src,dst);
-
InputStream in=new FileInputStream("file://xx path "");
Path dst = new Path("/xx route");
OutputStream os=fs.create(dst);
IOUtils.copyBytes(in,os,1024,true);

File download

Path dst = new Path("file://xx path "");
Path src = new Path("/xx route");
fs.copyToLocalFile(src,dst);
- 
Path dst = new Path("/xx route");
InputStream in= fs.open(dst);
OutputStream os=new FileOutputStream("file://xx path "");
IOUtils.copyBytes(in,os,1024,true);    

Delete file

Path dst = new Path("/system");
fs.delete(dst,true);

recycle bin

Path dst = new Path("/aa.log");
Trash trash=new Trash(fs,conf);
trash.moveToTrash(dst);

All documents

Path dst = new Path("/");
RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(dst, true);
while (listFiles.hasNext()){
    LocatedFileStatus fileStatus = listFiles.next();
    System.out.println(fileStatus.getPath()+" "+fileStatus.isDirectory()+" "+fileStatus.getLen());
}

All files or folders

Path dst = new Path("/");
FileStatus[] fileStatuses = fs.listStatus(dst);
for (FileStatus fileStatus : fileStatuses) {
    System.out.println(fileStatus.getPath()+" "+fileStatus.isDirectory());
}

MapReduce

summary

MapReduce is a Hadoop parallel computing framework, which draws lessons from the idea of functional programming and vector programming. Hadoop makes full use of the computing resources of the host where the storage node / Data Node runs (CPU, memory, network and a few disks) to complete the parallel computing of tasks. The Map Reduce framework will start a computing Resource Manager Node Manager on the physical host where all datanodes are located to manage local computing resources. By default, the system will divide the computing resources into 8 equal parts, and each equal part will be abstracted into a Container, which is mainly used as resource isolation Find some other hosts and start a resource management center Resource Manager to manage the computing resources of the cluster.

Process analysis

When a user submits a computing task to the MapReduce framework, The framework will split the task into Map phase and Reduce phase (the vector programming idea will split the task into two phases). The framework will start a task manager (each task has its own task manager) - MRAppMaster at the beginning of task submission according to the task parallelism of Map/Reduce phase (1 computing resource will be wasted in this process) it is used to manage the execution of tasks in Map phase and Reduce phase. During task execution, computing resources will be allocated in each phase according to the parallelism of phase tasks (each computing resource starts a Yan child), and MRAppMaster will complete the detection and management of phase tasks.

ResourceManager: be responsible for the unified scheduling of task resources, manage NodeManager resources, and start MRAppMaster

NodeManager: used to manage the computing resources on the local machine. By default, the computing resources on the local machine will be divided into 8 equal parts, and each equal part will be abstracted into a Container

MRAppMaster: for any task to be executed, there will be an MRAppMaster responsible for the execution and monitoring of YarnChild tasks.

YarnChild: it refers to the MapTask or ReduceTask executed specifically.

During task execution, the system will start mrappmaster and YarnChild to be responsible for task execution. Once task execution is completed, mrappmaster and YarnChild will exit automatically.

Environment construction

① Configure Explorer

[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/yarn-site.xml
<!--to configure MapReduce Core implementation of computing framework Shuffle-shuffle the cards-->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<!--Configure the target host where the resource manager is located-->
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>CentOS2</value>
</property>
<!--Turn off physical memory check-->
<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
<!--Turn off virtual memory check-->
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

② Configure MapReduce computing framework

[root@CentOS ~]# mv /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml.template  /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml
[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml
<!--MapRedcue Implementation of framework resource manager-->
<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>

③ Start computing service

[root@CentOS ~]# start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/hadoop-2.9.2/logs/yarn-root-resourcemanager-CentOS.out
CentOS: starting nodemanager, logging to /usr/hadoop-2.9.2/logs/yarn-root-nodemanager-CentOS.out
[root@CentOS ~]# jps
13078 SecondaryNameNode
12824 DataNode
1080 ResourceManager
12681 NameNode
1195 NodeManager
1262 Jps

④ You can access the ResourceManager embedded WebUI page: http://CentOS2:8088

MapReduce task development

background

Suppose we have the following table. We need to count the number of clicks on each section.

log levelcategoryClick date
INFO/product/xxxx12020-09-28 10:10:00
INFO/product/xxxx22020-09-28 12:10:00
INFO/cart/xxxx22020-09-28 12:10:00
INFO/order/xxxx2020-09-28 12:10:00

If we can treat the above logs as a table in the database, we can use the following SQL to solve this problem:

select category,sum(1) from t_click group by category

If we use the MapReduce calculation model mentioned above, we can use Map to complete the function of group and Reduce to complete the function of sum. There are the following data formats

INFO /product/xxx/1?name=zhangsan 2020-09-28 10:10:00
INFO /product/xxx/1?name=zhangsan 2020-09-28 10:10:00
INFO /cart/xxx/1?name=lisi 2020-09-28 10:10:00
INFO /order/xxx/1?name=zhangsan 2020-09-28 10:10:00
INFO /product/xxx/1?name=zhaoliu 2020-09-28 10:10:00
INFO /cart/xxx/1?name=win7 2020-09-28 10:10:00

realization

① Write Mapper logic

package com.baizhi.click;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * 1,Users must first understand the data format, storage location - data reading method - Mapper writing method?
 *    TextInputFormat<LongWritable,Text>  : Read file system, local system, HDFS
 *                    Byte offset text line
 * 2,You have to know what you want? Press__ Category__ Statistics__ Number of hits__ value
 *                               key           value
 */
public class URLMapper extends Mapper<LongWritable, Text,Text, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
         String line=value.toString();
         String url=line.split(" ")[1];
         //Get category
         int endIndex=url.indexOf("/",1);
         String category=url.substring(0,endIndex);

         //Output the result of conversion
        context.write(new Text(category),new IntWritable(1));
    }

}

② Reducer logic

package com.baizhi.click;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Iterator;

/**
 * 1,Which Mapper output results do you want to summarize? Determine the Key and Value types entered by the Reducer
 *
 * 2,You need to know the format in which the final output result is written out. When outputting the Key/Value format, the user only needs to pay attention to his toString
 *
 *   TextOutputFormat<key,value>  Write the results to the file system: local, HDFS
 */
public class URLReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int total=0;
        for (IntWritable value : values) {
            total+=value.get();
        }
        context.write(key,new IntWritable(total));
    }
}

③ Encapsulating Job objects

public class URLCountApplication extends Configured implements Tool {
    public int run(String[] strings) throws Exception {
        //1. Create a Job object
        Configuration conf = getConf();
        Job job= Job.getInstance(conf,"URLCountApplication");

        //2. Tell the job the data format
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        //3. Set data path
        TextInputFormat.addInputPath(job,new Path("/demo/click"));
        //The system creates it automatically. If it exists before execution, the execution will be abandoned
        TextOutputFormat.setOutputPath(job,new Path("/demo/result"));

        //4. Set processing logic
        job.setMapperClass(URLMapper.class);
        job.setReducerClass(URLReducer.class);

        //5. Set the output key and value
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //6. Submit job
        return job.waitForCompletion(true)?1:0;
    }
    
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new URLCountApplication(),args);
    }
}

Task release

Remote deployment
  • You need to add the following code to the job
job.setJarByClass(URLCountApplication.class);

Set the class loading path of the program, because the task is submitted using hadoop jar command after the jar package is completed

[root@CentOS ~]# yarn jar MapReduce-1.0-SNAPSHOT.jar com.baizhi.click.URLCountApplication
 perhaps
[root@CentOS ~]# hadoop jar MapReduce-1.0-SNAPSHOT.jar com.baizhi.click.URLCountApplication

tips: if you feel that this packaging and submission is complicated, we can use the ssh Remote Login plug-in provided by maven to log in to the system first and then automatically perform the subsequent submission tasks.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>MapReduce</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.9.2</version>
        </dependency>
    </dependencies>
    <build>
        <extensions>
            <extension>
                <groupId>org.apache.maven.wagon</groupId>
                <artifactId>wagon-ssh</artifactId>
                <version>2.10</version>
            </extension>
        </extensions>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>wagon-maven-plugin</artifactId>
                <version>1.0</version>
                <executions>
                    <execution>
                        <id>upload-deploy</id>
                        <!-- function package Run while packaging upload-single and sshexec -->
                        <phase>package</phase>
                        <goals>
                            <goal>upload-single</goal>
                            <goal>sshexec</goal>
                        </goals>
                        <configuration>
                            <!-- Files to deploy -->
                            <fromFile>target/${project.artifactId}-${project.version}.jar</fromFile>
                            <!-- Deployment Directory User: password@ip+Deployment address: Port -->
                            <url>
                                <![CDATA[ scp://root:123456@CentOS/root/ ]]>
                            </url>
                            <!--shell Execute script -->
                            <commands>
                                <command> hadoop fs -rm -r -f /demo/result </command>
                                <command> hadoop jar MapReduce-1.0-SNAPSHOT.jar com.baizhi.click.URLCountApplication </command>
                            </commands>
                            <displayCommandOutputs>true</displayCommandOutputs>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>
Local simulation

Without any yarn environment, it can be realized directly by local simulation You generally need to change the NativeIO source code. Because we can't download 2.9 2 source package, you can try to use 2.6 0 instead, modify 557 lines of the source code as follows:

public static boolean access(String path, AccessRight desiredAccess)
    throws IOException {
    return true;
}

Add log4j. In the resource directory proeprties

log4j.rootLogger=INFO,CONSOLE
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender 
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout 
log4j.appender.CONSOLE.layout.ConversionPattern=%p %d{yyyy-MM-dd HH:mm:ss,SSS} %C -%m%n
Cross platform submission

① Core|hdfs|yarn|mapred-site Copy the XML to the resources directory of the project

② At mapred site XML add the following configuration

<!--Open cross platform-->
<property>
    <name>mapreduce.app-submission.cross-platform	</name>
    <value>true</value>
</property>

③ Modify job code

conf.addResource("core-site.xml");
conf.addResource("hdfs-site.xml");
conf.addResource("yarn-site.xml");
conf.addResource("mapred-site.xml");
conf.set("mapreduce.job.jar","file:///xxx.jar");

InputForamt&OutputFormat

Overall design

! [[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-yyplpnjf-1628160864381) (C: \ users \ administrator \ appdata \ roaming \ typora \ user images \ image-20210804115918216. PNG)]( https://img-blog.csdnimg.cn/dc1d87f1fba84b88810d7e7704714d8c.png?x -oss-process=image/watermark,type_ ZmFuZ3poZW5naGVpdGk,shadow_ 10,text_ aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQyMDc0OTQ5,size_ 16,color_ FFFFFF,t_ 70)

InputFormat

This class is the top-level abstract class provided by Hadoop. It mainly customizes slice calculation logic and slice data reading logic.

public abstract class InputFormat<K, V> {
    public InputFormat() {
    }
    //Calculate slice / data split logical interval
    public abstract List<InputSplit> getSplits(JobContext var1) 
        throws IOException, InterruptedException;
    //Realize the reading logic of the logical interval and pass the read data to Mapper
    public abstract RecordReader<K, V> createRecordReader(InputSplit var1, 
                                                          TaskAttemptContext var2) 
        throws IOException, InterruptedException;
}

The implementation package of Hadoop provides the pre implementation of InputFormat interface, mainly including:

  • CompositeInputFormat - mainly realizes the join of large-scale data sets on the Map side
  • DBInputFormat - mainly provides the reading implementation for RDBMS databases, mainly for Oracle and MySQL databases.
  • FileInputFormat - pre implementation for distributed file systems.

TextInputFormat

By default, the file is cut in 128MB. The cut interval is called a Split, and then the interval data is read by using the LineRecordReader. The LineRecordReader will provide Mapper with each line of text data as value, and provide the byte offset of the value in the text line. The offset is a Long type parameter, which usually has no effect.

Note: for all subclasses of the default FileInputFormat, when the getSplits method is not overridden, the default calculated slice size range (0140.8MB] because the underlying layer passes when calculating the file slice (file size / 128MB > 1.1)? Slice new blocks: do not slice

Refer to the above case for the code.

NLineInputFormat

By default, the file is cut according to N lines. The cut interval is called a Split, and then the interval data is read by using the LineRecordReader. The LineRecordReader will provide Mapper with each line of text data as value, and provide the byte offset of the value in the text line. The offset is a Long type parameter, which usually has no effect.

It overrides getSplits of FileInputFormat, so we generally need to set the number of rows when using NLineInputFormat.

 NLineInputFormat.setNumLinesPerSplit(job,1000);
public class URLCountApplication extends Configured implements Tool {
    public int run(String[] strings) throws Exception {
        //1. Create a Job object
        Configuration conf = getConf();

        Job job= Job.getInstance(conf,"URLCountApplication");

        //2. Tell the job the data format
        job.setInputFormatClass(NLineInputFormat.class);
        NLineInputFormat.setNumLinesPerSplit(job,1000);
        job.setOutputFormatClass(TextOutputFormat.class);

        //3. Set data path
        TextInputFormat.addInputPath(job,new Path("D:/data/click"));
        //The system creates it automatically. If it exists before execution, the execution will be abandoned
        TextOutputFormat.setOutputPath(job,new Path("D:/data/result"));

        //4. Set processing logic
        job.setMapperClass(URLMapper.class);
        job.setReducerClass(URLReducer.class);

        //5. Set the output key and value
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //6. Submit job
        return job.waitForCompletion(true)?1:0;
    }
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new URLCountApplication(),args);
    }
}

KeyValueTextInputFormat

By default, the file is cut in 128MB units. The cut interval is called a Split. Then, the KeyValueLineRecordReader is used to read the interval data. The KeyValueLineRecordReader will provide Mapper with key and value, both of which are Text types. By default, the input of this format is separated by tab. Key/Value. If it is not Split correctly, the whole line will be taken as key, Value is null

conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR,",");
public class URLCountApplication extends Configured implements Tool {
    public int run(String[] strings) throws Exception {
        //1. Create a Job object
        Configuration conf = getConf();
        conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR,",");

        Job job= Job.getInstance(conf,"AvgCostAplication");

        //2. Tell the job the data format
        job.setInputFormatClass(KeyValueTextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        //3. Set data path
        TextInputFormat.addInputPath(job,new Path("file:///D:/data/keyvalue"));
        //The system creates it automatically. If it exists before execution, the execution will be abandoned
        TextOutputFormat.setOutputPath(job,new Path("file:///D:/data/result"));

        //4. Set processing logic
        job.setMapperClass(AvgCostMapper.class);
        job.setReducerClass(AvgCostReducer.class);

        //5. Set the output key and value
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(DoubleWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);

        //6. Submit job
        return job.waitForCompletion(true)?1:0;
    }
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new URLCountApplication(),args);
    }
}

MultipleInputs

This is a composite input format, which is mainly applicable to the combination of multiple inputformats of different formats. It is required that the output format of the Map segment must be consistent.

public class SumCostApplication extends Configured implements Tool {
    public int run(String[] strings) throws Exception {
        //1. Create a Job object
        Configuration conf = getConf();
        conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR,",");
        Job job= Job.getInstance(conf,"SumCostCountApplication");

        //2. Tell the job the data format
        job.setOutputFormatClass(TextOutputFormat.class);

        //3. Set data path
        TextOutputFormat.setOutputPath(job,new Path("file:///D:/data/result"));
        //4. Set processing logic
        MultipleInputs.addInputPath(job,new Path("file:///D:/data/mul/keyvalue"), KeyValueTextInputFormat.class,KeyVlaueCostMapper.class);
        MultipleInputs.addInputPath(job,new Path("file:///D:/data/mul/text"), TextInputFormat.class,TextCostMapper.class);
        job.setReducerClass(CostSumReducer.class);

        //5. Set the output key and value
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(DoubleWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);

        //6. Submit job
        return job.waitForCompletion(true)?1:0;
    }
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new SumCostApplication(),args);
    }
}

CombineFileInputFormat

All fileinputformats mentioned above calculate file slices in file units, which means that if there are many small files in the calculated directory, there will be too many Map tasks in the first stage. Therefore, the default FileInputFormat is not very friendly to small file processing. Therefore, Hadoop provides a CombineFileInputFormat format format class, which is specially used for slice calculation in small file scenarios, and multiple small files will correspond to the same slice. However, the format of these small files must be consistent. We can use CombineTextInputFormat. The usage of this class is the same as TextInputFormat, except for the calculation of slices.

public class URLCountApplication extends Configured implements Tool {
    public int run(String[] strings) throws Exception {
        //1. Create a Job object
        Configuration conf = getConf();

        Job job= Job.getInstance(conf,"URLCountApplication");

        //2. Tell the job the data format
        job.setInputFormatClass(CombineTextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        //3. Set data path
        CombineTextInputFormat.addInputPath(job,new Path("file:///D:/data/click"));
        //The system creates it automatically. If it exists before execution, the execution will be abandoned
        TextOutputFormat.setOutputPath(job,new Path("file:///D:/data/result"));

        //4. Set processing logic
        job.setMapperClass(URLMapper.class);
        job.setReducerClass(URLReducer.class);

        //5. Set the output key and value
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //6. Submit job
        return job.waitForCompletion(true)?1:0;
    }
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new URLCountApplication(),args);
    }
}

DBInputFormat

It is mainly responsible for reading the data in RDBMS. At present, it only supports MySQL/Oracle databases

public class UserDBWritable implements DBWritable {
    private Boolean sex;
    private Double salary;
    /**
     * DBOutputFormat Used
     * @param statement
     * @throws SQLException
     */
    public void write(PreparedStatement statement) throws SQLException {

    }

    public void readFields(ResultSet resultSet) throws SQLException {
        this.sex=resultSet.getBoolean("sex");
        this.salary=resultSet.getDouble("salary");
    }

    public Boolean getSex() {
        return sex;
    }

    public void setSex(Boolean sex) {
        this.sex = sex;
    }

    public Double getSalary() {
        return salary;
    }

    public void setSalary(Double salary) {
        this.salary = salary;
    }
}
public class DBAvgSalaryApplication extends Configured implements Tool {
    public int run(String[] strings) throws Exception {
        //1. Create a Job object
        Configuration conf = getConf();
        //Set parallelism
        conf.setInt(MRJobConfig.NUM_MAPS,5);
        DBConfiguration.configureDB(conf,"com.mysql.jdbc.Driver",
                                    "jdbc:mysql://localhost:3306/test","root","123456");

        Job job= Job.getInstance(conf,"DBAvgSalaryApplication");

        //2. Tell the job the data format
        job.setInputFormatClass(DBInputFormat.class);

        String query="select sex,salary from t_user";
        String countQuery="select count(*) from t_user";
        DBInputFormat.setInput(job,UserDBWritable.class,query,countQuery);

        job.setOutputFormatClass(TextOutputFormat.class);

        //3. Set data path
        //The system creates it automatically. If it exists before execution, the execution will be abandoned
        TextOutputFormat.setOutputPath(job,new Path("D:/data/result"));

        //4. Set processing logic
        job.setMapperClass(UserAvgMapper.class);
        job.setReducerClass(UserAvgReducer.class);

        //5. Set the output key and value
        job.setMapOutputKeyClass(BooleanWritable.class);
        job.setMapOutputValueClass(DoubleWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);

        //6. Submit job
        return job.waitForCompletion(true)?1:0;
    }
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new DBAvgSalaryApplication(),args);
    }
}

OutputFormat

This class is the top-level abstract class provided by Hadoop. It mainly implements the write logic, which is responsible for writing the output of the Reduce end to the peripheral system. At the same time, it also provides output check (only limited to the file system), and is responsible for returning to the committee to ensure that the system can output normally.

public abstract class OutputFormat<K, V> {

  //Create RecordWriter
  public abstract RecordWriter<K, V> 
    getRecordWriter(TaskAttemptContext context
                    ) throws IOException, InterruptedException;

  //Check whether the output directory is valid
  public abstract void checkOutputSpecs(JobContext context
                                        ) throws IOException, 
                                                 InterruptedException;

  //Returns a submitter
  public abstract 
  OutputCommitter getOutputCommitter(TaskAttemptContext context
                                     ) throws IOException, InterruptedException;
}


TextoutputFormat

Write the output of the Reducer directly to the file system, where the toString methods of key and value will be called when writing.

DBOutputFormat

Write the output of the Reducer directly to the database system.

public class URLCountDBWritable implements DBWritable {
    private String category;
    private Integer count;

    public URLCountDBWritable(String category, Integer count) {
        this.category = category;
        this.count = count;
    }

    public URLCountDBWritable() {
    }

    public String getCategory() {
        return category;
    }

    public void setCategory(String category) {
        this.category = category;
    }

    public Integer getCount() {
        return count;
    }

    public void setCount(Integer count) {
        this.count = count;
    }

    public void write(PreparedStatement statement) throws SQLException {
        statement.setString(1,category);
        statement.setInt(2,count);
    }

    public void readFields(ResultSet resultSet) throws SQLException {

    }
}
public class URLCountApplication extends Configured implements Tool {
    public int run(String[] strings) throws Exception {
        //1. Create a Job object
        Configuration conf = getConf();
        DBConfiguration.configureDB(conf,"com.mysql.jdbc.Driver",
                "jdbc:mysql://localhost:3306/test",
                "root","123456");
        Job job= Job.getInstance(conf,"URLCountApplication");

        //2. Tell the job the data format
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(DBOutputFormat.class);

        //3. Set data path
        TextInputFormat.addInputPath(job,new Path("file:///D:/data/click"));
        DBOutputFormat.setOutput(job,"url_click","url_category","url_count");

        //4. Set processing logic
        job.setMapperClass(URLMapper.class);
        job.setReducerClass(URLReducer.class);

        //5. Set the output key and value
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(URLCountDBWritable.class);
        job.setOutputValueClass(NullWritable.class);

        //6. Submit job
        return job.waitForCompletion(true)?1:0;
    }
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new URLCountApplication(),args);
    }
}

JedisOutputFormat

public class JedisOutputFormat extends OutputFormat<String,String> {
    public final static String JEDIS_HOST="jedis.host";
    public final static String JEDIS_PORT="jedis.port";

    public static void setOutput(Job job, String host, Integer port) {
        job.getConfiguration().set(JEDIS_HOST,host);
        job.getConfiguration().setInt(JEDIS_PORT,port);
    }

    public RecordWriter<String, String> getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException {
        Configuration config = context.getConfiguration();
        String host=config.get(JEDIS_HOST);
        Integer port=config.getInt(JEDIS_PORT,6379);
        return new JedisRecordWriter(host,port);
    }

    public void checkOutputSpecs(JobContext context) throws IOException, InterruptedException {}

    public OutputCommitter getOutputCommitter(TaskAttemptContext context) throws IOException, InterruptedException {
        return new FileOutputCommitter(FileOutputFormat.getOutputPath(context),
                context);
    }
}
public class JedisRecordWriter  extends RecordWriter<String,String> {
    private Jedis jedis=null;

    public JedisRecordWriter(String host, Integer port) {
        jedis=new Jedis(host,port);
    }

    public void write(String key, String value) throws IOException, InterruptedException {
        jedis.set(key,value);
    }

    public void close(TaskAttemptContext context) throws IOException, InterruptedException {
        jedis.close();
    }
}

public class URLCountApplication extends Configured implements Tool {
    public int run(String[] strings) throws Exception {
        //1. Create a Job object
        Configuration conf = getConf();

        Job job= Job.getInstance(conf,"URLCountApplication");

        //2. Tell the job the data format
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(JedisOutputFormat.class);

        //3. Set data path
        TextInputFormat.addInputPath(job,new Path("file:///D:/data/click"));
        JedisOutputFormat.setOutput(job,"CentOS",6379);
        //4. Set processing logic
        job.setMapperClass(URLMapper.class);
        job.setReducerClass(URLReducer.class);

        //5. Set the output key and value
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(String.class);
        job.setOutputValueClass(String.class);

        //6. Submit job
        return job.waitForCompletion(true)?1:0;
    }
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new URLCountApplication(),args);
    }
}

Dependency resolution

  • Runtime dependency (Yan child dependency)

Option 1

Users are required to copy the dependent jar packages to all computing nodes (the host where NodeManager runs)

[root@CentOS ~]# hadoop jar  xxx.jar entry class - libjars depends on jar package 1, jar package 2

Option 2

[root@CentOS ~]# hdfs dfs -mkdir /libs
[root@CentOS ~]# hdfs dfs -put mysql-connector-java-5.1.46.jar /libs
conf.setStrings("tmpjars","/libs/xxx1.jar,/libs/xxx2.jar,...");
  • Dependency on submission (client)

You need to configure HADOOP_CLASSPATH environment variable (/ root/.bashrc). Usually, this dependency occurs in the slice calculation phase.

HADOOP_CLASSPATH=/root/mysql-connector-java-5.1.46.jar
export HADOOP_CLASSPATH
[root@CentOS ~]# source .bashrc 
[root@CentOS ~]# hadoop classpath #View the class path of hadoop
/usr/hadoop-2.6.0/etc/hadoop:/usr/hadoop-2.6.0/share/hadoop/common/lib/*:/usr/hadoop-2.6.0/share/hadoop/common/*:/usr/hadoop-2.6.0/share/hadoop/hdfs:/usr/hadoop-2.6.0/share/hadoop/hdfs/lib/*:/usr/hadoop-2.6.0/share/hadoop/hdfs/*:/usr/hadoop-2.6.0/share/hadoop/yarn/lib/*:/usr/hadoop-2.6.0/share/hadoop/yarn/*:/usr/hadoop-2.6.0/share/hadoop/mapreduce/lib/*:/usr/hadoop-2.6.0/share/hadoop/mapreduce/*:`/root/mysql-connector-java-5.1.46.jar`:/usr/hadoop-2.6.0/contrib/capacity-scheduler/*.jar

InputFormat/OuputFormat and Mapper and Reducer

! [[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-xkrrjkce-1628160864385) (C: \ users \ administrator \ appdata \ roaming \ typora \ user images \ image-20210804172611934. PNG)]( https://img-blog.csdnimg.cn/3e598436154c4b3e989069ad0e37caad.png?x -oss-process=image/watermark,type_ ZmFuZ3poZW5naGVpdGk,shadow_ 10,text_ aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQyMDc0OTQ5,size_ 16,color_ FFFFFF,t_ 70)

MapReduce Shuffle

definition

In MapReduce, how to transfer the data processed in the mapper phase to the reducer phase is the most critical process in the MapReduce framework. This process is called shuffle. Generally speaking, the core process of shuffle mainly includes the following aspects: data partitioning, sorting, local aggregation / Combiner, buffer, overflow, Fetch / Fetch, merge sorting, etc.

! [[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-2fkjxkgi-1628160864387) (C: \ users \ administrator \ appdata \ roaming \ typora \ user images \ image-20210308171221736. PNG)]( https://img-blog.csdnimg.cn/1e6868f1d7504339a636edde45cbc12f.png?x -oss-process=image/watermark,type_ ZmFuZ3poZW5naGVpdGk,shadow_ 10,text_ aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQyMDc0OTQ5,size_ 16,color_ FFFFFF,t_ 70)

common problem

1. Can MapReduce implement global sorting?

By default, MapReduce cannot achieve global ordering, because the underlying MapReduce uses the HashPartitioner implementation, which can only ensure that the data in the data partition is arranged in the natural order of key s, so it cannot achieve global ordering. However, the following ideas can be used to complete global sorting:

  • Set the number of numeducetask to 1, which will cause all data to fall into the same partition to realize full sorting, but it is only applicable to small batch data sets
  • The user-defined partition strategy allows the data to be partitioned according to the interval, not according to the hash strategy. At this time, only the order between the intervals can be ensured to achieve global order. However, this method will lead to uneven distribution of interval data, resulting in data skew in the calculation process.
  • Use the TotalOrderPartitioner provided by Hadoop to sample the target first, and then calculate the partition interval.

reference resources: https://blog.csdn.net/lalaguozhe/article/details/9211919

2. How to interfere with MapReduce's partition policy?

Generally speaking, in the actual development, the partition strategy is rarely intervened, because based on big data, the first thing to consider is the uniform distribution of data to prevent data skew. Therefore, Hash is often the best choice. If you need to overwrite the original partition, you can call:

job.setPartitionerClass(Partition class information can be implemented)
public class CustomHashPartitioner<K, V> extends Partitioner<K, V> {

  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}

3. How to solve the problem of data skew in MapReduce calculation (interview hot issue)?

Scenario: count the population of Asian countries, taking China and Japan as examples. Naturally, the state will be used as the key and the citizen information as the value. During MapReduce calculation, Chinese citizens will naturally fall into a partition because their nationality is China. In this way, the data is heavily skewed.

! [[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-sa3hgtun-1628160864401) (C: \ users \ administrator \ appdata \ roaming \ typora \ typora user images \ image-20210308171703780. PNG)]( https://img-blog.csdnimg.cn/86355fa86b724058a5f664b6b666e07d.png?x -oss-process=image/watermark,type_ ZmFuZ3poZW5naGVpdGk,shadow_ 10,text_ aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQyMDc0OTQ5,size_ 16,color_ FFFFFF,t_ 70)

4. What determines the parallelism of Map and Reduce?

The parallelism of the Map side is determined by computing task slices, and the Reduce side is determined by job setNumReduceTask(n)

5. MapReduce tuning strategy

1) Avoid small file calculation. After offline merging into large files, conduct MapReduce analysis or CombineTextInputFormat.

2) Adjust the parameters of the ring buffer to reduce the IO operation of the Map task, which can not be increased without limit, and also consider the system GC problem.

3) Turn on Map Compression and compress the overflow file into GZIP format to reduce the network bandwidth occupation in the ReduceShuffle process at the cost of CPU consumption

//To enable decompression, you must run in a real environment
conf.setBoolean("mapreduce.map.output.compress",true);
conf.setClass("mapreduce.map.output.compress.codec", GzipCodec.class, CompressionCodec.class);

4) If conditions permit, we can turn on the Map side preprocessing mechanism to execute Reduce logic on the Map side in advance for local calculation, which can greatly improve the computing performance, but this optimization method is not suitable for all scenarios. For example, if the average value is calculated, this scenario is not suitable for implementing Reduce logic on the Map side.

1) The input and output types of Combiner/Reduce must be consistent, that is, the precomputing logic cannot change the Map side output type / Reduce side input type.

2) The original business logic, such as averaging, cannot be changed. Although the types are compatible, the business calculation is incorrect.

! [[the external chain picture transfer fails, and the source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-C3R2ZHQl-1628160864402)(assets/image-20200930142652073.png)] [the external chain picture transfer fails, and the source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-ebvFZep0-1628160864404)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210308173904104.png)](https://img-blog.csdnimg.cn/7291b1851ecd4951a11aa3eb5fdfaef5.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQyMDc0OTQ5,size_16,color_FFFFFF,t_70)

Advantages: reduce the number of keys, save the memory space occupied by sorting, greatly reduce the amount of data download in ReduceShuffle and save bandwidth.

5) Adjust the number of resources managed by NodeManager appropriately

yarn.nodemanager.resource.memory-mb=32G
yarn.nodemanager.resource.cpu-vcores = 16

Or turn on hardware resource monitoring

yarn.nodemanager.resource.detect-hardware-capabilities=true

6) If multiple small tasks are executed sequentially, we can consider using the JVM reuse mechanism. We can use one JVM to execute multiple tasks sequentially without restarting the new JVM.

mapreduce.job.jvm.numtasks=2

Hadoop HA build

summary

  • NameNode HA build storage, ResourceManager HA build calculation

! [[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-vpvas1pe-1628160864405) (C: \ users \ administrator \ appdata \ roaming \ typora \ typera user images \ image-20210310102724193. PNG)]( https://img-blog.csdnimg.cn/5ec5f801befc4b7692ab817131ad53b4.png?x -oss-process=image/watermark,type_ ZmFuZ3poZW5naGVpdGk,shadow_ 10,text_ aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQyMDc0OTQ5,size_ 16,color_ FFFFFF,t_ 70)

preparation

  • Install three CentOS-6.5 64 bit operating systems (complete JDK, SSH password free authentication, IP host name mapping, close firewall, etc.)

Host and service startup mapping table

hostservice
CentOSANameNode,zkfc,DataNode,JournalNode,Zookeeper,NodeManager
CentOSBNameNode,zkfc,DataNode,JournalNode,Zookeeper,NodeManager,ResourceManager
CentOSCDataNode,JournalNode,Zookeeper,NodeManager,ResourceManager

Host information

host nameIP information
CentOSA192.168.234.133
CentOSB192.168.234.134
CentOSC192.168.234.135

JDK installation and configuration

[root@CentOSX ~]# rpm -ivh jdk-8u171-linux-x64.rpm
[root@CentOSX ~]# vi .bashrc
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export CLASSPATH
export PATH
[root@CentOSX ~]# source .bashrc

IP host name mapping

[root@CentOSX ~]# vi /etc/hosts

192.168.234.133 CentOSA
192.168.234.134 CentOSB
192.168.234.135 CentOSC

Turn off firewall

[root@CentOSX ~]# systemctl stop firewalld
[root@CentOSX ~]# systemctl disable firewalld
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
[root@CentOSX ~]# firewall-cmd --state
not running

SSH password free authentication

[root@CentOSX ~]# ssh-keygen -t rsa
[root@CentOSX ~]# ssh-copy-id CentOSA
[root@CentOSX ~]# ssh-copy-id CentOSB
[root@CentOSX ~]# ssh-copy-id CentOSC

Zookeeper

[root@CentOSX ~]# tar -zxf zookeeper-3.4.6.tar.gz -C /usr/
[root@CentOSX ~]# mkdir /root/zkdata

[root@CentOSA ~]# echo 1 >> /root/zkdata/myid
[root@CentOSB ~]# echo 2 >> /root/zkdata/myid
[root@CentOSC ~]# echo 3 >> /root/zkdata/myid

[root@CentOSX ~]# touch /usr/zookeeper-3.4.6/conf/zoo.cfg
[root@CentOSX ~]# vi /usr/zookeeper-3.4.6/conf/zoo.cfg
tickTime=2000
dataDir=/root/zkdata
clientPort=2181
initLimit=5
syncLimit=2
server.1=CentOSA:2887:3887
server.2=CentOSB:2887:3887
server.3=CentOSC:2887:3887

[root@CentOSX ~]# /usr/zookeeper-3.4.6/bin/zkServer.sh start zoo.cfg
[root@CentOSX ~]# /usr/zookeeper-3.4.6/bin/zkServer.sh status zoo.cfg
JMX enabled by default
Using config: /usr/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: `follower|leader`
[root@CentOSX ~]# jps
5879 `QuorumPeerMain`
7423 Jps

Build Hadoop cluster (HDFS)

Extract and configure HADOOP_HOME

[root@CentOSX ~]# tar -zxf hadoop-2.9.2.tar.gz -C /usr/
[root@CentOSX ~]# vi .bashrc
HADOOP_HOME=/usr/hadoop-2.9.2
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
[root@CentOSX ~]# source .bashrc

Configure core site xml vi /usr/hadoop-2.9. 2/etc/hadoop/core-site. xml

<!--to configure Namenode service ID-->
<property>		
      <name>fs.defaultFS</name>		
      <value>hdfs://mycluster</value>	
</property>
<property>		
     <name>hadoop.tmp.dir</name>		
     <value>/usr/hadoop-2.9.2/hadoop-${user.name}</value>    
</property>
<property>		
     <name>fs.trash.interval</name>		
     <value>30</value>    
</property>
<!--Configure rack scripts-->
<property>		
     <name>net.topology.script.file.name</name>		
     <value>/usr/hadoop-2.9.2/etc/hadoop/rack.sh</value>    
</property>
<!--to configure ZK Service information-->
<property>   
	<name>ha.zookeeper.quorum</name>
	<value>CentOSA:2181,CentOSB:2181,CentOSC:2181</value> 
</property>
<!--to configure SSH Key location-->
<property>
     <name>dfs.ha.fencing.methods</name>
     <value>sshfence</value>
</property>
<property>
     <name>dfs.ha.fencing.ssh.private-key-files</name>
     <value>/root/.ssh/id_rsa</value>
</property>

Configure rack scripts

[root@CentOSX ~]# touch /usr/hadoop-2.9.2/etc/hadoop/rack.sh
[root@CentOSX ~]# chmod u+x /usr/hadoop-2.9.2/etc/hadoop/rack.sh
[root@CentOSX ~]# vi /usr/hadoop-2.9.2/etc/hadoop/rack.sh
while [ $# -gt 0 ] ; do
	  nodeArg=$1
	  exec</usr/hadoop-2.9.2/etc/hadoop/topology.data
	  result="" 
	  while read line ; do
		ar=( $line ) 
		if [ "${ar[0]}" = "$nodeArg" ] ; then
		  result="${ar[1]}"
		fi
	  done 
	  shift 
	  if [ -z "$result" ] ; then
		echo -n "/default-rack"
	  else
		echo -n "$result "
	  fi
done
[root@CentOSX ~]# touch /usr/hadoop-2.9.2/etc/hadoop/topology.data
[root@CentOSX ~]# vi /usr/hadoop-2.9.2/etc/hadoop/topology.data
192.168.234.133 /rack01
192.168.234.134 /rack01
192.168.234.135 /rack03

/usr/hadoop-2.9.2/etc/hadoop/rack.sh 192.168.234.133

Configure HDFS site xml vi /usr/hadoop-2.9. 2/etc/hadoop/hdfs-site. xml

<property>
	<name>dfs.replication</name>
	<value>3</value>
</property> 
<!--Turn on automatic failover-->
<property>
	<name>dfs.ha.automatic-failover.enabled</name>
	<value>true</value>
</property>
<!--explain core-site.xml content-->
<property>
	<name>dfs.nameservices</name>
	<value>mycluster</value>
</property>
<property>
	<name>dfs.ha.namenodes.mycluster</name>
	<value>nn1,nn2</value>
</property>
<property>
	<name>dfs.namenode.rpc-address.mycluster.nn1</name>
	<value>CentOSA:9000</value>
</property>
<property>
	 <name>dfs.namenode.rpc-address.mycluster.nn2</name>
	 <value>CentOSB:9000</value>
</property>
<!--Configure log server information-->
<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://CentOSA:8485;CentOSB:8485;CentOSC:8485/mycluster</value>
</property>
<!--Implementation class for failover-->
<property>
	<name>dfs.client.failover.proxy.provider.mycluster</name>
	<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

Configure slave VI / usr / hadoop-2.9 2/etc/hadoop/slaves

CentOSA
CentOSB
CentOSC

Because it is CentOS7, we need to install an additional plug-in. Otherwise, the NameNode cannot achieve automatic failover.

[root@CentOSX ~]# yum install -y psmisc 

Start HDFS (cluster initialization)

[root@CentOSX ~]# hadoop-daemon.sh start journalnode (wait 10s)
[root@CentOSA ~]# hdfs namenode -format
[root@CentOSA ~]# hadoop-daemon.sh start namenode
[root@CentOSB ~]# hdfs namenode -bootstrapStandby
[root@CentOSB ~]# hadoop-daemon.sh start namenode
#To register the Namenode information in zookeeper, you only need to execute the following instructions on either CentOSA or B
[root@CentOSA|B ~]# hdfs zkfc -formatZK
[root@CentOSA ~]# hadoop-daemon.sh start zkfc
[root@CentOSB ~]# hadoop-daemon.sh start zkfc
[root@CentOSX ~]# hadoop-daemon.sh start datanode

View rack information

[root@CentOSC ~]# hdfs dfsadmin -printTopology
Rack: /rack01
   192.168.73.131:50010 (CentOSA)
   192.168.73.132:50010 (CentOSB)

Rack: /rack03
   192.168.73.133:50010 (CentOSC)

Resource Manager setup

yarn-site.xml vi /usr/hadoop-2.9.2/etc/hadoop/yarn-site.xml

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>cluster</value>
</property>
<property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname.rm1</name>
    <value>CentOSB</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname.rm2</name>
    <value>CentOSC</value>
</property>
<property>
    <name>yarn.resourcemanager.zk-address</name>
    <value>CentOSA:2181,CentOSB:2181,CentOSC:2181</value>
</property>
<!--Turn off physical memory check-->
<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
<!--Turn off virtual memory check-->
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

mapred-site.xml mv /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml.template /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml

vi /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml

<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>

Start | close the Yan service

[root@CentOSB ~]# yarn-daemon.sh start|stop resourcemanager
[root@CentOSC ~]# yarn-daemon.sh start|stop resourcemanager
[root@CentOSX ~]# yarn-daemon.sh start|stop nodemanger

Topics: Big Data