Production optimization of Hadoop

Posted by DaveMate on Mon, 27 Sep 2021 12:41:22 +0200

HDFS troubleshooting

1. NameNode fault handling

Requirements:

The NameNode process hangs and the stored data is lost. How to recover the NameNode

fault simulation

kill -9 NameNode process

Delete data stored in NameNode:

[codecat@hadoop102 dfs]$ rm -rf /opt/module/hadoop-3.1.3/data/dfs/name/*

Solution

Copy the data in the SecondaryNameNode to the original NameNode storage data directory

[codecat@hadoop102 current]$ scp -r codecat@hadoop104:/opt/module/hadoop-3.1.3/data/dfs/namesecondary/* ./name/

Restart NameNode

[codecat@hadoop102 current]$ hdfs --daemon start namenode

2. Cluster security mode & disk repair

2.1 safety mode

The file system only accepts data read requests, but does not accept change requests such as deletion and modification

2.2 entering the safe mode scenario

NameNode is in safe mode during loading of mirror files and editing of logs
When NameNode receives DataNode registration again, it is in safe mode

2.3 conditions for exiting safe mode

dfs.namenode.safemode.min.datanodes: minimum number of available datanodes, 0 by default
dfs.namenode.safemode.threshold-pct: the percentage of blocks with the minimum number of copies in the total number of blocks in the system. The default is 0.999f. (only one block is allowed to be lost)
dfs.namenode.safemode.extension: stabilization time. The default value is 30000ms, i.e. 30s

2.4 basic grammar

View safe mode status: hdfs dfsadmin -safemode get
Enter safe mode status: hdfs dfsadmin -safemode enter
Leave safe mode status: hdfs dfsadmin -safemode leave
Wait for safe mode status: hdfs dfsadmin -safemode wait

2.5 case analysis

2.5.1 start the cluster and enter the safe mode

After the cluster is started, immediately go to the cluster to delete data, and prompt that the cluster is in safe mode

2.5.2 disk repair

How to deal with data block damage when entering safe mode

Delete two block information in Hadoop 102, Hadoop 103 and Hadoop 104 respectively

[codecat@hadoop102 subdir0]$ pwd
/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-349834019-192.168.150.102-1629042571631/current/finalized/subdir0/subdir0
[codecat@hadoop102 subdir0]$ rm -rf blk_1073741826_1002.meta 
[codecat@hadoop102 subdir0]$ rm -rf blk_1073741834_1010.meta 

[codecat@hadoop103 subdir0]$ pwd
/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-349834019-192.168.150.102-1629042571631/current/finalized/subdir0/subdir0
[codecat@hadoop102 subdir0]$ rm -rf blk_1073741826_1002.meta 
[codecat@hadoop102 subdir0]$ rm -rf blk_1073741834_1010.meta 

[codecat@hadoop104 subdir0]$ pwd
/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-349834019-192.168.150.102-1629042571631/current/finalized/subdir0/subdir0
[codecat@hadoop102 subdir0]$ rm -rf blk_1073741826_1002.meta 
[codecat@hadoop102 subdir0]$ rm -rf blk_1073741834_1010.meta

Restart the cluster
It is found that the security mode has been turned on and the number of blocks does not meet the requirements.

Leave safe mode

[codecat@hadoop102 subdir0]$ hdfs dfsadmin -safemode leave

Delete metadata
The cluster returns to normal

2.5.3 simulation waiting safety mode

Simulate wait safe mode

View current mode

[codecat@hadoop102 subdir0]$ hdfs dfsadmin -safemode get
Safe mode is OFF

First in safe mode

[codecat@hadoop102 subdir0]$ hdfs dfsadmin -safemode enter
Safe mode is ON

Create and execute the following script

[codecat@hadoop102 hadoop-3.1.3]$ vim safamode.sh

#!/bin/bash
hdfs dfsadmin -safemode wait
hadoop fs -put /opt/module/hadoop-3.1.3/NOTICE.txt /

[codecat@hadoop102 hadoop-3.1.3]$ chmod 777 safamode.sh
[codecat@hadoop102 hadoop-3.1.3]$ ./safamode.sh

Open another window and execute

[codecat@hadoop102 hadoop-3.1.3]$ hdfs dfsadmin -safemode get
Safe mode is OFF

There are already uploaded data on the HDFS cluster

Topics: Big Data Hadoop

Programmer Think