HDFS troubleshooting
1. NameNode fault handling
-
Requirements:
The NameNode process hangs and the stored data is lost. How to recover the NameNode
-
fault simulation
- kill -9 NameNode process
- Delete data stored in NameNode:
[codecat@hadoop102 dfs]$ rm -rf /opt/module/hadoop-3.1.3/data/dfs/name/*
-
Solution
- Copy the data in the SecondaryNameNode to the original NameNode storage data directory
[codecat@hadoop102 current]$ scp -r codecat@hadoop104:/opt/module/hadoop-3.1.3/data/dfs/namesecondary/* ./name/
- Restart NameNode
[codecat@hadoop102 current]$ hdfs --daemon start namenode
- Copy the data in the SecondaryNameNode to the original NameNode storage data directory
2. Cluster security mode & disk repair
2.1 safety mode
The file system only accepts data read requests, but does not accept change requests such as deletion and modification
2.2 entering the safe mode scenario
- NameNode is in safe mode during loading of mirror files and editing of logs
- When NameNode receives DataNode registration again, it is in safe mode
2.3 conditions for exiting safe mode
- dfs.namenode.safemode.min.datanodes: minimum number of available datanodes, 0 by default
- dfs.namenode.safemode.threshold-pct: the percentage of blocks with the minimum number of copies in the total number of blocks in the system. The default is 0.999f. (only one block is allowed to be lost)
- dfs.namenode.safemode.extension: stabilization time. The default value is 30000ms, i.e. 30s
2.4 basic grammar
- View safe mode status: hdfs dfsadmin -safemode get
- Enter safe mode status: hdfs dfsadmin -safemode enter
- Leave safe mode status: hdfs dfsadmin -safemode leave
- Wait for safe mode status: hdfs dfsadmin -safemode wait
2.5 case analysis
2.5.1 start the cluster and enter the safe mode
After the cluster is started, immediately go to the cluster to delete data, and prompt that the cluster is in safe mode
2.5.2 disk repair
How to deal with data block damage when entering safe mode
-
Delete two block information in Hadoop 102, Hadoop 103 and Hadoop 104 respectively
[codecat@hadoop102 subdir0]$ pwd /opt/module/hadoop-3.1.3/data/dfs/data/current/BP-349834019-192.168.150.102-1629042571631/current/finalized/subdir0/subdir0 [codecat@hadoop102 subdir0]$ rm -rf blk_1073741826_1002.meta [codecat@hadoop102 subdir0]$ rm -rf blk_1073741834_1010.meta [codecat@hadoop103 subdir0]$ pwd /opt/module/hadoop-3.1.3/data/dfs/data/current/BP-349834019-192.168.150.102-1629042571631/current/finalized/subdir0/subdir0 [codecat@hadoop102 subdir0]$ rm -rf blk_1073741826_1002.meta [codecat@hadoop102 subdir0]$ rm -rf blk_1073741834_1010.meta [codecat@hadoop104 subdir0]$ pwd /opt/module/hadoop-3.1.3/data/dfs/data/current/BP-349834019-192.168.150.102-1629042571631/current/finalized/subdir0/subdir0 [codecat@hadoop102 subdir0]$ rm -rf blk_1073741826_1002.meta [codecat@hadoop102 subdir0]$ rm -rf blk_1073741834_1010.meta
-
Restart the cluster
It is found that the security mode has been turned on and the number of blocks does not meet the requirements.
-
Leave safe mode
[codecat@hadoop102 subdir0]$ hdfs dfsadmin -safemode leave
-
Delete metadata
-
The cluster returns to normal
2.5.3 simulation waiting safety mode
Simulate wait safe mode
- View current mode
[codecat@hadoop102 subdir0]$ hdfs dfsadmin -safemode get Safe mode is OFF
- First in safe mode
[codecat@hadoop102 subdir0]$ hdfs dfsadmin -safemode enter Safe mode is ON
- Create and execute the following script
[codecat@hadoop102 hadoop-3.1.3]$ vim safamode.sh #!/bin/bash hdfs dfsadmin -safemode wait hadoop fs -put /opt/module/hadoop-3.1.3/NOTICE.txt / [codecat@hadoop102 hadoop-3.1.3]$ chmod 777 safamode.sh [codecat@hadoop102 hadoop-3.1.3]$ ./safamode.sh
- Open another window and execute
[codecat@hadoop102 hadoop-3.1.3]$ hdfs dfsadmin -safemode get Safe mode is OFF
- There are already uploaded data on the HDFS cluster