Production optimization of Hadoop

Posted by DaveMate on Mon, 27 Sep 2021 12:41:22 +0200

HDFS troubleshooting

1. NameNode fault handling

  1. Requirements:

    The NameNode process hangs and the stored data is lost. How to recover the NameNode

  2. fault simulation

    • kill -9 NameNode process
    • Delete data stored in NameNode:
      [codecat@hadoop102 dfs]$ rm -rf /opt/module/hadoop-3.1.3/data/dfs/name/*
      
  3. Solution

    1. Copy the data in the SecondaryNameNode to the original NameNode storage data directory
      [codecat@hadoop102 current]$ scp -r codecat@hadoop104:/opt/module/hadoop-3.1.3/data/dfs/namesecondary/* ./name/
      
    2. Restart NameNode
      [codecat@hadoop102 current]$ hdfs --daemon start namenode
      

2. Cluster security mode & disk repair

2.1 safety mode

The file system only accepts data read requests, but does not accept change requests such as deletion and modification

2.2 entering the safe mode scenario

  1. NameNode is in safe mode during loading of mirror files and editing of logs
  2. When NameNode receives DataNode registration again, it is in safe mode

2.3 conditions for exiting safe mode

  1. dfs.namenode.safemode.min.datanodes: minimum number of available datanodes, 0 by default
  2. dfs.namenode.safemode.threshold-pct: the percentage of blocks with the minimum number of copies in the total number of blocks in the system. The default is 0.999f. (only one block is allowed to be lost)
  3. dfs.namenode.safemode.extension: stabilization time. The default value is 30000ms, i.e. 30s

2.4 basic grammar

  • View safe mode status: hdfs dfsadmin -safemode get
  • Enter safe mode status: hdfs dfsadmin -safemode enter
  • Leave safe mode status: hdfs dfsadmin -safemode leave
  • Wait for safe mode status: hdfs dfsadmin -safemode wait

2.5 case analysis

2.5.1 start the cluster and enter the safe mode

After the cluster is started, immediately go to the cluster to delete data, and prompt that the cluster is in safe mode

2.5.2 disk repair

How to deal with data block damage when entering safe mode

  1. Delete two block information in Hadoop 102, Hadoop 103 and Hadoop 104 respectively

    [codecat@hadoop102 subdir0]$ pwd
    /opt/module/hadoop-3.1.3/data/dfs/data/current/BP-349834019-192.168.150.102-1629042571631/current/finalized/subdir0/subdir0
    [codecat@hadoop102 subdir0]$ rm -rf blk_1073741826_1002.meta 
    [codecat@hadoop102 subdir0]$ rm -rf blk_1073741834_1010.meta 
    
    [codecat@hadoop103 subdir0]$ pwd
    /opt/module/hadoop-3.1.3/data/dfs/data/current/BP-349834019-192.168.150.102-1629042571631/current/finalized/subdir0/subdir0
    [codecat@hadoop102 subdir0]$ rm -rf blk_1073741826_1002.meta 
    [codecat@hadoop102 subdir0]$ rm -rf blk_1073741834_1010.meta 
    
    [codecat@hadoop104 subdir0]$ pwd
    /opt/module/hadoop-3.1.3/data/dfs/data/current/BP-349834019-192.168.150.102-1629042571631/current/finalized/subdir0/subdir0
    [codecat@hadoop102 subdir0]$ rm -rf blk_1073741826_1002.meta 
    [codecat@hadoop102 subdir0]$ rm -rf blk_1073741834_1010.meta 
    
  2. Restart the cluster
    It is found that the security mode has been turned on and the number of blocks does not meet the requirements.

  3. Leave safe mode

    [codecat@hadoop102 subdir0]$ hdfs dfsadmin -safemode leave
    
  4. Delete metadata

  5. The cluster returns to normal

2.5.3 simulation waiting safety mode

Simulate wait safe mode

  1. View current mode
    [codecat@hadoop102 subdir0]$ hdfs dfsadmin -safemode get
    Safe mode is OFF
    
  2. First in safe mode
    [codecat@hadoop102 subdir0]$ hdfs dfsadmin -safemode enter
    Safe mode is ON
    
  3. Create and execute the following script
    [codecat@hadoop102 hadoop-3.1.3]$ vim safamode.sh
    
    #!/bin/bash
    hdfs dfsadmin -safemode wait
    hadoop fs -put /opt/module/hadoop-3.1.3/NOTICE.txt /
    
    [codecat@hadoop102 hadoop-3.1.3]$ chmod 777 safamode.sh
    [codecat@hadoop102 hadoop-3.1.3]$ ./safamode.sh 
    
  4. Open another window and execute
    [codecat@hadoop102 hadoop-3.1.3]$ hdfs dfsadmin -safemode get
    Safe mode is OFF
    
  5. There are already uploaded data on the HDFS cluster

Topics: Big Data Hadoop