Design and implementation of block placement policy for HDFS multi rack distribution

Posted by polymnia on Sun, 23 Jan 2022 08:42:49 +0100

preface

As we all know, HDFS has three sets to ensure the high availability of its data. Moreover, the placement of HDFS on the three replicas is also carefully designed. Two replicas are placed on the same rack (different nodes), and the other replica is placed on another rack. Under such a placement strategy, the replica data can tolerate the crash of a node or even the crash of a rack machine. But the "rack" mentioned here "Is the concept of a rack defined by the cluster admin for HDFS. It is a logical concept. It can be simply a physical rack or a set of racks. However, their common feature is that they can be separated from each other. The block placement strategy currently tacitly recognized by HDFS can tolerate the drop of a rack in theory, but in practice, it is large-scale During the operation of the module cluster, the default placement strategy still can not fully meet the requirements of high data availability. For example, recently, the author frequently encountered the problem of user missing block caused by the simultaneous unavailability of data 3 copies on the production cluster. Later, it was found that the reason was that the cluster was upgrading according to the rack rolling, and there would be up to one hour of rack shutdown each time. During this period, the occasional dead of other rack machines will cause this sporadic missing block. The essential reason for this problem is that one rack is lost, which makes the two copies existing on this rack unusable, thus making the probability data unusable. In view of this situation, we try to transform the existing block placement strategy to solve this problem.

block placement policy for HDFS multi rack distribution

We know that the block location of HDFS is based on the internal implementation of block placement policy and the topology setting given by admin. Topology is used to specify which nodes belong to which rack s.

If we don't want to transform the existing HDFS placement policy to solve the problems mentioned above, changing the topology used by the cluster can also work. Mapping the tracks in topology to more physical tracks can alleviate this problem to a certain extent. But here the author is talking about mitigation, not a complete solution. Although the scope of the rack in topology becomes larger, there is still a case that two copies fall on one physical rack.

Therefore, the fundamental solution to the problem of block distribution according to rack is to change the policy of block placement. The expected block placement distribution of the target is also very simple, that is, three copies are distributed on three racks at the same time. In this way, the cluster can greatly tolerate the unavailability of data caused by a rack drop.

policy implementation of multi rack distribution

Since the implementation logic of HDFS default block placement policy is complex, we are not inclined to directly change its process logic. Instead, we add a new policy class and override some of its methods to achieve the implementation of our block placement policy distributed according to rack.

Several key methods will be involved here:

  • chooseLocalRack
  • chooseRemoteRack
  • isGoodDatanode
  • verifyBlockPlacement
  • isMovable

The above five methods are closely related to the selection of block placement of HDFS. The first two methods of the above methods are the methods that need to be covered first, because under the multi rack distribution strategy, the semantics of chooseLocalRack will be transformed into the same semantics as chooselemoterack. In fact, the implementation of chooseRemoteRack method is consistent with that of default policy. In case this method needs to be changed later, the author temporarily copied the chooseRemoteRack method. The code is as follows:

public class BlockPlacementPolicyWithRackSeparated extends BlockPlacementPolicyDefault {

  private static ThreadLocal<Integer> CHOOSE_TARGET_FAILED_COUNT =
      new ThreadLocal<Integer>() {
    @Override
    protected Integer initialValue() {
      return 0;
    }
  };

  @Override
  protected DatanodeStorageInfo chooseLocalRack(Node localMachine, Set<Node> excludedNodes,
      long blocksize, int maxNodesPerRack, List<DatanodeStorageInfo> results,
      boolean avoidStaleNodes, EnumMap<StorageType, Integer> storageTypes) throws NotEnoughReplicasException {

    // no local machine, so choose a random machine
    if (localMachine == null) {
      return chooseRandom(NodeBase.ROOT, excludedNodes, blocksize,
          maxNodesPerRack, results, avoidStaleNodes, storageTypes);
    }
    final String localRack = localMachine.getNetworkLocation();

    try {
      // override default chooseLocalRack method and choose one from the remote rack
      return chooseRandom("~" + localRack, excludedNodes,
          blocksize, maxNodesPerRack, results, avoidStaleNodes, storageTypes);
    } catch (NotEnoughReplicasException e) {
        return chooseRandom(NodeBase.ROOT, excludedNodes, blocksize,
            maxNodesPerRack, results, avoidStaleNodes, storageTypes);
      }
  }

  @Override
  protected void chooseRemoteRack(int numOfReplicas, DatanodeDescriptor localMachine,
    Set<Node> excludedNodes, long blocksize, int maxReplicasPerRack,
    List<DatanodeStorageInfo> results, boolean avoidStaleNodes, EnumMap<StorageType,
    Integer> storageTypes) throws NotEnoughReplicasException {
    int oldNumOfReplicas = results.size();
    // randomly choose one node from remote racks
    try {
      chooseRandom(numOfReplicas, "~" + localMachine.getNetworkLocation(),
      excludedNodes, blocksize, maxReplicasPerRack, results,
      avoidStaleNodes, storageTypes);
      } catch (NotEnoughReplicasException e) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Failed to choose remote rack (location = ~"
              + localMachine.getNetworkLocation() + "), fallback to local rack", e);
        }
        chooseRandom(numOfReplicas-(results.size()-oldNumOfReplicas),
            localMachine.getNetworkLocation(), excludedNodes, blocksize,
            maxReplicasPerRack, results, avoidStaleNodes, storageTypes);
      }
  }
  ...
}

The next step is the isGoodDatanode method. This method is used to determine whether this candidate is a qualified datanode after selecting a target in chooseRemoteRack. If it is not qualified, the next selection will be made immediately. In the new policy implementation, we mainly judge the value of the rack to ensure that the rack to which the location of each replica belongs is independent.

...
  @Override
  boolean isGoodDatanode(DatanodeDescriptor node, int maxTargetPerRack,
      boolean considerLoad, List<DatanodeStorageInfo> results, boolean avoidStaleNodes) {
    if (!super.isGoodDatanode(node, maxTargetPerRack, considerLoad, results, avoidStaleNodes)) {
      return false;
    }

    // the chosen node rack
    String rack = node.getNetworkLocation();
    Set<String> rackNames = new HashSet<>();
    rackNames.add(rack);

    for (DatanodeStorageInfo info : results) {
      rack = info.getDatanodeDescriptor().getNetworkLocation();
      if (!rackNames.contains(rack)) {
        rackNames.add(rack);
      } else {
        LOG.warn("Chosen node failed since there is nodes in the same rack.");
        return false;
      }
    }

    return true;
  }
...

Finally, there are verifyBlockPlacement and isMovable methods. The former is used to judge whether the placement of existing block copies meets the multi rack distribution mode. The latter method is used in the Balancer to judge whether the block moving a location is movable. If the block placement policy is broken, it is not movable. The method is as follows:

...
  @Override
  public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
      int numberOfReplicas) {
    if (locs == null) {
      locs = DatanodeDescriptor.EMPTY_ARRAY;
    }

    if (!clusterMap.hasClusterEverBeenMultiRack()) {
      // only one rack
      return new BlockPlacementStatusDefault(1, 1);
    }

    int minRacks = 3;
    minRacks = Math.min(minRacks, numberOfReplicas);
    // 1. Check that all locations are different.
    // 2. Count locations on different racks.
    Set<String> racks = new TreeSet<>();
    for (DatanodeInfo dn : locs) {
      racks.add(dn.getNetworkLocation());
    }
    return new BlockPlacementStatusDefault(racks.size(), minRacks);
  }

  @Override
  public boolean isMovable(Collection<DatanodeInfo> locs, DatanodeInfo source,
      DatanodeInfo target) {
    Set<String> rackNames = new HashSet<>();
    for (DatanodeInfo dn : locs) {
      rackNames.add(dn.getNetworkLocation());
    }

    rackNames.remove(source.getNetworkLocation());
    rackNames.add(target.getNetworkLocation());

    return rackNames.size() >= locs.size() ? true : false;
  }
...

The above is the main method to implement the new block placement policy for multi rack distribution. The author has verified the correctness of this logic in the ut test method, but has not deployed this policy in the production cluster. At present, it is only for your reference and implementation.

Migration from old block placement to new block placement

The above implementation of the placement policy of multi rack distribution means that the problems mentioned above have been perfectly solved? The answer is No. Some things are not as simple as expected. The placement policy of multi rack can solve the problem of block distribution of new data writing. However, for a large number of original placement blocks in the cluster, we still need to transform the placement inside. How to migrate the block data of the old placement smoothly and transparently is also a difficult problem.

For the migration of block placement, the author investigated some available methods, which can be divided into two categories:

  • Server based migration scheme. The main idea is to let the NN end help us do this, but the advantage of the NN end is convenience, but there are great hidden dangers. First, the NN end does it, and the switching speed of the NN block placement is not easy to control, which will increase the load pressure of the NN, and then affect the normal processing performance of the NN. At present, the known method of NN server is to rely on the replicationQueuesInitializer thread after the service is started to scan and check the block placement. If it is found that a block violates the placement policy, the repeated distribution of subsequent block placement will be triggered.
  • Client based migration scheme. Compared with the Server-side solution, the migration speed of the client-side solution is easier to control and has less impact. There are two options:
    1) Use the existing Balancer tool to migrate blocks. At present, the Balancer will follow the rule settings of policy (related JIRA) when handling blocks HDFS-9007 ). So we can use it to help us with migration. However, the disadvantage is that the efficiency of this method will be low, because the Balancer only does the relocation of node data with unbalanced data usage. Assuming that the cluster data itself is well balanced, the data that Balancer can handle will be limited, and the expected effect of migration will not be as good as we expected.
    2) Block migration based on Fsck path level. The Balancer tool is based on block migration. In fact, we can also support block migration at the path (directory or file) level. Specify a path each time, and then get the file block under these paths for migration. After one path is migrated, the next path can be migrated. The path of each path can be adjusted according to the size of the actual number of files. In this way, the whole migration process will be smoother and have less impact. However, at present, fsck supports the path level block migration function only in version 3.3 (related to JIRA) HDFS-14053 ), we need a backport for the patch.

OK, the above is the main content of this article about introducing a new block placement policy implementation of multi rack distribution to improve data availability. It also mentioned the problem of block placement migration of old data. I hope you can gain something.

Topics: Hadoop hdfs