Shuffle Output Tracker
As an auxiliary component of shuffle, this component plays an important role in the whole shuffle module. In the previous series of analyses, we mentioned this component more or less. For example, when DAGScheduler submits a stage, it encapsulates the stage as a TaskSet, but the possible partitions have been computed and the results (stages may be submitted many times due to failure, and some of them may have been computed), so these partitions are not needed. To recalculate, instead of just computing those failed partitions, it is obvious that a component is needed to maintain the status of task failure success in shuffle and the location information of the calculated results.
In addition, in the shuffle reading stage, we know that a reduce partition will depend on the output data of multiple map partitions, so when we read the data of a reduce partition, we need to know which map partitions the reduce partition depends on, what the physical location of each block is, what the blockId is, and the amount of data belonging to this reduce partition in this block is large. How small is it? MapOutput Tracker is used to maintain the record of this information, so we now know the importance of MapOutput Tracker.
MapOutputTracker.scala
The main function classes and auxiliary classes of MapOutputTracker component are all in this file. I will first outline the main functions of each class, and then focus on the key classes.
- ShuffleStatus, which encapsulates the shuffle output state of a stage, is an array of mapStatuses, a major member of the class. The subscript of the array is the partition number of the map, which stores the output of each map partition. For MapStatus, you can see MapStatus.scala specifically. It is not intended to expand here.
- MapOutputTrackerMessage, a message class for rpc requests, has two implementation classes: GetMapOutputStatuses for getting all output states of a shuffle; StopMapOutputTracker for sending requests to the driver to stop MapOutputTracker Master Endpoint.
- MapOutput Tracker Master Endpoint, if you are familiar with spark's rpc module, you should be familiar with this class. It is an rpc server, registering itself with RpcEnv, identifying itself by a name, and receiving certain messages, that is, the two messages mentioned above.
- MapOutputTracker, which is an abstract class, defines only some operation interfaces. One of its most important functions may be to maintain a sequence value epoch internally. This value represents a consistent global map output state. Once the map output changes, the value will be added one. The executor side will synchronize the latest epoch to determine its own map output state. Is the cache expired?
- MapOutputTracker Master, which runs on the driver side and implements most of the functions of MapOutputTracker class, is the core class.
- MapOutput Tracker Worker, which runs on the executor side, encapsulates the logic of rpc calls.
Generally speaking, the core class is MapOutputTracker Master. Other classes are all around some auxiliary classes of this class. So we focus on MapOutputTracker Master. I don't intend to go deep into other classes. I believe readers can understand them easily.
MapOutputTrackerMaster
findMissingPartitions
As mentioned above, this method is called when DAGScheduler encapsulates a task set to find a partition that needs to be computed by a stage.
def findMissingPartitions(shuffleId: Int): Option[Seq[Int]] = { shuffleStatuses.get(shuffleId).map(_.findMissingPartitions()) }
-
ShuffleStatus.findMissingPartitions
def findMissingPartitions(): Seq[Int] = synchronized { val missing = (0 until numPartitions).filter(id => mapStatuses(id) == null) assert(missing.size == numPartitions - _numAvailableOutputs, s"${missing.size} missing, expected ${numPartitions - _numAvailableOutputs}") missing }
These two pieces of code are very simple. Needless to say, they are found in the map structure.
In addition, such as registerShuffle, registerMapOutput, unregisterMapOutput, unregisterShuffle, removeOutput OnHost and so on, we can see that these methods themselves are very simple answers, nothing more than inserting, updating and searching the internal map structure, the key is to know when these methods are invoked? Understanding this will give us a better understanding of MapOutputTracker's role and role in the overall spark framework. The location of method invocation can be easily located by IDE tools such as Idea. I have not done much expansion here, but simply summarized:
- Register Shuffle, DAGScheduler will register the shuffle corresponding to this stage when creating a Shuffle MapStage.
- RegiserMapOutput, after a shuffleMapTask task is completed, registers the information output from the map.
- Remove Output OnHost, which removes all relevant map output information from a host. This operation is usually invoked when the host is lost.
- RemoveOutput OnExecutor, likewise, removes all map output information from an executor, which is typically invoked when the executor is lost
getMapSizesByExecutorId
Let's look at another important method. When reading data in the reduce phase, a task first needs to know which map output it depends on. Then it recalls that the MapOutput Tracker Master Endpoint component on the driver side sends a message to get map output. After a series of method calls, this method will eventually be invoked:
def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, endPartition: Int) : Seq[(BlockManagerId, Seq[(BlockId, Long)])] = { logDebug(s"Fetching outputs for shuffle $shuffleId, partitions $startPartition-$endPartition") shuffleStatuses.get(shuffleId) match { case Some (shuffleStatus) => // Convert all mapStatus arrays to (BlockManagerId, Seq[(BlockId, Long)]) objects shuffleStatus.withMapStatuses { statuses => MapOutputTracker.convertMapStatuses(shuffleId, startPartition, endPartition, statuses) } case None => Seq.empty } }
Let's take a look: MapOutputTracker. ConversMapStatuses. This method is also very simple. In fact, it divides the output of each map partition into reduce partitions. The number of blockId (Long) tuples is equal to the number of map partitions * reduce partitions.
def convertMapStatuses( shuffleId: Int, startPartition: Int, endPartition: Int, statuses: Array[MapStatus]): Seq[(BlockManagerId, Seq[(BlockId, Long)])] = { assert (statuses != null) // Used to store results val splitsByAddress = new HashMap[BlockManagerId, ArrayBuffer[(BlockId, Long)]] // The final number of (BlockId, Long) tuples is equal to the number of map partitions * reduce partitions for ((status, mapId) <- statuses.zipWithIndex) { if (status == null) { val errorMessage = s"Missing an output location for shuffle $shuffleId" logError(errorMessage) throw new MetadataFetchFailedException(shuffleId, startPartition, errorMessage) } else { for (part <- startPartition until endPartition) { splitsByAddress.getOrElseUpdate(status.location, ArrayBuffer()) += ((ShuffleBlockId(shuffleId, mapId, part), status.getSizeForBlock(part))) } } } splitsByAddress.toSeq }
getPreferredLocationsForShuffle
Let's look at another important method. We know that partitions on the reduce end generally depend on the output of multiple map-side partitions, but the amount of data that each map partition depends on is different. For an extreme example, suppose that a partition on the reduce end depends on the output partitions of 10 map-side, but one partition relies on 10,000 data, while the other partitions only rely on one data. In this case, obviously, I am You should prioritize this reduce task to the executor that relies on 10,000 items. Of course, this example is very simple and may not be accurate, but it is enough to illustrate the role of this method.
def getPreferredLocationsForShuffle(dep: ShuffleDependency[_, _, _], partitionId: Int) : Seq[String] = { // Firstly, several parameter configurations are judged. If they all meet the requirements, then the biased position is calculated. if (shuffleLocalityEnabled && dep.rdd.partitions.length < SHUFFLE_PREF_MAP_THRESHOLD && dep.partitioner.numPartitions < SHUFFLE_PREF_REDUCE_THRESHOLD) { // Key calls val blockManagerIds = getLocationsWithLargestOutputs(dep.shuffleId, partitionId, dep.partitioner.numPartitions, REDUCER_PREF_LOCS_FRACTION) if (blockManagerIds.nonEmpty) { blockManagerIds.get.map(_.host) } else { Nil } } else { Nil } }
As you can see, the key method is getLocations WithLargest Outputs. Next, let's look at this method:
Annotations have made it clear that the logic of this method is very simple. For example, the total amount of data to be read by a reduce partition is 100m, and the total amount of data related to this reduce partition in the output of all map s on an executor is 20m, which is more than 0.2 of the total amount. At this time, the executor can become a biased position. Is it very simple? However, it should be noted that this method calculates the biased position in the smallest unit of executor, while in the previous method getPreferred Locations ForShuffle, the BlockManagerId that becomes biased position is only taken out as the biased position and returned to the upper caller. The problem is that there may be more than one executo on a host (i.e., physical node). R, this will result in duplicate hosts in the returned results; in addition, since returning host is the biased position, why not calculate the biased position directly by using host as the smallest unit, such as adding up all the data related to the reduce partition on a host, and if the proportion exceeds 0.2, it seems more reasonable that the host can be regarded as the biased position. It is also more likely to produce biased positions. For an extreme example, there are five executors running on one host, each executor has a partition-related data ratio of 0.1, and each of the other five hosts has only one executor running with a data ratio of 0.1. In this case, there is no bias, but it is obvious that the host with five executors should be taken as a bias.
def getLocationsWithLargestOutputs( shuffleId: Int, reducerId: Int, numReducers: Int, fractionThreshold: Double) : Option[Array[BlockManagerId]] = { val shuffleStatus = shuffleStatuses.get(shuffleId).orNull // Non-empty check on shuffleStatus if (shuffleStatus != null) { shuffleStatus.withMapStatuses { statuses => // Non-null checking of mapStatus arrays if (statuses.nonEmpty) { // HashMap to add up sizes of all blocks at the same location // Record the amount of data belonging to this reduce end partition in the block that records all map output on each executor val locs = new HashMap[BlockManagerId, Long] var totalOutputSize = 0L var mapIdx = 0 while (mapIdx < statuses.length) { val status = statuses(mapIdx) // status may be null here if we are called between registerShuffle, which creates an // array with null entries for each output, and registerMapOutputs, which populates it // with valid status entries. This is possible if one thread schedules a job which // depends on an RDD which is currently being computed by another thread. if (status != null) { val blockSize = status.getSizeForBlock(reducerId) if (blockSize > 0) { locs(status.location) = locs.getOrElse(status.location, 0L) + blockSize totalOutputSize += blockSize } } mapIdx = mapIdx + 1 } // Finally, the criteria for judging whether an executor can become a biased position are as follows: // Is the ratio of the size of all data associated with the reduce partition on the executor to the total amount of data in the partition larger than a threshold? // This threshold defaults to 0.2 val topLocs = locs.filter { case (loc, size) => size.toDouble / totalOutputSize >= fractionThreshold } // Return if we have any locations which satisfy the required threshold if (topLocs.nonEmpty) { return Some(topLocs.keys.toArray) } } } } None }
summary
International practices should be summarized later. Let's briefly summarize the role of map output tracker:
- Maintain all shuffle map output status information, location information, etc.
- Find out what uncalculated partitions are in a stage
- Get the biased position of reduce partition
- Get which map output the reduce partition depends on, their location, and the size of the relevant data in each map output