spark Source Reading--shuffle Process Analysis

Posted by johnh on Fri, 14 Jun 2019 20:53:13 +0200

ShuffleManager (1)

In this article, let's take a look at Shuffle Manager, another important module in the spark kernel.Shuffle is arguably one of the most important concepts in distributed computing, which is required for data join ing, aggregation, de-duplication, and so on.On the other hand, one of the main reasons why spark performs better than mapReduce is to optimize the shuffle process. On the one hand, the shuffle process of spark makes better use of memory (that is, execute memory, as we mentioned earlier when analyzing memory management), on the other hand, the disk files overwritten in the shuffle process are sorted and indexed.Of course, another major reason for spark's high performance is the optimization of the computing chain, which greatly reduces the drop-off of the intermediate process by bringing together computational chains of the multistep map type, which is also where spark is significantly different from mr.
The new version of the huffle manager for spark defaults to Sort Shuffle Manager.

Code for the SparkEnv initialization section:

  val shortShuffleMgrNames = Map(
  "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
  "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)

ShuffleMapTask.runTask

Looking at the source code for the shuffle manager, we should start with the timing of the shuffle manager call.Think about shuffle as a two-step process, writing and reading.Writing is in the map phase, classifying data into different partitions according to certain partition rules, and reading is in the reduce phase. Each partition pulls its own data from the output of the map phase, so we can basically follow this line of thought when analyzing the ShuffleManager source code.Let's analyze the writing process first, because for a complete shuffle process, it must be written before it is read.
Looking back at the previous analysis of the job's running process, we should remember that the job was split into tasks and then executed on the executor side, while the Shuffle stage was split into ShuffleMapTask s, where the shuffle writing process was done. Let's take a look at the code:

You can see that a shuffle writer is acquired through ShuffleManager.getWriter, which writes rdd's calculated data to disk.

override def runTask(context: TaskContext): MapStatus = {
// Deserialize the RDD using the broadcast variable.
val threadMXBean = ManagementFactory.getThreadMXBean
val deserializeStartTime = System.currentTimeMillis()
val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime
} else 0L
val ser = SparkEnv.get.closureSerializer.newInstance()
// Deserializing RDD and shuffle, key steps
// Here's how the internal SparkContext objects are deserialized when rdd and shuffle are deserialized
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
  ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
_executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
_executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
} else 0L

var writer: ShuffleWriter[Any, Any] = null
try {
  // shuffle Manager
  val manager = SparkEnv.get.shuffleManager
  // Get a shuffle writer
  writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
  // Here you can see that the core method of rdd calculation is the iterator method
  // SortShuffleWriter's write method can be divided into several steps:
  // Writes the data computed by the upstream RDD (by calling the rdd.iterator method) to the memory buffer.
  // If the memory threshold is exceeded during writing, disk files will be overwritten and multiple files may be written
  // Finally, the overwritten files are merged and sorted together with the remaining data in memory and written to disk to form a large data file
  // This sort is first by partition, and then by key
  // During the writing process after the last merge sort, without writing a partition, you will brush it manually and record the displacement of the partition data in the file
  // So when you actually end up writing a task's data, you have two files on disk: a data file and an index file that records the displacement of the partition data for each reduce r
  writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
  // Mainly delete overwrite files from intermediate processes to release the requested memory to the memory manager
  writer.stop(success = true).get
} catch {
  case e: Exception =>
    try {
      if (writer != null) {
        writer.stop(success = false)
      }
    } catch {
      case e: Exception =>
        log.debug("Could not stop writer", e)
    }
    throw e
}
}

SortShuffleManager.getWriter

Here we get different ShuffleWriter objects based on the shuffle type, which in most cases is the SortShuffleWriter type, so let's look directly at the SortShuffleWriter.write method.

/** Get a writer for a given partition. Called on executors by map tasks. */
// Gets a shuffle memory, is called on the executor side, and performs a map task call
override def getWriter[K, V](
  handle: ShuffleHandle,
  mapId: Int,
  context: TaskContext): ShuffleWriter[K, V] = {
numMapsForShuffle.putIfAbsent(
  handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
val env = SparkEnv.get
handle match {
  case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
    new UnsafeShuffleWriter(
      env.blockManager,
      shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
      context.taskMemoryManager(),
      unsafeShuffleHandle,
      mapId,
      context,
      env.conf)
  case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
    new BypassMergeSortShuffleWriter(
      env.blockManager,
      shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
      bypassMergeSortHandle,
      mapId,
      context,
      env.conf)
  case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
    new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
}
}

SortShuffleWriter.write

Summarize the main logic of this approach:

  • Gets a sorter that passes different parameters depending on whether or not a map-side aggregation is required
  • Inserting data into the sorter, a process or overwriting multiple disk files
  • Get a disk file name based on shuffleid and partition id,
  • Combines multiple overwritten disk files and in-memory sorting data and writes them to a file, returning the displacement of data from each reduce-end partition in the file
  • Writes the index to an index file and changes the file name of the data file from a temporary file name to a formal file name.
  • Finally, a MapStatus object is encapsulated for the return value of ShuffleMapTask.runTask.
  • There are also some finishing touches in the stop method, counting disk io time consumption and deleting intermediate overwrite files

      override def write(records: Iterator[Product2[K, V]]): Unit = {
      sorter = if (dep.mapSideCombine) {
        // In the case of a map-side merge, the user should provide aggregators and order
        require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
        new ExternalSorter[K, V, C](
          context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
      } else {
        // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
        // care whether the keys get sorted in each partition; that will be done on the reduce side
        // if the operation being run is sortByKey.
        new ExternalSorter[K, V, V](
          context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
      }
      // Write all map data into the sorter,
      // Multiple overwrite files may be generated during this process
      sorter.insertAll(records)
    
      // Don't bother including the time to open the merged output file in the shuffle write time,
      // because it just opens a single file, so is typically too fast to measure accurately
      // (see SPARK-3570).
      // mapId is the partitionId of the RDD on the shuffleMap side
      // Get the shuffle output file name of this map partition
      val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
      // Add a uuid suffix
      val tmp = Utils.tempFileWith(output)
      try {
        val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
        // This step merges and sorts the files on the overwritten disk and the data in memory.
        // And overwrite to a file where the file written in this step is a temporary file name
        val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
        // This step is mainly to write the index file, using the move method to rename the temporary index and temporary data file to normal file names
        shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
        // Returns a status object containing the shuffle service Id and the displacement of partition data in the file
        mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
      } finally {
        if (tmp.exists() && !tmp.delete()) {
          logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
        }
      }
      }

IndexShuffleBlockResolver

Let's first look at getting the shuffle output file name, which is obtained through the IndexShuffleBlockResolver component, and inside it is assigned the file name through the DiskBlockManager inside the BlockManager, which I mentioned earlier when I analyzed the Block Manager to manage the allocation of file names, the directory used by the spark, and the creation of subdirectoriesDelete etc.We see different naming rules for data files and index files, which are defined in the HuffleDataBlockId and the HuffleIndexBlockId, respectively.

def getDataFile(shuffleId: Int, mapId: Int): File = {
  blockManager.diskBlockManager.getFile(ShuffleDataBlockId(shuffleId, mapId, NOOP_REDUCE_ID))
}

private def getIndexFile(shuffleId: Int, mapId: Int): File = {
  blockManager.diskBlockManager.getFile(ShuffleIndexBlockId(shuffleId, mapId, NOOP_REDUCE_ID))
}

ExternalSorter.insertAll

Let's first look at the ExternalSorter.insertAll method, based on the order of calls in SortShuffleWriter:

  • The preferred choice is to divide the merge into two cases based on whether or not you love the map-side merge, which uses a different memory storage structure: Partitioned AppendOnlyMap for the map-side merge case and Partitioned PairBuffer for the map-side merge case.PartitionedAppendOnlyMap is a map structure implemented by array and linear detection.
  • The data is then looped one by one into the memory storage structure, taking into account map-side consolidation

      def insertAll(records: Iterator[Product2[K, V]]): Unit = {
      // TODO: stop combining if we find that the reduction factor isn't high
      val shouldCombine = aggregator.isDefined
    
      // Merge on map side
      if (shouldCombine) {
        // Combine values in-memory first using our AppendOnlyMap
        val mergeValue = aggregator.get.mergeValue
        val createCombiner = aggregator.get.createCombiner
        var kv: Product2[K, V] = null
        val update = (hadValue: Boolean, oldValue: C) => {
          if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
        }
        while (records.hasNext) {
          addElementsRead()
          kv = records.next()
          // Insert a piece of data into the memory buffer
          map.changeValue((getPartition(kv._1), kv._1), update)
          // If the buffer exceeds the threshold, it will overwrite to disk to generate a file
          // Check memory every time a piece of data is written
          maybeSpillCollection(usingMap = true)
        }
      } else {// No more map-side merges
        // Stick values into our buffer
        while (records.hasNext) {
          addElementsRead()
          val kv = records.next()
          buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
          maybeSpillCollection(usingMap = false)
        }
      }
      }

AppendOnlyMap.changeValue

Let's look at a slightly more complex structure, AppendOnlyMap,

  • Consider null values first
  • Calculate the hash of the key, then balance the capacity.Note that since capacity is an integer power of 2, the operation of capacity redundancy is equivalent to bitwise and operation with capacity-1, in java HashMap.
  • If no old value exists, insert it directly,
  • Update old values if they exist
  • If a hash collision occurs, then the detection needs to be backward and hopping.

It can be seen that the structure design is still very sophisticated. There is a very heavy method here. The incrementSize method will determine the size of the current data volume and expand if the threshold value is exceeded. This method of expansion is more complex, it is a process of re-hash redistribution. However, it is a process of re-hash redistribution, whether it is in the process of inserting new data or re-hash redistribution.The processing strategies for hash collisions must be the same, otherwise inconsistencies may occur.

// Insert a kv pair into the array.
def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
assert(!destroyed, destructionMessage)
val k = key.asInstanceOf[AnyRef]
// Handle case where key is empty
if (k.eq(null)) {
  // If this is the first time you insert a null value, you need to increase the size by 1
  if (!haveNullValue) {
    incrementSize()
  }
  nullValue = updateFunc(haveNullValue, nullValue)
  haveNullValue = true
  return nullValue
}
var pos = rehash(k.hashCode) & mask
// Linear Detection for hash Collision
// Here is an accelerated linear detection, which takes one step during the first collision.
// Take 2 steps in the second collision and 3 steps in the third collision
var i = 1
while (true) {
  val curKey = data(2 * pos)
  if (curKey.eq(null)) {// If the old value does not exist, insert it directly
    val newValue = updateFunc(false, null.asInstanceOf[V])
    data(2 * pos) = k
    data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
    incrementSize()
    return newValue
  } else if (k.eq(curKey) || k.equals(curKey)) {// If old values exist, they need to be updated
    val newValue = updateFunc(true, data(2 * pos + 1).asInstanceOf[V])
    data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
    return newValue
  } else {// hash collision, backward detection, skip detection
    val delta = i
    pos = (pos + delta) & mask
    i += 1
  }
}
null.asInstanceOf[V] // Never reached but needed to keep compiler happy
}

ExternalSorter.maybeSpillCollection

Back in ExternalSorter's insert method, we checked the memory footprint before inserting any data to determine if we needed to overwrite to disk or to disk if we needed to.
This method calls map.estimateSize to estimate the memory footprint of the currently inserted data. The ability to track and estimate memory footprint is implemented in the SizeTracker trait, which I mentioned earlier when I analyzed MemoryStore and used an intermediate data structure, DeserializedValuesHolder, when inserting object-type data into memory.Inside, there is a SizeTrackingVector, a class that tracks and estimates object size by inheriting the SizeTracker feature.

private def maybeSpillCollection(usingMap: Boolean): Unit = {
var estimatedSize = 0L
if (usingMap) {
  estimatedSize = map.estimateSize()
  if (maybeSpill(map, estimatedSize)) {
    map = new PartitionedAppendOnlyMap[K, C]
  }
} else {
  estimatedSize = buffer.estimateSize()
  if (maybeSpill(buffer, estimatedSize)) {
    buffer = new PartitionedPairBuffer[K, C]
  }
}

if (estimatedSize > _peakMemoryUsedBytes) {
  _peakMemoryUsedBytes = estimatedSize
}
}

ExternalSorter.maybeSpill

First check if the current memory usage exceeds the threshold. If it exceeds the threshold, one execution memory will be requested, and if not enough execution memory is requested, it will still need to be overwritten to disk

protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
var shouldSpill = false
// Check every 32 data written
if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
  // Claim up to double our current memory from the shuffle memory pool
  val amountToRequest = 2 * currentMemory - myMemoryThreshold
  // Request Execution Memory from Memory Manager
  val granted = acquireMemory(amountToRequest)
  myMemoryThreshold += granted
  // If we were granted too little memory to grow further (either tryToAcquire returned 0,
  // or we already had more memory than myMemoryThreshold), spill the current collection
  // Overwrite is required if memory usage exceeds the threshold
  shouldSpill = currentMemory >= myMemoryThreshold
}
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
// Actually spill
if (shouldSpill) {
  _spillCount += 1
  logSpillage(currentMemory)
  // Overwrite to disk
  spill(collection)
  _elementsRead = 0
  _memoryBytesSpilled += currentMemory
  // Release memory
  releaseMemory()
}
shouldSpill
}

ExternalSorter.spill

Next to the above method,

override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
// Get a sorted iterator
val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
// Write data to disk file
val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
spills += spillFile
}

WritablePartitionedPairCollection.destructiveSortedWritablePartitionedIterator

This method returns iterators sorted by partition and key with specific sorting logic in AppendOnlyMap.destructiveSortedIterator

AppendOnlyMap.destructiveSortedIterator

This code is divided into two parts. First, the array is compressed so that all the sparse data is transferred to the head of the array.
Then the arrays are sorted by the comparer, which first compares by partition, and then compares by key if the partitions are the same.
Then it returns an iterator that simply encapsulates the array.In this way, we have a general idea of the sorting logic of AppendonlyMap.

def destructiveSortedIterator(keyComparator: Comparator[K]): Iterator[(K, V)] = {
destroyed = true
// Pack KV pairs into the front of the underlying array
// This code transfers all sparse data to the header of the array, compressing it
var keyIndex, newIndex = 0
while (keyIndex < capacity) {
  if (data(2 * keyIndex) != null) {
    data(2 * newIndex) = data(2 * keyIndex)
    data(2 * newIndex + 1) = data(2 * keyIndex + 1)
    newIndex += 1
  }
  keyIndex += 1
}
assert(curSize == newIndex + (if (haveNullValue) 1 else 0))

// Sort data by comparator
new Sorter(new KVArraySortDataFormat[K, AnyRef]).sort(data, 0, newIndex, keyComparator)

new Iterator[(K, V)] {
  var i = 0
  var nullValueReady = haveNullValue
  def hasNext: Boolean = (i < newIndex || nullValueReady)
  def next(): (K, V) = {
    if (nullValueReady) {
      nullValueReady = false
      (null.asInstanceOf[K], nullValue)
    } else {
      val item = (data(2 * i).asInstanceOf[K], data(2 * i + 1).asInstanceOf[V])
      i += 1
      item
    }
  }
}
}

ExternalSorter.spillMemoryIteratorToDisk

Back in the ExternalSorter.spill method, after getting the sorted iterator, we can write the data overflow to disk.
I won't stick to the code for this method, but I'll summarize the main steps:

  • First get a temporary block's BlockId and temporary file name from DiskBlockManager
  • Get a disk writer through blockManager, the DiskBlockObjectWriter object, which encapsulates the logic to write files using the java stream api
  • Loop writes each data to disk and periodically writes it (data in memory is written to disk every certain number of data bars)
  • If an exception occurs, the previously written file is rolled back

Summary

Summarize the entire process of overwriting data to disk through External Sorter:

  • First, the data is inserted one by one into the internal map structure
  • Every insert of data checks memory usage, and if memory usage exceeds the threshold and does not request enough execution memory, it overwrites the current data in memory to disk
  • For overwriting, data is sorted by partition and key, data from the same partition is sorted together, and then by key according to the sorter provided; DiskBlockWriter is obtained from DiskBlockManager and BlockManager to write data to disk to form a file.And file information that will be overwritten
  • Overwrite multiple files during the entire write process

ExternalSorter.writePartitionedFile

Summarize the main steps:

  • Still get a disk writer through blockManager
  • Merge and sort multiple disk files overwritten internally and data left in memory into an iterator grouped by partition
  • Loops write data to disk, and each time a partition's data is written, a write is done, synchronizing the data from the os's file buffer to the disk, then getting the file length at that time, and recording the displacement of each partition in the file

      def writePartitionedFile(
        blockId: BlockId,
        outputFile: File): Array[Long] = {
    
      // Track location of each range in the output file
      val lengths = new Array[Long](numPartitions)
      val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
        context.taskMetrics().shuffleWriteMetrics)
    
      // If no data has been overwritten to disk before,
      // Just overwrite the data in memory to disk
      if (spills.isEmpty) {
        // Case where we only have in-memory data
        val collection = if (aggregator.isDefined) map else buffer
        // Returns the sorted iterator
        val it = collection.destructiveSortedWritablePartitionedIterator(comparator)
        while (it.hasNext) {
          val partitionId = it.nextPartition()
          while (it.hasNext && it.nextPartition() == partitionId) {
            it.writeNext(writer)
          }
          // Finish writing a partition once
          val segment = writer.commitAndGet()
          // Record the displacement of partitioned data in a file
          lengths(partitionId) = segment.length
        }
      } else {// Files overwritten to disk
        // We must perform merge-sort; get an iterator by partition and write everything directly.
        // Encapsulates an iterator for merging individual overwritten files and memory buffer data
        // TODO, an encapsulated iterator, is the key to merge ordering
        for ((id, elements) <- this.partitionedIterator) {
          if (elements.hasNext) {
            for (elem <- elements) {
              writer.write(elem._1, elem._2)
            }
            // After each partition is written, actively brush once to get file displacement.
            // This displacement is the displacement of the partition that is written,
            // When the reduce side pulls data, it will find the location of the data it should pull directly based on this displacement.
            val segment = writer.commitAndGet()
            lengths(id) = segment.length
          }
        }
      }
    
      writer.close()
      // Update some statistics after writing
      context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)
      context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)
      context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes)
    
      // Returns the displacement information of each reduce-end partition data in a file
      lengths
      }

IndexShuffleBlockResolver.writeIndexFileAndCommit

Still returning to the SortShuffleWriter.write method, the last step calls the IndexShuffleBlockResolver.writeIndexFileAndCommit method,
The main purpose of this method is to write the displacement values of each partition into an index file and rename the temporary index file and temporary data file to normal file names (renaming is an atomic operation).

summary

I summarize the process of shuffle writing data in two main steps:

  • One is that during data writing, multiple data files are overwritten to disk due to insufficient memory. All files are sorted by partition and key, which lays the foundation for the second merge sort.
  • The second part is to merge and sort these small overwritten files and the remaining data in the last memory, then write them to a large file and record the displacement of each partition's data in the file during the writing process.
  • Finally, an index file is written that records the displacement of each reduce-end partition in the data file so that reduce can quickly locate the data needed by its own partition while pulling data.

Topics: PHP Spark Apache Java