// According to the sorting method, the data is sorted and written to the memory buffer.
// If the calculation results in the sorting exceed the threshold value,
// Overwrite it to disk data file
sorter.insertAll(records)
- 1
Let's start with a macro view of the lower Map end. We will divide it into three categories according to whether aggregator.isDefined defines aggregation function and ordering.isDefined defines sorting function.
- Without aggregation and sorting, data is first written to different files according to partition, and finally merged into the same file in partition order. Suitable for a small number of partitions. Combining multiple bucket s into the same file reduces the number of map output files, saves disk I/O and improves performance.
- There is no aggregation but sorting. In the cache, the data is sorted by partition (or key), and then merged into the same file in partition order. Suitable when the number of partitions is large. Combining multiple bucket s into the same file reduces the number of map output files, saves disk I/O and improves performance. Caching uses data to be written to disk by exceeding the threshold.
- There are aggregation and sorting. Now the data is aggregated according to the key value in the cache. Then the data is sorted according to the partition (or key) in the cache. Finally, the data is merged and written to the same file in the partition order. Combining multiple bucket s into the same file reduces the number of map output files, saves disk I/O and improves performance. Caching uses data to be written to disk by exceeding the threshold. Reading data item by item and aggregating it reduces memory usage.
Let's take a closer look at insertAll:
def insertAll(records: Iterator[Product2[K, V]]): Unit = {
// If an aggregation function is defined, shouldCombine is true
val shouldCombine = aggregator.isDefined
// Do external sorting require aggregation
if (shouldCombine) {
// mergeValue is a merge function of Value
val mergeValue = aggregator.get.mergeValue
// createCombiner is a function that generates Combiner
val createCombiner = aggregator.get.createCombiner
var kv: Product2[K, V] = null
// update is a partial function
val update = (hadValue: Boolean, oldValue: C) => {
// When there is a Value, merge oldValue with the new Value kv._2
// If there is no Value, kv._2 is passed in to generate Value.
if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
}
while (records.hasNext) {
addElementsRead()
kv = records.next()
// First use our AppendOnlyMap
// Aggregate value in memory
map.changeValue((getPartition(kv._1), kv._1), update)
// Write to disk when exceeding threshold
maybeSpillCollection(usingMap = true)
}
} else {
// Insert Value directly into the buffer
while (records.hasNext) {
addElementsRead()
val kv = records.next()
buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
maybeSpillCollection(usingMap = false)
}
}
}
createCombiner here can be seen as generating a Value with kv._2. mergeValue can be understood as a combiner in MapReduce, that is, a Map-side Reduce operation, which aggregates the same key's Value first.
Aggregation algorithm
Let's take a closer look at the aggregation operation section:
Call stack:
- util.collection.SizeTrackingAppendOnlyMap.changeValue
- util.collection.AppendOnlyMap.changeValue
- util.collection.AppendOnlyMap.incrementSize
- util.collection.AppendOnlyMap.growTable
- util.collection.AppendOnlyMap.incrementSize
- util.collection.SizeTracker.afterUpdate
- util.collection.SizeTracker.takeSample
- util.collection.AppendOnlyMap.changeValue
First, the changeValue function of AppendOnlyMap:
util.collection.SizeTrackingAppendOnlyMap.changeValue
override def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
// Using aggregation algorithm to get new Value
val newValue = super.changeValue(key, updateFunc)
// Update Sampling of AppendOnlyMap Size
super.afterUpdate()
// Return results
newValue
}
util.collection.AppendOnlyMap.changeValue
Aggregation algorithm:
def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
assert(!destroyed, destructionMessage)
val k = key.asInstanceOf[AnyRef]
if (k.eq(null)) {
if (!haveNullValue) {
incrementSize()
}
nullValue = updateFunc(haveNullValue, nullValue)
haveNullValue = true
return nullValue
}
// According to k's hashCode, pos is obtained from hash and upper mask
// 2*pos is where k should be
// 2*pos+1 is the location of v corresponding to k
var pos = rehash(k.hashCode) & mask
var i = 1
while (true) {
// Get the value curKey at the location of k in data
val curKey = data(2 * pos)
if (curKey.eq(null)) {
// If curKey is empty
// The new Value generated from kv._2, that is, a single new value, is obtained.
val newValue = updateFunc(false, null.asInstanceOf[V])
data(2 * pos) = k
data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
// Expansion capacity
incrementSize()
return newValue
} else if (k.eq(curKey) || k.equals(curKey)) {
// If k and curKey are equal
// Aggregate oldValue (data(2 * pos + 1) and new Value (kv._2)
val newValue = updateFunc(true, data(2 * pos + 1).asInstanceOf[V])
data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
return newValue
} else {
// If curKey is not null, and k does not want to wait,
// That is hash conflict
// Then continue to traverse backwards until the first two situations occur.
val delta = i
pos = (pos + delta) & mask
i += 1
}
}
null.asInstanceOf[V]
}
- 1
util.collection.AppendOnlyMap.incrementSize
Let's take another look at the implementation of capacity expansion:
private def incrementSize() {
curSize += 1
// When curSize is greater than the threshold growThreshold,
// Call growTable()
if (curSize > growThreshold) {
growTable()
}
}
- 1
util.collection.AppendOnlyMap.growTable
protected def growTable() {
//Generate capacity doubled newData
val newCapacity = capacity * 2
require(newCapacity <= MAXIMUM_CAPACITY, s"Can't contain more than ${growThreshold} elements")
val newData = new Array[AnyRef](2 * newCapacity)
// Generating newMask
val newMask = newCapacity - 1
var oldPos = 0
while (oldPos < capacity) {
// Recalculate the data in the old Data with newMask.
// Copy to the new Data
if (!data(2 * oldPos).eq(null)) {
val key = data(2 * oldPos)
val value = data(2 * oldPos + 1)
var newPos = rehash(key.hashCode) & newMask
var i = 1
var keepGoing = true
while (keepGoing) {
val curKey = newData(2 * newPos)
if (curKey.eq(null)) {
newData(2 * newPos) = key
newData(2 * newPos + 1) = value
keepGoing = false
} else {
val delta = i
newPos = (newPos + delta) & newMask
i += 1
}
}
}
oldPos += 1
}
// To update
data = newData
capacity = newCapacity
mask = newMask
growThreshold = (LOAD_FACTOR * newCapacity).toInt
}
- 1
util.collection.SizeTracker.afterUpdate
Let's look back at the update in SizeTracking AppendOnlyMap. changeValue to sample super.afterUpdate() for the size of AppendOnlyMap. The so-called size sampling is the change of AppendOnlyMap size after only one Update. But calculating AppendOnlyMap once after each operation, such as insert `update', can greatly degrade performance. Therefore, the sampling estimation method is adopted here:
protected def afterUpdate(): Unit = {
numUpdates += 1
// If num Updates reaches the threshold,
// Sampling
if (nextSampleNum == numUpdates) {
takeSample()
}
}
- 1
util.collection.SizeTracker.takeSample
private def takeSample(): Unit = {
samples.enqueue(Sample(SizeEstimator.estimate(this), numUpdates))
// Only two samples are used.
if (samples.size > 2) {
samples.dequeue()
}
val bytesDelta = samples.toList.reverse match {
// Estimate the amount of change per update
case latest :: previous :: tail =>
(latest.size - previous.size).toDouble / (latest.numUpdates - previous.numUpdates)
// If less than 2 samples, no change is assumed.
case _ => 0
}
// To update
bytesPerUpdate = math.max(0, bytesDelta)
// Increase threshold
nextSampleNum = math.ceil(numUpdates * SAMPLE_GROWTH_RATE).toLong
}
- 1
- 10
Let's look at the function that estimates the size of AppendOnlyMap:
def estimateSize(): Long = {
assert(samples.nonEmpty)
// Calculate the total variation of the estimate
val extrapolatedDelta = bytesPerUpdate * (numUpdates - samples.last.numUpdates)
// Previous size plus estimated total variation
(samples.last.size + extrapolatedDelta).toLong
}
Write Buffer
Now let's go back to insertAll and take a closer look at how to insert Value directly into the buffer.
Call stack:
- util.collection.PartitionedPairBuffer.insert
- util.collection.PartitionedPairBuffer.growArray
util.collection.PartitionedPairBuffer.insert
def insert(partition: Int, key: K, value: V): Unit = {
// To the capacity size, call growArray()
if (curSize == capacity) {
growArray()
}
data(2 * curSize) = (partition, key.asInstanceOf[AnyRef])
data(2 * curSize + 1) = value.asInstanceOf[AnyRef]
curSize += 1
afterUpdate()
}
- 1
util.collection.PartitionedPairBuffer.growArray
private def growArray(): Unit = {
if (capacity >= MAXIMUM_CAPACITY) {
throw new IllegalStateException(s"Can't insert more than ${MAXIMUM_CAPACITY} elements")
}
val newCapacity =
if (capacity * 2 < 0 || capacity * 2 > MAXIMUM_CAPACITY) { // Overflow
MAXIMUM_CAPACITY
} else {
capacity * 2
}
// Generating new Array with doubled capacity
val newArray = new Array[AnyRef](2 * newCapacity)
// copy
System.arraycopy(data, 0, newArray, 0, 2 * capacity)
data = newArray
capacity = newCapacity
resetSamples()
}
- 8
overflow
Now let's go back to insertAll to see how to write to disk when the threshold is exceeded:
Call stack:
- util.collection.ExternalSorter.maybeSpillCollection
- util.collection.Spillable.maybeSpill
- util.collection.Spillable.spill
- util.collection.ExternalSorter.spillMemoryIteratorToDisk
- util.collection.Spillable.spill
- util.collection.Spillable.maybeSpill
util.collection.ExternalSorter.maybeSpillCollection
private def maybeSpillCollection(usingMap: Boolean): Unit = {
var estimatedSize = 0L
if (usingMap) {
estimatedSize = map.estimateSize()
if (maybeSpill(map, estimatedSize)) {
map = new PartitionedAppendOnlyMap[K, C]
}
} else {
estimatedSize = buffer.estimateSize()
if (maybeSpill(buffer, estimatedSize)) {
buffer = new PartitionedPairBuffer[K, C]
}
}
if (estimatedSize > _peakMemoryUsedBytes) {
_peakMemoryUsedBytes = estimatedSize
}
}
- 5
util.collection.Spillable.maybeSpill
protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
var shouldSpill = false
if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
// If it is greater than the threshold value
// amountToRequest is the memory space to be applied for
val amountToRequest = 2 * currentMemory - myMemoryThreshold
val granted = acquireMemory(amountToRequest)
myMemoryThreshold += granted
// If we allocate too little memory,
// Because tryToAcquire returns 0
// Or the size of the memory application exceeds myMemoryThreshold
// Cause the current Memory >= myMemory Threshold
// Should Spill
shouldSpill = currentMemory >= myMemoryThreshold
}
// If the number of elements read is greater than the threshold value
// Should Spill
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
if (shouldSpill) {
// Number of times with new Spill
_spillCount += 1
logSpillage(currentMemory)
// Spill operation
spill(collection)
// Element Read Number Zero
_elementsRead = 0
// Increase Spill's memory count
// Release memory
_memoryBytesSpilled += currentMemory
releaseMemory()
}
shouldSpill
}
- 1
util.collection.Spillable.spill
spill the collection in memory into an ordered file. SortShuffleWriter.write then calls sorter.writePartitionedFile to merge them
override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
// An iterator that generates a collection in memory,
// This part will be explained later.
val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
// Generate spill file,
// And add it to the array
val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
spills += spillFile
}
- 1
util.collection.ExternalSorter.spillMemoryIteratorToDisk
private[this] def spillMemoryIteratorToDisk(inMemoryIterator: WritablePartitionedIterator)
: SpilledFile = {
// Generate temporary files and blockId
val (blockId, file) = diskBlockManager.createTempShuffleBlock()
// These values are reset after each flush
var objectsWritten: Long = 0
val spillMetrics: ShuffleWriteMetrics = new ShuffleWriteMetrics
val writer: DiskBlockObjectWriter =
blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics)
// Record branch size in order of writing to disk
val batchSizes = new ArrayBuffer[Long]
// Record how many elements each partition has
val elementsPerPartition = new Array[Long](numPartitions)
// Flush writer content to disk,
// And update related variables
def flush(): Unit = {
val segment = writer.commitAndGet()
batchSizes += segment.length
_diskBytesSpilled += segment.length
objectsWritten = 0
}
var success = false
try {
// Traversing memory collections
while (inMemoryIterator.hasNext) {
val partitionId = inMemoryIterator.nextPartition()
require(partitionId >= 0 && partitionId < numPartitions,
s"partition Id: ${partitionId} should be in the range [0, ${numPartitions})")
inMemoryIterator.writeNext(writer)
elementsPerPartition(partitionId) += 1
objectsWritten += 1
// When the number of elements written reaches the batch serialization size,
// flush
if (objectsWritten == serializerBatchSize) {
flush()
}
}
if (objectsWritten > 0) {
// Write after traversal
// flush
flush()
} else {
writer.revertPartialWritesAndClose()
}
success = true
} finally {
if (success) {
writer.close()
} else {
writer.revertPartialWritesAndClose()
if (file.exists()) {
if (!file.delete()) {
logWarning(s"Error deleting ${file}")
}
}
}
}
SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition)
}
- 1
sort
Let's go back to Sort Shuffle Writer. write:
// In external ordering,
// Some of the results may be in memory
// Another part of the result is in one or more files
// They need to be merge d into a large file
val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
Call stack:
- util.collection.writePartitionedFile
- util.collection.ExternalSorter.destructiveSortedWritablePartitionedIterator
- util.collection.ExternalSorter.partitionedIterator
- partitionedDestructiveSortedIterator
util.collection.ExternalSorter.writePartitionedFile
Let's take a closer look at the writePartitioned File, add data to the External Sorter and write it to a disk file:
def writePartitionedFile(
blockId: BlockId,
outputFile: File): Array[Long] = {
// Tracking the location of the output file
val lengths = new Array[Long](numPartitions)
val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
context.taskMetrics().shuffleWriteMetrics)
if (spills.isEmpty) {
// When only data exists in memory
val collection = if (aggregator.isDefined) map else buffer
val it = collection.destructiveSortedWritablePartitionedIterator(comparator)
while (it.hasNext) {
val partitionId = it.nextPartition()
while (it.hasNext && it.nextPartition() == partitionId) {
it.writeNext(writer)
}
val segment = writer.commitAndGet()
lengths(partitionId) = segment.length
}
} else {
// Otherwise, merge-sort must be done.
// Get a partition iterator
// And write all the data directly
for ((id, elements) <- this.partitionedIterator) {
if (elements.hasNext) {
for (elem <- elements) {
writer.write(elem._1, elem._2)
}
val segment = writer.commitAndGet()
lengths(id) = segment.length
}
}
}
writer.close()
context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)
context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)
context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes)
lengths
}
- 1
- 28
util.collection.ExternalSorter.destructiveSortedWritablePartitionedIterator
In the writePartitioned File, an iterator is generated using destructive Sorted Writable Partitioned Iterator:
val it = collection.destructiveSortedWritablePartitionedIterator(comparator)
It was also mentioned in the previous blog post as util.collection.Spillable.spill:
val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
- 1
Let's look at destructive Sorted Writable Partitioned Iterator:
def destructiveSortedWritablePartitionedIterator(keyComparator: Option[Comparator[K]])
: WritablePartitionedIterator = {
// generator iterator
val it = partitionedDestructiveSortedIterator(keyComparator)
new WritablePartitionedIterator {
private[this] var cur = if (it.hasNext) it.next() else null
def writeNext(writer: DiskBlockObjectWriter): Unit = {
writer.write(cur._1._2, cur._2)
cur = if (it.hasNext) it.next() else null
}
def hasNext(): Boolean = cur != null
def nextPartition(): Int = cur._1._1
}
}
- 1
You can see that Writable Partitioned Iterator is equivalent to the proxy class of the iterator returned by partitioned Destructive Sorted Iterator. Instead of returning a value, the destructive Sorted Writable Partitioned Iterator passes in the DiskBlockObjectWriter and writes it. Let's put the partitioned Destructive Sorted Iterator first and look down.
util.collection.ExternalSorter.partitionedIterator
Unlike another branch, this branch calls partitioned Iterator to get the partition iterator and write all the data directly. Let's take a closer look at partitioned Iterator:
def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {
val usingMap = aggregator.isDefined
val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer
if (spills.isEmpty) {
// When there is no spills
// According to our previous process, we will not join this branch.
if (!ordering.isDefined) {
// If you don't need to sort key s
// Ordering Partition s only
groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))
} else {
// Otherwise, you need to sort partition s and key s
groupByPartition(destructiveIterator(
collection.partitionedDestructiveSortedIterator(Some(keyComparator))))
}
} else {
// When there are spills
// Temporary files and data in memory that need Merge spilled
merge(spills, destructiveIterator(
collection.partitionedDestructiveSortedIterator(comparator)))
}
}
- 1
Let's first look at spills.isEmpty in two ways:
- Sort Partition s only:
The partitioned Destructive Sorted Iterator passes in None, meaning that key s are not sorted. Sorting Partition s is done by default in partitioned Destructive Sorted Iterator. Let's leave it behind.
groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))
- 1
After sorting the Partitions, according to the aggregation of Partitions:
private def groupByPartition(data: Iterator[((Int, K), C)])
: Iterator[(Int, Iterator[Product2[K, C]])] =
{
val buffered = data.buffered
(0 until numPartitions).iterator.map(p => (p, new IteratorForPartition(p, buffered)))
}
- 1
- 2
Iterator ForPartition is an iterator for a single partion:
private[this] class IteratorForPartition(partitionId: Int, data: BufferedIterator[((Int, K), C)])
extends Iterator[Product2[K, C]]
{
override def hasNext: Boolean = data.hasNext && data.head._1._1 == partitionId
override def next(): Product2[K, C] = {
if (!hasNext) {
throw new NoSuchElementException
}
val elem = data.next()
(elem._1._2, elem._2)
}
}
- Sort partition s and key s
groupByPartition(destructiveIterator( collection.partitionedDestructiveSortedIterator(Some(keyComparator))))
- 1
In partitioned Destructive Sorted Iterator, keyComparator is passed in:
private val keyComparator: Comparator[K] = ordering.getOrElse(new Comparator[K] {
override def compare(a: K, b: K): Int = {
val h1 = if (a == null) 0 else a.hashCode()
val h2 = if (b == null) 0 else b.hashCode()
if (h1 < h2) -1 else if (h1 == h2) 0 else 1
}
})
Sort by hashCode of key, then call groupByPartition to sort by partition.
For spills, we use comparator:
private def comparator: Option[Comparator[K]] = {
// If sorting or aggregation is required
if (ordering.isDefined || aggregator.isDefined) {
Some(keyComparator)
} else {
None
}
}
- 1
partitionedDestructiveSortedIterator
Now let's take a look at partitioned Destructive Sorted Iterator. Partitioned Destructive Sorted Iterator is a method in the idiosyncratic Writable Partitioned Pair Collection. Writable Partitioned PairCollection is inherited by Partitioned AppendOnlyMap and Artitioned PairBuffer. In partitioned Iterator, you can see that:
val usingMap = aggregator.isDefined
val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer
- 1
If aggregation is required, use Partitioned AppendOnlyMap or Partitioned PairBuffer
util.collection.PartitionedPairBuffer.partitionedDestructiveSortedIterator
Let's start with a simple Partitioned PairBuffer. partitioned Destructive Sorted Iterator:
override def partitionedDestructiveSortedIterator(keyComparator: Option[Comparator[K]])
: Iterator[((Int, K), V)] = {
val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator)
// Sort the data
new Sorter(new KVArraySortDataFormat[(Int, K), AnyRef]).sort(data, 0, curSize, comparator)
iterator
}
- 1
We can see that:
val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator)
- 1
The original comparator is replaced by the partitionKey Comparator. The partitionKey Comparator is the partition and key quadratic sort. If the incoming keyComparator is None, then only the Partition is sorted:
def partitionKeyComparator[K](keyComparator: Comparator[K]): Comparator[(Int, K)] = {
new Comparator[(Int, K)] {
override def compare(a: (Int, K), b: (Int, K)): Int = {
val partitionDiff = a._1 - b._1
if (partitionDiff != 0) {
partitionDiff
} else {
keyComparator.compare(a._2, b._2)
}
}
}
- 1
- 2
Then we use Sort and others to sort the data, which uses TimSort. In future blog articles, we will explain in depth.
Finally, the iterator is returned, which simply traverses the data in pairs:
private def iterator(): Iterator[((Int, K), V)] = new Iterator[((Int, K), V)] {
var pos = 0
override def hasNext: Boolean = pos < curSize
override def next(): ((Int, K), V) = {
if (!hasNext) {
throw new NoSuchElementException
}
val pair = (data(2 * pos).asInstanceOf[(Int, K)], data(2 * pos + 1).asInstanceOf[V])
pos += 1
pair
}
}
}
- 1
- 6
util.collection.PartitionedAppendOnlyMap.partitionedDestructiveSortedIterator
def partitionedDestructiveSortedIterator(keyComparator: Option[Comparator[K]])
: Iterator[((Int, K), V)] = {
val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator)
destructiveSortedIterator(comparator)
}
- 1
util.collection.PartitionedAppendOnlyMap.destructiveSortedIterator
def destructiveSortedIterator(keyComparator: Comparator[K]): Iterator[(K, V)] = {
destroyed = true
// To the left
var keyIndex, newIndex = 0
while (keyIndex < capacity) {
if (data(2 * keyIndex) != null) {
data(2 * newIndex) = data(2 * keyIndex)
data(2 * newIndex + 1) = data(2 * keyIndex + 1)
newIndex += 1
}
keyIndex += 1
}
assert(curSize == newIndex + (if (haveNullValue) 1 else 0))
new Sorter(new KVArraySortDataFormat[K, AnyRef]).sort(data, 0, newIndex, keyComparator)
// Return the new Iterator
new Iterator[(K, V)] {
var i = 0
var nullValueReady = haveNullValue
def hasNext: Boolean = (i < newIndex || nullValueReady)
def next(): (K, V) = {
if (nullValueReady) {
nullValueReady = false
(null.asInstanceOf[K], nullValue)
} else {
val item = (data(2 * i).asInstanceOf[K], data(2 * i + 1).asInstanceOf[V])
i += 1
item
}
}
}
}