Differences between Spark coalesce and repartitions

Posted by Fabio9999 on Sat, 11 May 2019 03:27:02 +0200

Source package: org.apache.spark.rdd

def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)(implicit ord: Ordering[(K, V)] = null): RDD[(K, V)]

Return a new RDD that is reduced into numPartitions partitions.

This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

Translation:

Returns a new RDD simplified to numPartitions partitions.This results in a narrow dependency, for example, if you convert 1,000 partitions to 100 partitions, shuffles will not occur during this process; instead, shuffles will occur if 10 partitions are converted to 100 partitions.However, if you want to merge partitions substantially, for example, by merging them into one partition, this will cause your calculations to be computed on a few cluster nodes (implied: not enough parallelism).To avoid this situation, you can pass a true second shuffle parameter, which will result in one more shuffle step during the repartitioning process, meaning that the upstream partitions can run in parallel.

 

Be careful:

The second parameter, shuffle=true, will result in more partitions than before. For example, if you have a smaller number of partitions, if 100, calling coalesce(1000, shuffle = true) will use a HashPartitioner to generate 1000 partitions to distribute on cluster nodes.This is useful (for increasing parallelism).

def repartition(numPartitions: Int)(implicit ord: Ordering[(K, V)] = null): RDD[(K, V)]

Return a new RDD that has exactly numPartitions partitions.

Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.

If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.

Translation:

* Returning a RDD with exactly numPartitions can increase or decrease the parallelism of this RDD.Internally, this will use shuffle to redistribute the data. If you reduce the number of partitions, consider using coalesce to avoid shuffle execution

Reference resources: https://blog.csdn.net/dax1n/article/details/53431373 

coalesce and repartition: repartition
(*) are repartitioned
(*) Difference: coalesce does not shuffle(false) by default
repartition will shuffle
                   
Examples (*):
             val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 2)
View number of partitions: rdd1.partitions.length
Repartitioning: val rdd2 = rdd1.repartition(3)
val rdd3 = rdd1.coalesce(3,false) ---> Number of partitions: 2
val rdd4 = rdd1.coalesce(3,true) ---> Number of partitions: 3

 



 

Topics: Big Data Spark Apache