spark Source Reading--shuffle Process Analysis

ShuffleManager (1) In this article, let's take a look at Shuffle Manager, another important module in the spark kernel.Shuffle is arguably one of the most important concepts in distributed computing, which is required for data join ing, aggregation, de-duplication, and so on.On the other hand, one of the main reasons why spark performs better t ...

Posted by johnh on Fri, 14 Jun 2019 20:53:13 +0200

Differences between Spark coalesce and repartitions

Source package: org.apache.spark.rdd def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)(implicit ord: Ordering[(K, V)] = null): RDD[(K, V)] Return a new RDD that is reduced into numPartitions partitions. This results in a narrow dependency, e.g. if you g ...

Posted by Fabio9999 on Sat, 11 May 2019 03:27:02 +0200

RDD Conversion from Spark-SQL to DataFrame

Case: (bottom) 1. Dynamic programming of metadata to convert RDD into DataFrame - > RDD2 Data Frame Programmatically 2. Reflect RDD-> DataFrame-> RDD2 DataFrame by Reflecting After RDD is converted to DataFrame, we can use Spark SQL to query any data that can be built into RDD, such as HDFS. This function is extre ...

Posted by webdata on Mon, 06 May 2019 13:45:04 +0200