The transformation operator of spark and a case

Posted by ciaran on Sun, 10 Oct 2021 12:07:26 +0200

Conversion operator

map: the same partition runs orderly, while different partitions run disorderly

(fetching data in each partition is powerful but inefficient) (: / 1c37ca978ea74eae9f8c4258b0f9064f)

val result1: RDD[Int] = => {
      num * 2

mapPartitions: fetch one partition at a time and calculate it in the partition

High efficiency, but the data of each partition will be released only when all calculations are completed (because of the object reference). When the memory is small and the amount of data is large, the memory may be removed

val result2: RDD[Int] = rdd.mapPartitions(x => { * 2)})

golm: turn the data of a partition into a collection

val glomRdd: RDD[Array[Int]] = rdd.glom() 
    val maxRDD: RDD[Int] =
      data => data.max

Group by: it means that each data in the data source is divided by key

    val rdd: RDD[String] = sc.makeRDD(List("hello java", "spark", "scala"), 2)
    def func1(string: String):Char = {

Filter: filter, return Boolean type

    val rdd: RDD[String] = sc.makeRDD(List("hello java", "spark", "scala"), 2)


    val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,7), 2)
      false, // Whether to put it back after extraction. true means to put it back
      0.4, // Even if the probability of each data being extracted is greater than 1, it may not be extracted
      1 //Random number seed, can not be set


    val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 1, 3, 3, 4, 5, 6, 7), 2)
    // case _ => map(x => (x, null)).reduceByKey((x, _) => x, numPartitions).map(_._1)

coalesce: reduce partitions

(if shuffle is not selected, the data may be skewed later)

repartition: actually (coalesce(shuffle true))

You can add partitions and call coalesce, but shuffle=true by default

    val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6), 2)
    rdd .coalesce(2, true).saveAsTextFile("out")

sortBy: shuffle is performed. By default, the partition is not changed and the order is ascending

val rdd: RDD[(Int, String)] = sc.makeRDD(List((1, "2"), (3, "4"), (2, "6")), 2)
    // It's through shuffle

Double value operation, (intersection, union, difference, zipper)

 Intersection, union, difference set requirements RDD The data types shall be consistent; Zippers can be inconsistent
    // intersection
    val rddInt: RDD[Int] = rdd1.intersection(rdd2)
    // Union
    val rddUnion: RDD[Int] = rdd1.union(rdd2)
    // Difference set
    val rddSub: RDD[Int] = rdd1.subtract(rdd2)
    // zipper
    val rddZip: RDD[(Int, Int)] =


    val rddNew: RDD[(Int, Int)] =, 1))
    // Built in hash partition, rewritable
    val result: RDD[(Int, Int)] = rddNew.partitionBy(new HashPartitioner(2))

groupByKey: if no parameter is given, the first one is automatically obtained as key

The difference between groupBy and groupBy is that groupBy retains the complete K-V pair and groupByKey extracts value

    val GroupRDD: RDD[(Int, Iterable[(Int, String)])] = rdd.groupBy(_._1)
    // (1,CompactBuffer((1,spark), (1,hello)))(3,CompactBuffer((3,scala)))(2,CompactBuffer((2,java)))

    val GroupByRDD: RDD[(Int, Iterable[String])] = rdd.groupByKey()
    // (1,CompactBuffer(spark, hello))(3,CompactBuffer(scala))(2,CompactBuffer(java))

Supplement: the difference between groupByKey and reduceByKey

From the perspective of shuffle: groupByKey is used to disrupt the data. It will be shuffled and repartitioned, and then the map can be used for statistics
The shuffle must drop the disk and cannot wait in memory
reduceByKey will be pre aggregated in the partition (combine: map side pre aggregation similar to MR). There will be less data in shuffle; Higher performance
From the perspective of function: groupByKey is used for grouping, and other operations can be implemented. The function is more flexible. groupByKey is the aggregation of value s of the same key. If only grouping is required, it cannot be used

    val rdd: RDD[(Int, Int)] = sc.makeRDD(List((1, 1), (2, 1), (1, 1), (3, 1)), 1)
    rdd.reduceByKey(_ + _).collect().foreach(println)
    rdd.groupByKey().map({ case (word, iter) => (word, iter.sum) }).collect().foreach(println)
  • aggregateByKey: two parameter lists. The first parameter list is the initial value; The second parameter list has two parameters: the first parameter is intra partition operation, and the second parameter is inter partition operation
val rdd: RDD[(Int, Int)] = sc.makeRDD(List((1, 1), (2, 1), (1, 1), (3, 1)), 2)

foldByKey: used when the calculation rules within and between partitions are the same

It is equivalent to the aggregateByKey with the same two parameters in the second parameter list

val rdd: RDD[(Int, Int)] = sc.makeRDD(List((1, 1), (2, 1), (1, 1), (3, 1)), 1)
    rdd.foldByKey(0)(_ + _).collect.foreach(println)

Use aggregateByKey to find the average value of the key

    val rdd: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("a", 3), ("b", 1), ("b", 5)), 2)
     * (0, 0) of the initial value, the first 0 represents the "value", and the second 0 represents the number of times the partition key "value" appears. In fact, each operation updates the initial value.
    val rdd1: RDD[(String, (Int, Int))] = rdd.aggregateByKey((0, 0))(
      // In the first operation: x represents (0, 0), y is 1 in ("a", 1), and the result of the operation is that the first 0+y obtains the "sum" of the first operation, and the second 0 + 1 obtains the number of operations; The result is (1,1)
      (init_tuple, value) => {
        (init_tuple._1 + value, init_tuple._2 + 1)
      // The data in each partition is calculated, "value" and "value" are added, and the times and times are added
      (part1, part2) => {
        (part1._1 + part2._1, part1._2 + part2._2)
    val rdd2: RDD[(String, Int)] = rdd1.mapValues({ case (x, y) => {
      x / y

You can use combineByKey instead:

	val rdd: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("a", 3), ("b", 1), ("b", 5)), 2)

     * Three parameters are required
     * First parameter: structure conversion of the first data of the same key
     * The second parameter: calculation rules in the partition
     * The third parameter: calculation rules between partitions
    val rdd2: RDD[(String, (Int, Int))] = rdd.combineByKey(x => (x, 1),
      (init_tuple: (Int, Int, ), value) => {
        (init_tuple._1 + value, init_tuple._2 + 1)
      , (part1: (Int, Int), part2: (Int, Int)) => {
        (part1._1 + part2._1, part1._2 + part2._2)
  • join, leftOuterJoin, rightOuterJoin, cogroup (this does not diverge and will turn the data in two RDDS into two iterators iter)

Cases, top three cities in advertising hits in different provinces

// time stamp 	    Provincial Urban user advertising 
// 1635304263 Shenzhen, Guangdong yan2 N
    val rdd: RDD[String] = sc.textFile("C:\\Users\\93134\\Desktop\\a.txt")
    //1635304263 Shenzhen, Guangdong yan2 N
    val mapRdd: RDD[((String, String), Int)] = => {
      val datas: Array[String] = line.split(" ")
      ((datas(1), datas(2)), 1)

    // (Jiangxi, Nanchang, 115)
    val reduceRdd: RDD[((String, String), Int)] = mapRdd.reduceByKey(_ + _)

    // (Jiangxi, (Nanchang, 115))
    val mapRdd2: RDD[(String, (String, Int))] ={
      // Not recommended_ Pattern matching can be used directly
      case ((province, city), sum) => {
        (province, (city, sum))
    // (Jiangxi, compactbuffer ((Jiangxi, (Ji'an, 104)), (Jiangxi, (Ganzhou, 121)), (Jiangxi, (Nanchang, 115)))
    val groupRdd: RDD[(String, Iterable[(String, Int)])] = mapRdd2.groupByKey()

    // (Jiangxi, list ((Ganzhou, 121), (Nanchang, 115))
    val resultRdd: RDD[(String, List[(String, Int)])] = groupRdd.mapValues(values => {

Topics: Scala Big Data Spark