Conversion operator
map: the same partition runs orderly, while different partitions run disorderly
(fetching data in each partition is powerful but inefficient) (: / 1c37ca978ea74eae9f8c4258b0f9064f)
val result1: RDD[Int] = rdd.map(num => { println(num) num * 2 })
mapPartitions: fetch one partition at a time and calculate it in the partition
High efficiency, but the data of each partition will be released only when all calculations are completed (because of the object reference). When the memory is small and the amount of data is large, the memory may be removed
val result2: RDD[Int] = rdd.mapPartitions(x => { x.map(_ * 2)})
golm: turn the data of a partition into a collection
val glomRdd: RDD[Array[Int]] = rdd.glom() val maxRDD: RDD[Int] = glomRdd.map( data => data.max ) println(maxRDD.collect().sum)
Group by: it means that each data in the data source is divided by key
val rdd: RDD[String] = sc.makeRDD(List("hello java", "spark", "scala"), 2) def func1(string: String):Char = { string(0) } rdd.groupBy(func1).collect().foreach(println)
Filter: filter, return Boolean type
val rdd: RDD[String] = sc.makeRDD(List("hello java", "spark", "scala"), 2) rdd.filter(_.startsWith("s")).collect().foreach(println)
sample
val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,7), 2) println(rdd.sample( false, // Whether to put it back after extraction. true means to put it back 0.4, // Even if the probability of each data being extracted is greater than 1, it may not be extracted 1 //Random number seed, can not be set ).collect().mkString(","))
distinct
val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 1, 3, 3, 4, 5, 6, 7), 2) // case _ => map(x => (x, null)).reduceByKey((x, _) => x, numPartitions).map(_._1) rdd.distinct().collect().foreach(print)
coalesce: reduce partitions
(if shuffle is not selected, the data may be skewed later)
repartition: actually (coalesce(shuffle true))
You can add partitions and call coalesce, but shuffle=true by default
val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6), 2) rdd .coalesce(2, true).saveAsTextFile("out")
sortBy: shuffle is performed. By default, the partition is not changed and the order is ascending
val rdd: RDD[(Int, String)] = sc.makeRDD(List((1, "2"), (3, "4"), (2, "6")), 2) // It's through shuffle rdd.sortBy(_._2,false) .collect().foreach(println)
Double value operation, (intersection, union, difference, zipper)
intersection,union,subtract,zip Intersection, union, difference set requirements RDD The data types shall be consistent; Zippers can be inconsistent
// intersection val rddInt: RDD[Int] = rdd1.intersection(rdd2) // Union val rddUnion: RDD[Int] = rdd1.union(rdd2) // Difference set val rddSub: RDD[Int] = rdd1.subtract(rdd2) // zipper val rddZip: RDD[(Int, Int)] = rdd1.zip(rdd2)
partitionBy:
val rddNew: RDD[(Int, Int)] = rdd.map((_, 1)) // Built in hash partition, rewritable val result: RDD[(Int, Int)] = rddNew.partitionBy(new HashPartitioner(2)) result.saveAsTextFile("out")
groupByKey: if no parameter is given, the first one is automatically obtained as key
The difference between groupBy and groupBy is that groupBy retains the complete K-V pair and groupByKey extracts value
val GroupRDD: RDD[(Int, Iterable[(Int, String)])] = rdd.groupBy(_._1) // (1,CompactBuffer((1,spark), (1,hello)))(3,CompactBuffer((3,scala)))(2,CompactBuffer((2,java))) val GroupByRDD: RDD[(Int, Iterable[String])] = rdd.groupByKey() // (1,CompactBuffer(spark, hello))(3,CompactBuffer(scala))(2,CompactBuffer(java))
Supplement: the difference between groupByKey and reduceByKey
From the perspective of shuffle: groupByKey is used to disrupt the data. It will be shuffled and repartitioned, and then the map can be used for statistics
The shuffle must drop the disk and cannot wait in memory
reduceByKey will be pre aggregated in the partition (combine: map side pre aggregation similar to MR). There will be less data in shuffle; Higher performance
From the perspective of function: groupByKey is used for grouping, and other operations can be implemented. The function is more flexible. groupByKey is the aggregation of value s of the same key. If only grouping is required, it cannot be used
val rdd: RDD[(Int, Int)] = sc.makeRDD(List((1, 1), (2, 1), (1, 1), (3, 1)), 1) rdd.reduceByKey(_ + _).collect().foreach(println) rdd.groupByKey().map({ case (word, iter) => (word, iter.sum) }).collect().foreach(println)
- aggregateByKey: two parameter lists. The first parameter list is the initial value; The second parameter list has two parameters: the first parameter is intra partition operation, and the second parameter is inter partition operation
val rdd: RDD[(Int, Int)] = sc.makeRDD(List((1, 1), (2, 1), (1, 1), (3, 1)), 2) rdd.aggregateByKey(0)(math.max(_,_),_+_).collect.foreach(println)
foldByKey: used when the calculation rules within and between partitions are the same
It is equivalent to the aggregateByKey with the same two parameters in the second parameter list
val rdd: RDD[(Int, Int)] = sc.makeRDD(List((1, 1), (2, 1), (1, 1), (3, 1)), 1) rdd.foldByKey(0)(_ + _).collect.foreach(println)
Use aggregateByKey to find the average value of the key
val rdd: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("a", 3), ("b", 1), ("b", 5)), 2) /** * (0, 0) of the initial value, the first 0 represents the "value", and the second 0 represents the number of times the partition key "value" appears. In fact, each operation updates the initial value. */ val rdd1: RDD[(String, (Int, Int))] = rdd.aggregateByKey((0, 0))( // In the first operation: x represents (0, 0), y is 1 in ("a", 1), and the result of the operation is that the first 0+y obtains the "sum" of the first operation, and the second 0 + 1 obtains the number of operations; The result is (1,1) (init_tuple, value) => { (init_tuple._1 + value, init_tuple._2 + 1) }, // The data in each partition is calculated, "value" and "value" are added, and the times and times are added (part1, part2) => { (part1._1 + part2._1, part1._2 + part2._2) } ) val rdd2: RDD[(String, Int)] = rdd1.mapValues({ case (x, y) => { x / y } })
You can use combineByKey instead:
val rdd: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("a", 3), ("b", 1), ("b", 5)), 2) /** * Three parameters are required * First parameter: structure conversion of the first data of the same key * The second parameter: calculation rules in the partition * The third parameter: calculation rules between partitions */ val rdd2: RDD[(String, (Int, Int))] = rdd.combineByKey(x => (x, 1), (init_tuple: (Int, Int, ), value) => { (init_tuple._1 + value, init_tuple._2 + 1) } , (part1: (Int, Int), part2: (Int, Int)) => { (part1._1 + part2._1, part1._2 + part2._2) } )
- join, leftOuterJoin, rightOuterJoin, cogroup (this does not diverge and will turn the data in two RDDS into two iterators iter)
Cases, top three cities in advertising hits in different provinces
// time stamp Provincial Urban user advertising // 1635304263 Shenzhen, Guangdong yan2 N val rdd: RDD[String] = sc.textFile("C:\\Users\\93134\\Desktop\\a.txt") //1635304263 Shenzhen, Guangdong yan2 N val mapRdd: RDD[((String, String), Int)] = rdd.map(line => { val datas: Array[String] = line.split(" ") ((datas(1), datas(2)), 1) }) // (Jiangxi, Nanchang, 115) val reduceRdd: RDD[((String, String), Int)] = mapRdd.reduceByKey(_ + _) // (Jiangxi, (Nanchang, 115)) val mapRdd2: RDD[(String, (String, Int))] = reduceRdd.map({ // Not recommended_ Pattern matching can be used directly case ((province, city), sum) => { (province, (city, sum)) } }) // (Jiangxi, compactbuffer ((Jiangxi, (Ji'an, 104)), (Jiangxi, (Ganzhou, 121)), (Jiangxi, (Nanchang, 115))) val groupRdd: RDD[(String, Iterable[(String, Int)])] = mapRdd2.groupByKey() // (Jiangxi, list ((Ganzhou, 121), (Nanchang, 115)) val resultRdd: RDD[(String, List[(String, Int)])] = groupRdd.mapValues(values => { values.toList.sortBy(_._2)(Ordering.Int.reverse).take(3) }) resultRdd.collect().foreach(println) sc.stop()