Spark chasing Wife Series (Pair RDD Episode 2)

Posted by mj_23 on Sat, 05 Feb 2022 13:37:41 +0100

After a busy day, I didn't do anything

Small talk:

I didn't do anything today. Unconsciously, it's the fifth day of the lunar new year. I'll start taking subject 4 in 5678 days. I hope to get my driver's license early

combineByKey

First explain the meaning of each parameter

createCombiner: a function that creates a combination within a group. The popular point is to initialize the incoming data and take the current value as a parameter to perform some conversion operations Each key in the partition is called once
The function of mergeValue: after merging each partition, that is, the function of mergeValue: before and after merging each partition. Accumulate the createCombiner result with the Value corresponding to the same key
mergeCombiners: this function is mainly used to merge multiple partitions. The results corresponding to the same Key of each partition are aggregated

combineByKey() will traverse all elements in the partition. Therefore, each element has been encountered before or not. If it is a new element (referring to the key), createCombiner will be used to create the initial value of the accumulator corresponding to that key. It will only occur when each key appears for the first time, not when one appears

If it is a key that has appeared before, mergeValue will be used to merge the current value of the accumulator corresponding to the key with the new value

Just talking about the meaning of parameters without giving code is playing hooligans.

Let's practice this function through a case

One of Yuanzi's classmates asked for her average score. The scores of three subjects are 97, 96 and 95 Ask for her average score, three subjects, with a total score of 288 points. The average score was 96.

Look at the code

val wordCount = new SparkConf().setMaster("local").setAppName("WordCount") 
val sparkContext = new SparkContext(wordCount) 
val value1 = sparkContext.makeRDD(
List(("Yuanzi", 97), ("Yuanzi", 96), ("Yuanzi", 95)),2) 
val value = value1.combineByKey( 
//Convert grade to grade = = " 
(Score, 1) score => (score, 1), 
//Rules within the zone (score, 1) = (score + score, 1 + 1) 
(tuple1: (Int, Int), v) => (tuple1._1 + v, tuple1._2 + 1), 
//The regular scores and scores between divisions are added, and the times and times are added 
(tuple2: (Int, Int), tuple3: (Int, Int)) => 
(tuple2._1 + tuple3._1, tuple2._2 + tuple3._2) ) 
val value2 = value.map { 
case (name, (score, num)) => (name, score / num) } 
value2.collect().foreach(println(_))

Maybe you can't understand it just by looking at the code. Let's explain it one by one.

1.. First create the RDD

 val value1 = sparkContext.makeRDD(List(("Yuanzi", 97), ("Yuanzi", 96), ("Yuanzi", 95)),2)

2. Call combineByKey operator

val value = value1.combineByKey( 
//createCombiner, grouped by key, 97,96,95 
//Average score = total score / number of subjects 
//First convert each subject 
score => (score, 1) 
//Rules within the zone (score, 1) = (score + score, 1 + 1) 
//(97,1) zone 2: (96 + 95,2) 
(tuple1: (Int, Int), v) => (tuple1._1 + v, tuple1._2 + 1), 
//The regular scores and scores between divisions are added, and the times and times are added 
(tuple2: (Int, Int), tuple3: (Int, Int)) => 
(tuple2._1 + tuple3._1, tuple2._2 + tuple3._2) )

3. Calculate the average

    //Return value type of value
[String,[Int,Int]] 
//String is the first Int of Yuanzi: total score, and the second Int: number of subjects 
val value2 = value.map {
 case (name, (score, num)) => (name, score / num) }

Take a look at a similar illustration of the above case

The above diagram can more clearly see the function of this operator.

Just after grouping, convert the grouped data

Add the data in the same group in the partition. When the total number is added, the number is also added

Add the data of the same key between partitions (the data here is the data calculated in the partition)

sortByKey

When called on a (k,v) RDD, K must implement the Ordered attribute and return one sorted by key

Let's look at this operator first

def sortByKey(ascending: Boolean = true, 
numPartitions: Int = self.partitions.length) : RDD[(K, V)] 
= self.withScope { val part = new RangePartitioner(numPartitions, self, ascending) 
new ShuffledRDD[K, V, V](self, part) .setKeyOrdering(if (ascending) ordering else ordering.reverse) }

shuffle will occur when using this operator, because the underlying layer inherits the new ShuffledRDD. Of course, you can choose whether to arrange in ascending or descending order according to the second parameter.

Get to know it through code

val wordCount = new SparkConf().setMaster("local").setAppName("WordCount") 
val sparkContext = new SparkContext(wordCount) 
val value1 = sparkContext.makeRDD(List((a, 97), (b, 96), (c, 95)),2) value1.sortByKey().collect().foreach(println(_))

If the default is ascending, the result is

(a,97) (b,96) (c,95)

MapValues

Sometimes, we only want to access the value part of Pair RDD. At this time, it is troublesome to operate the binary. At this time, we use the mapValues(func) function provided by Spark. The function is similar to map {case (x, y): (x, func (y))}

Summary:

It's rotten today. I feel it's not written well. Tomorrow, we must study hard, output articles well, and accumulate so much writing. We must give full play to it.

TopN cases and action operators will be output tomorrow. If you can, you can also talk about RangePartitioner.

It's still four or five hundred to break the 20000 reading volume. Although this blog is very bad, I hope I can see the 20000 reading volume when I wake up tomorrow.

If you break 20000, you must blog well. It's too bad to put it away

Topics: Big Data Spark

Programmer Think