Operator of RDD
What is the 1-operator? API, method, behavior
What are the classes of 2-operators- transformation and action
3-transformation features: convert to new RDD and delay loading
What operators does - transformation have- See the table, such as map filter, etc
- transformation continue classification
eg: glom - Elements of each partition
The element of 1-RDD is a single value
map,groupBy,filter,flatMap,distinct
Beijing order case: local + standalone cluster
The input parameter of 2-binary type operator is also RDD
union,intersection
The element of 3-RDD is key_value
groupByKey,reduceByKey,sortByKey,
4-action features: immediate execution and output action, which is the last link of the calculation chain.
- what operators does action have- See table
eg: collecet, reduce, first,take,
takeSample,takeOrdered,top,count
Different names of operators [function], [method], [API]
Transformation
-
Return a new [RDD]. All transformation functions (operators) are [lazy delayed loading] and will not be executed immediately, such as [flatMap], [map] and [reduceByKey] in wordcount
transformation | meaning |
---|---|
map(func) | Returns a new RDD, which consists of each input element converted by func function |
filter(func) | Returns a new RDD, which is composed of input elements whose return value is true after being calculated by func function |
flatMap(func) | Similar to map, but each input element can be mapped to 0 or more output elements (so func should return a sequence, not a single element) |
mapPartitions(func) | It is similar to map, but runs independently on each fragment of RDD. Therefore, when running on RDD of type T, the function type of func must be iterator [t] = > iterator [u] |
mapPartitionsWithIndex(func) | Similar to mapPartitions, but func takes an integer parameter to represent the index value of the partition. Therefore, when running on an RDD of type T, the function type of func must be (int, interlator [t]) = > iterator [u] |
sample(withReplacement, fraction, seed) | Sample the data according to the proportion specified by fraction. You can choose whether to use random number for replacement. Seed is used to specify the seed of random number generator |
union(otherDataset) | A new RDD is returned after combining the source RDD and the parameter RDD |
intersection(otherDataset) | Returns a new RDD after intersecting the source RDD and the parameter RDD |
distinct([numTasks])) | After de duplication of the source RDD, a new RDD is returned |
groupByKey([numTasks]) | Call on a (K,V) RDD and return an (K, Iterator[V]) RDD |
reduceByKey(func, [numTasks]) | Call on a (K,V) RDD and return a (K,V) RDD. Use the specified reduce function to aggregate the values of the same key. Similar to groupByKey, the number of reduce tasks can be set through the second optional parameter |
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) | |
sortByKey([ascending], [numTasks]) | When called on a (K,V) RDD, K must implement the Ordered interface and return a (K,V) RDD sorted by key |
sortBy(func,[ascending], [numTasks]) | Similar to sortByKey, but more flexible |
join(otherDataset, [numTasks]) | Call on RDDS of types (K,V) and (K,W) to return the RDD of (K,(V,W)) of all element pairs corresponding to the same key |
cogroup(otherDataset, [numTasks]) | Call on RDDS of types (K,V) and (K,W) to return an RDD of type (k, (iteratable < V >, iteratable < w >)) |
cartesian(otherDataset) | Cartesian product |
pipe(command, [envVars]) | Pipeline operation on rdd |
coalesce(numPartitions) | Reduce the number of partitions of RDD to the specified value. You can do this after filtering a large amount of data |
repartition(numPartitions) | Repartition RDD |
Usage of common operators:
transformation operator
-
Value type valueType
-
map
-
groupBy
-
filter
-
flatMap
-
distinct
-
import os from pyspark import SparkConf, SparkContext os.environ['SPARK_HOME'] = '/export/server/spark' PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8" # When there are multiple versions, not specifying them is likely to cause errors os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON conf = SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]") sc = SparkContext(conf=conf) #Practice, introduce the glom function, and you can get the specific elements of each partition rdd1=sc.parallelize([5,6,4,7,3,8,2,9,1,10]) rdd2=rdd1.glom() print(rdd2.collect()) >>result: [[5, 6], [4, 7], [3, 8], [2, 9, 1, 10]] #The default parallelism is 4 because the local [*] machine has four core s rdd1.getNumPartitions() >>result: 4 #You can specify the number of partitions 3 rdd1=sc.parallelize([1,2,3,4,5,6,7,8,9],3) rdd1.getNumPartitions() >>result: 3 #1. map operator, mode 1 rdd2=rdd1.map(lambda x:x+1) print(rdd2.collect()) >>result:[2, 3, 4, 5, 6, 7, 8, 9, 10] #1. map operator, mode 2 def add(x): return x+1 rdd2=rdd1.map(add) print(rdd2.collect()) >>result: [2, 3, 4, 5, 6, 7, 8, 9, 10] #2. groupBy operator, rdd1=sc.parallelize([1,2,3,4]) rdd2=rdd1.groupBy(lambda x: 'even' if x%2==0 else 'odd') print(rdd2.collect()) >>result: [('even', <pyspark.resultiterable.ResultIterable object at 0x7f9e1c0e33d0>), ('odd', <pyspark.resultiterable.ResultIterable object at 0x7f9e0e2ae0d0>)] rdd3=rdd2.mapValues(lambda x:list(x)) print(rdd3.collect()) >>result: [('even', [2, 4]), ('odd', [1, 3])] #3. filter operator rdd1=sc.parallelize([1,2,3,4,5,6,7,8,9]) rdd2=rdd1.filter(lambda x:True if x>4 else False) print(rdd2.collect()) >>result: [5, 6, 7, 8, 9] #4. flatMap operator rdd1=sc.parallelize(["a b c","d e f","h i j"]) rdd2=rdd1.flatMap(lambda line:line.split(" ")) print(rdd2.collect()) >>result: ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i', 'j'] #4. distinct operator rdd1 = sc.parallelize([1,2,3,3,3,5,5,6]) rdd1.distinct().collect() >>result: [1, 5, 2, 6, 3]
Double ValueType
-
union
-
intersection
#union operator rdd1 = sc.parallelize([("a", 1), ("b", 2)]) print(rdd1.collect()) >>result:[('a', 1), ('b', 2)] rdd2 = sc.parallelize([("c",1),("b",3)]) print(rdd2.collect()) >>result:[('c', 1), ('b', 3)] rdd3=rdd1.union(rdd2) print(rdd3.collect()) >>result:[('a', 1), ('b', 2), ('c', 1), ('b', 3)] #intersection operator rdd2 = sc.parallelize([("a",1),("b",3)]) rdd3=rdd1.intersection(rdd2) rdd3.collect() >>result: [('a', 1)]
Key value type
-
groupByKey
-
reduceByKey
-
sortByKey
#groupByKey operator 1 rdd = sc.parallelize([("a",1),("b",2),("c",3),("d",4)]) rdd.groupByKey().collect() >>result: [('b', <pyspark.resultiterable.ResultIterable at 0x7f9e1c0e33a0>), ('c', <pyspark.resultiterable.ResultIterable at 0x7f9e0e27d430>), ('a', <pyspark.resultiterable.ResultIterable at 0x7f9e0e27df10>), ('d', <pyspark.resultiterable.ResultIterable at 0x7f9e0e27d340>)] result=rdd.groupByKey().collect() result[1] >>result: ('c', <pyspark.resultiterable.ResultIterable at 0x7f9e0e2ae3d0>) result[1][1] >>result: <pyspark.resultiterable.ResultIterable at 0x7f9e0e2ae3d0> list(result[1][1]) >>result: [3] #groupByKey operator 2, additional supplementary cases rdd = sc.parallelize([("M",'zs'),("F",'ls'),("M",'ww'),("F",'zl')]) rdd2=rdd.groupByKey() rdd2.collect() >>result: [('M', <pyspark.resultiterable.ResultIterable at 0x7f9e0e2bc550>), ('F', <pyspark.resultiterable.ResultIterable at 0x7f9e0e2bcfa0>)] ite=rdd2.collect() for x in ite : print('Gender is',x[0],'People are:',list(x[1])) >>result: Gender is M People are: ['zs', 'ww'] Gender is F People are: ['ls', 'zl'] #groupByKey operator 3 sc.parallelize([('hadoop', 1), ('hadoop', 5), ('spark', 3), ('spark', 6)]) >>result: ParallelCollectionRDD[60] at readRDDFromFile at PythonRDD.scala:274 rdd1=sc.parallelize([('hadoop', 1), ('hadoop', 5), ('spark', 3), ('spark', 6)]) rdd2=rdd1.groupByKey() rdd2.collect() >>result: [('hadoop', <pyspark.resultiterable.ResultIterable at 0x7f9e0e2bc490>), ('spark', <pyspark.resultiterable.ResultIterable at 0x7f9e0e2bc670>)] rdd2.mapValues(lambda value:sum(list(value))) >>result: PythonRDD[67] at RDD at PythonRDD.scala:53 rdd3=rdd2.mapValues(lambda value:sum(list(value))) rdd3.collect() >>result: [('hadoop', 6), ('spark', 9)] #reduceByKey operator rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) rdd.reduceByKey(lambda x,y:x+y).collect() >>result: [('b', 1), ('a', 2)] #sortByKey operator sc.parallelize([('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]) >>result: ParallelCollectionRDD[75] at readRDDFromFile at PythonRDD.scala:274 rdd1=sc.parallelize([('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]) rdd2=rdd1.sortByKey() rdd2.collect() >>result: [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)] print(rdd1.sortByKey(False)) PythonRDD[90] at RDD at PythonRDD.scala:53 print(rdd1.sortByKey(False).collect()) [('d', 4), ('b', 2), ('a', 1), ('2', 5), ('1', 3)] print(rdd1.sortByKey(True,2).glom().collect()) [[('1', 3), ('2', 5), ('a', 1)], [('b', 2), ('d', 4)]] tmp2 = [('Mary', 1), ('had', 2), ('a', 3), ('little', 4), ('lamb', 5), ('whose', 6), ('fleece', 7), ('was', 8), ('white', 9)] rdd1=sc.parallelize(tmp2) rdd2=rdd1.sortByKey(True,1,keyfunc=lambda k:k.upper()) rdd2.collect() >>result: [('a', 3), ('fleece', 7), ('had', 2), ('lamb', 5), ('little', 4), ('Mary', 1), ('was', 8), ('white', 9), ('whose', 6)]
Action
-
The returned [not] RDD can save and output the results, and all Action operators [execute immediately], such as [saveAsTextFile] in wordcount.
action | meaning |
---|---|
reduce(func) | Gather all elements in RDD through func function. This function must be interchangeable and parallelable |
collect() | In the driver, all elements of the dataset are returned as an array |
count() | Returns the number of RDD elements |
first() | Returns the first element of the RDD (similar to take(1)) |
take(n) | Returns an array of the first n elements of a dataset |
takeSample(withReplacement,num, [seed]) | Returns an array consisting of num elements randomly sampled from the data set. You can choose whether to replace the insufficient part with a random number. Seed is used to specify the seed of the random number generator |
takeOrdered(n, [ordering]) | Returns the first n elements in natural order or custom order |
saveAsTextFile(path) | Save the elements of the dataset in the form of textfile to the HDFS file system or other supported file systems. For each element, Spark will call the toString method to replace it with the text in the file |
saveAsSequenceFile(path) | Save the elements in the dataset to the specified directory in the format of Hadoop sequencefile, which can enable HDFS or other file systems supported by Hadoop. |
saveAsObjectFile(path) | Save the elements of the dataset to the specified directory in the form of Java serialization |
countByKey() | For an RDD of type (K,V), a (K,Int) map is returned, indicating the number of elements corresponding to each key. |
foreach(func) | On each element of the dataset, run the function func to update. |
foreachPartition(func) | On each partition of the dataset, run the function func |
#countByValue operator x = sc.parallelize([1, 3, 1, 2, 3]) y = x.countByValue() print(type(y)) <class 'collections.defaultdict'> print(y) >>result: defaultdict(<class 'int'>, {1: 2, 3: 2, 2: 1}) #collect operator rdd = sc.parallelize([1,3,5,2,6,7,11,9,10],3) rdd.map(lambda x: x + 1).collect() >>result: [2, 4, 6, 3, 7, 8, 12, 10, 11] x=rdd.map(lambda x: x + 1).collect() print(type(x)) >>result: <class 'list'> #reduce operator rdd1 = sc.parallelize([1,2,3,4,5]) rdd1.collect() >>result: [1, 2, 3, 4, 5] x=rdd1.reduce(lambda x,y:x+y) print(x) 15 #fold operator rdd1 = sc.parallelize([1,2,3,4,5], 3) rdd.glom().collect() >>result: [[1, 3, 5], [2, 6, 7], [11, 9, 10]] rdd1 = sc.parallelize([1,2,3,4,5], 3) rdd1.glom().collect() >>result: [[1], [2, 3], [4, 5]] rdd1.fold(10,lambda x,y:x+y) >>result: 55 #first operator sc.parallelize([2, 3, 4]).first() >>result: 2 #take operator sc.parallelize([2, 3, 4, 5, 6]).take(2) >>result: [2, 3] sc.parallelize([2, 3, 4, 5, 6]).take(10) >>result: [2, 3, 4, 5, 6] sc.parallelize([5,3,1,1,6]).take(2) >>result: [5, 3] sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3) >>result: [91, 92, 93] #top operator x = sc.parallelize([1, 3, 1, 2, 3]) x.top(3) >>result: [3, 3, 2] #count operator sc.parallelize([2, 3, 4]).count() >>result: 3 #foreach operator words = sc.parallelize ( ["scala", "java", "hadoop", "spark", "akka", "spark vs hadoop", "pyspark", "pyspark and spark and python"] ) words.foreach(lambda x:print(x)) >>result: pyspark pyspark and spark and python akka spark vs hadoop hadoop spark scala java #saveAsTextFile operator data = sc.parallelize([1,2,3], 2) data.glom().collect() >>result: [[1], [2, 3]] data.saveAsTextFile("hdfs://node1:8020/output/file1")
-
collecet
-
reduce
-
first
-
take
-
takeSample
-
takeOrdered
-
top
-
The Executor above count will uniformly send the execution results back to the Driver
foreach and saveAsTextFile will not be sent back to the Driver