What is the RDD operator in Spark

Posted by kingdm on Sat, 12 Feb 2022 17:44:38 +0100

Operator of RDD
What is the 1-operator? API, method, behavior
What are the classes of 2-operators- transformation and action
3-transformation features: convert to new RDD and delay loading
What operators does - transformation have- See the table, such as map filter, etc
- transformation continue classification
eg: glom - Elements of each partition
The element of 1-RDD is a single value
            map,groupBy,filter,flatMap,distinct
Beijing order case: local + standalone cluster
The input parameter of 2-binary type operator is also RDD
            union,intersection
The element of 3-RDD is key_value
            groupByKey,reduceByKey,sortByKey,
4-action features: immediate execution and output action, which is the last link of the calculation chain.
- what operators does action have- See table
     eg: collecet, reduce, first,take, 
                takeSample,takeOrdered,top,count

Different names of operators [function], [method], [API]

Transformation

  • Return a new [RDD]. All transformation functions (operators) are [lazy delayed loading] and will not be executed immediately, such as [flatMap], [map] and [reduceByKey] in wordcount

transformationmeaning
map(func)Returns a new RDD, which consists of each input element converted by func function
filter(func)Returns a new RDD, which is composed of input elements whose return value is true after being calculated by func function
flatMap(func)Similar to map, but each input element can be mapped to 0 or more output elements (so func should return a sequence, not a single element)
mapPartitions(func)It is similar to map, but runs independently on each fragment of RDD. Therefore, when running on RDD of type T, the function type of func must be iterator [t] = > iterator [u]
mapPartitionsWithIndex(func)Similar to mapPartitions, but func takes an integer parameter to represent the index value of the partition. Therefore, when running on an RDD of type T, the function type of func must be (int, interlator [t]) = > iterator [u]
sample(withReplacement, fraction, seed)Sample the data according to the proportion specified by fraction. You can choose whether to use random number for replacement. Seed is used to specify the seed of random number generator
union(otherDataset)A new RDD is returned after combining the source RDD and the parameter RDD
intersection(otherDataset)Returns a new RDD after intersecting the source RDD and the parameter RDD
distinct([numTasks]))After de duplication of the source RDD, a new RDD is returned
groupByKey([numTasks])Call on a (K,V) RDD and return an (K, Iterator[V]) RDD
reduceByKey(func, [numTasks])Call on a (K,V) RDD and return a (K,V) RDD. Use the specified reduce function to aggregate the values of the same key. Similar to groupByKey, the number of reduce tasks can be set through the second optional parameter
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
sortByKey([ascending], [numTasks])When called on a (K,V) RDD, K must implement the Ordered interface and return a (K,V) RDD sorted by key
sortBy(func,[ascending], [numTasks])Similar to sortByKey, but more flexible
join(otherDataset, [numTasks])Call on RDDS of types (K,V) and (K,W) to return the RDD of (K,(V,W)) of all element pairs corresponding to the same key
cogroup(otherDataset, [numTasks])Call on RDDS of types (K,V) and (K,W) to return an RDD of type (k, (iteratable < V >, iteratable < w >))
cartesian(otherDataset)Cartesian product
pipe(command, [envVars])Pipeline operation on rdd
coalesce(numPartitions)Reduce the number of partitions of RDD to the specified value. You can do this after filtering a large amount of data
repartition(numPartitions)Repartition RDD

Usage of common operators:

transformation operator

  • Value type valueType

    • map

    • groupBy

    • filter

    • flatMap

    • distinct

import os
from pyspark import SparkConf, SparkContext
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8"
# When there are multiple versions, not specifying them is likely to cause errors
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
conf = SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]")
sc = SparkContext(conf=conf)
#Practice, introduce the glom function, and you can get the specific elements of each partition
rdd1=sc.parallelize([5,6,4,7,3,8,2,9,1,10])
rdd2=rdd1.glom()
print(rdd2.collect())
>>result: [[5, 6], [4, 7], [3, 8], [2, 9, 1, 10]]
#The default parallelism is 4 because the local [*] machine has four core s
rdd1.getNumPartitions()
>>result:  4
#You can specify the number of partitions 3
rdd1=sc.parallelize([1,2,3,4,5,6,7,8,9],3)
rdd1.getNumPartitions()
>>result:  3
    
#1. map operator, mode 1
rdd2=rdd1.map(lambda x:x+1)
print(rdd2.collect())
>>result:[2, 3, 4, 5, 6, 7, 8, 9, 10]
    
#1. map operator, mode 2
def add(x):
    return x+1
rdd2=rdd1.map(add)
print(rdd2.collect())
>>result: [2, 3, 4, 5, 6, 7, 8, 9, 10]

#2. groupBy operator,
rdd1=sc.parallelize([1,2,3,4])
rdd2=rdd1.groupBy(lambda x: 'even' if x%2==0 else 'odd')
print(rdd2.collect())
>>result: [('even', <pyspark.resultiterable.ResultIterable object at 0x7f9e1c0e33d0>), ('odd', <pyspark.resultiterable.ResultIterable object at 0x7f9e0e2ae0d0>)]
    
rdd3=rdd2.mapValues(lambda x:list(x))
print(rdd3.collect())
>>result: [('even', [2, 4]), ('odd', [1, 3])]
  
#3. filter operator
rdd1=sc.parallelize([1,2,3,4,5,6,7,8,9])
rdd2=rdd1.filter(lambda x:True if x>4 else False)
print(rdd2.collect())
>>result: [5, 6, 7, 8, 9]
    
#4. flatMap operator
rdd1=sc.parallelize(["a b c","d e f","h i j"])
rdd2=rdd1.flatMap(lambda line:line.split(" "))
print(rdd2.collect())
>>result: ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i', 'j']
    
#4. distinct operator
rdd1 = sc.parallelize([1,2,3,3,3,5,5,6])
rdd1.distinct().collect()
>>result:  [1, 5, 2, 6, 3]

Double ValueType

  • union

  • intersection

#union operator
rdd1 = sc.parallelize([("a", 1), ("b", 2)])
print(rdd1.collect())
>>result:[('a', 1), ('b', 2)]
    
rdd2 = sc.parallelize([("c",1),("b",3)])
print(rdd2.collect())
>>result:[('c', 1), ('b', 3)]
    
rdd3=rdd1.union(rdd2)
print(rdd3.collect())
>>result:[('a', 1), ('b', 2), ('c', 1), ('b', 3)]
    
#intersection operator
rdd2 = sc.parallelize([("a",1),("b",3)])
rdd3=rdd1.intersection(rdd2)
rdd3.collect()
>>result: [('a', 1)]

 

Key value type

  • groupByKey

  • reduceByKey

  • sortByKey

#groupByKey operator 1
rdd = sc.parallelize([("a",1),("b",2),("c",3),("d",4)])
rdd.groupByKey().collect()
>>result: 
[('b', <pyspark.resultiterable.ResultIterable at 0x7f9e1c0e33a0>),
 ('c', <pyspark.resultiterable.ResultIterable at 0x7f9e0e27d430>),
 ('a', <pyspark.resultiterable.ResultIterable at 0x7f9e0e27df10>),
 ('d', <pyspark.resultiterable.ResultIterable at 0x7f9e0e27d340>)]
result=rdd.groupByKey().collect()
result[1]
>>result:  ('c', <pyspark.resultiterable.ResultIterable at 0x7f9e0e2ae3d0>)
result[1][1]
>>result:  <pyspark.resultiterable.ResultIterable at 0x7f9e0e2ae3d0>
list(result[1][1])
>>result:  [3]
    
#groupByKey operator 2, additional supplementary cases
rdd = sc.parallelize([("M",'zs'),("F",'ls'),("M",'ww'),("F",'zl')])
rdd2=rdd.groupByKey()
rdd2.collect()
>>result:  
[('M', <pyspark.resultiterable.ResultIterable at 0x7f9e0e2bc550>),
 ('F', <pyspark.resultiterable.ResultIterable at 0x7f9e0e2bcfa0>)]
ite=rdd2.collect()
for x in ite : print('Gender is',x[0],'People are:',list(x[1]))
>>result: Gender is M People are: ['zs', 'ww']
	    Gender is F People are: ['ls', 'zl']
      
#groupByKey operator 3
sc.parallelize([('hadoop', 1), ('hadoop', 5), ('spark', 3), ('spark', 6)])
>>result:  ParallelCollectionRDD[60] at readRDDFromFile at PythonRDD.scala:274
rdd1=sc.parallelize([('hadoop', 1), ('hadoop', 5), ('spark', 3), ('spark', 6)])
rdd2=rdd1.groupByKey()
rdd2.collect()
>>result:  
[('hadoop', <pyspark.resultiterable.ResultIterable at 0x7f9e0e2bc490>),
 ('spark', <pyspark.resultiterable.ResultIterable at 0x7f9e0e2bc670>)]
rdd2.mapValues(lambda value:sum(list(value)))
>>result:  PythonRDD[67] at RDD at PythonRDD.scala:53
rdd3=rdd2.mapValues(lambda value:sum(list(value)))
rdd3.collect()
>>result:  [('hadoop', 6), ('spark', 9)]
    
#reduceByKey operator
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
rdd.reduceByKey(lambda x,y:x+y).collect()
>>result:  [('b', 1), ('a', 2)]
    
#sortByKey operator
sc.parallelize([('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)])
>>result:  ParallelCollectionRDD[75] at readRDDFromFile at PythonRDD.scala:274
rdd1=sc.parallelize([('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)])
rdd2=rdd1.sortByKey()
rdd2.collect()
>>result:  [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
print(rdd1.sortByKey(False))
PythonRDD[90] at RDD at PythonRDD.scala:53
print(rdd1.sortByKey(False).collect())
[('d', 4), ('b', 2), ('a', 1), ('2', 5), ('1', 3)]
print(rdd1.sortByKey(True,2).glom().collect())
[[('1', 3), ('2', 5), ('a', 1)], [('b', 2), ('d', 4)]]
tmp2 = [('Mary', 1), ('had', 2), ('a', 3), ('little', 4), ('lamb', 5), ('whose', 6), ('fleece', 7), ('was', 8), ('white', 9)]

rdd1=sc.parallelize(tmp2)
rdd2=rdd1.sortByKey(True,1,keyfunc=lambda k:k.upper())
rdd2.collect()
>>result:  
[('a', 3),
 ('fleece', 7),
 ('had', 2),
 ('lamb', 5),
 ('little', 4),
 ('Mary', 1),
 ('was', 8),
 ('white', 9),
 ('whose', 6)]

Action

  • The returned [not] RDD can save and output the results, and all Action operators [execute immediately], such as [saveAsTextFile] in wordcount.

actionmeaning
reduce(func)Gather all elements in RDD through func function. This function must be interchangeable and parallelable
collect()In the driver, all elements of the dataset are returned as an array
count()Returns the number of RDD elements
first()Returns the first element of the RDD (similar to take(1))
take(n)Returns an array of the first n elements of a dataset
takeSample(withReplacement,num, [seed])Returns an array consisting of num elements randomly sampled from the data set. You can choose whether to replace the insufficient part with a random number. Seed is used to specify the seed of the random number generator
takeOrdered(n, [ordering])Returns the first n elements in natural order or custom order
saveAsTextFile(path)Save the elements of the dataset in the form of textfile to the HDFS file system or other supported file systems. For each element, Spark will call the toString method to replace it with the text in the file
saveAsSequenceFile(path)Save the elements in the dataset to the specified directory in the format of Hadoop sequencefile, which can enable HDFS or other file systems supported by Hadoop.
saveAsObjectFile(path)Save the elements of the dataset to the specified directory in the form of Java serialization
countByKey()For an RDD of type (K,V), a (K,Int) map is returned, indicating the number of elements corresponding to each key.
foreach(func)On each element of the dataset, run the function func to update.
foreachPartition(func)On each partition of the dataset, run the function func

 

#countByValue operator
x = sc.parallelize([1, 3, 1, 2, 3])
y = x.countByValue()
print(type(y))
<class 'collections.defaultdict'>
print(y)
>>result:  defaultdict(<class 'int'>, {1: 2, 3: 2, 2: 1})
    
#collect operator
rdd = sc.parallelize([1,3,5,2,6,7,11,9,10],3)
rdd.map(lambda x: x + 1).collect()
>>result:  [2, 4, 6, 3, 7, 8, 12, 10, 11]
x=rdd.map(lambda x: x + 1).collect()
print(type(x))
>>result:  <class 'list'>
    
#reduce operator
rdd1 = sc.parallelize([1,2,3,4,5])
rdd1.collect()
>>result:   [1, 2, 3, 4, 5]
x=rdd1.reduce(lambda x,y:x+y)
print(x)
15

#fold operator
rdd1 = sc.parallelize([1,2,3,4,5], 3)
rdd.glom().collect()
>>result:   [[1, 3, 5], [2, 6, 7], [11, 9, 10]]
rdd1 = sc.parallelize([1,2,3,4,5], 3)
rdd1.glom().collect()
>>result:   [[1], [2, 3], [4, 5]]
rdd1.fold(10,lambda x,y:x+y)
>>result:   55
    
#first operator
sc.parallelize([2, 3, 4]).first()
>>result:   2
    
#take operator
sc.parallelize([2, 3, 4, 5, 6]).take(2)
>>result:   [2, 3]
sc.parallelize([2, 3, 4, 5, 6]).take(10)
>>result:   [2, 3, 4, 5, 6]
sc.parallelize([5,3,1,1,6]).take(2) 
>>result:   [5, 3]
sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3)
>>result:   [91, 92, 93]
    
#top operator
x = sc.parallelize([1, 3, 1, 2, 3])
x.top(3)
>>result:   [3, 3, 2]
    
#count operator
sc.parallelize([2, 3, 4]).count()
>>result:   3
    
#foreach operator
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark and python"] )
words.foreach(lambda x:print(x))
>>result:  
pyspark
pyspark and spark and python
akka
spark vs hadoop
hadoop
spark
scala
java

#saveAsTextFile operator
data = sc.parallelize([1,2,3], 2)
data.glom().collect()
>>result:   [[1], [2, 3]]
data.saveAsTextFile("hdfs://node1:8020/output/file1")
  • collecet

  • reduce

  • first

  • take

  • takeSample

  • takeOrdered

  • top

  • The Executor above count will uniformly send the execution results back to the Driver

foreach and saveAsTextFile will not be sent back to the Driver

Topics: Big Data Spark