1, Theoretical basis
Spark operators can be divided into:
Transformation Transformation/Conversion operator: this transformation does not trigger the submission of the job and completes the intermediate process of the job. Transformation Operations are deferred, that is, from a RDD Transform to generate another RDD The operation is not performed immediately. It needs to wait until there is a problem Action Operation will really trigger the operation. Action Action operators: these operators trigger SparkContext Submit Job Homework. Action Operator will trigger Spark Submit job( Job),And output the data Spark System.
Level 1: Transformation - map
Task description
This related task: use Spark's map operator to complete the conversion operation according to relevant requirements.
Relevant knowledge
In order to complete this task, you need to master: how to use the map operator.
map
Each data item of the original RDD is transformed into a new element through the user-defined function f mapping in the map.
Each box in the figure represents an RDD partition. The partition on the left is mapped to a new RDD partition on the right through the user-defined function f: T - > U. however, in fact, this F function and other functions will not operate on data in a Stage until the Action operator is triggered.
map Case
sc = SparkContext("local", "Simple App") data = [1,2,3,4,5,6] rdd = sc.parallelize(data) print(rdd.collect()) rdd_map = rdd.map(lambda x: x * 2) print(rdd_map.collect())
Output:
[1, 2, 3, 4, 5, 6]
[2, 4, 6, 8, 10, 12]
Note: the elements (1, 2, 3, 4, 5, 6) of rdd1 are transformed into rdd2 (2, 4, 6, 8, 10) by map operator (x - > x * 2).
Programming requirements
Please read the code on the right carefully and supplement the code in the Begin - End area according to the tips in the method. The specific tasks are as follows:
Requirement: use the map operator to convert the rdd data (1, 2, 3, 4, 5) according to the following rules:
An even number is converted to the square of the number; An odd number is converted to the cube of the number.
Test description
After supplementing the code, click the evaluation, and the platform will test the code you wrote. When your result is consistent with the expected output, it is passed.
Start your mission. I wish you success!
code
from pyspark import SparkContext if __name__ == "__main__": #********** Begin **********# # 1. Initialize SparkContext, which is the entry of Spark Program sc = SparkContext("local", "Simple App") # 2. Create a List of 1 to 5 data = [1, 2, 3, 4, 5] # 3. Create rdd through SparkContext parallelization rdd = sc.parallelize(data) # 4. Use rdd.collect() to collect the elements of rdd. print(rdd.collect()) """ use map Operator, will rdd Data (1, 2, 3, 4, 5) Perform conversion according to the following rules:: Requirements: Even numbers are converted to the square of the number The odd number is converted to the cube of the number """ # 5. Use the map operator to complete the above requirements rdd_map = rdd.map(lambda x: x * x if x % 2 == 0 else x * x * x) # 6. Use rdd.collect() to collect the elements that complete the map transformation print(rdd_map.collect()) # 7. Stop SparkContext sc.stop() #********** End **********#
Level 2: Transformation - mapPartitions task description
Related tasks: use Spark's mapPartitions operator to complete the conversion operation according to relevant requirements.
Relevant knowledge
In order to complete this task, you need to master: how to use mapPartitions operator.
mapPartitions
The mapPartitions function obtains the iterator of each partition, and operates on the elements of the whole partition through the iterator of the whole partition in the function.
Each box in the figure represents an RDD partition. The partition on the left is mapped to a new RDD partition on the right through the user-defined function f: T - > U.
mapPartitions and map
map: traversal operator, which can traverse every element in RDD. The traversal unit is each record.
mapPartitions: traversal operator, which can change RDD format and improve RDD parallelism. The traversal unit is Partition, that is, it will load the data of a Partition into memory before traversal.
So the question is, who is efficient in traversing an RDD with the above two operators?
mapPartitions operator is efficient.
mapPartitions case
def f(iterator): list = [] for x in iterator: list.append(x*2) return list if __name__ == "__main__": sc = SparkContext("local", "Simple App") data = [1,2,3,4,5,6] rdd = sc.parallelize(data) print(rdd.collect()) partitions = rdd.mapPartitions(f) print(partitions.collect())
Output:
[1, 2, 3, 4, 5, 6] [2, 4, 6, 8, 10, 12]
mapPartitions(): the passed in parameter is the iterator (element iterator) of rdd, and the returned parameter is also an iterator (iterator).
Programming requirements
Please read the code on the right carefully and supplement the code in the Begin - End area according to the tips in the method. The specific tasks are as follows:
Requirement: use mapPartitions operator to convert rdd data ("dog", "salmon", "salmon", "rat", "elephant") according to the following rules:
Combine the string with the length of the string into a tuple, for example: dog --> (dog,3) salmon --> (salmon,6)
Test description
After supplementing the code, click the evaluation, and the platform will test the code you wrote. When your result is consistent with the expected output, it is passed.
Start your mission. I wish you success!
from pyspark import SparkContext #********** Begin **********# def f(iterator): list = [] for x in iterator: list.append((x, len(x))) return list #********** End **********# if __name__ == "__main__": # 1. Initialize SparkContext, which is the entry of Spark Program sc = SparkContext("local", "Simple App") # 2. A List with ("dog", "salmon", "salmon", "rat", "elephant") contents data = ["dog", "salmon", "salmon", "rat", "elephant"] # 3. Create rdd through SparkContext parallelization rdd = sc.parallelize(data) # 4. Use rdd.collect() to collect the elements of rdd. print(rdd.collect()) """ use mapPartitions Operator, will rdd Data ("dog", "salmon", "salmon", "rat", "elephant") Perform conversion according to the following rules:: Requirements: Combine the string with the length of the string into a tuple, for example: dog --> (dog,3) salmon --> (salmon,6) """ # 5. Use mapPartitions operator to complete the above requirements partitions = rdd.mapPartitions(f) # 6. Use rdd.collect() to collect the elements that complete mapPartitions conversion print(partitions.collect()) # 7. Stop SparkContext sc.stop() #********** End **********#
Level 3: Transformation - filter
100
Task requirements Reference answer Comment 4 Task description Relevant knowledge filter filter case Programming requirements Test description
Task description
This related task: use the filter operator of Spark to complete the conversion operation according to relevant requirements.
Relevant knowledge
In order to complete this task, you need to master: how to use the filter operator.
filter
The filter function is used to filter elements. The f function is applied to each element. The elements with a return value of true are retained in the RDD, and the elements with a return value of false will be filtered out. The internal implementation is equivalent to generation.
FilteredRDD(this,sc.clean(f))
The following code is the essential implementation of the function:
def filter(self, f): """ Return a new RDD containing only the elements that satisfy a predicate. >>> rdd = sc.parallelize([1, 2, 3, 4, 5]) >>> rdd.filter(lambda x: x % 2 == 0).collect() [2, 4] """ def func(iterator): return filter(fail_on_stopiteration(f), iterator) return self.mapPartitions(func, True)
Each box in the figure above represents an RDD partition, and T can be of any type. Operate on each data item through the user-defined filter function f to retain the data item that meets the conditions and returns true. For example, filter out V2 and V3, retain V1, and name the distinction V'1.
filter case
sc = SparkContext("local", "Simple App") data = [1,2,3,4,5,6] rdd = sc.parallelize(data) print(rdd.collect()) rdd_filter = rdd.filter(lambda x: x>2) print(rdd_filter.collect())
Output:
[1, 2, 3, 4, 5, 6] [3, 4, 5, 6]
Note: rdd1 ([1, 2, 3, 4, 5, 6]) is converted into rdd2 ([3, 4, 5, 6]) by filter operator.
Programming requirements
Please read the code on the right carefully and supplement the code in the Begin - End area according to the tips in the method. The specific tasks are as follows:
Requirement 1: use the filter operator to filter the data (1, 2, 3, 4, 5, 6, 7, 8) in the rdd according to the following rules:
Filter out rdd All odd numbers in.
Test description
After supplementing the code, click the evaluation, and the platform will test the code you wrote. When your result is consistent with the expected output, it is passed.
Start your mission. I wish you success!
# -*- coding: UTF-8 -*- from pyspark import SparkContext if __name__ == "__main__": #********** Begin **********# # 1. Initialize SparkContext, which is the entry of Spark Program sc = SparkContext("local", "Simple App") # 2. Create a List of 1 to 8 data = [1, 2, 3, 4, 5, 6, 7, 8] # 3. Create rdd through SparkContext parallelization rdd = sc.parallelize(data) # 4. Use rdd.collect() to collect the elements of rdd. print(rdd.collect()) """ use filter Operator, will rdd Data (1, 2, 3, 4, 5, 6, 7, 8) Perform conversion according to the following rules:: Requirements: Filter out rdd Odd number in """ # 5. Use the filter operator to complete the above requirements rdd_filter = rdd.filter(lambda x: x % 2 == 0) # 6. Use rdd.collect() to collect the elements that complete the filter transformation print(rdd_filter.collect()) # 7. Stop SparkContext sc.stop() #********** End **********#
Level 4: Transformation - flatMap
100
Task requirements Reference answer Comment 4 Task description Relevant knowledge flatMap flatMap case Programming requirements Test description
Task description
This related task: use Spark's flatMap operator to complete the conversion operation according to relevant requirements.
Relevant knowledge
In order to complete this task, you need to master: how to use flatMap operator.
flatMap
Each element in the original RDD is converted into a new element through function f, and the elements of each set in the generated RDD are combined into a set, which is created internally:
FlatMappedRDD(this,sc.clean(f))
The above figure shows a partition of RDD for flatMap function operation. The functions passed in flatMap are f: T - > U, and T and U can be any data type. Convert the data in the partition into new data through user-defined function F. The outer large box can be considered as an RDD partition, and the small box represents a collection. V1, V2 and V3 may be stored as an array or other container in a set as a data item of RDD. After being converted to V'1, V'2 and V'3, the original array or container will be combined and disassembled, and the disassembled data will form a data item in RDD.
flatMap case
sc = SparkContext("local", "Simple App") data = [["m"], ["a", "n"]] rdd = sc.parallelize(data) print(rdd.collect()) flat_map = rdd.flatMap(lambda x: x) print(flat_map.collect())
Output:
[['m'], ['a', 'n']] ['m', 'a', 'n']
flatMap: convert two sets into one set
Programming requirements
Please read the code on the right carefully and supplement the code in the Begin - End area according to the tips in the method. The specific tasks are as follows:
Requirement: use the flatMap operator to convert the rdd data ([1, 2, 3], [4, 5, 6], [7, 8, 9]) according to the following rules:
merge RDD For example: ([1,2,3],[4,5,6]) --> (1,2,3,4,5,6) ([2,3],[4,5],[6]) --> (1,2,3,4,5,6)
Test description
After supplementing the code, click the evaluation, and the platform will test the code you wrote. When your result is consistent with the expected output, it is passed.
Start your mission. I wish you success!
# -*- coding: UTF-8 -*- from pyspark import SparkContext if __name__ == "__main__": #********** Begin **********# # 1. Initialize SparkContext, which is the entry of Spark Program sc = SparkContext("local", "Simple App") # 2. Create a List of [[1, 2, 3], [4, 5, 6], [7, 8, 9]] data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # 3. Create rdd through SparkContext parallelization rdd = sc.parallelize(data) # 4. Use rdd.collect() to collect the elements of rdd. print(rdd.collect()) """ use flatMap Operator, will rdd Data ([1, 2, 3], [4, 5, 6], [7, 8, 9]) Perform conversion according to the following rules:: Requirements: merge RDD For example: ([1,2,3],[4,5,6]) --> (1,2,3,4,5,6) ([2,3],[4,5],[6]) --> (1,2,3,4,5,6) """ # 5. Use the filter operator to complete the above requirements flat_map = rdd.flatMap(lambda x: x) # 6. Use rdd.collect() to collect the elements that complete the filter transformation print(flat_map.collect()) # 7. Stop SparkContext sc.stop() #********** End **********#
Level 5: Transformation - distinct
100
Task requirements Reference answer Comment 4 Task description Relevant knowledge distinct distinct case Programming requirements Test description
Task description
Related tasks: use Spark's distinct operator to complete relevant operations as required.
Relevant knowledge
In order to complete this task, you need to master: how to use the distinct operator.
distinct
distinct de duplicates the elements in the RDD.
Each box in the figure above represents an RDD partition, and the data is de duplicated through the distinct function. For example, only one copy of V1 is retained after the duplicate data V1 and V1 are removed.
distinct case
sc = SparkContext("local", "Simple App") data = ["python", "python", "python", "java", "java"] rdd = sc.parallelize(data) print(rdd.collect()) distinct = rdd.distinct() print(distinct.collect())
Output:
['python', 'python', 'python', 'java', 'java'] ['python', 'java']
Programming requirements
Please read the code on the right carefully and supplement the code in the Begin - End area according to the tips in the method. The specific tasks are as follows:
Requirement: use distinct operator to de duplicate the data in rdd.
Test description
After supplementing the code, click the evaluation, and the platform will test the code you wrote. When your result is consistent with the expected output, it is passed.
Start your mission. I wish you success!
# -*- coding: UTF-8 -*- from pyspark import SparkContext if __name__ == "__main__": #********** Begin **********# # 1. Initialize SparkContext, which is the entry of Spark Program sc = SparkContext("local", "Simple App") # 2. Create a List with contents of (1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1) data = [1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1] # 3. Create rdd through SparkContext parallelization rdd = sc.parallelize(data) # 4. Use rdd.collect() to collect the elements of rdd print(rdd.collect()) """ use distinct Operator, will rdd Data (1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1) Perform conversion according to the following rules:: Requirements: Element de duplication, for example: 1,2,3,3,2,1 --> 1,2,3 1,1,1,1, --> 1 """ # 5. Use distinct operator to complete the above requirements distinctResult = rdd.distinct() # 6. Use rdd.collect() to collect the elements that complete the distinct transformation print(distinctResult.collect()) # 7. Stop SparkContext sc.stop() #********** End **********#