Spark operator - Python

Posted by blackcow on Fri, 22 Oct 2021 09:24:04 +0200

1, Theoretical basis

Spark operators can be divided into:

Transformation Transformation/Conversion operator: this transformation does not trigger the submission of the job and completes the intermediate process of the job. Transformation Operations are deferred, that is, from a RDD Transform to generate another RDD The operation is not performed immediately. It needs to wait until there is a problem Action Operation will really trigger the operation.

Action Action operators: these operators trigger SparkContext Submit Job Homework. Action Operator will trigger Spark Submit job( Job),And output the data Spark System.

Level 1: Transformation - map

Task description

This related task: use Spark's map operator to complete the conversion operation according to relevant requirements.
Relevant knowledge

In order to complete this task, you need to master: how to use the map operator.
map

Each data item of the original RDD is transformed into a new element through the user-defined function f mapping in the map.

Each box in the figure represents an RDD partition. The partition on the left is mapped to a new RDD partition on the right through the user-defined function f: T - > U. however, in fact, this F function and other functions will not operate on data in a Stage until the Action operator is triggered.
map Case

    sc = SparkContext("local", "Simple App")
    data = [1,2,3,4,5,6]
    rdd = sc.parallelize(data)
    print(rdd.collect())
    rdd_map = rdd.map(lambda x: x * 2)
    print(rdd_map.collect())

Output:

[1, 2, 3, 4, 5, 6]
[2, 4, 6, 8, 10, 12]

Note: the elements (1, 2, 3, 4, 5, 6) of rdd1 are transformed into rdd2 (2, 4, 6, 8, 10) by map operator (x - > x * 2).
Programming requirements

Please read the code on the right carefully and supplement the code in the Begin - End area according to the tips in the method. The specific tasks are as follows:

Requirement: use the map operator to convert the rdd data (1, 2, 3, 4, 5) according to the following rules:

An even number is converted to the square of the number;
An odd number is converted to the cube of the number.

Test description

After supplementing the code, click the evaluation, and the platform will test the code you wrote. When your result is consistent with the expected output, it is passed.

Start your mission. I wish you success!

code

from pyspark import SparkContext

if __name__ == "__main__":
     #********** Begin **********#
    # 1. Initialize SparkContext, which is the entry of Spark Program
    sc = SparkContext("local", "Simple App")
    # 2. Create a List of 1 to 5
    data = [1, 2, 3, 4, 5]
    # 3. Create rdd through SparkContext parallelization
    rdd = sc.parallelize(data)
    # 4. Use rdd.collect() to collect the elements of rdd.
    print(rdd.collect())
    """
    use map Operator, will rdd Data (1, 2, 3, 4, 5) Perform conversion according to the following rules::
    Requirements:
        Even numbers are converted to the square of the number
        The odd number is converted to the cube of the number
    """
    # 5. Use the map operator to complete the above requirements
    rdd_map = rdd.map(lambda x: x * x if x % 2 == 0 else x * x * x)
    # 6. Use rdd.collect() to collect the elements that complete the map transformation
    print(rdd_map.collect())
    # 7. Stop SparkContext
    sc.stop()
    #********** End **********#

Level 2: Transformation - mapPartitions task description

Related tasks: use Spark's mapPartitions operator to complete the conversion operation according to relevant requirements.
Relevant knowledge

In order to complete this task, you need to master: how to use mapPartitions operator.
mapPartitions

The mapPartitions function obtains the iterator of each partition, and operates on the elements of the whole partition through the iterator of the whole partition in the function.

Each box in the figure represents an RDD partition. The partition on the left is mapped to a new RDD partition on the right through the user-defined function f: T - > U.
mapPartitions and map

map: traversal operator, which can traverse every element in RDD. The traversal unit is each record.

mapPartitions: traversal operator, which can change RDD format and improve RDD parallelism. The traversal unit is Partition, that is, it will load the data of a Partition into memory before traversal.

So the question is, who is efficient in traversing an RDD with the above two operators?
mapPartitions operator is efficient.
mapPartitions case

def f(iterator):
    list = []
    for x in iterator:
        list.append(x*2)
    return list
if __name__ == "__main__":
    sc = SparkContext("local", "Simple App")
    data = [1,2,3,4,5,6]
    rdd = sc.parallelize(data)
    print(rdd.collect())
    partitions = rdd.mapPartitions(f)
    print(partitions.collect())

Output:

[1, 2, 3, 4, 5, 6]
[2, 4, 6, 8, 10, 12]

mapPartitions(): the passed in parameter is the iterator (element iterator) of rdd, and the returned parameter is also an iterator (iterator).
Programming requirements

Please read the code on the right carefully and supplement the code in the Begin - End area according to the tips in the method. The specific tasks are as follows:

Requirement: use mapPartitions operator to convert rdd data ("dog", "salmon", "salmon", "rat", "elephant") according to the following rules:

Combine the string with the length of the string into a tuple, for example:

    dog  -->  (dog,3)
    salmon   -->  (salmon,6)

Test description

After supplementing the code, click the evaluation, and the platform will test the code you wrote. When your result is consistent with the expected output, it is passed.

Start your mission. I wish you success!

from pyspark import SparkContext

#********** Begin **********#
def f(iterator):
    list = []
    for x in iterator:
        list.append((x, len(x)))
    return list


#********** End **********#

if __name__ == "__main__":
    # 1. Initialize SparkContext, which is the entry of Spark Program
    sc = SparkContext("local", "Simple App")
    # 2. A List with ("dog", "salmon", "salmon", "rat", "elephant") contents
    data = ["dog", "salmon", "salmon", "rat", "elephant"]
    # 3. Create rdd through SparkContext parallelization
    rdd = sc.parallelize(data)
    # 4. Use rdd.collect() to collect the elements of rdd.
    print(rdd.collect())
    """
    use mapPartitions Operator, will rdd Data ("dog", "salmon", "salmon", "rat", "elephant") Perform conversion according to the following rules::
    Requirements:
        Combine the string with the length of the string into a tuple, for example:
        dog  -->  (dog,3)
        salmon   -->  (salmon,6)
    """
    # 5. Use mapPartitions operator to complete the above requirements
    partitions = rdd.mapPartitions(f)
    # 6. Use rdd.collect() to collect the elements that complete mapPartitions conversion
    print(partitions.collect())
    # 7. Stop SparkContext
    sc.stop()

    #********** End **********#

Level 3: Transformation - filter

100

Task requirements
 Reference answer
 Comment 4

Task description
 Relevant knowledge
    filter
    filter case
 Programming requirements
 Test description

Task description

This related task: use the filter operator of Spark to complete the conversion operation according to relevant requirements.
Relevant knowledge

In order to complete this task, you need to master: how to use the filter operator.
filter

The filter function is used to filter elements. The f function is applied to each element. The elements with a return value of true are retained in the RDD, and the elements with a return value of false will be filtered out. The internal implementation is equivalent to generation.

FilteredRDD(this,sc.clean(f))

The following code is the essential implementation of the function:

   def filter(self, f):
        """
        Return a new RDD containing only the elements that satisfy a predicate.
        >>> rdd = sc.parallelize([1, 2, 3, 4, 5])
        >>> rdd.filter(lambda x: x % 2 == 0).collect()
        [2, 4]
        """
        def func(iterator):
            return filter(fail_on_stopiteration(f), iterator)
        return self.mapPartitions(func, True)

Each box in the figure above represents an RDD partition, and T can be of any type. Operate on each data item through the user-defined filter function f to retain the data item that meets the conditions and returns true. For example, filter out V2 and V3, retain V1, and name the distinction V'1.
filter case

    sc = SparkContext("local", "Simple App")
    data = [1,2,3,4,5,6]
    rdd = sc.parallelize(data)
    print(rdd.collect())
    rdd_filter = rdd.filter(lambda x: x>2)
    print(rdd_filter.collect())

Output:

[1, 2, 3, 4, 5, 6]
[3, 4, 5, 6]

Note: rdd1 ([1, 2, 3, 4, 5, 6]) is converted into rdd2 ([3, 4, 5, 6]) by filter operator.
Programming requirements

Please read the code on the right carefully and supplement the code in the Begin - End area according to the tips in the method. The specific tasks are as follows:

Requirement 1: use the filter operator to filter the data (1, 2, 3, 4, 5, 6, 7, 8) in the rdd according to the following rules:

Filter out rdd All odd numbers in.

Test description

After supplementing the code, click the evaluation, and the platform will test the code you wrote. When your result is consistent with the expected output, it is passed.

Start your mission. I wish you success!

# -*- coding: UTF-8 -*-
from pyspark import SparkContext

if __name__ == "__main__":
   #********** Begin **********#
    # 1. Initialize SparkContext, which is the entry of Spark Program
    sc = SparkContext("local", "Simple App")
    # 2. Create a List of 1 to 8
    data = [1, 2, 3, 4, 5, 6, 7, 8]
    # 3. Create rdd through SparkContext parallelization
    rdd = sc.parallelize(data)
    # 4. Use rdd.collect() to collect the elements of rdd.
    print(rdd.collect())
    """
    use filter Operator, will rdd Data (1, 2, 3, 4, 5, 6, 7, 8) Perform conversion according to the following rules::
    Requirements:
        Filter out rdd Odd number in
    """
    # 5. Use the filter operator to complete the above requirements
    rdd_filter = rdd.filter(lambda x: x % 2 == 0)
    # 6. Use rdd.collect() to collect the elements that complete the filter transformation
    print(rdd_filter.collect())
    # 7. Stop SparkContext
    sc.stop()
    #********** End **********#

Level 4: Transformation - flatMap

100

Task requirements
 Reference answer
 Comment 4

Task description
 Relevant knowledge
    flatMap
    flatMap case
 Programming requirements
 Test description

Task description

This related task: use Spark's flatMap operator to complete the conversion operation according to relevant requirements.
Relevant knowledge

In order to complete this task, you need to master: how to use flatMap operator.
flatMap

Each element in the original RDD is converted into a new element through function f, and the elements of each set in the generated RDD are combined into a set, which is created internally:

FlatMappedRDD(this,sc.clean(f))

The above figure shows a partition of RDD for flatMap function operation. The functions passed in flatMap are f: T - > U, and T and U can be any data type. Convert the data in the partition into new data through user-defined function F. The outer large box can be considered as an RDD partition, and the small box represents a collection. V1, V2 and V3 may be stored as an array or other container in a set as a data item of RDD. After being converted to V'1, V'2 and V'3, the original array or container will be combined and disassembled, and the disassembled data will form a data item in RDD.
flatMap case

    sc = SparkContext("local", "Simple App")
    data = [["m"], ["a", "n"]]
    rdd = sc.parallelize(data)
    print(rdd.collect())
    flat_map = rdd.flatMap(lambda x: x)
    print(flat_map.collect())

Output:

[['m'], ['a', 'n']]
['m', 'a', 'n']

flatMap: convert two sets into one set
Programming requirements

Please read the code on the right carefully and supplement the code in the Begin - End area according to the tips in the method. The specific tasks are as follows:

Requirement: use the flatMap operator to convert the rdd data ([1, 2, 3], [4, 5, 6], [7, 8, 9]) according to the following rules:

merge RDD For example:

    ([1,2,3],[4,5,6])  -->  (1,2,3,4,5,6)
    ([2,3],[4,5],[6])  -->  (1,2,3,4,5,6)

Test description

After supplementing the code, click the evaluation, and the platform will test the code you wrote. When your result is consistent with the expected output, it is passed.

Start your mission. I wish you success!

# -*- coding: UTF-8 -*-
from pyspark import SparkContext

if __name__ == "__main__":
   	 #********** Begin **********#
       
    # 1. Initialize SparkContext, which is the entry of Spark Program
    sc = SparkContext("local", "Simple App")
 
    # 2. Create a List of [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
    data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
 
    # 3. Create rdd through SparkContext parallelization
    rdd = sc.parallelize(data)
 
    # 4. Use rdd.collect() to collect the elements of rdd.
    print(rdd.collect())
 
    """
        use flatMap Operator, will rdd Data ([1, 2, 3], [4, 5, 6], [7, 8, 9]) Perform conversion according to the following rules::
        Requirements:
            merge RDD For example:
                            ([1,2,3],[4,5,6])  -->  (1,2,3,4,5,6)
                            ([2,3],[4,5],[6])  -->  (1,2,3,4,5,6)
        """
    # 5. Use the filter operator to complete the above requirements
    flat_map = rdd.flatMap(lambda x: x)
 
    # 6. Use rdd.collect() to collect the elements that complete the filter transformation
    print(flat_map.collect())
 
    # 7. Stop SparkContext
    sc.stop()
 
    #********** End **********#

Level 5: Transformation - distinct

100

Task requirements
 Reference answer
 Comment 4

Task description
 Relevant knowledge
    distinct
    distinct case
 Programming requirements
 Test description

Task description

Related tasks: use Spark's distinct operator to complete relevant operations as required.
Relevant knowledge

In order to complete this task, you need to master: how to use the distinct operator.
distinct

distinct de duplicates the elements in the RDD.

Each box in the figure above represents an RDD partition, and the data is de duplicated through the distinct function. For example, only one copy of V1 is retained after the duplicate data V1 and V1 are removed.
distinct case

    sc = SparkContext("local", "Simple App")
    data = ["python", "python", "python", "java", "java"]
    rdd = sc.parallelize(data)
    print(rdd.collect())
    distinct = rdd.distinct()
    print(distinct.collect())

Output:

['python', 'python', 'python', 'java', 'java']
['python', 'java']

Programming requirements

Please read the code on the right carefully and supplement the code in the Begin - End area according to the tips in the method. The specific tasks are as follows:

Requirement: use distinct operator to de duplicate the data in rdd.
Test description

After supplementing the code, click the evaluation, and the platform will test the code you wrote. When your result is consistent with the expected output, it is passed.

Start your mission. I wish you success!

# -*- coding: UTF-8 -*-
from pyspark import SparkContext

if __name__ == "__main__":
   #********** Begin **********#
    # 1. Initialize SparkContext, which is the entry of Spark Program
    sc = SparkContext("local", "Simple App")
    # 2. Create a List with contents of (1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1)
    data = [1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1]
    # 3. Create rdd through SparkContext parallelization
    rdd = sc.parallelize(data)
    # 4. Use rdd.collect() to collect the elements of rdd
    print(rdd.collect())
    """
       use distinct Operator, will rdd Data (1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1) Perform conversion according to the following rules::
       Requirements:
           Element de duplication, for example:
                        1,2,3,3,2,1  --> 1,2,3
                        1,1,1,1,     --> 1
       """
    # 5. Use distinct operator to complete the above requirements
    distinctResult = rdd.distinct()
    # 6. Use rdd.collect() to collect the elements that complete the distinct transformation
    print(distinctResult.collect())
    # 7. Stop SparkContext
    sc.stop()
    #********** End **********#

Topics: Python Big Data Spark