Introduction to Spark development

Posted by switchdoc on Sun, 09 Jan 2022 10:15:36 +0100

What is Spark

The whole Hadoop ecosystem is divided into distributed file system HDFS, computing framework MapReduce and resource scheduling framework Yan. However, with the development of the times, MapReduce's high-intensity disk IO, network communication frequency and dead write make it seriously slow down the operation speed of the whole ecosystem. Spark was born.

First, let's talk about how the original MapReduce slows down the pace:
For a task of word frequency statistics, the map task will first disperse the word frequency information of the current node, then land the temporary information on the disk, and then the reduce task will read the data from the disk for shuffling. Shuffling will map the same key to the same node performing the reduce task, which will involve a lot of network communication. After the reduce task is executed, the data will be landed on the disk again.

You can see that a MapReduce phase will go through at least two network IO+N network communications, which are also two main time-consuming tasks. Especially for neural network or graph computing, which requires frequent iteration, it will involve a lot of disk IO and network communication.

Spark improves this by using DAG graph and memory calculation.

DAG directed acyclic graph

When referring to DAG directed acyclic graph, we need to first mention its basic constituent element RDD, also known as elastic distributed data set. It will summarize and abstract the data files of all intermediate links in a unified way, so that all map s and reduce can be connected more smoothly. In the specific Spark development process, it is the description information of disk data. When an action operation is encountered, Spark will calculate the data according to the information in RDD.

There are two main operations in RDD: conversion and action. The transformation operation will start to build DAG directed acyclic graph between multiple RDDS, which describes the association relationship between RDDS. If so, we will read from the text file readme Read data from MD and create readme_rdd, this is the first node of a directed acyclic graph. Then we add a line to find the word "Python" in the file, and we will convert readme_rdd into python_rdd.

However, we should keep in mind that rdd is evaluated lazily. Transformation only records the operation relationship of data, and does not really perform operations. As for why, spark can combine some operations to reduce the steps of calculating data. If we start the action operation of rdd, spark will really execute according to the contents of this DAG.

However, MapReduce is not completely replaced by spark because of the memory capacity problem. Spark may not handle some problems properly

spark environment construction

Use the following command to download and build the spark environment

wget --no-check-certificate
tar zxvf spark-3.2.0-bin-hadoop3.2.tgz

Then you can use it directly

spar k needs to run in the jvm environment, so jdk1.0 is required Above version 8

Spark Development

Using pyspark

We can directly call pyspark provided by spark to execute the statement, which will initialize a SparkContext object sc by default, and we can directly call its method to create rdd.

lines = sc.textFile("")

If we want to write a python script to operate spark, we need to use the spark submit component provided by spark:



Spark initialization

To use Spark in python, you need to configure SparkConf, which mainly focuses on two parameters: the Url of the Spark cluster and the name of the driver service.

from pyspark import SparkConf, rdd
from pyspark.context import SparkContext

# spark configuration
conf = SparkConf().setMaster("local").setAppName("app")

Then you can create a SparkContext object based on this SparkConf. With this RDD, we can create an RDD and perform a series of operations.

sc = SparkContext(conf=conf)

Create RDD

There are three main input sources for RDD:

  • Local file system
file = sc.textFile("file:[Absolute path]")

Alternatively, we can directly use relative paths to create RDD S as in the following example:

lines = sc.textFile("")
  • Amazon S3
    To access S3 data in Spark, you must first set the S3 access credentials to AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variable
file = sc.textFile("s3n:[Absolute path]")
  • HDFS
file = sc.textFile("hdfs://[ip:port] / [absolute path] ")

Of course, in addition to reading from disk, it can also convert the data in memory into RDD:

memory_rdd = sc.parallelize(["pandas","china","Chinese","chinese food"])

Data reading and saving

The creation of RDD must be inseparable from data reading. It can be read in the following file formats:

  • text file
  • JSON
  • SequenceFile
  • protobuf

At present, only the first two common file formats are introduced.

text file

For local text files, we can directly use the just textFile. At the same time, pyspark also supports us to operate the entire folder. It will read the entire folder, then create it with the file name as the Key and the file content as the Value, and enter RDD. Here is an example:

Read directory,This will form key value pairs<file name,File content>
dir_rdd = sc.wholeTextFiles("file:///home/ik/software/spark/python_demo")
print(dir_rdd.keys().collect()) # Output key

The output is as follows:


It is consistent with the following results, but it will not recursively read the contents of the folder.

To save a file is to call the saveAsTextFile method, which will create a directory under the path.


There will be four files in this directory, and the file indicated by the red arrow is the content of the file.

Json file

The interface of the Json file is essentially consistent with the text file, but we need to call the Python loader.

json_rdd = sc.textFile("file:///home/ik/software/spark/python_demo/JsonTest1.json")
data = x:json.loads(x))

RDD conversion operation

RDD is read-only, so its conversion operation is actually to extract some data from the parent RDD and form a new RDD after conversion. This mainly involves two methods: map and filter.

  • map: receive a function, use this function for each element of RDD, and take the return result of the function as the value of the element in RDD
nums = sc.parallelize([1,2,3,4,5])
for number in x:x*x).collect():
  • Filter: receive a function to filter the data in RDD
python_line = lines.filter(lambda line:"Python" in line)
  • flatmap: it also receives a function, but it divides the original element into two elements according to the function rules
sentence = sc.parallelize(["hello world","hi world"])
words = sentence.flatMap(lambda line:line.split(" "))
print(words.collect())	# ['hello', 'world', 'hi', 'world']

RDD actions

Action operation will really trigger Spark operation, mainly through the following methods:

  • Count: returns the count result
lines = sc.textFile("")
  • take: it receives an integer operation that represents extracting several elements from the RDD. Then we can print these elements
# take get 10 data
rdd_set = python_line.take(10)
for line in rdd_set:
  • collect: it will get all the elements in the RDD, but note that if the object is too large, the memory will explode
# Transfer function, mainly filter
def filter_func(lines):
    return "Python" in lines

python_line2 = lines.filter(filter_func)
for line in python_line2.collect():

Collection operation

Collection operations are mainly as follows, so I won't introduce them more:

rdd1 = sc.parallelize([1,3,5,7,9,1])
rdd2 = sc.parallelize([2,4,6,8,1])
# distinct de duplication
# The intersection runtime returns only duplicate elements
print (rdd1.intersection(rdd2).collect())
# A subtract returns an element that exists only in the first rdd but not in the second rdd
# cartesian calculation of cartesian product

mysql read

  # mysql configuration (to be modified)
    prop = {'user': 'xxx', 
            'password': 'xxx', 
            'driver': 'com.mysql.cj.jdbc.Driver'}
    # database address (need to be modified)
    url = 'jdbc:mysql://host:port/database'
    # Read table
    data =, table='tb_newCity', properties=prop)
    # Print data type
    # Display data
    # Close spark session

Topics: Big Data Hadoop Spark