Basic operation of SparkStreaming in PySpark

Posted by andrew10181 on Fri, 11 Feb 2022 04:41:05 +0100

Basic operation of SparkStreaming in PySpark

preface

Stream data has the following characteristics:
• data arrives quickly and continuously, and the potential size may be endless
• numerous data sources and complex formats
• large amount of data, but do not pay much attention to storage. Once processed, it will either be discarded or archived for storage
• focus on the overall value of data, but focus on individual data
• the order of the data to be processed is not complete, or the data to be processed cannot reach the system

Stream computing (the value of data decreases with the flow of time):
Obtain massive data from different data sources in real time, and obtain valuable information through real-time analysis and processing

Flow calculation processing flow (emphasizing real-time):
Real time data collection - > real-time data calculation - > real-time query service

  • Real time data acquisition: in the real-time data acquisition stage, massive data from multiple data sources are usually collected, which needs to ensure real-time, low delay, stability and reliability
  • Data real-time calculation: in the data real-time calculation stage, analyze and calculate the collected data in real time, and feed back the real-time results
  • Real time query service: the results obtained through the stream computing framework can be queried, displayed or stored by users in real time

The stream processing system is different from the traditional data processing system as follows:

  • Stream processing system deals with real-time data, while traditional data processing system deals with pre stored static data
  • Users get real-time results through the stream processing system, while through the traditional data processing system, they get the results of a certain time in the past
  • Stream processing system does not need users to actively send queries, and real-time query service can actively push real-time results to users

SparkStreaming operation:
Spark Streaming can integrate multiple input data sources, such as Kafka, Flume, HDFS, and even ordinary TCP sockets. The processed data can be stored in the file system, database or displayed in the dashboard.

Basic principle of Spark Streaming:
It is to split the real-time input data stream in time slice (second level), and then process each time slice data (pseudo real-time, second level response) in a similar batch manner by Spark engine

The main abstraction of Spark Streaming:
Dstream (discrete stream) refers to continuous data stream. In terms of internal implementation, the input data of Spark Streaming is divided into segments according to the time slice (such as 1 second). Each segment of data is converted into RDD in Spark. These segments are dstreams, and the operation of dstream is finally converted into the operation of corresponding RDD

basic operation

The basic steps of writing Spark Streaming program are:

  1. Define the input source by creating an input DStream
  2. Flow calculations are defined by applying transformation and output operations to dstreams
  3. Use streamingcontext Start() to start receiving data and processing
  4. Through streamingcontext Awaittermination () method to wait for the end of processing (either manually or due to an error)
  5. You can use streamingcontext Stop() to manually end the flow calculation process

RDD queue flow

Knowledge points:
Differences among foreachRDD, foreachPartition and foreach in Spark:
First, the scope of action is different. foreachRDD acts on the RDD of each time interval in DStream, foreachPartition acts on each partition in the RDD of each time interval, and foreach acts on each element in the RDD of each time interval.
Output mysql process:

%%writefile rdd_queue.py

import findspark  
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql.session import SparkSession
import time
import os
import pymysql

# Environment configuration
conf = SparkConf().setAppName("RDD Queue").setMaster("local")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 2)


# Create RDD data stream
rddQueue = []
for i in range(5):
    rddQueue += [ssc.sparkContext.parallelize([j for j in range(100 + i)], 10)]
    time.sleep(1)
    
inputStream = ssc.queueStream(rddQueue)
reduceedStream = inputStream.map(lambda x: (x%10, 1)).reduceByKey(lambda x, y: x + y)
# Method 1: save the value to the specified text file
# reduceedStream.saveAsTextFiles("file://" + os.getcwd() + "/process/") 
# Method 2: save data to mysql database
def db_func(records):
    
    # Connect database
    # Connect to mysql database
    print("Connecting to database!")
    db = pymysql.connect(host='localhost',user='root',passwd='123456',db='spark')
    cursor = db.cursor()
    def do_insert():
        # sql statement
        print("Inserting data!")
        sql = " insert into wordcount(word, count) values ('%s','%s')" % (str(p[0]), str(p[1]))
        try:
            cursor.ececute(sql)
            db.commit()
        except:
            # RollBACK 
            db.rollback()
        for item in recorders:
            do_insert(item)
            
def func(rdd):
    
    repartitionedRDD = rdd.repartition(3)
    repartitionedRDD.foreachPartition(db_func)

reduceedStream.pprint()
reduceedStream.foreachRDD(func)

ssc.start()
ssc.stop(stopSparkContext=True, stopGraceFully=True)
%run rdd_queue.py
-------------------------------------------
Time: 2021-05-11 15:01:30
-------------------------------------------
(0, 10)
(1, 10)
(2, 10)
(3, 10)
(4, 10)
(5, 10)
(6, 10)
(7, 10)
(8, 10)
(9, 10)

-------------------------------------------
Time: 2021-05-11 15:01:32
-------------------------------------------
(0, 11)
(1, 10)
(2, 10)
(3, 10)
(4, 10)
(5, 10)
(6, 10)
(7, 10)
(8, 10)
(9, 10)

Word frequency statistics (stateless operation)

Note: for stateless transition, each statistics only counts the word frequency of the words arrived in the current batch, which has nothing to do with the previous batch and will not be accumulated.
The output result shows:

%%writefile sscrun.py
import findspark  
findspark.init()
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create environment
sc = SparkContext(master="local[2]", appName="NetworkWordCount")
ssc = StreamingContext(sc, 1)  # Monitor data every second

# Listen for port data (socket data source)
lines = ssc.socketTextStream("localhost", 9999)
#Split words
words = lines.flatMap(lambda line: line.split(" "))
# Statistical word frequency
pairs = words.map(lambda word:(word, 1))
wordCounts = pairs.reduceByKey(lambda x, y:x+y)
# print data
wordCounts.pprint()

# Turn on streaming
ssc.start()
ssc.awaitTermination(timeout=10)
%run sscrun.py
-------------------------------------------
Time: 2021-05-11 14:46:11
-------------------------------------------
('', 1)
('1', 1)
('\x1b[F', 1)
('\x1b[F1', 1)

Word frequency statistics (stateful operation)

Note: for the stateful assembly and replacement operation, the word frequency statistics of this batch will be continuously accumulated on the basis of the word frequency statistics results of previous batches. Therefore, the word frequency obtained in the final statistics is the statistical result of the word frequency of all batches.

Update function:
words.updateStateByKey(updateFunc, numPartitions=None, initialRDD=None)
%%writefile calcultor.py

import findspark  
findspark.init()
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
import os 
# Create environment
sc = SparkContext(master="local[2]", appName="NetworkWordCounts")
ssc = StreamingContext(sc, 3)  # Monitor data every second
# Monitor port data
lines = ssc.socketTextStream("localhost", 9999)
#Split words
words = lines.flatMap(lambda line: line.split(" "))

# Set checkpoint (write to current file system)
ssc.checkpoint("file://" + os.getcwd() + "checkpoint")

# Status update function
def updatefun(new_values,last_sum):
    
    # The value value of the key state of the current state is added to the forward state
    # Eg (initial state):
    # new_values = [1, 1, 1], last_sum = 0
    return sum(new_values) + (last_sum or 0)  # Initial transition to zero

# Monitor port data
lines = ssc.socketTextStream("localhost", 9999)
#Split words
words = lines.flatMap(lambda line: line.split(" "))
#  Statistical word frequency
pairs = words.map(lambda word:(word, 1))
wordCounts = pairs.updateStateByKey(updateFunc=updatefun)
wordCounts.pprint()

ssc.start()
ssc.awaitTermination()
%run calcultor.py
-------------------------------------------
Time: 2021-05-10 15:09:45
-------------------------------------------
('1', 5)
('12', 3)
('54', 1)
('4', 1)
('11', 1)
('6', 1)
('5', 1)

Windows (with transition operation)

Update function:
lines.reduceByKeyAndWindow(
    func,  # Statistical function
    invFunc,  # Inverse statistical function
    windowDuration,  # Window size
    slideDuration=None,  # Window update frequency
    numPartitions=None,
    filterFunc=None,
)

Calculation diagram:

%%writefile top.py

import findspark  
findspark.init()
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql.session import SparkSession

def get_countryname(line):
    country_name = line.strip()  # Eliminate spaces

    if country_name == 'usa':
        output = 'USA'
    elif country_name == 'ind':
        output = 'India'
    elif country_name == 'aus':
        output = 'Australia'
    else:
        output = 'Unknown'

    return (output, 1)

# Set parameters
batch_interval = 1
window_length = 6*batch_interval  # Window size
frquency = 3*batch_interval  # Sliding frequency

sc =  sc = SparkContext(master="local[2]", appName="NetworkWordCount")
ssc = StreamingContext(sc, batch_interval)

ssc.checkpoint("checkpoint")
# Monitor port data
lines = ssc.socketTextStream("localhost", 9999)

addFunc = lambda x, y: x+y
invAddFunc = lambda x, y: x-y
word_counts = lines.map(get_countryname).reduceByKeyAndWindow(addFunc, invAddFunc, window_length, frquency)
word_counts.pprint()

ssc.start()
ssc.awaitTermination()
%run top.py

reference resources

Interpretation of foreachRDD, foreachPartition and foreach in Spark
Interaction between Python and MySQL

Topics: Python Database Big Data Spark SQL