Python big data processing library PySpark actual combat summary III

Posted by cunoodle2 on Sat, 29 Jan 2022 14:37:23 +0100

Shared variable

  • broadcast variable

    • Broadcast variables allow programs to cache a read-only variable on each machine in the cluster instead of saving a copy for each task. With broadcast variables, you can share some data in a more efficient way, such as a global configuration file.

    • from pyspark.sql import SparkSession
      spark = SparkSession.builder.master("local[*]").appName("RDD Demo").getOrCreate();
      sc = spark.sparkContext
      conf = {"ip":"192.168.1.1","key":"cumt"}
      #Broadcast variable
      brVar =  sc.broadcast(conf)
      a = brVar.value #Get broadcast variable value
      print(a) #{'ip': '192.168.1.1', 'key': 'cumt'}
      print(a["key"]) #cumt
      brVar.unpersist() #Update broadcast variables
      conf["key"] = "jackwang"
      brVar =  sc.broadcast(conf) #Broadcast again
      a = brVar.value #Get broadcast new variable value
      print(a) #{'ip': '192.168.1.1', 'key': 'jackwang'}
      #destroy() can destroy the data of broadcast variables together with metadata, which cannot be used after destruction
      brVar.destroy()
      
  • accumulator

    • Variables that can only be added by using the associated operation can quickly execute the operation. During debugging, the related events in the execution of the job can be counted. The calculation tasks on different nodes can add values to the accumulator by using the add method. In order to ensure accuracy, only one action operation can be used. If necessary, Then perform cache or persist operations on the RDD object to cut off the dependency

    • rdd = sc.range(1,101)
      #Create accumulator, initial value 0
      acc = sc.accumulator(0)
      def fcounter(x):
      	global acc
        if x % 2 == 0 :
        acc += 1
        #unsupported operand type(s) for -=
        #acc -= 1
      rdd_counter =  rdd.map(fcounter)
      print(acc.value) #The logic of the 0 fccounter function has not been executed
      #Ensure that the accumulator value is correctly obtained multiple times
      rdd_counter.persist()
      print(rdd_counter.count()) #100
      print(acc.value) #50
      print(rdd_counter.count()) #100
      print(acc.value) #50
      

DataFrames and Spark SQL

  • Including data itself and structure information
  • Support nested data types, array and map.
  • Support building on multiple data sources

Create DataFrames

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[1]').appName('DataFrames').getOrCreate()

#Create DataFrames with list
a = [('Jack', 32),('Smith', 33)]
df = spark.createDataFrame(a)
print(df.collect()) #[Row(_1='Jack', _2=32), Row(_1='Smith', _2=33)]
df.show()
+-----+---+
|   _1| _2|
+-----+---+
| Jack| 32|
|Smith| 33|
+-----+---+

# Add structure information
df2 = spark.createDataFrame(a, ['name', 'age'])
print(df2.collect())	#[Row(name='Jack', age=32), Row(name='Smith', age=33)]
df2.show() 
+-----+---+
| name|age|
+-----+---+
| Jack| 32|
|Smith| 33|
+-----+---+

#DataFrame object with type
from pyspark.sql.types import *
a = [('Jack', 32),('Smith', 33)]
rdd = sc.parallelize(a)
# StructField(name, type, can it be empty)
schema = StructType([
   StructField("name", StringType(), True),
   StructField("age", IntegerType(), True)])
df = spark.createDataFrame(rdd, schema)
# Simplified version
df2 = spark.createDataFrame(rdd,"name:string,age:int")
df.printSchema()
# root
#  |-- name: string (nullable = true)
#  |-- age: integer (nullable = true)

Basic usage of Spark SQL

spark.sql query

spark.udf.register() registers custom functions

df.select().where() filter

df.agg({field name: Method}) show()

df.describe() to view related variables

spark.read.format(‘json’) . load() read file

df.join(deptDf, peopleDf.deptId == deptDf.id,'inner')

df.na.fill({‘name’: ‘unknown’,‘salary’: 0, }).show() null value filling

df.withColumn("timestamp", df.Id.cast("timestamp")) modifies the field type

# query
a = [('Jack', 32),('Smith', 33),('Li Si', 36)]
rdd = sc.parallelize(a)
df = spark.createDataFrame(rdd, "name: string, age: int")
df.createOrReplaceTempView("user") #Create temporary table

df2 = spark.sql("select count(*) as counter from user")
df2.show()
+-------+
|counter|
+-------+
|      3|
+-------+
df2 = spark.sql("select *,age+1 as next from user where age < 36")
df2.show()
+-----+---+----+
| name|age|next|
+-----+---+----+
| Jack| 32|  33|
|Smith| 33|  34|
+-----+---+----+

# Custom function
strlen = spark.udf.register("strLen", lambda x: len(x)) #Registered a custom function
a = [('Jack', 32),('Smith', 33),('Li Si', 36)]
rdd = sc.parallelize(a)
df = spark.createDataFrame(rdd, "name: string, age: int")
df.createOrReplaceTempView("user")
df2 = spark.sql("select *,strLen(name) as len from user")
df2.show()
+-----+---+---+
| name|age|len|
+-----+---+---+
| Jack| 32|  4|
|Smith| 33|  5|
|  Li Si| 36|  2|
+-----+---+---+

# Select view specific columns
df.select("name").show()

# Find and filter 
df.select("name").where(strlen("name")>2).show()
df.filter(df.age > 32).show()
 +-----+
 | name|
 +-----+
 | Jack|
 |Smith|
 +-----+

# Aggregate find maximum agg({field name: set function})
df.agg({"age": "max"}).show()
 +--------+
 |max(age)|
 +--------+
 |      36|
 +--------+
  
# View the number, mean, variance, maximum and minimum of fields 
df.describe(['age']).show()
+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 3|
|   mean|33.666666666666664|
| stddev| 2.081665999466133|
|    min|                32|
|    max|                36|
+-------+------------------+

# Read / write parquet file 
df.write.parquet("myuser.parquet")
spark.read.parquet("myuser.parquet").show()
df.write.csv('user.csv','append')
spark.read.csv

# Read json file
df = spark.read.format('json') .load('hdfs://localhost:9000/user.json')
df.show()
+---+------+------+-----+------+
|age|deptId|gender| name|salary|
+---+------+------+-----+------+
| 32|    01|     male|   Zhang San|  5000|
| 33|    01|     male|   Li Si|  6000|
| 38|    01|     female|   Wang Wu|  5500|
| 42|    02|     male| Jack|  7000|
| 27|    02|     female|Smith|  6500|
| 45|    02|     female| Lily|  9500|
+---+------+------+-----+------+
#Type of print field
print(df.dtypes)
[('age', 'bigint'), ('deptId', 'string'), ('gender', 'string'), ('name', 'string'), ('salary', 'bigint')]

# Perspective pivot
df2 = df.groupBy("deptId").pivot("gender") .sum("salary")
df2.show()
+------+-----+-----+
|deptId|    female|    male|
+------+-----+-----+
|    01| 5500|11000|
|    02|16000| 7000|
+------+-----+-----+

#Condition selection
df.select("name",df.salary.between(6000,9500)).show()
df.select("name","age").where(df.name.like("Smi%")).show()

# The average wage and maximum age of men and women in each department are calculated by association query
# User table
a = [   
  ('01','Zhang San', 'male',32,5000),
  ('01','Li Si', 'male',33,6000),
  ('01','Wang Wu', 'female',38,5500),
  ('02','Jack', 'male',42,7000),
  ('02','Smith', 'female',27,6500),
  ('02','Lily', 'female',45,9500)
]
rdd = sc.parallelize(a)
peopleDf = spark.createDataFrame(rdd, \
  "deptId:string,name:string,gender:string,age:int,salary:int")
# Department table
b = [  
  ('01','Sales Department'),
  ('02','R & D department')
]
rdd2 = sc.parallelize(b)
deptDf = spark.createDataFrame(rdd2, "id:string,name:string")

#The third parameter of the join function defaults to inner. Other options are:
# inner, cross, outer, full, full_outer, left, left_outer, 
# right, right_outer, left_semi, and left_anti.
peopleDf.join(deptDf, peopleDf.deptId == deptDf.id,'inner') \
  .groupBy(deptDf.name, peopleDf.gender) \
  .agg({"salary": "avg", "age": "max"}) \
  .sort(deptDf.name, peopleDf.gender) \
  .show()
+----+------+-----------+--------+
|name|gender|avg(salary)|max(age)|
+----+------+-----------+--------+
| R & D department|     male|     7000.0|      42|
| Sales Department|     male|     5500.0|      33|
| Sales Department|     female|     5500.0|      38|
| R & D department|     female|     8000.0|      45|
+----+------+-----------+--------+


# Get all column names
peopleDf.columns

# duplicate removal
peopleDf.distinct().show()

# Delete data column
peopleDf.drop("gender").show()

# Remove a DataFrame from another DataFrame
f1 = spark.createDataFrame(
        [("a", 1), ("a", 1), ("a", 1), ("a", 2), ("b",  3), ("c", 4)], ["C1", "C2"])
df2 = spark.createDataFrame([("a", 1), ("b", 3)], ["C1", "C2"])
df1.exceptAll(df2).show()
 +---+---+
 | C1| C2|
 +---+---+
 |  a|  1|
 |  a|  1|
 |  a|  2|
 |  c|  4|
 +---+---+

# Find intersection
df1.intersectAll(df2).show()

# Null replacement
a = [   
  ('01','Zhang San', 'male',32,5000),
  ('01', None, 'male',33,6000),
  ('01','Wang Wu', 'female',36,None),
  ('02','Jack', 'male',42,7000),
  ('02','Smith', 'female',27,6500),
  ('02','Lily', 'female',45,None),
]
rdd = sc.parallelize(a)
peopleDf = spark.createDataFrame(rdd,\
   "deptId:string,name:string,gender:string,age:int,salary:int")
# Replace null values
peopleDf.na.fill({'name': 'unknown','salary': 0, }).show()
+------+-------+------+---+------+
|deptId|   name|gender|age|salary|
+------+-------+------+---+------+
|    01|     Zhang San|     male| 32|  5000|
|    01|unknown|     male| 33|  6000|
|    01|     Wang Wu|     female| 36|     0|
|    02|   Jack|     male| 42|  7000|
|    02|  Smith|     female| 27|  6500|
|    02|   Lily|     female| 45|     0|
+------+-------+------+---+------+

# Convert to JSON format
peopleDf.toJSON().collect()
['{"deptId":"01","name":"Zhang San","gender":"male","age":32,"salary":5000}',
 '{"deptId":"01","gender":"male","age":33,"salary":6000}',
 '{"deptId":"01","name":"Wang Wu","gender":"female","age":36}',
 '{"deptId":"02","name":"Jack","gender":"male","age":42,"salary":7000}',
 '{"deptId":"02","name":"Smith","gender":"female","age":27,"salary":6500}',
 '{"deptId":"02","name":"Lily","gender":"female","age":45}']

# Add calculation column and rename
peopleDf.withColumn("age2",peopleDf.age+1) \
        .withColumnRenamed("name","full name") \
        .show()
    
# Date processing
df = spark.createDataFrame(sc.parallelize([("2016-08-26",)]),"Id:string")
df.show()
+----------+
|        Id|
+----------+
|2016-08-26|
+----------+
df2 = df.withColumn("Timestamp", df.Id.cast("timestamp"))
df3 = df2.withColumn("Date", df.Id.cast("date"))
df3.show()
+----------+--------------------+----------+
|        Id|           Timestamp|      Date|
+----------+--------------------+----------+
|2016-08-26|2016-08-26 00:00:...|2016-08-26|
+----------+--------------------+----------+

df = spark.createDataFrame([('2020-05-10',),('2020-05-09',)], ['date'])
from pyspark.sql.functions import add_months
df.select(add_months(df.date, 1).alias('next_month')).show()
+----------+
|next_month|
+----------+
|2020-06-10|
|2020-06-09|
+----------+

Write Spark Program and submit it

This is to calculate the PI according to the Monte Carlo simulation algorithm. Draw a circle in a square with variable length of 2. The area of the square is 4, the radius of the circle is 1, and the area is Pi*R^2=Pi. It is necessary to calculate the area of the circle. Here, take the center of the circle as the origin of the coordinate axis, take points randomly, and select n points in total

from pyspark.sql import SparkSession
from random import random
from operator import add
spark = SparkSession.builder \
        .master("local[*]") \
        .appName("Pi Demo") \
        .getOrCreate();
sc = spark.sparkContext
#############################################
if __name__ == "__main__":
    n = 100000 * 20
    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0
    count = sc.parallelize(range(1, n + 1), 20) .map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))
    spark.stop()

Spark submit important parameters

  • – master local or yarn spark://HOST:PORT
  • – deploy mode Client (by default, the Driver is started on the Client and the logic runs on the Client) or cluster does not support Python
  • –class CLASS_NAME for Java and Scala programs
  • – name distinguishes between different programs
  • – jars JARS comma separated jar packages to package the program code, i.e. dependent resources, into jar packages
  • –py-files PY_FILES comma separated zip .egg .py file, placed under PYTHONPATH
  • – Driver memory MEM configure Driver memory

Topics: Big Data Hadoop Spark