Shared variable
-
broadcast variable
-
Broadcast variables allow programs to cache a read-only variable on each machine in the cluster instead of saving a copy for each task. With broadcast variables, you can share some data in a more efficient way, such as a global configuration file.
-
from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").appName("RDD Demo").getOrCreate(); sc = spark.sparkContext conf = {"ip":"192.168.1.1","key":"cumt"} #Broadcast variable brVar = sc.broadcast(conf) a = brVar.value #Get broadcast variable value print(a) #{'ip': '192.168.1.1', 'key': 'cumt'} print(a["key"]) #cumt brVar.unpersist() #Update broadcast variables conf["key"] = "jackwang" brVar = sc.broadcast(conf) #Broadcast again a = brVar.value #Get broadcast new variable value print(a) #{'ip': '192.168.1.1', 'key': 'jackwang'} #destroy() can destroy the data of broadcast variables together with metadata, which cannot be used after destruction brVar.destroy()
-
-
accumulator
-
Variables that can only be added by using the associated operation can quickly execute the operation. During debugging, the related events in the execution of the job can be counted. The calculation tasks on different nodes can add values to the accumulator by using the add method. In order to ensure accuracy, only one action operation can be used. If necessary, Then perform cache or persist operations on the RDD object to cut off the dependency
-
rdd = sc.range(1,101) #Create accumulator, initial value 0 acc = sc.accumulator(0) def fcounter(x): global acc if x % 2 == 0 : acc += 1 #unsupported operand type(s) for -= #acc -= 1 rdd_counter = rdd.map(fcounter) print(acc.value) #The logic of the 0 fccounter function has not been executed #Ensure that the accumulator value is correctly obtained multiple times rdd_counter.persist() print(rdd_counter.count()) #100 print(acc.value) #50 print(rdd_counter.count()) #100 print(acc.value) #50
-
DataFrames and Spark SQL
- Including data itself and structure information
- Support nested data types, array and map.
- Support building on multiple data sources
Create DataFrames
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local[1]').appName('DataFrames').getOrCreate() #Create DataFrames with list a = [('Jack', 32),('Smith', 33)] df = spark.createDataFrame(a) print(df.collect()) #[Row(_1='Jack', _2=32), Row(_1='Smith', _2=33)] df.show() +-----+---+ | _1| _2| +-----+---+ | Jack| 32| |Smith| 33| +-----+---+ # Add structure information df2 = spark.createDataFrame(a, ['name', 'age']) print(df2.collect()) #[Row(name='Jack', age=32), Row(name='Smith', age=33)] df2.show() +-----+---+ | name|age| +-----+---+ | Jack| 32| |Smith| 33| +-----+---+ #DataFrame object with type from pyspark.sql.types import * a = [('Jack', 32),('Smith', 33)] rdd = sc.parallelize(a) # StructField(name, type, can it be empty) schema = StructType([ StructField("name", StringType(), True), StructField("age", IntegerType(), True)]) df = spark.createDataFrame(rdd, schema) # Simplified version df2 = spark.createDataFrame(rdd,"name:string,age:int") df.printSchema() # root # |-- name: string (nullable = true) # |-- age: integer (nullable = true)
Basic usage of Spark SQL
spark.sql query
spark.udf.register() registers custom functions
df.select().where() filter
df.agg({field name: Method}) show()
df.describe() to view related variables
spark.read.format(‘json’) . load() read file
df.join(deptDf, peopleDf.deptId == deptDf.id,'inner')
df.na.fill({‘name’: ‘unknown’,‘salary’: 0, }).show() null value filling
df.withColumn("timestamp", df.Id.cast("timestamp")) modifies the field type
# query a = [('Jack', 32),('Smith', 33),('Li Si', 36)] rdd = sc.parallelize(a) df = spark.createDataFrame(rdd, "name: string, age: int") df.createOrReplaceTempView("user") #Create temporary table df2 = spark.sql("select count(*) as counter from user") df2.show() +-------+ |counter| +-------+ | 3| +-------+ df2 = spark.sql("select *,age+1 as next from user where age < 36") df2.show() +-----+---+----+ | name|age|next| +-----+---+----+ | Jack| 32| 33| |Smith| 33| 34| +-----+---+----+ # Custom function strlen = spark.udf.register("strLen", lambda x: len(x)) #Registered a custom function a = [('Jack', 32),('Smith', 33),('Li Si', 36)] rdd = sc.parallelize(a) df = spark.createDataFrame(rdd, "name: string, age: int") df.createOrReplaceTempView("user") df2 = spark.sql("select *,strLen(name) as len from user") df2.show() +-----+---+---+ | name|age|len| +-----+---+---+ | Jack| 32| 4| |Smith| 33| 5| | Li Si| 36| 2| +-----+---+---+ # Select view specific columns df.select("name").show() # Find and filter df.select("name").where(strlen("name")>2).show() df.filter(df.age > 32).show() +-----+ | name| +-----+ | Jack| |Smith| +-----+ # Aggregate find maximum agg({field name: set function}) df.agg({"age": "max"}).show() +--------+ |max(age)| +--------+ | 36| +--------+ # View the number, mean, variance, maximum and minimum of fields df.describe(['age']).show() +-------+------------------+ |summary| age| +-------+------------------+ | count| 3| | mean|33.666666666666664| | stddev| 2.081665999466133| | min| 32| | max| 36| +-------+------------------+ # Read / write parquet file df.write.parquet("myuser.parquet") spark.read.parquet("myuser.parquet").show() df.write.csv('user.csv','append') spark.read.csv # Read json file df = spark.read.format('json') .load('hdfs://localhost:9000/user.json') df.show() +---+------+------+-----+------+ |age|deptId|gender| name|salary| +---+------+------+-----+------+ | 32| 01| male| Zhang San| 5000| | 33| 01| male| Li Si| 6000| | 38| 01| female| Wang Wu| 5500| | 42| 02| male| Jack| 7000| | 27| 02| female|Smith| 6500| | 45| 02| female| Lily| 9500| +---+------+------+-----+------+ #Type of print field print(df.dtypes) [('age', 'bigint'), ('deptId', 'string'), ('gender', 'string'), ('name', 'string'), ('salary', 'bigint')] # Perspective pivot df2 = df.groupBy("deptId").pivot("gender") .sum("salary") df2.show() +------+-----+-----+ |deptId| female| male| +------+-----+-----+ | 01| 5500|11000| | 02|16000| 7000| +------+-----+-----+ #Condition selection df.select("name",df.salary.between(6000,9500)).show() df.select("name","age").where(df.name.like("Smi%")).show() # The average wage and maximum age of men and women in each department are calculated by association query # User table a = [ ('01','Zhang San', 'male',32,5000), ('01','Li Si', 'male',33,6000), ('01','Wang Wu', 'female',38,5500), ('02','Jack', 'male',42,7000), ('02','Smith', 'female',27,6500), ('02','Lily', 'female',45,9500) ] rdd = sc.parallelize(a) peopleDf = spark.createDataFrame(rdd, \ "deptId:string,name:string,gender:string,age:int,salary:int") # Department table b = [ ('01','Sales Department'), ('02','R & D department') ] rdd2 = sc.parallelize(b) deptDf = spark.createDataFrame(rdd2, "id:string,name:string") #The third parameter of the join function defaults to inner. Other options are: # inner, cross, outer, full, full_outer, left, left_outer, # right, right_outer, left_semi, and left_anti. peopleDf.join(deptDf, peopleDf.deptId == deptDf.id,'inner') \ .groupBy(deptDf.name, peopleDf.gender) \ .agg({"salary": "avg", "age": "max"}) \ .sort(deptDf.name, peopleDf.gender) \ .show() +----+------+-----------+--------+ |name|gender|avg(salary)|max(age)| +----+------+-----------+--------+ | R & D department| male| 7000.0| 42| | Sales Department| male| 5500.0| 33| | Sales Department| female| 5500.0| 38| | R & D department| female| 8000.0| 45| +----+------+-----------+--------+ # Get all column names peopleDf.columns # duplicate removal peopleDf.distinct().show() # Delete data column peopleDf.drop("gender").show() # Remove a DataFrame from another DataFrame f1 = spark.createDataFrame( [("a", 1), ("a", 1), ("a", 1), ("a", 2), ("b", 3), ("c", 4)], ["C1", "C2"]) df2 = spark.createDataFrame([("a", 1), ("b", 3)], ["C1", "C2"]) df1.exceptAll(df2).show() +---+---+ | C1| C2| +---+---+ | a| 1| | a| 1| | a| 2| | c| 4| +---+---+ # Find intersection df1.intersectAll(df2).show() # Null replacement a = [ ('01','Zhang San', 'male',32,5000), ('01', None, 'male',33,6000), ('01','Wang Wu', 'female',36,None), ('02','Jack', 'male',42,7000), ('02','Smith', 'female',27,6500), ('02','Lily', 'female',45,None), ] rdd = sc.parallelize(a) peopleDf = spark.createDataFrame(rdd,\ "deptId:string,name:string,gender:string,age:int,salary:int") # Replace null values peopleDf.na.fill({'name': 'unknown','salary': 0, }).show() +------+-------+------+---+------+ |deptId| name|gender|age|salary| +------+-------+------+---+------+ | 01| Zhang San| male| 32| 5000| | 01|unknown| male| 33| 6000| | 01| Wang Wu| female| 36| 0| | 02| Jack| male| 42| 7000| | 02| Smith| female| 27| 6500| | 02| Lily| female| 45| 0| +------+-------+------+---+------+ # Convert to JSON format peopleDf.toJSON().collect() ['{"deptId":"01","name":"Zhang San","gender":"male","age":32,"salary":5000}', '{"deptId":"01","gender":"male","age":33,"salary":6000}', '{"deptId":"01","name":"Wang Wu","gender":"female","age":36}', '{"deptId":"02","name":"Jack","gender":"male","age":42,"salary":7000}', '{"deptId":"02","name":"Smith","gender":"female","age":27,"salary":6500}', '{"deptId":"02","name":"Lily","gender":"female","age":45}'] # Add calculation column and rename peopleDf.withColumn("age2",peopleDf.age+1) \ .withColumnRenamed("name","full name") \ .show() # Date processing df = spark.createDataFrame(sc.parallelize([("2016-08-26",)]),"Id:string") df.show() +----------+ | Id| +----------+ |2016-08-26| +----------+ df2 = df.withColumn("Timestamp", df.Id.cast("timestamp")) df3 = df2.withColumn("Date", df.Id.cast("date")) df3.show() +----------+--------------------+----------+ | Id| Timestamp| Date| +----------+--------------------+----------+ |2016-08-26|2016-08-26 00:00:...|2016-08-26| +----------+--------------------+----------+ df = spark.createDataFrame([('2020-05-10',),('2020-05-09',)], ['date']) from pyspark.sql.functions import add_months df.select(add_months(df.date, 1).alias('next_month')).show() +----------+ |next_month| +----------+ |2020-06-10| |2020-06-09| +----------+
Write Spark Program and submit it
This is to calculate the PI according to the Monte Carlo simulation algorithm. Draw a circle in a square with variable length of 2. The area of the square is 4, the radius of the circle is 1, and the area is Pi*R^2=Pi. It is necessary to calculate the area of the circle. Here, take the center of the circle as the origin of the coordinate axis, take points randomly, and select n points in total
from pyspark.sql import SparkSession from random import random from operator import add spark = SparkSession.builder \ .master("local[*]") \ .appName("Pi Demo") \ .getOrCreate(); sc = spark.sparkContext ############################################# if __name__ == "__main__": n = 100000 * 20 def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = sc.parallelize(range(1, n + 1), 20) .map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) spark.stop()
Spark submit important parameters
- – master local or yarn spark://HOST:PORT
- – deploy mode Client (by default, the Driver is started on the Client and the logic runs on the Client) or cluster does not support Python
- –class CLASS_NAME for Java and Scala programs
- – name distinguishes between different programs
- – jars JARS comma separated jar packages to package the program code, i.e. dependent resources, into jar packages
- –py-files PY_FILES comma separated zip .egg .py file, placed under PYTHONPATH
- – Driver memory MEM configure Driver memory