Spark SQL -- spark SQL performance optimization

Posted by abgoosht on Fri, 13 Mar 2020 08:27:46 +0100

Article directory

1. Cache table data in memory
2. Parameter optimization

1. Cache table data in memory

Performance tuning is mainly about putting data into memory. Caching data in memory can improve performance by directly reading the value of memory. In RDD, use rdd.cache or rdd.persist to cache. The underlying layer of DataFrame is RDD, so you can also cache.

Add cache:

SparkSession.sqlContext.cacheTable("name of temporary table") / / code
spark.sqlContext.cacheTable("tableName") / / command line
spark.cacheTable("tableName")
dataFrame.cache()

Clear cache:

spark.sqlContext.clearCache
spark.uncacheTable("tableName") / / remove table from memory.

For example:

scala> val mysqlDF = spark.read.format("jdbc").option("url","jdbc:mysql://192.168.15.131:3306/company?serverTimezone=UTC&characterEncoding=utf-8").option("user","root").option("password","123456").option("dbtable","emp").option("driver","com.mysql.jdbc.Driver").load
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
Sun Aug 25 03:59:07 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
mysqlDF: org.apache.spark.sql.DataFrame = [id: int, ename: string ... 2 more fields]

// Because the view does not save data, register a temporary table first:
scala> mysqlDF.registerTempTable("emp")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select * from emp").show
+---+-----+------+----+
| id|ename|deptno| sal|
+---+-----+------+----+
|  1|  Tom|    10|2500|
|  2| Andy|    11|1000|
|  3| Lily|    10|1500|
|  4| Lucy|    11| 500|
+---+-----+------+----+
 
// Identify that the table can be cached and the data is not actually cached
scala> spark.sqlContext.cacheTable("emp")

// Read MySQL
scala> spark.sql("select * from emp").show
+---+-----+------+----+
| id|ename|deptno| sal|
+---+-----+------+----+
|  1|  Tom|    10|2500|
|  2| Andy|    11|1000|
|  3| Lily|    10|1500|
|  4| Lucy|    11| 500|
+---+-----+------+----+

// Read from cache
scala> spark.sql("select * from emp").show
+---+-----+------+----+
| id|ename|deptno| sal|
+---+-----+------+----+
|  1|  Tom|    10|2500|
|  2| Andy|    11|1000|
|  3| Lily|    10|1500|
|  4| Lucy|    11| 500|
+---+-----+------+----+
 
// wipe cache
scala> spark.sqlContext.clearCache

2. Parameter optimization

spark.sql.inMemoryColumnarStorage.compressed
true by default
Spark SQL will automatically select a compression encoding method for each column based on statistics.

spark.sql.inMemoryColumnarStorage.batchSize
Default: 10000
Cache batch size. When caching data, large batch size can improve memory utilization and compression rate, but it also brings the risk of OOM (Out Of Memory).

Other performance related configuration options (however, manual modification is not recommended, which may be automatically modified in subsequent versions)

spark.sql.files.maxPartitionBytes
Default: 128 MB
Maximum number of bytes a single partition can hold when reading files

spark.sql.files.openCostInBytes
Default: 4M
The estimated cost of opening a file is measured by the number of bytes that can be scanned at the same time. Used when writing multiple files to a partition. It's better to overestimate, so that small file partitions will be faster (scheduled first) than large file partitions.

spark.sql.autoBroadcastJoinThreshold
Default value: 10M
It is used to configure the maximum byte size that a table can broadcast to all worker nodes when performing join operations. You can disable broadcasting by setting this value to - 1. Note that the current data statistics only supports the Hive Metastore table that has run the analyze table compute statistics noscan command.

spark.sql.shuffle.partitions
Default: 200
The number of partitions used to configure join or aggregate operation shuffle data.
val spark = SparkSession.builder().appName("SparkSQLDemo4").config("spark.sql.shuffle.partitions", 400).enableHiveSupport().getOrCreate()