Spark DataFrame supports all basic SQL Join operations, such as inner, right outer, left anti, left semi, cross and self join. Spark SQL Join is a wide conversion operation, and the resulting data will be reorganized in the network. Therefore, there will be very big performance problems if it is not carefully designed
On the other hand, Spark SQL Join operation has more optimizations by default (thanks to dataframe & dataset). However, there are still some performance problems to be considered when using
In this article, you will learn different Join syntax and use different Join types on DataFrame and DataSet. The example is Scala
SQL Join type and syntax
All Spark SQL types and syntax are listed below
join(right: Dataset[_]): DataFrame join(right: Dataset[_], usingColumn: String): DataFrame join(right: Dataset[_], usingColumns: Seq[String]): DataFrame join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[_], joinExprs: Column): DataFrame join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
The remaining tutorials explain six syntax Join types, receive appropriate dataframe, and use string for Join expression and Join type
For syntax 4 & 5, you can use "JoinType" or "Join string" to define the table above, and use the "JoinType" string parameter. When you use "JoinType", you should import org.apache.spark.sql.catalyst.plans_ As a definition JoinType object
JOINTYPE | JOIN STRING | EQUIVALENT SQL JOIN |
---|---|---|
Inner.sql | inner | INNER JOIN |
FullOuter.sql | outer, full, fullouter, full_outer | FULL OUTER JOIN |
LeftOuter.sql | left, leftouter, left_outer | LEFT JOIN |
RightOuter.sql | right, rightouter, right_outer | RIGHT JOIN |
Cross.sql | cross | |
LeftAnti.sql | anti, leftanti, left_anti | |
LeftSemi.sql | semi, leftsemi, left_semi |
All Join objects define joinType objects. In order to use, you need to import org.apache.spark.sql.catalyst.plans.{LeftOuter,Inner,...}
Before you use the Spark SQL join example, first create emp and dept DataFrame. emp here_ The ID column is unique in emp and Dept_ The ID is unique in the dept dataset and is emp in emp_ dept_ The ID points to dept in the dept dataset_ id.
val emp = Seq((1,"Smith",-1,"2018","10","M",3000), (2,"Rose",1,"2010","20","M",4000), (3,"Williams",1,"2010","10","M",1000), (4,"Jones",2,"2005","10","F",2000), (5,"Brown",2,"2010","40","",-1), (6,"Brown",2,"2010","50","",-1) ) val empColumns = Seq("emp_id","name","superior_emp_id","year_joined", "emp_dept_id","gender","salary") import spark.sqlContext.implicits._ val empDF = emp.toDF(empColumns:_*) empDF.show(false) val dept = Seq(("Finance",10), ("Marketing",20), ("Sales",30), ("IT",40) ) val deptColumns = Seq("dept_name","dept_id") val deptDF = dept.toDF(deptColumns:_*) deptDF.show(false)
Emp Dataset +------+--------+---------------+-----------+-----------+------+------+ |emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary| +------+--------+---------------+-----------+-----------+------+------+ |1 |Smith |-1 |2018 |10 |M |3000 | |2 |Rose |1 |2010 |20 |M |4000 | |3 |Williams|1 |2010 |10 |M |1000 | |4 |Jones |2 |2005 |10 |F |2000 | |5 |Brown |2 |2010 |40 | |-1 | |6 |Brown |2 |2010 |50 | |-1 | +------+--------+---------------+-----------+-----------+------+------+ Dept Dataset +---------+-------+ |dept_name|dept_id| +---------+-------+ |Finance |10 | |Marketing|20 | |Sales |30 | |IT |40 | +---------+-------+
Inner Join
inner join is the default join and the most commonly used. It is used to join two dataframes / datasets on the specified column. If there is no match between the two datasets, the data on the column will be discarded
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner") .show(false)
When inner join is applied to the dataset, emp will be added to emp_dept_id=50 and Dept_ The data with id = 30 is discarded. The lower side is the output result of the upper side
+------+--------+---------------+-----------+-----------+------+------+---------+-------+ |emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id| +------+--------+---------------+-----------+-----------+------+------+---------+-------+ |1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 | |2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 | |3 |Williams|1 |2010 |10 |M |1000 |Finance |10 | |4 |Jones |2 |2005 |10 |F |2000 |Finance |10 | |5 |Brown |2 |2010 |40 | |-1 |IT |40 | +------+--------+---------------+-----------+-----------+------+------+---------+-------+
full outer join
outer is also called full. Full join returns all rows in Spark DataFrame/Dataset. If there is no matching in the join expression, null is used to represent the object column
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"outer").show(false) empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"full").show(false) empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"fullouter").show(false)
Left Outer Join
Spark left and left outer join return all columns of DataFrame/Dataset on the left. Ignore the data on the right. If there is no matching data, it is allocated null
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"left") .show(false) empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftouter") .show(false)
In the empDF dataset emp_dept_id=50 is not recorded in dept, so the dataset is null in the dept column (dept_name & dept_id), and dept_id=30 will be discarded in Dept. the lower edge is the result of the upper Join expression
+------+--------+---------------+-----------+-----------+------+------+---------+-------+ |emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id| +------+--------+---------------+-----------+-----------+------+------+---------+-------+ |1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 | |2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 | |3 |Williams|1 |2010 |10 |M |1000 |Finance |10 | |4 |Jones |2 |2005 |10 |F |2000 |Finance |10 | |5 |Brown |2 |2010 |40 | |-1 |IT |40 | |6 |Brown |2 |2010 |50 | |-1 |null |null | +------+--------+---------------+-----------+-----------+------+------+---------+-------+ Scala
Right Outer Join
Spark Right and Right Outer join are in the other direction relative to left join. All rows of DataFrame/Dataset on the right will be returned, and the matching data set on the left will be ignored. When there is no matching, null data will be partitioned and the data on the left will be lost
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"right") .show(false) empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"rightouter") .show(false)
In the example, the data set dep on the right_ Id = 30 is not in the left dataset emp, so this record contains null in the emp column_ dept_ If the id = 50 does not match, it will be discarded. Below is the output
+------+--------+---------------+-----------+-----------+------+------+---------+-------+ |emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id| +------+--------+---------------+-----------+-----------+------+------+---------+-------+ |4 |Jones |2 |2005 |10 |F |2000 |Finance |10 | |3 |Williams|1 |2010 |10 |M |1000 |Finance |10 | |1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 | |2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 | |null |null |null |null |null |null |null |Sales |30 | |5 |Brown |2 |2010 |40 | |-1 |IT |40 | +------+--------+---------------+-----------+-----------+------+------+---------+-------+
Left Semi Join
Spark left Semi is similar to inner join. The difference is that leftsemi join returns all columns in the left dataset and ignores all columns in the right dataset. In other words, the 10000 returned by this join only matches the left data with the upper right dataset, and those not matched are ignored on both sides
The same result can be obtained by using select. On this result, use inner join. However, using this join is more efficient
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"), "leftsemi") .show(false)
output
leftsemi join +------+--------+---------------+-----------+-----------+------+------+ |emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary| +------+--------+---------------+-----------+-----------+------+------+ |1 |Smith |-1 |2018 |10 |M |3000 | |2 |Rose |1 |2010 |20 |M |4000 | |3 |Williams|1 |2010 |10 |M |1000 | |4 |Jones |2 |2005 |10 |F |2000 | |5 |Brown |2 |2010 |40 | |-1 | +------+--------+---------------+-----------+-----------+------+------+
Left Anti Join
What left anti join does is just the opposite of spark leftsemi. Left anti join returns only the left column without matching
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftanti") .show(false)
output
+------+-----+---------------+-----------+-----------+------+------+ |emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary| +------+-----+---------------+-----------+-----------+------+------+ |6 |Brown|2 |2010 |50 | |-1 | +------+-----+---------------+-----------+-----------+------+------+
Self Join
Spark join without self join is incomplete. Although no self join type is available, we can use any join type explained above to join to the DataFrame itself. The lower side is an inner self join
empDF.as("emp1").join(empDF.as("emp2"), col("emp1.superior_emp_id") === col("emp2.emp_id"),"inner") .select(col("emp1.emp_id"),col("emp1.name"), col("emp2.emp_id").as("superior_emp_id"), col("emp2.name").as("superior_emp_name")) .show(false)
We can find superior EMP for all employees join emp datasets_ ID and name
+------+--------+---------------+-----------------+ |emp_id|name |superior_emp_id|superior_emp_name| +------+--------+---------------+-----------------+ |2 |Rose |1 |Smith | |3 |Williams|1 |Smith | |4 |Jones |2 |Rose | |5 |Brown |2 |Rose | |6 |Brown |2 |Rose | +------+--------+---------------+-----------------+
Using SQL expressions
Because Spark SQL supports SQL native syntax, we can write the join operation after creating a temporary table and use spark.sql()
empDF.createOrReplaceTempView("EMP") deptDF.createOrReplaceTempView("DEPT") //SQL JOIN val joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id") joinDF.show(false) val joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id == d.dept_id") joinDF2.show(false)
source code
package com.sparkbyexamples.spark.dataframe.join import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.col object JoinExample extends App { val spark: SparkSession = SparkSession.builder() .master("local[1]") .appName("SparkByExamples.com") .getOrCreate() spark.sparkContext.setLogLevel("ERROR") val emp = Seq((1,"Smith",-1,"2018","10","M",3000), (2,"Rose",1,"2010","20","M",4000), (3,"Williams",1,"2010","10","M",1000), (4,"Jones",2,"2005","10","F",2000), (5,"Brown",2,"2010","40","",-1), (6,"Brown",2,"2010","50","",-1) ) val empColumns = Seq("emp_id","name","superior_emp_id","year_joined","emp_dept_id","gender","salary") import spark.sqlContext.implicits._ val empDF = emp.toDF(empColumns:_*) empDF.show(false) val dept = Seq(("Finance",10), ("Marketing",20), ("Sales",30), ("IT",40) ) val deptColumns = Seq("dept_name","dept_id") val deptDF = dept.toDF(deptColumns:_*) deptDF.show(false) println("Inner join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner") .show(false) println("Outer join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"outer") .show(false) println("full join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"full") .show(false) println("fullouter join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"fullouter") .show(false) println("right join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"right") .show(false) println("rightouter join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"rightouter") .show(false) println("left join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"left") .show(false) println("leftouter join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftouter") .show(false) println("leftanti join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftanti") .show(false) println("leftsemi join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftsemi") .show(false) println("cross join") empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"cross") .show(false) println("Using crossJoin()") empDF.crossJoin(deptDF).show(false) println("self join") empDF.as("emp1").join(empDF.as("emp2"), col("emp1.superior_emp_id") === col("emp2.emp_id"),"inner") .select(col("emp1.emp_id"),col("emp1.name"), col("emp2.emp_id").as("superior_emp_id"), col("emp2.name").as("superior_emp_name")) .show(false) empDF.createOrReplaceTempView("EMP") deptDF.createOrReplaceTempView("DEPT") //SQL JOIN val joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id") joinDF.show(false) val joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id == d.dept_id") joinDF2.show(false) }
conclusion
In this tutorial, you have learned the use of Spark SQL join type, INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF, and the corresponding SCALA code