Spark SQL implementation principle - logic plan optimization - operation push down: EliminateOuterJoin rule
This rule optimizes the outer join operation in order to eliminate the outer join operation as much as possible and convert it into inner or other join types. The EliminateOuterJoin optimization rule can take effect when the join operation is followed by a filter operation (according to the node organization of the logical plan tree, that is, when the filter operation is the parent node of the join operation).
Logic of EliminateOuterJoin rule
The execution logic of EliminateOuterJoin optimization rules can be described in several cases. The specific logic is as follows:
1. If the join is of full outer type and the filter operation is used for both sides of the join, convert the full outer type to the inner type.
2. If the join is of left outer type and the filter operation is used on the right side of the join, convert the left outer type to the inner type.
3. If the join is of right outer type and the filter operation is used on the left side of the join, convert the right outer type to the inner type.
4. If the join is of full outer type and the filter operation is only used on the left side of the join, convert the full outer type to the left outer type.
5. If the join is of full outer type and only the filter operation is used on the right side of the join, convert the full outer type to the right outer type.
As we know, internal join is generally used with caution in the actual development process, because internal join will lose data. But why should Spark SQL optimize the outer type to the inner type? Let's continue the analysis.
Effect of optimization rules
Let's take a look at the actual effects of the above situations through examples, so as to have a deeper understanding of the optimization process.
1. Use join and filter operation on both sides
import spark.implicits._ case class Person(id: String, age: Int) case class Record(no: String, age: Int) val data1 = Seq(Person("Michael", 9), Person("Andy", 7), Person("Justin", 11)) val data2 = Seq(Record("Michael", 29), Record("Andy2", 30), Record("Justin2", 11)) val ds1 = spark.createDataset(data1) val ds2 = spark.createDataset(data2) // Add filter conditions to the data sets on both sides of the join ds1.join(ds2, ds1("id")===ds2("no"), "fullouter").\ where(ds1("age")>=9 and ds2("age") >=20).explain(true)
In the above code, we use fullouter join and add filter conditions to the data sets at both ends of the join. You can see the following logical plan optimized by EliminateOuterJoin rules. On the right is the optimized logical plan. You can see that FullOuter is converted to Inner.
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin === Filter ((age#147 >= 9) && (age#152 >= 20)) Filter ((age#147 >= 9) && (age#152 >= 20)) !+- Join FullOuter, (id#146 = no#151) +- Join Inner, (id#146 = no#151) :- LocalRelation [id#146, age#147] :- LocalRelation [id#146, age#147] +- LocalRelation [no#151, age#152] +- LocalRelation [no#151, age#152]
You can see that the type of join has changed from FullOuter to inner. So, why can you convert the full outer type to the inner type?
We divide the processing process into two steps according to our logic. The first step is full outer. The results are as follows:
id | no | ds1.age | ds2.age |
---|---|---|---|
Michael | Michael | 9 | 29 |
Justin2 | 11 | ||
Justin | 11 | ||
Andy | 7 | ||
Andy2 | 30 |
The second step is to filter, so that the final result is that there is only one record left:
id | no | ds1.age | ds2.age |
---|---|---|---|
Michael | Michael | 9 | 29 |
Therefore, it can be seen from the above analysis that since the full outer join will add null values, as long as there are filtered operations, null values will not be matched, which is equivalent to excluding all null rows, that is, the operation of inner join. By optimizing into an inner join, some data can be filtered out and the whole data processing process can be accelerated. The following processing logic is similar to this.
2. If it is of left outer type and the Filter operation is performed on the right side, the left outer will be converted to inner
We use leftouter type join and filter the right side. The code is as follows:
ds1.join(ds2, ds1("id")===ds2("no"), "leftouter").where(ds2("age") >=20).explain(true)
By viewing the TRACE log, you can see that the Join of the logical plan is optimized to the Inner type.
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin === Filter (age#152 >= 20) Filter (age#152 >= 20) !+- Join LeftOuter, (id#146 = no#151) +- Join Inner, (id#146 = no#151) :- LocalRelation [id#146, age#147] :- LocalRelation [id#146, age#147] +- LocalRelation [no#151, age#152] +- LocalRelation [no#151, age#152]
The logic of this optimization is similar to the first case. When the left outer join operation is performed, all data rows on the left will be retained. If the left exists but the right does not exist, null values will be filled in. When filtering the data set on the right, null values will be filtered out, that is, data rows that do not match the data set on the left will be filtered out.
This is equivalent to an inner join operation.
3. If it is of right outer type, and the Filter operation is performed on the left side, convert right outer to inner
Execute the following operation code for rightouter join:
ds1.join(ds2, ds1("id")===ds2("no"), "rightouter").where(ds1("age") >=9).explain(true)
The optimized logical plan can be obtained:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin === Filter (age#147 >= 9) Filter (age#147 >= 9) !+- Join RightOuter, (id#146 = no#151) +- Join Inner, (id#146 = no#151) :- LocalRelation [id#146, age#147] :- LocalRelation [id#146, age#147] +- LocalRelation [no#151, age#152] +- LocalRelation [no#151, age#152]
The logic of this optimization rule is similar to that of Article 2. No more analysis here.
4. If it is of full outer type and the Filter operation is used on the left, convert full outer to left outer
Execute the join of fulluser:
ds1.join(ds2, ds1("id")===ds2("no"), "fullouter").where(ds1("age") >=9).explain(true)
The optimized logical plan of this operation is as follows:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin === Filter (age#147 >= 9) Filter (age#147 >= 9) !+- Join FullOuter, (id#146 = no#151) +- Join LeftOuter, (id#146 = no#151) :- LocalRelation [id#146, age#147] :- LocalRelation [id#146, age#147] +- LocalRelation [no#151, age#152] +- LocalRelation [no#151, age#152]
If you filter the left side, you will filter out all the rows that fill in the blanks on the right side (refer to the full outer result table in Article 1) (Note: if you can join, it will not be null), so as to retain the rows that meet the conditions on the left side. This is equivalent to a left outer join operation.
5. If it is a full outer type and the Filter operation is used on the right side, convert full outer to right outer
Execute the following code:
ds1.join(ds2, ds1("id")===ds2("no"), "fullouter").where(ds2("age") >=20).explain(true)
The optimized logic plan is as follows:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin === Filter (age#152 >= 20) Filter (age#152 >= 20) !+- Join FullOuter, (id#146 = no#151) +- Join RightOuter, (id#146 = no#151) :- LocalRelation [id#146, age#147] :- LocalRelation [id#146, age#147] +- LocalRelation [no#151, age#152] +- LocalRelation [no#151, age#152]
If you filter the right side, you will filter out all the rows that fill in the blanks on the left side (refer to the full outer result table in Article 1) (Note: if you can join, it will not be null), so as to retain the rows that meet the conditions on the right side. This is equivalent to a right outer join operation.
Implementation of rules
The core implementation code of the optimization rule (with deletion) is as follows:
object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { ... private def buildNewJoinType(filter: Filter, join: Join): JoinType = { // Check the filter conditions to see if they meet the optimization rules val conditions = splitConjunctivePredicates(filter.condition) ++ filter.constraints // Gets the filter criteria for the left dataset val leftConditions = conditions.filter(_.references.subsetOf(join.left.outputSet)) // Gets the filter criteria for the right dataset val rightConditions = conditions.filter(_.references.subsetOf(join.right.outputSet)) lazy val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) lazy val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) // Returns the optimized join type according to the existing join type join.joinType match { case RightOuter if leftHasNonNullPredicate => Inner case LeftOuter if rightHasNonNullPredicate => Inner case FullOuter if leftHasNonNullPredicate && rightHasNonNullPredicate => Inner case FullOuter if leftHasNonNullPredicate => LeftOuter case FullOuter if rightHasNonNullPredicate => RightOuter case o => o } } def apply(plan: LogicalPlan): LogicalPlan = plan transform { // Judge whether this rule can be optimized case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _)) => val newJoinType = buildNewJoinType(f, j) // Judge whether the join type is equal to the original one. If not, rebuild the Filter logical plan if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType)) } }
Summary
Because the full outer join operation is very expensive, if you filter the dataset, you can optimize the full outer operation. There are many rules behind the optimization of join logical plan, which will be analyzed in later articles.