Spark SQL workflow source code analysis stage (based on Spark 3.3.0)

Posted by fredroines on Thu, 24 Feb 2022 15:56:16 +0100

preface

This article belongs to the column big data technology system, which was originally created by the author. Please indicate the source of quotation. Please point out the deficiencies and errors in the comment area. Thank you!

Please refer to the table of contents and references of this column Big data technology system

catalogue

Spark SQL workflow source code analysis (I) overview (based on Spark 3.3.0)

Spark SQL workflow source code analysis (II) parsing phase (based on Spark 3.3.0)

Spark SQL workflow source code analysis (III) analysis stage (based on Spark 3.3.0)

Spark SQL workflow source code analysis (IV) optimization stage (based on Spark 3.3.0)

Spark SQL workflow source code analysis (V) planning stage (based on Spark 3.3.0)

relation

An article about spark 3 X's Catalog system

Spark 3. What is the Table Catalog API of version x?

What are the improvements in the v2 version of spark datasource API? What is the difference between v1 and v2?

What are we going to do?

Through the previous study, we can easily find that:

The thing to do in the parsing phase is to convert the SQL statement into AST, which is actually only related to the SQL statement, that is, the same SQL statement and the final AST should be the same.

What we need to do in the analysis stage is to further analyze the generated ast. After all, the previously generated AST is only transformed from SQL statements, and the effects of SQL statements are different in different environments. At this time, what we need to do is to adapt to local conditions.

How to adjust measures to local conditions?

First of all, understand what we have? We can consider everything according to what we have in hand.

What do we have?

After the parsing phase, we get an AST:

In addition, we have a set of configuration environment:

val spark = SparkSession.builder.master("local[*]").appName("SparkSQLExample").getOrCreate()

We also prepared the data source and schema

The data source comes from here:

val df = spark.read.json(DATA_PATH)

schema information comes from here:

df.createTempView("t_user")

Let's see what key information the above code can bring us:

  1. We know a temporary view: t_user
  2. We inferred the field name and field type from the JSON file:
Field nameField type
nameStringType
ageLongType
sexStringType
addrArrayType

The above field types are internal types from Apache Spark.

Look at the source code implementation

Find the entrance

  def sql(sqlText: String): DataFrame = withActive {
    val tracker = new QueryPlanningTracker
    val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) {
      // Parking phase
      sessionState.sqlParser.parsePlan(sqlText)
    }
    Dataset.ofRows(self, plan, tracker)
  }
  def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan, tracker: QueryPlanningTracker)
    : DataFrame = sparkSession.withActive {
    val qe = new QueryExecution(sparkSession, logicalPlan, tracker)
    qe.assertAnalyzed()
    new Dataset[Row](qe, RowEncoder(qe.analyzed.schema))
  }
  // analyzed delayed initialization
  def assertAnalyzed(): Unit = analyzed
  lazy val analyzed: LogicalPlan = executePhase(QueryPlanningTracker.ANALYSIS) {
    // This is the entrance to the 'analysis' phase
    sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker)
  }

Analyzer.executeAndCheck

def executeAndCheck(plan: LogicalPlan, tracker: QueryPlanningTracker): LogicalPlan = {
    // If it has been resolved, it will directly return to the current logical plan
    if (plan.analyzed) return plan
    // Here, a 'ThreadLocal[Int]' type is used to avoid the parser calling itself recursively
    AnalysisHelper.markInAnalyzer {
      // implement
      val analyzed = executeAndTrack(plan, tracker)
      try {
        // Verify the parsing results after execution
        checkAnalysis(analyzed)
        analyzed
      } catch {
        // Construct parse exception and throw
        case e: AnalysisException =>
          val ae = e.copy(plan = Option(analyzed))
          ae.setStackTrace(e.getStackTrace)
          throw ae
      }
    }
  }

You can see that the core logic of the analysis phase consists of two steps:

  1. implement
  2. check

Let's look at the implementation first

implement

  def executeAndTrack(plan: TreeType, tracker: QueryPlanningTracker): TreeType = {
    // Set up the tracker. For the query plan tracker, please refer to the first lecture
    QueryPlanningTracker.withTracker(tracker) {
      execute(plan)
    }
  }
/**
 * Execute the rule batch defined by the subclass.
 * Rule batches are executed serially using defined execution policies.
 * In each batch, the rule is also executed continuously.
 */
def execute(plan: TreeType): TreeType = {
    var curPlan = plan
    // This is used to count some operation information, such as how much time it takes, running rule batches, etc
    val queryExecutionMetrics = RuleExecutor.queryExecutionMeter
    // After the rule or rule batch is applied, log the changes of the logical plan
    val planChangeLogger = new PlanChangeLogger[TreeType]()
    // It's also the query plan tracker. See Lecture 1
    val tracker: Option[QueryPlanningTracker] = QueryPlanningTracker.get
    // Measurement information before execution
    val beforeMetrics = RuleExecutor.getCurrentMetrics()

    // This is used to check the integrity of the logical plan, mainly by checking whether the ID (exprId) of the named expression is unique and not repeated
    if (!isPlanIntegral(plan, plan)) {
      throw QueryExecutionErrors.structuralIntegrityOfInputPlanIsBrokenInClassError(
        this.getClass.getName.stripSuffix("$"))
    }

    batches.foreach { batch =>
      // Which logical plan does the rule batch start from
      val batchStartPlan = curPlan
      // Number of iterations
      var iteration = 1
      // Record the current logical plan
      var lastPlan = curPlan
      // Identification indicating whether to continue the cycle
      var continue = true

      // Run to a fixed point (or the maximum number of iterations specified in the execution strategy)
      while (continue) {
        curPlan = batch.rules.foldLeft(curPlan) {
          case (plan, rule) =>
            // Start time of rule run
            val startTime = System.nanoTime()
            // Apply rules to logical plans
            val result = rule(plan)
            // Run time of the rule
            val runTime = System.nanoTime() - startTime
            // The unequal description rule works and is effective
            val effective = !result.fastEquals(plan)

            if (effective) {
              
              //Record the number of valid rules queryexecutionmetrics incNumEffectiveExecution(rule.ruleName)
              // Record the effective running time queryexecutionmetrics incTimeEffectiveExecutionBy(rule.ruleName, runTime)
              // Print the log to see how the rules have changed
              planChangeLogger.logRule(rule.ruleName, plan, result)
            }
            // Record the execution time
            queryExecutionMetrics.incExecutionTimeBy(rule.ruleName, runTime)
            // Record how many rules you ran
            queryExecutionMetrics.incNumExecution(rule.ruleName)

            // Use the query plan tracker to record some time-related information
            tracker.foreach(_.recordRuleInvocation(rule.ruleName, runTime, effective))

            // After running each rule, check the integrity of the logical plan
            if (effective && !isPlanIntegral(plan, result)) {
              throw QueryExecutionErrors.structuralIntegrityIsBrokenAfterApplyingRuleError(
                rule.ruleName, batch.name)
            }

            result
        }
        // Number of iterations plus 1
        iteration += 1
        if (iteration > batch.strategy.maxIterations) {
          // Only those rules that have been run more than once will be printed in the log
          if (iteration != 2) {
            val endingMsg = if (batch.strategy.maxIterationsSetting == null) {
              "."
            } else {
              s", please set '${batch.strategy.maxIterationsSetting}' to a larger value."
            }
            val message = s"Max iterations (${iteration - 1}) reached for batch ${batch.name}" +
              s"$endingMsg"
            if (Utils.isTesting || batch.strategy.errorOnExceed) {
              throw new RuntimeException(message)
            } else {
              logWarning(message)
            }
          }
          // Check the idempotency of a single rule batch
          if (batch.strategy == Once &&
            Utils.isTesting && !excludedOnceBatches.contains(batch.name)) {
            checkBatchIdempotence(batch, curPlan)
          }
          continue = false
        }

        // If the logical plan has not changed, you need to exit the cycle
        if (curPlan.fastEquals(lastPlan)) {
          logTrace(
            s"Fixed point reached for batch ${batch.name} after ${iteration - 1} iterations.")
          continue = false
        }
        lastPlan = curPlan
      }
      // Log print how the logical plan has changed from the beginning to the present
      planChangeLogger.logBatch(batch.name, batchStartPlan, curPlan)
    }
    // Log print some measurement information
    planChangeLogger.logMetrics(RuleExecutor.getCurrentMetrics() - beforeMetrics)

    curPlan
  }

The flow chart is as follows:

The whole process seems simple, and the core logic is actually the following line of code:

val result = rule(plan)

This represents applying specific rules to the current logical plan to get another logical plan.

After seeing this, we all understand that execution is not important. The core functions of the bottom layer are Batch and Rule.

What exactly are Batch and Rule?

Rule Batch

batches in the source code represents the regular batch sequence

  /**
   * Define the batch sequence of rules to be overridden by the implementation.
   */
  protected def batches: Seq[Batch]

A rule Batch consists of a rule name, an execution policy, and a rule list.

  protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)

Execution strategy

  abstract class Strategy {

    /** Maximum number of iterations performed */
    def maxIterations: Int

    /** Whether to throw an exception after exceeding the maximum number of times */
    def errorOnExceed: Boolean = false

    /** SQLConf The key used to configure the maximum number of iterations in */
    def maxIterationsSetting: String = null
  }

There are two types of execution policies: fixedPoint and Once

fixedPoint

  // If the plan cannot be resolved in maxIterations, analyzer will throw an exception to notify the user to increase sqlconf ANALYZER_ MAX_ The value of iterations.
  protected def fixedPoint =
    FixedPoint(
      conf.analyzerMaxIterations,
      errorOnExceed = true,
      maxIterationsSetting = SQLConf.ANALYZER_MAX_ITERATIONS.key)

SQLConf. ANALYZER_ MAX_ Items represents the configuration item spark sql. analyzer. maxIterations

  case class FixedPoint(
    override val maxIterations: Int,
    override val errorOnExceed: Boolean = false,
    override val maxIterationsSetting: String = null) extends Strategy

FixedPoint represents the policy that runs to a fixed point or the maximum number of iterations, whichever comes first.

The rule batch for FixedPoint(1) should only run once.

Once

  case object Once extends Strategy { val maxIterations = 1 }

Once stands for an idempotent policy that runs only once.

Rule

abstract class Rule[TreeType <: TreeNode[_]] extends SQLConfHelper with Logging {

  // The integer ID of the rule is used to trim unnecessary tree traversal
  protected lazy val ruleId = RuleIdCollection.getRuleId(this.ruleName)

  /** The name of the current rule, which is automatically inferred according to the class name */
  val ruleName: String = {
    val className = getClass.getName
    if (className endsWith "$") className.dropRight(1) else className
  }

  /** The core application function */
  def apply(plan: TreeType): TreeType
}

TreeType represents any subclass of TreeNode, and TreeNode is the parent of all tree structures in Spark SQL. For example, the logical plan in the previous source code is one of its subclasses.

What are the rules?

We can from analyzer Find the answer in the scala file

  override def batches: Seq[Batch] = Seq(
    Batch("Substitution", fixedPoint,
      OptimizeUpdateFields,
      CTESubstitution,
      WindowsSubstitution,
      EliminateUnions,
      SubstituteUnresolvedOrdinals),
    Batch("Disable Hints", Once,
      new ResolveHints.DisableHints),
    Batch("Hints", fixedPoint,
      ResolveHints.ResolveJoinStrategyHints,
      ResolveHints.ResolveCoalesceHints),
    Batch("Simple Sanity Check", Once,
      LookupFunctions),
    Batch("Keep Legacy Outputs", Once,
      KeepLegacyOutputs),
    Batch("Resolution", fixedPoint,
      ResolveTableValuedFunctions(v1SessionCatalog) ::
      ResolveNamespace(catalogManager) ::
      new ResolveCatalogs(catalogManager) ::
      ResolveUserSpecifiedColumns ::
      ResolveInsertInto ::
      ResolveRelations ::
      ResolvePartitionSpec ::
      ResolveFieldNameAndPosition ::
      AddMetadataColumns ::
      DeduplicateRelations ::
      ResolveReferences ::
      ResolveExpressionsWithNamePlaceholders ::
      ResolveDeserializer ::
      ResolveNewInstance ::
      ResolveUpCast ::
      ResolveGroupingAnalytics ::
      ResolvePivot ::
      ResolveOrdinalInOrderByAndGroupBy ::
      ResolveAggAliasInGroupBy ::
      ResolveMissingReferences ::
      ExtractGenerator ::
      ResolveGenerate ::
      ResolveFunctions ::
      ResolveAliases ::
      ResolveSubquery ::
      ResolveSubqueryColumnAliases ::
      ResolveWindowOrder ::
      ResolveWindowFrame ::
      ResolveNaturalAndUsingJoin ::
      ResolveOutputRelation ::
      ExtractWindowExpressions ::
      GlobalAggregates ::
      ResolveAggregateFunctions ::
      TimeWindowing ::
      SessionWindowing ::
      ResolveInlineTables ::
      ResolveLambdaVariables ::
      ResolveTimeZone ::
      ResolveRandomSeed ::
      ResolveBinaryArithmetic ::
      ResolveUnion ::
      typeCoercionRules ++
      Seq(ResolveWithCTE) ++
      extendedResolutionRules : _*),
    Batch("Remove TempResolvedColumn", Once, RemoveTempResolvedColumn),
    Batch("Apply Char Padding", Once,
      ApplyCharTypePadding),
    Batch("Post-Hoc Resolution", Once,
      Seq(ResolveCommandsWithIfExists) ++
      postHocResolutionRules: _*),
    Batch("Remove Unresolved Hints", Once,
      new ResolveHints.RemoveAllHints),
    Batch("Nondeterministic", Once,
      PullOutNondeterministic),
    Batch("UDF", Once,
      HandleNullInputsForUDF,
      ResolveEncodersInUDF),
    Batch("UpdateNullability", Once,
      UpdateAttributeNullability),
    Batch("Subquery", Once,
      UpdateOuterReferences),
    Batch("Cleanup", fixedPoint,
      CleanupAliases),
    Batch("HandleAnalysisOnlyCommand", Once,
      HandleAnalysisOnlyCommand)
  )

Because the bottom layer is actually running rules, we draw the following table with rules as granularity.

ruleRule batchExecution strategyinterpretative statementSupplementary notes
OptimizeUpdateFieldsSubstitutionfixedPointThis rule optimizes the UpdateFields expression chain, so it looks more like an optimization rule. However, when manipulating deeply nested patterns, the UpdateFields expression tree can be very complex and cannot be analyzed. Therefore, we need to optimize UpdateFields as early as possible at the beginning of the analysis.
CTESubstitutionSubstitutionfixedPointUse nodes for analysis and replace sub plans With CTE references or CTE definitions according to the following conditions: 1. If you are in traditional mode, or if the query is an SQL command or DML statement, replace it With CTE definition, that is, inline CTE. 2. Otherwise, replace With CTE reference CTERelationRef. After query analysis, the InlineCTE rule will determine whether to inline. For each main query and sub query, all CTE definitions that are not inlined after this replacement will be grouped under a WithCTE node. Any main query or sub query that does not contain CTEs or has all CTEs inlined will obviously not contain any WithCTE nodes. If so, the WithCTE node will be located in the same location as the outermost With node. CTE definitions in the WithCTE node are saved in parsing order. This means that according to the dependency of CTE definitions on any valid CTE query, CTE definitions can be guaranteed to be arranged in topological order (that is, given CTE definitions A and B, B refers to a, and a is guaranteed to appear before B). Otherwise, it must be an invalid user query, and the relationship resolution rule will throw an analysis exception later.
WindowsSubstitutionSubstitutionfixedPointReplace the sub plan with WindowSpecDefinitions, where WindowSpecDefinition represents the specification of window functions.
EliminateUnionsSubstitutionfixedPointIf there is only one child, remove the Union operator from the plan
SubstituteUnresolvedOrdinalsSubstitutionfixedPointReplace the sequence number in "order by" or "group by" with an unresolved ordinal expression, where unresolved ordinal represents the unresolved sequence number used by order by or group by. For example, select by group 1 and select from table 1
DisableHintsDisable HintsOnceWhen the configuration item spark sql. optimizer. When disablehints is set, all hints in spark are deleted. This will be performed at the beginning of Analyzer to disable the hints feature.
ResolveJoinStrategyHintsHintsfixedPointThe list of allowed join policy hint is in joinstrategyhint Defined in strategies. You can use the join policy hint to specify a series of relationship aliases, such as "MERGE(a, c)", "BROADCAST(a)". The join policy hint plan node is inserted at the top of any relationship (alias different), subquery, or common table expression that matches the specified name. The working principle of hint parsing is to recursively traverse the query plan and find the relationship or sub query that matches one of the specified relationship aliases. Traversal does not exceed any view references, including clauses or subquery aliases. This rule must occur before a common table expression.The allowed join policies are: 1 BROADCAST(“BROADCAST”,“BROADCASTJOIN”,“MAPJOIN”)2.SHUFFLE_MERGE(“SHUFFLE_MERGE”,“MERGE”,“MERGEJOIN”)3.SHUFFLE_HASH(“SHUFFLE_HASH”)4.SHUFFLE_REPLICATE_NL ("SHUFFLE_REPLICATE_NL"), the category is outside the bracket, and the bracket is the specific hint string
ResolveCoalesceHintsHintsfixedPointCOALESCE Hint allows the following names: "COALESCE", "partition", "partition_by_range"
LookupFunctionsSimple Sanity CheckOnceCheck whether the function identifier of the unresolved function reference is defined in the function registry. Note that this rule does not attempt to resolve unsolvedfunctions. It only performs a simple existence check based on the function identifier to quickly identify undefined functions without triggering relationship resolution, which may lead to a potentially expensive partition / schema discovery process in some cases. To avoid duplicate external function lookups, the external function identifier is stored in the local hash set externalFunctionNameSet.
KeepLegacyOutputsKeep Legacy OutputsOnceWhen Spark sql. legacy. When keepcommandoutputschema is set to true, Spark will keep the output format of commands such as SHOW DATABASES unchanged.SHOW TABLES/SHOW NAMESPACES/DESCRIBE NAMESPACE/SHOW TBLPROPERTIES
ResolveTableValuedFunctionsResolutionfixedPointRules for parsing table valued function references.
ResolveNamespaceResolutionfixedPointResolve rules such as SHOW TABLES and SHOW FUNCTIONS.SHOW TABLES/SHOW TABLE EXTENDED/SHOW VIEWS/SHOW FUNCTIONS/ANALYZE TABLES
ResolveCatalogsResolutionfixedPointResolve the catalog from the multi part identifier in the SQL statement. If the resolved catalog is not session catalog, convert the statement to the corresponding v2 command.
ResolveUserSpecifiedColumnsResolutionfixedPointResolve user specified columns.
ResolveInsertIntoResolutionfixedPointParse the INSERT INTO statement.
ResolveRelationsResolutionfixedPointReplace unresolved relationships (tables and views) with specific relationships in catalog.For example, this rule will be used if the logical plan contains unresolved relation, which is used in our example.
ResolvePartitionSpecResolutionfixedPointResolvedPartitionSpec to ResolvedPartitionSpec in partition related commands.
ResolveFieldNameAndPositionResolutionfixedPointRules for parsing, normalizing, and rewriting field names based on the case sensitivity of commands.
AddMetadataColumnsResolutionResolutionfixedPointWhen the node is missing the resolved attribute, the metadata column is added to the output of the child relationship. Use logicalplan Columns in metadata output resolve references to metadata columns. However, before replacing the relationship, the output of the relationship does not include metadata columns. Unless this rule adds metadata to the output of the relationship, analyzer will detect that there is nothing to generate the column. This rule adds metadata columns only when a node has been resolved but lacks input from its child nodes. This ensures that metadata columns are not added to the plan unless they are used. By checking only the resolved nodes, this ensures that * expansion has been completed so that * does not accidentally select metadata columns. This rule resolves operators to downward to avoid premature projection of metadata columns.
DeduplicateRelationsResolutionfixedPointDelete any duplicate relationships for LogicalPlan.
ResolveReferencesResolutionfixedPointReplace the unresolved attribute with the specific AttributeReference of the sub node of the logical plan section.
ResolveExpressionsWithNamePlaceholdersResolutionfixedPointResolve the expression that contains the name placeholder.
ResolveDeserializerResolutionfixedPointReplace UnsolvedDeserializer with a deserialization expression that has been resolved to the given input property.
ResolveNewInstanceResolutionfixedPointIf the object to be constructed is an internal class, resolve the NewInstance by finding and adding an external scope to it.
ResolveUpCastResolutionfixedPointReplace UpCast with Cast, and throw an exception if the conversion may be truncated.
ResolveGroupingAnalyticsResolutionfixedPointParse the grouping function.
ResolvePivotResolutionfixedPointParse pivot (row to column)
ResolveOrdinalInOrderByAndGroupByResolutionfixedPointIn many dialects of SQL, the order position used in the order/sort by and group by clauses is valid. This rule is used to convert the ordinal position to the corresponding expression in the selection list. This support was introduced in Spark 2.0. If you sort references or group by expressions that are not integers but collapsible expressions, ignore them. When spark sql. orderByOrdinal/spark. sql. Groupbyordinal is set to false and the location number is ignored.Before Spark 2.0 was released, the characters in the order/sort by and group by clauses had no effect on the results.
ResolveAggAliasInGroupByResolutionfixedPointReplace the unresolved expression in the grouping key with the resolved expression in the SELECT clause. This rule should run after ResolveReferences is applied.
ResolveMissingReferencesResolutionfixedPointIn many dialects of SQL, sorting by attributes that do not exist in the SELECT clause is valid. This rule detects such queries and adds the required attributes to the original projection so that they are available during sorting. Add another projection to remove these attributes after sorting. The HAVING clause can also use grouping columns that are not displayed in the SELECT.
ExtractGeneratorResolutionfixedPointExtract the Generator from the Project list of the Project operator and create the Generate operator under Project. This rule will throw an AnalysisException in the following cases: 1 Generators are nested in expressions, such as select expand (list) + 1 from TBL. 2. There are multiple generators in the projectlist, such as select expand (list), expand (list) from TBL. 3. The Generator can be found in other operators other than Project or Generate, such as select * from TBL sort by expand (list).
ResolveGenerateResolutionfixedPointRewrite the table to Generate an expression that requires one or more of the following expressions to resolve: the specific Attribute reference of its output. Relocate from the SELECT clause (that is, from Project) to the Generate clause. The name of the output Attribute is extracted from the Alias or MultiAlias expression that encapsulates the Generator.
ResolveFunctionsResolutionfixedPointReplace unresolved function with specific LogicalPlan and unresolved function with specific Expression.
ResolveAliasesResolutionfixedPointReplace unresolved alias with a specific alias.
ResolveSubqueryResolutionfixedPointThis rule parses and rewrites subqueries within expressions. Note: CTE is processed in CTESubstitution.
ResolveSubqueryColumnAliasesResolutionfixedPointReplace unresolved column aliases for subqueries with projections.
ResolveWindowOrderResolutionfixedPointCheck and add order to AggregateWindowFunction
ResolveWindowFrameResolutionfixedPointCheck and add appropriate window frames for all window functions
ResolveNaturalAndUsingJoinResolutionfixedPointDelete the natural join or using join by calculating the output column based on the output on both sides, and then apply projection on the ordinary join to eliminate the natural join or using join.
ResolveOutputRelationResolutionfixedPointParse the columns of the output table from the data in the logical plan. This rule will: 1 Reorder columns when writing by name; 2. Insert forced conversion when the data types do not match; 3. Insert alias when the column names do not match; 4. Detect plans incompatible with the output table and raise AnalysisException
ExtractWindowExpressionsResolutionfixedPointExtract WindowExpressions from the projectList of the project operator and the aggregateExpressions of the aggregation operator, and create a separate window operator for each different WindowsSpecDefinition. This rule deals with three situations: 1 Project with WindowExpressions in the project list; 2. Include the aggregation of WindowExpressions in its aggregateExpressions. 3. A filter - > aggregate mode represents the HAVING clause, indicating GROUP BY. Aggregate has WindowExpressions in its aggregateExpressions.
GlobalAggregatesResolutionfixedPointConverts a projection containing an aggregate expression to an aggregate.
ResolveAggregateFunctionsResolutionfixedPointThis rule looks for aggregate expressions that are not in the aggregate operator. For example, those in the HAVING clause or ORDER BY clause. These expressions are pushed down to the underlying aggregation operator and then projected after the original operator.
TimeWindowingResolutionfixedPointUse the Expand operator to map time columns to multiple time windows. Since it is very important to calculate how many windows a time column can map to, we overestimate the number of windows and filter out the rows whose time column is not in the time window.
SessionWindowingResolutionfixedPointMatch the time column to the session window.
ResolveInlineTablesResolutionfixedPointReplace unresolved inlinetable with LocalRelation
ResolveLambdaVariablesResolutionfixedPointAnalyze lambda variables exposed by higher-order functions. This rule is divided into two steps: 1 Bind the anonymous variables exposed by the higher-order function to the parameters of the lambda function; This creates named and typed lambda variables. In this step, you will check whether the parameter names are duplicate and check the number of parameters. 2. Analyze the lambda variables used in the function expression tree of lambda function. Note that we allow variables other than the current lambda, which can be lambda functions defined in an external scope or attributes generated by children of the plan. If the name is duplicate, the name defined in the innermost scope is used.
ResolveTimeZoneResolutionfixedPointReplace TimeZoneAwareExpression without time zone id with a copy of the session's local time zone.
ResolveRandomSeedResolutionfixedPointSet the seed generated by random number.
ResolveBinaryArithmeticResolutionfixedPointAbout addition: 1 If both sides are spaced, keep the same; 2. Otherwise, if one side is a date and the other side is an interval, convert it to DateAddInterval; 3. Otherwise, if one side is interval, it will be converted to TimeAdd; 4. Otherwise, if one side is date, change it to DateAdd; 5. Other aspects remain unchanged. About subtraction: 1 If both sides are spaced, keep the same; 2. Otherwise, if the left side is the date and the right side is the interval, it will be converted to DateAddInterval(l, -r); 3. Otherwise, if the right side is an interval, it will be converted to TimeAdd(l, -r); 4. Otherwise, if one side is a timestamp, convert it to SubtractTimestamps; 5. Otherwise, if date is on the right, it will be converted to DateDiff/Subtract Dates; 6. Otherwise, if the left side is date, convert it to DateSub; 7. Otherwise, it will remain unchanged. About multiplication: 1. If one side is an interval, convert it to MultiplyInterval; 2. Otherwise, it will remain unchanged. About division: 1. If interval is on the left, it will be converted to DivideInterval; 2. Otherwise, it will remain unchanged.
ResolveUnionResolutionfixedPointResolve the different children of the union into a set of common columns.
typeCoercionRulesResolutionfixedPointWhen spark sql. ansi. When enabled is set to true, ANSI parsing is adopted, which represents a set of parsing rules.
ResolveWithCTEResolutionfixedPointUpdate the CTE reference with the resolve output attribute of the corresponding CTE definition.
extendedResolutionRulesResolutionfixedPointConvenient rewriting to provide additional rules.
RemoveTempResolvedColumnRemove TempResolvedColumnOnceDelete all TempResolvedColumn in the query plan. This is a last resort in case some rules in the main parsing batch cannot delete TempResolvedColumn. We should run this rule immediately after the main parsing batch.
ApplyCharTypePaddingApply Char PaddingOnceThis rule performs string padding for character type comparisons. When comparing a char type column / field with a string literal or char type column / field, right-click to fill the shorter column / field with the longer column / field.
ResolveCommandsWithIfExistsPost-Hoc ResolutionOnceRules for processing commands when tables or temporary views are not resolved. These commands support a flag "ifExists" so that they do not fail when the relationship is unresolved. If the "ifExists" flag is set to true, the logical plan will be resolved to NoopCommand.DROP TABLE/DROP VIEW/UNCACHE TABLE/DROP FUNCTION
postHocResolutionRulesPost-Hoc ResolutionOnceEasily rewritten to provide rules for ex post facto resolution. Note that these rules are executed in a single batch. The batch will run after the normal parsing of the batch and its rules will be executed at one time.
RemoveAllHintsRemove Unresolved HintsOnceDelete all hints to delete invalid hints provided by the user. This must be done after all other hints rules have been executed.
PullOutNondeterministicNondeterministicOnceExtract uncertain expressions from LogicalPlan that is not a Project or filter, put them into internal Project, and finally Project them to external Project.
HandleNullInputsForUDFUDFOnceThe null check is performed by adding an additional if expression to correctly handle the null primitive input of UDF. When the user uses primitive parameters to define UDF, it is impossible to judge whether the primitive parameters are null. Therefore, here we assume that the primitive input is null and can be propagated. If the input is null, we should return null.
ResolveEncodersInUDFUDFOnceThe encoder of UDF is parsed by explicitly giving attributes. We explicitly give attributes to deal with cases where the data type of the input value is different from the internal mode of the encoder, which may lead to data loss. For example, if the actual data type is Decimal (30, 0), the encoder should not convert the input value to Decimal (38, 18). The parsed encoder will then be used to deserialize the internal row into Scala values.
UpdateAttributeNullabilityUpdateNullabilityOnceUpdate the nullability of the attribute in the resolved LogicalPlan by using the nullability of the corresponding attribute of its child output attribute. This step is required because the user can use the resolved AttributeReference in the Dataset API, and the external join can change the nullability of AttributeReference. Without this rule, the NULL field of a column that can be NULL can actually be set to non NULL, which will lead to illegal optimization (such as NULL propagation) and wrong answers. For specific inquiries about this case, please refer to SPARK-13484 and SPARK-13801.
UpdateOuterReferencesSubqueryOncePush the aggregate expression in the sub query referencing the external query block down to the external query block for evaluation. The following rules update these external references as AttributeReference references to the attributes in the parentouter query block.
CleanupAliasesCleanupfixedPointClear unnecessary aliases in the plan. Basically, we only need to use Alias as a top-level expression in Project (Project list) or aggregation (aggregate expression) or window (window expression). Note that if an expression has other expression parameters that are not in its subexpression, such as runtimereplaceable, the Alias transformation in this rule cannot be used for these parameters.
HandleAnalysisOnlyCommandCleanupfixedPointMark the command as a parsed rule to remove its subcommands to avoid optimization. This rule should run after running all other analysis rules.

Parsing order

The parsing in the above table is performed in order.

The parsing sequence is shown in the following figure:

typeCoercionRules

Among the above rules, there is a special set of rules. Remember we talked about it in the second part( Spark SQL workflow source code analysis (II) parsing phase (based on Spark 3.3.0) )Mentioned Hive SQL ⇒ Spark SQL ⇒ evolution of ANSI SQL? This set of rules is related to it.

ANSI SQL

When Spark sql. ansi. When enabled is set to true, Spark will tend to handle SQL in an ANSI way.

ruleexplain
InConversionProcessing includes type coercion of IN expressions with subqueries and IN expressions without subqueries. 1. IN the first case, find the common type by comparing the left (LHS) expression type with the corresponding right (RHS) expression derived from the planned output of the subquery expression. Inject appropriate transformations into the LHS and RHS of the IN expression. 2. IN the second case, convert the value and in list expressions to the common operator type by looking at all parameter types and finding the closest type to which all parameters can be converted. When the common operator type is not found, the original expression is returned and a parsing exception is thrown during the type check phase.
PromoteStringLiteralsString text parsing that occurs in arithmetic, comparison, and datetime expressions.
DecimalPrecisionCalculates and propagates the precision of fixed precision decimals. Based on standard SQL and MS SQL, hive has many rules for this: https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf https://msdn.microsoft.com/en-us/library/ms190476.aspx
FunctionArgumentConversionEnsure that the types of different functions are as expected.
ConcatCoercionForce the Concat property children's type to the expected type. If spark sql. function. Concatbinaryasstring is false, all subtypes are binary, and the expected type is binary. Otherwise, a string is expected.
MapZipWithCoercionForce the key type of two different MapType parameters of MapZipWith expression to public type.
EltCoercionForce the type of the Elt property children to the expected type. If spark sql. function. Concatbinaryasstring is false, all subtypes are binary, and the expected type is binary. Otherwise, a string is expected.
CaseWhenCoercionForce the type of different branches of the CASE WHEN statement to a public type.
IfCoercionForce the type of different branches of If statement to public type.
StackCoercionForce the NullType in the Stack expression to the column type at the corresponding position
DivisionHive performs integer division using only the DIV operator/ Arguments to are always converted to decimal type.
IntegralDivisionThe DIV operator always returns a long integer value. This rule casts integer input to long type to avoid overflow during calculation.
ImplicitTypeCastsCast the type according to the expected input type of Expression.
DateTimeOperationsDatetime function for processing datetime_funcs( Spark SQL functions.scala source code analysis (VIII) DateTime functions (based on Spark 3.3.0))
WindowFrameCoercionCasts WindowFrame to the type of its operation.
GetDateFieldOperationsWhen a date field is obtained from a timestamp column, the column is cast to a date type. This is Spark's effort to simplify the implementation. In the default type coercion rule, the implicit coercion rule completes this work. However, ANSI implicit conversion rules do not allow the conversion of timestamp types to date types, so we need this additional rule to ensure that the extraction of date fields from timestamp columns works properly.
HIVE SQL

When Spark sql. ansi. When enabled is set to false, Spark will tend to process SQL in the way of HIVE.

ruleexplain
InConversionProcessing includes type coercion of IN expressions with subqueries and IN expressions without subqueries. 1. IN the first case, find the common type by comparing the left (LHS) expression type with the corresponding right (RHS) expression derived from the planned output of the subquery expression. Inject appropriate transformations into the LHS and RHS of the IN expression. 2. IN the second case, convert the value and in list expressions to the common operator type by looking at all parameter types and finding the closest type to which all parameters can be converted. When the common operator type is not found, the original expression is returned and a parsing exception is thrown during the type check phase.
PromoteStringsString text parsing that occurs in arithmetic expressions.
DecimalPrecisionCalculates and propagates the precision of fixed precision decimals. Based on standard SQL and MS SQL, hive has many rules for this: https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf https://msdn.microsoft.com/en-us/library/ms190476.aspx
BooleanEqualityChange the numeric type to Boolean so that expressions such as true=1 can be evaluated
FunctionArgumentConversionEnsure that the types of different functions are as expected.
ConcatCoercionForce the Concat property children's type to the expected type. If spark sql. function. Concatbinaryasstring is false, all subtypes are binary, and the expected type is binary. Otherwise, a string is expected.
MapZipWithCoercionForce the key type of two different MapType parameters of MapZipWith expression to public type.
EltCoercionForce the type of the Elt property children to the expected type. If spark sql. function. Concatbinaryasstring is false, all subtypes are binary, and the expected type is binary. Otherwise, a string is expected.
CaseWhenCoercionForce the type of different branches of the CASE WHEN statement to a public type.
IfCoercionForce the type of different branches of If statement to public type.
StackCoercionForce the NullType in the Stack expression to the column type at the corresponding position
DivisionHive performs integer division using only the DIV operator/ Arguments to are always converted to decimal type.
IntegralDivisionThe DIV operator always returns a long integer value. This rule casts integer input to long type to avoid overflow during calculation.
ImplicitTypeCastsCast the type according to the expected input type of Expression.
DateTimeOperationsDatetime function for processing datetime_funcs( Spark SQL functions.scala source code analysis (VIII) DateTime functions (based on Spark 3.3.0))
WindowFrameCoercionCasts WindowFrame to the type of its operation.
StringLiteralCoercionA special rule that supports string text as date_ add/date_ The second parameter of the sub function to maintain backward compatibility as a temporary solution.

Back to the original example

Due to space constraints, it is impossible for us to explain in detail how each rule is implemented.

So let's go back to the original example and understand its analysis. I believe we can bypass analogy.

In our example, two rules play a central role:

  1. ResolveRelations
  2. ResolveReferences

ResolveRelations

The resolverelationships rule replaces unresolved relationships (where relationships refer to tables and views) with concrete relationships in the catalog.

So how is it realized?

In fact, there are three steps. The specific flow chart is as follows:

Let's follow the source code to understand the above flow chart:

apply

Let's take a look at the implementation of the apply function. As mentioned earlier, this is the core function in the Rule.

def apply(plan: LogicalPlan)
        : LogicalPlan = plan.resolveOperatorsUpWithPruning(AlwaysProcess.fn, ruleId) {

Let's see what resolveoperators upwithpruning is first?

Recursive application partial function

/**
 * Returns a copy of this node, where the rule is first recursively applied to all its child nodes and then recursively applied to itself (post order pass)
 * Calendar, bottom-up). When the rule does not apply to a given node, it remains unchanged. This function is similar to transformUp,
 * However, subtrees marked as parsed are skipped. 
 * @param rule–The function used to convert this node to a child node. 
 * @param cond–Lambda expression for pruning tree traversal. If ` cond Apply ` returns on operator T
 *             false,Skip processing T and its subtrees; Otherwise, T and its subtrees are processed recursively. 
 * @param ruleId–Is the unique Id of the rule used to trim unnecessary tree traversal. When it is an unknown rule, it will not be
 *               Trim. Otherwise, if the rule (id is ruleId) has been marked as valid on operator T, skip
 *               Over treated T and its subtree. If the rules are not purely functional and read differently for different calls
 *               Do not pass the initial state of.
 */
 def resolveOperatorsUpWithPruning(cond: TreePatternBits => Boolean,
    ruleId: RuleId = UnknownRuleId)(rule: PartialFunction[LogicalPlan, LogicalPlan])
  : LogicalPlan = {
    // The current logical plan is not resolved and its subtree can be processed recursively. For rules with id of ruleId, this tree node and its subtree are not marked as invalid.
    if (!analyzed && cond.apply(self) && !isRuleIneffective(ruleId)) {
      // To prevent nested calls, a ThreadLocal[Int] is used to record the depth of the call
      AnalysisHelper.allowInvokingTransformsInAnalyzer {
        // Returns a copy of the current node, recursively applied to all its child nodes
        // The input of each rule is the output of its child nodes after applying the rule
        val afterRuleOnChildren = mapChildren(_.resolveOperatorsUpWithPruning(cond, ruleId)(rule))
        // If the logical plan has not changed
        val afterRule = if (self fastEquals afterRuleOnChildren) {
          CurrentOrigin.withOrigin(origin) {
            // Apply rule to logical plan
            rule.applyOrElse(self, identity[LogicalPlan])
          }
        } else {
          CurrentOrigin.withOrigin(origin) {
            // Apply rules to processed logical plan
            rule.applyOrElse(afterRuleOnChildren, identity[LogicalPlan])
          }
        }
        if (self eq afterRule) {
          // The marking rule (with id ruleId) is not valid for this tree node and its subtrees.
          self.markRuleAsIneffective(ruleId)
          self
        } else {
          // Copy node label
          afterRule.copyTagsFrom(self)
          afterRule
        }
      }
    } else {
      self
    }
  }

We can see that the core of this part is the partial function rule. Let's take a look at its implementation.

After the parsing phase is completed, our logical plan is as follows:

'Project [*]
+- 'UnresolvedRelation [t_user], [], false

Obviously, we can see that the branch of our corresponding partial function is the following:

    case u: UnresolvedRelation =>
        lookupRelation(u).map(resolveViews).getOrElse(u)

Find relationship

We need to find the corresponding relationship (table and view) from the Catalog.

  /**
   * If it is a v1 table in the session directory, resolve the relationship to a v1 relationship or a v2 relationship.
   * This is used to parse DML commands and SELECT queries.
   */
  private def lookupRelation(
        u: UnresolvedRelation,
        timeTravelSpec: Option[TimeTravelSpec] = None): Option[LogicalPlan] = {
      // Find temporary view
      lookupTempView(u.multipartIdentifier, u.isStreaming, timeTravelSpec.isDefined).orElse {
        // If not, we will try to find it from the relationship cache
        // If we parse database objects (relationships, functions, etc.) in the view, we may need to expand single or multi part identifiers using the current catalog and namespace after the view is created.
        expandIdentifier(u.multipartIdentifier) match {
          case CatalogAndIdentifier(catalog, ident) =>
            val key = catalog.name +: ident.namespace :+ ident.name
            // Find from relationship cache
            AnalysisContext.get.relationCache.get(key).map(_.transform {
              case multi: MultiInstanceRelation =>
                val newRelation = multi.newInstance()
                newRelation.copyTagsFrom(multi)
                newRelation
            }).orElse {
              // Cannot find manual load and create relationship
              val table = CatalogV2Util.loadTable(catalog, ident, timeTravelSpec)
              val loaded = createRelation(catalog, ident, table, u.options, u.isStreaming)
              // Update cache
              loaded.foreach(AnalysisContext.get.relationCache.update(key, _))
              loaded
            }
          case _ => None
        }
      }
    }
  }

How do I find temporary views?

private def lookupTempView(
        identifier: Seq[String],
        isStreaming: Boolean = false,
        isTimeTravel: Boolean = false): Option[LogicalPlan] = {
      // We are parsing a view, and when the view is created, the view name is not a temporary view, so we will return to None earlier
      if (isResolvingView && !isReferredTempViewName(identifier)) return None

      val tmpView = identifier match {
        // In our example, there is only one identifier: t_user
        // Find temporary views through SessionCatalog
        case Seq(part1) => v1SessionCatalog.lookupTempView(part1)
        case Seq(part1, part2) => v1SessionCatalog.lookupGlobalTempView(part1, part2)
        case _ => None
      }

      // After you find it, you have to check it
      tmpView.foreach { v =>
        if (isStreaming && !v.isStreaming) {
          throw QueryCompilationErrors.readNonStreamingTempViewError(identifier.quoted)
        }
        if (isTimeTravel) {
          val target = if (v.isStreaming) "streams" else "views"
          throw QueryCompilationErrors.timeTravelUnsupportedError(target)
        }
      }
      tmpView
    }

How can the metadata information of the corresponding temporary view be found in the SessionCatalog?

Students who are confused about SessionCatalog should first take a look at this blog—— An article about spark 3 X's Catalog system

  def lookupTempView(table: String): Option[SubqueryAlias] = {
    val formattedTable = formatTableName(table)
    getTempView(formattedTable).map { view =>
      SubqueryAlias(formattedTable, view)
    }
  }

Format the name of the table first

  protected[this] def formatTableName(name: String): String = {
    if (conf.caseSensitiveAnalysis) name else name.toLowerCase(Locale.ROOT)
  }

Get the corresponding metadata information and convert it into the corresponding logical plan

  def getTempView(name: String): Option[View] = synchronized {
    getRawTempView(name).map(getTempViewPlan)
  }
  def getRawTempView(name: String): Option[TemporaryViewRelation] = synchronized {
    tempViews.get(formatTableName(name))
  }

Oh ~ it can be seen that the logic of finding temporary views is to fetch the corresponding temporary views from the Map cache of SessionCatalog. Let's take a look at the Map cache:

  protected val tempViews = new mutable.HashMap[String, TemporaryViewRelation]

It is now clear that the metadata information of the temporary view we want to find comes from the HashMap cache.

So, the question is: when was the data added?

When was the data added?

Obviously, it comes from the following code in our example:

    df.createTempView("t_user")

This step creates the temporary view we need. Let's see how it is implemented~

  @throws[AnalysisException]
  def createTempView(viewName: String): Unit = withPlan {
    createTempViewCommand(viewName, replace = false, global = false)
  }
private def createTempViewCommand(
      viewName: String,
      replace: Boolean,
      global: Boolean): CreateViewCommand = {
    val viewType = if (global) GlobalTempView else LocalTempView

    val tableIdentifier = try {
      
      // After reading the second lecture, the students believe that they will not be unfamiliar with this, which is basically similar to the logic of parsing stage
      sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName)
    } catch {
      case _: ParseException => throw QueryCompilationErrors.invalidViewNameError(viewName)
    }
    // After parsing the view, build a command to create the view for later execution
    CreateViewCommand(
      name = tableIdentifier,
      userSpecifiedColumns = Nil,
      comment = None,
      properties = Map.empty,
      originalText = None,
      plan = logicalPlan,
      allowExisting = false,
      replace = replace,
      viewType = viewType,
      isAnalyzed = true)
  }

Where is the specific implementation?

  @inline private def withPlan(logicalPlan: LogicalPlan): DataFrame = {
    Dataset.ofRows(sparkSession, logicalPlan)
  }
  def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
    sparkSession.withActive {
      val qe = sparkSession.sessionState.executePlan(logicalPlan)
      qe.assertAnalyzed()
      new Dataset[Row](qe, RowEncoder(qe.analyzed.schema))
  }

What do we see? Isn't this the beginning of this lecture on source code analysis?

We seem to be back to the origin.

However, this time we focus on the command to create a view, and this lecture analyzes the SQL statement.

Therefore, the following logic is similar to the above logic. In essence, rules are applied one by one!

So what are the core differences?

Since we are aiming at the command of creating view, if there is a command, it must be executed. Therefore, the key point is

In the run method of CreateViewCommand:

override def run(sparkSession: SparkSession): Seq[Row] = {
    if (!isAnalyzed) {
      throw QueryCompilationErrors.logicalPlanForViewNotAnalyzedError()
    }
    // This is a plan that has completed the analysis phase
    val analyzedPlan = plan

    // If the column specified by the user is not empty and different from the output column of the analysis stage, an exception is thrown
    if (userSpecifiedColumns.nonEmpty &&
        userSpecifiedColumns.length != analyzedPlan.output.length) {
      throw QueryCompilationErrors.createViewNumColumnsMismatchUserSpecifiedColumnLengthError(
        analyzedPlan.output.length, userSpecifiedColumns.length)
    }

    // Get SessionCatalog
    val catalog = sparkSession.sessionState.catalog

    // Temporary objects are not allowed to be referenced when creating a permanent view
    // This should be in QE Assertanalyzed () is called later, that is, the child nodes have been resolved
    verifyTemporaryObjectsNotExists(isTemporary, name, analyzedPlan, referredTempFunctions)
    verifyAutoGeneratedAliasesNotExists(analyzedPlan, isTemporary, name)

    // Temporary view, our example is this type
    if (viewType == LocalTempView) {
      val aliasedPlan = aliasPlan(sparkSession, analyzedPlan)
      val tableDefinition = createTemporaryViewRelation(
        name,
        sparkSession,
        replace,
        catalog.getRawTempView,
        originalText,
        analyzedPlan,
        aliasedPlan,
        referredTempFunctions)
      // Call the SessionCatalog method to create a temporary view
      catalog.createTempView(name.table, tableDefinition, overrideIfExists = replace)
      // Global temporary view
    } else if (viewType == GlobalTempView) {
      val db = sparkSession.sessionState.conf.getConf(StaticSQLConf.GLOBAL_TEMP_DATABASE)
      val viewIdent = TableIdentifier(name.table, Option(db))
      val aliasedPlan = aliasPlan(sparkSession, analyzedPlan)
      val tableDefinition = createTemporaryViewRelation(
        viewIdent,
        sparkSession,
        replace,
        catalog.getRawGlobalTempView,
        originalText,
        analyzedPlan,
        aliasedPlan,
        referredTempFunctions)
      catalog.createGlobalTempView(name.table, tableDefinition, overrideIfExists = replace)
      // If it is a permanent view, the view name is in the cache of SessionCatalog
    } else if (catalog.tableExists(name)) {
      val tableMetadata = catalog.getTableMetadata(name)
      if (allowExisting) {
        // If you encounter this type of SQL: ` CREATE VIEW IF NOT EXISTS v0 AS SELECT... ` Do nothing when the target view does not exist
      } else if (tableMetadata.tableType != CatalogTableType.VIEW) {
        throw QueryCompilationErrors.tableIsNotViewError(name)
      } else if (replace) {
        // Circular View reference CREATE OR REPLACE VIEW detected
        val viewIdent = tableMetadata.identifier
        checkCyclicViewReference(analyzedPlan, Seq(viewIdent), viewIdent)

        // When replacing an existing view, you should kill the cache
        logDebug(s"Try to uncache ${viewIdent.quotedString} before replacing.")
        CommandUtils.uncacheTableOrView(sparkSession, viewIdent.quotedString)

        // Process this type of SQL: ` CREATE OR REPLACE VIEW v0 AS SELECT`
        // We don't care about the information in the old view. Kill it directly and create a new one
        catalog.dropTable(viewIdent, ignoreIfNotExists = false, purge = false)
        catalog.createTable(prepareTable(sparkSession, analyzedPlan), ignoreIfExists = false)
      } else {
        // Handle this type of SQL: ` CREATE VIEW v0 AS SELECT... ` 
        // Throw an exception when the target view already exists
        QueryCompilationErrors.viewAlreadyExistsError(name)
      }
    } else {
      // If it does not exist, create the view
      catalog.createTable(prepareTable(sparkSession, analyzedPlan), ignoreIfExists = false)
    }
    Seq.empty[Row]
  }

What is the method of creating temporary views in SessionCatalog?

  def createTempView(
      name: String,
      viewDefinition: TemporaryViewRelation,
      overrideIfExists: Boolean): Unit = synchronized {
    val table = formatTableName(name)
    if (tempViews.contains(table) && !overrideIfExists) {
      throw new TempTableAlreadyExistsException(name)
    }
    tempViews.put(table, viewDefinition)
  }

The tempViews here is the HashMap we mentioned earlier, indicating that the source of the data has been found~

Now another question comes. What does the data look like?

Let's put this problem aside and take a look at the last step of the above three steps: parsing the view

Parse view

/**
 * The current catalog and namespace may be different from that when the view is created. We must resolve the logical meter of the view here
 * Row, catalog and namespace are stored in view metadata. This is done by combining catalog and 
 * namespace It is implemented in "AnalysisContext". When resolving the relationship between single component names
 * analyzer The method ` analysiscontext. Will be viewed catalogAndNamespace`.  as
 * Results ` analysiscontext Catalogandnamespace ` is not empty, analyzer will expand the single component name
 * And use it instead of the current catalog and namespace.
 */
private def resolveViews(plan: LogicalPlan): LogicalPlan = plan match {
      // The child of the view should be a logical plan resolved from "desc.viewText", and the variable "viewText" should be defined, otherwise we will throw an error when generating the view operator.
      case view @ View(desc, isTempView, child) if !child.resolved =>
        // Resolve all unresolved relationships and views in child nodes
        val newChild = AnalysisContext.withAnalysisContext(desc) {
          // Depth of nested views
          val nestedViewDepth = AnalysisContext.get.nestedViewDepth
          // Maximum allowable depth
          val maxNestedViewDepth = AnalysisContext.get.maxNestedViewDepth
          if (nestedViewDepth > maxNestedViewDepth) {
            throw QueryCompilationErrors.viewDepthExceedsMaxResolutionDepthError(
              desc.identifier, maxNestedViewDepth, view)
          }
          SQLConf.withExistingConf(View.effectiveSQLConf(desc.viewSQLConfigs, isTempView)) {
            // Execution child node
            executeSameContext(child)
          }
        }
        // Since the unresolved operator is inside the view outside the AnalysisContext, it may be solved incorrectly.            
        checkAnalysis(newChild)
        view.copy(child = newChild)
      // An alias subquery object that encapsulates the table name
      case p @ SubqueryAlias(_, view: View) =>
        // Copy recursive call
        p.copy(child = resolveViews(view))
      case _ => plan
    }

This is essentially a process of recursively calling nested view parsing, and there is nothing worth describing.

After applying rules

After the above series of steps, after the rule resolverelationships is applied, the generated logical plan is as follows:

'Project [*]
+- SubqueryAlias t_user
   +- View (`t_user`, [addr#7,age#8L,name#9,sex#10])
      +- Relation [addr#7,age#8L,name#9,sex#10] json

Here we return to the above small question: what does the data look like?

What does the data look like?

Before executing the command to create a view, we get a logical plan of LogicalRelation type, so the key is how to print the plan as above?

LogicalRelation.simpleString

  override def simpleString(maxFields: Int): String = {
    s"Relation ${catalogTable.map(_.identifier.unquotedString).getOrElse("")}" +
      s"[${truncatedString(output, ",", maxFields)}] $relation"
  }

AttributeReference.toString

  override def toString: String = s"$name#${exprId.id}$typeSuffix$delaySuffix"

Combining these two pieces of code, we actually understand a puzzle. The above #7, #8 actually represents the expression ID. this expression is named expression. We actually mentioned it many times in our blog. If you forget it, you might as well look back at the previous content.

ResolveReferences

The ResolveReferences rule is used to replace the unresolved attribute with the specific AttributeReference of the sub node of the logical plan node.

So, how is it realized?

Still the same, let's look at the apply method first:

apply

def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
      AlwaysProcess.fn, ruleId) {
      // All children of the this query plan have not been resolved
      case p: LogicalPlan if !p.childrenResolved => p

      // Wait for rule 'DeduplicateRelations' to resolve conflicting attributes first
      case p: LogicalPlan if hasConflictingAttrs(p) => p

      // If the projection contains an * sign, expand it.
      case p: Project if containsStar(p.projectList) =>
        p.copy(projectList = buildExpandedProjectList(p.projectList, p.child))

Project [*]

The Project projection object included in our example obviously has a * sign.

It is worth noting here that after the ResolveRelations rule, the logical plan we get is actually a tree result, in which only the Project node (that is, the root node) will go to the lower branch (our calling rule is actually a recursive calling process, first solving the lower child node, and then solving the upper parent node).

    private def buildExpandedProjectList(
      exprs: Seq[NamedExpression],
      child: LogicalPlan): Seq[NamedExpression] = {
      exprs.flatMap {
        // API using Dataframe/Dataset: testdata2 groupBy($"a", $"b"). agg($"*")
        case s: Star => expand(s, child)
        // Using the API of SQL but not running the rule ResolveAlias: SELECT * FROM testData2 group by a, b
        case UnresolvedAlias(s: Star, _) => expand(s, child)
        // If exprs is a list type, it has multiple elements and contains the * sign
        case o if containsStar(o :: Nil) => expandStarExpression(o, child) :: Nil
        case o => o :: Nil
      }.map(_.asInstanceOf[NamedExpression])
    }

Through Project [*], we know that the incoming exprs is actually this *. In the second lecture, we know that the type of * is unresolved Star, which is a subclass of Star type, so it is obvious that the first branch is taken in the above call.

private def expand(s: Star, plan: LogicalPlan): Seq[NamedExpression] = {
      // This is used to capture the analysis exceptions thrown inside the closure, and some origin information will be attached to the exception
      withPosition(s) {
        try {
          // Call the expand method of unresolved star
          s.expand(plan, resolver)
        } catch {
          case e: AnalysisException =>
            AnalysisContext.get.outerPlan.map {
              // Only Project and Aggregate can have star expressions
              case u @ (_: Project | _: Aggregate) =>
                Try(s.expand(u.children.head, resolver)) match {
                  case Success(expanded) => expanded.map(wrapOuterReference)
                  case Failure(_) => throw e
                }
              // Do not use external plans to parse star expressions
              // Because the use of star is invalid
              case _ => throw e
            }.getOrElse { throw e }
        }
      }
    }

*

Let's take another look at the expand method of unresolved star

  override def expand(
      input: LogicalPlan,
      resolver: Resolver): Seq[NamedExpression] = {
    // If no table is specified, all non hidden properties are used
    if (target.isEmpty) return input.output

    // If a table is specified, the hidden attribute must also be used
    val hiddenOutput = input.metadataOutput.filter(_.supportsQualifiedStar)
    val expandedAttributes = (hiddenOutput ++ input.output).filter(
      matchedQualifier(_, target.get, resolver))

    if (expandedAttributes.nonEmpty) return expandedAttributes

    // An attempt was made to resolve it to a struct type extension.
    val attribute = input.resolve(target.get, resolver)
    if (attribute.isDefined) {
      // If the target can be resolved into an attribute of a child node, it must be a structure and need to be extended
      attribute.get.dataType match {
        case s: StructType => s.zipWithIndex.map {
          case (f, i) =>
            val extract = GetStructField(attribute.get, i)
            Alias(extract, f.name)()
        }

        case _ =>
          throw QueryCompilationErrors.starExpandDataTypeNotSupportedError(target.get)
      }
    } else {
      val from = input.inputSet.map(_.name).mkString(", ")
      val targetString = target.get.mkString(".")
      throw QueryCompilationErrors.cannotResolveStarExpandGivenInputColumnsError(
        targetString, from)
    }
  }

Since we are only a *, not like a. *, the target is empty. We directly return in the first step, that is, return input Output, here comes the question: what is input?

input

Through previous analysishelper It can be seen from the source code analysis of resolveoperators upwithpruning (i.e. the previous section on recursive application of partial functions):

The input of each rule is the output of its child nodes after applying the rule

Therefore, input is the output of the child node of Project [*] after applying the rule, that is

+- SubqueryAlias t_user
   +- View (`t_user`, [addr#7,age#8L,name#9,sex#10])
      +- Relation [addr#7,age#8L,name#9,sex#10] json

SubqueryAlias and its child nodes have not changed since the rule was applied.

Therefore, input Output is subqueryalias output

SubqueryAlias.output

  override def output: Seq[Attribute] = {
    // Concatenate identifiers
    val qualifierList = identifier.qualifier :+ alias
    child.output.map(_.withQualifier(qualifierList))
  }

SubqueryAlias. What output does is concatenate identifiers including child nodes.

What does the final parsed logical plan look like?

Project [addr#8, age#9L, name#10, sex#11]
+- SubqueryAlias t_user
   +- View (`t_user`, [addr#8,age#9L,name#10,sex#11])
      +- Relation [addr#8,age#9L,name#10,sex#11] json

As usual, let's draw a table to match the above output with the specific source code:

PrintCorresponding source codeexplain
Projectorg.apache.spark.sql.catalyst.plans.logical.ProjectProjection object
[addr#8, age#9L, name#10, sex#11]SubqueryAlias.outputConcatenate identifiers including child nodes, where the number represents the ID of NamedExpression.
SubqueryAliasorg.apache.spark.sql.catalyst.plans.logical.SubqueryAliasAn alias subquery object that encapsulates the table name
t_userSubqueryAlias.aliasTable alias
Vieworg.apache.spark.sql.catalyst.plans.logical.ViewView object
(t_user, [addr#8,age#9L,name#10,sex#11])View.simpleStringThe identifier and attribute sequence of the table, where the number represents the ID of NamedExpression.
Relation [addr#8,age#9L,name#10,sex#11] jsonLogicalRelation.simpleStringAttribute sequence and relationship name, where the number represents the ID of NamedExpression.

Why is it different from the output of Analyzed Logical Plan?

In the first lecture, we have actually given the final output of the analysis stage in advance:

== Analyzed Logical Plan ==
addr: array<string>, age: bigint, name: string, sex: string
Project [addr#7, age#8L, name#9, sex#10]
+- SubqueryAlias t_user
   +- View (`t_user`, [addr#7,age#8L,name#9,sex#10])
      +- Relation [addr#7,age#8L,name#9,sex#10] json

Why is it different from the above logical plan printing?

We can start from queryexecution Find the answer in writeplans

append("\n== Analyzed Logical Plan ==\n")
    try {
      if (analyzed.output.nonEmpty) {
        append(
          truncatedString(
            analyzed.output.map(o => s"${o.name}: ${o.dataType.simpleString}"), ", ", maxFields)
        )
        append("\n")
      }
      QueryPlan.append(analyzed, append, verbose, addSuffix, maxFields)

You can see that the attribute name and type will be printed before printing the generated logical plan.

Summary

This lecture is about the source code analysis of the analysis phase of Spark SQL workflow.

We have made clear our goal - we need to further analyze the AST generated in the parsing phase.

After determining the goal, we first calculated what we had at hand - a temporary view: t_user and inferred the field name and field type from the JSON file.

Then we start the source code analysis of the main process, starting from the entrance, and talking about execution and verification.

At this time, we came into contact with the concept of rule Batch, and we combed all rule batches.

Due to space constraints, it is impossible for us to explain in detail how each rule is implemented.

Therefore, we return to our original example. By analyzing this simple and direct example, we can help us by analogy.

We analyzed in detail how the two rules of ResolveRelations and ResolveReferences act on our logical plan step by step, and finally completed the analysis stage.

So far, I believe you have a glance at the basic process of the analysis stage.

Please see the students here help three times and support one wave. Thank you very much~

Topics: Big Data Spark SQL