Hive code analysis report: semantic analysis ④

Posted by julieb on Tue, 23 Nov 2021 17:55:20 +0100

2021SC@SDUSC

catalogue

Overview & & Review

Code analysis: method of generating QB

Summary:

Overview & & Review

As for the generation of QB by semantic parsing AST, we have always focused on code reading, and the content is scattered and split. Now, after reviewing the flow chart of HIVE compiler, we can see that semantic parsing is mainly to convert AST Tree into QueryBlock. Why should it be converted into QueryBlock? From the previous analysis, we can see that the AST Tree is still very abstract and does not carry the information related to tables and fields. For semantic analysis, the AST Tree can be stored in QueryBlock modules and carry the corresponding metadata information to prepare for the generation of logical execution plans.

Earlier, we analyzed that before entering the sql compiler, the compiler first judged whether the hive.semantic.analyzer.hook parameters were set up, so that some methods could be used to predict the statements, and then called the compiler module sem.analyze(tree, ctx):

Among them, sem is obtained by basesemanticeanalyzer sem = semanticalanalyzerfactory.get (querystate, tree).

hive has a variety of compilation methods for sql with different functions.

For example:

Explain uses explain semantic analyzer

DDL adopts ddlsemantica analyzer

Load uses loadsemantic analyzer, etc

The factory class setting can isolate these different functions and increase scalability. For example, one day, you need to add a compilation process of import data, develop an importsemantic analyzer class and register it in the semantic analyzer factory.

However, we mostly use query. This time, the source code analysis also focuses on query. Therefore, we enter the default option. At this time, the semantic analyzer.analyzeinternal method is mainly used for compilation.

In the above analysis, we know that the analyzeInternal function will eventually call the genResolvedParseTree function. Last time, we just took a brief look at its structure. This time, we will take a closer look at the logic of generating QB.

Code analysis: method of generating QB

After last overall browsing, we can focus the code logic of generating QB on the following code:

Phase1Ctx ctx_1 = initPhase1Ctx();

    preProcessForInsert(child, qb);

  

    if (!doPhase1(child, qb, ctx_1, plannerCtx)) {

    

      return false;

    }

    LOG.info("Completed phase 1 of Semantic Analysis");

Last time, the main function of this part of the code is to decompose the ASTTree into the corresponding QB. If the phase1Result error returns false, where can we draw this conclusion?

The second line of the code, preprocessor for insert (child, QB); Call the pre-processing before insertion. At this time, the QB bit initializes the blank QB. It is obvious that the fourth line of the code: if doPhase1 is executed successfully, a QB will be obtained.

Next, enter the doPhase1 method:

@SuppressWarnings({"fallthrough", "nls"})

  public boolean doPhase1(ASTNode ast, QB qb, Phase1Ctx ctx_1, PlannerContext plannerCtx)

      throws SemanticException {

            . . . . . . . . . . . . . . . . Slightly........

//token of type select     

case HiveParser.TOK_SELECT:

//Mark qb

qb.countSel();

        qbp.setSelExprForClause(ctx_1.dest, ast);

        . . . . . . . . . . . . . . . . . Slightly......

//where type token

case HiveParser.TOK_WHERE:

 // The reason why ast.getChild(0) is used to process where children is because it is complementary to the previous HiveParser.g structure.       

 qbp.setWhrExprForClause(ctx_1.dest, ast);

        if (!SubQueryUtils.findSubQueries((ASTNode) ast.getChild(0)).isEmpty())

            queryProperties.setFilterWithSubQuery(true);

        break;

      . . . . . . . . . . . . . . . . Slightly........

    // The following code is the same as above, matching different types of token s

case HiveParser.TOK_GROUPBY:

      case HiveParser.TOK_ROLLUP_GROUPBY:

      case HiveParser.TOK_CUBE_GROUPBY:

      case HiveParser.TOK_GROUPING_SETS:

      . . . . . . . . . . . . Slightly........

//Traverse the AST tree. Here, recursion is used to call the doPhase1() function. The purpose is that the recursion ends at the leaves of the tree

  if (!skipRecursion) {

//Call getChildCount() of AST tree to get the number of child nodes

      int child_count = ast.getChildCount();
//Recursively call doPhase1 to traverse AST

      for (int child_pos = 0; child_pos < child_count && phase1Result; ++child_pos) {

        phase1Result = phase1Result && doPhase1(

            (ASTNode)ast.getChild(child_pos), qb, ctx_1, plannerCtx);

      }

    }

    . . . . . . . . . . . . . . . Slightly.........

Due to too much space, the detailed code of information from the tree during recursive traversal is omitted. Here are two functions for analysis

The first is the doPhase1GetColumnAliasesFromSelect() function. The main function of this function in the process of doPhase1() traversing the tree is to obtain the alias of the column:

①private void doPhase1GetColumnAliasesFromSelect(

      ASTNode selectExpr, QBParseInfo qbp, String dest) throws SemanticException {

    if (isInsertInto(qbp, dest)) {

      ASTNode tblAst = qbp.getDestForClause(dest);

      String tableName = getUnescapedName((ASTNode) tblAst.getChild(0));

      Table targetTable = null;

//The condition determines whether the clause is a select statement. After all, the alias is only used in the select statement
     

 try {

        if (isValueClause(selectExpr)) {

          targetTable = db.getTable(tableName, false);

          replaceDefaultKeyword((ASTNode) selectExpr.getChild(0).getChild(0).getChild(1), targetTable, qbp.getDestSchemaForClause(dest));

        } else if (updating(dest)) {

          targetTable = db.getTable(tableName, false);

          replaceDefaultKeywordForUpdate(selectExpr, targetTable);

        }

      } catch (Exception e) {

        if (e instanceof SemanticException) {

          throw (SemanticException) e;

        } else {

          throw (new RuntimeException(e));

        }

      }

}

//Get the child node count of the expression tree, cycle and judge the relevant conditions

 for (int i = 0; i < selectExpr.getChildCount(); ++i) {

      ASTNode selExpr = (ASTNode) selectExpr.getChild(i);

      if ((selExpr.getToken().getType() == HiveParser.TOK_SELEXPR)

          && (selExpr.getChildCount() == 2)) {

//Get the alias and load QB

 String columnAlias = unescapeIdentifier(selExpr.getChild(1).getText());

        qbp.setExprToColumnAlias((ASTNode) selExpr.getChild(0), columnAlias);

      }

    }

  }

② doPhase1GetAllAggregations function: used to obtain aggregate clause information

private void doPhase1GetAllAggregations(ASTNode expressionTree,HashMap<String, ASTNode> aggregations, List<ASTNode> wdwFns, ASTNode wndParent) throws SemanticException { 

//Because we now have scalar subqueries, we can get subquery expressions in "having". In general, we don't want to include aggregations in subqueries

  int exprTokenType = expressionTree.getToken().getType();

    if(exprTokenType == HiveParser.TOK_SUBQUERY_EXPR) {

    .........

      return;

}

Summary:

In this paper, we mainly enter the doPhase1() function, which is the first stage of semantic analysis, because our main goal is to load the ast data into QB. The main idea of doPhase1 in this stage is to recursively traverse the AST and establish some necessary mapping relationships, so as to pass some key information to QB, such as the alias information of tables and sub queries, the name of internal clauses, aggregation operation information, etc, Furthermore, all the above mapping relationships are saved in QB/QBParseInfo.

 

Topics: Hadoop hive Data Warehouse