Hive code analysis report: semantic analysis ⑤

Posted by clewis4343 on Wed, 01 Dec 2021 21:52:13 +0100

2021SC@SDUSC

catalogue

summary

Supplementary description doPhase1()

getMetaData(QB, ReadEntity) analysis

summary

In the last article, I analyzed doPhase1() Function, which is the initial stage of semantic analysis. The final goal of the program is to load the ast data into QB. The main idea of doPhase1 in this stage is to recursively traverse the AST and establish some necessary mapping relationships, so as to pass some key information to QB, such as the alias information of tables and sub queries, the name of internal clauses, aggregation operation information, etc., and then all these mapping relationships above can be used Saved in QB/QBParseInfo.

Supplementary description doPhase1()

Let's take a look at the general structure and workflow of the code here

public boolean doPhase1(ASTNode ast, QB qb, Phase1Ctx ctx_1, PlannerContext plannerCtx)

      throws SemanticException {

            . . . . . . . . . . . . . . . . Slightly........

        case HiveParser.TOK_SELECT://token of type select

        qb.countSel();//Mark qb

        qbp.setSelExprForClause(ctx_1.dest, ast);

        . . . . . . . . . . . . . . . . . Slightly......

       case HiveParser.TOK_WHERE://where type token

       //Why ast.getChild(0) is used to handle children in where? This is complementary to the previous HiveParser.g structure.

        qbp.setWhrExprForClause(ctx_1.dest, ast);

        if (!SubQueryUtils.findSubQueries((ASTNode) ast.getChild(0)).isEmpty())

            queryProperties.setFilterWithSubQuery(true);

        break;

      . . . . . . . . . . . . . . . . Slightly........

      case HiveParser.TOK_GROUPBY:

      case HiveParser.TOK_ROLLUP_GROUPBY:

      case HiveParser.TOK_CUBE_GROUPBY:

      case HiveParser.TOK_GROUPING_SETS:

      . . . . . . . . . . . . Slightly........

          if (!skipRecursion) {

      // Iterate over the rest of the children

      int child_count = ast.getChildCount();

      for (int child_pos = 0; child_pos < child_count && phase1Result; ++child_pos) {

        // Recurse

        phase1Result = phase1Result && doPhase1(

            (ASTNode)ast.getChild(child_pos), qb, ctx_1, plannerCtx);

      }

    }

    . . . . . . . . . . . . . . . Slightly.........

General process:

doPhase1 cases the TOK type of each element in the ASTTree, and fills the node data for different cases. For traverses the whole ASTTree, and calls doPhase1 recursively for each element. This method is a depth first search algorithm.

After a round of depth first traversal, the QB tree without metadata is generated.

getMetaData(QB, ReadEntity) analysis

After doPhase1 is executed, it gets QB and QB. There are only some key words and some table names, but it can not correspond to the file path of hdfs. Therefore, metadata metaData mapping relation is needed, and getMetaData() function is invoked in SemanticAnalyzer.

private void getMetaData(QB qb, ReadEntity parentInput)

      throws HiveException {

LOG.info("Get metadata for source tables");

According to the log information, you can know that the goal of this function is to obtain metadata from the source table

List<String> tabAliases = new ArrayList<String>(qb.getTabAliases());

The above list records aliases because aliases may be modified in the middle

 Map<String, ObjectPair<String, ReadEntity>> aliasToViewInfo =

        new HashMap<String, ObjectPair<String, ReadEntity>>();

The function of the above MAP is as follows:

Track view aliases, view names, and read entities

For example, for queries like 'select * from V3', where V3 - > V2, V2 - > V1, V1 - > t

map is used to track input dependencies and their parent classes.
   

 Map<String, String> sqAliasToCTEName = new HashMap<String, String>();

    for (String alias : tabAliases) {

      String tabName = qb.getTabNameForAlias(alias);

      String cteName = tabName.toLowerCase();

      Get table details from the tabNameToTabObject cache:

Table tab = getTableObjectByName(tabName, false);

      if (tab != null) {

        // do a deep copy, in case downstream changes it.

        tab = new Table(tab.getTTable().deepCopy());

      }

      if (tab == null ||

          tab.getDbName().equals(SessionState.get().getCurrentDatabase())) {

        Table materializedTab = ctx.getMaterializedTable(cteName);

        if (materializedTab == null) {

          // we first look for this alias from CTE, and then from catalog.

          CTEClause cte = findCTEFromName(qb, cteName);

          if (cte != null) {

            if (!cte.materialize) {

              addCTEAsSubQuery(qb, cteName, alias);

              sqAliasToCTEName.put(alias, cteName);

              continue;

            }

            tab = materializeCTE(cteName, cte);

          }

        } else {

          tab = materializedTab;

        }

      }

      throw new SemanticException(e.getMessage(), e);

    }

  }

The word CTE appears many times in this function, and there are many functions involving CTE in the program. What does this CTE do? According to the information, CTE is a common expression commonly used in many database systems.

A common table expression (CTE) can be considered a temporary result set defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. A CTE is similar to a derived table because it is not stored as an object and lasts only during a query. Unlike a derived table, a CTE can be self referenced and can be referenced multiple times in the same query.

A CTE consists of an expression name representing the CTE, an AS keyword, and a SELECT statement. After a CTE is defined, it can be referenced like a table or view in a SELECT, INSERT, UPDATE, or DELETE statement. A CTE can also be used AS part of its definition SELECT statement in a CREATE VIEW statement.

Summary:

getMetaData(): get metadata of source table, target table and (mainly schema and other information)
The obtained metadata is also stored in QB/QBParseInfo.
① Get the metadata of source table. If a table is actually a view, rewrite it as the definition of view;
② Recursively obtain metadata for the source table in each sub query;
③ Get the metadata of all destination table/dir/local dir;

At this point, the relatively complete information is transmitted to QB

Topics: Hadoop hive Data Warehouse