Source code analysis of database middleware Sharding-JDBC: Syntax parsing of SQL parsing (1)

Posted by javauser on Tue, 11 Jun 2019 21:59:50 +0200

(iv) Focus on Wechat Public Number: [Taro Back-end Cabin] Welfare:

  1. RocketMQ/MyCAT/Sharing-JDBC All Source Analysis Articles List

  2. RocketMQ/MyCAT/Sharing-JDBC Chinese Annotation Source GitHub Address

  3. Every message you leave about the source code will be answered carefully. Even if you don't know how to read the source code, you can ask.

  4. New source parsing articles are notified in real time. Update one article per week or so.

1. Overview

SQL parsing engine, database middleware essential functions and processes. When Sharding-JDBC was officially released in 1.5.0.M1, it replaced the SQL parsing engine from Druid to self-developed. The new engine only parses the fragmented context, and adopts the concept of "semi-understanding" for SQL to further improve performance and compatibility, while reducing code complexity (it doesn't matter if you don't understand, we will update the article to explain this advantage later). Another domestic database middleware MyCAT SQL parsing engine is Druid, which is also developing its own SQL parsing engine.

Some students may be scared to see the SQL parsing. Please calm down and look down patiently. "SQL parsing" content will be divided into five relatively short articles, so that you can relatively relaxed and happy to understand:

  1. lexical analysis

  2. Insert SQL parsing

  3. Query SQL parsing

  4. Update SQL parsing

  5. Delete SQL parsing

Under the parsing package, the SQL parsing engine, as shown in the figure above, consists of two components:

  1. Lexer: Lexer.

  2. Parser: SQL parser.

Both are parsers. The difference is that Lexer only parses the lexicon, does not care about the context, and decomposes the string into N lexicons. Parser also needs to understand SQL on the basis of Lexer. Let's make an analogy:

SQL : SELECT * FROM t_user  
Lexer : [SELECT] [ * ] [FROM] [t_user]  
Parser : This is one item. [SELECT] The query table is [t_user] ,And return [ * ] Of all fields SQL. 

(iv) Not fully understood? It doesn't matter. The main character of this article is Lexer. We understand it bit by bit through the source code. A total of about 1400 lines of code, including comments and so on, actually less oh.

2. Lexer Lexical Parser

Lexer Principle: Sequential sequential parsing of SQL, the string into N lexical.

The core code is as follows:

// Lexer.java
public class Lexer {

    /**
     * Output string
     * For example: SQL
     */
    @Getter
    private final String input;
    /**
     * Lexical Markup Dictionary
     */
    private final Dictionary dictionary;
    /**
     * offset parsed to SQL
     */
    private int offset;
    /**
     * Current lexical markers
     */
    @Getter
    private Token currentToken;

    /**
     * Analyse the next lexical marker.
     *
     * @see #currentToken
     * @see #offset
     */
    public final void nextToken() {
        skipIgnoredToken();
        if (isVariableBegin()) { // variable
            currentToken = new Tokenizer(input, dictionary, offset).scanVariable();
        } else if (isNCharBegin()) { // N\
            currentToken = new Tokenizer(input, dictionary, ++offset).scanChars();
        } else if (isIdentifierBegin()) { // Keyword + Literals.IDENTIFIER
            currentToken = new Tokenizer(input, dictionary, offset).scanIdentifier();
        } else if (isHexDecimalBegin()) { // Hexadecimal
            currentToken = new Tokenizer(input, dictionary, offset).scanHexDecimal();
        } else if (isNumberBegin()) { // Number (integer + floating point)
            currentToken = new Tokenizer(input, dictionary, offset).scanNumber();
        } else if (isSymbolBegin()) { // Symbol
            currentToken = new Tokenizer(input, dictionary, offset).scanSymbol();
        } else if (isCharsBegin()) { // Strings, such as "abc"
            currentToken = new Tokenizer(input, dictionary, offset).scanChars();
        } else if (isEnd()) { // End
            currentToken = new Token(Assist.END, "", offset);
        } else { // Analysis of errors, unqualified lexical markers
            currentToken = new Token(Assist.ERROR, "", offset);
        }
        offset = currentToken.getEndPosition();
        // System.out.println("| " + currentToken.getLiterals() + " | " + currentToken.getType() + " | " + currentToken.getEndPosition() + " |");
    }
    /**
     * Skip Ignored Lexical Markers
     * 1. Blank space
     * 2. SQL Hint
     * 3. SQL Notes
     */
    private void skipIgnoredToken() {
        // Blank space
        offset = new Tokenizer(input, dictionary, offset).skipWhitespace();
        // SQL Hint
        while (isHintBegin()) {
            offset = new Tokenizer(input, dictionary, offset).skipHint();
            offset = new Tokenizer(input, dictionary, offset).skipWhitespace();
        }
        // SQL annotations
        while (isCommentBegin()) {
            offset = new Tokenizer(input, dictionary, offset).skipComment();
            offset = new Tokenizer(input, dictionary, offset).skipWhitespace();
        }
    }
}

Token (lexical marker) is continuously parsed through the # nextToken() method. Let's execute it once to see which Tokens the SQL will be split into.

SQL : SELECT i.* FROM t_order o JOIN t_order_item i ON o.order_id=i.order_id WHERE o.user_id=? AND o.order_id=?
literals TokenType class TokenType value endPosition
SELECT DefaultKeyword SELECT 6
i Literals IDENTIFIER 8
. Symbol DOT 9
* Symbol STAR 10
FROM DefaultKeyword FROM 15
t_order Literals IDENTIFIER 23
o Literals IDENTIFIER 25
JOIN DefaultKeyword JOIN 30
t_order_item Literals IDENTIFIER 43
i Literals IDENTIFIER 45
ON DefaultKeyword ON 48
o Literals IDENTIFIER 50
. Symbol DOT 51
order_id Literals IDENTIFIER 59
= Symbol EQ 60
i Literals IDENTIFIER 61
. Symbol DOT 62
order_id Literals IDENTIFIER 70
WHERE DefaultKeyword WHERE 76
o Literals IDENTIFIER 78
. Symbol DOT 79
user_id Literals IDENTIFIER 86
= Symbol EQ 87
? Symbol QUESTION 88
AND DefaultKeyword AND 92
o Literals IDENTIFIER 94
. Symbol DOT 95
order_id Literals IDENTIFIER 103
= Symbol EQ 104
? Symbol QUESTION 105
Assist END 105

Students with sharp eyes may have seen Tokenizer. Yes, it's Lexer's good geek, responsible for word segmentation.

Let's conclude that in the Lexer nextToken () method, skipIgnoredToken() method is used to skip the neglected Token, and isXXXXXX () method is used to determine the type of the next Token, then Tokenizer is given to participle back to Token. _Here we can consider an optimization, not every time a new Tokenizer(...) comes out, a Lexer with a Tokenizer.

Because different databases obey the SQL specification slightly differently, different databases correspond to different Lexer s.

Sub-Lexer implements its own unique SQL grammar by rewriting.

3. Token lexical markers

We have seen Token's example above. There are three attributes:

  • TokenType: Lexical Markup Type

  • String literals: Lexical literals

  • int endPosition: The end position of literals in SQL

TokenType lexical marker types are divided into four major categories:

  • DefaultKeyword: Lexical keywords

  • Literals: Lexical literal markers

  • Symbol: Lexical Symbol Marker

  • Assist: Lexical Auxiliary Markers

3.1 DefaultKeyword

Different databases have their own unique lexical keywords, such as MySQL's well-known paging Limit.

Let's take MySQL as an example. When creating MySQL Lexer, DefaultKeyword and MySQL Keyword are loaded (Oracle Lexer, PostgreSQL Lexer, SQL Server Lexer and MySQL Lexer). The core code is as follows:

// MySQLLexer.java
public final class MySQLLexer extends Lexer {

    /**
     * Dictionaries
     */
    private static Dictionary dictionary = new Dictionary(MySQLKeyword.values());
    
    public MySQLLexer(final String input) {
        super(input, dictionary);
    }
}

// Dictionary.java
public final class Dictionary {

    /**
     * Lexical Keyword Map
     */
    private final Map<String, Keyword> tokens = new HashMap<>(1024);
    
    public Dictionary(final Keyword... dialectKeywords) {
        fill(dialectKeywords);
    }

    /**
     * Install default lexical keywords + dialectal lexical keywords
     * Different databases have the same default lexical keywords and different dialectal keywords.
     *
     * @param dialectKeywords Dialect lexical keywords
     */
    private void fill(final Keyword... dialectKeywords) {
        for (DefaultKeyword each : DefaultKeyword.values()) {
            tokens.put(each.name(), each);
        }
        for (Keyword each : dialectKeywords) {
            tokens.put(each.toString(), each);
        }
    }
}

Keyword and Literals.IDENTIFIER are parsed together. We put them together at Literals.IDENTIFIER.

3.2 Literals Lexical Quantity Markers

Literals lexical literal markers are divided into six types:

  • IDENTIFIER: Lexical keywords

  • VARIABLE: Variables

  • CHARS: String

  • HEX: Hexadecimal

  • INT: Integer

  • FLOAT: Floating Point

3.2.1 Literals.IDENTIFIER lexical keywords

Lexical keywords. For example: table name, query field and so on.

The core code of Literals.IDENTIFIER and Keyword is parsed as follows:

// Lexer.java
private boolean isIdentifierBegin() {
   return isIdentifierBegin(getCurrentChar(0));
}
private boolean isIdentifierBegin(final char ch) {
   return CharType.isAlphabet(ch) || '`' == ch || '_' == ch || '$' == ch;
}

// Tokenizer.java
/**
* Scan identifier.
*
* @return Identifier Marker
*/
public Token scanIdentifier() {
   // ` Field `such as `id'in SELECT `id `FROM t_user`
   if ('`' == charAt(offset)) {
       int length = getLengthUntilTerminatedChar('`');
       return new Token(Literals.IDENTIFIER, input.substring(offset, offset + length), offset + length);
   }
   int length = 0;
   while (isIdentifierChar(charAt(offset + length))) {
       length++;
   }
   String literals = input.substring(offset, offset + length);
   // Processing order / group as table name
   if (isAmbiguousIdentifier(literals)) {
       return new Token(processAmbiguousIdentifier(offset + length, literals), literals, offset + length);
   }
   // Find whether it is Keyword from the lexical keyword, and if so, return Keyword, or Literals.IDENTIFIER?
   return new Token(dictionary.findTokenType(literals, Literals.IDENTIFIER), literals, offset + length);
}
/**
* Calculate the length of the end character
*
* @see #hasEscapeChar(char, int) Processing similar to SELECT a AS `b `c `FROM table. Here the continuous ```is not the end. If the `is passed on, it will result in a misjudgement, so this judgement is added.
* @param terminatedChar End Character
* @return length
*/
private int getLengthUntilTerminatedChar(final char terminatedChar) {
   int length = 1;
   while (terminatedChar != charAt(offset + length) || hasEscapeChar(terminatedChar, offset + length)) {
       if (offset + length >= input.length()) {
           throw new UnterminatedCharException(terminatedChar);
       }
       if (hasEscapeChar(terminatedChar, offset + length)) {
           length++;
       }
       length++;
   }
   return length + 1;
}
/**
* Is it an Escape character?
*
* @param charIdentifier character
* @param offset position
* @return Whether or not?
*/
private boolean hasEscapeChar(final char charIdentifier, final int offset) {
   return charIdentifier == charAt(offset) && charIdentifier == charAt(offset + 1);
}
private boolean isIdentifierChar(final char ch) {
   return CharType.isAlphabet(ch) || CharType.isDigital(ch) || '_' == ch || '$' == ch || '#' == ch;
}
/**
* Is it an ambiguous identifier?
* For example, "SELECT * FROM group", where "group" represents a table name, not a lexical key word
*
* @param literals identifier
* @return Whether or not?
*/
private boolean isAmbiguousIdentifier(final String literals) {
   return DefaultKeyword.ORDER.name().equalsIgnoreCase(literals) || DefaultKeyword.GROUP.name().equalsIgnoreCase(literals);
}
/**
* Get the lexical tag type corresponding to the identifier causing ambiguity
*
* @param offset position
* @param literals identifier
* @return Lexical Marker Types
*/
private TokenType processAmbiguousIdentifier(final int offset, final String literals) {
   int i = 0;
   while (CharType.isWhitespace(charAt(offset + i))) {
       i++;
   }
   if (DefaultKeyword.BY.name().equalsIgnoreCase(String.valueOf(new char[] {charAt(offset + i), charAt(offset + i + 1)}))) {
       return dictionary.findTokenType(literals);
   }
   return Literals.IDENTIFIER;
}

3.2.2 Literals.VARIABLE variable

Variables. For example: SELECT @@VERSION.

The core code is parsed as follows:

// Lexer.java
/**
* Is it a variable?
* MySQL Support with SQL Server
* 
* @see Tokenizer#scanVariable()
* @return Whether or not?
*/
protected boolean isVariableBegin() {
   return false;
}

// Tokenizer.java
/**
* Scan variables.
* In MySQL, @ stands for user variables and @@ stands for system variables.
* In SQL Server, there is @.
*
* @return Variable markers
*/
public Token scanVariable() {
   int length = 1;
   if ('@' == charAt(offset + 1)) {
       length++;
   }
   while (isVariableChar(charAt(offset + length))) {
       length++;
   }
   return new Token(Literals.VARIABLE, input.substring(offset, offset + length), offset + length);
}

3.2.3 Literals.CHARS String

Character string. For example, SELECT "123".

The core code is parsed as follows:

// Lexer.java
/**
* Whether N\
* Currently, SQL Server is unique: when processing Unicode string constants in SQL Server, the prefix N must be added to all Unicode strings.
*
* @see Tokenizer#scanChars()
* @return Whether or not?
*/
private boolean isNCharBegin() {
   return isSupportNChars() && 'N' == getCurrentChar(0) && '\'' == getCurrentChar(1);
}
private boolean isCharsBegin() {
   return '\'' == getCurrentChar(0) || '\"' == getCurrentChar(0);
}

// Tokenizer.java
/**
* Scan strings.
*
* @return String markup
*/
public Token scanChars() {
   return scanChars(charAt(offset));
}
private Token scanChars(final char terminatedChar) {
   int length = getLengthUntilTerminatedChar(terminatedChar);
   return new Token(Literals.CHARS, input.substring(offset + 1, offset + length - 1), offset + length);
}

3.2.4 Literals.HEX Hexadecimal System

// Lexer.java
/**
* Is it hexadecimal?
*
* @see Tokenizer#scanHexDecimal()
* @return Whether or not?
*/
private boolean isHexDecimalBegin() {
   return '0' == getCurrentChar(0) && 'x' == getCurrentChar(1);
}

// Tokenizer.java
/**
* Scan hexadecimal numbers.
*
* @return Hexadecimal Number Marker
*/
public Token scanHexDecimal() {
   int length = HEX_BEGIN_SYMBOL_LENGTH;
   // negative
   if ('-' == charAt(offset + length)) {
       length++;
   }
   while (isHex(charAt(offset + length))) {
       length++;
   }
   return new Token(Literals.HEX, input.substring(offset, offset + length), offset + length);
}

3.2.5 Literals.INT Integers

Integer. For example: SELECT * FROM t_user WHERE id = 1.

Literals.INT and Literals.FLOAT are parsed together, and we put them together at Literals.FLOAT.

3.2.6 Literals.FLOAT Floating Points

Floating point number. For example: SELECT * FROM t_user WHERE id = 1.0.
Floating-point numbers include several types: "1.0", "1.0F", "7.823E5" (scientific counting).

The core code is parsed as follows:

// Lexer.java
/**
* Is it a number?
* '-' Special treatment is required. ". 2" is treated as the decimal of ellipsis 0, and "-. 2" cannot be treated as the decimal of ellipsis, otherwise problems will arise.
* For example, the result of "SELECT a-.2" processing is "SELECT"/ "a"/ "-" / "2".
*
* @return Whether or not?
*/
private boolean isNumberBegin() {
   return CharType.isDigital(getCurrentChar(0)) // number
           || ('.' == getCurrentChar(0) && CharType.isDigital(getCurrentChar(1)) && !isIdentifierBegin(getCurrentChar(-1)) // Floating Point
           || ('-' == getCurrentChar(0) && ('.' == getCurrentChar(0) || CharType.isDigital(getCurrentChar(1))))); // negative
}

// Tokenizer.java
/**
* Scan numbers.
* There are two kinds of results for analysing numbers: integers and floating-point numbers.
*
* @return Digital Marking
*/
public Token scanNumber() {
   int length = 0;
   // negative
   if ('-' == charAt(offset + length)) {
       length++;
   }
   // Floating Point
   length += getDigitalLength(offset + length);
   boolean isFloat = false;
   if ('.' == charAt(offset + length)) {
       isFloat = true;
       length++;
       length += getDigitalLength(offset + length);
   }
   // Scientific counting representations, such as SELECT 7.823E5
   if (isScientificNotation(offset + length)) {
       isFloat = true;
       length++;
       if ('+' == charAt(offset + length) || '-' == charAt(offset + length)) {
           length++;
       }
       length += getDigitalLength(offset + length);
   }
   // Floating point numbers, such as SELECT 1.333F
   if (isBinaryNumber(offset + length)) {
       isFloat = true;
       length++;
   }
   return new Token(isFloat ? Literals.FLOAT : Literals.INT, input.substring(offset, offset + length), offset + length);
}

Here we should pay special attention to the following: "-". In the example of digital expression, it can be judged as minus sign and minus sign (regardless of scientific counting method).

  • The analytic result of ".2" is ".2"

  • The analytic result of "-.2" cannot be "-.2", but "-" and ".2".

3.3 Symbol lexical symbols

Lexical symbols. For example: "{","}", ">=" and so on.

The core code is parsed as follows:

// Lexer.java
/**
* Is it a symbol?
*
* @see Tokenizer#scanSymbol()
* @return Whether or not?
*/
private boolean isSymbolBegin() {
   return CharType.isSymbol(getCurrentChar(0));
}

// CharType.java
/**
* Judge whether it is a symbol or not.
*
* @param ch Characters to be judged
* @return Is it a symbol?
*/
public static boolean isSymbol(final char ch) {
   return '(' == ch || ')' == ch || '[' == ch || ']' == ch || '{' == ch || '}' == ch || '+' == ch || '-' == ch || '*' == ch || '/' == ch || '%' == ch || '^' == ch || '=' == ch
           || '>' == ch || '<' == ch || '~' == ch || '!' == ch || '?' == ch || '&' == ch || '|' == ch || '.' == ch || ':' == ch || '#' == ch || ',' == ch || ';' == ch;
}

// Tokenizer.java
/**
* Scanning symbols.
*
* @return Symbolic Marking
*/
public Token scanSymbol() {
   int length = 0;
   while (CharType.isSymbol(charAt(offset + length))) {
       length++;
   }
   String literals = input.substring(offset, offset + length);
   // Inverse traversal, query qualified symbols. For example, literals = ";;", will be split into two ";". If based on positive order, literals = "<=", it will be resolved to "<"+"=".
   Symbol symbol;
   while (null == (symbol = Symbol.literalsOf(literals))) {
       literals = input.substring(offset, offset + --length);
   }
   return new Token(symbol, literals, offset + length);
}

3.4 Assist Lexical Auxiliary Markers

Assist lexical auxiliary markers are divided into two types:

  • END: End of Analysis

  • ERROR: Error analysis.

4. Eggs

Is old iron simpler than imagined? Continue to write Parser related articles! Let's have a wave of Wechat public numbers.

Sharing-JDBC is collecting a list of companies that use it: Portal . (iv) Your registration will allow more people to participate in and use Sharing-JDBC. Sharing-JDBC will therefore be able to cover a wider range of scenarios. Register, teenager!

** I have created a group of micro-messaging [source circle], hoping to share with you the experience of reading source code.
Read the source code first difficult then easy, master the method, you can do more in-depth learning.
And it's not difficult to master the method.
Grouping: Wechat public number to send keywords [qun]. * *

Topics: Java SQL MySQL JDBC