Go compilation principle series 3 (lexical analysis)

Posted by andycole on Wed, 05 Jan 2022 21:24:16 +0100

preface

In the last article, the core technology of lexical analysis, finite automata (DFA), and the use and working principle of two common lexical analyzers were introduced. On this basis, it will be much easier to see the lexical analysis source code of Go

This paper mainly includes the following contents:

Go compiled entry file and what has been done in the compiled entry file
Where is lexical analysis in Go compilation and what is the detailed process
Write a test go source file, perform lexical analysis on this source file, and obtain the results of lexical analysis

Source code analysis

Compilation entry of Go

In order to better understand how the compilation process of Go comes to lexical analysis, here we first introduce where the compilation entry file of Go is and what has been done

Go The compiled entry file for is in: src/cmd/compile/main.go -> gc.Main(archInit)

Go to GC Main (archinit) is a long function. What it does in the front part is to obtain the parameters passed in from the command line and update the compilation options and configuration. Then you'll see the line below

lines := parseFiles(flag.Args())

The syntax of the text will be analyzed and shared in the syntax tree, and then the syntax of the text will be analyzed. After that, the syntax of the text will be analyzed and shared

Open the parseFiles(flag.Args()) file and you can see the following contents (I have omitted the code in the later part and mainly focus on the content of lexical analysis):

func parseFiles(filenames []string) uint {
	noders := make([]*noder, 0, len(filenames))
	// Limit the number of simultaneously open files.
	sem := make(chan struct{}, runtime.GOMAXPROCS(0)+10)

	for _, filename := range filenames {
		p := &noder{
			basemap: make(map[*syntax.PosBase]*src.PosBase),
			err:     make(chan syntax.Error),
		}
		noders = append(noders, p)

		go func(filename string) {
			sem <- struct{}{}
			defer func() { <-sem }()
			defer close(p.err)
			base := syntax.NewFileBase(filename)

			f, err := os.Open(filename)
			if err != nil {
				p.error(syntax.Error{Msg: err.Error()})
				return
			}
			defer f.Close()

			p.file, _ = syntax.Parse(base, f, p.error, p.pragma, syntax.CheckBranches) // errors are tracked via p.error
		}(filename)
	}
	......
}

We know that in the compilation process of Go, each source file will eventually be parsed into a syntax tree. As can be seen from the first few lines of the above code, it will first create multiple collaborations to compile the source files, but there is a limit on the number of open source files each time

sem := make(chan struct{}, runtime.GOMAXPROCS(0)+10)

Then traverse the source file and perform lexical and syntactic analysis on the file from multiple processes, mainly reflected in the for loop and go func. As you can see in go func, it will initialize the information of the source file first, mainly recording the name, row and column information of the source file. The purpose is to report the location of the error if an error is encountered in the process of lexical and syntax analysis. It mainly includes the following structures

type PosBase struct {
	pos       Pos
	filename  string
	line, col uint32
}

type Pos struct {
	base      *PosBase
	line, col uint32
}

Later, open the source file and start initializing the parser. The reason why the parser is initialized is that you will find that lexical analysis and parsing are put together during the compilation of Go. During the initialization of the parser, the lexical analyzer is also initialized. We can enter syntax in go fun Parse function

func Parse(base *PosBase, src io.Reader, errh ErrorHandler, pragh PragmaHandler, mode Mode) (_ *File, first error) {
	defer func() {
		if p := recover(); p != nil {
			if err, ok := p.(Error); ok {
				first = err
				return
			}
			panic(p)
		}
	}()

	var p parser
	p.init(base, src, errh, pragh, mode) //Initialization operation
	p.next() // The lexical parser parses the source file and converts it into a source file composed of all token s
	return p.fileOrNil(), p.first //The syntax parser parses the token file above
}

You can see that the initialization operation of syntax analysis is called:

p.init(base, src, errh, pragh, mode)

Go to p.init and you will see a line of code that initializes the lexical analyzer

p.scanner.init(...Here are the parameters to initialize the lexical analyzer)

You can see that the parser uses p.scanner to call the init method of the lexical analyzer. Just look at the structure of the parser. The structure of the lexical analyzer is embedded in the structure of the parser (this article mainly introduces the lexical analyzer, so the meaning of each structure field of the parser will not be introduced here, but will be introduced in detail in the article introducing syntax analysis)

//Parsing structure
type parser struct {
	file  *PosBase 
	errh  ErrorHandler
	mode  Mode
	pragh PragmaHandler
	scanner //Embedded lexical analyzer

	base   *PosBase 
	first  error 
	errcnt int  
	pragma Pragma  

	fnest  int
	xnest  int 
	indent []byte
}

After understanding the relationship between grammatical analysis and lexical analysis, we will start to see the specific process of lexical analysis

Lexical analysis process

The code location of lexical analysis is:

src/cmd/compile/internal/syntax/scanner.go

The lexical analyzer is implemented through a structure. Its structure is as follows:

type scanner struct {
	source //Source is also a structure. It mainly records the information of the source file for lexical analysis, such as the byte array of the content, the currently scanned characters and positions, etc. (because we know that the lexical analysis process scans the source file character by character from left to right)
	mode   uint  //Controls whether comments are resolved
	nlsemi bool // if set '\n' and EOF translate to ';'

	// current token, valid after calling next()
	line, col uint   //The initial value of the position of the currently scanned character is 0
	blank     bool // line is blank up to col
	tok       token // Token corresponding to the currently matched string (records all token types supported in Go)
	lit       string   // The source code text representation of the token. For example, if is recognized from the source file, its token is_ If, its lit is if
	bad       bool     // If there is a syntax error, the obtained lit may be incorrect
	kind      LitKind  // If the matched string is a value type, this variable identifies which value type it belongs to, such as INT, FLOAT or run
	op        Operator // Similar to kind, it identifies the recognized TOKEN (if it is an operator, what kind of operator it is)
	prec      int      // valid if tok is _Operator, _AssignOp, or _IncOp
}

type source struct {
	in   io.Reader
	errh func(line, col uint, msg string)

	buf       []byte // Byte array of source file contents
	ioerr     error  // Error message read from file
	b, r, e   int    // buffer indices (see comment above)
	line, col uint   // Position information of the currently scanned character
	ch        rune   // Currently scanned characters
	chw       int    // width of ch
}

After knowing the meaning of each field in the structure of the lexical parser, let's learn what types of tokens are in Go

Token

Token is the smallest lexical unit with independent meaning in programming language. Token mainly includes keywords, user-defined identifiers, operators, separators, comments, etc., which can be found in Src / CMD / compile / internal / syntax / tokens Go, I intercepted a part below (these tokens exist in the form of constants)

const (
	_    token = iota
	_EOF       // EOF

	// names and literals
	_Name    // name
	_Literal // literal

	// operators and operations
	// _Operator is excluding '*' (_Star)
	_Operator // op
	_AssignOp // op=
	_IncOp    // opop
	_Assign   // =
	......

	// delimiters
	_Lparen    // (
	_Lbrack    // [
	_Lbrace    // {
	_Rparen    // )
	_Rbrack    // ]
	_Rbrace    // }
	......

	// keywords
	_Break       // break
	_Case        // case
	_Chan        // chan
	_Const       // const
	_Continue    // continue
	_Default     // default
	_Defer       // defer
	......

	// empty line comment to exclude it from .String
	tokenCount //
)

The three most important attributes of the lexical unit corresponding to each lexical Token are: the type of lexical unit, the text form of the Token in the source code, and the location of the Token. Comments and semicolons are two special tokens. Ordinary comments generally do not affect the semantics of the program, so comments can be ignored many times (the mode field in the scanner structure is to identify whether to resolve comments)

All tokens are divided into four categories:

Special type Token. For example:_ EOF
The Token corresponding to the basic face value. For example: IntLit, FloatLit, imagelit, etc
Operator. For example: Add * / / +, Sub * / / -, * Mul * / /*
keyword. For example:_ Break* // break,_ Case* // case

Lexical analysis implementation

In the lexical analysis part, there are two core methods, one is nextch() and the other is next()

As we know, the lexical analysis process is to read the source file character by character, and the nextch() function is to continuously read the contents of the source file character by character from left to right

The following is part of the code of the nextch() function, which mainly obtains the next unprocessed character and updates the scanned position information

func (s *source) nextch() {
redo:
	s.col += uint(s.chw)
	if s.ch == '\n' {
		s.line++
		s.col = 0
	}

	// fast common case: at least one ASCII character
	if s.ch = rune(s.buf[s.r]); s.ch < sentinel {
		s.r++
		s.chw = 1
		if s.ch == 0 {
			s.error("invalid NUL character")
			goto redo
		}
		return
	}

	// slower general case: add more bytes to buffer if we don't have a full rune
	for s.e-s.r < utf8.UTFMax && !utf8.FullRune(s.buf[s.r:s.e]) && s.ioerr == nil {
		s.fill()
	}

	// EOF
	if s.r == s.e {
		if s.ioerr != io.EOF {
			// ensure we never start with a '/' (e.g., rooted path) in the error message
			s.error("I/O error: " + s.ioerr.Error())
			s.ioerr = nil
		}
		s.ch = -1
		s.chw = 0
		return
	}

......
}

The next () function is to segment the string and match the corresponding token according to the scanned characters through the idea of determining the finite automata introduced in the previous article. Some core codes of the next() function are as follows:

func (s *scanner) next() {
	nlsemi := s.nlsemi
	s.nlsemi = false

redo:
	// skip white space
	s.stop()
	startLine, startCol := s.pos()
	for s.ch == ' ' || s.ch == '\t' || s.ch == '\n' && !nlsemi || s.ch == '\r' {
		s.nextch()
	}

	// token start
	s.line, s.col = s.pos()
	s.blank = s.line > startLine || startCol == colbase
	s.start()
	if isLetter(s.ch) || s.ch >= utf8.RuneSelf && s.atIdentChar(true) {
		s.nextch()
		s.ident()
		return
	}

	switch s.ch {
	case -1:
		if nlsemi {
			s.lit = "EOF"
			s.tok = _Semi
			break
		}
		s.tok = _EOF

	case '\n':
		s.nextch()
		s.lit = "newline"
		s.tok = _Semi

	case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9':
		s.number(false)

	case '"':
		s.stdString()
......
}

A complete description of what these two methods do is:

The lexical analyzer uses the nextch() function to get the latest unresolved characters each time
According to the characters scanned, the next() function will determine which type of characters are currently scanned. For example, when scanning a a character, it will try to match an identifier type, that is, the s.ident() method called in the next() function, and it will determine if the identifier is a keyword.
If the scanned character is a numeric character, it will try to match a basic face value type (such as * IntLit, FloatLit, imagelit *)
After next () recognizes a token, it will be passed to the parser, and then the parser will call the next () function of the lexical analyzer, Continue to obtain the next token (so you will find that the lexical analyzer does not translate the entire source file into tokens at one time and then provide them to the parser, but the parser needs one by itself and obtains one through the next() function of the lexical analyzer)

We can see this line of code in the next() function

for s.ch == ' ' || s.ch == '\t' || s.ch == '\n' && !nlsemi || s.ch == '\r' {
		s.nextch()
	}

It filters out spaces, tabs, line breaks, etc. in the source file

For how to identify whether a string is a basic face value or a string, you can see the internal implementation of ident(), number(), stdString() methods inside. There is no code sticking here. In fact, the idea is the deterministic finite automata introduced in the previous article

Next, I'll start from the entrance of the Go compiler and draw the flow chart of lexical analysis to facilitate the introduction

Maybe after reading the source code, I still don't have a clear understanding of the lexical parser. Next, we'll actually use the lexical analyzer through the test files or standard library provided by Go to see how it works

Test lexical analysis process

There are two ways to test lexical analysis. You can directly compile and execute the lexical analyzer test file provided by Go or the standard library provided by Go

Lexical analyzer test file address: src/cmd/compile/internal/syntax/scanner_test.go
Go Address of lexical analyzer standard library provided: src/go/scanner/scanner.go

Next, I will write a source file myself and pass it to the lexical analyzer to see how it parses and what the parsing result is

Test lexical analyzer through test file

We can directly compile and execute Src / CMD / compile / internal / syntax / scanner_ test. The TestScanner method in go. The source code of this method is as follows (there are comments in the code):

func TestScanner(t *testing.T) {
	if testing.Short() {
		t.Skip("skipping test in short mode")
	}

	filename := *src_ // can be changed via -src flag
	//Here you can choose an absolute path to the source file you want to parse
	src, err := os.Open("/Users/shulv/studySpace/GolangProject/src/data_structure_algorithm/SourceCode/Token/aa.go")
	if err != nil {
		t.Fatal(err)
	}
	defer src.Close()

	var s scanner
	s.init(src, errh, 0) //Initialize lexical parser
	for {
		s.next() //Get the token (the next function will call the nextch() method to continuously get the next character until a token is matched)
		if s.tok == _EOF {
			break
		}
		if !testing.Verbose() {
			continue
		}
		switch s.tok { //Obtained token value
		case _Name, _Literal: //Identifier or underlying face value
			//Print out the file name, row, column, token and the text in the source file corresponding to token
			fmt.Printf("%s:%d:%d: %s => %s\n", filename, s.line, s.col, s.tok, s.lit)
		case _Operator:
			fmt.Printf("%s:%d:%d: %s => %s (prec = %d)\n", filename, s.line, s.col, s.tok, s.op, s.prec)
		default:
			fmt.Printf("%s:%d:%d: %s\n", filename, s.line, s.col, s.tok)
		}
	}
}

The test function will first open your source file and pass the contents of the source file to the initialization function of the lexical analyzer. Then, through an endless loop, continue to call the next() function to obtain the token until the terminator is encountered_ EOF, jump out of the loop

The contents of the file I want to parse for the lexical parser are as follows:

package Token

import "fmt"

func testScanner()  {
	a := 666
	if a == 666 {
		fmt.Println("Learning Scanner")
	}
}

Then run the test method through the following command (you can print more information, and you can print out the fields of the sacner structure):

# cd /usr/local/go/src/cmd/compile/internal/syntax
# go test -v -run="TestScanner"

Print results:
=== RUN   TestScanner
parser.go:1:1: package
parser.go:1:9: name => Token
parser.go:1:14: ;
parser.go:3:1: import
parser.go:3:8: literal => "fmt"
parser.go:3:13: ;
parser.go:5:1: func
parser.go:5:6: name => testScanner
parser.go:5:17: (
parser.go:5:18: )
parser.go:5:21: {
parser.go:6:2: name => a
parser.go:6:4: :=
parser.go:6:7: literal => 666
parser.go:6:10: ;
parser.go:7:2: if
parser.go:7:5: name => a
parser.go:7:7: op => == (prec = 3)
parser.go:7:10: literal => 666
parser.go:7:14: {
parser.go:8:3: name => fmt
parser.go:8:6: .
parser.go:8:7: name => Println
parser.go:8:14: (
parser.go:8:15: literal => "Learning Scanner"
parser.go:8:33: )
parser.go:8:34: ;
parser.go:9:2: }
parser.go:9:3: ;
parser.go:10:1: }
parser.go:10:2: ;
--- PASS: TestScanner (0.00s)
PASS
ok  	cmd/compile/internal/syntax	0.007s

Thesaurus test passed

Another test method is through the standard library provided by Go. Here I demonstrate how to test the lexical analyzer with the methods in the standard library

You need to write a piece of code to call the methods in the standard library to implement a lexical analysis process. An example is as follows:

package Token

import (
	"fmt"
	"go/scanner"
	"go/token"
)

func TestScanner1()  {
	src := []byte("cos(x)+2i*sin(x) //Comment ") / / the content I want to parse (of course, you can also use a byte array of the file content)
	//Initialize scanner
	var s scanner.Scanner
	fset := token.NewFileSet() //Initialize a file set (I'll explain this below)
	file := fset.AddFile("", fset.Base(), len(src)) //Adds a file to the character set
	s.Init(file, src, nil, scanner.ScanComments) //The third parameter is mode. I passed ScanComments, which means that the annotation needs to be parsed. Generally, the annotation can not be parsed
	//scanning
	for  {
		pos, tok, lit := s.Scan() //This is equivalent to the next() function
		if tok == token.EOF {
			break
		}
		fmt.Printf("%s\t%s\t%q\n", fset.Position(pos), tok, lit) //fset.Position(pos): get position information
	}
}

Execute the above code to obtain the following results:

1:1     IDENT   "cos"
1:4     (       ""
1:5     IDENT   "x"
1:6     )       ""
1:7     +       ""
1:8     IMAG    "2i"
1:10    *       ""
1:11    IDENT   "sin"
1:14    (       ""
1:15    IDENT   "x"
1:16    )       ""
1:18    ;       "\n"
1:18    COMMENT "//Comment"

You will find that the methods used in the standard library are completely different from those in the test file. This is because a set of lexical analyzer is implemented separately in the standard library without reusing the code of lexical analyzer in go compiler. I understand this is because the code in go compiler cannot be used externally as a public method. For security reasons, the methods in it must be kept private

If you look at the implementation of lexical analyzer in the standard library, you will find that it is different from the implementation in go compilation, but the core idea is the same (such as character scanning and token recognition). The difference lies in the processing of the files to be parsed. We know that in go language, multiple files form a package, and then multiple packages are linked into an executable File. Therefore, multiple files corresponding to a single package can be regarded as the basic compilation unit of go language. Therefore, the lexical parser provided by go also defines FileSet and File objects to describe filesets and files

type FileSet struct {
	mutex sync.RWMutex // protects the file set
	base  int          // base offset for the next file
	files []*File      // list of files in the order added to the set
	last  *File        // cache of last file looked up
}

type File struct {
	set  *FileSet
	name string // file name as provided to AddFile
	base int    // Pos value range for this file is [base...base+size]
	size int    // file size as provided to AddFile

	// lines and infos are protected by mutex
	mutex sync.Mutex
	lines []int // Lines contain the offset of the first character for each line (the first entry is always 0)
	infos []lineInfo
}

The function is actually to record the information of the parsed file, which is similar to the function of the source structure in the scanner structure of the lexical parser. The difference is that we know that the go compiler creates multiple collaborations and compiles multiple files concurrently, while the standard library stores multiple files to be parsed through the file set, You will find that there is a one-dimensional array of files to be parsed in the structure of FileSet

The following is a brief introduction to the relationship between FileSet and File, and how it calculates the location information of Token

FileSet and File

The corresponding relationship between FileSet and File is shown in the figure:

Image source: go-ast-book

The Pos type in the figure represents the subscript position of the array. Each File element in the FileSet corresponds to an interval of the underlying array. There is no intersection between different files, and there may be filling space between adjacent files

Each File is mainly composed of File name, base and size. Where base corresponds to the Pos index position of File in FileSet, so base and base+size define the start and end positions of File in FileSet array. Within each File, you can use offset to locate the subscript index, and use offset + File Base can convert the offset inside File to Pos position. Because Pos is the global offset of FileSet, on the contrary, you can also query the corresponding File and the offset within the corresponding File through Pos

The location information of each Token in lexical analysis is defined by Pos, and the corresponding File can be easily queried through Pos and the corresponding FileSet. Then calculate the corresponding line number and column number through the source File and offset corresponding to File (in the implementation, File only saves the starting position of each line and does not contain the original source code data). The bottom layer of Pos is int type, which is similar to the semantics of pointer. Therefore, 0 is also similar to null pointer. Null pointer is defined as NoPos, indicating invalid Pos

Source: go-ast-book

It can be seen from the relationship between FileSet and File that the lexical analyzer in the Go standard library can parse multiple source files through File set

summary

This paper mainly starts from the entry file compiled by go, gradually introduces the implementation of the source code of lexical analysis in go compilation, and tests and uses the lexical analyzer through the test file and the lexical analyzer standard library provided by go. I believe I can have a clear understanding of go's lexical analysis process after reading it

The lexical analysis part is relatively simple and involves less core content. The really difficult part is the later part of syntax analysis and abstract syntax tree. Interested partners, please continue to pay attention

Topics: Go Back-end compiler

Programmer Think