Create a programming language yourself

Posted by bettydailey on Mon, 17 Jan 2022 16:22:01 +0100

In the last section, we talked about flex. Today, we talk about bison

flex is used to analyze morphology, such as your keywords and operation symbols. bison is used to analyze grammar, that is, the grammar method you define, that is, the meaning you need to express after arranging the keywords and symbols you define according to a certain format

Bison and Flex are very similar in syntax, which is generally called BNF

Before introducing Bison, let's briefly talk about BNF

1. What is BNF grammar
BNF is the abbreviation of Backus Naur form, which is called Bakos normal form in Chinese. It is a symbol set used to describe computer language grammar

In BNF grammar

Angle bracket(<>)The contained in is required.
square brackets([])The included are optional.
Braces({})Contains items that can be repeated 0 to countless times.
Vertical line(|)Means to select either one on the left or right.
(::=)Is defined as.

For example, this is the BFN description of JAVA

<switch statement> ::= switch ( <expression> ) <switch block>
<switch block> ::= { <switch block statement groups><switch labels> }
<switch block statement groups> ::= <switch block statement group> | <switch block statement groups> <switch block statement group>

2.Bison
Like FLEX, Bison is also a separate document, but it is usually used together with FLEX y as the file extension
In order to write a parser, some methods are needed to describe the rules used by the parser to convert a series of tokens into parsed numbers

Declaration part
%%
rule of grammar
%%
Language add-on

'%%', '% {' and '%}' are punctuation marks that appear in each Bison syntax file to separate parts.

Declaration part: you can use C to define the types and variables used in the operation. You can use preprocessor commands to define the macros used there and use #include to include header files that perform any of these operations.

Bison declaration declares the names of terminal symbols and non terminal symbols. It can also describe the priority of operators and the data types of semantic values of various symbols.

Syntax rules: defines how to construct each non terminator from its parts.

The additional C code can contain any C code you want to use. Usually, yylex here is the definition of lexical analyzer and subroutines called by actions in syntax rules. In a simple program, all the rest of the program can be put here.

<exp> ::= <factor>
    | <exp> + <factor>

<factor> ::= <number>
    | <factor> * <number>

Each line is a rule that explains how to form branches of the syntax tree= The rule is recursive, that is, the branch of the tree can be composed of similar branches; the rule also has dependencies, which can be seen from the above as dependencies, thus forming a priority.

In fact, even if you say the whole Bison once, you can't understand it. Take a simple example

In order to get closer to the code needed for compiling the language, I write an example of an analytic function. The function of this example is to read a function named print defined by you from a.txt in a file. The parameters of the function can be a collection of numbers, which can be infinite, and then calculate their sum and print it out

It doesn't matter whether the following code can be understood or not, as long as it can run normally

//a.txt
print(1,2,3,4,5,6,7,8)

What we want to analyze is the above content. The number in print can be unlimited

Then create a flex file and define the types we want to use. Remember, all types and characters you use need to be defined in it, otherwise they will not be recognized in bison

//test-flex.l
%{
#include "header.h"
#include <io.h>

#include "test-bison.tab.h"
extern int yyerror(char*);
%}

%%

[0-9]+	     {yylval.vINTEGER =atoi(yytext);  return NUMBER; }
"print"		{return PRINT;}
"("    			{return '(';}
")"    			{return ')';}
","				{return(',');}
%%

int yywrap()
{
	return 1;
}

The above content defines a lexical parsing file, [0-9] + means that the input is a NUMBER and the returned identifier is NUMBER. This can be named arbitrarily, PRINT returns PRINT, and other symbols used are returned as they are

Then there is the most important bison file

 %{
#include "header.h"
extern int yylex();

#define YYDEBUG 1

int yyerror(char* str)
{
	printf(str);
	return(1);
}
void add(vector<int>* list)
{
	int ret=0;
	for (std::vector<int>::iterator it = list->begin(); it != list->end(); ++it) {
		ret+=(*it);
	}
	printf("%d\n",ret);
}

%}

%token <vINTEGER>NUMBER
%token PRINT
%type <vLIST> add_list
%start start
%%

start:print_definition	
	 
print_definition:PRINT '(' add_list ')'
				{
					add($3);
				} 			
			    ;


add_list:NUMBER{
					$$ = new vector<int>();
					$$->push_back($1);
				}| add_list ',' NUMBER
				{
					$1->push_back($3);
				}
				;
%%

To make the types in bison available, you also need to define them

#ifndef HEADER
#define HEADER
#include "stdafx.h"
#include <string>
#include <iostream>
#include<vector>
using namespace std;
typedef struct YYSTYPE
{
	int  vINTEGER;
	vector<int>* vLIST;
} YYSTYPE;

#endif

And then another entry file

#include "stdafx.h"
extern int yyparse();
extern FILE* yyin;
extern void yyrestart(FILE* F);
int main()
{
	FILE* file;
	file = fopen("a.txt", "r");
	yyin = file;
	yyrestart(yyin);
	yyparse();
}

ok, all the required documents are complete, and then execute

win_bison -d -v  test-bison.y
win_flex --nounistd test-Flex.l

If nothing is output, the generation is complete

Then you'll find that lex yy. c,test-bison.tab.c,test-bison.tab.h. These files are the files used for parsing
Then, just compile and run. If normal, you will output

You may find that on the surface, bison seems to be a regular analysis tool. In fact, it contains the core part of the compilation principle, which can not be explained in a few words. However, from the perspective of problem solving, if you want to use it, you don't have to understand all its syntax and principles, You just need to remember a few commonly used keywords and the grammatical structure of bison, which is enough to complete your work

bison specific structure explanation

%token
token is usually written in a language, that is, it is used as a method to define a key sub, or a data type, such as the one I used in the above example

%token <vINTEGER>NUMBER

This means that the identifier NUMBER is of type vINTEGER. Of course, it can not define any type, simply indicating that it is a token

%token PRINT

%type
It represents all types returned by an expression

%type <vLIST> add_list

%left
This is relatively simple. I didn't mention it above. It indicates the associativity of your operands. The following code means +, - operation is left combination

%left '-' '+'

%start
Indicates where the first expression you want to parse starts, that is, the entry of bison

%YYSTYPE
It is a C structure, which defines the required data types

int yyerror(char str)*
Output when compilation error occurs

%The contents in {%} will be copied to the C file as is

first

After yacc starts parsing the file, it will start with start, and the expression used by start
Is print_definition, so locate to print_definition,PRINT_ The definition rule starts with PRINT and is followed by (add_list).

add_list is more important. You can add numbers infinitely in print because of add_list is a recursion. The form of recursion is as follows| Meaning of or

XXX:aaa{
		
	}|XXX',' aaa
	{
		
	}

$x represents the NUMBER of parameters. For example, $1 represents the NUMBER in the example

push_back($1);

$$indicates what line this expression returns. The following sentence means add_list is the return vector*

$$ = new vector<int>();

add_ The type of list is described above

%type <vLIST> add_list

add_list first parses to the first data in (), and then the expression matches
add_list, go to parse add_list, the (1, 2, 3, 4, 5, 6, 7, 8) in the file conform to the comma arrangement, and then operate recursively and continue to add until all are completed

Then execute add(); Pass the vector * parsed just now into add and print the result

This is the basic implementation method and rules of BISON. If you want to know more about it, you can see below
http://dinosaur.compilertools.net/ This is more detailed

ok, this is the basic condition for parsing a program file

If you want to see an example, you can see me developing it menthol bison written by

Next time I'll start talking about how to write a real programming language

Topics: C++ Programming