LLVM official tutorial Chap 3

Posted by BrianG on Sun, 23 Jan 2022 22:13:01 +0100

note:

  1. Llvm3 is required Version 7 and above
  2. You will find that these tutorials are described from the bottom up. It may be difficult to understand at first reading. The only way is to read them several times.

set up

First, make some settings, mainly adding the codegen() function to all abstract syntax trees

///ExprAST - all expression nodes inherit from this
/// Base class for all expression nodes.
class ExprAST {
public:
  virtual ~ExprAST() {}
  virtual Value *codegen() = 0;
};

///NumberExprAST - numeric expression, such as "1.0"
/// Expression class for numeric literals like "1.0".
class NumberExprAST : public ExprAST {
  double Val;

public:
  NumberExprAST(double Val) : Val(Val) {}
  virtual Value *codegen();
};
...

The codegen() method generates an intermediate representation of all syntax tree nodes and returns a Value object of LLVM.

Value object is the type used to express SSA value in LLVM

The second step is to add a LogErrorV function to prompt error messages during code generation.

Value *LogErrorV(const char *Str) {
  LogError(Str);
  return nullptr;
}

You also need some data structures in LLVM

  • TheContext contains many core data structures of LLVM, such as type table and constant value table
  • The Builder object is used to track where code is generated
  • TheModule contains functions and global variables. It is the top-level structure used by LLVM IR to contain code and has all the memory of the generated IR
  • NamedValues is used to track which variables are defined in the current scope. At present, it only works on function parameters. When generating the code of the function body, all parameters are saved in NamedValues
static std::unique_ptr<LLVMContext> TheContext;
static std::unique_ptr<IRBuilder<>> Builder(*TheContext);
static std::unique_ptr<Module> TheModule;
static std::map<std::string, Value *> NamedValues;

3.3 code generation of expression

  1. numerical value

    In LLVM, in LLVM IR, numerical constants are represented by the ConstantFP class, which saves numerical values in the internal APFloat (APFloat has the ability to save floating-point constants of arbitrary precision). This code basically just creates and returns a ConstantFP. Note that in LLVM IR, constants are unique and shared. Therefore, the API uses foo::get(...) Idioms, not new foo(). Or foo::Create()

    Value *NumberExprAST::codegen() {
      return ConstantFP::get(TheContext, APFloat(Val));
    }
    
  2. variable

    As mentioned above, in our current design, the code generation of variables will only appear in the formal parameters of the function, and at this time, the variables are saved in the NamedValues map. Therefore, we only need to find the Value object according to the corresponding variable name. The processing of NamedValues table and how these variables are added to the table will be mentioned below.

    Value *VariableExprAST::codegen() {
      // Look this variable up in the function.
      Value *V = NamedValues[this->Name];
      if (!V)
        LogErrorV("Unknown variable name");
      return V;
    }
    
  3. Binary expression

    According to the Parse procedure above, the binary expression contains LHS, RHS and operator Op. LHS and RHS types have their own codegen() methods, which can be called directly. The focus is on the process of handling Op.

    Value *BinaryExprAST::codegen() {
      Value *L = LHS->codegen();
      Value *R = RHS->codegen();
      if (!L || !R)
        return nullptr;
    
      switch (Op) {
      case '+':
        return Builder.CreateFAdd(L, R, "addtmp");
      case '-':
        return Builder.CreateFSub(L, R, "subtmp");
      case '*':
        return Builder.CreateFMul(L, R, "multmp");
      case '<':
        L = Builder.CreateFCmpULT(L, R, "cmptmp");
        // Convert bool 0/1 to double 0.0 or 1.0
        return Builder.CreateUIToFP(L, Type::getDoubleTy(TheContext),
                                    "booltmp");
      default:
        return LogErrorV("invalid binary operator");
      }
    }
    

    In the above example, the LLVM builder class starts working. IRBuilder knows where to insert the newly created instruction. All we have to do is specify what instruction to create (for example, CreateFAdd), which operand to use (for example, L and R in the code), and selectively provide a name for the generated instruction.

    Note that the names just mentioned, i.e. addtmp, subtmp, etc. in the code, are just a hint. If there are multiple names with the same name, LLVM will automatically add an incremental suffix to each name

    LLVM instructions have strict rule constraints: for example, the left and right Operators of the added instruction must have the same type, and the added result type must match the operand type. Because all values in the kaleidoscope are double precision, this makes the code for add, sub, and mul very simple.

    On the other hand, LLVM specifies that the fcmp instruction always returns a 'i1' value (a 1-bit integer). The problem is that Kaleidoscope expects a value of 0.0 or 1.0. To get these semantics, we combine the fcmp instruction with the uitofp instruction. This instruction converts an input integer to a floating-point value by treating it as an unsigned value. Conversely, if we use the sitofofp instruction, Kaleidoscope's' < 'operator will return 0.0 and - 1.0, depending on the input value.

  4. function call

    The code for generating function calls using LLVM is very simple. The above code first looks up the function name in the symbol table of the LLVM Module. In retrospect, LLVM Module Is the container that holds the function we want to JIT. By giving each function the same name as the function specified by the user, we can use the LLVM symbol table to resolve the function name for us.

    Value *CallExprAST::codegen() {
      // Look up the name in the global module table.
    	// Find function name in Mudole
      Function *CalleeF = TheModule->getFunction(this->Callee);
      if (!CalleeF)
        return LogErrorV("Unknown function referenced");
    
      // If argument mismatch error.
    	// If the parameter does not correspond to
      if (CalleeF->arg_size() != this->Args.size())
        return LogErrorV("Incorrect # arguments passed");
    
      std::vector<Value *> ArgsV;
      for (unsigned i = 0, e = this->Args.size(); i != e; ++i) {
    		// The type base class of Args[i] here is ExprAST
        ArgsV.push_back(this->Args[i]->codegen());
        if (!ArgsV.back())
          return nullptr;
      }
    
      return Builder.CreateCall(CalleeF, ArgsV, "calltmp");
    }
    
  5. END

    So far, this is the end of our treatment of the four basic expressions in kaleidoscope. Feel free to enter and add more. For example, by browsing the LLVM language reference, you will find several other interesting instructions that are very easy to insert into our basic framework.

Function code generation

The code generation of prototypes and functions must deal with a lot of details, which makes their code less beautiful than the code generation of the above expressions. First, let's discuss the code generation of the ProtoType prototype: they are used for both function bodies and external function declarations. The code starts with:

Function *PrototypeAST::codegen() {
  // Make the function type:  double(double,double) etc.

	// hint: the type of this - > args here is vector < string >
  std::vector<Type*> Doubles(this->Args.size(),
                             Type::getDoubleTy(TheContext));
  FunctionType *FT =
    FunctionType::get(Type::getDoubleTy(TheContext), Doubles, false);

  Function *F =
    Function::Create(FT, Function::ExternalLinkage, Name, TheModule.get());

This code contains many functions in a few lines of code. First, note that this Function returns "Function *" instead of "Value *. Because "Prototype" really talks about the external interface of the Function (not the Value calculated by the expression), it makes sense to return the LLVM Function corresponding to the encoding.

There are three lines in the above code:

  1. First line:

    Because all function parameters in Kaleidoscope are of double type, the first line creates a vector of LLVM double type with size N, which represents the types of all parameters.

  2. Second line

    Then, it uses the Functiontype::get method to create a function type, which takes N double types as the parameter type and one double as the return value type. The false parameter indicates that the parameters of the function are immutable.

    Note that the type in LLVM is uniqued, just like a constant, so you don't need a new type, but get.

  3. Third line

    The last line above actually creates an IR Function corresponding to Prototype.

    IR Function indicates the type to be used (double → return value (N double → parameters)), link and function name, and which module to insert.

    ”Function::ExternalLinkage "means that the function can be defined outside the current module and / or can be called by functions outside the module.

    The Name passed in is the Name specified by the user: because "TheModule" is specified, this Name is registered in the symbol table of "TheModule".

// Set names for all arguments.
unsigned Idx = 0;
for (auto &Arg : F->args())
  Arg.setName(this->Args[Idx++]);

return F;

Finally, we set the name of each function parameter according to the name given in Prototype. This step is not strictly necessary, but maintaining name consistency can make the IR more readable and allow subsequent code to directly reference the parameters of the name without having to find them in the Prototype AST.

Now we have a function prototype without a body. This is how LLVM IR represents function declarations.

For the external statement in Kaleidoscope, you can completely dispose of an external statement at this step. However, for custom functions with bodies, we also need to codegen and attach a function body.

Function *FunctionAST::codegen() {
    // First, check for an existing function from a previous 'extern' declaration.
	// In the previous article, we added Name to the Module through: Function::Create(..., Name, TheModule.get())
  Function *TheFunction = TheModule->getFunction(this->Proto->getName());

  if (!TheFunction)
    TheFunction = this->Proto->codegen();

  if (!TheFunction)
    return nullptr;

  if (!TheFunction->empty())
    return (Function*)LogErrorV("Function cannot be redefined.");

For the function definition, we first look up the existing version of the function in the symbol table of the module in case a function has been created with the 'extern' statement. If Module::getFunction returns null, the previous version does not exist, so we will codegen() from the prototype. In both cases, we want to ensure that the function body is empty before we start (that is, there is no function body yet).

Next, you create the function body

// Create a new basic block to start insertion into.
BasicBlock *BB = BasicBlock::Create(TheContext, "entry", TheFunction);
Builder.SetInsertPoint(BB);

// Record the function arguments in the NamedValues map.
NamedValues.clear();
for (auto &Arg : TheFunction->args())
  NamedValues[Arg.getName()] = &Arg;

Now we are in the Builder section. The first line creates a new basic block (named "entry", and we will see the location of the entry in the next run) and inserts it into the function.

Then, the second line tells the builder that the new instruction should be inserted at the end of the new basic block.

The basic block in LLVM is an important part of the function that defines the control flowchart. Because we don't have any control flow, our function contains only one block at this time. We will solve this problem in Chapter 5:).

Next, we add the function parameter to the NamedValues map (after clearing it for the first time) so that we can access the VariableExprAST node.

// The Body here is an expression, which recursively calls its own codegen() method
if (Value *RetVal = Body->codegen()) {
  // Finish off the function.
  Builder.CreateRet(RetVal);

  // Validate the generated code, checking for consistency.
  verifyFunction(*TheFunction);

  return TheFunction;
}

Once the insertion point is set and the NamedValues map is populated, we call the codegen() method for the root expression of the function. If no error occurs, issue code to evaluate the expression in the entry block and return the calculated value. Assuming there are no errors, then we create an LLVM ret instruction , complete this function. After building the function, we call the verifyFunction provided by LLVM. This function performs various consistency checks on the generated code to determine whether our compiler is all right. It's important to use this: it can catch many bug s. Once the function is completed and validated, we return it.

The only part left here is dealing with error conditions. For simplicity, we only deal with this problem by deleting the function generated by the eraseFromParent method. This allows users to redefine the wrong function they entered before: if we don't delete it, it will exist in the symbol table with a principal to prevent future redefinition.

// Error reading body, remove function.
  TheFunction->eraseFromParent();
  return nullptr;
}

Topics: C++ llvm