[model reasoning] quantization implementation share 1: explain the implementation of min max symmetric quantization algorithm in detail

Posted by newbie79 on Wed, 15 Dec 2021 18:35:13 +0100

Welcome to my official account, reply to 001 Google programming specification.

O_o >_< o_O O_o ~_~ o_O

Hello, I'm Jizhi horizon. This paper analyzes the implementation of min max symmetric quantization algorithm, taking Tengine's implementation as an example.

Tengine is an excellent end-to-end deep learning reasoning framework open source by OpenAILab. Its core is mainly implemented in C language, and the wrapped function code is nested with C + +. Quantification is an essential optimization link for reasoning acceleration. A mature reasoning framework will generally peel off the quantification module to form an independent set of tools, such as Tengine, NCNN, shengteng and Cambrian. This is mainly because the quantization process is not strongly related to hardware and can do more things by decoupling.

min max and kl quantization algorithms are the basis and standard of hardware manufacturers' adaptive reasoning engine. kl quantization is deeply loved by users. For example, TensorRT of NVIDIA also adopts kl quantization strategy; The min max to be introduced here is characterized by simple logic and good effect. It is more appropriate as the beginning of the quantitative implementation sharing series. Here, let's take you to study the specific implementation of Minx Max quantization strategy in Tengine.

1. Quantitative use

quantization is mainly divided into activation value (dynamic) quantization and weight offset (static) quantization, while the quantization of weight & offset has a great impact on the accuracy, and the quantization of activation value has little impact on the whole, but it also needs quantization to achieve the overall satisfactory effect. For general quantization, the quantization of weight & offset will adopt channel by channel, while the quantization of activation value is generally layer by layer. Explain 1 Why is this? For quantization, convolution must be the big head. For convolution, if the activation value quantization adopts the channel by channel method, it is contrary to the convolution kernel parameter sharing. Therefore, generally, the activation value is quantized layer by layer to fit the convolution parameter sharing.

here we mainly look at the transmission parameters required for Tengine quantization:

Input model: incoming fp32 tmfile model file;
Output model: generated int8 tmfile model file;
Calib images: the incoming activation value quantifies the calibration image;
Scale file: generated calibration table file;
Agorithm: quantization algorithm, MIN-MAX, KL, ACIQ, DFQ and EQ can be selected;
Dims: enter the shape of the calibration chart, where it is transferred to three-dimensional c h w, n, and n = 1 is written in the code;
Mean: mean value of image preprocessing;
Scale: image preprocessing scale;
BGR2RGB: channel conversion;
Center crop: image preprocessing and clipping;
Letter box: image preprocessing, resizing the image on the premise of maintaining the aspect ratio;
YOLOv5 focus: a preprocessing attention mechanism similar to yolov5;
Thread num: quantifies the multithreading setting;

2. Min max quantization

min max is the simplest quantization algorithm. The main logic is as follows:

the main codes for implementing the min max method in Tengine are as follows:

case ALGORITHM_MIN_MAX:{
    if (quant_tool.scale_file.empty()){
        quant_tool.scale_file = "table_minmax.scale";
        quant_tool.activation_quant_tool();
    }
    save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
    /* Evaluate quantitative losses */
    if (quant_tool.evaluate){
        fprintf(stderr, "[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
        quant_tool.assess_quant_loss(0);
    }
    break;
}

the main quantitative search strategy interface is quant_tool.activation_quant_tool() and save_graph_i8_perchannel, for min max, these two interfaces do two things respectively:

(1) activate value quantization and generate table_minmax.scale；

(2) weight & offset quantization to generate scale_weight.txt and scale_bias.txt；

2.1 quantification of activation value

when looking at the Tengine source code, we must grasp the struct graph* ir_graph, the structure of graph is the essence.

activation value quantization is a dynamic process, which needs to dynamically obtain the data distribution of each layer, which is why you need to feed a certain number of calibration pictures.

let's talk about the preprocessing module first. This other quantization algorithm is general:

// Set input_tensor and input_data address binding, while input_tensor=>ir_ graph->tensor_ list.  Note: you must see this step, otherwise the subsequent code is difficult to understand
tensor_t input_tensor = get_graph_input_tensor(ir_graph, 0, 0);

if (set_tensor_shape(input_tensor, dims, 4) < 0){
    fprintf(stderr, "Set input tensor shape failed\n");
    return -1;
}

if (set_tensor_buffer(input_tensor, input_data.data(), img_size * sizeof(float)) < 0){
    fprintf(stderr, "Set input tensor buffer failed\n");
    return -1;
}

// prerun graph, do some initialization configuration
if (prerun_graph_multithread(ir_graph, this->opt) < 0){
    fprintf(stderr, "Prerun multithread graph failed.\n");
    return -1;
}

// Image preprocessing, outgoing input_data, this and the previous input_ tensor & ir_ graph->tensor_ List [0] input parameter binding, modified input_data modifies ir_graph.tensor_list, so you can understand
get_input_data_cv(imgs_list[nums].c_str(), input_data.data(), img_c, img_h, img_w, mean, scale, sw_RGB, center_crop, letterbox_rows, letterbox_cols, focus);

then run and record the intermediate activation value in IR_ graph->tensor_ In list [i]:

if (run_graph(ir_graph, 1) < 0){
    fprintf(stderr, "Run graph failed\n");
    return -1;
}

min and max values of activation value:

/* get the min/max value of activation tensor */
for (int i = 0; i < ir_graph->tensor_num; i++){
    struct tensor* act_tensor = ir_graph->tensor_list[i];
    if (act_tensor->tensor_type == TENSOR_TYPE_VAR || act_tensor->tensor_type == TENSOR_TYPE_INPUT){
        float* start_addr = (float*)act_tensor->data;
        float* end_addr = (float*)act_tensor->data + act_tensor->elem_num;
        max_activation[i] = std::max(max_activation[i], *std::max_element(start_addr, end_addr));
        min_activation[i] = std::min(min_activation[i], *std::min_element(start_addr, end_addr));}
}

calculate the quantization scale of the activation value. For the softmax layer, the scale is 1 / 127 by default f:

/* save the calibration file with min-max algorithm */
FILE* fp_minmax = fopen("table_minmax.scale", "wb");
for (int i = 0; i < ir_graph->tensor_num; i++){
    struct tensor* t = ir_graph->tensor_list[i];
    if (t->tensor_type == TENSOR_TYPE_VAR || t->tensor_type == TENSOR_TYPE_INPUT){
        float act_scale = 1.f;
        int act_zero_point = 0;

        act_scale = std::max(std::abs(max_activation[i]), std::abs(min_activation[i])) / 127.f;

        /* the scale of softmax is always scale = 1 / 127.f */
        for (int j = 0; j < ir_graph->node_num; j++){
            struct node* noden = ir_graph->node_list[j];
            struct tensor* tensor_tmp = get_ir_graph_tensor(ir_graph, noden->output_tensors[0]);

            if (!(tensor_tmp->tensor_type == TENSOR_TYPE_INPUT || tensor_tmp->tensor_type == TENSOR_TYPE_VAR))
                continue;

            std::string tmp_op_name = get_op_name_from_type(noden->op.type);
            std::string cur_name = t->name;
            std::string tmp_name = tensor_tmp->name;

            if ((cur_name == tmp_name) && tmp_op_name == "Softmax"){
                act_scale = 1 / 127.f;
                break;}
        }

        fprintf(fp_minmax, "%s %f %d\n", ir_graph->tensor_list[i]->name, act_scale, act_zero_point);}
}

2.2 weight & offset quantization

weight & offset quantization is different from activation value quantization. Activation value quantization needs to calibrate picture reasoning to obtain the dynamic distribution of input data, while weight & offset is static, and the simple quantization process does not need to perform forward reasoning.

2.2. 1 create a graph

load tmfile and build graph:

struct graph* ir_graph = (struct graph*)create_graph(nullptr, "tengine", model_file);
if (nullptr == ir_graph){
fprintf(stderr, "Create graph failed.\n");
return -1;}

2.2. 2 optimize activation value quantization scale

this is mainly a quant Inplace optimization, which is a quantization processing strategy for non convolution operators.

if (inplace == 0){
    for (int i = 0; i < ir_graph->tensor_num; i++){
        struct tensor* ir_tensor = ir_graph->tensor_list[i];
        if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
            ir_tensor->scale = layer_scale[ir_tensor->name];
            ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];}}
    }
    else{
        std::tr1::unordered_map<std::string, bool> layer_pass;
        for (int i = ir_graph->tensor_num - 1; i >= 0; i--){
            struct tensor* ir_tensor = ir_graph->tensor_list[i];
            if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
                if (layer_pass[ir_tensor->name] == false){
                    uint32_t ir_node_idx = ir_tensor->producer;
                    struct node* t_node = ir_graph->node_list[ir_node_idx];

                    std::string op_name = get_op_name_from_type(t_node->op.type);

                    bool poolTrue = false;
                    bool reluTrue = false;
                    if (op_name == "Pooling"){
                        struct pool_param* pool_param = (struct pool_param*)t_node->op.param_mem;
                        if (pool_param->pool_method == 0)
                            poolTrue = true;
                    }
                    else if (op_name == "ReLU"){
                        struct relu_param* relu_param = (struct relu_param*)t_node->op.param_mem;
                        if (relu_param->negative_slope == 0.f)
                            reluTrue = true;
                    }
                    if (op_name == "Flatten" || op_name == "Reshape" || op_name == "Squeeze" || op_name == "Clip" || op_name == "Slice" || poolTrue || reluTrue){
                        struct tensor* t_in_tensor = ir_graph->tensor_list[t_node->input_tensors[0]];
                        if (layer_scale[ir_tensor->name] != 0){
                            ir_tensor->scale = layer_scale[ir_tensor->name];
                            ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];

                            if (t_in_tensor->tensor_type == TENSOR_TYPE_VAR || t_in_tensor->tensor_type == TENSOR_TYPE_INPUT){
                                recursion_pass_through(ir_graph, ir_tensor->name, t_in_tensor, layer_used, layer_scale, layer_zeropoint, layer_pass);}}
                    }
                    else{
                        ir_tensor->scale = layer_scale[ir_tensor->name];
                        ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];
                    }
                    layer_pass[ir_tensor->name] = true;}}}
}

2.2. 3 weight & offset quantization

the whole process of quantization is similar to that of activation value quantization, that is, search the min and max values first, and then cut and scale them. Here, not only the scale needs to be calculated, but also the truncation and scaling needs to be done. The reason is that the int8 tmfile quantization model file needs to be generated. Another thing to note here is that the weight quantization accuracy is int8 and the offset quantization accuracy is int32. Because the value of the weight after matrix multiplication is likely to overflow int8, the value of the weight matrix multiplication needs to be stored in int32 and then added with the offset of int32.

in addition to the above, there is another difference between activation value quantization and activation value quantization. Activation value quantization is perLayer, while weight & offset quantization is perChannel.

weight int8 quantization:

/* quantize the weight data from fp32 to int8 */
if (op_name == "Convolution" || op_name == "FullyConnected" || op_name == "Deconvolution"){
    struct tensor* weight_tensor = ir_graph->tensor_list[noden->input_tensors[1]];

    int channel_num = weight_tensor->dims[0];
    int cstep = int(weight_tensor->elem_num / channel_num);
    float* weight_data = (float*)weight_tensor->data;
    int8_t* i8_weight_data = (int8_t*)sys_malloc(weight_tensor->elem_num * sizeof(int8_t));

    float* weight_scale_list = (float*)sys_malloc(channel_num * sizeof(float));
    int* weight_zp_list = (int*)sys_malloc(channel_num * sizeof(int));

    fprintf(fp_weight, "%s ", weight_tensor->name);
    /* calculate the quant scale value of weight perchannel, scale = abs(min, max) / 127 */
    if (internal){
        // TODO
        for (int ch = 0; ch < channel_num; ch++){
            weight_scale_list[ch] = weight_tensor->scale_list[ch];
            weight_zp_list[ch] = 0;}
    }
    else{
        for (int ch = 0; ch < channel_num; ch++){
            float* weight_data_ch_start = weight_data + ch * cstep;
            float* weight_data_ch_end = weight_data + (ch + 1) * cstep;
            float weight_max = *std::max_element(weight_data_ch_start, weight_data_ch_end);
            float weight_min = *std::min_element(weight_data_ch_start, weight_data_ch_end);

            weight_scale_list[ch] = std::max(std::abs(weight_max), std::abs(weight_min)) / 127.f;
            weight_zp_list[ch] = 0;
            fprintf(fp_weight, "%8.8f ", weight_scale_list[ch]);
        }
        fprintf(fp_weight, "\n");
    }

    /* quantize the value of weight from Float32 to Int8, value_i8 = (value_fp32 / scale).round().clip(-127, 127) */
    for (int ch = 0; ch < channel_num; ch++){
        for (int j = 0; j < cstep; j++){
            if (weight_data[ch * cstep + j] == 0 || weight_scale_list[ch] == 0)
                i8_weight_data[ch * cstep + j] = 0;
            else{
                float int8_data = round(weight_data[ch * cstep + j] / weight_scale_list[ch]);
                int8_data = int8_data > 127.f ? 127.f : int8_data;
                int8_data = int8_data < -127.f ? -127.f : int8_data;
                i8_weight_data[ch * cstep + j] = int8_t(int8_data);}}
    }

    weight_tensor->scale_list = weight_scale_list;
    weight_tensor->zp_list = weight_zp_list;
    weight_tensor->data_type = TENGINE_DT_INT8;
    weight_tensor->elem_size = sizeof(int8_t); // int8, signed char
    weight_tensor->data = i8_weight_data;
    weight_tensor->quant_param_num = channel_num;
}

offset int32 quantization:

/* quantize the weight data from fp32 to int32 */if (noden->input_num > 2){    struct tensor* input_tensor = ir_graph->tensor_list[noden->input_tensors[0]];    struct tensor* bias_tensor = ir_graph->tensor_list[noden->input_tensors[2]];    float* bias_scale_list = (float*)sys_malloc(bias_tensor->dims[0] * sizeof(float));    int* bias_zp_list = (int*)sys_malloc(bias_tensor->dims[0] * sizeof(int32_t));    float* bias_data = (float*)bias_tensor->data;    int* int32_bias_data = (int*)sys_malloc(bias_tensor->elem_num * sizeof(int32_t));    int bstep = int(bias_tensor->elem_num / channel_num);    fprintf(fp_bias, "%s ", bias_tensor->name);    /* calculate the quant scale value of bias perchannel, scale = scale_weight * scale_in */    for (int ch = 0; ch < channel_num; ch++){        bias_scale_list[ch] = weight_scale_list[ch] * input_tensor->scale;        bias_zp_list[ch] = 0;        fprintf(fp_bias, "%8.8f ", bias_scale_list[ch]);    }    fprintf(fp_bias, "\n");    /* quantize the value of bias from Float32 to Int32, value_i32 = (value_fp32 / scale).round() */    for (int ch = 0; ch < channel_num; ch++){        for (int bi = 0; bi < bstep; bi++){            if (bias_data[ch * bstep + bi] == 0 || bias_scale_list[ch] == 0)                int32_bias_data[ch * bstep + bi] = 0;            else                int32_bias_data[ch * bstep + bi] = int(round(bias_data[ch * bstep + bi] / bias_scale_list[ch]));}    }    bias_tensor->scale_list = bias_scale_list;    bias_tensor->zp_list = bias_zp_list;    bias_tensor->data_type = TENGINE_DT_INT32;    bias_tensor->elem_size = sizeof(int32_t); // int32, signed int    bias_tensor->data = int32_bias_data;    bias_tensor->quant_param_num = channel_num;}

here, the quantization of weight & bias is almost introduced.

the above details the implementation of min max quantization algorithm, mainly taking Tengine as an example. I hope my sharing can be a little helpful to your learning.

[official account transmission]
<[model reasoning] quantization implementation share 1: explain the implementation of min max symmetric quantization algorithm in detail>

Topics: Algorithm AI Deep Learning tengine

Programmer Think