[model reasoning] quantization realization sharing 2: explain the implementation of KL symmetric quantization algorithm in detail

Posted by nkzle on Fri, 17 Dec 2021 19:00:27 +0100

Welcome to my official account, reply to 001 Google programming specification.

O_o >_< o_O O_o ~_~ o_O

Hello, I'm Jizhi horizon. This paper analyzes the implementation of KL symmetric quantization algorithm, taking Tengine's implementation as an example.

an article has been written earlier< [model reasoning] quantization implementation share 1: explain the implementation of min max symmetric quantization algorithm in detail >, interested students can refer to it. This is a sequel to the previous article and the second article on quantitative implementation.

I won't introduce more quantitative background. I've said a lot in previous articles. Let's start directly.

1. KL quantization principle

KL quantization is a quantization method that uses KL divergence to measure the similarity between real data distribution and quantitative data distribution. It is the quantization strategy adopted for activation value in NVIDIA TensorRT. The main logic of KL quantization is as follows:

KL is different from MIN-MAX. instead of directly mapping [min, max] to [- 127, 127], KL looks for a threshold value | t | < max (| Max |, | min |) and maps its [- T, T] to [- 127, 127]. It is considered that as long as the threshold is properly selected, the values beyond the threshold can be discarded, and the accuracy loss will not be greatly affected;
Values beyond the threshold value ±|T| are directly mapped to the threshold value. For example, the three red dots in the above figure are directly mapped to - 127. This mapping relationship is called saturation.

KL quantization method attempts to abstract float32 numerical distribution and int8 numerical distribution into two distributions, update the two numerical distributions with threshold | T | and measure the similarity of the two distributions with KL divergence. If the value of KL divergence is smaller, it means that the two distributions are more similar, which means that the threshold | T | is the best choice. For symmetric quantization, Scale can be calculated according to this threshold, while Zero_point is always zero.

the following figure is the pseudo code about KL divergence calibration in TensorRT, which also perfectly explains the whole quantization process of KLD. (mark it. The following figure is figure 2, which will be called later)

2. KL quantization implementation

the implementation of KL quantization in Tengine is described here.

the main processes are as follows:

(1) activation value quantization: first calculate min and max, and then search and quantify with KL strategy to generate activation value calibration table. fp32toint8；

(2) weight quantization: min max quantization strategy is used. fp32toint8；

(3) offset quantization: use the activation value quantization scale for int32 quantization. fp32toint32；

the quantization of weights and offsets is one step more than the quantization of active values. In addition to calculating the Scale, it is also necessary to apply Scale to the value for direct quantization to generate int8 tmfile.

the main codes for realizing KL quantization in Tengine are as follows:

case ALGORITHM_KL:{
    if (quant_tool.scale_file.empty()){
        quant_tool.scale_file = "table_kl.scale";
        quant_tool.activation_quant_tool();
    }
    save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
    /* Evaluate quantitative losses */
    if (quant_tool.evaluate){
        fprintf(stderr, "[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
        quant_tool.assess_quant_loss(0);
    }
    break;
}

the main quantitative search strategy interface is quant_tool.activation_quant_tool() and save_graph_i8_perchannel, for KL quantization, these two interfaces do two things respectively:

(1) activate value quantization and generate table_kl.scale；

(2) weight & offset quantization to generate scale_weight.txt,scale_bias.txt and int8 tmfile;

due to the calculation method of min and Max and the weight & offset quantization process in the activation value quantization, KL quantization and MIN-MAX quantization have the same logic and share the same code, which will not be introduced here. Interested students can refer to this part< [model reasoning] quantization implementation share 1: explain the implementation of min max symmetric quantization algorithm in detail >, this paper mainly introduces the KL quantization search strategy in activation value quantization.

the entrance of KL quantitative search strategy is here:

quant_tool.activation_quant_tool();

then we will compare min and Max first, mainly using std::max_element,std::min_element interface, which will not be discussed here. After obtaining the min and max values, start the KL search strategy.

2.1 outline probability histogram

do the first round of sketching probability histogram and carry out the first round of KL calculation. In the second round, you don't need to re sketch the probability histogram, but iterate on the probability histogram constructed in the first round. Therefore, the more you calibrate the number of pictures, the closer the final probability histogram will be to the real distribution.

/* calculate hist */
uint32_t inum = 0;
for (int i = 0; i < ir_graph->tensor_num; i++){
    struct tensor* ir_tensor = ir_graph->tensor_list[i];
    if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
        float step_max = std::abs(max_activation[i]);
        if (std::abs(min_activation[i]) > step_max)
            step_max = std::abs(min_activation[i]);
        float step_bin = step_max / 2048.0f;

        std::vector<float> every_edge;
        if (nums == imgs_list.size() - 1){
            for (int j = 0; j < 2048; j++){
                float edge_float = (step_bin * (j + 0.5f));
                every_edge.push_back(edge_float);
            }
            hist_edge.push_back(every_edge);
            hist_gram.push_back(histCount((float*)ir_tensor->data, ir_tensor->elem_num, step_max));
        }
        else{
            std::vector<uint32_t> hist_tmp;
            hist_tmp = histCount((float*)ir_tensor->data, ir_tensor->elem_num, step_max);
            for (int j = 0; j < 2048; j++){
                hist_gram[inum][j] += hist_tmp[j];}
        }
        tensor_hist[i] = inum;
        hist_tensor[inum] = i;
        inum++;}
}

look at the following histCount interfaces:

std::vector<uint32_t> histCount(float* data, uint32_t elem_num, float abs_max){
    float bin_scale = abs_max / 2047.f;
    int bin_zp = 0;
    std::vector<uint32_t> hist(2048);
    for (int i = 0; i < elem_num; i++){
        if (data[i] != 0){
            uint32_t hist_idx = round(std::abs(data[i]) / bin_scale);
            hist[hist_idx]++;}
    }
    return hist;
}

finally, normalize the obtained probability histogram:

distribution = normalize_histogram(distribution_in);

the implementation interface of histogram normalization is also very simple:

std::vector<float> normalize_histogram(std::vector<uint32_t>& histogram){
    std::vector<float> histogram_out(histogram.size());
    const size_t length = histogram.size();
    float sum = 0;
    for (size_t i = 1; i < length; i++)
        sum += histogram[i];

    for (size_t i = 1; i < length; i++)
        histogram_out[i] = float(histogram[i] / sum);

    return histogram_out;
}

2.2 calculation P

the following logic needs to look back at Figure 2. First calculate P, then calculate Q, and finally calculate KL divergence.

first calculate the simulated quantization distribution P from target_ Bin = 128 -- > 2048 incremental retrieval, the overflow part is mapped to edge processing, and P can be regarded as the fp32 data distribution before quantization, that is, the real distribution:

// get P
fill(quantize_distribution.begin(), quantize_distribution.end(), 0.0f);
const float num_per_bin = static_cast<float>(threshold) / static_cast<float>(target_bin);

for (int i = 0; i < target_bin; i++){
    const float start = static_cast<float>(i) * num_per_bin;
    const float end = start + num_per_bin;

    const int left_upper = static_cast<int>(ceil(start));
    if (static_cast<float>(left_upper) > start){
        const float left_scale = static_cast<float>(left_upper) - start;
        quantize_distribution[i] += left_scale * distribution[left_upper - 1];
    }

    const int right_lower = static_cast<int>(floor(end));

    if (static_cast<float>(right_lower) < end){
        const float right_scale = end - static_cast<float>(right_lower);
        quantize_distribution[i] += right_scale * distribution[right_lower];
    }

    for (int j = left_upper; j < right_lower; j++){
        quantize_distribution[i] += distribution[j];}
}

2.2 calculation Q

then calculate the real quantization distribution Q, accompanied by P from target_ Bin = 128 -- > 2048 incremental retrieval, Q can be regarded as int8 data distribution after quantization, that is, quantization distribution:

// get Q
std::vector<float> expand_distribution(threshold, 0);
for (int i = 0; i < target_bin; i++){
    const float start = static_cast<float>(i) * num_per_bin;
    const float end = start + num_per_bin;
    float count = 0;

    const int left_upper = static_cast<int>(ceil(start));
    float left_scale = 0;
    if (static_cast<float>(left_upper) > start){
        left_scale = static_cast<float>(left_upper) - start;
        if (distribution[left_upper - 1] != 0){
            count += left_scale;}
    }

    const int right_lower = static_cast<int>(floor(end));
    float right_scale = 0;
    if (static_cast<float>(right_lower) < end){
        right_scale = end - static_cast<float>(right_lower);
        if (distribution[right_lower] != 0){
            count += right_scale;}
    }

    for (int j = left_upper; j < right_lower; j++){
        if (distribution[j] != 0){
            count++;}
    }

    const float expand_value = quantize_distribution[i] / count;

    if (static_cast<float>(left_upper) > start){
        if (distribution[left_upper - 1] != 0){
            expand_distribution[left_upper - 1] += expand_value * left_scale;}
    }
    if (static_cast<float>(right_lower) < end){
        if (distribution[right_lower] != 0){
            expand_distribution[right_lower] += expand_value * right_scale;}
    }
    for (int j = left_upper; j < right_lower; j++){
        if (distribution[j] != 0){
            expand_distribution[j] += expand_value;}}
}

2.3 calculation of KL divergence

next, calculate the KL divergence of the real distribution P and the quantitative distribution Q:

const float kl_divergence = compute_kl_divergence(t_distribution, expand_distribution);

the interface for KL divergence calculation is also very simple:

float compute_kl_divergence(std::vector<float>& dist_a, std::vector<float>& dist_b){
    const size_t length = dist_a.size();
    float result = 0;

    for (size_t i = 0; i < length; i++){
        if (dist_a[i] != 0){
            if (dist_b[i] == 0){
                result += 1;
            }
            else{
                result += dist_a[i] * log(dist_a[i] / dist_b[i]);}}
    }
    return result;
}

finally, we want to find a target that minimizes KL divergence_ Bin is retrieved in the 128 -- > 2048 loop, so this implementation can be written as follows:

// the best num of bin
if (kl_divergence < min_kl_divergence)
{
    min_kl_divergence = kl_divergence;
    target_threshold = threshold;
}

in this way, we get the target we dream of_ Bin, which is the target here_ threshold.

2.4 calculate Scale

get target after calculation_ After the threshold, it's easy to calculate the Scale. Just do it directly.

float act_scale = hist_edge[i][threshold_bin] / fake_quant_set;    // fake_quant_set = 127
int act_zero_point = 0;

again, since it is symmetric quantization, only Scale and zero need to be calculated_ Point is always zero.

then we can save our activation value quantization calibration table_ kl. Once again, the following weight offset quantization method is consistent with MIN-MAX, and the quantization method of MIN-MAX has been introduced in the previous article, so I won't repeat it here.

the above has completed the implementation of the practical KL divergence quantization algorithm. I hope my sharing can be a little helpful to your learning.

[official account transmission]
<[model reasoning] quantization realization sharing 2: explain the implementation of KL symmetric quantization algorithm in detail>

Topics: C++ Algorithm AI Deep Learning tengine

Programmer Think