Huffman tree sorting

Posted by ununium on Tue, 28 Dec 2021 11:49:57 +0100

Huffman coding

Huffman coding is mainly used in information compression. It is a coding method with high compression efficiency at present. When implementing Huffman coding, binary tree is used for implementation.

Here is a simple example of Huffman coding. For an article, extract all the words and the number of occurrences. Then how to encode these words according to the number of occurrences of these words, so as to make the final article composed of coding as short as possible.

Optimal binary tree

Before introducing how to realize Huffman coding, first learn and understand the optimal binary tree in the binary tree. Let's first define some of these nouns:

  1. Path: the branch from one node to another in the tree constitutes the path between the two nodes.
  2. Path length: the number of branches on the path.
  3. Tree path length: the sum of the path lengths from the tree root to each node.
  4. Weighted path length of a node: the product of the path length from the node to the tree root and the weight on the node.
  5. Weighted path length of tree( W P L WPL WPL): the sum of the weighted path lengths of all leaf nodes in the tree.

After that, the weighted path length of the tree is mainly used( W P L WPL WPL), here is a simple example.

In the above example W P L = 5 ∗ 2 + 3 ∗ 3 + 4 ∗ 2 = 27 WPL=5*2+3*3+4*2=27 WPL=5∗2+3∗3+4∗2=27.

Next, the definition of optimal binary tree is given, assuming that the binary tree is composed of n n n leaves, each leaf node with weight w i w_i wi, then the weighted path length W P L WPL The smallest binary tree of WPL is called optimal binary tree or Huffman tree.

As defined above, and W P L WPL The definition of WPL shows that for the optimal binary tree, the lower the weight value of the leaf node, the greater its depth, that is, it should be at the bottom, while the higher the weight value of the leaf node, the smaller its depth. Only in this way can it be guaranteed W P L WPL WPL should be as small as possible.

Huffman tree structure

When constructing Huffman tree, first of all, our original data only has the symbol represented by each leaf node and the corresponding weight value. At the same time, we also know that we should put the lower weight in the deeper position and the higher weight in the shallower position.

Referring to the previous establishment process of binary tree, it is generally established from top to bottom. However, Huffman tree is just the opposite. It is built from bottom to top, because to meet the needs of location mentioned earlier, the establishment process of Huffman tree is as follows.

  1. According to the given n n n weights ( w 1 , w 2 , . . . , w n ) (w_1,w_2,...,w_n) (w1, w2,..., wn) composition n n Set of n binary trees F = { T 1 , T 2 , . . . , T n } F=\left\{T_1,T_2,...,T_n \right\} F={T1, T2,..., Tn}, where each binary tree T i T_i Only one of Ti  with weight is w i w_i The root node of wi , and the left and right subtrees are empty.
  2. stay F F In F, select the tree with the smallest weight of the two root nodes as the left and right subtrees to construct a new binary tree, and set the weight of its root node as the sum of the weight of the root nodes of its left and right subtrees
  3. stay F F Delete the two trees in F and add the new binary tree F F In F.
  4. Repeat 2 and 3 until F F F contains only one tree.

According to the above process, multiple binary trees can be combined into a binary tree. Here, the above algorithm is simply executed through several examples.

As for the code implementation part, the static linked list is used to implement it, but it is generally divided into two structures, the first is the node in the tree, and the second is the whole Huffman tree.

For each node, since the post order encoding process is from bottom to top and the decoding process is from top to bottom, for each node, it is necessary to know who its left and right children and parents are, as well as the encoding, original data, weight and other information. Therefore, the node class is defined as follows.

#include <iostream>
#include <string>
using namespace std;

// node
typedef struct HuffmanNode
{
    char data;  // Coded symbol
    int weight; // Weight, frequency
    int left;   // Left child
    int right;  // Right child
    int parent; // parents 
    string code;    // Coded symbol

    // Constructor
    HuffmanNode()
    {
        // cout << "node" << endl;
        weight = -1;
        left = -1;
        right = -1;
        parent = -1;
        data = '#';
        code = "";
    }
}HuffmanNode;

In the implementation of Huffman tree, static linked list is used, that is, array is used to save each node. Suppose at the beginning n n n leaf nodes, then according to the previous algorithm, two nodes are combined into one, and a total of two leaf nodes will be generated n − 1 n-1 n − 1 new node, so there will be 2 n − 1 2n-1 2n − 1 node. So array development 2 n − 1 2n-1 2n − 1 space is enough. During initialization, the front n n n spaces are used to store leaf nodes, and the subsequent spaces are stored in order according to the generated nodes.

// tree
typedef struct HuffmanTree
{
    string info;    // Original information
    HuffmanNode* tree;  // Huffman tree
    int num;    // Number of leaf nodes

    // Constructor
    HuffmanTree()
    {
        info = "";
        tree = NULL;
        num = 0;
    }
}

In the previous algorithm, we need to find the two root nodes with the lowest weight each time, and then merge the two binary trees. Here, the small one is on the left and the large one is on the right. In the search process, use directly C + + C++ References in C + + and simple violent traversal of existing nodes. However, it should be noted that you need to find nodes that have not been added, that is, nodes without parents.

// Find the node with the smallest weight and the second smallest weight
    void Select2Min(int pos, int& min1, int& min2)
    {
        int w1 = 1000000;
        // Find the node with the lowest weight
        for (int i = 0; i < pos; i++)
        {
            // cout << tree[i].parent << " " << tree[i].weight << endl;
            // No parents and less weight
            if (tree[i].parent == -1 && w1 > tree[i].weight)
            {
                w1 = tree[i].weight;
                min1 = i;
            }
        }

        int w2 = 1000000;
        // Find the node with the second smallest weight
        for (int i = 0; i < pos; i++)
        {
            if (i == min1)  // Avoid the same as before
                continue;

            if (tree[i].parent == -1 && w2 > tree[i].weight)
            {
                w2 = tree[i].weight;
                min2 = i;
            }
        }
    }

After finding the subscript of the two nodes with the lowest weight, next you need to generate a new node, set the parents of the two nodes as the subscript of the new node according to the operation, set the relationship between the left child and the right child of the new node, and update the weight value of the new node.

// Establish Huffman tree
    void createTree(char element[], int weights[], int n)
    {
        if (n < 2)  // Prevent array out of bounds
            return;

        // initialization
        num = n;    // Quantity to be coded
        tree = new HuffmanNode[2 * n - 1];    // Open up space
        if (tree == NULL)   // Successful application
            return;
        for (int i = 0; i < n; i++) // Initialize leaf node first
        {
            tree[i].data = element[i];
            tree[i].weight = weights[i];
            // cout << tree[i].data << " " << tree[i].weight << endl;
        }

        for (int i = n; i < 2 * n - 1; i++)
        {
            int min1 = -1;  // Minimum node
            int min2 = -1;  // The second small node
            Select2Min(i, min1, min2);  // Select the smallest and second smallest nodes
            // cout << min1 << " " << min2 << endl;
            tree[i].weight = tree[min1].weight + tree[min2].weight; // Weight calculation
            // Relationship confirmation
            tree[min1].parent = i;  
            tree[min2].parent = i;
            tree[i].left = min1;
            tree[i].right = min2;
        }
    }

Huffman tree coding

After the construction and establishment of Huffman tree, the coding process is not completed. In general, what we want is to encode the leaf nodes. In the Huffman tree, we construct a total tree and save all the nodes to be encoded in the total tree in the form of leaf nodes. Then, when coding, start from the root node and traverse the left child once, the coding will be added " 0 " "0" "0". If the child traverses to the right once, the code will be added " 1 " "1" "1".

For convenience, the coding is usually traversed from the leaf node from bottom to top, and each leaf node can go up. However, the code obtained in this way is reversed. Finally, you need to transpose the obtained code once.

// Calculate leaf code
    void getCode()
    {
        if (tree == NULL)   // Pointer is not null
            return;

        // Coding node by node
        for (int i = 0; i < num; i++)
        {
            int now = i;    // Child node
            int par = tree[i].parent;   // Parent node
            while (par != -1)   // All the way up to the root node
            {
                if (now == tree[par].left)  // If the child node is the left child of the parent node, add 0
                    tree[i].code += "0";
                if (now == tree[par].right) // If the child node is the right child of the parent node, add 1
                    tree[i].code += "1";
                now = par;  // Update node and parent node
                par = tree[par].parent;
            }
            reverse(tree[i].code.begin(), tree[i].code.end());  // String inversion
            // cout << tree[i].data << " " << tree[i].code << endl;
        }
    }

After obtaining the code, the next step is the example, that is, coding according to a string of information entered by the user. This implementation is very simple. Search directly according to the words in the information. After finding the corresponding leaf node, add its code.

// code
    string encode(string s)
    {
        if (tree == NULL)   // Pointer is not null
            return "NULL";
        string ans = "";    // Coding results
        // Character by character encoding
        for (int i = 0; i < (int)s.size(); i++)
        {
            // Match according to each character, find the corresponding code and add it
            for (int j = 0; j < num; j++)
            {
                if (tree[j].data == s[i])   // Judge whether the leaf node character matches the current character
                    ans += tree[j].code;
            }
        }
        return ans;
    }

Huffman tree decoding

For the decoding process, that is to analyze the original meaning according to a string of codes input by the user. First, ignore the error. Assuming that the encoding is correct, the decoding starts from the root node. If an error occurs 0 0 0 traverses the left child of the current node. If encountered 1 1 1 traverses the right child of the current node. If a leaf node is encountered during traversal, it means that a symbol has been successfully parsed, and then return to the root node to continue the previous encoding and decoding.

Next, consider the error. First, after turning left or right, there is no node, that is, a non-existent node is accessed. At this time, it indicates that there is an error in the intermediate code. After that, there is redundant data or less data at the end. In this case, the intermediate coding is normal. However, after all codes are parsed, it is found that the current pointer does not point to the root node, which indicates that the last part of the codes are not parsed successfully.

// decode
    string decode(string s)
    {
        if (tree == NULL)   // Pointer is not null
            return "NULL";

        string ans = "";    // Decoding result
        int index = 2 * num - 2;    // Decoding from the root node down
        // Decode character by character
        for (int i = 0; i < (int)s.size(); i++)
        {
            if (s[i] == '0')    // Turn left at 0
                index = tree[index].left;
            if (s[i] == '1')    // Turn right at 1
                index = tree[index].right;

            if (index == -1)    // If a nonexistent node is accessed, it indicates decoding error
                return "error";

            // If a leaf node is encountered, a character is decoded successfully, and then traversal starts from the root node again
            if (tree[index].left == -1 && tree[index].right == -1)
            {
                ans += tree[index].data;
                index = 2 * num - 2;
            }
                
        }
        if (index != 2 * num - 2)   // Redundant encoding at the end is also a decoding error
            return "error";
        return ans;

Topics: Algorithm data structure Binary tree