Implement high-performance HashMap in specific scenarios

Posted by vumpler on Tue, 25 Jan 2022 17:35:11 +0100

Efficiency problems in some scenarios of C + + standard library

In the following scenario, the C + + standard library is unordered_map,map,multiset,unordered_multiset is not efficient.

Group repetition count

There are 100 million records with two columns: group_id, attribute, and the record has been recorded according to group_id is in order. Request to count each group_ The number of repetitions of the attribute value of the row corresponding to ID to generate a new result column. For example:

group_id	attribute	result
G001	A	1
G001	A	2
G001	B	1
G002	C	1
G002	B	1
G002	A	1
G002	B	2

Performance test code

Use the following code to test std::map, std::unordered_map, std::multiset, std::unordered_multiset performance.

Performance test results

Run release build several times in non debug mode, and the following results are obtained:

Implementation mode	Efficiency of processing 100 million data (running time)
std::unordered_multiset	15.0s
std::multiset	21.7s
std::unordered_map	3.9s
std::map	7.3s

According to the above test results:

multiset mode is slower than map mode, so it needs to study the specific implementation;
The red black tree method (used by map and multiset) is slower than the Hash method (used by unordered_map and unordered_multiset). The same is true in terms of algorithm analysis, but the construction of hash table is time-consuming.

The performance of Debug and Release versions is very different

Implementation mode	100 million data processing efficiency of Release version	100 million debugs and processing efficiency
std::unordered_multiset	15.0s	310s
std::multiset	21.7s	357s
std::unordered_map	3.9s	141s
std::map	7.3s	189s

It can be seen from the test data that the performance gap between the debug version and the release version is about 20 times. Performance is vastly different.

Implementation of Hash map for specific scenarios

###Optimization ideas
The specific scenario in this paper requires an associated container (hash map or multiset) to record the number of repetitions of each cardinality, which has the following characteristics:

The required container space does not need to be too large because of the same group_ The number of rows corresponding to ID is less than 1000, and even if they are all different, the container space required will not exceed 1000;
Containers need to be emptied or destroyed and recreated frequently because new groups are scanned_ When ID, the original container content is useless;
The space occupied by each element of the container will not be large, because the key value will not be large, and the value is UInt32 or UInt64.

Learn from CH's Hash map to realize the Hash map of efficiency optimization in specific scenarios. The key points of optimization are as follows:

Improve the clear operation efficiency of Hash map;
Avoid allocating and releasing memory;
Avoid memory copy during container expansion.

Promote clear operation

By introducing the concept of version, the container emptying operation of O(1) is realized. In general, the Hash map will destroy the elements in the container one by one in the clear() method, and then recycle the memory. Refer to clear() of bidirectional linked list std::list

The optimized Hash map can clear the container as long as version+=1, because each element of the optimized Hash map has a version field. If the value of the version field is not equal to the current version, it is considered that the element does not exist or has been destroyed.

Avoid allocating and freeing memory

Define a local array in the function, which is in the stack memory. The allocation of stack memory is very simple. The compiler knows in advance how many local variables there are in the function and can calculate the total stack memory space size. Executing SP -= size is equal to allocating this memory. SP + < offset > is the address of the local variable.

Stack memory does not need to be released at all, because SP += size is equivalent to releasing memory when the function exits.

Avoid memory copy during capacity expansion

According to the special scenario, we can estimate a maximum capacity and allocate such a large amount of memory in the stack memory at one time.

Implementation code

template <typename KeyTy>
struct OccurCountNode
{
        KeyTy key;
        uint32_t count = 0;
        uint32_t version = 0;
};

void my_hash_map(const std::vector<std::string>& group_ids, const std::vector<std::string>& attrs, std::vector<int>& result)
{
        const unsigned char DEGREE = 9;
        const size_t MAP_SIZE = 1UL << DEGREE;
        const size_t MASK = MAP_SIZE - 1;
        const size_t TAIL = 10;
        int version = 1;
        OccurCountNode<std::string> occurs[MAP_SIZE+TAIL];
        auto hasher = std::hash<std::string>();
        auto emplace = [&hasher, &occurs, &version, MASK, MAP_SIZE, TAIL](const std::string & key) -> OccurCountNode<std::string>*
        {
                size_t idx = hasher(key) & MASK;
                while (idx < MAP_SIZE + TAIL)
                {
                        auto& item = occurs[idx];
                        if (item.version != version)
                        {
                                item.version = version;
                                item.key = key;
                                item.count = 1;
                                return &item;
                        }
                        if (item.version == version && item.key == key)
                        {
                                item.count++;
                                return &item;
                        }
                        idx++;
                }
                return nullptr;
        };

        std::string group_id;
        for (size_t i = 0; i < group_ids.size(); ++i)
        {
                if (group_id != group_ids[i])
                {
                        version++;
                        group_id = group_ids[i];
                }

                auto value = attrs[i];
                OccurCountNode<std::string>* find_result = emplace(value);
                // assign occurs[value] to result
                result[i] = find_result->count;
        }
}

Performance test data

Implementation mode	100 million data processing efficiency of Release version
std::unordered_multiset	15.4s
std::multiset	23.0s
std::unordered_map	4.0s
std::map	8.5s
Optimize Hash map for specific scenarios	1.9s

Why don't generic containers do this

General container classes need to consider various situations, so there will be some "unnecessary" things in a specific scene, which affect the performance in a specific scene.

For example: std::unordered_map uses a two-way linked list as internal storage. The advantage of the two-way linked list is that it does not need to request the allocation of a whole block of memory, and it does not need to copy the contents of the old memory to the new memory during capacity expansion, but the cost is the loss of performance when accessing data in some cases (the current scenario belongs to "some cases" here).

General container classes cannot be in stack memory, because stack memory space is limited. However, in our specific scenario, we can ensure that the container space requirements have an upper bound, which is suitable for putting in the stack memory.

STL container and Hash Table performance test code optimized for specific scenarios

All the codes involved above are as follows, including the performance test of four STL associated containers and our own optimized hash map according to specific scenarios.

#include <chrono>
#include <iostream>
#include <map>
#include <set>
#include <unordered_map>
#include <unordered_set>
#include <vector>

using namespace std::chrono;

template <typename TMap>
void stl_map(const std::vector<std::string>& group_ids, const std::vector<std::string>& attrs, std::vector<int>& result)
{
        TMap occurs;
        std::string group_id;
        for (size_t i = 0; i < group_ids.size(); ++i)
        {
                if (group_id != group_ids[i])
                {
                        occurs.clear();
                        group_id = group_ids[i];
                }

                auto value = attrs[i];
                if (occurs.find(value) == occurs.end())
                {
                        occurs[value] = 1;
                }
                else
                {
                        ++occurs[value];
                }
                // assign occurs[value] to result
                result[i] = occurs[value];
        }
}

template <typename TMultiSet>
void stl_multiset(const std::vector<std::string>& group_ids, const std::vector<std::string>& attrs, std::vector<int>& result)
{
        TMultiSet occurs;
        std::string group_id;
        int tmp;
        for (size_t i = 0; i < group_ids.size(); ++i)
        {
                if (group_id != group_ids[i])
                {
                        occurs.clear();
                        group_id = group_ids[i];
                }

                auto value = attrs[i];
                std::multiset<int> s;
                occurs.insert(value);
                // assign occurs[value] to result
                result[i] = occurs.count(value);
        }
}

template <typename KeyTy>
struct OccurCountNode
{
        KeyTy key;
        uint32_t count = 0;
        uint32_t version = 0;
};

void my_hash_map(const std::vector<std::string>& group_ids, const std::vector<std::string>& attrs, std::vector<int>& result)
{
        const unsigned char DEGREE = 9;
        const size_t MAP_SIZE = 1UL << DEGREE;
        const size_t MASK = MAP_SIZE - 1;
        const size_t TAIL = 10;
        int version = 1;
        OccurCountNode<std::string> occurs[MAP_SIZE + TAIL];
        auto hasher = std::hash<std::string>();
        auto emplace = [&hasher, &occurs, &version, MASK, MAP_SIZE, TAIL](const std::string& key) -> OccurCountNode<std::string>*
        {
                size_t idx = hasher(key) & MASK;
                while (idx < MAP_SIZE + TAIL)
                {
                        auto& item = occurs[idx];
                        if (item.version != version)
                        {
                                item.version = version;
                                item.key = key;
                                item.count = 1;
                                return &item;
                        }
                        if (item.version == version && item.key == key)
                        {
                                item.count++;
                                return &item;
                        }
                        idx++;
                }
                return nullptr;
        };

        std::string group_id;
        for (size_t i = 0; i < group_ids.size(); ++i)
        {
                if (group_id != group_ids[i])
                {
                        version++;
                        group_id = group_ids[i];
                }

                auto value = attrs[i];
                OccurCountNode<std::string>* find_result = emplace(value);
                // assign occurs[value] to result
                result[i] = find_result->count;
        }
}

using OccursCountFunc = void (*)(const std::vector<std::string>& group_ids, const std::vector<std::string>& attrs, std::vector<int>& result);

long long test_perf(const std::vector<std::string>& group_ids, const std::vector<std::string>& attrs, std::vector<int>& result, OccursCountFunc func)
{
        auto start = system_clock::now();

        func(group_ids, attrs, result);

        for (size_t i = 0; i < 10; ++i)
        {
                std::cout << result[i] << ", ";
        }

        auto end = system_clock::now();
        return duration_cast<milliseconds>(end - start).count();
}

int main()
{
        const size_t DATA_SIZE = 100000000;
        std::vector<std::string> group_ids(DATA_SIZE);
        std::vector<std::string> attrs(DATA_SIZE);
        std::vector<int> result1(DATA_SIZE);
        std::vector<int> result2(DATA_SIZE);
        std::vector<int> result3(DATA_SIZE);
        std::vector<int> result4(DATA_SIZE);
        std::vector<int> result5(DATA_SIZE);

        // Prepare test data.
        char buf[50];
        std::string values[] = { "A", "B", "C", "D", "E" };
        for (size_t i = 0; i < DATA_SIZE; ++i)
        {
                snprintf(buf, 50, "G%010zu", (i / 20) + 1);
                group_ids[i] = std::string(buf);
                attrs[i] = values[rand() % 5];
        }

        // Performance test.
        std::cout << "Performance test starts\n";

        std::cout << "stl_hash_multiset - ";
        std::cout << "Cost: " << test_perf(group_ids, attrs, result1, stl_multiset<std::unordered_multiset<std::string>>) << std::endl;

        std::cout << "stl_multiset - ";
        std::cout << "Cost: " << test_perf(group_ids, attrs, result2, stl_multiset<std::multiset<std::string>>) << std::endl;

        std::cout << "stl_hash_map - ";
        std::cout << "Cost: " << test_perf(group_ids, attrs, result3, stl_map<std::unordered_map<std::string, int>>) << std::endl;

        std::cout << "stl_map - ";
        std::cout << "Cost: " << test_perf(group_ids, attrs, result4, stl_map<std::map<std::string, int>>) << std::endl;

        std::cout << "my_hash_map" << std::endl;
        std::cout << "Cost: " << test_perf(group_ids, attrs, result5, my_hash_map) << std::endl;

        std::cout << "Performance test ended\n";

        // Validate results.
        for (size_t i = 0; i < DATA_SIZE; ++i)
        {
                if (result1[i] != result2[i] || result2[i] != result3[i] || result3[i] != result4[i] || result4[i] != result5[i])
                {
                        std::cout << "Error: " << result1[i] << ", " << result2[i] << ", " << result3[i] << ", " << result4[i] << result5[i] << std::endl;
                }
        }
}

Hash Map of CH

// The HashTable
template
<
    typename Key,
    typename Cell,
    typename Hash,
    typename Grower,
    typename Allocator
>
class HashTable :
    private boost::noncopyable,
    protected Hash,
    protected Allocator,
    protected Cell::State,
    protected ZeroValueStorage<Cell::need_zero_value_storage, Cell>     /// empty base optimization
{ ... }

As shown in the code, the HashTable template accepts five template parameters:

Key - is the key value type
Cell - type of mapped value
Hash - hash function
Grower - growth strategy
Allocator - memory allocator

Clearable Hash Map

Clearable hash map is an instantiated and simple derived class of HashTable template. The key point is that ClearableHashMapCell < key, mapped, hash > is used as the Cell type. A version field is defined in ClearableHashMapCell, which is the key to its implementation of "fast clear".

ClearableHashMapCell adds a version field to HashMapCell. Every time a Clearable hash map performs a clear operation, its version will increase by 1, so the cells whose version field is not equal to the current version are considered "cleared".

template
<
    typename Key,
    typename Mapped,
    typename Hash = DefaultHash<Key>,
    typename Grower = HashTableGrower<>,
    typename Allocator = HashTableAllocator
>
class ClearableHashMap : public HashTable<Key, ClearableHashMapCell<Key, Mapped, Hash>, Hash, Grower, Allocator>
{
public:
    Mapped & operator[](const Key & x)
    {
        typename ClearableHashMap::LookupResult it;
        bool inserted;
        this->emplace(x, it, inserted);

        if (inserted)
            new (&it->getMapped()) Mapped();

        return it->getMapped();
    }

    void clear()
    {
        ++this->version;
        this->m_size = 0;
    }
};

Clearable Hash Map based on stack memory

The Clearable hash map based on stack memory is implemented through a special memory allocator AllocatorWithStackMemory, which is defined as follows:

template <typename Key, typename Mapped, typename Hash,
    size_t initial_size_degree>
using ClearableHashMapWithStackMemory = ClearableHashMap<
    Key,
    Mapped,
    Hash,
    HashTableGrower<initial_size_degree>,
    HashTableAllocatorWithStackMemory<
        (1ULL << initial_size_degree)
        * sizeof(ClearableHashMapCell<Key, Mapped, Hash>)>>;

The main implementation of HashTableAllocatorWithStackMemory is AllocatorWithStackMemory.

/**
  * We are going to use the entire memory we allocated when resizing a hash
  * table, so it makes sense to pre-fault the pages so that page faults don't
  * interrupt the resize loop. Set the allocator parameter accordingly.
  */
using HashTableAllocator = Allocator<true /* clear_memory */, true /* mmap_populate */>;

template <size_t initial_bytes = 64>
using HashTableAllocatorWithStackMemory = AllocatorWithStackMemory<HashTableAllocator, initial_bytes>;

AllocatorWithStackMemory defines an array on the stack memory. If the memory size to be allocated is just less than or equal to the length of the array, the address of the array will be returned, This array is the "allocated" memory. If the memory size to be allocated is unfortunately larger than the length of the array, the memory will be allocated according to the usual practice (allocated memory in heap memory implemented by Allocator template).

Hasher

city hash is generally used in CH.

HashTableGrower

HashTableGrower represents the memory usage strategy of hash table. Based on a whole block of continuous memory, the hash value is used to obtain the address through "and" operations with the mask, and the address value is increased by 1 when the hash collides to avoid collision.

Reference code

clear() of bidirectional linked list std::list

    void clear() noexcept { // erase all
        auto& _My_data = _Mypair._Myval2;
        _My_data._Orphan_non_end();
        _Node::_Free_non_head(_Getal(), _My_data._Myhead);
        _My_data._Myhead->_Next = _My_data._Myhead;
        _My_data._Myhead->_Prev = _My_data._Myhead;
        _My_data._Mysize        = 0;
    }

    template <class _Alnode>
    static void _Free_non_head(
        _Alnode& _Al, _Nodeptr _Head) noexcept { // free a list starting at _First and terminated at nullptr
        _Head->_Prev->_Next = nullptr;

        auto _Pnode = _Head->_Next;
        for (_Nodeptr _Pnext; _Pnode; _Pnode = _Pnext) {
            _Pnext = _Pnode->_Next;
            _Freenode(_Al, _Pnode);
        }
    }   

    template <class _Alnode>
    static void _Freenode(_Alnode& _Al, _Nodeptr _Ptr) noexcept { // destroy all members in _Ptr and deallocate with _Al
        allocator_traits<_Alnode>::destroy(_Al, _STD addressof(_Ptr->_Myval));
        _Freenode0(_Al, _Ptr);
    }

    template <class _Alnode>
    static void _Freenode0(_Alnode& _Al, _Nodeptr _Ptr) noexcept {
        // destroy pointer members in _Ptr and deallocate with _Al
        static_assert(is_same_v<typename _Alnode::value_type, _List_node>, "Bad _Freenode0 call");
        _Destroy_in_place(_Ptr->_Next);
        _Destroy_in_place(_Ptr->_Prev);
        allocator_traits<_Alnode>::deallocate(_Al, _Ptr, 1);
    }

Topics: C++ Big Data Algorithm clickhouse

Programmer Think