C + + octet sharing - data structure II - hash table

Posted by argrafic on Sat, 19 Feb 2022 02:43:16 +0100

C + + octet sharing - data structure II - hash table


What is a hash table?

The search Binary Tree finds the value by comparing it with the target value one by one from the root node, and then looking down until the target value is found or the root node is not found. The time complexity is O (log n). In the hash table, the value and key are bound in pairs, and the key is brought into the hash function to obtain the storage address of the target value, so as to obtain the target value. Without considering the hash conflict, the time complexity is O(1).
Hash table can actually be understood as a special array. The array we usually use starts with the subscript 0 until the array length len-1, and stores the array elements in turn. Their addresses are continuous. The hash table uses an array larger than the original data. The key is mapped to an address in the array through the hash function, and the storage is discontinuous.

For example, we now have int nums[20]An array of integers, accessing nums[3]The steps are:
1,Get nums First address of array
2,Offset according to first address sizeof(int)*3 Bytes to reach nums[3]And get the elements in it
For example, we now have a hash table hashmap(string, int) myhash,It stores someone's height.
myhash["Zhang San"] = 175,myhash["Li Si"] = 180, Representative Zhang San height 175 cm,Li Si is 180 in height.
Now get Zhang San's height myhash["Zhang San"]The steps are:
1,index = H("Zhang San"),Through hash function H,Get the address of the corresponding array inside the hash table index 
2,Use access index Point to the memory area and get the height of Zhang San

Construction method of hash function:

1. Digital analysis method 2, square middle method 3, folding method 4, division and remainder method. Main usage 4 However, no matter which method is used, different key s will have the same address calculated by the hash function. This situation is called hash conflict.

Precautions for hash function:
1.The time required to calculate the hash address (i.e hash (the function itself should not be too complicated)
2.The length of the keyword, key If it is too long, we should consider using folding method and division and remainder method
3.Length of hash table
4.Whether the keyword distribution is uniform and whether there is a rule to follow
5.Designed hash Functions minimize conflicts when the above conditions are met

For a detailed explanation of the construction method, please refer to: Data structure Hash table (Hash table)
Link from: CSDN clear ice

Resolution of hash conflict:

No matter how ingenious the hash function is designed, there will always be special key s leading to hash conflicts, especially for dynamic lookup tables.

hash Function conflict resolution methods include the following common methods
1.Open Customization: for conflict key,Continuous use Hi​=(H(key)+di​)MODmm Perform recursive calculation until there is no conflict.
2.Chain address method: for conflicting key,If the same index is obtained after the hash function calculation, the linked list is directly constructed in the index and all the conflicting data are linked in the linked list.
3.Public overflow area method: establish a special storage space to store conflicting data. This method is suitable for situations with less data and conflict.
4.Re hashing: prepare several hash Function, if the first hash If the function conflicts, the second one is used hash Function, the second also conflicts, use the third

Focus on 1. Open customization method and 2. Chain address method

Insertion and rehash of hash table elements

First of all, we need to know a concept, load factor: load factor loadFactor=n/m, and the length of the actual data is divided by the table length.

1,When loadFactor<=1 When, hash The expected complexity of table lookup is O(1).. Therefore, every time we add elements to the hash table, we must ensure that loadFactor<1 In the case of,
Can be added.
2,The larger the load factor and the larger the average lookup length, the lower the load factorαThe smaller the better? No, just like a table with a length of 100, only one data is stored. The search time is very fast, but the space is wasted. usually
 In case, it is considered thatα=0.75 It is the situation with the highest efficiency of comprehensive utilization of time and space.

rehash refers to the process of adjusting the space and hash function of the hash table when the load factor of the hash table after inserting elements is greater than 0.75.
Taking hashmap in C + + as an example, rehash is required when the load factor is greater than or equal to 0.75. In the rehash process, it will first open up twice the space of the original bucket array, which is called the new bucket array, and then re hash all the elements in the original bucket array to the new bucket array. In this process, it should be noted that because the size of the new array changes, it is necessary to construct a new hash function, and then re hash the previous key and value into the new array.

Deletion of hash table elements:

When deleting an element of a hash table, if the chain address method is used to deal with hash conflicts, it can be deleted directly. If it is an open custom method, it will certainly not work, because there may be hash conflicts before. If you delete the element of conflict 1, the conflicting element behind 1 will no longer be found, so you can't delete the element directly.

Extension questions:

Q: why is the number of buckets in the hash table prime?

First of all, we must clarify a concept. The number of buckets of the hash table refers to the p value in the hash function H(key)=key%p constructed by the division and residue method. It is not the length of hash table data, nor the length of hash table. This concept is vague in many articles, so we must make it clear.
For division and remainder method, please refer to: Data structure Hash table (Hash table)
Link from: CSDN clear ice

A: the hash table of the hash function is constructed by using the division and remainder method. When processing the key value, it is to divide a number by the key value, and the remainder obtained is the storage address. If p takes a combined number, the resulting address collision probability will be higher, which will affect the efficiency of the hash table; Using prime numbers can reduce the conflict probability, make the hash table evenly divided and have higher search efficiency.

Continuously updating

Topics: C++ Algorithm data structure hash