hash principle, conflict and solution

Posted by jrmontg on Thu, 14 Oct 2021 05:03:24 +0200

In the Java programming language, there are two basic structures, one is array and the other is analog pointer (Reference). All data structures can be constructed with these two basic structures, and so can HashMap. When the program tries to put multiple key values into HashMap, take the following code snippet as an example:

HashMap<String,Object> m=new HashMap<String,Object>();  
m.put("a", "rrr1");  
m.put("b", "tt9");  
m.put("c", "tt8");  
m.put("d", "g7");  
m.put("e", "d6");  
m.put("f", "d4");  
m.put("g", "d4");  
m.put("h", "d3");  
m.put("i", "d2");  
m.put("j", "d1");  
m.put("k", "1");  
m.put("o", "2");  
m.put("p", "3");  
m.put("q", "4");  
m.put("r", "5");  
m.put("s", "6");  
m.put("t", "7");  
m.put("u", "8");  
m.put("v", "9");

HashMap uses a so-called "Hash algorithm" to determine the storage location of each element. When the program executes the map. Put (String, object) method, the system will call the hashCode() method of String to get its hashCode value -- each Java object has a hashCode() method, and its hashCode value can be obtained through this method. After obtaining the hashCode value of the object, the system will determine the storage location of the element according to the hashCode value. The source code is as follows:

public V put(K key, V value) {  
        if (key == null)  
            return putForNullKey(value);  
        int hash = hash(key.hashCode());  
        int i = indexFor(hash, table.length);  
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {  
            Object k;  
            //Judge whether there are elements with the same hashcode and the same key in the currently determined index position. If there are elements with the same hashcode and the same key, the new value overwrites the original old value and returns the old value.  
            //If the same hashcode exists, the index positions determined by them are the same. At this time, judge whether their key s are the same. If they are different, a hash conflict occurs.  
            //After a Hash conflict, instead of an Entry, an Entry chain is stored in a single bucket of the HashMap.  
            //The system can only traverse each Entry in order until it finds the Entry to be searched - if the Entry to be searched is located at the end of the Entry chain (the Entry is the first one put into the bucket),  
            //The system must loop to the end to find the element.  
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {  
                V oldValue = e.value;  
                e.value = value;  
                return oldValue;  
            }  
        }  
        modCount++;  
        addEntry(hash, key, value, i);  
        return null;  
    }

An important internal interface is used in the above program: Map.Entry. Each Map.Entry is actually a key value pair. As can be seen from the above program, when the system decides to store the key value pair in HashMap, it does not consider the value in the entry at all, but only calculates and determines the storage location of each entry according to the key. This also explains the previous conclusion: we can take the value in the Map set as an attachment to the key. After the system determines the storage location of the key, the value can be saved there. After my transformation of the HashMap program, I deliberately constructed the hash conflict phenomenon, because the initial size of the HashMap is 16, but I put more than 16 elements in the HashMap, And I blocked its resize() method. Don't let it expand. At this time, the underlying array entry [] of HashMap The table structure is as follows:

The bucket in Hashmap appears in the form of single linked list. One problem to be solved by hash table is the conflict of hash values. There are usually two methods: linked list method and open address method. The linked list method is to organize objects with the same hash value into a linked list and put them in the slot corresponding to the hash value; The open address method uses a detection algorithm to continue to find the next available slot when a slot has been occupied. java.util.HashMap adopts the linked list method. The linked list is a one-way linked list. The core code for forming a single linked list is as follows:

void addEntry(int hash, K key, V value, int bucketIndex) {  
    Entry<K,V> e = table[bucketIndex];  
    table[bucketIndex] = new Entry<K,V>(hash, key, value, e);  
    if (size++ >= threshold)  
        resize(2 * table.length);  
bsp;

The code of the above method is very simple, but it contains a design: the system always puts the newly added Entry object into the bucketIndex index of the table array - if there is already an Entry object at the bucketIndex index, the newly added Entry object points to the original Entry object (an Entry chain is generated), If there is no Entry object at the bucketIndex index, that is, the e variable of the above program code is null, that is, the newly placed Entry object points to null, that is, no Entry chain is generated.

When there is no hash conflict in the hashmap and a single linked list is not formed, the hashmap searches for elements quickly and the get() method can directly locate the elements. However, after the single linked list appears, an Entry chain is stored in a single bucket instead of an Entry. The system can only traverse each Entry in order, Until the Entry to be searched is found - if the Entry to be searched is located at the end of the Entry chain (the Entry is first placed in the bucket), the system must cycle to the end to find the element.

When creating a HashMap, there is a default load factor, with a default value of 0.75, which is a compromise between time and space costs: increasing the load factor can reduce the memory space occupied by the Hash table (that is, the Entry array), but it will increase the time cost of querying data, and queries are the most frequent operations (get() and put() of HashMap) Methods all use query); Reducing the load factor will improve the performance of data query, but it will increase the memory space occupied by Hash table.

1, HashMap overview

HashMap is based on hash table Map Interface implementation. This implementation provides all optional mapping operations and allows the use of null Value sum null Key. (except out of sync and allowed) null In addition, HashMap Class and Hashtable Roughly the same.) this class does not guarantee the order of mapping, especially it does not guarantee that the order is constant.

It is worth noting that HashMap is not thread safe. If you want a thread safe HashMap, you can obtain a thread safe HashMap through the static method synchronizedMap of the Collections class.

Map map = Collections.synchronizedMap(new HashMap());

2, Data structure of HashMap

The bottom layer of HashMap is mainly based on arrays and linked lists. The reason why it has a very fast query speed is that it determines the storage location by calculating the hash code. HashMap mainly calculates the hash value through the hashCode of the key. As long as the hashcodes are the same, the calculated hash value is the same. If there are many stored objects, the hash values calculated by different objects may be the same, which leads to the so-called hash conflict. Students who have studied data structures know that there are many ways to solve hash conflicts. The bottom layer of HashMap solves hash conflicts through linked lists.

In the figure, the purple part represents the hash table, also known as the hash array. Each element of the array is the head node of a single linked list. The linked list is used to resolve conflicts. If different key s are mapped to the same position of the array, they will be placed in the single linked list.

Let's look at the code of the Entry class in HashMap:

 /** Entry It is a one-way linked list.    
  * It is the linked list corresponding to "HashMap linked storage method".
  *It implements the Map.Entry interface, that is, getKey(), getValue(), setValue(V value), equals(Object o), hashCode()
  **/
 static class Entry < K, V > implements Map.Entry < K, V >
 {
     final K key;
     V value;
     // Point to next node    
     Entry < K, V > next;
     final int hash;
     // Constructor.    
     // Input parameters include "Hashi value (h)", "key (k)", "value (v)", "next node (n)"    
     Entry(int h, K k, V v, Entry < K, V > n)
     {
         value = v;
         next = n;
         key = k;
         hash = h;
     }
     public final K getKey()
     {
         return key;
     }
     public final V getValue()
     {
         return value;
     }
     public final V setValue(V newValue)
         {
             V oldValue = value;
             value = newValue;
             return oldValue;
         }
         // Judge whether two entries are equal    
         // If the "key" and "value" of two entries are equal, return true.    
         // Otherwise, false is returned    
     public final boolean equals(Object o)
         {
             if(!(o instanceof Map.Entry)) return false;
             Map.Entry e = (Map.Entry) o;
             Object k1 = getKey();
             Object k2 = e.getKey();
             if(k1 == k2 || (k1 != null && k1.equals(k2)))
             {
                 Object v1 = getValue();
                 Object v2 = e.getValue();
                 if(v1 == v2 || (v1 != null && v1.equals(v2))) return true;
             }
             return false;
         }
         // Implement hashCode()    
     public final int hashCode()
     {
         return(key == null ? 0 : key.hashCode()) ^ (value == null ? 0 : value.hashCode());
     }
     public final String toString()
         {
             return getKey() + "=" + getValue();
         }
         // When an element is added to the HashMap, recordAccess() is called.    
         // Nothing is done here    
     void recordAccess(HashMap < K, V > m)
         {}
         // When an element is deleted from the HashMap, recordRemoval() is called.    
         // Nothing is done here    
     void recordRemoval(HashMap < K, V > m)
     {}
 }

HashMap is actually an Entry array. The Entry object contains keys and values. next is also an Entry object, which is used to handle hash conflicts and form a linked list.

3, HashMap source code analysis

1. Key attributes

Let's take a look at some key attributes in the HashMap class:

transient Entry[] table;//An array of entities that store elements

transient int size;//Number of stored elements

int threshold; //Critical value when the actual size exceeds the critical value, the capacity will be expanded. threshold = loading factor * capacity

final float loadFactor; //Loading factor
 
transient int modCount;//Number of times modified

The loadFactor indicates the filling degree of elements in the Hsah table

If: the larger the loading factor is, the more elements will be filled. The advantage is that the space utilization is high, but the chance of conflict is increased. The length of the linked list will be longer and longer, and the search efficiency will be reduced.

On the contrary, the smaller the loading factor is, the fewer elements will be filled. The advantage is that the chance of conflict is reduced, but the space is wasted. The data in the table will be too sparse (many spaces begin to expand before they are used)

The greater the chance of conflict, the higher the cost of finding

Therefore, it is necessary to Find a balance and compromise between "conflict opportunity" and "space utilization" This kind of balance and compromise is essentially the balance and compromise of the famous "time - space" contradiction in data structure

If the machine memory is enough and you want to improve the query speed, you can set the load factor to be smaller; On the contrary, if the machine memory is tight and there is no requirement for query speed, you can set the load factor a little larger. However, generally we don't need to set it, just let it take the default value of 0.75.

2. Construction method

Here are some construction methods of HashMap:

public HashMap(int initialCapacity, float loadFactor)
{
    //Make sure the numbers are legal
    if(initialCapacity < 0) throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity);
    if(initialCapacity > MAXIMUM_CAPACITY) initialCapacity = MAXIMUM_CAPACITY;
    if(loadFactor <= 0 || Float.isNaN(loadFactor)) throw new IllegalArgumentException("Illegal load factor: " + loadFactor);
    // Find a power of 2 >= initialCapacity
    int capacity = 1; //Initial capacity
    while(capacity < initialCapacity) //Ensure that the capacity is the n-th power of 2, so that the capacity is greater than the n-th power of the minimum 2 of initialCapacity
        capacity <<= 1;
    this.loadFactor = loadFactor;
    threshold = (int)(capacity * loadFactor);
    table = new Entry[capacity];
    init();
}
public HashMap(int initialCapacity)
{
    this(initialCapacity, DEFAULT_LOAD_FACTOR);
}
public HashMap()
{
    this.loadFactor = DEFAULT_LOAD_FACTOR;
    threshold = (int)(DEFAULT_INITIAL_CAPACITY * DEFAULT_LOAD_FACTOR);
    table = new Entry[DEFAULT_INITIAL_CAPACITY];
    init();
}

We can see that when constructing a HashMap, if we specify the loading factor and initial capacity, we will call the first construction method. Otherwise, we will use the default. The default initial capacity is 16 and the default load factor is 0.75. We can see lines 13-15 in the above code. The purpose of this code is to ensure that the capacity is the n-th power of 2 and make the capacity greater than the minimum n-th power of initialCapacity. As for why we should set the capacity to the n-th power of 2, let's see later.

Focus on the two most used methods put and get in HashMap

3. Store data

Let's take a look at the process of storing data in HashMap. First, take a look at the put method of HashMap:

public V put(K key, V value) {
     // If "key is null", add the key value pair to table[0].
         if (key == null) 
            return putForNullKey(value);
     // If "key is not null", calculate the hash value of the key, and then add it to the linked list corresponding to the hash value.
         int hash = hash(key.hashCode());
     //Search the index of the specified hash value in the corresponding table
         int i = indexFor(hash, table.length);
     // Loop through the Entry array. If the key value pair corresponding to "the key" already exists, replace the old value with the new value. Then exit!
         for (Entry<K,V> e = table[i]; e != null; e = e.next) { 
             Object k;
              if (e.hash == hash && ((k = e.key) == key || key.equals(k))) { //If the key s are the same, the old value is overwritten and returned
                  V oldValue = e.value;
                 e.value = value;
                 e.recordAccess(this);
                 return oldValue;
              }
         }
     //Modification times + 1
         modCount++;
     //Add key value to table[i]
     addEntry(hash, key, value, i);
     return null;
}

The above program uses an important internal interface: map. Entry, each Map.Entry It's actually a key-value yes. As can be seen from the above program, when the system decides to store HashMap Medium key-value Right time, no consideration at all Entry Medium Value, just based on key To calculate and determine each Entry Storage location for. This also shows the previous conclusion: we can put Map In the collection value regard as key When the system decides key After the storage location of, value Then save it there.

Let's analyze this function slowly. Lines 2 and 3 are used to handle the case where the key value is null. Let's look at the putForNullKey(value) method:

 private V putForNullKey(V value)
 {
     for(Entry < K, V > e = table[0]; e != null; e = e.next)
     {
         if(e.key == null)
         { //If an object with null key exists, it will be overwritten
             V oldValue = e.value;
             e.value = value;
             e.recordAccess(this);
             return oldValue;
         }
     }
     modCount++;
     addEntry(0, null, value, 0); //If the key is null, the hash value is 0
     return null;
 }

Note: if the key is null, the hash value is 0, and the object is stored at the position with index 0 in the array. table[0]

Let's go back to line 4 of the put method. It calculates the hash code through the hashCode value of the key. The following is the function for calculating the hash code:

1  //The method of calculating the hash value is calculated by the hashCode of the key
2     static int hash(int h) {
3         // This function ensures that hashCodes that differ only by
4         // constant multiples at each bit position have a bounded
5         // number of collisions (approximately 8 at default load factor).
6         h ^= (h >>> 20) ^ (h >>> 12);
7         return h ^ (h >>> 7) ^ (h >>> 4);
8     }

After the hash code is obtained, the index that should be stored in the array will be calculated through the hash code. The function of calculating the index is as follows:

1     static int indexFor(int h, int length) { //Calculate the index value according to the hash value and the array length
2         return h & (length-1);  //You can't calculate it casually here. There is a reason to use hash & (length-1), which can ensure that the calculated index is within the range of array size and will not exceed
3     }

Let's focus on this. Generally, we naturally think of using hash value to modulo length (i.e. division hash method) for hash table hashing. This method is also implemented in Hashtable. This method can basically ensure that the hash of elements in the hash table is relatively uniform, but the modulo will use division operation, which is very inefficient. In HashMap, h& (length-1) method is used to replace modulo, Uniform hashing is also realized, but the efficiency is much higher. This is also an improvement of Hashtable by HashMap.

Next, we analyze why the capacity of the hash table must be an integer power of 2. First, if the length is an integer power of 2, h& (length-1) is equivalent to taking the module of length, which ensures the uniformity of hash and improves the efficiency; Secondly, if length is an integer power of 2, it is an even number, so that length-1 is an odd number and the last bit of the odd number is 1, which ensures that the last bit of H & (length-1) may be 0 or 1 (depending on the value of H), that is, the result after and may be even or odd, so as to ensure the uniformity of hash. If length is an odd number, Obviously, length-1 is an even number, and its last bit is 0. In this way, the last bit of h& (length-1) must be 0, that is, it can only be an even number. In this way, any hash value will only be hashed to the even subscript position of the array, which wastes nearly half of the space. Therefore, length takes an integer power of 2 to reduce the probability of collision between different hash values, This allows elements to be hashed evenly in the hash table.

This seems very simple, but actually it has a mystery. Let's take an example to illustrate:

Assuming that the array lengths are 15 and 16 respectively, and the optimized hash codes are 8 and 9 respectively, the results of & operation are as follows:

 h & (table.length-1)                     hash                             table.length-1
       8 & (15-1):                                  0100                   &              1110                   =                0100
       9 & (15-1):                                  0101                   &              1110                   =                0100
       -----------------------------------------------------------------------------------------------------------------------
       8 & (16-1):                                  0100                   &              1111                   =                0100
       9 & (16-1):                                  0101                   &              1111                   =                0101

As can be seen from the above ex amp le, when they are "and" with 15-1 (1110), they produce the same results, that is, they will be located at the same position in the array, which will produce a collision. 8 and 9 will be placed at the same position in the array to form a linked list, so it is necessary to traverse the chain during query Table, get 8 or 9, which reduces the efficiency of query. At the same time, we can also find that when the array length is 15, the hash value will be "and" with 15-1 (1110), then The last bit is always 0, and 00010011010110011111101 can never store elements, which wastes a lot of space. What's worse, in this case, the available position of the array is much smaller than the length of the array, which means that it further increases the probability of collision and slows down the efficiency of query! When the array length is 16, that is, to the nth power of 2, the value on each bit of the binary number obtained by 2n-1 is 1, which makes the lower bit of the original hash the same as that of the original hash, plus hash(int h) Method further optimizes the hashCode of the key by adding high-order calculation, so that only two values of the same hash value will be placed at the same position in the array to form a linked list.

　　 Therefore, when the array length is the nth power of 2, the probability that different key s calculate the same index is small, so the data is evenly distributed on the array, that is, the probability of collision is small. In contrast, you don't have to traverse the linked list at a certain location during query, so the query efficiency is higher.

According to the source code of the put method above, when the program tries to put a key value pair into the HashMap, the program first determines the storage location of the Entry according to the hashCode() return value of the key: if the hashCode() return values of the keys of the two entries are the same, their storage locations are the same. If the keys of these two entries return true through equals comparison, the value of the newly added Entry will overwrite the value of the original Entry in the collection, but the key will not overwrite it. If the keys of these two entries return false through the equals comparison, the newly added Entry will form an Entry chain with the original Entry in the collection, and the newly added Entry is located at the head of the Entry chain. For details, please continue to see the description of the addEntry() method.

1 void addEntry(int hash, K key, V value, int bucketIndex) {
2         Entry<K,V> e = table[bucketIndex]; //If the location to be added has a value, set the original value of the location to the next node of the new entry, that is, the next node of the new entry linked list
3         table[bucketIndex] = new Entry<>(hash, key, value, e);
4         if (size++ >= threshold) //If it is greater than the critical value, expand the capacity
5             resize(2 * table.length); //Capacity expansion in multiples of 2
6     }

The parameter bucketIndex is the index value calculated by the indexFor function. The second line of code is to obtain the Entry object with index bucketIndex in the array. The third line is to build a new Entry object with hash, key and value, put it at the position with index bucketIndex, and set the original object at this position as the next of the new object to form a linked list.

Lines 4 and 5 judge whether the size reaches the critical value threshold after put. If it reaches the critical value, it is necessary to expand the capacity. The capacity of HashMap is expanded twice.

4. Resize

The resize() method is as follows:

Resize the HashMap. newCapacity is the adjusted unit

 1     void resize(int newCapacity) {
 2         Entry[] oldTable = table;
 3         int oldCapacity = oldTable.length;
 4         if (oldCapacity == MAXIMUM_CAPACITY) {
 5             threshold = Integer.MAX_VALUE;
 6             return;
 7        }
 8 
 9         Entry[] newTable = new Entry[newCapacity];
10         transfer(newTable);//Used to move all the elements of the original table to the newTable
11         table = newTable;  //Then assign newTable to table
12         threshold = (int)(newCapacity * loadFactor);//Recalculate threshold
13     }

A new underlying array of HashMap is created. Line 10 in the above code calls the transfer method to add all the elements of the HashMap to the new HashMap, and recalculate the index position of the elements in the new array

When there are more and more elements in the HashMap, the probability of hash conflict is higher and higher, because the length of the array is fixed. Therefore, in order to improve the efficiency of query, the HashMap array must be expanded. The operation of array expansion will also appear in ArrayList, which is a common operation. After the HashMap array is expanded, the most performance consuming point appears: the data in the original array must be recalculated in the new array and put in, which is resize.

When will HashMap be expanded? When the number of elements in the HashMap exceeds the array size * loadFactor, the array will be expanded. The default value of loadFactor is 0.75, which is a compromise value. That is, by default, the size of the array is 16. When the number of elements in the HashMap exceeds 16 * 0.75 = 12, expand the size of the array to 2 * 16 = 32, that is, double it, and then recalculate the position of each element in the array. The expansion requires array replication, and copying the array is a very performance consuming operation, Therefore, if we have predicted the number of elements in the HashMap, the preset number of elements can effectively improve the performance of the HashMap.

5. Data reading

1.public V get(Object key) {   
2.    if (key == null)   
3.        return getForNullKey();   
4.    int hash = hash(key.hashCode());   
5.    for (Entry<K,V> e = table[indexFor(hash, table.length)];   
6.        e != null;   
7.        e = e.next) {   
8.        Object k;   
9.        if (e.hash == hash && ((k = e.key) == key || key.equals(k)))   
10.            return e.value;   
11.    }   
12.    return null;   
13.}

With the hash algorithm stored above as the basis, it is easy to understand this code. As can be seen from the above source code: when get ting elements from HashMap, first calculate the hashCode of the key to find an element at the corresponding position in the array, and then find the required element in the linked list at the corresponding position through the equals method of the key.

6. Performance parameters of HashMap:

HashMap contains the following constructors:

HashMap(): build a HashMap with an initial capacity of 16 and a load factor of 0.75.

HashMap(int initialCapacity): build a HashMap with initial capacity of initialCapacity and load factor of 0.75.

HashMap(int initialCapacity, float loadFactor): create a HashMap with the specified initial capacity and the specified load factor.

The basic constructor of HashMap HashMap(int initialCapacity, float loadFactor) has two parameters: initial capacity initialCapacity and load factor loadFactor.

initialCapacity: the maximum capacity of the HashMap, that is, the length of the underlying array.

loadFactor: the load factor is defined as the actual number of elements of the hash table (n) / the capacity of the hash table (m).

The load factor measures the use of the space of a hash table. The larger the load factor, the higher the filling degree of the hash table, and vice versa. For the hash table using the linked list method, the average time to find an element is O(1+a). Therefore, if the load factor is larger, the space will be more fully utilized, but the consequence is the reduction of search efficiency; If the load factor is too small, the hash table data will be too sparse, resulting in a serious waste of space.

In the implementation of HashMap, the maximum capacity of HashMap is determined through the threshold field:

threshold = (int)(capacity * loadFactor);

According to the definition formula of load factor, threshold is the maximum number of elements allowed under the corresponding loadFactor and capacity. If it exceeds this number, resize again to reduce the actual load factor. The default load factor of 0.75 is a balanced choice for space and time efficiency. When the capacity exceeds this maximum capacity, the HashMap capacity after resizing is twice the capacity

Topics: Java data structure

Programmer Think