HashCode & HashMap perturbation function, initialization capacity, load factor, expansion element splitting

Posted by glowball on Tue, 22 Feb 2022 14:56:00 +0100

Hashcode & HashMap perturbation function, initialization capacity, load factor, expansion element splitting

1. Why does hashcode use 31 as a multiplier?

String. The hashCode method of class is as follows:

public int hashCode() {
    int h = hash;
    if (h == 0 && value.lengt > 0) {
        char[] val = value;
        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i]
        }
        hash = h
    }
    return h;
}

There is a fixed value 31 written in the above method. You must have this question when looking at the source code of String hashCode method. Why 31?

  • Hash function must choose prime number, which is the theory of hash function to reduce conflict demonstrated by scientists.
  • If the multiplier is even and the multiplication overflows, the information will be lost because using an even number is equivalent to a bit operation (low complement 0).
  • Using 31 has a good performance, that is, using shift and subtraction instead of multiplication can get better performance: 31 * I = (I < < 5) - I. the JVM currently used automatically completes this kind of optimization.
  • 31 is a small prime number, the hash collision probability is very low, and the hash value generated is very uniform hash. (it can be obtained by experiment)
2. HashMap
2.1 HashMap perturbation function

The method to obtain hash value from HashMap in Java8 is as follows:

static final int hash() {
   int h;
   return (key == null) ? 0 : (h = key.hashCode() ^ (h >>> 16));
}

Get the subscript value of the key in the get method: n is the size of the map

(n - 1) & hash

When you see the source code of HashMap's hash method, you will think about why you use the perturbation function to calculate, and why you can't directly use the hashCode value of key?

  • First of all, the value range of hashCode is [- 2 ^ 32, 2 ^ 31], that is [[- 2147483648, 2147483647], which is nearly 4 billion long. It is impossible to initialize the array so large that there is no memory.
  • Therefore, we need to calculate the disturbance of the hashCode value (H = key. hashCode() ^ (H > > > 16)), move the Hash value 16 bits to the right, and add 0 to the high bit to the right, which is equivalent to moving the high bit to the low bit, and then perform XOR operation with the original Hash value (the same is 0, the different is 1), which is equivalent to mixing the high and low bits in the original Hash value, Get a lower 16 bit Hash value with more Hash, which increases the randomness. Finally, the sum operation is performed with the size of the map array - 1, and all the high bits return to 0, leaving only the last four bits
    The calculation method is as follows: for example, the original hash value is 00000000 11111111 00000000 00001010
hashCode()  					  	  00000000 11111111 00000000 00001010		hash value
hashCode() >>> 16 				  	  00000000 00000000 00000000 11111111		Shift right 16 bits
hashCode() ^ (hashCode() >>> 16)  	  00000000 11111111 00000000 11110101  		XOR high low(Same as 0,The difference is 1)
hashCode() ^ (hashCode() >>> 16) & 15 00000000 00000000 00000000 00001111 		And operation subscript(1 if both are 1,Otherwise, it is 0)
									  00000000 00000000 00000000 00000101
                                      							   = 0101 
                                      							   = 5
  • In a word, the purpose of using perturbation function is to increase randomness, make data elements more uniform hash and reduce collision.
2.2 HashMap initialization capacity

The initialization content of HashMap is 16, the maximum capacity is the 30th power of 2, and the capacity size must be a multiple of 2, because only the multiple of 2 is 1 when subtracting 1, so the randomness is greater when performing and operation with hash value.

/**
 * The default initial capacity - MUST be a power of two.
 */
static final int DEFAULT_INITAL_CAPCITY = 1 << 4

If the value we pass is not a multiple of 2, such as 17, what will HashMap do at this time?

In the HashMap construction method, a tableSizeFor method will be called for calculation to obtain a number greater than the minimum 2 times of the value we passed.

public HashMap(int initialCapacity, float loadFactor) {
    ...
    this.loadFactor = loadlFactor;
    this.threshold = tablSizeFor(initialCapacity);
}

static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUN_CAPACITY ? MAXIMUM_CAPACITY : n + 1);
}
  • MAXIMUN_ Capability = 1 < < 30, which is the critical range, that is, the largest Map set.

  • tableSizeFor moves 1, 2, 4, 8 and 16 bits to the right, and then performs or operation (if both bits are 0, it will be 0, otherwise it will be 1), because in this way, each position of the binary can be filled with 1. After adding 1, it will be a standard multiple of 2.

  • The process of initializing and calculating the threshold value of incoming 17 can be shown in the figure for easy understanding. The final number is 32

    17 - 1 			10000
    n >>> 1  		01000 
    n | n >>> 1		11000
    n >>> 2			00110
    n | n >>> 2		11110
    n >>> 4 		00001
    n | n >>> 4		11111	
    n >>> 8			00000
    n | n >>> 8     11111
    n >>> 16		00000
    n | n >>> 16    11111 = 31
    
2.3 HashMap load factor
static final float DEFAULT_LOAD_FACTOR = 0.75f;

The load factor is used to expand the capacity when the capacity exceeds a certain threshold. In HashMap, the load factor determines the amount of data, which can be expanded later. For example, the capacity of HashMap is 16. When the capacity exceeds 16 * 0.75 = 12 elements, it is necessary to expand the capacity. The reason for this is that even if the number of elements is larger than the capacity, it may not be able to fill the capacity, because there will be collisions in some places. If there are a large number of linked lists, In this way, the performance of the array of Map will be lost. Therefore, a reasonable size should be selected for capacity expansion. The default value of HashMap is 0.75. When the threshold capacity accounts for 3 / 4, the capacity expansion operation should be carried out to reduce Hash collision. At the same time, 0.75 is a default construction value, which can also be adjusted when creating HashMap. For example, if you want to exchange more space for time, you can adjust the load factor a little smaller to reduce collisions.

2.4 HashMap expansion element splitting

HashMap needs to split the original elements into new arrays during capacity expansion. In the splitting process, the original jdk1 7 will need to recalculate the hash value, but in jdk1 8 is optimized without recalculation, which improves the performance of splitting.

String key = "zuio";
int hash = key.hashCode() ^ (key.hashCode() >>> 16);
System.out.println("zuio Disturbance of hash Value:" + Integer.toBinaryString(hash));
System.out.println("Subscript binary value with capacity of 16:" + Integer.toBinaryString(hash & (16 - 1) ));
System.out.println("Subscript decimal value with capacity of 16:" + ((16 - 1) & hash));
System.out.println("zuio hash The original capacity of 16 and the calculation result are:" + (hash & 16));
System.out.println("Subscript binary value with capacity of 32:" + Integer.toBinaryString(hash & (32 - 1) ));
System.out.println("Subscript decimal value with capacity of 32:" + ((32 - 1) & hash));

String key2 = "plop";
int hash2 = key2.hashCode() ^ (key2.hashCode() >>> 16);
System.out.println("zuio Disturbance of hash Value:" + Integer.toBinaryString(hash2));
System.out.println("Subscript binary value with capacity of 16:" + Integer.toBinaryString(hash2 & (16 - 1) ));
System.out.println("Subscript decimal value with capacity of 16:" + ((16 - 1) & hash2));
System.out.println("plop hash Value and original capacity 16 and the calculation result is:" + (hash2 & 16));
System.out.println("Subscript binary value with capacity of 32:" + Integer.toBinaryString(hash2 & (32 - 1) ));
System.out.println("Subscript decimal value with capacity of 32:" + ((32 - 1) & hash2));

// The output result above is
// Disturbance hash value of zuio: 11100111000
// Subscript binary value with capacity of 16: 1000
// Subscript decimal value with capacity of 16: 8
// The calculation result of zuio hash value and original capacity is: 16
// Subscript binary value with capacity of 32: 11000
// Subscript decimal value with capacity of 32: 24
    
// Disturbance hash value of plop: 1101001000110011101001
// Subscript binary value with capacity of 16: 1001
// Subscript decimal value with capacity of 16: 9
// The result of the operation between the plop hash value and the original capacity is: 0
// Subscript binary value with capacity of 32: 1001
// Subscript decimal value with capacity of 32: 9

The following conclusions can be drawn from the above two examples: the original hash value and the original capacity are & calculated. If the result is 0, the subscript position remains unchanged. If it is not 0, the new subscript adds the original capacity to the original position (I have only given two examples, and I can give more examples. Finally, we can get the above conclusion). The core code of resize in HashMap expansion method is as follows:

final Node<K, V>[] resize() {
	...
	// hash value and original capacity & operation
	if ((e.hash & oldCap) == 0) {
		// Original index
    	if (loTail == null)
    		loHead = e;
   		 else
    		loTail.next = e;
    	loTail = e;
    }
    else {
    	// Original index + original capacity
    	if (hiTail == null)
    		hiHead = e;
    	else
    		hiTail.next = e;
    	hiTail = e;
    }
}

Each time HashMap performs capacity expansion, the array length becomes twice the original, so the array length is converted to binary, which is one bit more than the original, such as 16-1, 1111, 32-1, 11111 and one bit more. The extra bit performs & operation with the same bit of the hash value. If the result is 0, the index remains unchanged, and if the result is 1, the index is the original index + the length of the original array. The chestnuts are as follows:

Extra binary one bit 1 and hash The sum operation is 1, Is the original index + Original array length
 Original array last 16-1 Binary: 	  0000 0000 0000 0000 0000 0000 0000 1111
 New array length 32-1 Binary:      0000 0000 0000 0000 0000 0000 0001 1111
 character string zuio Disturbance of hash value:      0000 0000 0011 1001 0011 1001 1001 1000
								The extra bit is the same as the operation result 1		 	
 
Extra binary one bit 1 and hash The sum operation is 0, Is the original index    
Original array last 16-1 Binary: 	  0000 0000 0000 0000 0000 0000 0000 1111
 New array length 32-1 Binary:      0000 0000 0000 0000 0000 0000 0001 1111
 character string plop Disturbance of hash value:     0000 0000 0011 0100 1000 1100 1110 1001
        						The extra bit is related to the operation result       0	

Topics: Java HashMap hashcode