JDK8: HashMap source code analysis: hash method

Posted by kidintraffic on Mon, 07 Mar 2022 22:59:18 +0100

1, Overview

We know that in HashMap, the location of a key value pair stored in the internal data of HashMap is related to the hashCode value of K, which is also because the hash algorithm of HashMap is based on the hashCode value.

Three concepts should be distinguished here: hashCode value, hash value, hash method and array subscript

hashCode value: the return value of the hashCode method of the K Object in the KV pair (if not overridden, the generated value of the hashCode method of the Object class will be used by default)

Hash value: it is the result of another operation based on the hashCode value. This operation process is the hash method.

Array subscript: calculate the array subscript according to the hash value and the array length. The calculation formula is: hash value & (array length - 1) = subscript.

What we want to discuss is the hash method mentioned above. First of all, it is a static method in the HashMap class. The code of this method is very simple, with only two lines, and the syntax is easy to understand. The specific intention is really worth studying in depth.

We first discuss the process of subscript calculation of lower array, which helps us understand the design idea of hash method

 

2, Array subscript calculation
 

hash: hash value
length: Array length
 Calculation formula: hash  &(length-1)

if length=16,hash=1,Then the calculation process of subscript value is as follows:
    0000 0000 0000 0000 0000 0000 0000 1111    16-1 = 15,15 The corresponding binary value is 1111
    0000 0000 0000 0000 0000 0000 0000 0001    1 Binary
    0000 0000 0000 0000 0000 0000 0000 0001    And operation is performed on the last two lines, and the final result is 1

if length=16,hash=17,Then the calculation process of subscript value is as follows:
    0000 0000 0000 0000 0000 0000 0000 1111    16-1 = 15,15 The corresponding binary value is 1111
    0000 0000 0000 0000 0000 0000 0001 0001    17 Binary
    0000 0000 0000 0000 0000 0000 0000 0001    And operation is performed on the last two lines, and the final result is 1

if length=16,hash=33,Then the calculation process of subscript value is as follows:
    0000 0000 0000 0000 0000 0000 0000 1111    16-1 = 15,15 The corresponding binary value is 1111
    0000 0000 0000 0000 0000 0000 0010 0001    33 Binary
    0000 0000 0000 0000 0000 0000 0000 0001    And operation is performed on the last two lines, and the final result is 1

if length=32,hash=1,Then the calculation process of subscript value is as follows:
    0000 0000 0000 0000 0000 0000 0001 1111    32-1 = 31,31 The corresponding binary value is 11111
    0000 0000 0000 0000 0000 0000 0000 0001    1 Binary
    0000 0000 0000 0000 0000 0000 0000 0001    And operation is performed on the last two lines, and the final result is 1

if length=32,hash=17,Then the calculation process of subscript value is as follows:
    0000 0000 0000 0000 0000 0000 0001 1111    32-1 = 31,31 The corresponding binary value is 11111
    0000 0000 0000 0000 0000 0000 0001 0001    17 Binary
    0000 0000 0000 0000 0000 0000 0001 0001    And the last two lines, and the final result is 17

if length=32,hash=33,Then the calculation process of subscript value is as follows:
    0000 0000 0000 0000 0000 0000 0001 1111    32-1 = 31,31 The corresponding binary value is 11111
    0000 0000 0000 0000 0000 0000 0010 0001    33 Binary
    0000 0000 0000 0000 0000 0000 0000 0001    And operation is performed on the last two lines, and the final result is 1


(1,16)->1,(17,16)->1,(33,16)->1
(1,32)->1,(17,32)->17,(33,32)->1
 Summary:
hash  &(length-1)The operation effect is equivalent to hash%length. 
We know hashMap The capacity expansion rule of is 2 times, and the length of the array must be 2 N Power. Assume that the array length is not 2 N Power, then repeat the above calculation process again, and the effect of finding the module will not be achieved.

After discussing the index addressing process of hashMap array, we can know

When putting (k, V), the module will be calculated according to the hash value of K and the current array length to get where the key value pair should be stored in the array.

When you get(K), you will also find the module according to the hash value of K and the current array length, and locate where you should go to the array to find V.

According to the example given above, when multiple hash values and array length modulus have the same result after put(K,V), they are all stored in the same location of the data. This situation is also called hash collision. When this collision occurs, we will organize them in the form of linked list in a first come first served manner, which means that I can locate the array location when I get(K), You also need to traverse the linked list and compare equals one by one to determine the V to be returned.

Assuming that the hash values are different, the greater the length of the hashMap array, the smaller the probability of collision.

When the array length is 16, when the hash values are 1, 17 and 33, everyone is positioned to the subscript 1. At this time, we can also think about this. When the length of my hashMap array remains unchanged or will not change again, the hash value greater than the array length (17 and 33) has the same effect on my hashMap as the hash value 1.

When the array length is 65535, when the hash values are 1, 131071 and 262143, we all locate the subscript 1. At this time, we can also think about this. When the length of my hashMap array remains unchanged or will not change again, the hash values greater than the array length (131071 and 262143) have the same effect on my hashMap as the hash value 1.

In the above two scenarios, if we assume that hash value = hashCode value, it means that whether we define our own hashCode or the default hashCode generation method, the value greater than the length of the array does not produce the ideal effect we want to achieve (we hope that different hashcodes can distribute them to a location that will not collide), because after taking the module, His position may collide with at least one value less than the length of the array.

What scenario can avoid collision: the maximum length of the hashMap array is known (the data to be stored does not exceed the maximum length of the array). When creating, the initial capacity is specified as the maximum length. The hash value of each K is different, and each hash value is less than the maximum length of the array. Are there any such scenes? Yes, but most of our application scenarios are not so coincidental or so intentional.

Let's take a look at the most commonly used scenarios:

Map<String,Object> user = new HashMap<String,Object>(); // The default internal data length is 16
user.put("name", "kevin");
user.put("sex", 1);
user.put("level", 1);
user.put("phone", "13000000000");
user.put("address", "beijing fengtai");

Using string as key should be a common case. What is the hashCode corresponding to these K? If the subscript is located according to hashCode, what is it?

System.out.println("\"name\".hashCode()	: " +"name".hashCode());
System.out.println("\"sex\".hashCode()	: " +"sex".hashCode());
System.out.println("\"level\".hashCode()	: " +"level".hashCode());
System.out.println("\"phone\".hashCode()	: " +"phone".hashCode());
System.out.println("\"address\".hashCode()	: " +"address".hashCode());
 
System.out.println("--------*****---------");
 
System.out.println("\"name\".hashCode() & (16 - 1) 	:" + ("name".hashCode() & (16 - 1)));
System.out.println("\"sex\".hashCode() & (16 - 1) 	:" + ("sex".hashCode() & (16 - 1)));
System.out.println("\"level\".hashCode() & (16 - 1)	:" + ("level".hashCode() & (16 - 1)));
System.out.println("\"phone\".hashCode() & (16 - 1) 	:" + ("phone".hashCode() & (16 - 1)));
System.out.println("\"address\".hashCode() & (16 - 1) :" + ("address".hashCode() & (16 - 1)));
 
Output result:
 
"name".hashCode()	: 3373707
"sex".hashCode()	: 113766
"level".hashCode()	: 102865796
"phone".hashCode()	: 106642798
"address".hashCode()	: -1147692044
--------*****---------
"name".hashCode() & (16 - 1) 	:11
"sex".hashCode() & (16 - 1) 	:6
"level".hashCode() & (16 - 1)	:4
"phone".hashCode() & (16 - 1) 	:14
"address".hashCode() & (16 - 1) :4

Although the collision probability of has4 is very high, even though there are several random values of has4 in the array, the collision probability of has4 is very high, This means that these seemingly (absolute) large hashCode values and values in the range of [0-15] have the same effect on the initial array capacity with a length of 16.

However, in the case of a small amount of data, even if there is a collision, the performance disadvantage is almost negligible.

However, the subscript evaluation above is an assumption. It is assumed that the module is directly calculated according to the hashCode of K and the length of the array, but it is actually calculated according to the hash value of K and the length of the array

So what is the hash value of K and how to calculate it? Next, let's discuss the calculation process of hash value and hash method.

3, hash method analysis

static final int hash(Object key) {
     int h;
     return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}


1. If the key is empty, the hash value is set to 0. HashMap allows null as a key. However, because the hash value of null must be 0 and null==null is true, there will be only one null key in HashMap at most. And the null key must be placed in the first position of the array. However, if there is a hash collision and a linked list is formed at this position, the node corresponding to the null key is not sure where it is in the linked list (depending on the insertion order, and its position in the linked list may change every time the capacity is expanded).

2. If the key is an object that is not empty, the hashCode value H and h of the key are moved to the right by 16 bits without sign, and the value after XOR operation is performed to obtain the final hash value.

The information we can confirm from the code at present is that the hashCode value (h) is the basis of calculation, and two bit operations (unsigned right shift and XOR) are carried out on the basis of h

We also retest the user's key

System.out.println("hash(\"name\")	: " + hash("name"));
System.out.println("hash(\"sex\")	: " + hash("sex"));
System.out.println("hash(\"level\")	: " + hash("level"));
System.out.println("hash(\"phone\")	: " + hash("phone"));
System.out.println("hash(\"address\")	: " + hash("address"));
 
System.out.println("--------*****---------");
 
System.out.println("hash(\"name\") & (16 - 1) 	:" + (hash("name") & (16 - 1)));
System.out.println("hash(\"sex\") & (16 - 1) 		:" + (hash("sex") & (16 - 1)));
System.out.println("hash(\"level\") & (16 - 1)	:" + (hash("level") & (16 - 1)));
System.out.println("hash(\"phone\") & (16 - 1) 	:" + (hash("phone") & (16 - 1)));
System.out.println("hash(\"address\") & (16 - 1) 	:" + (hash("address") & (16 - 1)));
 
Output result:
 
hash("name")	: 3373752
hash("sex")	: 113767
hash("level")	: 102866341
hash("phone")	: 106642229
hash("address")	: -1147723677
--------*****---------
hash("name") & (16 - 1) 	:8
hash("sex") & (16 - 1) 		:7
hash("level") & (16 - 1)	:5
hash("phone") & (16 - 1) 	:5
hash("address") & (16 - 1) 	:3

Let's compare the two output results

"name".hashCode()	    : 3373707               hash("name")	: 3373752
"sex".hashCode()	    : 113766                hash("sex")	        : 113767
"level".hashCode()	    : 102865796             hash("level")	: 102866341
"phone".hashCode()	    : 106642798             hash("phone")	: 106642229
"address".hashCode()	    : -1147692044           hash("address")	: -1147723677
--------*****---------
"name".hashCode() & (16 - 1) 	:11                 hash("name") & (16 - 1) 	:8
"sex".hashCode() & (16 - 1) 	:6                  hash("sex") & (16 - 1) 	:7
"level".hashCode() & (16 - 1)	:4                  hash("level") & (16 - 1)	:5
"phone".hashCode() & (16 - 1) 	:14                 hash("phone") & (16 - 1) 	:5
"address".hashCode() & (16 - 1) :4                  hash("address") & (16 - 1) 	:3

Although the output after hash algorithm is different from that directly using hashCode, the array subscripts still collide (there are two 5's). Therefore, the hash method can not solve the problem of collision (in fact, collision is not a problem, we just want to reduce the occurrence as much as possible). So why not use hashCode directly instead of going through such a bit operation to generate a hash value.
 

Let's take a look at the calculation process of H ^ (H > > > 16) through several examples

if h=17,that h ^ (h >>> 16)The calculation of can be embodied in the following process:
    0000 0000 0000 0000 0000 0000 0001 0001    This is: h(17)Binary. [high order is 16 zeros]
    0000 0000 0000 0000 0000 0000 0000 0000    h(17)After the binary unsigned right shift of 16 bits, this is:( h >>> 16)Binary
    0000 0000 0000 0000 0000 0000 0001 0001    Exclusive or operation is performed on the last two lines, which is: h ^ (h >>> 16)Binary
 Finally known (when) h=17 Time): h ^ (h >>> 16) = 17,Nothing has changed.

if h=65535,that h ^ (h >>> 16)The calculation of can be embodied in the following process:
    0000 0000 0000 0000 1111 1111 1111 1111    h[High or 16 zeros]
    0000 0000 0000 0000 0000 0000 0000 0000    h >>> 16
    0000 0000 0000 0000 1111 1111 1111 1111    h ^ (h >>> 16)
Finally known (when) h=65535 Time): h ^ (h >>> 16) = 65535,Nothing has changed.

if h=65536,that h ^ (h >>> 16)The calculation of can be embodied in the following process:
    0000 0000 0000 0001 0000 0000 0000 0000     h[High bit contains a 1]
    0000 0000 0000 0000 0000 0000 0000 0001     h >>> 16
    0000 0000 0000 0001 0000 0000 0000 0001     h ^ (h >>> 16)
Finally known (when) h=65536 Time): h ^ (h >>> 16) = 65537,hash The value after is different from the original value.

if h=1147904,that h ^ (h >>> 16)The calculation process can be reflected as follows:
    0000 0000 0001 0001 1000 0100 0000 0000     h[High order contains two 1s, which are not continuous]
    0000 0000 0000 0000 0000 0000 0001 0001     h >>> 16
    0000 0000 0001 0001 1000 0100 0001 0001     h ^ (h >>> 16)
Finally known (when) h=1147904 Time): h ^ (h >>> 16) = 1147921,hash The value after is different from the original value.

Let's take a look at the case of negative numbers. The case of negative numbers is a little more complicated, mainly because the conversion between binary and decimal systems of negative numbers will be [plus]|reduce|Reverse] and other operations.
if h=-5,that h ^ (h >>> 16)The calculation of can be embodied in the following process:
    0000 0000 0000 0000 0000 0000 0000 0101    Get the binary of 5 first
    1111 1111 1111 1111 1111 1111 1111 1010    5 Binary negation of

    1111 1111 1111 1111 1111 1111 1111 1011    5 Add 1 after the binary of is negated. At this time, it is: h(-5)Binary.
    0000 0000 0000 0000 1111 1111 1111 1111    h(-5)After the binary unsigned right shift of 16 bits, this is:( h >>> 16)Binary
    1111 1111 1111 1111 0000 0000 0000 0100    Exclusive or operation is performed on the last two lines, which is: h ^ (h >>> 16)Binary
                                               Next, find the decimal value corresponding to this binary
    1111 1111 1111 1111 0000 0000 0000 0011    Previous binary value minus 1
    0000 0000 0000 0000 1111 1111 1111 1100    Take the inverse. At this time, the decimal value is 65532, but a minus sign is required
 Finally known (when) h=-5 Time): h ^ (h >>> 16) = -65532,hash There is a big difference between the original value and the latter value

1) The above example only considers the case of positive numbers, but the following conclusions can be drawn
When h (hashCode value) is in [0-65535], the result after bit operation is still H
When h (hashCode value) is in [65535-N], the result after bit operation is different from H

When h (hashCode value) is negative, the result of bit operation is not the same as H

2) The hashCode value of the key in the user object above is not in the range of [0-65535], so the calculated result is different from the hashCode value.

 

Why is the value H, H ^ (H > > > 16) = h in the range of [0-65535]?

From the binary operation description in the example, we can find that the high 16 bits of the binary of the value in the range of [0-65535] are 0. After the unsigned right shift of 16 bits, the original high 16 bits become the current low 16 bits, and the current high 16 bits are supplemented with 16 zeros. After this operation, the current value is 32 zeros, and the XOR operation with 32 zeros with any integer value N will still result in N. The upper 16 bits of the value that is no longer in the range of [0-65535] contain 1 digit bit. After the unsigned right shift of 16 bits, although the high bit is still supplemented with 16 zeros, the current low bit still contains 1 digit bit, so the final operation result will change.

The hashCode of the key in our use object is either greater than 65535 or less than 0, so the final hash value and hashCode are different, because the non-0 digital bits in the high order of hashCode participate in the operation in the XOR operation.

 

4, Function of hash method

The previous examples have proved that hash method can not eliminate collision.

"name".hashCode()	    : 3373707               hash("name")	: 3373752
"sex".hashCode()	    : 113766                hash("sex")	        : 113767
"level".hashCode()	    : 102865796             hash("level")	: 102866341
"phone".hashCode()	    : 106642798             hash("phone")	: 106642229

Moreover, through comparative observation, although the value after hash and hashCode are different, the hash value generated by positive hashCode is not very different even if it is different from the original value. Why not just use hashCode as the hash value? Why do we have to go through the operation of H ^ (H > > > 16)?

The comments in the hash method are described as follows:
 

/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower.  Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.)  So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/

The general meaning is:

During addressing calculation, the effective binary bits that can participate in the calculation are only the bits on the right corresponding to the array length value, which means that the probability of collision will be high.

The high bit of hashCode can participate in addressing calculation through shift and XOR operation.

This method is adopted to achieve a balance in performance, practicability and distribution quality.

Many hashCode algorithms have been reasonably distributed, and in case of a large number of collisions, the query performance problem can also be solved through the tree structure.

Therefore, the bit operation with high performance is used to let the high bit participate in the addressing operation, and the bit operation is relatively low for the system loss.

I also mentioned Float keys. I think it should be a float type object as a key. I also tested it by the way
 

System.out.println("Float.valueOf(0.1f).hashCode()		:" + Float.valueOf(0.1f).hashCode());
System.out.println("Float.valueOf(1.3f).hashCode()		:" + Float.valueOf(1.3f).hashCode());
System.out.println("Float.valueOf(100.4f).hashCode()	:" + Float.valueOf(1.4f).hashCode());
System.out.println("Float.valueOf(987607.3f).hashCode()	:" + Float.valueOf(100000.3f).hashCode());
System.out.println("Float.valueOf(2809764.4f).hashCode()	:" + Float.valueOf(100000.4f).hashCode());
System.out.println("Float.valueOf(-100.3f).hashCode()	:" + Float.valueOf(-100.3f).hashCode());
System.out.println("Float.valueOf(-100.4f).hashCode()	:" + Float.valueOf(-100.4f).hashCode());
 
System.out.println("--------*****---------");
 
System.out.println("hash(0.1f)		:" + hash(0.1f));
System.out.println("hash(1.3f)		:" + hash(1.3f));
System.out.println("hash(1.4f)	:" + hash(1.4f));
System.out.println("hash(100000.3f)	:" + hash(100000.3f));
System.out.println("hash(100000.4f)	:" + hash(100000.4f));
System.out.println("hash(-100.3f)	:" + hash(-100.3f));
System.out.println("hash(-100.4f)	:" + hash(-100.4f));
 
Output result:
 
Float.valueOf(0.1f).hashCode()		:1036831949
Float.valueOf(1.3f).hashCode()		:1067869798
Float.valueOf(100.4f).hashCode()	:1068708659
Float.valueOf(987607.3f).hashCode()	:1203982374
Float.valueOf(2809764.4f).hashCode()	:1203982387
Float.valueOf(-100.3f).hashCode()	:-1027040870
Float.valueOf(-100.4f).hashCode()	:-1027027763
--------*****---------
hash(0.1f)	:1036841217
hash(1.3f)	:1067866560
hash(1.4f)	:1068698752
hash(100000.3f)	:1203967973
hash(100000.4f)	:1203967984
hash(-100.3f)	:-1027056814
hash(-100.4f)	:-1027076603

The floating-point number in the example still has a large range of values, but the generated hashcodes are not different. Because the hashcodes are not different, the final hash values are not different. So how to prove that the collision will be reduced after hash? I don't quite understand this. It seems that we have to look at the analysis of the great gods.

 

Topics: Java Programming data structure linked list