HashMap and Load Factor

Posted by techiefreak05 on Sun, 09 Jun 2019 18:49:42 +0200

HashMap stores data as <key, value> in an array.Conflicts are resolved by hashing, with the Entry stored in the array as the header of a one-way chain table.

The source code for Entry is as follows:

static class Entry<K,V> implements Map.Entry<K,V> {
        final K key;
        V value;
        Entry<K,V> next;
        final int hash;
 
        //Construction, get, set, etc. omitted
 
        public final boolean equals(Object o) {
            if (!(o instanceof Map.Entry))
                return false;
            Map.Entry e = (Map.Entry)o;
            Object k1 = getKey();
            Object k2 = e.getKey();
            if (k1 == k2 || (k1 != null && k1.equals(k2))) {
                Object v1 = getValue();
                Object v2 = e.getValue();
                if (v1 == v2 || (v1 != null && v1.equals(v2)))
                    return true;
            }
            return false;
        }
 
        public final int hashCode() {
            return (key==null   ? 0 : key.hashCode()) ^
                   (value==null ? 0 : value.hashCode());
        }
 
    }

Let's step by step explore the secrets of HashMap by reading the source code!!!

1) "Discover problems, have questions!"

1) Construct capacity in HashMap

capacity is the size of the Entry array of the current HashMap. Why is the size of the Entry array the nth power of 2?

// Find a power of 2 >= initialCapacity
int capacity = 1;
while (capacity < initialCapacity)
      capacity <<= 1;

2) Construct the loadFactor in HashMap

threshold = (int)(capacity * loadFactor);

threshold is the maximum size of HashMap, note that it is not the size of the array inside HashMap.

Why do I need a loadFactor and how do I set it up properly?

 

3) put of HashMap

public V put(K key, V value) {
    if (key == null)
        return putForNullKey(value);//When the key is null, place it at the top of the array: table[0]
    int hash = hash(key.hashCode());
    int i = indexFor(hash, table.length);
    for (Entry<K,V> e = table[i]; e != null; e = e.next) {
        Object k;
        if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
            V oldValue = e.value;
            e.value = value;
            e.recordAccess(this);
            return oldValue;
        }
    }
 
    modCount++;
    addEntry(hash, key, value, i);
    return null;
}


Here we focus on two things:

1)hash

static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

What does hash do?

 

2)indexFor

static int indexFor(int h,int length) {
        return h & (length-1);
    }

What is the purpose of indexFor s?

 

2) "Confuse, go to the clouds and see the day!"

After hash, the hashCode of key needs to be hash again in order to be within the range of table (table is hashMap's entry[]).This is actually the h%length division used.

It derives from a mathematical rule that if length is the N-th power of 2, then the modular operation of the number h to length is equivalent to the bitwise and operation of a and (length-1), that is, h%length <=> h& (length-1).

Bit operations are of course more efficient than redundancy, so this explains why the size of an Entry array is the nth power of two?

 

Why don't we just indexFor the hashCode of the key and hash it again?

I did a simple experiment as follows:

int[] hashcode = new int[] { 100000001,100000011,100000111,100001111,100011111,100111111,101111111,111111111 };
for (int c : hashcode)
    System.out.println(hash(c));


Eight hashcodes are assumed in the array hashcode. If you divide them by 10, the remainder is 1. All conflicts!

However, after hash, the corresponding values are:

94441116; 94441110; 94441204; 94440071; 94558923; 94592891; 107664244; 117165177

Divide the above values by 10 to get the remainder: 6, 0, 4, 1, 3, 1, 4, 7, conflict 2.

We can see that hash can distribute hashCode more evenly and prevent some bad hash functions!

From the above we can also see that when the remainder is taken, the influence of high position is relatively small, for example: 1048592, 1048832, 1052672, 1114112, 2097152 can be divided by 16 integers (the remainder is 0)!

Then we have to find a way to make the high position affect the outcome of the redundancy too, such as Hash.

Detailed reference: http://marystone.iteye.com/blog/709945

 

For the loadFactor, I also did a simple experiment as follows:

public static void main(String[] args) {
        String s = "a aa aaa b bb bbb c cc ccc d dd ddd e ee eee f ff fff g gg ggg h hh hhh abc bcd cde def efg fgh ghi hij ijk ";
        String[] ss = s.split(" ");
        int size = ss.length;//
        Set<Integer> indexS = new HashSet<Integer>();
        int conflict = 0;
        for (int i = 0; i < ss.length; i++) {
            int index = hash(ss[i].hashCode()) % size;
            if (indexS.contains(index))
                conflict++;
            else
                indexS.add(index);
        }
        System.out.println("Number of conflicts:"+conflict);
    }


When the divisor participating in the redundancy is size, the number of conflicts is 13

When the divisor participating in the redundancy is 2*size, the number of conflicts is 9

When the divisor participating in the redundancy is 3*size, the number of conflicts is 6

In hashMap, size is affected by the loadFactor.

Extreme wishes,

If the loadFactor is small and small, table s in the map need to be continuously expanded, resulting in larger divisors and fewer conflicts!

If the loadFactor is very large, table s in the map are full and do not require expansion, resulting in more and more conflicts and longer chains for conflict resolution.

Original Link: http://www.cnblogs.com/huangfox/archive/2012/07/06/2579614.html