Why Netty's FastThreadLocal is fast

Posted by v0id on Mon, 14 Oct 2019 13:53:46 +0200

Preface

Recently, when looking at the netty source code, I found a class called FastThreadLocal. JDK itself has a ThreadLocal class, so I can roughly think of this class is faster than jdk's own, where is the main speed, and why it is faster. Here is a brief analysis.

performance testing

ThreadLocal is mainly used in multi-threaded environment to easily obtain the data of current threads. Users do not need to care about multi-threaded issues, so they can use it conveniently. In order to illustrate the problem, two scenarios are tested separately: multiple threads operate the same ThreadLocal, and multiple ThreadLocals under a single thread. The following are tested separately:

1. Multiple threads operate on the same ThreadLocal

The test code is used for ThreadLocal and FastThreadLocal respectively. Some of the code are as follows:

public static void test2() throws Exception {
        CountDownLatch cdl = new CountDownLatch(10000);
        ThreadLocal<String> threadLocal = new ThreadLocal<String>();
        long starTime = System.currentTimeMillis();
        for (int i = 0; i < 10000; i++) {
            new Thread(new Runnable() {

                @Override
                public void run() {
                    threadLocal.set(Thread.currentThread().getName());
                    for (int k = 0; k < 100000; k++) {
                        threadLocal.get();
                    }
                    cdl.countDown();
                }
            }, "Thread" + (i + 1)).start();
        }
        cdl.await();
        System.out.println(System.currentTimeMillis() - starTime + "ms");
    }

The above code creates 10,000 threads, sets them to ThreadLocal, get s 100,000 times, and then calculates the total time consumption through CountDownLatch. The running result is about 1,000 Ms.
Next, test FastThreadLocal. The code is similar:

public static void test2() throws Exception {
        CountDownLatch cdl = new CountDownLatch(10000);
        FastThreadLocal<String> threadLocal = new FastThreadLocal<String>();
        long starTime = System.currentTimeMillis();
        for (int i = 0; i < 10000; i++) {
            new FastThreadLocalThread(new Runnable() {

                @Override
                public void run() {
                    threadLocal.set(Thread.currentThread().getName());
                    for (int k = 0; k < 100000; k++) {
                        threadLocal.get();
                    }
                    cdl.countDown();
                }
            }, "Thread" + (i + 1)).start();
        }

        cdl.await();
        System.out.println(System.currentTimeMillis() - starTime);
    }

After running, the result is about 1000ms. It can be found that there is no difference in performance between the two types of ThreadLocal in this case. The second case is tested below.

2. Multiple ThreadLocal s under a Single Thread

The test code is used for ThreadLocal and FastThreadLocal respectively. Some of the code are as follows:

    public static void test1() throws InterruptedException {
        int size = 10000;
        ThreadLocal<String> tls[] = new ThreadLocal[size];
        for (int i = 0; i < size; i++) {
            tls[i] = new ThreadLocal<String>();
        }
        
        new Thread(new Runnable() {
            @Override
            public void run() {
                long starTime = System.currentTimeMillis();
                for (int i = 0; i < size; i++) {
                    tls[i].set("value" + i);
                }
                for (int i = 0; i < size; i++) {
                    for (int k = 0; k < 100000; k++) {
                        tls[i].get();
                    }
                }
                System.out.println(System.currentTimeMillis() - starTime + "ms");
            }
        }).start();
    }

The above code creates 10,000 ThreadLocals, and then uses the same thread to set the ThreadLocal value, while get 100,000 times, running results: about 2,000 ms;
Next, we test FastThreadLocal. The code is similar to that of FastThreadLocal.

    public static void test1() {
        int size = 10000;
        FastThreadLocal<String> tls[] = new FastThreadLocal[size];
        for (int i = 0; i < size; i++) {
            tls[i] = new FastThreadLocal<String>();
        }
        
        new FastThreadLocalThread(new Runnable() {

            @Override
            public void run() {
                long starTime = System.currentTimeMillis();
                for (int i = 0; i < size; i++) {
                    tls[i].set("value" + i);
                }
                for (int i = 0; i < size; i++) {
                    for (int k = 0; k < 100000; k++) {
                        tls[i].get();
                    }
                }
                System.out.println(System.currentTimeMillis() - starTime + "ms");
            }
        }).start();
    }

Running results: about 30ms; can be found to achieve two orders of magnitude of performance gap, of course, this is in the case of a large number of visits to the effect; next focus on the analysis of ThreadLocal mechanism, and why FastThreadLocal faster than ThreadLocal;

ThreadLocal mechanism

Because we usually use set and get methods, look at the corresponding source code separately:

    public void set(T value) {
        Thread t = Thread.currentThread();
        ThreadLocalMap map = getMap(t);
        if (map != null)
            map.set(this, value);
        else
            createMap(t, value);
    }
    
    ThreadLocalMap getMap(Thread t) {
        return t.threadLocals;
    }

First, get the current thread, then get the threadLocals variable stored in the current thread. This variable is actually ThreadLocalMap. Finally, see if the ThreadLocalMap is empty, create a new Map for empty, and store the current value with the current ThreadLocal as the key if not empty.

private void set(ThreadLocal<?> key, Object value) {

            // We don't use a fast path as with get() because it is at
            // least as common to use set() to create new entries as
            // it is to replace existing ones, in which case, a fast
            // path would fail more often than not.

            Entry[] tab = table;
            int len = tab.length;
            int i = key.threadLocalHashCode & (len-1);

            for (Entry e = tab[i];
                 e != null;
                 e = tab[i = nextIndex(i, len)]) {
                ThreadLocal<?> k = e.get();

                if (k == key) {
                    e.value = value;
                    return;
                }

                if (k == null) {
                    replaceStaleEntry(key, value, i);
                    return;
                }
            }

            tab[i] = new Entry(key, value);
            int sz = ++size;
            if (!cleanSomeSlots(i, sz) && sz >= threshold)
                rehash();
        }

Generally speaking, ThreadLocal Map uses an array to store data, similar to HashMap; each ThreadLocal allocates a threadLocal HashCode when initialized, and then performs a modular operation with the length of the array, so there will be a hash conflict. In HashMap, the conflict is handled by using array + linked list, while in ThreadLocal Map, you can see straight lines. Next Index is used to perform traversal operation, which obviously has worse performance. Let's take a look at the get method again.

    public T get() {
        Thread t = Thread.currentThread();
        ThreadLocalMap map = getMap(t);
        if (map != null) {
            ThreadLocalMap.Entry e = map.getEntry(this);
            if (e != null) {
                @SuppressWarnings("unchecked")
                T result = (T)e.value;
                return result;
            }
        }
        return setInitialValue();
    }

Similarly, first get the current thread, then get the ThreadLocal Map in the current thread, and then take the current ThreadLocal as the key to get the value in the ThreadLocal Map:

        private Entry getEntry(ThreadLocal<?> key) {
            int i = key.threadLocalHashCode & (table.length - 1);
            Entry e = table[i];
            if (e != null && e.get() == key)
                return e;
            else
                return getEntryAfterMiss(key, i, e);
        }
        
         private Entry getEntryAfterMiss(ThreadLocal<?> key, int i, Entry e) {
            Entry[] tab = table;
            int len = tab.length;

            while (e != null) {
                ThreadLocal<?> k = e.get();
                if (k == key)
                    return e;
                if (k == null)
                    expungeStaleEntry(i);
                else
                    i = nextIndex(i, len);
                e = tab[i];
            }
            return null;
        }

In the same set mode, array subscripts are acquired by modular acquisition, otherwise data will be traversed if there is no conflict, so the following problems can be roughly known through analysis:
1. ThreadLocal Map is stored under Thread, ThreadLocal is the key, so multiple threads operate on the same ThreadLocal is actually a record inserted in each ThreadLocal Map thread, there is no conflict problem;
2.ThreadLocalMap greatly affects performance by traversing when resolving conflicts.
3.FastThreadLocal solves conflicts by other ways to optimize performance;
Let's continue to see how FastThreadLocal achieves performance optimization.

Why Netty's FastThreadLocal is fast

Netty provides two classes of FastThreadLocal and FastThreadLocalThread respectively. FastThreadLocalThread inherits from Thread. The following is also a source code analysis of commonly used set and get methods:

   public final void set(V value) {
        if (value != InternalThreadLocalMap.UNSET) {
            set(InternalThreadLocalMap.get(), value);
        } else {
            remove();
        }
    }

    public final void set(InternalThreadLocalMap threadLocalMap, V value) {
        if (value != InternalThreadLocalMap.UNSET) {
            if (threadLocalMap.setIndexedVariable(index, value)) {
                addToVariablesToRemove(threadLocalMap, this);
            }
        } else {
            remove(threadLocalMap);
        }
    }

First, the value is determined to be Internal ThreadLocalMap. UNSET, and then an Internal ThreadLocalMap is used to store data as well:

    public static InternalThreadLocalMap get() {
        Thread thread = Thread.currentThread();
        if (thread instanceof FastThreadLocalThread) {
            return fastGet((FastThreadLocalThread) thread);
        } else {
            return slowGet();
        }
    }

    private static InternalThreadLocalMap fastGet(FastThreadLocalThread thread) {
        InternalThreadLocalMap threadLocalMap = thread.threadLocalMap();
        if (threadLocalMap == null) {
            thread.setThreadLocalMap(threadLocalMap = new InternalThreadLocalMap());
        }
        return threadLocalMap;
    }

It can be found that the Internal ThreadLocal Map is also stored in FastThreadLocalThread. The difference is that instead of using the corresponding hash value of ThreadLocal to model the location, it uses the index attribute of FastThreadLocal directly. Index is initialized when instantiated:

    private final int index;

    public FastThreadLocal() {
        index = InternalThreadLocalMap.nextVariableIndex();
    }

Then enter the nextVariableIndex method:

    static final AtomicInteger nextIndex = new AtomicInteger();
     
    public static int nextVariableIndex() {
        int index = nextIndex.getAndIncrement();
        if (index < 0) {
            nextIndex.decrementAndGet();
            throw new IllegalStateException("too many thread-local indexed variables");
        }
        return index;
    }

There is a static nextIndex object in the Internal ThreadLocal Map to generate array subscripts, because it is static, so the index generated by each FastThreadLocal is continuous. Let's see how to set Indexed Variable in the Internal ThreadLocal Map:

    public boolean setIndexedVariable(int index, Object value) {
        Object[] lookup = indexedVariables;
        if (index < lookup.length) {
            Object oldValue = lookup[index];
            lookup[index] = value;
            return oldValue == UNSET;
        } else {
            expandIndexedVariableTableAndSet(index, value);
            return true;
        }
    }

Indexed Variables is an array of objects to store value s; directly use index as an array subscript for storage; if index is larger than the length of the array, expand it; get method reads quickly through index in FastThreadLocal:

   public final V get(InternalThreadLocalMap threadLocalMap) {
        Object v = threadLocalMap.indexedVariable(index);
        if (v != InternalThreadLocalMap.UNSET) {
            return (V) v;
        }

        return initialize(threadLocalMap);
    }
    
    public Object indexedVariable(int index) {
        Object[] lookup = indexedVariables;
        return index < lookup.length? lookup[index] : UNSET;
    }

Reading directly through subscripts is very fast, but there will be a problem, which may result in waste of space.

summary

From the above analysis, we can know that when there are a lot of ThreadLocal read and write operations, we may encounter performance problems; in addition, FastThreadLocal achieves O(1) read data through space for time; there is also a question why HashMap (array + black mangrove) is not directly used to replace ThreadLocalMap.

Topics: Programming Netty JDK Attribute