Thread Local Cache TLAB for JVM Source Analysis

Posted by exoskeleton on Wed, 03 Jul 2019 18:35:43 +0200

Short Book Occupy the wolf
Please note the origin of the original, thank you!

background

Before introducing TLAB, consider a question:
When creating objects, you need to request memory of a specified size on the heap. If there are a large number of threads requesting memory at the same time, you can ensure that the same block of memory will not be applied by locking mechanisms or pointer collisions. Memory allocation is an extremely frequent action in the JVM operation, which is bound to degrade performance.

Therefore, TLAB technology was introduced in the implementation of Hotspot 1.6.

What is TLAB

TLAB is the full name ThreadLocalAllocBuffer, which is a piece of private memory for threads. If the virtual machine parameter -XX:UseTLAB is set, at the time of thread initialization, a memory block of the specified size will also be requested for use only by the current thread, so that each thread has a separate Buffer and if memory needs to be allocated, it will be on its ownAllocate on Buffer so that there is no competition and the allocation efficiency can be greatly improved. When the Buffer capacity is insufficient, re-apply a piece from Eden area to continue to use. This application action still requires atomic operation. Its technical core is to improve memory allocation by reducing a large number of atomic operations.Performance.

TLAB implementation

Implement at/Users/zhanjun/openjdk/hotspot/src/share/vm/memory/threadLocalAllocBuffer.hpp

// ThreadLocalAllocBuffer: a descriptor for thread-local storage used by
// the threads for allocation.
//            It is thread-private at any time, but maybe multiplexed over
//            time across multiple threads. The park()/unpark() pair is
//            used to make it avaiable for such multiplexing.
class ThreadLocalAllocBuffer: public CHeapObj<mtThread> {
  friend class VMStructs;
private:
  HeapWord* _start;                              // address of TLAB
  HeapWord* _top;                                // address after last allocation
  HeapWord* _pf_top;                             // allocation prefetch watermark
  HeapWord* _end;                                // allocation end (excluding alignment_reserve)
  size_t    _desired_size;                       // desired size   (including alignment_reserve)
  size_t    _refill_waste_limit;                 // hold onto tlab if free() is larger than this

Each thread allocates a large space from Eden, such as 100KB, as its own TLAB.This start is the starting address of TLAB, the end is the end of TLAB, and then top is the current allocation pointer, obviously start <= top < end.

_desired_size refers to the memory size of TLAB.

_refill_waste_limit refers to the largest waste of space, assuming it is 5KB, which in general is:
1. If the current TLAB has allocated 96KB, there are still 4KB left, but now a new object needs 6KB of space, obviously TLAB memory is not enough, you can simply re-apply for a TLAB, the original TLAB to Eden management, which wastes only 4KB of space, within _refill_waste_limit.
2. If the current TLAB has allocated 90KB and there are 10KB left, now 11KB is needed for a new object, obviously there is not enough memory for TLAB, then you cannot simply discard the current TLAB, which will be allocated to the Eden area for application.

When new Thread() is executed in Java code, the following code is triggered

// The first routine called by a new Java thread
void JavaThread::run() {
  // initialize thread-local alloc buffer related fields
  this->initialize_tlab();
  // used to test validitity of stack trace backs
  this->record_base_of_stack_pointer();
  // Record real stack base and size.
  this->record_stack_base_and_size();
  // Initialize thread local storage; set before calling MutexLocker
  this->initialize_thread_local_storage();
  this->create_stack_guard_pages();
  this->cache_global_variables();

In the run method of JavaThread, the first step is to call this->initialize_tlab(); the method initializes TLAB, which is implemented as follows:

void initialize_tlab() {
    if (UseTLAB) {
      tlab().initialize();
    }
  }

Where tlab() returns a ThreadLocalAllocBuffer object, calling initialize() to initialize TLAB, as follows:

void ThreadLocalAllocBuffer::initialize() {
  initialize(NULL,                    // start
             NULL,                    // top
             NULL);                   // end

  set_desired_size(initial_desired_size());

  // Following check is needed because at startup the main (primordial)
  // thread is initialized before the heap is.  The initialization for
  // this thread is redone in startup_initialization below.
  if (Universe::heap() != NULL) {
    size_t capacity   = Universe::heap()->tlab_capacity(myThread()) / HeapWordSize;
    double alloc_frac = desired_size() * target_refills() / (double) capacity;
    _allocation_fraction.sample(alloc_frac);
  }

  set_refill_waste_limit(initial_refill_waste_limit());

  initialize_statistics();
}

1. Set the _desired_size of the current TLAB, which is calculated by the initial_desired_size() method;
2. Set the _refill_waste_limit of the current TLAB, which is calculated by the initial_refill_waste_limit() method;
3. Initialize some statistical fields, such as _number_of_refills, _fast_refill_waste, _slow_refill_waste, _gc_waste s and_slow_allocations;

Analysis of Calculating Process for Field_desired_size
size_t ThreadLocalAllocBuffer::initial_desired_size() {
  size_t init_sz;

  if (TLABSize > 0) {
    init_sz = MIN2(TLABSize / HeapWordSize, max_size());
  } else if (global_stats() == NULL) {
    // Startup issue - main thread initialized before heap initialized.
    init_sz = min_size();
  } else {
    // Initial size is a function of the average number of allocating threads.
    unsigned nof_threads = global_stats()->allocating_threads_avg();

    init_sz  = (Universe::heap()->tlab_capacity(myThread()) / HeapWordSize) /
                      (nof_threads * target_refills());
    init_sz = align_object_size(init_sz);
    init_sz = MIN2(MAX2(init_sz, min_size()), max_size());
  }
  return init_sz;
}

TLABSize is set to 256 * K by default in the argument module and can also be set by JVM parameter selection, but even if set, it will be compared to a maximum max_size and a smaller value, where max_size is calculated as follows:

const size_t ThreadLocalAllocBuffer::max_size() {
  // TLABs can't be bigger than we can fill with a int[Integer.MAX_VALUE].
  // This restriction could be removed by enabling filling with multiple arrays.
  // If we compute that the reasonable way as
  //    header_size + ((sizeof(jint) * max_jint) / HeapWordSize)
  // we'll overflow on the multiply, so we do the divide first.
  // We actually lose a little by dividing first,
  // but that just makes the TLAB  somewhat smaller than the biggest array,
  // which is fine, since we'll be able to fill that.

  size_t unaligned_max_size = typeArrayOopDesc::header_size(T_INT) +
                              sizeof(jint) *
                              ((juint) max_jint / (size_t) HeapWordSize);
  return align_size_down(unaligned_max_size, MinObjAlignment);
}

It's clear here that TLAB cannot be larger than it can hold int[Integer.MAX_VALUE]. A little confused, why?

Analysis of Field_refill_waste_limit Calculation
size_t initial_refill_waste_limit()  { 
    return desired_size() / TLABRefillWasteFraction; 
}

The calculation logic is simple, where TLABRefillWasteFraction defaults to 64

memory allocation

Newan object, assuming it needs 1K in size, let's step by step see how it is allocated.

instanceOop instanceKlass::allocate_instance(TRAPS) {
  assert(!oop_is_instanceMirror(), "wrong allocation path");
  bool has_finalizer_flag = has_finalizer(); // Query before possible GC
  int size = size_helper();  // Query before forming handle.
  KlassHandle h_k(THREAD, as_klassOop());
  instanceOop i;
  i = (instanceOop)CollectedHeap::obj_allocate(h_k, size, CHECK_NULL);
  if (has_finalizer_flag && !RegisterFinalizersAtInit) {
    i = register_finalizer(i, CHECK_NULL);
  }
  return i;
}

The memory allocation entry for the object is instanceKlass::allocate_instance(), which is allocated on heap memory by the CollectedHeap::obj_allocate() method

oop CollectedHeap::obj_allocate(KlassHandle klass, int size, TRAPS) {
  debug_only(check_for_valid_allocation_state());
  assert(!Universe::heap()->is_gc_active(), "Allocation during gc not allowed");
  assert(size >= 0, "int won't convert to size_t");
  HeapWord* obj = common_mem_allocate_init(klass, size, CHECK_NULL);
  post_allocation_setup_obj(klass, obj);
  NOT_PRODUCT(Universe::heap()->check_for_bad_heap_word_value(obj, size));
  return (oop)obj;
}

Where the common_mem_allocate_init() method eventually calls the CollectedHeap::common_mem_allocate_noinit() method, as follows:

HeapWord* CollectedHeap::common_mem_allocate_noinit(KlassHandle klass, size_t size, TRAPS) {

  // Clear unhandled oops for memory allocation.  Memory allocation might
  // not take out a lock if from tlab, so clear here.
  CHECK_UNHANDLED_OOPS_ONLY(THREAD->clear_unhandled_oops();)

  if (HAS_PENDING_EXCEPTION) {
    NOT_PRODUCT(guarantee(false, "Should not allocate with exception pending"));
    return NULL;  // caller does a CHECK_0 too
  }

  HeapWord* result = NULL;
  if (UseTLAB) {
    result = allocate_from_tlab(klass, THREAD, size);
    if (result != NULL) {
      assert(!HAS_PENDING_EXCEPTION,
             "Unexpected exception, will result in uninitialized storage");
      return result;
    }
  }
  bool gc_overhead_limit_was_exceeded = false;
  result = Universe::heap()->mem_allocate(size,
                                          &gc_overhead_limit_was_exceeded);

Based on the value of UseTLAB, it is decided whether to allocate memory on TLAB or not. If UseTLAB is not manually cancelled in the JVM parameter, allocate_from_tlab() will be called to attempt allocation on TLAB because there may be allocation failures, such as insufficient TLAB capacity. See the implementation of allocate_from_tlab():

HeapWord* CollectedHeap::allocate_from_tlab(KlassHandle klass, Thread* thread, size_t size) {
  assert(UseTLAB, "should use UseTLAB");

  HeapWord* obj = thread->tlab().allocate(size);
  if (obj != NULL) {
    return obj;
  }
  // Otherwise...
  return allocate_from_tlab_slow(klass, thread, size);

From the above implementation, you can see that the allocate method of ThreadLocalAllocBuffer is attempted to be called first, and if it is returned to null, allocate_from_tlab_slow() is executed to allocate. From the naming of this method, you can see that this is a slower allocation path.

The allocate method of ThreadLocalAllocBuffer is implemented as follows:

inline HeapWord* ThreadLocalAllocBuffer::allocate(size_t size) {
  invariants();
  HeapWord* obj = top();
  if (pointer_delta(end(), obj) >= size) {
    // successful thread-local allocation
#ifdef ASSERT
    // Skip mangling the space corresponding to the object header to
    // ensure that the returned space is not considered parsable by
    // any concurrent GC thread.
    size_t hdr_size = oopDesc::header_size();
    Copy::fill_to_words(obj + hdr_size, size - hdr_size, badHeapWordVal);
#endif // ASSERT
    // This addition is safe because we know that top is
    // at least size below end, so the add can't wrap.
    set_top(obj + size);

    invariants();
    return obj;
  }
  return NULL;
}

The result of allocation is determined by determining whether the remaining capacity of the current TLAB is larger than the size to be allocated. If the current remaining capacity is insufficient, NULL is returned, indicating that the allocation failed.

Slow allocate allocate_from_tlab_slow() is implemented as follows:

HeapWord* CollectedHeap::allocate_from_tlab_slow(KlassHandle klass, Thread* thread, size_t size) {

  // Retain tlab and allocate object in shared space if
  // the amount free in the tlab is too large to discard.
  if (thread->tlab().free() > thread->tlab().refill_waste_limit()) {
    thread->tlab().record_slow_allocation(size);
    return NULL;
  }

  // Discard tlab and allocate a new one.
  // To minimize fragmentation, the last TLAB may be smaller than the rest.
  size_t new_tlab_size = thread->tlab().compute_size(size);

  thread->tlab().clear_before_allocation();

  if (new_tlab_size == 0) {
    return NULL;
  }

  // Allocate a new TLAB...
  HeapWord* obj = Universe::heap()->allocate_new_tlab(new_tlab_size);
  if (obj == NULL) {
    return NULL;
  }

 // Some code was removed
  return obj;
}

1. If the remaining capacity of the current TLAB is greater than the waste threshold, it is not allocated in the current TLAB, allocated directly in the shared Eden area, and recorded the memory size of the slow allocation;
2. If the remaining capacity is less than the waste threshold, the current TLAB can be discarded.
3. Re-apply TLAB in Eden area by allocate_new_tlab() method as follows:

HeapWord* GenCollectedHeap::allocate_new_tlab(size_t size) {
  bool gc_overhead_limit_was_exceeded;
  return collector_policy()->mem_allocate_work(size /* size */,
                                               true /* is_tlab */,
                                               &gc_overhead_limit_was_exceeded);
}

Implement the requested memory through the current heap recycling policy.

Topics: jvm Java REST less