Fault isolation using Resilience4j framework in Java projects

Posted by ch1326 on Fri, 26 Nov 2021 12:50:00 +0100

So far in this series, we have learned about Resilience4j and its application Retry, RateLimiter and TimeLimiter modular. In this article, we will explore the Bulkhead module. We'll see what problems it solves, when and how to use it, and look at some examples.

Code example

Attached to this article On GitHub Working code example for.

What is Resilience4j?

Please refer to the description in the previous article for a quick understanding General working principle of Resilience4j.

What is fault isolation?

A few years ago, we encountered a production problem. One of the servers stopped responding to the health check and the load balancer took the server out of the pool.

Just as we began to investigate this problem, there was a second alert - another server had stopped responding to health checks and was removed from the pool.

After a few minutes, each server stopped responding to health detection and our service was completely shut down.

We use Redis to cache some data for several functions supported by the application. As we found later, the Redis cluster has some problems at the same time. It has stopped accepting new connections. We use the Jedis library to connect to Redis. The default behavior of the library is to block the calling thread indefinitely until the connection is established.

Our service is hosted on Tomcat, and its default request processing thread pool size is 200 threads. Therefore, each request through the code path connected to Redis will eventually block the thread indefinitely.

Within minutes, all 2000 threads in the cluster were blocked indefinitely -- not even idle threads to respond to the health check of the load balancer.

The service itself supports multiple functions, and not all functions need to access Redis cache. But when this aspect goes wrong, it eventually affects the whole service.

This is the problem that fault isolation solves - it can prevent problems in a service area from affecting the whole service.

Although what happens to our service is an extreme example, we can see how slow upstream dependencies affect irrelevant areas of the calling service.

If we set a limit of 20 concurrent requests for Redis on each server instance, only these threads will be affected when Redis connection problems occur. The remaining request processing threads can continue to service other requests.

The idea behind fault isolation is to set a limit on the number of concurrent calls we make to remote services. We regard calls to different remote services as different and isolated pools, and set a limit on the number of calls that can be made at the same time.

The term bulkhead itself comes from its use in ships, where the bottom of the ship is divided into separate parts. If there are cracks and water begins to flow in, only this part will be filled with water. This will prevent the whole ship from sinking.

Resilience4j diaphragm concept

resilience4j-bulkhead The working principle of is similar to other Resilience4j modules. We provide it with the code we want to construct and execute as a function -- a lambda expression that makes a remote call or a Supplier of a value retrieved from a remote service, and so on -- and decorate it with code to control the number of concurrent calls.

Resilience4j provides two types of diaphragms - SemaphoreBulkhead and ThreadPoolBulkhead.

SemaphoreBulkhead internal use
java.util.concurrent.Semaphore to control the number of concurrent calls and execute our code on the current thread.

ThreadPoolBulkhead uses a thread in the thread pool to execute our code. It is used internally
java.util.concurrent.ArrayBlockingQueue and
java.util.concurrent.ThreadPoolExecutor to control the number of concurrent calls.

SemaphoreBulkhead

Let's look at the configuration related to the semaphore partition and its meaning.

Maxconcurrent calls determines the maximum number of concurrent calls we can make to the remote service. We can think of this value as the number of permissions for initializing semaphores.

Any thread that attempts to call a remote service beyond this limit can immediately obtain a BulkheadFullException or wait for another thread to release the license. This is determined by the maxWaitDuration value.

When there are multiple threads waiting for permission, the fairCallHandlingEnabled configuration determines whether the waiting threads obtain permission in first in first out order.

Finally, the writableStackTraceEnabled configuration allows us to reduce the amount of information in the stack trace when a BulkheadFullException occurs. This is useful because without it, our log may be filled with a lot of similar information when exceptions occur multiple times. Usually, when reading the log, it is enough to know that BulkheadFullException has occurred.

ThreadPoolBulkhead

coreThreadPoolSize, maxThreadPoolSize, keepAliveDuration, and queueCapacity are the main configurations related to threadpoolbulk. ThreadPoolBulkhead uses these configurations internally to Construct a ThreadPoolExecutor.

The internalThreadPoolExecutor uses one of the available idle threads to execute the incoming task. If no thread is free to execute the incoming task, the task is queued for execution later when the thread is available. If queueCapacity is reached, the remote call is rejected and BulkheadFullException is returned.

ThreadPoolBulkhead also has a writableStackTraceEnabled configuration to control the amount of information in the stack trace of BulkheadFullException.

Using Resilience4j bulkhead module

Let's see how to use it resilience4j-bulkhead Various functions available in the module.

We will use the same example as the previous articles in this series. Suppose we are setting up a website for an airline to allow its customers to search and book flights. Our service talks with the remote service encapsulated by the FlightSearchService class.

SemaphoreBulkhead

When using semaphore based partitions, BulkheadRegistry, BulkheadConfig, and Bulkhead are the main abstractions we use.

Bulkhead registry is a factory for creating and managing bulkhead objects.

BulkheadConfig encapsulates maxConcurrentCalls, maxWaitDuration, writableStackTraceEnabled, and fairCallHandlingEnabled configurations. Each Bulkhead object is associated with a Bulkhead config.

The first step is to create a BulkheadConfig:

BulkheadConfig config = BulkheadConfig.ofDefaults();

This creates a BulkheadConfig with default values of maxConcurrentCalls(25), maxWaitDuration(0s), writableStackTraceEnabled(true), and fairCallHandlingEnabled(true).

Suppose we want to limit the number of concurrent calls to 2, and we are willing to wait 2 seconds for the thread to get permission:

BulkheadConfig config = BulkheadConfig.custom()
  .maxConcurrentCalls(2)
  .maxWaitDuration(Duration.ofSeconds(2))
  .build();

Then we create a Bulkhead:

BulkheadRegistry registry = BulkheadRegistry.of(config);

Bulkhead bulkhead = registry.bulkhead("flightSearchService");

Now let's express our code to run the flight search as a Supplier and decorate it with bulkhead:

BulkheadRegistry registry = BulkheadRegistry.of(config);
Bulkhead bulkhead = registry.bulkhead("flightSearchService");

Finally, let's call several decoration operations to understand how the partition works. We can use completable future to simulate concurrent flight search requests from users:

for (int i=0; i<4; i++) {
  CompletableFuture
    .supplyAsync(decoratedFlightsSupplier)
    .thenAccept(flights -> System.out.println("Received results"));
}

The timestamp and thread name in the output show that among the four concurrent requests, the first two requests pass immediately:

Searching for flights; current time = 11:42:13 187; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 11:42:13 187; current thread = ForkJoinPool.commonPool-worker-5
Flight search successful at 11:42:13 226
Flight search successful at 11:42:13 226
Received results
Received results
Searching for flights; current time = 11:42:14 239; current thread = ForkJoinPool.commonPool-worker-9
Searching for flights; current time = 11:42:14 239; current thread = ForkJoinPool.commonPool-worker-7
Flight search successful at 11:42:14 239
Flight search successful at 11:42:14 239
Received results
Received results

The third and fourth requests can be licensed only after 1 second, after the previous request is completed.

If the thread cannot obtain permission within the 2s maxWaitDuration specified by us, it will throw a BulkheadFullException:

Caused by: io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
    at io.github.resilience4j.bulkhead.BulkheadFullException.createBulkheadFullException(BulkheadFullException.java:49)
    at io.github.resilience4j.bulkhead.internal.SemaphoreBulkhead.acquirePermission(SemaphoreBulkhead.java:164)
    at io.github.resilience4j.bulkhead.Bulkhead.lambda$decorateSupplier$5(Bulkhead.java:194)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
    ... 6 more

Except for the first row, the other rows in the stack trace do not add much value. If BulkheadFullException occurs multiple times, these stack traces will be repeated in our log file.

We can reduce the amount of information generated in the stack trace by setting the writableStackTraceEnabled configuration to false:

BulkheadConfig config = BulkheadConfig.custom()
    .maxConcurrentCalls(2)
    .maxWaitDuration(Duration.ofSeconds(1))
    .writableStackTraceEnabled(false)
.build();

Now, when the BulkheadFullException occurs, there is only one line in the stack trace:

Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-5
io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
Flight search successful at 12:27:58 699
Flight search successful at 12:27:58 699
Received results
Received results

Similar to other Resilience4j modules we have seen, Bulkhead also provides additional methods, such as decorateCheckedSupplier(), decorateCompletionStage(), decoratrunnable (), decorateConsumer(), etc. Therefore, we can provide our code in other structures other than Supplier suppliers.

ThreadPoolBulkhead

When using thread pool based partitions,
ThreadPoolBulkheadRegistry, ThreadPoolBulkheadConfig, and ThreadPoolBulkhead are the main abstractions we use.

ThreadPoolBulkhead registry is a factory for creating and managing ThreadPoolBulkhead objects.

ThreadPoolBulkheadConfig encapsulates the core threadpoolsize, maxThreadPoolSize, keepAliveDuration, and queueCapacity configurations. Each ThreadPoolBulkhead object is associated with a ThreadPoolBulkhead config.

The first step is to create a ThreadPoolBulkheadConfig:

ThreadPoolBulkheadConfig config =
  ThreadPoolBulkheadConfig.ofDefaults();

This creates a ThreadPoolBulkheadConfig with default values of coreThreadPoolSize (number of available processors - 1), maxThreadPoolSize (maximum number of available processors), keepAliveDuration (20ms), and queueCapacity (100).

Suppose we want to limit the number of concurrent calls to 2:

ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
  .maxThreadPoolSize(2)
  .coreThreadPoolSize(1)
  .queueCapacity(1)
  .build();

Then we create a ThreadPoolBulkhead:

ThreadPoolBulkheadRegistry registry = ThreadPoolBulkheadRegistry.of(config);
ThreadPoolBulkhead bulkhead = registry.bulkhead("flightSearchService");

Now let's express our code to run the flight search as a Supplier and decorate it with bulkhead:

Supplier<List<Flight>> flightsSupplier =
  () -> service.searchFlightsTakingOneSecond(request);
Supplier<CompletionStage<List<Flight>>> decoratedFlightsSupplier =
  ThreadPoolBulkhead.decorateSupplier(bulkhead, flightsSupplier);

And returning a supplier < list < flight > >
SemaphoreBulkhead.decorateSupplier(),
ThreadPoolBulkhead.decorateSupplier() returns a supplier < completionstage < list < flight > >. This is because ThreadPoolBulkHead does not execute code synchronously on the current thread.

Finally, let's call several decoration operations to understand the working principle of the diaphragm:

for (int i=0; i<3; i++) {
  decoratedFlightsSupplier
    .get()
    .whenComplete((r,t) -> {
      if (r != null) {
        System.out.println("Received results");
      }
      if (t != null) {
        t.printStackTrace();
      }
    });
}

The timestamp and thread name in the output show that although the first two requests are executed immediately, the third request has been queued and will be executed later by one of the released threads:

Searching for flights; current time = 16:15:00 097; current thread = bulkhead-flightSearchService-1
Searching for flights; current time = 16:15:00 097; current thread = bulkhead-flightSearchService-2
Flight search successful at 16:15:00 136
Flight search successful at 16:15:00 135
Received results
Received results
Searching for flights; current time = 16:15:01 151; current thread = bulkhead-flightSearchService-2
Flight search successful at 16:15:01 151
Received results

If there are no idle threads and capacity in the queue, a BulkheadFullException is thrown:

Exception in thread "main" io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
 at io.github.resilience4j.bulkhead.BulkheadFullException.createBulkheadFullException(BulkheadFullException.java:64)
 at io.github.resilience4j.bulkhead.internal.FixedThreadPoolBulkhead.submit(FixedThreadPoolBulkhead.java:157)
... other lines omitted ...

We can use the writableStackTraceEnabled configuration to reduce the amount of information generated in the stack trace:

ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
  .maxThreadPoolSize(2)
  .coreThreadPoolSize(1)
  .queueCapacity(1)
  .writableStackTraceEnabled(false)
  .build();

Now, when the BulkheadFullException occurs, there is only one line in the stack trace:

Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-5
io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
Flight search successful at 12:27:58 699
Flight search successful at 12:27:58 699
Received results
Received results

Context propagation

Sometimes we store data in a ThreadLocal variable and read it in different areas of the code. We do this to avoid explicitly passing data as parameters between method chains, especially when the value is not directly related to the core business logic we are implementing.

For example, we might want to record the current user ID or transaction ID or a request tracking ID in each log statement to make it easier to search the log. Using ThreadLocal is a useful technique for such scenarios.

When using ThreadPoolBulkhead, because our code is not executed on the current thread, the data we store in the ThreadLocal variable will not be available in other threads.

Let's look at an example to understand this problem. First, we define a RequestTrackingIdHolder class, a wrapper class around ThreadLocal:

class RequestTrackingIdHolder {
  static ThreadLocal<String> threadLocal = new ThreadLocal<>();


  static String getRequestTrackingId() {
    return threadLocal.get();
  }


  static void setRequestTrackingId(String id) {
    if (threadLocal.get() != null) {
      threadLocal.set(null);
      threadLocal.remove();
    }
    threadLocal.set(id);
  }


  static void clear() {
    threadLocal.set(null);
    threadLocal.remove();
  }
}

Static methods can easily set and get values stored on ThreadLocal. Next, we set a request tracking ID before calling the flight search operation:

for (int i=0; i<2; i++) {
  String trackingId = UUID.randomUUID().toString();
  System.out.println("Setting trackingId " + trackingId + " on parent, main thread before calling flight search");
  RequestTrackingIdHolder.setRequestTrackingId(trackingId);
  decoratedFlightsSupplier
    .get()
    .whenComplete((r,t) -> {
        // other lines omitted
    });
}

The sample output shows that this value is not available in threads managed by the partition:

Setting trackingId 98ff99df-466a-47f7-88f7-5e31fc8fcb6b on parent, main thread before calling flight search
Setting trackingId 6b98d73c-a590-4a20-b19d-c85fea783caf on parent, main thread before calling flight search
Searching for flights; current time = 19:53:53 799; current thread = bulkhead-flightSearchService-1; Request Tracking Id = null
Flight search successful at 19:53:53 824
Received results
Searching for flights; current time = 19:53:54 836; current thread = bulkhead-flightSearchService-1; Request Tracking Id = null
Flight search successful at 19:53:54 836
Received results

To solve this problem, ThreadPoolBulkhead provides a ContextPropagator. Context propagator is an abstraction for retrieving, copying, and cleaning up values across thread boundaries. It defines an interface that contains methods to get values from the current thread (retrieve()), copy them to the new execution thread (copy()), and finally clean up on the execution thread (clear()).

Let's implement a
RequestTrackingIdPropagator:

class RequestTrackingIdPropagator implements ContextPropagator {
  @Override
  public Supplier<Optional> retrieve() {
    System.out.println("Getting request tracking id from thread: " + Thread.currentThread().getName());
    return () -> Optional.of(RequestTrackingIdHolder.getRequestTrackingId());
  }


  @Override
  Consumer<Optional> copy() {
    return optional -> {
      System.out.println("Setting request tracking id " + optional.get() + " on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.setRequestTrackingId(s.toString()));
    };
  }


  @Override
  Consumer<Optional> clear() {
    return optional -> {
      System.out.println("Clearing request tracking id on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.clear());
    };
  }
}

We provide a ContextPropagator for ThreadPoolBulkhead by setting on ThreadPoolBulkhead config:

class RequestTrackingIdPropagator implements ContextPropagator {
  @Override
  public Supplier<Optional> retrieve() {
    System.out.println("Getting request tracking id from thread: " + Thread.currentThread().getName());
    return () -> Optional.of(RequestTrackingIdHolder.getRequestTrackingId());
  }


  @Override
  Consumer<Optional> copy() {
    return optional -> {
      System.out.println("Setting request tracking id " + optional.get() + " on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.setRequestTrackingId(s.toString()));
    };
  }


  @Override
  Consumer<Optional> clear() {
    return optional -> {
      System.out.println("Clearing request tracking id on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.clear());
    };
  }
}

Now, the sample output shows that the request tracking ID is available in the thread managed by the partition:

Setting trackingId 71d44cb8-dab6-4222-8945-e7fd023528ba on parent, main thread before calling flight search
Getting request tracking id from thread: main
Setting trackingId 5f9dd084-f2cb-4a20-804b-038828abc161 on parent, main thread before calling flight search
Getting request tracking id from thread: main
Setting request tracking id 71d44cb8-dab6-4222-8945-e7fd023528ba on thread: bulkhead-flightSearchService-1
Searching for flights; current time = 20:07:56 508; current thread = bulkhead-flightSearchService-1; Request Tracking Id = 71d44cb8-dab6-4222-8945-e7fd023528ba
Flight search successful at 20:07:56 538
Clearing request tracking id on thread: bulkhead-flightSearchService-1
Received results
Setting request tracking id 5f9dd084-f2cb-4a20-804b-038828abc161 on thread: bulkhead-flightSearchService-1
Searching for flights; current time = 20:07:57 542; current thread = bulkhead-flightSearchService-1; Request Tracking Id = 5f9dd084-f2cb-4a20-804b-038828abc161
Flight search successful at 20:07:57 542
Clearing request tracking id on thread: bulkhead-flightSearchService-1
Received results

Bulkhead event

Both Bulkhead and ThreadPoolBulkhead have an EventPublisher to generate the following types of events:

BulkheadOnCallPermittedEvent
BulkheadOnCallRejectedEvent and
BulkheadOnCallFinishedEvent

We can listen to these events and record them, for example:

Bulkhead bulkhead = registry.bulkhead("flightSearchService");
bulkhead.getEventPublisher().onCallPermitted(e -> System.out.println(e.toString()));
bulkhead.getEventPublisher().onCallFinished(e -> System.out.println(e.toString()));
bulkhead.getEventPublisher().onCallRejected(e -> System.out.println(e.toString()));

The sample output shows the contents of the record:

2020-08-26T12:27:39.790435: Bulkhead 'flightSearch' permitted a call.
... other lines omitted ...
2020-08-26T12:27:40.290987: Bulkhead 'flightSearch' rejected a call.
... other lines omitted ...
2020-08-26T12:27:41.094866: Bulkhead 'flightSearch' has finished a call.

Bulkhead indicator

SemaphoreBulkhead

Bulkhead exposed two indicators:

The maximum number of available permissions (resilience4j.bulkhead.max.allowed.concurrent.calls), and
The number of concurrent calls allowed (resilience4j.bulkhead.available.concurrent.calls).

The bulkhead.available indicator is the same as maxconcurrent calls configured on bulkhead config.

First, we create BulkheadConfig, BulkheadRegistry, and Bulkhead as before. Then, we create a MeterRegistry and bind BulkheadRegistry to it:

MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedBulkheadMetrics.ofBulkheadRegistry(registry)
  .bindTo(meterRegistry);

After running several partition decoration operations, we show the captured indicators:

Consumer<Meter> meterConsumer = meter -> {
  String desc = meter.getId().getDescription();
  String metricName = meter.getId().getName();
  Double metricValue = StreamSupport.stream(meter.measure().spliterator(), false)
    .filter(m -> m.getStatistic().name().equals("VALUE"))
    .findFirst()
    .map(m -> m.getValue())
    .orElse(0.0);
  System.out.println(desc + " - " + metricName + ": " + metricValue);};meterRegistry.forEachMeter(meterConsumer);

Here are some sample outputs:

The maximum number of available permissions - resilience4j.bulkhead.max.allowed.concurrent.calls: 8.0
The number of available permissions - resilience4j.bulkhead.available.concurrent.calls: 3.0

ThreadPoolBulkhead

ThreadPoolBulkhead exposure has five indicators:

The current length of the queue (resilience4j.bulkhead.queue.depth),
Current thread pool size (resilience4j.bulkhead.thread.pool.size),
The core and maximum capacity of the thread pool (resilience4j.bulkhead.core.thread.pool.size and resilience4j.bulkhead.max.thread.pool.size), and
The capacity of the queue (resilience4j.bulkhead.queue.capacity).

First, we create ThreadPoolBulkheadConfig as before
ThreadPoolBulkheadRegistry and ThreadPoolBulkhead. Then, we create a MeterRegistry and
ThreadPoolBulkheadRegistry is bound to it:

MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedThreadPoolBulkheadMetrics.ofThreadPoolBulkheadRegistry(registry).bindTo(meterRegistry);

After running the partition trim operation several times, we will display the captured indicators:

The queue capacity - resilience4j.bulkhead.queue.capacity: 5.0
The queue depth - resilience4j.bulkhead.queue.depth: 1.0
The thread pool size - resilience4j.bulkhead.thread.pool.size: 5.0
The maximum thread pool size - resilience4j.bulkhead.max.thread.pool.size: 5.0
The core thread pool size - resilience4j.bulkhead.core.thread.pool.size: 3.0

In practical application, we will regularly export the data to the monitoring system and analyze it on the dashboard.

Traps and good practices in implementing diaphragms

Make the diaphragm a single example

All calls to a given remote service should pass through the same Bulkhead instance. For a given remote service, Bulkhead must be a singleton.

If we don't enforce this operation, some areas of our code base may bypass Bulkhead and call the remote service directly. To prevent this, the actual invocation of the remote service should be in a core, inner layer, and other areas, and the inner layer exposed partition decorator should be used.

How can we ensure that future new developers understand this intention? Check out Tom's article, which shows a way to solve such problems, namely These intentions are clarified through the organizational package structure . In addition, it shows how to enforce this by coding intent in the ArchUnit test.

Combined with other Resilience4j modules

It is more efficient to use the diaphragm in combination with one or more other Resilience4j modules, such as retry and rate limiter. For example, if there is a BulkheadFullException, we may want to retry after some delay.

conclusion

In this article, we learned how to use the Bulkhead module of Resilience4j to set limits on our concurrent calls to remote services. We learned why this is important and saw some practical examples of how to configure it.

You can use On GitHub The code demonstrates a complete application.

This article is translated from: Implementing Bulkhead with Resilience4j - Reflectoring

Topics: Java

Programmer Think

Fault isolation using Resilience4j framework in Java projects

Code example

What is Resilience4j?

What is fault isolation?

Resilience4j diaphragm concept

SemaphoreBulkhead

ThreadPoolBulkhead

Using Resilience4j bulkhead module

SemaphoreBulkhead

ThreadPoolBulkhead

Context propagation

Bulkhead event

Bulkhead indicator

SemaphoreBulkhead

ThreadPoolBulkhead

Traps and good practices in implementing diaphragms

Make the diaphragm a single example

Combined with other Resilience4j modules

conclusion

Hot Topics