Spring Cloud upgrade road - Hoxton - 4. Use Resilience4j to achieve instance level isolation and fusing

Posted by cedricm on Fri, 05 Jun 2020 05:40:36 +0200

How not to enable Hystrix

Because our entry annotation class is replaced by SpringBootApplication from @ SpringCloudApplication, spring cloud circuit breaker will not be enabled. The introduced Hystrix dependency has no effect. Please refer to Section 2 of this series: The way to upgrade Spring Cloud - Hoxton - 2. Annotation modification of entry class and transformation of OpenFeign

Using Resilience4j to achieve instance level isolation and fusing

Why do you need instance level fusing? Because some instances of a microservice may be temporarily unavailable, we hope that when we try again, we will not try these instances again. The default spring cloud circuit breaker generally implements microservice level fusing. When some instances of a microservice are temporarily unavailable but some instances are available, the whole microservice fusing is likely to occur. In general, when rolling publishing, if the operation is improper, the fuse of microservice level causes the microservice to be unavailable, but in fact, some instances are available. So we need instance level fusing, not microservice level fusing.

Why is instance level thread isolation required? Prevent an instance from having problems, slow response, and blocking the entire business thread.

The implementation in spring cloud circuit breaker has limited use for resilience4j. We want to take advantage of more functions (such as thread isolation, etc.). Moreover, spring cloud circuit breaker can be directly used to achieve microservice level fusing, but it is difficult to achieve instance level fusing. The main reason is that its configuration is based on the microservice name, and there is no extension, so there are too many places to modify the code if we want to implement it. So we abandoned spring cloud circuit breaker.

Fortunately, resilience4j officially has its own spring cloud starter, which implements the core bean configuration of all its functions and is easy to use. We use this starter and related configuration methods to achieve our instance level isolation and fusing.

Introducing

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-cloud2</artifactId>
    <version>${resilience4j-spring-cloud2.version}</version>
</dependency>

After that, the following beans will be loaded automatically: BulkheadRegistry, ThreadPoolBulkheadRegistry, CircuitBreakerRegistry, RateLimiterRegistry, and RetryRegistry. The configuration of these beans is as follows:

  • io.github.resilience4j.bulkhead.autoconfigure.BulkheadProperties : prefix resilience4j.bulkhead
  • io.github.resilience4j.bulkhead.autoconfigure.ThreadPoolBulkheadProperties : prefix resilience4j.thread-pool-bulkhead
  • io.github.resilience4j.circuitbreaker.autoconfigure.CircuitBreakerProperties : prefix resilience4j.circuit breaker
  • io.github.resilience4j.ratelimiter.autoconfigure.RateLimiterProperties : prefix resilience4j.ratelimiter
  • io.github.resilience4j.retry.autoconfigure.RetryProperties : prefix resilience4j.retry

The main elements used here are: circuit breaker and ThreadPoolBulkhead. Circuit breaker is used for instance level fusing, and ThreadPoolBulkhead is used for instance level thread isolation.

How to configure and use

Circuit breaker related configuration: CircuitBreaker

Circuit breaker has five states: CLOSED, open and HALF_OPEN. The remaining two states are manual operation, which we will not use here: DISABLED and
FORCED_OPEN. CLOSED means the circuit breaker is closed, and the request is processed as usual. OPEN represents the opening of the circuit breaker. If there is a request, an exception will be thrown: CallNotPermittedException.

Circuit breaker uses a sliding window to count successful and failed requests and turn the circuit breaker on or off. There are two kinds of sliding windows:

  • Count based sliding window: use a ring array of size N to record the latest N request results.
  • Timing based sliding window: record the request results in the last N seconds
Configuration item Default explain
failureRateThreshold 50 The percentage of failed requests. If it exceeds this percentage, the circuit breaker will become OPEN
slowCallDurationThreshold 60000[ms] Slow call time. When a call is slower than this time, it will be recorded as slow call
slowCallRateThreshold 100 When the slow call reaches this percentage, the circuit breaker will become OPEN
permittedNumberOfCallsInHalfOpenState 10 When circuit breaker is in half_ Number of requests allowed to pass in the open state
slidingWindowType COUNT_BASED Sliding window type, COUNT_BASED stands for count based sliding window, TIME_BASED represents a sliding window based on timing
slidingWindowSize 100 Sliding window size, if count is configured_ The base default value of 100 represents the last 100 requests. If time is configured_ The base default value of 100 represents the request of the last 100s.
minimumNumberOfCalls 100 Minimum number of requests. Only when the number of requests reaches this number in the sliding window will the judgment of circuit breaker be triggered.
waitDurationInOpenState 60000[ms] From OPEN to HALF_ Waiting time required for open status
automaticTransitionFromOpenToHalfOpenEnabled false If it is set to true, it indicates whether to change from OPEN to half automatically_ OPEN, even if no request comes.
recordExceptions empty Exception list, which specifies a list of exceptions. All exceptions in this collection or subclasses of these exceptions that are thrown during the call will be recorded as failures. Other exceptions are not considered failures, or exceptions configured in ignore exceptions are not considered failures. By default, all exceptions are considered failures.
ignoreExceptions empty Exception white list. All exceptions and their subclasses in this list will not be considered as request failure, even if these exceptions are configured in recordExceptions. The default white list is empty.

The default configuration we implement here is:

resilience4j.circuitbreaker:
  configs:
    default:
      # Whether to register with health indicator of Actuator
      registerHealthIndicator: true
      slidingWindowSize: 10
      minimumNumberOfCalls: 5
      slidingWindowType: TIME_BASED
      permittedNumberOfCallsInHalfOpenState: 3
      automaticTransitionFromOpenToHalfOpenEnabled: true
      waitDurationInOpenState: 2s
      failureRateThreshold: 30
      recordExceptions:
        - java.lang.Exception

The above configuration represents that by default, all exceptions and their subclasses are considered failures. The sliding window is time-based and records the requests of the last 10 seconds. The trigger circuit breaker judgment must have at least 5 requests within 10 seconds. After the failure ratio reaches more than 30%, the circuit breaker changes to open. After the circuit breaker is OPEN, it will be automatically converted to HALF after 2 seconds_ OPEN.

Configuration related to ThreadPoolBulkhead: Create and configure a ThreadPoolBulkhead

Configuration item Default explain
maxThreadPoolSize Runtime.getRuntime().availableProcessors() Maximum thread pool size
coreThreadPoolSize Runtime.getRuntime().availableProcessors() - 1 Core thread pool size
queueCapacity 100 Queue size
keepAliveDuration 20[ms] Thread lifetime

The default configuration we implement here is:

resilience4j.thread-pool-bulkhead:
  configs:
    default:
      maxThreadPoolSize: 50
      coreThreadPoolSize: 10
      queueCapacity: 1

Bond with open feign

We need to add CircuitBreaker and ThreadPoolBulkhead after FeignClient is called and the instance to send the request is selected. In other words, we need to get the instance of this request call and the name of the microservice, load the corresponding circuit breaker and ThreadPoolBulkhead, wrap the call request, and then execute the call.

The core implementation of FeignClient, according to org.springframework.cloud.openfeign.loadbalancer.DefaultFeignLoadBalancerConfiguration You know it is org.springframework.cloud.openfeign.loadbalancer.FeignBlockingLoadBalancerClient :

@Bean
@ConditionalOnMissingBean
public Client feignClient(BlockingLoadBalancerClient loadBalancerClient) {
	return new FeignBlockingLoadBalancerClient(new Client.Default(null, null),
			loadBalancerClient);
}

Check the source code of FeignBlockingLoadBalancerClient:

@Override
public Response execute(Request request, Request.Options options) throws IOException {
	final URI originalUri = URI.create(request.url());
	//Microservice name
	String serviceId = originalUri.getHost();
	Assert.state(serviceId != null,
			"Request URI does not contain a valid hostname: " + originalUri);
	//Select an instance from the load balancer
	ServiceInstance instance = loadBalancerClient.choose(serviceId);
	if (instance == null) {
		String message = "Load balancer does not contain an instance for the service "
				+ serviceId;
		if (LOG.isWarnEnabled()) {
			LOG.warn(message);
		}
		return Response.builder().request(request)
				.status(HttpStatus.SERVICE_UNAVAILABLE.value())
				.body(message, StandardCharsets.UTF_8).build();
	}
	//Modify the original url
	String reconstructedUrl = loadBalancerClient.reconstructURI(instance, originalUri)
			.toString();
	//Build a new Request
	Request newRequest = Request.create(request.httpMethod(), reconstructedUrl,
			request.headers(), request.body(), request.charset(),
			//This RequestTemplate can get the microservice name
			request.requestTemplate());
	return delegate.execute(newRequest, options);
}

Therefore, we can replace the default implementation by inheriting FeignBlockingLoadBalancerClient to proxy the call request. However, due to the existence of sleuth and the small bug s in it, the RequestTemplate is lost, so we can't get the name of the microservice. For this, please refer to my PR: replace method for deprecation and keep reference of requestTemplate . however, the Hoxton version will not be merged, so we need to create a path class with the same name for replacement: org.springframework.cloud.sleuth.instrument.web.client.feign.TracingFeignClient

Request build() {
    if (headers == null) {
        return delegate;
    }
    String url = delegate.url();
    byte[] body = delegate.body();
    Charset charset = delegate.charset();
    //Keep requestTemplate
    return Request.create(delegate.httpMethod(), url, headers, body, charset, delegate.requestTemplate());
}

After that, we implement FeignBlockingLoadBalancerClient with CircuitBreaker and ThreadPoolBulkhead, and optimize the HttpClient:

@Bean
public HttpClient getHttpClient() {
    // Long connection for 30 seconds
    PoolingHttpClientConnectionManager pollingConnectionManager = new PoolingHttpClientConnectionManager(5, TimeUnit.MINUTES);
    // Total connections
    pollingConnectionManager.setMaxTotal(1000);
    // Concurrent number of the same route
    pollingConnectionManager.setDefaultMaxPerRoute(1000);

    HttpClientBuilder httpClientBuilder = HttpClients.custom();
    httpClientBuilder.setConnectionManager(pollingConnectionManager);
    // To maintain long connection configuration, keep alive needs to be added to the header
    httpClientBuilder.setKeepAliveStrategy(new DefaultConnectionKeepAliveStrategy());
    return httpClientBuilder.build();
}

@Bean
public FeignBlockingLoadBalancerClient feignBlockingLoadBalancerCircuitBreakableClient(HttpClient httpClient, BlockingLoadBalancerClient loadBalancerClient, BulkheadRegistry bulkheadRegistry, ThreadPoolBulkheadRegistry threadPoolBulkheadRegistry, CircuitBreakerRegistry circuitBreakerRegistry, RateLimiterRegistry rateLimiterRegistry, RetryRegistry retryRegistry, Tracer tracer) {
    return new FeignBlockingLoadBalancerClient(new CircuitBreakableClient(
            httpClient,
            bulkheadRegistry,
            threadPoolBulkheadRegistry,
            circuitBreakerRegistry,
            rateLimiterRegistry,
            retryRegistry,
            tracer),
            loadBalancerClient);
}

@Log4j2
public static class CircuitBreakableClient extends feign.httpclient.ApacheHttpClient {
    private final BulkheadRegistry bulkheadRegistry;
    private final ThreadPoolBulkheadRegistry threadPoolBulkheadRegistry;
    private final CircuitBreakerRegistry circuitBreakerRegistry;
    private final RateLimiterRegistry rateLimiterRegistry;
    private final RetryRegistry retryRegistry;
    private final Tracer tracer;

    public CircuitBreakableClient(HttpClient httpClient, BulkheadRegistry bulkheadRegistry, ThreadPoolBulkheadRegistry threadPoolBulkheadRegistry, CircuitBreakerRegistry circuitBreakerRegistry, RateLimiterRegistry rateLimiterRegistry, RetryRegistry retryRegistry, Tracer tracer) {
        super(httpClient);
        this.bulkheadRegistry = bulkheadRegistry;
        this.threadPoolBulkheadRegistry = threadPoolBulkheadRegistry;
        this.circuitBreakerRegistry = circuitBreakerRegistry;
        this.rateLimiterRegistry = rateLimiterRegistry;
        this.retryRegistry = retryRegistry;
        this.tracer = tracer;
    }

    @Override
    public Response execute(Request request, Request.Options options) throws IOException {
        String serviceName = request.requestTemplate().feignTarget().name();
        URL url = new URL(request.url());
        String instanceId = serviceName + ":" + url.getHost() + ":" + url.getPort();

        //Each instance has a resilience4j fuse recorder. Fuse in the instance dimension. All instances of this service share the resilience4j fuse configuration of this service
        ThreadPoolBulkhead threadPoolBulkhead;
        CircuitBreaker circuitBreaker;
        try {
            threadPoolBulkhead = threadPoolBulkheadRegistry.bulkhead(instanceId, serviceName);
        } catch (ConfigurationNotFoundException e) {
            threadPoolBulkhead = threadPoolBulkheadRegistry.bulkhead(instanceId);
        }
        try {
            circuitBreaker = circuitBreakerRegistry.circuitBreaker(instanceId, serviceName);
        } catch (ConfigurationNotFoundException e) {
            circuitBreaker = circuitBreakerRegistry.circuitBreaker(instanceId);
        }
        //Keep traceId
        Span span = tracer.currentSpan();
        Supplier<CompletionStage<Response>> completionStageSupplier = ThreadPoolBulkhead.decorateSupplier(threadPoolBulkhead,
                CircuitBreaker.decorateSupplier(circuitBreaker, () -> {
                    try (Tracer.SpanInScope cleared = tracer.withSpanInScope(span)) {
                        log.info("call url: {} -> {}", request.httpMethod(), request.url());
                        Response execute = super.execute(request, options);
                        if (execute.status() != HttpStatus.OK.value()) {
                            throw new ResponseWrapperException(execute.toString(), execute);
                        }
                        return execute;
                    } catch (Exception e) {
                        throw new ResponseWrapperException(e.getMessage(), e);
                    }
                })
        );

        try {
            return Try.ofSupplier(completionStageSupplier).get().toCompletableFuture().join();
        } catch (CompletionException e) {
            Throwable cause = e.getCause();
            if (cause instanceof ResponseWrapperException) {
                ResponseWrapperException responseWrapperException = (ResponseWrapperException) cause;
                if (responseWrapperException.getResponse() != null) {
                    return (Response) responseWrapperException.getResponse();
                }
            }
            throw new ResponseWrapperException(cause.getMessage(), cause);
        }
    }
}

In this way, we glued open feign and added instance based fusing and thread isolation

Topics: Spring github Windows Java