Troubleshooting of Apache HttpClient connection pool leakage
Problem background
-
The main business of the business system is a data aggregation management platform, in which one function of the system is to synchronize all resources (large synchronization for short)
-
The service synchronization data request data tool is Feign adapted to Apache HttpClient. This request encapsulation is based on the service adaptation service encapsulation request api at that time
-
Feign version: 10.10 one
Problem source
- In the production environment, the large synchronization function (more than 20 tasks) found that it was stuck when running more than half of the tasks. This problem was not found in the test environment
Synchronous interface
public interface SyncHelper { Order syncOrder(); void syncAllAccount(); void syncSingleAccount(Long accountId); default boolean enableSync() { return true; } }
Implementation of large synchronization function
@Slf4j @Component public class SyncAccountResourceListener { @Autowired private final List<SyncHelper> helpers; // Single thread pool private static final ExecutorService EXECUTOR = Executors.newSingleThreadExecutor(new ThreadFactoryBuilder() .setDaemon(false) .setNameFormat("h3c-sync-resource-%d") .build() ); public void sync(){ for (SyncHelper helper : helpers) { if (Thread.currentThread().isInterrupted()) { log.error("[{}] sync task interrupted,account:[{}]", className, accountId); continue; } Future<?> future = EXECUTOR.submit(() -> helper.syncSingleAccount(accountId)); try { future.get(helper.getTimeOut(), TimeUnit.SECONDS); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } catch (ExecutionException|TimeoutException e) { log.error("[{}] sync error,account:[{}]", className, accountId, e); } finally { future.cancel(true); } } } }
Troubleshooting steps
I wanted to solve the problem as quickly as possible. The synchronization progress list on the system shows that they are all stuck in the same synchronization class. Then I took a rough look at the codes of relevant synchronization classes and found that there are no relevant codes that may lead to dead circulation
Try to reproduce
- Test the big synchronization in the test environment and find no problem (including requesting round-trip data log and database sql print log), and successfully complete all synchronization tasks
- Then do unit tests for the stuck synchronization classes and execute them repeatedly. The results show that there is no problem
At this point, the question is even more confused. If it cannot be replicated in the test environment and local unit tests, how can there be related problems in production?
deadlock
At first, I didn't check the deadlock problem, because most of the synchronization didn't use multithreading
Possible causes
- Multithreading is used to run a task in a large synchronous resource using a single thread thread pool, and then the task times out. TimeOut fails to handle the task interruption, resulting in all subsequent tasks being blocked
- I see that some colleagues use multithreading to synchronize data, which is not very reasonable, similar to the following code:
List<SysDept> deptList = ...... List<CompletableFuture<CmdbUsageReport>> futureList = new ArrayList<>(); deptList.forEach(t -> futureList.add( CompletableFuture.supplyAsync(() -> { // Time consuming task return report; }, ioPool))); CompletableFuture.allOf(futureList.toArray(new CompletableFuture[]{})).join(); List<CmdbUsageReport> reportList = futureList.stream().map(CompletableFuture::join).filter(Objects::nonNull).collect(toList()); // ......
Since the thread problem is checked, directly use the relevant analysis tools to analyze it and see what's going on
Analyze thread status
Use arthas or visual VM to view the thread status of synchronization tasks
Start the corresponding process of arthas attach
java -jar arthas-boot.jar
thread --all view simple information about all threads
Use the single thread pool to run synchronization tasks. The thread executing the thread pool also has a user-defined name, which is `h3c-sync-resource-0`(process ID 250 ,Thread status is `WAITING` )
thread 250 view the details of the synchronization process
"h3c-sync-resource-0" - Thread t@195 java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method) - parking to wait for <66bb3d00> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:393) at org.apache.http.pool.AbstractConnPool.access$300(AbstractConnPool.java:70) at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:253) - locked <637b83f5> (a org.apache.http.pool.AbstractConnPool$2) at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:198) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:306) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:282) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at feign.httpclient.ApacheHttpClient.execute(ApacheHttpClient.java:83) at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:119) at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:89) at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100) at com.sun.proxy.$Proxy360.osAggregates(Unknown Source)
According to the thread information, you can see that the key code is stuck at the place requested by feign, and then you can see that it is apache http client dead lock deadlock
AbstractConnPool.java:393 at org apache. http. pool. AbstractConnPool. getPoolEntryBlocking
Looking at the code in detail, it is found that the problem is caused by the idle connection blocking waiting of the http client connection pool
Source tracking
Now visit the http client official website to see a simple demo example. The demo case access address is: httpcomponents-client-quickstart
You will see a simple use case:
try (CloseableHttpClient httpclient = HttpClients.createDefault()) { HttpGet httpGet = new HttpGet("http://httpbin.org/get"); // The underlying HTTP connection is still held by the response object // to allow the response content to be streamed directly from the network socket. // In order to ensure correct deallocation of system resources // the user MUST call CloseableHttpResponse#close() from a finally clause. // Please note that if response content is not fully consumed the underlying // connection cannot be safely re-used and will be shut down and discarded // by the connection manager. try (CloseableHttpResponse response1 = httpclient.execute(httpGet)) { System.out.println(response1.getCode() + " " + response1.getReasonPhrase()); HttpEntity entity1 = response1.getEntity(); // do something useful with the response body // and ensure it is fully consumed EntityUtils.consume(entity1); } }
- The official simply uses the form of java7 try with resources to use httpclient and automatically release resources after use
- The official document also explains that if the connection is safe, recycled and reused, you need to use entityutils Consumer consumes the response content and closes the stream
- Finally, the thread blocking and waiting problem may be caused by not consuming the response content in time
Track Feign source code
Now let's take a look at the general process and source code of Apache HttpClient converting Feign request. The method of HttpClient converting Feign Response: Feign HttpClient. ApacheHttpClient. toFeignBody
Response.Body toFeignBody(HttpResponse httpResponse) { final HttpEntity entity = httpResponse.getEntity(); if (entity == null) { return null; } return new Response.Body() { @Override public Integer length() { // If it is transfer encoding: chunked length must return null return entity.getContentLength() >= 0 && entity.getContentLength() <= Integer.MAX_VALUE ? (int) entity.getContentLength() : null; } @Override public boolean isRepeatable() { return entity.isRepeatable(); } @Override public InputStream asInputStream() throws IOException { // Simply pass InputStream return entity.getContent(); } @SuppressWarnings("deprecation") @Override public Reader asReader() throws IOException { return new InputStreamReader(asInputStream(), UTF_8); } @Override public Reader asReader(Charset charset) throws IOException { Util.checkNotNull(charset, "charset should not be null"); return new InputStreamReader(asInputStream(), charset); } @Override public void close() throws IOException { // Resource recovery method EntityUtils.consume(entity); } }; }
Because the response content needs to be completely consumed before it can return to the connection pool to reuse the connection, org apache. http. util. EntityUtils. The approximate code of consumption is as follows:
public static void consume(final HttpEntity entity) throws IOException { if (entity == null) { return; } if (entity.isStreaming()) { // Still in the transmission state, get the stream final InputStream inStream = entity.getContent(); // close directly closes the recycle resource if (inStream != null) { inStream.close(); } } }
EntityUtils. The connection process of consumer consuming response content and safely reusing is as follows. If you are interested, you can have a look. There will be no long discussion here:
org.apache.http.util.EntityUtils.consume org.apache.http.impl.execchain.ResponseEntityProxy.getContent Packaged into automatic release connection EofSensorInputStream org.apache.http.conn.EofSensorInputStream.close org.apache.http.conn.EofSensorInputStream.checkClose org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed org.apache.http.impl.execchain.ResponseEntityProxy.releaseConnection
feign.AsyncResponseHandler#handleResponse
void handleResponse(CompletableFuture<Object> resultFuture, String configKey, Response response, Type returnType, long elapsedTime) { // copied fairly liberally from SynchronousMethodHandler boolean shouldClose = true; try { if (logLevel != Level.NONE) { // If the log level is not NONE, the log will be output response = logger.logAndRebufferResponse(configKey, logLevel, response, elapsedTime); } if (Response.class == returnType) { if (response.body() == null) { resultFuture.complete(response); } else if (response.body().length() == null || response.body().length() > MAX_RESPONSE_BUFFER_SIZE) { // If it is transfer encoding: chunked length must return null, resulting in the following finally block not closing the recycling of resources shouldClose = false; resultFuture.complete(response); } else { // Ensure the response body is disconnected // InputStream is converted to byte [] to recycle resources and back to response final byte[] bodyData = Util.toByteArray(response.body().asInputStream()); resultFuture.complete(response.toBuilder().body(bodyData).build()); } } else if (response.status() >= 200 && response.status() < 300) { if (isVoidType(returnType)) { resultFuture.complete(null); } else { final Object result = decode(response, returnType); shouldClose = closeAfterDecode; resultFuture.complete(result); } } else if (decode404 && response.status() == 404 && !isVoidType(returnType)) { final Object result = decode(response, returnType); shouldClose = closeAfterDecode; resultFuture.complete(result); } else { resultFuture.completeExceptionally(errorDecoder.decode(configKey, response)); } } catch (final IOException e) { if (logLevel != Level.NONE) { logger.logIOException(configKey, logLevel, e, elapsedTime); } resultFuture.completeExceptionally(errorReading(response.request(), response, e)); } catch (final Exception e) { resultFuture.completeExceptionally(e); } finally { if (shouldClose) { ensureClosed(response.body()); } } }
feign.slf4j.Slf4jLogger.logAndRebufferResponse
protected Response logAndRebufferResponse(String configKey, Level logLevel, Response response, long elapsedTime) throws IOException { if (logger.isDebugEnabled()) { // DEBUG level calls the parent class method for log output return super.logAndRebufferResponse(configKey, logLevel, response, elapsedTime); } return response; }
feign.Logger.logAndRebufferResponse
protected Response logAndRebufferResponse(String configKey, Level logLevel, Response response, long elapsedTime) throws IOException { String reason = response.reason() != null && logLevel.compareTo(Level.NONE) > 0 ? " " + response.reason() : ""; int status = response.status(); log(configKey, "<--- HTTP/1.1 %s%s (%sms)", status, reason, elapsedTime); if (logLevel.ordinal() >= Level.HEADERS.ordinal()) { // Log output at a level greater than HEADERS for (String field : response.headers().keySet()) { for (String value : valuesOrEmpty(response.headers(), field)) { log(configKey, "%s: %s", field, value); } } int bodyLength = 0; if (response.body() != null && !(status == 204 || status == 205)) { // HTTP 204 No Content "...response MUST NOT include a message-body" // HTTP 205 Reset Content "...response MUST NOT include an entity" if (logLevel.ordinal() >= Level.FULL.ordinal()) { log(configKey, ""); // CRLF } // It is crucial to read the InputStream and convert it to byte [], and call InputStream close for resource recovery byte[] bodyData = Util.toByteArray(response.body().asInputStream()); bodyLength = bodyData.length; if (logLevel.ordinal() >= Level.FULL.ordinal() && bodyLength > 0) { log(configKey, "%s", decodeOrDefault(bodyData, UTF_8, "Binary data")); } log(configKey, "<--- END HTTP (%s-byte body)", bodyLength); // Because the current response Body () is consumed and saved back to response return response.toBuilder().body(bodyData).build(); } else { log(configKey, "<--- END HTTP (%s-byte body)", bodyLength); } } return response; }
feign.Util.toByteArray
public static byte[] toByteArray(InputStream in) throws IOException { checkNotNull(in, "in"); try { ByteArrayOutputStream out = new ByteArrayOutputStream(); copy(in, out); return out.toByteArray(); } finally { // closeable close closes the recycle resource ensureClosed(in); } }
track Feign Slf4jLogger `feign/Logger.java:84` , find `feign.Logger#logAndRebufferResponse ` will read the content of 'body' once greater than`HEADERS` level:`logLevel.ordinal() >= Level.HEADERS.ordinal()`,rearwards `byte[] bodyData = Util.toByteArray(response.body().asInputStream());`,Make a copy of the data, and then close Drop the original `InputStream`
Thinking about reading source code
After the source code analysis arrives at this place, it is found that the corresponding content will be automatically released under normal circumstances:
-
Enable log (Feign log level is greater than / equal to headers & & logger level is less than / equal to DEBUG)
- The original response information will be copied Body() and release InputStream
-
Close log (Feign log level is less than headers & & logger level is greater than DEBUG)
- Response. class == returnType && (response.body(). length() == null || response. body(). length() > MAX_ RESPONSE_ BUFFER_ Size) in this case, the response information will not be released
- If closeAfterDecode is false, the response information will not be released
Verification problem
The Feign log level of the production environment is full (greater than HEADERS), but the log level enabled by the Logger is INFO. Try the following steps to reproduce the problem
- Build Apache httpclient to minimize the number of connection pools
CLIENT = new ApacheHttpClient( HttpClientBuilder.create() .setSSLSocketFactory( new SSLConnectionSocketFactory(trustAllSslSocketFactory(), (hostname, session) -> true) ) // Set maximum route .setMaxConnPerRoute(1) // Sets the maximum total number of connections .setMaxConnTotal(1) .setDefaultRequestConfig( RequestConfig.custom() .setConnectTimeout(CONNECT_TIMEOUT) .setSocketTimeout(REQ_TIMEOUT) .build() ) .useSystemProperties() .build()
- The unit test Logger level configuration of local synchronization is equal to info & & feign, and the log level is greater than / equal to HEADERS. The unit test result synchronization task will deadlock after a period of time
- The unit test Logger level configuration of local synchronization is equal to debug & & feign, and the log level is less than HEADERS. The unit test result synchronization task will deadlock after a period of time
- The configuration of the locally synchronized unit test Logger level is equal to debug & & feign, and the log level is greater than or equal to headers (this is the configuration of the test and development environment). No deadlock is found in the unit test result synchronization task after a period of time
After unit testing and source code analysis, the problem result is very obvious: the connection pool connection cannot be safely reused because the response information is not fully consumed and released
Trace the root cause of the problem
Review feign. Com from the source code analysis above Asyncresponsehandler#handleresponse source code, closeAfterDecode is true by default, so the response content will not be released only under the following conditions:
Response.class == returnType && (response.body().length() == null || response.body().length() > MAX_RESPONSE_BUFFER_SIZE)
Check whether the item code has a Response return type and the length of the request Response body is empty
Recall that when requesting the corresponding api, the token will be asynchronously verified and refreshed. To verify the token, there is no need to process the return value. Previously, the Response is used and the return code is judged. The following are some key codes:
Verify token Feign interface
@RequestLine("GET /sys/identity/v2/tokens") @Headers({"X-Subject-Token: {token}"}) Response verifyToken(@Param("token") String token);
Verify token code
@Override public boolean verifyToken(H3CClientConfig config, String token) { IdentityApi identityApi = Feigns.h3c(IdentityApi.class, config); Response response = identityApi.verifyToken(token); return response.status() == HttpStatus.OK.value(); }
Request response log
GET https://19.50.81.200:8100/sys/identity/v2/tokens HTTP/1.1 200 OK Date: Sun, 05 Sep 2021 02:55:09 GMT Content-Type: application/json; charset=UTF-8 Transfer-Encoding: chunked Connection: keep-alive X-Subject-Token: eyJjdHkiOiJKV1QiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2IiwiYWxnIjoiZGlyIn0..yWaex8RVEvBvw1D-kk_LMQ.IwiMpoRWSPsUZiEpr09tL7WDZ1-vIRZqFsGqQq2CV4wBp6S8mBIhICI3Ce2sE_TLA_A2oX_NnMpAf5D4C_DwunJaiJ3lnD51Sg1bxWao_gXnPS7JdfpyaRXY-rtPMaxs-0FisUuyVlKfQh3Ab8t3WsCLzU9Yz7sQ367CKtW1z32ttafrWRlotLN0y7XX3ZRz7Ttznm2cZ5Ae79MEPQF1-hbKiGoz4B8kR1NRgeL-arlpa8qtgERYEEtr-VtJgydDpylusItc_uOtPqwEh0HAgYQjJovF75pej5WlCgdzYVQMr08OGT0JnBrReWYxl0h2P0xxZQtNcM2d0T54TebvvRhQKRyywvasQ064FS4B4mGN-8E3TZkxSfSfr4OWZ1Nmwpr3xFGBOSVpKf5-AufCoXPW3yGu3vFSpCahoKq01n9_gd4UbKLE82Cwou4uZf4VMZ7A7hOAdWYo_geb1bTzLUyTdDSUgbS8XiiYCOpaA4euv409ELE22U77F940M2DO2y8lbaDuk4iAv3QIp5gCGg.9pzTvRPM-FAMa-17a2J5kQ
I did not process the Response response in the above code, but just judged the Response code. However, through the request round-trip log, I found that the Response information of the request Response is a little different: there is no content length field and the Response header is transfer encoding: chunked
About HTTP chunked
The response header of chunked InputStream is transfer encoding: chunked, which is an unknown length and has no content length field. The length of the last data block must be 0. It can also be understood that the http message ends with 0\r\n\r\n.
Just meet the conditions for leakage class == returnType && (Response.body(). length() == null || Response. body(). length() > MAX_ Response_ BUFFER_ Size) if the Response is not released in time, the connection of the connection pool cannot be safely reused.
Problem summary
Finally, it was found that the root cause of the problem was that the request connection pool could not be released normally due to the leakage of the input stream. The return value of the feign request used the Response and was not wrapped in the form of try with resources, resulting in the failure to execute the close method
Key to troubleshooting:
- Feign source code needs to have a certain debugging and analysis ability
- Master conventional application performance analysis methods
- Understand the differences between production and test environments in business. It is best to keep the configuration of test environment and production consistent as far as possible to find problems as soon as possible
Solution
When the Feign request return value uses the Response response type, use java7 try with resources or try finally to release resources in time