Positioning process of lock not released due to OOM (in combination with Arthas)

Posted by ArneR on Thu, 31 Oct 2019 01:54:50 +0100

This paper is based on Spring Cloud Fincheley SR4, Arthas 3.1.4

Recently, it is found that the service returns slowly, and feign's fallback is triggered. Check the log and find that the retry is triggered. What triggered the retry is found through the exception stack:

Caused by: feign.RetryableException: connect timed out executing GET http://test-service/test-api
        at feign.FeignException.errorExecuting(FeignException.java:65) ~[feign-core-9.7.0.jar!/:?]
        at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:105) ~[feign-core-9.7.0.jar!/:?]
        at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:77) ~[feign-core-9.7.0.jar!/:?]
        at feign.hystrix.HystrixInvocationHandler$1.run(HystrixInvocationHandler.java:107) ~[feign-hystrix-9.7.0.jar!/:?]
        at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:302) ~[hystrix-core-1.5.18.jar!/:1.5.18]
        at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:298) ~[hystrix-core-1.5.18.jar!/:1.5.18]
        at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:46) ~[rxjava-1.3.8.jar!/:1.3.8]
        ... 27 more
Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) ~[?:?]
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) ~[?:?]
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) ~[?:?]
        at java.net.Socket.connect(Socket.java:591) ~[?:?]
        at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:474) ~[?:?]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:569) ~[?:?]
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) ~[?:?]
        at sun.net.www.http.HttpClient.New(HttpClient.java:341) ~[?:?]
        at sun.net.www.http.HttpClient.New(HttpClient.java:362) ~[?:?]
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1248) ~[?:?]
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187) ~[?:?]
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081) ~[?:?]
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015) ~[?:?]
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587) ~[?:?]
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1515) ~[?:?]
        at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527) ~[?:?]
        at feign.Client$Default.convertResponse(Client.java:150) ~[feign-core-9.7.0.jar!/:?]
        at feign.Client$Default.execute(Client.java:72) ~[feign-core-9.7.0.jar!/:?]
        at org.springframework.cloud.sleuth.instrument.web.client.feign.TracingFeignClient.execute(TracingFeignClient.java:91) ~[spring-cloud-sleuth-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE]
        at org.springframework.cloud.sleuth.instrument.web.client.feign.LazyTracingFeignClient.execute(LazyTracingFeignClient.java:55) ~[spring-cloud-sleuth-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE]
        at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:103) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE]
        at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:88) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE]
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:287) ~[spring-retry-1.2.4.RELEASE.jar!/:?]
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:180) ~[spring-retry-1.2.4.RELEASE.jar!/:?]
        at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:88) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE]
        at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:54) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE]
        at com.netflix.client.AbstractLoadBalancerAwareClient$1.call(AbstractLoadBalancerAwareClient.java:104) ~[ribbon-loadbalancer-2.2.5.jar!/:2.2.5]
        at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:303) ~[ribbon-loadbalancer-2.2.5.jar!/:2.2.5]
        at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:287) ~[ribbon-loadbalancer-2.2.5.jar!/:2.2.5]
        at rx.internal.util.ScalarSynchronousObservable$3.call(ScalarSynchronousObservable.java:231) ~[rxjava-1.3.8.jar!/:1.3.8]
        at rx.internal.util.ScalarSynchronousObservable$3.call(ScalarSynchronousObservable.java:228) ~[rxjava-1.3.8.jar!/:1.3.8]
        at rx.Observable.unsafeSubscribe(Observable.java:10327) ~[rxjava-1.3.8.jar!/:1.3.8]

It is found that the connection timeout of calling microservice test service. Note that the connection timeout is not the read timeout. Next, first check the ribbon connection timeout configuration to see if the configured connection timeout is too short.

#ribbon default connection timeout
ribbon.ConnectTimeout=500

It is not short to establish a tcp connection within 500ms, so there is no problem with the configuration. Call the interface of the microservice instance manually from the current instance to see if it is OK:

curl http://IP of test service: port of test service

Access found successful, no blocking. Prove that there is no problem with the network connection. Let's look at the netstat network connection status:

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp        0      0 test-5c665cc496-6jkx5:54644 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54666 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54680 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54088 ip-10-238-8-68.eu-central-1.compute.internal:8211 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54674 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54682 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54652 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp       32      0 test-5c665cc496-6jkx5:47092 244-213-201-91.test.com:https CLOSE_WAIT  
tcp        0      0 test-5c665cc496-6jkx5:54662 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54624 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp       32      0 test-5c665cc496-6jkx5:47098 244-213-201-91.test.com:https CLOSE_WAIT  
tcp        0      0 test-5c665cc496-6jkx5:54672 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54630 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:58758 ip-10-238-9-71.eu-central-1.compute.internal:mysql ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:8251 10-238-8-117.api-gateway.test1.svc.cluster.local:43132 FIN_WAIT2   
tcp        0      0 test-5c665cc496-6jkx5:54648 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:58778 ip-10-238-9-71.eu-central-1.compute.internal:mysql ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54646 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54628 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54650 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54632 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:54638 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED 
tcp        0      0 test-5c665cc496-6jkx5:59434 ip-10-238-9-71.eu-central-1.compute.internal:mysql ESTABLISHED 
tcp       32      0 test-5c665cc496-6jkx5:47104 244-213-201-91.test.com:https CLOSE_WAIT  
tcp        1      0 test-5c665cc496-6jkx5:54146 ip-10-238-8-68.eu-central-1.compute.internal:8211 CLOSE_WAIT  
tcp       32      0 test-5c665cc496-6jkx5:47100 244-213-201-91.test.com:https CLOSE_WAIT  
tcp       32      0 test-5c665cc496-6jkx5:47106 244-213-201-91.test.com:https CLOSE_WAIT

Found:

There is no link with any instance ip of the current service test microservice.
There's not a lot of network connections, there's not a lot of timed ﹣ waiting, there's also a lot of close ﹣ wait.

Guess that the ip address of the called service test is not up-to-date.

Now it's time to launch Athas and start Arthas. Let's see the real ip address of feign call: Because we use sleuth, we use the feigh client monitored by sleuth to see the calling ip:

watch org.springframework.cloud.sleuth.instrument.web.client.feign.TracingFeignClient execute params[0].url()

The corresponding code is:

package org.springframework.cloud.sleuth.instrument.web.client.feign;

final class TracingFeignClient implements Client {


	@Override public Response execute(Request request, Request.Options options)
			throws IOException {
			    //.....
			}
}

Through observation, we find that it is indeed the previous ip. So why hasn't it been updated? Let's check the relevant code of EurekaClient. Refer to another article: Spring Cloud Eureka full explanation (4) - core process - service and instance list access details . let's take a look at the PollingServerListUpdater class. The thread pool responsible for updating is:

int coreSize = poolSizeProp.get();
ThreadFactory factory = (new ThreadFactoryBuilder())
        .setNameFormat("PollingServerListUpdater-%d")
        .setDaemon(true)
        .build();
_serverListRefreshExecutor = new ScheduledThreadPoolExecutor(coreSize, factory);

Thankfully, this thread has a name. Let's look at the thread list through Arthas:

[arthas@24]$ thread
thread
Threads Total: 736, NEW: 0, RUNNABLE: 81, BLOCKED: 0, WAITING: 634, TIMED_WAITIN
G: 21, TERMINATED: 0                                                            
ID     NAME                GROUP        PRIORI STATE %CPU   TIME   INTER DAEMON 
11     Log4j2-TF-2-AsyncLo main         5      TIMED 39     138:46 false true   
17     Log4j2-TF-5-AsyncLo main         5      TIMED 37     137:27 false true   
1045   as-command-execute- system       10     RUNNA 19     0:0    false true   
1027   AsyncAppender-Worke system       9      WAITI 0      0:0    false true   
68     AsyncResolver-boots main         5      TIMED 0      0:0    false true   
264    AsyncResolver-boots main         5      WAITI 0      0:0    false true   
1025   Attach Listener     system       9      RUNNA 0      0:0    false true   
9      Common-Cleaner      InnocuousThr 8      TIMED 0      0:0    false true   
151    DataPublisher       main         5      TIMED 0      0:2    false true   
107    DestroyJavaVM       main         5      RUNNA 0      0:37   false false  
69     DiscoveryClient-0   main         5      TIMED 0      0:1    false true   
70     DiscoveryClient-1   main         5      WAITI 0      0:1    false true   
748    DiscoveryClient-Cac main         5      WAITI 0      0:9    false true   
751    DiscoveryClient-Hea main         5      WAITI 0      0:7    false true   
71     DiscoveryClient-Ins main         5      TIMED 0      0:0    false true   
24     Druid-ConnectionPoo main         5      WAITI 0      0:0    false true   
21     Druid-ConnectionPoo main         5      WAITI 0      0:0    false true   
25     Druid-ConnectionPoo main         5      TIMED 0      0:0    false true   
22     Druid-ConnectionPoo main         5      TIMED 0      0:0    false true   
20     Druid-ConnectionPoo main         5      TIMED 0      0:0    false true   
67     Eureka-JerseyClient main         5      WAITI 0      0:0    false true   
3      Finalizer           system       8      WAITI 0      0:0    false true   
119    ForkJoinPool.common main         5      WAITI 0      0:9    false true   
30     I/O dispatcher 1    main         5      RUNNA 0      0:2    false false  
155    NFLoadBalancer-Ping main         5      TIMED 0      0:0    false true   
150    PollingServerListUp main         5      WAITI 0      0:0    false true   
157    PollingServerListUp main         5      WAITI 0      0:0    false true   
2      Reference Handler   system       10     RUNNA 0      0:0    false true   
146    RibbonApacheHttpCli main         5      TIMED 0      0:0    false true   
135    RxComputationSchedu main         5      TIMED 0      1:44   false true   
132    RxIoScheduler-1 (Ev main         5      TIMED 0      0:0    false true   
4      Signal Dispatcher   system       9      RUNNA 0      0:0    false true   
114    SimplePauseDetector main         5      TIMED 0      2:10   false true   
115    SimplePauseDetector main         5      TIMED 0      2:10   false true   
116    SimplePauseDetector main         5      TIMED 0      2:12   false true

Find that the keyword PollingServerListUp has two threads, 150 and 157. See what they are doing.

[arthas@24]$ thread 150
thread 150
"PollingServerListUpdater-0" Id=150 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@778af81b
    at java.base@11.0.4/jdk.internal.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@778af81b
    at java.base@11.0.4/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
    at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
    at java.base@11.0.4/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1170)
    at java.base@11.0.4/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
    at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
    at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
    at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base@11.0.4/java.lang.Thread.run(Thread.java:834)

Affect(row-cnt:0) cost in 322 ms.
[arthas@24]$ thread 157
thread 157
"PollingServerListUpdater-1" Id=157 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@3adcda2
    at java.base@11.0.4/jdk.internal.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@3adcda2
    at java.base@11.0.4/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
    at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
    at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917)
    at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240)
    at java.base@11.0.4/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
    at com.netflix.loadbalancer.BaseLoadBalancer.setServersList(BaseLoadBalancer.java:475)
    at com.netflix.loadbalancer.DynamicServerListLoadBalancer.setServersList(DynamicServerListLoadBalancer.java:156)
    at com.netflix.loadbalancer.DynamicServerListLoadBalancer.updateAllServerList(DynamicServerListLoadBalancer.java:267)
    at com.netflix.loadbalancer.DynamicServerListLoadBalancer.updateListOfServers(DynamicServerListLoadBalancer.java:250)
    at com.netflix.loadbalancer.DynamicServerListLoadBalancer$1.doUpdate(DynamicServerListLoadBalancer.java:62)
    at com.netflix.loadbalancer.PollingServerListUpdater$1.run(PollingServerListUpdater.java:116)
    at java.base@11.0.4/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base@11.0.4/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base@11.0.4/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base@11.0.4/java.lang.Thread.run(Thread.java:834)

    Number of locked synchronizers = 1
    - java.util.concurrent.ThreadPoolExecutor$Worker@6ab1c5f9

It is found that PollingServerListUpdater-1 has been waiting to acquire the lock. It seems that the problem is here. Check the corresponding Ribbon code and find out:

PollingServerListUpdater-1 needs to obtain the write lock of allServerLock
The read lock of allServerLock is obtained only by runPinger (for each instance ping, the instance list needs to be read).

If there is a problem with Ping, let's look at the corresponding code BaseLoadBalancer to obtain locks:

public void runPinger() throws Exception {
    //Omit useless code
    allLock = allServerLock.readLock();
    allLock.lock();
    allServers = allServerList.toArray(new Server[allServerList.size()]);
    allLock.unlock();

We can see that the code here does not use the routine of try{lock} finally {unlock}. If the intermediate code is abnormal, the lock cannot be released. And here is the re-entry lock. You need to unlock as many times as you want. If you want to unlock as many times as you want, other threads will not get it.

allServers = allServerList.toArray(new Server[allServerList.size()]); no IO is involved, just data transformation, and no null pointer exception. The guess is that an OOM exception occurred, causing no memory allocation. After checking the log, we found OOM.

This tells us to try{lock} finally {unlock} for locks. Even if the code does not throw any exception, the lock may not be released in case of OOM.

I think it's better to fix this code. Put forward an issue to ribbon.

Topics: Programming Java Spring Druid network

Programmer Think

Positioning process of lock not released due to OOM (in combination with Arthas)

Hot Topics