This paper is based on Spring Cloud Fincheley SR4, Arthas 3.1.4
Recently, it is found that the service returns slowly, and feign's fallback is triggered. Check the log and find that the retry is triggered. What triggered the retry is found through the exception stack:
Caused by: feign.RetryableException: connect timed out executing GET http://test-service/test-api at feign.FeignException.errorExecuting(FeignException.java:65) ~[feign-core-9.7.0.jar!/:?] at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:105) ~[feign-core-9.7.0.jar!/:?] at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:77) ~[feign-core-9.7.0.jar!/:?] at feign.hystrix.HystrixInvocationHandler$1.run(HystrixInvocationHandler.java:107) ~[feign-hystrix-9.7.0.jar!/:?] at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:302) ~[hystrix-core-1.5.18.jar!/:1.5.18] at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:298) ~[hystrix-core-1.5.18.jar!/:1.5.18] at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:46) ~[rxjava-1.3.8.jar!/:1.3.8] ... 27 more Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) ~[?:?] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) ~[?:?] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) ~[?:?] at java.net.Socket.connect(Socket.java:591) ~[?:?] at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:474) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:569) ~[?:?] at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) ~[?:?] at sun.net.www.http.HttpClient.New(HttpClient.java:341) ~[?:?] at sun.net.www.http.HttpClient.New(HttpClient.java:362) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1248) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1515) ~[?:?] at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527) ~[?:?] at feign.Client$Default.convertResponse(Client.java:150) ~[feign-core-9.7.0.jar!/:?] at feign.Client$Default.execute(Client.java:72) ~[feign-core-9.7.0.jar!/:?] at org.springframework.cloud.sleuth.instrument.web.client.feign.TracingFeignClient.execute(TracingFeignClient.java:91) ~[spring-cloud-sleuth-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at org.springframework.cloud.sleuth.instrument.web.client.feign.LazyTracingFeignClient.execute(LazyTracingFeignClient.java:55) ~[spring-cloud-sleuth-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:103) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:88) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:287) ~[spring-retry-1.2.4.RELEASE.jar!/:?] at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:180) ~[spring-retry-1.2.4.RELEASE.jar!/:?] at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:88) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:54) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at com.netflix.client.AbstractLoadBalancerAwareClient$1.call(AbstractLoadBalancerAwareClient.java:104) ~[ribbon-loadbalancer-2.2.5.jar!/:2.2.5] at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:303) ~[ribbon-loadbalancer-2.2.5.jar!/:2.2.5] at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:287) ~[ribbon-loadbalancer-2.2.5.jar!/:2.2.5] at rx.internal.util.ScalarSynchronousObservable$3.call(ScalarSynchronousObservable.java:231) ~[rxjava-1.3.8.jar!/:1.3.8] at rx.internal.util.ScalarSynchronousObservable$3.call(ScalarSynchronousObservable.java:228) ~[rxjava-1.3.8.jar!/:1.3.8] at rx.Observable.unsafeSubscribe(Observable.java:10327) ~[rxjava-1.3.8.jar!/:1.3.8]
It is found that the connection timeout of calling microservice test service. Note that the connection timeout is not the read timeout. Next, first check the ribbon connection timeout configuration to see if the configured connection timeout is too short.
#ribbon default connection timeout ribbon.ConnectTimeout=500
It is not short to establish a tcp connection within 500ms, so there is no problem with the configuration. Call the interface of the microservice instance manually from the current instance to see if it is OK:
curl http://IP of test service: port of test service
Access found successful, no blocking. Prove that there is no problem with the network connection. Let's look at the netstat network connection status:
Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 test-5c665cc496-6jkx5:54644 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54666 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54680 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54088 ip-10-238-8-68.eu-central-1.compute.internal:8211 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54674 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54682 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54652 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 32 0 test-5c665cc496-6jkx5:47092 244-213-201-91.test.com:https CLOSE_WAIT tcp 0 0 test-5c665cc496-6jkx5:54662 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54624 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 32 0 test-5c665cc496-6jkx5:47098 244-213-201-91.test.com:https CLOSE_WAIT tcp 0 0 test-5c665cc496-6jkx5:54672 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54630 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:58758 ip-10-238-9-71.eu-central-1.compute.internal:mysql ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:8251 10-238-8-117.api-gateway.test1.svc.cluster.local:43132 FIN_WAIT2 tcp 0 0 test-5c665cc496-6jkx5:54648 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:58778 ip-10-238-9-71.eu-central-1.compute.internal:mysql ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54646 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54628 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54650 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54632 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54638 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:59434 ip-10-238-9-71.eu-central-1.compute.internal:mysql ESTABLISHED tcp 32 0 test-5c665cc496-6jkx5:47104 244-213-201-91.test.com:https CLOSE_WAIT tcp 1 0 test-5c665cc496-6jkx5:54146 ip-10-238-8-68.eu-central-1.compute.internal:8211 CLOSE_WAIT tcp 32 0 test-5c665cc496-6jkx5:47100 244-213-201-91.test.com:https CLOSE_WAIT tcp 32 0 test-5c665cc496-6jkx5:47106 244-213-201-91.test.com:https CLOSE_WAIT
Found:
- There is no link with any instance ip of the current service test microservice.
- There's not a lot of network connections, there's not a lot of timed ﹣ waiting, there's also a lot of close ﹣ wait.
Guess that the ip address of the called service test is not up-to-date.
Now it's time to launch Athas and start Arthas. Let's see the real ip address of feign call: Because we use sleuth, we use the feigh client monitored by sleuth to see the calling ip:
watch org.springframework.cloud.sleuth.instrument.web.client.feign.TracingFeignClient execute params[0].url()
The corresponding code is:
package org.springframework.cloud.sleuth.instrument.web.client.feign; final class TracingFeignClient implements Client { @Override public Response execute(Request request, Request.Options options) throws IOException { //..... } }
Through observation, we find that it is indeed the previous ip. So why hasn't it been updated? Let's check the relevant code of EurekaClient. Refer to another article: Spring Cloud Eureka full explanation (4) - core process - service and instance list access details . let's take a look at the PollingServerListUpdater class. The thread pool responsible for updating is:
int coreSize = poolSizeProp.get(); ThreadFactory factory = (new ThreadFactoryBuilder()) .setNameFormat("PollingServerListUpdater-%d") .setDaemon(true) .build(); _serverListRefreshExecutor = new ScheduledThreadPoolExecutor(coreSize, factory);
Thankfully, this thread has a name. Let's look at the thread list through Arthas:
[arthas@24]$ thread thread Threads Total: 736, NEW: 0, RUNNABLE: 81, BLOCKED: 0, WAITING: 634, TIMED_WAITIN G: 21, TERMINATED: 0 ID NAME GROUP PRIORI STATE %CPU TIME INTER DAEMON 11 Log4j2-TF-2-AsyncLo main 5 TIMED 39 138:46 false true 17 Log4j2-TF-5-AsyncLo main 5 TIMED 37 137:27 false true 1045 as-command-execute- system 10 RUNNA 19 0:0 false true 1027 AsyncAppender-Worke system 9 WAITI 0 0:0 false true 68 AsyncResolver-boots main 5 TIMED 0 0:0 false true 264 AsyncResolver-boots main 5 WAITI 0 0:0 false true 1025 Attach Listener system 9 RUNNA 0 0:0 false true 9 Common-Cleaner InnocuousThr 8 TIMED 0 0:0 false true 151 DataPublisher main 5 TIMED 0 0:2 false true 107 DestroyJavaVM main 5 RUNNA 0 0:37 false false 69 DiscoveryClient-0 main 5 TIMED 0 0:1 false true 70 DiscoveryClient-1 main 5 WAITI 0 0:1 false true 748 DiscoveryClient-Cac main 5 WAITI 0 0:9 false true 751 DiscoveryClient-Hea main 5 WAITI 0 0:7 false true 71 DiscoveryClient-Ins main 5 TIMED 0 0:0 false true 24 Druid-ConnectionPoo main 5 WAITI 0 0:0 false true 21 Druid-ConnectionPoo main 5 WAITI 0 0:0 false true 25 Druid-ConnectionPoo main 5 TIMED 0 0:0 false true 22 Druid-ConnectionPoo main 5 TIMED 0 0:0 false true 20 Druid-ConnectionPoo main 5 TIMED 0 0:0 false true 67 Eureka-JerseyClient main 5 WAITI 0 0:0 false true 3 Finalizer system 8 WAITI 0 0:0 false true 119 ForkJoinPool.common main 5 WAITI 0 0:9 false true 30 I/O dispatcher 1 main 5 RUNNA 0 0:2 false false 155 NFLoadBalancer-Ping main 5 TIMED 0 0:0 false true 150 PollingServerListUp main 5 WAITI 0 0:0 false true 157 PollingServerListUp main 5 WAITI 0 0:0 false true 2 Reference Handler system 10 RUNNA 0 0:0 false true 146 RibbonApacheHttpCli main 5 TIMED 0 0:0 false true 135 RxComputationSchedu main 5 TIMED 0 1:44 false true 132 RxIoScheduler-1 (Ev main 5 TIMED 0 0:0 false true 4 Signal Dispatcher system 9 RUNNA 0 0:0 false true 114 SimplePauseDetector main 5 TIMED 0 2:10 false true 115 SimplePauseDetector main 5 TIMED 0 2:10 false true 116 SimplePauseDetector main 5 TIMED 0 2:12 false true
Find that the keyword PollingServerListUp has two threads, 150 and 157. See what they are doing.
[arthas@24]$ thread 150 thread 150 "PollingServerListUpdater-0" Id=150 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@778af81b at java.base@11.0.4/jdk.internal.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@778af81b at java.base@11.0.4/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081) at java.base@11.0.4/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1170) at java.base@11.0.4/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899) at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054) at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114) at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.4/java.lang.Thread.run(Thread.java:834) Affect(row-cnt:0) cost in 322 ms. [arthas@24]$ thread 157 thread 157 "PollingServerListUpdater-1" Id=157 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@3adcda2 at java.base@11.0.4/jdk.internal.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@3adcda2 at java.base@11.0.4/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885) at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917) at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240) at java.base@11.0.4/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959) at com.netflix.loadbalancer.BaseLoadBalancer.setServersList(BaseLoadBalancer.java:475) at com.netflix.loadbalancer.DynamicServerListLoadBalancer.setServersList(DynamicServerListLoadBalancer.java:156) at com.netflix.loadbalancer.DynamicServerListLoadBalancer.updateAllServerList(DynamicServerListLoadBalancer.java:267) at com.netflix.loadbalancer.DynamicServerListLoadBalancer.updateListOfServers(DynamicServerListLoadBalancer.java:250) at com.netflix.loadbalancer.DynamicServerListLoadBalancer$1.doUpdate(DynamicServerListLoadBalancer.java:62) at com.netflix.loadbalancer.PollingServerListUpdater$1.run(PollingServerListUpdater.java:116) at java.base@11.0.4/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base@11.0.4/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at java.base@11.0.4/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.4/java.lang.Thread.run(Thread.java:834) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@6ab1c5f9
It is found that PollingServerListUpdater-1 has been waiting to acquire the lock. It seems that the problem is here. Check the corresponding Ribbon code and find out:
- PollingServerListUpdater-1 needs to obtain the write lock of allServerLock
- The read lock of allServerLock is obtained only by runPinger (for each instance ping, the instance list needs to be read).
If there is a problem with Ping, let's look at the corresponding code BaseLoadBalancer to obtain locks:
public void runPinger() throws Exception { //Omit useless code allLock = allServerLock.readLock(); allLock.lock(); allServers = allServerList.toArray(new Server[allServerList.size()]); allLock.unlock();
We can see that the code here does not use the routine of try{lock} finally {unlock}. If the intermediate code is abnormal, the lock cannot be released. And here is the re-entry lock. You need to unlock as many times as you want. If you want to unlock as many times as you want, other threads will not get it.
allServers = allServerList.toArray(new Server[allServerList.size()]); no IO is involved, just data transformation, and no null pointer exception. The guess is that an OOM exception occurred, causing no memory allocation. After checking the log, we found OOM.
This tells us to try{lock} finally {unlock} for locks. Even if the code does not throw any exception, the lock may not be released in case of OOM.
I think it's better to fix this code. Put forward an issue to ribbon.