Hello, I'm brother Lei.
First, let's talk about the problem: the API of Intranet sandbox environment is stuck for one week, and all APIs are unresponsive
At the beginning, when the test complained about the slow response of the environment, we restarted the application and the application returned to normal, so we didn't deal with it. But later, the frequency of problems became more and more frequent, and more and more colleagues began to complain, so they felt that there might be problems in the code and began to check.
Firstly, it is found that the developed local ide has no problems. When the application is stuck, the database and redis are normal, and there are no special error logs. I began to suspect that it was the sandbox environment machine problem (the test environment itself was fragile)
So ssh goes to the server and executes the following command
top

At this time, I found that the machine was still normal, but my heart was still 😖, So I'm going to look at the jvm stack information
First, let's look at the threads that consume resources
Execute top -H -p 12798

Find the first three relatively resource consuming threads
jstack viewing heap memory
Hexadecimal 31ff of jstack 12798 |grep 12799

I didn't see any problem. I looked up and down 10 lines, so I implemented it

You can see that some threads are in lock state. However, there are no business-related codes, which are ignored. There is no clue at this time. Think about it. Decided to give up the stuck machine
In order to protect the scene of the accident, dump all heap memory of the problem process, and then restart the test environment application in the debug mode. It is planned to directly debug the problem machine remotely when the problem reappears
The next day, the problem reappeared, so the operation and maintenance department was informed to forward nginx, remove the problem application, and debug tomcat remotely.
I found an interface at random. The breakpoint was at the interface entrance. The tragedy began and nothing happened! The API waited for a service response and did not enter the breakpoint. At this time, I was a little confused and calmed down for a while. I set a breakpoint in the aop before the entrance. I debug ged again. This time, I entered the breakpoint. f8 N times later, I found that the card owner was running the redis command. Continue to follow, and finally find a problem at a place in jedis:
/** * Returns a Jedis instance to be used as a Redis connection. The instance can be newly created or retrieved from a * pool. * * @return Jedis instance ready for wrapping into a {@link RedisConnection}. */ protected Jedis fetchJedisConnector() { try { if (usePool && pool != null) { return pool.getResource(); } Jedis jedis = new Jedis(getShardInfo()); // force initialization (see Jedis issue #82) jedis.connect(); return jedis; } catch (Exception ex) { throw new RedisConnectionFailureException("Cannot get Jedis connection", ex); } }
Top pool After getresource(), the thread starts to wait
public T getResource() { try { return internalPool.borrowObject(); } catch (Exception e) { throw new JedisConnectionException("Could not get a resource from the pool", e); } } return internalPool.borrowObject(); This code should be a rental code followed by public T borrowObject(long borrowMaxWaitMillis) throws Exception { this.assertOpen(); AbandonedConfig ac = this.abandonedConfig; if (ac != null && ac.getRemoveAbandonedOnBorrow() && this.getNumIdle() < 2 && this.getNumActive() > this.getMaxTotal() - 3) { this.removeAbandoned(ac); } PooledObject<T> p = null; boolean blockWhenExhausted = this.getBlockWhenExhausted(); long waitTime = 0L; while(p == null) { boolean create = false; if (blockWhenExhausted) { p = (PooledObject)this.idleObjects.pollFirst(); if (p == null) { create = true; p = this.create(); } if (p == null) { if (borrowMaxWaitMillis < 0L) { p = (PooledObject)this.idleObjects.takeFirst(); } else { waitTime = System.currentTimeMillis(); p = (PooledObject)this.idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS); waitTime = System.currentTimeMillis() - waitTime; } } if (p == null) { throw new NoSuchElementException("Timeout waiting for idle object"); }
There is a piece of code
if (p == null) { if (borrowMaxWaitMillis < 0L) { p = (PooledObject)this.idleObjects.takeFirst(); } else { waitTime = System.currentTimeMillis(); p = (PooledObject)this.idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS); waitTime = System.currentTimeMillis() - waitTime; } }
Borrowmaxwaitmillis < 0 will be executed all the time, and then cycle all the time. It is suspected that this value is not configured
Find the redis pool configuration and find that MaxWaitMillis is not configured. The else code is also an Exception after configuration, which can not solve the problem
Continue with F8
public E takeFirst() throws InterruptedException { this.lock.lock(); Object var2; try { Object x; while((x = this.unlinkFirst()) == null) { this.notEmpty.await(); } var2 = x; } finally { this.lock.unlock(); } return var2; }
When I found the word lock here, I began to suspect that all request APIs were blocked
So install Arthas on the ssh server again (Arthas is an open source Java diagnostic tool of Alibaba)
Execute thread command

It is found that a large number of http NiO threads are waiting. http nio-8083-exec - this thread is actually the tomcat thread that comes out of the http request
Randomly find a thread to view the heap memory
thread -428

It can be confirmed that the api keeps turning around, which is caused by the code for redis to obtain the connection,
Reading this memory code, all threads are waiting for @53e5504e this object to release the lock. Therefore, jstack searched 53e5504e globally and did not find the thread where the object was located.
Since then. The cause of the problem can be determined to be the problem of redis connection acquisition. However, the reason why the connection cannot be obtained is uncertain
Execute the thread - B of arthas again (thread - B, find the thread currently blocking other threads)

no result. This is different from what I thought. I should be able to find a blocking thread. So I looked at the document of this command and found the following sentence

Well, we happen to be the latter....
Organize your thoughts again. This time, modify the redis pool configuration, set the connection acquisition timeout to 2s, and then observe what the application did when the problem reappears again.
Add a configuration
JedisConnectionFactory jedisConnectionFactory = new JedisConnectionFactory(); ....... JedisPoolConfig config = new JedisPoolConfig(); config.setMaxWaitMillis(2000); ....... jedisConnectionFactory.afterPropertiesSet();
Restart the service and wait....
Another day, again
ssh server, check tomcat accesslog and find that a large number of api requests appear 500,
org.springframework.data.redis.RedisConnectionFailureException: Cannot get Jedis connection; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource fr om the pool at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.fetchJedisConnector(JedisConnectionFactory.java:140) at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.getConnection(JedisConnectionFactory.java:229) at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.getConnection(JedisConnectionFactory.java:57) at org.springframework.data.redis.core.RedisConnectionUtils.doGetConnection(RedisConnectionUtils.java:128) at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:91) at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:78) at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:177) at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:152) at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:85) at org.springframework.data.redis.core.DefaultHashOperations.get(DefaultHashOperations.java:48)
Find the source, 500 places for the first time
The following codes were found
....... Cursor c = stringRedisTemplate.getConnectionFactory().getConnection().scan(options); while (c.hasNext()) { .....,, }
Analyze the code, stringredistemplate getConnectionFactory(). After getconnection () obtains the redisConnection in the pool, there is no follow-up operation, that is, the links in the redis connection pool are not released or returned to the link pool after being leased. Although the business has been processed and the redisConnection is idle, the status of the redisConnection in the pool has not returned to the idle state

Normal should be

Since then, the problem has been found.
Summary: spring stringRedisTemplate encapsulates the normal redis operations, but does not support Commands such as Scan SetNx. In this case, you need to get jedis Connection for some special Commands
Use stringredistemplate getConnectionFactory(). Getconnection () is not recommended
We can use
stringRedisTemplate.execute(new RedisCallback<Cursor>() { @Override public Cursor doInRedis(RedisConnection connection) throws DataAccessException { return connection.scan(options); } });
To execute
Or after using connection, use
RedisConnectionUtils.releaseConnection(conn, factory);
To release the connection
At the same time, it is not recommended to use the keys command in redis. The redis pool should be configured reasonably. Otherwise, there is no error log and no error report. It is very difficult to locate the problem.