Oversold 100 bottles of Feitian Maotai! Is it because Redis distributed locks are improperly used?

Posted by terandle on Wed, 09 Feb 2022 11:53:43 +0100


Using distributed locks based on Redis is nothing new today.

This article is mainly based on the accident analysis and solution caused by redis distributed lock in our actual project. The rush purchase orders in our project are solved by distributed locks. Once, the operation conducted a rush purchase activity of Feitian Maotai, with 100 bottles in stock, but it oversold 100 bottles! You know, the scarcity of flying Maotai on earth!

The accident is classified as P0 major accident... Can only be accepted calmly. The performance of the whole project team was deducted~~

After the accident, CTO named me and asked me to take the lead to deal with it. All right, go~

Accident scene

After some understanding, I learned that this snapping activity interface has never been like this before, but why is it oversold this time?

The reason is: the previous rush to buy goods were not scarce goods, but this activity was actually Feitian Maotai. Through the buried point data analysis, all data basically doubled, and the enthusiasm of the activity can be imagined! I didn't say much. I went directly to the core code, and the confidential part was processed with pseudo code.

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
SeckillActivityRequestVO response;
    String key = "key:" + request.getSeckillId;
    try {
        Boolean lockFlag = redisTemplate.opsForValue().setIfAbsent(key, "val", 10, TimeUnit.SECONDS);
        if (lockFlag) {
            // HTTP requests the user service for user related verification
            // User activity verification
            
            // Inventory verification
            Object stock = redisTemplate.opsForHash().get(key+":info", "stock");
            assert stock != null;
            if (Integer.parseInt(stock.toString()) <= 0) {
                // Business exception
            } else {
                redisTemplate.opsForHash().increment(key+":info", "stock", -1);
                // Generate order
                // Events triggered after successfully creating a published order
                // Build response VO
            }
        }
    } finally {
        // Release lock
        stringRedisTemplate.delete("key");
        // Build response VO
    }
    return response;
}

The above code ensures sufficient execution time of business logic through the expiration time and validity period of distributed lock of 10s; The try finally statement block is used to ensure that the lock will be released in time. The inventory is also verified within the business code. It looks safe ~ don't worry, continue to analyze

Cause of accident

Feitian Maotai rush purchase activity has attracted a large number of new users to download and register our APP, including many wool parties, which use professional means to register new users to collect wool and brush orders. Of course, our user system is well prepared in advance. It is connected to various martial arts such as Alibaba cloud man-machine authentication, three factor authentication and self-developed risk control system, which blocks a large number of illegal users. I can't help but like here ~ but that's why the user service has been in a high running load.

At the beginning of the rush buying activity, a large number of user verification requests hit the user service. As a result, the user service gateway has a short response delay. The response time of some requests exceeds 10s. However, due to the response timeout of HTTP requests, we set 30s, which causes the interface to be blocked in the user verification. After 10s, the distributed lock has expired. At this time, a new request can get the lock, that is, the lock is overwritten. After these blocked interfaces are executed, they will execute the logic of releasing locks, which will release the locks of other threads, resulting in new requests competing for locks ~ this is an extremely bad cycle. At this time, we can only rely on inventory verification, but inventory verification is not non atomic. We use the method of get and compare, and the tragedy of oversold occurs in this way~~~

accident analysis

After careful analysis, it can be found that this snapping interface has serious security risks in high concurrency scenarios, which are mainly concentrated in three places:

  • There is no other system risk tolerance

Due to the tight user service, the gateway response is delayed, but there is no response, which is the fuse of oversold.

  • Seemingly secure distributed locks are actually not secure at all

Although the method of set key value [EX seconds] [PX milliseconds] [NX|XX] is adopted, if thread A executes for A long time and does not have time to release, the lock expires. At this time, thread B can obtain the lock. When thread A finishes executing, it releases the lock. In fact, it releases the lock of thread B. At this time, thread C can obtain the lock. At this time, if thread B releases the lock after executing, it is actually the lock set by thread C. This is the direct cause of oversold.

  • Non atomic inventory verification

Non atomic inventory verification results in inaccurate inventory verification results in concurrent scenarios. This is the root cause of oversold.

Search the official account back office architect to reply to the "neat structure" and get a surprise package.

Through the above analysis, the root cause of the problem is that inventory verification heavily relies on distributed locks. Because when the distributed lock is set and del normally, there is no problem with inventory verification. However, when the distributed lock is not safe and reliable, inventory verification is useless.

Solution

After knowing the reason, we can suit the remedy to the case.

Achieve relatively secure distributed locks
Definition of relative security: set and del are mapped one by one, and there will be no other ready-made lock del. From the perspective of actual situation, even if set and del can be mapped one by one, the absolute security of business cannot be guaranteed. Because the expiration time of the lock is always bounded, unless the expiration time is not set or set to be very long, but doing so will also bring other problems. So it doesn't make sense. To implement a relatively secure distributed lock, you must rely on the value value of the key. When releasing the lock, the uniqueness of the value value is used to ensure that it will not be deleted. We implement atomic get and compare based on LUA script, as follows:

public void safedUnLock(String key, String val) {
    String luaScript = "local in = ARGV[1] local curr=redis.call('get', KEYS[1]) if in==curr then redis.call('del', KEYS[1]) end return 'OK'"";
    RedisScript<String> redisScript = RedisScript.of(luaScript);
    redisTemplate.execute(redisScript, Collections.singletonList(key), Collections.singleton(val));
}

We use LUA script to unlock safely.

Realize safe inventory verification

If we have a deeper understanding of concurrency, we will find that operations such as get and compare/ read and save are non atomic. If we want to achieve atomicity, we can also use LUA script to achieve atomicity. However, in our example, since the rush purchase activity can only place one bottle per order, it can be implemented not based on LUA script, but based on the atomicity of redis itself. The reason is:

// redis will return the results after the operation. This process is atomic
Long currStock = redisTemplate.opsForHash().increment("key", "stock", -1);

The "snakefoot" code is not found in the drawing.

Improved code
After the above analysis, we decided to create a new DistributedLocker class to deal with distributed locks.

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
SeckillActivityRequestVO response;
    String key = "key:" + request.getSeckillId();
    String val = UUID.randomUUID().toString();
    try {
        Boolean lockFlag = distributedLocker.lock(key, val, 10, TimeUnit.SECONDS);
        if (!lockFlag) {
            // Business exception
        }

        // User activity verification
        // Inventory verification is guaranteed based on the atomicity of redis itself
        Long currStock = stringRedisTemplate.opsForHash().increment(key + ":info", "stock", -1);
        if (currStock < 0) { // Indicates that the inventory has been deducted.
            // Business exception.
            log.error("[Rush order] No inventory");
        } else {
            // Generate order
            // Events triggered after successfully creating a published order
            // Build response
        }
    } finally {
        distributedLocker.safedUnLock(key, val);
        // Build response
    }
    return response;
}

Deep thinking

Is distributed lock necessary
After improvement, we can actually find that we can ensure that we will not oversold with the help of redis's atomic inventory deduction. Right. However, without this layer of lock, all requests will go through the business logic. Due to the dependence on other systems, the pressure on other systems will increase. This will increase performance loss and service instability. Based on distributed locks, some traffic can be intercepted to a certain extent.

Selection of distributed lock
Someone proposed to use RedLock to realize distributed lock. RedLock is more reliable, but at the expense of some performance. In this scenario, the improvement of reliability is far less cost-effective than the improvement of performance. For scenarios with high reliability requirements, RedLock can be used.

Is it necessary to think about distributed locks again
Since the bug needs to be urgently repaired and put online, we optimized it and deployed it online immediately after the pressure test in the test environment. It is proved that this optimization is successful, the performance is slightly improved, and there is no oversold in the case of distributed lock failure. However, is there room for optimization? yes , we have! Since the service is deployed in a cluster, we can spread the inventory to each server in the cluster and notify each server in the cluster through broadcasting. The gateway layer makes a hash algorithm based on the user ID to decide which server to request. In this way, inventory deduction and judgment can be realized based on application cache. Performance has been further improved!

// The message is initialized in advance, and efficient thread safety is realized with the help of ConcurrentHashMap
private static ConcurrentHashMap<Long, Boolean> SECKILL_FLAG_MAP = new ConcurrentHashMap<>();
// Set in advance by message. Because of the atomicity of hashicmap itself, it can be used directly here
private static Map<Long, AtomicInteger> SECKILL_STOCK_MAP = new HashMap<>();

...

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
SeckillActivityRequestVO response;

    Long seckillId = request.getSeckillId();
    if(!SECKILL_FLAG_MAP.get(requestseckillId)) {
        // Business exception
    }
     // User activity verification
     // Inventory verification
    if(SECKILL_STOCK_MAP.get(seckillId).decrementAndGet() < 0) {
        SECKILL_FLAG_MAP.put(seckillId, false);
        // Business exception
    }
    // Generate order
    // Events triggered after successfully creating a published order
    // Build response
    return response;
}

Through the above transformation, we don't need to rely on redis at all. Performance and security can be further improved! Of course, this scheme does not take into account the complex scenarios such as dynamic capacity expansion and capacity reduction of the machine. If these are to be considered, it is better to directly consider the solution of distributed lock.

summary

Oversold of scarce goods is definitely a major accident. If the oversold quantity is large, it will even have a very serious business impact and social impact on the platform. After this accident, I realized that I can't take any line of code in the project lightly, otherwise in some scenarios, these normal working codes will become fatal killers! For a developer, when designing a development scheme, he must consider the scheme thoroughly. How can we consider the plan thoroughly? Only continuous learning!

The code word is not easy. Please give me some praise + attention + comments! One dragon

If you want to get redis distributed lock learning materials, you can get the contact information below;
Or those who want to discuss problems together also join the technical exchange Q group

Add assistant little sister VX:  Mlzg5201314zz 
Or join technical exchanges Q Group: 614478470
 Remember notes[Jian Shu]

Topics: Java Redis Distribution