RocketMQ problem location method

Posted by csousley on Mon, 20 Dec 2021 03:15:52 +0100

problem analysis

The first thing I think about is whether the consumption thread is stuck? The thread is stuck because:

  1. Stop the wolrd occurred:
  2. GC causes
  3. Other safepoint causes (such as jstack, regular access to safepoint, etc.), please refer to my article JVM related - SafePoint and Stop The World)
  4. The thread takes too long to process messages. It may fail to obtain locks and may be stuck in some IO

Collect JFR at that time (about

JFR, please refer to my other series JFR total solution ), found:

  1. During this time period, there are no GC and other stop the world safepoint events with long stagnation time:

image

image

  1. During this time, the thread is park, and the stack shows that the thread has no messages to consume:

image

Since there is no problem with the application, let's see if there is any problem with RocketMQ. For general RocketMQ Broker logs, we are concerned about:

  1. Statistics on the time consumption of message persistence. If an exception occurs here, we need to tune the parameters related to Java MMAP. Please refer to:
  2. Message persistence exception, see storeerr log
  3. If the lock is abnormal, check the lock log

Which broker should I see? As mentioned earlier, the hashKey is specified for the Topic. We can locate which one is through the hashKey of the message

broker:

int hashCode = "ours hashKey".hashCode();
log.info("{}", Math.abs(hashCode % 24));

We found the hashKey of the message. Through the above code, the result is 20, that is, queue 20. Through the previous description, * * we know that each broker has 8 queues, 20

The corresponding queue is the queue above broker-2, that is, the queue with broker-2 queueId = 5 * *. Let's look at broker-2

The above log location problem.

We found lock There are exceptions in the log, as shown below. There are many similar ones, which last for about 54s, which is consistent with the thread park time and the message delay:

2021-07-01 07:11:47 WARN AdminBrokerThread_10 - tryLockBatch, message queue locked by other client. Group: consumption group OtherClientId: 10.238.18.6@29 NewClientId: 10.238.18.122@29 MessageQueue [topic=news topic, brokerName=broker-2, queueId=5]

This log means, 10.238 eighteen 122@29 This instance failed to lock queueId = 5 because 10.238 eighteen 6@29

Holding this lock. So why did this happen?

Principle of RocketMQ multi queue sequential consumption

To realize multi queue sequential consumption, RocketMQ first needs to specify a hashKey and use the hashKey

Messages will be put into a specific queue. When consumers consume this queue, if sequential consumption is specified, it will be consumed by single thread, so as to ensure order in the same queue.

So how to ensure that each queue is consumed by a single thread? Each Broker maintains one:

private final ConcurrentMap<String/* group */, ConcurrentHashMap<MessageQueue, LockEntry>> mqLockTable =
        new ConcurrentHashMap<String, ConcurrentHashMap<MessageQueue, LockEntry>>(1024);

It is a concurrentmap < < consumer group name, concurrenthashmap < < message queue, lock object > >. The lock object LockEntry includes:

RebalanceLockManager.java:

//Read rocketmq broker. rebalance. Lockmaxlivetime is an environment variable. The default value is 60s
private final static long REBALANCE_LOCK_MAX_LIVE_TIME = Long.parseLong(System.getProperty(
        "rocketmq.broker.rebalance.lockMaxLiveTime", "60000"));
static class LockEntry {
    //RocketMQ client unique id
    private String clientId;
    private volatile long lastUpdateTimestamp = System.currentTimeMillis();
    //Omit getter setter
    public boolean isLocked(final String clientId) {
        boolean eq = this.clientId.equals(clientId);
        return eq && !this.isExpired();
    }
    public boolean isExpired() {
        // At REBALANCE_LOCK_MAX_LIVE_TIME expires after such a long time
        boolean expired =
            (System.currentTimeMillis() - this.lastUpdateTimestamp) > REBALANCE_LOCK_MAX_LIVE_TIME;
        return expired;
    }
}

RocketMQ client sends LOCK_BATCH_MQ requests to the Broker, which encapsulates the client request as a LockEntry

And try to update the Map. If the update is successful, the lock is obtained. If it fails, the lock is not obtained. The Broker's detailed update logic is(

If you are interested, you can view it or skip it directly without affecting your understanding. There are pictures for understanding in the back):

public boolean tryLock(final String group, final MessageQueue mq, final String clientId) {
    //No, it's locked
    if (!this.isLocked(group, mq, clientId)) {
        try {
            //Obtain the lock, which is within the instance, because each broker maintains its own queue lock table and does not share it
            this.lock.lockInterruptibly();
            try {
                //Try to obtain, judge whether it exists, and judge whether it expires if it exists
                ConcurrentHashMap<MessageQueue, LockEntry> groupValue = this.mqLockTable.get(group);
                if (null == groupValue) {
                    groupValue = new ConcurrentHashMap<>(32);
                    this.mqLockTable.put(group, groupValue);
                }
                LockEntry lockEntry = groupValue.get(mq);
                if (null == lockEntry) {
                    lockEntry = new LockEntry();
                    lockEntry.setClientId(clientId);
                    groupValue.put(mq, lockEntry);
                    log.info("tryLock, message queue not locked, I got it. Group: {} NewClientId: {} {}",
                        group,
                        clientId,
                        mq);
                }
                if (lockEntry.isLocked(clientId)) {
                    lockEntry.setLastUpdateTimestamp(System.currentTimeMillis());
                    return true;
                }
                String oldClientId = lockEntry.getClientId();
                if (lockEntry.isExpired()) {
                    lockEntry.setClientId(clientId);
                    lockEntry.setLastUpdateTimestamp(System.currentTimeMillis());
                    log.warn(
                        "tryLock, message queue lock expired, I got it. Group: {} OldClientId: {} NewClientId: {} {}",
                        group,
                        oldClientId,
                        clientId,
                        mq);
                    return true;
                }
                //Here is the log we just saw
                log.warn(
                    "tryLock, message queue locked by other client. Group: {} OtherClientId: {} NewClientId: {} {}",
                    group,
                    oldClientId,
                    clientId,
                    mq);
                return false;
            } finally {
                this.lock.unlock();
            }
        } catch (InterruptedException e) {
            log.error("putMessage exception", e);
        }
    } else {
    }
    return true;
}
//Determine whether it is locked
private boolean isLocked(final String group, final MessageQueue mq, final String clientId) {
    //Get by consumer group name
    ConcurrentHashMap<MessageQueue, LockEntry> groupValue = this.mqLockTable.get(group);
    //If not null
    if (groupValue != null) {
        //Try to get lockEntry to see if it exists
        LockEntry lockEntry = groupValue.get(mq);
        if (lockEntry != null) {
            //If yes, judge whether it is expired
            boolean locked = lockEntry.isLocked(clientId);
            if (locked) {
                lockEntry.setLastUpdateTimestamp(System.currentTimeMillis());
            }
            return locked;
        }
    }
    return false;
}

Each MQ client sends lock regularly_ BATCH_ MQ requests and maintains all queues that acquire locks locally:

ProcessQueue.java:

//Send regularly * * LOCK_BATCH_MQ * * interval
public final static long REBALANCE_LOCK_INTERVAL = Long.parseLong(System.getProperty("rocketmq.client.rebalance.lockInterval", "20000"));

ConsumeMessageOrderlyService.java:

if (MessageModel.CLUSTERING.equals(ConsumeMessageOrderlyService.this.defaultMQPushConsumerImpl.messageModel())) {
    this.scheduledExecutorService.scheduleAtFixedRate(new Runnable() {
        @Override
        public void run() {
            ConsumeMessageOrderlyService.this.lockMQPeriodically();
        }
    }, 1000 * 1, ProcessQueue.REBALANCE_LOCK_INTERVAL, TimeUnit.MILLISECONDS);
}

The flow chart is as follows:

Picture upload failed (image-637be5-1625137977763)

When ConsumeMessageOrderlyService is closed, all queues will be unlock ed:

public void shutdown() {
    this.stopped = true;
    this.scheduledExecutorService.shutdown();
    this.consumeExecutor.shutdown();
    if (MessageModel.CLUSTERING.equals(this.defaultMQPushConsumerImpl.messageModel())) {
        this.unlockAllMQ();
    }
}

Causes of problems

Here, the client sends lock regularly_ BATCH_ The MQ interval is the default of 20s, and the expiration time of the Broker end lock is also the default of 60s.

Our cluster container choreography uses k8s and has the function of instance migration. When the cluster pressure is high, new nodes are automatically expanded

(it can be understood as a virtual machine) and deploy to create new service instances. When the pressure of some services in the cluster is small, some service instances will shrink. At this time, there will be no need for so many nodes and some will be recycled

Node, but there are service instances that cannot be shrunk on the recycled node. In this case, these service instances need to be migrated to other nodes

. This is what happened in our business example.

When the problem occurs, the migration occurs, and the old instance is closed without waiting for ConsumeMessageOrderlyService#shutdown

As a result, the lock is not actively released, but waits for 60s after the lock expires before the new instance gets the queue lock and starts consumption.

Problem solving

  1. In the next version, add elegant Shutdown Logic for RocketMQ client
  2. All service instances (RocketMQ client) are configured with rocketmq.client.rebalance.lockInterval to shorten the heartbeat time (5S), and RocketMQ Broker is configured with rocketmq.broker.rebalance.lockMaxLiveTime to shorten the expiration time (for example, 15s), but the expiration time is three times the heartbeat time (three times the design axiom in the cluster)