Experience gained after being hit many times by performance bottlenecks caused by a large number of online log outputs

Posted by ttroutmpr on Mon, 01 Nov 2021 09:47:57 +0100

Due to the large scale of online services (hundreds of millions of daily requests and hundreds of thousands of daily live users), and the complex logic involved in the business, we use the info level for the online log level, resulting in a very large amount of online logs. We often encounter performance bottlenecks caused by too slow log writing (the total log volume of each microservice per hour is about 1000G). The following is a list of our log performance improvement and log reduction specifications for your reference. The logging framework we use is Log4j2

1. Use Log4j2 asynchronous log

First, it is recommended to use asynchronous logs for log output when the number of logs is large. The advantage is that when RingBuffer is full, calling log api will not block when printing logs. The disadvantage is that it increases the possibility of log loss. For example, when the process exits abnormally (such as kill -9), the log events in ringbuffer that have not been output to the file will be lost.

Here's a brief introduction to the principle of Log4j2 asynchronous log: Log4j2 asynchronous log is based on the high-performance data structure Disruptor, which is a ring buffer and has done a lot of performance optimization (for specific principles, please refer to my other series: High concurrency data structure (disruptor) ), Log4j2 applies this as follows:

To put it simply, multi threads output logs through the facade class org.apache.logging.log4j.Logger of log4j2, which is encapsulated into an org.apache.logging.log4j.core.LogEvent and put into the ring buffer of the Disruptor. On the consumer side, a single thread consumes these logevents and writes them to the corresponding Appender. Suppose we have only one Appender, and its configuration is to output all logs to a file:

<RollingFile name="file" append="true"
             filePattern="./app.log-%d{yyyy.MM.dd.HH}" immediateFlush="false">
    <PatternLayout pattern="${logFormat}"/>
    <Policies>
        <TimeBasedTriggeringPolicy interval="1" modulate="true"/>
    </Policies>
    <DirectWriteRolloverStrategy maxFiles="72"/>
</RollingFile>

The asynchronous log logger is configured as (includeLocation is configured as false to avoid the performance loss of the call stack each time the log is printed, which will be mentioned later). The log level here is info:

 <Asyncroot level="info" includeLocation="false">
    <appender-ref ref="file"/>
</Asyncroot>

log4j component is additionally configured as:

log4j2.component.properties:

# When there is no log, the single thread printing the log waits for the log event through SLEEP. In this way, the CPU consumption is less, and the single thread delay of log printing can also be awakened when the log arrives. This will be analyzed in detail later
# We further modified Thread.yield() to Thread.spinWait()
AsyncLoggerConfig.WaitStrategy=SLEEP

2. Close includeLocation and add the mark of the mark position in the log content

In the log, we may need to output the line of the corresponding method of the current output log. This needs to get the stack at run time. Getting the stack, whether through Thorwable.getStackTrace() before Java 9 or through StackWalker after Java 9, is a very CPU consuming operation. When a large number of logs are output, it will become a serious performance bottleneck. The reasons are:

  1. Getting the stack belongs to switching from Java code run to JVM code run, which is a JNI call. This switching has performance loss.
  2. Before Java 9, the stack was obtained by creating a new exception, and after Java 9, the stack was obtained by Stackwalker. These two methods, up to the present Java 17 version, have serious performance problems under the condition of high concurrency and will eat a lot of CPU. It mainly focuses on the optimization of the underlying JVM symbol and constant pool.

Therefore, we do not print the method of the class in the log. However, you can add the class name and method name to the log to quickly locate the problem code. This template is usually available in the IDE. For example, I customized this shortcut template in the IDEA:

After pressing the shortcut key cm and pressing tab, it will become a string of class name method name.

3. Configure the waiting policy of the Disruptor as SLEEP, but it is better to modify the Thread.yield to Thread.onSpinWait (this modification is only for x86 machine deployment)

What the consumer of Disruptor does is to constantly check whether a message is coming. In fact, it is whether a status bit is ready. After it is ready, read the message for consumption. As for how to keep checking, this is the waiting strategy. There are many waiting strategies in the Disruptor. Those familiar with multiprocessor programming are certainly familiar with the waiting strategy. Here, it can be simply understood as how the thread should wait and give up CPU resources when the task does not arrive, so as to start working as soon as possible when the task arrives. In log4j 2, asynchronous logs are based on the Disruptor, and the environment variable AsyncLoggerConfig.WaitStrategy is used to configure the waiting policy of the Disruptor. At present, the latest version of log4j 2 can be configured:

switch (strategyUp) {
    case "SLEEP":
        final long sleepTimeNs =
                parseAdditionalLongProperty(propertyName, "SleepTimeNs", 100L);
        final String key = getFullPropertyKey(propertyName, "Retries");
        final int retries =
                PropertiesUtil.getProperties().getIntegerProperty(key, 200);
        return new SleepingWaitStrategy(retries, sleepTimeNs);
    case "YIELD":
        return new YieldingWaitStrategy();
    case "BLOCK":
        return new BlockingWaitStrategy();
    case "BUSYSPIN":
        return new BusySpinWaitStrategy();
    case "TIMEOUT":
        return new TimeoutBlockingWaitStrategy(timeoutMillis, TimeUnit.MILLISECONDS);
    default:
        return new TimeoutBlockingWaitStrategy(timeoutMillis, TimeUnit.MILLISECONDS);
}

Here we use the SleepingWaitStrategy with the most balanced strategy. In most current applications, the number of threads is much larger than the number of CPUs, and even the number of RUNNABLE threads is much larger than the number of CPUs. Using the BusySpinWaitStrategy based on Wait will cause the business to suddenly come in idle time. When the business peak is reached, the log consuming threads wake up not in time (the CPU has been preempted by a large number of RUNNABLE business threads). If you use a more aggressive BusySpinWaitStrategy (always call Thread.onSpinWait()) or YieldingWaitStrategy (always call Thread.yield() after SPIN), you will also have high CPU consumption in idle time. What we expect is a progressive waiting strategy, such as:

  1. Keep spinning for a certain number of times to reduce thread switching consumption when there are too many logs.
  2. After exceeding a certain number of times, start calling Thread.onSpinWait() or Thread.yield(), so that the current thread gives up CPU resources to deal with intermittent log peaks.
  3. After the second step reaches a certain number of times, use functions such as Wait or Thread.sleep() to block the current thread and reduce CPU consumption when the log is low.

SleepingWaitStrategy is such a strategy. The second step is Thread.yield(), and the third step is Thread.sleep(). At the same time, we modify Thread.yield() to Thread.onSpinWait(). The reason is that the environment we deploy to is x86. In x86 environment, after Thread.onSpinWait() is called a certain number of times, C1 compiler will replace it with the x86 instruction PAUSE. Refer to JVM source code:

x86.ad

instruct onspinwait() %{
  match(OnSpinWait);
  ins_cost(200);

  format %{
    $$template
    $$emit$$"pause\t! membar_onspinwait"
  %}
  ins_encode %{
    __ pause();
  %}
  ins_pipe(pipe_slow);
%}

We know that the CPU does not always directly operate on the memory, but after reading the cache line, it is cached on the CPU cache. However, for the code that constantly checks whether a status bit is ready and constantly reads the CPU cache, it will be considered that the status has not changed before the current CPU receives that the CPU cache has expired from the bus. When the business is busy, the bus may be very busy, resulting in the failure to check the update of the status bit in the second step of the SleepingWaitStrategy, resulting in the entry into the third step.

PAUSE directive (Reference: https://www.felixcloutier.com... )It is a special instruction generated for the implementation of this wait strategy. It will tell the processor that the code sequence executed is a code that constantly checks whether a status bit is ready (i.e. spin wait loop). In this case, the CPU branch prediction will avoid the memory sequence conflict according to this prompt, and the CPU will not cache the read memory, In other words, the spin wait loop is not cached or instructed
Reordering and other actions. So as to improve the execution efficiency of spin wait loop.

This instruction makes Thread.onSpinWait() more efficient than Thread.yield() in a spin wait loop scenario. Therefore, we modify Thread.yield() of SleepingWaitStrategy to Thread.onSpinWait().

4. Customize the exception formatting plug-in to reduce further performance problems caused by printing too many exception stack logs in the exception set

In fact, the JVM has the parameter - XX:+OmitStackTraceInFastThrow, which can automatically omit the exception stack when an exception is thrown multiple times. But the downside is:

  1. For a new exception, when you catch up with the traffic peak and suddenly output many exceptions, many logs will still be printed.
  2. For A process that has been started for A long time, for example, many A exceptions are thrown on January 1, and A exception is thrown on March 1, but there is no exception stack, so it is difficult to find where the exception stack is.

Since various frameworks are used in our project, and some use asynchronous frameworks, the exception stack will be very, very long (some even have 1000 lines), so the most suitable ones are:

  1. The exception stack is output for each exception
  2. However, the exception stack only includes the packages we care about, and other packages are omitted

However, Log4j2 only provides the configuration of blacklist packages, that is, the exception stack of which packages is omitted:

In fact, what we need is to turn this blacklist into a white list. Therefore, in view of this situation, we reversed the judgment of the blacklist according to the official source code, and changed it from the blacklist to the white list. And create a custom exception formatting plug-in:

import org.apache.logging.log4j.core.LogEvent;
import org.apache.logging.log4j.core.config.Configuration;
import org.apache.logging.log4j.core.config.plugins.Plugin;
import org.apache.logging.log4j.core.impl.CustomizedThrowableProxyRenderer;
import org.apache.logging.log4j.core.impl.ThrowableProxy;
import org.apache.logging.log4j.core.pattern.ConverterKeys;
import org.apache.logging.log4j.core.pattern.PatternConverter;
import org.apache.logging.log4j.core.pattern.ThrowablePatternConverter;

@Plugin(name = "CustomizedThrowablePatternConverter", category = PatternConverter.CATEGORY)
//The configuration key is% cusEx or% cusThrowable or% cusException
@ConverterKeys({ "cusEx", "cusThrowable", "cusException" })
public class CustomizedThrowablePatternConverter extends ThrowablePatternConverter {
    public static CustomizedThrowablePatternConverter newInstance(final Configuration config, final String[] options) {
        return new CustomizedThrowablePatternConverter(config, options);
    }
    private CustomizedThrowablePatternConverter(final Configuration config, final String[] options) {
        super("CustomizedThrowable", "throwable", options, config);
    }

    @Override
    public void format(final LogEvent event, final StringBuilder toAppendTo) {
        final ThrowableProxy proxy = event.getThrownProxy();
        final Throwable throwable = event.getThrown();
        if ((throwable != null || proxy != null) && options.anyLines()) {
            if (proxy == null) {
                super.format(event, toAppendTo);
                return;
            }
            final int len = toAppendTo.length();
            if (len > 0 && !Character.isWhitespace(toAppendTo.charAt(len - 1))) {
                toAppendTo.append(' ');
            }
            //The customized ThrowableProxyRenderer here actually copies the ThrowableProxyRenderer code of log4j2, and then reverses the blacklist judgment and turns it into a white list
            CustomizedThrowableProxyRenderer.formatExtendedStackTraceTo(proxy, toAppendTo, options.getIgnorePackages(),
                    options.getTextRenderer(), getSuffix(event), options.getSeparator());
        }
    }
}

Then, through layout% cusex {filters (java, com.mycompany)}, the exception stack will only output the exception stack at the beginning of these packages, which starts with java and com.mycompany. The output stack is similar to:

    ... suppressed 27 lines
    at com.mycompany.spring.cloud.parent.web.common.undertow.jfr.JFRTracingFilter.doFilter(JFRTracingFilter.java:40) ~[spring-cloud-parent-web-common-2020.0.3-SNAPSHOT.jar!/:2020.0.3-SNAPSHOT]
    at com.mycompany.spring.cloud.parent.web.common.undertow.jfr.LazyJFRTracingFilter.doFilter(LazyJFRTracingFilter.java:23) ~[spring-cloud-parent-web-common-2020.0.3-SNAPSHOT.jar!/:2020.0.3-SNAPSHOT]
    ... suppressed 46 lines
Caused by: com.alibaba.fastjson.JSONException: write javaBean error, fastjson version 1.2.75, class com.hopegaming.factsCenter.query.revo.controller.frontend.share.matches.MatchController$EventVO, fieldName : data
    ... suppressed 8 lines
    ... 74 more
Caused by: java.lang.NullPointerException
    at com.mycompany.OrderController.postOrder(OrderController.java:103) ~[classes!/:2020.0.3-SNAPSHOT]
    ... suppressed 13 lines
    ... 74 more

5. Monitor the usage rate of RingBuffer. If the usage rate exceeds a certain percentage and lasts for a period of time, it proves that the application is too busy writing logs and needs to be dynamically expanded, and temporarily cut off part of the traffic from this instance

Since the RingBuffer of log4j2 disruptor is a ring Buffer, its capacity is limited. We did not modify its size here. We went to the default configuration. Check its source code:

AsyncLoggerConfigDisruptor.java

public synchronized void start() {
    //Omit other codes
    ringBufferSize = DisruptorUtil.calculateRingBufferSize("AsyncLoggerConfig.RingBufferSize");
    disruptor = new Disruptor<>(factory, ringBufferSize, threadFactory, ProducerType.MULTI, waitStrategy);
}

DisruptorUtil.java

private static final int RINGBUFFER_MIN_SIZE = 128;
private static final int RINGBUFFER_DEFAULT_SIZE = 256 * 1024;
private static final int RINGBUFFER_NO_GC_DEFAULT_SIZE = 4 * 1024;

static int calculateRingBufferSize(final String propertyName) {
    //Whether ThreadLocal is enabled, 4 kB if yes, 256 kB if No
    int ringBufferSize = Constants.ENABLE_THREADLOCALS ? RINGBUFFER_NO_GC_DEFAULT_SIZE : RINGBUFFER_DEFAULT_SIZE;
    //Read the system variable and log4j2.component.properties file to obtain the configuration of propertyname (here AsyncLoggerConfig.RingBufferSize)
    final String userPreferredRBSize = PropertiesUtil.getProperties().getStringProperty(propertyName,
            String.valueOf(ringBufferSize));
    try {
        int size = Integer.parseInt(userPreferredRBSize);
        //If it is less than 128 bytes, it is set according to 128 bytes
        if (size < RINGBUFFER_MIN_SIZE) {
            size = RINGBUFFER_MIN_SIZE;
            LOGGER.warn("Invalid RingBufferSize {}, using minimum size {}.", userPreferredRBSize,
                    RINGBUFFER_MIN_SIZE);
        }
        ringBufferSize = size;
    } catch (final Exception ex) {
        LOGGER.warn("Invalid RingBufferSize {}, using default size {}.", userPreferredRBSize, ringBufferSize);
    }
    //Take the nearest n-th power of 2, because taking the remainder to the n-th power of 2 is equal to taking the sum of 2^n-1, which is faster
    return Integers.ceilingNextPowerOfTwo(ringBufferSize);
}

If ThreadLocal is enabled to generate logevents in this way, each time the newly generated LogEvent is not used, the previous LogEvent is stored in ThreadLocal, so as to avoid creating a new LogEvent. However, consider the following:

logger.info("{}", someObj);

This will cause strong references, so that if the thread does not have a new log, the someObj cannot be recycled. Therefore, for Web applications, log4j2 generates logevents by default without enabling ThreadLocal:

Constants.java

public static final boolean IS_WEB_APP = PropertiesUtil.getProperties().getBooleanProperty(
            "log4j2.is.webapp", isClassAvailable("javax.servlet.Servlet"));
public static final boolean ENABLE_THREADLOCALS = !IS_WEB_APP && PropertiesUtil.getProperties().getBooleanProperty(
            "log4j2.enable.threadlocals", true);

From this, we can see that the size of our RingBuffer is 256 kB.

What happens when the RingBuffer is full log4j2? When RingBuffer is full, if asyncloggerconfig. Synchronizeenqueuewhenquefull = false is configured in log4j2.component.properties, it will Wait (actually park) on the production method of the Disruptor and Wait for the consumption of the next ring buffer slot that can be produce d; The default configuration is true, that is, all production log threads attempt to obtain the same lock in the global (private final Object queueFullEnqueueLock = new Object();):

DisruptorUtil.java

static final boolean ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
        .getBooleanProperty("AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull", true);
private boolean synchronizeEnqueueWhenQueueFull() {
    return DisruptorUtil.ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL
            // Background thread must never block
            && backgroundThreadId != Thread.currentThread().getId();
}

private final Object queueFullEnqueueLock = new Object();

private void enqueue(final LogEvent logEvent, final AsyncLoggerConfig asyncLoggerConfig) {
    //If asyncloggerconfig. Synchronizeenqueuewhenquefull = true, the default is true
    if (synchronizeEnqueueWhenQueueFull()) {
        synchronized (queueFullEnqueueLock) {
            disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
        }
    } else {
        //If asyncloggerconfig. Synchronizeenqueuewhenquefull = false
        disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
    }
}

In the default configuration, the exception stack is the same as that seen in JFR, for example:

"Thread-0" #27 [13136] prio=5 os_prio=0 cpu=0.00ms elapsed=141.08s tid=0x0000022d6f2fbcc0 nid=0x3350 waiting for monitor entry  [0x000000399bcfe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(AsyncLoggerConfigDisruptor.java:375)
    - waiting to lock <merged>(a java.lang.Object)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueueEvent(AsyncLoggerConfigDisruptor.java:330)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logInBackgroundThread(AsyncLoggerConfig.java:159)
    at org.apache.logging.log4j.core.async.EventRoute$1.logMessage(EventRoute.java:46)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.handleQueueFull(AsyncLoggerConfig.java:149)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:136)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
    at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
    at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
    at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
    at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
    at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1312)
    Omit business method stack

When it is configured as false, the stack looks like this:

"Thread-0" #27 [18152] prio=5 os_prio=0 cpu=0.00ms elapsed=5.68s tid=0x000002c1fa120e00 nid=0x46e8 runnable  [0x000000eda8efe000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@17-loom/Native Method)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17-loom/LockSupport.java:410)
    at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)
    at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)
    at com.lmax.disruptor.RingBuffer.publishEvent(RingBuffer.java:524)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(AsyncLoggerConfigDisruptor.java:379)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueueEvent(AsyncLoggerConfigDisruptor.java:330)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logInBackgroundThread(AsyncLoggerConfig.java:159)
    at org.apache.logging.log4j.core.async.EventRoute$1.logMessage(EventRoute.java:46)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.handleQueueFull(AsyncLoggerConfig.java:149)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:136)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
    at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
    at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
    at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
    at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
    at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1312)

For this situation, we need to monitor.

The first thing I think of is the indicator monitoring of the external process collection system: now all services advocate going to the cloud and realize cloud native services. For cloud services, NFS (Network File System) is likely to be used to store logs, such as EFS of AWS. This kind of NFS can dynamically control the maximum IO load every move, but the service growth is difficult to predict perfectly, and the high concurrent service traffic basically arrives in an instant. It is difficult to evaluate the real traffic peak only through IO timing collection (for example, the IO is collected at a fixed time for 5s, but a lot of traffic suddenly arrives in a certain second, resulting in most threads blocking in the process. After collecting IO, it seems that the IO pressure is not large.) Moreover, due to thread blocking, we may see that the CPU consumption is not high. Therefore, it is difficult to really locate the log traffic problem with external timing collection indicators.

Then we consider monitoring the process itself, exposing the interface to external monitoring and regular inspection, such as the pod health inspection of K8s, etc. When the log writing pressure of the process is too high, a new instance is expanded; After startup, set the status of the process with high log pressure to temporarily offline in the registry (for example, Eureka is set to OUT_OF_SERVICE and Nacos is set to PAUSED), so that the traffic can be forwarded to other instances. After the log pressure is low, modify the status to UP and continue the service.

Log4j2 for each asyncloger configuration, an independent RingBuffer will be created, such as the following log4j2 configuration:

<!--Omitted except loggers Configuration other than-->
 <loggers>
    <!--default logger -->
    <Asyncroot level="info" includeLocation="true">
        <appender-ref ref="console"/>
    </Asyncroot>
    <AsyncLogger name="RocketmqClient" level="error" additivity="false" includeLocation="true">
        <appender-ref ref="console"/>
    </AsyncLogger>
    <AsyncLogger name="com.alibaba.druid.pool.DruidDataSourceStatLoggerImpl" level="error" additivity="false" includeLocation="true">
        <appender-ref ref="console"/>
    </AsyncLogger>
    <AsyncLogger name="org.mybatis" level="error" additivity="false" includeLocation="true">
        <appender-ref ref="console"/>
    </AsyncLogger>
</loggers>

This configuration contains four asynclogers, and a RingBuffer will be created for each asyncloger. Log4j2 also considers the monitoring of asyncloger, so the monitoring of asyncloger is exposed as an MBean (JMX Managed Bean).

The relevant source code is as follows:

Server.java

private static void registerLoggerConfigs(final LoggerContext ctx, final MBeanServer mbs, final Executor executor)
        throws InstanceAlreadyExistsException, MBeanRegistrationException, NotCompliantMBeanException {

    //Get all configuration values under the loggers tag in the log4j2.xml configuration
    final Map<String, LoggerConfig> map = ctx.getConfiguration().getLoggers();
    //Each key is actually the name of the logger
    for (final String name : map.keySet()) {
        final LoggerConfig cfg = map.get(name);
        final LoggerConfigAdmin mbean = new LoggerConfigAdmin(ctx, cfg);
        //Register a LoggerConfigAdmin for each logger
        register(mbs, mbean, mbean.getObjectName());
        //If it is an asynchronous log configuration, register a RingBufferAdmin
        if (cfg instanceof AsyncLoggerConfig) {
            final AsyncLoggerConfig async = (AsyncLoggerConfig) cfg;
            final RingBufferAdmin rbmbean = async.createRingBufferAdmin(ctx.getName());
            register(mbs, rbmbean, rbmbean.getObjectName());
        }
    }
}

Class source code of MBean created: RingBufferAdmin.java

public class RingBufferAdmin implements RingBufferAdminMBean {
    private final RingBuffer<?> ringBuffer;
    private final ObjectName objectName;
    //... omit other code we don't care about
    
    public static final String DOMAIN = "org.apache.logging.log4j2";
    String PATTERN_ASYNC_LOGGER_CONFIG = DOMAIN + ":type=%s,component=Loggers,name=%s,subtype=RingBuffer";
    
    //Create RingBufferAdmin. The name format conforms to the name format of Mbean
    public static RingBufferAdmin forAsyncLoggerConfig(final RingBuffer<?> ringBuffer, 
            final String contextName, final String configName) {
        final String ctxName = Server.escape(contextName);
        //For RootLogger, cfgName here is an empty string
        final String cfgName = Server.escape(configName);
        final String name = String.format(PATTERN_ASYNC_LOGGER_CONFIG, ctxName, cfgName);
        return new RingBufferAdmin(ringBuffer, name);
    }
    
    //Gets the size of RingBuffer
    @Override
    public long getBufferSize() {
        return ringBuffer == null ? 0 : ringBuffer.getBufferSize();
    }
    //Gets the remaining size of RingBuffer
    @Override
    public long getRemainingCapacity() {
        return ringBuffer == null ? 0 : ringBuffer.remainingCapacity();
    }
    public ObjectName getObjectName() {
        return objectName;
    }
}

We can view the corresponding MBean s through JConsole:

Where 2f0e140b is the name of LoggerContext.

Our microservice project uses spring boot and integrates prometheus. We can expose the Log4j2 RingBuffer size as an indicator to prometheus through the following code:

import io.micrometer.core.instrument.Gauge;
import io.micrometer.prometheus.PrometheusMeterRegistry;
import lombok.extern.log4j.Log4j2;
import org.apache.commons.lang3.StringUtils;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.core.LoggerContext;
import org.apache.logging.log4j.core.jmx.RingBufferAdminMBean;
import org.springframework.beans.factory.ObjectProvider;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.actuate.autoconfigure.metrics.export.ConditionalOnEnabledMetricsExport;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.event.ContextRefreshedEvent;
import org.springframework.context.event.EventListener;

import javax.annotation.PostConstruct;
import javax.management.ObjectName;
import java.lang.management.ManagementFactory;

@Log4j2
@Configuration(proxyBeanMethods = false)
//It needs to be loaded when prometheus is introduced and the actor exposes the prometheus port
@ConditionalOnEnabledMetricsExport("prometheus")
public class Log4j2Configuration {
    @Autowired
    private ObjectProvider<PrometheusMeterRegistry> meterRegistry;
    //Initialize only once
    private volatile boolean isInitialized = false;

    //You need to register after the ApplicationContext is refreshed
    //The log configuration is initialized before the ApplicationContext is loaded
    //However, the loading of related beans of prometheus is complex, and there are many changes with the version change, so we are lazy and register after the entire ApplicationContext is refreshed
    // ApplicationContext may refresh multiple times, such as calling / Actor / refresh, and there are multiple ApplicationContext scenarios
    // For simplicity, a simple isInitialized is used to determine whether it is the first initialization, and ensure that it is initialized only once
    @EventListener(ContextRefreshedEvent.class)
    public synchronized void init() {
        if (!isInitialized) {
            //Obtain LoggerContext through LogManager to obtain configuration
            LoggerContext loggerContext = (LoggerContext) LogManager.getContext(false);
            org.apache.logging.log4j.core.config.Configuration configuration = loggerContext.getConfiguration();
            //Get the name of LoggerContext because the name of Mbean contains this
            String ctxName = loggerContext.getName();
            configuration.getLoggers().keySet().forEach(k -> {
                try {
                    //For the RootLogger, its cfgName is an empty string. For good display, we name it root in prometheus
                    String cfgName = StringUtils.isBlank(k) ? "" : k;
                    String gaugeName = StringUtils.isBlank(k) ? "root" : k;
                    Gauge.builder(gaugeName + "_logger_ring_buffer_remaining_capacity", () ->
                    {
                        try {
                            return (Number) ManagementFactory.getPlatformMBeanServer()
                                    .getAttribute(new ObjectName(
                                            //Assemble the name according to the naming method in the Log4j2 source code
                                            String.format(RingBufferAdminMBean.PATTERN_ASYNC_LOGGER_CONFIG, ctxName, cfgName)
                                            //Get the remaining size. Note that this is strictly case sensitive
                                    ), "RemainingCapacity");
                        } catch (Exception e) {
                            log.error("get {} ring buffer remaining size error", k, e);
                        }
                        return -1;
                    }).register(meterRegistry.getIfAvailable());
                } catch (Exception e) {
                    log.error("Log4j2Configuration-init error: {}", e.getMessage(), e);
                }
            });
            isInitialized = true;
        }
    }
}

After adding this code and requesting / Actor / Prometheus, you can see the corresponding return:

//Omit other
# HELP root_logger_ring_buffer_remaining_capacity  
# TYPE root_logger_ring_buffer_remaining_capacity gauge
root_logger_ring_buffer_remaining_capacity 262144.0
# HELP org_mybatis_logger_ring_buffer_remaining_capacity  
# TYPE org_mybatis_logger_ring_buffer_remaining_capacity gauge
org_mybatis_logger_ring_buffer_remaining_capacity 262144.0
# HELP com_alibaba_druid_pool_DruidDataSourceStatLoggerImpl_logger_ring_buffer_remaining_capacity  
# TYPE com_alibaba_druid_pool_DruidDataSourceStatLoggerImpl_logger_ring_buffer_remaining_capacity gauge
com_alibaba_druid_pool_DruidDataSourceStatLoggerImpl_logger_ring_buffer_remaining_capacity 262144.0
# HELP RocketmqClient_logger_ring_buffer_remaining_capacity  
# TYPE RocketmqClient_logger_ring_buffer_remaining_capacity gauge
RocketmqClient_logger_ring_buffer_remaining_capacity 262144.0

In this way, when this value is 0 for a period of time (it means that the RingBuffer is full and the log generation speed is much faster than the speed of consumption writing to the Appender), we think that the application log load is too high.

WeChat search "my programming meow" attention to the official account, daily brush, easy to upgrade technology, and capture all kinds of offer:

Topics: Java Microservices log4j2