Note the OOM exception resolution process on a single line

Posted by jkatcher on Tue, 01 Oct 2019 13:04:51 +0200

background

Things are like this. This department has maintained an online report query service (referred to as report). For some time, users often feedback to their small partners. Report often fails to open the pages, and after a certain period of time, they recover. Although it is not a trading system, it also causes serious problems to users and wastes a lot of resources, so I intend to take time and energy to focus on solving these problems recently.

process

When this happened again, the login to the server found that the JVM had been carrying out full GC, but could not recover memory at all. At the same time, the log reported java.lang.OutOfMemoryError: Java heap space.
Helplessly, the service can only be restarted to ensure normal access.
Subsequently, by modifying the Tomcat startup script, OOM dump is added to the JVM startup parameters:

vi catalina.sh 
#java jvm config
#JAVA_OPTS="-Xms1024m -Xmx4096m -Xss1024K -XX:PermSize=512m -XX:MaxPermSize=2048m"
JAVA_OPTS="-Xms1024m -Xmx4096m -Xss1024K -XX:PermSize=512m -XX:MaxPermSize=2048m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/tomcat8999/temp"

Some time later, when another problem occurred, the login server found that the OOM log had been dump out. 24428 is the process number of the current JVM.

Next, copy the log to the local machine, start Eclipse Memory Analyzer, and load java_pid24428.hprof, as shown in Figure 1:

Open Histogram to view the objects of each class

Except for the basic java objects such as char[],String and so on, no custom classes that take up a lot of space are found. Look at Dominator Tree, which lists the largest objects in memory.

Here we find an anomaly.

@ 0x6c3391470 pool-1-thread-3, a thread that occupies 97.5% of the total memory, locates the problem in the ReportQuery Controller class after expansion, which is easy to do.
In fact, Leak Suspects Report can also confirm this point:

Then let's look at the stack details:

OK! Now the problem is ready to be located. ReportQuery Controller. writeDownFile is a method to provide data download for reports. The problem lies in the download link. Next is to find code vulnerabilities.

// Output report content
@Override
public void processRow(List<String> row) throws Exception {
    String value = "\n" + format.format(row.toArray());
    write(out, value, charset);
    atomicTotal.incrementAndGet();
}
...
if (atomicTotal.get() > DOWNLOAD_MAX) {
    throw new RuntimeException(DOWNLOAD_MAX_ERROR);
}

The problem is that in this code, the program originally wanted to limit the maximum number of downloaded lines DOWNLOAD_MAX = 100,000 lines, so it defined a self-increasing Integer, and read a row of data + 1, and finally judged whether the value of Atomic Integer reached DOWNLOAD_MAX, and then decided whether to throw an exception.
The problem is that if the amount of data read far exceeds DOWNLOAD_MAX, it is possible to fill up memory before judging atomicTotal. get ()> DOWNLOAD_MAX.
And let's see that Atomic Integer has a value of 110 66861 at this time.

Class Name | Shallow Heap | Retained Heap | Percentage

java.util.concurrent.atomic.AtomicInteger @ 0x6df9c11e8  11066861 | 16 | 16 | 0.00%

More than 10 million records eventually fill up memory.

The next step is to revamp the code:

// Output report content
@Override
public void processRow(List<String> row) throws Exception {
    String value = "\n" + format.format(row.toArray());
    write(out, value, charset);

    //flush every 1000 rows
    flushPerOneThousand(flushSignal, out);
    //More than 100,000 inspections
    if (checkOverFlow(atomicTotal)) {
        breakOut(out);
    }
}

After modification, flush data will be sent to the file every 1000 rows, and the method will exit immediately after DOWNLOAD_MAX is checked, and the flow will be closed and the exception will be thrown.
After testing, the reconstructed version will not cause OOM when the record set that will have problems before downloading.

[ypdata@sjsa-hadooptest tomcat8999]$ jstat -gc 14754 1000 20000
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT   
44544.0 45568.0  0.0   14588.5 1300992.0 66862.7   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694
44544.0 45568.0  0.0   14588.5 1300992.0 66862.7   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694
44544.0 45568.0  0.0   14588.5 1300992.0 66955.7   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694
44544.0 45568.0  0.0   14588.5 1300992.0 66955.7   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694
44544.0 45568.0  0.0   14588.5 1300992.0 66957.8   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694
44544.0 45568.0  0.0   14588.5 1300992.0 66959.8   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694
44544.0 45568.0  0.0   14588.5 1300992.0 66966.4   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694
44544.0 45568.0  0.0   14588.5 1300992.0 67059.4   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694
44544.0 45568.0  0.0   14588.5 1300992.0 67105.6   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694
44544.0 45568.0  0.0   14588.5 1300992.0 67105.6   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694
44544.0 45568.0  0.0   14588.5 1300992.0 67105.6   848896.0   367583.8  78552.0 76517.4 8152.0 7735.1    349  231.509   5     57.185  288.694

So far, the report stability problem has been solved.

summary

  • On-line system should add dump on oom option to get heap image in the first place
  • After OOM occurs, we should use memory analysis tools to find out the cause of leakage and locate the specific code segment in time.
  • Writing code in peacetime should always tighten a chord, not only pay attention to the realization of business logic, but also need to consider the call of resources (computing, memory, network, disk)

enjoy yourself

Topics: Java jvm Tomcat Eclipse