The tools that come to mind from a continuous GC and take a long time - jvmkill and jvmquake

Posted by d1223m on Mon, 25 Nov 2019 10:49:12 +0100

Recently, a strange problem occurred in a query microservice online. The number of GCs suddenly increased, and the time consumption of GC also increased. It's not always the problem, but it's reported every other time.

View jstat -gcutils 10000: Similar to the following:

S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
0.00 100.00 100.00 100.00  44.56  95.10     21    8.678    11  140.498  149.176
0.00 100.00 100.00 100.00  44.56  95.10     31    8.678    11  140.498  169.176
0.00 100.00 100.00 100.00  44.56  95.10     41    8.678    11  140.498  189.176

Looking at jmap, we find that there are many ByteArrayRow that are jdbc data return related objects:

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:       6736991      270490848  [B (java.base@11.0.4)
   2:       2154446       51706704  java.lang.String (java.base@11.0.4)
   3:        184135       30913512  [[B (java.base@11.0.4)
   4:          3720       30535872  [C (java.base@11.0.4)
   5:        102716       20543200  Business database entity POJO class
   6:        543210       17382720  java.sql.Timestamp (java.sql@11.0.4)
   7:        524288       12582912  org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapper
   8:         48220        7123584  [Ljava.lang.Object; (java.base@11.0.4)
   9:         71046        6252048  java.lang.reflect.Method (java.base@11.0.4)
  10:        152288        4873216  java.util.concurrent.ConcurrentHashMap$Node (java.base@11.0.4)
  11:        183671        4408104  com.mysql.jdbc.ByteArrayRow
  12:         14721        4056328  [I (java.base@11.0.4)
  13:          1959        3149560  [J (java.base@11.0.4)

It is speculated that there may be a problem with the writing of a database query statement, which results in the query scope being too large (it should be the query related to the business class above). Through positioning, it is found that a scheduled task triggers a full scan SQL, which causes the JVM GC pressure to rise suddenly and does not OutOfMemory.

For a problem like this, positioning is troublesome. It's better to have a tool. It can close this instance and dump out the memory occupation to locate and solve the problem.

NetFlix is the ancestor of JAVA EE framework. This related tool has long been thought of.

Dead query and JVM kill

NetFlix has a lot of JAVA based storage, such as Cassandra and ElasticSearch. Most of the time it works. But sometimes, a bug or improper design leads to a full query or a large number of queries, which leads to memory exhaustion or continuous GC. In this way, the request speed of this instance will be greatly reduced, which is an unhealthy instance. For this kind of unhealthy instance caused by "dead query", the first step we think about is to kill it in time and save the impact on business performance. JVM kill can be implemented.

JVM kill is an application JVM tools Interface java agent based on ResourceExhausted This hook callback will kill the JVM process when the JVM enters the outofmemoryerror: unable to create new native thread state or outofmemoryerror: java heap space state. This tool can be used with the JVM startup parameter - XX:HeapDumpOnOutOfMemoryError to print the memory snapshot of the heap when OutofMemory is used to locate problems.

As for how to recover, our cluster uses containers. After the discovery process is kill ed, a new pod (instance) will be started

Perfection of scenarios that cannot be solved by jvmkill - jvmquake

However, not all problems can be solved in this way. As we mentioned at the beginning, this problem can not be solved. The problem mentioned at the beginning is just GC, not OutOfMemory. Like this, jvmkill can't solve it.

NetFlix uses Cassandra to give a similar example of reading a full table:

cqlsh> PAGING OFF
Disabled Query paging.
cqlsh> SELECT * FROM large_ks.large_table;
OperationTimedOut: errors={}, last_host=some host
cqlsh> SELECT * FROM large_ks.large_table;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
Schema metadata was not refreshed. See log for details.

Use the JVM tool to view the current Cassandra instance status:

$ sudo -u cassandra jstat -gcutil $(pgrep -f Cassandra) 100ms
S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
0.00 100.00 100.00 100.00  97.96  95.10     21    8.678    11  140.498  149.176
0.00 100.00 100.00 100.00  97.96  95.10     21    8.678    11  140.498  149.176
0.00 100.00 100.00 100.00  97.96  95.10     21    8.678    11  140.498  149.176
$ grep "stopped" gclog | tail
2019-03-05T02:59:15.111+0000: 1057.680: Total time for which application threads were stopped: 18.0113457 seconds, Stopping threads took: 0.0001798 seconds
2019-03-05T02:59:16.159+0000: 1058.728: Total time for which application threads were stopped: 1.0472826 seconds, Stopping threads took: 0.0001542 seconds
2019-03-05T02:59:40.461+0000: 1083.030: Total time for which application threads were stopped: 24.3016592 seconds, Stopping threads took: 0.0001809 seconds
2019-03-05T03:00:16.263+0000: 1118.832: Total time for which application threads were stopped: 35.8020625 seconds, Stopping threads took: 0.0001307 seconds
2019-03-05T03:00:51.599+0000: 1154.168: Total time for which application threads were stopped: 35.3361231 seconds, Stopping threads took: 0.0001517 seconds
2019-03-05T03:01:09.470+0000: 1172.039: Total time for which application threads were stopped: 17.8703301 seconds, Stopping threads took: 0.0002151 seconds

It can be seen from the GC log that there is a repeated GC time consumption of more than 20 seconds GCViewer For this tool, you can see: -

Basically, all the GC time from the dead query is very long, and it can't be recovered. For this problem, it is better to have corresponding monitoring to actively kill the process. However, the existing JVM startup parameters or status bit monitoring cannot be implemented.

The implementation and solution of jvmquake

Netflix engineers believe that no matter how heavy the load, the main time of JVM should be spent on business logic, not GC. If the proportion of business time is very low, the JVM process needs to be kill ed. So they designed a model: GC time pool. When a GC occurs, if it takes 200ms, the GC pool will be + 200ms; if it takes 500ms, the business operation time will be used to deduct the GC time pool. The bottom line is 0, which cannot be negative. Here is 0. In this way, if the ratio of GC time to business time is always greater than 1:1, the GC time pool will grow indefinitely. When a certain limit is reached, the JVM is considered unhealthy. This is the implementation idea of jvmquake.

How is the GC time measured? adopt GarbageCollectionStart and GarbageCollectionFinish These two Java tool interfaces

jvmquake has three parameters:

  1. GC pool upper limit: as mentioned above,. When a certain limit is reached, the JVM is considered unhealthy, and corresponding operations will be taken.
  2. Ratio of business time to GC time: the ratio of expected business time to GC time is 1:1 by default, assuming that the expectation is 2:1. Then it is GC pool = + 2GC time - 1 business time. Generally, business time accounts for more than 95% (that is, 19:1).
  3. Take action: for jvmquake, before sending SIGKILL to the JVM process, you need to do some other operations to keep the scene for later locating the reason.

What action? full dump to core dump

So what are these operations? First of all, we definitely want to keep jmap's full dump. You can create a large array continuously through a thread to cause OOM, thus triggering - XX:HeapDumpOnOutOfMemoryError to print full dump. However, if several kill and restart occur. There is a risk that the hard disk will be full. It is better that the output of this dummp can have a channel interface and be written to IO outside the local disk (for example, S3). Unfortunately, however, the jvm does not. We can turn to core dump.

By sending SIGABRT, linux prints the core dump automatically. The size of the core dump is controlled by the limit of ulimit-c. If you want to be unrestricted, you can use the command:

ulimit -c unlimited

We still want to upload it to S3 storage. Fortunately, we can output the core dump to a specific pipe through kernel.core ﹐ pattern sysctl linux kernel manual (Chapter "Piping core dumps to a program")

coredump_uploader.sh

#!/bin/bash
# Depends on zstd (>1.3.6) and aws-cli
set -o errexit
set -o nounset
set -o pipefail

# Set kernel.core_pattern = | coredump_uploader.sh %P %s %E
PID=$1
shift
SIGNAL=$1
shift
EXECUTABLE=$1
shift

# Set these to appropriate values for your environment
readonly BUCKET=
readonly APP_NAME=
readonly INSTANCE_ID=
readonly COREDUMP_BACKOFF_MINS=60

TARGET_STDERR="/proc/${PID}/fd/2"
if ! exec 2>>"${TARGET_STDERR}" >&2; then
  # Redirect output into the target's stderr, if possible
  exec 2> >(exec 2>&-; logger -t "coredump-uploader-${PID}")
  exec >&2
fi

echo "${0##*/} initiated for PID ${PID}"
mkdir -p /run/coredumps

find /run/coredumps -mmin "+${COREDUMP_BACKOFF_MINS}" -type f -delete
LAST=$(find /run/coredumps -type f -printf "%t\n" | head -n1)
if [ -n "${LAST}" ]; then
   # Back off without creating a file
   echo "Last core dump was at ${LAST}; too soon to upload another."
   exit
fi
mktemp -p /run/coredumps dump.XXXXXXXXXX

MACHINE_DATE=$(date "+%s.%N")
HUMAN_DATE=$(date -d "@${MACHINE_DATE}" '+%Y-%m-%dT%H:%M:%S.%N')
TWO_63=9223372036854775808
REVERSE_DATE=$(($TWO_63 - ${MACHINE_DATE/.})) # overflows in 2230

S3_KEY="s3://${BUCKET}/apps/${APP_NAME}/coredumps/${INSTANCE_ID}/${REVERSE_DATE}_${HUMAN_DATE}"
echo "Uploading to: ${S3_KEY}"

echo "Uploading metadata"
EXE_HASH=$(sha256sum /proc/$PID/exe | cut -d' ' -f1)
(
    cat <<EOF
Date: $HUMAN_DATE
PID: $PID
Signal: $SIGNAL
Command: ${EXECUTABLE//!//}
SHA256(Executable): $EXE_HASH
EOF

  lsof -p $PID -Pn
) | aws s3 cp - "${S3_KEY}/meta"

echo "Uploading tarball"
#XXX: which of these files duplicate data already present in the core dump?
#XXX: also, if any of these files are useful, should we get the same from each
# running thread
tar -C /proc/$PID -c \
  cgroup \
  cmdline \
  cpuset \
  comm \
  environ \
  gid_map \
  io \
  maps \
  mountinfo \
  sched \
  schedstat \
  smaps \
  stack \
  stat \
  status \
  uid_map \
  wchan \
| aws s3 cp - "${S3_KEY}/proc.tar"

echo "Uploading core file"
echo "You may see 'Aborted (Core Dumped)' printed to the terminal before completion"
# If you can't use zstd, consider lz4 instead: it has compression throughput
# suitable for streaming uploads.
zstd --adapt | aws s3 cp - "${S3_KEY}/core.zst"

Reference resources

Topics: Programming Java jvm AWS SQL