Logging system EFK follow-up: monitoring online logs

Posted by karldesign on Mon, 23 Sep 2019 05:26:46 +0200

In the previous article EFK follow-up of log system: monitor alarm monitoring In this paper, we have basically completed the construction and testing of efk monitoring and alarm system. Then, when we switch the log source to online log, there are big problems. The CPU usage rate of fluentd is still high, and the alarm message in kafka is growing rapidly, hundreds of thousands of pieces in an instant. We try to adjust the online log level to INFO. The latter problem has not been alleviated, and the nail warning information is consistently brushing the screen.

Fluent-bit does not support tail-end reading, but our logs are kept for 30 days (7 days later, the remaining compression), the initial reading volume is really large, and fluent-bit offset loss has occurred before, so we decided to use fluent to collect and forward the logs.

fluentd collector

First, upgrade fluentd to version 1.6 and modify the corresponding configuration file

1.fluentd:v1.6

FROM fluent/fluentd:v1.6-debian-1
#FROM fluent/fluentd:v1.2
#Add es plug-in, kafka plug-in
USER root
RUN  fluent-gem install fluent-plugin-elasticsearch
RUN  fluent-gem install fluent-plugin-kafka
RUN  fluent-gem install fluent-plugin-rewrite-tag-filter
CMD exec fluentd -c /fluentd/etc/${FLUENTD_CONF} -p /fluentd/plugins $FLUENTD_OPT

2.fluentd.conf

<system>
  log_level error
</system>

# tail collects and parses logs
<source>
  @type tail
  path  "#{ENV['tailInputPath']}"
  exclude_path  "#{ENV['tailExcludePath']}"
  pos_file "#{ENV['tailPosFile']}"
  tag  fb.dapeng
  path_key  tailKey
  #refresh_interval 5s
  read_from_head  true
  multiline_flush_interval 10s
  <parse>
    @type multiline
    format_firstline /^\d{2}-\d{2} \d{2}:\d{2}:\d{2} \d{3}/
    format1 /^(?<logtime>^\d{2}-\d{2} \d{2}:\d{2}:\d{2} \d{3}) (?<threadPool>[^ ]+|Check idle connection Thread) (?<level>[^ ]+) \[(?<sessionTid>\w*)\] - (?<message>.*)/
  </parse>
</source>

<filter fb.dapeng>
  @type record_transformer
  enable_ruby
  <record>
    hostname production
    tag ${record["tailKey"].split('/')[3]}
  </record>
  remove_keys tailKey
</filter>
<filter fb.dapeng>
  @type grep
  <regexp>
    key level
    pattern /^\w+$/
  </regexp>
</filter>
<filter error.fb.dapeng>
    @type grep
    <regexp>
        key sessionTid
        pattern /^[0-9a-f]{16}$/
    </regexp>
</filter>
<match fb.dapeng>
  @type copy
  <store>
    @type kafka2
    brokers "#{ENV['kafkaHost']}:9092"
    topic_key efk
    default_topic efk
    <buffer efk>
        flush_interval 5s
    </buffer>
    <format>
        @type json
    </format>
    compression_codec gzip
    required_acks -1
    max_send_retries 3
  </store>
  <store>
    @type rewrite_tag_filter
    <rule>
      key     level
      pattern /^ERROR$/
      tag     error.fb.dapeng
    </rule>
  </store>
</match>

<match error.fb.dapeng>
    @type kafka2
    brokers "#{ENV['kafkaHost']}:9092"
    topic_key efk_error
    default_topic efk_error
    <buffer efk_error>
        flush_interval 5s
    </buffer>
    <format>
        @type json
    </format>
    compression_codec gzip
    required_acks -1
    max_send_retries 3
</match>
<source>
  @type kafka_group
  brokers "#{ENV['kafkaHost']}:9092"
  consumer_group efk_consumer
  topics efk
  format json
  start_from_beginning false
  max_wait_time 5
  max_bytes 1500000
</source>

# topic of the message in kafka_group is the tag corresponding to event
<match efk>
    @type elasticsearch
    hosts "#{ENV['esHost']}:9092"
    index_name dapeng_log_index
    type_name  dapeng_log
    #content_type application/x-ndjson
    buffer_type file
    buffer_path /tmp/buffer_file
    buffer_chunk_limit 10m
    buffer_queue_limit 512
    flush_mode interval
    flush_interval 5s
    request_timeout 5s
    flush_thread_count 2
    reload_on_failure true
    resurrect_after 30s
    reconnect_on_error true
    with_transporter_log true
    logstash_format true
    logstash_prefix dapeng_log_index
    template_name dapeng_log_index
    template_file  /fluentd/etc/template.json
    num_threads 2
    utc_index  false
</match>

3.dc-all.yml environment variable configuration

      - tailInputPath=/var/logs/*/*.%Y-%m-%d.log
      - tailExcludePath=["/var/logs/*/fluent*.log","/var/logs/*/console.log","/var/logs/*/gc*.log"]
      - tailPosFile=/fluentd/etc/logs.pos
      - kafkaHost=kafka The server IP
      - esHost=es The server IP

After fluentd 1.4, support "{ENV['env_key']}" to get the value of env_key

Here we choose tail input of fluentd, strftime is used for monitoring files, and *.% Y -% m -% d. log only monitors the day's log.
In this way, read_from_head can be used to read from scratch, and there is no need to worry about too much reading. At the same time, there is no need to worry about the tail end reading leading to log switching. When refresh_interval of the monitoring list is not updated, some of the logs at the beginning of the log file will be lost.

The pit of rsync synchronization file

After switching fluentd to collect logs, the amount of kafka messages decreases, but it is still not enough, and the alarm messages are repeated to brush the screen. Through observation, we find that there are duplicate data in the alarm. That is to say, fluentd collects duplicate logs, because we synchronize the logs of online environment to the environment where efk is located by rsync, at this time. It is doubtful whether Rsync replicates completely every time it synchronizes. Although the file name does not change, fluentd may read from scratch after each synchronization, resulting in duplicate logs.

First tail-f listens to the corresponding log file, and finds that the listener file has not changed after Rsync synchronization. After turning off the re-listener, it finds that the file has indeed changed. At the same time, it looks at the number of messages in kafka. It is found that after each synchronization file, offset of efk_error will increase the same number, which explains why pinning alarm messages. So every time the alarms are repeated, it's basically determined that rsync's problem causes fluentd to collect them from scratch after each synchronization.

By querying rsync documents