In the previous article EFK follow-up of log system: monitor alarm monitoring In this paper, we have basically completed the construction and testing of efk monitoring and alarm system. Then, when we switch the log source to online log, there are big problems. The CPU usage rate of fluentd is still high, and the alarm message in kafka is growing rapidly, hundreds of thousands of pieces in an instant. We try to adjust the online log level to INFO. The latter problem has not been alleviated, and the nail warning information is consistently brushing the screen.
Fluent-bit does not support tail-end reading, but our logs are kept for 30 days (7 days later, the remaining compression), the initial reading volume is really large, and fluent-bit offset loss has occurred before, so we decided to use fluent to collect and forward the logs.
fluentd collector
First, upgrade fluentd to version 1.6 and modify the corresponding configuration file
1.fluentd:v1.6
FROM fluent/fluentd:v1.6-debian-1 #FROM fluent/fluentd:v1.2 #Add es plug-in, kafka plug-in USER root RUN fluent-gem install fluent-plugin-elasticsearch RUN fluent-gem install fluent-plugin-kafka RUN fluent-gem install fluent-plugin-rewrite-tag-filter CMD exec fluentd -c /fluentd/etc/${FLUENTD_CONF} -p /fluentd/plugins $FLUENTD_OPT
2.fluentd.conf
<system> log_level error </system> # tail collects and parses logs <source> @type tail path "#{ENV['tailInputPath']}" exclude_path "#{ENV['tailExcludePath']}" pos_file "#{ENV['tailPosFile']}" tag fb.dapeng path_key tailKey #refresh_interval 5s read_from_head true multiline_flush_interval 10s <parse> @type multiline format_firstline /^\d{2}-\d{2} \d{2}:\d{2}:\d{2} \d{3}/ format1 /^(?<logtime>^\d{2}-\d{2} \d{2}:\d{2}:\d{2} \d{3}) (?<threadPool>[^ ]+|Check idle connection Thread) (?<level>[^ ]+) \[(?<sessionTid>\w*)\] - (?<message>.*)/ </parse> </source> <filter fb.dapeng> @type record_transformer enable_ruby <record> hostname production tag ${record["tailKey"].split('/')[3]} </record> remove_keys tailKey </filter> <filter fb.dapeng> @type grep <regexp> key level pattern /^\w+$/ </regexp> </filter> <filter error.fb.dapeng> @type grep <regexp> key sessionTid pattern /^[0-9a-f]{16}$/ </regexp> </filter> <match fb.dapeng> @type copy <store> @type kafka2 brokers "#{ENV['kafkaHost']}:9092" topic_key efk default_topic efk <buffer efk> flush_interval 5s </buffer> <format> @type json </format> compression_codec gzip required_acks -1 max_send_retries 3 </store> <store> @type rewrite_tag_filter <rule> key level pattern /^ERROR$/ tag error.fb.dapeng </rule> </store> </match> <match error.fb.dapeng> @type kafka2 brokers "#{ENV['kafkaHost']}:9092" topic_key efk_error default_topic efk_error <buffer efk_error> flush_interval 5s </buffer> <format> @type json </format> compression_codec gzip required_acks -1 max_send_retries 3 </match> <source> @type kafka_group brokers "#{ENV['kafkaHost']}:9092" consumer_group efk_consumer topics efk format json start_from_beginning false max_wait_time 5 max_bytes 1500000 </source> # topic of the message in kafka_group is the tag corresponding to event <match efk> @type elasticsearch hosts "#{ENV['esHost']}:9092" index_name dapeng_log_index type_name dapeng_log #content_type application/x-ndjson buffer_type file buffer_path /tmp/buffer_file buffer_chunk_limit 10m buffer_queue_limit 512 flush_mode interval flush_interval 5s request_timeout 5s flush_thread_count 2 reload_on_failure true resurrect_after 30s reconnect_on_error true with_transporter_log true logstash_format true logstash_prefix dapeng_log_index template_name dapeng_log_index template_file /fluentd/etc/template.json num_threads 2 utc_index false </match>
3.dc-all.yml environment variable configuration
- tailInputPath=/var/logs/*/*.%Y-%m-%d.log - tailExcludePath=["/var/logs/*/fluent*.log","/var/logs/*/console.log","/var/logs/*/gc*.log"] - tailPosFile=/fluentd/etc/logs.pos - kafkaHost=kafka The server IP - esHost=es The server IP
After fluentd 1.4, support "{ENV['env_key']}" to get the value of env_key
Here we choose tail input of fluentd, strftime is used for monitoring files, and *.% Y -% m -% d. log only monitors the day's log.
In this way, read_from_head can be used to read from scratch, and there is no need to worry about too much reading. At the same time, there is no need to worry about the tail end reading leading to log switching. When refresh_interval of the monitoring list is not updated, some of the logs at the beginning of the log file will be lost.
The pit of rsync synchronization file
After switching fluentd to collect logs, the amount of kafka messages decreases, but it is still not enough, and the alarm messages are repeated to brush the screen. Through observation, we find that there are duplicate data in the alarm. That is to say, fluentd collects duplicate logs, because we synchronize the logs of online environment to the environment where efk is located by rsync, at this time. It is doubtful whether Rsync replicates completely every time it synchronizes. Although the file name does not change, fluentd may read from scratch after each synchronization, resulting in duplicate logs.
First tail-f listens to the corresponding log file, and finds that the listener file has not changed after Rsync synchronization. After turning off the re-listener, it finds that the file has indeed changed. At the same time, it looks at the number of messages in kafka. It is found that after each synchronization file, offset of efk_error will increase the same number, which explains why pinning alarm messages. So every time the alarms are repeated, it's basically determined that rsync's problem causes fluentd to collect them from scratch after each synchronization.
By querying rsync documents
The append parameter is found so that incremental synchronization is done instead of updating the entire file.
rsync -a --append /src/ /dist/
Then the problem was solved in a logical way.
Reference resources:
https://github.com/fluent/flu...
https://download.samba.org/pu...