Elastic search 429, logstash does not update kafka queue status

Posted by !jazz on Sat, 25 May 2019 20:26:52 +0200

Recently, a new problem has arisen. When using filebeat to load historical log files, a large amount of data will be added to the kafka queue. logstash will read a large number of event s from the queue and send them to the ES side. The initial manifestations are:
logstash keeps sending messages to ES, and the data in ES keeps increasing. The amount of log writing is much larger than the original amount of log writing, and there is no sign of stopping.
After checking the consumer checker of kafka, it is found that offset of queue has not been offset. That is, Kafka has not been confirmed by logstash's consumption, believing that the news has not been consumed. We know that in the whole ELK process, logstash's input takes data from Kafka and sends it to ES through filter and output plug-in. Only after ES returns successfully, logstash will update kakfa's offset. If ES returns unsuccessfully, logstash will not update kafka's offset, and will try to resend the message until it succeeds. This problem has caused a serious problem here, that is, the information in kakfka has not been consumed, logstash has been retrying, but in fact the data has been inserted into ES, resulting in the increasing index of ES, system load evarage has been increasing, and the response is slower.

To discover this problem, you need to open the debug log of logstash (add it to the logstash.yml file):

log.level: info

Similar prints will be found in the log:

[ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[INFO ] [logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"**es_rejected_execution_exception**", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@742302a0 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6ae23e0c[Running, pool size = 8, active threads = 8, queued tasks = 206, completed tasks = 5502508]]"})
[ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://elastic:xxxxxx@10.60.33.105:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}

If the REST API of elastic search returns the status code 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elastic search.transport.TransportService), this means that ES can no longer process incoming messages (ES cluster is not able to keep up with the incoming requests,)
At the same time, view the running status of ES cluster through x-pack:

It is found that although the CPU utilization rate is only 60%, the load average of the system has reached 10, which means that the system has been running at 1.2 times the load. (Servers are 8 cores, 2.2G main frequency)
(In multiprocessor systems, the average load is determined by the number of cores. Calculated at 100% load, 1.00 represents a single processor, while 2.00 indicates two dual processors, and 4.00 indicates that the host has four processors.

What does this say? Because CPU usage is not particularly high, but the load of the system is very high, we guess that the main load of the system is on io. Go back and observe logstash's log:

[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[2017-07-10T11:17:02,084][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action
[2017-07-10T11:17:02,085][ERROR][logstash.outputs.elasticsearch] Action

We found that in a lot of time, 429 errors were repeatedly prompted, representing logstash sending bulk reuqest to ES continuously and quickly. Looking at the logstash configuration, we found that the default configuration is:

# How many events to retrieve from inputs before sending to filters+workers#
pipeline.batch.size: 125
## How long to wait before dispatching an undersized batch to filters+workers# Value is in milliseconds.#
pipeline.batch.delay: 5

The unit here is milliseconds, the instance of each logstash, which sends 200 request s per second to the ES cluster, which causes the network io overload of the ES cluster.

So, does that mean that we need to shorten the interval between batches and increase the size of each batch? For example, batch.size = 1500, batch.delay = 500, has prevented 429 problems?
After testing, this is feasible.

But why does the ES cluster report errors in the case of 200 bulk request s? Can't any web server handle 200 requests?
Looking at the configuration document of ES, we found that there was a missing data, vi/etc/security/limits.conf:

For each request, whether bulk request or other, ES has a thread to process the request. Correspondingly, the thread opens a file socket to generate the corresponding segment file. If the number of threads (npro) and the number of files (nofile) that ES can use are not configured, 429 problems may arise.

After modifying the configuration of ES, batch delay is reduced and found to be invalid. The current solution still has to set batch.delay to more than 200 ms.

Topics: ElasticSearch kafka REST network