Deep ploughing ElasticSearch - batch operation document

Posted by bandit on Sun, 12 Dec 2021 11:12:07 +0100

1. Batch query

The speed of ES is already very fast, but it can even be faster. Combine multiple requests into one to avoid the network delay and overhead of processing each request separately. If you need to retrieve many documents from ES, using multi get or mget API to put these retrieval requests in one request will retrieve all documents faster than document by document request.

If you want to query 100 pieces of data one by one, you have to send 100 network requests, which is still very expensive. If you query 100 pieces of data in batch, you only need to send one network request, and the performance overhead of network requests is reduced by 100 times.

# One by one query
GET /test_index/_doc/1
GET /test_index/_doc/2

The mget API requires a docs array as a parameter, and each element contains the metadata of the document to be retrieved, including_ index , _ type and_ id . If you want to retrieve one or more specific fields, you can_ The source parameter to specify the names of these fields.

1. Construct 3 pieces of data:

PUT /test_index/_doc/1
{
  "test field":"test1"
}

PUT /test_index/_doc/2
{
  "test field":"test2"
}

PUT /test_index/_doc/3
{
  "test field":"test3"
}

2. Batch query:

GET /_mget
{
  "docs":[
    {
      "_index":"test_index",
      "_type":"_doc",
      "_id":"1"
    },
      {
      "_index":"test_index",
      "_type":"_doc",
      "_id":"2"
    }
 ]
}
#! Deprecation: [types removal] Specifying types in multi get requests is deprecated.
{
  "docs" : [
    {
      "_index" : "test_index",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "test field" : "test1"
      }
    },
    {
      "_index" : "test_index",
      "_type" : "_doc",
      "_id" : "2",
      "_version" : 1,
      "_seq_no" : 1,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "test field" : "test2"
      }
    }
  ]
}

3. The type has been removed in the new version of es. It is not recommended to specify the type in the mget request. Re request:

GET /_mget
{
  "docs":[
    {
      "_index":"test_index",
      "_id":"1"
    },
      {
      "_index":"test_index",
      "_id":"2"
    }
 ]
}

4. If the queried document is under the same index, you can use the following query syntax:

GET /test_index/_doc/_mget
{
   "ids": [1, 2]
}
#! Deprecation: [types removal] Specifying types in multi get requests is deprecated.
{
  "docs" : [
    {
      "_index" : "test_index",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "test field" : "test1"
      }
    },
    {
      "_index" : "test_index",
      "_type" : "_doc",
      "_id" : "2",
      "_version" : 1,
      "_seq_no" : 1,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "test field" : "test2"
      }
    }
  ]
}

Similarly, type has been removed in the new version of es, so you can not specify type:

GET /test_index/_mget
{
   "ids": [1, 2]
}

It can be said that mget is very important. Generally speaking, when querying, if you want to query multiple pieces of data at one time, you must use batch batch api to reduce the number of network overhead as much as possible, which may improve the performance several times or even dozens of times. It is very important.

2. Batch create / update / delete documents

In the same way that mget allows us to retrieve multiple documents at once, the bulk API allows multiple create, index, update, or delete requests in a single step.

bulk is slightly different from other request body formats, as shown below:

{ action: { metadata }}\n
{ request body        }\n
{ action: { metadata }}\n
{ request body        }\n
...

The action/metadata line specifies which document does what.

action must be one of the following options:

  • create

    If the document does not exist, create it.

  • index

    Create a new document or replace an existing document.

  • update

    Partially update a document.

  • delete

    Delete a document.

metadata should specify the name of the document to be indexed, created, updated, or deleted_ index , _ type and_ id .

Note: the bulk api has strict requirements on the syntax of json. Each json string cannot be wrapped, but only one line can be placed. At the same time, there must be a newline between a json string and a json string

2.1 deleting documents

1. Construction data

PUT /test_index/_doc/1
{
  "test field":"test1"
}

PUT /test_index/_doc/2
{
  "test field":"test2"
}

PUT /test_index/_doc/3
{
  "test field":"test3"
}

2. Delete data

POST /_bulk
{ "delete":{"_index":"test_index","_type":"_doc","_id":1}}

3. Delete two documents in batch, that is, make two delete requests in a single step

POST /_bulk
{"delete":{"_index":"test_index","_type":"_doc","_id":1}}
{"delete":{"_index":"test_index","_type":"_doc","_id":3}}

2.2 forced document creation

Make multiple delete and create requests in a single step.

POST /_bulk
{"delete":{"_index":"test_index","_type":"_doc","_id":1}}
{"delete":{"_index":"test_index","_type":"_doc","_id":3}}
{"create":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test1"}
{"create":{"_index":"test_index","_type":"_doc","_id":3}}
{"test field":"test3"}

2.3 index documents

POST /_bulk
{"delete":{"_index":"test_index","_type":"_doc","_id":1}}
{"create":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test1"}
{"index":{"_index":"test_index","_type":"_doc","_id":2}}
{"test field":"test2"}

2.4 full replacement documents

POST /_bulk
{"delete":{"_index":"test_index","_type":"_doc","_id":1}}
{"create":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test1"}
{"index":{"_index":"test_index","_type":"_doc","_id":2}}
{"test field":"test2"}
{"index":{"_index":"test_index","_type":"_doc","_id":2}}
{"test field":"test22"}

2.5 partially updated documents

Combining all operations together, a complete bulk request takes the following form.

POST /_bulk
{"delete":{"_index":"test_index","_type":"_doc","_id":1}}
{"create":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test1"}
{"index":{"_index":"test_index","_type":"_doc","_id":2}}
{"test field":"test2"}
{"index":{"_index":"test_index","_type":"_doc","_id":2}}
{"test field":"test22"}
{"update":{ "_index":"test_index","_type":"_doc","_id": "1"}}
{"doc":{"test_field":"bulk test2"}}

The Elasticsearch response contains an array of items. The contents of this array are the results of each request listed in the order of the request.

#! Deprecation: [types removal] Specifying types in bulk requests is deprecated.
{
  "took" : 77,
  "errors" : false,
  "items" : [
    {
      "delete" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 9,
        "result" : "deleted",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 20,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "create" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 10,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 21,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_version" : 7,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 22,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "index" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_version" : 8,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 23,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "update" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 11,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 24,
        "_primary_term" : 1,
        "status" : 200
      }
    }
  ]
}

Each sub request is executed independently, so the failure of one sub request will not affect the success of other sub requests. If any of the sub requests fails, the error flag at the top level is set to true, and the error details are reported in the corresponding request:

POST /_bulk
{"create":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test1"}
{"index":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test2"}
{
  "took" : 7,
  "errors" : true,
  "items" : [
    {
      "create" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "status" : 409,
        "error" : {
          "type" : "version_conflict_engine_exception",
          "reason" : "[1]: version conflict, document already exists (current version [11])",
          "index_uuid" : "lokVYUtTTJG2TwWqCmyxzw",
          "shard" : "0",
          "index" : "test_index"
        }
      }
    },
    {
      "index" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 12,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 26,
        "_primary_term" : 1,
        "status" : 200
      }
    }
  ]
}

In the response, we see that create document 1 failed because it already exists. However, the subsequent index request, which is also an operation on document 1, succeeds. This also means that the bulk request is not atomic: it cannot be used for transaction control. Each request is processed separately, so the success or failure of one request will not affect other requests.

2.6 do not specify Index repeatedly

Maybe you are batch indexing log data into the same index and type. But specifying the same metadata for each document is a waste. Instead, you can receive the default in the URL of the bulk request, just like the mget API/_ Index or/_ index/_type :

POST /test_index/_doc/_bulk
{"index":{}}
{"test field":"test1"}

Because the type in the new version es has been removed, you can not specify the type:

POST /test_index/_bulk
{"index":{}}
{"test field":"test2"}

The whole batch request needs to be loaded into memory by the node receiving the request. Therefore, the larger the request, the less memory can be obtained by other requests. There is an optimal value for the size of batch requests. If it is greater than this value, the performance will not be improved or even decreased. But the best value is not a fixed value. It depends entirely on the overall situation of hardware, the size and complexity of documents, and the load of indexing and searching.

Fortunately, it's easy to find the best point: try by batch indexing typical documents and increasing the batch size. When performance starts to decline, your batch size is too large. A good way is to start with 1000 to 5000 documents as a batch. If your documents are very large, reduce the number of documents in batch.

It is often useful to pay close attention to the physical size of your batch requests. A thousand 1KB documents are completely different from the physical size of a thousand 1MB documents. A good batch size takes up about 5-15 MB after processing.

Topics: Big Data ElasticSearch