stack - es - official document - Pagination

Posted by lazytiger on Wed, 09 Mar 2022 05:45:23 +0100

There is no perfect program in the world, but we are not depressed because writing a program is a continuous process of pursuing perfection.
-Hou's workshop

Paging search results

  • By default, the search returns the first 10 hits. To browse a larger set of results, you can use the from and size parameters of the Search API. The from parameter defines the number of hits to skip. The default value is 0. The size parameter is the maximum number of hits returned. These two parameters together define a result page.
GET /_search
{
  "from": 5,
  "size": 20,
  "query": {
    "match": {
      "user.id": "kimchy"
    }
  }
}
  • Avoid using from and sizes to search too deep pages or request too many results at once. Search requests typically span multiple slices. Each fragment must load its requested hits and the hits of previous pages into memory. For deep pages or large result sets, these operations may significantly increase memory and CPU utilization, resulting in performance degradation or node failure.
  • By default, you cannot use from and size to search pages that hit entries more than 10000 times. This limit is a protective measure set by the index max_ result_ Window settings. If you need to browse more than 10000 hit entries, use search_after parameter.

Warning: Elasticsearch uses Lucene's internal document id as a determinant. The data id may be exactly the same between copies of these documents. When a paging search hits, you may occasionally see inconsistent sorting of documents with the same sorting value.

Search after

  • You can use search_after parameter, which uses a set of sorting values from the previous page to retrieve the hits of the next page.
  • Using search_after requires multiple search requests with the same query and sort values. If a refresh occurs between these requests, the order of the results may change, resulting in inconsistent results between pages. To prevent this from happening, you can create a point in time (PIT) to maintain the current index state during the search.
POST /my-index-000001/_pit?keep_alive=1m
  • The API returns a PIT ID.
{
  "id": "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="
}
  • To get the first page of the results, submit a search request with the sort parameter. If using PIT, please use PIT The ID parameter specifies the PIT ID in the PIT and omits the target data stream or index from the request path.

Important: all PIT search requests have added a named_ shard_doc's implicit sorting bisector field, which can also be provided explicitly. If you cannot use PIT, we recommend that you include a draw decisive field in your sort. The winning field should contain a unique value for each document. If you don't include a draw field, your page results may miss or repeat hits.
Note: when the sort order is_ shard_doc without tracking the total hits, the Search after request is optimized and faster. This is the most effective option if you want to traverse all documents regardless of order.
Note: if the sort field is date or index in some target data streams and date in other targets_ Nanos field, then use numeric_ The type parameter converts the value to a single type and uses the format parameter to specify the date format for the sort field. Otherwise, Elasticsearch will not correctly interpret the Search after parameter in each request.

GET /_search
{
  "size": 10000,
  "query": {
    "match" : {
      "user.id" : "elkbee"
    }
  },
  "pit": {
    "id":  "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", 
    "keep_alive": "1m"
  },
  "sort": [ 
    {"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos", "numeric_type" : "date_nanos" }}
  ]
}
  • The search response contains an array of sort values for each hit. If PIT is used, the last sort value of each hit contains a decisive point. This is called_ shard_ The bisection mechanism of DOC will be automatically added in each search request using PIT_ shard_doc value is the combination of shard index in PIT and Lucene internal document ID. it is unique in each document and constant in PIT. You can also explicitly customize and add tiebreaker in the search request:
GET /_search
{
  "size": 10000,
  "query": {
    "match" : {
      "user.id" : "elkbee"
    }
  },
  "pit": {
    "id":  "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", 
    "keep_alive": "1m"
  },
  "sort": [ 
    {"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos"}},
    {"_shard_doc": "desc"}
  ]
}
{
  "pit_id" : "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", 
  "took" : 17,
  "timed_out" : false,
  "_shards" : ...,
  "hits" : {
    "total" : ...,
    "max_score" : null,
    "hits" : [
      ...
      {
        "_index" : "my-index-000001",
        "_id" : "FaslK3QBySSL_rrj9zM5",
        "_score" : null,
        "_source" : ...,
        "sort" : [                                
          "2021-05-20T05:30:04.832Z",
          4294967298                              
        ]
      }
    ]
  }
}
  • To get the results of the next page, use the sort value of the last hit (including tiebreaker) as the search_after parameter, rerun the previous search. Use the latest PIT ID in PIT.pit ID parameter. The query and sort parameters of the search must remain unchanged. If provided, the from parameter must be 0 (default) or - 1.
GET /_search
{
  "size": 10000,
  "query": {
    "match" : {
      "user.id" : "elkbee"
    }
  },
  "pit": {
    "id":  "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", 
    "keep_alive": "1m"
  },
  "sort": [
    {"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos"}}
  ],
  "search_after": [                                
    "2021-05-20T05:30:04.832Z",
    4294967298
  ],
  "track_total_hits": false                        
}
  • You can repeat this process to get more result pages. If you use PIT, you can use the keep of each search request_ Use the live parameter to extend the retention period of PIT.
  • When you finish, you should delete your PIT.
DELETE /_pit
{
    "id" : "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="
}

Scroll through search results

Important: we no longer recommend using the scrolling API for deep paging. Use search with a point in time (PIT) if you need to keep the index state when more than 10000 pages are paged_ After parameter.

  • Although a search request returns a result "page", the scrolling API can be used to retrieve a large number of results (or even all results) from a search request, which is very similar to using cursors in traditional databases.
  • Scrolling is not for real-time user requests, but for processing a large amount of data, for example, to re index the content of a data stream or index to a new data stream or different configured indexes.
  • Clients that support scrolling and indexing
  • In order to use scrolling, the initial search request should specify a scrolling parameter in the query string that tells Elasticsearch how long to keep the "search context" activity (see keep search context activity), for example? scroll=1m.
POST /my-index-000001/_search?scroll=1m
{
  "size": 100,
  "query": {
    "match": {
      "message": "foo"
    }
  }
}
  • The result of the above request contains a_ scroll_id, which should be passed to the scrolling API to retrieve the next batch of results.
POST /_search/scroll                                                               
{
  "scroll" : "1m",                                                                 
  "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" 
}
  • The size parameter allows you to configure the maximum number of hits returned per batch of results. Each call to the scrolling API will return the next batch of results until no more results can be returned, that is, the hits array is empty.

Important: the initial search request and each subsequent scroll request return one_ scroll_id. Although_ scroll_ The ID may change between requests, but it doesn't always change -- in any case, only the most recently received_ scroll_id should be used.
Note: if the request specifies aggregation, only the initial search response will contain the aggregation results.
Note: the rolling request is optimized when the sorting order is_ doc, they will be faster. If you want to traverse all documents, regardless of the order, this is the most effective choice:

GET /_search?scroll=1m
{
  "sort": [
    "_doc"
  ]
}

Keep search context active

  • The scroll bar returns all documents that matched the search at the time of the initial search request. It will ignore any subsequent changes to these documents. scroll_id identifies a search context that tracks everything Elasticsearch needs to return the correct document. The search context is created by the initial request and remains active by subsequent requests.
  • The scroll parameter (passed to the search request and each scroll request) tells Elasticsearch how long it should keep the search context active. Its value (for example, 1m, see time units) does not need to be long enough to process all the data -- it only needs to be long enough to process the previous batch of results. Set a new expiration time for each scroll request (with the scroll parameter). If the scroll request does not pass in the scroll parameter, the search context will be released as part of the scroll request.
  • Typically, the background merge process optimizes the index by merging smaller segments to create new larger segments. Once the segments are no longer needed, they are deleted. This process continues during scrolling, but an open search context prevents old segments from being deleted because they are still in use.

Note: keeping the old segment active means that more disk space and file handles are required. Ensure that you have configured the node to have enough free file handles. reference resources File descriptor.

  • In addition, if a segment contains deleted or updated documents, the search context must track whether each document in the segment was active at the time of the initial search request. If there are many open scrolls on the index and there may be continuous deletion or update, make sure your node has enough heap space.

Note: in order to prevent problems caused by opening too many scrolls, users are not allowed to open scrolls more than a certain limit. By default, the maximum number of open reels is 500. This restriction can be updated with the search max_ open_ scroll_ Context cluster settings.

  • You can use the node stats API to see how many search contexts are open:
GET /_nodes/stats/indices/search

Clear scrolling

  • When the scroll timeout is exceeded, the search context will be automatically deleted. However, there is a cost to keeping the reel open. As discussed in the previous section, when the reel is no longer used, you should use the clear scroll API to explicitly clear the reel:
DELETE /_search/scroll
{
  "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}
  • Multiple scroll IDS can be passed as an array:
DELETE /_search/scroll
{
  "scroll_id" : [
    "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==",
    "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB"
  ]
}
  • all search contexts can be used_ all parameter clear:
DELETE /_search/scroll/_all
  • scroll_id can also be passed as a query string parameter or in the request body. Multiple scroll IDS can be passed as comma separated values:
DELETE /_search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==,DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB

Rolling slice

  • When paging a large number of documents, it is helpful to split the search into multiple slices to use them independently:
GET /my-index-000001/_search?scroll=1m
{
  "slice": {
    "id": 0,                      
    "max": 2                      
  },
  "query": {
    "match": {
      "message": "foo"
    }
  }
}
GET /my-index-000001/_search?scroll=1m
{
  "slice": {
    "id": 1,
    "max": 2
  },
  "query": {
    "match": {
      "message": "foo"
    }
  }
}
  • The result returned by the first request belongs to the first slice (id: 0), and the result returned by the second request belongs to the second slice. Since the maximum number of slices is set to 2, the union of the results of the two requests is equal to the result of the rolling query without slices. By default, splitting is done first on shards and then used on each shard_ ID field. The local segmentation follows the formula slice (DOC) = floormod (hashcode (DOC. _id, max)).
  • Each scroll is independent and can be processed in parallel like any scroll request.

Note: if the number of slices is greater than the number of shards, the slice filter will be very slow in the first call. Its complexity is O(N), and the memory cost of each slice is N bits, where N is the total number of documents in the shard. After a few calls, the filter should be cached and subsequent calls should be faster, but you should limit the number of slice queries executed in parallel to avoid memory explosion.

  • The point in time API supports a more efficient partitioning strategy and is not affected by this problem. If possible, it is recommended to use slice point in time search instead of scroll search.
  • Another way to avoid this overhead is to use the DOC of another field_ Values to slice. The field must have the following properties:
    • The field is numeric.
    • doc_values is enabled on this field
    • Each document should contain a value. If the specified field of the document has more than one value, the first value is used.
    • The value of each document should be set once when the document is created and not updated. This ensures deterministic results for each slice.
    • The cardinality of the field should be high. This ensures that each slice gets roughly the same number of documents.
GET /my-index-000001/_search?scroll=1m
{
  "slice": {
    "field": "@timestamp",
    "id": 0,
    "max": 10
  },
  "query": {
    "match": {
      "message": "foo"
    }
  }
}
  • For appending only time-based indexes, you can safely use the timestamp field.

reference resources

Topics: ElasticSearch