[es] three query usages: from size, search after and scroll

Posted by biopv on Thu, 27 Jan 2022 10:08:15 +0100

1, The difference between the three

  1. from size:

    1. deep pagination occurs when the page is deeply paged or the size is very large. And the self-protection mechanism of es is max_result_window is 10000. When the number of queries exceeds 10000, an error will be reported
    2. The implementation principle of this query is similar to the limit in mysql. For example, to query the 10001 data, you need to take out the first 1000 and filter them to get the data finally. (poor performance, simple implementation, suitable for a small amount of data)
  2. search after

    1. search_ The disadvantage of after is that it can't jump to paging randomly. It can only turn back page by page (when new data comes in, it can also be queried in real time), and at least one unique non duplicate field needs to be specified for sorting (generally _id and time fields)
    2. When using search_after, the from value must be set to 0 or - 1
  3. scroll

    1. Efficient rolling query. The first query will save a historical snapshot and cursor (scroll_id) in memory to record the termination position of the current message query. The next query will be based on the cursor for consumption (good performance, not real-time, generally used for mass data export or index reconstruction)

2, Code test class

  1. from size

package com.example.es.test;

import org.apache.http.HttpHost;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.sort.SortBuilders;
import org.elasticsearch.search.sort.SortOrder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

/**
 * @author 
 * @Description  es From size usage of
 * @date 2022/01/26 10:04
 */
public class ESTest_from_size {

    public static final Logger logger = LoggerFactory.getLogger(ESTest_searchAfter.class);

    public static void main(String[] args) throws Exception{
        long startTime = System.currentTimeMillis();
        // Create ES client
        RestHighLevelClient esClient = new RestHighLevelClient(
                RestClient.builder(new HttpHost("localhost", 9200, "http"))
        );
        // 1. Create searchRequest
        SearchRequest searchRequest = new SearchRequest("audit2");
        // 2. Specify query criteria
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();//Track must be added_ total_ Hits, or only 10000 will be displayed
        // The first page on the page is equivalent to 0 in es
        sourceBuilder.from(0);
        // How many pieces of data per page
        sourceBuilder.size(1000);
        // Set unique sort value positioning
        sourceBuilder.sort(SortBuilders.fieldSort("operationtime").order(SortOrder.DESC));
        //Add the sourceBuilder object to the search request
        searchRequest.source(sourceBuilder);
        // Send request
        SearchResponse searchResponse = esClient.search(searchRequest, RequestOptions.DEFAULT);
        SearchHit[] hits = searchResponse.getHits().getHits();
        List<Map<String, Object>> result = new ArrayList<>();
        if (hits != null && hits.length > 0) {
            for (SearchHit hit : hits) {
                // Get required data
                Map<String, Object> sourceAsMap = hit.getSourceAsMap();
                result.add(sourceAsMap);
            }
        }
        logger.info("The number of data queried is:{}", result.size());
        // Close client
        esClient.close();
        logger.info("Running time: " + (System.currentTimeMillis() - startTime) + "ms");
    }
}

Operation results:

10:08:40.466 [main] INFO com.example.es.test.ESTest_searchAfter - The number of data queried is 1000
10:08:40.474 [main] INFO com.example.es.test.ESTest_searchAfter - Running time: 1506ms

Phenomenon:

If the data queried by from size exceeds 10000, an error will be reported

2,search after

package com.example.es.test;

import org.apache.http.HttpHost;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.sort.SortBuilders;
import org.elasticsearch.search.sort.SortOrder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

/**
 * @author
 * @Description  es Search for_ After method
 * @date 2022/01/11 14:04
 */
public class ESTest_searchAfter {

    public static final Logger logger = LoggerFactory.getLogger(ESTest_searchAfter.class);

    public static void main(String[] args) throws Exception{
        long startTime = System.currentTimeMillis();
        // Create ES client
        RestHighLevelClient esClient = new RestHighLevelClient(
                RestClient.builder(new HttpHost("localhost", 9200, "http"))
        );
        // 1. Create searchRequest
        SearchRequest searchRequest = new SearchRequest("audit2");
        // 2. Specify query criteria
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder().trackTotalHits(true);//Track must be added_ total_ Hits, or only 10000 will be displayed
        //Set the number of data queried per page
        sourceBuilder.size(1000);
        // Set unique sort value positioning
        sourceBuilder.sort(SortBuilders.fieldSort("operationtime").order(SortOrder.DESC));//Multi condition query
        //Add the sourceBuilder object to the search request
        searchRequest.source(sourceBuilder);
        // Send request
        SearchResponse searchResponse = esClient.search(searchRequest, RequestOptions.DEFAULT);
        SearchHit[] hits1 = searchResponse.getHits().getHits();
        List<Map<String, Object>> result = new ArrayList<>();
        if (hits1 != null && hits1.length > 0) {
            do {
                for (SearchHit hit : hits1) {
                    // Get required data
                    Map<String, Object> sourceAsMap = hit.getSourceAsMap();
                    result.add(sourceAsMap);
                }
                // Get the last sort value sort, which is used to record the data retrieval from this place next time
                SearchHit[] hits = searchResponse.getHits().getHits();
                Object[] lastNum = hits[hits.length - 1].getSortValues();
                // Set the last sort value of searchAfter
                sourceBuilder.searchAfter(lastNum);
                searchRequest.source(sourceBuilder);
                // Make the next query
                searchResponse = esClient.search(searchRequest, RequestOptions.DEFAULT);
            } while (searchResponse.getHits().getHits().length != 0);
        }
        logger.info("The number of data queried is:{}", result.size());
        // Close client
        esClient.close();
        logger.info("Running time: " + (System.currentTimeMillis() - startTime) + "ms");
    }

}

Operation results:

16:11:44.057 [main] INFO com.example.es.test.ESTest_searchAfter - The number of data queried is 64000
16:11:44.061 [main] INFO com.example.es.test.ESTest_searchAfter - Running time: 20979ms

Phenomenon: audit2 there are 69873 pieces of data in the index. The information printed on the console is printed every 1000 queries. Finally, 64000 records are queried, and 5873 pieces of data are lost. In addition, if the size exceeds 10000, an error will also be reported.

My own question: since search after can't skip page query, it can only be queried page by page, isn't the front end calling this interface and the back end still returning all the data. If the front end is set to scroll down for query, then the scroll wheel will scroll down a few pages and the back end will return a few pages of data, won't the back end save more time for query. At present, search after still queries the data at one time, but internally, it is queried page by page, and the final display is all the data. I have questions about how I can interface with the front end.

3,scroll

package com.example.es.test;

import org.apache.http.HttpHost;
import org.elasticsearch.action.search.*;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.sort.SortBuilders;
import org.elasticsearch.search.sort.SortOrder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;


/**
 * @author 
 * @Description  java Realize scroll scrolling query
 * @date 2021/12/08 14:09
 */
public class ESTest_Scroll {

    public static final Logger logger = LoggerFactory.getLogger(ESTest_Scroll.class);

    public static void main(String[] args) throws Exception{
        long startTime = System.currentTimeMillis();
        // Create ES client
        RestHighLevelClient esClient = new RestHighLevelClient(
                RestClient.builder(new HttpHost("localhost", 9200, "http"))
        );
        // 1. Create searchRequest
        SearchRequest searchRequest = new SearchRequest("audit2");
        // 2. Specify scroll information
        searchRequest.scroll(TimeValue.timeValueMinutes(1L));
        // 3. Specify query criteria
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.size(1000);
        searchSourceBuilder.sort(SortBuilders.fieldSort("operationtime").order(SortOrder.DESC));//Multi condition query
        searchRequest.source(searchSourceBuilder);
        //4. Get the returned result scrollId, source
        SearchResponse searchResponse = esClient.search(searchRequest, RequestOptions.DEFAULT); //Initialize the search context by sending an initial search request
        String scrollId = searchResponse.getScrollId();
        SearchHit[] searchHits = searchResponse.getHits().getHits();
        List<Map<String, Object>> result = new ArrayList<>();
        for (SearchHit hit: searchHits) {
            result.add(hit.getSourceAsMap());
        }
        // java is the same. We need to query twice. First, find out our home page
        // After the query, we need to get his id
        // Then use his id to query his next page
        while (true) {
            //5. Loop - create SearchScrollRequest create a new search scroll request and save the last returned scroll identifier and scroll interval
            // Get scrollId to query the next page
            SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
            //6. Specifies the lifetime of the scrollId
            scrollRequest.scroll(TimeValue.timeValueMinutes(1L));
            //7. Execute the query to get the returned results
            SearchResponse scrollResp = esClient.scroll(scrollRequest, RequestOptions.DEFAULT);
            //8. Judge whether the data is queried and output
            SearchHit[] hits = scrollResp.getHits().getHits();
            //Cycle output next page
            if (hits != null && hits.length > 0) {
                for (SearchHit hit : hits) {
                    result.add(hit.getSourceAsMap());
                }
            } else {
                //9. Judge that no data is found and exit the cycle
                break;
            }
        }
        //After checking, we delete the id stored in the cache. After scrolling, clear the scrolling context
        //10. Create ClearScrollRequest
        ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
        //11. Specify scrollId
        clearScrollRequest.addScrollId(scrollId);
        //12. Delete scrollId
        ClearScrollResponse clearScrollResponse = esClient.clearScroll(clearScrollRequest, RequestOptions.DEFAULT);
        //13. Output results
        boolean succeeded = clearScrollResponse.isSucceeded();
        logger.info("delete scrollId: {}", succeeded);
        logger.info("Total number of queries:{}", result.size());
        // Close client
        esClient.close();
        logger.info("Running time: " + (System.currentTimeMillis() - startTime) + "ms");
    }

}

Operation results:

16:20:54.794 [main] INFO com.example.es.test.ESTest_Scroll - delete scrollId: true
16:20:54.795 [main] INFO com.example.es.test.ESTest_Scroll - Total number of queries: 69873
16:20:54.797 [main] INFO com.example.es.test.ESTest_Scroll - Running time: 5716ms

Phenomenon:

audit2 the index contains a total of 69873 data, and 69873 records are finally queried, none of which is lost. In addition, if the size exceeds 10000, an error will also be reported. It's strange that search after will lose data, while a record of scroll is not lost.

Topics: Java ElasticSearch Spring Boot Back-end