[Elasticsearch] Full text queries query_string and other string queries

Posted by Inkeye on Mon, 07 Mar 2022 23:47:04 +0100

1. General

Reprint: https://zhuanlan.zhihu.com/p/143957734 And modify

The reason is that I encountered such a query when I inquired

GET /xxx-000001/_search
{
  "query": {
    "bool": {
      "filter": {
        "query_string": {
          "query": """srcAddress:192*"""
        }
      }
    }
  }
}

Can you find it? I remember that in the past, queries were level by level. I went to es for a query. I really found it, and then I came to learn how to query.

ps: at the end of the article, there is a summary of all queries of Full text queries!

2.multi_ Match query - multi field version of match

# 1. At the same time, query "content", "content.ik_smart_analyzer" to get document 3
GET /tehero_index/_doc/_search
{
  "query": {
    "multi_match": {
      "query": "system",
      "fields": [
        "content",
        "content.ik_smart_analyzer"
      ]
    }
  }
}

# 2. Query all fields at the same time to get all documents
GET /tehero_index/_doc/_search
{
  "query": {
    "multi_match": {
      "query": "system",
      "fields": [
        "content",
        "content.ik_smart_analyzer",
        "content.ik_max_analyzer"
      ]
    }
  }
}

It should be noted that the query relationship between multiple Fields is or, which is equivalent to mysql [where field 1 = "search term" or field 2 = "search term" or field 3 = "search term]

Field ^ number: it means to enhance this field (weight affects correlation score): you can first know that there is such an attribute. Correlation score is a key and difficult point, which will be explained systematically later.

GET /tehero_index/_doc/_search
{
  "query": {
    "multi_match": {
      "query": "system",
      "fields": [
        "content",
        "content.ik_smart_analyzer^3",
        "content.ik_max_analyzer"
      ]
    }
  }
}

2.1 multi_ sql statement corresponding to match query

GET /tehero_index/_doc/_search
{
  "query": {
    "multi_match": {
      "query": "Systematics",
      "fields": [
        "content.ik_smart_analyzer",
        "content.ik_max_analyzer"
      ]
    }
  }
}

DSL Execution Analysis:

  1. Search the keyword "systematics", and perform different word segmentation according to the word splitter corresponding to the searched field: "content.ik_smart_analyzer" field (hereinafter referred to as field1) to get a Token [systematics]; The "content.ik_max_analyzer" field (field2 for short) is segmented to get three tokens [systematics, systematics, systematics].
  2. The Token using the search term is retrieved in the PostingList of the corresponding field, which is equivalent to sql statement: [select id from field1 PostingList where Token = "systematics"] [select id from field2 PostingList where Token in ("systematics", "systematics", "systematics")];
  3. Finally, do a merge operation on the retrieved two postinglists to get the document.

2, Query - simple search terms

For this grammar, you can understand it first. It is mainly used in English. For Chinese, it is not practical. (ps: translate the following content to the official website) the query divides the search words into two groups: more important (i.e. low frequency) AND less important (i.e. high frequency, such as discontinued words). First, it searches for documents that match more important terms. These terms appear in fewer documents AND have a greater impact on relevance. It then performs a second query on less important words that often appear AND have little impact on relevance. However, it is based on the result set of the first query, rather than calculating the correlation score of all matching documents. In this way, the high-frequency term can improve the correlation calculation without paying the price of poor performance. If the query consists of only high-frequency words, a single query is executed as an AND (merge) query, in other words, all words are required.

# Words with a document frequency greater than 0.1% (such as "this" and "is") will be considered general terms.
GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "nelly the elephant as a cartoon",
                "cutoff_frequency": 0.001,
                "low_freq_operator": "and"
            }
        }
    }
}

Equivalent to:

GET /_search
{
    "query": {
        "bool": {
            "must": [
            { "term": { "body": "nelly"}},
            { "term": { "body": "elephant"}},
            { "term": { "body": "cartoon"}}
            ],
            "should": [
            { "term": { "body": "the"}},
            { "term": { "body": "as"}},
            { "term": { "body": "a"}}
            ]
        }
    }
}

Summary: common terms query purpose: to improve the accuracy of search results on the premise of ensuring the retrieval performance. (high frequency pause words such as the a can be retrieved)

# A simple understanding of the a as and other word segmentation
GET /_analyze
{
  "text": ["the a as"],
  "analyzer": "ik_max_word"
}
result:Empty, because these pause words have been filtered out, so it is used at this time common terms query,These words were retrieved
{
  "tokens": []
}

3, query_string query
Allows us to specify the AND | OR | NOT condition in a single query string, and also with multi_ Like match query, it supports multi field search.

# 1. Retrieve the document containing Token [systematology, es], and the result is empty
GET /tehero_index/_doc/_search
{
    "query": {
        "query_string" : {
            "fields" : ["content.ik_smart_analyzer"],
            "query" : "Systematics AND es"
        }
    }
}
# 2. Searching for documents containing one of Token [systems, es] can retrieve documents 1, 2 and 4
GET /tehero_index/_doc/_search
{
    "query": {
        "query_string" : {
            "fields" : ["content.ik_smart_analyzer"],
            "query" : "Systematics OR es"
        }
    }
}

With the previous foundation, query_string query is very easy to understand. Statement 1 is equivalent to sql statement [where Token = "systematics" and Token = "es"]. Note: 1. The middle conjunction [AND | OR | NOT] must be fully capitalized; 2. Each search word will still be segmented by the corresponding word splitter. A single search word is equivalent to match query.

GET /tehero_index/_doc/_search
{
    "query": {
        "query_string" : {
            "fields" : ["content.ik_smart_analyzer"],
            "query" : "System programming OR es"
        }
    }
}

For example, in the above example, a single search word "system programming" will still be segmented into two tokens by the word splitter "ik_smart". At the same time, match query will be performed on the search word "system programming", so the DSL above will retrieve all documents.
4, simple_query_string query
Similar to query_string, but it ignores the wrong syntax, never throws an exception, and discards the invalid part of the query.
simple_query_string supports the following special characters:

+ Representation and operation, equivalent to query_string of AND
| Representation or operation, equivalent to query_string  of OR
- Reverse a single token,amount to query_string of NOT
"" Indicates the search term match_phrase query
* The end of a word indicates a prefix query

In combination with DSL statement, the following is a simple understanding:

4.1 + represents AND operation, which is equivalent to query_ AND of string

# 1. Document retrieved 4
GET /tehero_index/_doc/_search
{
    "query": {
        "simple_query_string" : {
            "fields" : ["content.ik_smart_analyzer"],
            "query" : "Systematics + interval"
        }
    }
}

4.2 | represents OR operation, which is equivalent to query_ OR of string

# 2. Documents 1, 2, 4 retrieved
GET /tehero_index/_doc/_search
{
    "query": {
        "simple_query_string" : {
            "fields" : ["content.ik_smart_analyzer"],
            "query" : "Systematics | interval"
        }
    }
}

4.3 - reverse a single token, which is equivalent to query_ NOT of string

# 3. Documents 1, 2 retrieved
GET /tehero_index/_doc/_search
{
    "query": {
        "simple_query_string" : {
            "fields" : ["content.ik_smart_analyzer"],
            "query" : "Systematics -interval",
            "default_operator": "and"
        }
    }
}

Note: parameter "default_operator": "and". The default value of this parameter is or.
The sql statement corresponding to the above DSL is: [where Token = systematics and token < > interval]
4.4 "" means match the search term_ phrase query

# 4. Retrieved document 2
GET /tehero_index/_doc/_search
{
    "query": {
        "simple_query_string" : {
            "fields" : ["content.ik_smart_analyzer"],
            "query" : "\"Systematic programming concerns\""
        }
    }
}
# 5. Retrieve all documents
GET /tehero_index/_doc/_search
{
    "query": {
        "simple_query_string" : {
            "fields" : ["content.ik_smart_analyzer"],
            "query" : "Systematic programming concerns"
        }
    }
}

Analysis: "query": "systematic programming concerns", which will match the search words_ phrase query !
4.5 * at the end of the word indicates prefix query match_phrase_prefix query

# 6. Document 3 retrieved
GET /tehero_index/_doc/_search
{
    "query": {
        "simple_query_string" : {
            "fields" : ["content.ik_smart_analyzer"],
            "query" : "system"
        }
    }
}
# 6. Retrieve all documents, equivalent to match_phrase_prefix query
GET /tehero_index/_doc/_search
{
    "query": {
        "simple_query_string" : {
            "fields" : ["content.ik_smart_analyzer"],
            "query" : "system*"
        }
    }
}

5, Summary
By now, we have learned all the query statements of Full text queries:

1) match query: standard query used to execute full-text query, including fuzzy matching and phrase or proximity query. Important parameter: controls the Boolean relationship between Token s: operator: or/and
2)match_phrase query: similar to match query, but used to match exact phrases or words. Important parameter: distance between Token s: slop parameter
3)match_phrase_prefix query: Match_ The phrase query is similar, but the wildcard search will be performed on the last Token in the inverted index list. Important parameter: fuzzy matching number control: Max_ The default value of expansions is 50 and the minimum value is 1
4)multi_match query: the multi field version of the match query. This query is widely used in practice, which can reduce the complexity of DSL statements. At the same time, the statement has multiple query types, which will be explained by TeHero later.
5) common terms query: not significant for Chinese retrieval.
6)query_string query and simple_query_string query is actually a collection of the above query statements. It is very flexible to use and easy to write DSL. However, TeHero believes that these two query statements have an obvious disadvantage: similar to sql injection. If the user enters the corresponding "Keywords" [such as OR, *] in the search term, the user will get the data that should not have been queried. Use with caution!