Deep ploughing ElasticSearch - how to control the accuracy of full-text search results

Posted by Zaxnyd on Sat, 18 Dec 2021 21:07:08 +0100

1. Data preparation

1. Data preparation and post blog data construction:

POST /forum/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01","tag" : ["java", "hadoop"] ,"view_cnt" : 30 }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02","tag" : ["java"],"view_cnt" : 50 }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01","tag" : ["hadoop"],"view_cnt" : 100 }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02","tag" : ["java", "elasticsearch"],"view_cnt" : 80 }
{ "index": { "_id": 5 }}
{ "articleID" : "QQPX-R-3956-#aD9", "userID" : 2, "hidden": true, "postDate": "2017-01-03","tag" : ["java", "spark"],"view_cnt" : 90 }

2. Add a title field to the post data:

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"title" : "this is java and elasticsearch blog"} }
{ "update": { "_id": "2"} }
{ "doc" : {"title" : "this is java blog"} }
{ "update": { "_id": "3"} }
{ "doc" : {"title" : "this is elasticsearch blog"} }
{ "update": { "_id": "4"} }
{ "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} }
{ "update": { "_id": "5"} }
{ "doc" : {"title" : "this is spark blog"} }

2. Matching query [match]

The match query match is a core query. No matter what fields need to be queried, match query should be the preferred query method. It is an advanced full-text query, which means that it can handle both full-text fields and precise fields.

1. Word query

Use the match query to search for a single word in the full-text field.

Example: search for blog s with spark in the title

GET /forum/_search
{
  "query": {
    "match": {
      "title": "java"
    }
  }
}

In the search results: document 2 is the most relevant because its title field is shorter, that is, java occupies a large part of the content.

{
    "max_score" : 0.38845783,
    "hits" : [
      {
        "_id" : "2",
        "_score" : 0.38845783,
        "_source" : {
          "title" : "this is java blog"
        }
      },
      {
        "_id" : "1",
        "_score" : 0.32969955,
        "_source" : {
          "title" : "this is java and elasticsearch blog"
        }
      },
      {
        "_id" : "4",
        "_score" : 0.32969955,
        "_source" : {

          "title" : "this is java, elasticsearch, hadoop blog"
        }
      }
    ]
}

2. Multi word query

Match query is responsible for full-text retrieval, which is different from the previous term query. Instead of searching exact value, it is full text retrieval. Of course, if the field to be retrieved is not_ For analyzed types, match query is also equivalent to term query.

Example: blog with java or elasticsearch in the search title

GET /forum/_search
{
  "query": {
    "match": {
      "title": "java elasticsearch"
    }
  }
}
{
    "max_score" : 0.6593991,
    "hits" : [
      {
        "_id" : "1",
        "_score" : 0.6593991,
        "_source" : {
          "title" : "this is java and elasticsearch blog"
        }
      },
      {
        "_id" : "4",
        "_score" : 0.6593991,
        "_source" : {
          "title" : "this is java, elasticsearch, hadoop blog"
        }
      },
      {
        "_id" : "2",
        "_score" : 0.38845783,
        "_source" : {
          "title" : "this is java blog"
        }
      },
      {
        "_id" : "3",
        "_score" : 0.38845783,
        "_source" : {
          "title" : "this is elasticsearch blog"
        }
      }
    ]
}

Result analysis:

Both document 1 and document 4 contain java and elasticsearch, but document 1 has a higher score than document 4 because the length of the title field of document 1 is shorter, document 2 only contains java and document 3 only contains elasticsearch, but document 2 has a higher score than document 3 because the title field of document 2 is shorter.

Because the match query must find two words (["java","elasticsearch"]), it actually executes two term queries internally, and then combines the results of the two queries as the final result output. In order to do this, it packages two term queries into a bool query.

The above example tells us an important message: any document can be matched as long as the title field contains at least one word in the specified word item. The more words are matched, the more relevant the document is.

3. Improve accuracy [and]

Matching documents with arbitrary query terms may result in irrelevant long tails in the results. This is a shotgun search. Maybe we just want to search documents containing all terms, that is, we don't match java OR elasticsearch, but find all documents by matching java AND elasticsearch.

Match queries can also accept the operator operator as an input parameter, which is or by default. We can change it to and so that all specified terms must match. Therefore, if you want all search keywords to match, use and to achieve the effect that can not be achieved by simple match query.

Example: blog with java and elastic search in the search title

GET /forum/_search
{
  "query": {
    "match": {
      "title": {
        "query": "java elasticsearch",
        "operator": "and"
      }
    }
  }
}

You can match the following 2 documents:

"title" : "this is java and elasticsearch blog"
"title" : "this is java, elasticsearch, hadoop blog"

This query can exclude document 2 and document 3 because it contains only one of the two terms.

4. Control accuracy [minimum_should_match]

Choosing between all and any is a little too black or white. If the user gives 5 query terms and wants to find documents containing only 4 of them, what should be done? Setting the operator parameter to and will only exclude this document.

Sometimes this is exactly what we expect, but in most application scenarios of full-text search, we want to include those documents that may be relevant while excluding those that are less relevant. In other words, we want to be in the middle of something.

match query supports minimum_should_match minimum matching parameter, which allows us to specify the number of word items that must be matched to indicate whether a document is relevant. We can set it to a specific number. The more common way is to set it to a percentage, because we can't control the number of words users enter when searching.

Example: search for blog s containing at least 3 of the 4 keywords java, elastic search, spark, hadoop

GET /forum/_search
{
  "query": {
    "match": {
      "title": {
        "query": "java elasticsearch spark hadoop",
        "minimum_should_match": "75%"
      }
    }
  }
}

He will match one of the following documents:

"title" : "this is java, elasticsearch, hadoop blog"

In order to fully understand how match handles multi word queries, we need to see how to use bool queries to combine multiple query criteria.

5. A problem in combined filter

A bool filter consists of three parts:

Must: all statements must match, equivalent to AND.

must_not: all statements cannot match, which is equivalent to NOT.

should: there is at least one statement to match, which is equivalent to OR.

I talked about the combination filter in my last blog. Here I ask you a question: in the combination filter, do multiple conditions in the should statement block have to match one of them? Can you match none? When can I match none?

By default, no should statement must match, with one exception: when there is no must statement, at least one should statement must match.

If the condition of must is satisfied, the condition in should can not match. If there is no must statement, the condition in should statement must match.

Example 1:

GET /forum/_search
{
  "query": {
    "constant_score": {
      "filter": {
          "bool": {
            "must":     { "term": { "title": "java" }},
            "must_not": { "term": { "title": "spark"  }},
            "should": [
                        { "term": { "title": "hadoop" }},
                        { "term": { "title": "elasticsearch"}}
            ]
          }
        }
    }
  }
}

He will match the following three documents:

"title" : "this is java and elasticsearch blog"
// The document does not match any of the conditions in should, but it is still searched
"title" : "this is java blog"
"title" : "this is java, elasticsearch, hadoop blog"

This also shows that under the condition of must, the conditions in shoud can not match!!!

Example 2:

GET /forum/_search
{
  "query": {
    "constant_score": {
      "filter": {
          "bool": {
            "must_not": { "term": { "title": "spark"  }},
            "should": [
                        { "term": { "title": "hadoop" }},
                        { "term": { "title": "elasticsearch"}}
            ]
          }
        }
    }
  }
}

It will match the following 3 documents:

"title" : "this is java and elasticsearch blog"
"title" : "this is elasticsearch blog"
"title" : "this is java, elasticsearch, hadoop blog"

When there is no must statement, the conditions in the should statement must match one.

6. Combined query

In combination filter, we discussed how to use bool filter to combine multiple filters through logical combination of and, or and not. In query, bool query has similar functions, with only one important difference.

The filter makes a binary judgment: should the document appear in the result? But the query is more subtle. It not only determines whether a document should be included in the result, but also calculates the relevance of the document.

Like filters, bool queries can also accept must and must_ Multiple query statements under the not and should parameters.

GET /forum/_search
{
  "query": {
    "bool": {
      "must":     { "match": { "title": "java" }},
      "must_not": { "match": { "title": "spark"  }},
      "should": [
                  { "match": { "title": "hadoop" }},
                  { "match": { "title": "elasticsearch"   }}
      ]
    }
  }
}

The above query results return any document whose title field contains the term java but does not contain spark. So far, this works very similar to the bool filter.

The difference lies in the two should statements, that is, a document does not have to contain hadoop or elastic search, but if they are included, we think they are more relevant.

GET /forum/_search
{
  "query": {
    "bool": {
      "must":     { "match": { "title": "java" }},
      "must_not": { "match": { "title": "spark"  }},
      "should": [
                  { "match": { "title": "hadoop" }},
                  { "match": { "title": "elasticsearch"   }}
      ]
    }
  }
}
{
    "max_score" : 2.2356422,
    "hits" : [
      {
        "_id" : "4",
        "_score" : 2.2356422,
        "_source" : {
          "title" : "this is java, elasticsearch, hadoop blog"
        }
      },
      {
        "_id" : "1",
        "_score" : 0.97797304,
        "_source" : {
          "title" : "this is java and elasticsearch blog"
        }
      },
      {
        "_id" : "2",
        "_score" : 0.57843524,
        "_source" : {
          "title" : "this is java blog"
        }
      }
    ]
}

Should can affect the correlation score. On the basis of meeting must, the conditions in should can also be mismatched. However, if there are more matches, the correlation score of the document will be higher:

relevance score ranks first: document 4, including java and all keywords in should, hadoop and elastic search
relevance score ranks second: document 1, including java and elastic search in should
relevance score ranks third: document 2, which contains java and does not contain any keywords in should

How to calculate the score?

The bool query calculates a relevance score for each document_ Score, and then the scores of all matching must and should statements_ Score is summed and finally divided by the total number of must and should statements.

must_ The not statement will not affect the score; Its function is to exclude irrelevant documents.

7. Control accuracy [minimum_should_match]

All must statements must match_ Not statements must not match, but how many should statements should match? By default, no should statement must match, with one exception: when there is no must statement, at least one should statement must match.

Just as we can control the precision of match query, we can use minimum_ should_ The match parameter controls the number of should statements to be matched. It can be either an absolute number or a percentage:

GET /forum/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "java"}},
        { "match": { "title": "elasticsearch"}},
        { "match": { "title": "hadoop"}},
	    { "match": { "title": "spark"}}
      ],
      "minimum_should_match": 3 
    }
  }
}

It will match the following 1 document:

"title" : "this is java, elasticsearch, hadoop blog"

If a document contains all four conditions, it will be more relevant than a document containing only three.

Topics: Java Big Data ElasticSearch