Deep ploughing ElasticSearch - based on most_fields strategy to realize multi field query

Posted by rowanparker on Wed, 22 Dec 2021 20:17:16 +0100

1. Difference between best field and most fields

1. best_fields:

When searching for specific concepts of words, such as "brown fox", phrases are more meaningful than independent words. The more words the document contains in the same field, the better. The score also comes from the most matching field.

2. most_fields:

In order to fine tune the correlation, a common technology is to index the same data to different fields, which have independent analysis chains. One field can include the original word without stem extraction, and the other field includes other etymology and accent.

The main fields may include their etymology, synonyms, and morphemes or accents, which are used to match as many documents as possible. The same text is indexed to other fields to provide a more accurate match. Other fields are used as a signal to improve the correlation score when matching each document. The more matching fields, the better.

2. Majority field strategy

Full text search is called the battlefield of recall and accuracy. The purpose is to present the most relevant documents to users on the first page of the results:

  • Recall rate: return all relevant documents
  • Accuracy: no irrelevant documents are returned

In order to improve the recall rate, we expand the search scope - not only return documents that accurately match the user's search term, but also return all documents we think are relevant to the query. If a user searches for "quick brown box", a document containing the word fast foxes is considered a very reasonable return result.

A common way to improve the accuracy of full-text correlation is to establish multiple indexes for the same text, each of which provides a different correlation signal. The main field will match as many documents as possible. For example, we can do the following:

  • Stem extraction is used to index words like jumps, jumping and jumped, and jump is used as their root form. In this way, even if the user searches for jumped, he can still find a matching document containing jumping.
  • Include synonyms such as jump, leap, and hop.
  • Remove diacritic or accent words: e.g. esta, est á and esta will be indexed in the non diacritic form esta.

To implement multi field mapping, the first thing to do is to index our fields twice: once using stem mode and once using non stem mode. In order to achieve this, multifields is adopted:

DELETE /blog

PUT /blog
{
  "settings": { "number_of_shards": 1 }, 
  "mappings": {
    "properties": {
        "title": { 
        "type":     "text",
        "analyzer": "english", // The title field uses the english parser to extract the stem. 	
        "fields": {
          "std":   { 
            "type":     "text",
            "analyzer": "standard" // title. The STD field uses the standard parser, so there is no stemming.
          }
        }
      }
    }
  }
}

Then index some documents:

PUT /blog/_doc/1
{ "title": "My rabbit jumps" }

PUT /blog/_doc/2
{ "title": "Jumping jack rabbits" }

Here, a simple match is used to query whether the title field contains jumping raids:

GET /blog/_search
{
   "query": {
        "match": {
            "title": "jumping rabbits"
        }
    }
}

Because of the english analyzer, this query is to find documents with two extracted words: jump and rabbit. The title field of both documents includes these two words at the same time, so the two documents get the same score:

"hits" : [
    {
        "_index" : "blog",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.36464313,
        "_source" : {
            "title" : "My rabbit jumps"
        }
    },
    {
        "_index" : "blog",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.36464313,
        "_source" : {
            "title" : "Jumping jack rabbits"
        }
    }
]

If you just query title STD field, then only document 2 is matched. However, if you query two fields at the same time, and then use bool query to merge the scoring results, Then the two documents are matched (the role of the title field), and the relevance score of document 2 is higher (the role of the title.std field). We want to combine the scores of all matching fields, so we use the most_fields type. This allows the multi_match query to wrap the two field statements with the bool query instead of the dis_max query.

GET /blog/_search
{
 "query": {
    "multi_match": {
      "query":  "jumping rabbits",
      "type":   "most_fields", 
      "fields": [ "title", "title.std" ]
    }
  }
}
"hits" : [
    {
        "_index" : "blog",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.7509375,
        "_source" : {
            "title" : "Jumping jack rabbits"
        }
    },
    {
        "_index" : "blog",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.36464313,
        "_source" : {
            "title" : "My rabbit jumps"
        }
    }
]

Match the field title with breadth, including as many documents as possible - to improve recall - while using the field title STD acts as a signal to place more relevant documents at the top of the results.

3. Improve the contribution of the field to the final score

The contribution of each field to the final score can be controlled by the user-defined value boost. For example, making the title field more important also reduces the role of other signal fields:

GET /blog/_search
{
 "query": {
    "multi_match": {
      "query":       "jumping rabbits",
      "type":        "most_fields",
      "fields":      [ "title^10", "title.std" ] 
    }
  }
}

The boost value of the title field is 10, making it better than the title STD is more important.

Topics: Big Data ElasticSearch