Elasticsearch series--prefix search and fuzzy search

Posted by jariizumi on Sun, 22 Mar 2020 03:14:31 +0100

outline

In this article, we introduce several ways to use partial search. The browser search box that we often use, when typing, pops up a drop-down prompt, which is also based on the principle of local search.

PrefixQuery

In the previous search, terms are the smallest matching unit and the words that exist in the inverted index. Now let's talk about a partially matched topic, matching only a part of one term, which is equivalent to mysql's "where content like'%love%'". At a glance in the database, you can see that this query is indexed-free and very inefficient.

Elasticsearch has a special splitting process for this search and supports multiple partial search formats, this time focusing on prefix matching for not_analyzed exact value fields.

Prefix Search Syntax

Common searches that we may need for prefix include zip code, product serial number, courier bill number, certificate number. The content of these values itself contains certain logical categorization meanings, such as a prefix indicating region, year, etc. We take zip code as an example:


# Create only one postcode field of type keyword
PUT /demo_index
{
    "mappings": {
        "address": {
            "properties": {
                "postcode": {
                    "type":  "keyword"
                }
            }
        }
    }
}

# Import zip codes for some examples
POST /demo_index/address/_bulk
{ "index": { "_id": 1 }}
{ "postcode" : "510000"}
{ "index": { "_id": 2 }}
{ "postcode" : "514000"}
{ "index": { "_id": 3 }}
{ "postcode" : "527100"}
{ "index": { "_id": 4 }}
{ "postcode" : "511500"}
{ "index": { "_id": 5 }}
{ "postcode" : "511100"}

Example prefix search:

GET /demo_index/address/_search
{
  "query": {
    "prefix": {
      "postcode": {
        "value": "511"
      }
    }
  }
}

The results of the search show two, which are expected:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "demo_index",
        "_type": "address",
        "_id": "5",
        "_score": 1,
        "_source": {
          "postcode": "511100"
        }
      },
      {
        "_index": "demo_index",
        "_type": "address",
        "_id": "4",
        "_score": 1,
        "_source": {
          "postcode": "511500"
        }
      }
    ]
  }
}

Prefix Search Principle

Preix query does not calculate relevance score, and _score is fixed to 1. The only difference with prefix filter is that filter caches bitset.

Let's analyze the search process for the example:

  1. When indexing a document, first create an inverted index. Keword does not have a word-breaking operation, but directly index. A simple example table is as follows:
postcode doc ids
510000 1
514000 2
527100 3
511500 4
511100 5
  1. If it is a full-text search, the search string "511" returns empty without a matching result.
  2. If it is a prefix search, scan to a "511500", continue the search, and get a "511100", continue the search until the entire inverted index is completely searched, returning the results.

From this process, we can see that the search performance of match is still very high, prefix search is relatively low due to traversing the index, but in some scenarios, only prefix search is competent.The longer the prefix is, the fewer documents will match and the better the performance will be. If the prefix is too short and has only one character, too much matching data will affect the performance, which should be noted.

Wildcards and regular searches

Wildcard and regular expression searches follow the prefix search type, but are more versatile.

wildcard

General symbols:?Any one character, zero or any number of characters, example:

GET /demo_index/address/_search
{
  "query": {
    "wildcard": {
      "postcode": {
        "value": "*110*"
      }
    }
  }
}
Regular Search

That is, search strings are written in regular expressions and are in the usual format:

GET /demo_index/address/_search
{
  "query": {
    "regexp": {
      "postcode": {
        "value": "[0-9]11.+"
      }
    }
  }
}

Both are advanced syntax introductions that allow us to write more flexible query requests, but they do not perform well and are not used much.

Instant Search

When we use search engines, we will find relevant word hints in the search box, such as when we search for "Elasticsearch" on the google website:

The browser captures every input event, enters every character, sends a request to the background, prefixes your search with the content you are searching, searches for the first 10 data related to the current hotspot, and returns it to you to assist you in completing your input. baidu has similar functions.

This implementation is based on prefix search, but the background implementation of google/baidu is more complex. We can simulate instant search from the perspective of Elasticsearch:

GET /demo_index/website/_search 
{
  "query": {
    "match_phrase_prefix": {
      "title": "Elasticsearch q"
    }
  }
}

The principle is match_phrase, except that the last term is prefixed for searching. That is, the search string "Elasticsearch q", Elasticsearch makes a normal match query, and "q" prefixes a search, which scans the entire inverted index to find all documents that begin with q, and then finds all documents that contain both Elasticsearch and documents that start with Q.

Of course, this query supports the slop parameter.

max_expansions parameter

In prefix queries, we mentioned the performance risk of too short a prefix. In this case, we can reduce the performance problem of too short a prefix by using the max_expansions parameter. The recommended value is 50, as shown in the following example:

GET /demo_index/website/_search 
{
  "query": {
    "match_phrase_prefix": {
      "postcode": {
        "query": "Elasticsearch q",
        "max_expansions": 50
      }
    }
  }
}

The purpose of max_expansions is to control the number of words that match the prefix, which first looks for the first word that matches the prefix "q", then searches for the matched word in alphabetical order until no more matches exist or when the number exceeds max_expansions.

When we use google to search for data, the key is to type a character request once, so we can use max_expansions to control the number of matching documents, because we keep typing until the content we want to search is entered or the appropriate prompt is picked up, then click the search button to search the web page.

So use match_phrase_prefix to remember to take the max_expansions parameter with you, otherwise the performance will be too low when you enter the first character.

Application of ngram

Some of the queries we used before have no special settings for indexing. This solution is called query time implementation. This non-intrusiveness and flexibility are usually achieved at the expense of search performance. Another scheme is called index time.Time), the setting of index is intrusive, and completes some search preparation ahead of time, which is very helpful for performance improvement.If some functions require high real-time performance, it is a good practice to switch from query to index.

The prefix search function looks at the specific usage scenario. If it is at the entrance of the first-level function, which is responsible for most of the traffic, it is recommended to use the index, let's have a look at ngram first.

What are ngrams

Prefix queries are searched by matching one by one. The whole process is somewhat blind and the amount of searches is large, so the performance is relatively low. But if I split these keywords up in advance according to a certain length, I can go back to the efficient way of matching queries.ngrams are actually a sliding window that splits keywords. The length of the window can be set. Let's take "Elastic" for example, ngram s with seven lengths:

  • Length 1: [E,l,a,s,t,i,c]
  • Length 2: [El,la,as,st,ti,ic]
  • Length 3: [Ela,las,ast,sti,tic]
  • Length 4: [Elas,last,asti,stic]
  • Length 5: [Elast,lasti,astic]
  • Length 6: [Elasti,lastic]
  • Length 7: [Elastic]

As you can see, the longer the length, the fewer words will be split. Each split word is added to the inverted index so that a match search can be made.

There is also a special edge ngram that leaves only the first word when it is split, as follows:

  • Length 1:E
  • Length 2:El
  • Length 3:Ela
  • Length 4:Elas
  • Length 5:Elast
  • Length 6:Elasti
  • Length 7:Elastic

This kind of split fits our search habits in particular.

case

  1. Create an index specifying a filter
PUT /demo_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type":   "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": {
          "type":    "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  }
}

Filter means that for any term received by this token filter, the filter generates an n-gram with a minimum fixed value of 1 and a maximum of 20.

  1. Use this token filter above in the custom analyzer autocomplete
PUT /demo_index/_mapping/_doc
{
  "properties": {
      "title": {
          "type":     "text",
          "analyzer": "autocomplete",
          "search_analyzer": "standard"
      }
  }
}
  1. We can test the results
GET /demo_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "love you"
}

Response results:

{
  "tokens": [
    {
      "token": "l",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "lo",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "lov",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "love",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "y",
      "start_offset": 5,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "yo",
      "start_offset": 5,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "you",
      "start_offset": 5,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

The test results are as expected.

  1. Add a little more test data
PUT /demo_index/_doc/_bulk
{ "index": { "_id": "1"} }
{ "title" : "love"}
{ "index": { "_id": "2"}} 
{"title" : "love me"} }
{ "index": { "_id": "3"}} 
{"title" : "love you"} }
{ "index": { "_id": "4"}} 
{"title" : "love every one"} 
  1. Use a simple match query
GET /demo_index/_doc/_search
{
    "query": {
        "match": {
            "title": "love ev"
        }
    }
}

Response results:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.83003354,
    "hits": [
      {
        "_index": "demo_index",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.83003354,
        "_source": {
          "title": "love every one"
        }
      },
      {
        "_index": "demo_index",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.41501677,
        "_source": {
          "title": "love"
        }
      }
    ]
  }
}

If you use match, only love d ones will come out, full-text search, but the score is low.

  1. Use match_phrase

It is recommended to use match_phrase, which requires every term to have, and position is exactly on one bit, which meets our expectations.

GET /demo_index/_doc/_search
{
    "query": {
        "match_phrase": {
            "title": "love ev"
        }
    }
}

We can see that most of the work is done in the indexing phase, and all the queries need only match or match_phrase, which is much more efficient than prefix queries.

search hint

Elasticsearch also supports the completion suggest type to implement search hints, also known as auto completion.

completion suggest principle

When indexing, to specify the field type completion, Elasticsearch generates a list of all possible terms for the search field and places them in a finite state transducer, an optimized graph structure.

When performing a search, Elasticsearch matches character by character along the matching path from the beginning of the graph. Once it is at the end of user input, Elasticsearch looks for all possible current paths to end, generates a list of suggestions, and caches the list of suggestions in memory.

completion suggest is much faster in performance than any word-based query.

Example

  1. Specify the title.fields field as completion type
PUT /music
{
  "mappings": {
    "children" :{
      "properties": {
        "title": {
          "type": "text",
          "fields": {
            "suggest": {
              "type":"completion"
            }
          }
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

  1. Insert some sample data
PUT /music/children/_bulk
{ "index": { "_id": "1"} }
{ "title":"children music London Bridge", "content":"London Bridge is falling down"}
{ "index": { "_id": "2"}} 
{"title":"children music Twinkle", "content":"twinkle twinkle little star"} 
{ "index": { "_id": "3"}} 
{"title":"children music sunshine", "content":"you are my sunshine"} 
  1. Search for requests and responses
GET /music/children/_search
{
  "suggest": {
    "my-suggest": {
      "prefix": "children music",
      "completion": {
        "field":"title.suggest"
      }
    }
  }
}

The response is as follows, with deletions:

{
  "took": 26,
  "timed_out": false,
  "suggest": {
    "my-suggest": [
      {
        "text": "children music",
        "offset": 0,
        "length": 14,
        "options": [
          {
            "text": "children music London Bridge",
            "_index": "music",
            "_type": "children",
            "_id": "1",
            "_score": 1,
            "_source": {
              "title": "children music London Bridge",
              "content": "London Bridge is falling down"
            }
          },
          {
            "text": "children music Twinkle",
            "_index": "music",
            "_type": "children",
            "_id": "2",
            "_score": 1,
            "_source": {
              "title": "children music Twinkle",
              "content": "twinkle twinkle little star"
            }
          },
          {
            "text": "children music sunshine",
            "_index": "music",
            "_type": "children",
            "_id": "3",
            "_score": 1,
            "_source": {
              "title": "children music sunshine",
              "content": "you are my sunshine"
            }
          }
        ]
      }
    ]
  }
}

The returned value can then be supplemented to the front-end page as a prompt, such as filling the browser's drop-down box with data.

Fuzzy Search

The fuzzy search can be used to correct misspelled words. Examples:

GET /music/children/_search
{
  "query": {
    "fuzzy": {
      "name": {
        "value": "teath",
        "fuzziness": 2
      }
    }
  }
}

fuzziness: The maximum number of letters to correct, default is 2, limited, too large a setting is also invalid, can not be infinitely increased, too many errors can not be corrected.

General usage: match nests a fuzziness and is set to auto.

GET /music/children/_search
{
  "query": {
    "match": {
      "name": {
        "query": "teath",
        "fuzziness": "AUTO",
        "operator": "and"
      }
    }
  }
}

Just have a look.

Summary

This article introduces the basic usage of prefix search, wildcard search and regular search, gives a brief explanation of the performance impact and control means of prefix search. Local search and search hints in ngram are very classic practices when indexing. Finally, the general usage of fuzzy search is introduced by the way.

Focus on Java high-concurrency, distributed architecture, more technology dry goods to share and learn from, please follow Public Number: Java Architecture Community You can sweep the QR code on the left to add friends and invite you to join the Java Architecture Community WeChat Group to explore technology

Topics: Programming ElasticSearch Google Java MySQL