ElasticSearch index details

Posted by ErcFrtz on Sun, 09 Feb 2020 11:55:43 +0100

Still using mysql for full-text indexing? Try elastic search!

Note: the following documents are based on elasticsearch version 7. X, which is different from the old version

1. Format description

The data interaction interface of elasticSearch is based on http protocol, and the basic format is as follows:

http://localhost:9200/{index}/{type}/{id}

Index: index name, which can be similar to the table of relational database
Type: type name. It should be noted that after 7.x, the type attribute is removed and "_doc" is used by default. 8.x no longer supports specifying the type in the request
id: i.e. id, can not be specified, elasticSearch will automatically generate
Document: json serialization of objects
Metadata: that is, the data format of elasticSearch, which is generally as follows. The data corresponding to "source" is the document we store

{
    "_index" :   "website",
    "_type" :    "_doc",
    "_id" :      "123",
    "_version" : 1, 
    "found" :    true,
    "_source" :  {
            "title": "My first blog entry",
            "text":  "Just trying this out...",
            "date":  "2014/01/01"
    }
}

Request format description (in order to better understand the code, "ා", "/ /" means annotation):

POST http://localhost:9200/demo/_doc  # POST request request path
# The following is the request body, and the request type is"application/json"
{"name": "Xiaohong","age": 25}

The above is represented by okHttp:

private OkHttpClient okHttpClient = new OkHttpClient();
public String post() {
        String url = "http://localhost:9200/demo/_doc";
        String json = "{\"name\": \"Xiaohong\",\"age\": 25}";
        MediaType JSON = MediaType.parse("application/json");
        RequestBody requestBody = RequestBody.create(JSON, json);
        Request request = new Request.Builder().url(url).post(requestBody).build();
        try {
            Response response = okHttpClient.newCall(request).execute();
            // If the elasticSearch is executed successfully, 201 will be returned
            if (response.code() == 200 || response.code() == 201) {
                return response.body().string();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return "";
 }

2. Basic data operation

Insert data (POST)

http://localhost:9200/demo/_doc/1  # Specify id as1
http://localhost:9200/demo/_doc  # Auto generate id
{
  "name": "Xiaohong",
  "age": 25
}

Query (GET)

http://localhost:9200/demo/_doc/1?pretty # Return to metadata document
http://localhost:9200/demo/_doc/1/_source # Return document
http://localhost:9200/demo/_doc/1?_source=name # Return to specified column,Include metadata
http://localhost:9200/demo/_doc/1/_source?_source=name # Return to document specified column
http://localhost:9200/demo/_doc/_count # Number of queries of this type
http://localhost:9200/demo/_doc/_search?size=1&from=1 # Paging query, subscript from0At first, size Default is10

Update (POST)

http://localhost:9200/demo/_doc/1 # Overwrite update, specify id to update
{
  "name": "Xiao Ming",
  "age": 18
}

http://localhost:9200/demo/_doc/1/_update # Partial update, specifying the id to be updated
{
   "doc":{
       "name":"Xiaohong"
   }
}

# Script can be used in update API to change the field content of source, which is called ctx in update script._source . 
http://localhost:9200/demo/_doc/1/_update # The age column is self increasing. Specify the id to be updated
{
 	"script":"ctx._source.age+=1"
}

DELETE

http://localhost:9200/demo/_doc/1 # Specify the id to delete
http://localhost:9200/demo # Specify the index to delete

Batch processing

index and type can be declared in json or directly in url, as shown below ("_id" can not be specified, and ES creates it automatically)

http://localhost:9200/demo/_doc/_bulk 
# Note that at the end of each line, you must include"\n"，convenient elasticSearch Read data
{ "index": { "_id": "3"}}
{"name":"Xiao Ming","age":29,"like":"Basketball"}
{ "index": { "_id": "4" }}
{"name":"Xiaohong","age":30,"like":"Volleyball"}

More see Basic operation and requirements of batch processing

3. bool query

If you need to perform sql like AND, OR, = AND other operations, the above query interface will not meet the requirements. elasticSearch provides a bool index in the following format:

POST  http://localhost:9200/demo/_doc/_search   # demo is the index name, others are the default interface
{
  "query": {
      "bool": {
        "must":     { "match": { "name": "Xiao Ming" }}, # must is required
        "must_not": { "match": { "name":  "Xiaohong" }}, # Must not
        "should":   { "match": { "like": "Basketball" }}, # should include (may not include), which will improve the scoring of index results
        "filter":   { "range": { "age" : { "gt" : 18 }} } # Filter filter, as a range of index values
    }
  }
}

The meaning of range is as follows:

GT: > greater than
LT: < less than
GTE: > = greater than or equal to
LTE: < less than or equal to

What if there are multiple must conditions? In this way, the value corresponding to "must" can be changed into an array. At the same time, "must not", "should", "filter" also supports multi criteria query

POST  http://localhost:9200/demo/_doc/_search
{
  "query": { 
    "bool": { 
      "must": [
        { "match": { "name":   "Xiao Ming"        }},
        { "match": { "age": 29}}
      ],
      "filter":   { "range": { "age" : { "gt" : 18 }} }
    }
  }
}

bool query can be nested repeatedly, for example, query age is 26, like basketball metadata

POST  http://localhost:9200/demo/_doc/_search
{
  "query": {
       "bool":{ 
           "must":{"match": {"like": {"query": " Basketball"}}},
            "filter":{
                "bool":{
                    "must": { "match": { "age": 26 }}
                }
            }
       }
  }
}

If only the bool query of the filter is included, you can use "constant" score, which is a non scoring method and can speed up the query

POST  http://localhost:9200/demo/_doc/_search
# "constant_score" Query to replace only filter Sentence bool Query. Here is the filter"age"yes"20"Metadata
{
  "query": {
    "constant_score":   {
        "filter": {
            "term": { "age": 20 }
        }
    }
  }
}

# Find multiple exact values, such as find age yes20perhaps30Data
{
  "query": {
    "constant_score":   {
        "filter": {
            "term": { "age": [20,30] }
        }
    }
  }
}

However, if you need to find the text type precisely, for example, if you only like "Basketball", it will not work. For details, see Exact value lookup

Filters: filters are fast, do not calculate the correlation of results, and are easy to cache. Please use filters as much as possible
term: if it is used for text, it is a relation containing but not equal. If you need to query a field of text type, you need to set the field to "not analyzed" for details reference Exact query text
match_phrase: to solve the problem that the analyzed field can't accurately search for phrases (such as "quick brown fox"), see phrase match
slop: the parameter tells match_phrase how far apart the query terms are from each other to still treat the document as a match. see Mix up

4. Full text index

See here, some students will ask, this is not to move the function of sql to elastic search ~ then why don't I use sql? There is also a brief introduction to grammar. Don't worry. Next is the point.

4.1 basic index format

Here is the simplest full-text index:

GET http://localhost:9200/demo/_doc/_search?q=name:Xiaohong # Pass directly"q"parameter
POSt http://localhost:9200/demo/_doc/_search
# among "and"It means that all the participles match. Similarly"or",It means one of the matches is OK
{
  "query": {
    "match": {
      "like": {   # Specify the properties to search for
        "query": "Volleyball and basketball ", # The word to be searched and the result of word segmentation are related to analyzer. Later, we will talk about it as "Volleyball" and "Basketball"
        "operator": "and"  # All words in query must contain
      }
    }
  }
}

# If you need at least a match, you can transfer "minimum_should_match": $percent%
{
  "query": {
    "match": {
      "like": {
        "query": "Volleyball, basketball, badminton, table tennis",
        "minimum_should_match": "75%"  // Indicates that at least 3 / 4 of the words are matched
      }
    }
  }
}

You can also query with bool as follows: the content field must contain all three words: full, text, and search. If the content field also contains Elasticsearch or Lucene, the document will get a higher score.

See you for details. Query statement promotion weight

POSt http://localhost:9200/demo/_doc/_search 
{
    "query": {
        "bool": {
            "must": {
                "match": {
                    "content": { 
                        "query":    "full text search",
                        "operator": "and"
                    }
                }
            },
            "should": [ 
                { "match": { "content": "Elasticsearch" }},
                { "match": { "content": "Lucene"        }}
            ]
        }
    }
}

4.2,analyzer

As mentioned above, we need to mention a little bit here, because the results of full-text index have a lot to do with the word breaker, otherwise the results will be different from what we thought (the author stepped on a lot of holes here). The question is, what is word segmentation? Let's take a look at a demo

POST  http://localhost:9200/demo/_analyze
{
  "text":"full text search", # Phrases to analyze
   "analyzer": "standard"
}

"standard" indicates which participator is needed to analyze the phrase. The returned result is as follows, and each token is the result after segmentation

{
  "tokens": [
    {
      "token": "full",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "text",
      "start_offset": 5,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "search",
      "start_offset": 10,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Obviously, the "standard" word breaker divides "full text search" into three words, so how does elastic search do full-text indexing? Or the above phrases, draw them into a table

Query ╲ store	full	text	search
full	√	×	×
text	×	√	×

As long as the query content segmentation results, in the index segmentation results, even if the search is successful, so the word breaker is very important for the search results. How to configure analyzer? See you for details. Configure analyzer

If it's Chinese, how can standard participle? Just try it. It's a bit of a play through here. Chinese will be more complex. A word will have more than one word, and it's connected together. English is not the same. It's separated by spaces. For example, the Chinese corresponding to "phone" is "phone". If you use "standard" to segment words, you will get the following results.

{
  "tokens": [
    {
      "token": "electric",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "word",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    }
  ]
}

It separates every word. If I want to search "xx mobile XX", the result may contain "XX hand XXX machine XXX", which is not connected together. It's very sad! It doesn't matter. elasticSearch supports the installation of the analyzer plug-in. Here is the "Chinese word breaker" ik_smart

4.3. IK Chinese word breaker

Under the elasticSearch installation directory, install as follows. For details, see README

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip

Or local installation (under window)

elasticsearch-7.5.1\bin\elasticsearch-plugin install file:///D:/elasticsearch/elasticsearch-7.5.1-windows-x86_64.zip

After installation, set the field properties (mapping) that need to be indexed. If an index has been created before, it needs to be deleted first. For details, see delete

PUT  http://localhost:9200/demo
{
  "mappings": {
     "dynamic":true, # Can default columns be created dynamically
     "properties": { # Properties (fixed format)
      "like": { # Fields to set
        "type":     "text", # The type of the field
        "analyzer": "ik_smart", # Specifies the word breaker to create the index
         "search_analyzer": "ik_smart" # Specify the word breaker for the search
      }
    }
  }
}

We can check the mapping of index field type and word breaker as follows to make sure the setting is successful

GET http://localhost:9200/demo/_mapping # View field attribute mapping for the demo index

Try word segmentation again

POST  http://localhost:9200/demo/_analyze
{
  "text":"Telephone",
   "analyzer": "ik_smart"
}

# give the result as follows
{
  "tokens": [
    {
      "token": "Telephone",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

When searching, you can specify a word breaker (if you have specified search analyzer before, you don't need to specify it again), so that the result of "XXX basket XXX ball" won't appear.

POST  http://localhost:9200/demo/_doc/_search 
{
    "query": {
        "match": {
            "contentText":{
                "query":       "Basketball and volleyball",
                "operator":    "and",
                "analyzer": "ik_smart"
            }
        }
    }
}

The IK plug-in has two word breakers: ik_smart or ik_max_word.

ik_max_word: it will split the text in the most fine-grained way. For example, it will split the "National Anthem of the people's Republic of China" into "the National Anthem of the people's Republic of China, the people, the Chinese, the people's Republic, the people's Republic, the Republic, and the people's Republic of China". It will make every possible combination, suitable for Term Query;
ik_smart: can do the most coarse-grained split, for example, can split the "National Anthem of the people's Republic of China" into "National Anthem of the people's Republic of China", which is suitable for Phrase query.

5, sort

In search results, we also need to sort them. Here, we want to sort "age" in descending order

# stay"like"Search in field"Basketball"，And take"age" In descending order
GET http://localhost:9200/demo/_doc/_search?sort=date:desc&sort=age&q=like:Basketball 
# stay"like"Search in field"Basketball and volleyball"，And take"age" In descending order 
POSt http://localhost:9200/demo/_doc/_search
{
  "query": {
      "match":{
          "like":{
              "query":"Basketball and volleyball",
              "operator":"and"
          }
      }
  },
   "sort": { "age": { "order": "desc" }} # desc:Descending order;  asc:Ascending order
}

If there are the following errors

Fielddata is disabled on text fields by default. Set fielddata=true on [age] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.

Set the column mapping to be sorted

http://localhost:9200/demo/_mapping  # Set mapping for index
{
  "properties": {
    "age": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

For more details, see sort

6, postscript

Of course, the function of elasticSearch is far more than that. Because of the space, this article does not list too many interfaces. For more documents, please refer to ElasticSearch document You can see the document information of each version here.

Finally, the recent new coronavirus is particularly rampant, the epidemic situation is very serious. Here I wish you all the best of luck, such as East China Sea, longevity, peace of year, promotion and salary increase!

------WeiWq is recorded in Guangzhou on February 9, 2020.

One generation of small strong

29 original articles published, 38 praised, 90000 visitors+

Private letter follow

Topics: ElasticSearch JSON SQL Attribute

Programmer Think