Elasticsearch Basics

Posted by tommyinnn on Thu, 10 Feb 2022 23:27:31 +0100

Understand Elasticsearch

The principle of search is to establish a reverse index, also known as inverted index, which is to establish an index according to the keywords in the article content and correspond to the title of the article. For example, the ancient poetry index, which makes you think about the ancient poetry related to "Moonlight", is not easy to think of. Because the index we established is a positive index. First, the name, Dynasty, author and content of ancient poetry, but "Moonlight in front of bed" in "quiet night thinking" has "Moonlight". We can index "Moonlight" and "Moonlight in front of bed", but there are too many indexes in this poem. The "Hometown" corresponds to "bow your head and think of your hometown", which we can't remember, Then we can correspond "Moonlight" and "Hometown" to "silent night thoughts". When we think of "silent night thoughts", there are all the following.

Elasticsearch is an encapsulation of Lucene, which can provide search function. Simplify index creation and access through api; A distributed search engine is implemented.

Install Elasticsearch.

Some concepts: index, similar to database. type is similar to the data table. A document is like a record.

ES is also a master slave architecture, which realizes data fragmentation and backup. Only the resume index and type need to go through the master. There is a simple routing rule for data writing, which can be route d to any node in the cluster, so the pressure of data writing is dispersed to the whole cluster.

The typical application of ES is ELK log analysis system, in which E is Elasticsearch, L is Logstash, which is a log collection system, and K is Kibana, which is a data visualization platform. For example, for 1000 machines, if the system fails, it is very troublesome to check the log one by one.

After installing ES, you need to create an index, also known as creating mapping.

Create mapping statement: index_name/_mapping/type_name, the request body in json format is as follows:

{

"settings":{

"index":{

"number_of_shards":"4",

"number_of_replicas":"1"

}

},

"mappings":{

"type_name":{

"_ttl":{

"enabled":false

},

"dynamic":false,

"_all":{

"enabled":false

},

"properties":{

"name":"string",

"index":"not_analyzed"

}

}

}

}

Basic concepts

  1. Cluster cluster

A cluster contains multiple nodes that provide external services. The cluster to which each node belongs is determined by the cluster name in the configuration file.

  1. Node node

Each node in the cluster also has a name. By default, it is randomly assigned. It can also be made by itself. It is managed and communicated through the node name in the es cluster.

  1. Index index

An index is a collection of documents with the same structure, which is equivalent to a library in mysql.

  1. Type type

An index can correspond to one or more types. The type can be regarded as the logical partition of the index, which is equivalent to the table in mysql.

  1. Documnet document

There is a string in JSON format in es, and each document has a document ID. if you do not specify an ID yourself, the system will automatically generate an ID. the index/type/id of the document must be unique, which is equivalent to the line in mysql.

  1. Field field

A document will contain multiple fields, and each field corresponds to a field type, which is similar to the column in mysql.

7. shard slice

es is divided into primary shard primary shard and replica shard replica shard.

Primary shard: when saving a document, it will be stored in the primary shard first, and then copied to different replica Shards. By default, there will be five primary shards in an index. You can specify the number of shards yourself. Once shards are established, the number of shards cannot be changed.

Replica sharding: each primary shard will have zero or more replicas. Replicas are mainly the replication of the primary shard. Replica sharding can provide high availability. When a primary shard is hung, you can choose one as the primary Shard from the replica shard, which can also improve performance. Therefore, the primary shard cannot be deployed on the same node as the replica shard.

8. replica replication

Replication is to prevent single point problems. It can transfer failures and ensure high availability of the system.

9. Mapping

Describing how data is stored in each field is the process of defining the document types and fields stored and indexed. Each document in the index has a type, and each type has its own mapping. A mapping defines the data type of each field in the document structure.

Use GET /index/_mapping/type obtains the mapping information of the corresponding / index/type.

Similar is to get the table structure. As follows:

GET /shop_product_es/_mapping/shop_product_es_type
{
  "shop_product_es": {
    "mappings": {
      "shop_product_es_type": {
        "dynamic": "false",
        "_all": {
          "enabled": false
        },
        "properties": {
          "secondCategory": {
            "type": "integer"
          },
          "avgPrice": {
            "type": "double"
          }
        }
      }
    }
  }
}

Word segmentation correlation

To query the effect of text word segmentation, use the command: index_name/type_name/_anaylze and json parameters need to fill in the content of word segmentation.

Elasticsearch's default word segmentation mode is to separate each Chinese word, so when searching, it will disassemble the keywords and then index them. The query results are not satisfactory.

Query statement

You can use url query or DSL statement query.

URL parameter search

This method is similar to the GET request. The parameters are spliced to the connection, and multiple parameters are separated by the & symbol. For ex amp le: GET index/type/_search? parameter

Query all

Command: GET index/type/_search

return:

{
  "took": 7, //Query time, MS
  "timed_out": false, //timeout is not to stop the execution of the query. It just tells the coordinating node to return the results collected so far and close the connection
  "_shards": {
    "total": 5, //The number of requested partitions. The index is divided into five partitions. Therefore, for search requests, all primary shard s will be hit
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2, //The total number of qualified items is checked here
    "max_score": 1, //Score matching
    "hits": [ //data
      {
        "_index": "school",
        "_type": "student",
        "_id": "2",
        "_score": 1,
        "_source": {
          "name": "houyi",
          "age": 23,
          "class": 2,
          "gender": "male"
        }
      },
      {
        "_index": "school",
        "_type": "student",
        "_id": "1",
        "_score": 1,
        "_source": {
          "name": "Lv Bu",
          "age": 21,
          "class": 2,
          "gender": "male"
        }
      }
    ]
  }
}

Multi index, multi type search

Specify a special index and type in the URL for multi index and multi type search.

  1. /_ Search to search all type s in all indexes

  2. /school/_search to search all type s in the school index

  3. /school,ad/_search to search all type s in the school and ad indexes

  4. /school/student/_search to search for student types in the school index

  5. /s*,a*/_search to search all type s in all indexes starting with s and a

  6. /school,ad/student,phone/_search to search for student and phone types on the school and ad indexes

  7. /_ all/student,phone/_search to search for student and phone types in all indexes

Query by criteria

Query the command whose name is Houyi: GET /school/student/_search?q=name:houyi

More query parameters are as follows:

parameter

explain

q

Query string, for example: q=syslog

df

Fields used by default when no prefix is defined in the query

analyzer

A word splitter used when parsing query strings

lowercase_expanded_terms

The case flag is ignored when searching, and the default value is true

analyze_wildcard

Whether the wildcard or prefix query is analyzed. The default is false

default_operator

The default is the relationship of multiple conditions, AND OR. The default is OR

lenient

If it is set to true, it will be ignored when the field type conversion fails, and the default is false

explain

An explanation of the scoring mechanism will be included in each returned result

_source

Whether metadata is included and supported_ source_include and_ source_exclude

fields

Only the columns specified in the index are returned, and multiple columns are separated by commas

sort

Sort by field name, such as fieldName:asc or fieldName:desc

track_scores

Scoring track. When sorting, true means to return scoring information

timeout

Timeout setting

terminate_after

The maximum number of queries in each partition. If set, there will be a terminated in the returned result_ Early field

from

Returns the value from the beginning of the index matching result. The default value is 0. It is used for paging start

size

The number of entries returned from search results. The default is 10

search_type

The type of search can be dfs_query_then_fetch,query_ then_ Fetch, query by default_ theh_ fetch

todo: supplementary example.

Query DSL

https://juejin.im/post/5d2d300b6fb9a07ec56ea9bb

The search command is: index_ name/type_ name/_ Fill in the search criteria with the search and json parameters. Note that the index name is not a cluster name.

GET /index_name/type_name/_search
{
    "query":{
        "bool":{
            "must":{
                "match":{
                    "last_name":"smith"
                }
            }
        }
    }
}

The requested parameter is a json string, and the outermost layer is the query node.

Compound query refers to multi condition combined query. For example, a bool statement includes must and must_not, should, filter statements. For example:

GET /index/type/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {
          "name": "phone"
        }}
      ]
      , "must_not": [
        {"match": {
          "color": "red"
        }}
      ]
      , "should": [
        {"match": {
          "price": 5000
        }}
      ]
      , "filter": {
          "term": {
            "label": "phone"
          }
      }
    }
  }
}

Must: indicates the content of the query that must be included in the document.

must_not: the identification document must not contain the query content.

should: indicates that the document relevance score can be increased if the document matches.

In fact, we can use two kinds of structured statements: Structured Query and structured filtering

Query the result of a blank field

Similar to where condition in SQL: field name is null

GET		index/type/_search
{
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "Field name"
        }
      }
    }
  }
}

On the contrary, if the query is not empty:

GET 	index/type/_search
{
  "query": {
    "bool": {
      "must": {
        "exists": {
          "field": "Field name"
        }
      }
    }
  }
}

Paging query

If the search starts from the first piece of data, query 25 items per page.

Write on the url: Index / type/_ search? from=0&size=25

It can also be written in JSON query statements:

{
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "secondCategoryName"
        }
      }
    }
  },
  "from":0,
  "size":12
}

Grouping statistics

A group by statement similar to sql.

POST shop_product_es/shop_product_es_type/_search

{
  "aggs": {
    "group_by_firstcategory": {
      "terms": {
        "field": "firstCategory"
      },
      "aggs": {
        "secondCategory": {
          "terms": {
            "field": "secondCategory"
          },
          "aggs": {
            "thirdCategory": {
              "terms": {
                "field": "thirdCategory"
              }
            }
          }
        }
      }
    }
  }
}

Key words detailed explanation

  1. match_all query

The query simply matches all documents.

GET /ad/phone/_search
{
	"query":{
    	"match_all":{}
    }
}

  1. match query

Support full-text search and precise query, depending on whether the field supports full-text retrieval.

Full text search:

GET /ad/phone/_search
{
	"query":{
    	"match":{
        	"ad":"a red"
        }
    }
}

Full text retrieval will first segment the query string, a red will be divided into a and red, and then match in the chaopai index, so this statement will find out all three documents.

Exact query:

GET /ad/phone/_search
{
	"query":{
    	"match":{
        	"price":"6000"
        }
    }
}

For queries with exact values, you can use the filter statement instead of query, because the filter will be cached.

operator operation:

Match queries can also accept the operator operator as an input parameter. By default, the operator is or. We can also change it to and so that all specified terms must match.

GET /ad/phone/_search
{
	"query":{
    	"match":{
        	"ad":{
            	"query":"a red",
                "operator":"and"
            }
        }
    }
}

Accuracy matching:

Match query supports minimum_should_match the lowest matching parameter, which can specify the number of word items that must be matched to indicate whether a document is relevant. We can set it to a specific data (the value needs to match the number of words in the inverted index). More commonly, we can set it to a percentage, because we can't control the number of words users enter when searching.

GET /ad/phone/_search
{
	"query":{
    	"match":{
        	"ad":{
            	"query":"a red",
                "minimum_should_match":"2"
            }
        }
    }
}

Only documents matching the last two words a and red will be returned.

If minimum_ should_ If match is 1, the document will return as long as one of the words is matched.

  1. multi_match query

Multi field queries, such as querying documents with the word red in the color and ad fields.

GET /ad/phone/_search
{
	"query":{
    	"multi_match":{
        	"query":"red",
            "fields":["color","ad"]
        }
    }
}

  1. range query

Range query, operators: gt (greater than), gte (greater than or equal to), lt (less than), lte (less than or equal to)

Query documents with the price of Dayun 4000 less than 6000

GET /ad/phone/_search
{
	"query":{
    	"range":{
        	"price":{
            	"gt":4000,
                "lt":6000
            }
        }
    }
}

  1. term query

Exact value query: query the document whose price field is equal to 6000

GET /ad/phone/_search
{
	"query":{
    	"term":{
        	"price":{
            	"value":"6000"
            }
        }
    }
}

Query the document whose name field is equal to iphone 8

GET /ad/phone/_search
{
	"query":{
    	"term":{
        	"name":{
            	"value":"iphone 8"
            }
        }
    }
}

As a result, no relevant documents were found.

The reason is that the term query will find the exact term in the inverted index. It will not use the word splitter, but will only match the inverted index. The type type of the name field is text, which will divide iphone 8 into iphone and 8. When we use term to query iphone 8, there is no iphone 8 in the inverted index, so no matching document can be found.

The difference between term and match query

term query, without word segmentation, directly matches the inverted index;

During match query, word segmentation will be carried out. When querying iphone 8, word segmentation will be carried out into iphone and 8, and then the inverted index will be matched. Therefore, the results will query both iphone 8 and xiaomi 8 documents.

Another thing to note is that term queries do not go through the word splitter, but match the inverted index, so the query results are related to how the word splitter divides words. For example, if you add a new document of / ad/phone type, the name field is assigned oppo. In this case, if you use term to query oppo, you will not find the document. This is because es uses the standard word splitter by default. After word segmentation, it will convert the word into lowercase output. Therefore, you can't find the document by using oppo, but you can find it by using lowercase oppo.

GET /ad/phone/_search
{
  "query": {
    "term": {
      "name": {
        "value": "Oppo" //Change to oppo to find out the newly added documents
      }
    }
  }
}

All query results like term are related to the selected word splitter. Understanding the word segmentation method of the selected word splitter is helpful for us to write query statements.

  1. terms query

terms query, like term query, allows you to specify multiple values for matching. If this field contains any value in the formulation, the document meets the conditions.

GET /ad/phone/_search
{
	"query":{
    	"terms":{
        	"ad":["red","blue"]
        }
    }
}

  1. exists query and missing query

Used to find documents with values (exists) or no values (missing) for the specified field.

The specified name field has a value

GET /ad/phone/_search
{
	"query":{
    	"bool":{
        	"filter":{
            	"exists":{
                	"field":"name"
                }
            }
        }
    }
}

The specified name field has no value:

GET /ad/phone/_search
{
	"query":{
    	"bool":{
        	"filter":{
            	"missing":{
                	"filed":"name"
                }
            }
        }
    }
}

  1. match_phrase query

Phrase query, exact matching, query a red will match the ad field containing a red phrase, without word segmentation query, and will not query documents containing "a other word red".

GET /ad/phone/_search
{
	"query":{
    	"match_phrase":{
        	"ad":"a red"
        }
    }
}

  1. scroll query

Similar to paging query, page skipping query is not supported and can only be queried page by page. scroll query is not for real-time user requests, but for processing a large amount of data.

POST /ad/phone/_search?scroll=1m
{
  "query": {
    "match_all": {}
  },
  "size": 1,
  "from": 0
}

The return value contains a

"_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAQFlV6T3VqY2NaVDBLRG5uZXdiZ0hFYUEAAAAAAAAAERZVek91amNjWlQwS0RubmV3YmdIRWFBAAAAAAAAABIWVXpPdWpjY1pUMEtEbm5ld2JnSEVhQQAAAAAAAAATFlV6T3VqY2NaVDBLRG5uZXdiZ0hFYUEAAAAAAAAAFBZVek91amNjWlQwS0RubmV3YmdIRWFB"

To be used in the next query

_scroll_id

You can query the document on the next page

POST /_search/scroll 
{
    "scroll" : "1m", 
    "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAYFlV6T3VqY2NaVDBLRG5uZXdiZ0hFYUEAAAAAAAAAGRZVek91amNjWlQwS0RubmV3YmdIRWFBAAAAAAAAABYWVXpPdWpjY1pUMEtEbm5ld2JnSEVhQQAAAAAAAAAXFlV6T3VqY2NaVDBLRG5uZXdiZ0hFYUEAAAAAAAAAFRZVek91amNjWlQwS0RubmV3YmdIRWFB" 
}

  1. multi_get query

Multiple documents can be obtained based on index, type (optional) and id (and possible route). If the acquisition of a document fails, the error information will be included in the response.

GET /ad/phone/_mget
{
  "ids": ["1","8"]
}

  1. bulk batch operation

bulk batch operation can create, index, update and delete multiple documents in a single API call. This can greatly improve the indexing speed.

The bulk request body is as follows:

{ action: { metadata }}\n 
{ request body        }\n
{ action: { metadata }}\n
{ request body        }\n
...

action must be the following

behavior

explain

create

Create when document does not exist

index

Create a new document or replace an existing document

update

Partial update document

delete

Delete a document

You must specify the name of the document when indexing, creating, updating, or deleting_ index,_ type,_ id these metadata.

For example:

    PUT _bulk
    { "create" : { "_index" : "ad", "_type" : "phone", "_id" : "6" }}
    { "doc" : {"name" : "bulk"}}
    { "index" : { "_index" : "ad", "_type" : "phone", "_id" : "6" }}
    { "doc" : {"name" : "bulk"}}
    { "delete":{  "_index" : "ad", "_type" : "phone", "_id" : "1"}}
    { "update":{  "_index" : "ad", "_type" : "phone", "_id" : "3"}}
    { "doc" : {"name" : "huawei p20"}}

return:

    {
      "took": 137,
      "errors": true, //If any document fails, true is returned here,
      "items": [ //items array, which lists the results of each request. The order of the results is the same as that of our request
        {
          //The create exception already exists in the document
          "create": { 
            "_index": "ad",
            "_type": "phone",
            "_id": "6",
            "status": 409,
            "error": {
              "type": "version_conflict_engine_exception",
              "reason": "[phone][6]: version conflict, document already exists (current version [2])",
              "index_uuid": "9F5FHqgISYOra_P09HReVQ",
              "shard": "2",
              "index": "ad"
            }
          }
        },
        {
          //The index document already exists and will be overwritten
          "index": { 
            "_index": "ad",
            "_type": "phone",
            "_id": "6",
            "_version": 3,
            "result": "updated",
            "_shards": {
              "total": 2,
              "successful": 1,
              "failed": 0
            },
            "_seq_no": 6,
            "_primary_term": 5,
            "status": 200
          }
        },
        {
          //delete  
          "delete": { 
            "_index": "ad",
            "_type": "phone",
            "_id": "1",
            "_version": 1,
            "result": "not_found",
            "_shards": {
              "total": 2,
              "successful": 1,
              "failed": 0
            },
            "_seq_no": 4,
            "_primary_term": 5,
            "status": 404
          }
        },
        {
          //modify  
          "update": { 
            "_index": "ad",
            "_type": "phone",
            "_id": "3",
            "_version": 3,
            "result": "noop",
            "_shards": {
              "total": 2,
              "successful": 1,
              "failed": 0
            },
            "status": 200
          }
        }
      ]
    }

The bulk request is not an atomic operation and cannot implement transactions. Each request operation is separate, so the success of each request does not interfere with other operations.

  1. fuzzy query

Fuzzy query, fuzzy query will calculate the spelling similarity with keywords.

GET /ad/phone/_search
{
	"query":{
    	"fuzzy":{
        	"color":{
            	"value":"res",
                "fuzziness":2,
                "prefix_length":1
            }
        }
    }
}

Parameter setting:

fuzziness: the maximum editing distance, which is AUTO by default;

prefix_length: the initial number of characters that will not be "blurred". This helps to reduce the amount of data that must be checked. The default value is 0;

max_ Expansions: the maximum number of terms that a fuzzy query will expand to. The default value is 50, and the setting is small, which helps to optimize the query;

Transfers: whether fuzzy autocracy (ab - > BA) is supported. The default is false.

  1. wildcard query, fuzzy search

Using wildcard can realize fuzzy search without word segmentation, that is, searching for "Xiaomi", which will only search the records continuously containing "Xiaomi" in the keyword, on the premise that the fields to be searched do not have word segmentation (set to type "type": "string" and "index": "not_analyzed" in ES2, and set to keyword type in ES6).

Support wildcard fuzzy query,? Matches a single character, * matches any character.

To prevent extremely slow Wildcard Queries, * or? Wildcard entries should not be placed at the beginning of wildcards.

GET /es_index/es_index_type/_search
{
	"query":{
    	"wildcard":{
        	"color":"r?d"
        }
    }
}

color is one of the fields. There is only one character between the query characters r and d.

Reference article:

Elasticsearch basic concept - Nuggets

ElasticSearch - query statement details - Nuggets

Bottom of article

Topics: ElasticSearch