Understand Elasticsearch
The principle of search is to establish a reverse index, also known as inverted index, which is to establish an index according to the keywords in the article content and correspond to the title of the article. For example, the ancient poetry index, which makes you think about the ancient poetry related to "Moonlight", is not easy to think of. Because the index we established is a positive index. First, the name, Dynasty, author and content of ancient poetry, but "Moonlight in front of bed" in "quiet night thinking" has "Moonlight". We can index "Moonlight" and "Moonlight in front of bed", but there are too many indexes in this poem. The "Hometown" corresponds to "bow your head and think of your hometown", which we can't remember, Then we can correspond "Moonlight" and "Hometown" to "silent night thoughts". When we think of "silent night thoughts", there are all the following.
Elasticsearch is an encapsulation of Lucene, which can provide search function. Simplify index creation and access through api; A distributed search engine is implemented.
Install Elasticsearch.
Some concepts: index, similar to database. type is similar to the data table. A document is like a record.
ES is also a master slave architecture, which realizes data fragmentation and backup. Only the resume index and type need to go through the master. There is a simple routing rule for data writing, which can be route d to any node in the cluster, so the pressure of data writing is dispersed to the whole cluster.
The typical application of ES is ELK log analysis system, in which E is Elasticsearch, L is Logstash, which is a log collection system, and K is Kibana, which is a data visualization platform. For example, for 1000 machines, if the system fails, it is very troublesome to check the log one by one.
After installing ES, you need to create an index, also known as creating mapping.
Create mapping statement: index_name/_mapping/type_name, the request body in json format is as follows:
{
"settings":{
"index":{
"number_of_shards":"4",
"number_of_replicas":"1"
}
},
"mappings":{
"type_name":{
"_ttl":{
"enabled":false
},
"dynamic":false,
"_all":{
"enabled":false
},
"properties":{
"name":"string",
"index":"not_analyzed"
}
}
}
}
Basic concepts
-
Cluster cluster
A cluster contains multiple nodes that provide external services. The cluster to which each node belongs is determined by the cluster name in the configuration file.
-
Node node
Each node in the cluster also has a name. By default, it is randomly assigned. It can also be made by itself. It is managed and communicated through the node name in the es cluster.
-
Index index
An index is a collection of documents with the same structure, which is equivalent to a library in mysql.
-
Type type
An index can correspond to one or more types. The type can be regarded as the logical partition of the index, which is equivalent to the table in mysql.
-
Documnet document
There is a string in JSON format in es, and each document has a document ID. if you do not specify an ID yourself, the system will automatically generate an ID. the index/type/id of the document must be unique, which is equivalent to the line in mysql.
-
Field field
A document will contain multiple fields, and each field corresponds to a field type, which is similar to the column in mysql.
7. shard slice
es is divided into primary shard primary shard and replica shard replica shard.
Primary shard: when saving a document, it will be stored in the primary shard first, and then copied to different replica Shards. By default, there will be five primary shards in an index. You can specify the number of shards yourself. Once shards are established, the number of shards cannot be changed.
Replica sharding: each primary shard will have zero or more replicas. Replicas are mainly the replication of the primary shard. Replica sharding can provide high availability. When a primary shard is hung, you can choose one as the primary Shard from the replica shard, which can also improve performance. Therefore, the primary shard cannot be deployed on the same node as the replica shard.
8. replica replication
Replication is to prevent single point problems. It can transfer failures and ensure high availability of the system.
9. Mapping
Describing how data is stored in each field is the process of defining the document types and fields stored and indexed. Each document in the index has a type, and each type has its own mapping. A mapping defines the data type of each field in the document structure.
Use GET /index/_mapping/type obtains the mapping information of the corresponding / index/type.
Similar is to get the table structure. As follows:
GET /shop_product_es/_mapping/shop_product_es_type { "shop_product_es": { "mappings": { "shop_product_es_type": { "dynamic": "false", "_all": { "enabled": false }, "properties": { "secondCategory": { "type": "integer" }, "avgPrice": { "type": "double" } } } } } }
Word segmentation correlation
To query the effect of text word segmentation, use the command: index_name/type_name/_anaylze and json parameters need to fill in the content of word segmentation.
Elasticsearch's default word segmentation mode is to separate each Chinese word, so when searching, it will disassemble the keywords and then index them. The query results are not satisfactory.
Query statement
You can use url query or DSL statement query.
URL parameter search
This method is similar to the GET request. The parameters are spliced to the connection, and multiple parameters are separated by the & symbol. For ex amp le: GET index/type/_search? parameter
Query all
Command: GET index/type/_search
return:
{ "took": 7, //Query time, MS "timed_out": false, //timeout is not to stop the execution of the query. It just tells the coordinating node to return the results collected so far and close the connection "_shards": { "total": 5, //The number of requested partitions. The index is divided into five partitions. Therefore, for search requests, all primary shard s will be hit "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 2, //The total number of qualified items is checked here "max_score": 1, //Score matching "hits": [ //data { "_index": "school", "_type": "student", "_id": "2", "_score": 1, "_source": { "name": "houyi", "age": 23, "class": 2, "gender": "male" } }, { "_index": "school", "_type": "student", "_id": "1", "_score": 1, "_source": { "name": "Lv Bu", "age": 21, "class": 2, "gender": "male" } } ] } }
Multi index, multi type search
Specify a special index and type in the URL for multi index and multi type search.
-
/_ Search to search all type s in all indexes
-
/school/_search to search all type s in the school index
-
/school,ad/_search to search all type s in the school and ad indexes
-
/school/student/_search to search for student types in the school index
-
/s*,a*/_search to search all type s in all indexes starting with s and a
-
/school,ad/student,phone/_search to search for student and phone types on the school and ad indexes
-
/_ all/student,phone/_search to search for student and phone types in all indexes
Query by criteria
Query the command whose name is Houyi: GET /school/student/_search?q=name:houyi
More query parameters are as follows:
parameter | explain |
q | Query string, for example: q=syslog |
df | Fields used by default when no prefix is defined in the query |
analyzer | A word splitter used when parsing query strings |
lowercase_expanded_terms | The case flag is ignored when searching, and the default value is true |
analyze_wildcard | Whether the wildcard or prefix query is analyzed. The default is false |
default_operator | The default is the relationship of multiple conditions, AND OR. The default is OR |
lenient | If it is set to true, it will be ignored when the field type conversion fails, and the default is false |
explain | An explanation of the scoring mechanism will be included in each returned result |
_source | Whether metadata is included and supported_ source_include and_ source_exclude |
fields | Only the columns specified in the index are returned, and multiple columns are separated by commas |
sort | Sort by field name, such as fieldName:asc or fieldName:desc |
track_scores | Scoring track. When sorting, true means to return scoring information |
timeout | Timeout setting |
terminate_after | The maximum number of queries in each partition. If set, there will be a terminated in the returned result_ Early field |
from | Returns the value from the beginning of the index matching result. The default value is 0. It is used for paging start |
size | The number of entries returned from search results. The default is 10 |
search_type | The type of search can be dfs_query_then_fetch,query_ then_ Fetch, query by default_ theh_ fetch |
todo: supplementary example.
Query DSL
https://juejin.im/post/5d2d300b6fb9a07ec56ea9bb
The search command is: index_ name/type_ name/_ Fill in the search criteria with the search and json parameters. Note that the index name is not a cluster name.
GET /index_name/type_name/_search { "query":{ "bool":{ "must":{ "match":{ "last_name":"smith" } } } } }
The requested parameter is a json string, and the outermost layer is the query node.
Compound query refers to multi condition combined query. For example, a bool statement includes must and must_not, should, filter statements. For example:
GET /index/type/_search { "query": { "bool": { "must": [ {"match": { "name": "phone" }} ] , "must_not": [ {"match": { "color": "red" }} ] , "should": [ {"match": { "price": 5000 }} ] , "filter": { "term": { "label": "phone" } } } } }
Must: indicates the content of the query that must be included in the document.
must_not: the identification document must not contain the query content.
should: indicates that the document relevance score can be increased if the document matches.
In fact, we can use two kinds of structured statements: Structured Query and structured filtering
Query the result of a blank field
Similar to where condition in SQL: field name is null
GET index/type/_search { "query": { "bool": { "must_not": { "exists": { "field": "Field name" } } } } }
On the contrary, if the query is not empty:
GET index/type/_search { "query": { "bool": { "must": { "exists": { "field": "Field name" } } } } }
Paging query
If the search starts from the first piece of data, query 25 items per page.
Write on the url: Index / type/_ search? from=0&size=25
It can also be written in JSON query statements:
{ "query": { "bool": { "must_not": { "exists": { "field": "secondCategoryName" } } } }, "from":0, "size":12 }
Grouping statistics
A group by statement similar to sql.
POST shop_product_es/shop_product_es_type/_search { "aggs": { "group_by_firstcategory": { "terms": { "field": "firstCategory" }, "aggs": { "secondCategory": { "terms": { "field": "secondCategory" }, "aggs": { "thirdCategory": { "terms": { "field": "thirdCategory" } } } } } } } }
Key words detailed explanation
-
match_all query
The query simply matches all documents.
GET /ad/phone/_search { "query":{ "match_all":{} } }
-
match query
Support full-text search and precise query, depending on whether the field supports full-text retrieval.
Full text search:
GET /ad/phone/_search { "query":{ "match":{ "ad":"a red" } } }
Full text retrieval will first segment the query string, a red will be divided into a and red, and then match in the chaopai index, so this statement will find out all three documents.
Exact query:
GET /ad/phone/_search { "query":{ "match":{ "price":"6000" } } }
For queries with exact values, you can use the filter statement instead of query, because the filter will be cached.
operator operation:
Match queries can also accept the operator operator as an input parameter. By default, the operator is or. We can also change it to and so that all specified terms must match.
GET /ad/phone/_search { "query":{ "match":{ "ad":{ "query":"a red", "operator":"and" } } } }
Accuracy matching:
Match query supports minimum_should_match the lowest matching parameter, which can specify the number of word items that must be matched to indicate whether a document is relevant. We can set it to a specific data (the value needs to match the number of words in the inverted index). More commonly, we can set it to a percentage, because we can't control the number of words users enter when searching.
GET /ad/phone/_search { "query":{ "match":{ "ad":{ "query":"a red", "minimum_should_match":"2" } } } }
Only documents matching the last two words a and red will be returned.
If minimum_ should_ If match is 1, the document will return as long as one of the words is matched.
-
multi_match query
Multi field queries, such as querying documents with the word red in the color and ad fields.
GET /ad/phone/_search { "query":{ "multi_match":{ "query":"red", "fields":["color","ad"] } } }
-
range query
Range query, operators: gt (greater than), gte (greater than or equal to), lt (less than), lte (less than or equal to)
Query documents with the price of Dayun 4000 less than 6000
GET /ad/phone/_search { "query":{ "range":{ "price":{ "gt":4000, "lt":6000 } } } }
-
term query
Exact value query: query the document whose price field is equal to 6000
GET /ad/phone/_search { "query":{ "term":{ "price":{ "value":"6000" } } } }
Query the document whose name field is equal to iphone 8
GET /ad/phone/_search { "query":{ "term":{ "name":{ "value":"iphone 8" } } } }
As a result, no relevant documents were found.
The reason is that the term query will find the exact term in the inverted index. It will not use the word splitter, but will only match the inverted index. The type type of the name field is text, which will divide iphone 8 into iphone and 8. When we use term to query iphone 8, there is no iphone 8 in the inverted index, so no matching document can be found.
The difference between term and match query
term query, without word segmentation, directly matches the inverted index;
During match query, word segmentation will be carried out. When querying iphone 8, word segmentation will be carried out into iphone and 8, and then the inverted index will be matched. Therefore, the results will query both iphone 8 and xiaomi 8 documents.
Another thing to note is that term queries do not go through the word splitter, but match the inverted index, so the query results are related to how the word splitter divides words. For example, if you add a new document of / ad/phone type, the name field is assigned oppo. In this case, if you use term to query oppo, you will not find the document. This is because es uses the standard word splitter by default. After word segmentation, it will convert the word into lowercase output. Therefore, you can't find the document by using oppo, but you can find it by using lowercase oppo.
GET /ad/phone/_search { "query": { "term": { "name": { "value": "Oppo" //Change to oppo to find out the newly added documents } } } }
All query results like term are related to the selected word splitter. Understanding the word segmentation method of the selected word splitter is helpful for us to write query statements.
-
terms query
terms query, like term query, allows you to specify multiple values for matching. If this field contains any value in the formulation, the document meets the conditions.
GET /ad/phone/_search { "query":{ "terms":{ "ad":["red","blue"] } } }
-
exists query and missing query
Used to find documents with values (exists) or no values (missing) for the specified field.
The specified name field has a value
GET /ad/phone/_search { "query":{ "bool":{ "filter":{ "exists":{ "field":"name" } } } } }
The specified name field has no value:
GET /ad/phone/_search { "query":{ "bool":{ "filter":{ "missing":{ "filed":"name" } } } } }
-
match_phrase query
Phrase query, exact matching, query a red will match the ad field containing a red phrase, without word segmentation query, and will not query documents containing "a other word red".
GET /ad/phone/_search { "query":{ "match_phrase":{ "ad":"a red" } } }
-
scroll query
Similar to paging query, page skipping query is not supported and can only be queried page by page. scroll query is not for real-time user requests, but for processing a large amount of data.
POST /ad/phone/_search?scroll=1m { "query": { "match_all": {} }, "size": 1, "from": 0 }
The return value contains a
"_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAQFlV6T3VqY2NaVDBLRG5uZXdiZ0hFYUEAAAAAAAAAERZVek91amNjWlQwS0RubmV3YmdIRWFBAAAAAAAAABIWVXpPdWpjY1pUMEtEbm5ld2JnSEVhQQAAAAAAAAATFlV6T3VqY2NaVDBLRG5uZXdiZ0hFYUEAAAAAAAAAFBZVek91amNjWlQwS0RubmV3YmdIRWFB"
To be used in the next query
_scroll_id
You can query the document on the next page
POST /_search/scroll { "scroll" : "1m", "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAYFlV6T3VqY2NaVDBLRG5uZXdiZ0hFYUEAAAAAAAAAGRZVek91amNjWlQwS0RubmV3YmdIRWFBAAAAAAAAABYWVXpPdWpjY1pUMEtEbm5ld2JnSEVhQQAAAAAAAAAXFlV6T3VqY2NaVDBLRG5uZXdiZ0hFYUEAAAAAAAAAFRZVek91amNjWlQwS0RubmV3YmdIRWFB" }
-
multi_get query
Multiple documents can be obtained based on index, type (optional) and id (and possible route). If the acquisition of a document fails, the error information will be included in the response.
GET /ad/phone/_mget { "ids": ["1","8"] }
-
bulk batch operation
bulk batch operation can create, index, update and delete multiple documents in a single API call. This can greatly improve the indexing speed.
The bulk request body is as follows:
{ action: { metadata }}\n { request body }\n { action: { metadata }}\n { request body }\n ...
action must be the following
behavior | explain |
create | Create when document does not exist |
index | Create a new document or replace an existing document |
update | Partial update document |
delete | Delete a document |
You must specify the name of the document when indexing, creating, updating, or deleting_ index,_ type,_ id these metadata.
For example:
PUT _bulk { "create" : { "_index" : "ad", "_type" : "phone", "_id" : "6" }} { "doc" : {"name" : "bulk"}} { "index" : { "_index" : "ad", "_type" : "phone", "_id" : "6" }} { "doc" : {"name" : "bulk"}} { "delete":{ "_index" : "ad", "_type" : "phone", "_id" : "1"}} { "update":{ "_index" : "ad", "_type" : "phone", "_id" : "3"}} { "doc" : {"name" : "huawei p20"}}
return:
{ "took": 137, "errors": true, //If any document fails, true is returned here, "items": [ //items array, which lists the results of each request. The order of the results is the same as that of our request { //The create exception already exists in the document "create": { "_index": "ad", "_type": "phone", "_id": "6", "status": 409, "error": { "type": "version_conflict_engine_exception", "reason": "[phone][6]: version conflict, document already exists (current version [2])", "index_uuid": "9F5FHqgISYOra_P09HReVQ", "shard": "2", "index": "ad" } } }, { //The index document already exists and will be overwritten "index": { "_index": "ad", "_type": "phone", "_id": "6", "_version": 3, "result": "updated", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "_seq_no": 6, "_primary_term": 5, "status": 200 } }, { //delete "delete": { "_index": "ad", "_type": "phone", "_id": "1", "_version": 1, "result": "not_found", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "_seq_no": 4, "_primary_term": 5, "status": 404 } }, { //modify "update": { "_index": "ad", "_type": "phone", "_id": "3", "_version": 3, "result": "noop", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "status": 200 } } ] }
The bulk request is not an atomic operation and cannot implement transactions. Each request operation is separate, so the success of each request does not interfere with other operations.
-
fuzzy query
Fuzzy query, fuzzy query will calculate the spelling similarity with keywords.
GET /ad/phone/_search { "query":{ "fuzzy":{ "color":{ "value":"res", "fuzziness":2, "prefix_length":1 } } } }
Parameter setting:
fuzziness: the maximum editing distance, which is AUTO by default;
prefix_length: the initial number of characters that will not be "blurred". This helps to reduce the amount of data that must be checked. The default value is 0;
max_ Expansions: the maximum number of terms that a fuzzy query will expand to. The default value is 50, and the setting is small, which helps to optimize the query;
Transfers: whether fuzzy autocracy (ab - > BA) is supported. The default is false.
-
wildcard query, fuzzy search
Using wildcard can realize fuzzy search without word segmentation, that is, searching for "Xiaomi", which will only search the records continuously containing "Xiaomi" in the keyword, on the premise that the fields to be searched do not have word segmentation (set to type "type": "string" and "index": "not_analyzed" in ES2, and set to keyword type in ES6).
Support wildcard fuzzy query,? Matches a single character, * matches any character.
To prevent extremely slow Wildcard Queries, * or? Wildcard entries should not be placed at the beginning of wildcards.
GET /es_index/es_index_type/_search { "query":{ "wildcard":{ "color":"r?d" } } }
color is one of the fields. There is only one character between the query characters r and d.
Reference article:
Elasticsearch basic concept - Nuggets
ElasticSearch - query statement details - Nuggets