Still using mysql for full-text indexing? Try elastic search!
Note: the following documents are based on elasticsearch version 7. X, which is different from the old version
1. Format description
The data interaction interface of elasticSearch is based on http protocol, and the basic format is as follows:
http://localhost:9200/{index}/{type}/{id}
- Index: index name, which can be similar to the table of relational database
- Type: type name. It should be noted that after 7.x, the type attribute is removed and "_doc" is used by default. 8.x no longer supports specifying the type in the request
- id: i.e. id, can not be specified, elasticSearch will automatically generate
- Document: json serialization of objects
- Metadata: that is, the data format of elasticSearch, which is generally as follows. The data corresponding to "source" is the document we store
{ "_index" : "website", "_type" : "_doc", "_id" : "123", "_version" : 1, "found" : true, "_source" : { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014/01/01" } }
Request format description (in order to better understand the code, "ා", "/ /" means annotation):
POST http://localhost:9200/demo/_doc # POST request request path # The following is the request body, and the request type is"application/json" {"name": "Xiaohong","age": 25}
The above is represented by okHttp:
private OkHttpClient okHttpClient = new OkHttpClient(); public String post() { String url = "http://localhost:9200/demo/_doc"; String json = "{\"name\": \"Xiaohong\",\"age\": 25}"; MediaType JSON = MediaType.parse("application/json"); RequestBody requestBody = RequestBody.create(JSON, json); Request request = new Request.Builder().url(url).post(requestBody).build(); try { Response response = okHttpClient.newCall(request).execute(); // If the elasticSearch is executed successfully, 201 will be returned if (response.code() == 200 || response.code() == 201) { return response.body().string(); } } catch (IOException e) { e.printStackTrace(); } return ""; }
2. Basic data operation
Insert data (POST)
http://localhost:9200/demo/_doc/1 # Specify id as1 http://localhost:9200/demo/_doc # Auto generate id { "name": "Xiaohong", "age": 25 }
Query (GET)
http://localhost:9200/demo/_doc/1?pretty # Return to metadata document http://localhost:9200/demo/_doc/1/_source # Return document http://localhost:9200/demo/_doc/1?_source=name # Return to specified column,Include metadata http://localhost:9200/demo/_doc/1/_source?_source=name # Return to document specified column http://localhost:9200/demo/_doc/_count # Number of queries of this type http://localhost:9200/demo/_doc/_search?size=1&from=1 # Paging query, subscript from0At first, size Default is10
Update (POST)
http://localhost:9200/demo/_doc/1 # Overwrite update, specify id to update { "name": "Xiao Ming", "age": 18 } http://localhost:9200/demo/_doc/1/_update # Partial update, specifying the id to be updated { "doc":{ "name":"Xiaohong" } } # Script can be used in update API to change the field content of source, which is called ctx in update script._source . http://localhost:9200/demo/_doc/1/_update # The age column is self increasing. Specify the id to be updated { "script":"ctx._source.age+=1" }
DELETE
http://localhost:9200/demo/_doc/1 # Specify the id to delete http://localhost:9200/demo # Specify the index to delete
Batch processing
- index and type can be declared in json or directly in url, as shown below ("_id" can not be specified, and ES creates it automatically)
http://localhost:9200/demo/_doc/_bulk # Note that at the end of each line, you must include"\n",convenient elasticSearch Read data { "index": { "_id": "3"}} {"name":"Xiao Ming","age":29,"like":"Basketball"} { "index": { "_id": "4" }} {"name":"Xiaohong","age":30,"like":"Volleyball"}
More see Basic operation and requirements of batch processing
3. bool query
If you need to perform sql like AND, OR, = AND other operations, the above query interface will not meet the requirements. elasticSearch provides a bool index in the following format:
POST http://localhost:9200/demo/_doc/_search # demo is the index name, others are the default interface { "query": { "bool": { "must": { "match": { "name": "Xiao Ming" }}, # must is required "must_not": { "match": { "name": "Xiaohong" }}, # Must not "should": { "match": { "like": "Basketball" }}, # should include (may not include), which will improve the scoring of index results "filter": { "range": { "age" : { "gt" : 18 }} } # Filter filter, as a range of index values } } }
The meaning of range is as follows:
- GT: > greater than
- LT: < less than
- GTE: > = greater than or equal to
- LTE: < less than or equal to
What if there are multiple must conditions? In this way, the value corresponding to "must" can be changed into an array. At the same time, "must not", "should", "filter" also supports multi criteria query
POST http://localhost:9200/demo/_doc/_search { "query": { "bool": { "must": [ { "match": { "name": "Xiao Ming" }}, { "match": { "age": 29}} ], "filter": { "range": { "age" : { "gt" : 18 }} } } } }
bool query can be nested repeatedly, for example, query age is 26, like basketball metadata
POST http://localhost:9200/demo/_doc/_search { "query": { "bool":{ "must":{"match": {"like": {"query": " Basketball"}}}, "filter":{ "bool":{ "must": { "match": { "age": 26 }} } } } } }
If only the bool query of the filter is included, you can use "constant" score, which is a non scoring method and can speed up the query
POST http://localhost:9200/demo/_doc/_search # "constant_score" Query to replace only filter Sentence bool Query. Here is the filter"age"yes"20"Metadata { "query": { "constant_score": { "filter": { "term": { "age": 20 } } } } } # Find multiple exact values, such as find age yes20perhaps30Data { "query": { "constant_score": { "filter": { "term": { "age": [20,30] } } } } }
However, if you need to find the text type precisely, for example, if you only like "Basketball", it will not work. For details, see Exact value lookup
- Filters: filters are fast, do not calculate the correlation of results, and are easy to cache. Please use filters as much as possible
- term: if it is used for text, it is a relation containing but not equal. If you need to query a field of text type, you need to set the field to "not analyzed" for details reference Exact query text
- match_phrase: to solve the problem that the analyzed field can't accurately search for phrases (such as "quick brown fox"), see phrase match
- slop: the parameter tells match_phrase how far apart the query terms are from each other to still treat the document as a match. see Mix up
4. Full text index
See here, some students will ask, this is not to move the function of sql to elastic search ~ then why don't I use sql? There is also a brief introduction to grammar. Don't worry. Next is the point.
4.1 basic index format
Here is the simplest full-text index:
GET http://localhost:9200/demo/_doc/_search?q=name:Xiaohong # Pass directly"q"parameter POSt http://localhost:9200/demo/_doc/_search # among "and"It means that all the participles match. Similarly"or",It means one of the matches is OK { "query": { "match": { "like": { # Specify the properties to search for "query": "Volleyball and basketball ", # The word to be searched and the result of word segmentation are related to analyzer. Later, we will talk about it as "Volleyball" and "Basketball" "operator": "and" # All words in query must contain } } } } # If you need at least a match, you can transfer "minimum_should_match": $percent% { "query": { "match": { "like": { "query": "Volleyball, basketball, badminton, table tennis", "minimum_should_match": "75%" // Indicates that at least 3 / 4 of the words are matched } } } }
You can also query with bool as follows: the content field must contain all three words: full, text, and search. If the content field also contains Elasticsearch or Lucene, the document will get a higher score.
See you for details. Query statement promotion weight
POSt http://localhost:9200/demo/_doc/_search { "query": { "bool": { "must": { "match": { "content": { "query": "full text search", "operator": "and" } } }, "should": [ { "match": { "content": "Elasticsearch" }}, { "match": { "content": "Lucene" }} ] } } }
4.2,analyzer
As mentioned above, we need to mention a little bit here, because the results of full-text index have a lot to do with the word breaker, otherwise the results will be different from what we thought (the author stepped on a lot of holes here). The question is, what is word segmentation? Let's take a look at a demo
POST http://localhost:9200/demo/_analyze { "text":"full text search", # Phrases to analyze "analyzer": "standard" }
"standard" indicates which participator is needed to analyze the phrase. The returned result is as follows, and each token is the result after segmentation
{ "tokens": [ { "token": "full", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "text", "start_offset": 5, "end_offset": 9, "type": "<ALPHANUM>", "position": 1 }, { "token": "search", "start_offset": 10, "end_offset": 16, "type": "<ALPHANUM>", "position": 2 } ] }
Obviously, the "standard" word breaker divides "full text search" into three words, so how does elastic search do full-text indexing? Or the above phrases, draw them into a table
Query ╲ store | full | text | search |
---|---|---|---|
full | √ | × | × |
text | × | √ | × |
As long as the query content segmentation results, in the index segmentation results, even if the search is successful, so the word breaker is very important for the search results. How to configure analyzer? See you for details. Configure analyzer
If it's Chinese, how can standard participle? Just try it. It's a bit of a play through here. Chinese will be more complex. A word will have more than one word, and it's connected together. English is not the same. It's separated by spaces. For example, the Chinese corresponding to "phone" is "phone". If you use "standard" to segment words, you will get the following results.
{ "tokens": [ { "token": "electric", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "word", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 } ] }
It separates every word. If I want to search "xx mobile XX", the result may contain "XX hand XXX machine XXX", which is not connected together. It's very sad! It doesn't matter. elasticSearch supports the installation of the analyzer plug-in. Here is the "Chinese word breaker" ik_smart
4.3. IK Chinese word breaker
Under the elasticSearch installation directory, install as follows. For details, see README
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip
Or local installation (under window)
elasticsearch-7.5.1\bin\elasticsearch-plugin install file:///D:/elasticsearch/elasticsearch-7.5.1-windows-x86_64.zip
After installation, set the field properties (mapping) that need to be indexed. If an index has been created before, it needs to be deleted first. For details, see delete
PUT http://localhost:9200/demo { "mappings": { "dynamic":true, # Can default columns be created dynamically "properties": { # Properties (fixed format) "like": { # Fields to set "type": "text", # The type of the field "analyzer": "ik_smart", # Specifies the word breaker to create the index "search_analyzer": "ik_smart" # Specify the word breaker for the search } } } }
We can check the mapping of index field type and word breaker as follows to make sure the setting is successful
GET http://localhost:9200/demo/_mapping # View field attribute mapping for the demo index
Try word segmentation again
POST http://localhost:9200/demo/_analyze { "text":"Telephone", "analyzer": "ik_smart" } # give the result as follows { "tokens": [ { "token": "Telephone", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 } ] }
When searching, you can specify a word breaker (if you have specified search analyzer before, you don't need to specify it again), so that the result of "XXX basket XXX ball" won't appear.
POST http://localhost:9200/demo/_doc/_search { "query": { "match": { "contentText":{ "query": "Basketball and volleyball", "operator": "and", "analyzer": "ik_smart" } } } }
The IK plug-in has two word breakers: ik_smart or ik_max_word.
-
ik_max_word: it will split the text in the most fine-grained way. For example, it will split the "National Anthem of the people's Republic of China" into "the National Anthem of the people's Republic of China, the people, the Chinese, the people's Republic, the people's Republic, the Republic, and the people's Republic of China". It will make every possible combination, suitable for Term Query;
-
ik_smart: can do the most coarse-grained split, for example, can split the "National Anthem of the people's Republic of China" into "National Anthem of the people's Republic of China", which is suitable for Phrase query.
5, sort
In search results, we also need to sort them. Here, we want to sort "age" in descending order
# stay"like"Search in field"Basketball",And take"age" In descending order GET http://localhost:9200/demo/_doc/_search?sort=date:desc&sort=age&q=like:Basketball # stay"like"Search in field"Basketball and volleyball",And take"age" In descending order POSt http://localhost:9200/demo/_doc/_search { "query": { "match":{ "like":{ "query":"Basketball and volleyball", "operator":"and" } } }, "sort": { "age": { "order": "desc" }} # desc:Descending order; asc:Ascending order }
If there are the following errors
Fielddata is disabled on text fields by default. Set fielddata=true on [age] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
Set the column mapping to be sorted
http://localhost:9200/demo/_mapping # Set mapping for index { "properties": { "age": { "type": "text", "fielddata": true } } }
For more details, see sort
6, postscript
Of course, the function of elasticSearch is far more than that. Because of the space, this article does not list too many interfaces. For more documents, please refer to ElasticSearch document You can see the document information of each version here.
Finally, the recent new coronavirus is particularly rampant, the epidemic situation is very serious. Here I wish you all the best of luck, such as East China Sea, longevity, peace of year, promotion and salary increase!
------WeiWq is recorded in Guangzhou on February 9, 2020.