2021-05-30_ElasticSearch-7.6.x (library, table, record document)
1, Topic introduction
-
Three tools of big data:
ElasticSearch,Hadoop,HBase,Spark
-
Data cleaning, ELK
ELK technology, E: ElasticSearch, L: Logstash, K: Kibana
-
Comparison of search methods:
SQL, like% course%, if it is a large amount of data, it is very slow. Is there an index or slow
ElasticSearch, search case, Baidu, github, Taobao e-commerce- Shop around
- install
- Ecosystem
- Word splitter ik
- Restful operation ES
- CRUD
- SpringBoot integrates ElasticSearch (from principle analysis!)
- Crawler crawls data! Jd.com and Taobao data
- Actual combat, simulated full-text retrieval
In the future, as long as you use search, you can use ES! (used in case of large amount of data)
2, Founder Doug Cutting
Storage + calculation
content
-
Lucene, full-text retrieval function, open source
-
In 2003, Google published an academic paper and publicly introduced its own Google File System GFS (Google File System), which is a special file system designed by Google to store massive search data.
-
In 2004, Doug Cutting implemented a distributed file storage system based on Google's GFS paper and named it NDFS (Nutch Distributed File System).
-
In 2004, Google published a technical academic paper to introduce its MapReduce programming model, which is used for parallel analysis and operation of large-scale data sets (greater than 1TB).
-
In 2005, Doug Cutting realized this function in Nutch search engine based on MapReduce.
-
In 2006, Yahoo (Yahoo) company, Zhaoan Doug Cutting. It upgraded NDFS and MapReduce and renamed Hadoop (NDFS is also renamed HDFS, Hadoop Distributed File System)
-
This is the origin of Hadoop, a big data framework.
-
In 2006, Google introduced its own BigTable, a distributed data storage system, a non relational database used to process massive data.
-
Doug Cutting introduced BigTable into his hadoop system and named it HBase.
Google Hadoop GFS HDFS MapReduce MapReduce BigTable HBase
summary
Lucene is a set of information retrieval toolkit, jar package, without search engine system. (Solr)
Included: index structure, tools for reading and writing indexes, sorting, search rules - tool classes.
Relationship between Lucene and ElasticSearch:
ElasticSearch is encapsulated and enhanced based on Lucene (easy to use).
3, ElasticSearch overview
**Keywords: * * high expansion, distributed full-text retrieval engine, real-time storage, retrieval data, RESTful API
Who is using:
- Wikipedia
- The Guardian (foreign news website)
- Stack Overflow (foreign program exception discussion forum)
- GitHub (open source code management)
- E-commerce website
- Log data analysis, logstash acquisition log, ES incoming complex data analysis, ELK technology, elasticsearch data cleaning + logstash data filtering + kibana visual analysis
- Commodity price monitoring website
- BI system, Business Intelligence, Business Intelligence
- Domestic: in station search (e-commerce, recruitment, portal, etc.), IT system search (OA, CRM, ERP, etc.), data analysis (a popular use scenario of ES)
4, ElasticSearch and Solr differences
-
Introduction to ElasticSearch:
Full text search, structured search, analysis
Highlight keywords, error correction
User log analysis
Lucene based
-
Solr introduction:
It can run independently in Servlet containers such as Jetty and Tomcat
The enterprise search server is developed based on Lucene, which actually encapsulates Lucene
Provide an API interface similar to web service
-
Lucene introduction:
Lucene is a mature free and open source tool in the Java development environment
In itself, Lucene is the most popular Java information retrieval library at present and in recent years.
ElasticSearch and Solr summary
- ES is basically out of the box and very simple. Solr installation is slightly complicated.
- Solr uses Zookeeper for distributed management, while ES has distributed coordination management function
- Solr supports data in more formats, such as JSON, XML and CSV, while ES only supports JSON file format
- Solr officially provides more functions, while ES itself pays more attention to core functions. Advanced functions are mostly provided by third-party plug-ins. For example, the graphical interface needs kibana friendly support.
- Solr query is fast, but when updating the index (i.e. insertion and deletion are slow), it is used for applications with many queries such as e-commerce
- ES establishes index blocks (i.e. slow query), i.e. fast real-time query, which is used for facebook, Sina and other searches.
- Solr is a favorable solution for traditional search applications, but ES is more suitable for emerging real-time search applications.
- Solr is relatively mature and has a larger and more mature community of users, developers and contributors, while ES has fewer developers and maintainers, updates too quickly and has high learning costs.
5, Elastsearch installation
- Declaration: ES 7 x. Jdk1 is required 8. Minimum requirements: ElasticSearch client, interface tool kibana. For java development, the version of ElasticSearch should be consistent with the corresponding Java core jar package.
download
Official website: https://www.elastic.co/cn/elasticsearch/
Download versions from the official website: https://www.elastic.co/cn/downloads/past-releases#elasticsearch
ELK three swordsman, decompression and ready to use (web project, front-end environment - Nodejs, python 2 - npm)
Install under Window
-
Decompression can be used
-
Familiar with Directory:
bin # Startup file config # configuration file log4j2 # Log profile jvm.options # java virtual machine related configuration elasticsearchyml # The configuration file of elasticsearch defaults to 9200 port, which is cross domain lib # Related jar packages modules # functional module plugins # plug-in unit logs # journal
-
Start elasticsearch.com in bin with cmd Bat or double-click elasticsearch Browser 9200, bat: local host
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (IMG kpltlbms-16223209519) (ks1_s09_elasticsearch notes. assets/1622224807815.png)]
-
Access test
Install the plug-in of visual interface es head
- If there is no front-end infrastructure, you need to download the basic environment installation related to Vue first.
-
Download address: http://github.com/mobz/elasticsearch-head/
-
After decompression, open the home directory with cmd
npm install # node -v # npm -v # npm install -g cnpm --registry=https://registry.npm.taobao.org # cnpm -v # cnpm install npm run start
-
If you directly visit 127.0.0.1:9100, there will be cross domain problems and you need to modify elasticsearch yml
http.cors.enabled: true http.cors.allow-origin: "*"
-
Restart ES service and connect 127.0.0.1:9100 again
When ES is understood as a database, an index can be established (understood as a library), in which documents (data in the Library) and type s can be stored
-
The HEAD uses it as a data presentation tool. kibana is used for all subsequent queries
6, Kibana overview and installation
ELK introduction - Understanding
-
ELK is the acronym of Elasticsearch, Logstash and Kibana. It is also known as Elastic Stack on the market.
-
ElasticSearch is a near real-time search platform framework based on Lucene, distributed and interactive through r es tful mode. For big data full-text search engine scenarios like Baidu and Google, ElasticSearch can be used as the underlying support framework. It can be seen that the search capability provided by ElasticSearch is indeed powerful. In the market, we often call it ElasticSearch for short.
-
Logstash is ELK's central data flow engine, which is used to collect data in different formats from different targets (file / data store / MQ) and output it to different destinations after filtering (file / MQ/redis/elasticsearch/kafka, etc.).
-
Kibana can display the data of elasticsearch through a friendly interface and provide the function of real-time analysis.
-
Many developers in the market can unanimously say that ELK is a general term of log analysis architecture technology stack as long as it is mentioned. However, in fact, ELK is not only applicable to log analysis, but also supports any other scenario of data analysis and collection. Log analysis and collection are just more representative. Not unique.
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-W49AGKnO-1622320956521)(KS1_S09_ElasticSearch notes. assets/1622226768911.png)]
- Collection of cleaning data - "analysis (search, storage) - Kibana
Introduction to Kibana
-
Kibana is an open source analysis and visualization platform for ElasticSearch, which is used to search and view the data interactively stored in the ElasticSearch index. Kibana can be used for advanced data analysis and display through various charts. Kibana makes massive data easier to understand. It is simple to operate and is a browser based user interface. It can quickly create a dashboard and display the dynamic of ElasticSearch query in real time. Setting up kibana is very simple. Without coding or additional infrastructure, you can complete Kiana installation and start ElasticSearch index monitoring in a few minutes.
-
Official website: https://www.elastic.co/cn/kibana
-
The version of Kibbana should be consistent with ElasticSearch.
-
decompression
-
start-up
-
Step 3: access test, localhost:5601
-
Development tool test (Post, curl, head, Google browser plug-in, Kibana)
-
Sinicization, modified * \ kibana-7.6.1-windows-x86_ Kibana. Under 64 \ config path yml
#i18n.locale: "en" i18n.locale: "zh-CN"
7, ES core concept
summary
ElasticSearch related concepts: what are clusters, nodes, indexes, types, documents, shards, and mappings.
elasticsearch is a document oriented, customer comparison between relational row database and elasticsearch. Everything is JSON
ElasticSearch | Relational DB |
---|---|
Indexes | database |
types | tables |
documents | rows |
fields | Fields (columns) |
The elastic search (cluster) can contain multiple indexes (databases), each index can contain multiple types (tables), each type contains multiple documents (rows), and each document contains multiple fields (columns).
-
Physical design:
The index is divided into multiple slices;
A person is a cluster, "cluster_name": "elasticsearch"
-
Logic design:
file
It's just pieces of data
Key value is a JSON object. fastjson for automatic conversion
type
Can not be set, will guess the setting
Indexes
database
-
Physical design: how nodes and shards work
Inverted index
-
Completely filter out all irrelevant data to improve efficiency
-
An elasticsearch index is composed of multiple Lucene indexes.
Summary key points
- Indexes
- Field type (mapping)
- document
- Slice (Lucene inverted index)
8, IK word breaker plug-in
What is an IK word breaker?
-
Word segmentation is to divide a paragraph of Chinese or other words into keywords. When searching, we will segment our own information, segment the data in the database or index library, and then perform a matching operation. The default Chinese word segmentation is to treat each word as a word, which obviously does not meet the requirements, So we need to install Chinese word splitter ik to solve this problem.
-
It is recommended to use ik Chinese word segmentation.
-
IK provides two word segmentation algorithms, ik_smart and ik_max_word, where ik_smart is the least segmentation, ik_max_word is the most fine-grained division.
ik word splitter installation
-
GitHub address: https://github.com/medcl/elasticsearch-analysis-ik/releases
-
After downloading, put it into the elasticsearch plug-in: * \ elasticsearch-7.6.1\plugins\ik
-
Restart observation ES
[the external chain picture transfer fails, and the source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-A2guzt5P-1622320956523)(KS1_S09_ElasticSearch notes. assets/1622275411288.png)]
-
Elasticsearch plugins list, you can view the loaded plug-ins
-
kibana test
# Original kibana example GET _search { "query": { "match_all": {} } } # ik_smart GET _analyze { "analyzer": "ik_smart", "text": "I like the paper version very much java book" } # ik_max_word GET _analyze { "analyzer": "ik_max_word", "text": "I like the paper version very much java book" }
ik word splitter adds its own configuration
# Step1: route: elasticsearch-7.6.1\plugins\ik\config\IKAnalyzer.cfg.xml Edit as:<entry key="ext_dict">zhout.dic</entry> # Step2: Create a file: elasticsearch-7.6.1\plugins\ik\config\zhout.dic Add: paper version
-
Restart ES
-
In the cmd window, you will be prompted to load zhout dic
[2021-05-29T16:24:12,601][INFO ][o.w.a.d.Monitor ] [LAPTOP-HRUQ7L7V] [Dict Loading] *\elasticsearch-7.6.1\plugins\ik\config\zhout.dic
9, Rest style description, about index operation
method | url address | describe |
---|---|---|
PUT | localhost:9200 / index name / type name / document id | Create document (specify document id) |
POST | localhost:9200 / index name / type name | Create document (random document id) |
POST | localhost:9200 / index name / type name / document id/_update | Modify document |
DELETE | localhost:9200 / index name / type name / document id | remove document |
GET | localhost:9200 / index name / type name / document id | Query document (specify document id) |
POST | localhost:9200 / index name / type name/_ search | Query all data |
Basic operation of index
- Type:
- String type: text, keyword
- Value types: long, integer, short, byte, double, float, half, float, scaled float
- date type: date
- te boolean type: Boolean
- Binary type: binary
- wait
- If your own document does not specify a type, ES will give us the default configuration field type
- Extension: use the command elasticsearch to index the situation and get_ Cat / can get a lot of current information about ES.
# Syntax PUT / index name / ~ type name ~ / document id + data # Syntax PUT / index name + rule # Example 1, create an index PUT /test1/type1/1 { "name" : "zhout", "age": 3 } # Example 2: set field type: create rule PUT /test2 { "mappings": { "properties": { "name": { "type": "text" }, "age": { "type": "long" }, "birthday": { "type": "date" } } } } # Example 3: obtain rules and specific information GET test2 # Example 4, add a default_ doc index PUT /test3/_doc/1 { "name": "zhout", "age": 13, "birth": "1994-07-17" } # Example 5: get index GET test3 # Example 6: get health value GET _cat/health GET _cat/indices?v
Modify and submit, use PUT, and then overwrite; There can be other ways
# Syntax PUT new, updated # Syntax POST update # Example 1: overwrite update PUT /test3/_doc/1 { "name": "zhouzhou", "age": 13, "birth": "1994-07-17" } # Example 2: the new method uses POST POST /test3/_doc/1/_update { "doc": { "name": "zhouzhouzhou" } }
Delete the index. Judge whether to delete the index or delete the document record according to the request.
# Syntax DELETE # Example 1: DELETE DELETE test1
10, Basic operation of documents (key points)
Basic operation of documents
# Syntax GET GET data # Syntax PUT Update data # Syntax post_ update Update the data and recommend this method # Example 1, create an index PUT /zhout/user/1 { "name": "Paper version", "age": 23, "desc": "A meal is as fierce as a tiger. At a glance, the salary is 2500", "tags": ["technical nerd","warm","Straight man"] } # Example 2, continue to create PUT /zhout/user/2 { "name": "Zhang San", "age": 3, "desc": "Outlaw maniac", "tags": ["make friends","Travel","scumbag"] } # Example 3, continue to create PUT /zhout/user/3 { "name": "Li Si", "age": 30, "desc": "mmp,I don't know how to describe it", "tags": ["Pretty girl","Travel","sing"] } # Example 4, obtaining data GET zhout/user/1 # Example 5: POST is not followed by_ The function of update is the same as that of PUT POST zhout/user/1 { "doc": { "name": "Paper version" } } # Example 6: Post_ update POST zhout/user/1/_update { "doc": { "name": "Paper version" } }
Simple search
-
hit:
- Index and document information
- Total number of query results
- Specific documents found
- The information in the data can be traversed.
- Score: this information can be used to judge who is more in line with the result
# Syntax GET GET data # Example 1, simple search GET zhout/user/1 # Example 2: conditional search GET zhout/user/_search?q=name:Paper version # Example 3, add a data PUT /zhout/user/1 { "name": "Paper version paper version 2", "age": 23, "desc": "A meal is as fierce as a tiger. At a glance, the salary is 2500", "tags": ["technical nerd","warm","Straight man"] } # Example 4: the query object parameter body is constructed using JSON GET zhout/user/_search { "query": { "match": { "name": "Paper version" } } }
Complex search, select (sorting, paging, highlighting, fuzzy query, accurate query)
-
The data subscript starts from 0, which is the same as all the data structures learned
- /search/{currrent}/{pagesize}
-
gt (greater than), gte (greater than or equal to), lt (less than), lte (less than or equal to)
# Syntax GET GET data # Multiple search conditions, interval range and output one result # Example 1, result filtering - pass_ source, SELECT name,desc GET zhout/user/_search { "query": { "match": { "name": "Paper version" } }, "_source": ["name", "desc"] } # Example 2, sorting results - through sort and fields GET zhout/user/_search { "query": { "match": { "name": "Paper version" } }, "sort": [ { "age": { "order": "desc" } } ] } # Example 3, paging - how many pieces of data (data volume of a single page) are returned from the first data through from and size GET zhout/user/_search { "query": { "match": { "name": "Paper version" } }, "sort": [ { "age": { "order": "desc" } } ], "from": 0, "size": 1 } # Example 4: Boolean query - matching multiple conditions through bool and must (and in sql) GET zhout/user/_search { "query": { "bool": { "must": [ { "match": { "name": "Paper version" } }, { "match": { "age": "23" } } ] } } } # Example 5: Boolean query - matching multiple conditions through bool and should (or in sql) GET zhout/user/_search { "query": { "bool": { "should": [ { "match": { "name": "Paper version" } }, { "match": { "age": "23" } } ] } } } # Example 6, Boolean query - through bool and must_not (not in sql), matching multiple conditions GET zhout/user/_search { "query": { "bool": { "must_not": [ { "match": { "age": "23" } } ] } } } # Example 7, filter - through filter and range, gte, lte data range GET zhout/user/_search { "query": { "bool": { "must": [ { "match": { "name": "Paper version" } } ], "filter": { "range": { "age": { "gte": 10, "lte": 30 } } } } } }
Match multiple criteria
-
term query is to accurately find the terms specified by the inverted index directly.
-
About participle:
- term, direct query, accurate
- match, which will be parsed using a word splitter (first analyze the document, and then query through the analyzed document)
-
Two types: text (can be parsed by word segmentation) and keywork (cannot be parsed by word segmentation)
# Syntax GET GET data # Example 1, matching multiple conditions - multiple conditions are separated by spaces, and the results can be found as long as one of them is met GET zhout/user/_search { "query": { "match": { "tags": "Male Technology" } } } # Example 2: add test data PUT testdb { "mappings": { "properties": { "name": { "type": "text" }, "desc":{ "type": "keyword" } } } } PUT testdb/_doc/1 { "name": "Paper version Java name", "desc": "First data in paper version" } PUT testdb/_doc/2 { "name": "Paper version Java name", "desc": "First data in paper version 2" } # Example 3, word segmentation analysis -_ analyze keyword, not split GET _analyze { "analyzer": "keyword", "text": "Paper version Java name" } # Example 4, word segmentation analysis -_ analyze standard, split GET _analyze { "analyzer": "standard", "text": "Paper version Java name" } # Example 5: exact query 1. The type of name is text -- it is standard participle GET testdb/_search { "query": { "term": { "name": "paper" } } } # Example 5: exact query 2. The type of desc is keyword GET testdb/_search { "query": { "term": { "desc": "First data in paper version" } } }
Exact matching of multiple value queries
-
MySQL can also match, but the matching speed is too low
- Match by criteria
- Exact match
- Interval range matching
- Matching field filtering
- Multi condition query
- Highlight query
# Exact matching of multiple value queries # Example 1: adding data PUT testdb/_doc/3 { "t1": "22", "t2": "2020-4-7" } PUT testdb/_doc/4 { "t1": "22", "t2": "2020-4-7" } # Example 2: accurately query multiple condition values GET testdb/_search { "query": { "bool": { "should": [ { "term": { "t1": { "value": "22" } } }, { "term": { "t1": { "value": "33" } } } ] } } } # Example 3: highlight query - highlight and files. Search related results can be highlighted GET zhout/user/_search { "query": { "match": { "name": "Paper version" } }, "highlight": { "fields": { "name": {} } } } # Example 4: highlight custom labels - highlight, pre_tags , post_tags GET zhout/user/_search { "query": { "match": { "name": "Paper version" } }, "highlight": { "pre_tags": "<p class='key' style='color:red'>", "post_tags": "</p>", "fields": { "name": {} } } }
11, Integrated SpringBoot
Find official documents
-
Address: https://www.elastic.co/guide/en/elasticsearch/client/index.html
-
Java REST Client - Recommended: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.x/index.html
-
Java API - native API
-
Native dependency found
<dependency> <groupId>org.elasticsearch.client</groupId> <artifactId>elasticsearch-rest-high-level-client</artifactId> <version>7.6.1</version> </dependency>
-
Initialize object
RestHighLevelClient client = new RestHighLevelClient( RestClient.builder( new HttpHost("localhost", 9200, "http"), new HttpHost("localhost", 9201, "http"))); client.close();
integration testing
- Create index
- Determine whether the index exists
- Delete index
- create documents
- CRUD operation document