KS1_S09_ElasticSearch-7.6X (library, table, record document)

Posted by nsantos on Tue, 08 Feb 2022 00:07:55 +0100

2021-05-30_ElasticSearch-7.6.x (library, table, record document)

1, Topic introduction

Three tools of big data:

ElasticSearch,Hadoop,HBase,Spark
Data cleaning, ELK

ELK technology, E: ElasticSearch, L: Logstash, K: Kibana
Comparison of search methods:

SQL, like% course%, if it is a large amount of data, it is very slow. Is there an index or slow
ElasticSearch, search case, Baidu, github, Taobao e-commerce
1. Shop around
2. install
3. Ecosystem
4. Word splitter ik
5. Restful operation ES
6. CRUD
7. SpringBoot integrates ElasticSearch (from principle analysis!)
8. Crawler crawls data! Jd.com and Taobao data
9. Actual combat, simulated full-text retrieval

In the future, as long as you use search, you can use ES! (used in case of large amount of data)

2, Founder Doug Cutting

Storage + calculation

content

Lucene, full-text retrieval function, open source
In 2003, Google published an academic paper and publicly introduced its own Google File System GFS (Google File System), which is a special file system designed by Google to store massive search data.
In 2004, Doug Cutting implemented a distributed file storage system based on Google's GFS paper and named it NDFS (Nutch Distributed File System).
In 2004, Google published a technical academic paper to introduce its MapReduce programming model, which is used for parallel analysis and operation of large-scale data sets (greater than 1TB).
In 2005, Doug Cutting realized this function in Nutch search engine based on MapReduce.
In 2006, Yahoo (Yahoo) company, Zhaoan Doug Cutting. It upgraded NDFS and MapReduce and renamed Hadoop (NDFS is also renamed HDFS, Hadoop Distributed File System)
This is the origin of Hadoop, a big data framework.
In 2006, Google introduced its own BigTable, a distributed data storage system, a non relational database used to process massive data.
Doug Cutting introduced BigTable into his hadoop system and named it HBase.

Google Hadoop
GFS HDFS
MapReduce MapReduce
BigTable HBase

Google	Hadoop
GFS	HDFS
MapReduce	MapReduce
BigTable	HBase

summary

Lucene is a set of information retrieval toolkit, jar package, without search engine system. (Solr)

Included: index structure, tools for reading and writing indexes, sorting, search rules - tool classes.

Relationship between Lucene and ElasticSearch:

ElasticSearch is encapsulated and enhanced based on Lucene (easy to use).

3, ElasticSearch overview

**Keywords: * * high expansion, distributed full-text retrieval engine, real-time storage, retrieval data, RESTful API

Who is using:

Wikipedia
The Guardian (foreign news website)
Stack Overflow (foreign program exception discussion forum)
GitHub (open source code management)
E-commerce website
Log data analysis, logstash acquisition log, ES incoming complex data analysis, ELK technology, elasticsearch data cleaning + logstash data filtering + kibana visual analysis
Commodity price monitoring website
BI system, Business Intelligence, Business Intelligence
Domestic: in station search (e-commerce, recruitment, portal, etc.), IT system search (OA, CRM, ERP, etc.), data analysis (a popular use scenario of ES)

4, ElasticSearch and Solr differences

Introduction to ElasticSearch:

Full text search, structured search, analysis

Highlight keywords, error correction

User log analysis

Lucene based
Solr introduction:

It can run independently in Servlet containers such as Jetty and Tomcat

The enterprise search server is developed based on Lucene, which actually encapsulates Lucene

Provide an API interface similar to web service
Lucene introduction:

Lucene is a mature free and open source tool in the Java development environment

In itself, Lucene is the most popular Java information retrieval library at present and in recent years.

ElasticSearch and Solr summary

ES is basically out of the box and very simple. Solr installation is slightly complicated.
Solr uses Zookeeper for distributed management, while ES has distributed coordination management function
Solr supports data in more formats, such as JSON, XML and CSV, while ES only supports JSON file format
Solr officially provides more functions, while ES itself pays more attention to core functions. Advanced functions are mostly provided by third-party plug-ins. For example, the graphical interface needs kibana friendly support.
Solr query is fast, but when updating the index (i.e. insertion and deletion are slow), it is used for applications with many queries such as e-commerce
1. ES establishes index blocks (i.e. slow query), i.e. fast real-time query, which is used for facebook, Sina and other searches.
2. Solr is a favorable solution for traditional search applications, but ES is more suitable for emerging real-time search applications.
Solr is relatively mature and has a larger and more mature community of users, developers and contributors, while ES has fewer developers and maintainers, updates too quickly and has high learning costs.

5, Elastsearch installation

Declaration: ES 7 x. Jdk1 is required 8. Minimum requirements: ElasticSearch client, interface tool kibana. For java development, the version of ElasticSearch should be consistent with the corresponding Java core jar package.

download

Official website: https://www.elastic.co/cn/elasticsearch/

Download versions from the official website: https://www.elastic.co/cn/downloads/past-releases#elasticsearch

ELK three swordsman, decompression and ready to use (web project, front-end environment - Nodejs, python 2 - npm)

Install under Window

Decompression can be used

Familiar with Directory:

bin						# Startup file
config					# configuration file
	log4j2				# Log profile
	jvm.options 		# java virtual machine related configuration
	elasticsearchyml	# The configuration file of elasticsearch defaults to 9200 port, which is cross domain
lib						# Related jar packages
modules					# functional module 
plugins					# plug-in unit 
logs					# journal

Start elasticsearch.com in bin with cmd Bat or double-click elasticsearch Browser 9200, bat: local host

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (IMG kpltlbms-16223209519) (ks1_s09_elasticsearch notes. assets/1622224807815.png)]
Access test

Install the plug-in of visual interface es head

If there is no front-end infrastructure, you need to download the basic environment installation related to Vue first.

Download address: http://github.com/mobz/elasticsearch-head/

After decompression, open the home directory with cmd

npm install 

# node -v
# npm -v
# npm install -g cnpm --registry=https://registry.npm.taobao.org
# cnpm -v
# cnpm install

npm run start

If you directly visit 127.0.0.1:9100, there will be cross domain problems and you need to modify elasticsearch yml
```
http.cors.enabled: true
http.cors.allow-origin: "*"
```
Restart ES service and connect 127.0.0.1:9100 again

When ES is understood as a database, an index can be established (understood as a library), in which documents (data in the Library) and type s can be stored
The HEAD uses it as a data presentation tool. kibana is used for all subsequent queries

6, Kibana overview and installation

ELK introduction - Understanding

ELK is the acronym of Elasticsearch, Logstash and Kibana. It is also known as Elastic Stack on the market.
ElasticSearch is a near real-time search platform framework based on Lucene, distributed and interactive through r es tful mode. For big data full-text search engine scenarios like Baidu and Google, ElasticSearch can be used as the underlying support framework. It can be seen that the search capability provided by ElasticSearch is indeed powerful. In the market, we often call it ElasticSearch for short.
Logstash is ELK's central data flow engine, which is used to collect data in different formats from different targets (file / data store / MQ) and output it to different destinations after filtering (file / MQ/redis/elasticsearch/kafka, etc.).
Kibana can display the data of elasticsearch through a friendly interface and provide the function of real-time analysis.
Many developers in the market can unanimously say that ELK is a general term of log analysis architecture technology stack as long as it is mentioned. However, in fact, ELK is not only applicable to log analysis, but also supports any other scenario of data analysis and collection. Log analysis and collection are just more representative. Not unique.

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-W49AGKnO-1622320956521)(KS1_S09_ElasticSearch notes. assets/1622226768911.png)]

Collection of cleaning data - "analysis (search, storage) - Kibana

Introduction to Kibana

Kibana is an open source analysis and visualization platform for ElasticSearch, which is used to search and view the data interactively stored in the ElasticSearch index. Kibana can be used for advanced data analysis and display through various charts. Kibana makes massive data easier to understand. It is simple to operate and is a browser based user interface. It can quickly create a dashboard and display the dynamic of ElasticSearch query in real time. Setting up kibana is very simple. Without coding or additional infrastructure, you can complete Kiana installation and start ElasticSearch index monitoring in a few minutes.
Official website: https://www.elastic.co/cn/kibana
The version of Kibbana should be consistent with ElasticSearch.

decompression
start-up
Step 3: access test, localhost:5601
Development tool test (Post, curl, head, Google browser plug-in, Kibana)
Sinicization, modified * \ kibana-7.6.1-windows-x86_ Kibana. Under 64 \ config path yml
```
#i18n.locale: "en"
i18n.locale: "zh-CN"
```

7, ES core concept

summary

ElasticSearch related concepts: what are clusters, nodes, indexes, types, documents, shards, and mappings.

elasticsearch is a document oriented, customer comparison between relational row database and elasticsearch. Everything is JSON

ElasticSearch	Relational DB
Indexes	database
types	tables
documents	rows
fields	Fields (columns)

The elastic search (cluster) can contain multiple indexes (databases), each index can contain multiple types (tables), each type contains multiple documents (rows), and each document contains multiple fields (columns).

Physical design:

The index is divided into multiple slices;

A person is a cluster, "cluster_name": "elasticsearch"
Logic design:

file

It's just pieces of data

Key value is a JSON object. fastjson for automatic conversion

type

Can not be set, will guess the setting

Indexes

database
Physical design: how nodes and shards work

Inverted index

Completely filter out all irrelevant data to improve efficiency
An elasticsearch index is composed of multiple Lucene indexes.

Summary key points

Indexes
Field type (mapping)
document
Slice (Lucene inverted index)

8, IK word breaker plug-in

What is an IK word breaker?

Word segmentation is to divide a paragraph of Chinese or other words into keywords. When searching, we will segment our own information, segment the data in the database or index library, and then perform a matching operation. The default Chinese word segmentation is to treat each word as a word, which obviously does not meet the requirements, So we need to install Chinese word splitter ik to solve this problem.
It is recommended to use ik Chinese word segmentation.
IK provides two word segmentation algorithms, ik_smart and ik_max_word, where ik_smart is the least segmentation, ik_max_word is the most fine-grained division.

ik word splitter installation

GitHub address: https://github.com/medcl/elasticsearch-analysis-ik/releases
After downloading, put it into the elasticsearch plug-in: * \ elasticsearch-7.6.1\plugins\ik
Restart observation ES

[the external chain picture transfer fails, and the source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-A2guzt5P-1622320956523)(KS1_S09_ElasticSearch notes. assets/1622275411288.png)]
Elasticsearch plugins list, you can view the loaded plug-ins

kibana test

# Original kibana example
GET _search
{
  "query": {
    "match_all": {}
  }
}

# ik_smart
GET _analyze
{
  "analyzer": "ik_smart",
  "text": "I like the paper version very much java book"
}

# ik_max_word
GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "I like the paper version very much java book"
}

ik word splitter adds its own configuration

# Step1: 
route: elasticsearch-7.6.1\plugins\ik\config\IKAnalyzer.cfg.xml
 Edit as:<entry key="ext_dict">zhout.dic</entry>

# Step2: 
Create a file: elasticsearch-7.6.1\plugins\ik\config\zhout.dic
 Add: paper version

Restart ES
In the cmd window, you will be prompted to load zhout dic

[2021-05-29T16:24:12,601][INFO ][o.w.a.d.Monitor          ] [LAPTOP-HRUQ7L7V] [Dict Loading] *\elasticsearch-7.6.1\plugins\ik\config\zhout.dic

9, Rest style description, about index operation

method	url address	describe
PUT	localhost:9200 / index name / type name / document id	Create document (specify document id)
POST	localhost:9200 / index name / type name	Create document (random document id)
POST	localhost:9200 / index name / type name / document id/_update	Modify document
DELETE	localhost:9200 / index name / type name / document id	remove document
GET	localhost:9200 / index name / type name / document id	Query document (specify document id)
POST	localhost:9200 / index name / type name/_ search	Query all data

Basic operation of index

Type:
1. String type: text, keyword
2. Value types: long, integer, short, byte, double, float, half, float, scaled float
3. date type: date
4. te boolean type: Boolean
5. Binary type: binary
6. wait
If your own document does not specify a type, ES will give us the default configuration field type
Extension: use the command elasticsearch to index the situation and get_ Cat / can get a lot of current information about ES.

# Syntax PUT / index name / ~ type name ~ / document id + data
# Syntax PUT / index name + rule

# Example 1, create an index
PUT /test1/type1/1
{
  "name" : "zhout",
  "age": 3
}

# Example 2: set field type: create rule
PUT /test2
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "long"
      },
      "birthday": {
        "type": "date"
      }
    }
  }
}

# Example 3: obtain rules and specific information
GET test2

# Example 4, add a default_ doc index
PUT /test3/_doc/1
{
  "name": "zhout",
  "age": 13,
  "birth": "1994-07-17"
}

# Example 5: get index
GET test3

# Example 6: get health value
GET _cat/health
GET _cat/indices?v

Modify and submit, use PUT, and then overwrite; There can be other ways

# Syntax PUT new, updated
# Syntax POST update

# Example 1: overwrite update
PUT /test3/_doc/1
{
  "name": "zhouzhou",
  "age": 13,
  "birth": "1994-07-17"
}

# Example 2: the new method uses POST
POST /test3/_doc/1/_update
{
  "doc": {
    "name": "zhouzhouzhou"
  }
}

Delete the index. Judge whether to delete the index or delete the document record according to the request.

# Syntax DELETE

# Example 1: DELETE
DELETE test1

10, Basic operation of documents (key points)

Basic operation of documents

# Syntax GET 	 GET data
# Syntax PUT 	 Update data
# Syntax post_ update 	 Update the data and recommend this method	

# Example 1, create an index
PUT /zhout/user/1
{
  "name": "Paper version",
  "age": 23,
  "desc": "A meal is as fierce as a tiger. At a glance, the salary is 2500",
  "tags": ["technical nerd","warm","Straight man"]
}

# Example 2, continue to create
PUT /zhout/user/2
{
  "name": "Zhang San",
  "age": 3,
  "desc": "Outlaw maniac",
  "tags": ["make friends","Travel","scumbag"]
}

# Example 3, continue to create
PUT /zhout/user/3
{
  "name": "Li Si",
  "age": 30,
  "desc": "mmp,I don't know how to describe it",
  "tags": ["Pretty girl","Travel","sing"]
}

# Example 4, obtaining data
GET zhout/user/1

# Example 5: POST is not followed by_ The function of update is the same as that of PUT
POST zhout/user/1
{
  "doc": {
    "name": "Paper version"
  }
}

# Example 6: Post_ update
POST zhout/user/1/_update
{
  "doc": {
    "name": "Paper version"
  }
}

Simple search

hit:

Index and document information
Total number of query results
Specific documents found
The information in the data can be traversed.
Score: this information can be used to judge who is more in line with the result

# Syntax GET 	 GET data


# Example 1, simple search
GET zhout/user/1

# Example 2: conditional search
GET zhout/user/_search?q=name:Paper version

# Example 3, add a data
PUT /zhout/user/1
{
  "name": "Paper version paper version 2",
  "age": 23,
  "desc": "A meal is as fierce as a tiger. At a glance, the salary is 2500",
  "tags": ["technical nerd","warm","Straight man"]
}

# Example 4: the query object parameter body is constructed using JSON
GET zhout/user/_search
{
  "query": {
    "match": {
      "name": "Paper version"
    }
  }
}

Complex search, select (sorting, paging, highlighting, fuzzy query, accurate query)

The data subscript starts from 0, which is the same as all the data structures learned
- /search/{currrent}/{pagesize}

gt (greater than), gte (greater than or equal to), lt (less than), lte (less than or equal to)

# Syntax GET 	 GET data
# Multiple search conditions, interval range and output one result


# Example 1, result filtering - pass_ source, SELECT name,desc
GET zhout/user/_search
{
  "query": {
    "match": {
      "name": "Paper version"
    }
  },
  "_source": ["name", "desc"]
}

# Example 2, sorting results - through sort and fields
GET zhout/user/_search
{
  "query": {
    "match": {
      "name": "Paper version"
    }
  },
  "sort": [
    {
      "age": {
        "order": "desc"
      }
    }
  ]
}

# Example 3, paging - how many pieces of data (data volume of a single page) are returned from the first data through from and size
GET zhout/user/_search
{
  "query": {
    "match": {
      "name": "Paper version"
    }
  },
  "sort": [
    {
      "age": {
        "order": "desc"
      }
    }
  ],
  "from": 0,
  "size": 1
}

# Example 4: Boolean query - matching multiple conditions through bool and must (and in sql)
GET zhout/user/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "Paper version"
          }
        },
        {
          "match": {
            "age": "23"
          }
        }
      ]
    }
  }
}

# Example 5: Boolean query - matching multiple conditions through bool and should (or in sql)
GET zhout/user/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "name": "Paper version"
          }
        },
        {
          "match": {
            "age": "23"
          }
        }
      ]
    }
  }
}

# Example 6, Boolean query - through bool and must_not (not in sql), matching multiple conditions
GET zhout/user/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "age": "23"
          }
        }
      ]
    }
  }
}

# Example 7, filter - through filter and range, gte, lte data range
GET zhout/user/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "Paper version"
          }
        }
      ],
      "filter": {
        "range": {
          "age": {
            "gte": 10,
            "lte": 30
          }
        }
      }
    }
  }
}

Match multiple criteria

term query is to accurately find the terms specified by the inverted index directly.
About participle:
- term, direct query, accurate
- match, which will be parsed using a word splitter (first analyze the document, and then query through the analyzed document)

Two types: text (can be parsed by word segmentation) and keywork (cannot be parsed by word segmentation)

# Syntax GET 	 GET data


# Example 1, matching multiple conditions - multiple conditions are separated by spaces, and the results can be found as long as one of them is met
GET zhout/user/_search
{
  "query": {
    "match": {
      "tags": "Male Technology"
    }
  }
}

# Example 2: add test data
PUT testdb
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "desc":{
        "type": "keyword"
      }
    }
  }
}

PUT testdb/_doc/1
{
  "name": "Paper version Java name",
  "desc": "First data in paper version"
}

PUT testdb/_doc/2
{
  "name": "Paper version Java name",
  "desc": "First data in paper version 2"
}

# Example 3, word segmentation analysis -_ analyze keyword, not split
GET _analyze
{
  "analyzer": "keyword",
  "text": "Paper version Java name"
  
}

# Example 4, word segmentation analysis -_ analyze standard, split
GET _analyze
{
  "analyzer": "standard",
  "text": "Paper version Java name"
  
}

# Example 5: exact query 1. The type of name is text -- it is standard participle
GET testdb/_search
{
  "query": {
    "term": {
      "name": "paper"
    }
  }
}

# Example 5: exact query 2. The type of desc is keyword
GET testdb/_search
{
  "query": {
    "term": {
      "desc": "First data in paper version"
    }
  }
}

Exact matching of multiple value queries

MySQL can also match, but the matching speed is too low

Match by criteria
Exact match
Interval range matching
Matching field filtering
Multi condition query
Highlight query

# Exact matching of multiple value queries

# Example 1: adding data
PUT testdb/_doc/3
{
  "t1": "22",
  "t2": "2020-4-7"
}

PUT testdb/_doc/4
{
  "t1": "22",
  "t2": "2020-4-7"
}

# Example 2: accurately query multiple condition values
GET testdb/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "t1": {
              "value": "22"
            }
          }
        },
        {
          "term": {
            "t1": {
              "value": "33"
            }
          }
        }
      ]
    }
  }
}

# Example 3: highlight query - highlight and files. Search related results can be highlighted
GET zhout/user/_search
{
  "query": {
    "match": {
      "name": "Paper version"
    }
  },
  "highlight": {
    "fields": {
      "name": {}
    }
  }
}

# Example 4: highlight custom labels - highlight, pre_tags , post_tags 
GET zhout/user/_search
{
  "query": {
    "match": {
      "name": "Paper version"
    }
  },
  "highlight": {
    "pre_tags": "<p class='key' style='color:red'>", 
    "post_tags": "</p>", 
    "fields": {
      "name": {}
    }
  }
}

11, Integrated SpringBoot

Find official documents

Address: https://www.elastic.co/guide/en/elasticsearch/client/index.html
Java REST Client - Recommended: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.x/index.html
Java API - native API

Native dependency found

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>7.6.1</version>
</dependency>

Initialize object

RestHighLevelClient client = new RestHighLevelClient(
        RestClient.builder(
                new HttpHost("localhost", 9200, "http"),
                new HttpHost("localhost", 9201, "http")));

client.close();

integration testing

Create index
Determine whether the index exists
Delete index
create documents
CRUD operation document

Topics: ElasticSearch

Programmer Think

KS1_S09_ElasticSearch-7.6X (library, table, record document)

2021-05-30_ElasticSearch-7.6.x (library, table, record document)

1, Topic introduction

Three tools of big data:

Data cleaning, ELK

Comparison of search methods: