KS1_S09_ElasticSearch-7.6X (library, table, record document)

Posted by nsantos on Tue, 08 Feb 2022 00:07:55 +0100

2021-05-30_ElasticSearch-7.6.x (library, table, record document)

1, Topic introduction

  1. Three tools of big data:

    ElasticSearch,Hadoop,HBase,Spark

  2. Data cleaning, ELK

    ELK technology, E: ElasticSearch, L: Logstash, K: Kibana

  3. Comparison of search methods:

    SQL, like% course%, if it is a large amount of data, it is very slow. Is there an index or slow
    ElasticSearch, search case, Baidu, github, Taobao e-commerce

    1. Shop around
    2. install
    3. Ecosystem
    4. Word splitter ik
    5. Restful operation ES
    6. CRUD
    7. SpringBoot integrates ElasticSearch (from principle analysis!)
    8. Crawler crawls data! Jd.com and Taobao data
    9. Actual combat, simulated full-text retrieval

In the future, as long as you use search, you can use ES! (used in case of large amount of data)

2, Founder Doug Cutting

Storage + calculation

content

  1. Lucene, full-text retrieval function, open source

  2. In 2003, Google published an academic paper and publicly introduced its own Google File System GFS (Google File System), which is a special file system designed by Google to store massive search data.

  3. In 2004, Doug Cutting implemented a distributed file storage system based on Google's GFS paper and named it NDFS (Nutch Distributed File System).

  4. In 2004, Google published a technical academic paper to introduce its MapReduce programming model, which is used for parallel analysis and operation of large-scale data sets (greater than 1TB).

  5. In 2005, Doug Cutting realized this function in Nutch search engine based on MapReduce.

  6. In 2006, Yahoo (Yahoo) company, Zhaoan Doug Cutting. It upgraded NDFS and MapReduce and renamed Hadoop (NDFS is also renamed HDFS, Hadoop Distributed File System)

  7. This is the origin of Hadoop, a big data framework.

  8. In 2006, Google introduced its own BigTable, a distributed data storage system, a non relational database used to process massive data.

  9. Doug Cutting introduced BigTable into his hadoop system and named it HBase.

    GoogleHadoop
    GFSHDFS
    MapReduceMapReduce
    BigTableHBase

summary

Lucene is a set of information retrieval toolkit, jar package, without search engine system. (Solr)

Included: index structure, tools for reading and writing indexes, sorting, search rules - tool classes.

Relationship between Lucene and ElasticSearch:

ElasticSearch is encapsulated and enhanced based on Lucene (easy to use).

3, ElasticSearch overview

**Keywords: * * high expansion, distributed full-text retrieval engine, real-time storage, retrieval data, RESTful API

Who is using:

  1. Wikipedia
  2. The Guardian (foreign news website)
  3. Stack Overflow (foreign program exception discussion forum)
  4. GitHub (open source code management)
  5. E-commerce website
  6. Log data analysis, logstash acquisition log, ES incoming complex data analysis, ELK technology, elasticsearch data cleaning + logstash data filtering + kibana visual analysis
  7. Commodity price monitoring website
  8. BI system, Business Intelligence, Business Intelligence
  9. Domestic: in station search (e-commerce, recruitment, portal, etc.), IT system search (OA, CRM, ERP, etc.), data analysis (a popular use scenario of ES)

4, ElasticSearch and Solr differences

  1. Introduction to ElasticSearch:

    Full text search, structured search, analysis

    Highlight keywords, error correction

    User log analysis

    Lucene based

  2. Solr introduction:

    It can run independently in Servlet containers such as Jetty and Tomcat

    The enterprise search server is developed based on Lucene, which actually encapsulates Lucene

    Provide an API interface similar to web service

  3. Lucene introduction:

    Lucene is a mature free and open source tool in the Java development environment

    In itself, Lucene is the most popular Java information retrieval library at present and in recent years.

ElasticSearch and Solr summary

  1. ES is basically out of the box and very simple. Solr installation is slightly complicated.
  2. Solr uses Zookeeper for distributed management, while ES has distributed coordination management function
  3. Solr supports data in more formats, such as JSON, XML and CSV, while ES only supports JSON file format
  4. Solr officially provides more functions, while ES itself pays more attention to core functions. Advanced functions are mostly provided by third-party plug-ins. For example, the graphical interface needs kibana friendly support.
  5. Solr query is fast, but when updating the index (i.e. insertion and deletion are slow), it is used for applications with many queries such as e-commerce
    1. ES establishes index blocks (i.e. slow query), i.e. fast real-time query, which is used for facebook, Sina and other searches.
    2. Solr is a favorable solution for traditional search applications, but ES is more suitable for emerging real-time search applications.
  6. Solr is relatively mature and has a larger and more mature community of users, developers and contributors, while ES has fewer developers and maintainers, updates too quickly and has high learning costs.

5, Elastsearch installation

  • Declaration: ES 7 x. Jdk1 is required 8. Minimum requirements: ElasticSearch client, interface tool kibana. For java development, the version of ElasticSearch should be consistent with the corresponding Java core jar package.

download

Official website: https://www.elastic.co/cn/elasticsearch/

Download versions from the official website: https://www.elastic.co/cn/downloads/past-releases#elasticsearch

ELK three swordsman, decompression and ready to use (web project, front-end environment - Nodejs, python 2 - npm)

Install under Window

  1. Decompression can be used

  2. Familiar with Directory:

    bin						# Startup file
    config					# configuration file
    	log4j2				# Log profile
    	jvm.options 		# java virtual machine related configuration
    	elasticsearchyml	# The configuration file of elasticsearch defaults to 9200 port, which is cross domain
    lib						# Related jar packages
    modules					# functional module 
    plugins					# plug-in unit 
    logs					# journal
    
  3. Start elasticsearch.com in bin with cmd Bat or double-click elasticsearch Browser 9200, bat: local host

    [the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (IMG kpltlbms-16223209519) (ks1_s09_elasticsearch notes. assets/1622224807815.png)]

  4. Access test

Install the plug-in of visual interface es head

  • If there is no front-end infrastructure, you need to download the basic environment installation related to Vue first.
  1. Download address: http://github.com/mobz/elasticsearch-head/

  2. After decompression, open the home directory with cmd

    npm install 
    
    # node -v
    # npm -v
    # npm install -g cnpm --registry=https://registry.npm.taobao.org
    # cnpm -v
    # cnpm install
    
    npm run start
    
  3. If you directly visit 127.0.0.1:9100, there will be cross domain problems and you need to modify elasticsearch yml

    http.cors.enabled: true
    http.cors.allow-origin: "*"
    
  4. Restart ES service and connect 127.0.0.1:9100 again

    When ES is understood as a database, an index can be established (understood as a library), in which documents (data in the Library) and type s can be stored

  5. The HEAD uses it as a data presentation tool. kibana is used for all subsequent queries

6, Kibana overview and installation

ELK introduction - Understanding

  1. ELK is the acronym of Elasticsearch, Logstash and Kibana. It is also known as Elastic Stack on the market.

  2. ElasticSearch is a near real-time search platform framework based on Lucene, distributed and interactive through r es tful mode. For big data full-text search engine scenarios like Baidu and Google, ElasticSearch can be used as the underlying support framework. It can be seen that the search capability provided by ElasticSearch is indeed powerful. In the market, we often call it ElasticSearch for short.

  3. Logstash is ELK's central data flow engine, which is used to collect data in different formats from different targets (file / data store / MQ) and output it to different destinations after filtering (file / MQ/redis/elasticsearch/kafka, etc.).

  4. Kibana can display the data of elasticsearch through a friendly interface and provide the function of real-time analysis.

  5. Many developers in the market can unanimously say that ELK is a general term of log analysis architecture technology stack as long as it is mentioned. However, in fact, ELK is not only applicable to log analysis, but also supports any other scenario of data analysis and collection. Log analysis and collection are just more representative. Not unique.

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-W49AGKnO-1622320956521)(KS1_S09_ElasticSearch notes. assets/1622226768911.png)]

  • Collection of cleaning data - "analysis (search, storage) - Kibana

Introduction to Kibana

  • Kibana is an open source analysis and visualization platform for ElasticSearch, which is used to search and view the data interactively stored in the ElasticSearch index. Kibana can be used for advanced data analysis and display through various charts. Kibana makes massive data easier to understand. It is simple to operate and is a browser based user interface. It can quickly create a dashboard and display the dynamic of ElasticSearch query in real time. Setting up kibana is very simple. Without coding or additional infrastructure, you can complete Kiana installation and start ElasticSearch index monitoring in a few minutes.

  • Official website: https://www.elastic.co/cn/kibana

  • The version of Kibbana should be consistent with ElasticSearch.

  1. decompression

  2. start-up

  3. Step 3: access test, localhost:5601

  4. Development tool test (Post, curl, head, Google browser plug-in, Kibana)

  5. Sinicization, modified * \ kibana-7.6.1-windows-x86_ Kibana. Under 64 \ config path yml

    #i18n.locale: "en"
    i18n.locale: "zh-CN"
    

7, ES core concept

summary

ElasticSearch related concepts: what are clusters, nodes, indexes, types, documents, shards, and mappings.

elasticsearch is a document oriented, customer comparison between relational row database and elasticsearch. Everything is JSON

ElasticSearchRelational DB
Indexesdatabase
typestables
documentsrows
fieldsFields (columns)

The elastic search (cluster) can contain multiple indexes (databases), each index can contain multiple types (tables), each type contains multiple documents (rows), and each document contains multiple fields (columns).

  1. Physical design:

    The index is divided into multiple slices;

    A person is a cluster, "cluster_name": "elasticsearch"

  2. Logic design:

    file

    It's just pieces of data

    Key value is a JSON object. fastjson for automatic conversion

    type

    Can not be set, will guess the setting

    Indexes

    database

  3. Physical design: how nodes and shards work

Inverted index

  1. Completely filter out all irrelevant data to improve efficiency

  2. An elasticsearch index is composed of multiple Lucene indexes.

Summary key points

  1. Indexes
  2. Field type (mapping)
  3. document
  4. Slice (Lucene inverted index)

8, IK word breaker plug-in

What is an IK word breaker?

  1. Word segmentation is to divide a paragraph of Chinese or other words into keywords. When searching, we will segment our own information, segment the data in the database or index library, and then perform a matching operation. The default Chinese word segmentation is to treat each word as a word, which obviously does not meet the requirements, So we need to install Chinese word splitter ik to solve this problem.

  2. It is recommended to use ik Chinese word segmentation.

  3. IK provides two word segmentation algorithms, ik_smart and ik_max_word, where ik_smart is the least segmentation, ik_max_word is the most fine-grained division.

ik word splitter installation

  1. GitHub address: https://github.com/medcl/elasticsearch-analysis-ik/releases

  2. After downloading, put it into the elasticsearch plug-in: * \ elasticsearch-7.6.1\plugins\ik

  3. Restart observation ES

    [the external chain picture transfer fails, and the source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-A2guzt5P-1622320956523)(KS1_S09_ElasticSearch notes. assets/1622275411288.png)]

  4. Elasticsearch plugins list, you can view the loaded plug-ins

  5. kibana test

    # Original kibana example
    GET _search
    {
      "query": {
        "match_all": {}
      }
    }
    
    # ik_smart
    GET _analyze
    {
      "analyzer": "ik_smart",
      "text": "I like the paper version very much java book"
    }
    
    # ik_max_word
    GET _analyze
    {
      "analyzer": "ik_max_word",
      "text": "I like the paper version very much java book"
    }
    

ik word splitter adds its own configuration

# Step1: 
route: elasticsearch-7.6.1\plugins\ik\config\IKAnalyzer.cfg.xml
 Edit as:<entry key="ext_dict">zhout.dic</entry>

# Step2: 
Create a file: elasticsearch-7.6.1\plugins\ik\config\zhout.dic
 Add: paper version
  1. Restart ES

  2. In the cmd window, you will be prompted to load zhout dic

[2021-05-29T16:24:12,601][INFO ][o.w.a.d.Monitor          ] [LAPTOP-HRUQ7L7V] [Dict Loading] *\elasticsearch-7.6.1\plugins\ik\config\zhout.dic

9, Rest style description, about index operation

methodurl addressdescribe
PUTlocalhost:9200 / index name / type name / document idCreate document (specify document id)
POSTlocalhost:9200 / index name / type nameCreate document (random document id)
POSTlocalhost:9200 / index name / type name / document id/_updateModify document
DELETElocalhost:9200 / index name / type name / document idremove document
GETlocalhost:9200 / index name / type name / document idQuery document (specify document id)
POSTlocalhost:9200 / index name / type name/_ searchQuery all data

Basic operation of index

  • Type:
    1. String type: text, keyword
    2. Value types: long, integer, short, byte, double, float, half, float, scaled float
    3. date type: date
    4. te boolean type: Boolean
    5. Binary type: binary
    6. wait
  • If your own document does not specify a type, ES will give us the default configuration field type
  • Extension: use the command elasticsearch to index the situation and get_ Cat / can get a lot of current information about ES.
# Syntax PUT / index name / ~ type name ~ / document id + data
# Syntax PUT / index name + rule

# Example 1, create an index
PUT /test1/type1/1
{
  "name" : "zhout",
  "age": 3
}

# Example 2: set field type: create rule
PUT /test2
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "long"
      },
      "birthday": {
        "type": "date"
      }
    }
  }
}

# Example 3: obtain rules and specific information
GET test2

# Example 4, add a default_ doc index
PUT /test3/_doc/1
{
  "name": "zhout",
  "age": 13,
  "birth": "1994-07-17"
}

# Example 5: get index
GET test3

# Example 6: get health value
GET _cat/health
GET _cat/indices?v

Modify and submit, use PUT, and then overwrite; There can be other ways

# Syntax PUT new, updated
# Syntax POST update

# Example 1: overwrite update
PUT /test3/_doc/1
{
  "name": "zhouzhou",
  "age": 13,
  "birth": "1994-07-17"
}

# Example 2: the new method uses POST
POST /test3/_doc/1/_update
{
  "doc": {
    "name": "zhouzhouzhou"
  }
}

Delete the index. Judge whether to delete the index or delete the document record according to the request.

# Syntax DELETE

# Example 1: DELETE
DELETE test1

10, Basic operation of documents (key points)

Basic operation of documents

# Syntax GET 	 GET data
# Syntax PUT 	 Update data
# Syntax post_ update 	 Update the data and recommend this method	

# Example 1, create an index
PUT /zhout/user/1
{
  "name": "Paper version",
  "age": 23,
  "desc": "A meal is as fierce as a tiger. At a glance, the salary is 2500",
  "tags": ["technical nerd","warm","Straight man"]
}

# Example 2, continue to create
PUT /zhout/user/2
{
  "name": "Zhang San",
  "age": 3,
  "desc": "Outlaw maniac",
  "tags": ["make friends","Travel","scumbag"]
}

# Example 3, continue to create
PUT /zhout/user/3
{
  "name": "Li Si",
  "age": 30,
  "desc": "mmp,I don't know how to describe it",
  "tags": ["Pretty girl","Travel","sing"]
}

# Example 4, obtaining data
GET zhout/user/1

# Example 5: POST is not followed by_ The function of update is the same as that of PUT
POST zhout/user/1
{
  "doc": {
    "name": "Paper version"
  }
}

# Example 6: Post_ update
POST zhout/user/1/_update
{
  "doc": {
    "name": "Paper version"
  }
}

Simple search

  • hit:

    • Index and document information
    • Total number of query results
    • Specific documents found
    • The information in the data can be traversed.
    • Score: this information can be used to judge who is more in line with the result
    # Syntax GET 	 GET data
    
    
    # Example 1, simple search
    GET zhout/user/1
    
    # Example 2: conditional search
    GET zhout/user/_search?q=name:Paper version
    
    # Example 3, add a data
    PUT /zhout/user/1
    {
      "name": "Paper version paper version 2",
      "age": 23,
      "desc": "A meal is as fierce as a tiger. At a glance, the salary is 2500",
      "tags": ["technical nerd","warm","Straight man"]
    }
    
    # Example 4: the query object parameter body is constructed using JSON
    GET zhout/user/_search
    {
      "query": {
        "match": {
          "name": "Paper version"
        }
      }
    }
    

Complex search, select (sorting, paging, highlighting, fuzzy query, accurate query)

  • The data subscript starts from 0, which is the same as all the data structures learned

    • /search/{currrent}/{pagesize}
  • gt (greater than), gte (greater than or equal to), lt (less than), lte (less than or equal to)

    # Syntax GET 	 GET data
    # Multiple search conditions, interval range and output one result
    
    
    # Example 1, result filtering - pass_ source, SELECT name,desc
    GET zhout/user/_search
    {
      "query": {
        "match": {
          "name": "Paper version"
        }
      },
      "_source": ["name", "desc"]
    }
    
    # Example 2, sorting results - through sort and fields
    GET zhout/user/_search
    {
      "query": {
        "match": {
          "name": "Paper version"
        }
      },
      "sort": [
        {
          "age": {
            "order": "desc"
          }
        }
      ]
    }
    
    # Example 3, paging - how many pieces of data (data volume of a single page) are returned from the first data through from and size
    GET zhout/user/_search
    {
      "query": {
        "match": {
          "name": "Paper version"
        }
      },
      "sort": [
        {
          "age": {
            "order": "desc"
          }
        }
      ],
      "from": 0,
      "size": 1
    }
    
    # Example 4: Boolean query - matching multiple conditions through bool and must (and in sql)
    GET zhout/user/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "name": "Paper version"
              }
            },
            {
              "match": {
                "age": "23"
              }
            }
          ]
        }
      }
    }
    
    # Example 5: Boolean query - matching multiple conditions through bool and should (or in sql)
    GET zhout/user/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "name": "Paper version"
              }
            },
            {
              "match": {
                "age": "23"
              }
            }
          ]
        }
      }
    }
    
    # Example 6, Boolean query - through bool and must_not (not in sql), matching multiple conditions
    GET zhout/user/_search
    {
      "query": {
        "bool": {
          "must_not": [
            {
              "match": {
                "age": "23"
              }
            }
          ]
        }
      }
    }
    
    # Example 7, filter - through filter and range, gte, lte data range
    GET zhout/user/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "name": "Paper version"
              }
            }
          ],
          "filter": {
            "range": {
              "age": {
                "gte": 10,
                "lte": 30
              }
            }
          }
        }
      }
    }
    

Match multiple criteria

  • term query is to accurately find the terms specified by the inverted index directly.

  • About participle:

    • term, direct query, accurate
    • match, which will be parsed using a word splitter (first analyze the document, and then query through the analyzed document)
  • Two types: text (can be parsed by word segmentation) and keywork (cannot be parsed by word segmentation)

    # Syntax GET 	 GET data
    
    
    # Example 1, matching multiple conditions - multiple conditions are separated by spaces, and the results can be found as long as one of them is met
    GET zhout/user/_search
    {
      "query": {
        "match": {
          "tags": "Male Technology"
        }
      }
    }
    
    # Example 2: add test data
    PUT testdb
    {
      "mappings": {
        "properties": {
          "name": {
            "type": "text"
          },
          "desc":{
            "type": "keyword"
          }
        }
      }
    }
    
    PUT testdb/_doc/1
    {
      "name": "Paper version Java name",
      "desc": "First data in paper version"
    }
    
    PUT testdb/_doc/2
    {
      "name": "Paper version Java name",
      "desc": "First data in paper version 2"
    }
    
    # Example 3, word segmentation analysis -_ analyze keyword, not split
    GET _analyze
    {
      "analyzer": "keyword",
      "text": "Paper version Java name"
      
    }
    
    # Example 4, word segmentation analysis -_ analyze standard, split
    GET _analyze
    {
      "analyzer": "standard",
      "text": "Paper version Java name"
      
    }
    
    # Example 5: exact query 1. The type of name is text -- it is standard participle
    GET testdb/_search
    {
      "query": {
        "term": {
          "name": "paper"
        }
      }
    }
    
    # Example 5: exact query 2. The type of desc is keyword
    GET testdb/_search
    {
      "query": {
        "term": {
          "desc": "First data in paper version"
        }
      }
    }
    
    

Exact matching of multiple value queries

  • MySQL can also match, but the matching speed is too low

    • Match by criteria
    • Exact match
    • Interval range matching
    • Matching field filtering
    • Multi condition query
    • Highlight query
    # Exact matching of multiple value queries
    
    # Example 1: adding data
    PUT testdb/_doc/3
    {
      "t1": "22",
      "t2": "2020-4-7"
    }
    
    PUT testdb/_doc/4
    {
      "t1": "22",
      "t2": "2020-4-7"
    }
    
    # Example 2: accurately query multiple condition values
    GET testdb/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "term": {
                "t1": {
                  "value": "22"
                }
              }
            },
            {
              "term": {
                "t1": {
                  "value": "33"
                }
              }
            }
          ]
        }
      }
    }
    
    # Example 3: highlight query - highlight and files. Search related results can be highlighted
    GET zhout/user/_search
    {
      "query": {
        "match": {
          "name": "Paper version"
        }
      },
      "highlight": {
        "fields": {
          "name": {}
        }
      }
    }
    
    # Example 4: highlight custom labels - highlight, pre_tags , post_tags 
    GET zhout/user/_search
    {
      "query": {
        "match": {
          "name": "Paper version"
        }
      },
      "highlight": {
        "pre_tags": "<p class='key' style='color:red'>", 
        "post_tags": "</p>", 
        "fields": {
          "name": {}
        }
      }
    }
    

11, Integrated SpringBoot

Find official documents

  1. Address: https://www.elastic.co/guide/en/elasticsearch/client/index.html

  2. Java REST Client - Recommended: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.x/index.html

  3. Java API - native API

  4. Native dependency found

    <dependency>
        <groupId>org.elasticsearch.client</groupId>
        <artifactId>elasticsearch-rest-high-level-client</artifactId>
        <version>7.6.1</version>
    </dependency>
    
  5. Initialize object

    RestHighLevelClient client = new RestHighLevelClient(
            RestClient.builder(
                    new HttpHost("localhost", 9200, "http"),
                    new HttpHost("localhost", 9201, "http")));
    
    client.close();
    

integration testing

  1. Create index
  2. Determine whether the index exists
  3. Delete index
  4. create documents
  5. CRUD operation document

Topics: ElasticSearch