ElasticSearch Getting Started tutorial notes

Posted by kool_samule on Wed, 22 Dec 2021 13:37:55 +0100

Video tutorial: [crazy God says Java] ElasticSearch7.6.x the latest complete tutorial is easy to understand
Video address: https://www.bilibili.com/video/BV17a4y1x7zq
Refuse white whoring and thank the crazy God for sharing the video tutorial

ElasticSearch overview

ElasticSearch, or es for short, is an open-source and highly extended distributed full-text search engine. It can store and retrieve data in near real time. It has good scalability and can be extended to hundreds of servers to process PB level data. Es also uses Java to develop and use Lucene as its core to realize all indexing and search functions, but its purpose is to hide the complexity of Lucene through a simple RestFul API, so as to make full-text search easier.

According to the statistics of DB Engines, an internationally authoritative database product evaluation organization, ElasticSearch surpassed Solr and became the first search engine application in January 2016.

Who is using:

  1. Wikipedia

  2. The Guardian (foreign news website)

  3. Stack Overflow (foreign program exception discussion forum)

  4. GitHub (open source code management)

  5. E-commerce website

  6. Log data analysis, logstash collection log, es complex data analysis, ELK Technology (elasticsearch + logstash + kibana)

    ... ...

ES and solr

Introduction to ElasticSearch

Elasticsearch is a real-time distributed search and analysis engine. It makes it possible for you to process big data at an unprecedented speed.

It is used for full-text search, structured search, analysis, and a mixture of the three:

Wikipedia uses elastic search to provide full-text search and highlight keywords, as well as search as you type and did you mean
Suggested function.

The guardian uses Elasticsearch combined with user logs and social network data to provide real-time feedback to their editors, so as to timely understand the public's response to the newly published articles
Yes.

Stack overflow combines full-text search, geographic location query, and more like this function to find relevant questions and answers.

Github uses elastic search to retrieve 130 billion lines of code.

But elasticsearch is not only used for large enterprises, it also allows startups such as DataDog and Klout to turn their initial ideas into scalable solutions. Elasticsearch can run on your laptop or process petabytes of data on hundreds of servers. Elasticsearch is based on Apache Lucene ™ Open source search engine. Whether in open source or proprietary domain, Lucene can be regarded as the most advanced, best performing and most powerful search engine library so far.

However, Lucene is just a library. To use it, you must use Java as the development language and integrate it directly into your application. Worse, Lucene is very complex. You need to deeply understand the relevant knowledge of retrieval to understand how it works.

Elasticsearch is also developed in Java and uses Lucene as its core to realize all indexing and search functions, but its purpose is to use a simple RESTful API
Hide the complexity of Lucene to make full-text search simple.

Solr introduction

Solr is a top-level open source project under Apache ', which is developed in Java. It is a full-text search server based on Lucene. Solr provides richer queries than Lucene
At the same time, it realizes configurable and extensible, and optimizes the index and search performance.

Solr can run independently in Servlet containers such as Jetty and Tomcat. The implementation method of Solr index is very simple. Use POST method to send a message to Solr server
An XM [document describing the Field and its contents. Solr adds, deletes and updates the index according to the xml document. Solr search only needs to send an HTTP GET request, and then
Return query results in Xm|, json and other formats for analysis and organize page layout. Solr does not provide the function of building UI. Solr provides - management interfaces through which you can
To query the configuration and operation of Solr.

solr is an enterprise search server based on lucene, which actually encapsulates lucene.

Solr is an independent enterprise search application server, which provides an API interface similar to web service. Users can request from the search engine server through http
Submit formatted files and generate indexes; You can also make a search request and get the returned results.

Lucene introduction

Lucene is a sub project of 4 jakarta project team of apache Software Foundation. It is an open source full-text retrieval engine toolkit, but it is not - a complete full-text
Search engine, but the architecture of a full-text search engine, provides a complete query engine, index engine and some text analysis engines (English and German).
The purpose of Lucene is to provide software developers with a simple and easy-to-use toolkit to facilitate the realization of full-text retrieval function in the target system, or to build on it
A complete full-text search engine. Lucene is an open source library for full-text retrieval and search, supported and provided by the Apache Software Foundation. Lucene provides a
Simple but powerful application program interface, capable of full-text indexing and searching. Lucene is a mature free and open source tool in the Java development environment. In itself,
Lucene is the most popular free Java information retrieval library at present and in recent years. People often mention the information retrieval program library, although it is related to the search engine, it should not
Confuse information retrieval library with search engine.

Lucene is the architecture of a full-text retrieval engine. What is a full-text search engine?

Full text search engine is a real search engine. Representative foreign search engines include Google, FastlAllTheWeb, AltaVista, Inktomi, Teoma, WiseNut, etc
Baidu is famous in China. They are retrieved and used in the database established by extracting the information of each website (mainly web page text) from the Internet
Users query the relevant records matching the criteria, and then return the results to users in a certain order, so they are the real search engine.

From the perspective of the source of search results, full-text search engines can be divided into two types, - one is to have its own retrieval program (Indexer), commonly known as "Spider" program
Robot program, and self built web database, search results directly from its own database, such as the above mentioned 7 engines; The other is to rent it
He runs a database and arranges search results in a custom format, such as the Lycos engine.

ElasticSearch and Solr comparison

  • Solr is faster when simply searching existing data

  • When indexing in real time, Solr will cause I/O blocking and poor query performance. ElasticSearch has obvious advantages

  • As the amount of data increases, the search efficiency of Solr will become lower, while ElasticSearch has no significant change

  • By moving the search infrastructure from Solr to ElasticSearch, it will increase by nearly 50% × Search performance of

ElasticSearch vs Solr summary

  1. es is basically out of the box (you can use it after decompression!), It's simple. Solr installation is slightly more complex
  2. Solr uses Zookeeper for distributed management, while Elasticsearch itself has distributed coordination management function.
  3. Solr supports data in more formats, such as JSON, XML and CSV, while Elasticsearch only supports JSON file format.
  4. Solr officially provides more functions, while Elasticsearch itself pays more attention to core functions. Advanced functions are mostly provided by third-party plug-ins. For example, the graphical interface needs kibana friendly support.
  5. Solr query is fast, but it is slow to update the index (i.e. slow to insert and delete), which is used for applications with many queries such as e-commerce;
    • ES has fast indexing (i.e. slow query), i.e. fast real-time query, which is used for facebook, Sina and other searches.
    • Solr is a powerful solution for traditional search applications, but Elasticsearch is more suitable for emerging real-time search applications.
  6. Solr is relatively mature. It has a larger and more mature community of users, developers and contributors. Compared with developers and maintainers, Elasticsearch updates too quickly and learns to use it
    High cost.

ElasticSearch installation

  • Environmental requirements: JDK above 1.8, Nodejs

  • Download: Official Website https://www.elastic.co/

  • Windows installation package: elasticsearch-7.15 2-windows-x86_ 64.zip, extract and use

  • Directory structure:

    + bin						Startup file
    + config					configuration file
    	+ log4j2.properties		Log profile
    	+ jvm.options			Virtual machine configuration file. It is recommended to reduce the memory,
    	+ elasticsearch.yml		elasticsearch Configuration file, default port 9200, cross domain problem
    + lib						relevant jar package
    + logs						journal
    + modules					modular
    + plugins					plug-in unit
    
  • Start, double-click / elasticsearch-7.15 2/bin/elasticsearch. bat

    Visit 127.0 0.1:9200, get the page

    {
      "name" : "CRATER-PC",
      "cluster_name" : "elasticsearch",				// Cluster name. A service is also a cluster
      "cluster_uuid" : "8AWbPdNgRymiY6VXtuiAuA",
      "version" : {
        "number" : "7.15.2",
        "build_flavor" : "default",
        "build_type" : "zip",
        "build_hash" : "93d5a7f6192e8a1a12e154a2b81bf6fa7309da0c",
        "build_date" : "2021-11-04T14:04:42.515624022Z",
        "build_snapshot" : false,
        "lucene_version" : "8.9.0",
        "minimum_wire_compatibility_version" : "6.8.0",
        "minimum_index_compatibility_version" : "6.0.0-beta1"
      },
      "tagline" : "You Know, for Search"
    }
    

Elasticsearch head installation

  • Download address: https://github.com/mobz/elasticsearch-head

  • After decompression, it is a front-end web package project

  • cd elasticsearch-head

  • npm install

  • npm run start

  • open http://localhost:9100/

After entering the page, there is a cross domain problem (cross ip, cross port), because the 9200 port is connected to the 9100 port. Solve the following problems:

  • Stop ElasticSearch and modify ElasticSearch YML, add the following.

    http.cors.enabled: true
    http.cors.allow-origin: "*"
    

    Restart ElasticSearch again and the head connector succeeds.

    You can create a new index at will. At present, it can be understood as a database

    As for compliance query, head is a data display tool. It is not recommended to write commands here. Check in Kibana later

Kibana installation

Learn about ELK

ELK is the acronym of Elasticsearch, Logstash and Kibana. It is also called Elastic Stack on the market.

Elasticsearch is a near real-time search platform framework based on Lucene, distributed and interactive through Restful mode. For big data full-text search engine scenarios like Baidu and Google, elasticsearch can be used as the underlying support framework. It can be seen that the search capability provided by elasticsearch is indeed powerful. In the market, we often refer to elasticsearch as es for short.

Logstash is the central data flow engine of ELK. It is used to collect data in different formats from different targets (file / data store / MQ) and support output to different purposes after filtering
Location (file / MQ/redis/elasticsearch/kafka, etc.).

Kibana can display elasticsearch data through friendly pages and provide real-time analysis.

Many developers in the market can - say that ELK is the general name of a log analysis architecture technology stack, but in fact, ELK is not only suitable for log analysis, but also supports log analysis
In any other scenario of data analysis and collection, log analysis and collection are only more representative. Is not unique.

Install Kibana

Kibana is an open source analysis and visualization platform for Elasticsearch, which is used to search and view the data interactively stored in Elasticsearch index. With kibana, advanced data analysis and display can be carried out through various charts. Kibana makes massive data easier to understand. It is easy to operate, and the browser based user interface can quickly create a dashboard to display the Elasticsearch query dynamics in real time. Setting up kibana is very simple. Without coding or additional infrastructure, kibana installation can be completed and Elasticsearch index monitoring can be started in a few minutes.

Official website: https://www.elastic.co/cn/kibana

  • Download Kibana-7.15 2-windows-x86_ 64.zip. Note: Kibana version must be consistent with ES version.

  • Decompression is a standard project

  • Start: double click / bin / kibana bat

  • Visit localhost:5601

  • Find Dev Tools in the left menu bar and use development tools on Kibana

  • For localization (internationalization), there is a zh-CN.json file under / x-pack/plugins/translations/translations for translation.

    Edit config / kibana YML file, with the following modifications:

    #i18n.locale: "en"
    i18n.locale: "zh-CN"
    

    Restart Kibana and the language becomes Chinese

ElasticSearch core concepts

ElasticSearch is document oriented. Comparison between relational database and ElasticSearch

Relational DBElasticSearch
databaseIndexes
tablestype
rowsdocuments
Fields (columns)fields

ElasticSearch (cluster) can contain multiple indexes (databases), each index can contain multiple types (tables), each type contains multiple documents (rows), and each document contains multiple fields (columns).

Physical design:

ElasticSearch divides each index into multiple slices in the background, and each slice can be migrated between different servers in the cluster

Logic design:

An index type contains multiple documents, such as document 1 and document 2. When indexing a document, we can find it in this order: index = > type = > document ID. through this combination, we can index a specific document. Note: D doesn't have to be an integer, it's actually a string.

file

Previously, elasticsearch is document oriented, which means that the smallest unit of index and search data is the document. In elasticsearch, the document has several important attributes:

  • Self contained. A document contains both fields and corresponding values, that is, key: value
  • It can be hierarchical. A document contains itself. That's how complex logical entities come from. It is a json object, and fastjson performs automatic conversion.
  • Flexible structure. Documents do not rely on predefined patterns. We know that in relational databases, fields must be defined in advance before they can be used. In elastic search, fields are very flexible. Sometimes, we can ignore the field or add a new field dynamically.
  • Although we can add or ignore a field at will, the type of each field is very important, such as an age field type, which can be string or integer. Because elasticsearch will save the mapping between fields and types and other settings. This mapping is specific to each type of each mapping, which is why types are sometimes referred to as mapping types in elastic search.

type

Types are logical containers for documents. Just like relational databases, tables are containers for rows. The definition of a field in a type is called mapping. For example, name is mapped to a string type.

We say that documents are modeless. They do not need to have all the fields defined in the mapping, such as adding a new field. What does elasticsearch do? Elasticsearch will automatically add a new field to the map, but if the field is not sure what type it is, elasticsearch will start to guess. If the value is 18, then
Elasticsearch will think it's plastic surgery. However, elasticsearch may also be wrong, so the safest way is to define the required mapping in advance. This is the same as that of relational databases. First define the fields and then use them. Don't fix any moths.

Indexes

It's the database.

The index is a container of mapping type, and the index in ElasticSearch is a very large collection of documents. The index stores the fields and other settings of the mapping type, which are then stored on each slice.

Physical design: how nodes and shards work

A cluster has at least one node, and a node is an ElasticSearch process. A node can have multiple default indexes. If you create an index, the index will be composed of five primary shards, and each primary shard will have a replica shard

The above figure shows a cluster with three nodes. It can be seen that the primary partition and the corresponding replication partition will not be in the same node, which is conducive to a node hanging and data loss. In fact, a fragment is a Lucene index, a file directory containing inverted index, which enables ElasticSearch to obtain those documents containing specific keywords without scanning all documents.

Inverted index

ElasticSearch uses a structure called inverted index, which uses Lucene inverted index as the bottom layer. This structure is suitable for fast full-text search. An index consists of all non duplicate lists in the document. For each word, there is a document list containing it.

  • For example, there are now two documents, each containing the following:

    Study every day, good good up to forever		# Contents of document 1
    To forever, study every day, good good up		# Contents of document 2
    
  • In order to create an inverted index, first split each document into independent words (or terms and tokens), then create a sorted list containing all non duplicate terms, and then list which document each term appears in:

    termdoc_1doc_2
    Study×
    To×
    every
    forever
    day
    study×
    good
    to×
    up
  • Now, try to retrieve to forever, just look at the document for each term

    termdoc_1doc_2
    to×
    forever
  • Both documents match, but the matching degree (weight / score) of the first document is higher than that of the second document. If there are no other conditions, both documents containing keywords will be returned.

  • Let's take another example. For example, if we search blog posts through blog tags, the inverted index list is such a structure:

    Blog posts (raw data)Index list (inverted index)
    Blog post IDlabellabelBlog post ID
    1pythonpython1,2,3
    2pythonlinux3,4
    3linux,python
    4linux

    If you want to search for articles with python tags, it will be much faster to find the inverted indexed data than to find all the original data. Just look at the label
    Column, and then get the relevant article ID.

Comparison between Elasticsearch index and Lucene index:

In elastic search, the term index is frequently used, which is the use of the term. In elastic search, the index is divided into multiple slices, and each slice is an index of Lucene. Therefore, an elastic search index is composed of multiple Lucene indexes. Unless otherwise specified, when it comes to cable bow, it refers to the index of elastic search.

All the next operations are completed in the Console under Dev Tools in kibana.

IK word breaker plug-in

Word segmentation: that is, a character is divided into keywords. When searching, the character information will be segmented, and the data in the database or index library will be segmented, and then a matching operation will be performed. The default Chinese word segmentation is to treat each word as a word, such as "my name is meteorite crater" will be divided into "I" "Call", "meteorite", "stone" and "pit" obviously do not meet the requirements, so we need to install Chinese word splitter IK to solve this problem.

IK provides two word segmentation algorithms: ik_smart and ik_max_word, where ik_smart is the least segmentation, ik_max_word is the most fine-grained partition.

Install IK word breaker

  1. Download: https://github.com/medcl/elasticsearch-analysis-ik/releases

    Note: the version needs to be corresponding

    elasticsearch-analysis-ik-7.15.2.zip

  2. Enter elasticsearch-7.15 2 / plugins directory, create a new ik directory and unzip it

  3. Restart ElasticSearch and observe that the plug-in has been loaded

    You can also view the plug-in using the elasticsearch plugin command

  4. Using Kibana test

    ik_smart minimum segmentation:

    ik_max_word is divided in the most fine-grained way, exhausting all possibilities of Thesaurus:

  5. In the test, it is found that the following test cases can not remove the desired "mixed yuan shape and meaning" and "mouse tail juice" using any word segmentation algorithm. For this kind of word, it needs to be added to the user-defined dictionary.

  6. Customize the dictionary, and create a new dictionary file in the config directory of ik word breaker plug-in DIC, as follows

    Mixed form and meaning
     Mouse tail juice
    

    Modify ikanalyzer. In the same directory cfg. XML file to configure a custom dictionary on the

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
    <properties>
    	<comment>IK Analyzer Extended configuration</comment>
    	<!--Users can configure their own extended dictionary here -->
    	<entry key="ext_dict">crater.dic</entry>
    	 <!--Users can configure their own extended stop word dictionary here-->
    	<entry key="ext_stopwords"></entry>
    	<!--Users can configure the remote extension dictionary here -->
    	<!-- <entry key="remote_ext_dict">words_location</entry> -->
    	<!--Users can configure the remote extended stop word dictionary here-->
    	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
    </properties>
    

    After restart, observe the log and load the custom dictionary

  7. Test again, the custom words are split out

Rest style description

A software architecture style, not a standard, only provides a set of design principles and constraints. It is mainly used for interactive software between client and server. The software designed based on this style can be more concise, more hierarchical, and easier to implement caching and other mechanisms.

Basic command description

methodurl addressdescribe
PUT127.0. 0.1:9200 / index name / type name / document IDCreate document (specify document ID)
POST127.0. 0.1:9200 / index name / type nameCreate document (random document ID)
POST127.0. 0.1:9200 / index name / type name / document ID/_updateModify document
DELETE127.0. 0.1:9200 / index name / type name / document IDremove document
GET127.0. 0.1:9200 / index name / type name / document IDQuery documents by document ID
POST127.0. 0.1:9200 / index name / type name/_ searchQuery all data

Basic operation of index

New index

  1. Create an index

    # The later version of the type name is removed
    PUT /Index name/Type name/file ID
    {Request body}
    
    # Return values include
    "_index": 	Current index name
    "_type"		Current type
    "_version"	Current version (1 means it has not been modified)
    

    You can also see the new index on the overview and index page in elastic search head

    You can see the details of the data on the data index page, and the automatic index addition is completed. This is also the reason why you can learn ES as a database in the early stage.

  2. Data type of field

    • String type

      text,keyword

    • Numerical type

      long,Integer,short,byte,double,float,half float,scaled float

    • Date type

      date

    • te boolean type

      boolean

    • Binary type

      binary

    • Wait

  3. Specifies the type of field

    Similar to building a database (establishing indexes and corresponding types of fields), it can also be regarded as the establishment of rules

  4. Get test2 index information

  5. Default data type

    _ doc default type. Type will be gradually discarded in future versions. It can be unspecified or explicitly specified

    If the document field does not specify a type, ElasticSearch will configure the field type by default.

    Extension: through get_ Cat command to view ElasticSearch information

    Other commands:

    GET _cat/indices
    GET _cat/aliases
    GET _cat/allocation
    GET _cat/count
    GET _cat/fielddata
    GET _cat/health
    GET _cat/indices
    GET _cat/master
    GET _cat/nodeattrs
    GET _cat/nodes
    GET _cat/pending_tasks
    GET _cat/plugins
    GET _cat/recovery
    GET _cat/repositories
    GET _cat/segments
    GET _cat/shards
    GET _cat/snapshots
    GET _cat/tasks
    GET _cat/templates
    GET _cat/thread_pool
    

Modify index

There are two implementation schemes for modification

  • Scheme 1: PUT, change the data to be modified in the request body and add it again, so as to overwrite the old data and realize the modification.

    The disadvantage of this scheme is that once the request body misses the field, it will cause data loss.

  • Scheme 2: POST

Delete index

  • command

    # Delete index
    DELETE /test1
    

    # remove document
    DELETE /test3/_doc/1
    

Basic operation of document

Simple query

  1. First, add some data

    PUT /crater/user/1
    {
    	"name": "Crater",
    	"age": 24,
    	"desc": "A meal is as fierce as a tiger. At a glance, the salary is two thousand five",
    	"tags": ["technical nerd","Treasure boy","Straight man"]
    }
    
    PUT /crater/user/2
    {
    	"name": "Zhang San",
    	"age": 35,
    	"desc": "Outlaw maniac",
    	"tags": ["Miss Luo","criminal law"]
    }
    
    PUT /crater/user/3
    {
    	"name": "Li Si",
    	"age": 30,
    	"desc": "Indescribable",
    	"tags": ["lady","dance"]
    }
    

  2. Query data

    # command
    GET /crater/user/1
    

Modify document

  1. Update data

    Mode 1:

    # command
    PUT /crater/user/3
    {
    	"name": "Li si123",
    	"age": 30,
    	"desc": "Indescribable",
    	"tags": ["lady","dance"]
    }
    

    Mode 2: Recommended

    # command
    POST /crater/user/3/_update
    {
    	"doc": {
    		"name": "Li Si 456"
    	}
    }
    

Simple condition query

  1. GET query

    # command
    GET /crater/user/_search?q=name:Meteorite
    

Complex condition query

Condition, projection, sorting, paging

Sorting, paging, highlighting, fuzzy query, accurate query

  • Generally, get / criter / user is not used/_ search? Q = name: meteorite is in this form, but the parameter entity is used

    # Add another document for testing
    PUT /crater/user/4
    {
    	"name": "Crater master",
    	"age": 18,
    	"desc": "The operation is as fierce as a tiger, and the salary is 2500 yuan",
    	"tags": ["technical nerd","Treasure boy","Straight man"]
    }
    
    # command
    GET /crater/user/_search
    {
      "query": {
        "match": {
          "name": "Meteorite"
        }
      }
    }
    

  • You can also filter the properties of the query, similar to the projection of the database

    # command
    GET /crater/user/_search
    {
      "query": {
        "match": {
          "name": "Meteorite"
        }
      },
      "_source": ["name","desc"]
    }
    

  • You can add sorting, such as sorting in reverse order according to age

    # command
    GET /crater/user/_search
    {
      "query": {
        "match": {
          "name": "Meteorite"
        }
      },
      "_source": ["name","age"],
      "sort": [
        {
          "age": {
            "order": "desc"
          }
        }
      ]
    }
    

  • You can add paging queries

    # command
    GET /crater/user/_search
    {
      "query": {
        "match": {
          "name": "Meteorite"
        }
      },
      "_source": ["name","age"],
      "sort": [
        {
          "age": {
            "order": "desc"
          }
        }
      ],
      "from": 0,	# Starting page number, starting from 0
      "size": 1		# Number of entries per page
    }
    

Boolean query

  • must (and), multi condition query, all conditions are met

    # command
    GET /crater/user/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "name": "Meteorite"
              }
            },
            {
              "match": {
                "age": "24"
              }
            }
          ]
        }
      }
    }
    

  • should (or) satisfy one condition

    # command
    GET /crater/user/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "name": "Meteorite"
              }
            },
            {
              "match": {
                "age": "24"
              }
            }
          ]
        }
      }
    }
    

  • must_not(not)

    # command
    GET /crater/user/_search
    {
      "query": {
        "bool": {
          "must_not": [
            {
              "match": {
                "name": "Meteorite"
              }
            },
            {
              "match": {
                "age": "24"
              }
            }
          ]
        }
      }
    }
    

  • At the same time of query, filter is used

    # command
    GET /crater/user/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "name": "Meteorite"
              }
            }
          ],
          "filter": {
            "range": {
              "age": {
                "gte": 20
              }
            }
          }
        }
      }
    }
    

Match multiple criteria

  • Multiple conditions are separated by spaces. As long as one condition is met, it can be found out. Later, the score can be used for judgment and screening

    # command
    GET /crater/user/_search
    {
      "query": {
        "match": {
          "tags": "Men's house"
        }
      }
    }
    

Precise query

term query is an accurate search directly through the terms specified in the inverted index

About participle

  • term: direct and accurate query
  • match: it can be parsed using a word splitter (first analyze the document, and then query through the analyzed document)

The two types are text and keyword

  • text will be parsed by the word splitter. Before indexing, these texts will be segmented, converted into word combinations, and indexed

  • keyword type fields can only be retrieved by themselves

  • Build test

    PUT testdb
    {
      "mappings": {
        "properties": {
          "name": {
            "type": "text"
          },
          "desc": {
            "type": "keyword"
          }
        }
      }
    }
    
    PUT testdb/_doc/1
    {
      "name": "Crater name",
      "desc": "Crater desc1"
    }
    
    PUT testdb/_doc/2
    {
      "name": "Crater name",
      "desc": "Crater desc2"
    }
    
  • The difference between the two types of word splitter

    # Parsed
    GET _analyze
    {
      "analyzer": "standard",
      "text": "Crater name"
    }
    

    # Not resolved
    GET _analyze
    {
      "analyzer": "keyword",
      "text": "Crater desc"
    }
    
    

  • test

    # Match to match {"name": "crater"}, you can query two pieces of data
    GET /testdb/_doc/_search
    {
      "query": {
        "match": {
          "name": "Crater"
        }
      }
    }
    
    # Match to match {"desc": "crater"}, no data can be queried,
    # The keyword type field will not be parsed by the word splitter, but can only be retrieved by itself, such as {"desc": "meteorite crater desc1"}
    GET /testdb/_doc/_search
    {
      "query": {
        "match": {
          "desc": "Crater"
        }
      }
    }
    
    # If you use term to match {"name": "crater"}, {"desc": "crater"}, you can't query the data, because you need full value accurate matching
    # If you want to query data:
    # For name, match the resolved "meteorite", "stone", "crater" and "name"
    # For DESC, you need to match "crater desc1" and "crater desc2"
    GET /testdb/_doc/_search
    {
      "query": {
        "term": {
          "name": "Crater"
        }
      }
    }
    GET /testdb/_doc/_search
    {
      "query": {
        "term": {
          "desc": "Crater"
        }
      }
    }
    
    # You can also query two pieces of data by match ing {"name": "crater"},
    
  • Summary:

    match (fuzzy) and term (exact) refer to the degree of matching when querying

    For example, "meteorite crater name" is a text, which is segmented into "meteorite", "stone", "crater" and "name" when it is created. Therefore, term cannot be used to query "meteorite crater name", but term can be used to query "meteorite", "stone", "crater" and "name", because the index of comparison participle can accurately match these four words

    The "meteorite crater desc1" is of keyword type. When it is created, there is no word segmentation and only one index. Therefore, whether match or term is used, the full value is required

    match and term are responsible for whether the data of query criteria is word segmentation, and text and keyword are responsible for whether the stored data is word segmentation

Multiple value matching exact query

  • Add test data

    PUT testdb/_doc/3
    {
      "t1": "22",
      "t2": "2021-12-20"
    }
    
    PUT testdb/_doc/4
    {
      "t1": "33",
      "t2": "2021-12-17"
    }
    
  • Accurately query multiple values with term

    GET testdb/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "term": {
                "t1": "22"
              }
            },
            {
              "term": {
                "t1": "33"
              }
            }
          ]
        }
      }
    }
    

Highlight query

  • When querying, use highlight to specify the highlighted field

    You can see that in the query results, the words you want to query in the query conditions are labeled

    # command
    GET crater/user/_search
    {
      "query": {
        "match": {
          "name": "Meteorite"
        }
      },
      "highlight": {
        "fields": {
          "name": {}
        }
      }
    }
    

  • The em label is the default label, or you can specify a custom label

    # command
    GET crater/user/_search
    {
      "query": {
        "match": {
          "name": "Meteorite"
        }
      },
      "highlight": {
        "pre_tags": "<p class='key' style='color:red'>", 
        "post_tags": "</p>", 
        "fields": {
          "name": {}
        }
      }
    }
    

    As you can see, the words are wrapped in custom tags

Integrated SpringBoot

Client documentation: https://www.elastic.co/guide/en/elasticsearch/client/index.html

Java REST Client is recommended: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/index.html

Select Java High Level REST Client advanced client and operate according to the document:

  1. Introducing native dependencies

    <dependency>
        <groupId>org.elasticsearch.client</groupId>
        <artifactId>elasticsearch-rest-high-level-client</artifactId>
        <version>7.15.2</version>
    </dependency>
    
  2. initialization

    Bind nodes in a cluster. A node is also a cluster

    RestHighLevelClient client = new RestHighLevelClient(
            RestClient.builder(
                    new HttpHost("localhost", 9200, "http"),
                    new HttpHost("localhost", 9201, "http")));
    

    It needs to be closed after use

    client.close();
    

test

  • Building the project environment is the most important dependence

  • Customize ES dependency to ensure consistency with local

    <properties>
            <java.version>1.8</java.version>
            <!-- custom ES Version dependent to ensure consistency with local -->
            <elasticsearch.version>7.15.2</elasticsearch.version>
        </properties>
    
  • Inject RestHighLevelClient (client)

    package com.crater.config;
    
    import org.apache.http.HttpHost;
    import org.elasticsearch.client.RestClient;
    import org.elasticsearch.client.RestHighLevelClient;
    import org.springframework.context.annotation.Bean;
    import org.springframework.context.annotation.Configuration;
    
    @Configuration
    public class ElasticSearchClientConfig {
    
        @Bean
        public RestHighLevelClient restHighLevelClient() {
            RestHighLevelClient client = new RestHighLevelClient(
                    RestClient.builder(new HttpHost("127.0.0.1", 9200, "http"))
            );
            return client;
        }
    }
    

Index API test

  • Test index creation

    @SpringBootTest
    class CraterEsApiApplicationTests {
        @Autowired
        @Qualifier("restHighLevelClient")
        private RestHighLevelClient client;
    
        /**
         * Test index creation
         */
        @Test
        void testCreateIndex() throws IOException {
            // 1. Create index request
            CreateIndexRequest request = new CreateIndexRequest("crater_index");
    
            // 2. Execute creation request
            IndicesClient indices = client.indices();
            CreateIndexResponse createIndexResponse = indices.create(request, RequestOptions.DEFAULT);
            System.out.println(createIndexResponse);
        }
    }
    
    # print contents
    org.elasticsearch.client.indices.CreateIndexResponse@38e8e841
    
  • Test get index

    /**
     * Test get index
     */
    @Test
    void testExistIndex() throws IOException {
        GetIndexRequest request = new GetIndexRequest("crater_index");
        boolean exists = client.indices().exists(request, RequestOptions.DEFAULT);
        System.out.println(exists);
    }
    
  • Test delete index

    /**
     * Test delete index
     */
    @Test
    void testDeleteIndex() throws IOException {
        DeleteIndexRequest request = new DeleteIndexRequest("crater_index");
        AcknowledgedResponse delete = client.indices().delete(request, RequestOptions.DEFAULT);
        System.out.println(delete.isAcknowledged());
    }
    

Document API testing

  • Test add document

    /**
     * Test add document
     */
    @Test
    void testAddDocument() throws IOException {
        // create object
        User user = new User("Crater", 17);
        // Create request
        IndexRequest request = new IndexRequest("crater_index");
    
        // Rules: get / criter_ index/_ doc/1
        request.id("1");
        request.timeout(TimeValue.timeValueSeconds(1));
    
        // Data put request JSON
        request.source(JSON.toJSONString(user), XContentType.JSON);
    
        // The client sends a request to get the response result
        IndexResponse indexResponse = client.index(request, RequestOptions.DEFAULT);
        System.out.println(indexResponse.status());
        System.out.println(indexResponse.toString());
    }
    
    # Return value
    CREATED
    IndexResponse[index=crater_index,type=_doc,id=1,version=1,result=created,seqNo=0,primaryTerm=1,shards={"total":2,"successful":1,"failed":0}]
    
    
  • To obtain a document, first determine whether it exists

    /**
     * To obtain a document, first determine whether it exists: get criter_ index/_ doc/1
     */
    @Test
    void testIsExist() throws IOException {
        GetRequest getRequest = new GetRequest("crater_index", "1");
        // Do not get returned_ source context
        getRequest.fetchSourceContext(new FetchSourceContext(false));
        // Do not sort
        getRequest.storedFields("_none_");
    
        boolean exists = client.exists(getRequest, RequestOptions.DEFAULT);
        System.out.println(exists);
    }
    

    Get document

    /**
     * Get document
     */
    @Test
    void testGetDocument() throws IOException {
        GetRequest getRequest = new GetRequest("crater_index", "1");
        GetResponse getResponse = client.get(getRequest, RequestOptions.DEFAULT);
        System.out.println(getResponse.getSourceAsString());
        System.out.println(getResponse);
    }
    
    # Return value
    {"age":17,"name":"Crater"}
    {"_index":"crater_index","_type":"_doc","_id":"1","_version":1,"_seq_no":0,"_primary_term":1,"found":true,"_source":{"age":17,"name":"Crater"}}
    
    
  • Modify document information

    /**
     * Update document information
     */
    @Test
    void testUpdateRequest() throws IOException {
        UpdateRequest request = new UpdateRequest("crater_index", "1");
        request.timeout(TimeValue.timeValueSeconds(1));
    
        User user = new User("Crater YYDS", 3);
        request.doc(JSON.toJSONString(user), XContentType.JSON);
    
        UpdateResponse updateResponse = client.update(request, RequestOptions.DEFAULT);
        System.out.println(updateResponse.status());
    }
    
  • Batch insert

    /**
     * Batch insert
     */
    @Test
    void testBulkRequest() throws IOException {
        BulkRequest bulkRequest = new BulkRequest();
        bulkRequest.timeout("10s");
    
        List<User> users = new ArrayList<>();
        users.add(new User("Crater 01", 14));
        users.add(new User("Crater 02", 14));
        users.add(new User("Crater 03", 14));
        users.add(new User("Crater 04", 14));
        users.add(new User("Crater 05", 14));
    
        // Batch request
        for (int i = 0; i < users.size(); i++) {
            bulkRequest.add(
                // Batch update and batch delete. You can modify different requests here
                new IndexRequest("crater_index")
                .id((i + 1) + "")
                .source(JSON.toJSONString(users.get(i)), XContentType.JSON)
            );
        }
        BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
        System.out.println(bulkResponse.status());
    }
    

Actual combat - Jingdong search

Environment construction

  1. Build the project environment according to the test environment

  2. yml configuration

    server.port=9090
    # Turn off thymeleaf cache
    spring.thymeleaf.cache=false
    
  3. Copy static files to directory

    File acquisition: pay attention to the crazy God and restore ElasticSearch. The knowledge is priceless and supports genuine

  4. New controller

    @Controller
    public class IndexController {
    
        @RequestMapping({"/","index"})
        public String index() {
            return "index";
        }
    }
    
  5. Start project, access http://127.0.0.1:9090/

Reptile

  1. Search for an entry in Jingdong Mall and you can see the url of the visit

    https://search.jd.com/Search?keyword=java

  2. jsoup package is introduced to parse the data of web pages

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.13.1</version>
    </dependency>
    
  3. Create a new HtmlParseUtil class under utils package

    Test, data can be obtained

    @Component
    public class HtmlParseUtil {
        public static void main(String[] args) throws IOException {
            List<Content> javas = new HtmlParseUtil().parseJD("MacBok");
            for (Content java : javas) {
                System.out.println(java);
            }
        }
    
        public List<Content> parseJD(String keyWord) throws IOException{
            String url = "https://search.jd.com/Search?keyword=" + keyWord;
    
            // When parsing a web page, the Document returned by jsup is the browser Document object
            Document document = Jsoup.parse(new URL(url), 30000);
            // Action Document
            Element element = document.getElementById("J_goodsList");
            // Get li tag
            Elements elements = element.getElementsByTag("li");
            // Gets the content in the element
            List<Content> goodList  = new ArrayList<>();
            for (Element el : elements) {
                // For websites with many pictures, lazy loading is adopted for response speed. The real url of JD is saved in data lazy img
                String img = el.getElementsByTag("img").eq(0).attr("data-lazy-img");
                String price = el.getElementsByClass("p-price").eq(0).text();
                String title = el.getElementsByClass("p-name").eq(0).text();
    
                goodList.add(new Content(title, img, price));
            }
            return goodList;
        }
    }
    

Write business

  1. Write Service layer

    @Service
    public class ContentService {
        @Autowired
        private RestHighLevelClient restHighLevelClient;
        @Autowired
        private HtmlParseUtil htmlParseUtil;
    
        public Boolean parseContent(String keywords) throws IOException {
            List<Content> contents = htmlParseUtil.parseJD(keywords);
            // Insert the crawled data into ES
            BulkRequest bulkRequest = new BulkRequest();
            bulkRequest.timeout("2m");
            for (int i = 0; i < contents.size(); i++) {
                bulkRequest.add(
                        new IndexRequest("jd_goods")
                                .source(JSON.toJSONString(contents.get(i)), XContentType.JSON)
                );
            }
            BulkResponse bulk = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
            return !bulk.hasFailures();
        }
    }
    
  2. Write Controller layer

    @RestController
    public class ContentController {
        @Autowired
        private ContentService contentService;
    
        @GetMapping("/parse/{keywords}")
        public Boolean parse(@PathVariable("keywords") String keywords) throws IOException {
            return contentService.parseContent(keywords);
        }
    }
    
  3. By calling this interface, data can be added to ES on the premise of creating a new index jd_goods

  4. Obtain these data to realize the search function

    Service layer:

    /**
     * Crawling data into ES
     */
    public List<Map<String, Object>> searchPage(String keyword, int pageNo, int pageSize) throws IOException {
        if (pageNo < 1) {
            pageNo = 1;
        }
    
        // Condition search
        SearchRequest searchRequest = new SearchRequest("jd_goods");
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    
        // paging
        sourceBuilder.from(pageNo);
        sourceBuilder.size(pageSize);
    
        // Accurate matching
        TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title", keyword);
        sourceBuilder.query(termQueryBuilder);
        sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));
    
        // Perform search
        searchRequest.source(sourceBuilder);
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
    
        // Analytical results
        List<Map<String, Object>> list = new ArrayList<>();
        for (SearchHit documentFields : searchResponse.getHits().getHits()) {
            list.add(documentFields.getSourceAsMap());
        }
        return list;
    }
    

    Controller layer

    @GetMapping("/search/{keywords}/{pageNo}/{pageSize}")
    public List<Map<String, Object>> search(@PathVariable("keywords") String keywords,
                                            @PathVariable("pageNo") int pageNo,
                                            @PathVariable("pageSize") int pageSize) throws IOException {
        return contentService.searchPage(keywords, pageNo, pageSize);
    }
    

    Access this interface and test it

Front end display

Vue is used, but scaffolding is not used

  1. Put Vue min.js axios. Copy min.js to / static/js/

    Main code, see gitee for detailed code

    <!-- Product details -->
    <div class="view grid-nosku">
    
        <div class="product" v-for="result in results">
            <div class="product-iWrap">
                <!--Product cover-->
                <div class="productImg-wrap">
                    <a class="productImg">
                        <img :src="result.img">
                    </a>
                </div>
                <!--Price-->
                <p class="productPrice">
                    <em><b>¥</b>{{result.price}}</em>
                </p>
                <!--title-->
                <p class="productTitle">
                    <a> {{result.title}} </a>
                </p>
                <!-- Shop name -->
                <div class="productShop">
                    <span>Shop: Crazy God said Java </span>
                </div>
                <!-- Transaction information -->
                <p class="productStatus">
                    <span>Monthly transaction<em>999 pen</em></span>
                    <span>evaluate <a>3</a></span>
                </p>
            </div>
        </div>
    </div>
    
    <!-- Front end use Vue -->
    <script th:src="@{/js/axios.min.js}"></script>
    <script th:src="@{/js/vue.min.js}"></script>
    <script>
    
    new Vue({
        el: '#app ', / / bind the outermost div
        data: {
            keyword: "",    // Search keywords
            results: []     // search result
        },
        methods: {
            searchKey() {
                var keyword = this.keyword;
    
                // Docking back-end interface
                axios.get('search/'+keyword+'/1/10').then(res => {
                    this.results = res.data;
                })
            }
        }
    })
    
    </script>
    
  2. test

Highlight

  1. Add a highlight constructor to the ContentService

    // Highlight 
    HighlightBuilder highlightBuilder = new HighlightBuilder();
    highlightBuilder.requireFieldMatch(false);  // Are multiple matches highlighted
    highlightBuilder.field("title");
    highlightBuilder.preTags("<span style='color:red'>");
    highlightBuilder.postTags("</span>");
    sourceBuilder.highlighter(highlightBuilder);
    
  2. Parse the highlighted field and replace the original content

    for (SearchHit hit : searchResponse.getHits().getHits()) {
        // Resolve highlighted fields
        Map<String, HighlightField> highlightFields = hit.getHighlightFields();
        HighlightField title = highlightFields.get("title");
        Map<String, Object> sourceAsMap = hit.getSourceAsMap(); // Original results of query
        // Replace the original field with a highlighted field
        if (title != null) {
            Text[] fragments = title.fragments();
            StringBuilder newTitle = new StringBuilder();
            for (Text text : fragments) {
                newTitle.append(text);
            }
            sourceAsMap.put("title", newTitle.toString());      // replace
        }
    
        list.add(sourceAsMap);
    }
    
  3. Page title, parse the value into html

    <!--title-->
    <p class="productTitle">
        <a v-html="result.title">  </a>
    </p>
    
  4. Test, the keyword is highlighted

Finish scattering flowers!!!

Topics: ElasticSearch solr search engine