es learning notes

Posted by funkyfela on Mon, 03 Jan 2022 04:40:46 +0100

1, es introduction?

1 es introduction

Introduction: es is an open source distributed (full-text) search engine based on Apache lucene. It provides a simple restful API to hide the complexity of Lucene.
es database analogy:

Relational DBdatabasetablesrowsFields (columns)
ElasticsearchIndexes (incices)typesdocumentsfields

ES is also a distributed document database, in which each field can be indexed, and the data of each field can be searched. It can be horizontally extended to hundreds of servers to store and process PB level data.
It can store, search and analyze a large amount of data in a very short time. It is usually used as the core engine in the case of complex search scenarios.
ES is designed for high availability and scalability. On the one hand, system expansion can be completed by upgrading hardware, which is called Vertical Scale/Scaling Up.
On the other hand, adding more servers to complete system expansion is called Horizontal Scale/Scaling Out. Although es can take advantage of more powerful hardware, vertical expansion still has its limit after all. The real scalability comes from horizontal expansion. By adding more nodes to the cluster to share the load and increase reliability. ES is inherently distributed. It knows how to manage multiple nodes to complete expansion and achieve high availability. It means that the application does not need to make any changes.
Gateway, which represents the persistent storage mode of ES index. In the gateway, ES stores the index in memory by default, and then persists it to the gateway when the memory is full. When the ES cluster is shut down or restarted, it will read the index data from the gateway. For example, LocalFileSystem, HDFS, AS3, etc.
DistributedLucene Directory, which is a directory composed of some column index files in Lucene. It is responsible for managing these index files. It includes reading and writing data, adding and merging indexes, etc.
River represents the data source. It exists in ES as a plug-in.  
Mapping, which means mapping, is very similar to data types in static languages. For example, if we declare a variable of type int, then this variable can only store data of type int. For example, if we declare a mapping field of double type, we can only store data of double type.
Mapping not only tells ES which field is which type. It can also tell ES how to index data and whether the data is indexed.
Search Moudle, a search module, supports some common search operations
Index cloud, an index module, supports some common operations of index
Discovery is mainly responsible for the discovery of the master node of the cluster. For example, when a node suddenly leaves or comes in, perform a partition and re partition. Here's a discovery mechanism.
The default implementation mode of the discovery mechanism is unicast and multicast, that is Zen. At the same time, it also supports point-to-point implementation. The other is in the form of plug-in, namely EC2.
Scripting is the scripting language.    
Transport represents the internal node of ES and the client interaction with the cluster. Including Thrift, Memcached, Http and other protocols
RESTful Style API, which realizes API programming in a RESTful way.
3rd plugins, representing third-party plug-ins.
Java(Netty) is a development framework.
JMX, it's monitoring.

2 es node

A node is an instance of an ES. After the ES is started on the server, it has a node. If the ES is started on another server, it is another node. You can even start multiple es processes on one server and have multiple nodes on one server. Multiple nodes can join the same cluster.
When the ElasticSearch node is started, it will use multicast (or unicast, if the user changes the configuration) to find other nodes in the cluster and establish a connection with them.
There are three types of nodes. The first type is client_node mainly plays the role of request distribution, similar to routing. The second type is master_node is the primary node. All additions, deletions and data fragmentation are operated by the primary node (there is no update data operation at the bottom of elasticsearch, and the updates provided by the upper layer are actually deleted and added). Of course, it can also undertake search operations. The third type is date_node. This type of node can only perform search operations. Which date will it be assigned to_ Node is defined by the client_node decides, while data_node data is from the master_node synchronized

3 es slice

An index can store a large amount of data beyond the hardware limit of a single node. For example, an index with 1 billion documents occupies 1TB of disk space, but no node has such a large disk space; Or a single node handles the search request and the response is too slow.
To solve this problem, ES provides the ability to divide the index into multiple copies, which are called sharding. When you create an index, you can specify the number of slices you want. Each partition itself is also a fully functional and independent "index", which can be placed on any node in the cluster.
There are two main reasons why fragmentation is important:
1. Allows you to split / expand your content horizontally
Allows you to perform distributed, parallel operations over shards (potentially on multiple nodes), thereby improving performance / throughput
How a fragment is distributed and how its documents aggregate back to search requests are completely managed by ES. These are transparent to you as a user.
2. In a network / cloud environment, failure can occur at any time. A partition / node is offline somehow or disappears for any reason. In this case, a failover mechanism is very useful and highly recommended. For this purpose, ES allows you to create one or more copies of shards, which are called copy shards, or directly copy.
Replication is important for two reasons:
(1) High availability is provided in case of fragmentation / node failure. For this reason, it is important to note that the replication shard is never placed on the same node as the original/primary shard.
(2) Expand your search volume / throughput because search can run in parallel on all replications
In short, each index can be divided into multiple slices. An index can also be copied 0 times (meaning no copy) or more. Once replicated, each index has a primary shard (the original shard as the replication source) and a replication shard (a copy of the primary shard). The number of shards and copies can be specified when the index is created. After the index is created, you can dynamically change the number of copies at any time, but you can't change the number of slices.
By default, each index in ES is partitioned into 5 primary partitions and 1 replication, which means that if there are at least two nodes in your cluster, your index will have 5 primary partitions and another 5 replication partitions (1 full copy). In this way, each index will have a total of 10 partitions. Multiple slices of an index can be stored on one host or multiple hosts in the cluster, depending on the number of cluster machines. The specific location of master partition and replica partition is determined by the internal strategy of ES.

4. Indexing principle in ES

(1) Traditional relational database
The search efficiency of binary tree is logN. It is not necessary to move all nodes when inserting new nodes at the same time. Therefore, storing indexes in tree structure can give consideration to the performance of insertion and query at the same time. Therefore, on this basis, combined with the read characteristics of the disk (sequential read / random read), the traditional relational database uses a data structure such as B-Tree/B+Tree as the index
(2)ES
Inverted index
Term (word): after a piece of text is analyzed by the analyzer, a string of words will be output. One by one is called term
Term Dictionary: as the name suggests, it maintains term, which can be understood as a collection of terms
Term Index: in order to find a word faster, we build an index for the word
Posting List: the inverted list records the document list of all documents in which a word has appeared and the location information of the word in the document. Each record is called an inverted item (Posting). According to the inverted list, you can know which documents contain a word. (PS: in the actual inverted list, not only the document ID is saved, but also some other information, such as word frequency (the number of times Term occurs), offset, etc.
(PS: if compared with modern Chinese dictionaries, Term is equivalent to words, Term Dictionary is equivalent to the Chinese Dictionary itself, and Term Index is equivalent to the directory index of the dictionary)
We know that each document has an ID. if it is not specified when inserting, Elasticsearch will automatically generate an ID
Elastic search creates an inverted index for each field. For example, "Zhang San", "Beijing" and "22" are terms, and [1, 3] is the posting list. Posting list is an array that stores all document ID s that match a Term.
As long as you know the document ID, you can quickly find the document. In order to quickly find this Term through our given keywords, we need to build an index for Terms. The best is the B-Tree index (MySQL is the best example of B-Tree index).
The process of finding Term is roughly the same as that of recording ID in MyISAM
In MyISAM, the index and data are separated. The address of the record can be found through the index, and then the record can be found
In the inverted index, you can find the position of Term in the Term Dictionary through the Term index, and then find the Posting List. With the inverted list, you can find the document according to the ID
(PS: it can be understood that, compared with MyISAM, Term Index is equivalent to index file and Term Dictionary is equivalent to data file)
(PS: in fact, we have divided into three steps. We can take Term Index and Term Dictionary as one step, that is, to find Term. Therefore, we can understand inverted index as follows: find the corresponding inverted list through words, and then find the document records according to the inverted items in the inverted list.)

2, linux boot es

Modify config / elasticsearch Network in YML Host: the ip in 127.0.0.1 is the specified ip or 0.0.0.0
(there is a space between ip and network. Host: ip in 127.0.0.1!!!)
Create a new user:
useradd es
Add folders and grant permissions for es users: under root permission
chown -R es:es /elasticsearch-6.8.0
chmod -R 777 /elasticsearch-6.8.0
Note: under the folder: chmod -X * -R
Switch to es user:
su es
Switch to the directory where es is stored:
./elasticsearch -d
screen ./elasticsearch

Test es startup success:
ps aux | grep elasticsearch
curl -XGET ip:port
curl -u elastic:passwoed -X GET ip:port
curl -u elastic -XPUT 'http://ip:port/_xpack/security/user/elastic/_password?pretty' -H 'Content-Type: application/json' -d'
{
"password" : "123456"
}
'
Note:
Modify config / elasticsearch Network in YML Host: ip in 127.0.0.1
After modifying the ip address, if the following conditions occur, it cannot be started:
ERROR: [2] bootstrap checks failed
[1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
[2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
[1]: max file descriptors [65535] for elasticsearch process is too low, increase to at least [65536]
terms of settlement:
Edit / etc / security / limits Conf, add the following contents;

  • soft nofile 65536
  • hard nofile 65536
    After this file is modified, the user needs to log in again before it will take effect

ERROR: [2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
terms of settlement:
Edit / etc / sysctl Conf, add the following:
vm.max_map_count=655360
After saving, execute: sysctl -p restart

3, kibana mode operation ES

1 query

The following test is the index name, http://127.0.0.1:5601/

# Cluster correlation
# Query cluster health status
GET _cluster/health
# Query all nodes
GET _cat/nodes
# Query index and partition distribution
GET _cat/shards
# Query all plug-ins
GET _cat/plugins

# Index related query
 Query all indexes and capacities
GET _cat/indices
 Query index mapping structure
GET _index/_mapping
 Query all index mapping structures  
GET _all
 Query all indexes with the same prefix
GET base_*/_search
 Query all index templates
GET _template
# View index mapping structure
GET lawzllaw/_mapping
# View index settings
GET lawzllaw/_settings
# View quantity in index
GET lawzllaw/_count


# View data index method  #  "field": "title.keyword",
GET test/_analyze
{
  "field": "title", 
  "text": "Measures for the administration of public rental housing"
}
# "From": 0 and "size": 1 can be added to query all data to limit the quantity
GET test/_search
{
  "query": {
    "match_all": {}
  }
}
# Elastic search query (match and term)
# match will perform word segmentation and return all relevant documents after word segmentation
GET test/_search
{
  "query": {
    "match": {
      "title": {
        "query": "Regulations on public energy conservation"
      }
    }
  }
}

# match_phrase also performs word segmentation, but the returned exact matching result must contain all word segmentation
# slop parameter: when querying how far apart entries can be, the document is still regarded as a match, that is, several words can not be matched
GET test/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "Regulations on public energy conservation",
        "slop" : 1
      }
    }
  }
}


# multi_match the two fields
GET test/_search
{
  "query": {
    "multi_match": {
        "query" : "bmw",
        "fields" : ["title", "content"]
    }
  }
}
# Score: the score of perfectly matched documents is relatively high. Use best_fields
        The more fields match, the higher the score of the document, and the most_fields
        If the word segmentation vocabulary of the entry is assigned to different fields, it is used cross_fields
GET test/_search
{
  "query": {
    "multi_match": {
      "query": "How much is my BMW engine",
      "type": "best_fields",
      "fields": [
        "tag",
        "content"
      ],
      "tie_breaker": 0.3
    }
  }
}


# term stands for exact match, that is, without word segmentation analysis, the document must contain the words of the whole search
# Use term to determine whether this field is "analyzed". The default string is analyzed.
# The analyzed can use title keyword
GET test/_search
{
  "query": {
    "term": {
      "title": "Car maintenance"
    }
  }
}
# bool joint query: must, should, must_ Not (multi condition query)
"""
must: The document must exactly match the criteria
should: should There will be more than one condition below. If at least one condition is met, the document meets the requirements should
must_not: The document must not match the criteria
"""

GET test/_search
{
  "query": {
    "bool": {
      "must": {
        "term": {
          "content": "bmw"
        }
      },
      "must_not": {
        "term": {
          "title": "bmw"
        }
      }
    }
  }
}
GET test/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": "lease"
          }
        },
        {
          "match": {
            "timeliness": "1"
          }
        }
      ]
    }
  }
}

# should (or query):
GET test/_search
{
    "query": {
        "bool": {
            "must": {
                "bool" : { 
                    "should": [
                        { "match_phrase": { "unitname": "Beijing" }},
                        { "match_phrase": { "unitname": "Shanghai" }} ] 
                }
            }

        }
    }
}

# Query and sort the range, specify the range with range: gte and lte, and sort with sort: asc and desc
GET base_company_event*/_search
{
  "query": {
    "range": {
      "p_stock2201_f033n": {
        "gte": 1000,
        "lte": 2300
      }
    }
  },
  "sort": [
    {
      "p_stock2201_f033n": {
        "order": "asc"
      }
    }
  ]
}
# Aggregate query
(1)Fixed use size and aggs,field Enter attributes later, from and to Next, enter the range to segment
GET test/_search
{
  "size": 20,
  "aggs": {
    "p_stock2201_f033n": {
      "range": {
        "field": "p_stock2201_f033n",
        "ranges": [
          {
            "from": 1000,
            "to": 2000
          },{
            "from": 2000,
            "to": 3000
          },{
            "from": 3000,
            "to": 4000
          }
        ]
      }
    }
  }


# Aggregate query group
GET test/_search
{
  "size": 0,
  "aggs": {
    "group_by_place": {
      "terms": {
        "field": "timeliness.keyword",
        "size": 10
      }
    }
  }
}

# Aggregate query group
GET test/_search
{
  "size": 0,
  "aggs": {
    "group_by_place": {
      "terms": {
        "field": "timeliness.keyword",
        "size": 10
      }
    }
  }
}

2 modification

POST test/doc/599966     # POST index/doc/id
{
    "timeliness":2
}

#  Modify the field value according to the index condition
POST test/_update_by_query
{
  "query": {
    "match": {
      "effectlevel": "Judicial documents"
    }
  },
  "script": {
    "source": "ctx._source['effectlevel'] = '23'"

  }
}

3 delete

# Delete index
DELETE /test

# Conditional deletion of index
POST /test/_delete_by_query
{
  "query": {
    "match": {
      "publishtime": "2021-04-05"      
    }
  }
}

# es clear index data
POST test/_delete_by_query
{
  "query": {"match_all": {}}
}

4 NEW

# Create index:
PUT /people
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "man": {
      "dynamic": "strict",
      "properties": {
        "name": {
          "type": "text"
        },
        "age": {
          "type": "integer"
        },
        "birthday": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        },
        "address":{
          "dynamic": "true",
          "type": "object"
        }
      }
    }
  }
}

4, Python operations

# Introducing es
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

body = {}
# Connect es
ES_INDEX = "lawzllaw"
es = Elasticsearch()

# Create index
result = es.indices.create(index="lawzllaw", ignore=400)
# Delete Index
result = es.indices.delete(index='news', ignore=[400, 404])

# insert data
result = es.create(index='lawzllaw', doc_type='', id=1, body=body)
result = es.index(index='lawzllaw', doc_type='', body=body)
# Update data
result = es.update(index='lawzllaw', doc_type='', body=body, id=1)
result = es.index(index='lawzllaw', doc_type='', body=body, id=1)
result es.update_by_query(index=ES_INDEX, body=body, doc_type="doc")
# Delete data
result = es.delete(index='lawzllaw', doc_type='', id=1)
result = es.delete_by_query(index='lawzllaw', body=body, doc_type='')
# Query: two types of get and search
result = es.get(index="lawzllaw", doc_type="test-type", id=01)
result = es.search(index='lawzllaw', body=body)

# Batch write, delete, update
doc = [
    {'index': {'_index': 'lawzllaw', '_type': 'typeName', '_id': 'idValue'}}
    {'name': 'jack', 'sex': 'male', 'age': 10 }
    {'delete': {'_index': 'lawzllaw', '_type': 'typeName', '_id': 'idValue'}}
    {"create": {'_index' : 'lawzllaw', "_type" : 'typeName', '_id': 'idValue'}}
    {'name': 'lucy', 'sex': 'female', 'age': 20 }
    {'update': {'_index': 'lawzllaw', '_type': 'typeName', '_id': 'idValue'}}
    {'doc': {'age': '100'}}
 ]

es.bulk(index='lawzllaw', doc_type='typeName', body=doc)

#Batch update can also use the following methods to assemble json and write it at last
 for line in list:
    action = {
        "_index": self.lawzllaw,
        "_type": self.index_type,
        "_id": i, #_ id can also be generated by default without assignment
        "_source": {
            "date": line['date'],
            "source": line['source'].decode('utf8'),
            "link": line['link'],
            "keyword": line['keyword'].decode('utf8'),
            "title": line['title'].decode('utf8')}
    }
    i += 1
    ACTIONS.append(action)
success, _ = bulk(self.es, ACTIONS, index=self.lawzllaw, raise_on_error=True)

5, curl operations

6, ES vulnerability introduction

Elasticsearch unauthorized access vulnerability

1. Set x-pack user login authorization and configure ES

$ cat elasticsearch.yml
cluster.name: eryajf-search
node.name: es-node1
path.data: /data/elasticsearch7/data
path.logs: /data/elasticsearch7/log
network.host: 0.0.0.0
http.port: 9200
xpack.security.enabled: true # This configuration indicates that the xpack authentication mechanism is enabled
xpack.security.transport.ssl.enabled: true
cluster.initial_master_nodes: ["es-node1"]
$ cat elasticsearch.yml
cluster.name: eryajf-search
node.name: es-node1
path.data: /data/elasticsearch7/data
path.logs: /data/elasticsearch7/log
network.host: 0.0.0.0
http.port: 9200
xpack.security.enabled: true # This configuration indicates that the xpack authentication mechanism is enabled
xpack.security.transport.ssl.enabled: true
cluster.initial_master_nodes: ["es-node1"]

Parameter Description:
xpack.security.enabled: indicates that the xpack authentication mechanism is enabled.
xpack.security.transport.ssl.enabled: if this message does not match, es will not work, and the following error will be reported:
Transport SSL must be enabled if security is enabled on a [basic] license. Please set [xpack.security.transport.ssl.enabled] to [true] or disable security by setting [xpack.security.enabled] to [false]

2. Set user name and password:
After successful startup, we execute the following command to set the password.
./bin/x-pack/setup-passwords interactive
Then, according to the prompt, set the password for the three built-in accounts.
The system can also generate passwords automatically. Automatically generate the password and execute the following command.
./bin/x-pack/setup-passwords auto

The role permissions of the three built-in accounts are explained as follows:
elastic account: it has the super user role and is a built-in super user.
Kibana account: own kibana_system role. Kibana is used to connect and communicate with elasticsearch. The kibana server submits a request as the user to access the cluster monitoring API and Kibana index. Cannot access index.
logstash_system account: own logstash_system role. The user Logstash is used when storing monitoring information in Elasticsearch.
In addition, you can set the password through DSL statements in kibana.
POST _xpack/security/user/kibana/_password
{
"password": "elastic"
}

3. After changing the password, we also need to modify the configuration file of elasticsearch.
http.cors.enabled: true
http.cors.allow-origin: '*'
http.cors.allow-headers: Authorization,X-Requested-With,Content-Length,Content-Type

Finish editing elasticsearch / config / elasticsearch After the YML file, save and restart es.
head access method, http://localhost:9100/?base_uri=http://localhost:9200&auth_user=elastic&auth_password=password .

Topics: Python Database ElasticSearch