Elasticsearch Series - A Simple Start

Posted by andyg2 on Tue, 19 Nov 2019 01:55:25 +0100

outline

This article mainly introduces the data format of Elasticsearch Document, compares the modeling of Java applications and relational databases, describes the basic cluster state query, the basic CRUD operation example of Document and the bulk batch processing example by writing Restful API on Kibana platform.

Document data format

The data model of Java application system is object-oriented, some objects are complex, traditional business system, data needs to fall into relational database. When designing the model in the database field, complex POJO objects will be designed as one-to-one or one-to-many relationships, which will be flattened. When querying, multiple tables are needed to query and restore the format of POJO objects. Elasticsearch is a document database. Document stores data structure that is consistent with POJO and uses JSON format, which makes querying data more convenient.

Document document data example:

{
  "fullname" : "Three Zhang",
  "text" : "hello elasticsearch",
  "org": {
      "name": "development",
      "desc": "all member are lovely"
  }
}

Restful API makes requests easier

The previous article mentioned that Elasticsearch works with Kibana, and the Dev Tools menu on the Kibana interface allows you to send a Reestful request for Elasticsearch.Subsequent Restful API requests, without exception, are executed on the Kibana platform.

Let's try a few requests for cluster information first

  1. Check cluster health

GET /_cat/health?v

epoch      timestamp cluster        status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1573626290 14:24:50  hy-application yellow          1         1     1  1    0    0       1             0                  -                 50.3%

From the above, you can see the number of node s, shard s, etc. There is also a state of cluster, which shows yellow, why is it yellow? Clusters have green, yellow, and red states, which are defined as follows:

  • green: primary and replica shares of each index are active
  • yellow: the primary share of each index is active, but some replica shares are not active and are not available
  • red: not all primary shard s of indexes are active, some indexes have missing data

Our example starts only one elasticsearch instance, only one node, because the index defaults to five primary shards and five replica shards, and primary shards and replica shards under the same node cannot be assigned on one machine (fault tolerant mechanism), all but one primary shard is assigned and started, replica shard does not have a second node to start, becauseIt's the yellow state.If you want to become a green judgment, start another node.

  1. See what indexes are in the cluster

GET /_cat/indices?v

health status index                  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   location               48G_CgE7TiWomlYsyQW1NQ   5   1          3            0       11kb           11kb
yellow open   company_mem            s9DKUeyWTdCj2J8BaFYXRQ   5   1          3            0       15kb           15kb
yellow open   %{[appname]}           ysT9_OibR5eSRu19olrq_w   5   1         32            0    386.5kb        386.5kb
yellow  open   .kibana                4yS67TTOQGOD7l-uMtICOg   1   0          2            0       12kb           12kb
yellow open   tvs                    EM-SvQdfSaGAXUADmDFHVg   5   1          8            0     16.3kb         16.3kb
yellow open   company_org            wIOqfx5hScavO13vvyucMg   5   1          3            0     14.6kb         14.6kb
yellow open   blog                   n5xmcGSbSamYphzI_LVSYQ   5   1          1            0      4.9kb          4.9kb
yellow open   website                5zZZB3cbRkywC-iTLCYUNg   5   1         12            0     18.2kb         18.2kb
yellow open   files                  _6E1d7BLQmy9-7gJptVp7A   5   1          2            0      7.3kb          7.3kb
yellow open   files-lock             XD7LFToWSKe_6f1EvLNoFw   5   1          1            0        8kb            8kb
yellow open   music                  i1RxpIdjRneNA7NfLjB32g   5   1          3            0     15.1kb         15.1kb
yellow open   book_shop              1CrHN1WmSnuvzkfbVCuOZQ   5   1          4            0     18.1kb         18.1kb
yellow open   avs                    BCS2qgfFT_GqO33gajcg_Q   5   1          0            0      1.2kb          1.2kb
  1. View node information

GET /_cat/nodes?v

Response:

ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.17.137           38          68   0    0.08    0.03     0.05 mdi       *      node-1

We can see the name of the node.

  1. Create Index Command

Create an index named "location"

PUT /location?pretty

Response:

{
  "acknowledged": true,
  "shards_acknowledged": true
}

Look at the index and you can see the index location you just created

health status index                  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   location               48G_CgE7TiWomlYsyQW1NQ   5   1          3            0       11kb           11kb
yellow open   .kibana                4yS67TTOQGOD7l-uMtICOg   1   0          2            0       12kb           12kb
  1. Delete Index Command

Delete the index named "location"

DELETE /location?pretty

Looking at the index again, the index location you just created has been deleted

health status index                  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   .kibana                4yS67TTOQGOD7l-uMtICOg   1   0          2            0       12kb           12kb

Is it easy, is the interface friendly?

Getting Started CRUD Operations

Introduces the basic CRUD operation of document, with English songs for children as background

  1. New Songs

The structure of the children's song we designed contains four fields: song name, lyrics content, language type, song length in seconds, which is placed in the JSON string. PUT syntax: <REST Verb> /<Index>/<Type>/<ID> REST Verbs can be PUT, POST, DELETE, and the contents behind the backslash are index name, type name, ID, respectively.

The request is as follows:

PUT /music/children/1
{
    "name": "gymbo",
    "content": "I hava a friend who loves smile, gymbo is his name",
    "language": "english",
    "length": "75"
}

Response content contains index name, type name, ID value, version version version number (Optimistic Lock Control), result type (create/update/deleted three), shard information, etc. If the index does not exist when added, it will be automatically created. Index name is the one specified at request, field type inside the document, according to the automatic mapping type defined by elastic search, and each fiElds are indexed upside down so they can be searched.

Why is the data not equal between total and success? When a document is added, documents are written to primary shard and replica shard, respectively, but since there is only one node and replica is not started, the total number of writes is 2. Primary shard succeeds with a quantity of 1.failed records only primary shard write failures.

The response is as follows:

{
  "_index": "music",
  "_type": "children",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}
  1. Modify song There are two ways to modify a document: incremental, modifying only the specified field, or replacing the document in its entirety, replacing all the original information
  • Incrementally modify the length value, noting that it is a POST request with _update at the end and doc as a fixed write, with the following requests:
POST /music/children/1/_update
{
  "doc": {
    "length": "76"
  }
}

Response:

{
  "_index": "music",
  "_type": "children",
  "_id": "1",
  "_version": 2,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 1,
  "_primary_term": 1
}
  • Replace the document in its entirety with the following request:
PUT /music/children/1
{
    "name": "gymbo",
    "content": "I hava a friend who loves smile, gymbo is his name",
    "language": "english",
    "length": "77"
}

Response:

{
  "_index": "music",
  "_type": "children",
  "_id": "1",
  "_version": 3,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 1
}

Careful children's shoes will do. Replacing the grammar of a document in its entirety is the same as creating an index, right!Is the same syntax, whether to create or update depends on whether the ID storage above does not exist, does not exist to create, does exist to update, and successfully updates version+1, but there is a disadvantage of this kind of full replacement, it must take complete attributes, otherwise the attributes are not declared.

Think of a scenario where you want to use this syntax: GET all the attributes first, then update the attributes to be updated, and then call the updated syntax of the full replacement.There are actually two reasons for this: complex operation, query before update, and large message size (relative to incremental update).Therefore, enterprise R&D generally uses incremental methods to make document updates.

  1. Query Songs

Query statement: GET/music/children/1

_source is the content of the JSON, the result of the query:

{
  "_index": "music",
  "_type": "children",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "name": "gymbo",
    "content": "I hava a friend who loves smile, gymbo is his name",
    "language": "english",
    "length": "75"
  }
}
  1. Delete song

Delete statement: DELETE/music/children/1

Response results:

{
  "_index": "music",
  "_type": "children",
  "_id": "1",
  "_version": 4,
  "result": "deleted",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 3,
  "_primary_term": 1
}

bulk batch

The additions and deletions mentioned in the previous section are for a single document, and Elasticsearch also has a batch command that can be executed in bulk.

  1. Basic syntax examples for bulk Or use the above nursery songs as the background of the case
POST /music/children/_bulk
{"index":{"_id":"1"}}
{"name": "gymbo", "content": "I hava a friend who loves smile, gymbo is his name", "language": "english", "length": "75"}
{"create":{"_id":"2"}}
{"name": "wake me, shark me", "content": "don't let me sleep too late", "language": "english", "length": "55"}
{ "update": {"_id": "2", "retry_on_conflict" : 3} }
{ "doc" : {"content" : "don't let me sleep too late, gonna get up brightly early in the morning"} }
{ "delete": {"_id": "3" }}

Multiple commands can be executed together. If there are different indexes and types within a bulk request, indexes and types can also be written in body json, with two JSON strings for each operation:

{"action": {"metadata"}}
{"data"}

delete exception, it only needs one json string

There are several types of action s:

  • index: Normal PUT operation, document is created when ID does not exist, full replacement when ID exists
  • Create: Force creation, equivalent to the PUT/index/type/id/_create command
  • update: incremental modification operation performed
  • Delete: delete document action
  1. bulk considerations bulk api has strict syntax requirements, it cannot wrap lines within each json string, and there must be a line wrap between each json string, otherwise syntax errors will be reported. Since bulk is executed in batches with several commands, what should I do when it encounters errors?Will it interrupt? If there is a command execution error in the bulk request, it will skip directly and proceed to the next one. At the same time, the results of each command will be displayed separately in the response message. The correct result will be displayed, and the error will be prompted with the error log accordingly.

  2. bulk performance issues Since bulk is batch processing, there must be some relationship between bulk size and optimal performance. The memory requested by bulk is loaded into memory first. The bulk request depends on the number of command bars and the content of each command. An example diagram of the relationship between bulk and performance (expressing concepts, data is not referenced) is as follows:

The goal of bulk performance optimization is to find this inflection point, and you need to try an optimal size repeatedly, depending on the specific business data characteristics and the amount of concurrency. Common settings range from 1000 to 5000, and the size of bulk requests is controlled from 5 to 15MB (for reference only).

Summary

This article gives a brief introduction to the data format of the document, along with an explanation of the criteria for determining the red, yellow and green states of the Elasticsearch cluster. The focus is on the small CRUD cases and the bulk batch processing examples demonstrated on the kibana platform, which are the most basic and can take a little more time to become familiar with.

Focus on Java high-concurrency, distributed architecture, more technology dry goods to share and learn from, please follow Public Number: Java Architecture Community

Topics: Programming ElasticSearch JSON Java Database