Getting started with Elasticsearch: document data format, simple es restful api

Posted by wiseass on Fri, 22 Oct 2021 19:39:37 +0200

1. Doument data format

1.1 Document Oriented Search and Analysis Engine

1.1.1 Object data stored in database

  • The data structure of the application system is object-oriented and complex
  • Object data is stored in a database and can only be disassembled to become flat multiple tables. It is cumbersome to restore the object format each time you query
public class Employee {

  private String email;
  private String firstName;
  private String lastName;
  private EmployeeInfo info;
  private Date joinDate;

}

private class EmployeeInfo {
  
  private String bio; // character
  private Integer age;
  private String[] interests; // Hobby

}

EmployeeInfo info = new EmployeeInfo();
info.setBio("curious and modest");
info.setAge(30);
info.setInterests(new String[]{"bike", "climb"});

Employee employee = new Employee();
employee.setEmail("zhangsan@sina.com");
employee.setFirstName("san");
employee.setLastName("zhang");
employee.setInfo(info);
employee.setJoinDate(new Date());

Employee object: It contains the Employee class's own properties, as well as an EmployeeInfo object

Two tables: employee table, employee_info table, which splits the data of the employee object back into Employee data and EmployeeInfo data
employee table: email, first_name, last_name, join_date, 4 fields
employee_info table: bio, age, interests, three fields; There is also a foreign key field, such as employee_id, associated with employee table

1.1.2 Object Data Stored in ES

  • ES is document-oriented and stores the same data structure as object-oriented. Based on this document data structure, es can provide complex indexes, full-text retrieval, analysis aggregation and other functions.
  • document of es is expressed in json data format
{
    "email":      "zhangsan@sina.com",
    "first_name": "san",
    "last_name": "zhang",
    "info": {
        "bio":         "curious and modest",
        "age":         30,
        "interests": [ "bike", "climb" ]
    },
    "join_date": "2017/01/01"
}

We understand the difference between the document data format of es and the relational data format of the database.

2. Background introduction of e-commerce website commodity management cases

There is an e-commerce website for which a background system based on ES needs to be built, providing the following functions:

(1) CRUD operations on commodity information
(2) Perform simple structured queries
(3) Simple full-text retrieval and complex phrase retrieval can be performed
(4) Highlight the results of full-text retrieval
(5) Simple aggregation analysis of data

2.1 Simple cluster management

es provides a set of api, called cat api, that allows you to view a wide variety of data in es

2.1.1 Quick check cluster health: GET /_cat/health?v

response

epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1488006741 15:12:21  elasticsearch yellow          1         1      1   1    0    0        1             0                  -                 50.0%

epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1488007113 15:18:33  elasticsearch green           2         2      2   1    0    0        0             0                  -                100.0%

epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1488007216 15:20:16  elasticsearch yellow          1         1      1   1    0    0        1             0                  -                 50.0%
  • Health of clusters? green, yellow, red?
    green: primary and replica shares of each index are active
    yellow: the primary share of each index is active, but some replica shares are not active and are not available
    red: not all primary shard s of indexes are active, some indexes have missing data

  • Why is it in a yellow state now?
    Now that we have a laptop, we start an ES process, which is equivalent to just one node. Now there is an index in es, which kibana built-in. Since the default configuration assigns five primary shards and five replica shards to each index, primary shards and replica shards cannot be on the same machine (for fault tolerance). Now Kibana's own index is a primary shard and a replica shard. Currently there is only one node, so only one primary shard is assigned and started, but one replica shard does not have a second machine to start.

  • For a small experiment, just start the second es process, there will be two node s in the ES cluster, then the replica share will be automatically assigned to it, and then the cluster status will become green.

2.1.2 Quickly see which indexes are in the cluster: GET /_cat/indices?v

response

health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   .kibana rUm9n9wMRQCCrRDEhqneBg   1   1          1            0      3.1kb          3.1kb

2.1.3 Create Index: PUT/test_ Index?

response

health status index      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   test_index XmS9DTAtSkSZSwWhhGEKkQ   5   1          0            0       650b           650b
yellow open   .kibana    rUm9n9wMRQCCrRDEhqneBg   1   1          1            0      3.1kb          3.1kb

2.1.4 Delete index: DELETE/test_ Index?

response

health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   .kibana rUm9n9wMRQCCrRDEhqneBg   1   1          1            0      3.1kb          3.1kb

CRUD operations for 2.2 commodities

2.2.1 New Goods: New Documents, Index PUT/index/type/id

request

PUT /ecommerce/product/1
{
    "name" : "gaolujie yagao",
    "desc" :  "gaoxiao meibai",
    "price" :  30,
    "producer" :      "gaolujie producer",
    "tags": [ "meibai", "fangzhu" ]
}

response

{
  "_index": "ecommerce",
  "_type": "product",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

Es automatically creates index es and type s without having to create them ahead of time, and ES by default creates an inverted index on each field of the document so that it can be searched

2.2.2 Query commodities: retrieve document GET/index/type/id

request

GET /ecommerce/product/1

response

{
  "_index": "ecommerce",
  "_type": "product",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "name": "gaolujie yagao",
    "desc": "gaoxiao meibai",
    "price": 30,
    "producer": "gaolujie producer",
    "tags": [
      "meibai",
      "fangzhu"
    ]
  }
}

2.2.3 Modify Commodity: Replace Document PUT/ecommerce/product/1

One bad alternative is to take all the field s with you to modify the information

2.2.3 Modify merchandise: Update document POST/ecommerce/product/1/_ Update

request

POST /ecommerce/product/1/_update
{
  "doc": {
    "name": "jiaqiangban gaolujie yagao"
  }
}

response

{
  "_index": "ecommerce",
  "_type": "product",
  "_id": "1",
  "_version": 8,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  }
}

2.2.4 Delete goods: Delete document DELETE/ecommerce/product/1

response

{
  "found": true,
  "_index": "ecommerce",
  "_type": "product",
  "_id": "1",
  "_version": 9,
  "result": "deleted",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  }
}

{
  "_index": "ecommerce",
  "_type": "product",
  "_id": "1",
  "found": false
}

2.3 Commodity management: six search methods

2.3.1 query string search: GET /ecommerce/product/_search

Search for all items:

took: It took several milliseconds
timed_out: Timeout or not, there is no
_shards: The data is split into five pieces, so for search requests, all are hit primary shard(Or some of it replica shard Or)
hits.total: Number of query results, 3 document
hits.max_score: score Meaning is document For a search Relevance matching score, the more relevant, the more matched, the higher the score
hits.hits: Contains matching searches document Detailed data for

Response:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "2",
        "_score": 1,
        "_source": {
          "name": "jiajieshi yagao",
          "desc": "youxiao fangzhu",
          "price": 25,
          "producer": "jiajieshi producer",
          "tags": [
            "fangzhu"
          ]
        }
      },      
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "3",
        "_score": 1,
        "_source": {
          "name": "zhonghua yagao",
          "desc": "caoben zhiwu",
          "price": 40,
          "producer": "zhonghua producer",
          "tags": [
            "qingxin"
          ]
        }
      }      
    ]
    
  }
}
  • Origin of query string search, because the search parameters are accompanied by the query string requested by http

  • Search for items with Yagao in their name, sorted in descending order by price: GET/ecommerce/product/_ Search?Q=name:yagao&sort=price:desc

  • For temporary use of tools on the command line, such as curl, to quickly make requests to retrieve the desired information; But if the query request is complex, it can be difficult to build

  • In production environments, query string search is rarely used

2.3.2 query DSL (domain-specific language)

DSL: Domain Specified Language, domain-specific language
http request body: request body, can use json format to build query syntax, more convenient, can build a variety of complex syntax, much more powerful than query string search

  • Query all items "query": {"match_all": {}
GET /ecommerce/product/_search
{
  "query": { "match_all": {} }
}
  • Query name contains yagao's products, sorted in descending order by price
GET /ecommerce/product/_search
{
    "query" : {
        "match" : {
            "name" : "yagao"
        }
    },
    "sort": [
        { "price": "desc" }
    ]
}
  • Page-by-page query of goods, a total of 3 items, assuming that 1 item is displayed on each page, and now page 2, so the second item is found
GET /ecommerce/product/_search
{
  "query": { "match_all": {} },
  "from": 1,
  "size": 1
}
  • Specify the name and price of the item you want to inquire about.
GET /ecommerce/product/_search
{
  "query": { "match_all": {} },
  "_source": ["name", "price"]
}

More suitable for use in production environments, complex queries can be built

2.3.3 query filter

Search product name contains yagao and sells for more than 25 yuan

GET /ecommerce/product/_search
{
    "query" : {
        "bool" : {
            "must" : {
                "match" : {
                    "name" : "yagao" 
                }
            },
            "filter" : {
                "range" : {
                    "price" : { "gt" : 25 } 
                }
            }
        }
    }
}

2.3.4 full-text search

GET /ecommerce/product/_search
{
    "query" : {
        "match" : {
            "producer" : "yagao producer"
        }
    }
}

producer, this field will be disassembled to create an inverted index

special		4
yagao		4
producer	1,2,3,4
gaolujie	1
zhognhua	3
jiajieshi	2

yagao producer - > Yagao and producer
response

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.70293105,
    "hits": [
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "4",
        "_score": 0.70293105,
        "_source": {
          "name": "special yagao",
          "desc": "special meibai",
          "price": 50,
          "producer": "special yagao producer",
          "tags": [
            "meibai"
          ]
        }
      },
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "1",
        "_score": 0.25811607,
        "_source": {
          "name": "gaolujie yagao",
          "desc": "gaoxiao meibai",
          "price": 30,
          "producer": "gaolujie producer",
          "tags": [
            "meibai",
            "fangzhu"
          ]
        }
      },
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "3",
        "_score": 0.25811607,
        "_source": {
          "name": "zhonghua yagao",
          "desc": "caoben zhiwu",
          "price": 40,
          "producer": "zhonghua producer",
          "tags": [
            "qingxin"
          ]
        }
      },
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "2",
        "_score": 0.1805489,
        "_source": {
          "name": "jiajieshi yagao",
          "desc": "youxiao fangzhu",
          "price": 25,
          "producer": "jiajieshi producer",
          "tags": [
            "fangzhu"
          ]
        }
      }
    ]
  }
}

2.3.5 phrase search

In contrast to full-text retrieval, full-text retrieval splits the input search string, matches one-to-one in an inverted index, and returns as a result if any of the disassembled words can be matched
phrase search, a search string that requires input, must contain exactly the same in the specified field text before it can be counted as a match to be returned as a result

GET /ecommerce/product/_search
{
    "query" : {
        "match_phrase" : {
            "producer" : "yagao producer"
        }
    }
}

response

{
  "took": 11,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.70293105,
    "hits": [
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "4",
        "_score": 0.70293105,
        "_source": {
          "name": "special yagao",
          "desc": "special meibai",
          "price": 50,
          "producer": "special yagao producer",
          "tags": [
            "meibai"
          ]
        }
      }
    ]
  }
}

2.3.6 highlight search

GET /ecommerce/product/_search
{
    "query" : {
        "match" : {
            "producer" : "producer"
        }
    },
    "highlight": {
        "fields" : {
            "producer" : {}
        }
    }
}

2.4 Aggregation Analysis

2.4.1 Calculate the quantity of goods under each tag

GET /ecommerce/product/_search
{
  "aggs": {
    "group_by_tags": {
      "terms": { "field": "tags" }
    }
  }
}

Error setting fielddata property of text field to true

PUT /ecommerce/_mapping/product
{
  "properties": {
    "tags": {
      "type": "text",
      "fielddata": true
    }
  }
}
GET /ecommerce/product/_search
{
  "size": 0,
  "aggs": {
    "all_tags": {
      "terms": { "field": "tags" }
    }
  }
}

response

{
  "took": 20,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_tags": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "fangzhu",
          "doc_count": 2
        },
        {
          "key": "meibai",
          "doc_count": 2
        },
        {
          "key": "qingxin",
          "doc_count": 1
        }
      ]
    }
  }
}

2.4.2 Couples of items with yagao in their names, counting the number of items under each tag

GET /ecommerce/product/_search
{
  "size": 0,
  "query": {
    "match": {
      "name": "yagao"
    }
  },
  "aggs": {
    "all_tags": {
      "terms": {
        "field": "tags"
      }
    }
  }
}

2.4.3 Grouping first, then calculating the average value of each group, calculating the average price of goods under each tag

GET /ecommerce/product/_search
{
    "size": 0,
    "aggs" : {
        "group_by_tags" : {
            "terms" : { "field" : "tags" },
            "aggs" : {
                "avg_price" : {
                    "avg" : { "field" : "price" }
                }
            }
        }
    }
}

response

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_tags": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "fangzhu",
          "doc_count": 2,
          "avg_price": {
            "value": 27.5
          }
        },
        {
          "key": "meibai",
          "doc_count": 2,
          "avg_price": {
            "value": 40
          }
        },
        {
          "key": "qingxin",
          "doc_count": 1,
          "avg_price": {
            "value": 40
          }
        }
      ]
    }
  }
}

2.4.4 Calculate the average price of the goods under each tag and sort them in descending order according to the average price

GET /ecommerce/product/_search
{
    "size": 0,
    "aggs" : {
        "all_tags" : {
            "terms" : { "field" : "tags", "order": { "avg_price": "desc" } },
            "aggs" : {
                "avg_price" : {
                    "avg" : { "field" : "price" }
                }
            }
        }
    }
}

We are now all using the restful api of ES to learn and explain all the knowledge and function points of es, but we are not using some programming languages (such as java) to explain them for the following reasons:

1. The most important API for ES is the restful api, which allows us to try, learn and even use it in certain environments. If you learn not to use the es restful api, i.e. I come up with the java api to say es, that's okay, but you just missed a big piece of ES knowledge and you don't know how its most important restful API works
2. Speaking knowledge points, using es restful api, is more convenient, fast, and does not need to write a lot of java code every time. It can speed up the efficiency and speed of teaching, and make it easier for students to learn about the knowledge and function of es itself.
3. We usually start with a more detailed explanation of java api after we finish the es knowledgepoint and how to perform various operations with java api
4. Each chapter will be accompanied by a project reality, which is a real project and system developed entirely based on java.

2.4.5 Group by specified price range interval, then group by tag within each group, and finally calculate the average price for each group.

GET /ecommerce/product/_search
{
  "size": 0,
  "aggs": {
    "group_by_price": {
      "range": {
        "field": "price",
        "ranges": [
          {
            "from": 0,
            "to": 20
          },
          {
            "from": 20,
            "to": 40
          },
          {
            "from": 40,
            "to": 50
          }
        ]
      },
      "aggs": {
        "group_by_tags": {
          "terms": {
            "field": "tags"
          },
          "aggs": {
            "average_price": {
              "avg": {
                "field": "price"
              }
            }
          }
        }
      }
    }
  }
}

Topics: Java ElasticSearch