Summary of association processing methods in ES

Posted by Coco on Mon, 03 Jan 2022 16:54:15 +0100

preface

This section mainly introduces the processing method of association relationship in ES.

1, Scheme summary

According to the introduction in the authoritative guide to Elasticsearch and the official website, ES mainly deals with association relationships in the following ways:

  1. Application layer Association
  2. Unplanned data
  3. nested object
  4. Parent child relationship document
  5. Terms lookup cross index query

Type comparison of Join, Nested, Object and Flattened fields

2, Application layer Association

The index data is not specially processed, but the associated query of data is realized through multiple queries in the application.
For example, in the following example, a question will have multiple answers, and the question data and answer data are in a different index.

1. Create problem index question_index

PUT question_index
{
  "mappings": {
      "properties": {
        "id":{"type": "keyword"},
        "text":{"type": "keyword"}
      }
    }
}

PUT question_index/_doc/1?refresh
{
  "id":"1",
  "text": "I'm the first question"
}

PUT question_index/_doc/2?refresh
{
  "id":"2",
  "text": "I'm the second question"
}

2. Create problem index question_index
Note: where pid is the problem id.

PUT answer_index
{
  "mappings": {
      "properties": {
        "pid":{"type": "keyword"},
        "text":{"type": "keyword"}
      }
    }
}

PUT answer_index/_doc/1?refresh 
{
  "pid":"1",
  "text": "Answer 1 to question 1"
}

PUT answer_index/_doc/2?refresh
{
  "pid":"1",
  "text": "Answer 2 to question 1"
}

3. Business scenario
Now we need to query the corresponding answer information of the first question. We can do this:

First, find out the id of the corresponding record according to the problem name.

GET question_index/_search
{
  "query": {
    "term": {
      "text": {
        "value": "I'm the first question"
      }
    }
  }
}

Then, the results obtained from the first query will be filled into the terms filter from answer_ Query the answer data in the index.

GET answer_index/_search
{
  "query": {
    "terms": {
      "pid": [
        "1"
      ]
    }
  }
}

Advantages and disadvantages analysis:
The main advantage of application layer association is that it is simple, does not need to do additional processing on the data structure, and can perform association query on any two different indexes.
The disadvantage is that multiple queries must be performed.

Applicable scenarios:
Application layer association is applicable to the case where there are few associated data. The reason is that the multi value matching query performance of terms for a large number of data will be poor.

Multi value queries can be performed using terms. As long as the target document matches a value in the terms query, the document will be marked as one of the query results, but the parameter value of terms is limited. There are 65535 elements by default. You can set index max_ terms_ Count to make changes.
You can also use the terms lookup syntax to solve the problem of too many terms parameter elements.

3, Denormalized data

In order to obtain better retrieval performance, the best method is to store non standardized data during index modeling, and avoid Association query during access by redundant storage of document data fields.
For example, in the following example, I hope to find the blog posts written by the user's name.
The index structure of the conventional method is as follows, in the blog_ Only user is saved in the index_ ID, used to associate user information.

PUT user_index
{
  "mappings": {
     "properties": {
        "id":  {"type": "keyword"},
        "name":   {"type": "keyword"},
        "email":   {"type": "keyword"}
     }
  }
}

PUT blog_index
{
  "mappings": {
      "properties": {
        "title":{"type": "keyword"},
        "body":{"type": "keyword"},
        "user_id":{"type": "keyword"}
      }
    }
}

Non standardized data processing:
explain:
The user information is directly saved in the blog index data through the Object field type. In this way, through the redundant saving of data, the associated query is avoided.

PUT blog_index
{
  "mappings": {
      "properties": {
        "title":{"type": "keyword"},
        "body":{"type": "keyword"},
        "user":{
          "properties": {
            "id":  {"type": "keyword"},
            "name":   {"type": "keyword"},
            "email":   {"type": "keyword"}
          }
        }
      }
    }
}

Query the blog data with the user name Lao Wan:

GET blog_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "user.name": "Lao Wan"}}
      ]
    }
  }
}

Advantages and disadvantages analysis:
The advantage of data denormalization is fast. Because each document contains all the required information, when these information needs to be matched in the query, there is no need for expensive join operations.
The disadvantage is that the redundant storage of a large amount of data will occupy more storage space, and the update operation of associated data will be more complex.

4, Nested objects

Nested data types can also be built through nested to realize data association.
In the above non normalized data, it has been demonstrated that the Object field type is used to save the associated data redundantly to avoid data association query.

So what's the difference between the two?

Differences between object files and nested files:

Official note: the difference between object files and nested files
If you need to index an array of objects and maintain the independence of each object in the array, use nested data types instead of object data types.
Internally, nested objects index each object in the array as a separate hidden document, which means that nested queries can be used to query each nested object independently of other objects.

In short:
Object files is suitable for saving simple objects and cannot be used to save object arrays because it cannot guarantee the independence of multiple object queries.
Nested files is suitable for storing object arrays.

## 1. Create an index and specify the user field as a nested object
PUT my-index-000001
{
  "mappings": {
    "properties": {
      "user": {
        "type": "nested" 
      }
    }
  }
}
## 2. Add data
PUT my-index-000001/_doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}
## 3. Query data; Query the user whose last name is Alice and whose first name is Smith. If the user type is Object type, records can be queried.
## The nested type has no records that meet the conditions because each object is isolated from each other
GET my-index-000001/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}

Advantages and disadvantages analysis:
Object files is suitable for one-to-one association
Nested files is suitable for one to many relationships.
Both of them use non standardized data and redundant storage of data to avoid associated query.
Both object files and nested files save the data association relationship in the same record.

5, Parent child relationship document

The parent-child relationship between index records is built through the join field type.
Build parent-child association through join type field in ES

Official website address: Join field type
join type fields are mainly used to build parent-child relationships in the same index. Define a set of parent-child relationships through relationships, and each relationship contains a parent relationship name and a child relationship name.

Example:
Create index my_index and specify the associated field my in mappings_ join_ The type of field is join,
The relationship is specified through the relationships attribute. The parent relationship name is question and the child relationship name is answer.
The name of parent-child relationship here can be defined by yourself. When adding data to the index, you need to define the relationship name according to the defined relationship name
Specify my_ join_ The value of the field.
my_join_field the name of the associated field can also be customized.

PUT my_index
{
  "mappings": {
      "properties": {
        "text":{"type": "keyword"},
        "my_join_field": { 
          "type": "join",
          "relations": {
            "question": "answer" 
          }
        }
      }
    }
}

Advantages and disadvantages analysis:
For the parent-child association relationship constructed through the Join field, the data is saved in the same partition of the same index, but the parent record and child record are saved in different index records respectively. The association relationship built through object files and nested files is in the same index record.
Therefore, the parent-child relationship built by the Join field is more suitable for scenarios with more associated data.
And because the parent-child relationship is an independent record storage, it is more convenient to add, update and delete parent and child data separately.
The main disadvantage is has_child or has_ The query performance of parent query will be poor.

be careful ⚠️:
Join fields cannot be used like joins in relational databases. In ES, in order to ensure good query performance, the best practice is to set the data model as a non standardized document, that is, to construct a wide table through field redundancy.
For each join field, has_child or has_ Both parent queries can have a significant impact on your query performance.

6, Terms lookup cross index query

At present, it is only found that cross index association queries can be realized through Terms lookup. If there are other aspects, please leave a message.

explain:
The Terms lookup query obtains the field values of existing documents through id, and then uses these values as search terms for secondary query.

1. Create parameter index and add data

PUT params_index/_doc/1
{
  "group" : "fans",
  "name" : [ 
     "Lao Wan", "Xiao Ming"
  ]
}

2. Create a blog index and add data

## DELETE blog_index

PUT blog_index
{
  "mappings": {
      "properties": {
        "title":{"type": "keyword"},
        "body":{"type": "keyword"},
        "user_name":{"type": "keyword"}
      }
    }
}

PUT blog_index/_doc/1?refresh
{
  "title": "Lao Wan's first blog",
  "body":"start es The first day of study",
  "user_name": "Lao Wan"
}

PUT blog_index/_doc/2?refresh
{
  "title": "Lao Wan's second blog",
  "body":"study ES Associated query",
  "user_name": "Lao Wan"
}

PUT blog_index/_doc/3?refresh
{
  "title": "Sanya Travel Diary",
  "body":"Seaside punch in",
  "user_name": "Xiao Ming"
}

PUT blog_index/_doc/4?refresh
{
  "title": "King's Diary",
  "body":"Kill the king today",
  "user_name": "Xiao Wang"
}

3. Index params according to parameters_ Index query blog_ Records in index

GET blog_index/_search
{
  "query": {
    "terms": {
      "user_name" : {
            "index" : "params_index",
            "id" : "1",
            "path" : "name"
        }
    }
  }
}

summary

This paper mainly summarizes the processing methods of association relationship in ES.
The main methods are as follows:

  1. Application layer Association
  2. Unplanned data
  3. nested object
  4. Parent child relationship document
  5. Terms lookup cross index query

According to the advantages, disadvantages and applicable scenarios of each method, it shall be correctly selected in the actual project.

Topics: ElasticSearch join