[ClickHouse technology series] - nested data structures in ClickHouse

Posted by sirish on Thu, 16 Dec 2021 04:39:41 +0100

Introduction: This article is translated from Altinity's series of technical articles on ClickHouse. ClickHouse, an open source analysis engine for online analytical processing (OLAP), is widely used by companies at home and abroad because of its excellent query performance, PB level data scale and simple architecture. This series of technical articles will introduce ClickHouse in detail.

preface

This article is translated from Altinity's series of technical articles on ClickHouse. ClickHouse, an open source analysis engine for online analytical processing (OLAP), is widely used by companies at home and abroad because of its excellent query performance, PB level data scale and simple architecture.

The Alibaba cloud EMR-OLAP team has made a series of optimizations based on the open source ClickHouse and provided the cloud hosting service of the open source OLAP analysis engine ClickHouse. EMR ClickHouse is fully compatible with the product features of the open source version. At the same time, it provides cloud product functions such as cluster rapid deployment, cluster management, capacity expansion, capacity reduction and monitoring alarm. On the basis of open source, it optimizes the read-write performance of ClickHouse and improves the ability of rapid integration between ClickHouse and other components of EMR. visit https://help.aliyun.com/docum... Learn more.

Translator: He Yuan (Jinghang), senior product expert of Alibaba cloud computing platform Division

Nested data structures in ClickHouse

In this blog post, we will learn about the nested data structure in ClickHouse for MySQL and how to use it with PMM to view queries.

Nested structures are not common in relational database management systems. Usually, it's just a flat table. Sometimes it is convenient to store unstructured information in a structured database.

We are working to adapt ClickHouse to long-term storage for Percona monitoring and management (PMM), especially to store detailed information about queries. One of the problems we are trying to solve is to count the different errors that cause a particular query to fail.

For example, for a query dated August 17, 2017:

"SELECT foo FROM bar WHERE id=?"

It was executed 1000 times. The error code of 25 failures is "1212", and the error code of 8 failures is "1250". Of course, the traditional method of storing in relational data is to create a table "Date, QueryID, ErrorCode, ErrorCnt" and then Join the table. Unfortunately, columnar databases do not perform well in the case of multiple joins, and it is usually recommended to use denormalized tables.

We can create a column for each possible ErrorCode, but this is not the optimal solution. There may be thousands of columns, and most of the time they are empty.

In this case, ClickHouse proposes a nested data structure. In our case, these can be defined as:

CREATE TABLE queries
(
    Period Date,
    QueryID UInt32,
    Fingerprint String,
    Errors Nested
    (
        ErrorCode String,
        ErrorCnt UInt32
    )
)Engine=MergeTree(Period,QueryID,8192);

The obvious problem with this solution is: how do we insert data into this table? How do we extract it?

Let's start with INSERT. The insertion may be as follows:

INSERT INTO queries VALUES ('2017-08-17',5,'SELECT foo FROM bar WHERE id=?',['1220','1230','1212'],[5,6,2])

This means that the query inserted during August 17, 2017 has 5 errors 1220, 6 errors 1230 and 2 errors 1212.

Then it may produce different errors on different dates:

INSERT INTO queries VALUES ('2017-08-18',5,'SELECT foo FROM bar WHERE id=?',['1220','1240','1258'],[3,2,1])

Let's look at the method of selecting data. Very basic SELECT:

SELECT *
FROM queries
|_____Period_|_QueryID_|_Fingerprint_|_Errors.ErrorCode_______|_Errors.ErrorCnt_|
| 2017-08-17 |       5 | SELECT foo  | ['1220','1230','1212'] | [5,6,2]         |
| 2017-08-18 |       5 | SELECT foo  | ['1220','1240','1260'] | [3,16,12]       |
|____________|_________|_____________|________________________|_________________|

If we want to use more familiar table output, we can use the ARRAY JOIN extension:

SELECT *
FROM queries
ARRAY JOIN Errors
┌─────Period─┬─QueryID─┬─Fingerprint─┬─Errors.ErrorCode─┬─Errors.ErrorCnt─┐
│ 2017-08-17 │       5 │ SELECT foo  │ 1220             │            5    │
│ 2017-08-17 │       5 │ SELECT foo  │ 1230             │            6    │
│ 2017-08-17 │       5 │ SELECT foo  │ 1212             │            2    │
│ 2017-08-18 │       5 │ SELECT foo  │ 1220             │            3    │
│ 2017-08-18 │       5 │ SELECT foo  │ 1240             │           16    │
│ 2017-08-18 │       5 │ SELECT foo  │ 1260             │           12    │
└────────────┴─────────┴─────────────┴──────────────────┴─────────────────┘

However, we usually want to see aggregation in multiple periods, which can be done through traditional aggregation functions:

SELECT 
    QueryID,
    Errors.ErrorCode,
    SUM(Errors.ErrorCnt)
FROM queries
ARRAY JOIN Errors
GROUP BY 
    QueryID,
    Errors.ErrorCode
┌─QueryID─┬─Errors.ErrorCode─┬─SUM(Errors.ErrorCnt)─┐
│       5 │ 1212             │                 2    │
│       5 │ 1230             │                 6    │
│       5 │ 1260             │                12    │
│       5 │ 1240             │                16    │
│       5 │ 1220             │                 8    │
└─────────┴──────────────────┴──────────────────────┘

If we are creative and only return one line per QueryID, we can also do the following:

SELECT 
    QueryID, 
    groupArray((ecode, cnt))
FROM 
(
    SELECT 
        QueryID, 
        ecode, 
        sum(ecnt) AS cnt
    FROM queries 
    ARRAY JOIN 
        Errors.ErrorCode AS ecode, 
        Errors.ErrorCnt AS ecnt
    GROUP BY 
        QueryID, 
        ecode
) 
GROUP BY QueryID
┌─QueryID─┬─groupArray(tuple(ecode, cnt))──────────────────────────────┐
│       5 │ [('1230',6),('1212',2),('1260',12),('1220',8),('1240',16)] │
└─────────┴────────────────────────────────────────────────────────────┘

conclusion

ClickHouse provides a flexible way to store data. Although it is a column database, it can achieve a low degree of structure and provide various functions to extract and aggregate data.

follow-up

You've learned about handling real-time updates in ClickHouse. This series also includes other content:

  • Handle real-time updates in ClickHouse
  • Use the new TTL move to store the data in a suitable place
  • Using Join in ClickHouse materialized view
  • ClickHouse aggregate function and aggregate state
  • Nested data structures in ClickHouse (this article)

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Topics: data structure clickhouse