See the official website for details: Roll up and query && Materialized view
Rollup and query
In Doris, Rollup, as a polymer view, can play two roles in query:
- Indexes
- Aggregate data (only for aggregate model, i.e. aggregate key)
However, in order to hit the Rollup, certain conditions must be met, and whether the Rollup can be hit can be determined by executing the PreAggregation value of the scandoe node in the plan, and which Rollup table is hit can be determined by the Rollup field.
Noun interpretation
Base: base table.
Rollup: generally refers to the rollup table created based on the Base table, but in some scenarios, it includes the Base and rollup tables.
Indexes
Doris's prefix index has been introduced in the previous query practice, That is, Doris will separately generate a sorted sparse index data in the underlying storage engine for the first 36 bytes in the Base/Rollup table (if there is a varchar type, it may cause the prefix index to be less than 36 bytes, varchar will truncate the prefix index, and use up to 20 bytes of varchar). The data is also sorted, located with the index, and then binary search in the data, Then, when querying, the prefix index of each Base/Rollup will be matched according to the conditions in the query, and the Base/Rollup with the longest matching prefix index will be selected.
-----> Match left to right +----+----+----+----+----+----+ | c1 | c2 | c3 | c4 | c5 |... |
As shown in the figure above, take the conditions where and on are pushed up and down to ScanNode in the query, match from the first column of the prefix index, and check whether there are these columns in the conditions. If there are, the length of the matching will be accumulated until the matching is not up or 36 bytes end (varchar type columns can only match 20 bytes, and the prefix index will be truncated if they match less than 36 bytes), Then select the Base/Rollup with the longest matching length. For example, a Base table and four rollups are created:
+---------------+-------+--------------+------+-------+---------+-------+ | IndexName | Field | Type | Null | Key | Default | Extra | +---------------+-------+--------------+------+-------+---------+-------+ | test | k1 | TINYINT | Yes | true | N/A | | | | k2 | SMALLINT | Yes | true | N/A | | | | k3 | INT | Yes | true | N/A | | | | k4 | BIGINT | Yes | true | N/A | | | | k5 | DECIMAL(9,3) | Yes | true | N/A | | | | k6 | CHAR(5) | Yes | true | N/A | | | | k7 | DATE | Yes | true | N/A | | | | k8 | DATETIME | Yes | true | N/A | | | | k9 | VARCHAR(20) | Yes | true | N/A | | | | k10 | DOUBLE | Yes | false | N/A | MAX | | | k11 | FLOAT | Yes | false | N/A | SUM | | | | | | | | | | rollup_index1 | k9 | VARCHAR(20) | Yes | true | N/A | | | | k1 | TINYINT | Yes | true | N/A | | | | k2 | SMALLINT | Yes | true | N/A | | | | k3 | INT | Yes | true | N/A | | | | k4 | BIGINT | Yes | true | N/A | | | | k5 | DECIMAL(9,3) | Yes | true | N/A | | | | k6 | CHAR(5) | Yes | true | N/A | | | | k7 | DATE | Yes | true | N/A | | | | k8 | DATETIME | Yes | true | N/A | | | | k10 | DOUBLE | Yes | false | N/A | MAX | | | k11 | FLOAT | Yes | false | N/A | SUM | | | | | | | | | | rollup_index2 | k9 | VARCHAR(20) | Yes | true | N/A | | | | k2 | SMALLINT | Yes | true | N/A | | | | k1 | TINYINT | Yes | true | N/A | | | | k3 | INT | Yes | true | N/A | | | | k4 | BIGINT | Yes | true | N/A | | | | k5 | DECIMAL(9,3) | Yes | true | N/A | | | | k6 | CHAR(5) | Yes | true | N/A | | | | k7 | DATE | Yes | true | N/A | | | | k8 | DATETIME | Yes | true | N/A | | | | k10 | DOUBLE | Yes | false | N/A | MAX | | | k11 | FLOAT | Yes | false | N/A | SUM | | | | | | | | | | rollup_index3 | k4 | BIGINT | Yes | true | N/A | | | | k5 | DECIMAL(9,3) | Yes | true | N/A | | | | k6 | CHAR(5) | Yes | true | N/A | | | | k1 | TINYINT | Yes | true | N/A | | | | k2 | SMALLINT | Yes | true | N/A | | | | k3 | INT | Yes | true | N/A | | | | k7 | DATE | Yes | true | N/A | | | | k8 | DATETIME | Yes | true | N/A | | | | k9 | VARCHAR(20) | Yes | true | N/A | | | | k10 | DOUBLE | Yes | false | N/A | MAX | | | k11 | FLOAT | Yes | false | N/A | SUM | | | | | | | | | | rollup_index4 | k4 | BIGINT | Yes | true | N/A | | | | k6 | CHAR(5) | Yes | true | N/A | | | | k5 | DECIMAL(9,3) | Yes | true | N/A | | | | k1 | TINYINT | Yes | true | N/A | | | | k2 | SMALLINT | Yes | true | N/A | | | | k3 | INT | Yes | true | N/A | | | | k7 | DATE | Yes | true | N/A | | | | k8 | DATETIME | Yes | true | N/A | | | | k9 | VARCHAR(20) | Yes | true | N/A | | | | k10 | DOUBLE | Yes | false | N/A | MAX | | | k11 | FLOAT | Yes | false | N/A | SUM | +---------------+-------+--------------+------+-------+---------+-------+
The prefix indexes of the three tables are
Base(k1 ,k2, k3, k4, k5, k6, k7) rollup_index1(k9),rollup_index2(k9) rollup_index3(k4, k5, k6, k1, k2, k3, k7) rollup_index4(k4, k6, k5, k1, k2, k3, k7)
The conditions on the columns of the prefix index that can be used need to be = < > < = > = in between, and these conditions are juxtaposed, and the relationship uses and connection, for or= Wait until these cannot be hit, and then look at the following query:
SELECT * FROM test WHERE k1 = 1 AND k2 > 3;
There are conditions on k1 and k2. Check that only the first column of Base contains k1 in the condition, so match the longest prefix index, test, explain as follows:
| 0:OlapScanNode | TABLE: test | PREAGGREGATION: OFF. Reason: No AggregateInfo | PREDICATES: `k1` = 1, `k2` > 3 | partitions=1/1 | rollup: test | buckets=1/10 | cardinality=-1 | avgRowSize=0.0 | numNodes=0 | tuple ids: 0
Look at the following query:
SELECT * FROM test WHERE k4 = 1 AND k5 > 3;
With k4 and k5 conditions, check rollup_index3,rollup_ The first column of index4 contains k4, but rollup_ The second column of index3 contains k5, so the matching prefix index is the longest.
| 0:OlapScanNode | TABLE: test | PREAGGREGATION: OFF. Reason: No AggregateInfo | PREDICATES: `k4` = 1, `k5` > 3 | partitions=1/1 | rollup: rollup_index3 | buckets=10/10 | cardinality=-1 | avgRowSize=0.0 | numNodes=0 | tuple ids: 0
Now let's try to match the conditions on the column containing varchar, as follows:
SELECT * FROM test WHERE k9 IN ("xxx", "yyyy") AND k1 = 10;
There are two conditions k9 and k1, rollup_index1 and Rollup_ The first column of index2 contains k9. It is reasonable to select both rollups here to hit the prefix index, and the effect is the same. Just select one randomly (because the varchar here is just 20 bytes, and the prefix index is truncated when it is less than 36 bytes). However, the current policy will continue to match k1 here, because rollup_ The second column of index1 is k1, so rollup is selected_ Index1, in fact, the latter k1 condition will not accelerate. (if conditions other than the prefix index are needed, they can speed up the query by creating a Bloom Filter. Generally, they can be created for string types, because Doris has a Block level for columns, and has a Min/Max index for shaping and dates). The following is the result of explain.
| 0:OlapScanNode | TABLE: test | PREAGGREGATION: OFF. Reason: No AggregateInfo | PREDICATES: `k9` IN ('xxx', 'yyyy'), `k1` = 10 | partitions=1/1 | rollup: rollup_index1 | buckets=1/10 | cardinality=-1 | avgRowSize=0.0 | numNodes=0 | tuple ids: 0
Finally, look at a query that can be hit by multiple rollups:
SELECT * FROM test WHERE k4 < 1000 AND k5 = 80 AND k6 >= 10000;
There are three conditions K4, K5 and K6, rollup_index3 and Rollup_ The first three columns of index4 contain these three columns respectively, so the prefix index length of the two matches is the same. Both can be selected. The current default strategy is to select a rollup created earlier, here is rollup_index3.
| 0:OlapScanNode | TABLE: test | PREAGGREGATION: OFF. Reason: No AggregateInfo | PREDICATES: `k4` < 1000, `k5` = 80, `k6` >= 10000.0 | partitions=1/1 | rollup: rollup_index3 | buckets=10/10 | cardinality=-1 | avgRowSize=0.0 | numNodes=0 | tuple ids: 0
If the above query is slightly modified as:
SELECT * FROM test WHERE k4 < 1000 AND k5 = 80 OR k6 >= 10000;
The query here cannot hit the prefix index. (even any min / Max and bloomfilter indexes in the Doris storage engine do not work)
Aggregate data
Of course, the function of aggregating data of general polymer view is essential. This kind of physical view is very helpful for aggregation query or report query. To hit the polymer view, the following prerequisites are required:
- All columns involved in the query or subquery exist in a separate Rollup.
- If there is a Join in the query or subquery, the type of the Join must be Inner join.
The following are the types of aggregate queries that can hit Rollup,
Column type query type | Sum | Distinct/Count Distinct | Min | Max | APPROX_COUNT_DISTINCT |
---|---|---|---|---|---|
Key | false | true | true | true | true |
Value(Sum) | true | false | false | false | false |
Value(Replace) | false | false | false | false | false |
Value(Min) | false | false | true | false | false |
Value(Max) | false | false | false | true | false |
If the above conditions are met, there will be two stages when judging that the Rollup is hit for the aggregation model:
- First, the Rollup table with the longest hit prefix index is found through conditional matching. See the index strategy above.
- Then compare the number of rows of Rollup and select the smallest Rollup.
The following table and Rollup:
+-------------+-------+--------------+------+-------+---------+-------+ | IndexName | Field | Type | Null | Key | Default | Extra | +-------------+-------+--------------+------+-------+---------+-------+ | test_rollup | k1 | TINYINT | Yes | true | N/A | | | | k2 | SMALLINT | Yes | true | N/A | | | | k3 | INT | Yes | true | N/A | | | | k4 | BIGINT | Yes | true | N/A | | | | k5 | DECIMAL(9,3) | Yes | true | N/A | | | | k6 | CHAR(5) | Yes | true | N/A | | | | k7 | DATE | Yes | true | N/A | | | | k8 | DATETIME | Yes | true | N/A | | | | k9 | VARCHAR(20) | Yes | true | N/A | | | | k10 | DOUBLE | Yes | false | N/A | MAX | | | k11 | FLOAT | Yes | false | N/A | SUM | | | | | | | | | | rollup2 | k1 | TINYINT | Yes | true | N/A | | | | k2 | SMALLINT | Yes | true | N/A | | | | k3 | INT | Yes | true | N/A | | | | k10 | DOUBLE | Yes | false | N/A | MAX | | | k11 | FLOAT | Yes | false | N/A | SUM | | | | | | | | | | rollup1 | k1 | TINYINT | Yes | true | N/A | | | | k2 | SMALLINT | Yes | true | N/A | | | | k3 | INT | Yes | true | N/A | | | | k4 | BIGINT | Yes | true | N/A | | | | k5 | DECIMAL(9,3) | Yes | true | N/A | | | | k10 | DOUBLE | Yes | false | N/A | MAX | | | k11 | FLOAT | Yes | false | N/A | SUM | +-------------+-------+--------------+------+-------+---------+-------+
See the following query:
SELECT SUM(k11) FROM test_rollup WHERE k1 = 10 AND k2 > 200 AND k3 in (1,2,3);
First, judge whether the query can hit the aggregated Rollup table. After checking the above figure, it is OK. Then, the conditions contain K1, K2 and K3 conditions, which are test_ The first three columns of Rollup, rollup1 and rollup2 contain, so the prefix index length is the same. Then compare the number of rows. Obviously, rollup2 has the highest degree of aggregation and the least number of rows, so rollup2 is selected.
| 0:OlapScanNode | | TABLE: test_rollup | | PREAGGREGATION: ON | | PREDICATES: `k1` = 10, `k2` > 200, `k3` IN (1, 2, 3) | | partitions=1/1 | | rollup: rollup2 | | buckets=1/10 | | cardinality=-1 | | avgRowSize=0.0 | | numNodes=0 | | tuple ids: 0 |
Materialized view
Materialized view is a special table that stores the pre calculated data set (according to the defined SELECT statement) in Doris.
Materialized views appear mainly to meet the needs of users. They can not only analyze any dimension of original detailed data, but also quickly analyze and query fixed dimensions.
Applicable scenario
-
Analysis requirements cover two aspects: detailed data query and fixed dimension query.
-
The query involves only a small number of columns or rows in the table.
-
Queries contain some time-consuming processing operations, such as aggregation operations that take a long time.
-
The query needs to match different prefix indexes.
advantage
-
For those queries that frequently use the same sub query results, the performance is greatly improved.
-
Doris automatically maintains the data of materialized view. Whether it is a new import or deletion operation, it can ensure the data consistency of base table and materialized view chart. No additional labor maintenance costs are required.
-
When querying, it will automatically match to the optimal materialized view and read data directly from the materialized view.
Automatic maintenance of materialized view data will cause some maintenance overhead, which will be explained in the limitations of materialized view later.
Materialized view VS Rollup
Before the materialized view function, users generally used the Rollup function to improve query efficiency through pre aggregation. However, Rollup has some limitations. It can't do pre aggregation based on detail model.
Materialized views not only cover the functions of Rollup, but also support richer aggregation functions. So materialized views are actually a superset of Rollup.
In other words, the functions previously supported by ALTER TABLE ADD ROLLUP syntax can now be realized through CREATE MATERIALIZED VIEW.
Using materialized views
Doris system provides a complete set of DDL syntax for materialized views, including creating, viewing and deleting. The syntax of DDL is consistent with PostgreSQL and Oracle.
Create materialized view
First of all, you should decide what kind of materialized view to create according to the characteristics of your query statement. This is not as like as two peas. There are two principles:
- Abstract from the query statement, the grouping and aggregation methods shared by multiple queries are used as the definition of materialized view.
- You do not need to create materialized views for all dimension combinations.
First, if a materialized view is abstracted, and multiple queries can match this materialized view. This materialized view works best. Because the maintenance of materialized views also consumes resources.
If the materialized view only fits a special query, and other queries do not use this materialized view. This will lead to the low cost performance of this materialized view, which not only occupies the storage resources of the cluster, but also can not serve more queries.
Therefore, users need to combine their own query statements and data dimension information to abstract the definition of some materialized views.
The second point is that the actual analysis query does not cover all dimension analysis. Therefore, it is enough to create materialized views for common dimension combinations, so as to achieve a balance between space and time.
You can create materialized views with the following command. Creating a materialized view is an asynchronous operation, that is, after the user successfully submits the creation task, Doris will calculate the stock data in the background until the creation is successful.
CREATE MATERIALIZED VIEW
The specific syntax can be viewed through the following command:
HELP CREATE MATERIALIZED VIEW
Aggregate functions are supported
At present, the aggregate functions supported by materialized view creation statements are:
-
SUM, MIN, MAX (Version 0.12)
-
COUNT, BITMAP_UNION, HLL_UNION (Version 0.13)
-
BITMAP_ The form of union must be: bitmap_ The type of Union (to_bitmap (column)) column can only be integer (largeint is not supported), or BITMAP_UNION(COLUMN) and base table is AGG model.
-
HLL_ The form of union must be HLL_ The type of Union (hll_hash (column)) column cannot be DECIMAL or HLL_UNION(COLUMN) and base table is AGG model.
Update strategy
In order to ensure the data consistency between the materialized view chart and the base table, Doris will synchronize the import, deletion and other operations on the base table to the materialized view table. And through incremental update to improve the update efficiency. Ensure atomicity through transactions.
For example, if the user inserts data into the base table through the INSERT command, the data will be synchronously inserted into the materialized view. When the base table and materialized view table are written successfully, the INSERT command will return successfully.
Query automatic matching
After the materialized view is created successfully, the user's query does not need to be changed, that is, it is still the base table of the query. Doris will automatically select an optimal materialized view according to the current query statement, read data from the materialized view and calculate.
You can use the EXPLAIN command to check whether the current query uses materialized views.
Matching relationship between aggregation in materialized view and aggregation in query:
Materialized view aggregation | Aggregation in query |
---|---|
sum | sum |
min | min |
max | max |
count | count |
bitmap_union | bitmap_union, bitmap_union_count, count(distinct) |
hll_union | hll_raw_agg, hll_union_agg, ndv, approx_count_distinct |
After the aggregation functions of bitmap and hll match the materialized view, the aggregation operator of the query will be rewritten according to the table structure of the materialized view. See example 2 for details.
# Query materialized view
View the materialized views of the current table and their table structure. Use the following command:
MySQL [test]> desc mv_test all; +-----------+---------------+-----------------+----------+------+-------+---------+--------------+ | IndexName | IndexKeysType | Field | Type | Null | Key | Default | Extra | +-----------+---------------+-----------------+----------+------+-------+---------+--------------+ | mv_test | DUP_KEYS | k1 | INT | Yes | true | NULL | | | | | k2 | BIGINT | Yes | true | NULL | | | | | k3 | LARGEINT | Yes | true | NULL | | | | | k4 | SMALLINT | Yes | false | NULL | NONE | | | | | | | | | | | mv_2 | AGG_KEYS | k2 | BIGINT | Yes | true | NULL | | | | | k4 | SMALLINT | Yes | false | NULL | MIN | | | | k1 | INT | Yes | false | NULL | MAX | | | | | | | | | | | mv_3 | AGG_KEYS | k1 | INT | Yes | true | NULL | | | | | to_bitmap(`k2`) | BITMAP | No | false | | BITMAP_UNION | | | | | | | | | | | mv_1 | AGG_KEYS | k4 | SMALLINT | Yes | true | NULL | | | | | k1 | BIGINT | Yes | false | NULL | SUM | | | | k3 | LARGEINT | Yes | false | NULL | SUM | | | | k2 | BIGINT | Yes | false | NULL | MIN | +-----------+---------------+-----------------+----------+------+-------+---------+--------------+
You can see the current MV_ There are three materialized views in the test table: mv_1, mv_2 and mv_3, and their table structure.
Delete materialized view
If you no longer need materialized views, you can delete materialized views through the following command:
DROP MATERIALIZED VIEW
The specific syntax can be viewed through the following command:
HELP DROP MATERIALIZED VIEW
Best Practices 1
Using materialized views is generally divided into the following steps:
-
Create materialized view
-
Asynchronously check whether the materialized view is built
-
Query and automatically match materialized views
The first step is to create a materialized view
Suppose the user has a sales record detail table, which stores the transaction id, salesperson, sales store, sales time and amount of each transaction. The table creation statement is:
create table sales_records(record_id int, seller_id int, store_id int, sale_date date, sale_amt bigint) distributed by hash(record_id) properties("replication_num" = "1");
This is sales_ The table structure of records is as follows:
MySQL [test]> desc sales_records; +-----------+--------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+--------+------+-------+---------+-------+ | record_id | INT | Yes | true | NULL | | | seller_id | INT | Yes | true | NULL | | | store_id | INT | Yes | true | NULL | | | sale_date | DATE | Yes | false | NULL | NONE | | sale_amt | BIGINT | Yes | false | NULL | NONE | +-----------+--------+------+-------+---------+-------+
At this time, if the user often makes an analysis and query on the sales volume of different stores, you can give this sales_ The records table creates a materialized view grouped by sales stores and summing the sales of the same sales stores. The create statement is as follows:
MySQL [test]> create materialized view store_amt as select store_id, sum(sale_amt) from sales_records group by store_id;
If the backend returns the following figure, the task of creating materialized view is submitted successfully.
Query OK, 0 rows affected (0.012 sec)
Step 2: check whether the materialized view is built
Since creating a materialized view is an asynchronous operation, after submitting the task of creating a materialized view, the user needs to check whether the materialized view is built asynchronously through the command. The command is as follows:
SHOW ALTER TABLE ROLLUP FROM db_name; (Version 0.12) SHOW ALTER TABLE MATERIALIZED VIEW FROM db_name; (Version 0.13)
DB in this command_ Name is a parameter. You need to replace it with your real db name. The result of the command is to display all the tasks of creating materialized views for this dB. The results are as follows:
+-------+---------------+---------------------+---------------------+---------------+-----------------+----------+---------------+-----------+-------------------------------------------------------------------------------------------------------------------------+----------+---------+ | JobId | TableName | CreateTime | FinishedTime | BaseIndexName | RollupIndexName | RollupId | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+---------------------+---------------------+---------------+-----------------+----------+---------------+-----------+-------------------------------------------------------------------------------------------------------------------------+----------+---------+ | 22036 | sales_records | 2020-07-30 20:04:28 | 2020-07-30 20:04:57 | sales_records | store_amt | 22037 | 5008 | FINISHED | | NULL | 86400 | +-------+---------------+---------------------+---------------------+---------------+-----------------+----------+---------------+-----------+-------------------------------------------------------------------------------------------------------------------------+----------+---------+
TableName refers to the table from which the materialized view data comes, and RollupIndexName refers to the name of the materialized view. The more important indicator is State.
When the State of the create materialized view task has become FINISHED, it indicates that the materialized view has been created successfully. This means that the materialized view may be automatically matched during query.
Step 3: query
After the materialized view is created, the user will query the sales volume of different stores directly from the materialized view store just created_ Read the aggregated data in AMT. Achieve the effect of improving query efficiency.
The user's query still specifies sales_records table, for example:
SELECT store_id, sum(sale_amt) FROM sales_records GROUP BY store_id;
The above query can automatically match the store_amt. You can use the following command to check whether the current query matches the appropriate materialized view.
EXPLAIN SELECT store_id, sum(sale_amt) FROM sales_records GROUP BY store_id; +-----------------------------------------------------------------------------+ | Explain String | +-----------------------------------------------------------------------------+ | PLAN FRAGMENT 0 | | OUTPUT EXPRS:<slot 2> `store_id` | <slot 3> sum(`sale_amt`) | | PARTITION: UNPARTITIONED | | | | RESULT SINK | | | | 4:EXCHANGE | | | | PLAN FRAGMENT 1 | | OUTPUT EXPRS: | | PARTITION: HASH_PARTITIONED: <slot 2> `store_id` | | | | STREAM DATA SINK | | EXCHANGE ID: 04 | | UNPARTITIONED | | | | 3:AGGREGATE (merge finalize) | | | output: sum(<slot 3> sum(`sale_amt`)) | | | group by: <slot 2> `store_id` | | | | | 2:EXCHANGE | | | | PLAN FRAGMENT 2 | | OUTPUT EXPRS: | | PARTITION: RANDOM | | | | STREAM DATA SINK | | EXCHANGE ID: 02 | | HASH_PARTITIONED: <slot 2> `store_id` | | | | 1:AGGREGATE (update serialize) | | | STREAMING | | | output: sum(`sale_amt`) | | | group by: `store_id` | | | | | 0:OlapScanNode | | TABLE: sales_records | | PREAGGREGATION: ON | | partitions=1/1 | | rollup: store_amt | | tabletRatio=10/10 | | tabletList=22038,22040,22042,22044,22046,22048,22050,22052,22054,22056 | | cardinality=0 | | avgRowSize=0.0 | | numNodes=1 | +-----------------------------------------------------------------------------+ 45 rows in set (0.006 sec)
The most important one is the rollup attribute in OlapScanNode. You can see that the rollup of the current query displays the store_amt. That is, the query has been correctly matched to the materialized view store_amt and read the data directly from the materialized view.
Best practice 2 PV,UV
Business scenario: calculate the UV and PV of advertising
Assuming that the user's original ad Click data is stored in Doris, UV query can be performed by creating a bitmap for ad PV_ Union materialized view to improve query speed.
Through the following statement, first create a table to store the details of advertising click data, including the click events of each click, what advertising is clicked, through what channels, and who the users click.
MySQL [test]> create table advertiser_view_record(time date, advertiser varchar(10), channel varchar(10), user_id int) distributed by hash(time) properties("replication_num" = "1"); Query O K, 0 rows affected (0.014 sec)
The original ad Click data table structure is:
MySQL [test]> desc advertiser_view_record; +------------+-------------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-------+---------+-------+ | time | DATE | Yes | true | NULL | | | advertiser | VARCHAR(10) | Yes | true | NULL | | | channel | VARCHAR(10) | Yes | false | NULL | NONE | | user_id | INT | Yes | false | NULL | NONE | +------------+-------------+------+-------+---------+-------+ 4 rows in set (0.001 sec)
- Create materialized view
Since the user wants to query the UV value of the advertisement, that is, it is necessary to accurately de duplicate the users of the same advertisement, the query is generally as follows:
SELECT advertiser, channel, count(distinct user_id) FROM advertiser_view_record GROUP BY advertiser, channel;
- For this UV scene, we can create a scene with bitmap_ The materialized view of union, so as to achieve a pre accurate de duplication effect.
In Doris, count(distinct) aggregates the result and bitmap_union_ The results of count aggregation are completely consistent. And bitmap_union_count equals bitmap_ The result of the union is counted, so if * * count(distinct) * * is involved in the query, create a * * bitmap_union * * the aggregated materialized view can speed up the query.
For this case, you can create a user group based on advertising and channels_ ID is a materialized view for accurate de duplication.
MySQL [test]> create materialized view advertiser_uv as select advertiser, channel, bitmap_union(to_bitmap(user_id)) from advertiser_view_record group by advertiser, channel; Query OK, 0 rows affected (0.012 sec)
- Note: because the user_id is an INT type, so in Doris, you need to pass the field through the function * to first_ Convert bitmap * to bitmap type before * bitmap_union * aggregation.
After creation, the table structure of ad Click details and materialized view chart is as follows:
MySQL [test]> desc advertiser_view_record all; +------------------------+---------------+----------------------+-------------+------+-------+---------+--------------+ | IndexName | IndexKeysType | Field | Type | Null | Key | Default | Extra | +------------------------+---------------+----------------------+-------------+------+-------+---------+--------------+ | advertiser_view_record | DUP_KEYS | time | DATE | Yes | true | NULL | | | | | advertiser | VARCHAR(10) | Yes | true | NULL | | | | | channel | VARCHAR(10) | Yes | false | NULL | NONE | | | | user_id | INT | Yes | false | NULL | NONE | | | | | | | | | | | advertiser_uv | AGG_KEYS | advertiser | VARCHAR(10) | Yes | true | NULL | | | | | channel | VARCHAR(10) | Yes | true | NULL | | | | | to_bitmap(`user_id`) | BITMAP | No | false | | BITMAP_UNION | +------------------------+---------------+----------------------+-------------+------+-------+---------+--------------+
- Query automatic matching
When the materialized view is created and the advertisement UV is queried, Doris will automatically advertise from the materialized view just created_ Query data in UV. For example, the original query statement is as follows:
SELECT advertiser, channel, count(distinct user_id) FROM advertiser_view_record GROUP BY advertiser, channel;
- After the materialized view is selected, the actual query will be converted to:
SELECT advertiser, channel, bitmap_union_count(to_bitmap(user_id)) FROM advertiser_uv GROUP BY advertiser, channel;
- Use the EXPLAIN command to check whether Doris matches the materialized view:
MySQL [test]> explain SELECT advertiser, channel, count(distinct user_id) FROM advertiser_view_record GROUP BY advertiser, channel; +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Explain String | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | PLAN FRAGMENT 0 | | OUTPUT EXPRS:<slot 7> `advertiser` | <slot 8> `channel` | <slot 9> bitmap_union_count(`default_cluster:test`.`advertiser_view_record`.`mv_bitmap_union_user_id`) | | PARTITION: UNPARTITIONED | | | | RESULT SINK | | | | 4:EXCHANGE | | | | PLAN FRAGMENT 1 | | OUTPUT EXPRS: | | PARTITION: HASH_PARTITIONED: <slot 4> `advertiser`, <slot 5> `channel` | | | | STREAM DATA SINK | | EXCHANGE ID: 04 | | UNPARTITIONED | | | | 3:AGGREGATE (merge finalize) | | | output: bitmap_union_count(<slot 6> bitmap_union_count(`default_cluster:test`.`advertiser_view_record`.`mv_bitmap_union_user_id`)) | | | group by: <slot 4> `advertiser`, <slot 5> `channel` | | | | | 2:EXCHANGE | | | | PLAN FRAGMENT 2 | | OUTPUT EXPRS: | | PARTITION: RANDOM | | | | STREAM DATA SINK | | EXCHANGE ID: 02 | | HASH_PARTITIONED: <slot 4> `advertiser`, <slot 5> `channel` | | | | 1:AGGREGATE (update serialize) | | | STREAMING | | | output: bitmap_union_count(`default_cluster:test`.`advertiser_view_record`.`mv_bitmap_union_user_id`) | | | group by: `advertiser`, `channel` | | | | | 0:OlapScanNode | | TABLE: advertiser_view_record | | PREAGGREGATION: ON | | partitions=1/1 | | rollup: advertiser_uv | | tabletRatio=10/10 | | tabletList=22084,22086,22088,22090,22092,22094,22096,22098,22100,22102 | | cardinality=0 | | avgRowSize=0.0 | | numNodes=1 | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 45 rows in set (0.030 sec)
- In the EXPLAIN result, you can first see that the rollup attribute value of OlapScanNode is advertiser_uv. That is, the query will directly scan the data of the materialized view. Description: the match was successful.
Secondly, for user_ The ID field count(distinct) is rewritten to bitmap_union_count(to_bitmap). That is to achieve the effect of accurate weight removal through bitmap.
Best practices 3
Business scenario: match richer prefix indexes
The user's original table has three columns (k1, k2, k3). Where K1 and K2 are prefix index columns. At this time, if the user query criteria include where k1=1 and k2=2, the query can be accelerated through the index.
However, in some cases, the user's filter conditions cannot match the prefix index, such as where k3=3. The query speed cannot be improved through the index.
This problem can be solved by creating a materialized view with k3 as the first column.
- Create materialized view
CREATE MATERIALIZED VIEW mv_1 as SELECT k3, k2, k1 FROM tableA ORDER BY k3;
- After the above syntax is created, the materialized view retains complete detailed data, and the prefix index of the materialized view is k3 column. The structure of the table is as follows:
MySQL [test]> desc tableA all; +-----------+---------------+-------+------+------+-------+---------+-------+ | IndexName | IndexKeysType | Field | Type | Null | Key | Default | Extra | +-----------+---------------+-------+------+------+-------+---------+-------+ | tableA | DUP_KEYS | k1 | INT | Yes | true | NULL | | | | | k2 | INT | Yes | true | NULL | | | | | k3 | INT | Yes | true | NULL | | | | | | | | | | | | mv_1 | DUP_KEYS | k3 | INT | Yes | true | NULL | | | | | k2 | INT | Yes | false | NULL | NONE | | | | k1 | INT | Yes | false | NULL | NONE | +-----------+---------------+-------+------+------+-------+---------+-------+
- Query matching
At this time, if there are k3 columns in the user's query, the filter criteria are, for example:
select k1, k2, k3 from table A where k3=3;
- At this time, the query will be directly from the MV just created_ 1 read data in materialized view. The materialized view pair k3 has a prefix index, and the query efficiency will be improved.
limitations
-
The parameters of the aggregation function of materialized views do not support expressions. Only single columns are supported. For example, sum(a+b) does not support expressions.
-
If the condition column of the delete statement does not exist in the materialized view, it cannot be deleted. If you must delete data, you need to delete the materialized view before deleting data.
-
Too many materialized views on a single table will affect the import efficiency: when importing data, the materialized views and base table data are updated synchronously. If there are more than 10 materialized views on a table, the import speed may be very slow. This is like importing 10 tables at the same time for a single import.
-
The same column and different aggregation functions cannot appear in a materialized view at the same time. For example, select sum(a), min(a) from table are not supported.
Abnormal error
- DATA_QUALITY_ERR: "The data quality does not satisfy, please check your data" materialized view creation failed due to data quality problems. Note: bitmap type only supports positive integers. If there are negative numbers in the original data, the creation of materialized view will fail