hive summary 4: actual training of table building (analysis of knowledge points involved in table building code)

Posted by soccer022483 on Tue, 08 Mar 2022 22:11:05 +0100

catalogue

1: Table creation code

2: Data model

3: partition by range

4: hash bucket

V other

1: Table creation code

CREATE TABLE `heheyotubehehe_ads_people_kehuduan_experience_data` (
  `day` int(11) NOT NULL DEFAULT "0" COMMENT "",
  `os` varchar(50) NULL COMMENT "",
  `soft_version` varchar(50) NULL COMMENT "",
  `is_app_new` varchar(50) NULL COMMENT " Is it a new user",
  `disp_pv` int(11) NULL DEFAULT "0" COMMENT "Pop up window I show pv",
  `disp_uv` int(11) NULL DEFAULT "0" COMMENT "Pop up window I show uv",
  `agree_clk_uv` int(11) NULL DEFAULT "0" COMMENT "Click the Agree button in the pop-up window uv",
  `disagree_clk_uv` int(11) NULL DEFAULT "0" COMMENT "Pop up a disagree button click uv",
  `second_disp_pv` int(11) NULL DEFAULT "0" COMMENT Private pop-up window II display pv",
  `second_disp_uv` int(11) NULL DEFAULT "0" COMMENT "Pop up window II display uv",
  `second_agree_clk_uv` int(11) NULL DEFAULT "0" COMMENT "Pop up window 2: click the Agree button uv",
  `second_think_clk_uv` int(11) NULL DEFAULT "0" COMMENT "Pop up window 2 disagree button click uv",
  `third_disp_pv` int(11) NULL DEFAULT "0" COMMENT "Pop up window III display pv",
  `third_disp_uv` int(11) NULL DEFAULT "0" COMMENT "Pop up window III display uv",
  `third_check_agreement_clk_uv` int(11) NULL DEFAULT "0" COMMENT "Click the view agreement button in pop-up window 3 uv",
  `third_exit_clk_uv` int(11) NULL DEFAULT "0" COMMENT "Pop up window 3 Click the exit application button uv",
) ENGINE=OLAP
AGGREGATE KEY(`day`, `os`, `soft_version`, `is_app_new`,  `disp_pv`, `disp_uv`, 
`agree_clk_uv`, `disagree_clk_uv`, `second_disp_pv`, 
`second_disp_uv`, `second_agree_clk_uv`, `second_think_clk_uv`,
 `third_disp_pv`, `third_disp_uv`, `third_check_agreement_clk_uv`, 
`third_exit_clk_uv`, `device_disp_pv`, `device_disp_uv`, 
`device_dis_uv`, `device_agree_clk_uv`, `device_disagree_clk_uv`,
 `glide_disp_pv`, `glide_disp_uv`, `glide_dis_uv`)
COMMENT "[heheyotubehehe]Experience report"
PARTITION BY RANGE(`day`)
(PARTITION p202104 VALUES [("20210401"), ("20210501")),
PARTITION p202105 VALUES [("20210501"), ("20210601")),
PARTITION p202106 VALUES [("20210601"), ("20210701")),
PARTITION p202107 VALUES [("20210701"), ("20210801")))
DISTRIBUTED BY HASH(`day`) BUCKETS 10
PROPERTIES (
"replication_num" = "3",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "MONTH",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-360",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p",
"dynamic_partition.replication_num" = "3",
"dynamic_partition.buckets" = "10",
"dynamic_partition.start_day_of_month" = "1",
"in_memory" = "false",
"storage_format" = "V2"
);

[Note: the field starts casually and does not involve trade secrets. If there is infringement, it must be deleted]

2: Data model

aggregate key appears in the code. It belongs to a data model.

OLAP databases with MPP architecture such as Doris usually process a large amount of data by improving concurrency

Doris's data models are mainly divided into three categories: aggregate, uniq and duplicate

Aggregate model: in doris, the aggregation granularity of value is determined by key.

Uniq model: this kind of data has no aggregation requirements, and only needs to ensure the uniqueness of the primary key

Duplicate model: in some multidimensional analysis scenarios, data has neither primary key nor aggregation requirements

Suggestions for data model selection:

1) Through pre aggregation, the Aggregate model can greatly reduce the amount of data to be scanned and the amount of calculation to be queried when aggregating queries,

It is very suitable for report query scenarios with fixed patterns. However, the model is not friendly to count(*) queries.

At the same time, because the aggregation method on the Value column is fixed, semantic correctness needs to be considered in other types of aggregation queries.
2) The Uniq model can ensure the uniqueness of primary key constraints for scenarios that require unique primary key constraints.

However, it is impossible to take advantage of the query advantages brought by pre aggregation such as ROLLUP (because the essence is REPLACE and there is no SUM aggregation).
3) Duplicate is suitable for ad hoc queries of any dimension. Although it is also impossible to take advantage of the characteristics of prepolymerization,

However, it is not constrained by the aggregation model and can give full play to the advantages of the column storage model (only relevant columns are read instead of all Key columns).

3: partition by range

partition by range('day ') appears in the code

doris supports two-level partition storage: the first layer is range partition, and the second layer is hash bucket

partition by range syntax:

partition by range('day')(

Partition , partition name 1 , values , partition range 1

Partition , partition name 2 , values , partition range 2

)

4: hash bucket

distributed by hash('day') buckets 10 in the code

Divide the data into different bucket s according to the hash value.

1) It is recommended to use the column with large discrimination as the bucket to avoid data skew
2) To facilitate data recovery, it is recommended that the size of a single bucket should not be too large and should be kept within 10GB,

Therefore, when creating tables or adding partitions, please consider the number of buckets reasonably. Different partitions can specify different numbers of buckets.

V other

1)engine=olap

2) The fields in parentheses after aggregate key correspond to the fields defined above one by one.

3) properties are fixed choices.

 

reference:

https://blog.csdn.net/m0_37622868/article/details/82849108

https://blog.csdn.net/u012150370/article/details/104783173

 

Topics: Big Data