Maxcompute data generation - detailed explanation of method

Posted by jredsmyth on Fri, 28 Jan 2022 04:59:12 +0100

Introduction: the method of making some simulated data

summary

Making data can be used on some strange occasions. Generally, we have data before we have data-based applications, but conversely, if the application gets another scenario, there is no way to demonstrate without data function.

Generally, more real data can be applied to functional testing and demonstration after desensitization. However, data desensitization is actually complicated (too heavy desensitization data is useless, and too low data is leaked again), so it seems safer to simulate some data by yourself.

I generally encounter two scenarios of creating data. First, a partner or colleague asks for a method of SQL processing data, but there is no data. Second, sometimes there are some scenes of POC, which do not provide real simulation data and need to be simulated by themselves.

classification

If it is a data simulation of a single business scenario, a single table can be satisfied in many cases. However, if you simulate a business scenario or POC test scenario, you need to simulate multiple interrelated tables in a business system.

When creating data, there are usually some user requirements and clear descriptions of business scenarios. There are also some other requirements, such as the number of records in the table, the storage of rows, the generation rules of fields, the value field of fields, the enumeration value of fields, and a small amount of real data.

2.1. A table

The data of a single table may be very simple. For example, we test a function and a section of SQL JOIN logic every day. It may also be very complex. Constructing a table is equivalent to constructing a business system.

2.2. A business system

Compared with a single table, the business system only increases the number of tables. Moreover, because there is a primary foreign key relationship between the tables of the business system, it is necessary to create the code table (dimension table) first, and then rebuild the business table (fact table).

method

The method of creating simulation data is divided into two stages. The first stage is to construct a small table and generate a code table (dimension table), and then the second stage uses Cartesian product to quickly multiply the required amount of data. Among them, the data value filling of the column can be generated by random function.

3.1. Construct a small constant table

Maxcompute's simplest method of creating data is the insert into values statement, which is also the most commonly used statement. Earlier versions that did not support this statement used the method of union all. If you don't want to actually write data to, you can use the from values and with expressions.

Example 1: insert data into a specific partition through the insert... values operation.

Examples of commands are as follows:

--Create partition table srcp. 
create table if not exists srcp (key string,value bigint) partitioned by (p string);

--To partition table srcp Add partition.
alter table srcp add if not exists partition (p='abc');

--To table srcp Specified partition of abc Insert data into.
insert into table srcp partition (p='abc') values ('a',1),('b',2),('c',3);

--Query table srcp. 
select * from srcp where p='abc';

--Returns the result.
+------------+------------+------------+
| key        | value      | p          |
+------------+------------+------------+
| a          | 1          | abc        |
| b          | 2          | abc        |
| c          | 3          | abc        |
+------------+------------+------------+

Example 2: insert data through the values table operation.

Examples of commands are as follows:
--Create partition table srcp. 
create table if not exists srcp (key string,value bigint) partitioned by (p string);

--To table srcp Insert data into.
insert into table srcp partition (p) select concat(a,b), length(a)+length(b),'20170102' from values ('d',4),('e',5),('f',6) t(a,b);

--Query table srcp. 
select * from srcp where p='20170102';

--Returns the result.
+------------+------------+------------+
| key        | value      | p          |
+------------+------------+------------+
| d4         | 2          | 20170102   |
| e5         | 2          | 20170102   |
| f6         | 2          | 20170102   |
+------------+------------+------------+

Values (...), (...) t(a, b) is equivalent to defining a table named T, columns a and B, and data types STRING and BIGINT respectively. The type of the column needs to be derived from the values list.

Example 3: construct a constant table by combining from values or union all.

Examples of commands are as follows:

with t as (select 1 c union all select 2 c) select * from t;
--Equivalent to the following statement.
select * from values (1), (2) t(c);

--Returns the result.
+------------+
| c          |
+------------+
| 1          |
| 2          |
+------------+

The above example comes from:
https://help.aliyun.com/document_detail/73778.html?spm=a2c4g.11186623.6.732.7e477b57ZhLOGj

3.2. Constructing large tables by Cartesian product

As we all know, the expression of Cartesian product can only be used when MAPJOIN is prompted. Therefore, the small constant table constructed in the first step can use MAPJOIN.

Examples of commands are as follows:

--1 construct a constant table (the ordered number I use here is convenient to use where to get the record number of a specified number to multiply the Cartesian product)

create table za1 as 
select c0 from values
 (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15)
,(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)
,(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45)
,(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60)
,(61),(62),(63)
t(c0);

-- 2 Use the constant table to correlate multiple times to construct the required number of records[Let's use a calculator to make a rough calculation N How many power is enough]
create table zb1 as 
select *
  from(
-- 10*63*63=39690
select /*+mapjoin(t2,t3)*/
       1000000 + row_number() over(partition by 1)-1 as c0
  from za1 t1 -- 63
  join za1 t2 -- 63
  join(select c0 from za1 limit 10)t3 -- 10
)t
;

--3 The table constructed in step 2 has reached 10000, and the number of table records reconstructed with this table can easily reach 100 million

3.3. Fill columns with random and ordered values

In essence, data types can be divided into two types: sequence value and enumeration value. Sequence value is an ordered sequence of numbers, using row_number() function, which is mainly defined as the primary key in this scenario. Enumeration values are a few code values (numerical value, amount, code) distributed in records. These enumeration values are mainly filled with random functions. I won't describe other situations that I haven't encountered yet.

Examples of commands are as follows:

-- 1 Ordered value, in this example, the generated data is an ordered value from 1000000-1036689 The sequence can be used as the main foreign key of the business
select /*+mapjoin(t2,t3)*/
       1000000 + row_number() over(partition by 1)-1 as c0
  from za1 t1 -- 63
  join za1 t2 -- 63
  join(select c0 from za1 limit 10)t3 -- 10
;

-- 2 Random value/Fixed value, in this example c2 The column produces a relatively uniform 1-1000 Value of
-- Random numbers generated by random functions are floating-point values and must be converted to bigint
select /*+mapjoin(t2,t3)*/
       1000000 + row_number() over(partition by 1)-1 as c0
      ,1617120000 as c1
      ,cast(round(rand()*999,0) as bigint)+1 as c2
  from za1 t1 -- 63
  join za1 t2 -- 63
  join(select c0 from za1 limit 10)t3 -- 10
;

3.4. Construction of different data types

General data types can be divided into four types: unique value of primary key, enumeration value represented by string, numerical value, date and time. The only difference is that the enumeration value is a number rather than text, and there is no construction date and time. So if it is really necessary, how to implement it.

Time can be constructed into a unixtime, which can be converted into a numerical value. For text type enumeration values, you can first construct a code table, then construct a business table, and then associate them (generally, the business system stores code values instead of a long string).
Examples of commands are as follows:

-- Use code table to convert text
with za as (
select * from values
 (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15)
,(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)
,(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45)
,(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60)
,(61),(62),(63)
t(c0)
)
,ta as (
select * from values ('zhangsan',4),('lisi',5),('wangmazi',6) t(a,b))
select k,a,b,c
  from(
select 100 + row_number() over(partition by 1)-1 as k
      ,cast(round(rand()*3,0) as bigint)+3 as c
  from za  -- 63
 limit 3
)tb join ta on ta.b=tb.c
;
return:
k   a   b   c
101 lisi    5   5
102 wangmazi    6   6
103 zhangsan    4   4

-- utilize unixtimetamp Transfer date and time
with za as (
select * from values
 (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15)
,(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)
,(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45)
,(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60)
,(61),(62),(63)
t(c0)
)
select k
,from_unixtime(1617120000) as t
,from_unixtime(1617120000
+3600000 * c )   -- hour
 as b
,c
  from(
select 100 + row_number() over(partition by 1)-1 as k
      ,cast(round(rand()*3,0) as bigint)+3 as c
  from za  -- 63
 limit 3
)tb 
;
return:
k   t   b   c
100 2021-03-31 00:00:00 2021-03-31 03:00:00 3
101 2021-03-31 00:00:00 2021-03-31 05:00:00 5
102 2021-03-31 00:00:00 2021-03-31 06:00:00 6

practice

4.1. Practical cases

In a POC project in the telecommunications industry some time ago, the customer first gave 80 lines of real data, asked to create more than one billion data, and gave some very special data requirements.

Requirements for original data and data characteristics after processing according to customer requirements

Number of records: the number of records in a single table, 1.6 billion in the original and 170 million after processing;

Number of users: 14 million;

Number of equipment: 230000;

Single line record size: 436KB of original data and 157KB after processing;

Number of single user records (min, Max, median): the minimum value is 1; The maximum value is less than 30000 in the original and 24000 after treatment; The median original value is 51 and 4 after processing;

The following table:

time	category	Number of records	Number of users	Minimum number of user records	Maximum number of user records	Median number of user records	Number of equipment	Single line (KB)
9:00	original	1668486059	14297500	1	31973	51	231272	436
9:00	handle	174817694	13371656	1	2441	4	230860	157

According to the above requirements, the first step is to analyze business requirements. There are 61 columns of original data, but only 10 columns actually participate in data calculation. Therefore, to construct the original table, you only need to construct these 10 columns, and then select one row from the 61 columns of the original records.

Analyze the original data structure and select the data columns involved in the calculation:

create table if not exists t_log10 (
 imei_tac             int      comment 'User equipment ID1' 
,phone7               int      comment 'User equipment ID2'  
,imsi                 string   comment 'User equipment ID3'  
,msisdn               string   comment 'User equipment ID4' 
,tac                  int      comment 'Telecommunication equipment ID1'
,cell_id              int      comment 'Telecommunication equipment ID2'
,Procedure_Type       int      comment 'Business type'
,Procedure_Start_Time bigint   comment 'Business start time, unixtimestamp'
,Procedure_status     int      comment 'Business status, fixed value 1'
,country_code         int      comment 'Country code, fixed value-406'  )
partitioned by (hh string);

In the telecom service, this service scenario describes the registration of the user's mobile device on the base station equipment of the telecom operator. This business calculation uses 10 fields. There are five dimensions related to user equipment, namely user equipment ID (1-4) and country code; There are two dimensions related to telecommunication equipment, namely telecommunication equipment ID (1-2). The other three are related to the occurrence of user equipment and telecom equipment business, namely business type, business status and business start time.

Therefore, after the demand analysis, I think I need to build a user equipment dimension table and telecom base station equipment dimension table, and then build a telecom business fact table (business table) according to these dimension tables.
The first step is to build the dimension (code) table of Telecom base station:

drop table if exists t_tac_lacid;
create table if not exists t_tac_lacid (id bigint,tac bigint,lacid bigint);

insert overwrite table t_tac_lacid
select /*+mapjoin(t2)*/
 row_number() over(partition by 1)+100000 as rn
,t1.c0+6001  as tac
,t2.c0+1201  as lacid
from (select row_number() over(partition by 1)-1 as c0 from zb1 limit 2300)t1
join (select row_number() over(partition by 1)-1 as c0 from zb1 limit 100)t2
;
-- 230000

In this example, through the specific number of records in the zb1 selected area, the result set of the specified number of records is multiplied by the Cartesian product. Because the two IDS need to build a unique primary key, row is used here_ Number window function. When building the primary key, 100000 + is used to build a fixed length ID.
The second step is to build the user equipment dimension (code) table.

drop table if exists t_user;
create table t_user (imei_tac bigint,phone7 bigint,imsi string ,msisdn string);

insert overwrite table t_user
select 
 rn as imei_tac
,cast(substr(to_char(rn),2,7) as bigint)+1000000 as phone7
,substr(MD5(rn), 1,10) as imsi
,substr(MD5(rn),11,10) as msisdn
from(
select /*+mapjoin(t2,t3,t4)*/
 row_number() over(partition by 1)+10000000 as rn
from za1 t1
join za1 t2
join za1 t3
join (select c0 from za1 limit 58) t4
-- limit 100
)t;
-- 14502726
-- 63*63*63*58 = 14502726

In this example, a seemingly real number of records is constructed by using za1 this table four times (in fact, it makes no difference if there are several pieces of data, which is a little boring here). Use row_ The number window function constructs the business primary key, and transforms several forms (MD5 interception) to build different primary key styles. Then the random function is used to construct the base station information. In fact, the base station information is also calculated. These special processing are mainly to build the final result table.

The last step is to build the result table, because we haven't considered the problems of median, extreme value and processed results before. In fact, the final implementation is relatively complex (it's too long to stick out, so you need to ask me for it alone).

The method to meet special requirements is user segmentation:

1) Extreme value: a very small number of user records meet the user extreme value [for example, select 500 users

2) Median, the median must be the number of records of more than half of the users

3) Supplement the number of users, excluding the extremum and median

You need to use hints to improve performance, because the original tables for creating data are very small, and there is usually only one worker in the map stage. Therefore, it is necessary to cut down the data block input in the map phase and increase the resources of map and reduce.

set odps.sql.mapper.cpu=200;
set odps.sql.mapper.memory=8192;
set odps.sql.mapper.split.size=4;
set odps.sql.reducer.cpu=200;
set odps.sql.reducer.memory=8192;

4.2. summary

Creating data scenarios is relatively simple most of the time, but the above special complex situations will also be encountered. However, complex business mainly tests the ability of data processing. How to use basic tables to generate complex tables is still the process of building the relational model of relational database.

To build a single data table, first analyze the dimensions and facts in the business, then build the dimensions and use the dimensions to build the facts.

Original link

This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Topics: SQL

Programmer Think

Maxcompute data generation - detailed explanation of method

summary

classification

method

practice

Hot Topics