Real time data warehouse: real time analysis of MySQL and HBase dimension tables to ClickHouse based on flow computing Oceanus

Posted by irishmike2004 on Wed, 22 Dec 2021 19:17:55 +0100

Real time is the future. Recently, we shared real-time computing services at Tencent cloud computing Oceanus (Flink)~

Project background

This paper introduces how to build a real-time data warehouse by combining MySQL database, stream computing Oceanus (Flink), HBase and cloud data warehouse ClickHouse, read MySQL data through stream computing Oceanus, associate the dimension table in HBase, and finally store the data into cloud data warehouse ClickHouse for index analysis, so as to realize the whole process operation guidance of complete real-time data warehouse.

1. Environmental construction

1.1 create flow computing Oceanus cluster

On the flow calculation Oceanus product activity page 1 yuan to buy Oceanus cluster.

Create a cluster on the [cluster management] - [new cluster] page of Oceanus console, select region, availability area, VPC, log, storage, and set the initial password.

If you have not used VPC before, you need to create logs to store these components. VPC needs and the following MySQL The ES cluster and ES cluster use the same, otherwise it needs to be opened manually (such as peer-to-peer connection).

The created clusters are as follows:

1.2 create a private network VPC

The private network is a logically isolated network space customized by you on Tencent cloud. When building MySQL, EMR, ClickHouse cluster and other services, the selected networks must be consistent before the networks can communicate with each other. Otherwise, you need to use peer-to-peer connection, VPN and other methods to get through the network. Page address: https://console.cloud.tencent.com/vpc/vpc?rid=8

1.3 create cloud MySQL service

Cloud database MySQL It is a high-performance distributed data storage service built by Tencent cloud based on the open source database MySQL, which enables users to more easily set up, operate and expand relational databases in the cloud. Page address: https://console.cloud.tencent.com/cdb

For the page of creating a new MySQL service, note that the selected network is created before.

After creating the MySQL service, you need to modify the binlog_ row_ The image parameter, as shown in the figure, is modified to FULL (the default value is MINIMAL).

After modifying the parameters, log in to MySQL to create the database and database tables required by the example.

1) Create database mysqltestdb

Open the SQL window or click the visualization page to create the database and table.

New database command:
create database mysqltestdb;

Create a new table student based on the newly created database:

create table `student` (
  `id` int(11) not null auto_increment comment 'Primary key id',
  `name` varchar(10) collate utf8mb4_bin default '' comment 'name',
  `age` int(11) default null comment 'Age',
  `create_time` timestamp null default current_timestamp comment 'Data creation time',
  primary key (`id`)
) engine=innodb auto_increment=4 default charset=utf8mb4 collate=utf8mb4_bin row_format=compact comment='Student list'
Insert data into Student table:
 insert into mysqltestdb.student(id,name,age) values(1,"xiaomin",20);

1.4 creating an EMR cluster

EMR is an elastic open source pan Hadoop service hosted in the cloud, supporting Spark, HBase, Presto, Flink, Druid, etc big data Framework, this example mainly needs to use HBase components. Page address https://console.cloud.tencent.com/emr

Install the HBase component in the EMR cluster.

In the production environment, the server configuration can be selected according to the actual situation. In the example, the low configuration server is selected. The network needs to select the VPC network created before, and always keep the service components under the same VPC.

Enter HBase Master node

Click login to enter the server

Enter command:

# Enter HBase command

root@172~# hbase shell

Enter hbase shell and create a new table:

# Create table statement
  create 'dim_hbase', 'cf'

# insert data
  put 'dim_hbase','1','cf:name','MingDeSchool'

1.5 create a cloud data warehouse ClickHouse

Page address: https://console.cloud.tencent.com/cdwch

New cluster

Select the network. Select the VPC network newly created before (still ensure that all services are on the same network)

Log in to ClickHouse

In the previously created EMR, select a virtual machine and click login. It is best to select a node with an external network IP.

Install the ClickHouse client on this machine. ClickHouse client installation tutorial: https://cloud.tencent.com/document/product/1299/49824

Login client

Command example:

clickhouse-client -h User's own ClickHouse service IP --port 9000

New database

create database testdb on cluster default_cluster;

new table

CREATE TABLE testdb.student_school on cluster default_cluster (
`id` Int32,
`name` Nullable(String),
`school_name` Nullable(String),
`Sign` Int8
) ENGINE = ReplicatedCollapsingMergeTree('/clickhouse/tables/{layer}-{shard}/testdb/ student_school, '{replica}', Sign) ORDER BY id;

2. Data cleaning and operation processing

two point one Data preparation

2.1. one MySQL-CDC data and HBase dimension table data

Follow the above steps to create tables and insert data into MySQL and HBase tables.

two point two Create Flink SQL job

Create SQL jobs in the Oceanus console.

2.2. one Source end

MySQL-CDC Source:

--Student information as cdc Source table
CREATE TABLE `student` (
  `id` INT NOT NULL,
  `name` varchar,
  `age` INT,
  proc_time AS PROCTIME(),
  PRIMARY KEY (`id`) NOT ENFORCED
) WITH (
  'connector' = 'mysql-cdc',
  'hostname' = 'YoursIp', 
  'port' = '3306',
  'username' = 'user name',
  -- User name for database access (required) SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT, SELECT, RELOAD Permissions)
  'password' = 'YoursPassword,
  'database-name' = 'mysqltestdb',
  'table-name' = 'student' 
);

HBase dimension table

--Example use school School information as dimension table
CREATE TABLE dim_hbase (
  rowkey STRING,
  cf ROW <school_name STRING>,  -- If there are multiple column clusters, write cf Row<age INT,name String>
  PRIMARY KEY (rowkey) NOT ENFORCED
) WITH (
  'connector' = 'hbase-1.4',
  'table-name' = 'dim_hbase',
  'zookeeper.quorum' = 'User's own hbase The server zookeeper Addresses, multiple separated by commas'
);

2.2. two Sink end

Create table statement to ClickHouse

--Deposit after Association clickhouse surface
CREATE TABLE `student_school` (
  stu_id INT,
  stu_name STRING,
  school_name STRING,
  PRIMARY KEY (`id`) NOT ENFORCED
) WITH (
  'connector' = 'clickhouse',
  'url' = 'clickhouse://yourIP:8123',
  -- If ClickHouse The cluster is not configured with an account, and the password can not be specified
  --'username' = 'root',
  --'password' = 'root',
  'database-name' = 'testdb',
  'table-name' = ' student_school ',
  'table.collapsing.field' = 'Sign'
);

2.2. three Perform logical operations

In this example, only a simple Join is performed without complex operations. For detailed operation logic, please refer to:

Stream calculation Oceanus operator and built-in function Or Flink's official website Flink SQL development chapter

INSERT INTO
  student_school
SELECT
  student.id as stu_id,
  student.name as stu_name,
  dim_hbase.cf.school_name
FROM
  student
  JOIN dim_hbas for SYSTEM_TIME as of student.proc_time
  ON CAST(student.school_id AS STRING) = dim_hbase.rowkey;

2.3 result verification

Query whether the data is correct in the ClickHouse database.

select * from testdb.student_school; 

summary

If you use HBase as a dimension table, you need to purchase an EMR cluster. If you don't need an EMR cluster, you can use MySQL, Redis, etc. as a dimension table for association.

Original text transferred from: https://cloud.tencent.com/developer/article/1861802

Topics: MySQL HBase flink clickhouse