Real time is the future. Recently, we shared real-time computing services at Tencent cloud computing Oceanus (Flink)~
Project background
This paper introduces how to build a real-time data warehouse by combining MySQL database, stream computing Oceanus (Flink), HBase and cloud data warehouse ClickHouse, read MySQL data through stream computing Oceanus, associate the dimension table in HBase, and finally store the data into cloud data warehouse ClickHouse for index analysis, so as to realize the whole process operation guidance of complete real-time data warehouse.
1. Environmental construction
1.1 create flow computing Oceanus cluster
On the flow calculation Oceanus product activity page 1 yuan to buy Oceanus cluster.
Create a cluster on the [cluster management] - [new cluster] page of Oceanus console, select region, availability area, VPC, log, storage, and set the initial password.
If you have not used VPC before, you need to create logs to store these components. VPC needs and the following MySQL The ES cluster and ES cluster use the same, otherwise it needs to be opened manually (such as peer-to-peer connection).
The created clusters are as follows:
1.2 create a private network VPC
The private network is a logically isolated network space customized by you on Tencent cloud. When building MySQL, EMR, ClickHouse cluster and other services, the selected networks must be consistent before the networks can communicate with each other. Otherwise, you need to use peer-to-peer connection, VPN and other methods to get through the network. Page address: https://console.cloud.tencent.com/vpc/vpc?rid=8
1.3 create cloud MySQL service
Cloud database MySQL It is a high-performance distributed data storage service built by Tencent cloud based on the open source database MySQL, which enables users to more easily set up, operate and expand relational databases in the cloud. Page address: https://console.cloud.tencent.com/cdb
For the page of creating a new MySQL service, note that the selected network is created before.
After creating the MySQL service, you need to modify the binlog_ row_ The image parameter, as shown in the figure, is modified to FULL (the default value is MINIMAL).
After modifying the parameters, log in to MySQL to create the database and database tables required by the example.
1) Create database mysqltestdb
Open the SQL window or click the visualization page to create the database and table.
New database command:
create database mysqltestdb;
Create a new table student based on the newly created database:
create table `student` ( `id` int(11) not null auto_increment comment 'Primary key id', `name` varchar(10) collate utf8mb4_bin default '' comment 'name', `age` int(11) default null comment 'Age', `create_time` timestamp null default current_timestamp comment 'Data creation time', primary key (`id`) ) engine=innodb auto_increment=4 default charset=utf8mb4 collate=utf8mb4_bin row_format=compact comment='Student list'
Insert data into Student table:
insert into mysqltestdb.student(id,name,age) values(1,"xiaomin",20);
1.4 creating an EMR cluster
EMR is an elastic open source pan Hadoop service hosted in the cloud, supporting Spark, HBase, Presto, Flink, Druid, etc big data Framework, this example mainly needs to use HBase components. Page address https://console.cloud.tencent.com/emr
Install the HBase component in the EMR cluster.
In the production environment, the server configuration can be selected according to the actual situation. In the example, the low configuration server is selected. The network needs to select the VPC network created before, and always keep the service components under the same VPC.
Enter HBase Master node
Click login to enter the server
Enter command:
# Enter HBase command root@172~# hbase shell
Enter hbase shell and create a new table:
# Create table statement create 'dim_hbase', 'cf' # insert data put 'dim_hbase','1','cf:name','MingDeSchool'
1.5 create a cloud data warehouse ClickHouse
Page address: https://console.cloud.tencent.com/cdwch
New cluster
Select the network. Select the VPC network newly created before (still ensure that all services are on the same network)
Log in to ClickHouse
In the previously created EMR, select a virtual machine and click login. It is best to select a node with an external network IP.
Install the ClickHouse client on this machine. ClickHouse client installation tutorial: https://cloud.tencent.com/document/product/1299/49824
Login client
Command example:
clickhouse-client -h User's own ClickHouse service IP --port 9000
New database
create database testdb on cluster default_cluster;
new table
CREATE TABLE testdb.student_school on cluster default_cluster ( `id` Int32, `name` Nullable(String), `school_name` Nullable(String), `Sign` Int8 ) ENGINE = ReplicatedCollapsingMergeTree('/clickhouse/tables/{layer}-{shard}/testdb/ student_school, '{replica}', Sign) ORDER BY id;
2. Data cleaning and operation processing
two point one Data preparation
2.1. one MySQL-CDC data and HBase dimension table data
Follow the above steps to create tables and insert data into MySQL and HBase tables.
two point two Create Flink SQL job
Create SQL jobs in the Oceanus console.
2.2. one Source end
MySQL-CDC Source:
--Student information as cdc Source table CREATE TABLE `student` ( `id` INT NOT NULL, `name` varchar, `age` INT, proc_time AS PROCTIME(), PRIMARY KEY (`id`) NOT ENFORCED ) WITH ( 'connector' = 'mysql-cdc', 'hostname' = 'YoursIp', 'port' = '3306', 'username' = 'user name', -- User name for database access (required) SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT, SELECT, RELOAD Permissions) 'password' = 'YoursPassword, 'database-name' = 'mysqltestdb', 'table-name' = 'student' );
HBase dimension table
--Example use school School information as dimension table CREATE TABLE dim_hbase ( rowkey STRING, cf ROW <school_name STRING>, -- If there are multiple column clusters, write cf Row<age INT,name String> PRIMARY KEY (rowkey) NOT ENFORCED ) WITH ( 'connector' = 'hbase-1.4', 'table-name' = 'dim_hbase', 'zookeeper.quorum' = 'User's own hbase The server zookeeper Addresses, multiple separated by commas' );
2.2. two Sink end
Create table statement to ClickHouse
--Deposit after Association clickhouse surface CREATE TABLE `student_school` ( stu_id INT, stu_name STRING, school_name STRING, PRIMARY KEY (`id`) NOT ENFORCED ) WITH ( 'connector' = 'clickhouse', 'url' = 'clickhouse://yourIP:8123', -- If ClickHouse The cluster is not configured with an account, and the password can not be specified --'username' = 'root', --'password' = 'root', 'database-name' = 'testdb', 'table-name' = ' student_school ', 'table.collapsing.field' = 'Sign' );
2.2. three Perform logical operations
In this example, only a simple Join is performed without complex operations. For detailed operation logic, please refer to:
Stream calculation Oceanus operator and built-in function Or Flink's official website Flink SQL development chapter
INSERT INTO student_school SELECT student.id as stu_id, student.name as stu_name, dim_hbase.cf.school_name FROM student JOIN dim_hbas for SYSTEM_TIME as of student.proc_time ON CAST(student.school_id AS STRING) = dim_hbase.rowkey;
2.3 result verification
Query whether the data is correct in the ClickHouse database.
select * from testdb.student_school;
summary
If you use HBase as a dimension table, you need to purchase an EMR cluster. If you don't need an EMR cluster, you can use MySQL, Redis, etc. as a dimension table for association.
Original text transferred from: https://cloud.tencent.com/developer/article/1861802