Flink practice tutorial: getting started 5 - writing ClickHouse

Posted by szym9341 on Sat, 06 Nov 2021 21:44:06 +0100

Introduction to flow computing Oceanus

Stream computing Oceanus is a powerful tool for real-time analysis of big data product ecosystem. It is an enterprise level real-time big data analysis platform based on Apache Flink with the characteristics of one-stop development, seamless connection, sub second delay, low cost, security and stability. Stream computing Oceanus aims to maximize the value of enterprise data and accelerate the construction process of real-time digitization of enterprises.

This article will introduce in detail how to use Datagen Connector to simulate and generate customer video hits data, aggregate and analyze customer video hits per minute by using the rolling window function, and finally output the data to ClickHouse.

Pre preparation

Create flow computing Oceanus cluster

Enter the flow calculation Oceanus console [1], click cluster management on the left, and click Create cluster on the top left to complete the creation of Oceanus cluster. For details, please refer to the official Oceanus documentation to create an exclusive cluster [2].

Create ClickHouse cluster

Enter ClickHouse console [3] and click new cluster in the upper left corner to complete the creation of ClickHouse cluster. For details, please refer to ClickHouse quick start [4].

Note: when creating Oceanus cluster and ClickHouse cluster, the selected VPC must be the same.

Create ClickHouse table:

Enter a CVM in the same VPC as the ClickHouse cluster and install the ClickHouse client (you need to connect to the Internet to download the client). Refer to ClickHouse quick start [4] for specific steps.

 # Download Clickhouse client command
wget https://repo.yandex.ru/clickhouse/rpm/stable/x86\_64/clickhouse-client-20.7.2.30-2.noarch.rpm
wget https://repo.yandex.ru/clickhouse/rpm/stable/x86\_64/clickhouse-common-static-20.7.2.30-2.x86\_64.rpm

2. Install the client

rpm -ivh \*.rpm

3. Log in to the ClickHouse cluster using the tcp port, and the IP address can be viewed through the console

clickhouse-client -hxxx.xxx.xxx.xxx --port 9000

4. Log in to the ClickHouse cluster and create a table.

CREATE TABLE default.datagen_to_ck on cluster default_cluster (
win_start     TIMESTAMP,
win_end       TIMESTAMP,
user_id       String,
amount_total  Int16,
Sign          Int8  )
ENGINE = ReplicatedCollapsingMergeTree('/clickhouse/tables/{layer}-{shard}/default/datagen_to_ck', '{replica}',Sign)
ORDER BY (win_start,win_end,user_id);

Flow calculation Oceanus job

1. Create Source

CREATE TABLE random_source ( 
    user_id   VARCHAR,
    amount    INT,
    pre_time  AS CURRENT_TIMESTAMP,
    WATERMARK FOR pre_time AS pre_time - INTERVAL '3' SECOND
  ) WITH ( 
  'connector' = 'datagen', 
  'rows-per-second' = '5',            -- Number of data pieces generated per second
  'fields.user_id.length' = '1',      -- Length of random string
  'fields.amount.kind' = 'random',    -- Unbounded random number
  'fields.amount.min' = '1',          -- Minimum value of random number
  'fields.amount.max' = '10'          -- Maximum value of random number
);

2. Create Sink

CREATE TABLE clickhouse (
    win_start     TIMESTAMP(3),
    win_end       TIMESTAMP(3),
    user_id       VARCHAR,
    amount_total  BIGINT,
    PRIMARY KEY (win_start,win_end,user_id) NOT ENFORCED -- If the database table to be synchronized has a primary key defined, You also need to define it here
) WITH (
    'connector' = 'clickhouse',
    'url' = 'clickhouse://10.0.0.178:8123',
    --'username' = 'root',     -- If ClickHouse The cluster is not configured with an account, and the password can not be specified
    --'password' = 'root',
    'database-name' = 'default',
    'table-name' = 'datagen_to_ck',
    'table.collapsing.field' = 'Sign'   -- CollapsingMergeTree Name of the type column field
);

3. Write business SQL

INSERT INTO clickhouse
SELECT
    TUMBLE_START(pre_time,INTERVAL '1' MINUTE) AS win_start,
    TUMBLE_END(pre_time,INTERVAL '1' MINUTE) AS win_end,
    user_id,
    CAST(SUM(amount) AS BIGINT) AS amount_total
FROM random_source
GROUP BY TUMBLE(pre_time,INTERVAL '1' MINUTE),user_id;

4. Select Connector

Click [operation parameters], select the link Connector Clickhouse in [built-in Connector], and click [save] > [publish draft] to run the operation.

The new version of Flink 1.13 cluster does not require users to select the built-in Connector

summary

This example uses datagen Connecor simulation to generate random data, uses the scroll window to count the amount_total of video clicks per minute of each user (user_id), and then stores the data in ClickHouse. For more time window function examples, please refer to Oceanus official document 5.