Index statistics: real time UVPV statistics based on flow computing Oceanus(Flink)

Posted by kula on Fri, 22 Oct 2021 07:29:47 +0200

Author: Wu Yuntao, senior engineer of Tencent CSIG

Introduction
Recently, I sorted out how to use Flink to realize real-time statistics of UV and PV indicators, and communicated with colleagues in the micro vision Department of the company. Then we simplified the scenario and found that it would be more convenient to use Flink SQL to realize the statistics of these indicators.

I. solution description

1.1 general

In this scheme, combined with the self built Kafka cluster, Tencent cloud computing Oceanus (Flink) and cloud database Redis, the UV and PV indicators of blogs, shopping and other websites are visually analyzed in real time. The analysis indicators include the number of independent visitors (UV), product clicks (PV), conversion rate (conversion rate = transaction times / clicks), etc.

Introduction to related concepts: UV (Unique Visitor): the number of independent visitors. A client visiting your website is a visitor. If a user visits the same page five times, the UV of the page will only be increased by 1, because UV counts the number of users after de duplication, not the number of visits. PV (Page View) : number of hits or page views. If the user visits the same page for 5 times, the PV of the page will be increased by 5.

1.2 scheme structure and advantages

According to the above real-time index statistics scenario, the following architecture diagram is designed:

List of products involved:

Self built Kafka cluster of local data center (IDC)
Private network VPC
Private line access / cloud networking / VPN connection / peer-to-peer connection
Flow calculation Oceanus (Flink)
Cloud database Redis

II. Pre preparation

Purchase the required Tencent cloud resources and get through the network. The self built Kafka cluster needs to adopt VPN connection, special line connection or peer-to-peer connection according to the region where the cluster is located to realize network interconnection.

2.1 create a private network VPC

The private network (VPC) is a logically isolated network space customized on Tencent cloud. When building services such as Oceanus cluster and Redis components, it is recommended to select the same VPC before the network can communicate. Otherwise, the network needs to be connected by means of peer-to-peer connection, NAT gateway and VPN. Please refer to the steps of creating a private network Help documentation.

2.2 create Oceanus cluster

Stream computing Oceanus is a powerful tool for real-time analysis of big data product ecosystem. It is an enterprise level real-time big data analysis platform based on Apache Flink with the characteristics of one-stop development, seamless connection, sub second delay, low cost, security and stability. Stream computing Oceanus aims to maximize the value of enterprise data and accelerate the construction process of real-time digitization of enterprises .

Create a cluster on the [cluster management] - [new cluster] page of Oceanus console, select region, availability area, VPC, log, storage, and set the initial password. The VPC and subnet use the newly created network. After creation, Flink's cluster is as follows:

2.3 create Redis cluster

stay Redis console On the [new instance] page of, create a cluster, select the same private network VPC in the same region and region as other components, and select the same subnet here.

2.4 configure self built Kafka cluster

2.4.1 modify self built Kafka cluster configuration

When connecting a self built Kafka cluster, the bootstrap servers parameter often uses hostname instead of ip to connect. However, the Oceanus cluster on Tencent cloud connected with a self built Kafka cluster is a fully managed cluster. The mapping relationship between the hostname and ip of the self built cluster cannot be resolved on the nodes of the Oceanus cluster, so it is necessary to change the listener address from hostname to ip address.

Configure the advertised.listeners parameter in the config/server.properties configuration file as the IP address. Example:

# 0.10.X and later
advertised.listeners=PLAINTEXT://10.1.0.10:9092
# Versions before 0.10.X
advertised.host.name=PLAINTEXT://10.1.0.10:9092

Restart the Kafka cluster after modification.

! if you use a self built zookeeper address on the cloud, you also need to change the hostname in zk configuration to the IP address form.

2.4.2 sending analog data to topic

In this case, topic is uvpv demo.

1) Kafka client

Enter the self built Kafka cluster node, start the Kafka client and simulate sending data.

./bin/kafka-console-producer.sh --broker-list 10.1.0.10:9092 --topic uvpv-demo
>{"record_type":0, "user_id": 2, "client_ip": "100.0.0.2", "product_id": 101, "create_time": "2021-09-08 16:20:00"}
>{"record_type":0, "user_id": 3, "client_ip": "100.0.0.3", "product_id": 101, "create_time": "2021-09-08 16:20:00"}
>{"record_type":1, "user_id": 2, "client_ip": "100.0.0.1", "product_id": 101, "create_time": "2021-09-08 16:20:00"}

2) Send using script

Script 1: Java code reference: https://cloud.tencent.com/document/product/597/54834

Script 2: Python script. Refer to the python script in the previous case and modify it appropriately: Real time large screen analysis of live video scene based on Tencent cloud Oceanus

2.5 get through the self built IDC cluster to Tencent cloud network communication

The self built Kafka cluster connects Tencent cloud network, and can get through the network communication from the self built IDC to Tencent cloud through the first three ways below.

Dedicated line access
It is applicable to the connection between Local Data Center IDC and Tencent cloud network.
Cloud networking
It is applicable to the connection between Local Data Center IDC and Tencent cloud network, as well as the connection between private network VPC in different regions on the cloud.
VPN connection
It is applicable to the connection between Local Data Center IDC and Tencent cloud network.
Peer to peer connection + NAT Gateway
It is suitable for VPC connection between private networks in different regions on the cloud, not for local IDC to Tencent cloud network.

In this scheme, VPN connection is used to realize the communication between local IDC and cloud network. Reference link: Establish VPC to IDC connection (routing table)

The following network architecture diagram is drawn according to the scheme:

III. scheme realization

3.1 business objectives

The flow calculation Oceanus is used to realize the real-time statistics of website UV, PV and conversion rate indicators. Here, only the following three statistical indicators are listed:

Number of unique visitors to the site. After processing, Oceanus stores the number of independent visitors in Redis through the set type, and also achieves the purpose of de duplication of the data of the same visitor.
The number of hits on the product page of the website. After processing, Oceanus uses the list type to store page hits in Redis.
Conversion rate (conversion rate = number of transactions / hits). After Oceanus processing, it can be stored in Redis with String.

3.2 source data format

Kafka topic: uvpv demo (browsing records)

field	type	meaning
record_type	int	Customer number
user_id	varchar	Customer ip address
client_ip	varchar	room number
product_id	Int	Time of entering the room
create_time	timestamp	Creation time

Kafka is stored internally in json format, and the data format is as follows:

# Browse records
{
  "record_type":0,  # 0 indicates browsing records
  "user_id": 6, 
  "client_ip": "100.0.0.6", 
  "product_id": 101, 
  "create_time": "2021-09-06 16:00:00"
}

# Purchase record
{
  "record_type":1, # 1 means purchase record
  "user_id": 6, 
  "client_ip": "100.0.0.8", 
  "product_id": 101, 
  "create_time": "2021-09-08 18:00:00"
}

3.3 writing Flink SQL jobs

In the example, the acquisition logic of UV, PV and conversion is realized and written to the Sink end.

1. Define Source

 CREATE TABLE `input_web_record` (
    `record_type` INT,
    `user_id` INT,
    `client_ip` VARCHAR,
    `product_id` INT,
    `create_time` TIMESTAMP,
    `times` AS create_time,
    WATERMARK FOR times AS times - INTERVAL '10' MINUTE 
 ) WITH (
     'connector' = 'kafka',   -- Optional 'kafka','kafka-0.11'. Pay attention to select the corresponding built-in  Connector
     'topic' = 'uvpv-demo',  
     'scan.startup.mode' = 'earliest-offset', 
     --'properties.bootstrap.servers' = '82.157.27.147:9092', 
     'properties.bootstrap.servers' = '10.1.0.10:9092',  
     'properties.group.id' = 'WebRecordGroup',  -- Required parameter, Be sure to specify Group ID
     'format' = 'json',
     'json.ignore-parse-errors' = 'true',     -- ignore JSON Structure Parsing exception
     'json.fail-on-missing-field' = 'false'   -- If set to true, If a missing field is encountered, an error will be reported and set to false The missing field is set to null
 );

2. Define Sink

-- UV sink
CREATE TABLE `output_uv` (  
 `userids`   STRING,
 `user_id` STRING
) WITH (
  'connector' = 'redis',          
  'command' = 'sadd',              -- Save using collection uv(Supported commands: set,lpush,sadd,hset,zadd)
  'nodes' = '192.28.28.217:6379',  -- redis Connection address,Cluster mode is used by multiple nodes'',''separate.
  -- 'additional-key' = '<key>',   -- Used to specify hset and zadd of key. hset,zadd Must be set.
  'password' = 'yourpassword'   
);

-- PV sink
CREATE TABLE `output_pv` (  
 `pagevisits`   STRING,
 `product_id` STRING,
 `hour_count` BIGINT
) WITH (
  'connector' = 'redis',          
  'command' = 'lpush',              -- Save using list pv(Supported commands: set,lpush,sadd,hset,zadd)
  'nodes' = '192.28.28.217:6379',   -- redis Connection address,Cluster mode is used by multiple nodes'',''separate.
  -- 'additional-key' = '<key>',    -- Used to specify hset and zadd of key. hset,zadd Must be set.
  'password' = 'yourpassword'   
);

-- Conversion rate sink
CREATE TABLE `output_conversion_rate` (  
 `conversion_rate`   STRING,
 `rate` STRING
) WITH (
  'connector' = 'redis',        
  'command' = 'set',              -- Save using list pv(Supported commands: set,lpush,sadd,hset,zadd)
  'nodes' = '192.28.28.217:6379', -- redis Connection address,Cluster mode is used by multiple nodes'',''separate.
  -- 'additional-key' = '<key>',  -- Used to specify hset and zadd of key. hset,zadd Must be set.
  'password' = 'yourpassword'   
);

3. Business logic

-- Processed UV Indicators, statistics of all time UV
INSERT INTO output_uv 
SELECT 
  'userids' AS `userids`,
  CAST(user_id AS string) AS user_id 
FROM input_web_record ;

-- Processed and obtained PV Indicators, statistics every 10 minutes PV
INSERT INTO output_pv 
SELECT 
  'pagevisits' AS pagevisits, 
  CAST(product_id AS string) AS product_id, 
  SUM(product_id) AS hour_count
FROM input_web_record WHERE record_type = 0 
GROUP BY 
  HOP(times, INTERVAL '5' MINUTE, INTERVAL '10' MINUTE), 
  product_id, 
  user_id;

-- Process and obtain the conversion index, and count the conversion rate every 10 minutes
INSERT INTO output_conversion_rate 
SELECT 
  'conversion_rate' AS conversion_rate, 
  CAST( (((SELECT COUNT(1) FROM input_web_record WHERE record_type=0)*1.0)/SUM(a.product_id)) as string) 
FROM (SELECT * FROM input_web_record where record_type = 1) AS a
GROUP BY  
  HOP(times, INTERVAL '5' MINUTE, INTERVAL '10' MINUTE), 
  product_id;

3.4 result verification

Usually, the statistics of UV and PV indicators will be displayed through the Web site. Here, for simplicity, it is directly displayed in the Redis console Log in to query:

userids: storing UV s

pagevisits: store PV

conversion_rate: storage conversion rate, that is, the number of purchased goods / total page hits.

IV. summary

Collect data through self built Kafka cluster, conduct field accumulation, window aggregation and other operations in real time in flow computing Oceanus (Flink), store the processed data in cloud database Redis, and count the UV, PV and other indicators refreshed in real time. In the design of Kafka json format, this scheme simplifies the browsing records and product purchase records in the same topic for ease of understanding. The focus is to show the whole scheme by opening up the network between self built IDC and Tencent cloud products.
For the ultra large-scale UV weight removal, micro vision colleagues used Redis hyperloglog to realize UV statistics. Compared with the direct use of set type, it has the advantage of minimal memory space occupation. See the link for details: https://cloud.tencent.com/developer/article/1889162 .

Stream computing Oceanus Limited second kill exclusive activity is popular ↓↓

Focus on Tencent public data, official account, technical exchange, latest activities, service exclusive Get~

Topics: flink

Programmer Think