Confluent Platform: ksqlDB real-time stream processing (quick start)

Posted by James25 on Mon, 31 Jan 2022 21:53:23 +0100

1. Introduction to confluent platform

  • The Confluent Platform is a comprehensive event flow platform that enables you to easily access, store and manage data in the form of continuous real-time streams. Confluent by Apache Kafka ® Built by the original creator of Kafka, it expands Kafka's advantages through enterprise level functions and eliminates the burden of Kafka management or monitoring
  • By integrating historical and real-time data into a single, central source of facts, fluent can easily build a new category of modern event driven applications, obtain a common data pipeline, and unlock powerful new use cases, performance and reliability with full scalability
  • The Confluent Platform allows you to focus on how to get business value from data without worrying about underlying mechanisms, such as how data is transferred or integrated between different systems. Specifically, the fluent platform simplifies the process of connecting data sources to Kafka, building streaming applications, and protecting, monitoring, and managing your Kafka infrastructure
  • The core of fluent platform is Apache Kafka, which is the most popular open source distributed streaming media platform

function

  • Connect to other database systems: provide a fusion connector with Kafka, and use Kafka Connect API to connect Kafka to other systems, such as database, key value storage, search index and file system.
  • It can be used for ETL and caching view data (pre calculating query results to speed up data reading): ksqlDB component is Kafka's streaming SQL Engine. It provides an easy-to-use but powerful interactive SQL interface for streaming processing on Kafka without writing code in programming languages such as Java or Python. ksqlDB has scalability, elasticity, fault tolerance and real-time. It supports a wide range of streaming operations, including data filtering, transformation, aggregation, connection, windowing and conversational.

2. Rapid deployment: quick start

platform quickstart: https://docs.confluent.io/platform/current/quickstart/ce-quickstart.html

categorycommand
Start serviceconfluent local services start
Out of Serviceconfluent local services stop
Delete services and dataconfluent local destroy

a. Unzip the installation and start the service

#1. Unzip and configure environment variables
[root@c7-docker confluent-6.1.1]# export CONFLUENT_HOME=/opt/confluent-6.1.1/
[root@c7-docker confluent-6.1.1]# echo 'export CONFLUENT_HOME=/opt/confluent-6.1.1/' >>  /etc/profile
[root@c7-docker confluent-6.1.1]# export PATH=$PATH:$CONFLUENT_HOME/bin
[root@c7-docker confluent-6.1.1]# echo 'export PATH=$PATH:$CONFLUENT_HOME/bin' >> /etc/profile

#2. Install kafka connector kafka connect datagen
#connector doc:	      https://docs.confluent.io/home/connect/overview.html
#Plug in installation directory: / opt / concurrent-6.1.1/share/concurrent-hub-components 
[root@c7-docker confluent-6.1.1]#  grep 'plugin.path' /opt/confluent-6.1.1/etc/ -r
/opt/confluent-6.1.1/etc/kafka/connect-standalone.properties:# plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
/opt/confluent-6.1.1/etc/kafka/connect-standalone.properties:plugin.path=/usr/share/java,/opt/confluent-6.1.1/share/confluent-hub-components
/opt/confluent-6.1.1/etc/kafka/connect-distributed.properties:# plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
/opt/confluent-6.1.1/etc/kafka/connect-distributed.properties:plugin.path=/usr/share/java,/opt/confluent-6.1.1/share/confluent-hub-components
/opt/confluent-6.1.1/etc/ksqldb/connect.properties:# plugin.path=
/opt/confluent-6.1.1/etc/schema-registry/connect-avro-standalone.properties:# plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
/opt/confluent-6.1.1/etc/schema-registry/connect-avro-standalone.properties:plugin.path=share/java,/opt/confluent-6.1.1/share/confluent-hub-components
/opt/confluent-6.1.1/etc/schema-registry/connect-avro-distributed.properties:# plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
/opt/confluent-6.1.1/etc/schema-registry/connect-avro-distributed.properties:plugin.path=share/java,/opt/confluent-6.1.1/share/confluent-hub-components
/opt/confluent-6.1.1/etc/kafka-connect-replicator/replicator-connect-distributed.properties:#plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors
/opt/confluent-6.1.1/etc/kafka-connect-replicator/replicator-connect-standalone.properties:#plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors

#Local compilation and installation: git clone https://github.com/confluentinc/kafka-connect-datagen.git
# git checkout v0.4.0
# mvn clean package
# confluent-hub install target/components/packages/confluentinc-kafka-connect-datagen-0.4.0.zip 
### Usage: confluent-hub install : install a component from either Confluent Hub or from a local file
[root@c7-docker confluent-6.1.1]# confluent-hub install --no-prompt confluentinc/kafka-connect-datagen:latest
Running in a "--no-prompt" mode
Implicit acceptance of the license below:
Apache License 2.0
https://www.apache.org/licenses/LICENSE-2.0
Downloading component Kafka Connect Datagen 0.5.0, provided by Confluent, Inc. from Confluent Hub and installing into /opt/confluent-6.1.1//share/confluent-hub-components
Adding installation directory to plugin path in the following files:
  /opt/confluent-6.1.1//etc/kafka/connect-distributed.properties
  /opt/confluent-6.1.1//etc/kafka/connect-standalone.properties
  /opt/confluent-6.1.1//etc/schema-registry/connect-avro-distributed.properties
  /opt/confluent-6.1.1//etc/schema-registry/connect-avro-standalone.properties
Completed

#3. Modify the configuration file (the default connection address of ksqlDB is localhost:8088) to prevent remote connection http://192.168.56.7:9021/ (the query sql will report an error)
[root@c7-docker confluent-6.1.1]# cd etc/
[root@c7-docker etc]# ls
cli                       confluent-control-center-fe  confluent-metadata-service  kafka                     ksqldb
confluent-common          confluent-hub-client         confluent-rebalancer        kafka-connect-replicator  rest-utils
confluent-control-center  confluent-kafka-mqtt         confluent-security          kafka-rest                schema-registry
[root@c7-docker etc]# grep ':8088' * -r
confluent-control-center/control-center.properties:#confluent.controlcenter.ksql.ksqlDB.url=http://localhost:8088
confluent-control-center/control-center.properties:confluent.controlcenter.ksql.ksqlDB.url=http://192.168.56.117:8088
confluent-control-center/control-center-minimal.properties:#confluent.controlcenter.ksql.ksqlDB.url=http://localhost:8088
confluent-control-center/control-center-minimal.properties:confluent.controlcenter.ksql.ksqlDB.url=http://192.168.56.117:8088
confluent-control-center/control-center-dev.properties:#confluent.controlcenter.ksql.ksqlDB.url=http://localhost:8088
confluent-control-center/control-center-dev.properties:confluent.controlcenter.ksql.ksqlDB.url=http://192.168.56.117:8088
confluent-control-center/control-center-production.properties:#confluent.controlcenter.ksql.ksqlDB.url=http://ksql:8088
confluent-control-center/control-center-production.properties:confluent.controlcenter.ksql.ksqlDB.url=http://192.168.56.117:8088
ksqldb/ksql-server.properties:#listeners=http://0.0.0.0:8088
ksqldb/ksql-server.properties:listeners=http://192.168.56.117:8088
ksqldb/ksql-server.properties:# listeners=http://[::]:8088
ksqldb/ksql-server.properties:# listeners=https://0.0.0.0:8088
ksqldb/ksql-production-server.properties:#listeners=http://0.0.0.0:8088
ksqldb/ksql-production-server.properties:listeners=http://192.168.56.117:8088
ksqldb/ksql-production-server.properties:# listeners=http://[::]:8088
ksqldb/ksql-production-server.properties:# listeners=https://0.0.0.0:8088

#4. Start the service and view the log
[root@c7-docker etc]# confluent local services start
The local commands are intended for a single-node development environment only,
NOT for production usage. https://docs.confluent.io/current/cli/index.html
Using CONFLUENT_CURRENT: /tmp/confluent.007829
Starting ZooKeeper
ZooKeeper is [UP]
Starting Kafka
Kafka is [UP]
Starting Schema Registry
Schema Registry is [UP]
Starting Kafka REST
Kafka REST is [UP]
Starting Connect
Connect is [UP]
Starting ksqlDB Server
ksqlDB Server is [UP]
Starting Control Center
Control Center is [UP]
[root@c7-docker etc]# ls /tmp/confluent.007829
connect  control-center  kafka  kafka-rest  ksql-server  schema-registry  zookeeper
#Data file, log file: 
[root@c7-docker lib]# ls /tmp/confluent.007829/
connect  control-center  kafka  kafka-rest  ksql-server  schema-registry  zookeeper
[root@c7-docker lib]# ls /tmp/confluent.007829/connect/
connect.properties  connect.stdout  data  logs

b. Control center paging operation: create topic and generate test data

  • visit http://xxx:9021 Page operation
  • Create topic: pageviews, users
  • Install kafka connect datagen and generate test data
#a. Generate test data for pageviews topic in AVRO format:  
       {
	  "name": "datagen-pageviews",
	  "connector.class": "io.confluent.kafka.connect.datagen.DatagenConnector",
	  "key.converter": "org.apache.kafka.connect.storage.StringConverter",
	  "kafka.topic": "pageviews",
	  "max.interval": "100",
	  "quickstart": "pageviews"
	}



#b. Generate test data for users topic in AVRO format:   
	{
	  "name": "datagen-users",
	  "connector.class": "io.confluent.kafka.connect.datagen.DatagenConnector",
	  "key.converter": "org.apache.kafka.connect.storage.StringConverter",
	  "kafka.topic": "users",
	  "max.interval": "1000",
	  "quickstart": "users"
       }

c. Using ksqlDB: view data / create table or stream

ksql doc : https://docs.ksqldb.io/en/latest/concepts/streams/

  • Paging operation ksqlDB is equivalent to command line execution: ksql http://localhost:8088
######################### Create a table/stream using ksqlDB: 
#1. Create stream (Kafka topic can be created automatically)
# ksql> CREATE STREAM riderLocations (profileId VARCHAR, latitude DOUBLE, longitude DOUBLE)
#       WITH (kafka_topic='locations', value_format='json', partitions=1);

#Parameter Description:
#kafka_topic - 
#	Name of the Kafka topic underlying the stream. In this case it will be automatically created because it doesn't exist yet, but streams may also be created over topics that already exist.
#value_format - 
#	Encoding of the messages stored in the Kafka topic. For JSON encoding, each row will be stored as a JSON object whose keys/values are column names/values.
#partitions - 
#	Number of partitions to create for the locations topic. Note that this parameter is not needed for topics that already exist.

#2. Run push query over the stream: coordinates are within 5 miles 
# SELECT * FROM riderLocations WHERE GEO_DISTANCE(latitude, longitude, 37.4133, -122.1162) <= 5 EMIT CHANGES;
#   Then open a new session, insert data and observe whether the data is queried in real time
# INSERT INTO riderLocations (profileId, latitude, longitude) VALUES ('c2309eec', 37.7877, -122.4205);
# INSERT INTO riderLocations (profileId, latitude, longitude) VALUES ('18f4ea86', 37.3903, -122.0643);

# ksql>  create stream users( rowkey int key, username varchar) with( KAFKA_TOPIC='users',VALUE_FORMAT='JSON');
# Message
#----------------
# Stream created
#----------------
#ksql> insert into users(username) values('a');
#ksql> insert into users(username) values('b');
#ksql> select 'hello,'+ username as greeting from users emit changes;
#+----------------------------------------------------------------------------------------------------------------------------------------------------------+
#|GREETING                                                                                                                                                  |
#+----------------------------------------------------------------------------------------------------------------------------------------------------------+
#|hello,b                                                                                                                                                   |
#|hello,a                                                                                                                                                   |

# a stream for the pageviews topic :  
ksql> CREATE STREAM pageviews WITH (KAFKA_TOPIC='pageviews', VALUE_FORMAT='AVRO');

# a table for the users topic:        
ksql> CREATE TABLE  users (id VARCHAR PRIMARY KEY) WITH (KAFKA_TOPIC='users', VALUE_FORMAT='AVRO');

ksql> set 'auto.offset.reset'='earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
ksql>  show topics;
 Kafka Topic                 | Partitions | Partition Replicas
---------------------------------------------------------------
 pageviews                   | 1          | 1
 users                       | 1          | 1
---------------------------------------------------------------

######################### 1. Non persistent query
ksql> SELECT * FROM pageviews EMIT CHANGES LIMIT 1;
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|VIEWTIME                                          |USERID                                            |PAGEID                                            |
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|129321                                            |User_1                                            |Page_40                                           |
Limit Reached
Query terminated


ksql> SELECT * from  users EMIT CHANGES LIMIT 1;
+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
|ID                           |REGISTERTIME                 |USERID                       |REGIONID                     |GENDER                       |
+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
|User_5                       |1489800608800                |User_5                       |Region_5                     |MALE                         |
Limit Reached
Query terminated

######################### 2. Persistent query: persistent query (as a stream): filter the female users in pageviews stream and save the query results to pageviews_female topic
ksql> SELECT users.id AS userid, pageid, regionid
  FROM pageviews LEFT JOIN users ON pageviews.userid = users.id
  WHERE gender = 'FEMALE'
  EMIT CHANGES LIMIT 1;
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|USERID                                            |PAGEID                                            |REGIONID                                          |
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|User_3                                            |Page_89                                           |Region_7                                          |
Limit Reached
Query terminated


ksql> CREATE STREAM pageviews_female AS 
  SELECT users.id AS userid, pageid, regionid
  FROM pageviews LEFT JOIN users ON pageviews.userid = users.id
  WHERE gender = 'FEMALE'
  EMIT CHANGES;

ksql> SELECT * from  pageviews_female EMIT CHANGES LIMIT 1;
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|USERID                                            |PAGEID                                            |REGIONID                                          |
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|User_8                                            |Page_97                                           |Region_4                                          |
Limit Reached
Query terminated

######################### 3. persistent query: filter those whose regionid ends in 8 or 9, and save the query results to pageviews_enriched_r8_r9 topic
ksql>  SELECT * FROM pageviews_female
  WHERE regionid LIKE '%_8' OR regionid LIKE '%_9'
  EMIT CHANGES LIMIT 1;
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|USERID                                            |PAGEID                                            |REGIONID                                          |
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|User_9                                            |Page_95                                           |Region_8                                          |
Limit Reached
Query terminated


ksql> CREATE STREAM pageviews_female_like_89
  WITH (KAFKA_TOPIC='pageviews_enriched_r8_r9', VALUE_FORMAT='AVRO')
  AS SELECT * FROM pageviews_female
  WHERE regionid LIKE '%_8' OR regionid LIKE '%_9'
  EMIT CHANGES;

ksql> SELECT * from  pageviews_female_like_89 EMIT CHANGES LIMIT 1;
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|USERID                                            |PAGEID                                            |REGIONID                                          |
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|User_8                                            |Page_67                                           |Region_8                                          |
Limit Reached
Query terminated

######################### 4. Persistent query: persistent query: count each REGIONID and gender in pageviews (stream) (30s is a window), and count > 1. Save the result as table (topic is pageviews_regions) 
ksql> SELECT gender, regionid, COUNT(*) AS numusers
  FROM pageviews LEFT JOIN users ON pageviews.userid = users.id
  WINDOW TUMBLING (SIZE 30 SECOND)
  GROUP BY gender, regionid
  HAVING COUNT(*) > 1 EMIT CHANGES LIMIT 1;
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|GENDER                                            |REGIONID                                          |NUMUSERS                                          |
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|FEMALE                                            |Region_6                                          |2                                                 |
Limit Reached
Query terminated


ksql> CREATE TABLE pageviews_regions WITH (KEY_FORMAT='JSON')
  AS SELECT gender, regionid, COUNT(*) AS numusers
  FROM pageviews LEFT JOIN users ON pageviews.userid = users.id
  WINDOW TUMBLING (SIZE 30 SECOND)
  GROUP BY gender, regionid
  HAVING COUNT(*) > 1
  EMIT CHANGES;

ksql> SELECT * from  pageviews_regions EMIT CHANGES LIMIT 1;
+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|KSQL_COL_0                           |WINDOWSTART                          |WINDOWEND                            |NUMUSERS                             |
+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|OTHER|+|Region_3                     |1623913830000                        |1623913860000                        |3                                    |
Limit Reached
Query terminated

Topics: Linux