Flink tutorial (26) - Flink multilingual development

Posted by funguse on Wed, 09 Mar 2022 02:43:43 +0100

01 introduction

In the previous blog, we learned the advanced features of Flink. Interested students can refer to the following:

This article mainly explains Flink multilingual development.

reference resources: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/scala_api_extensions.html

02 Scala-Flink

2.1 requirements

Use Flink to receive the e-commerce click stream log data from Kafka and process it in real time:

  1. Data preprocessing: broaden the data, that is, turn the data into a wide table for subsequent analysis
  2. Analyze real-time channel hotspots
  3. Analyze real-time channel PV/UV

2.2 preparation


View theme:
/export/servers/kafka/bin/kafka-topics.sh --list --zookeeper node01:2181

create themes:
/export/servers/kafka/bin/kafka-topics.sh --create --zookeeper node01:2181 --replication-factor 2 --partitions 3 --topic pyg

Review the topic again:
/export/servers/kafka/bin/kafka-topics.sh --list --zookeeper node01:2181

Launch console consumer
/export/servers/kafka/bin/kafka-console-consumer.sh --bootstrap-server node01:9092 --from-beginning --topic pyg

Delete theme--No execution required
/export/servers/kafka/bin/kafka-topics.sh --delete --zookeeper node01:2181 --topic pyg

Prepare skeleton code for import:

2.3 code implementation

2.3.1 entry class - Data Analysis

object App {
  def main(args: Array[String]): Unit = {
    //Note: TODO indicates that this step has not been completed during development and needs to be completed later
    //Here is just to use different colors to distinguish the steps
    //TODO 1. Preparing the StreamExecutionEnvironment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //TODO 2. Set environment parameters (Checkpoint / restart policy / whether to use event time...)
    //=================Recommended must be set===================
    //Set the state backend of checkpoint state to FsStateBackend, use local path for local test, and use the path of incoming HDFS for cluster test
      env.setStateBackend(new FsStateBackend("file:///D:/ckp"))
      env.setStateBackend(new FsStateBackend(args(0)))//Incoming during subsequent cluster tests hdfs://node01:8020/flink-checkpoint/checkpoint
    //Set the Checkpoint interval to 1000ms, which means that the interval between two checkpoints is 1000ms. The more frequently Checkpoint is used, the easier it is to recover data. At the same time, Checkpoint will consume some IO accordingly.
    env.enableCheckpointing(1000)//(by default, if the time is not set, checkpoint is not enabled)
    //Set the minimum waiting time between two checkpoints. For example, set the minimum waiting time between checkpoints to be 500ms (in order to avoid that the previous time is too slow and the latter time overlaps when doing Checkpoint every 1000ms)
    //For example, on the expressway, one vehicle is released at the gate every 1s, but the minimum distance between two vehicles is 500m
    env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)//The default is 0
    //Set whether to fail the overall task if an error occurs in the process of Checkpoint: true is false, not true
    env.getCheckpointConfig.setFailOnCheckpointingErrors(false)//The default is true
    //Set whether to clear checkpoints, indicating whether to keep the current Checkpoint when canceling. The default Checkpoint will be deleted when the job is cancelled
    //ExternalizedCheckpointCleanup. DELETE_ ON_ Cancelation: true. When the job is cancelled, the external checkpoint is deleted (the default value)
    //ExternalizedCheckpointCleanup. RETAIN_ ON_ Cancel: false. When the job is cancelled, the external checkpoint will be reserved
    //=================Recommended must be set===================

    //=================Just use the default===============
    //Set the execution mode of checkpoint to actual_ Once (default), note: external support is required, such as Source and Sink support
    //Set the timeout time of the Checkpoint. If the Checkpoint has not been completed within 60s, it means that the Checkpoint fails, it will be discarded.
    env.getCheckpointConfig.setCheckpointTimeout(60000)//Default 10 minutes
    //Set how many checkpoint s can be executed at the same time
    env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)//The default is 1
    //=================Just use the default===============

    //======================Configure restart policy==============
    //1. If Checkpoint is configured without a restart policy, the program will restart indefinitely when a non fatal error occurs in the code
    //2. Configure no restart policy
    //3. Fixed delay restart strategy -- used in development
    //As follows: in case of abnormality, restart once every 10s, up to 3 times
      3, // Up to 3 restarts
      org.apache.flink.api.common.time.Time.of(10, TimeUnit.SECONDS) // Restart interval
    //4. Failure rate restart strategy - occasionally used by developers
    //As follows: restart up to 3 times within 5 minutes, with an interval of 10 minutes
      3, // Maximum number of failures per measurement interval
      Time.of(5, TimeUnit.MINUTES), //Time interval for failure rate measurement
      Time.of(10, TimeUnit.SECONDS) // Time interval between two consecutive restarts
    //======================Configure restart policy==============

    //TODO 3.Source-Kafka
    val topic: String = "pyg"
    val schema = new SimpleStringSchema()
    val props:Properties = new Properties()
    props.setProperty("auto.offset.reset","latest")//If there is a recorded offset, it will be consumed from the recorded position. If not, it will be consumed from the latest data
    props.setProperty("flink.partition-discovery.interval-millis","5000")//Dynamic partition detection: start a background thread to check the partition status of Kafka every 5s

    val kafkaSource: FlinkKafkaConsumer[String] = new FlinkKafkaConsumer[String](topic,schema,props)
    kafkaSource.setCommitOffsetsOnCheckpoints(true)//When the Checkpoint is executed, the offset will be submitted (one in the Checkpoint and one in the default topic)

    val jsonStrDS: DataStream[String] = env.addSource(kafkaSource)
    // {"count":1,"message":"{\"browserType \ ": \" Firefox \ ", \" categoryID\":20,\"channelID\":20,\"city\":\"ZhengZhou\",\"country\":\"china\",\"entryTime\":1577898060000,\"leaveTime\":1577898060000,\"network \ ": \" Telecom \ ", \" produceID\":15,\"province\":\"HeBei\",\"source \ ": \" direct input \ ", \" userid \ ": 2}, timestamp: 1598754734031}

    //TODO 4. Parse jsonStr data into sample class Message
    val messageDS: DataStream[Message] = jsonStrDS.map(jsonStr => {
      val jsonObj: JSONObject = JSON.parseObject(jsonStr)
      val count: lang.Long = jsonObj.getLong("count")
      val timeStamp: lang.Long = jsonObj.getLong("timeStamp")
      val clickLogStr: String = jsonObj.getString("message")
      val clickLog: ClickLog = JSON.parseObject(clickLogStr, classOf[ClickLog])
      Message(clickLog, count, timeStamp)
      //You can't use the following lazy methods
      //val message: Message = JSON.parseObject(jsonStr,classOf[Message])
    //Message(ClickLog(10,10,3,china,HeBei,ZhengZhou, telecom, 360 search jump, Google browser, 15778776460000157789806000,15), 11598754740100)

    //TODO 5. Add watermaker to the data (or put it in step 6)
    val watermakerDS: DataStream[Message] = messageDS.assignTimestampsAndWatermarks(
      new BoundedOutOfOrdernessTimestampExtractor[Message](org.apache.flink.streaming.api.windowing.time.Time.seconds(5)) {
        override def extractTimestamp(element: Message): Long = element.timeStamp

    //TODO 6. Data preprocessing
    //In order to facilitate subsequent indicator statistics, you can preprocess the log Message parsed above, such as broadening the field
    //The preprocessing code can be written here, or a method can be extracted separately, or an object can be extracted separately Method to complete
    //Expand [datastream] to [datastream]
    val clickLogWideDS: DataStream[ClickLogWide] = ProcessTask.process(watermakerDS)
    //Clicklogwide (18,9,10, China, Henan, Luyang, mobile, baidu jump, Google browser, 1577887260001, 577898060000,15,11598758614216, China Henan Luyang, 20200820200830202020083081011,0,0,0,0,0)

    //TODO 7. Real time index statistical analysis - direct sink results to HBase
    //Statistical analysis of real-time indicators - hot spots of real-time channels
    //Statistical analysis of real-time indicators - real-time channel time division PV/UV

    //TODO 8.execute

2.3.2 data preprocessing

In order to facilitate subsequent analysis, we need to use Flink to preprocess the click stream log in real time. Add some fields on the basis of the original click stream log to facilitate the statistical development of subsequent business functions.

The following are the original click stream log fields obtained from consumption in Kafka:

Field name Description
channelID channel ID
categoryID product category ID
produceID product ID
Country country
city City
network mode
Source source method
browserType browser type
entryTime time of entering the website
leaveTime time to leave the website
userID ID of the user

We need to add the following fields on the basis of the original click stream log fields:
Field name Description
count number of user visits
timestamp user access time
address country, province and city (splicing)
Is isNew a new user accessing a channel
Is isHourNew a new user of a channel within an hour
Is isDayNew a new user of a channel on a certain day
Is isMonthNew a new user of a channel in a certain month

We cannot directly calculate the values of the last four fields from the click stream log. Instead, you need to have a history table in hbase to save the user's historical access status before you can calculate it.
The structure of the user_history table is as follows:
Example of column name Description
rowkey user ID: channel ID 10:220
userid user ID 10
channelid 220
lastVisitedTime last access time (timestamp) 1553653555

 * Author itcast
 * Desc Business tasks of data preprocessing module
object ProcessTask {
  //Convert the original user behavior log data with watermark into wide table ClickLogWide and return it as required
  //Convert DataStream[Message] to DataStream[ClickLogWide]
  def process(watermakerDS: DataStream[Message]): DataStream[ClickLogWide] = {
    import org.apache.flink.api.scala._
    val clickLogWideDS: DataStream[ClickLogWide] = watermakerDS.map(message => {
      val address: String = message.clickLog.country + message.clickLog.province + message.clickLog.city
      val yearMonth: String = TimeUtil.parseTime(message.timeStamp, "yyyyMM")
      val yearMonthDay: String = TimeUtil.parseTime(message.timeStamp, "yyyyMMdd")
      val yearMonthDayHour: String = TimeUtil.parseTime(message.timeStamp, "yyyyMMddHH")

      val (isNew, isHourNew, isDayNew, isMonthNew) = getIsNew(message)

      val clickLogWide = ClickLogWide(

        message.count, //Number of user visits
        message.timeStamp, //User access time

        address, //Country, province and city - splicing
        yearMonth, //years
        yearMonthDay, //specific date
        yearMonthDayHour, //Mm / DD / yyyy

        isNew, //Is it a new user accessing a channel - 0 means no, 1 means yes
        isHourNew, //Whether you are a new user of a channel within a certain hour - 0 means no, 1 means yes
        isDayNew, //Whether you are a new user of a channel on a certain day - 0 means no, 1 means yes
        isMonthNew //Whether you are a new user of a channel in a certain month - 0 means no, 1 means yes

  /*For example, a user accesses this channel for the first time on August 30, 2020
  So this log
  The user will visit again on August 30, 2020
  Then this log:
  The user will visit again on August 30, 2020
  The user will visit again on August 31, 2020
  def getIsNew(msg: Message):(Int,Int,Int,Int) = {
    var isNew: Int = 0 //Is it a new user accessing a channel - 0 means no, 1 means yes
    var isHourNew: Int = 0 //Whether you are a new user of a channel within a certain hour - 0 means no, 1 means yes
    var isDayNew: Int = 0 //Whether you are a new user of a channel on a certain day - 0 means no, 1 means yes
    var isMonthNew: Int = 0//Whether you are a new user of a channel in a certain month - 0 means no, 1 means yes

    //How to judge whether the user is each isxxNew of the channel?
    //The access time of the last time the user accessed the channel can be recorded in an external medium, such as HBase
    //Enter a log and go to HBase to check the lastVisitTime of the user and the channel
    //No results -- isxxNew is all 1
    //There are results -- compare the visit time with lastVisitTime

    //1. Define some HBase constants, such as table name, column family name and field name
    val tableName = "user_history"
    val columnFamily = "info"
    val rowkey = msg.clickLog.userID + ":" + msg.clickLog.channelID
    val queryColumn = "lastVisitTime"

    //2. Check lastVisitTime according to the channel of the user
    //Note: remember to modify resources / HBase site The host name in XML and HBase have to be started
    val lastVisitTime: String = HBaseUtil.getData(tableName,rowkey,columnFamily,queryColumn)

    //3. Judge whether lastVisitTime has a value
      //If lastVisitTime is empty, it means that the user has not accessed the channel before, and all can be set to 1
      isNew = 1
      isHourNew = 1
      isDayNew = 1
      isMonthNew = 1
      //If lastVisitTime is not empty, it indicates that the user has accessed the channel before, and isxxNew assigns values to the user according to the situation
      //For example, if lastVisitTime is 2020-08-30-11 and the current visit time is 2020-08-30-12, isHourNew=1 and others are 0
      //For example, if lastVisitTime is 2020-08-30 and the current access time is 2020-08-31, isDayNew=1 and others are 0
      //For example, if lastVisitTime is 2020-08 and the current access time is 2020-09, isMonthNew=1 and others are 0
      isNew = 0
      isHourNew = TimeUtil.compareDate(msg.timeStamp,lastVisitTime.toLong,"yyyyMMddHH")
      isDayNew = TimeUtil.compareDate(msg.timeStamp,lastVisitTime.toLong,"yyyyMMdd")
      isMonthNew = TimeUtil.compareDate(msg.timeStamp,lastVisitTime.toLong,"yyyyMM")
    //Don't forget to save this visit time into HBase as lastVisitTime


    //be careful:
    Start hbase before testing
    Then log in to hbase shell
     ./hbase shell
    View hbase table
    After running, the table will be generated, and then the table data will be viewed
    scan "user_history",{LIMIT=>10}

2.3.3 real time channel hotspot

Channel hotspot is to count the number of channels accessed (clicked).
The following data are obtained from the analysis:

The historical click data needs to be accumulated

object ChannelRealHotTask {

  //Define a sample class to encapsulate the channel id and access times
  case class ChannelRealHot(channelId: String, visited: Long)

  //According to the incoming user behavior log width table, statistical analysis of channel access times is carried out, and the results are saved to HBase
  def process(clickLogWideDS: DataStream[ClickLogWide]) = {
    import org.apache.flink.api.scala._
    //1. Take out the fields channelID and count we need and encapsulate them into sample classes
    val result: DataStream[ChannelRealHot] = clickLogWideDS
      .map(clickLogWide => {
        ChannelRealHot(clickLogWide.channelID, clickLogWide.count)
      //2. Grouping
      //3. Window
      //ize: Time, slide: Time
      //Demand: count the access times of each channel every 10s
      //4. Polymerization
      .reduce((c1, c2) => {
        ChannelRealHot(c2.channelId, c1.visited + c2.visited)
    //5. The results are stored in HBase
    result.addSink(new SinkFunction[ChannelRealHot] {
      override def invoke(value: ChannelRealHot, context: SinkFunction.Context): Unit = {
        //Call HBaseUtil here to save each result (the number of accesses per channel) to HBase
        //-1. Check the last access times of HBase channel first
        val tableName = "channel_realhot"
        val columnFamily = "info"
        val queryColumn = "visited"
        val rowkey = value.channelId
        val historyValueStr: String = HBaseUtil.getData(tableName, rowkey, columnFamily, queryColumn)

        var currentFinalResult = 0L

        //-2. Judge and combine the results
        if (StringUtils.isBlank(historyValueStr)) {
          //If historyValueStr is empty, directly let the number of times this time be the final result of this time and save it
          currentFinalResult = value.visited
        } else {
          //If historyValueStr is not empty, the number of times + historical value of this time will be taken as the final result of this time and saved
          currentFinalResult = value.visited + historyValueStr.toLong

        //-3. Save the final results
        HBaseUtil.putData(tableName, rowkey, columnFamily, queryColumn, currentFinalResult.toString)

2.3.4 real time channel PV/UV

PV (traffic volume) is Page View, and page refresh is calculated once.
UV (Unique Visitor), that is, the same client is calculated only once within the specified time

The data obtained after statistical analysis are as follows:

object ChannelRealPvUvTask {

  case class ChannelRealPvUv(channelId: String, monthDayHour: String, pv: Long, uv: Long)

  def process(clickLogWideDS: DataStream[ClickLogWide]) = {
    import org.apache.flink.api.scala._
    //be careful:
    // Each wide table log has three fields: Yearmonth, yearmonthday and yearmonthdayhour,
    // According to the demand, we need to convert one log into three pieces of data according to these three fields, so as to facilitate the later statistics of PV/UV in different periods
    // That is to say, every piece of data should be changed into 3 pieces of data now!
    //Using flatMap
    //Zhang San, Changping, Beijing, China
    // -->
    //China, Zhang San
    //Zhang San, Beijing, China
    //Zhang San, Changping, Beijing, China

    //1. Data conversion
    val result: DataStream[ChannelRealPvUv] = clickLogWideDS.flatMap(clickLogWide => {
        ChannelRealPvUv(clickLogWide.channelID, clickLogWide.yearMonth, clickLogWide.count, clickLogWide.isMonthNew),
        ChannelRealPvUv(clickLogWide.channelID, clickLogWide.yearMonthDay, clickLogWide.count, clickLogWide.isDayNew),
        ChannelRealPvUv(clickLogWide.channelID, clickLogWide.yearMonthDayHour, clickLogWide.count, clickLogWide.isHourNew)
    //2. Grouping
    .keyBy("channelId", "monthDayHour")
    //3. Window
    //4. Polymerization
   .reduce((c1, c2) => {
      ChannelRealPvUv(c2.channelId, c2.monthDayHour, c1.pv + c2.pv, c1.uv + c2.uv)
    //5. Save the results to HBase
    //Note: if the performance of HBase can't keep up during the test after class, you can print directly and see the results. Just read the sink below!
    result.addSink(new SinkFunction[ChannelRealPvUv] {

      override def invoke(value: ChannelRealPvUv, context: SinkFunction.Context): Unit = {
        //-1. Check
        val tableName = "channel_pvuv"
        val columnFamily = "info"
        val queryColumn1 = "pv"
        val queryColumn2 = "uv"
        val rowkey = value.channelId + ":" + value.monthDayHour

        val map: Map[String, String] = HBaseUtil.getMapData(tableName,rowkey,columnFamily,List(queryColumn1,queryColumn2))

       /* val pvhistoryValueStr: String = map.getOrElse(queryColumn1,null)
        val uvhistoryValueStr: String = map.getOrElse(queryColumn2,null)

        var currentFinalPv = 0L
        var currentFinalUv = 0L

          //If the value of this channel is null, the value of this channel will be taken as the final value of this channel
          currentFinalPv = value.pv
          //If pvhistoryValueStr is not empty, pv + pvhistoryValueStr of this time period of this channel will be taken as the final result of this time period of this channel
          currentFinalPv = value.pv + pvhistoryValueStr.toLong

          //If uvhistoryValueStr is empty, the uv of this channel in this period will be directly taken as the final result of this channel in this period
          currentFinalUv = value.uv
          //If uvhistoryValueStr is not empty, take the uv + uvhistoryValueStr of this channel in this period as the final result of this channel in this period
          currentFinalUv = value.uv + uvhistoryValueStr.toLong

        val pvhistoryValueStr: String = map.getOrElse(queryColumn1,"0")
        val uvhistoryValueStr: String = map.getOrElse(queryColumn2,"0")

        val currentFinalPv = value.pv + pvhistoryValueStr.toLong
        val currentFinalUv = value.uv + uvhistoryValueStr.toLong

        //-3. Deposit

03 Py-Flink

Environment preparation

pip install apache-flink

It needs to be installed under the condition of good network environment, which is estimated to take about 2 hours, because it needs to download many other dependencies

3.2 official documents

3.3 example code

from pyflink.common.serialization import SimpleStringEncoder
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import StreamingFileSink

def tutorial():
    env = StreamExecutionEnvironment.get_execution_environment()
    ds = env.from_collection(
        collection=["hadoop spark flink","hadoop spark","hadoop"],


    result = ds.flat_map(lambda line: line.split(" "), result_type=Types.STRING())\
        .map(lambda word: (word, 1),output_type=Types.ROW([Types.STRING(), Types.INT()]))\
        .key_by(lambda x: x[0],key_type_info=Types.STRING())\
        .reduce(lambda a, b: a + b)


                .for_row_format('data/output/result1', SimpleStringEncoder())


if __name__ == '__main__':

from pyflink.dataset import ExecutionEnvironment
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment
from pyflink.table.descriptors import Schema, OldCsv, FileSystem
from pyflink.table.expressions import lit

exec_env = ExecutionEnvironment.get_execution_environment()
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)

t_env.connect(FileSystem().path('data/input')) \
                 .field('word', DataTypes.STRING())) \
                 .field('word', DataTypes.STRING())) \

t_env.connect(FileSystem().path('/tmp/output')) \
                 .field('word', DataTypes.STRING())
                 .field('count', DataTypes.BIGINT())) \
                 .field('word', DataTypes.STRING())
                 .field('count', DataTypes.BIGINT())) \

tab = t_env.from_path('mySource')
tab.group_by(tab.word) \
   .select(tab.word, lit(1).count) \

04 end

This article mainly explains a simple example of Flink's multilingual development. Thank you for reading. The end of this article!

Topics: Big Data hive flink