Flink tutorial (12) - Flink advanced API (Time and Watermaker)

Posted by cwls1184 on Mon, 07 Mar 2022 16:36:49 +0100

01 introduction

In the previous blog, we have a certain understanding of the use of Flink batch streaming API. Interested students can refer to the following:

In the previous tutorial, we have learned the Window in Flink's four cornerstones. As shown in the figure below, this article explains Time:

02 Time

In Flink's streaming processing, different concepts of time will be involved, as shown in the following figure:

You can see that Time is divided into the following categories:

  • Event time: the time when the event is really happening
  • Ingestion time: the time when the event reaches Flink
  • Processing time: the time when the event is actually processed / calculated

There is no doubt that EventTime is the most important, because as long as the event time is generated, it will not change, and the event time can better reflect the essence of the event!

Why is event time so important?

For example, when ordering takeout in the underground garage, the order was placed at 11:59. However, due to the lack of signal in the underground garage, the program has been trying to submit again. It was already 12:05 when we got out of the underground garage. At this time, if we want to count the order amount before 12, should this transaction be counted? Of course, it should be counted, because the true generation time of the data is 11:59, which is the time.

The event time reflects the time of the event. Because the data may be due to network delay and other reasons, a mechanism is needed to solve the problem of data disorder or delay to the end to a certain extent! That is the Watermaker watermark mechanism / watermark mechanism we will learn next.

03 Watermaker watermark mechanism / watermark mechanism

3.1 Watermaker definition

Watermaker: it is to add an additional time column to the data, that is, watermaker is a timestamp!

Watermaker calculation formula (this can ensure that the watermaker water level will always rise (increase) and will not fall):

Watermaker = event time of data (maximum event time of current window) - maximum allowable delay time or out of order time

3.2 role of watermaker

Previous windows are triggered and calculated according to the system time, such as: [10:00:00 ~ 10:00:10) windows,
Once the system time reaches 10:00:10, the calculation will be triggered, which may lead to the loss of delayed data!

Now with Watermaker, the window can trigger the calculation according to Watermaker! In other words, Watermaker is used to trigger window calculation!

3.3 how watermaker triggers window calculation

The trigger conditions of window calculation are:

  • There is data in the window
  • Watermaker > = end time of window

As mentioned earlier, Watermaker = the maximum event time of the current window - the maximum allowable delay time or out of sequence time, that is, as long as there is continuous data, it can be guaranteed that the Watermaker water level will always rise / increase and will not fall / decrease, so the window calculation must be triggered in the end.

3.4 graphic Watermaker

Trigger formula:

  • Watermaker > = end time of window
  • Watermaker = maximum event time of the current window - maximum allowable delay time or out of order time
  • Maximum event time of current window - maximum allowable delay time or out of order time > = end time of window
  • Maximum event time of Current Window > = end time of window + maximum allowable delay time or out of order time

As shown in the figure above, the window time is: [10:00:00 ~ 10:10:00], and CBDA data arrive at the window in turn.

Case 1: if there is no Watermaker mechanism: B data is late (at least 2 minutes late), then B data is lost.

Scenario 2: with Watermaker mechanism and setting the maximum allowable delay time or out of order time to 5 minutes, then:

  • When C data arrives, Watermaker=max (10:11:00) - 5 = 10:06:00 < window end time 10:10:00 - no condition is triggered
  • When B data arrives, watermaker = max (10:11:00, 10:09:00) - 5 = 10:06:00 < window end time 10:10:00 - no condition is triggered
  • When the D data arrives, Watermaker=max(10:11:00,10:09:00, 10:15:00) - 5 = 10:10:00 = window end time 10:10:00 - the window triggers the calculation only when the trigger conditions are met, and the B data will not be lost

Note: Watermaker mechanism can solve the problem of delayed arrival after data disorder to A certain extent, but it can't solve the more serious problem. If the arrival window of A data has been calculated, A data will still be lost. If you want A data not to be lost, you can set the maximum allowable delay time or out of sequence time A little larger, or use the Allowed Lateness side channel output mechanism of subsequent learning.

04 case demonstration

4.1 Watermaker case demonstration

Demand: there is order data in the format of: (order ID, user ID, timestamp / event time, order amount). It is required to calculate the total order amount of each user within 5 seconds every 5s, and add Watermaker to solve the problem of data delay and data disorder to a certain extent.

Core API: datastream assignTimestampsAndWatermarks(...)

Periodic generationGenerated from special records
Display time driveData driven
Call the generation method at regular intervalsEvery time TimeStamp is allocated, the generation method will be called
Implementation of watermarkperiodiosImplement AssignerWithPunctuatedWatermarks

Note: generally, we directly use the boundedoutordernesstimestampextractor provided by Flink

Implementation method 1: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/event_timestamps_watermarks.html

/**
 * @author : YangLinWei
 * @createTime: 2022/3/7 11:07 afternoon
 * <p>
 * Simulate real-time order data in the format of: (order ID, user ID, order amount, timestamp / event time)
 * It is required to calculate the total order amount of each user within 5 seconds (time-based scrolling window) every 5s
 * Watermaker is added to solve the problems of data delay and data disorder to a certain extent.
 */
public class WatermakerDemo01 {

    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //2.Source
        //Simulate real-time order data (data has delay and out of order)
        DataStream<Order> orderDS = env.addSource(new SourceFunction<Order>() {
            private boolean flag = true;

            @Override
            public void run(SourceContext<Order> ctx) throws Exception {
                Random random = new Random();
                while (flag) {
                    String orderId = UUID.randomUUID().toString();
                    int userId = random.nextInt(3);
                    int money = random.nextInt(100);
                    //Analog data delay and out of order!
                    long eventTime = System.currentTimeMillis() - random.nextInt(5) * 1000;
                    ctx.collect(new Order(orderId, userId, money, eventTime));

                    TimeUnit.SECONDS.sleep(1);
                }
            }

            @Override
            public void cancel() {
                flag = false;
            }
        });

        //3.Transformation
        //-Tell Flink to calculate based on the event time!
        //env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);// The default of the new version is eventtime
        //-Tell Flnk which column in the data is the event time, because Watermaker = current maximum event time - maximum allowable delay time or out of order time
        /*DataStream<Order> watermakerDS = orderDS.assignTimestampsAndWatermarks(
                new BoundedOutOfOrdernessTimestampExtractor<Order>(Time.seconds(3)) {//Maximum allowable delay time or out of sequence time
                    @Override
                    public long extractTimestamp(Order element) {
                        return element.eventTime;
                        //Specify which column the event time is, and the bottom layer of Flink will automatically calculate:
                        //Watermaker = Current maximum event time - maximum allowable delay time or out of sequence time
                    }
        });*/
        DataStream<Order> watermakerDS = orderDS
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((event, timestamp) -> event.getEventTime())
                );

        //When the code comes here, Watermaker has been added! Next, you can calculate the window
        //It is required to calculate the total order amount of each user within 5 seconds (time-based scrolling window) every 5s
        DataStream<Order> result = watermakerDS
                .keyBy(Order::getUserId)
                //.timeWindow(Time.seconds(5), Time.seconds(5))
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                .sum("money");


        //4.Sink
        result.print();

        //5.execute
        env.execute();
    }

    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class Order {
        private String orderId;
        private Integer userId;
        private Integer money;
        private Long eventTime;
    }
}

Operation results:

Implementation mode 2:

/**
 * Check
 *
 * @author : YangLinWei
 * @createTime: 2022/3/7 11:15 afternoon
 */
public class WatermakerDemo02 {

    public static void main(String[] args) throws Exception {

        FastDateFormat df = FastDateFormat.getInstance("HH:mm:ss");

        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //2.Source
        //Simulate real-time order data (data has delay and out of order)
        DataStreamSource<Order> orderDS = env.addSource(new SourceFunction<Order>() {
            private boolean flag = true;

            @Override
            public void run(SourceContext<Order> ctx) throws Exception {
                Random random = new Random();
                while (flag) {
                    String orderId = UUID.randomUUID().toString();
                    int userId = random.nextInt(3);
                    int money = random.nextInt(100);
                    //Analog data delay and out of order!
                    long eventTime = System.currentTimeMillis() - random.nextInt(5) * 1000;
                    System.out.println("The data sent is: " + userId + " : " + df.format(eventTime));
                    ctx.collect(new Order(orderId, userId, money, eventTime));
                    TimeUnit.SECONDS.sleep(1);
                }
            }

            @Override
            public void cancel() {
                flag = false;
            }
        });

        //3.Transformation
        /*DataStream<Order> watermakerDS = orderDS
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((event, timestamp) -> event.getEventTime())
                );*/

        //You can directly use the above in development
        //Learning test can be realized by yourself
        DataStream<Order> watermakerDS = orderDS
                .assignTimestampsAndWatermarks(
                        new WatermarkStrategy<Order>() {
                            @Override
                            public WatermarkGenerator<Order> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
                                return new WatermarkGenerator<Order>() {
                                    private int userId = 0;
                                    private long eventTime = 0L;
                                    private final long outOfOrdernessMillis = 3000;
                                    private long maxTimestamp = Long.MIN_VALUE + outOfOrdernessMillis + 1;

                                    @Override
                                    public void onEvent(Order event, long eventTimestamp, WatermarkOutput output) {
                                        userId = event.userId;
                                        eventTime = event.eventTime;
                                        maxTimestamp = Math.max(maxTimestamp, eventTimestamp);
                                    }

                                    @Override
                                    public void onPeriodicEmit(WatermarkOutput output) {
                                        //Watermaker = current maximum event time - maximum allowable delay time or out of order time
                                        Watermark watermark = new Watermark(maxTimestamp - outOfOrdernessMillis - 1);
                                        System.out.println("key:" + userId + ",system time:" + df.format(System.currentTimeMillis()) + ",Event time:" + df.format(eventTime) + ",Watermark time:" + df.format(watermark.getTimestamp()));
                                        output.emitWatermark(watermark);
                                    }
                                };
                            }
                        }.withTimestampAssigner((event, timestamp) -> event.getEventTime())
                );


        //When the code comes here, Watermaker has been added! Next, you can calculate the window
        //It is required to calculate the total order amount of each user within 5 seconds (time-based scrolling window) every 5s
       /* DataStream<Order> result = watermakerDS
                 .keyBy(Order::getUserId)
                //.timeWindow(Time.seconds(5), Time.seconds(5))
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                .sum("money");*/

        //Use the above code for business calculation during development
        //When learning and testing, you can use the following code to output the data in more detail, such as the event time and Watermaker time of the data in each window when the output window is triggered
        DataStream<String> result = watermakerDS
                .keyBy(Order::getUserId)
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                //Apply the function in apply to the data in the window
                //WindowFunction<IN, OUT, KEY, W extends Window>
                .apply(new WindowFunction<Order, String, Integer, TimeWindow>() {
                    @Override
                    public void apply(Integer key, TimeWindow window, Iterable<Order> input, Collector<String> out) throws Exception {
                        //Prepare a collection to store the event time of the data belonging to the window
                        List<String> eventTimeList = new ArrayList<>();
                        for (Order order : input) {
                            Long eventTime = order.eventTime;
                            eventTimeList.add(df.format(eventTime));
                        }
                        String outStr = String.format("key:%s,Window start end:[%s~%s),Time of events belonging to this window:%s",
                                key.toString(), df.format(window.getStart()), df.format(window.getEnd()), eventTimeList);
                        out.collect(outStr);
                    }
                });
        //4.Sink
        result.print();

        //5.execute
        env.execute();
    }

    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class Order {
        private String orderId;
        private Integer userId;
        private Integer money;
        private Long eventTime;
    }
}

Operation results:

4.2 Watermaker case demonstration

Demand: there is order data in the format of: (order ID, user ID, timestamp / event time, order amount)
It is required to calculate the total order amount of each user within 5 seconds every 5s and add Watermaker to solve the problem of data delay and data disorder to a certain extent. And use outputtag + allowedlatency to solve the problem of data loss.

API:

Example code:

/**
 * allowedLateness
 *
 * @author : YangLinWei
 * @createTime: 2022/3/7 11:18 afternoon
 */
public class WatermakerDemo03 {

    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //2.Source
        //Simulate real-time order data (data has delay and out of order)
        DataStreamSource<Order> orderDS = env.addSource(new SourceFunction<Order>() {
            private boolean flag = true;

            @Override
            public void run(SourceContext<Order> ctx) throws Exception {
                Random random = new Random();
                while (flag) {
                    String orderId = UUID.randomUUID().toString();
                    int userId = random.nextInt(3);
                    int money = random.nextInt(100);
                    //Analog data delay and out of order!
                    long eventTime = System.currentTimeMillis() - random.nextInt(10) * 1000;
                    ctx.collect(new Order(orderId, userId, money, eventTime));

                    //TimeUnit.SECONDS.sleep(1);
                }
            }

            @Override
            public void cancel() {
                flag = false;
            }
        });


        //3.Transformation
        DataStream<Order> watermakerDS = orderDS
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((event, timestamp) -> event.getEventTime())
                );

        //When the code comes here, Watermaker has been added! Next, you can calculate the window
        //It is required to calculate the total order amount of each user within 5 seconds (time-based scrolling window) every 5s
        OutputTag<Order> outputTag = new OutputTag<>("Seriouslylate", TypeInformation.of(Order.class));

        SingleOutputStreamOperator<Order> result = watermakerDS
                .keyBy(Order::getUserId)
                //.timeWindow(Time.seconds(5), Time.seconds(5))
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                .allowedLateness(Time.seconds(5))
                .sideOutputLateData(outputTag)
                .sum("money");

        DataStream<Order> result2 = result.getSideOutput(outputTag);

        //4.Sink
        result.print("Normal data and non serious data");
        result2.print("Serious late data");

        //5.execute
        env.execute();
    }

    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class Order {
        private String orderId;
        private Integer userId;
        private Integer money;
        private Long eventTime;
    }
}

Operation results:

05 end

This article mainly explains the principle and usage of Time and Watermaker. Thank you for reading. The end of this article!

Topics: Hadoop flink Yarn