Flink difficulty analysis: unveiling the mystery of Watermark

Posted by barnbuster on Mon, 10 Jan 2022 23:42:20 +0100

Apache Flink is called the ultimate streaming framework. It not only provides real-time computing power with high throughput, low latency and exactly once semantics, but also provides computing power based on streaming engine to process batch data. In a real sense, it realizes batch flow unification. It is undoubtedly a rising star after Spark and Storm.
However, just getting started with Flink, you will be exposed to strange technical words such as Watermark or Watermark, but what exactly is Watermark has cast a mysterious veil on Apache Flink. Here we uncover the mystery of Watermark.

1, Time

1.1 temporal semantics

In streaming data processing, Flink divides time into three time semantics according to different locations of time generation, namely event time, event access time and event processing time.

1.1.1 Event Time

The event time, that is, the time when the event behavior occurs, such as the registration time of the system end user, the order placing time and the order payment time, determines the real time of the event.

1.1.2 Ingestion Time

Event access time, or ingestion time, that is, the access time generated when the data is accessed to the Flink system.

1.1.3 Processing Time

Processing time: the data is converted through each operator instance, and the time of the system where the operator instance is located is the data processing time.

1.2 setting time semantics

In Flink, the Process Time time semantics is used by default. If the user chooses to use Event Time or Ingestion Time semantics, the setStreamTimeCharacteristic() method should be invoked in the StreamExecutionEnvironment created to set the time concept of the system.

    // Use EventTime
    env.setStreamTImeCharacteristic(TimeCharacteristic.EventTime)
    // Using IngestionTime
    env.setStreamTImeCharacteristic(TimeCharacteristic.IngestionTime)

2, Watermark

When processing streaming data with EventTime time semantics, the data is generated from the Event, flows through the Source, and then to the Operator, which takes a certain time. Theoretically, the data is transmitted to the Operator for processing according to the sequence of EventTime; However, the disorder caused by network delay, message backlog and back pressure cannot be ruled out; Especially when using Kafka, the order of data between multiple partitions cannot be guaranteed. Therefore, you cannot wait indefinitely during Window calculation. There must be a mechanism to trigger Window calculation after a specific time, that is, this mechanism is watermark.

2.1 what is watermark?

The essence of watermark is time stamp, which can solve the problem of data disorder or delayed arrival to a certain extent.

2.2 how to calculate Watermark?

  • Watermark = maximum event time of current window - maximum allowable data delay time / out of order time

  • Time setting of maximum allowable data delay

    // Set time semantics
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    
    // Specify watermark allocation policy
    ds.assignTimestampsAndWatermarks(
        // Note: WatermarkStrategy is available for Flink version 1.11
        // The parameters are maximum delay time, maximum disorder degree, or maximum disorder time.
        // Value maxOutOfOrderness=2s
        WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(2))
                         // Specify event time data
                         .withTimestamAssigner(e, timestamp) -> e.getEventTime())
    );
    
  • Boundedoutordernesswatermarks source code analysis

    @Public
    public class BoundedOutOfOrdernessWatermarks<T> implements WatermarkGenerator<T> {
    	// Maximum event time of the current window
    	private long maxTimestamp;
    	// Maximum allowable delay time of window
    	private final long outOfOrdernessMillis;
    	
    	public BoundedOutOfOrdernessWatermarks(Duration maxOutOfOrderness) {
    	    // Delay time in milliseconds
    		this.outOfOrdernessMillis = maxOutOfOrderness.toMillis();
    
    		// The initial guarantee time is minimum
    		this.maxTimestamp = Long.MIN_VALUE + outOfOrdernessMillis + 1;
    	}
    	
    	/**
    	 * All events will call this method, not limited to the window data
    	 * @param event time
    	 * @param eventTimestamp Event time
    	 * @param out Watermark Output device
    	 */
    	@Override
    	public void onEvent(T event, long eventTimestamp, WatermarkOutput out){
    	    // Calculate the maximum event time of the window to ensure that the window watermark increases monotonically
    		maxTimestamp = Math.max(maxTimestamp, eventTimestamp);
    	}
    
    	@Override
    	public void onPeriodicEmit(WatermarkOutput output) {
    	    // Watermark = maximum event time of current window - maximum delay time - 1
    	    // The reason why the watermark needs to be reduced by one is that the window is closed on the left and open on the right
    		output.emitWatermark(new Watermark(maxTimestamp - outOfOrdernessMillis - 1));
    	}
    }
    

2.3 when is the departure window calculated?

  • Watermark > = window end time
  • deduction
    Watermark = maximum event time of current window - maximum allowable data delay time / out of order time
    =>Watermark = maximum event time of current window - maximum allowable data delay time / out of order time > = window end time
    =>Maximum event time of Current Window > = window end time + maximum allowable data delay time / out of order time

2.4 principle

During the window processing of Apache Flink, if the time exceeds the maximum end time of the window, the calculation operation of data (such as summary, grouping, etc.) will be triggered. However, for out of order data, it is easy to miss the window calculation time, resulting in data loss. The application of Watermark mechanism can solve the problem of data disorder or delayed arrival to a certain extent.

2.4.1 window calculation problem


As shown in the figure, when the event flow data C arrives, the time of the event flow data C exceeds the end time of the window x, so the window x will trigger the calculation, and the new window U receives the event flow data C. When the event flow data D and E are accessed, the event flow data D and e will be lost because the calculation of window X has been triggered.

2.4.2 watermark window


Watermark calculation is added to the window shown in the figure. When the event flow data C arrives, watermark is 10:09:00, but it is less than the end time of window X. if the calculation conditions of window X are not met, window x calculation will not be triggered. At the same time, the new window U receives the event flow data C. When the event flow data D/E arrives, the calculation has not been triggered in window x, so the event flow data D/E is added to window x, which solves the problem of data disorder in 2 seconds to a certain extent. When the event flow data F arrives, the watermark value is 10:10:00 and greater than or equal to the end time of window X. when the calculation conditions of window X are met, the calculation of window x is triggered.

2.5 Watermark setting strategy

2.5.1 AssignerWithPunctuatedWatermarks

Punctuation watermark generates a new watermark by triggering the time of some special marked events in the data flow. In this way, the trigger of the window is independent of the time, but depends on when the tag event is received.

In actual production, Punctuated mode will produce a large number of watermarks in the scene with high TPS, which will put pressure on the downstream operators to a certain extent. Therefore, Punctuated mode will be selected for Watermark generation only in the scene with high real-time requirements.

2.5.2 AssignerWithPeriodicWatermarks

For periodic water level, the system will generate a Watermark periodically (at a certain time interval). The time interval of water level rise is set by the user. Within the time interval of two water level rises, some messages will flow in. The user can calculate a new water level according to this part of data.

In actual production, the Periodic method must continue to generate Watermark periodically in combination with the two dimensions of time and accumulated number, otherwise there will be a great delay in extreme cases.

For example, the simplest watermark algorithm is to take the largest event time so far. However, this method is more violent, has a low tolerance for disorderly events, and is prone to a large number of late events.

3, Case

package com.hotmail.ithink.watermark;

import com.google.common.collect.Lists;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.commons.lang3.time.FastDateFormat;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.io.Serializable;
import java.util.List;
import java.util.Random;
import java.util.UUID;
import java.util.concurrent.TimeUnit;

public class WatermarkMain {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        // Window calculation based on Watermark event time
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        // Get event flow data
        DataStreamSource<OrderEvent> eventDataStream = env.addSource(new SourceFunction<OrderEvent>() {
            private static final long serialVersionUID = 5652749729728486680L;

            // switch
            private boolean switchFlag = true;

            @Override
            public void run(SourceContext<OrderEvent> ctx) throws Exception {
                Random random = new Random();
                while (switchFlag) {
                    String orderId = UUID.randomUUID().toString().replaceAll("-", "");
                    int userId = random.nextInt(2);
                    int money = random.nextInt(100);
                    long eventTime = System.currentTimeMillis() - random.nextInt(5) * 1000;

                    OrderEvent orderEvent = new OrderEvent(orderId, userId, money, eventTime);

                    System.out.println("data: " + orderEvent);

                    // Send element
                    ctx.collect(orderEvent);

                    // Sleep for 1s
                    TimeUnit.SECONDS.sleep(1);
                }
            }

            @Override
            public void cancel() {
                switchFlag = false;
            }
        });

        // Add a Watermark to the event flow data and specify the event time
//        SingleOutputStreamOperator<OrderEvent> eventWatermarkDataStream 
//          = eventDataStream.assignTimestampsAndWatermarks(
//                //Set the maximum allowable delay time as 3s
//                WatermarkStrategy.<OrderEvent>forBoundedOutOfOrderness(Duration.ofSeconds(3))
//                        //Set timestamp data
//                        .withTimestampAssigner((e, timestamp) -> e.getEventTime())
//        );

        SingleOutputStreamOperator<OrderEvent> eventWatermarkDataStream = eventDataStream
         .assignTimestampsAndWatermarks(
            new WatermarkStrategy<OrderEvent>() {
                @Override
                public WatermarkGenerator<OrderEvent> createWatermarkGenerator(
                        WatermarkGeneratorSupplier.Context ctx
                ) {
                    return new WatermarkGenerator<OrderEvent>() {
                        /** Maximum allowable delay time */
                        private final int outOfOrdernessMills = 3000;
                        /** User ID**/
                        private Integer userId;
                        /** Event time**/
                        private Long eventTime;
                        /** Maximum event timestamp */
                        private Long maxTimestamp = Long.MIN_VALUE + outOfOrdernessMills + 1;

                        // Time formatting
                        private FastDateFormat df = FastDateFormat.getInstance("HH:mm:ss");

                        @Override
                        public void onEvent(OrderEvent event, long eventTimestamp, WatermarkOutput output) {
                            this.userId = event.userId;
                            this.eventTime = event.eventTime;
                            maxTimestamp = Math.max(maxTimestamp, eventTimestamp);

                            System.out.println("watermark on event: "  + event);
                        }

                        @Override
                        public void onPeriodicEmit(WatermarkOutput out) {
                            Watermark watermark = new Watermark(maxTimestamp - outOfOrdernessMills - 1);

                            String note = String.format("watermark emit key:%s current time:%s " +
                                            "event time:%s watermark:%s",
                                    userId, System.currentTimeMillis(), df.format(eventTime),
                                    df.format(maxTimestamp - outOfOrdernessMills - 1));
                            System.out.println(note);

                            out.emitWatermark(watermark);
                        }
                    };
                }
        }.withTimestampAssigner((e, timestamp) -> e.getEventTime()));

        // Add window calculation
        SingleOutputStreamOperator<String> outDataStream = eventWatermarkDataStream.keyBy(OrderEvent::getUserId)
                // Set scroll window
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                // Specifies the window application function
                .apply(new WindowFunction<OrderEvent, String, Integer, TimeWindow>() {
                    private static final long serialVersionUID = 7034105248794615763L;
                    // Time formatting
                    private FastDateFormat df = FastDateFormat.getInstance("HH:mm:ss");

                    @Override
                    public void apply(Integer key, TimeWindow window, Iterable<OrderEvent> events,
                                      Collector<String> out) throws Exception {
                        List<String> eventTimeList = Lists.newLinkedList();
                        for (OrderEvent event : events) {
                            String time = df.format(event.getEventTime());
                            eventTimeList.add(time);
                        }

                        String windowStartTime = df.format(window.getStart());
                        String windowEndTime = df.format(window.getEnd());

                        String rs = String.format("key:%s window:[%s,%s) window event times:%s",
                                key, windowStartTime, windowEndTime, eventTimeList.toString());

                        out.collect(rs);
                    }
                });

        outDataStream.print("WaterMarkResult::");

        env.execute("WatermarkMain");
    }

    @Data
    @NoArgsConstructor
    @AllArgsConstructor
    public static class OrderEvent implements Serializable {
        private static final long serialVersionUID = 2082940433103599734L;

        /** Order ID */
        private String orderId;
        /** User ID */
        private Integer userId;
        /** amount of money */
        private Integer money;
        /** Event time */
        private Long eventTime;
    }
}

Topics: Java Big Data flink