01 introduction
In the previous blog, we have a certain understanding of the use of Flink batch streaming API. Interested students can refer to the following:
- Flink tutorial (01) - Flink knowledge map
- Flink tutorial (02) - getting started with Flink
- Flink tutorial (03) - Flink environment construction
- Flink tutorial (04) - getting started with Flink
- Flink tutorial (05) - simple analysis of Flink principle
- Flink tutorial (06) - Flink batch streaming API (Source example)
- Flink tutorial (07) - Flink batch streaming API (Transformation example)
- Flink tutorial (08) - Flink batch streaming API (Sink example)
- Flink tutorial (09) - Flink batch streaming API (Connectors example)
- Flink tutorial (10) - Flink batch streaming API (others)
- Flink tutorial (11) - Flink advanced API (Window)
In the previous tutorial, we have learned the Window in Flink's four cornerstones. As shown in the figure below, this article explains Time:
02 Time
In Flink's streaming processing, different concepts of time will be involved, as shown in the following figure:
You can see that Time is divided into the following categories:
- Event time: the time when the event is really happening
- Ingestion time: the time when the event reaches Flink
- Processing time: the time when the event is actually processed / calculated
There is no doubt that EventTime is the most important, because as long as the event time is generated, it will not change, and the event time can better reflect the essence of the event!
Why is event time so important?
For example, when ordering takeout in the underground garage, the order was placed at 11:59. However, due to the lack of signal in the underground garage, the program has been trying to submit again. It was already 12:05 when we got out of the underground garage. At this time, if we want to count the order amount before 12, should this transaction be counted? Of course, it should be counted, because the true generation time of the data is 11:59, which is the time.
The event time reflects the time of the event. Because the data may be due to network delay and other reasons, a mechanism is needed to solve the problem of data disorder or delay to the end to a certain extent! That is the Watermaker watermark mechanism / watermark mechanism we will learn next.
03 Watermaker watermark mechanism / watermark mechanism
3.1 Watermaker definition
Watermaker: it is to add an additional time column to the data, that is, watermaker is a timestamp!
Watermaker calculation formula (this can ensure that the watermaker water level will always rise (increase) and will not fall):
Watermaker = event time of data (maximum event time of current window) - maximum allowable delay time or out of order time
3.2 role of watermaker
Previous windows are triggered and calculated according to the system time, such as: [10:00:00 ~ 10:00:10) windows,
Once the system time reaches 10:00:10, the calculation will be triggered, which may lead to the loss of delayed data!
Now with Watermaker, the window can trigger the calculation according to Watermaker! In other words, Watermaker is used to trigger window calculation!
3.3 how watermaker triggers window calculation
The trigger conditions of window calculation are:
- There is data in the window
- Watermaker > = end time of window
As mentioned earlier, Watermaker = the maximum event time of the current window - the maximum allowable delay time or out of sequence time, that is, as long as there is continuous data, it can be guaranteed that the Watermaker water level will always rise / increase and will not fall / decrease, so the window calculation must be triggered in the end.
3.4 graphic Watermaker
Trigger formula:
- Watermaker > = end time of window
- Watermaker = maximum event time of the current window - maximum allowable delay time or out of order time
- Maximum event time of current window - maximum allowable delay time or out of order time > = end time of window
- Maximum event time of Current Window > = end time of window + maximum allowable delay time or out of order time
As shown in the figure above, the window time is: [10:00:00 ~ 10:10:00], and CBDA data arrive at the window in turn.
Case 1: if there is no Watermaker mechanism: B data is late (at least 2 minutes late), then B data is lost.
Scenario 2: with Watermaker mechanism and setting the maximum allowable delay time or out of order time to 5 minutes, then:
- When C data arrives, Watermaker=max (10:11:00) - 5 = 10:06:00 < window end time 10:10:00 - no condition is triggered
- When B data arrives, watermaker = max (10:11:00, 10:09:00) - 5 = 10:06:00 < window end time 10:10:00 - no condition is triggered
- When the D data arrives, Watermaker=max(10:11:00,10:09:00, 10:15:00) - 5 = 10:10:00 = window end time 10:10:00 - the window triggers the calculation only when the trigger conditions are met, and the B data will not be lost
Note: Watermaker mechanism can solve the problem of delayed arrival after data disorder to A certain extent, but it can't solve the more serious problem. If the arrival window of A data has been calculated, A data will still be lost. If you want A data not to be lost, you can set the maximum allowable delay time or out of sequence time A little larger, or use the Allowed Lateness side channel output mechanism of subsequent learning.
04 case demonstration
4.1 Watermaker case demonstration
Demand: there is order data in the format of: (order ID, user ID, timestamp / event time, order amount). It is required to calculate the total order amount of each user within 5 seconds every 5s, and add Watermaker to solve the problem of data delay and data disorder to a certain extent.
Core API: datastream assignTimestampsAndWatermarks(...)
Periodic generation | Generated from special records |
---|---|
Display time drive | Data driven |
Call the generation method at regular intervals | Every time TimeStamp is allocated, the generation method will be called |
Implementation of watermarkperiodios | Implement AssignerWithPunctuatedWatermarks |
Note: generally, we directly use the boundedoutordernesstimestampextractor provided by Flink
Implementation method 1: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/event_timestamps_watermarks.html
/** * @author : YangLinWei * @createTime: 2022/3/7 11:07 afternoon * <p> * Simulate real-time order data in the format of: (order ID, user ID, order amount, timestamp / event time) * It is required to calculate the total order amount of each user within 5 seconds (time-based scrolling window) every 5s * Watermaker is added to solve the problems of data delay and data disorder to a certain extent. */ public class WatermakerDemo01 { public static void main(String[] args) throws Exception { //1.env StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //2.Source //Simulate real-time order data (data has delay and out of order) DataStream<Order> orderDS = env.addSource(new SourceFunction<Order>() { private boolean flag = true; @Override public void run(SourceContext<Order> ctx) throws Exception { Random random = new Random(); while (flag) { String orderId = UUID.randomUUID().toString(); int userId = random.nextInt(3); int money = random.nextInt(100); //Analog data delay and out of order! long eventTime = System.currentTimeMillis() - random.nextInt(5) * 1000; ctx.collect(new Order(orderId, userId, money, eventTime)); TimeUnit.SECONDS.sleep(1); } } @Override public void cancel() { flag = false; } }); //3.Transformation //-Tell Flink to calculate based on the event time! //env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);// The default of the new version is eventtime //-Tell Flnk which column in the data is the event time, because Watermaker = current maximum event time - maximum allowable delay time or out of order time /*DataStream<Order> watermakerDS = orderDS.assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor<Order>(Time.seconds(3)) {//Maximum allowable delay time or out of sequence time @Override public long extractTimestamp(Order element) { return element.eventTime; //Specify which column the event time is, and the bottom layer of Flink will automatically calculate: //Watermaker = Current maximum event time - maximum allowable delay time or out of sequence time } });*/ DataStream<Order> watermakerDS = orderDS .assignTimestampsAndWatermarks( WatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(3)) .withTimestampAssigner((event, timestamp) -> event.getEventTime()) ); //When the code comes here, Watermaker has been added! Next, you can calculate the window //It is required to calculate the total order amount of each user within 5 seconds (time-based scrolling window) every 5s DataStream<Order> result = watermakerDS .keyBy(Order::getUserId) //.timeWindow(Time.seconds(5), Time.seconds(5)) .window(TumblingEventTimeWindows.of(Time.seconds(5))) .sum("money"); //4.Sink result.print(); //5.execute env.execute(); } @Data @AllArgsConstructor @NoArgsConstructor public static class Order { private String orderId; private Integer userId; private Integer money; private Long eventTime; } }
Operation results:
Implementation mode 2:
/** * Check * * @author : YangLinWei * @createTime: 2022/3/7 11:15 afternoon */ public class WatermakerDemo02 { public static void main(String[] args) throws Exception { FastDateFormat df = FastDateFormat.getInstance("HH:mm:ss"); //1.env StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //2.Source //Simulate real-time order data (data has delay and out of order) DataStreamSource<Order> orderDS = env.addSource(new SourceFunction<Order>() { private boolean flag = true; @Override public void run(SourceContext<Order> ctx) throws Exception { Random random = new Random(); while (flag) { String orderId = UUID.randomUUID().toString(); int userId = random.nextInt(3); int money = random.nextInt(100); //Analog data delay and out of order! long eventTime = System.currentTimeMillis() - random.nextInt(5) * 1000; System.out.println("The data sent is: " + userId + " : " + df.format(eventTime)); ctx.collect(new Order(orderId, userId, money, eventTime)); TimeUnit.SECONDS.sleep(1); } } @Override public void cancel() { flag = false; } }); //3.Transformation /*DataStream<Order> watermakerDS = orderDS .assignTimestampsAndWatermarks( WatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(3)) .withTimestampAssigner((event, timestamp) -> event.getEventTime()) );*/ //You can directly use the above in development //Learning test can be realized by yourself DataStream<Order> watermakerDS = orderDS .assignTimestampsAndWatermarks( new WatermarkStrategy<Order>() { @Override public WatermarkGenerator<Order> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) { return new WatermarkGenerator<Order>() { private int userId = 0; private long eventTime = 0L; private final long outOfOrdernessMillis = 3000; private long maxTimestamp = Long.MIN_VALUE + outOfOrdernessMillis + 1; @Override public void onEvent(Order event, long eventTimestamp, WatermarkOutput output) { userId = event.userId; eventTime = event.eventTime; maxTimestamp = Math.max(maxTimestamp, eventTimestamp); } @Override public void onPeriodicEmit(WatermarkOutput output) { //Watermaker = current maximum event time - maximum allowable delay time or out of order time Watermark watermark = new Watermark(maxTimestamp - outOfOrdernessMillis - 1); System.out.println("key:" + userId + ",system time:" + df.format(System.currentTimeMillis()) + ",Event time:" + df.format(eventTime) + ",Watermark time:" + df.format(watermark.getTimestamp())); output.emitWatermark(watermark); } }; } }.withTimestampAssigner((event, timestamp) -> event.getEventTime()) ); //When the code comes here, Watermaker has been added! Next, you can calculate the window //It is required to calculate the total order amount of each user within 5 seconds (time-based scrolling window) every 5s /* DataStream<Order> result = watermakerDS .keyBy(Order::getUserId) //.timeWindow(Time.seconds(5), Time.seconds(5)) .window(TumblingEventTimeWindows.of(Time.seconds(5))) .sum("money");*/ //Use the above code for business calculation during development //When learning and testing, you can use the following code to output the data in more detail, such as the event time and Watermaker time of the data in each window when the output window is triggered DataStream<String> result = watermakerDS .keyBy(Order::getUserId) .window(TumblingEventTimeWindows.of(Time.seconds(5))) //Apply the function in apply to the data in the window //WindowFunction<IN, OUT, KEY, W extends Window> .apply(new WindowFunction<Order, String, Integer, TimeWindow>() { @Override public void apply(Integer key, TimeWindow window, Iterable<Order> input, Collector<String> out) throws Exception { //Prepare a collection to store the event time of the data belonging to the window List<String> eventTimeList = new ArrayList<>(); for (Order order : input) { Long eventTime = order.eventTime; eventTimeList.add(df.format(eventTime)); } String outStr = String.format("key:%s,Window start end:[%s~%s),Time of events belonging to this window:%s", key.toString(), df.format(window.getStart()), df.format(window.getEnd()), eventTimeList); out.collect(outStr); } }); //4.Sink result.print(); //5.execute env.execute(); } @Data @AllArgsConstructor @NoArgsConstructor public static class Order { private String orderId; private Integer userId; private Integer money; private Long eventTime; } }
Operation results:
4.2 Watermaker case demonstration
Demand: there is order data in the format of: (order ID, user ID, timestamp / event time, order amount)
It is required to calculate the total order amount of each user within 5 seconds every 5s and add Watermaker to solve the problem of data delay and data disorder to a certain extent. And use outputtag + allowedlatency to solve the problem of data loss.
API:
Example code:
/** * allowedLateness * * @author : YangLinWei * @createTime: 2022/3/7 11:18 afternoon */ public class WatermakerDemo03 { public static void main(String[] args) throws Exception { //1.env StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //2.Source //Simulate real-time order data (data has delay and out of order) DataStreamSource<Order> orderDS = env.addSource(new SourceFunction<Order>() { private boolean flag = true; @Override public void run(SourceContext<Order> ctx) throws Exception { Random random = new Random(); while (flag) { String orderId = UUID.randomUUID().toString(); int userId = random.nextInt(3); int money = random.nextInt(100); //Analog data delay and out of order! long eventTime = System.currentTimeMillis() - random.nextInt(10) * 1000; ctx.collect(new Order(orderId, userId, money, eventTime)); //TimeUnit.SECONDS.sleep(1); } } @Override public void cancel() { flag = false; } }); //3.Transformation DataStream<Order> watermakerDS = orderDS .assignTimestampsAndWatermarks( WatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(3)) .withTimestampAssigner((event, timestamp) -> event.getEventTime()) ); //When the code comes here, Watermaker has been added! Next, you can calculate the window //It is required to calculate the total order amount of each user within 5 seconds (time-based scrolling window) every 5s OutputTag<Order> outputTag = new OutputTag<>("Seriouslylate", TypeInformation.of(Order.class)); SingleOutputStreamOperator<Order> result = watermakerDS .keyBy(Order::getUserId) //.timeWindow(Time.seconds(5), Time.seconds(5)) .window(TumblingEventTimeWindows.of(Time.seconds(5))) .allowedLateness(Time.seconds(5)) .sideOutputLateData(outputTag) .sum("money"); DataStream<Order> result2 = result.getSideOutput(outputTag); //4.Sink result.print("Normal data and non serious data"); result2.print("Serious late data"); //5.execute env.execute(); } @Data @AllArgsConstructor @NoArgsConstructor public static class Order { private String orderId; private Integer userId; private Integer money; private Long eventTime; } }
Operation results:
05 end
This article mainly explains the principle and usage of Time and Watermaker. Thank you for reading. The end of this article!