[Flink] [Chapter 8] ProcessFunctionAPI

Posted by sleepydad on Fri, 21 Jan 2022 09:34:20 +0100

1. Introduction to processfunction

1.1 description on API

A function that processes elements of a stream.

For every element in the input stream processElement(Object, 
ProcessFunction.Context, Collector) is invoked. This can produce 
zero or more elements as output. 

Implementations can also query the time and set timers through the 
provided ProcessFunction.Context. 

For firing timers onTimer(long, ProcessFunction.OnTimerContext, 
Collector) will be invoked. This can again produce zero or more 
elements as output and register further timers.


NOTE: Access to keyed state and timers (which are also scoped to a key)
 is only available if the ProcessFunction is applied on a KeyedStream.

NOTE: A ProcessFunction is always a 
org.apache.flink.api.common.functions.RichFunction. Therefore, 
access to the org.apache.flink.api.common.functions.RuntimeContext is 
always available and setup and teardown methods can be implemented.
 See 
org.apache.flink.api.common.functions.RichFunction.open(org.apache.flink.configuration.Configuration)
 and org.apache.flink.api.common.functions.RichFunction.close().

(1) ProcessFunction is a function used to process elements in a stream
(2) The processElement() method is used to process each element and can output 0 to multiple outputs
(3) Through processfunction Context. Get alarm clock and set alarm clock
(4) When the alarm clock rings, the OnTimer() method will execute. In this method, 0 to n outputs can be output, and the alarm clock can be set again
matters needing attention:
(1) The function of the alarm clock is only valid for KeyedStream
(2) Processfunctions are also rich functions in nature, so you can also use state programming and lifecycle methods

1.2 type structure view


1.3 application scenario of processfunction

Since ProcessFunction is an abstract subclass of AbstractRichFunction, ProcessFunction can be used in any scenario where RichFunction can be used

Usage scenario of richFunction:

  1. Third party write library
  2. Get the context environment for state programming

Unique usage scenarios of ProcessFunction:

  1. timer
  2. Side output stream

processFunction can only be used for data processing and cannot be used to define transmission and window opening (keyBy and window cannot)

Additional note: Flink SQL is implemented using Process Function.

1.4 8 processfunctions

Flink provides 8 process functions, each of which is used for different streams.
All processfunctions are parameters of the Stream's process operator

  • ProcessFunction (common to DataStream)
  • KeyedProcessFunction ( KeyedStream)
  • CoProcessFunction (ConnectStream)
  • ProcessJoinFunction (JoinStream)
  • BroadcastProcessFunction (broadcast stream)
  • KeyedBroadcastProcessFunction
  • ProcessWindowFunction (windowfunction of windowStream after KeyBy)
  • Processallwindowfunction (windowfunction of datastream)

2. Function display of processfunction

package No08_process;

import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;

public class _01_ProcessFunction Function display {
    public static void main(String[] args) {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> source = env.socketTextStream("hadoop102", 9999);
    }
    public static class MyProcessFunc extends ProcessFunction<String,String> {
        @Override
        public void open(Configuration parameters) throws Exception {
            //TODO function 1 gets the runtime context for state programming (this method inherits from RichFunction)
            RuntimeContext runtimeContext = getRuntimeContext();
            // Can get status
            //runtimeContext.getState();
        }

        @Override
        public void close() throws Exception {
            super.close();
        }

        @Override
        //TODO calls this method for each Element in the DataStream. The return value is void. You can decide whether there is output
        // If you want to output, use collector to output
        public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
            // todo function 3-out to mainstream output
            out.collect(" ");

            //todo function 2-ctx obtains the processing time, registers the processing time timer, and deletes the processing time timer
            ctx.timerService().currentProcessingTime();
            ctx.timerService().registerProcessingTimeTimer(1L);
            ctx.timerService().deleteProcessingTimeTimer(1L);
            //todo function 2 obtains the event time, registers the event time timer, and deletes the event time timer
            ctx.timerService().currentWatermark();
            ctx.timerService().registerEventTimeTimer(1L);
            ctx.timerService().deleteEventTimeTimer(1L);

            //todo function 4-ctx: write CTX to the side output stream output
            //ctx.output(new OutputTag<String>("outPutTag"){},value);


        }

        @Override
        //todo function 5: specify the task execution when the timer is triggered
        // ctx can also set the alarm clock again
        //out to mainstream output
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
            super.onTimer(timestamp, ctx, out);
        }
    }

}

1. Timer

The TimerService objects held by Context and OnTimerContext have the following methods:

  • currentProcessingTime(): Long returns the current processing time
  • currentWatermark(): Long returns the timestamp of the current watermark
  • Registerprocessingtimer (timestamp: long): unit will register the timer of the processing time of the current key. When the processing time reaches the timing time, the timer is triggered.
  • Registereventtimer (timestamp: long): unit will register the event time timer of the current key. When the water level is greater than or equal to the time registered by the timer, the timer is triggered to execute the callback function.
  • Deleteprocessingtimer (timestamp: long): unit deletes the previously registered processing time timer. If there is no timer with this timestamp, it will not be executed.
  • Deleteeventtimer (timestamp: long): unit deletes the previously registered event time timer. If there is no timer with this timestamp, it will not be executed.

When the timer timer is triggered, the callback function onTimer() is executed. Note that the timer timer can only be used on keyed streams.

package No08_process;

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;

public class _03_timer {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> source = env.socketTextStream("hadoop102", 9999);


        SingleOutputStreamOperator<String> res = source.keyBy(new KeySelector<String, String>() {
            @Override
            public String getKey(String s) throws Exception {
                return s;
            }
        }).process(new MyOnTimerProcessFunc());

        //The timer function can only be used with KeyedStream

        res.print();

        env.execute();
    }


    //todo implements the output of a piece of data two seconds after processing the current data
    public static class MyOnTimerProcessFunc extends KeyedProcessFunction<String,String,String>{

        @Override
        public void processElement(String value,Context ctx, Collector<String> out) throws Exception {
            out.collect(value);

            //Register the alarm clock in two seconds
            ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime() + 2000L);
        }


        //The alarm clock rings and the task is timed
        @Override
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
            System.out.println("The timer is triggered");
        }
    }


}

2. Side output flow

The output of most operators of DataStream API is a single output, that is, a stream of some data type. In addition to the split operator, a stream can be divided into multiple streams with the same data types. The side outputs function of process function can generate multiple streams, and the data types of these streams can be different. A side output can be defined as an OutputTag[X] object. X is the data type of the output stream. Process function can send an event to one or more side outputs through the Context object.

package No08_process;

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

public class _04_Side output stream {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> source = env.socketTextStream("hadoop102", 9999);
        KeyedStream<String, String> keyedStream = source.keyBy(new KeySelector<String, String>() {
            @Override
            public String getKey(String s) throws Exception {
                return s;
            }
        });

        //todo outputs the output with temperature less than 30 degrees to the side output stream, and the mainstream of the output with temperature greater than 30 degrees
        SingleOutputStreamOperator<String> result = keyedStream.process(new mySplit());

        result.print("high");
        result.getSideOutput(new OutputTag<Tuple2<String,Double>>("<30"){}).print("sideOut");
        env.execute();

    }
    public static class mySplit extends KeyedProcessFunction<String,String,String> {

        @Override
        public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
            //todo get temperature
            String[] fields = value.split(",");
            double temp = Double.parseDouble(fields[2]);

            if(temp >= 30){
                out.collect(value);
            }else{
                //The data type of the measured output stream is not limited and is defined when outputting
                ctx.output(new OutputTag<Tuple2<String,Double>>("<30"){},new Tuple2<String,Double>(fields[0],temp));
            }
            //todo Description: the reason why the official recommends this method of streaming split is that the side output stream can be different from the mainstream data type

        }
    }

}

Topics: Big Data flink