Flink streaming API

Posted by kristolklp on Wed, 09 Feb 2022 16:49:27 +0100

1 Environment

1.1 getExecutionEnvironment

Create an execution environment that represents the context of the current executing program. If the program is called independently, this method returns to the local execution environment; If the program is called from the command-line client to submit to the cluster, this method returns the execution environment of the cluster, that is, getExecutionEnvironment will determine the returned operation environment according to the query operation mode, which is the most commonly used way to create the execution environment.

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 
// Streaming data execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

If the parallelism is not set, the configuration in flink-conf.yaml will prevail, and the default is 1.

1.2 Source

Read data from collection

// Source: get data from Collection
        DataStream<SensorReading> dataStream = env.fromCollection(
                Arrays.asList(
                        new SensorReading("sensor_1", 1547718199L, 35.8),
                        new SensorReading("sensor_6", 1547718201L, 15.4),
                        new SensorReading("sensor_7", 1547718202L, 6.7),
                        new SensorReading("sensor_10", 1547718205L, 38.1)
                )
        );

Read data from file

// Get data output from file
       DataStream<String> dataStream = env.readTextFile("/src/main/resources/sensor.txt");

Read data from Kafka
1 pom dependency

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>Flink_Tutorial</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <flink.version>1.12.1</flink.version>
        <scala.binary.version>2.12</scala.binary.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <!-- kafka -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
    </dependencies>
</project>

2 start zookeeper

$ bin/zookeeper-server-start.sh config/zookeeper.properties

3 start kafka service

$ bin/kafka-server-start.sh config/server.properties

4 start kafka producer

$ bin/kafka-console-producer.sh --broker-list localhost:9092  --topic sensor

5 write java code

public class SourceTest3_Kafka {

    public static void main(String[] args) throws Exception {
        // Create execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // Set parallelism 1
        env.setParallelism(1);

        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "localhost:9092");
        // The following secondary parameters
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // flink add external data source
        DataStream<String> dataStream = env.addSource(new FlinkKafkaConsumer<String>("sensor", new SimpleStringSchema(),properties));

        // Printout
        dataStream.print();

        env.execute();
    }
}

6 run the java code and enter it in the Kafka producer console

$ bin/kafka-console-producer.sh --broker-list localhost:9092  --topic sensor
>sensor_1,1547718199,35.8
>sensor_6,1547718201,15.4
>

Custom Source
addSource function

 DataStream<SensorReading> dataStream = env.addSource(new MySensorSource());

 // Implement custom SourceFunction
public static class MySensorSource implements SourceFunction<SensorReading> {

        // Flag bit, control data generation
        private volatile boolean running = true;


        @Override
        public void run(SourceContext<SensorReading> ctx) throws Exception {
            //Define a random number generator
            Random random = new Random();

            // Set the initial temperature of 10 sensors
            HashMap<String, Double> sensorTempMap = new HashMap<>();
            for (int i = 0; i < 10; ++i) {
                sensorTempMap.put("sensor_" + (i + 1), 60 + random.nextGaussian() * 20);
            }

            while (running) {
                for (String sensorId : sensorTempMap.keySet()) {
                    // Random fluctuation based on current temperature
                    Double newTemp = sensorTempMap.get(sensorId) + random.nextGaussian();
                    sensorTempMap.put(sensorId, newTemp);
                    ctx.collect(new SensorReading(sensorId,System.currentTimeMillis(),newTemp));
                }
                // Control output rating
                Thread.sleep(2000L);
            }
        }

        @Override
        public void cancel() {
            this.running = false;
        }
    }

1.3 Transform

Basic conversion operator (map/flatMap/filter)

 		// 1. Map, string = > string length INT
        DataStream<Integer> mapStream = dataStream.map(new MapFunction<String, Integer>() {
            @Override
            public Integer map(String value) throws Exception {
                return value.length();
            }
        });

        // 2. flatMap, separating strings by commas
        DataStream<String> flatMapStream = dataStream.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String value, Collector<String> out) throws Exception {
                String[] fields = value.split(",");
                for(String field:fields){
                    out.collect(field);
                }
            }
        });

        // 3. filter to filter the data starting with "sensor_1"
        DataStream<String> filterStream = dataStream.filter(new FilterFunction<String>() {
            @Override
            public boolean filter(String value) throws Exception {
                return value.startsWith("sensor_1");
            }
        });

Aggregation operator
There are no aggregation methods such as reduce and sum in DataStream, because in Flink design, all data must be grouped before aggregation.
First keyBy gets KeyedStream, then calls reduce, sum and so on. (group first and then aggregate)

Common aggregation operators include:

keyBy

Rolling Aggregation

reduce

KeyBy

1. KeyBy will repartition;
2. Different key s may be grouped together because they are implemented through the hash principle;

Rolling Aggregation
These operators can aggregate each tributary of KeyedStream.

sum()
min()
max()
minBy()
maxBy()

		// Group first and then aggregate
        // grouping
        KeyedStream<SensorReading, String> keyedStream = sensorStream.keyBy(SensorReading::getId);

        // For rolling aggregation, the difference between Max and maxBy is that except for the fields used for max comparison, other fields of maxBy will also be updated to the latest, while Max only updates the compared fields, and other fields remain unchanged
        DataStream<SensorReading> resultStream = keyedStream.maxBy("temperature");

reduce
Reduce is applicable to more general aggregation operation scenarios. The ReduceFunction functional interface needs to be implemented in java.

For example: modify the requirements on the premise of Rolling Aggregation. Obtain the sensor information with the highest historical temperature in the same group, and update its timestamp information in real time.

 		// Group first and then aggregate
        // grouping
        KeyedStream<SensorReading, String> keyedStream = sensorStream.keyBy(SensorReading::getId);

        // reduce, user-defined protocol function, except for obtaining the sensor information of max temperature, the timestamp is required to be updated to the latest
        DataStream<SensorReading> resultStream = keyedStream.reduce(
                (curSensor,newSensor)->new SensorReading(curSensor.getId(),newSensor.getTimestamp(), Math.max(curSensor.getTemperature(), newSensor.getTemperature()))
        );

3 multi stream conversion operator
Multi stream conversion operators generally include:

Split and Select (new version removed)

Connect and CoMap

Union

Split and Select: separate

Note: Split and Select API s no longer exist in the new version of Flink (at least not in Flink 1.12.1!) getSideOutput is required

Split & select: split a DataStream into multiple datastreams

This function is implemented using getSideOutput
First, define the class to be divided

	// Defines the classification of getSideOutput
    private static final org.apache.flink.util.OutputTag<SensorReading> high = new org.apache.flink.util.OutputTag<SensorReading>("high"){
    };
    private static final org.apache.flink.util.OutputTag<SensorReading> low = new org.apache.flink.util.OutputTag<SensorReading>("low"){
    };

	// Spilt shunt
	SingleOutputStreamOperator<SensorReading> SplitSensorReading = DataSensorReadingmap.process(new ProcessFunction<SensorReading, SensorReading>() {
            @Override
            public void processElement(SensorReading sensorReading, Context context, Collector<SensorReading> collector) throws Exception {
                // Grade according to temperature
                if (sensorReading.getTemperature()>30) {
                    context.output(high, sensorReading);

                }else if(sensorReading.getTemperature()<=30){
                    context.output(low,sensorReading);
                }else{
                    collector.collect(sensorReading);
                }
            }
        });

        SplitSensorReading.getSideOutput(high).print("high");
        SplitSensorReading.getSideOutput(low).print("low");

Connect and CoMap: Connect
Connect:

DataStream,DataStream -> ConnectedStreams:
Connect two data streams that maintain their types. After the two data streams are connected, they are only placed in one stream. Their internal data and form remain unchanged, and the two streams are independent of each other.

CoMap:

ConnectedStreams -> DataStream:
Acting on ConnectedStreams, the function is the same as that of map and flatMap. Map and flatMap operations are performed on each Stream in ConnectedStreams respectively;

        // connect converts the high-temperature flow into a binary type, and outputs status information after connecting and merging with the low-temperature flow
        SingleOutputStreamOperator<Tuple2<String,Double>> HighTemperatureWarning = SplitSensorReading.getSideOutput(high).map(new MapFunction<SensorReading, Tuple2<String,Double>>() {
            @Override
            public Tuple2<String, Double> map(SensorReading sensorReading) throws Exception {
                return new Tuple2<>(sensorReading.getId(),sensorReading.getTemperature());
            }
        });

        // Merge high temperature flow conversion binary type and low temperature flow SensorReading type
        ConnectedStreams<Tuple2<String, Double>, SensorReading> sensorReadingConnectedStreams = HighTemperatureWarning.connect(SplitSensorReading.getSideOutput(low));

        // CoMap performs map and flatMap operations on each Stream
        SingleOutputStreamOperator<Object> ResultStream = sensorReadingConnectedStreams.map(new CoMapFunction<Tuple2<String, Double>, SensorReading, Object>() {
            @Override
            public Object map1(Tuple2<String, Double> stringDoubleTuple2) throws Exception {
                return new Tuple3<>(stringDoubleTuple2.f0, stringDoubleTuple2.f1, "high temp warning");
            }

            @Override
            public Object map2(SensorReading sensorReading) throws Exception {
                return new Tuple2<>(sensorReading.getId(), sensorReading.getTemperature());
            }
        });
        ResultStream.print();

Union: Union (multiple)

DataStream ->DataStream:
Union two or more datastreams to generate a new DataStream containing multiple DataStream elements.

Union And Connect Differences between:
	 1Connect Data types can be different, Connect Only two streams can be merged; 
	 2Union Multiple streams can be merged, Union The data structure must be the same;

		// union Union multiple
        DataStream<SensorReading> UnionSensorReadingDataStream = SplitSensorReading.getSideOutput(high).union(SplitSensorReading.getSideOutput(low));
        UnionSensorReadingDataStream.print("Union");

1.4 summary

Transformation operator is to convert one or more datastreams into new datastreams

As shown in the figure above, DataStream will be transformed, filtered and aggregated into other different streams by different Transformation operations, so as to meet our business requirements.

Topics: flink

Programmer Think