Flink DataStream

Posted by Corvin on Fri, 08 Oct 2021 11:13:12 +0200

get ready

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.registerJobListener(new JobListener() {
    @Override
    public void onJobSubmitted(@Nullable JobClient jobClient, @Nullable Throwable throwable) {
        Logger.getLogger("test").info("onJobSubmitted");
    }

    @Override
    public void onJobExecuted(@Nullable JobExecutionResult jobExecutionResult, @Nullable Throwable throwable) {
        Logger.getLogger("test").info("onJobExecuted");
    }
});

1. Operation

1)map(MapFunction<T, R> mapper)

Input a data and output a data. Any transformation can be made in the middle. In the following example, the input stream is TestObj type and the final output is String type

List<TestObj> testObjs=new ArrayList<>();
testObjs.add(new TestObj(1,"Apple,Pear"));
testObjs.add(new TestObj(2,"grapefruit,a mandarin orange"));
testObjs.add(new TestObj(3,"cat,tiger"));
testObjs.add(new TestObj(4,"dog,wolf"));
DataStream<TestObj> data=env.fromCollection(testObjs);
data.map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getValue();
    }
}).print();
try {
    env.execute();
} catch (Exception e) {
    e.printStackTrace();
}

result

7> Apple, pear
8> Grapefruit, orange
1> Cat, tiger
2> Dog, wolf

>7, 8, 1 and 2 in front of the number are the subtask serial number, followed by the output result, where the value value of testObj is output

MapFunction<T, O>

The sub interface of Function. Generic t is the input value type and O is the output value type. It contains the only method O map(T var1) throws Exception. The input T type data returns O type data, which can be customized in the middle.

2)flatMap(FlatMapFunction<T, R> flatMapper)

Input a data and output one or more data. The following example is to store the value value of testObj in the set after being separated by commas, and finally output twice the data

data.flatMap(new FlatMapFunction<TestObj, String>() {
    @Override
    public void flatMap(TestObj testObj, Collector<String> collector) throws Exception {
        String[] ss=testObj.getValue().split(",");
        for (String s:ss){
            collector.collect(testObj.getKey()+":"+s);
        }
    }
}).print();

result:

4> 1: Apple
6> 3: cat
7> 4: dog
6> 3: Tiger
5> 2: Grapefruit
5> 2: Orange
4> 1: Pear
7> 4: Wolf

The number in front of the colon is the key value of testObj. The two data of the same key come from an initial data. You can see that testObj of an initial data is still executed by a subtask

FlatMapFunction<T, O>

The sub interface of Function. The generic type T is the input value type and O is the output value type. It contains the only method void flatmap (t VAR1, collector < o > var2) throws exception. The input T-type data is stored in the collector < o > after processing

3)filter(FilterFunction<T> filter)

Enter a data and judge whether to retain the data according to the user definition. In the following example, testObj with a data key value of a multiple of 2 is converted with map for convenience. Otherwise, the object address is output.

data.filter(new FilterFunction<TestObj>() {
    @Override
    public boolean filter(TestObj testObj) throws Exception {
        return testObj.getKey()%2==0;
    }
}).map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getKey()+":"+testObj.getValue();
    }
}).print();

result:

1> 2: grapefruit, orange
3> 4: dog, wolf

Only data with key s 2 and 4 are output

FilterFunction<T>

The sub interface of Function. The generic type T is the input value type and the output value boolean type. It contains the only method boolean filter(T var1) throws Exception. It returns true to retain the data and false to remove the data.

4)assignTimestampsAndWatermarks(WatermarkStrategy<T> watermarkStrategy)

Watermarks are generally used to handle out of sequence events

data.assignTimestampsAndWatermarks(new WatermarkStrategy<TestObj>() {
    @Override
    public WatermarkGenerator<TestObj> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
        return new WatermarkGenerator<TestObj>() {
            @Override
            public void onEvent(TestObj testObj, long l, WatermarkOutput watermarkOutput) {
                Logger.getLogger("test").info("onEvent: "+testObj.getKey());
                //Execute when the time is triggered, check and memorize the time stamp or generate watermark
            }

            @Override
            public void onPeriodicEmit(WatermarkOutput watermarkOutput) {
                Logger.getLogger("test").info("onPeriodicEmit: ");
                //If it is executed periodically, a new Watermark may be generated
            }
        };
    }
}).map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getKey()+":"+testObj.getValue();
    }
}).print();

output

September 30, 2021 4:17:03 PM com.test.flink.Test onJobSubmitted
Info: onJobSubmitted
September 30, 2021 4:17:03 PM com.test.flink.Test onEvent
Info: onEvent: 1
September 30, 2021 4:17:03 PM com.test.flink.Test onEvent
Info: onEvent: 2
September 30, 2021 4:17:03 PM com.test.flink.Test onEvent
Info: onEvent: 3
September 30, 2021 4:17:03 PM com.test.flink.Test onEvent
Info: onEvent: 4
September 30, 2021 4:17:03 PM com.test.flink.Test onPeriodicEmit
Info: onPeriodicEmit:
8> 4: dog, wolf
6> 2: grapefruit, orange
7> 3: cat, tiger
5> 1: apples, pears
September 30, 2021 4:17:03 PM com.test.flink.Test onJobExecuted
Info: onJobExecuted

It can be seen that each data will be marked with a watermark

5) process(ProcessFunction<T, R> processFunction)

ProcessFunction has one more context parameter than FlatMapFunction. Context can obtain timestamp and watermark. Of course, the premise is that it has been set before, otherwise null will be returned

data.assignTimestampsAndWatermarks(new WatermarkStrategy<TestObj>() {
    @Override
    public WatermarkGenerator<TestObj> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
        return new WatermarkGenerator<TestObj>() {
            @Override
            public void onEvent(TestObj testObj, long l, WatermarkOutput watermarkOutput) {

            }

            @Override
            public void onPeriodicEmit(WatermarkOutput watermarkOutput) {

            }
        };
    }
}).process(new ProcessFunction<TestObj, String>() {
    @Override
    public void processElement(TestObj testObj, Context context, Collector<String> collector) throws Exception {
        long ts = context.timestamp();
        long cpt = context.timerService().currentProcessingTime();
        long cw = context.timerService().currentWatermark();
        collector.collect(testObj.getKey()+":"+ts+"-"+cpt+"-"+cw);
    }
}).print();

output

1> 4:-9223372036854775808-1633654899323--9223372036854775808
6> 1:-9223372036854775808-1633654899323--9223372036854775808
7> 2:-9223372036854775808-1633654899324--9223372036854775808
8> 3:-9223372036854775808-1633654899323--9223372036854775808

2. Zoning

1)keyBy(KeySelector<T, K> key)

data.keyBy(v->v.getKey()).print();

data.keyBy(TestObj::getKey).print();

data.keyBy(new KeySelector<TestObj, Integer>() {
    @Override
    public Integer getKey(TestObj testObj) throws Exception {
        return testObj.getKey();
    }
}).print();

These expressions all mean the same thing. I probably turned over the source code, as if I had divided a zone with a key.

DataSream.class

public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key) {
    Preconditions.checkNotNull(key);
    return new KeyedStream(this, (KeySelector)this.clean(key));
}

protected <F> F clean(F f) {
    return this.getExecutionEnvironment().clean(f);
}

KeyedStream.class

public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector) {
    this(dataStream, keySelector, TypeExtractor.getKeySelectorTypes(keySelector, dataStream.getType()));
}

public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
    this(dataStream, new PartitionTransformation(dataStream.getTransformation(), new KeyGroupStreamPartitioner(keySelector, 128)), keySelector, keyType);
}

@Internal
KeyedStream(DataStream<T> stream, PartitionTransformation<T> partitionTransformation, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
    super(stream.getExecutionEnvironment(), partitionTransformation);
    this.keySelector = (KeySelector)this.clean(keySelector);
    this.keyType = this.validateKeyType(keyType);
}

Official description of PartitionTransformation.class

This transformation represents a change of partitioning of the input elements.

This transformation represents the partition change of the input data

This does not create a physical operation, it only affects how upstream operations are connected to downstream operations.

It does not generate an actual operator, but only affects how upstream operators connect downstream operators

In other words, the keyBy itself does not actually perform operations. Therefore, if you remove the subsequent print() and keep only one keyBy, it cannot be executed, and an error will be reported: java.lang.IllegalStateException: No operators defined in streaming topology. Cannot execute. But map and other methods are OK.

Modify the data a little

List<TestObj> testObjs=new ArrayList<>();
testObjs.add(new TestObj(1,"Apple,Pear"));
testObjs.add(new TestObj(1,"grapefruit,a mandarin orange"));
testObjs.add(new TestObj(3,"cat,tiger"));
testObjs.add(new TestObj(3,"dog,wolf"));
DataStream<TestObj> data=env.fromCollection(testObjs);
data.keyBy(new KeySelector<TestObj, Integer>() {
    @Override
    public Integer getKey(TestObj testObj) throws Exception {
        return testObj.getKey();
    }
}).map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getKey()+":"+testObj.getValue();
    }
}).print();

output

8> 3: cat, tiger
6> 1: apples, pears
8> 3: dog, wolf
6> 1: grapefruit, orange

It can be seen that the data of the same partition after the partition is executed by a subtask

The keyBy method returns KeyedSream, a subclass of DataStream. You can use all methods of DataStream except partition methods.

2)forward()

Directly reserve the upstream partition. In the following example, first partition with keyBy, then perform the operation, and then perform the operation again after forward()

data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() {
    @Override
    public TestObj map(TestObj testObj) throws Exception {
        return testObj;
    }
}).forward().map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getKey()+":"+testObj.getValue();
    }
}).print();

output

8> 3: cat, tiger
8> 3: dog, wolf
6> 1: apples, pears
6> 1: grapefruit, orange

Still in pairs

3)rebalence()

Partition the upstream data cycle to the downstream

data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() {
    @Override
    public TestObj map(TestObj testObj) throws Exception {
        return testObj;
    }
}).rebalance().map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getKey()+":"+testObj.getValue();
    }
}).print();

output

1> 3: dog, wolf
7> 1: grapefruit, orange
8> 3: cat, tiger
6> 1: apples, pears

Reassignment uses different subtasks

4)shuffle()

Randomly assign upstream data to downstream

data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() {
    @Override
    public TestObj map(TestObj testObj) throws Exception {
        return testObj;
    }
}).shuffle().map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getKey()+":"+testObj.getValue();
    }
}).print();

output

7> 3: Koala
8> 1: apples, pears
8> 1: Grapes
8> 3: cat, tiger
8> 3: sheep, cattle
5> 1: grapefruit, orange
5> 3: dog, wolf

In order to see the random effect, some data are added. The key is still 1 and 3

5)rescale()

Circulate the upstream partition data to the downstream partition respectively

data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() {
    @Override
    public TestObj map(TestObj testObj) throws Exception {
         return testObj;
    }
}).rescale().map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
         return testObj.getKey()+":"+testObj.getValue();
    }
}).print().setParallelism(4);

output

1> 1: grapefruit, orange
2> 1: Grapes
4> 1: apples, pears
1> 3: cat, tiger
2> 3: dog, wolf
3> 3: sheep, cattle
4> 3: Koala

keyBy is followed by two partitions, each with 3 and 4 data, and then set to four partitions. After repartition with rescale, the previous two partitions cycle to the new partition respectively

6)global()

All upstream data is allocated to the first downstream partition

data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() {
    @Override
    public TestObj map(TestObj testObj) throws Exception {
         return testObj;
    }
}).global().map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
         return testObj.getKey()+":"+testObj.getValue();
    }
}).print();

output

1> 1: apples, pears
1> 1: grapefruit, orange
1> 1: Grapes
1> 3: cat, tiger
1> 3: dog, wolf
1> 3: sheep, cattle
1> 3: Koala

7)broadcast()

Allocate upstream data to each partition downstream

data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() {
    @Override
    public TestObj map(TestObj testObj) throws Exception {
         return testObj;
    }
}).broadcast().map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
         return testObj.getKey()+":"+testObj.getValue();
    }
}).print().setParallelism(3);

output

3> 3: cat, tiger
3> 3: Koala
3> 1: Grapes
3> 3: cat, tiger
3> 3: Koala
3> 1: Grapes
3> 3: sheep, cattle
3> 1: grapefruit, orange
3> 3: sheep, cattle
3> 1: grapefruit, orange
3> 3: cat, tiger
3> 3: Koala
3> 1: Grapes
3> 3: sheep, cattle
3> 1: grapefruit, orange
1> 3: dog, wolf
3> 3: dog, wolf
3> 1: apples, pears
2> 3: sheep, cattle
2> 1: grapefruit, orange
2> 3: sheep, cattle
2> 1: grapefruit, orange
2> 3: dog, wolf
2> 1: apples, pears
2> 3: dog, wolf
2> 1: apples, pears
2> 3: sheep, cattle
2> 1: grapefruit, orange
2> 3: dog, wolf
2> 1: apples, pears
2> 3: cat, tiger
2> 3: Koala
2> 1: Grapes
3> 3: cat, tiger
3> 3: Koala
3> 1: Grapes
1> 1: apples, pears
1> 3: dog, wolf
1> 1: apples, pears
1> 3: cat, tiger
1> 3: Koala
1> 1: Grapes
1> 3: cat, tiger
1> 3: Koala
1> 1: Grapes
1> 3: dog, wolf
1> 1: apples, pears
1> 3: cat, tiger
1> 3: Koala
1> 1: Grapes
2> 3: sheep, cattle
2> 1: grapefruit, orange
1> 3: sheep, cattle
1> 1: grapefruit, orange
1> 3: dog, wolf
1> 1: apples, pears

8)partitionCustom(Partitioner<K> partitioner, KeySelector<T, K> keySelector)

The user-defined partition changes the data. The following example partitions the remainder of 2 according to the key value

List<TestObj> testObjs=new ArrayList<>();
testObjs.add(new TestObj(1,"Apple,Pear"));
testObjs.add(new TestObj(2,"grapefruit,a mandarin orange"));
testObjs.add(new TestObj(3,"cat,tiger"));
testObjs.add(new TestObj(4,"dog,wolf"));
testObjs.add(new TestObj(5,"sheep,cattle"));
testObjs.add(new TestObj(6,"Grape"));
testObjs.add(new TestObj(7,"Koalas"));
DataStream<TestObj> data=env.fromCollection(testObjs);
data.partitionCustom(new Partitioner<Integer>() {
    @Override
    public int partition(Integer integer, int i) {
        return integer%2;
    }
},TestObj::getKey).map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getKey()+":"+testObj.getValue();
    }
}).print();

output

1> 2: grapefruit, orange
1> 4: dog, wolf
1> 6: Grapes
2> 1: apples, pears
2> 3: cat, tiger
2> 5: sheep, cattle
2> 7: Koala

Finally, odd numbers are in one partition and even numbers are in one partition

3. Output

1)print()

It has been used many times and output to the console

2)writeToSocket(String hostName, int port, SerializationSchema<T> schema)

Output to a specific address

hostName - host address

Port - port

schema - serialization method

3)addSink(SinkFunction<T> sinkFunction)

Output to another location

data.map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getKey()+":"+testObj.getValue();
    }
}).addSink(new SinkFunction<String>() {
    @Override
    public void invoke(String value, Context context) throws Exception {
        //Actual output method
        //e.g. output to the file FileUtils.writeFileUtf8(file,value);
    }
});

4. Make up the remaining window s after learning

Topics: Java flink

Programmer Think

Flink DataStream

Hot Topics