get ready
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.registerJobListener(new JobListener() { @Override public void onJobSubmitted(@Nullable JobClient jobClient, @Nullable Throwable throwable) { Logger.getLogger("test").info("onJobSubmitted"); } @Override public void onJobExecuted(@Nullable JobExecutionResult jobExecutionResult, @Nullable Throwable throwable) { Logger.getLogger("test").info("onJobExecuted"); } });
1. Operation
1)map(MapFunction<T, R> mapper)
Input a data and output a data. Any transformation can be made in the middle. In the following example, the input stream is TestObj type and the final output is String type
List<TestObj> testObjs=new ArrayList<>(); testObjs.add(new TestObj(1,"Apple,Pear")); testObjs.add(new TestObj(2,"grapefruit,a mandarin orange")); testObjs.add(new TestObj(3,"cat,tiger")); testObjs.add(new TestObj(4,"dog,wolf")); DataStream<TestObj> data=env.fromCollection(testObjs); data.map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getValue(); } }).print(); try { env.execute(); } catch (Exception e) { e.printStackTrace(); }
result
7> Apple, pear
8> Grapefruit, orange
1> Cat, tiger
2> Dog, wolf
>7, 8, 1 and 2 in front of the number are the subtask serial number, followed by the output result, where the value value of testObj is output
MapFunction<T, O>
The sub interface of Function. Generic t is the input value type and O is the output value type. It contains the only method O map(T var1) throws Exception. The input T type data returns O type data, which can be customized in the middle.
2)flatMap(FlatMapFunction<T, R> flatMapper)
Input a data and output one or more data. The following example is to store the value value of testObj in the set after being separated by commas, and finally output twice the data
data.flatMap(new FlatMapFunction<TestObj, String>() { @Override public void flatMap(TestObj testObj, Collector<String> collector) throws Exception { String[] ss=testObj.getValue().split(","); for (String s:ss){ collector.collect(testObj.getKey()+":"+s); } } }).print();
result:
4> 1: Apple
6> 3: cat
7> 4: dog
6> 3: Tiger
5> 2: Grapefruit
5> 2: Orange
4> 1: Pear
7> 4: Wolf
The number in front of the colon is the key value of testObj. The two data of the same key come from an initial data. You can see that testObj of an initial data is still executed by a subtask
FlatMapFunction<T, O>
The sub interface of Function. The generic type T is the input value type and O is the output value type. It contains the only method void flatmap (t VAR1, collector < o > var2) throws exception. The input T-type data is stored in the collector < o > after processing
3)filter(FilterFunction<T> filter)
Enter a data and judge whether to retain the data according to the user definition. In the following example, testObj with a data key value of a multiple of 2 is converted with map for convenience. Otherwise, the object address is output.
data.filter(new FilterFunction<TestObj>() { @Override public boolean filter(TestObj testObj) throws Exception { return testObj.getKey()%2==0; } }).map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).print();
result:
1> 2: grapefruit, orange
3> 4: dog, wolf
Only data with key s 2 and 4 are output
FilterFunction<T>
The sub interface of Function. The generic type T is the input value type and the output value boolean type. It contains the only method boolean filter(T var1) throws Exception. It returns true to retain the data and false to remove the data.
4)assignTimestampsAndWatermarks(WatermarkStrategy<T> watermarkStrategy)
Watermarks are generally used to handle out of sequence events
data.assignTimestampsAndWatermarks(new WatermarkStrategy<TestObj>() { @Override public WatermarkGenerator<TestObj> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) { return new WatermarkGenerator<TestObj>() { @Override public void onEvent(TestObj testObj, long l, WatermarkOutput watermarkOutput) { Logger.getLogger("test").info("onEvent: "+testObj.getKey()); //Execute when the time is triggered, check and memorize the time stamp or generate watermark } @Override public void onPeriodicEmit(WatermarkOutput watermarkOutput) { Logger.getLogger("test").info("onPeriodicEmit: "); //If it is executed periodically, a new Watermark may be generated } }; } }).map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).print();
output
September 30, 2021 4:17:03 PM com.test.flink.Test onJobSubmitted
Info: onJobSubmitted
September 30, 2021 4:17:03 PM com.test.flink.Test onEvent
Info: onEvent: 1
September 30, 2021 4:17:03 PM com.test.flink.Test onEvent
Info: onEvent: 2
September 30, 2021 4:17:03 PM com.test.flink.Test onEvent
Info: onEvent: 3
September 30, 2021 4:17:03 PM com.test.flink.Test onEvent
Info: onEvent: 4
September 30, 2021 4:17:03 PM com.test.flink.Test onPeriodicEmit
Info: onPeriodicEmit:
8> 4: dog, wolf
6> 2: grapefruit, orange
7> 3: cat, tiger
5> 1: apples, pears
September 30, 2021 4:17:03 PM com.test.flink.Test onJobExecuted
Info: onJobExecuted
It can be seen that each data will be marked with a watermark
5) process(ProcessFunction<T, R> processFunction)
ProcessFunction has one more context parameter than FlatMapFunction. Context can obtain timestamp and watermark. Of course, the premise is that it has been set before, otherwise null will be returned
data.assignTimestampsAndWatermarks(new WatermarkStrategy<TestObj>() { @Override public WatermarkGenerator<TestObj> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) { return new WatermarkGenerator<TestObj>() { @Override public void onEvent(TestObj testObj, long l, WatermarkOutput watermarkOutput) { } @Override public void onPeriodicEmit(WatermarkOutput watermarkOutput) { } }; } }).process(new ProcessFunction<TestObj, String>() { @Override public void processElement(TestObj testObj, Context context, Collector<String> collector) throws Exception { long ts = context.timestamp(); long cpt = context.timerService().currentProcessingTime(); long cw = context.timerService().currentWatermark(); collector.collect(testObj.getKey()+":"+ts+"-"+cpt+"-"+cw); } }).print();
output
1> 4:-9223372036854775808-1633654899323--9223372036854775808
6> 1:-9223372036854775808-1633654899323--9223372036854775808
7> 2:-9223372036854775808-1633654899324--9223372036854775808
8> 3:-9223372036854775808-1633654899323--9223372036854775808
2. Zoning
1)keyBy(KeySelector<T, K> key)
data.keyBy(v->v.getKey()).print(); data.keyBy(TestObj::getKey).print(); data.keyBy(new KeySelector<TestObj, Integer>() { @Override public Integer getKey(TestObj testObj) throws Exception { return testObj.getKey(); } }).print();
These expressions all mean the same thing. I probably turned over the source code, as if I had divided a zone with a key.
DataSream.class
public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key) { Preconditions.checkNotNull(key); return new KeyedStream(this, (KeySelector)this.clean(key)); } protected <F> F clean(F f) { return this.getExecutionEnvironment().clean(f); }
KeyedStream.class
public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector) { this(dataStream, keySelector, TypeExtractor.getKeySelectorTypes(keySelector, dataStream.getType())); } public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) { this(dataStream, new PartitionTransformation(dataStream.getTransformation(), new KeyGroupStreamPartitioner(keySelector, 128)), keySelector, keyType); } @Internal KeyedStream(DataStream<T> stream, PartitionTransformation<T> partitionTransformation, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) { super(stream.getExecutionEnvironment(), partitionTransformation); this.keySelector = (KeySelector)this.clean(keySelector); this.keyType = this.validateKeyType(keyType); }
Official description of PartitionTransformation.class
This transformation represents a change of partitioning of the input elements.
This transformation represents the partition change of the input data
This does not create a physical operation, it only affects how upstream operations are connected to downstream operations.
It does not generate an actual operator, but only affects how upstream operators connect downstream operators
In other words, the keyBy itself does not actually perform operations. Therefore, if you remove the subsequent print() and keep only one keyBy, it cannot be executed, and an error will be reported: java.lang.IllegalStateException: No operators defined in streaming topology. Cannot execute. But map and other methods are OK.
Modify the data a little
List<TestObj> testObjs=new ArrayList<>(); testObjs.add(new TestObj(1,"Apple,Pear")); testObjs.add(new TestObj(1,"grapefruit,a mandarin orange")); testObjs.add(new TestObj(3,"cat,tiger")); testObjs.add(new TestObj(3,"dog,wolf")); DataStream<TestObj> data=env.fromCollection(testObjs); data.keyBy(new KeySelector<TestObj, Integer>() { @Override public Integer getKey(TestObj testObj) throws Exception { return testObj.getKey(); } }).map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).print();
output
8> 3: cat, tiger
6> 1: apples, pears
8> 3: dog, wolf
6> 1: grapefruit, orange
It can be seen that the data of the same partition after the partition is executed by a subtask
The keyBy method returns KeyedSream, a subclass of DataStream. You can use all methods of DataStream except partition methods.
2)forward()
Directly reserve the upstream partition. In the following example, first partition with keyBy, then perform the operation, and then perform the operation again after forward()
data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() { @Override public TestObj map(TestObj testObj) throws Exception { return testObj; } }).forward().map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).print();
output
8> 3: cat, tiger
8> 3: dog, wolf
6> 1: apples, pears
6> 1: grapefruit, orange
Still in pairs
3)rebalence()
Partition the upstream data cycle to the downstream
data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() { @Override public TestObj map(TestObj testObj) throws Exception { return testObj; } }).rebalance().map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).print();
output
1> 3: dog, wolf
7> 1: grapefruit, orange
8> 3: cat, tiger
6> 1: apples, pears
Reassignment uses different subtasks
4)shuffle()
Randomly assign upstream data to downstream
data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() { @Override public TestObj map(TestObj testObj) throws Exception { return testObj; } }).shuffle().map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).print();
output
7> 3: Koala
8> 1: apples, pears
8> 1: Grapes
8> 3: cat, tiger
8> 3: sheep, cattle
5> 1: grapefruit, orange
5> 3: dog, wolf
In order to see the random effect, some data are added. The key is still 1 and 3
5)rescale()
Circulate the upstream partition data to the downstream partition respectively
data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() { @Override public TestObj map(TestObj testObj) throws Exception { return testObj; } }).rescale().map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).print().setParallelism(4);
output
1> 1: grapefruit, orange
2> 1: Grapes
4> 1: apples, pears
1> 3: cat, tiger
2> 3: dog, wolf
3> 3: sheep, cattle
4> 3: Koala
keyBy is followed by two partitions, each with 3 and 4 data, and then set to four partitions. After repartition with rescale, the previous two partitions cycle to the new partition respectively
6)global()
All upstream data is allocated to the first downstream partition
data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() { @Override public TestObj map(TestObj testObj) throws Exception { return testObj; } }).global().map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).print();
output
1> 1: apples, pears
1> 1: grapefruit, orange
1> 1: Grapes
1> 3: cat, tiger
1> 3: dog, wolf
1> 3: sheep, cattle
1> 3: Koala
7)broadcast()
Allocate upstream data to each partition downstream
data.keyBy(TestObj::getKey).map(new MapFunction<TestObj, TestObj>() { @Override public TestObj map(TestObj testObj) throws Exception { return testObj; } }).broadcast().map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).print().setParallelism(3);
output
3> 3: cat, tiger
3> 3: Koala
3> 1: Grapes
3> 3: cat, tiger
3> 3: Koala
3> 1: Grapes
3> 3: sheep, cattle
3> 1: grapefruit, orange
3> 3: sheep, cattle
3> 1: grapefruit, orange
3> 3: cat, tiger
3> 3: Koala
3> 1: Grapes
3> 3: sheep, cattle
3> 1: grapefruit, orange
1> 3: dog, wolf
3> 3: dog, wolf
3> 1: apples, pears
2> 3: sheep, cattle
2> 1: grapefruit, orange
2> 3: sheep, cattle
2> 1: grapefruit, orange
2> 3: dog, wolf
2> 1: apples, pears
2> 3: dog, wolf
2> 1: apples, pears
2> 3: sheep, cattle
2> 1: grapefruit, orange
2> 3: dog, wolf
2> 1: apples, pears
2> 3: cat, tiger
2> 3: Koala
2> 1: Grapes
3> 3: cat, tiger
3> 3: Koala
3> 1: Grapes
1> 1: apples, pears
1> 3: dog, wolf
1> 1: apples, pears
1> 3: cat, tiger
1> 3: Koala
1> 1: Grapes
1> 3: cat, tiger
1> 3: Koala
1> 1: Grapes
1> 3: dog, wolf
1> 1: apples, pears
1> 3: cat, tiger
1> 3: Koala
1> 1: Grapes
2> 3: sheep, cattle
2> 1: grapefruit, orange
1> 3: sheep, cattle
1> 1: grapefruit, orange
1> 3: dog, wolf
1> 1: apples, pears
8)partitionCustom(Partitioner<K> partitioner, KeySelector<T, K> keySelector)
The user-defined partition changes the data. The following example partitions the remainder of 2 according to the key value
List<TestObj> testObjs=new ArrayList<>(); testObjs.add(new TestObj(1,"Apple,Pear")); testObjs.add(new TestObj(2,"grapefruit,a mandarin orange")); testObjs.add(new TestObj(3,"cat,tiger")); testObjs.add(new TestObj(4,"dog,wolf")); testObjs.add(new TestObj(5,"sheep,cattle")); testObjs.add(new TestObj(6,"Grape")); testObjs.add(new TestObj(7,"Koalas")); DataStream<TestObj> data=env.fromCollection(testObjs); data.partitionCustom(new Partitioner<Integer>() { @Override public int partition(Integer integer, int i) { return integer%2; } },TestObj::getKey).map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).print();
output
1> 2: grapefruit, orange
1> 4: dog, wolf
1> 6: Grapes
2> 1: apples, pears
2> 3: cat, tiger
2> 5: sheep, cattle
2> 7: Koala
Finally, odd numbers are in one partition and even numbers are in one partition
3. Output
1)print()
It has been used many times and output to the console
2)writeToSocket(String hostName, int port, SerializationSchema<T> schema)
Output to a specific address
hostName - host address
Port - port
schema - serialization method
3)addSink(SinkFunction<T> sinkFunction)
Output to another location
data.map(new MapFunction<TestObj, String>() { @Override public String map(TestObj testObj) throws Exception { return testObj.getKey()+":"+testObj.getValue(); } }).addSink(new SinkFunction<String>() { @Override public void invoke(String value, Context context) throws Exception { //Actual output method //e.g. output to the file FileUtils.writeFileUtf8(file,value); } });
4. Make up the remaining window s after learning