A Flink program for Kafka data consumption, the Flinon Yarn model, was released in the test and production environments before. It was normal and had no problems. However, after restarting the test environment, it was redistributed again. The error was reported as follows:
2019-07-01 15:19:25,984 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Sink: Coupon Sink (1/1) (28578957b82c7fccd680cc4fb5fbb7cd) switched from RUNNING to FAILED. AsynchronousException{java.lang.Exception: Could not materialize checkpoint 8 for operator Source: Custom Source -> Sink: Coupon Sink (1/1).} at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.Exception: Could not materialize checkpoint 8 for operator Source: Custom Source -> Sink: Coupon Sink (1/1). at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942) ... 6 more Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://cxhadoop/flink/checkpoints/292e9f2140f8abc69acaadb99cfd4c58/chk-8/91154fad-3667-4dd3-9b1d-a503c0054207 in order to obtain the stream state handle at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53) at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853) ... 5 more Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://cxhadoop/flink/checkpoints/292e9f2140f8abc69acaadb99cfd4c58/chk-8/91154fad-3667-4dd3-9b1d-a503c0054207 in order to obtain the stream state handle at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326) at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767) at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696) at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50) ... 7 more Caused by: java.io.IOException: DataStreamer Exception: at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:695) Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hdfs.protocol.HdfsConstants at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1413) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1357) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:587)
You can see from the log that there was a problem with the checkpoint file. Although there are many mistakes, there are two key problems.
Can not materialize checkpoint 8 for operator Source and Could not flush and close the file system output stream. The previous error is the cause of the latter error. It is not difficult to analyze that there is a problem in checkpoint file creation. Check the configuration of checkpoint in flink:
state.backend: filesystem # Directory for checkpoints filesystem, when using any of the default bundled # state backends. # state.checkpoints.dir: hdfs://cxhadoop/flink/checkpoints state.checkpoints.num-retained: 20 # Default target directory for savepoints, optional. # state.savepoints.dir: hdfs://cxhadoop/flink/savepoints
There is an additional state.checkpoints.num-retained line configuration found, which is the maximum number of checkpoints retained in the checkpoint directory. If this configuration is exceeded, it cannot be created. Looking at the number of checkpoints in the directory, we found that this configuration was long overrun, which led to errors in the flink program. Every time you start the flink program, you can't create the checkpoint directory file properly, so you can comment out this configuration without restricting the number of reserved files.