Abstract: Apache Flink, as the most popular computing engine for batch unification, is widely used in real-time ETL, event processing, data analysis, CEP, real-time machine learning and other fields. Beginning with Flink 1.9, the Apache Flink community began to provide support for the Python language on the basis of the original Java, Scala, SQL programming languages. With the development of Flink versions 1.9-1.12 and upcoming versions 1.13, the PyFlink API has been gradually improved to meet the needs of Python users in most cases. Next, we take Flink 1.12 as an example to show how to use the Python language to develop Flink jobs using the PyFlink API. Contents include:
-
Environmental preparation
-
Job development
-
Job submission
-
Problem investigation
-
summary
Tips: Click "Read the original" at the end to see more technical dry goods~
GitHub address
https://github.com/apache/flink
Welcome to Flink for a compliment on star~
1. Environmental Preparation
Step 1: Install Python
PyFlink only supports Python 3.5+. First, you need to verify that Python 3.5+ is installed in your development environment and, if not, Python 3.5+.
Step 2: Install JDK
We know that Flink runs in the Java language, so you also need to install the JDK in order to execute the Flink job. Flink provides full support for both JDK 8 and JDK 11, and you need to confirm that the above version of JDK has been installed in your development environment and, if not, JDK first.
Step 3: Install PyFlink
Next, you need to install PyFlink, which can be installed with the following commands:
#Create Python virtual environment python3-m PIP install virtualenvvirtualenv-p `which python3` venv #Use the Python virtual environment created above. / venv/bin/activate #Install PyFlink 1.12 python3-m PIP install apache-flink==1.12.2
2. Job development
PyFlink Table API Job
Let's start by showing you how to develop a PyFlink Table API job.
1) Create TableEnvironment object
For a Table API job, the user first needs to create a TableEnvironment object. The following example defines a TableEnvironment object, a job that uses the object's definition, runs in stream mode, and executes using blink planner.
env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()t_env = StreamTableEnvironment.create(environment_settings=env_settings)
2) Configure the execution parameters of the job
The execution parameters of a job can be configured in the following ways. The following example sets the default concurrency degree for a job to 4.
t_env.get_config().get_configuration().set_string('parallelism.default', '4')
3) Create data source table
Next, you need to create a data source table for the job. There are several ways to define data source tables in PyFlink.
Mode 1: from_elements
PyFlink supports users to create source tables from a given list. The following example defines a table with three rows of data: [("hello", ("world", ("2), ("flink", ("3)]), which has two columns with column names a and b, of type VARCHAR and BIGINT, respectively.
tab = t_env.from_elements([("hello", 1), ("world", 2), ("flink", 3)], ['a', 'b'])
Explain:
-
This is often used during the testing phase to quickly create a data source table to validate job logic.
-
From_ The elements method can take multiple parameters, the first of which is used to specify a list of data where each element must be of type tuple. The second parameter specifies the schema of the table.
Mode 2: DDL
In addition, the data can come from an external data source. The following example defines a name of my_source, a table of datagen type with two fields of VARCHAR type.
t_env.execute_sql(""" CREATE TABLE my_source ( a VARCHAR, b VARCHAR ) WITH ( 'connector' = 'datagen', 'number-of-rows' = '10' ) """) tab = t_env.from_path('my_source')
Explain:
-
Defining data source tables by DDL is currently the most recommended method, and all connectors supported in Java Table API & SQL can be used in PyFlink Table API jobs by DDL. For a detailed list of connectors, see Flink's official document [1].
-
Currently, only a few connector implementations are included in the official distribution packages provided by Flink, such as FileSystem, DataGen, Print, BlackHole, etc. Most connector implementations are not currently included in the official distribution packages provided by Flink, such as Kafka, ES, etc. For connectors that are not included in Flink's official distribution package, if they need to be used in a PyFlink job, the user needs to explicitly specify the appropriate FAT JAR, such as for Kafka, the JAR package [2], which can be specified as follows:
# Note: the file:///prefix cannot omit t t_env.get_config().get_configuration().set_string("pipeline.jars", "file:///my/jar/path/flink-sql-connector-kafka_2.11-1.12.0.jar ")
Mode 3: catalog
hive_catalog = HiveCatalog("hive_catalog")t_env.register_catalog("hive_catalog", hive_catalog)t_env.use_catalog("hive_catalog") # Suppose a name named source_has already been defined in hive catalog Table tab = t_env.from_path('source_table')
This is similar to DDL except that the table definition is registered in the catalog beforehand and does not need to be redefined in the job.
^ 4) Define the calculation logic of the job
Mode 1: Through the Table API
Once you have the source table, you can then use the various operations provided in the Table API to define the job's computing logic and perform various transformations on the table, such as:
@udf(result_type=DataTypes.STRING())def sub_string(s: str, begin: int, end: int): return s[begin:end] transformed_tab = tab.select(sub_string(col('a'), 2, 4))
Mode 2: Through SQL statement
In addition to the various operations provided in the Table API, tables can also be transformed directly through SQL statements, such as the logic described above, or through SQL statements:
t_env.create_temporary_function("sub_string", sub_string)transformed_tab = t_env.sql_query("SELECT sub_string(a, 2, 4) FROM %s" % tab)
Explain:
-
TableEnvironment provides a variety of ways to execute SQL statements for slightly different purposes:
Method Name | Instructions |
---|---|
sql_query | Used to execute SELECT statements |
sql_update | Used to execute INSERT/ CREATE TABLE statements. This method has been deprecate and execute_is recommended SQL or create_statement_set substitution. |
create_statement_set | To execute multiple SQL statements, you can write a multi-sink job using this method. |
execute_sql | Used to execute a single SQL statement. execute_sql VS create_statement_set:The former can only execute a single SQL statement, the latter can be used to execute multiple SQL statements execute_ SQL VS sql_ Quey: The former can be used to execute various types of SQL statements, such as DDL, DML, DQL, SHOW, DESCRIBE, EXPLAIN, USE, etc. The latter can only execute DQL statements, even DQL statements, which behave differently. The former generates a Flink job that triggers the calculation of the Table data and returns the TableResult type, while the latter does not trigger the calculation, only logically transforms the Table and returns the TableType |
^ 5) View implementation plan
During the development or debugging of a job, users may need to view the execution plan of the job, which can be done in the following ways.
Mode 1: Table.explain
For example, when we need to know transformed_ Tab When the current execution plan is executed, you can execute: print(transformed_tab.explain()) with the following output:
== Abstract Syntax Tree ==LogicalProject(EXPR$0=[sub_string($0, 2, 4)])+- LogicalTableScan(table=[[default_catalog, default_database, Unregistered_TableSource_582508460, source: [PythonInputFormatTableSource(a)]]]) == Optimized Logical Plan ==PythonCalc(select=[sub_string(a, 2, 4) AS EXPR$0])+- LegacyTableSourceScan(table=[[default_catalog, default_database, Unregistered_TableSource_582508460, source: [PythonInputFormatTableSource(a)]]], fields=[a]) == Physical Execution Plan ==Stage 1 : Data Source content : Source: PythonInputFormatTableSource(a) Stage 2 : Operator content : SourceConversion(table=[default_catalog.default_database.Unregistered_TableSource_582508460, source: [PythonInputFormatTableSource(a)]], fields=[a]) ship_strategy : FORWARD Stage 3 : Operator content : StreamExecPythonCalc ship_strategy : FORWARD
Mode 2: TableEnvironment.explain_sql
The first option is to view the execution plan for a table, sometimes without an existing table object available, such as:
print(t_env.explain_sql("INSERT INTO my_sink SELECT * FROM %s " % transformed_tab))
Its execution plan is as follows:
== Abstract Syntax Tree ==LogicalSink(table=[default_catalog.default_database.my_sink], fields=[EXPR$0])+- LogicalProject(EXPR$0=[sub_string($0, 2, 4)]) +- LogicalTableScan(table=[[default_catalog, default_database, Unregistered_TableSource_1143388267, source: [PythonInputFormatTableSource(a)]]]) == Optimized Logical Plan ==Sink(table=[default_catalog.default_database.my_sink], fields=[EXPR$0])+- PythonCalc(select=[sub_string(a, 2, 4) AS EXPR$0]) +- LegacyTableSourceScan(table=[[default_catalog, default_database, Unregistered_TableSource_1143388267, source: [PythonInputFormatTableSource(a)]]], fields=[a]) == Physical Execution Plan ==Stage 1 : Data Source content : Source: PythonInputFormatTableSource(a) Stage 2 : Operator content : SourceConversion(table=[default_catalog.default_database.Unregistered_TableSource_1143388267, source: [PythonInputFormatTableSource(a)]], fields=[a]) ship_strategy : FORWARD Stage 3 : Operator content : StreamExecPythonCalc ship_strategy : FORWARD Stage 4 : Data Sink content : Sink: Sink(table=[default_catalog.default_database.my_sink], fields=[EXPR$0]) ship_strategy : FORWARD
6) Write out the result data
Mode 1: via DDL
Similar to creating a data source table, you can also create a result table by DDL.
t_env.execute_sql(""" CREATE TABLE my_sink ( `sum` VARCHAR ) WITH ( 'connector' = 'print' ) """) table_result = transformed_tab.execute_insert('my_sink')
Explain:
-
When print is used as sink, job results are printed to standard output. If you don't need to see the output, you can also use blackhole as a sink.
Mode 2: collect
You can also collect the results of a table to the client through the collect method and view them one by one.
table_result = transformed_tab.execute()with table_result.collect() as results: for result in results: print(result)
Explain:
-
This makes it easy to collect the results of the table to the client and view them.
-
Since the data will eventually be collected to the client, it is best to limit the number of data bars, such as:
transformed_tab.limit(10).execute(), which limits the collection of only 10 pieces of data to the client.
Mode 3: to_pandas
You can also use to_pandas method that converts the result of a table to a pandas.DataFrame and view it.
result = transformed_tab.to_pandas()print(result)
You can see the following output:
_c00 321 e62 8b3 be4 4f5 b46 a67 498 359 6b
Explain:
-
This is similar to collect, which also collects the results of the table to the client, so it's best to limit the number of results.
7) Summary
The complete job example is as follows:
from pyflink.table import DataTypes, EnvironmentSettings, StreamTableEnvironmentfrom pyflink.table.expressions import colfrom pyflink.table.udf import udf def table_api_demo(): env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build() t_env = StreamTableEnvironment.create(environment_settings=env_settings) t_env.get_config().get_configuration().set_string('parallelism.default', '4') t_env.execute_sql(""" CREATE TABLE my_source ( a VARCHAR, b VARCHAR ) WITH ( 'connector' = 'datagen', 'number-of-rows' = '10' ) """) tab = t_env.from_path('my_source') @udf(result_type=DataTypes.STRING()) def sub_string(s: str, begin: int, end: int): return s[begin:end] transformed_tab = tab.select(sub_string(col('a'), 2, 4)) t_env.execute_sql(""" CREATE TABLE my_sink ( `sum` VARCHAR ) WITH ( 'connector' = 'print' ) """) table_result = transformed_tab.execute_insert('my_sink') # 1)Waiting for job execution to finish, for local Execute, otherwise the job may not have finished executing and the script exits, resulting in minicluster Exit prematurely # 2) When a job submits to a remote cluster via detach mode, such as YARN/Standalone/K8s, the method table_needs to be removed Result. Wait() if __name__ == '__main__': table_api_demo()
The results are as follows:
4> +I(a1)3> +I(b0)2> +I(b1)1> +I(37)3> +I(74)4> +I(3d)1> +I(07)2> +I(f4)1> +I(7f)2> +I(da)
PyFlink DataStream API Job
StreamExecutionEnvironment Object Creation
For a DataStream API job, the user first needs to define a StreamExecutionEnvironment object.
env = StreamExecutionEnvironment.get_execution_environment()
Configure job execution parameters
The execution parameters of a job can be configured in the following ways. The following example sets the default concurrency degree for a job to 4.
env.set_parallelism(4)
3) Create a data source
Next, you need to create a data source for the job. There are several ways to define data sources in PyFlink.
Mode 1: from_collection
PyFlink supports users to create source tables from a list. The following example defines a table with three rows of data: [(1,'a a a|b b', (2,'b b|a', (3,'a a a|a')], which has two columns with column names a and b, of type VARCHAR and BIGINT, respectively.
ds = env.from_collection( collection=[(1, 'aaa|bb'), (2, 'bb|a'), (3, 'aaa|a')], type_info=Types.ROW([Types.INT(), Types.STRING()]))
Explain:
-
This is often used during the testing phase to create a data source easily.
-
From_ The collection method can receive two parameters, the first of which is used to specify a list of data; The second parameter specifies the type of data.
Mode 2: Use the connector defined in the PyFlink DataStream API
Additionally, you can use connectors already supported in the PyFlink DataStream API, noting that only Kafka connector is supported in 1.12.
deserialization_schema = JsonRowDeserializationSchema.builder() \ .type_info(type_info=Types.ROW([Types.INT(), Types.STRING()])).build() kafka_consumer = FlinkKafkaConsumer( topics='test_source_topic', deserialization_schema=deserialization_schema, properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'}) ds = env.add_source(kafka_consumer)
Explain:
-
Kafka connector is not currently included in Flink's official distribution package. If you need to use it in a PyFlink job, you need to explicitly specify the appropriate FAT JAR [2]. JAR packages can be specified as follows:
# Note: the file:///prefix cannot omit env.add_jars("file:///my/jar/path/flink-sql-connector-kafka_2.11-1.12.0.jar ")
-
Even for PyFlink DataStream API jobs, FAT JAR s packaged in Table & SQL connector are recommended to avoid recursive dependencies.
Mode 3: Use the connector defined in the PyFlink Table API
The following example defines how connector s supported in Table & SQL can be used for PyFlink DataStream API jobs.
t_env = StreamTableEnvironment.create(stream_execution_environment=env) t_env.execute_sql(""" CREATE TABLE my_source ( a INT, b VARCHAR ) WITH ( 'connector' = 'datagen', 'number-of-rows' = '10' ) """) ds = t_env.to_append_stream( t_env.from_path('my_source'), Types.ROW([Types.INT(), Types.STRING()]))
Explain:
-
Because there are fewer types of connectors supported by build-in in the PyFlink DataStream API, it is recommended that you create a data source table to be used in PyFlink DataStream API jobs in such a way that connectors available in all PyFlink Table API s can be used in PyFlink DataStream API jobs.
-
It is important to note that TableEnvironment needs to create a StreamTableEnvironment in the following ways. Create (stream_execution_environment=env) so that the PyFlink DataStream API and the PyFlink Table API share the same StreamExecutionEnvironment object.
Define calculation logic
After generating the DataStream object corresponding to the data source, you can then use the various operations defined in the PyFlink DataStream API to define calculation logic and transform the DataStream object, such as:
def split(s): splits = s[1].split("|") for sp in splits: yield s[0], sp ds = ds.map(lambda i: (i[0] + 1, i[1])) \ .flat_map(split) \ .key_by(lambda i: i[1]) \ .reduce(lambda i, j: (i[0] + j[0], i[1]))
5) Write out the result data
Mode 1: print
You can call the print method on the DataStream object to print the results of the DataStream to standard output, for example:
ds.print()
Mode 2: Use the connector defined in the PyFlink DataStream API
You can use connectors already supported in the PyFlink DataStream API directly. Note that support for FileSystem, JDBC, Kafka connector is provided in 1.12, for example, Kafka:
serialization_schema = JsonRowSerializationSchema.builder() \ .with_type_info(type_info=Types.ROW([Types.INT(), Types.STRING()])).build() kafka_producer = FlinkKafkaProducer( topic='test_sink_topic', serialization_schema=serialization_schema, producer_config={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'}) ds.add_sink(kafka_producer)
Explain:
-
JDBC, Kafka connector are not currently included in the official distribution packages provided by Flink. If you need to use them in a PyFlink job, you need to explicitly specify the corresponding FAT JAR. For example, Kafka connector can use a JAR package [2], which can be specified as follows:
# Note: the file:///prefix cannot omit env.add_jars("file:///my/jar/path/flink-sql-connector-kafka_2.11-1.12.0.jar ")
-
FAT JAR packaged in Table & SQL connector is recommended to avoid recursive dependency.
Mode 3: Use the connector defined in the PyFlink Table API
The following example shows how to use a supported connector in Table & SQL as a sink for a PyFlink DataStream API job.
# Write 1: ds is of type Types. ROWdef split(s): split s = s[1]. Split ('|') for SP in split s: yield Row (s[0], sp) ds = ds.map(lambda i: (i[0] + 1, i[1])) \ .flat_map(split, Types.ROW([Types.INT(), Types.STRING()])) \ .key_by(lambda i: i[1]) \ .reduce(lambda i, j: Row(i[0] + j[0], i[1])) # Write two: ds is of type Types. TUPLEdef split(s): split s = s[1]. Split ('|') for SP in split s: yield s[0], sp ds = ds.map(lambda i: (i[0] + 1, i[1])) \ .flat_map(split, Types.TUPLE([Types.INT(), Types.STRING()])) \ .key_by(lambda i: i[1]) \ .reduce(lambda i, j: (i[0] + j[0], i[1])) # Write ds to sinkt_ Env. Execute_ SQL ("""CREATE TABLE my_sink (a INT, B VARCHAR)" WITH ("connector'='print')") table = t_env.from_data_stream(ds)table_result = table.execute_insert("my_sink")
Explain:
-
Notice that t_ Env. From_ Data_ The result type type of the DS object in stream(ds) must be a composite type Types.ROW or Types.TUPLE, which is why flat_in job calculation logic needs to be explicitly declared Result type of map operation
-
Job submission requires submission through the job submission method provided in the PyFlink Table API
-
Since there are few types of connectors currently supported in the PyFlink DataStream API, it is recommended that you define the data source tables used in PyFlink DataStream API jobs in such a way that all connectors available in the PyFlink Table API can be used as sink s for PyFlink DataStream API jobs.
7) Summary
The complete job example is as follows:
Mode 1 (for debugging):
from pyflink.common.typeinfo import Typesfrom pyflink.datastream import StreamExecutionEnvironment def data_stream_api_demo(): env = StreamExecutionEnvironment.get_execution_environment() env.set_parallelism(4) ds = env.from_collection( collection=[(1, 'aaa|bb'), (2, 'bb|a'), (3, 'aaa|a')], type_info=Types.ROW([Types.INT(), Types.STRING()])) def split(s): splits = s[1].split("|") for sp in splits: yield s[0], sp ds = ds.map(lambda i: (i[0] + 1, i[1])) \ .flat_map(split) \ .key_by(lambda i: i[1]) \ .reduce(lambda i, j: (i[0] + j[0], i[1])) ds.print() env.execute() if __name__ == '__main__': data_stream_api_demo()
The results are as follows:
3> (2, 'aaa')3> (2, 'bb')3> (6, 'aaa')3> (4, 'a')3> (5, 'bb')3> (7, 'a')
Mode 2 (for online work):
from pyflink.common.typeinfo import Typesfrom pyflink.datastream import StreamExecutionEnvironmentfrom pyflink.table import StreamTableEnvironment def data_stream_api_demo(): env = StreamExecutionEnvironment.get_execution_environment() t_env = StreamTableEnvironment.create(stream_execution_environment=env) env.set_parallelism(4) t_env.execute_sql(""" CREATE TABLE my_source ( a INT, b VARCHAR ) WITH ( 'connector' = 'datagen', 'number-of-rows' = '10' ) """) ds = t_env.to_append_stream( t_env.from_path('my_source'), Types.ROW([Types.INT(), Types.STRING()])) def split(s): splits = s[1].split("|") for sp in splits: yield s[0], sp ds = ds.map(lambda i: (i[0] + 1, i[1])) \ .flat_map(split, Types.TUPLE([Types.INT(), Types.STRING()])) \ .key_by(lambda i: i[1]) \ .reduce(lambda i, j: (i[0] + j[0], i[1])) t_env.execute_sql(""" CREATE TABLE my_sink ( a INT, b VARCHAR ) WITH ( 'connector' = 'print' ) """) table = t_env.from_data_stream(ds) table_result = table.execute_insert("my_sink") # 1)Waiting for job execution to finish, for local Execute, otherwise the job may not have finished executing and the script exits, resulting in minicluster Exit prematurely # 2) When a job submits to a remote cluster via detach mode, such as YARN/Standalone/K8s, the method table_needs to be removed Result. Wait() if __name__ == '__main__': data_stream_api_demo()
3. Job submission
Flink provides a variety of job deployment methods, such as local, standalone, YARN, K8s, etc. PyFlink also supports these. Please refer to Flink's official document [3] for more details.
local
Description: When a job is executed in this way, a minicluster is started and the job is submitted to the minicluster for execution, which is suitable for the job development phase.
Example: python3 table_api_demo.py
standalone
Description: When a job is executed this way, it is submitted to a remote standalone cluster.
Example:
./bin/flink run --jobmanager localhost:8081 --python table_api_demo.py
YARN Per-Job
Description: When a job is executed this way, it is submitted to a remote YARN cluster.
Example:
./bin/flink run --target yarn-per-job --python table_api_demo.py
K8s application mode
Description: When a job is executed in this way, it is submitted to the K8s cluster and executed in application mode.
Example:
./bin/flink run-application \ --target kubernetes-application \ --parallelism 8 \ -Dkubernetes.cluster-id=<ClusterId> \ -Dtaskmanager.memory.process.size=4096m \ -Dkubernetes.taskmanager.cpu=2 \ -Dtaskmanager.numberOfTaskSlots=4 \ -Dkubernetes.container.image=<PyFlinkImageName> \
--pyModule table_api_demo \ --pyFiles file:///path/to/table_api_demo.py
Parameter Description
In addition to the parameters mentioned above, there are other parameters associated with the PyFlink job when submitted via flink run.
Parameter Name | Purpose Description | Example |
---|---|---|
-py / --python | The entry file for the specified job | -py file:///path/to/table_api_demo.py |
-pym / --pyModule | The entry module for the specified job, similar to--python, has a zip package as a Python file that can be used as a job and is more generic than--python when it cannot be specified with--python | -pym table_api_demo -pyfs file:///path/to/table_api_demo.py |
-pyfs / --pyFiles | Specify one or more Python files (.py/.zip, etc., comma-separated) that will be placed in the PYTHONPATH of the Python process when the job is executed and can be accessed in the Python custom function | -pyfs file:///path/to/table_api_demo.py,file:///path/to/deps.zip |
-pyarch / --pyArchives | Specify one or more archive files (comma-separated) that, when the job is executed, are unzipped and placed in the workspace directory of the Python process, which can be accessed by relative paths | -pyarch file:///path/to/venv.zip |
-pyexec / --pyExecutable | Specify the path of the Python process when the job executes | -pyarch file:///path/to/venv.zip -pyexec venv.zip/venv/bin/python3 |
-pyreq / --pyRequirements | Specify the requirements file, which defines job dependencies | -pyreq requirements.txt |
IV. Problem Investigation
When we just started the PyFlink job development, we will inevitably encounter a variety of problems. It is very important to learn how to troubleshoot the problems. Next, we introduce some common troubleshooting methods.
client-side exception output
PyFlink jobs also follow the Flink job submission method, which first compiles to JobGraph on the client side, then submits to the Flink cluster for execution. If there is a problem with job compilation, it will cause an exception to be thrown when the job is submitted on the client side, where you can see output like this on the client side:
Traceback (most recent call last): File "/Users/dianfu/code/src/github/pyflink-usecases/datastream_api_demo.py", line 50, in <module> data_stream_api_demo() File "/Users/dianfu/code/src/github/pyflink-usecases/datastream_api_demo.py", line 45, in data_stream_api_demo table_result = table.execute_insert("my_") File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink/table/table.py", line 864, in execute_insert return TableResult(self._j_table.executeInsert(table_path, overwrite)) File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/py4j/java_gateway.py", line 1285, in __call__ return_value = get_return_value( File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink/util/exceptions.py", line 162, in deco raise java_exceptionpyflink.util.exceptions.TableException: Sink `default_catalog`.`default_database`.`my_` does not exists at org.apache.flink.table.planner.delegation.PlannerBase.translateToRel(PlannerBase.scala:247) at org.apache.flink.table.planner.delegation.PlannerBase$$anonfun$1.apply(PlannerBase.scala:159) at org.apache.flink.table.planner.delegation.PlannerBase$$anonfun$1.apply(PlannerBase.scala:159) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.flink.table.planner.delegation.PlannerBase.translate(PlannerBase.scala:159) at org.apache.flink.table.api.internal.TableEnvironmentImpl.translate(TableEnvironmentImpl.java:1329) at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:676) at org.apache.flink.table.api.internal.TableImpl.executeInsert(TableImpl.java:572) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282) at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79) at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Process finished with exit code 1
For example, the name used in the above error description job is "my_" The table does not exist.
TaskManager log file
Some errors do not occur until the job is running, such as dirty data or implementation issues with Python custom functions. For these errors, you often need to check the TaskManager log file, such as the following errors which reflect that the opencv library accessed by the user in the Python custom function does not exist.
Caused by: java.lang.RuntimeException: Error received from SDK harness for instruction 2: Traceback (most recent call last): File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 253, in _execute response = task() File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 310, in <lambda> lambda: self.create_worker().do_instruction(request), request) File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 479, in do_instruction return getattr(self, request_type)( File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 515, in process_bundle bundle_processor.process_bundle(instruction_id)) File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 977, in process_bundle input_op_by_transform_id[element.transform_id].process_encoded( File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 218, in process_encoded self.output(decoded_value) File "apache_beam/runners/worker/operations.py", line 330, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "pyflink/fn_execution/beam/beam_operations_fast.pyx", line 71, in pyflink.fn_execution.beam.beam_operations_fast.FunctionOperation.process File "pyflink/fn_execution/beam/beam_operations_fast.pyx", line 85, in pyflink.fn_execution.beam.beam_operations_fast.FunctionOperation.process File "pyflink/fn_execution/coder_impl_fast.pyx", line 83, in pyflink.fn_execution.coder_impl_fast.DataStreamFlatMapCoderImpl.encode_to_stream File "/Users/dianfu/code/src/github/pyflink-usecases/datastream_api_demo.py", line 26, in split import cv2ModuleNotFoundError: No module named 'cv2' at org.apache.beam.runners.fnexecution.control.FnApiControlClient$ResponseStreamObserver.onNext(FnApiControlClient.java:177) at org.apache.beam.runners.fnexecution.control.FnApiControlClient$ResponseStreamObserver.onNext(FnApiControlClient.java:157) at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:251) at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33) at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76) at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309) at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292) at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:782) at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more
Explain:
-
In local mode, the TaskManager log is located in the PyFlink installation directory: site-packages/pyflink/log/, which can also be found with the following commands:
>>> import pyflink
>>> print(pyflink.path)
['/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink'], the log file is in/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink/log directory
Custom Log
Sometimes the contents of the exception log are not enough to help us locate the problem, so consider printing some log information in the Python custom function. PyFlink supports users to output logs in Python custom functions through logging, such as:
def split(s): import logging logging.info("s: " + str(s)) splits = s[1].split("|") for sp in splits: yield s[0], sp
In this way, the input parameters to the split function are printed to the TaskManager log file.
Remote Debugging
The PyFlink job starts a separate Python process to execute Python custom functions during the run, so if you need to debug Python custom functions, you need to debug them remotely. See [4] to learn how to debug Python remotely in Pycharm.
1) Install pydevd-pycharm in a Python environment:
pip install pydevd-pycharm~=203.7717.65
2) Set remote debugging parameters in Python custom functions:
def split(s): import pydevd_pycharm pydevd_pycharm.settrace('localhost', port=6789, stdoutToServer=True, stderrToServer=True) splits = s[1].split("|") for sp in splits: yield s[0], sp
3) Follow the steps of remote debugging in Pycharm, and you can do so either by referring to [4], or by referring to the section "Code debugging" in the blog [5].
Note: Python remote debugging is only supported in the professional version of Pycharm.
Community User Mailing List
If you have not solved the problem after the above steps, you can also subscribe to the Flink user mailing list [6] and send the problem to the Flink user mailing list. It is important to note that when sending a problem to a mailing list, try to make it as clear as possible. It is best to have code and data that can be reproduced. You can refer to this message [7].
Nail group
In addition, you are welcome to join the PyFlink Exchange Group to share PyFlink related issues.
V. Summary
In this article, we mainly introduce the PyFlink API job environment preparation, job development, job submission, problem solving and other information, hope to help users quickly build a Flink job using Python language, and hope to help everyone. Next, we will continue to launch PyFlink series articles to help PyFlink users gain insight into PyFlink features, scenarios, best practices, and more.
In addition, we have launched a questionnaire, which we hope you can actively participate in to help us better organize PyFlink-related learning materials. After completing the questionnaire, you will be able to participate in the lottery of Flink custom Polo shirts, which will open on time at 12:00 p.m. on April 30.
Reference Links
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/
[2] https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka_2.11/1.12.0/flink-sql-connector-kafka_2.11-1.12.0.jar
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs
[4] https://www.jetbrains.com/help/pycharm/remote-debugging-with-product.html#remote-debug-config
[6] https://flink.apache.org/community.html#mailing-lists
[7] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/PyFlink-called-already-closed-and-NullPointerException-td42997.html