Flnk full development cycle

Posted by Rulkster on Fri, 18 Feb 2022 23:34:26 +0100

Abstract: Apache Flink, as the most popular computing engine for batch unification, is widely used in real-time ETL, event processing, data analysis, CEP, real-time machine learning and other fields. Beginning with Flink 1.9, the Apache Flink community began to provide support for the Python language on the basis of the original Java, Scala, SQL programming languages. With the development of Flink versions 1.9-1.12 and upcoming versions 1.13, the PyFlink API has been gradually improved to meet the needs of Python users in most cases. Next, we take Flink 1.12 as an example to show how to use the Python language to develop Flink jobs using the PyFlink API. Contents include:

Environmental preparation
Job development
Job submission
Problem investigation
summary

Tips: Click "Read the original" at the end to see more technical dry goods~

GitHub address

https://github.com/apache/flink

Welcome to Flink for a compliment on star~

1. Environmental Preparation

Step 1: Install Python

PyFlink only supports Python 3.5+. First, you need to verify that Python 3.5+ is installed in your development environment and, if not, Python 3.5+.

Step 2: Install JDK

We know that Flink runs in the Java language, so you also need to install the JDK in order to execute the Flink job. Flink provides full support for both JDK 8 and JDK 11, and you need to confirm that the above version of JDK has been installed in your development environment and, if not, JDK first.

Step 3: Install PyFlink

Next, you need to install PyFlink, which can be installed with the following commands:

#Create Python virtual environment python3-m PIP install virtualenvvirtualenv-p `which python3` venv
#Use the Python virtual environment created above. / venv/bin/activate
#Install PyFlink 1.12 python3-m PIP install apache-flink==1.12.2

2. Job development

PyFlink Table API Job

Let's start by showing you how to develop a PyFlink Table API job.

1) Create TableEnvironment object

For a Table API job, the user first needs to create a TableEnvironment object. The following example defines a TableEnvironment object, a job that uses the object's definition, runs in stream mode, and executes using blink planner.

env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()t_env = StreamTableEnvironment.create(environment_settings=env_settings)

2) Configure the execution parameters of the job

The execution parameters of a job can be configured in the following ways. The following example sets the default concurrency degree for a job to 4.

t_env.get_config().get_configuration().set_string('parallelism.default', '4')

3) Create data source table

Next, you need to create a data source table for the job. There are several ways to define data source tables in PyFlink.

Mode 1: from_elements

PyFlink supports users to create source tables from a given list. The following example defines a table with three rows of data: [("hello", ("world", ("2), ("flink", ("3)]), which has two columns with column names a and b, of type VARCHAR and BIGINT, respectively.

tab = t_env.from_elements([("hello", 1), ("world", 2), ("flink", 3)], ['a', 'b'])

Explain:

This is often used during the testing phase to quickly create a data source table to validate job logic.

From_ The elements method can take multiple parameters, the first of which is used to specify a list of data where each element must be of type tuple. The second parameter specifies the schema of the table.

Mode 2: DDL

In addition, the data can come from an external data source. The following example defines a name of my_source, a table of datagen type with two fields of VARCHAR type.

t_env.execute_sql("""        CREATE TABLE my_source (          a VARCHAR,          b VARCHAR        ) WITH (          'connector' = 'datagen',          'number-of-rows' = '10'        )    """)
tab = t_env.from_path('my_source')

Explain:

Defining data source tables by DDL is currently the most recommended method, and all connectors supported in Java Table API & SQL can be used in PyFlink Table API jobs by DDL. For a detailed list of connectors, see Flink's official document [1].

Currently, only a few connector implementations are included in the official distribution packages provided by Flink, such as FileSystem, DataGen, Print, BlackHole, etc. Most connector implementations are not currently included in the official distribution packages provided by Flink, such as Kafka, ES, etc. For connectors that are not included in Flink's official distribution package, if they need to be used in a PyFlink job, the user needs to explicitly specify the appropriate FAT JAR, such as for Kafka, the JAR package [2], which can be specified as follows:

# Note: the file:///prefix cannot omit t t_env.get_config().get_configuration().set_string("pipeline.jars", "file:///my/jar/path/flink-sql-connector-kafka_2.11-1.12.0.jar ")

Mode 3: catalog

hive_catalog = HiveCatalog("hive_catalog")t_env.register_catalog("hive_catalog", hive_catalog)t_env.use_catalog("hive_catalog")
# Suppose a name named source_has already been defined in hive catalog Table tab = t_env.from_path('source_table')

This is similar to DDL except that the table definition is registered in the catalog beforehand and does not need to be redefined in the job.

^ 4) Define the calculation logic of the job

Mode 1: Through the Table API

Once you have the source table, you can then use the various operations provided in the Table API to define the job's computing logic and perform various transformations on the table, such as:

@udf(result_type=DataTypes.STRING())def sub_string(s: str, begin: int, end: int):   return s[begin:end]
transformed_tab = tab.select(sub_string(col('a'), 2, 4))

Mode 2: Through SQL statement

In addition to the various operations provided in the Table API, tables can also be transformed directly through SQL statements, such as the logic described above, or through SQL statements:

t_env.create_temporary_function("sub_string", sub_string)transformed_tab = t_env.sql_query("SELECT sub_string(a, 2, 4) FROM %s" % tab)

Explain:

TableEnvironment provides a variety of ways to execute SQL statements for slightly different purposes:

Method Name	Instructions
sql_query	Used to execute SELECT statements
sql_update	Used to execute INSERT/ CREATE TABLE statements. This method has been deprecate and execute_is recommended SQL or create_statement_set substitution.
create_statement_set	To execute multiple SQL statements, you can write a multi-sink job using this method.
execute_sql	Used to execute a single SQL statement. execute_sql VS create_statement_set:The former can only execute a single SQL statement, the latter can be used to execute multiple SQL statements execute_ SQL VS sql_ Quey: The former can be used to execute various types of SQL statements, such as DDL, DML, DQL, SHOW, DESCRIBE, EXPLAIN, USE, etc. The latter can only execute DQL statements, even DQL statements, which behave differently. The former generates a Flink job that triggers the calculation of the Table data and returns the TableResult type, while the latter does not trigger the calculation, only logically transforms the Table and returns the TableType

^ 5) View implementation plan

During the development or debugging of a job, users may need to view the execution plan of the job, which can be done in the following ways.

Mode 1: Table.explain

For example, when we need to know transformed_ Tab When the current execution plan is executed, you can execute: print(transformed_tab.explain()) with the following output:

== Abstract Syntax Tree ==LogicalProject(EXPR$0=[sub_string($0, 2, 4)])+- LogicalTableScan(table=[[default_catalog, default_database, Unregistered_TableSource_582508460, source: [PythonInputFormatTableSource(a)]]])
== Optimized Logical Plan ==PythonCalc(select=[sub_string(a, 2, 4) AS EXPR$0])+- LegacyTableSourceScan(table=[[default_catalog, default_database, Unregistered_TableSource_582508460, source: [PythonInputFormatTableSource(a)]]], fields=[a])
== Physical Execution Plan ==Stage 1 : Data Source    content : Source: PythonInputFormatTableSource(a)
    Stage 2 : Operator        content : SourceConversion(table=[default_catalog.default_database.Unregistered_TableSource_582508460, source: [PythonInputFormatTableSource(a)]], fields=[a])        ship_strategy : FORWARD
        Stage 3 : Operator            content : StreamExecPythonCalc            ship_strategy : FORWARD

Mode 2: TableEnvironment.explain_sql

The first option is to view the execution plan for a table, sometimes without an existing table object available, such as:

print(t_env.explain_sql("INSERT INTO my_sink SELECT * FROM %s " % transformed_tab))

Its execution plan is as follows:

== Abstract Syntax Tree ==LogicalSink(table=[default_catalog.default_database.my_sink], fields=[EXPR$0])+- LogicalProject(EXPR$0=[sub_string($0, 2, 4)])   +- LogicalTableScan(table=[[default_catalog, default_database, Unregistered_TableSource_1143388267, source: [PythonInputFormatTableSource(a)]]])
== Optimized Logical Plan ==Sink(table=[default_catalog.default_database.my_sink], fields=[EXPR$0])+- PythonCalc(select=[sub_string(a, 2, 4) AS EXPR$0])   +- LegacyTableSourceScan(table=[[default_catalog, default_database, Unregistered_TableSource_1143388267, source: [PythonInputFormatTableSource(a)]]], fields=[a])
== Physical Execution Plan ==Stage 1 : Data Source    content : Source: PythonInputFormatTableSource(a)
    Stage 2 : Operator        content : SourceConversion(table=[default_catalog.default_database.Unregistered_TableSource_1143388267, source: [PythonInputFormatTableSource(a)]], fields=[a])        ship_strategy : FORWARD
        Stage 3 : Operator            content : StreamExecPythonCalc            ship_strategy : FORWARD
            Stage 4 : Data Sink                content : Sink: Sink(table=[default_catalog.default_database.my_sink], fields=[EXPR$0])                ship_strategy : FORWARD

6) Write out the result data

Mode 1: via DDL

Similar to creating a data source table, you can also create a result table by DDL.

t_env.execute_sql("""        CREATE TABLE my_sink (          `sum` VARCHAR        ) WITH (          'connector' = 'print'        )    """)
table_result = transformed_tab.execute_insert('my_sink')

Explain:

When print is used as sink, job results are printed to standard output. If you don't need to see the output, you can also use blackhole as a sink.

Mode 2: collect

You can also collect the results of a table to the client through the collect method and view them one by one.

table_result = transformed_tab.execute()with table_result.collect() as results:    for result in results:        print(result)

Explain:

This makes it easy to collect the results of the table to the client and view them.
Since the data will eventually be collected to the client, it is best to limit the number of data bars, such as:

transformed_tab.limit(10).execute(), which limits the collection of only 10 pieces of data to the client.

Mode 3: to_pandas

You can also use to_pandas method that converts the result of a table to a pandas.DataFrame and view it.

result = transformed_tab.to_pandas()print(result)

You can see the following output:

  _c00  321  e62  8b3  be4  4f5  b46  a67  498  359  6b

Explain:

This is similar to collect, which also collects the results of the table to the client, so it's best to limit the number of results.

7) Summary

The complete job example is as follows:

from pyflink.table import DataTypes, EnvironmentSettings, StreamTableEnvironmentfrom pyflink.table.expressions import colfrom pyflink.table.udf import udf

def table_api_demo():    env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()    t_env = StreamTableEnvironment.create(environment_settings=env_settings)    t_env.get_config().get_configuration().set_string('parallelism.default', '4')
    t_env.execute_sql("""            CREATE TABLE my_source (              a VARCHAR,              b VARCHAR            ) WITH (              'connector' = 'datagen',              'number-of-rows' = '10'            )        """)
    tab = t_env.from_path('my_source')
    @udf(result_type=DataTypes.STRING())    def sub_string(s: str, begin: int, end: int):        return s[begin:end]
    transformed_tab = tab.select(sub_string(col('a'), 2, 4))
    t_env.execute_sql("""            CREATE TABLE my_sink (              `sum` VARCHAR            ) WITH (              'connector' = 'print'            )        """)
    table_result = transformed_tab.execute_insert('my_sink')
    # 1)Waiting for job execution to finish, for local Execute, otherwise the job may not have finished executing and the script exits, resulting in minicluster Exit prematurely    # 2) When a job submits to a remote cluster via detach mode, such as YARN/Standalone/K8s, the method table_needs to be removed Result. Wait()

if __name__ == '__main__':    table_api_demo()

The results are as follows:

4> +I(a1)3> +I(b0)2> +I(b1)1> +I(37)3> +I(74)4> +I(3d)1> +I(07)2> +I(f4)1> +I(7f)2> +I(da)

PyFlink DataStream API Job

StreamExecutionEnvironment Object Creation

For a DataStream API job, the user first needs to define a StreamExecutionEnvironment object.

env = StreamExecutionEnvironment.get_execution_environment()

Configure job execution parameters

The execution parameters of a job can be configured in the following ways. The following example sets the default concurrency degree for a job to 4.

env.set_parallelism(4)

3) Create a data source

Next, you need to create a data source for the job. There are several ways to define data sources in PyFlink.

Mode 1: from_collection

PyFlink supports users to create source tables from a list. The following example defines a table with three rows of data: [(1,'a a a|b b', (2,'b b|a', (3,'a a a|a')], which has two columns with column names a and b, of type VARCHAR and BIGINT, respectively.

ds = env.from_collection(        collection=[(1, 'aaa|bb'), (2, 'bb|a'), (3, 'aaa|a')],        type_info=Types.ROW([Types.INT(), Types.STRING()]))

Explain:

This is often used during the testing phase to create a data source easily.
From_ The collection method can receive two parameters, the first of which is used to specify a list of data; The second parameter specifies the type of data.

Mode 2: Use the connector defined in the PyFlink DataStream API

Additionally, you can use connectors already supported in the PyFlink DataStream API, noting that only Kafka connector is supported in 1.12.

deserialization_schema = JsonRowDeserializationSchema.builder() \    .type_info(type_info=Types.ROW([Types.INT(), Types.STRING()])).build()
kafka_consumer = FlinkKafkaConsumer(    topics='test_source_topic',    deserialization_schema=deserialization_schema,    properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})
ds = env.add_source(kafka_consumer)

Explain:

Kafka connector is not currently included in Flink's official distribution package. If you need to use it in a PyFlink job, you need to explicitly specify the appropriate FAT JAR [2]. JAR packages can be specified as follows:

# Note: the file:///prefix cannot omit env.add_jars("file:///my/jar/path/flink-sql-connector-kafka_2.11-1.12.0.jar ")

Even for PyFlink DataStream API jobs, FAT JAR s packaged in Table & SQL connector are recommended to avoid recursive dependencies.

Mode 3: Use the connector defined in the PyFlink Table API

The following example defines how connector s supported in Table & SQL can be used for PyFlink DataStream API jobs.

t_env = StreamTableEnvironment.create(stream_execution_environment=env)
t_env.execute_sql("""        CREATE TABLE my_source (          a INT,          b VARCHAR        ) WITH (          'connector' = 'datagen',          'number-of-rows' = '10'        )    """)
ds = t_env.to_append_stream(    t_env.from_path('my_source'),    Types.ROW([Types.INT(), Types.STRING()]))

Explain:

Because there are fewer types of connectors supported by build-in in the PyFlink DataStream API, it is recommended that you create a data source table to be used in PyFlink DataStream API jobs in such a way that connectors available in all PyFlink Table API s can be used in PyFlink DataStream API jobs.

It is important to note that TableEnvironment needs to create a StreamTableEnvironment in the following ways. Create (stream_execution_environment=env) so that the PyFlink DataStream API and the PyFlink Table API share the same StreamExecutionEnvironment object.

Define calculation logic

After generating the DataStream object corresponding to the data source, you can then use the various operations defined in the PyFlink DataStream API to define calculation logic and transform the DataStream object, such as:

def split(s):    splits = s[1].split("|")    for sp in splits:       yield s[0], sp
ds = ds.map(lambda i: (i[0] + 1, i[1])) \       .flat_map(split) \       .key_by(lambda i: i[1]) \       .reduce(lambda i, j: (i[0] + j[0], i[1]))

5) Write out the result data

Mode 1: print

You can call the print method on the DataStream object to print the results of the DataStream to standard output, for example:

ds.print()

Mode 2: Use the connector defined in the PyFlink DataStream API

You can use connectors already supported in the PyFlink DataStream API directly. Note that support for FileSystem, JDBC, Kafka connector is provided in 1.12, for example, Kafka:

serialization_schema = JsonRowSerializationSchema.builder() \    .with_type_info(type_info=Types.ROW([Types.INT(), Types.STRING()])).build()
kafka_producer = FlinkKafkaProducer(    topic='test_sink_topic',    serialization_schema=serialization_schema,    producer_config={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})
ds.add_sink(kafka_producer)

Explain:

JDBC, Kafka connector are not currently included in the official distribution packages provided by Flink. If you need to use them in a PyFlink job, you need to explicitly specify the corresponding FAT JAR. For example, Kafka connector can use a JAR package [2], which can be specified as follows:

# Note: the file:///prefix cannot omit env.add_jars("file:///my/jar/path/flink-sql-connector-kafka_2.11-1.12.0.jar ")

FAT JAR packaged in Table & SQL connector is recommended to avoid recursive dependency.

Mode 3: Use the connector defined in the PyFlink Table API

The following example shows how to use a supported connector in Table & SQL as a sink for a PyFlink DataStream API job.

# Write 1: ds is of type Types. ROWdef split(s): split s = s[1]. Split ('|') for SP in split s: yield Row (s[0], sp)
ds = ds.map(lambda i: (i[0] + 1, i[1])) \       .flat_map(split, Types.ROW([Types.INT(), Types.STRING()])) \       .key_by(lambda i: i[1]) \       .reduce(lambda i, j: Row(i[0] + j[0], i[1]))
# Write two: ds is of type Types. TUPLEdef split(s): split s = s[1]. Split ('|') for SP in split s: yield s[0], sp
ds = ds.map(lambda i: (i[0] + 1, i[1])) \       .flat_map(split, Types.TUPLE([Types.INT(), Types.STRING()])) \       .key_by(lambda i: i[1]) \       .reduce(lambda i, j: (i[0] + j[0], i[1]))
# Write ds to sinkt_ Env. Execute_ SQL ("""CREATE TABLE my_sink (a INT, B VARCHAR)" WITH ("connector'='print')")
table = t_env.from_data_stream(ds)table_result = table.execute_insert("my_sink")

Explain:

Notice that t_ Env. From_ Data_ The result type type of the DS object in stream(ds) must be a composite type Types.ROW or Types.TUPLE, which is why flat_in job calculation logic needs to be explicitly declared Result type of map operation

Job submission requires submission through the job submission method provided in the PyFlink Table API

Since there are few types of connectors currently supported in the PyFlink DataStream API, it is recommended that you define the data source tables used in PyFlink DataStream API jobs in such a way that all connectors available in the PyFlink Table API can be used as sink s for PyFlink DataStream API jobs.

7) Summary

The complete job example is as follows:

Mode 1 (for debugging):

from pyflink.common.typeinfo import Typesfrom pyflink.datastream import StreamExecutionEnvironment

def data_stream_api_demo():    env = StreamExecutionEnvironment.get_execution_environment()    env.set_parallelism(4)
    ds = env.from_collection(        collection=[(1, 'aaa|bb'), (2, 'bb|a'), (3, 'aaa|a')],        type_info=Types.ROW([Types.INT(), Types.STRING()]))
    def split(s):        splits = s[1].split("|")        for sp in splits:            yield s[0], sp
    ds = ds.map(lambda i: (i[0] + 1, i[1])) \           .flat_map(split) \           .key_by(lambda i: i[1]) \           .reduce(lambda i, j: (i[0] + j[0], i[1]))
    ds.print()
    env.execute()

if __name__ == '__main__':    data_stream_api_demo()

The results are as follows:

3> (2, 'aaa')3> (2, 'bb')3> (6, 'aaa')3> (4, 'a')3> (5, 'bb')3> (7, 'a')

Mode 2 (for online work):

from pyflink.common.typeinfo import Typesfrom pyflink.datastream import StreamExecutionEnvironmentfrom pyflink.table import StreamTableEnvironment

def data_stream_api_demo():    env = StreamExecutionEnvironment.get_execution_environment()    t_env = StreamTableEnvironment.create(stream_execution_environment=env)    env.set_parallelism(4)
    t_env.execute_sql("""            CREATE TABLE my_source (              a INT,              b VARCHAR            ) WITH (              'connector' = 'datagen',              'number-of-rows' = '10'            )        """)
    ds = t_env.to_append_stream(        t_env.from_path('my_source'),        Types.ROW([Types.INT(), Types.STRING()]))
    def split(s):        splits = s[1].split("|")        for sp in splits:            yield s[0], sp
    ds = ds.map(lambda i: (i[0] + 1, i[1])) \           .flat_map(split, Types.TUPLE([Types.INT(), Types.STRING()])) \           .key_by(lambda i: i[1]) \           .reduce(lambda i, j: (i[0] + j[0], i[1]))
    t_env.execute_sql("""            CREATE TABLE my_sink (              a INT,              b VARCHAR            ) WITH (              'connector' = 'print'            )        """)
    table = t_env.from_data_stream(ds)    table_result = table.execute_insert("my_sink")
    # 1)Waiting for job execution to finish, for local Execute, otherwise the job may not have finished executing and the script exits, resulting in minicluster Exit prematurely    # 2) When a job submits to a remote cluster via detach mode, such as YARN/Standalone/K8s, the method table_needs to be removed Result. Wait()

if __name__ == '__main__':    data_stream_api_demo()

3. Job submission

Flink provides a variety of job deployment methods, such as local, standalone, YARN, K8s, etc. PyFlink also supports these. Please refer to Flink's official document [3] for more details.

local

Description: When a job is executed in this way, a minicluster is started and the job is submitted to the minicluster for execution, which is suitable for the job development phase.

Example: python3 table_api_demo.py

standalone

Description: When a job is executed this way, it is submitted to a remote standalone cluster.

Example:

./bin/flink run --jobmanager localhost:8081 --python table_api_demo.py

YARN Per-Job

Description: When a job is executed this way, it is submitted to a remote YARN cluster.

Example:

./bin/flink run --target yarn-per-job --python table_api_demo.py

K8s application mode

Description: When a job is executed in this way, it is submitted to the K8s cluster and executed in application mode.

Example:

./bin/flink run-application \ --target kubernetes-application \ --parallelism 8 \ -Dkubernetes.cluster-id=<ClusterId> \ -Dtaskmanager.memory.process.size=4096m \ -Dkubernetes.taskmanager.cpu=2 \ -Dtaskmanager.numberOfTaskSlots=4 \ -Dkubernetes.container.image=<PyFlinkImageName> \

--pyModule table_api_demo \ --pyFiles file:///path/to/table_api_demo.py

Parameter Description

In addition to the parameters mentioned above, there are other parameters associated with the PyFlink job when submitted via flink run.

Parameter Name	Purpose Description	Example
-py / --python	The entry file for the specified job	-py file:///path/to/table_api_demo.py
-pym / --pyModule	The entry module for the specified job, similar to--python, has a zip package as a Python file that can be used as a job and is more generic than--python when it cannot be specified with--python	-pym table_api_demo -pyfs file:///path/to/table_api_demo.py
-pyfs / --pyFiles	Specify one or more Python files (.py/.zip, etc., comma-separated) that will be placed in the PYTHONPATH of the Python process when the job is executed and can be accessed in the Python custom function	-pyfs file:///path/to/table_api_demo.py,file:///path/to/deps.zip
-pyarch / --pyArchives	Specify one or more archive files (comma-separated) that, when the job is executed, are unzipped and placed in the workspace directory of the Python process, which can be accessed by relative paths	-pyarch file:///path/to/venv.zip
-pyexec / --pyExecutable	Specify the path of the Python process when the job executes	-pyarch file:///path/to/venv.zip -pyexec venv.zip/venv/bin/python3
-pyreq / --pyRequirements	Specify the requirements file, which defines job dependencies	-pyreq requirements.txt

IV. Problem Investigation

When we just started the PyFlink job development, we will inevitably encounter a variety of problems. It is very important to learn how to troubleshoot the problems. Next, we introduce some common troubleshooting methods.

client-side exception output

PyFlink jobs also follow the Flink job submission method, which first compiles to JobGraph on the client side, then submits to the Flink cluster for execution. If there is a problem with job compilation, it will cause an exception to be thrown when the job is submitted on the client side, where you can see output like this on the client side:

Traceback (most recent call last):  File "/Users/dianfu/code/src/github/pyflink-usecases/datastream_api_demo.py", line 50, in <module>    data_stream_api_demo()  File "/Users/dianfu/code/src/github/pyflink-usecases/datastream_api_demo.py", line 45, in data_stream_api_demo    table_result = table.execute_insert("my_")  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink/table/table.py", line 864, in execute_insert    return TableResult(self._j_table.executeInsert(table_path, overwrite))  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/py4j/java_gateway.py", line 1285, in __call__    return_value = get_return_value(  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink/util/exceptions.py", line 162, in deco    raise java_exceptionpyflink.util.exceptions.TableException: Sink `default_catalog`.`default_database`.`my_` does not exists     at org.apache.flink.table.planner.delegation.PlannerBase.translateToRel(PlannerBase.scala:247)     at org.apache.flink.table.planner.delegation.PlannerBase$$anonfun$1.apply(PlannerBase.scala:159)     at org.apache.flink.table.planner.delegation.PlannerBase$$anonfun$1.apply(PlannerBase.scala:159)     at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)     at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)     at scala.collection.Iterator$class.foreach(Iterator.scala:891)     at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)     at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)     at scala.collection.AbstractIterable.foreach(Iterable.scala:54)     at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)     at scala.collection.AbstractTraversable.map(Traversable.scala:104)     at org.apache.flink.table.planner.delegation.PlannerBase.translate(PlannerBase.scala:159)     at org.apache.flink.table.api.internal.TableEnvironmentImpl.translate(TableEnvironmentImpl.java:1329)     at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:676)     at org.apache.flink.table.api.internal.TableImpl.executeInsert(TableImpl.java:572)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)     at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)     at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282)     at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79)     at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238)     at java.lang.Thread.run(Thread.java:748)
Process finished with exit code 1

For example, the name used in the above error description job is "my_" The table does not exist.

TaskManager log file

Some errors do not occur until the job is running, such as dirty data or implementation issues with Python custom functions. For these errors, you often need to check the TaskManager log file, such as the following errors which reflect that the opencv library accessed by the user in the Python custom function does not exist.

Caused by: java.lang.RuntimeException: Error received from SDK harness for instruction 2: Traceback (most recent call last):  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 253, in _execute    response = task()  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 310, in <lambda>    lambda: self.create_worker().do_instruction(request), request)  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 479, in do_instruction    return getattr(self, request_type)(  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 515, in process_bundle    bundle_processor.process_bundle(instruction_id))  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 977, in process_bundle    input_op_by_transform_id[element.transform_id].process_encoded(  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 218, in process_encoded    self.output(decoded_value)  File "apache_beam/runners/worker/operations.py", line 330, in apache_beam.runners.worker.operations.Operation.output  File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output  File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive  File "pyflink/fn_execution/beam/beam_operations_fast.pyx", line 71, in pyflink.fn_execution.beam.beam_operations_fast.FunctionOperation.process  File "pyflink/fn_execution/beam/beam_operations_fast.pyx", line 85, in pyflink.fn_execution.beam.beam_operations_fast.FunctionOperation.process  File "pyflink/fn_execution/coder_impl_fast.pyx", line 83, in pyflink.fn_execution.coder_impl_fast.DataStreamFlatMapCoderImpl.encode_to_stream  File "/Users/dianfu/code/src/github/pyflink-usecases/datastream_api_demo.py", line 26, in split    import cv2ModuleNotFoundError: No module named 'cv2'
    at org.apache.beam.runners.fnexecution.control.FnApiControlClient$ResponseStreamObserver.onNext(FnApiControlClient.java:177)    at org.apache.beam.runners.fnexecution.control.FnApiControlClient$ResponseStreamObserver.onNext(FnApiControlClient.java:157)    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:251)    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76)    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309)    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292)    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:782)    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)    ... 1 more

Explain:

In local mode, the TaskManager log is located in the PyFlink installation directory: site-packages/pyflink/log/, which can also be found with the following commands:

>>> import pyflink

>>> print(pyflink.path)

['/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink'], the log file is in/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink/log directory

Custom Log

Sometimes the contents of the exception log are not enough to help us locate the problem, so consider printing some log information in the Python custom function. PyFlink supports users to output logs in Python custom functions through logging, such as:

def split(s):    import logging    logging.info("s: " + str(s))    splits = s[1].split("|")    for sp in splits:        yield s[0], sp

In this way, the input parameters to the split function are printed to the TaskManager log file.

Remote Debugging

The PyFlink job starts a separate Python process to execute Python custom functions during the run, so if you need to debug Python custom functions, you need to debug them remotely. See [4] to learn how to debug Python remotely in Pycharm.

1) Install pydevd-pycharm in a Python environment:

pip install pydevd-pycharm~=203.7717.65

2) Set remote debugging parameters in Python custom functions:

def split(s):    import pydevd_pycharm    pydevd_pycharm.settrace('localhost', port=6789, stdoutToServer=True, stderrToServer=True)    splits = s[1].split("|")    for sp in splits:        yield s[0], sp

3) Follow the steps of remote debugging in Pycharm, and you can do so either by referring to [4], or by referring to the section "Code debugging" in the blog [5].

Note: Python remote debugging is only supported in the professional version of Pycharm.

Community User Mailing List

If you have not solved the problem after the above steps, you can also subscribe to the Flink user mailing list [6] and send the problem to the Flink user mailing list. It is important to note that when sending a problem to a mailing list, try to make it as clear as possible. It is best to have code and data that can be reproduced. You can refer to this message [7].

Nail group

In addition, you are welcome to join the PyFlink Exchange Group to share PyFlink related issues.

V. Summary

In this article, we mainly introduce the PyFlink API job environment preparation, job development, job submission, problem solving and other information, hope to help users quickly build a Flink job using Python language, and hope to help everyone. Next, we will continue to launch PyFlink series articles to help PyFlink users gain insight into PyFlink features, scenarios, best practices, and more.

In addition, we have launched a questionnaire, which we hope you can actively participate in to help us better organize PyFlink-related learning materials. After completing the questionnaire, you will be able to participate in the lottery of Flink custom Polo shirts, which will open on time at 12:00 p.m. on April 30.

Reference Links

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/

[2] https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka_2.11/1.12.0/flink-sql-connector-kafka_2.11-1.12.0.jar

[3] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs

[4] https://www.jetbrains.com/help/pycharm/remote-debugging-with-product.html#remote-debug-config

[5] https://mp.weixin.qq.com/s?__biz=MzIzMDMwNTg3MA==&mid=2247485386&idx=1&sn=da24e5200d72e0627717494c22d0372e&chksm=e8b43eebdfc3b7fdbd10b49e6749cb761b7aa5f8ddc90b34eb3170119a8bbb3ddd7327acb712&scene=178&cur_album_id=1386152464113811456#rd

[6] https://flink.apache.org/community.html#mailing-lists

[7] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/PyFlink-called-already-closed-and-NullPointerException-td42997.html

Topics: flink

Programmer Think

Flnk full development cycle

Step 1: Install Python

Step 2: Install JDK

Step 3: Install PyFlink

PyFlink Table API Job

PyFlink DataStream API Job

Flink provides a variety of job deployment methods, such as local, standalone, YARN, K8s, etc. PyFlink also supports these. Please refer to Flink's official document [3] for more details.

local

standalone

YARN Per-Job

K8s application mode

Parameter Description

client-side exception output

TaskManager log file

Custom Log

Remote Debugging

Community User Mailing List

Nail group

Reference Links

Hot Topics