Flink (38): API call of Flink TableAPI and FlinkSQL (in)

Posted by svguerin3 on Sat, 18 Dec 2021 19:29:05 +0100

catalogue

0. Links to related articles

1. Table query

1.1. Call of table API

1.2. SQL query

2. Convert DataStream to table

2.1. Code expression

2.2. Correspondence between data type and Table schema

3. Convert DataSet to table

4. Create a Temporary View

0. Links to related articles

1. Table query

Using the connector of the external system, we can read and write data and register in the Catalog of the environment. Next, you can perform query transformation on the table.

Flink provides us with two query methods: Table API and SQL.

1.1. Call of table API

Table API is a query API integrated in Scala and Java languages. Unlike SQL, table API queries are not represented by strings, but are called step by step in the host language.

The Table API is based on the Table class representing a "Table" and provides a complete set of operation processing method APIs. These methods will return a new Table object, which represents the result of applying the transformation operation to the input Table. Some relational conversion operations can be composed of multiple method calls to form a chain call structure. For example, Table select(…). filter(...), where select (...) represents the field specified in the selection Table, and filter(...) represents the filter criteria.

The implementation in Scala code is as follows:

val sensorTable: Table = tableEnv.from("inputTable")

val resultTable: Table = senorTable
    .select("id, temperature")
    .filter("id ='sensor_1'")

The implementation in Java code is as follows:

// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section

// register Orders table
// scan registered Orders table
Table orders = tableEnv.from("Orders");// compute revenue for all customers from France
Table revenue = orders
  	.filter($("cCountry")
	.isEqual("FRANCE"))
  	.groupBy($("cID"), $("cName")
  	.select($("cID"), $("cName"), $("revenue")
	.sum()
	.as("revSum"));
	
// emit or convert Table
// execute query

1.2. SQL query

Flink's SQL integration is based on Apache compute, which implements the SQL standard. In Flink, SQL query statements are defined with regular strings. The result of SQL query is a new Table.

The code implementation is as follows:

val resultSqlTable: Table = tableEnv.sqlQuery("select id, temperature from inputTable where id ='sensor_1'")

Or:

val resultSqlTable: Table = tableEnv.sqlQuery(
  """
    |select id, temperature
    |from inputTable
    |where id = 'sensor_1'
  """.stripMargin)

Or:

// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section

// register Orders table
// compute revenue for all customers from France
Table revenue = tableEnv.sqlQuery(
    "SELECT cID, cName, SUM(revenue) AS revSum " +
    "FROM Orders " +
    "WHERE cCountry = 'FRANCE' " +
    "GROUP BY cID, cName"
  );
// emit or convert Table
// execute query


// ============================================================================

// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section

// register "Orders" table
// register "RevenueFrance" output table
// compute revenue for all customers from France and emit to "RevenueFrance"
tableEnv.executeSql(
    "INSERT INTO RevenueFrance " +
    "SELECT cID, cName, SUM(revenue) AS revSum " +
    "FROM Orders " +
    "WHERE cCountry = 'FRANCE' " +
    "GROUP BY cID, cName"
  );

Of course, aggregation can also be added. For example, we can count the number of temperature data of each sensor and make a count:

// Implementation of TableAPI
val aggResultTable = sensorTable
    .groupBy('id)
    .select('id, 'id.count as 'count)

// Implementation of FlinkSQL
val aggResultSqlTable = tableEnv.sqlQuery("select id, count(id) as cnt from inputTable group by id")

Here, the fields specified in the Table API are preceded by a single quotation mark ', which is the writing method of Expression type defined in the Table API, which can easily represent the fields in a table.

Fields can be directly enclosed in double quotation marks, or in the form of half side single quotation marks + field name. The latter form is generally used in later codes.

2. Convert DataStream to table

Flink allows us to convert a Table to a DataStream: Based on a DataStream, we can first stream the data source, then map it into a sample class, and then convert it into a Table. The column fields of Table are the fields in the sample class, so you don't have to define a schema.

2.1. Code expression

The implementation in the code is very simple, directly using tableenv Just use fromdatastream(). The default converted Table schema corresponds to the field definitions in DataStream one by one, or can be specified separately.

This allows us to change the order of fields, rename, or select only some fields, which is equivalent to a map operation (or a select operation of the Table API).

The codes are as follows:

val inputStream: DataStream[String] = env.readTextFile("sensor.txt")
val dataStream: DataStream[SensorReading] = inputStream
  .map(data => {
    val dataArray = data.split(",")
    SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
  })

val sensorTable: Table = tableEnv.fromDataStream(dataStream)

val sensorTable2 = tableEnv.fromDataStream(dataStream, 'id, 'timestamp as 'ts)

2.2. Correspondence between data type and Table schema

In the example in the previous section, the correspondence between the data type in the DataStream and the Schema of the table corresponds to the field name in the sample class (name based mapping), so you can also rename it with as.

Another correspondence method is position based mapping. In the process of mapping, you can directly specify a new field name.

// Name based correspondence:
val sensorTable = tableEnv.fromDataStream(dataStream, 'timestamp as 'ts, 'id as 'myId, 'temperature)


// Location based correspondence:
val sensorTable = tableEnv.fromDataStream(dataStream, 'myId, 'ts)

Flink's DataStream and dataset APIs support multiple types.

Composite types, such as tuples (built-in Scala and Java tuples), POJO s, Scala case classes, and Row types of Flink, allow nested data structures with multiple fields that can be accessed in Table expressions. Other types are considered atomic types.

For tuple types and atomic types, it is generally better to use position correspondence; If you do not want to use a name, it is also possible: tuple type. The default names are "_1" and "_2"; For atomic types, the default name is "f0".

3. Convert DataSet to table

// Gets the DataSet and specifies it as the Row type
DataSet<Row> trainData = env
        .readTextFile(path)
        .map(new RichMapFunction<String, Row>() {
            @Override
            public Row map(String value) throws Exception {
                JSONObject json = JSONObject.parseObject(value);

                Row row = new Row(4);
                row.setField(0, json.getLongValue("user_id"));
                row.setField(1, json.getLongValue("stall_id"));
                row.setField(2, json.getDoubleValue("score"));
                row.setField(3, json.getLongValue("stall_classify"));

                return row;
            }
        })
        .returns(new RowTypeInfo(
                BasicTypeInfo.LONG_TYPE_INFO,
                BasicTypeInfo.LONG_TYPE_INFO,
                BasicTypeInfo.DOUBLE_TYPE_INFO,
                BasicTypeInfo.LONG_TYPE_INFO
        ));

// Convert a DataSet into a table. Method 1
Table trainTable_1 = tableEnv.fromDataSet(trainData, "user_id, stall_id, score, stall_classify");

// Convert a DataSet into a table, method 2 (used in Alink, and DataSetConversionUtil is a tool class in Alink)
Table trainTable_2 = DataSetConversionUtil.toTable(
        envId,
        trainData,
        new String[]{"user_id, stall_id, score, stall_classify"}
);

4. Create a Temporary View

The first way to create a temporary view is to convert it directly from DataStream. Similarly, the corresponding field can be converted directly; You can also specify corresponding fields during conversion.

The code is as follows:

tableEnv.createTemporaryView("sensorView", dataStream)
tableEnv.createTemporaryView("sensorView", dataStream, 'id, 'temperature, 'timestamp as 'ts)

In addition, of course, you can also create views based on tables:

tableEnv.createTemporaryView("sensorView", sensorTable)

The Schema of View and Table are identical. In fact, in the Table API, View and Table can be considered equivalent.

Topics: Big Data flink