Play Hudi Docker Demo based on Ubuntu -- Spark write and query

Posted by dilum on Fri, 11 Feb 2022 17:01:45 +0100

brief introduction

Last article Playing Hudi Docker Demo based on Ubuntu (2) -- writing test data to Kafka Describes how to write test data to fkaka cluster.
This article describes how to use Spark to consume Kafka data and write the data to HDFS. Hudi is introduced into Spark in the form of Jar package.

Types of Hudi tables and queries

Table type	Supported query types
Copy On Write	Support snapshot query and incremental query
Merge On Read	Support snapshot query, incremental query and read optimization query

1. Table type

Copy On Write (cow): store data in e.g. parquet format, and merge historical data synchronously when writing data.
Merge On Read (mor): store data in combination with column storage (e.g parquet) and row storage (e.g avro). Incremental data first stores data in row storage, and then combines data in synchronous or asynchronous manner to generate column storage files.

balance	CopyOnWrite	MergeOnRead
Data delay	high	low
Query delay	low	high
Renewal cost	High, the entire parquet file needs to be rewritten	Low, write incremental files in append mode
Write amplification	large	Small, depending on the consolidation strategy

2. Query type

Snapshot Queries: you can query the snapshot data that was successfully submitted or merged last time.
Incremental Queries: you can query the data newly written to the table after the specified submission or consolidation.
Read Optimized Queries: limited to the MergeOnRead table, you can query the data of the listed file.

For the MergeOnRead table, the following trade-offs need to be made to select the query type:

balance	Snapshot Queries	Read Optimized Queries
Data delay	low	high
Query delay	high	low

Specific process

1. Enter the container ad hoc-2

sudo docker exec -it adhoc-2 /bin/bash

2. Execute spark submit

Execute the following spark submit command to start delta streamer, consume data from kafka cluster and adopt COPY_ON_WRITE mode is written to HDFS, and the table name is stock_ticks_cow

spark-submit \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
  --table-type COPY_ON_WRITE \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field ts  \
  --target-base-path /user/hive/warehouse/stock_ticks_cow \
  --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

Execute the following spark submit command to start delta streamer, consume data from kafka cluster and use merge_ ON_ The read mode is written to HDFS, and the table name is stock_ticks_mor

spark-submit \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
  --table-type MERGE_ON_READ \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field ts \
  --target-base-path /user/hive/warehouse/stock_ticks_mor \
  --target-table stock_ticks_mor \
  --props /var/demo/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
  --disable-compaction

3. View hdfs file

stock_ticks_cow
stock_ticks_cow is partitioned by date. There is a metadata file and parquet format data file under the partition directory.
Yes You can see the commit information in the hoodle directory.

stock_ticks_mor

4. Synchronize to Hive metadata

/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
  --jdbc-url jdbc:hive2://hiveserver:10000 \
  --user hive \
  --pass hive \
  --partitioned-by dt \
  --base-path /user/hive/warehouse/stock_ticks_cow \
  --database default \
  --table stock_ticks_cow

/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
  --jdbc-url jdbc:hive2://hiveserver:10000 \
  --user hive \
  --pass hive \
  --partitioned-by dt \
  --base-path /user/hive/warehouse/stock_ticks_mor \
  --database default \
  --table stock_ticks_mor

4. Spark SQL query

Enter spark shell:

$SPARK_INSTALL/bin/spark-shell \
  --jars $HUDI_SPARK_BUNDLE \
  --master local[2] \
  --driver-class-path $HADOOP_CONF_DIR \
  --conf spark.sql.hive.convertMetastoreParquet=false \
  --deploy-mode client \
  --driver-memory 1G \
  --executor-memory 3G \
  --num-executors 1 \
  --packages org.apache.spark:spark-avro_2.11:2.4.4

stock_ticks_cow is the CopyOnWrite table
stock_ticks_mor_ro is the MergeOnRead table, which is used to read optimized queries
stock_ticks_mor_rt is the MergeOnRead table, which is used for snapshot query

Spark context Web UI available at http://adhoc-2:4040
Spark context available as 'sc' (master = local[2], app id = local-1644547729231).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("show tables").show(100, false)
+--------+------------------+-----------+
|database|tableName         |isTemporary|
+--------+------------------+-----------+
|default |stock_ticks_cow   |false      |
|default |stock_ticks_mor_ro|false      |
|default |stock_ticks_mor_rt|false      |
+--------+------------------+-----------+

## Run max timestamp query against COW table
scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+

## Projection Query
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20220211022538859  |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20220211022538859  |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+

# Merge-On-Read Queries:
# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+

# Run Snapshot Query. Notice that the latest timestamp is again 10:29
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+

# Run Read Optimized and Snapshot project queries
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20220211022707523  |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20220211022707523  |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+

scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20220211022707523  |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20220211022707523  |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+

5. Write the second batch of data to kafka

Exit the docker container and execute it on the Ubuntu command line

cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P

6. Enter the container ad hoc-2 and execute spark submit to write the second batch of data to the Hudi table

Enter container ad hoc-2

sudo docker exec -it adhoc-2 /bin/bash

The second batch of data to Hudi CopyOnWrite table

spark-submit \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
  --table-type COPY_ON_WRITE \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field ts \
  --target-base-path /user/hive/warehouse/stock_ticks_cow \
  --target-table stock_ticks_cow \
  --props /var/demo/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

View hdfs Directory:

hdfs dfs -ls -R /user/hive/warehouse/stock_ticks_cow

The second batch of data to Hudi MergeOnRead table

spark-submit \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
  --table-type MERGE_ON_READ \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field ts \
  --target-base-path /user/hive/warehouse/stock_ticks_mor \
  --target-table stock_ticks_mor \
  --props /var/demo/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
  --disable-compaction

View hdfs Directory:

hdfs dfs -ls -R /user/hive/warehouse/stock_ticks_mor

7. Spark SQL query