Apache RocketMQ + Hudi Quick Build Lakehouse

Posted by az_wraith on Tue, 15 Feb 2022 20:04:01 +0100

Table of Contents

  • background knowledge
  • Architectural Evolution in the Big Data Age
  • RocketMQ Connector&Stream
  • Apache Hudi
  • Building Lakehouse Practices

The title of this article contains three keywords: Lakehouse, RocketMQ, Hudi. We start with the overall Lakehouse architecture, and then step by step analyze the causes of the architecture, the characteristics of the architecture components, and the practical part of building the Lakehouse architecture.

background knowledge

1. Lakehouse architecture

Lakehouse was originally proposed by Databrick and requires the following Lakehouse architecture features:

(1) Transaction support

Many data pipelines within an enterprise often read and write data concurrently. Support for ACID transactions ensures consistency when multiple parties read and write data concurrently.

(2)Schema enforcement and governance

Lakehouse should have a way to support pattern execution and evolution, a paradigm that supports DW schema s (such as star or snowflake models), an ability to reason about data integrity, and a robust governance and auditing mechanism;

(3) Openness

The storage formats used are open and standardized, such as parquet, and provide API s for a variety of tools and engines, including machine learning and Python/R libraries, so that they can access data directly and effectively;

(4) BI support

Lakehouse can use BI tools directly on the source data. This improves data freshness, reduces latency, and reduces the cost of operating two copies of data in a data pool and a data warehouse.

(5) Separation of storage from computation

In practice, this means that storage and computing use separate clusters, so these systems can scale to support larger user concurrency and data volumes. Some modern number compartments also have this attribute;

(6) Support multiple data types from unstructured to structured data

Lakehouse can be used to store, optimize, analyze, and access many data applications including image, video, audio, text, and semi-structured data.

(7) Support various workloads

Includes data science, machine learning, and SQL and analysis. Multiple tools may be required to support these workloads, but their underlying layers depend on the same data repository;

(8) End-to-end flow

Real-time reports are standard applications in many enterprises. Streaming support eliminates the need to build separate systems dedicated to serving real-time data applications.

From the above description of the Lakehouse architecture, we can see that for a single function, we can build a set of solutions using some open source portfolios. However, there does not seem to be a common solution for support of all functions at present. Next, let's look at what the mainstream data processing architecture is in the big data age.

Architectural Evolution in the Big Data Age

1. Open Source Products in the Big Data Age

There are many types of open source products in the big data era, including RocketMQ and Kafka in the messaging field. flink, spark, storm in computing; HDFS, Hbase, Redis, ElasticSearch, Hudi, DeltaLake, etc. in the storage domain.

Why are there so many open source products? First of all, in the big data age, with the growing volume of data and the varying needs of each business, there are various types of products that architects can choose to support a variety of scenarios. However, many types of products also cause some difficulties for architects, such as difficult selection, high trial and error costs, high learning costs, complex architecture and so on.

2. Current Mainstream Multilayer Architecture

Processing scenarios in the large data field include data analysis, BI, scientific computing, machine learning, indicator monitoring, etc. For different scenarios, business will choose different computing and storage engines according to business characteristics. For example, a combination of binlog + CDC+ RocketMQ + Flink + Hbase + ELK can be used for BI and MTIC visualization.

(1) Advantages of a multi-tier architecture: support for a wide range of business scenarios;

(2) Disadvantages of multi-tier architecture:

  • Processing link length, high latency;
  • More copies of data, double the cost;
  • High learning costs;

The main reason for the disadvantage of multilayer architecture is that the storage and computing links are too long.

  • Do we really need so many solutions to support a wide range of business scenarios? Is the Lakehouse architecture a unified solution?
  • Can multi-layer reservoirs be merged? Can Hudi products support multiple storage needs?
  • Can computing layers in a multi-tier architecture be merged? Can RocketMQ stream fuse the messaging and computing layers?

Current mainstream multitier architecture

3. Lakehouse architecture generation

Lakehouse architecture is an upgraded version of a multitier architecture that continues to reduce storage layer complexity to one level. Further compression of the computing layer fuses the messaging and computing layers, and RocketMQ stream acts as the computing role. We have the new architecture shown in the figure below. In the new architecture, the entry and exit of messages are implemented by RocketMQ connector, and the message computing layer is implemented by RocketMQ stream, which completes the intermediate flow of message computing within RocketMQ. The results are calculated through the RocketMQ-Hudi-connector closing house Hudi, which supports multiple indexes and provides a uniform API output to different products.

Lakehouse architecture

Below we analyze the characteristics of this architecture.

(1) Advantages of the Lakehouse architecture:

  • Shorter links, more suitable for real-time scenes, high freshness of data;
  • Controllable cost reduces storage costs;
  • Low cost of learning and friendly to programmers;
  • Operational complexity is greatly reduced;

(2) Disadvantages of Lakehouse architecture

Requirements for stability and ease of use are high for message products and data Lake products, while message products need to support computing scenarios, and data Lake products need to provide strong indexing capabilities.

(3) Selection

In the Lakehouse architecture we chose the messaging product RocketMQ and the data Lake product Hudi.

At the same time, RocketMQ stream can be used to integrate the computing layer in the RocketMQ cluster, which reduces the computing layer to one level and can satisfy most large and medium data processing scenarios.

Next we will step by step analyze the features of both RocketMQ and Hudi products.

RocketMQ Connector & Stream

RocketMQ History Diagram

RocketMQ has been hatched in Apache since 2017, completed proto-cloud biochemistry by RocketMQ 4.0 in 2018, and fully integrated messages, events and streams by RocketMQ 5.0 in 2021.

1. Business Messaging Preferences

RocketMQ has become the preferred messaging product in the business world as a "sleeping messaging product", mainly due to the following features of the product:

(1) High reliability of financial class

It has undergone the flood peak test of Alibaba Double 11;

(2) Minimalist architecture

As shown in the figure below, the architecture of RocketMQ consists of two main parts: the source data cluster NameServer Cluster and the computed storage cluster Broker Cluster.

RocketMQ Architecture

NameServer nodes are stateless and can be scaled horizontally very easily. Broker nodes use primary and standby methods to ensure high data reliability, support a primary and standby scenario, and flexible configuration.

How to set it up: A RocketMQ cluster can be built with simple code:

Jar:

nohup sh bin/mqnamesrv &
 nohup sh bin/mqbroker -n localhost:9876 &

On K8S:

kubectl apply -f example/rocketmq_cluster.yaml

(3) Very low operating and maintenance costs

RocketMQ has a low maintenance cost and provides a good CLI tool MQAdmin. MQAdmin provides rich command support, covering many aspects such as cluster health check, cluster in and out flow control. For example, a mqadmin clusterList command can get all the node states (production and consumption traffic, latency, queue length, disk water level, etc.) of the current cluster. The mqadmin updateBrokerConfig command can set the readable and writable state of broker nodes or topic s in real time, which can dynamically remove temporary unavailable nodes to achieve the effect of flow migration for production and consumption.

(4) Rich message types

The types of messages supported by RocketMQ include normal messages, transaction messages, deferred messages, timed messages, sequential messages, and so on. Easily support large data and business scenarios.

(5) High throughput and low latency

Pressure scene master-standby synchronous replication mode, each Broker node can fill disk utilization and control p99 latency at the millisecond level.

2. Overview of RocketMQ 5.0

RocketMQ 5.0 is a cloud-based, cloud-growing native messaging, events, and streaming hyperfusion platform with the following features:

(1) Lightweight SDK

  • Fully support the cloud native communication standard gRPC protocol;
  • Stateless Pop consumption mode, multilingual friendliness and easy integration;

(2) Minimalist architecture

  • No external dependency, reduce the burden of operation and maintenance;
  • Loose coupling between nodes allows any service node to migrate at any time.

(3) Separation of separable and combinable storage computing

  • Broker upgraded to a truly stateless service node with no binding;
  • Broker and Store nodes are deployed separately and scaled independently;
  • Multi-protocol standard support, no vendor lock-in;
  • It can be divided and combined to adapt to a variety of business scenarios and reduce the burden of operation and maintenance;

As shown in the figure below, Computing Clusters (Broker s) mainly include abstract models and corresponding protocol adaptions, as well as consumer and governance capabilities. Store is mainly divided into message storage CommitLog (multitype message storage, multimodal storage) and index storage index (multivariate index). If you can make full use of the storage capacity in the cloud, configuring CommitLog and Index in the file system in the cloud can naturally separate storage from computation.

(4) Multimode storage support

  • Satisfies high availability demands in different basic scenarios;
  • Take full advantage of cloud infrastructure to reduce costs;

(5) Cloud native infrastructure:

  • Observable performance cloud biochemistry, OpenTelemetry standardization;
  • Kubernetes One-Click Deployment Extension Delivery.

RocketMQ 5.02021 Events and Future Planning

3, RocketMQConnector

a. Traditional data streams

(1) Disadvantages of traditional data streams

  • Producer consumer code needs to be implemented by itself, which is costly.
  • Data synchronization tasks are not managed uniformly;
  • Repeated development, uneven code quality;

(2) Solution: RocketMQ Connector

  • Co-build, reuse data synchronization task code;
  • Unified management and scheduling to improve resource utilization;

b. RocketMQ Connector data synchronization process

Compared to traditional data streams, RocketMQ connector data streams differ in their unified management of source and sink, open source, and active community.

4. RocketMQ Connector architecture

As shown in the figure above, the RocketMQ Connector architecture mainly consists of Runtime and Worker, as well as EcoSource&Sink.

(1) Standard: OpenMessaging

(2) Ecology: supports most products in data fields such as ActiveMQ, Cassandra, ES, JDBC, JMS, MongoDB, Kafka, RabbitMQ, Mysql, Flume, Hbase, Redis, etc.

(3) Components: Manager unified management scheduling, if there are multiple tasks, all tasks can be uniformly load balanced, evenly distributed to different Workers, and Workers can be scaled horizontally.

5,RocketMQ Stream

RocketMQ Stream is a product that compresses the computing layer into one layer. It supports common operators such as window, join, dimension table and is compatible with Flink SQL, UDF/UDAF/UDTF.

Apache Hudi

Hudi is a streaming data Lake platform that supports fast updates to large amounts of data. Built-in table format to support transactional reservoirs, a series of table services, data services (out-of-the-box uptake tools), and complete maintenance monitoring tools. Hudi can offload storage to OSS, AWS S3 on Ali Cloud.

Hudi features include:

  • Transactional write, MVCC/OCC concurrency control;
  • Native support for record level updates and deletions;
  • Query-oriented optimization: small file automatic management, for incremental pull optimization design, automatic compression, clustering to optimize file layout;

Apache Hudi is a complete data Lake platform. It is characterized by:

  • Each module is tightly integrated and self-managed.
  • Write using Spark, Flink, Java;
  • Use Spark, Flink, Hive, Presto, Trino, Impala,
    AWS Athena/Redshift etc.
  • Out-of-the-box tools/services for data operations.

Apache Hudi is optimized for three main types of scenarios:

1. Streaming stack

(1) Incremental treatment;

(2) Fast and efficient;

(3) Line-oriented;

(4) Unoptimized scanning;

2. Batch Stack

(1) batch processing;

(2) low efficiency;

(3) formats for scanning and storing in columns;

3. Incremental Processing Stack

(1) Incremental treatment;

(2) Fast and efficient;

(3) Scan and column format.

Building Lakehouse Practices

This section only describes the main processes and operational configuration items, and the operational details of the local setup can be referred to in the appendix section.

1. Preparations

RocketMQ version: 4.9.0

rocketmq-connect-hudi version: 0.0.1-SNAPSHOT

Hudi version: 0.8.0

2. Build RocketMQ-Hudi-connector

(1) Download:

git clone https://github.com/apache/roc...

(2) Configuration:

/data/lakehouse/rocketmq-externals/rocketmq-connect/rocketmq-connect-runtime/target/distribution/conf/connect.conf in connector-plugin Route

(3) Compilation:

cd rocketmq-externals/rocketmq-connect-hudi
mvn clean install -DskipTest -U

rocketmq-connect-hudi-0.0.1-SNAPSHOT-jar-with-dependencies.jar is the rocketmq-hudi-connector we need to use

3. Running

(1) Start or use an existing RocketMQ cluster and initialize the metadata Topic:

connector-cluster-topic (cluster information) connector-config-topic (configuration information)

connector-offset-topic (sink consumption progress) connector-position-topic (source data processing progress and to keep messages in order, each top can have only one queue)

(2) Start the RocketMQ connector runtime

cd /data/lakehouse/rocketmq-externals/rocketmq-connect/rocketmq-connect-runtime
  sh ./run_worker.sh    ##  Worker can start multiple

(3) Configure and launch the RocketMQ-hudi-connector task

Request RocketMQ connector runtime to create a task

curl http://${runtime-ip}:${runtime-port}/connectors/${rocketmq-hudi-sink-connector-name} ?config='{"connector-class":"org.apache.rocketmq.connect.hudi.connector.HudiSinkConnector","topicNames":"topicc","tablePath":"file:///tmp/hudi_connector_test","tableName":"hudi_connector_test_table","insertShuffleParallelism":"2","upsertShuffleParallelism":"2","deleteParallelism":"2","source-record-converter":"org.apache.rocketmq.connect.runtime.converter.RocketMQConverter","source-rocketmq":"127.0.0.1:9876","src-cluster":"DefaultCluster","refresh-interval":"10000","schemaPath":"/data/lakehouse/config/user.avsc"}'
  Successful startup prints the following logs:
2021-09-06 16:23:14 INFO pool-2-thread-1 - Open HoodieJavaWriteClient successfully

(4) At this point, the data produced to source top is automatically written to the corresponding table of 1 Hudi, which can be queried through Hudi's api.

4. Configuration resolution

(1) The RocketMQ connector needs to configure the RocketMQ cluster information and the connector plug-in location, including: the connection work node id identifies workerid, the connection service command receive port httpPort, the rocketmq cluster namesrvAddr, the connection local configuration storage directory storePathRootDir, and the connector plug-in directory pluginPaths.

RocketMQ connector configuration table

(2) The Hudi task needs to configure the Hudi table path tablePath and table name tableName, as well as the Schema file used by Hudi.

Hudi Task Configuration Table

Click Here View Lakehouse build live video

Appendix: Build Lakehouse demo on your local Mac system

Components involved: rocketmq, rocketmq-connector-runtime, rocketmq-connect-hudi, hudi, hdfs, avro, spark-shell 0, starting HDFS

Download hadoop package

​https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz​

cd /Users/osgoo/Documents/hadoop-2.10.1
vi core-site.xml
<configuration>
<property>
 <name>fs.defaultFS</name>
 <!-- You can use commands hostname View host name Here the host name is hadoop1-->
 <value>hdfs://localhost:9000</value>
</property>
<!--Cover core-default.xml Default configuration in-->
</configuration>

vi hdfs-site.xml
<configuration>
<property>
        <name>dfs.replication</name>
        <value>1</value>
  </property>
</configuration>

./bin/hdfs namenode -format
./sbin/start-dfs.sh 
jps look down namenode,datanode
lsof -i:9000
./bin/hdfs dfs -mkdir -p /Users/osgoo/Downloads

1,start-up rocketmq Cluster, Create rocketmq-connector Built-in topic
QickStart: https://rocketmq.apache.org/docs/quick-start/
sh mqadmin updatetopic -t connector-cluster-topic -n localhost:9876 -c DefaultCluster
sh mqadmin updatetopic -t connector-config-topic -n localhost:9876 -c DefaultCluster
sh mqadmin updatetopic -t connector-offset-topic -n localhost:9876 -c DefaultCluster
sh mqadmin updatetopic -t connector-position-topic -n localhost:9876 -c DefaultCluster

2,Create a source for data entering the lake topic,testhudi1
sh mqadmin updatetopic -t testhudi1 -n localhost:9876 -c DefaultCluster

3,Compile rocketmq-connect-hudi-0.0.1-SNAPSHOT-jar-with-dependencies.jar
cd rocketmq-connect-hudi
mvn clean install -DskipTest -U

4,start-up rocketmq-connector runtime
 To configure connect.conf
--------------
workerId=DEFAULT_WORKER_1
storePathRootDir=/Users/osgoo/Downloads/storeRoot

## Http port for user to access REST API
httpPort=8082

# Rocketmq namesrvAddr
namesrvAddr=localhost:9876

# Source or sink connector jar file dir,The default value is rocketmq-connect-sample
pluginPaths=/Users/osgoo/Downloads/connector-plugins
---------------
Copy rocketmq-hudi-connector.jar reach pluginPaths=/Users/osgoo/Downloads/connector-plugins

sh run_worker.sh

5,Configure Into Lake config
curl http://localhost:8082/connectors/rocketmq-connect-hudi?config='{"connector-class":"org.apache.rocketmq.connect.hudi.connector.HudiSinkConnector","topicNames":"testhudi1","tablePath":"hdfs://localhost:9000/Users/osgoo/Documents/base-path7","tableName":"t7","insertShuffleParallelism":"2","upsertShuffleParallelism":"2","deleteParallelism":"2","source-record-converter":"org.apache.rocketmq.connect.runtime.converter.RocketMQConverter","source-rocketmq":"127.0.0.1:9876","source-cluster":"DefaultCluster","refresh-interval":"10000","schemaPath":"/Users/osgoo/Downloads/user.avsc"}'

6,Send a message to testhudi1


7,## Reading with spark

cd /Users/osgoo/Downloads/spark-3.1.2-bin-hadoop3.2/bin

./spark-shell \
  --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "t7"
val basePath = "hdfs://localhost:9000/Users/osgoo/Documents/base-path7"

val tripsSnapshotDF = spark.
  read.
  format("hudi").
  load(basePath + "/*")
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")

spark.sql("select * from hudi_trips_snapshot").show()

Welcome to the discussions and exchanges with Rocketmq enthusiasts:

Nail Scavenger Group

Topics: Alibaba Cloud message queue Cloud Native Open Source