Business Data Diversion for Flink Real-Time Projects

Posted by apervizi on Mon, 21 Feb 2022 18:40:24 +0100

Python WeChat Subscription Applet Course Video

https://edu.csdn.net/course/detail/36074

Python Actual Quantitative Transaction Finance System

https://edu.csdn.net/course/detail/35475
In the previous article, we have obtained the output stream of business data, which is the output stream of dim layer dimension data and the output stream of dwd layer fact data. Next, what we need to do is to flow these output streams to the corresponding data media separately. The dim layer flows to hbase, and the dwd layer is still written back to kafka.

1. Dimension of shunts table sink to hbase

As a result of the previous article, the dimension data output stream hbaseDs on the side, and the fact data in the mainstream filterDs is as follows:

//5. Dynamic streaming, fact table writing kafka, dimension table writing hbase
OutputTag<JSONObject> hbaseTag = new OutputTag<JSONObject>(TableProcess.SINK\_TYPE\_HBASE){};
//Create a custom mapFunction function
SingleOutputStreamOperator<JSONObject> kafkaTag = filterDs.process(new TableProcessFunction(hbaseTag));
DataStream<JSONObject> hbaseDs = kafkaTag.getSideOutput(hbaseTag);
filterDs.print("json str --->>");

The process is as follows:

Custom RickSinkFunction class: DimSink.java

  • Initialize phoenix connection
  • Save data

1.1 Configuration

In the BaseDbTask task, we have obtained the output stream of hbase, and then we can start a series of operations for hbase.

Add phoenix Dependency Package

<dependency>
    <groupId>org.apache.phoenixgroupId>
    <artifactId>phoenix-sparkartifactId>
    <version>5.0.0-HBase-2.0version>
    <exclusions>
        <exclusion>
            <groupId>org.glassfishgroupId>
            <artifactId>javax.elartifactId>
        exclusion>
    exclusions>
dependency>

Modify hbase-site.xml, because a separate schema is used, hbase-site is also included in the Idea program. XML

In order to open the schema mapping of namespace and phoenix for hbase, this configuration file needs to be added to the program, as well as hbase-site for HBase and phoenix on the linux service. In the XML configuration file, add the above two configurations and synchronize using xsync.

xml version="1.0"?
xml-stylesheet type="text/xsl" href="configuration.xsl"?
<configuration>
    <property>
        <name>hbase.rootdirname>
        <value>hdfs://hadoop101:9000/hbasevalue>
    property>
    <property>
        <name>hbase.cluster.distributedname>
        <value>truevalue>
    property>
    <property>
        <name>hbase.zookeeper.quorumname>
        <value>hadoop101,hadoop102,hadoop103value>
    property>
    <property>
        <name>hbase.table.sanity.checksname>
        <value>falsevalue>
    property>
    <property>
        <name>phoenix.schema.isNamespaceMappingEnabledname>
        <value>truevalue>
    property>
    <property>
        <name>phoenix.schema.mapSystemTablesToNamespacename>
        <value>truevalue>
    property>
configuration>

1.2 Create Namespace

Execute in phoenix

create schema GMALL_REALTIME;

1.3 DimSink.java

Custom addSink class

package com.zhangbao.gmall.realtime.app.func;
import com.alibaba.fastjson.JSONObject;
import com.google.common.base.Strings;
import com.zhangbao.gmall.realtime.common.GmallConfig;
import lombok.extern.log4j.Log4j2;
import org.apache.commons.lang3.StringUtils;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
/**
 * @author: zhangbao
 * @date: 2021/9/4 12:23
 * @desc: Write dimension tables to hbase
 **/
@Log4j2
public class DimSink extends RichSinkFunction {
    private Connection conn = null;
​
    @Override
    public void open(Configuration parameters) throws Exception {
        log.info("establish phoenix Connect...");
        Class.forName("org.apache.phoenix.jdbc.PhoenixDriver");
        conn = DriverManager.getConnection(GmallConfig.PHOENIX\_SERVER);
        log.info("phoenix Connection successful!");
   }
​
    @Override
    public void invoke(JSONObject jsonObject, Context context) throws Exception {
        String sinkTable = jsonObject.getString("sink\_table");
        JSONObject data = jsonObject.getJSONObject("data");
        PreparedStatement ps = null;
        if(data!=null && data.size()>0){
            try {
                //Generate an upsert statement for phoenix containing insert and update operations
                String sql = generateUpsert(data,sinkTable.toUpperCase());
                log.info("Start execution phoenix sql -->{}",sql);
                ps = conn.prepareStatement(sql);
                ps.executeUpdate();
                conn.commit();
                log.info("implement phoenix sql Success");
           } catch (SQLException throwables) {
                throwables.printStackTrace();
                throw new RuntimeException("implement phoenix sql Failed!");
           }finally {
                if(ps!=null){
                    ps.close();
               }
           }
       }
   }
​
    //Generate upsert sql
    private String generateUpsert(JSONObject data, String sinkTable) {
        StringBuilder sql = new StringBuilder();
        //upsert into scheme.table(id,name) values('11','22')
        sql.append("upsert into "+GmallConfig.HBASE\_SCHEMA+"."+sinkTable+"(");
        //Split Column Name
        sql.append(StringUtils.join(data.keySet(),",")).append(")");
        //Fill value
        sql.append("values('"+ StringUtils.join(data.values(),"','")+"')");
        return sql.toString();
   }
}

Then join in the main program

//6. Write dimension tables into hbase
hbaseDs.addSink(new DimSink());

1.4 Test

  • Services that need to be started

hdfs,zk,kafka,Maxwell,hbase,BaseDbTask.java

  • Modify configuration data: gmall2021_realtime.table_process
INSERT INTO `gmall2021\_realtime`.`table\_process` (`source\_table`, `operate\_type`, `sink\_type`, `sink\_table`, `sink\_columns`, `sink\_pk`, `sink\_extend`) VALUES ('base\_trademark', 'insert', 'hbase', 'dim\_base\_trademark', 'id,tm\_name', 'id', NULL);

This configuration data represents if the table base_ When trademark inserts data, it synchronizes the data into hbase and automatically creates tables as dimension data.

  • Modify table data in business library: gmall2021.base_trademark
INSERT INTO `gmall2021`.`base\_trademark` (`id`, `tm\_name`, `logo\_url`) VALUES ('15', '55', '55');
  • View phoenix data: select * from GMALL_REALTIME.BASE_TRADEMARK;

The data has been synchronized to hbase in real time.

2. Shunt fact table sink to kafka

2.1 MyKafkaUtil Definition New Method

Define a new producer method in MyKafkaUtil that dynamically specifies a topic or, if not, produces to the default topic:default_data

/**
     * Produce dynamically to a different topic, or automatically to the default topic if no topic is passed
     * @param T Serialized data, subject can be specified
     */
    public static  FlinkKafkaProducer getKafkaBySchema(KafkaSerializationSchema T){
        Properties pros = new Properties();
        pros.setProperty(ProducerConfig.BOOTSTRAP\_SERVERS\_CONFIG,KAFKA\_HOST);
        return new FlinkKafkaProducer(DEFAULT\_TOPIC,T,pros,FlinkKafkaProducer.Semantic.EXACTLY\_ONCE);
   }

Use in main task BaseDbTask

//7. Write factual data back to kafka
FlinkKafkaProducer<JSONObject> kafkaBySchema = MyKafkaUtil.getKafkaBySchema(new KafkaSerializationSchema<JSONObject>() {
    @Override
    public void open(SerializationSchema.InitializationContext context) throws Exception {
        System.out.println("kafka serialize open");
    }
    @Override
    public ProducerRecord serialize(JSONObject jsonObject, @Nullable Long aLong) {
        String sinkTopic = jsonObject.getString("sink\_table");
        return new ProducerRecord<>(sinkTopic, jsonObject.getJSONObject("data").toString().getBytes());
   }
});
kafkaTag.addSink(kafkaBySchema);

2.2 Test

  • Services that need to be started

hdfs,zk,kafka,Maxwell,hbase,BaseDbTask.java

  • Modify configuration information: gmall2021_realtime.table_process
INSERT INTO `gmall2021\_realtime`.`table\_process` (`source\_table`, `operate\_type`, `sink\_type`, `sink\_table`, `sink\_columns`, `sink\_pk`, `sink\_extend`) VALUES ('order\_info', 'insert', 'kafka', 'dwd\_order\_info', 'id,consignee,consignee\_tel,total\_amount,order\_status,user\_id,payment\_way,delivery\_address,order\_comment,out\_trade\_no,trade\_body,create\_time,operate\_time,expire\_time,process\_status,tracking\_no,parent\_order\_id,img\_url,province\_id,activity\_reduce\_amount,coupon\_reduce\_amount,original\_total\_amount,feight\_fee,feight\_fee\_reduce,refundable\_time', 'id', NULL);

Represents table order_ If info has inserted data, it will be synchronized to kafka with topic dwd_order_info.

  • Start kafka consumers to see if there is data coming in

[zhangbao@hadoop101 root]$ cd /opt/module/kafka/bin/

[zhangbao@hadoop101 bin]$ ./kafka-console-consumer.sh --bootstrap-server hadoop101:9092,hadoop102:9092,hadoop103:9092 --topic dwd_order_info

  • Finally, start the Business Data Generation Service: mock-db-0.0.1-SNAPSHOT.jar

Remember to modify the profile generation date first: 2021-09-12

Finally, when you look at kafka consumers, you can see data generation, indicating that the process is working.

3. Introduction to Operator Selection

functionConvertible StructureFilterable dataSide OutputopenAvailable StatusOutput to
MapFunctionYesDownstream Operator
FilterFunctionYesDownstream Operator
RichMapFunctionYesYesYesDownstream Operator
RichFilterFunctionYesYesYesDownstream Operator
ProcessFunctionYesYesYesYesYesDownstream Operator
SinkFunctionYesYesexternal
RichSinkFunctionYesYesYesYesexternal

Topics: Big Data HBase flink computer