Real time data warehouse -- DWM layer

Posted by mattvenables on Sat, 25 Dec 2021 12:59:16 +0100

DWM layer and DWS layer

1, Design

Design ideas

We previously split the data into independent Kafka topics through shunting at the DWD layer. Then, how to process the data next, we need to think about which index items we want to calculate in real time.

Because real-time computing is different from offline computing, the development, operation and maintenance costs of real-time computing are very high. It is necessary to consider whether it is necessary to build a large and comprehensive middle layer like offline data warehouse in combination with the actual situation.

If it is not necessary to be large and comprehensive, you need to roughly plan the index requirements to be calculated in real time. Outputting these indicators in the form of topic wide table is our DWS layer.

Requirements combing

Statistical topics	Demand index	Output mode	Calculation source	Source hierarchy
visitor	pv	Visual large screen	page_log directly differentiable	dwd
	uv	Visual large screen	Page is required_ Log filter de duplication	dwm
	Jump out rate	Visual large screen	Page is required_ Log behavior judgment	dwm
	Number of pages entered	Visual large screen	Need to identify start access ID	dwd
	Continuous access duration	Visual large screen	page_log directly differentiable	dwd
commodity	click	Multidimensional analysis	page_log directly differentiable	dwd
	Collection	Multidimensional analysis	Collection table	dwd
	add to cart	Multidimensional analysis	Shopping cart table	dwd
	place an order	Visual large screen	Order width table	dwm
	payment	Multidimensional analysis	Payment width table	dwm
	refund	Multidimensional analysis	Refund form	dwd
	comment	Multidimensional analysis	Comment form	dwd
region	pv	Multidimensional analysis	page_log directly differentiable	dwd
	uv	Multidimensional analysis	Page is required_ Log filter de duplication	dwm
	place an order	Visual large screen	Order width table	dwm
key word	Search keywords	Visual large screen	Page access logs are directly available	dwd
	Click commodity keywords	Visual large screen	Re aggregation of product subject orders	dws
	Keywords of ordered goods	Visual large screen	Re aggregation of product subject orders	dws

There will be more actual requirements. Here, real-time calculation is mainly carried out for the purpose of large visual screen.

What is the positioning of DWM layer? DWM layer mainly serves DWS, because some requirements directly from DWD layer to DWS layer will have a certain amount of calculation, and the results of this calculation are likely to be reused by multiple DWS layer topics. Therefore, some DWD layers will form a layer of DWM, which is mainly related to business

Accessing UV calculations
Jump out of detail calculation
Order width table
Payment width table

2, DWM visitor UV calculation

Demand analysis and ideas

uv, the full name is Unique Visitor, that is, independent visitor. For real-time calculation, it can also be called DAU(Daily Active User), that is, daily active user, because uv in real-time calculation usually refers to the number of visitors on that day.

So how to identify the visitors of the day from the user behavior log, there are two points:

First, identify the first page opened by the visitor, indicating that the visitor begins to enter our application
Second, since visitors can enter the application many times a day, we need to de duplicate it within a day

code

package com.atguigu.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.atguigu.utils.MyKafkaUtil;
import org.apache.flink.api.common.functions.RichFilterFunction;
import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.time.Time;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

import java.text.SimpleDateFormat;
/*
Data flow:
web/app -> nginx -> springboot -> Kafka -> FlinkApp(LogBaseApp) -> Kafka
FlinkApp(DauApp) -> Kafka

Service: Nginx Logger zookeeper Kafka LogbaseApp DauApp consumer (dwm_unique_visit) MockLog

 */
public class DauApp {
    public static void main(String[] args) throws Exception {

        //1. Obtain execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        //        //1.1 setting status
//        env.setStateBackend(new FsStateBackend("hdfs://hadoop102:9000/gmall/dwd_log/ck"));
        //1.2 open CK
//        env.enableCheckpointing(10000L, CheckpointingMode.EXACTLY_ONCE);
//        env.getCheckpointConfig().setCheckpointTimeout(60000L);
        //2. Read Kafka dwd_page_log topic data creation stream
        String groupId = "unique_visit_app";
        String sourceTopic = "dwd_page_log";
        String sinkTopic = "dwm_unique_visit";
        FlinkKafkaConsumer<String> kafkaSource = MyKafkaUtil.getKafkaSource(sourceTopic, groupId);
        DataStreamSource<String> kafkaDS = env.addSource(kafkaSource);

        //3. Convert each row of data into JSON objects
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.process(new ProcessFunction<String, JSONObject>() {
            @Override
            public void processElement(String s, Context context, Collector<JSONObject> collector) throws Exception {
                try {
                    JSONObject jsonObject = JSON.parseObject(s);
                    collector.collect(jsonObject);
                } catch (Exception e) {
                    e.printStackTrace();
                    context.output(new OutputTag<String>("dirty") {
                    }, s);

                }
            }
        });


        /*
        kafkaDS.map(new MapFunction<String, JSONObject>() {

            @Override
            public JSONObject map(String s) throws Exception {
                try {
                    return JSON.parseObject(s);
                } catch (Exception e) {
                    e.printStackTrace();
                    System.out.println("Dirty data found: "+ s);
                    return null;
                }
            }
        });
         */

        //4. Group by mid
        KeyedStream<JSONObject, String> keyedStream = jsonObjDS.keyBy
                (jsonObject -> jsonObject.getJSONObject("common").getString("mid"));


        //5. Filter out the data not accessed for the first time today
        SingleOutputStreamOperator<JSONObject> filterDS = keyedStream.filter(new UvRichFilterFunction());

        //6. Write to Kafka topic of DWM layer
        filterDS.map(json -> json.toString()).addSink(MyKafkaUtil.getKafkaSink(sinkTopic));
        //7. Start task
        env.execute();

    }
    public static class UvRichFilterFunction extends RichFilterFunction<JSONObject>{
        private ValueState<String> firstVisitState ;
        private SimpleDateFormat simpleDateFormat;
        @Override
        public void open(Configuration parameters) throws Exception {

            simpleDateFormat=new SimpleDateFormat("yyyy-MM-dd");


            ValueStateDescriptor<String> stringValueStateDescriptor = new ValueStateDescriptor<>("visit-state", String.class);
            //Create status TTL configuration object
            StateTtlConfig stateTtlConfig = StateTtlConfig.newBuilder(Time.days(1))
                    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                    .build();
            //Keep for 24 hours


            stringValueStateDescriptor.enableTimeToLive(stateTtlConfig);
            firstVisitState = getRuntimeContext().getState(stringValueStateDescriptor);



        }

        @Override
        public boolean filter(JSONObject jsonObject) throws Exception {
            //Take out the last visited page
            String lastPageId = jsonObject.getJSONObject("page").getString("last_page_id");

            //Determine whether there is a previous page (whether it came in from the previous page, not newly opened)
            if (lastPageId == null || lastPageId.length()<=0){

                //Fetch status data
                String firstVisitData = firstVisitState.value();

                //Data extraction time
                Long ts = jsonObject.getLong("ts");
                String curDate = simpleDateFormat.format(ts);

                if (firstVisitData == null || firstVisitData.equals(curDate)){
                    firstVisitState.update(curDate);
                    return true;
                }else {
                    return false;
                }


            }else {
                return false;
            }

        }
    }


}

3, DWM visitor jump out of detail calculation

Demand analysis and ideas

Jump out

Jumping out means that the user exits after successfully visiting one page of the website and does not continue to visit other pages of the website. The jump out rate is divided by the number of jumps out divided by the number of visits.

By focusing on the jump out rate, we can see whether the drained visitors can be attracted quickly, the quality comparison between the users drained from the channel, and the comparison of the jump out rate before and after application optimization can also see the results of optimization and improvement.

Thinking of calculating jump out behavior

First of all, we should identify which are jump out behaviors, and identify the last page visited by these jump out visitors. Then we should grasp several characteristics:

This page is the first page recently visited by the user

This can be determined by whether the page has a previous page (last_page_id). If the representation is empty, it means that this is the first page visited by the visitor this time.

For a long time after the first visit (set by yourself), users do not continue to visit other pages.

The recognition of this first feature is very simple and last is retained_ page_ It is OK if the ID is empty. However, the judgment of the second access is actually a little troublesome. First, it can not be concluded with one piece of data. It needs to be combined. It needs to be combined with one piece of existing data and non-existing data. Moreover, an existing piece of data should be obtained from a non-existent data. What is more troublesome is that he does not exist forever, but does not exist within a certain time range. So how to identify the combination behavior with certain failure?

The simplest way is Flink's own CEP technology. This CEP is very suitable for identifying an event through multiple data combinations.

The user jump out event is essentially a combination of a condition event and a timeout event.

{"common":{"mid":"101"},"page":{"page_id":"home"},"ts":1000000000}

{"common":{"mid":"102"},"page":{"page_id":"home"},"ts":1000000001}

{"common":{"mid":"102"},"page":{"page_id":"home","last_page_id":"aa"},"ts":1000000020}

{"common":{"mid":"102"},"page":{"page_id":"home","last_page_id":"aa"},"ts":1000000030}

code

package com.atguigu.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.atguigu.utils.MyKafkaUtil;

import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.PatternFlatSelectFunction;
import org.apache.flink.cep.PatternFlatTimeoutFunction;
import org.apache.flink.cep.PatternStream;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.pattern.conditions.SimpleCondition;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

import java.util.List;
import java.util.Map;

public class UserJumpDetailApp {
    public static void main(String[] args) throws Exception {
        //1. Obtain execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        //        //1.1 setting status
//        env.setStateBackend(new FsStateBackend("hdfs://hadoop102:9000/gmall/dwd_log/ck"));
        //1.2 open CK
//        env.enableCheckpointing(10000L, CheckpointingMode.EXACTLY_ONCE);
//        env.getCheckpointConfig().setCheckpointTimeout(60000L);
        //2. Read Kafka dwd_page_log topic data creation stream
        String sourceTopic = "dwd_page_log";
        String groupId = "userJumpDetailApp";
        String sinkTopic = "dwm_user_jump_detail";

        FlinkKafkaConsumer<String> kafkaSource = MyKafkaUtil.getKafkaSource(sourceTopic, groupId);
        DataStreamSource<String> kafkaDS = env.addSource(kafkaSource);
//        DataStream<String> kafkaDS = env
//                .fromElements(
//                        "{\"common\":{\"mid\":\"101\"},\"page\":{\"page_id\":\"home\"},\"ts\":1000000000} ",
//                        "{\"common\":{\"mid\":\"102\"},\"page\":{\"page_id\":\"home\"},\"ts\":1000000001}",
//                        "{\"common\":{\"mid\":\"102\"},\"page\":{\"page_id\":\"good_list\",\"last_page_id\":" +
//                                "\"home\"},\"ts\":1000000020} ",
//                        "{\"common\":{\"mid\":\"102\"},\"page\":{\"page_id\":\"good_list\",\"last_page_id\":" +
//                                "\"detail\"},\"ts\":1000000025} "
//                );
//
        //DataStream<String> kafkaDS = env.socketTextStream("hadoop103",9999);
        //Extract the timestamp in the data and generate watermark
        WatermarkStrategy<JSONObject> watermarkStrategy = WatermarkStrategy.<JSONObject>forMonotonousTimestamps().withTimestampAssigner(new SerializableTimestampAssigner<JSONObject>() {
            @Override
            public long extractTimestamp(JSONObject jsonObject, long l) {

                return jsonObject.getLong("ts");
            }
        });

        //3. Convert data to JSON object
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.process(new ProcessFunction<String, JSONObject>() {
            @Override
            public void processElement(String s, Context context, Collector<JSONObject> collector) throws Exception {

                try {
                    JSONObject jsonObject = JSON.parseObject(s);
                    collector.collect(jsonObject);
                } catch (Exception e) {
                    context.output(new OutputTag<String>("dirty"){},s);
                }
            }
        }).assignTimestampsAndWatermarks(watermarkStrategy);

        //4. Partition according to Mid
        KeyedStream<JSONObject, String> keyedStream = jsonObjDS.keyBy(jsonObject -> jsonObject.getJSONObject("common").getString("mid"));
        //5. Define pattern sequence
        Pattern<JSONObject, JSONObject> pattern = Pattern.<JSONObject>begin("start").where(new SimpleCondition<JSONObject>() {
            @Override
            public boolean filter(JSONObject jsonObject) throws Exception {
                //Take out the previous page information
                String lastPageId = jsonObject.getJSONObject("page").getString("last_page_id");


                return (lastPageId == null || lastPageId.length() <= 0);
            }
        }).followedBy("follow").where(new SimpleCondition<JSONObject>() {
            @Override
            public boolean filter(JSONObject value) throws Exception {

                String lastPageId = value.getJSONObject("page").getString("last_page_id");
                return lastPageId != null && lastPageId.length() > 0;
            }
        }).within(Time.seconds(10));
        //6. Apply the pattern sequence to the flow
        PatternStream<JSONObject> patternStream = CEP.pattern(keyedStream, pattern);

        //7. Extract events and timeout events
        OutputTag<String> timeOutTag = new OutputTag<String>("timeOut") {};
        SingleOutputStreamOperator<Object> selectDS = patternStream.flatSelect(timeOutTag, new PatternFlatTimeoutFunction<JSONObject, String>() {
            @Override
            public void timeout(Map<String, List<JSONObject>> map, long l, Collector<String> collector) throws Exception {
                collector.collect(map.get("start").get(0).toString());
            }
        }, new PatternFlatSelectFunction<JSONObject, Object>() {
            @Override
            public void flatSelect(Map<String, List<JSONObject>> map, Collector<Object> collector) throws Exception {
                //Nothing there?
                //Because of what's inside! It's no use!

            }
        });

        //8. Write data to Kafka
        FlinkKafkaProducer<String> kafkaSink = MyKafkaUtil.getKafkaSink(sinkTopic);
        selectDS.getSideOutput(timeOutTag).addSink(kafkaSink);


        //9. Implementation of tasks
        env.execute();

    }
}

4, DWM commodity order wide table

Demand analysis and ideas

Orders are important objects of statistical analysis. There are many dimensions of statistical needs around orders, such as users, regions, commodities, categories, brands, etc.

In order to make the statistical calculation more convenient and reduce the association between large tables, the relevant data around orders will be integrated into a wide table of orders in the process of real-time calculation.

What data needs to be integrated with orders?

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-kDTYAKAz-1629586654467)(D: \ real-time data warehouse \ picture 13.png)]

As shown in the figure above, we have split the data into fact data and dimension data, The fact data (green) enters the kafka data stream (DWD layer) and the dimension data (blue) enters hbase for medium and long-term storage. Then, we need to integrate and associate the real-time and dimension data in the DWM layer to form a wide table. Here, we need to deal with two kinds of associations: fact data and fact data association, fact data and dimension data association.

Fact data and fact data association are actually the association between streams.
Fact data is associated with dimension data, which is actually querying external data sources in flow calculation.

code implementation

Receive order and order detail data from Kafka's dwd layer

package com.atguigu.app.dwm;

import com.alibaba.fastjson.JSON;
import com.atguigu.bean.OrderDetail;
import com.atguigu.bean.OrderInfo;
import com.atguigu.bean.OrderWide;
import com.atguigu.utils.MyKafkaUtil;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.util.Collector;

import java.text.SimpleDateFormat;

/**
 * Mock -> MySQL(binlog) -> MaxWell -> Kafka(ods_base_db_m)
 *  -> DbBaseApp(Modify configuration, Phoenix)
 *  -> Kafka(dwd_order_info,dwd_order_detail) ->  OrderWideApp
 */



public class OrderWideApp {

    public static void main(String[] args) throws Exception {

        //1. Obtain execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        //        //1.1 setting status
//        env.setStateBackend(new FsStateBackend("hdfs://hadoop102:9000/gmall/dwd_log/ck"));
        //1.2 open CK
//        env.enableCheckpointing(10000L, CheckpointingMode.EXACTLY_ONCE);
//        env.getCheckpointConfig().setCheckpointTimeout(60000L);

        //2. Read kafka order and order details subject data dwd_order_info dwd_order_detail
        String orderInfoSourceTopic = "dwd_order_info";
        String orderDetailSourceTopic = "dwd_order_detail";
        String orderWideSinkTopic = "dwm_order_wide";
        String groupId = "order_wide_group";

        FlinkKafkaConsumer<String> orderInfoKafkaSource = MyKafkaUtil.getKafkaSource(orderInfoSourceTopic, groupId);
        DataStreamSource<String> orderInfoKafkaDS = env.addSource(orderInfoKafkaSource);
        FlinkKafkaConsumer<String> orderDetailKafkaSource = MyKafkaUtil.getKafkaSource(orderDetailSourceTopic, groupId);
        DataStreamSource<String> orderDetailKafkaDS = env.addSource(orderDetailKafkaSource);

        //3. Convert each row of data into a JavaBean, extract the timestamp, and generate a WaterMark
        WatermarkStrategy<OrderInfo> orderInfoWatermarkStrategy = WatermarkStrategy.<OrderInfo>forMonotonousTimestamps()
                .withTimestampAssigner(new SerializableTimestampAssigner<OrderInfo>() {
                    @Override
                    public long extractTimestamp(OrderInfo orderInfo, long l) {
                        return orderInfo.getCreate_ts();
                    }
                });
        WatermarkStrategy<OrderDetail> orderDetailWatermarkStrategy = WatermarkStrategy.<OrderDetail>forMonotonousTimestamps()
                .withTimestampAssigner(new SerializableTimestampAssigner<OrderDetail>() {
                    @Override
                    public long extractTimestamp(OrderDetail orderDetail, long l) {
                        return orderDetail.getCreate_ts();
                    }
                });

        KeyedStream<OrderInfo, Long> orderInfoWithIdKeyedStream = orderInfoKafkaDS.map(jsonStr -> {
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

            //Convert JSON string to JavaBean
            OrderInfo orderInfo = JSON.parseObject(jsonStr, OrderInfo.class);
            //Take out the creation time field

            String create_time = orderInfo.getCreate_time();

            //Split by space
            String[] createTimeArr = create_time.split(" ");

            orderInfo.setCreate_date(createTimeArr[0]);
            orderInfo.setCreate_hour(createTimeArr[1]);
            orderInfo.setCreate_ts(sdf
                    .parse(create_time).getTime());

            return orderInfo;
        }).assignTimestampsAndWatermarks(orderInfoWatermarkStrategy)
                .keyBy(OrderInfo::getId);


        KeyedStream<OrderDetail, Long> orderDetailWithOrderKeyedStream = orderDetailKafkaDS.map(jsonStr -> {
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
            OrderDetail orderDetail = JSON.parseObject(jsonStr, OrderDetail.class);

            orderDetail.setCreate_ts(sdf.parse(orderDetail.getCreate_time()).getTime());

            return orderDetail;
        }).assignTimestampsAndWatermarks(orderDetailWatermarkStrategy)
                .keyBy(OrderDetail::getOrder_id);

        //4. Double flow JOIN
        SingleOutputStreamOperator<OrderWide> orderWideDS = orderInfoWithIdKeyedStream.intervalJoin(orderDetailWithOrderKeyedStream)
                .between(Time.seconds(-5), Time.seconds(5))//In the production environment, in order not to lose data, set the time to the maximum network delay
                .process(new ProcessJoinFunction<OrderInfo, OrderDetail, OrderWide>() {
                    @Override
                    public void processElement(OrderInfo orderInfo, OrderDetail orderDetail, Context context, Collector<OrderWide> collector) throws Exception {
                        collector.collect(new OrderWide(orderInfo, orderDetail));
                    }
                });


        //Test printing
        orderWideDS.print(">>>>>>>>>>>>>>>");
        //5. Associated dimension

        //6. Write data to Kafka dwm_order_wide

        //7. Start task
        env.execute();

    }
}

Test results:

Query Phoenix tool class encapsulation

package com.atguigu.utils;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.atguigu.common.GmallConfig;
import org.apache.commons.beanutils.BeanUtils;

import java.sql.*;
import java.util.ArrayList;
import java.util.List;

public class PhoenixUtil {
    //statement
    private static Connection connection;
    //Initialize connection
    private static void init(){
        try {
            Class.forName(GmallConfig.PHOENIX_DRIVER);
            connection = DriverManager.getConnection(GmallConfig.PHOENIX_SERVER);

            //Set the library of Phoenix connected to
            connection.setSchema(GmallConfig.HBASE_SCHEMA);
        } catch (Exception e) {
            e.printStackTrace();
            throw new RuntimeException("Failed to get connection!");
        }
    }


    public static <T>List<T> queryList(String sql,Class<T> cls){
        //Initialize connection
        init();

        PreparedStatement preparedStatement = null;
        ResultSet resultSet = null;
        try {
            //Compile SQL
            preparedStatement = connection.prepareStatement(sql);

            //Execute query
            resultSet = preparedStatement.executeQuery();

            //Get metadata information in query results
            ResultSetMetaData metaData = resultSet.getMetaData();

            int columnCount = metaData.getColumnCount();

            ArrayList<T> list = new ArrayList<>();
            while (resultSet.next()){//Right line
                T t = cls.newInstance();
                for (int i = 1; i <= columnCount; i++) {//Pair column
                    BeanUtils.setProperty(t,metaData.getColumnName(i),resultSet.getObject(i));
                }
                list.add(t);
            }
            return list;
        } catch (Exception throwables) {
            throw new RuntimeException("Failed to query dimension information");
        }finally {
            if (preparedStatement != null){
                try {
                    preparedStatement.close();
                } catch (SQLException throwables) {
                    throwables.printStackTrace();
                }
            }
            if (resultSet != null){
                try {
                    resultSet.close();
                } catch (SQLException throwables) {
                    throwables.printStackTrace();
                }
            }
        }
    }

    public static void main(String[] args) {
        System.out.println(queryList("select * from GMALL200821_REALTIME.DIM_BASE_TRADEMARK", JSONObject.class));
    }
}

test result

Query dimension information tool class encapsulation

package com.atguigu.utils;

import com.alibaba.fastjson.JSONObject;
import org.apache.flink.api.java.tuple.Tuple2;

import java.util.List;


//Tool class for querying dimension tables
/*
select * from t where id = '' and name ='';
 */
public class DimUtil {
    public static JSONObject getDimInfo(String tableName, Tuple2<String,String>... columnValues){

        if (columnValues.length<=0){
            throw new RuntimeException("When querying dimension data, please set at least one query criteria");//Throw an exception and stop the operation
        }
        StringBuilder whereSql = new StringBuilder(" where ");

        //Traverse the query criteria and assign whereSql
        for (int i = 0; i < columnValues.length; i++) {
            Tuple2<String, String> columnValue = columnValues[i];
            String  column = columnValue.f0;
            String value = columnValue.f1;

           whereSql.append(column).append(" = '").append(value).append("'");

           //Judge if it is not the last condition, add "and"
            if (i < columnValues.length -1){
                whereSql.append("and");
            }

        }

        String querySql= "select * from "+tableName + whereSql.toString();
        System.out.println(querySql);
        //Query dimension data in Phoenix
        List<JSONObject> queryList = PhoenixUtil.queryList(querySql, JSONObject.class);
        return queryList.get(0);
    }

    public static JSONObject  getDimInfo(String tableName,String value){
        return getDimInfo(tableName,new Tuple2<>("id",value));
    }

    public static void main(String[] args) {
        System.out.println(getDimInfo("DIM_BASE_TRADEMARK","20"));
    }
}

test result

Optimize 1 bypass cache

Code implementation - DimUtil

package com.atguigu.utils;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.flink.api.java.tuple.Tuple2;
import redis.clients.jedis.Jedis;

import java.util.List;


//Tool class for querying dimension tables
/*
select * from t where id = '' and name ='';
 */

/**
 * <p>
 * Redis
 * 1.What data dimension data is stored in JsonStr
 * 2，What type of String Set Hash is used
 * 3.RedisKey Your design? String:tableName+id    Set : tableName  Hash:tableName
 *
 * <p>
 * The collection method is excluded because we need to set the expiration time for each independent dimension data
 */
public class DimUtil {
    public static JSONObject getDimInfo(String tableName, Tuple2<String,String>... columnValues){

        if (columnValues.length<=0){
            throw new RuntimeException("When querying dimension data, please set at least one query criteria");//Throw an exception and stop the operation
        }
        //Create Phoenix Where clause
        StringBuilder whereSql = new StringBuilder(" where ");

        StringBuilder rediskey = new StringBuilder(tableName).append(":");

        //Traverse the query criteria and assign whereSql
        for (int i = 0; i < columnValues.length; i++) {
            Tuple2<String, String> columnValue = columnValues[i];
            String  column = columnValue.f0;
            String value = columnValue.f1;

           whereSql.append(column).append(" = '").append(value).append("'");

           rediskey.append(value);
           //Judge if it is not the last condition, add "and"
            if (i < columnValues.length -1){
                whereSql.append("and");
                rediskey.append(":");
            }

        }

        //Get Redis connection
        Jedis jedis = RedisUtil.getJedis();
        String dimJsonStr = jedis.get(rediskey.toString());

        //Judge whether data is queried from Redis
        if (dimJsonStr!=null&& dimJsonStr.length()>0){
            jedis.close();
            return JSON.parseObject(dimJsonStr);
        }

        String querySql= "select * from "+tableName + whereSql.toString();
        System.out.println(querySql);
        //Query dimension data in Phoenix
        List<JSONObject> queryList = PhoenixUtil.queryList(querySql, JSONObject.class);
        JSONObject dimJsonObj = queryList.get(0);
        //Write data to Redis
        jedis.set(rediskey.toString(),dimJsonObj.toString());
        jedis.expire(rediskey.toString(),24*60*60);

        return dimJsonObj;
    }

    public static JSONObject  getDimInfo(String tableName,String value){
        return getDimInfo(tableName,new Tuple2<>("id",value));
    }

    public static void main(String[] args) {
        System.out.println(getDimInfo("DIM_BASE_TRADEMARK","20"));
    }
}

Code implementation - RedisUtil

package com.atguigu.utils;

import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;

public class RedisUtil {

    public static JedisPool jedisPool=null;
    public static Jedis getJedis(){

        if(jedisPool==null){
            JedisPoolConfig jedisPoolConfig =new JedisPoolConfig();
            jedisPoolConfig.setMaxTotal(100); //Maximum number of available connections
            jedisPoolConfig.setBlockWhenExhausted(true); //Connection exhausted wait
            jedisPoolConfig.setMaxWaitMillis(2000); //waiting time

            jedisPoolConfig.setMaxIdle(5); //Maximum idle connections
            jedisPoolConfig.setMinIdle(5); //Minimum number of idle connections

            jedisPoolConfig.setTestOnBorrow(true); //Test the connection ping pong

            jedisPool=new JedisPool( jedisPoolConfig, "hadoop103",6379 ,1000);
            System.out.println("Open connection pool");
            return  jedisPool.getResource();

        }else{
//            System.out.println("connection pool:" + jedisPool.getNumActive());
            return   jedisPool.getResource();
        }
    }
}

After redis is started

Optimization 2: asynchronous query

In the process of Flink flow processing, it is often necessary to interact with external systems and complete the fields in the fact table with a dimension table.

For example, in the e-commerce scenario, the skuid of a commodity needs to be associated with some attributes of the commodity, such as the industry, manufacturer and manufacturer of the commodity; In the logistics scenario, if you know the package id, you need to associate the package industry attribute, shipment information, receipt information, and so on.

By default, in Flink's MapFunction, a single parallel can only interact synchronously: * * send the request to external storage, IO block, wait for the request to return, and then continue to send the next request** This way of synchronous interaction often consumes a lot of time waiting on the network. In order to improve processing efficiency, you can increase the parallelism of MapFunction, but increasing the parallelism means more resources, which is not a very good solution.

Flink introduces Async I/O in 1.2. In the asynchronous mode, IO operations are asynchronized. A single parallel can send multiple requests continuously. Which request returns first is processed first, so there is no need for blocking waiting between consecutive requests, which greatly improves the flow processing efficiency.

Async I/O is a very popular feature contributed by Alibaba to the community. It solves the problem that network delay has become a system bottleneck when interacting with external systems.

Asynchronous query actually entrusts the query operation of dimension table to a separate thread pool, so that it will not be blocked by a query. A single parallel can send multiple requests continuously to improve the concurrency efficiency.

This method is especially for operations involving network IO to reduce the consumption caused by request waiting.

Code implementation - OrderWideApp

package com.atguigu.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.atguigu.app.func.DimAsyncFunction;
import com.atguigu.bean.OrderDetail;
import com.atguigu.bean.OrderInfo;
import com.atguigu.bean.OrderWide;
import com.atguigu.utils.MyKafkaUtil;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.util.Collector;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.concurrent.TimeUnit;

/**
 * Mock -> Mysql(binLog) -> MaxWell -> Kafka(ods_base_db_m) -> DbBaseApp(Modify configuration, Phoenix)
 * -> Kafka(dwd_order_info,dwd_order_detail) -> OrderWideApp(Associated dimension (Redis)
 */
public class OrderWideApp {

    public static void main(String[] args) throws Exception {

        //1. Obtain execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        //1.1 setting status
//        env.setStateBackend(new FsStateBackend("hdfs://hadoop102:8020/gmall/dwd_log/ck"));
//        //1.2 open CK
//        env.enableCheckpointing(10000L, CheckpointingMode.EXACTLY_ONCE);
//        env.getCheckpointConfig().setCheckpointTimeout(60000L);


        //2. Read Kafka order and order details subject data dwd_order_info  dwd_order_detail
        String orderInfoSourceTopic = "dwd_order_info";
        String orderDetailSourceTopic = "dwd_order_detail";
        String orderWideSinkTopic = "dwm_order_wide";
        String groupId = "order_wide_group";

        FlinkKafkaConsumer<String> orderInfoKafkaSource = MyKafkaUtil.getKafkaSource(orderInfoSourceTopic, groupId);
        DataStreamSource<String> orderInfoKafkaDS = env.addSource(orderInfoKafkaSource);
        FlinkKafkaConsumer<String> orderDetailKafkaSource = MyKafkaUtil.getKafkaSource(orderDetailSourceTopic, groupId);
        DataStreamSource<String> orderDetailKafkaDS = env.addSource(orderDetailKafkaSource);

        //3. Convert each row of data into a JavaBean and extract the timestamp to generate a WaterMark
        WatermarkStrategy<OrderInfo> orderInfoWatermarkStrategy = WatermarkStrategy.<OrderInfo>forMonotonousTimestamps()
                .withTimestampAssigner(new SerializableTimestampAssigner<OrderInfo>() {
                    @Override
                    public long extractTimestamp(OrderInfo element, long recordTimestamp) {
                        return element.getCreate_ts();
                    }
                });
        WatermarkStrategy<OrderDetail> orderDetailWatermarkStrategy = WatermarkStrategy.<OrderDetail>forMonotonousTimestamps()
                .withTimestampAssigner(new SerializableTimestampAssigner<OrderDetail>() {
                    @Override
                    public long extractTimestamp(OrderDetail element, long recordTimestamp) {
                        return element.getCreate_ts();
                    }
                });
        KeyedStream<OrderInfo, Long> orderInfoWithIdKeyedStream = orderInfoKafkaDS.map(jsonStr -> {
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

            //Convert JSON string to JavaBean
            OrderInfo orderInfo = JSON.parseObject(jsonStr, OrderInfo.class);
            //Take out the creation time field
            String create_time = orderInfo.getCreate_time();
            //Split by space
            String[] createTimeArr = create_time.split(" ");

            orderInfo.setCreate_date(createTimeArr[0]);
            orderInfo.setCreate_hour(createTimeArr[1]);
            orderInfo.setCreate_ts(sdf.parse(create_time).getTime());

            return orderInfo;
        }).assignTimestampsAndWatermarks(orderInfoWatermarkStrategy)
                .keyBy(OrderInfo::getId);

        KeyedStream<OrderDetail, Long> orderDetailWithOrderIdKeyedStream = orderDetailKafkaDS.map(jsonStr -> {
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
            OrderDetail orderDetail = JSON.parseObject(jsonStr, OrderDetail.class);
            orderDetail.setCreate_ts(sdf.parse(orderDetail.getCreate_time()).getTime());
            return orderDetail;
        }).assignTimestampsAndWatermarks(orderDetailWatermarkStrategy)
                .keyBy(OrderDetail::getOrder_id);

        //4. Double flow JOIN
        SingleOutputStreamOperator<OrderWide> orderWideDS = orderInfoWithIdKeyedStream.intervalJoin(orderDetailWithOrderIdKeyedStream)
                .between(Time.seconds(-5), Time.seconds(5))  //In the production environment, in order not to lose data, set the time to the maximum network delay
                .process(new ProcessJoinFunction<OrderInfo, OrderDetail, OrderWide>() {
                    @Override
                    public void processElement(OrderInfo orderInfo, OrderDetail orderDetail, Context context, Collector<OrderWide> collector) throws Exception {
                        collector.collect(new OrderWide(orderInfo, orderDetail));
                    }
                });

        //Test printing
//        orderWideDS.print(">>>>>>>>>");

        //5. Associated dimension
        //5.1 associated user dimension
        SingleOutputStreamOperator<OrderWide> orderWideWithUserDS = AsyncDataStream.unorderedWait(orderWideDS,
                new DimAsyncFunction<OrderWide>("DIM_USER_INFO") {

                    @Override
                    public String getKey(OrderWide orderWide) {
                        return orderWide.getUser_id().toString();
                    }

                    @Override
                    public void join(OrderWide orderWide, JSONObject dimInfo) throws ParseException {

                        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");

                        //Retrieve the birthday in the user dimension
                        String birthday = dimInfo.getString("BIRTHDAY");
                        long currentTS = System.currentTimeMillis();
                        Long ts = sdf.parse(birthday).getTime();

                        //Process Birthday field to age
                        Long ageLong = (currentTS - ts) / 1000L / 60 / 60 / 24 / 365;
                        orderWide.setUser_age(ageLong.intValue());

                        //Retrieve gender in user dimension
                        String gender = dimInfo.getString("GENDER");
                        orderWide.setUser_gender(gender);

                    }
                },
                60,
                TimeUnit.SECONDS);

        //orderWideWithUserDS.print("Users>>>>>>");

        //5.2 associated region dimension
        SingleOutputStreamOperator<OrderWide> orderWideWithProvinceDS = AsyncDataStream.unorderedWait(orderWideWithUserDS,
                new DimAsyncFunction<OrderWide>("DIM_BASE_PROVINCE") {

                    @Override
                    public String getKey(OrderWide orderWide) {
                        return orderWide.getProvince_id().toString();
                    }

                    @Override
                    public void join(OrderWide orderWide, JSONObject dimInfo) throws Exception {
                        //Extract dimension information and set it into orderWide
                        orderWide.setProvince_name(dimInfo.getString("NAME"));
                        orderWide.setProvince_area_code(dimInfo.getString("AREA_CODE"));
                        orderWide.setProvince_iso_code(dimInfo.getString("ISO_CODE"));
                        orderWide.setProvince_3166_2_code(dimInfo.getString("ISO_3166_2"));
                    }
                }, 60, TimeUnit.SECONDS);

       //orderWideWithProvinceDS.print("Province>>>>>>>>");

        //5.3 associated SKU dimensions
        SingleOutputStreamOperator<OrderWide> orderWideWithSkuDS = AsyncDataStream.unorderedWait(
                orderWideWithProvinceDS, new DimAsyncFunction<OrderWide>("DIM_SKU_INFO") {
                    @Override
                    public void join(OrderWide orderWide, JSONObject jsonObject) throws Exception {
                        orderWide.setSku_name(jsonObject.getString("SKU_NAME"));
                        orderWide.setCategory3_id(jsonObject.getLong("CATEGORY3_ID"));
                        orderWide.setSpu_id(jsonObject.getLong("SPU_ID"));
                        orderWide.setTm_id(jsonObject.getLong("TM_ID"));
                    }

                    @Override
                    public String getKey(OrderWide orderWide) {
                        return String.valueOf(orderWide.getSku_id());
                    }
                }, 60, TimeUnit.SECONDS);
        //orderWideWithSkuDS.print("sku>>>>");
        //5.4 associated SPU dimensions
        SingleOutputStreamOperator<OrderWide> orderWideWithSpuDS = AsyncDataStream.unorderedWait(
                orderWideWithSkuDS, new DimAsyncFunction<OrderWide>("DIM_SPU_INFO") {
                    @Override
                    public void join(OrderWide orderWide, JSONObject jsonObject) throws Exception {
                        orderWide.setSpu_name(jsonObject.getString("SPU_NAME"));
                    }

                    @Override
                    public String getKey(OrderWide orderWide) {
                        return String.valueOf(orderWide.getSpu_id());
                    }
                }, 60, TimeUnit.SECONDS);
        //orderWideWithSpuDS.print("spu");
        //5.5 associated brand dimension
        SingleOutputStreamOperator<OrderWide> orderWideWithTmDS = AsyncDataStream.unorderedWait(
                orderWideWithSpuDS, new DimAsyncFunction<OrderWide>("DIM_BASE_TRADEMARK") {
                    @Override
                    public void join(OrderWide orderWide, JSONObject jsonObject) throws Exception {
                        orderWide.setTm_name(jsonObject.getString("TM_NAME"));
                    }

                    @Override
                    public String getKey(OrderWide orderWide) {
                        return String.valueOf(orderWide.getTm_id());
                    }
                }, 60, TimeUnit.SECONDS);

        //5.6 associated category dimension
        SingleOutputStreamOperator<OrderWide> orderWideWithCategory3DS = AsyncDataStream.unorderedWait(
                orderWideWithTmDS, new DimAsyncFunction<OrderWide>("DIM_BASE_CATEGORY3") {
                    @Override
                    public void join(OrderWide orderWide, JSONObject jsonObject) throws Exception {
                        orderWide.setCategory3_name(jsonObject.getString("NAME"));
                    }

                    @Override
                    public String getKey(OrderWide orderWide) {
                        return String.valueOf(orderWide.getCategory3_id());
                    }
                }, 60, TimeUnit.SECONDS);

        orderWideWithCategory3DS.print("3DS>>>>");

        //6. Write data to Kafka DWM_ order_ wide
        orderWideWithCategory3DS.map(JSON::toJSONString)
                .addSink(MyKafkaUtil.getKafkaSink(orderWideSinkTopic));

        //7. Start task
        env.execute();

    }

}

5, DWM - goods - payment spread sheet

Demand analysis and ideas

The main reason for the wide payment table is that the payment table does not include the order details, the payment amount is not subdivided into commodities, and there is no way to count the payment status of commodity level.

Therefore, the core of this wide table is to associate the information of the payment table with the order details.

Code implementation - payment entity class - PaymentInfo

package com.atguigu.bean;

import lombok.Data;

import java.math.BigDecimal;

@Data
public class PaymentInfo {
    Long id;
    Long order_id;
    Long user_id;
    BigDecimal total_amount;
    String subject;
    String payment_type;
    String create_time;
    String callback_time;
}

Code implementation - payment wide table entity class - PaymentWide

package com.atguigu.bean;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.commons.beanutils.BeanUtils;

import java.lang.reflect.InvocationTargetException;
import java.math.BigDecimal;

@Data
@AllArgsConstructor
@NoArgsConstructor
public class PaymentWide {

    Long payment_id;
    String subject;
    String payment_type;
    String payment_create_time;
    String callback_time;
    Long detail_id;
    Long order_id;
    Long sku_id;
    BigDecimal order_price;
    Long sku_num;
    String sku_name;
    Long province_id;
    String order_status;
    Long user_id;
    BigDecimal total_amount;
    BigDecimal activity_reduce_amount;
    BigDecimal coupon_reduce_amount;
    BigDecimal original_total_amount;
    BigDecimal feight_fee;
    BigDecimal split_feight_fee;
    BigDecimal split_activity_amount;
    BigDecimal split_coupon_amount;
    BigDecimal split_total_amount;
    String order_create_time;

    String province_name;//Query dimension table
    String province_area_code;
    String province_iso_code;
    String province_3166_2_code;

    Integer user_age; //User information
    String user_gender;

    Long spu_id;     //To be associated as dimension data
    Long tm_id;
    Long category3_id;
    String spu_name;
    String tm_name;
    String category3_name;

    public PaymentWide(PaymentInfo paymentInfo, OrderWide orderWide) {
        mergeOrderWide(orderWide);
        mergePaymentInfo(paymentInfo);
    }

    public void mergePaymentInfo(PaymentInfo paymentInfo) {
        if (paymentInfo != null) {
            try {
                BeanUtils.copyProperties(this, paymentInfo);
                payment_create_time = paymentInfo.create_time;
                payment_id = paymentInfo.id;
            } catch (IllegalAccessException e) {
                e.printStackTrace();
            } catch (InvocationTargetException e) {
                e.printStackTrace();
            }
        }
    }

    public void mergeOrderWide(OrderWide orderWide) {
        if (orderWide != null) {
            try {
                BeanUtils.copyProperties(this, orderWide);
                order_create_time = orderWide.create_time;
            } catch (IllegalAccessException e) {
                e.printStackTrace();
            } catch (InvocationTargetException e) {
                e.printStackTrace();
            }
        }
    }
}

Code implementation - payment wide table processing main program - PaymentWideApp

package com.atguigu.app.dwm;

import com.alibaba.fastjson.JSON;
import com.atguigu.bean.OrderWide;
import com.atguigu.bean.PaymentInfo;
import com.atguigu.bean.PaymentWide;
import com.atguigu.utils.MyKafkaUtil;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

import java.text.ParseException;
import java.text.SimpleDateFormat;

/**
 * Mock -> Mysql(binLog) -> MaxWell -> Kafka(ods_base_db_m) -> DbBaseApp(Modify configuration, Phoenix)
 * -> Kafka(dwd_order_info,dwd_order_detail) -> OrderWideApp(Association dimension (redis) - > Kafka (dwm_order_wide)
 * -> PaymentWideApp -> Kafka(dwm_payment_wide)
 */
public class PaymentWideApp {

    public static void main(String[] args) throws Exception {

        //1. Obtain execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        //1.1 setting status
        //env.setStateBackend(new FsStateBackend("hdfs://hadoop102:8020/gmall/dwd_log/ck"));
        //1.2 open CK
        //env.enableCheckpointing(10000L, CheckpointingMode.EXACTLY_ONCE);
        //env.getCheckpointConfig().setCheckpointTimeout(60000L);
        //Modify user name
        //System.setProperty("HADOOP_USER_NAME", "atguigu");

        //2. Read Kafka subject data dwd_payment_info  dwm_order_wide
        String groupId = "payment_wide_group";
        String paymentInfoSourceTopic = "dwd_payment_info";
        String orderWideSourceTopic = "dwm_order_wide";
        String paymentWideSinkTopic = "dwm_payment_wide";
        DataStreamSource<String> paymentKafkaDS = env.addSource(MyKafkaUtil.getKafkaSource(paymentInfoSourceTopic, groupId));
        DataStreamSource<String> orderWideKafkaDS = env.addSource(MyKafkaUtil.getKafkaSource(orderWideSourceTopic, groupId));

        //3. Convert the data into JavaBean s and extract the timestamp to generate WaterMark
        SingleOutputStreamOperator<PaymentInfo> paymentInfoDS = paymentKafkaDS
                .map(jsonStr -> JSON.parseObject(jsonStr, PaymentInfo.class))
                .assignTimestampsAndWatermarks(WatermarkStrategy.<PaymentInfo>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<PaymentInfo>() {
                            @Override
                            public long extractTimestamp(PaymentInfo element, long recordTimestamp) {
                                SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                                try {
                                    return sdf.parse(element.getCreate_time()).getTime();
                                } catch (ParseException e) {
                                    e.printStackTrace();
                                    throw new RuntimeException("Time format error!!");
                                }
                            }
                        }));
        SingleOutputStreamOperator<OrderWide> orderWideDS = orderWideKafkaDS
                .map(jsonStr -> JSON.parseObject(jsonStr, OrderWide.class))
                .assignTimestampsAndWatermarks(WatermarkStrategy.<OrderWide>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<OrderWide>() {
                            @Override
                            public long extractTimestamp(OrderWide element, long recordTimestamp) {
                                SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                                try {
                                    return sdf.parse(element.getCreate_time()).getTime();
                                } catch (ParseException e) {
                                    e.printStackTrace();
                                    throw new RuntimeException("Time format error!!");
                                }
                            }
                        }));

        //4. Group by OrderID
        KeyedStream<PaymentInfo, Long> paymentInfoKeyedStream = paymentInfoDS.keyBy(PaymentInfo::getOrder_id);
        KeyedStream<OrderWide, Long> orderWideKeyedStream = orderWideDS.keyBy(OrderWide::getOrder_id);

        //5. Double flow JOIN
        SingleOutputStreamOperator<PaymentWide> paymentWideDS = paymentInfoKeyedStream.intervalJoin(orderWideKeyedStream)
                .between(Time.minutes(-15), Time.seconds(0))
                .process(new ProcessJoinFunction<PaymentInfo, OrderWide, PaymentWide>() {
                    @Override
                    public void processElement(PaymentInfo paymentInfo, OrderWide orderWide, Context ctx, Collector<PaymentWide> out) throws Exception {
                        out.collect(new PaymentWide(paymentInfo, orderWide));
                    }
                });

        //Print test
        paymentWideDS.print(">>>>>>>>>>");

        //6. Write data to Kafka DWM_ payment_ wide
        paymentWideDS.map(JSON::toJSONString).addSink(MyKafkaUtil.getKafkaSink(paymentWideSinkTopic));

        //7. Start task
        env.execute();

    }

}

Code implementation - date conversion tool class - DateTimeUtil

package com.atguigu.utils;

import java.time.LocalDateTime;
import java.time.ZoneId;
import java.time.ZoneOffset;
import java.time.format.DateTimeFormatter;
import java.util.Date;

public class DateTimeUtil {

    private final static DateTimeFormatter formator = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");

    public static String toYMDhms(Date date) {
        LocalDateTime localDateTime = LocalDateTime.ofInstant(date.toInstant(), ZoneId.systemDefault());
        return formator.format(localDateTime);
    }

    public static Long toTs(String YmDHms) {
        LocalDateTime localDateTime = LocalDateTime.parse(YmDHms, formator);
        return localDateTime.toInstant(ZoneOffset.of("+8")).toEpochMilli();
    }

}

Programmer Think

Real time data warehouse -- DWM layer

DWM layer and DWS layer

1, Design

Design ideas

Requirements combing

2, DWM visitor UV calculation

Demand analysis and ideas

code

3, DWM visitor jump out of detail calculation

Demand analysis and ideas

Jump out

Thinking of calculating jump out behavior

code

4, DWM commodity order wide table

Demand analysis and ideas

code implementation

Receive order and order detail data from Kafka's dwd layer

Query Phoenix tool class encapsulation

Query dimension information tool class encapsulation

Optimize 1 bypass cache

Code implementation - DimUtil

Code implementation - RedisUtil

Optimization 2: asynchronous query

Code implementation - OrderWideApp

5, DWM - goods - payment spread sheet

Demand analysis and ideas

Code implementation - payment entity class - PaymentInfo

Code implementation - payment wide table entity class - PaymentWide

Code implementation - payment wide table processing main program - PaymentWideApp

Code implementation - date conversion tool class - DateTimeUtil

Hot Topics