Data service analysis of Spark project

Posted by plimpton on Sun, 17 Nov 2019 20:47:13 +0100

Suspicious object analysis

The handling of suspicious objects can be compared with the event difference from several coordinate distances of their traces. If it is logical, it can be considered that the current object is not suspicious. If it is suspicious, save the result first and enter the suspicious object library.

  1. By acquiring data of time period;
  2. Aggregate data according to the RSFZ of the monitored object; (RSFZ: sbbh? Tgsj? Jwzb)
  3. According to the decision rule: process the RSFZ aggregate data of each monitoring object, and put the unique identification of the data conforming to the responsibility into the accumulator;
  4. Obtain the unique ID of the family member, query the related records from the table, and store them in mysql.

Code

public class VerifyCompute {
        String sql = "select * from t_alldata t where t.tgsj";
        Dataset<Row> allFromTime = spark.sql(sql);

        JavaRDD<Row> resultRDD = allFromTime.javaRDD();
        /**
         * Get the necessary column names from the data to convert to PairRDD
         */
        JavaPairRDD<String, String> pairRDD = resultRDD.mapToPair(new PairFunction<Row, String, String>() {
            @Override
            public Tuple2<String, String> call(Row row) throws Exception {
                String rsfz = row.getAs("RSFZ");
                String id = row.getAs("ID");
                String tgsj = row.getAs("TGSJ");
                String sbbh =row.getAs("SBBH");
                String jwzb = row.getAs("JWZB");

                String tuple01 = rsfz;
                String tuple02 = id + "_" + jwzb + "_" + tgsj+"_"+sbbh;
                return new Tuple2<String, String>(tuple01, tuple02);
            }
        });

        /**
         * The data of the same ID card is overlapped with ID + "Z" + jwzb + "Z" + tgsj, so that
         * Obtain the transit time and latitude and longitude coordinates.
         */
        JavaPairRDD<String, String> reduceRDD = pairRDD.reduceByKey(new Function2<String, String, String>() {
            private static final long serialVersionUID = 1L;
            @Override
            public String call(String v1, String v2) throws Exception {
                return v1 + "&" + v2;
            }
        });

        //Create an accumulator to add calculated data.
        CollectionAccumulator<String> acc = jsc.sc().collectionAccumulator();

        /**
         * Since RDD can not be invoked in map foreach, id can be encapsulated by accumulator.
         */
        reduceRDD.foreach(new VoidFunction<Tuple2<String, String>>() {
            private static final long serialVersionUID = 1L;
            @Override
            public void call(Tuple2<String, String> s2) throws Exception {
                String IDCard = s2._1;
                String[] values = s2._2.split("&");

                for (int i = 0; i < values.length; i++) {
                    for (int k = i + 1; k < values.length; k++) {
                        String value1 = values[i];
                        String value2 = values[k];

                        String[] item1 = value1.split("_");
                        String[] item2 = value2.split("_");

                        String id1 = item1[0];
                        String lon1 = item1[1];
                        String lat1 =item1[2];
                        String tgsj1 = item1[3];

                        String id2 = item2[0];
                        String lon2 = item2[1];
                        String lat2 =item2[2];
                        String tgsj2 = item2[3];

                        double subHour = TimeUtils.getSubHour(tgsj1, tgsj2);
                        double distance = MapUtils.getLongDistance(Double.valueOf(lon1), Double.valueOf(lat1),Double.valueOf(lon2),Double.valueOf(lat2));

                        Integer speed = SpeedUtils.getSpeed(distance, subHour);

                       //Core operation
                        if (speed > 5) {
                            acc.add(id1 + "_" + id2);

                            /**
                             * If RDD is done here, a null pointer exception will occur. Therefore, we need to call dataFrame or RDD in map and foreach operators.
                             */
                        }
                    }
                }
            }
        });

        List<String> accValue = acc.value();

        for(String id: accValue) {
            Dataset<Row> resultDF3 = spark.sql("select RSFZ,GRXB,PSQK, TGSJ," +
                    "SBBH,JWZB from t_alldata where id in (" + id.split("_")[0] + "," + id.split("_")[1] + ")");
            resultDF3.show(20);

            Dataset<Row> resultDF4 = resultDF3.withColumn("CreateTime", functions.lit(new Date().getTime()));
            resultDF4.write()
                    .format("jdbc")
                    .option("url", "jdbc:mysql://bigdata03:3306/test?characterEncoding=UTF-8")
                    .option("dbtable", "t_verify_result")
                    .option("user", "root")
                    .option("password", "123456")
                    .mode(SaveMode.Append)
                    .save();
        }
    }
}

Implementation of practical operation

  1. Create tables in mysql database
CREATE TABLE t_verify_result(
RSFZ text,
GRXB text,
PSQK text,
TGSJ text,
SBBH text,
JWZB text,
CJSJ text
)charset utf8 collate utf8_general_ci;
  1. Reuse the data of the T? People? Together table in hive.
  2. Execute with submit
./spark-submit 
--master spark://bigdata01:7077 
--class com.monitor.compare.VerifyCompute 
--deploy-mode client 
/root/monitoranalysis-1.0-SNAPSHOT.jar

In this way, the data can be stored in the database in the form of rsfz | grxb | psqk | tgsj sbbh | jwzb | cjsj.

The second way

In this way, pure SQL is used for processing
Customize a UDF first

public class ComputeUDF implements UDF1<String, String> {

    @Override
    public String call(String s) throws Exception {
        StringBuilder sbResult = new StringBuilder();

        String value = s;
        String[] values = s.split("&");

        // Traverse the data, take out the time and longitude and latitude in the data, and calculate different speed s.
        for (int i = 0; i < values.length; i++) {
            for (int k = i+1; k < values.length; k++) {
                String value1 = values[i];
                String value2 = values[k];
                String[] item1 = value1.split("_");
                String[] item2 = value2.split("_");

                String id1 = item1[0];
                String lon1 = item1[1];
                String lat1 = item1[2];
                String tgsj1 = item1[3];

                String id2 = item2[0];
                String lon2 = item2[1];
                String lat2 = item2[2];
                String tgsj2 = item2[3];

                //System.out.println("id= "+id1+" ,lon= "+lon1+" ,lat= "+lat1+" ,tgsj= "+tgsj1);
                double subHour = TimeUtils.getSubHour(tgsj1, tgsj2);

                // System.out.println("subHour= "+subHour);
                double distanct = MapUtils.getLongDistance(Double.valueOf(lon1), Double.valueOf(lat1), Double.valueOf(lon2), Double.valueOf(lat2));

                //System.out.println("distanct= "+distanct);
                Integer speed = SpeedUtils.getSpeed(distanct, subHour);

                //System.out.println("speed= "+speed);
                if (speed > 5 && speed != null) {
                    if (sbResult.length() > 0) {
                        sbResult.append("&").append(id1 + "_" + id2);
                    } else {
                        //append to StringBuilder for the first time.
                        sbResult.append(id1 + "_" + id2);
                    }
                }
            }
        }
        return sbResult.toString().length() > 0 ? sbResult.toString() : null;
    }
}

Second, use this method in the Spark Program

Dataset<Row> sqlDF = spark.sql("select rsfz,verifyValue from(" +
                "select rsfz,getVerify(concat_ws('&',collect_set(concat_ws('_',id,jwzb,tgsj)))) " +
                "as verifyValue from t_people_together group by rsfz) where verifyValue is not null");

If you don't understand the above sql statements, you can divide them into three sql statements

spark.sql("select RSFZ,concat_ws('_',ID,JWZB,TGSJ) as concatValue from t_alldata").show(false);

spark.sql("select RSFZ,concat_ws('&',collect_set(concat_ws('_',ID,JWZB,TGSJ))) " +
               "as concatValue from t_alldata group by RSFZ").show(false);

getVerify Return id1_id2,So the result is <RSFZ id1_id2>
spark.sql("select RSFZ,getVerify(concat_ws('&',collect_set(concat_ws('_',ID,JWZB,TGSJ)))) " +
                "as verifyValue from t_alldata group by RSFZ").show(false);

Actual operation

  1. Create mysql database table
CREATE TABLE t_verify_result2(
RSFZ text,
GRXB text,
PSQK text,
TGSJ text,
SBBH text,
JWZB text,
CJSJ text
)charset utf8 collate utf8_general_ci;
  1. Implementation of spark
./spark-submit 
--master spark://bigdata01:7077 
--class com.monitor.compare.VerifyCompute2
--deploy-mode client 
/root/monitoranalysis-1.0-SNAPSHOT.jar

Example code

https://github.com/yy1028500451/MonitorAnalysis/tree/master/src/main/java/com/monitor/compare

Topics: Spark SQL MySQL Database