Spark + parse text + recursion + pattern matching + broadcast filtering

Posted by shan111 on Tue, 04 Jan 2022 15:35:44 +0100

catalogue

Requirements: query a given number of tables, several of which are used in the code

Transformation concept: given key

Word, several hits in the log file

Type selection: Spark parsing keyword list rdd1;Spark parses file directory data rdd2; rdd1 join rdd2(broadcast)

Upper Code:

Step 1: create SparkContext

Step 2: read the file under the target path

Step 3: read keyword list

Step 4: integrate the data of several files

Step 6: core computing [broadcast Join, count, filter, de duplication]

Step 7: close SparksContext

summary

Tip 1: sc.textFile reads a file and returns an array of contents. sc.wholeTextFiles reads a file and returns [(path, content)], that is, an array of tuples of file path and file content. Both can regularly match directories and files, such as

Tip 2: regular matching does not support recursion with depth + 2, such as specifying / opt / *;

Tip 3: if you want to cover more comprehensive directories, you can make an array and splice it into a string of path s

Tip 4: if you want to use multiple separators, you can separate them with |

Requirements: query a given number of tables, several of which are used in the code

Transformation concept: given key

Word, several hits in the log file

Type selection: Spark parsing keyword list rdd1;Spark parses file directory data rdd2; rdd1 join rdd2(broadcast)

(it is planned to use flume to collect the files in the directory, sink them to HDFS, and then use Spark to calculate. Later, it is found that Spark can fully cover them)

Go to several files from a list to find out how many places are there and which words are used

list.txt file is as follows

table_a
table_b
table_c

Queried Directory:

game
├── lagou
│   ├── notes
│   │   ├── Kafka.pdf
│   │   ├── Redis01.pdf
│   │   └── Redis06.pdf
│   ├── servers
│   │   ├── apache-maven-3.6.3
│   │   │   ├── bin
│   │   │   ├── conf
│   │   │   └── README.txt
│   │   ├── flume-1.9.0
│   │   │   ├── bin
│   │   │   ├── conf
│   │   │   └── tools
│   │   ├── hadoop-2.9.2
│   │   │   ├── bin
│   │   │   ├── etc
│   │   │   └── share
│   │   ├── hbase-1.3.1
│   │   │   ├── bin
│   │   │   ├── conf
│   │   │   └── README.txt
│   │   ├── hive-2.3.7
│   │   │   ├── bin
│   │   │   ├── conf
│   │   │   └── scripts
│   │   ├── kafka_2.12-1.0.2
│   │   │   ├── bin
│   │   │   ├── config
│   │   │   └── site-docs
│   │   ├── spark-2.4.5
│   │   │   ├── bin
│   │   │   ├── conf
│   │   │   └── yarn
│   │   └── zookeeper-3.4.14
│   │       ├── bin
│   │       ├── conf
│   │       └── zookeeper-server
│   └── software
│       ├── azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz
│       ├── cdh
│       │   ├── 5 -> 5.7.6
│       │   ├── 5.7 -> 5.7.6
│       │   └── 5.7.6
│       ├── clickhouse2
│       ├── flink-1.11.1-bin-scala_2.11.tgz
│       └── nohup.out
└── rh
    └── devtoolset-8
        ├── enable
        └── root
            ├── bin -> usr/bin
            ├── etc
            ├── home
            ├── opt
            ├── root
            ├── usr
            └── var

Experimental environment

IntelliJ IDEA
scala        2.11.8
spark        2.4.5
hadoop        2.9.2
maven        3.6.3

Upper Code:

Step 1: create SparkContext

val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")

Step 2: read the file under the target path

    val srcFile0 = "file:///G:\\client\\TsProj\\src\\game\\*";
    val srcFile1 = "file:///G:\\client\\TsProj\\src\\game\\*\\*.ts";
    val srcFile2 = "file:///G:\\client\\TsProj\\src\\game\\*\\*\\*.ts";
    val srcFile3 = "file:///G:\\client\\TsProj\\src\\game\\*\\*\\*\\*.ts";
    val srcFile4 = "file:///G:\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*.ts";
    val srcFile5 = "file:///G:\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*\\*.ts";
    val srcFile6 = "file:///G:\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*\\*\\*.ts";
    val fileList = Array(srcFile2, srcFile3, srcFile4, srcFile5, srcFile6);
    val content_map: RDD[(String, String)] = sc.wholeTextFiles(fileList.mkString(","))

Step 3: read keyword list

    val listFile = "file:///C:\\Users\\pc\\Desktop\\table_list.txt";
    val listData = sc.textFile(listFile).map(x => (x.split("\\|")(0))).collect();
    listData.foreach(println)

Step 4: integrate the data of several files

    val arr_lines: Array[String] = content_map.collect().map {
      content =>
        println(s"path = ${content._1}")
        content._2
    }.flatMap(_.split("\n+"))

    /**
     * path = file:/G:/client/TsProj/src/game/entity/component/BagComponent.ts
     * path = file:/G://client/TsProj/src/game/entity/component/BattleArrayComponent.ts
     * path = file:/G://client/TsProj/src/game/entity/component/CardBaoQiComponent.ts
     */

    arr_lines.take(5).foreach { l => println(s"Each line==> ${l}") }
    /**
     * Each line = = > Import {component} from ".. /.. /.. /.. / framework / entity / component";
     * Each line = = > Import {ysj_ts} from ".. /.. /.. / data / Pb / Gen / Pb";
     * Each line = = > Import {s} from '.. /.. // global/GameConfig';
     * Each line = = > Import {itemconfig tr} from "data / FB / itemconfig-t-r";
     * Each line = = > Import {arraymap} from "framework / common / arraymap";
     */

    println(s"Number of rows = ${arr_lines.length}");
    /**
     * Number of rows = 68243
     */

Step 5: word processing [segmentation, prefix filtering, suffix filtering]

    //Handle [space, dot, equal sign, semicolon] with separator
    val words = arr_lines.flatMap(_.trim.split("\\s+|\\.|=|;"))
    words.take(3).foreach { w => println(s"Every word==> ${w}") }

    /**
     * Per word = = > Import
     * Per word = = >{
     * Per word = = > component
     */

    println(s"Number of words = ${words.length}");
    /**
     * Number of words = 230893
     */

      //Filter keyword prefix
    val filerStartWord = words.filter(_.startsWith("get"))
    println(s"Filter prefix==> ${filerStartWord.length}");

    /**
     * Filter prefix = = > 5664
     */

    filerStartWord.take(3).foreach { w => println(s"Per prefix==> ${w}") }
    /**
     * Per prefix = = > getinstance()
     * Per prefix = = > getinstance()
     * Per prefix = = > getinstance()
     */

      //Filter keyword suffix
    val filerEndWord = words.filter(_.endsWith("TBS()"))
    println(s"Filter suffix==> ${filerEndWord.length}");

    /** Filter suffix = = > 197 */
    filerEndWord.foreach { w => println(s"Per suffix==> ${w}") }

    /**
     * Per suffix = = > getmainlevelconfigtbs()
     * Per suffix = = > getmainlevelsettingconfigtbs()
     * Per suffix = = > getelementalresonanceconfiguretbs()
     */

Step 6: core computing [broadcast Join, count, filter, de duplication]

    //Broadcast variable
    val listBC = sc.broadcast(listData)
    //Join, count, filter, de duplication
    val tuples = filerEndWord.map { iter =>
      val tabInfo: Array[String] = listBC.value
      //println(iter)
      var cnt = 0;
      for (tab <- tabInfo) {
        if (iter.toLowerCase.contains("get" + tab.toLowerCase + "tbs")) {
          cnt = cnt + 1;
        }
      }
      (iter, cnt);
    }.filter(_._2 > 0).distinct

    tuples.take(3).foreach(println(_))
    /**
     * (getleveldifficultyconfigTBS(),1)
     * (getmonsterviewconfigTBS(),1)
     */
    
    println(s"matching = ${tuples.length}")
    /**
     * Match = 2
     */
    

Step 7: close SparksContext

    // Close SparkContext
    sc.stop()

summary

Tip 1: sc.textFile reads the file and returns an array of contents,
sc.wholeTextFiles reads a file and returns [(path, content)], that is, an array of tuples of file path and file content
Both can regularly match directories and files, such as

c:/dir/*/*

Tip 2: regular matching does not support recursion with depth + 2,
For example, specify / opt / *;

opt
├── lagou
│   ├── notes
│   │   ├── Kafka.pdf
│   │   └── 267.pdf
│   ├── servers
│   │   ├── apache-maven-3.6.3
│   │   └── zookeeper-3.4.14
│   └── software
│       ├── apache-maven-3.6.3-bin.tar.gz
│       ├── azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz
│       ├── cdh
│       ├── clickhouse
│       ├── derby.log
│       ├── mysql-community-test-5.7.26-1.el7.x86_64.rpm
│       └── nohup.out
└── rh
    └── devtoolset-8
        ├── enable
        └── root

Only the files in the [Opt] directory and the files in [opt.lagou, opt.rh] can be found, but the deeper directory [note, servers, software] can not be found, which will be supplemented later

Tip 3: if you want to cover more comprehensive directories, you can make an array and splice it into a string of path s

val srcFile2 = "file:///G:\\ysj\\client\\TsProj\\src\\game\\*\\*\\*.ts";
val srcFile3 = "file:///G:\\ysj\\client\\TsProj\\src\\game\\*\\*\\*\\*.ts";
val srcFile4 = "file:///G:\\ysj\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*.ts";
val srcFile5 = "file:///G:\\ysj\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*\\*.ts";
val srcFile6 = "file:///G:\\ysj\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*\\*\\*.ts";
val fileList = Array(srcFile2, srcFile3, srcFile4, srcFile5, srcFile6);
val value: RDD[(String, String)] = sc.wholeTextFiles(fileList.mkString(","))

Tip 4: if you want to use multiple separators, you can separate them with |

val words = parts.flatMap(_.trim.split("\\S+|\\.|=|;"))

There are four separators supported here: multiple spaces [.] [=][;], Cut words on any match

Topics: Scala Big Data Spark