catalogue
Requirements: query a given number of tables, several of which are used in the code
Transformation concept: given key
Word, several hits in the log file
Step 2: read the file under the target path
Step 4: integrate the data of several files
Step 6: core computing [broadcast Join, count, filter, de duplication]
Tip 2: regular matching does not support recursion with depth + 2, such as specifying / opt / *;
Tip 4: if you want to use multiple separators, you can separate them with |
Requirements: query a given number of tables, several of which are used in the code
Transformation concept: given key
Word, several hits in the log file
Type selection: Spark parsing keyword list rdd1;Spark parses file directory data rdd2; rdd1 join rdd2(broadcast)
(it is planned to use flume to collect the files in the directory, sink them to HDFS, and then use Spark to calculate. Later, it is found that Spark can fully cover them)
Go to several files from a list to find out how many places are there and which words are used
list.txt file is as follows
table_a table_b table_c
Queried Directory:
game ├── lagou │ ├── notes │ │ ├── Kafka.pdf │ │ ├── Redis01.pdf │ │ └── Redis06.pdf │ ├── servers │ │ ├── apache-maven-3.6.3 │ │ │ ├── bin │ │ │ ├── conf │ │ │ └── README.txt │ │ ├── flume-1.9.0 │ │ │ ├── bin │ │ │ ├── conf │ │ │ └── tools │ │ ├── hadoop-2.9.2 │ │ │ ├── bin │ │ │ ├── etc │ │ │ └── share │ │ ├── hbase-1.3.1 │ │ │ ├── bin │ │ │ ├── conf │ │ │ └── README.txt │ │ ├── hive-2.3.7 │ │ │ ├── bin │ │ │ ├── conf │ │ │ └── scripts │ │ ├── kafka_2.12-1.0.2 │ │ │ ├── bin │ │ │ ├── config │ │ │ └── site-docs │ │ ├── spark-2.4.5 │ │ │ ├── bin │ │ │ ├── conf │ │ │ └── yarn │ │ └── zookeeper-3.4.14 │ │ ├── bin │ │ ├── conf │ │ └── zookeeper-server │ └── software │ ├── azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz │ ├── cdh │ │ ├── 5 -> 5.7.6 │ │ ├── 5.7 -> 5.7.6 │ │ └── 5.7.6 │ ├── clickhouse2 │ ├── flink-1.11.1-bin-scala_2.11.tgz │ └── nohup.out └── rh └── devtoolset-8 ├── enable └── root ├── bin -> usr/bin ├── etc ├── home ├── opt ├── root ├── usr └── var
Experimental environment
IntelliJ IDEA scala 2.11.8 spark 2.4.5 hadoop 2.9.2 maven 3.6.3
Upper Code:
Step 1: create SparkContext
val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]") val sc = new SparkContext(conf) sc.setLogLevel("WARN")
Step 2: read the file under the target path
val srcFile0 = "file:///G:\\client\\TsProj\\src\\game\\*"; val srcFile1 = "file:///G:\\client\\TsProj\\src\\game\\*\\*.ts"; val srcFile2 = "file:///G:\\client\\TsProj\\src\\game\\*\\*\\*.ts"; val srcFile3 = "file:///G:\\client\\TsProj\\src\\game\\*\\*\\*\\*.ts"; val srcFile4 = "file:///G:\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*.ts"; val srcFile5 = "file:///G:\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*\\*.ts"; val srcFile6 = "file:///G:\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*\\*\\*.ts"; val fileList = Array(srcFile2, srcFile3, srcFile4, srcFile5, srcFile6); val content_map: RDD[(String, String)] = sc.wholeTextFiles(fileList.mkString(","))
Step 3: read keyword list
val listFile = "file:///C:\\Users\\pc\\Desktop\\table_list.txt"; val listData = sc.textFile(listFile).map(x => (x.split("\\|")(0))).collect(); listData.foreach(println)
Step 4: integrate the data of several files
val arr_lines: Array[String] = content_map.collect().map { content => println(s"path = ${content._1}") content._2 }.flatMap(_.split("\n+")) /** * path = file:/G:/client/TsProj/src/game/entity/component/BagComponent.ts * path = file:/G://client/TsProj/src/game/entity/component/BattleArrayComponent.ts * path = file:/G://client/TsProj/src/game/entity/component/CardBaoQiComponent.ts */ arr_lines.take(5).foreach { l => println(s"Each line==> ${l}") } /** * Each line = = > Import {component} from ".. /.. /.. /.. / framework / entity / component"; * Each line = = > Import {ysj_ts} from ".. /.. /.. / data / Pb / Gen / Pb"; * Each line = = > Import {s} from '.. /.. // global/GameConfig'; * Each line = = > Import {itemconfig tr} from "data / FB / itemconfig-t-r"; * Each line = = > Import {arraymap} from "framework / common / arraymap"; */ println(s"Number of rows = ${arr_lines.length}"); /** * Number of rows = 68243 */
Step 5: word processing [segmentation, prefix filtering, suffix filtering]
//Handle [space, dot, equal sign, semicolon] with separator val words = arr_lines.flatMap(_.trim.split("\\s+|\\.|=|;")) words.take(3).foreach { w => println(s"Every word==> ${w}") } /** * Per word = = > Import * Per word = = >{ * Per word = = > component */ println(s"Number of words = ${words.length}"); /** * Number of words = 230893 */ //Filter keyword prefix val filerStartWord = words.filter(_.startsWith("get")) println(s"Filter prefix==> ${filerStartWord.length}"); /** * Filter prefix = = > 5664 */ filerStartWord.take(3).foreach { w => println(s"Per prefix==> ${w}") } /** * Per prefix = = > getinstance() * Per prefix = = > getinstance() * Per prefix = = > getinstance() */ //Filter keyword suffix val filerEndWord = words.filter(_.endsWith("TBS()")) println(s"Filter suffix==> ${filerEndWord.length}"); /** Filter suffix = = > 197 */ filerEndWord.foreach { w => println(s"Per suffix==> ${w}") } /** * Per suffix = = > getmainlevelconfigtbs() * Per suffix = = > getmainlevelsettingconfigtbs() * Per suffix = = > getelementalresonanceconfiguretbs() */
Step 6: core computing [broadcast Join, count, filter, de duplication]
//Broadcast variable val listBC = sc.broadcast(listData) //Join, count, filter, de duplication val tuples = filerEndWord.map { iter => val tabInfo: Array[String] = listBC.value //println(iter) var cnt = 0; for (tab <- tabInfo) { if (iter.toLowerCase.contains("get" + tab.toLowerCase + "tbs")) { cnt = cnt + 1; } } (iter, cnt); }.filter(_._2 > 0).distinct tuples.take(3).foreach(println(_)) /** * (getleveldifficultyconfigTBS(),1) * (getmonsterviewconfigTBS(),1) */ println(s"matching = ${tuples.length}") /** * Match = 2 */
Step 7: close SparksContext
// Close SparkContext sc.stop()
summary
Tip 1: sc.textFile reads the file and returns an array of contents,
sc.wholeTextFiles reads a file and returns [(path, content)], that is, an array of tuples of file path and file content
Both can regularly match directories and files, such as
c:/dir/*/*
Tip 2: regular matching does not support recursion with depth + 2,
For example, specify / opt / *;
opt ├── lagou │ ├── notes │ │ ├── Kafka.pdf │ │ └── 267.pdf │ ├── servers │ │ ├── apache-maven-3.6.3 │ │ └── zookeeper-3.4.14 │ └── software │ ├── apache-maven-3.6.3-bin.tar.gz │ ├── azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz │ ├── cdh │ ├── clickhouse │ ├── derby.log │ ├── mysql-community-test-5.7.26-1.el7.x86_64.rpm │ └── nohup.out └── rh └── devtoolset-8 ├── enable └── root
Only the files in the [Opt] directory and the files in [opt.lagou, opt.rh] can be found, but the deeper directory [note, servers, software] can not be found, which will be supplemented later
Tip 3: if you want to cover more comprehensive directories, you can make an array and splice it into a string of path s
val srcFile2 = "file:///G:\\ysj\\client\\TsProj\\src\\game\\*\\*\\*.ts"; val srcFile3 = "file:///G:\\ysj\\client\\TsProj\\src\\game\\*\\*\\*\\*.ts"; val srcFile4 = "file:///G:\\ysj\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*.ts"; val srcFile5 = "file:///G:\\ysj\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*\\*.ts"; val srcFile6 = "file:///G:\\ysj\\client\\TsProj\\src\\game\\*\\*\\*\\*\\*\\*\\*.ts"; val fileList = Array(srcFile2, srcFile3, srcFile4, srcFile5, srcFile6); val value: RDD[(String, String)] = sc.wholeTextFiles(fileList.mkString(","))
Tip 4: if you want to use multiple separators, you can separate them with |
val words = parts.flatMap(_.trim.split("\\S+|\\.|=|;"))
There are four separators supported here: multiple spaces [.] [=][;], Cut words on any match