[Spark Streaming] Spark Day10: Spark Streaming learning notes

Posted by rjs34 on Sun, 28 Nov 2021 07:40:00 +0100

Spark Day10: Spark Streaming

01 - [understand] - yesterday's course content review

Practical exercise: Taking the background of DMP advertising industry as an example, the processing of advertising click data is divided into two aspects [advertising data ETL conversion and business report development], as follows:

[[premise]: use SparkSQL Complete the case exercise and write the code
1,Advertising data ETL transformation
	JSON Text data ->  DataFrame: extract IP Address, parsed and converted to province and city -> Save to Hive Partition table
	data source
		File system( HDFS,LocalFS)Text file data: JSON format
	data processing
		ip Address, converting provinces and cities
		Implementation: Using DSL Programming, you can call similar SQL Similar statements and functions can also be called RDD Conversion functions, such as mapPartitions
	Data terminal Sink
		Hive Partition table

2,Business report analysis
	[Premise]: by default, the data of the previous day is analyzed each time
	Data flow:
		Hive Partition table  -> DataFrame: Report analysis and statistics based on business -> MySQL In database table
	[[note]:
		a. When loading data, consider filtering and only obtain the data of the previous day
		b. During report analysis
			use SQL Programming, relatively easy
			Can consider DSL programming
		c. When saving data
			Cannot be used directly SparkSQL Provide external data source interface and use the original ecology JDBC
			dataframe.rdd.foreachPartition(iter => saveToMySQL(iter))
			
[Extend]: to MySQL When the table writes data, the upsert[If the primary key exists, it will be updated; if it does not exist, it will be inserted
	- Method 1: use REPALCE replace INSERT
		replace INTO db_test.tb_wordcount (word, count) VALUES(?, ?)
		In this way, there are some limitations, such as the need to list all columns
		
	- Mode 2: ON DUPLICATE KEY UPDATE
        INSERT INTO ods_qq_group_members ( gid, uin, datadate )
        VALUES (111, 1111111, '2016-11-29' ) 
        ON DUPLICATE KEY UPDATE gid=222, uin=22222, datadate='2016-11-29'

02 - [understand] - outline of today's course content

Starting today, let's enter Spark framework: Explanation on streaming data analysis module.

SparkCore and SparkSQL, offline analysis batch processing, and analysis data are static and unchanged
SparkStreaming and StructuredStreaming are real-time streaming data analysis. The analysis data is generated continuously and analyzed as soon as it is generated

Firstly, learn the SparkStreaming streaming computing module to process streaming data with batch processing idea for real-time analysis.

1,Streaming Overview of flow computing
	Streaming At present, there are many application scenarios
	Lambda Architecture, offline and real-time
	Streaming Calculation mode
	SparkStreaming Computational thought

2,Introductory case
	Word frequency statistics of official case operation“
	Programming implementation code: SparkStreaming Introductory programming
	Streaming working principle
		How to use the idea of batch to process streaming data
		
3,DStream: Separated and discrete flow
	DStream What is it? DStream = Seq[RDD]
	DStream Operations
		Functions are divided into two categories: conversion functions and output functions
	Streaming application status

03 - [understand] - data structure abstraction of each module in Spark framework

Spark framework is a unified analysis engine, which contains many modules. Each module has a data structure to encapsulate data.

stay Spark1.x The main three modules are encapsulated by their own data structure
	- SparkCore: RDD
	- SparkSQL: DataFrame/Dataset
	- SparkStreaming: DStream
	
reach Spark2.x It is recommended to use SparkSQL Analysis of offline data and streaming data
	Dataset/DataFrame
	appear StructuredStreaming Module, encapsulating streaming data into Dataset In, use DSL and SQL Analyze streaming data

04 - [understand] - streaming application scenario overview of streaming

1) E-commerce real-time large screen: at double eleven every year, the real-time order sales and product quantity of Taobao and JD are displayed on a large screen

2) Commodity recommendation: Jingdong and Taobao shopping malls have commodity recommendation modules in shopping carts, commodity details and other places

3) Industrial big data: in the current workshop, equipment can be networked to report their own operation status, which can be targeted at the application layer
These data are used to analyze the operation status and robustness, and show the completion status and operation status of the workpiece

4) Cluster monitoring: general large clusters and platforms need to be monitored

Specifically, the application scenarios of streaming computing are as follows:

05 - [Master] - Lambda architecture of Straming overview

Lambda architecture is a real-time big data processing framework proposed by Nathan Marz, the author of Storm.

The big data architecture can only conduct offline data analysis and real-time data calculation.

Lambda The architecture is divided into three layers:
- first floor: Batch Layer
	Batch layer
	Data offline analysis
- The second floor: Speed Layer
	Velocity layer
	Data real-time calculation
- Third floor: ServingLayer
	Service layer
	Provide external services for offline analysis and real-time calculation results, such as visual display

Lambda architecture integrates offline computing and real-time computing, integrates a series of architecture principles such as immutability, read-write separation and complexity isolation, and can integrate Hadoop, Kafka, Storm, Spark, Hbase and other big data components.

for instance:
	[Annual double 11 Carnival Shopping Festival, the number of user transaction orders on that day]
	- Point 1: real time statistics of sales in transaction orders
		totalAmt
		Finally, it will be displayed on the large screen in real time
	- Second point: 11.11 Later, 11.12 In the early morning, we began to analyze the transaction data of the previous day
		Which province is the worst loser
		Which city has the best female consumption
	Whether real-time computing or offline analysis, it needs to be displayed in the end
		Provide data for calculation and analysis

Lambda architecture solves this problem by decomposing three-tier architecture: Batch Layer, SpeedLayer and Serving Layer.

06 - [Master] - streaming data computing mode outlined by streaming

At present, there are several streaming computing frameworks in the field of big data framework:

1) . Storm framework
- Alibaba double 11 used this framework a few years ago
2) , Samza, open source of Lingying company
- It relies heavily on Kafka and is rarely used by domestic companies
3),SparkStreaming
- The upstream computing framework based on SparkCore is not widely used at present
4) , Flink framework
- At present, the most popular framework in the field of big data streaming computing, especially in China, is widely promoted. It is used by all major manufacturers, with high real-time performance and large throughput, especially in Alibaba.
5),StructuredStreaming
- For streaming data processing function module in SparkSQL framework
- Proposed from Spark2.0, it is relatively excellent. When many companies use SparkSQL, if streaming data needs real-time processing, they directly choose structured streaming

Different streaming processing frameworks have different characteristics and adapt to different scenarios. There are mainly the following two modes.

In general, the streaming computing engine (framework) processes streaming data (there are 2 patterns)

Mode 1: Native stream processing

All input records will be processed one by one. Storm and Flink mentioned above adopt this method;

Generate a piece of data and process a piece of data. This kind of framework processes data very fast and has high real-time performance

Mode 2: Micro Batch

The input data is divided into multiple micro batch data at a certain time interval T, and then each batch data is processed. Spark Streaming and StructuredStreaming adopt this method

Micro batch processing divides streaming data into many batches, often according to time intervals, such as 1 second, for processing and analysis

about Spark in StructuredStreaming Structured six
	- By default, it belongs to micro batch mode
		Batch by batch processing data
	- Spark 2.3 Start, Continues Processing
		Continuous stream processing is the native stream pattern analysis data

07 - [Master] - SparkStreaming's computing idea of Straming overview

Spark Streaming is an important framework in spark ecosystem. It is based on Spark Core. The following figure also shows the status of Spark Streaming in spark ecosystem.

Officially defined Spark Streaming module:

SparkStreaming It makes it easier for users to build scalable and fault-tolerant semantic streaming applications.

SparkStreaming is a real-time computing framework based on SparkCore. It can consume data from many data sources and process data in real time. It has the characteristics of high throughput and strong fault tolerance.

about Spark Streaming For example, stream data at time intervals BatchInterval Divided into many parts, each part Batch(Batch), for each batch of data Batch treat as RDD Conduct rapid analysis and processing.
	- First, divide streaming data according to time interval
		batchInterval，Like 1 second
	- Second, divide data into batches Batch
		Each batch of data is considered to be RDD
	- Third, when processing streaming data, only each batch is processed RDD that will do
		RDD Data analysis and processing

Data structure: DStream，Encapsulate streaming data
	Essentially a series of RDD A collection of, DStream Data streams can be divided in batches according to time intervals such as seconds and minutes

The streaming data is divided into many batches according to [X seconds]. Each Batch data is encapsulated in RDD for processing and analysis, and finally each Batch data is output.

For the current version of Spark Streaming, the minimum Batch Size is between 0.5 and 5 seconds, so Spark Streaming can meet the streaming quasi real-time computing scenario,

08 - [Master] - running official word frequency statistics of introductory cases

SparkStreaming officially provides an Example case. Function Description: from the TCP Socket data source, real-time consumption data, word frequency statistics WordCount is carried out for each Batch of Batch data. The flow chart is as follows:

1,Data source: TCP Socket
	Where to read real-time data and then conduct real-time analysis

2,Data terminal: output console
	Where is the result data output

3,Function: real time statistics of each batch of data, time interval BatchInterval: 1s

Run the officially provided case and run it with the [$spark_home / bin / run example] command. The effect is as follows:

The specific steps are as follows:

SparkStreaming module is used for streaming data processing, which is between Batch processing and RealTime real-time processing.

09 - [Master] - Streaming programming module of introductory case

Based on the IDEA integrated development environment, programming implementation: read the streaming data in real time from the TCP Socket, count the word frequency of the data in each batch, and WordCount.

stay Spark Each module in the framework has its own data structure and its own program entry:
- SparkCore	
	RDD
	SparkContext
- SparkSQL
	DataFrame/Dataset
	SparkSession/SQLContext(Spark 1.x)
- SparkStreaming
	DStream
	StreamingContext
		Parameter: split streaming data interval BatchInterval: 1s，5s((Demo)
		Bottom or SparkContext，Each batch of data as RDD

According to the official documents, there are two ways to build the StreamingContext instance object. The screenshot is as follows:

The first way: build a SparkConf object

The second method: build a SparkContext object

For SparkStreaming streaming applications, the code logic is roughly as follows:

Write the SparkStreaming program module, build the StreamingContext streaming context instance object, start the streaming application and wait for termination

package cn.itcast.spark.start

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Based on the IDEA integrated development environment, it is programmed to read streaming data in real time from TCP Socket and make word frequency statistics on the data in each batch.
 */
object _01StreamingWordCount {
	
	def main(args: Array[String]): Unit = {
		
		// TODO: 1. Build the StreamingContext instance object and pass the time interval BatchInterval
		val ssc: StreamingContext = {
			// Create a SparkConf object and set the application properties
			val sparkConf: SparkConf = new SparkConf()
    			.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
    			.setMaster("local[3]")
			// Passing sparkConf object and time interval batchInterval: 5 seconds
			new StreamingContext(sparkConf, Seconds(5))
		}
		
		// TODO: 2. Define data source, obtain streaming data, and package it into DStream
		
		
		// TODO: 3. Call the conversion function in DStream (similar to the conversion function in RDD) according to business requirements
		
		
		// TODO: 4. Define the data terminal and output the result data of each batch
		
		
		// TODO: 5. Start streaming application and wait for termination
		ssc.start() // Start the streaming application, start consuming data in real time from the data source, processing data and outputting results
		// Streaming applications run as long as they are started, unless the program terminates abnormally or is considered terminated
		ssc.awaitTermination()
		// When the streaming application stops, the resource needs to be closed
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

10 - [Master] - code implementation and test run of introductory cases

Each streaming application (whether SparkStreaming, StructuredStreaming, or Flink) has three core steps

- Step 1: data source Source	
	Where to consume streaming data in real time
- Step 2: data conversion Transformation
	Process data by business
	Call function
- Step 3: data terminal Sink
	Save the processing result data to the external system

package cn.itcast.spark.start

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Based on the IDEA integrated development environment, it is programmed to read streaming data in real time from TCP Socket and make word frequency statistics on the data in each batch.
 */
object _01StreamingWordCount {
	
	def main(args: Array[String]): Unit = {
		
		// TODO: 1. Build the StreamingContext instance object and pass the time interval BatchInterval
		val ssc: StreamingContext = {
			// Create a SparkConf object and set the application properties
			val sparkConf: SparkConf = new SparkConf()
    			.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
    			.setMaster("local[3]")
			// Passing sparkConf object and time interval batchInterval: 5 seconds
			new StreamingContext(sparkConf, Seconds(5))
		}
		
		// TODO: 2. Define data source, obtain streaming data, and package it into DStream
		/*
		def socketTextStream(
	      hostname: String,
	      port: Int,
	      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
	    ): ReceiverInputDStream[String]
		 */
		val inputDStream: ReceiverInputDStream[String] = ssc.socketTextStream("node1.itcast.cn", 9999)
		
		// TODO: 3. Call the conversion function in DStream (similar to the conversion function in RDD) according to business requirements
		/*
				spark hive hive spark spark hadoop
		 */
		val resultDStream: DStream[(String, Int)] = inputDStream
			// Split word
			.flatMap(line => line.trim.split("\\s+"))
			// Convert to binary
			.map(word => word -> 1)
			// Group according to words and aggregate within the group
			/*
				(spark, 1) 
				(spark, 1)          ->   (spark, [1, 1])  (hive, [1]) -> (spark, 2) (hive, 1)
				(hive, 1)
			 */
			.reduceByKey((tmp, itme) => tmp + itme)
		
		// TODO: 4. Define the data terminal and output the result data of each batch
		resultDStream.print()
		
		
		// TODO: 5. Start streaming application and wait for termination
		ssc.start() // Start the streaming application, start consuming data in real time from the data source, processing data and outputting results
		// Streaming applications run as long as they are started, unless the program terminates abnormally or is considered terminated
		ssc.awaitTermination()
		// When the streaming application stops, the resource needs to be closed
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

Screenshot of operation result monitoring:

11 - [Master] - how SparkStreaming works in the introductory case

When SparkStreaming processes streaming data, it divides the data into micro batches according to the time interval, and each batch of data is treated as RDD for processing and analysis.

Take the WordCount program of word frequency statistics as an example to explain the working principle of Streaming.

Step 1: create StreamingContext

When the SparkStreaming streaming application starts (streamingContext.start), first create the StreamingContext streaming context instance object, build the whole streaming application environment, and the bottom layer is still SparkContext.

From the Jobs Tab on the WEB UI interface, you can see that Job-0 is a Receiver receiver that has been running. It runs in Task mode and requires 1Core CPU.

Step 2: the receiver receives data

After each Receiver is started, it receives data from the data source in real time (such as TCP Socket), and also divides the received streaming data into many blocks (blocks) according to the time interval.

		receiver Receiver Divide the time interval of streaming data BlockInterval ， The default value is 200 ms ， Pass attribute[ spark.streaming.blockInterval]set up.
		
Hypothetical settings Batch Batch interval is 1 s，By default, there are several for each batch Block What about you???
	1s = 1000ms = 200ms * 5 
	So five block
 Treat the batch data as one RDD，here RDD The number of partitions is 5

Step 3: receive the Block Report

The Receiver will report the block information corresponding to the received data in real time. When the BatchInterval time reaches, the StreamingContext will treat the data block within the corresponding time range as an RDD and load the SparkContext to process the data.

The streaming data is processed in this cycle, as shown in the following figure:

12 - [Master] - what is DStream

The SparkStreaming module encapsulates the data structure of streaming data: DStream (discrete stream, continuous data stream), which represents the continuous data stream and the result data stream after various Spark operators.

It can be seen from the WEB UI interface that when calling the function operation on DStream, the bottom layer is to operate on RDD. It is found that the functions in DStream are the same as those in RDD.

13 - [learn] - overview of DStream Operations function

DStream is similar to RDD. It contains many functions for data processing and output. It is mainly divided into two categories:

First: Transformation function

stay SparkStreaming There are three main conversion types for convection in:
- Convert data in the stream
	map,flatMpa,filter

- Data in convection involves aggregate statistics
	count
	reduce
	countByValue
	...
- Aggregate two streams
	union
	join
	cogroup

Second: Output function [Output function]

The foreachRDD function is used for the RDD output of each batch of results in DStream. The bottom layer of the print function used earlier also calls the foreachRDD function. The screenshot is as follows:

[external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-1w2d8bg6-1638081078226)( https://gitee.com/the_efforts_paid_offf/picture-blog/raw/master/img/20211128142957.png )]

There are two important functions in DStream, which operate on RDD of each batch of data. They are closer to the bottom layer and have better performance. They are strongly recommended:

14 - [Master] - use of transform function in DStream

Understand the transform function through the source code. There are two method overloads. The declaration is as follows:

Next, use the transform function to modify the word frequency statistics program. The specific code is as follows:

package cn.itcast.spark.rdd

import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Based on the IDEA integrated development environment, it is programmed to read streaming data in real time from TCP Socket and make word frequency statistics on the data in each batch.
 */
object _03StreamingTransformRDD {
	
	def main(args: Array[String]): Unit = {
		
		// TODO: 1. Build the StreamingContext instance object and pass the time interval BatchInterval
		val ssc: StreamingContext = {
			// a. Create a SparkConf object and set basic application information
			val sparkConf = new SparkConf()
    			.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
    			.setMaster("local[3]")
			// b. Create an instance object and set BatchInterval
			new StreamingContext(sparkConf, Seconds(5))
		}
		
		// TODO: 2. Define data source, obtain streaming data, and package it into DStream
		/*
		  def socketTextStream(
		      hostname: String,
		      port: Int,
		      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
		    ): ReceiverInputDStream[String]
		 */
		val inputDStream: DStream[String] = ssc.socketTextStream(
			"node1.itcast.cn",
			9999,
			storageLevel = StorageLevel.MEMORY_AND_DISK
		)
		
		// TODO: 3. Call the conversion function in DStream (similar to the conversion function in RDD) according to business requirements
		/*
			TODO: If you can operate on RDD, do not operate on DStream. When calling a function in DStream also exists in RDD, use RDD operation
			def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]
		 */
		// Here RDD is the RDD data of each batch in DStream
		val resultDStream: DStream[(String, Int)] = inputDStream.transform{rdd =>
			val resultRDD: RDD[(String, Int)] = rdd
				// Split words by separator
				.flatMap(line => line.split("\\s+"))
				// Convert words to binary, indicating that each word occurs once
				.map(word => word -> 1)
				// Group according to words, aggregate reduce and sum the execution in the group
				.reduceByKey((tmp, item) => tmp + item)
			// RDD processing result of each batch returned
			resultRDD
		}
		
		// TODO: 4. Define the data terminal and output the result data of each batch
		resultDStream.print()
		
		// TODO: 5. Start streaming application and wait for termination
		ssc.start()
		ssc.awaitTermination()
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

View the DAG diagram of each Batch data execution Job in the WEB UI monitoring, and directly display the operation for RDD.

15 - [Master] - use of foreachRDD function in DStream

The foreachRDD function is an operation to output the result data RDD in DStream, similar to the transform function. For each batch of RDD data operation, the source code is declared as follows:

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-gFQxcdhr-1638081078229)(/img/image-20210429113612287.png)]

Continue to modify the word frequency statistics code and customize the output data. The specific codes are as follows:

package cn.itcast.spark.output

import org.apache.commons.lang3.time.FastDateFormat
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Based on the IDEA integrated development environment, it is programmed to read streaming data in real time from TCP Socket and make word frequency statistics on the data in each batch.
 */
object _04StreamingOutputRDD {
	
	def main(args: Array[String]): Unit = {
		
		// TODO: 1. Build the StreamingContext instance object and pass the time interval BatchInterval
		val ssc: StreamingContext = {
			// a. Create a SparkConf object and set basic application information
			val sparkConf = new SparkConf()
    			.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
    			.setMaster("local[3]")
				// TODO: set the algorithm version of the data output file system to 2
				.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
			// b. Create an instance object and set BatchInterval
			new StreamingContext(sparkConf, Seconds(5))
		}
		
		
		// TODO: 2. Define data source, obtain streaming data, and package it into DStream
		/*
		  def socketTextStream(
		      hostname: String,
		      port: Int,
		      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
		    ): ReceiverInputDStream[String]
		 */
		val inputDStream: DStream[String] = ssc.socketTextStream(
			"node1.itcast.cn",
			9999,
			storageLevel = StorageLevel.MEMORY_AND_DISK
		)
		
		// TODO: 3. Call the conversion function in DStream (similar to the conversion function in RDD) according to business requirements
		/*
			TODO: If you can operate on RDD, do not operate on DStream. When calling a function in DStream also exists in RDD, use RDD operation
			def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]
		 */
		// Here RDD is the RDD data of each batch in DStream
		val resultDStream: DStream[(String, Int)] = inputDStream.transform{ rdd =>
			val resultRDD: RDD[(String, Int)] = rdd
				.filter(line => null != line && line.trim.length > 0)
				.flatMap(line => line.trim.split("\\s+"))
				.map(word => (word, 1))
				.reduceByKey((tmp, item) => tmp + item)
			// Return result RDD
			resultRDD
		}
		
		// TODO: 4. Define the data terminal and output the result data of each batch
		//resultDStream.print()
		/*
			def foreachRDD(foreachFunc: (RDD[T], Time) => Unit): Unit
				rdd Represents the RDD of each batch processing result
				time Indicates the time when the batch was generated, Long type
		 */
		resultDStream.foreachRDD((rdd, time) => {
			// Print the generation time of each batch
			val batchTime: String = FastDateFormat.getInstance("yyyy/MM/dd HH:mm:ss").format(time.milliseconds)
			println("-------------------------------------------")
			println(s"Batch Time: ${batchTime}")
			println("-------------------------------------------")
			
			// TODO: judge whether the result RDD has data. If there is data, output it. Otherwise, no operation will be performed
			if(!rdd.isEmpty()){
				// When outputting the result RDD: reduce the number of partitions, operate on each partition, and obtain connections through the sparkstreaming
				val resultRDD: RDD[(String, Int)] = rdd.coalesce(1)
				resultRDD.cache()
				// Print results to RDD console
				resultRDD.foreachPartition(iter => iter.foreach(println))
				
				// Save the result RDD to a file
				resultRDD.saveAsTextFile(s"datas/streaming-wc-${time.milliseconds}")
				resultRDD.unpersist()
			}
		})
		
		
		// TODO: 5. Start streaming application and wait for termination
		ssc.start()
		ssc.awaitTermination()
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

16 - [understand] - three statuses of SparkStreaming streaming applications

When using SparkStreaming to process real-time application services, different functions need to be used according to different business requirements. SparkStreaming streaming streaming computing framework is mainly divided into three categories for specific services, and different functions are used for processing:

Business 1: Stateless stateless

Business 2: stateful State

Business 3: window statistics

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-g0Xbre1z-1638081078231)(/img/image-20210429115034287.png)]

Appendix I. creating Maven module

1) . Maven engineering structure

2) . POM file content

Contents in the POM document of Maven project (Yilai package):

    <!-- Specify the warehouse location, in order: aliyun,cloudera and jboss Warehouse -->
    <repositories>
        <repository>
            <id>aliyun</id>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        </repository>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
        <repository>
            <id>jboss</id>
            <url>http://repository.jboss.com/nexus/content/groups/public</url>
        </repository>
    </repositories>

    <properties>
        <scala.version>2.11.12</scala.version>
        <scala.binary.version>2.11</scala.binary.version>
        <spark.version>2.4.5</spark.version>
        <hadoop.version>2.6.0-cdh5.16.2</hadoop.version>
        <hbase.version>1.2.0-cdh5.16.2</hbase.version>
        <kafka.version>2.0.0</kafka.version>
        <mysql.version>8.0.19</mysql.version>
    </properties>

    <dependencies>

        <!-- rely on Scala language -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <!-- Spark Core rely on -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Spark SQL rely on -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Spark Streaming rely on -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Spark Streaming integrate Kafka 0.8.2.1 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-8_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Spark Streaming And Kafka 0.10.0 Integration dependency-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Hadoop Client rely on -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <!-- HBase Client rely on -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-hadoop2-compat</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <!-- Kafka Client rely on -->
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>2.0.0</version>
        </dependency>
        <!-- according to ip Convert to provincial and urban areas -->
        <dependency>
            <groupId>org.lionsoul</groupId>
            <artifactId>ip2region</artifactId>
            <version>1.7.2</version>
        </dependency>
        <!-- MySQL Client rely on -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.version}</version>
        </dependency>
        <dependency>
            <groupId>c3p0</groupId>
            <artifactId>c3p0</artifactId>
            <version>0.9.1.2</version>
        </dependency>
    </dependencies>

    <build>
        <outputDirectory>target/classes</outputDirectory>
        <testOutputDirectory>target/test-classes</testOutputDirectory>
        <resources>
            <resource>
                <directory>${project.basedir}/src/main/resources</directory>
            </resource>
        </resources>
        <!-- Maven Compiled plug-ins -->
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

Topics: Spark Algorithm Machine Learning

Programmer Think