[Spark Streaming] Spark Day11: Spark Streaming learning notes

Posted by richever on Sun, 28 Nov 2021 09:11:50 +0100

Spark Day11: Spark Streaming

01 - [understand] - yesterday's course content review

Main explanation: Spark Streaming module quick start

1,Streaming Overview of flow computing
	- Streaming Application scenario
		Real time report RealTime Report
		Real time increment ETL
		Real time early warning and monitoring
		Real time search recommendation
		wait
	- Big data architecture: Lambda framework
		Offline analysis, real-time calculation
		It is divided into three layers:
			- Batch layer, BatchLayer
			- Velocity layer, SpeedLayer
			- Service layer, ServingLayer
	- Streaming data processing mode
		The first mode: native stream processing native
			One piece of data, one piece of data
		The second mode: Micro batch processing Mirco-Batch
			The streaming data is divided into small batches, and each small batch is processed quickly
	- SparkStreaming Computational thought
		Streaming data at time intervals BatchInterval Divided into many batches Batch，Each batch of data as RDD，Conduct processing analysis
		DStream = Seq[RDD/Batch]

2,Quick start: word frequency statistics WordCount
	- Requirements:
		use SparkStreaming Convective data were analyzed from TCP Socket Read the data, make word frequency statistics for each batch of data, print the console, [note that word frequency statistics here are not global, but (local) of each batch]
	- Official case
		run-example
	- SparkStreaming Application development portal
		StreamingContext，Streaming context instance object
		Development steps:
			data source DStream,Data processing and output (call) DStream Start streaming application start,Waiting for termination await，Finally, close the resource stop
	- Programming development, similar RDD Chinese word frequency statistics, call function flatMap,map,redueByKey etc.
	- Flow application principle
		- When running the program, first create StreamingContext Object, underlying sparkContext
		- ssc.start，Start receiver Receivers，Each receiver in Task Mode runs in Executor in
		- Receiver The receiver begins to receive data from the data source at time intervals BlockInterval When dividing data Block，Default 200 ms，take Block Store to Executor In memory, if multiple copies are set, in other Executor Then store and finally send BlockReport to SSC
		- When reached BatchINterval A batch interval is generated Batch Batch, will Block Assigned to this batch, the bottom layer takes the data in the modification as RDD Conduct processing analysis
		

3,Data structure: DStream = Seq[RDD]
	Encapsulate the data flow, and the data is generated continuously, which is divided into many batches according to the time interval Batch，DStream = Seq[RDD]
	Functions: 2 types
		- Conversion function Transformation，similar RDD Medium conversion function
		- Output function Output
	2 Two important functions are for each batch RDD Operate
		- Conversion function: tranform(rdd => rdd)
		- Output function: foreachRDD(rdd => Unit)
		Modify word frequency statistics code

02 - [understand] - outline of today's course content

It mainly explains three aspects: integration Kafka, application case (status, window) and offset management

1,integrate Kafka
	SparkStreaming In the actual project, it is basically from Kafka Real time processing of consumption data
	- 2 sets during integration API
		because Kafka Consumer API There are 2 sets, so there are also 2 sets for integration API
	- Write code
		How from Kafka Consumption data must be mastered
	- Get data offset information for each batch
		offset

2,Application case: Baidu search ranking
	Perform relevant initialization operations
		- Tool classes, creating StreamingContext Object and consumption Kafka data
		- The simulation data generator generates user search log data in real time and sends it to Kafka in
	- real time ETL((stateless)
	- Cumulative statistics (with status)
	- Window statistics

3,Offset management
	SparkStreaming A big failure requires user management from Kafka Consumption data offset, just know the knowledge points

03 - [understand] - streaming application technology stack

In the actual project, whether Storm or Spark Streaming and Flink are used, Kafka real-time consumption data is mainly processed and analyzed. The technical architecture of streaming data real-time processing is roughly as follows:

- data source Source
	Distributed message queue Kafka
		flume integrate Kafka
		call Producer API Write data
		Canal Real time MySQL Synchronize table data to Kafka In, data format JSON character string
		.....
		
- applications running
	At present, as long as time-flow applications in enterprises are basically running in Hadoop YARN colony

- Data terminal
	Write data to NoSQL In the database, such as Redis,HBase,Kafka
	
Flume/SDK/Kafka Producer API -> KafKa  —> SparkStreaming/Flink/Storm  -> Hadoop YARN -> Redis -> UI

04 - [understand] - Kafka review and integration of two sets of API s

Apache Kafka: the most primitive function [message queue], buffering data, has a publish subscribe function (similar to the official account of WeChat).

The framework diagram of Kafka is as follows:

1,Services: Broker，Start service per machine
	One Kafka Cluster, at least 3 machines
2,rely on Zookeeper
	Configuration information is stored in ZK in
3,Producer producer
	towards Kafka Write data in
4,Consumer consumer
	from Kafka Subscription data

5,How data is stored and managed
	use Topic Subject, manage different types of data, and divide it into multiple partitions partition，Adopt replication mechanism
		leader Copy: read / write data, 1
		follower Replica: synchronize data to ensure data reliability,1 Or more

Spark Streaming is integrated with Kafka and has two sets of APIs. The reason is that Kafka Consumer API has two sets, The New Consumer API has appeared since Kafka version 0.9, which is convenient for users. The consumption data from Kafka Topic is stable until version 0.10.

At present, it is basically used in enterprises Kafka New Consumer API consumption Kafka Data in.
	- Core class: KafkaConsumer,ConsumerRecorder

05 - [Master] - New Consumer API integrated programming

Use Kafka 0.10. + to provide a new version of Consumer API, integrate Streaming, consume Topic data in real time and process it.

Add related Maven dependencies:

<!-- Spark Streaming And Kafka 0.10.0 Integration dependency-->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
    <version>2.4.5</version>
</dependency>

At present, enterprises basically use New Consumer API integration. The advantages are as follows:

First, it is similar to the Direct mode in the Old Consumer API

Second, the simple parallelism is 1:1

Instructions for using the createDirectStream function API in the tool class KafkaUtils (function declaration):

Official documents: http://spark.apache.org/docs/2.4.5/streaming-kafka-0-10-integration.html

First, start the Kafka service and create Topic: WC Topic

[root@node1 ~]# zookeeper-daemon.sh start 

[root@node1 ~]# kafka-daemon.sh start 

[root@node1 ~]# jps
2945 Kafka

# Use KafkaTools to create Topic and set 1 copy and 3 partitions


kafka-console-producer.sh --topic wc-topic --broker-list node1.itcast.cn:9092

The specific implementation code needs to create location policy object and consumption policy object

package cn.itcast.spark.kafka

import java.util

import org.apache.commons.lang3.time.FastDateFormat
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, ConsumerStrategy, KafkaUtils, LocationStrategies, LocationStrategy}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Streaming Get data through Kafka New Consumer API
 */
object _01StreamingSourceKafka {
	
	def main(args: Array[String]): Unit = {
		
		// 1. Build the StreamingContext instance object and pass the time interval BatchInterval
		val ssc: StreamingContext = {
			// a. Create a SparkConf object and set basic application information
			val sparkConf = new SparkConf()
    			.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
    			.setMaster("local[3]")
				// Set the algorithm version of the data output file system to 2
				.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
			// b. Create an instance object and set BatchInterval
			new StreamingContext(sparkConf, Seconds(5))
		}
		
		// 2. Define the data source, obtain the streaming data and package it into DStream
		// TODO: from Kafka consumption data, New Consumer API is adopted
		/*
		  def createDirectStream[K, V](
		      ssc: StreamingContext,
		      locationStrategy: LocationStrategy,
		      consumerStrategy: ConsumerStrategy[K, V]
		   ): InputDStream[ConsumerRecord[K, V]]
		 */
		// a. Location policy object
		val locationStrategy: LocationStrategy = LocationStrategies.PreferConsistent
		// b. Consumption strategy
		val kafkaParams: Map[String, Object] = Map[String, Object](
			"bootstrap.servers" -> "node1.itcast.cn:9092",
			"key.deserializer" -> classOf[StringDeserializer],
			"value.deserializer" -> classOf[StringDeserializer],
			"group.id" -> "gui-1001",
			"auto.offset.reset" -> "latest",
			"enable.auto.commit" -> (false: java.lang.Boolean)
		)
		val consumerStrategy: ConsumerStrategy[String, String] = ConsumerStrategies.Subscribe(
			Array("wc-topic"), //
			kafkaParams //
		)
		// c. Use the New Consumer API to obtain the data in Kafka Topic
		val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
			ssc, //
			locationStrategy, //
			consumerStrategy //
		)
		
		// Only get the Value data in Kafka Topic: Message message
		val inputDStream: DStream[String] = kafkaDStream.map(record => record.value())
		
		// 3. Call the conversion function in DStream (similar to the conversion function in RDD) according to business requirements
		/*
			def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]
		 */
		// Here RDD is the RDD data of each batch in DStream
		val resultDStream: DStream[(String, Int)] = inputDStream.transform{ rdd =>
			rdd
				.filter(line => null != line && line.trim.length > 0)
				.flatMap(line => line.trim.split("\\s+"))
				.map(word => (word, 1))
				.reduceByKey((tmp, item) => tmp + item)
		}
		
		// 4. Define the data terminal and output the result data of each batch
		/*
			def foreachRDD(foreachFunc: (RDD[T], Time) => Unit): Unit
		 */
		resultDStream.foreachRDD((rdd, time) => {
			//val xx: Time = time
			val format: FastDateFormat = FastDateFormat.getInstance("yyyy/MM/dd HH:mm:ss")
			println("-------------------------------------------")
			println(s"Time: ${format.format(time.milliseconds)}")
			println("-------------------------------------------")
			// Judge whether there is data in the RDD of each batch of results. If there is data, output it again
			if(!rdd.isEmpty()){
				rdd.coalesce(1).foreachPartition(iter => iter.foreach(println))
			}
		})
		
		// 5. Start streaming application and wait for termination
		ssc.start()
		ssc.awaitTermination()
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

06 - [understand] - obtain consumption offset information when integrating Kafka

When SparkStreaming integrates Kafka, the data of each batch is encapsulated in KafkaRDD, including the metadata information of each data, regardless of the data obtained by the Direct method in the Old Consumer API or the NewConsumer API.

When the streaming application is running, you can see the offset range of each batch of consumption data in the WEB UI monitoring interface. Can you obtain data in the program??

Official documents: http://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html#obtaining-offsets

The code for obtaining offset information is as follows:

When modifying the previous code to obtain consumption Kafka data, the data offset range of each partition in each batch:

package cn.itcast.spark.kafka

import org.apache.commons.lang3.time.FastDateFormat
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Streaming Obtain data through Kafka New Consumer API and obtain the OFFSET offset of each batch of processing data
 */
object _02StreamingKafkaOffset {
	
	def main(args: Array[String]): Unit = {
		
		// 1. Build the StreamingContext instance object and pass the time interval BatchInterval
		val ssc: StreamingContext = {
			// a. Create a SparkConf object and set basic application information
			val sparkConf = new SparkConf()
    			.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
    			.setMaster("local[3]")
				// Set the algorithm version of the data output file system to 2
				.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
			// b. Create an instance object and set BatchInterval
			new StreamingContext(sparkConf, Seconds(5))
		}
		
		// 2. Define the data source, obtain the streaming data and package it into DStream
		// TODO: from Kafka consumption data, New Consumer API is adopted
		/*
		def createDirectStream[K, V](
		      ssc: StreamingContext,
		      locationStrategy: LocationStrategy,
		      consumerStrategy: ConsumerStrategy[K, V]
		    ): InputDStream[ConsumerRecord[K, V]]
		 */
		// step1. Indicates the location policy when consuming Topic data in Kafka
		val locationStrategy: LocationStrategy = LocationStrategies.PreferConsistent
		// Step 2. Indicates the consumption policy and encapsulates the consumption configuration information when consuming topic data in Kafka
		/*
	        def Subscribe[K, V](
		      topics: Iterable[jl.String],
		      kafkaParams: collection.Map[String, Object]
		    ): ConsumerStrategy[K, V]
		 */
		val kafkaParams: collection.Map[String, Object] = Map(
			"bootstrap.servers" -> "node1.itcast.cn:9092", //
			"key.deserializer" -> classOf[StringDeserializer],
			"value.deserializer" -> classOf[StringDeserializer],
			"group.id" -> "groop_id_1001",
			"auto.offset.reset" -> "latest",
			"enable.auto.commit" -> (false: java.lang.Boolean)
		)
		val consumerStrategy: ConsumerStrategy[String, String] = ConsumerStrategies.Subscribe (
			Array("wc-topic"), kafkaParams
		)
		// step3. Use Kafka New Consumer API to consume data
		val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
			ssc, locationStrategy, consumerStrategy
		)
		
		// TODO: first, define an array to store offsets
		var offsetRanges: Array[OffsetRange] = Array.empty[OffsetRange] // The data offset information of each Kafka partition is encapsulated in the OffsetRange object
		
		// 3. Call the conversion function in DStream (similar to the conversion function in RDD) according to business requirements
		/*
			def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]
		 */
		// Here RDD is the RDD data of each batch in DStream
		val resultDStream: DStream[(String, Int)] = kafkaDStream.transform{ rdd =>
			// TODO: at this time, the conversion operation is directly performed for obtaining KafkaDStream. rdd belongs to KafkaRDD and contains relevant offset information
			// TODO: second, convert KafkaRDD to HasOffsetRanges to obtain the offset range
			offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
			
			rdd
				.map(record => record.value())
				.filter(line => null != line && line.trim.length > 0)
				.flatMap(line => line.trim.split("\\s+"))
				.map(word => (word, 1))
				.reduceByKey((tmp, item) => tmp + item)
		}
		
		// 4. Define the data terminal and output the result data of each batch
		/*
			def foreachRDD(foreachFunc: (RDD[T], Time) => Unit): Unit
		 */
		resultDStream.foreachRDD((rdd, time) => {
			//val xx: Time = time
			val format: FastDateFormat = FastDateFormat.getInstance("yyyy/MM/dd HH:mm:ss")
			println("-------------------------------------------")
			println(s"Time: ${format.format(time.milliseconds)}")
			println("-------------------------------------------")
			// Judge whether there is data in the RDD of each batch of results. If there is data, output it again
			if(!rdd.isEmpty()){
				rdd.coalesce(1).foreachPartition(iter => iter.foreach(println))
			}
			
			// TODO: Third, when the current batch data processing is completed, print the data offset information in the current batch
			offsetRanges.foreach{offsetRange =>
				println(s"topic: ${offsetRange.topic}    partition: ${offsetRange.partition}    offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}")
			}
		})
		
		// 5. Start streaming application and wait for termination
		ssc.start()
		ssc.awaitTermination()
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

07 - [understand] - business scenario and requirement description of application case

Analyze the user's logs when using Baidu search by imitating the Baidu search billboard: [real time analysis of Baidu search logs]. The main business requirements are as follows:

Business 1: search log data storage HDFS，Real time log data ETL Extract, transform, store HDFS File system;

Business 2: Baidu hot search ranking Top10，Accumulate and count the number of search terms of all users to obtain Top10 Search terms and times;

Business 3: Hot search in the near future Top10，Count the number of user search words in the latest period of time (for example, the last half hour or the last 2 hours) to obtain Top10 Search terms and times;

The directory structure in Maven Project development is as follows:

08 - [Master] - initialization environment and tool class of application case

Before programming the business, first write a program to simulate the generation of log data generated by users using Baidu search and the creation tool StreamingContextUtils to provide StreamingContext objects and methods to receive data from Kafka.

Start the Kafka Broker service and create a Topic [search log Topic]. The command is as follows:

Simulate log data

Simulate the user search log data, and the field information is encapsulated into the CaseClass sample class [SearchLog], with the following code:

package cn.itcast.spark.app.mock

/**
 * User Baidu search log data encapsulation sample class CaseClass
 * <p>
 *
 * @param sessionId Session ID
 * @param ip        IP address
 * @param datetime  Search date time
 * @param keyword   Search keywords
 */
case class SearchLog(
	                    sessionId: String, //
	                    ip: String, //
	                    datetime: String, //
	                    keyword: String //
                    ) {
	override def toString: String = s"$sessionId,$ip,$datetime,$keyword"
}

Simulate the generation of search log data class [MockSearchLogs], and the specific code is as follows:

package cn.itcast.spark.app.mock

import java.util.{Properties, UUID}

import org.apache.commons.lang3.time.FastDateFormat
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer

import scala.util.Random

/**
 * When the simulated user uses Baidu search engine, the search query log data includes the following fields:
 *      uid, ip, search_datetime, search_keyword
 */
object MockSearchLogs {
    
    def main(args: Array[String]): Unit = {
    
        // Search keywords directly to Baidu hot search list
        val keywords: Array[String] = Array(
            "Wu Zunyou reminded may day not to attend large gatherings", "The shopping guide was punished for claiming that the child was not dead", "The twin sisters in the brush video are identical twins",
            "Yunnan citizens pleaded guilty to bribery for more than 4 years.6 Hundred million", "The Indian man knelt down and begged the police not to take away the oxygen cylinder", "SARFT:Support the investigation and handling of Yin-Yang contracts and other issues",
            "75 200 affiliated companies cancelled by first-line artists", "The space station's space and core modules were successfully launched", "Chinese navy ships warn to drive away US ships",
            "Delhi, India changes dog crematorium to human crematorium", "The Ministry of public security sent a working group to Guangxi", "A beautiful man was kneeling and pressed by the police for 5 minutes and died",
            "Legendary Wall Street fund manager jumped to death", "Apollo 11 astronaut Collins died", "Carina Lau apologized to Dou Xiao and he ChaoLian"
        )
        
        // Send Kafka Topic
        val props = new Properties()
        props.put("bootstrap.servers", "node1.itcast.cn:9092")
        props.put("acks", "1")
        props.put("retries", "3")
        props.put("key.serializer", classOf[StringSerializer].getName)
        props.put("value.serializer", classOf[StringSerializer].getName)
        val producer = new KafkaProducer[String, String](props)
        
        val random: Random = new Random()
        while (true){
            // Randomly generate a search query log
            val searchLog: SearchLog = SearchLog(
                getUserId(), //
                getRandomIp(), //
                getCurrentDateTime(), //
                keywords(random.nextInt(keywords.length)) //
            )
            println(searchLog.toString)
            Thread.sleep(100 + random.nextInt(100))
            
            val record = new ProducerRecord[String, String]("search-log-topic", searchLog.toString)
            producer.send(record)
        }
        // Close connection
        producer.close()
    }
    
    /**
     * Randomly generated user SessionId
     */
    def getUserId(): String = {
        val uuid: String = UUID.randomUUID().toString
        uuid.replaceAll("-", "").substring(16)
    }
    
    /**
     * Gets the current date and time in the format yyyymmddhhmmssss
     */
    def getCurrentDateTime(): String = {
        val format =  FastDateFormat.getInstance("yyyyMMddHHmmssSSS")
        val nowDateTime: Long = System.currentTimeMillis()
        format.format(nowDateTime)
    }
    
    /**
     * Get random IP address
     */
    def getRandomIp(): String = {
        // ip range
        val range: Array[(Int, Int)] = Array(
            (607649792,608174079), //36.56.0.0-36.63.255.255
            (1038614528,1039007743), //61.232.0.0-61.237.255.255
            (1783627776,1784676351), //106.80.0.0-106.95.255.255
            (2035023872,2035154943), //121.76.0.0-121.77.255.255
            (2078801920,2079064063), //123.232.0.0-123.235.255.255
            (-1950089216,-1948778497),//139.196.0.0-139.215.255.255
            (-1425539072,-1425014785),//171.8.0.0-171.15.255.255
            (-1236271104,-1235419137),//182.80.0.0-182.92.255.255
            (-770113536,-768606209),//210.25.0.0-210.47.255.255
            (-569376768,-564133889) //222.16.0.0-222.95.255.255
        )
        // Random number: IP address range subscript
        val random = new Random()
        val index = random.nextInt(10)
        val ipNumber: Int = range(index)._1 + random.nextInt(range(index)._2 - range(index)._1)
        //println(s"ipNumber = ${ipNumber}")
        
        // Convert Int type IP address to IPv4 format
        number2IpString(ipNumber)
    }
    
    /**
     * Converts an Int type IPv4 address to a string type
     */
    def number2IpString(ip: Int): String = {
        val buffer: Array[Int] = new Array[Int](4)
        buffer(0) = (ip >> 24) & 0xff
        buffer(1) = (ip >> 16) & 0xff
        buffer(2) = (ip >> 8) & 0xff
        buffer(3) = ip & 0xff
        // Return IPv4 address
        buffer.mkString(".")
    }
    
}

All SparkStreaming applications need to build a StreamingContext instance object, consume Kafka data from the New KafkaConsumer API, and write a tool class [StreamingContextUtils], which provides two methods:

package cn.itcast.spark.app

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * The tool class provides: constructing streaming application context, StreamingContext instance object and consuming data from Kafka Topic
 */
object StreamingContextUtils {
	
	/**
	 * Get the StreamingContext instance and pass the batch processing interval
	 * @param batchInterval Batch interval in seconds
	 */
	def getStreamingContext(clazz: Class[_], batchInterval: Int): StreamingContext = {
		// i. Create a SparkConf object and set the application configuration information
		val sparkConf = new SparkConf()
			.setAppName(clazz.getSimpleName.stripSuffix("$"))
			.setMaster("local[3]")
			// Set Kryo serialization
			.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
			.registerKryoClasses(Array(classOf[ConsumerRecord[String, String]]))
			// Set the algorithm version when saving file data: 2
			.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
		// ii. Create a streaming context object and pass the SparkConf object and time interval
		val context = new StreamingContext(sparkConf, Seconds(batchInterval))
		// iii. return
		context
	}
	
	/**
	 * Consume data from the specified Kafka Topic, starting from the latest offset (largest) by default
	 * @param ssc StreamingContext Instance object
	 * @param topicName Topic name in consumption Kafka
	 */
	def consumerKafka(ssc: StreamingContext, topicName: String): DStream[ConsumerRecord[String, String]] = {
		// i. Location strategy
		val locationStrategy: LocationStrategy = LocationStrategies.PreferConsistent
		// ii. What Topic data are read
		val topics = Array(topicName)
		// iii. consumption Kafka data configuration parameters
		val kafkaParams = Map[String, Object](
			"bootstrap.servers" -> "node1.itcast.cn:9092",
			"key.deserializer" -> classOf[StringDeserializer],
			"value.deserializer" -> classOf[StringDeserializer],
			"group.id" -> "gui_0001",
			"auto.offset.reset" -> "latest",
			"enable.auto.commit" -> (false: java.lang.Boolean)
		)
		// iv. consumption data strategy
		val consumerStrategy: ConsumerStrategy[String, String] = ConsumerStrategies.Subscribe(
			topics, kafkaParams
		)
		
		// v. The new consumer API is adopted to obtain data, which is similar to the Direct method
		val kafkaDStream: DStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
			ssc, locationStrategy, consumerStrategy
		)
		// vi. return DStream
		kafkaDStream
	}
	
}

09 - [Master] - ETL storage of real-time data of application cases

Extract the IP address field from the Kafka Topic consumption data in real time, call the [ip2Region] library to parse it into provinces and cities, store it in the HDFS file, and set the batch interval to 10 seconds.

This requirement belongs to the [Stateless stateless] application scenario in streaming applications. You can use the transform or foreachRDD function.

package cn.itcast.spark.app.etl

import cn.itcast.spark.app.StreamingContextUtils
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.DStream
import org.lionsoul.ip2region.{DbConfig, DbSearcher}

/**
 * The real-time consumption Kafka Topic data is saved to the HDFS file system after ETL (filtering and conversion). The BatchInterval is 10s
 */
object _03StreamingETLHdfs {
	
	def main(args: Array[String]): Unit = {
	
		// 1. Create a StreamingContext instance object
		val ssc: StreamingContext = StreamingContextUtils.getStreamingContext(this.getClass, 10)
		
		// 2. Use New Consumer API from Kafka consumption data
		val kafkaDStream: DStream[ConsumerRecord[String, String]] = StreamingContextUtils.consumerKafka(ssc, "search-log-topic")
		
		// TODO: 3. Perform ETL conversion for the acquired data, and convert the IP address to province and city
		val etlDStream: DStream[String] = kafkaDStream.transform { rdd =>
			val etlRDD: RDD[String] = rdd
				// Filter data
				.filter(record => null != record.value() && record.value().trim.split(",").length == 4)
				// For each partition operation, obtain the ip address in each data and convert it into province and city
				.mapPartitions { iter =>
					// a. Create DbSearch object
					val dbSearcher = new DbSearcher(new DbConfig(), "dataset/ip2region.db")
					// b. Convert and analyze the IP value of the data in the partition
					iter.map { record =>
						// Get Message Value
						val message: String = record.value()
						// Get IP address value
						val ipValue: String = message.split(",")(1)
						// Resolve IP address
						val region: String = dbSearcher.btreeSearch(ipValue).getRegion
						val Array(_, _, province, city, _) = region.split("\\|")
						// Splice string
						s"${message},${province},${city}"
					}
				}
			// Returns the converted RDD
			etlRDD
		}
		etlDStream
		
		// 4. Save data to HDFS file system
		etlDStream.foreachRDD((rdd, batchTime) => {
			if(!rdd.isEmpty()){
				rdd.coalesce(1).saveAsTextFile(s"datas/streaming/search-logs-${batchTime}")
			}
		})
		
		// Start the streaming application and wait for the termination to end
		ssc.start()
		ssc.awaitTermination()
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

Run the simulation log data program and ETL application, view the real-time data and save the file after ETL. The screenshot is as follows:

10 - [Master] - updateStateByKey function of application case

The number of occurrences of each search term is counted in real time. The function [updateStateByKey] is provided in SparkStreaming to realize cumulative statistics. Spark 1.6 provides [mapWithState] function status statistics, which has better performance and is also recommended in practical applications.

Update the status of each batch of data according to the Key and previous status, and use the definition function [updateFunc]. The schematic diagram is as follows:

For WordCount, the status update logic diagram is as follows:

use updatStateByKey The key points of the status update function are as follows:
	- First, basis Key Update status
		Key Is the key field. For applications, Key It's a search term
	- Second, the principle of renewal
		step1,In the current batch, Key State of
		step2,obtain Key Previous status
		step3,Merge current batch status and previous status
 For this application,
	Key Search term, corresponding status State，Data type: Int，or Long

Programming, accumulate real-time statistics, and use updateStateByKey function

package cn.itcast.spark.app.state

import cn.itcast.spark.app.StreamingContextUtils
import org.apache.commons.lang3.time.FastDateFormat
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.DStream

/**
 * Consume Kafka Topic data in real time, accumulate and count the search times of each search term, and realize Baidu search billboard
 */
object _04StreamingUpdateState {
	
	def main(args: Array[String]): Unit = {
		
		// 1. Create a StreamingContext instance object
		val ssc: StreamingContext = StreamingContextUtils.getStreamingContext(this.getClass, 10)
		// TODO: set checkpoint directory
       	ssc.checkpoint("datas/streaming/ckpt-1001")
		
		// 2. Use New Consumer API from Kafka consumption data
		val kafkaDStream: DStream[ConsumerRecord[String, String]] = StreamingContextUtils.consumerKafka(ssc, "search-log-topic")
		
		// 3. TODO: step1. Aggregate statistics for current batch data
		val batchReduceDStream: DStream[(String, Int)] = kafkaDStream.transform{rdd =>
			rdd
				// Get Message information
    			.map(record => record.value())
    			.filter(msg => null != msg && msg.trim.split(",").length == 4)
				// Extract the search term, indicating that it appears once
    			.map(msg => msg.trim.split(",")(3) -> 1)
				// TODO: optimize and aggregate the data in the current batch once
    			.reduceByKey(_ + _)
		}
		
		
		// 3. TODO: step2. Aggregate the aggregation results of the current batch with the previous status data (status update)
		/*
		  def updateStateByKey[S: ClassTag](
		      updateFunc: (Seq[V], Option[S]) => Option[S]
		    ): DStream[(K, S)]
		    - Seq[V]Represents the value value set corresponding to the Key in the current batch
		        If the data in the current batch is aggregated by Key, there is only one value at this time
		        V Type: Int
		    - Option[S]): Indicates the previous status of the Key. If the Key has not appeared before, the status is None
		        S Type: Int
		 */
		val stateDStream: DStream[(String, Int)] = batchReduceDStream.updateStateByKey(
			(values: Seq[Int], state: Option[Int]) => {
				// a. Get previous status of Key
				val previousState: Int = state.getOrElse(0)
				// b. Gets the status of the Key in the current batch
				val currentState: Int = values.sum
				// c. What about the merge status
				val latestState: Int = previousState + currentState
				// Return to the latest status
				Some(latestState)
			}
		)
		
		// 4. Output the result data of each batch
		stateDStream.foreachRDD((rdd, time) => {
			val format: FastDateFormat = FastDateFormat.getInstance("yyyy/MM/dd HH:mm:ss")
			println("-------------------------------------------")
			println(s"Time: ${format.format(time.milliseconds)}")
			println("-------------------------------------------")
			// Judge whether there is data in the RDD of each batch of results. If there is data, output it again
			if(!rdd.isEmpty()){
				rdd.coalesce(1).foreachPartition(iter => iter.foreach(println))
			}
		})
		
		// Start the streaming application and wait for the termination to end
		ssc.start()
		ssc.awaitTermination()
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

11 - [Master] - mapWithState function of application case

Spark 1.6 provides a new status update function [mapWithState]. The mapWithState function will also count the status of global keys. However, if there is no data input, the status of previous keys will not be returned. It only cares about those keys that have changed. If there is no data input, the data of those keys that have not changed will not be returned.

In this way, even if there is a large amount of data, checkpoint will not occupy too much storage like updateStateByKey, which is more efficient;

StateSpec objects need to be built to encapsulate the State, and related operations can be performed. The declaration definition of the class is as follows:

Description of the [mapWithState] parameter of the status function:

Modify the previous case code and use the mapWithState function to update the state,

package cn.itcast.spark.app.state

import cn.itcast.spark.app.StreamingContextUtils
import org.apache.commons.lang3.time.FastDateFormat
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.{State, StateSpec, StreamingContext}
import org.apache.spark.streaming.dstream.DStream

/**
 * Consume Kafka Topic data in real time, accumulate and count the search times of each search term, and realize Baidu search billboard
 */
object _05StreamingMapWithState {
	
	def main(args: Array[String]): Unit = {
		
		// 1. Create a StreamingContext instance object
		val ssc: StreamingContext = StreamingContextUtils.getStreamingContext(this.getClass, 5)
		// TODO: set checkpoint directory
		ssc.checkpoint("datas/streaming-ckpt-999999")
		
		// 2. Use New Consumer API from Kafka consumption data
		val kafkaDStream: DStream[ConsumerRecord[String, String]] = StreamingContextUtils.consumerKafka(ssc, "search-log-topic")
		
		// 3. TODO: step1. Aggregate statistics for current batch data
		val batchReduceDStream: DStream[(String, Int)] = kafkaDStream.transform{ rdd =>
			rdd
				.filter(record => null != record && record.value().trim.split(",").length == 4)
				.map{record =>
					// Get each data Message of Kafka Topic
					val msg: String = record.value()
					// Get search keywords
					val searchWord: String = msg.trim.split(",").last
					// Returns a binary
					searchWord -> 1
				}
				// Group by search terms, aggregate and count the occurrence times of each search term
    			.reduceByKey(_ + _)  // This is performance optimization
		}
		
		// 3. TODO: step2. Aggregate the aggregation results of the current batch with the previous status data (status update)
		/*
		  def mapWithState[StateType: ClassTag, MappedType: ClassTag](
		      spec: StateSpec[K, V, StateType, MappedType]
		    ): MapWithStateDStream[K, V, StateType, MappedType]
		 */
		// Building stateSpec objects
		/*
		def function[KeyType, ValueType, StateType, MappedType](
		      mappingFunction: (KeyType, Option[ValueType], State[StateType]) => MappedType
		 ): StateSpec[KeyType, ValueType, StateType, MappedType]
		 */
		val spec: StateSpec[String, Int, Int, (String, Int)] = StateSpec.function(
			(key: String, option: Option[Int], state: State[Int]) => {
				// a. Get the name of the current Key
				val currentState: Int = option.getOrElse(0)
				// b. Get previous status
				val previousState: Int = state.getOption().getOrElse(0)
				// c. Merge status
				val latestState: Int = currentState + previousState
				// d. Update status
				state.update(latestState)
				// e. Returns the sum of key and status, encapsulated in a binary
				key -> latestState
			}
		)
		// Status update statistics by Key
		val stateDStream: DStream[(String, Int)] = batchReduceDStream.mapWithState(spec)
		
		// 4. Output the result data of each batch
		stateDStream.foreachRDD((rdd, time) => {
			val format: FastDateFormat = FastDateFormat.getInstance("yyyy/MM/dd HH:mm:ss")
			println("-------------------------------------------")
			println(s"Time: ${format.format(time.milliseconds)}")
			println("-------------------------------------------")
			// Judge whether there is data in the RDD of each batch of results. If there is data, output it again
			if(!rdd.isEmpty()){
				rdd.coalesce(1).foreachPartition(iter => iter.foreach(println))
			}
		})
		
		// Start the streaming application and wait for the termination to end
		ssc.start()
		ssc.awaitTermination()
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

12 - [Master] - real time window statistics of application cases

Some column window functions are provided in SparkStreaming to facilitate the analysis of window data. Document:

http://spark.apache.org/docs/2.4.5/streaming-programming-guide.html#window-operations

In actual projects, it is often required to count the latest data status every once in a while, rather than all the data, which is called trend statistics or window statistics. SparkStreaming provides the implementation functions of relevant functions. The business logic is as follows:

The window function [window] is declared as follows and contains two parameters: window size (WindowInterval, each statistical data range) and sliding size (how often to count), which must be an integer multiple of the batch interval.

package cn.itcast.spark.app.window

import cn.itcast.spark.app.StreamingContextUtils
import org.apache.commons.lang3.time.FastDateFormat
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.DStream

/**
 * Consume Kafka Topic data in real time, and count the number of search words in the recent search log at regular intervals
	 * Batch interval: BatchInterval = 2s
	 * Window size interval: WindowInterval = 4s
	 * Sliding size interval: SliderInterval = 2s
 */
object _06StreamingWindow {
	
	def main(args: Array[String]): Unit = {
		
		// 1. Create a StreamingContext instance object
		val ssc: StreamingContext = StreamingContextUtils.getStreamingContext(this.getClass, 2)
		// TODO: set checkpoint directory
		ssc.checkpoint(s"datas/spark/ckpt-${System.nanoTime()}")
		
		// 2. Use New Consumer API from Kafka consumption data
		val kafkaDStream: DStream[ConsumerRecord[String, String]] = StreamingContextUtils.consumerKafka(ssc, "search-log-topic")
		
		
		// TODO: setting window: size is 4 seconds, sliding is 2 seconds
		/*
		def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
		 */
		val windowDStream: DStream[ConsumerRecord[String, String]] = kafkaDStream.window(
			Seconds(4), // Window size
			Seconds(2) // Sliding size
		)
		
		// 3. Aggregate the data in the window
		val resultDStream: DStream[(String, Int)] = windowDStream.transform{rdd =>
			// Here RDD is the RDD data in the window
			rdd
				// Get Message information
				.map(record => record.value())
				.filter(msg => null != msg && msg.trim.split(",").length == 4)
				// Extract the search term, indicating that it appears once
				.map(msg => msg.trim.split(",").last -> 1)
				// TODO: aggregate the data in the current window once
				.reduceByKey(_ + _)
		}
		
		// 4. Output the result data of each batch
		resultDStream.foreachRDD((rdd, time) => {
			val format: FastDateFormat = FastDateFormat.getInstance("yyyy/MM/dd HH:mm:ss")
			println("-------------------------------------------")
			println(s"Time: ${format.format(time.milliseconds)}")
			println("-------------------------------------------")
			// Judge whether there is data in the RDD of each batch of results. If there is data, output it again
			if(!rdd.isEmpty()){
				rdd.coalesce(1).foreachPartition(iter => iter.foreach(println))
			}
		})
		
		// Start the streaming application and wait for the termination to end
		ssc.start()
		ssc.awaitTermination()
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

SparkStreaming also provides a function to combine Window settings with aggregate reduceByKey for easier programming.

Modify the above code and write the aggregation function and window together:

package cn.itcast.spark.app.window

import cn.itcast.spark.app.StreamingContextUtils
import org.apache.commons.lang3.time.FastDateFormat
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.DStream

/**
 * Consume Kafka Topic data in real time, and count the number of search words in the recent search log at regular intervals
	 * Batch interval: BatchInterval = 2s
	 * Window size interval: WindowInterval = 4s
	 * Sliding size interval: SliderInterval = 2s
 */
object _07StreamingReduceWindow {
	
	def main(args: Array[String]): Unit = {
		
		// 1. Create a StreamingContext instance object
		val ssc: StreamingContext = StreamingContextUtils.getStreamingContext(this.getClass, 2)
		// TODO: set checkpoint directory
		ssc.checkpoint(s"datas/spark/ckpt-${System.nanoTime()}")
		
		// 2. Use New Consumer API from Kafka consumption data
		val kafkaDStream: DStream[ConsumerRecord[String, String]] = StreamingContextUtils.consumerKafka(ssc, "search-log-topic")
		
		// 3. TODO: convert batch data
		val etlDStream: DStream[(String, Int)] = kafkaDStream.transform{ rdd =>
			rdd
				.filter(record => null != record && record.value().trim.split(",").length == 4)
				.map{record =>
					// Get each data Message of Kafka Topic
					val msg: String = record.value()
					// Get search keywords
					val searchWord: String = msg.trim.split(",").last
					// Returns a binary
					searchWord -> 1
				}
			
		}
		
		// TODO: set the window size to 4 seconds, slide to 2 seconds, and aggregate the data in the window
		/*
		  def reduceByKeyAndWindow(
		      reduceFunc: (V, V) => V,
		      windowDuration: Duration,
		      slideDuration: Duration
		    ): DStream[(K, V)]
		 */
		val resultDStream: DStream[(String, Int)] = etlDStream.reduceByKeyAndWindow(
			(v1: Int, v2: Int) => v1 + v2, // After grouping the data in the window according to the Key, aggregate the Value
			Seconds(4), //Window size
			Seconds(2) // Sliding size
		)
		
		// 4. Output the result data of each batch
		resultDStream.foreachRDD((rdd, time) => {
			val format: FastDateFormat = FastDateFormat.getInstance("yyyy/MM/dd HH:mm:ss")
			println("-------------------------------------------")
			println(s"Time: ${format.format(time.milliseconds)}")
			println("-------------------------------------------")
			// Judge whether there is data in the RDD of each batch of results. If there is data, output it again
			if(!rdd.isEmpty()){
				rdd.coalesce(1).foreachPartition(iter => iter.foreach(println))
			}
		})
		
		// Start the streaming application and wait for the termination to end
		ssc.start()
		ssc.awaitTermination()
		ssc.stop(stopSparkContext = true, stopGracefully = true)
	}
	
}

Appendix I. creating Maven module

1) . Maven engineering structure

2) . POM file content

Contents in the POM document of Maven project (Yilai package):

    <!-- Specify the warehouse location, in order: aliyun,cloudera and jboss Warehouse -->
    <repositories>
        <repository>
            <id>aliyun</id>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        </repository>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
        <repository>
            <id>jboss</id>
            <url>http://repository.jboss.com/nexus/content/groups/public</url>
        </repository>
    </repositories>

    <properties>
        <scala.version>2.11.12</scala.version>
        <scala.binary.version>2.11</scala.binary.version>
        <spark.version>2.4.5</spark.version>
        <hadoop.version>2.6.0-cdh5.16.2</hadoop.version>
        <hbase.version>1.2.0-cdh5.16.2</hbase.version>
        <kafka.version>2.0.0</kafka.version>
        <mysql.version>8.0.19</mysql.version>
    </properties>

    <dependencies>

        <!-- rely on Scala language -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <!-- Spark Core rely on -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Spark SQL rely on -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Spark Streaming rely on -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Spark Streaming integrate Kafka 0.8.2.1 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-8_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Spark Streaming And Kafka 0.10.0 Integration dependency-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Hadoop Client rely on -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <!-- HBase Client rely on -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-hadoop2-compat</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <!-- Kafka Client rely on -->
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>2.0.0</version>
        </dependency>
        <!-- according to ip Convert to provincial and urban areas -->
        <dependency>
            <groupId>org.lionsoul</groupId>
            <artifactId>ip2region</artifactId>
            <version>1.7.2</version>
        </dependency>
        <!-- MySQL Client rely on -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.version}</version>
        </dependency>
        <dependency>
            <groupId>c3p0</groupId>
            <artifactId>c3p0</artifactId>
            <version>0.9.1.2</version>
        </dependency>
    </dependencies>

    <build>
        <outputDirectory>target/classes</outputDirectory>
        <testOutputDirectory>target/test-classes</testOutputDirectory>
        <resources>
            <resource>
                <directory>${project.basedir}/src/main/resources</directory>
            </resource>
        </resources>
        <!-- Maven Compiled plug-ins -->
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

Topics: Java Big Data Spark

Programmer Think