Spark Streaming foundation - DStream creation - RDD queue, custom data source, Kafka data source

Posted by RussellReal on Fri, 19 Nov 2021 23:38:38 +0100

Chapter 3 DStream creation

3.1 RDD queue

3.1.1 usage and description

During the test, you can create a DStream by using ssc.queueStream(queueOfRDDs). Each RDD pushed to the queue will be processed as a DStream.

3.1.2 case practice

Requirement: create several RDD S in a loop and put them into the queue. Create Dstream through SparkStream and calculate WordCount.

  • (1) Write code
object RDDStream {
 def main(args: Array[String]) {
 //1. Initialize Spark configuration information
 val conf = new SparkConf().setMaster("local[*]").setAppName("RDDStream")
 //2. Initialize SparkStreamingContext
 val ssc = new StreamingContext(conf, Seconds(4))
 //3. Create RDD queue
 val rddQueue = new mutable.Queue[RDD[Int]]()
 //4. Create QueueInputDStream
 val inputStream = ssc.queueStream(rddQueue,oneAtATime = false)
 //5. Process RDD data in the queue
 val mappedStream = inputStream.map((_,1))
 val reducedStream = mappedStream.reduceByKey(_ + _)
 //6. Print results
 reducedStream.print()
 //7. Start task
 ssc.start()
//8. Create a loop and put RDD into the RDD queue
 for (i <- 1 to 5) {
 rddQueue += ssc.sparkContext.makeRDD(1 to 300, 10)
 Thread.sleep(2000)
 }
 ssc.awaitTermination()
 }
}
  • (2) Result display
-------------------------------------------
Time: 1539075280000 ms
-------------------------------------------
(4,60)
(0,60)
(6,60)
(8,60)
(2,60)
(1,60)
(3,60)
(7,60)
(9,60)
(5,60)
-------------------------------------------
Time: 1539075284000 ms
-------------------------------------------
(4,60)
(0,60)
(6,60)
(8,60)
(2,60)
(1,60)
(3,60)
(7,60)
(9,60)
(5,60)
-------------------------------------------
Time: 1539075288000 ms
-------------------------------------------
(4,30)
(0,30)
(6,30)
(8,30)
(2,30)
(1,30)
(3,30)
(7,30)
(9,30)
(5,30)
-------------------------------------------
Time: 1539075292000 ms
-------------------------------------------

3.2 user defined data source

3.2.1 usage and description

  • You need to inherit the Receiver and implement onStart and onStop methods to customize data source collection.

3.2.2 case practice

Requirements: customize the data source to monitor a port number and obtain the content of the port number.

  • (1) Custom data source
class CustomerReceiver(host: String, port: Int) extends 
Receiver[String](StorageLevel.MEMORY_ONLY) {
 //When it is initially started, this method is called to read data and send data to Spark
 override def onStart(): Unit = {
 new Thread("Socket Receiver") {
 override def run() {
 receive()
 }
 }.start()
 }
 //Read and send data to Spark
 def receive(): Unit = {
 //Create a Socket
 var socket: Socket = new Socket(host, port)
 //Define a variable to receive data from the port
 var input: String = null
 //Create a BufferedReader to read the data from the port
 val reader = new BufferedReader(new InputStreamReader(socket.getInputStream, 
StandardCharsets.UTF_8))
 //Read data
 input = reader.readLine()
 //When the receiver is not closed and the input data is not empty, the data is sent to Spark circularly
 while (!isStopped() && input != null) {
 store(input)
 input = reader.readLine()
 }
 //If you jump out of the loop, close the resource
 reader.close()
 socket.close()
 //Restart task
 restart("restart")
 }
 override def onStop(): Unit = {}
}
  • (2) Collect data using a custom data source
object FileStream {
 def main(args: Array[String]): Unit = {
 //1. Initialize Spark configuration information
val sparkConf = new SparkConf().setMaster("local[*]")
.setAppName("StreamWordCount")
 //2. Initialize SparkStreamingContext
 val ssc = new StreamingContext(sparkConf, Seconds(5))
//3. Create Streaming for custom receiver
val lineStream = ssc.receiverStream(new CustomerReceiver("hadoop102", 9999))
 //4. Segment each line of data to form words
 val wordStream = lineStream.flatMap(_.split("\t"))
 //5. Map words into meta groups (word,1)
val wordAndOneStream = wordStream.map((_, 1))
 //6. Count the same number of words
 val wordAndCountStream = wordAndOneStream.reduceByKey(_ + _)
 //7. Printing
 wordAndCountStream.print()
 //8. Start SparkStreamingContext
 ssc.start()
 ssc.awaitTermination()
 }
}

3.3 Kafka data source (key points of interview and development)

3.3.1 version selection

  • Receiver API: a special Executor is required to receive data and send it to other Executors for calculation. The speed of the Executor receiving data is different from that of the calculated Executor, especially when the speed of the Executor receiving data is greater than that of the calculated Executor, which will lead to memory overflow of the calculated data node. This method is available in previous versions, and the current version is not applicable.
  • DirectAPI: the computing Executor actively consumes Kafka data, and the speed is controlled by itself.

3.3.2 Kafka 0-8 Receiver mode (the current version is not applicable)

Requirements: read data from Kafka through SparkStreaming, make simple calculation on the read data, and finally print
To the console.

  • (1) Import dependency
<dependency>
 <groupId>org.apache.spark</groupId>
 <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
 <version>2.4.5</version>
</dependency>
  • (2) Write code
package com.atguigu.kafka
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object ReceiverAPI {
 def main(args: Array[String]): Unit = {
 //1. Create SparkConf
 val sparkConf: SparkConf = new 
SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]")
 //2. Create StreamingContext
val ssc = new StreamingContext(sparkConf, Seconds(3))
 //3. Read Kafka data and create dstream (based on Receive mode)
 val kafkaDStream: ReceiverInputDStream[(String, String)] = 
KafkaUtils.createStream(ssc,
 "linux1:2181,linux2:2181,linux3:2181",
 "atguigu",
 Map[String, Int]("atguigu" -> 1))
 //4. Calculate WordCount
 kafkaDStream.map { case (_, value) =>
 (value, 1)
 }.reduceByKey(_ + _)
 .print()
 //5. Start task
 ssc.start()
 ssc.awaitTermination()
 }
}

3.3.3 Kafka 0-8 Direct mode (the current version is not applicable)

Requirements: read data from Kafka through SparkStreaming, make simple calculation on the read data, and finally print
To the console.

  • (1) Import dependency
<dependency>
 <groupId>org.apache.spark</groupId>
 <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
 <version>2.4.5</version>
</dependency>
  • (2) Write code (automatically maintain offset)
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object DirectAPIAuto02 {
 val getSSC1: () => StreamingContext = () => {
 val sparkConf: SparkConf = new 
SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]")
 val ssc = new StreamingContext(sparkConf, Seconds(3))
 ssc
 }
 def getSSC: StreamingContext = {
 //1. Create SparkConf
 val sparkConf: SparkConf = new 
SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]")
 //2. Create StreamingContext
 val ssc = new StreamingContext(sparkConf, Seconds(3))
 //Set CK
 ssc.checkpoint("./ck2")
 //3. Define Kafka parameters
 val kafkaPara: Map[String, String] = Map[String, String](
 ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> 
"linux1:9092,linux2:9092,linux3:9092",
 ConsumerConfig.GROUP_ID_CONFIG -> "atguigu"
 )
 //4. Read Kafka data
 val kafkaDStream: InputDStream[(String, String)] = 
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
 kafkaPara,
 Set("atguigu"))
 //5. Calculate WordCount
 kafkaDStream.map(_._2)
 .flatMap(_.split(" "))
 .map((_, 1))
 .reduceByKey(_ + _)
 .print()
 //6. Return data
 ssc
 }
 def main(args: Array[String]): Unit = {
 //Get SSC
 val ssc: StreamingContext = StreamingContext.getActiveOrCreate("./ck2", () => 
getSSC)
 //Open task
 ssc.start()
 ssc.awaitTermination()
 }
}
  • (3) Write code (maintain offset manually)
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils,
OffsetRange}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object DirectAPIHandler {
 def main(args: Array[String]): Unit = {
 //1. Create SparkConf
 val sparkConf: SparkConf = new 
SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]")
 //2. Create StreamingContext
 val ssc = new StreamingContext(sparkConf, Seconds(3))
 //3.Kafka parameters
 val kafkaPara: Map[String, String] = Map[String, String](
 ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> 
"hadoop102:9092,hadoop103:9092,hadoop104:9092",
 ConsumerConfig.GROUP_ID_CONFIG -> "atguigu"
 )
 //4. Get the last reserved offset of the last startup = > getoffset (MySQL)
 val fromOffsets: Map[TopicAndPartition, Long] = Map[TopicAndPartition, 
Long](TopicAndPartition("atguigu", 0) -> 20)
 //5. Read Kafka data and create DStream
 val kafkaDStream: InputDStream[String] = KafkaUtils.createDirectStream[String, 
String, StringDecoder, StringDecoder, String](ssc,
 kafkaPara,
 fromOffsets,
 (m: MessageAndMetadata[String, String]) => m.message())
 //6. Create an array to store the offset information of the current consumption data
 var offsetRanges = Array.empty[OffsetRange]
 //7. Obtain the offset information of the current consumption data
 val wordToCountDStream: DStream[(String, Int)] = kafkaDStream.transform { rdd 
=>
 offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
 rdd
 }.flatMap(_.split(" "))
 .map((_, 1))
 .reduceByKey(_ + _)
 //8. Print Offset information
 wordToCountDStream.foreachRDD(rdd => {
 for (o <- offsetRanges) {
 println(s"${o.topic}:${o.partition}:${o.fromOffset}:${o.untilOffset}")
 }
 rdd.foreach(println)
 })
 //9. Start task
 ssc.start()
 ssc.awaitTermination()
 }
}

3.3.4 Kafka 0-10 Direct mode

Requirements: read data from Kafka through SparkStreaming, make simple calculation on the read data, and finally print
To the console.

  • (1) Import dependency
<dependency>
 <groupId>org.apache.spark</groupId>
 <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
 <version>3.0.0</version>
</dependency>
<dependency>
 <groupId>com.fasterxml.jackson.core</groupId>
 <artifactId>jackson-core</artifactId>
 <version>2.10.1</version>
 </dependency>
  • (2) Write code
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, 
LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object DirectAPI {
 def main(args: Array[String]): Unit = {
 //1. Create SparkConf
 val sparkConf: SparkConf = new 
SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]")
 //2. Create StreamingContext
 val ssc = new StreamingContext(sparkConf, Seconds(3))
 //3. Define Kafka parameters
 val kafkaPara: Map[String, Object] = Map[String, Object](
 ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> 
"linux1:9092,linux2:9092,linux3:9092",
 ConsumerConfig.GROUP_ID_CONFIG -> "atguigu",
 "key.deserializer" -> 
"org.apache.kafka.common.serialization.StringDeserializer",
 "value.deserializer" -> 
"org.apache.kafka.common.serialization.StringDeserializer"
 )
 //4. Read Kafka data and create DStream
 val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = 
KafkaUtils.createDirectStream[String, String](ssc,
 LocationStrategies.PreferConsistent,
 ConsumerStrategies.Subscribe[String, String](Set("atguigu"), kafkaPara))
 //5. Take out the KV of each message
 val valueDStream: DStream[String] = kafkaDStream.map(record => record.value())
 //6. Calculate WordCount
 valueDStream.flatMap(_.split(" "))
 .map((_, 1))
 .reduceByKey(_ + _)
 .print()
 //7. Start task
 ssc.start()
 ssc.awaitTermination()
 }
}

View Kafka consumption progress

bin/kafka-consumer-groups.sh --describe --bootstrap-server linux1:9092 --group 
atguigu

Statement: This article is a note taken during learning. If there is any infringement, please inform us to delete it!
Original video address: https://www.bilibili.com/video/BV11A411L7CK

Topics: Big Data kafka Spark