Chapter 3 DStream creation
3.1 RDD queue
3.1.1 usage and description
During the test, you can create a DStream by using ssc.queueStream(queueOfRDDs). Each RDD pushed to the queue will be processed as a DStream.
3.1.2 case practice
Requirement: create several RDD S in a loop and put them into the queue. Create Dstream through SparkStream and calculate WordCount.
- (1) Write code
object RDDStream { def main(args: Array[String]) { //1. Initialize Spark configuration information val conf = new SparkConf().setMaster("local[*]").setAppName("RDDStream") //2. Initialize SparkStreamingContext val ssc = new StreamingContext(conf, Seconds(4)) //3. Create RDD queue val rddQueue = new mutable.Queue[RDD[Int]]() //4. Create QueueInputDStream val inputStream = ssc.queueStream(rddQueue,oneAtATime = false) //5. Process RDD data in the queue val mappedStream = inputStream.map((_,1)) val reducedStream = mappedStream.reduceByKey(_ + _) //6. Print results reducedStream.print() //7. Start task ssc.start() //8. Create a loop and put RDD into the RDD queue for (i <- 1 to 5) { rddQueue += ssc.sparkContext.makeRDD(1 to 300, 10) Thread.sleep(2000) } ssc.awaitTermination() } }
- (2) Result display
------------------------------------------- Time: 1539075280000 ms ------------------------------------------- (4,60) (0,60) (6,60) (8,60) (2,60) (1,60) (3,60) (7,60) (9,60) (5,60) ------------------------------------------- Time: 1539075284000 ms ------------------------------------------- (4,60) (0,60) (6,60) (8,60) (2,60) (1,60) (3,60) (7,60) (9,60) (5,60) ------------------------------------------- Time: 1539075288000 ms ------------------------------------------- (4,30) (0,30) (6,30) (8,30) (2,30) (1,30) (3,30) (7,30) (9,30) (5,30) ------------------------------------------- Time: 1539075292000 ms -------------------------------------------
3.2 user defined data source
3.2.1 usage and description
- You need to inherit the Receiver and implement onStart and onStop methods to customize data source collection.
3.2.2 case practice
Requirements: customize the data source to monitor a port number and obtain the content of the port number.
- (1) Custom data source
class CustomerReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) { //When it is initially started, this method is called to read data and send data to Spark override def onStart(): Unit = { new Thread("Socket Receiver") { override def run() { receive() } }.start() } //Read and send data to Spark def receive(): Unit = { //Create a Socket var socket: Socket = new Socket(host, port) //Define a variable to receive data from the port var input: String = null //Create a BufferedReader to read the data from the port val reader = new BufferedReader(new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8)) //Read data input = reader.readLine() //When the receiver is not closed and the input data is not empty, the data is sent to Spark circularly while (!isStopped() && input != null) { store(input) input = reader.readLine() } //If you jump out of the loop, close the resource reader.close() socket.close() //Restart task restart("restart") } override def onStop(): Unit = {} }
- (2) Collect data using a custom data source
object FileStream { def main(args: Array[String]): Unit = { //1. Initialize Spark configuration information val sparkConf = new SparkConf().setMaster("local[*]") .setAppName("StreamWordCount") //2. Initialize SparkStreamingContext val ssc = new StreamingContext(sparkConf, Seconds(5)) //3. Create Streaming for custom receiver val lineStream = ssc.receiverStream(new CustomerReceiver("hadoop102", 9999)) //4. Segment each line of data to form words val wordStream = lineStream.flatMap(_.split("\t")) //5. Map words into meta groups (word,1) val wordAndOneStream = wordStream.map((_, 1)) //6. Count the same number of words val wordAndCountStream = wordAndOneStream.reduceByKey(_ + _) //7. Printing wordAndCountStream.print() //8. Start SparkStreamingContext ssc.start() ssc.awaitTermination() } }
3.3 Kafka data source (key points of interview and development)
3.3.1 version selection
- Receiver API: a special Executor is required to receive data and send it to other Executors for calculation. The speed of the Executor receiving data is different from that of the calculated Executor, especially when the speed of the Executor receiving data is greater than that of the calculated Executor, which will lead to memory overflow of the calculated data node. This method is available in previous versions, and the current version is not applicable.
- DirectAPI: the computing Executor actively consumes Kafka data, and the speed is controlled by itself.
3.3.2 Kafka 0-8 Receiver mode (the current version is not applicable)
Requirements: read data from Kafka through SparkStreaming, make simple calculation on the read data, and finally print
To the console.
- (1) Import dependency
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-8_2.11</artifactId> <version>2.4.5</version> </dependency>
- (2) Write code
package com.atguigu.kafka import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.ReceiverInputDStream import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} object ReceiverAPI { def main(args: Array[String]): Unit = { //1. Create SparkConf val sparkConf: SparkConf = new SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]") //2. Create StreamingContext val ssc = new StreamingContext(sparkConf, Seconds(3)) //3. Read Kafka data and create dstream (based on Receive mode) val kafkaDStream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, "linux1:2181,linux2:2181,linux3:2181", "atguigu", Map[String, Int]("atguigu" -> 1)) //4. Calculate WordCount kafkaDStream.map { case (_, value) => (value, 1) }.reduceByKey(_ + _) .print() //5. Start task ssc.start() ssc.awaitTermination() } }
3.3.3 Kafka 0-8 Direct mode (the current version is not applicable)
Requirements: read data from Kafka through SparkStreaming, make simple calculation on the read data, and finally print
To the console.
- (1) Import dependency
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-8_2.11</artifactId> <version>2.4.5</version> </dependency>
- (2) Write code (automatically maintain offset)
import kafka.serializer.StringDecoder import org.apache.kafka.clients.consumer.ConsumerConfig import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.InputDStream import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} object DirectAPIAuto02 { val getSSC1: () => StreamingContext = () => { val sparkConf: SparkConf = new SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]") val ssc = new StreamingContext(sparkConf, Seconds(3)) ssc } def getSSC: StreamingContext = { //1. Create SparkConf val sparkConf: SparkConf = new SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]") //2. Create StreamingContext val ssc = new StreamingContext(sparkConf, Seconds(3)) //Set CK ssc.checkpoint("./ck2") //3. Define Kafka parameters val kafkaPara: Map[String, String] = Map[String, String]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "linux1:9092,linux2:9092,linux3:9092", ConsumerConfig.GROUP_ID_CONFIG -> "atguigu" ) //4. Read Kafka data val kafkaDStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaPara, Set("atguigu")) //5. Calculate WordCount kafkaDStream.map(_._2) .flatMap(_.split(" ")) .map((_, 1)) .reduceByKey(_ + _) .print() //6. Return data ssc } def main(args: Array[String]): Unit = { //Get SSC val ssc: StreamingContext = StreamingContext.getActiveOrCreate("./ck2", () => getSSC) //Open task ssc.start() ssc.awaitTermination() } }
- (3) Write code (maintain offset manually)
import kafka.common.TopicAndPartition import kafka.message.MessageAndMetadata import kafka.serializer.StringDecoder import org.apache.kafka.clients.consumer.ConsumerConfig import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.{DStream, InputDStream} import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils, OffsetRange} import org.apache.spark.streaming.{Seconds, StreamingContext} object DirectAPIHandler { def main(args: Array[String]): Unit = { //1. Create SparkConf val sparkConf: SparkConf = new SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]") //2. Create StreamingContext val ssc = new StreamingContext(sparkConf, Seconds(3)) //3.Kafka parameters val kafkaPara: Map[String, String] = Map[String, String]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop102:9092,hadoop103:9092,hadoop104:9092", ConsumerConfig.GROUP_ID_CONFIG -> "atguigu" ) //4. Get the last reserved offset of the last startup = > getoffset (MySQL) val fromOffsets: Map[TopicAndPartition, Long] = Map[TopicAndPartition, Long](TopicAndPartition("atguigu", 0) -> 20) //5. Read Kafka data and create DStream val kafkaDStream: InputDStream[String] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, String](ssc, kafkaPara, fromOffsets, (m: MessageAndMetadata[String, String]) => m.message()) //6. Create an array to store the offset information of the current consumption data var offsetRanges = Array.empty[OffsetRange] //7. Obtain the offset information of the current consumption data val wordToCountDStream: DStream[(String, Int)] = kafkaDStream.transform { rdd => offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd }.flatMap(_.split(" ")) .map((_, 1)) .reduceByKey(_ + _) //8. Print Offset information wordToCountDStream.foreachRDD(rdd => { for (o <- offsetRanges) { println(s"${o.topic}:${o.partition}:${o.fromOffset}:${o.untilOffset}") } rdd.foreach(println) }) //9. Start task ssc.start() ssc.awaitTermination() } }
3.3.4 Kafka 0-10 Direct mode
Requirements: read data from Kafka through SparkStreaming, make simple calculation on the read data, and finally print
To the console.
- (1) Import dependency
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-10_2.12</artifactId> <version>3.0.0</version> </dependency> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-core</artifactId> <version>2.10.1</version> </dependency>
- (2) Write code
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord} import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.{DStream, InputDStream} import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies} import org.apache.spark.streaming.{Seconds, StreamingContext} object DirectAPI { def main(args: Array[String]): Unit = { //1. Create SparkConf val sparkConf: SparkConf = new SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]") //2. Create StreamingContext val ssc = new StreamingContext(sparkConf, Seconds(3)) //3. Define Kafka parameters val kafkaPara: Map[String, Object] = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "linux1:9092,linux2:9092,linux3:9092", ConsumerConfig.GROUP_ID_CONFIG -> "atguigu", "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer", "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer" ) //4. Read Kafka data and create DStream val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](Set("atguigu"), kafkaPara)) //5. Take out the KV of each message val valueDStream: DStream[String] = kafkaDStream.map(record => record.value()) //6. Calculate WordCount valueDStream.flatMap(_.split(" ")) .map((_, 1)) .reduceByKey(_ + _) .print() //7. Start task ssc.start() ssc.awaitTermination() } }
View Kafka consumption progress
bin/kafka-consumer-groups.sh --describe --bootstrap-server linux1:9092 --group atguigu
Statement: This article is a note taken during learning. If there is any infringement, please inform us to delete it!
Original video address: https://www.bilibili.com/video/BV11A411L7CK