Spark Big Data-Spark+Kafka Build Real-Time Analysis Dashboard

Posted by t31os on Fri, 17 Jan 2020 03:40:05 +0100

Spark+Kafka Build Real-Time Analysis Dashboard

I. Framework

Spark+Kafka is used to analyze the number of male and female students shopping per second in real time, Spark Streaming is used to process the user shopping log in real time, then websocket is used to push the data to the browser in real time, and finally the browser displays the received data in real time. The overall framework of the case is as follows:

Detailed analysis of the following steps:

  1. The app sends the shopping log to Kafka, topic is "sex", because it only counts the number of male and female shoppers, so just send the gender attribute in the shopping log.Here, the shopping log is sent in a simulated way, that is, reading the shopping log data and sending it to Kafka at the same interval.
  2. Then use Spark Streaming to read and process messages from the Kafka theme "sex".Here the data is read sequentially by the size of the sliding window, for example, once every 5 seconds as the window size, and then processed.
  3. Spark sends the processed data to Kafka with top as "result".
  4. Then use Flask to build a web application that receives a message with the Kafka theme "result".
  5. Use Flask-SocketIO to push data to clients in real time.
  6. Client browsers use the JS framework socketio to receive data in real time, and then use the JS visualization library hightlights.js library to display dynamically.

2. Preparing experimental environment

Environmental requirements

  • Ubuntu: 16.04
    Spark: 2.1.0
    Scala: 2.11.8
    kafka: 0.8.2.2
    Python: 3.x (version 3.0 and above)
    Flask: 0.12.1
    Flask-SocketIO: 2.8.6
    kafka-python: 1.3.3

3. Python operation kafka

1. Data introduction

The data set package used is data_format.zip Click here Download data_format.zip dataset
This data set compression package is the transaction data of Taobao in the first 6 months of Double 11 in 2015 (including Double 11), which has offset but does not affect the results of the experiment). It contains three files, user_log.csv, train.csv, test.csv. In this case, only user_log.csv is used. The file user_log.csv is listed below.Data format definition for log.csv:
User behavior log user_log.csv, the fields in the log are defined as follows:

  1. user_id | buyer ID
  2. item_id | commodity ID
  3. cat_id | commodity category ID
  4. merchant_id | vendor ID
  5. brand_id | brand ID
  6. Month | trading time: month
  7. Day | transaction event: day
  8. action | behavior, value range {0,1,2,3}, 0 means click, 1 means join shopping cart, 2 means purchase, 3 means focus on goods
  9. age_range | Buyer age segment: 1 for age <18,2 for age [18,24], 3 for age [25,29], 4 for age [30,34], 5 for age [35,39], 6 for age [40,49], 7 and 8 for age >=50,0 and NULL for unknown
  10. gender | gender: 0 for female, 1 for male, 2 for unknown
  11. province|harvest address province

The data is formatted as follows:

user_id,item_id,cat_id,merchant_id,brand_id,month,day,action,age_range,gender,province
328862,323294,833,2882,2661,08,29,0,0,1,Inner Mongolia
328862,844400,1271,2882,2661,08,29,0,1,1,Shanxi
328862,575153,1271,2882,2661,08,29,0,2,1,Shanxi
328862,996875,1271,2882,2661,08,29,0,1,1,Inner Mongolia
328862,1086186,1271,1253,1049,08,29,0,0,2,Zhejiang
328862,623866,1271,2882,2661,08,29,0,0,2,Heilongjiang
328862,542871,1467,2882,2661,08,29,0,5,2,Sichuan
328862,536347,1095,883,1647,08,29,0,7,1,Jilin

Real-time statistics of the number of male and female students shopping per second, so for each shopping log, we just need to get the gender, send it to Kafka, and Spark Streaming receives the gender for processing.

2. Data Preprocessing

Use Python to preprocess the data and send it directly to Kafka through the Kafka producer, where you need to install the code library for Python to operate on Kafka.

2.1 Create Producer

# coding: utf-8
import csv
import time
from kafka import KafkaProducer

# Instantiate a KafkaProducer example for delivering messages to Kafka
producer = KafkaProducer(bootstrap_servers='192.168.1.30:9092')
# Open Data File
csvfile = open("../data/user_log.csv", "r", encoding='UTF-8')
# Generate a reader that can be used to read csv files
reader = csv.reader(csvfile)

for line in reader:
    gender = line[9]  # Sex is the ninth element in each line of log code
    if gender == 'gender':
        continue  # Remove the first row header
    time.sleep(0.1)  # Send a row of data every 0.1 seconds
    # Send data, top is'sex'
    print(line[9].encode('utf8'))
    producer.send('sex', line[9].encode('utf8'))

2.2 Creating Consumers

from kafka import KafkaConsumer

consumer = KafkaConsumer('result', bootstrap_servers='192.168.1.30:9092')
for msg in consumer:
    print((msg.value).decode('utf8'))

3. Spark Streaming real-time data processing

  • Copy the jar package of spark-streaming-kafka (version kafka010) into the jars package on each node of the cluster.
  • Real-time statistics of the number of male and female shoppers per second, while Spark Streaming receives data of 1,1,0,2...Of these, 0 is for women and 1 is for men, SO 2 or null values are not considered.In fact, through analysis, we can find that this is a typical wordcount problem and is based on the Spark stream calculation.The number of girls, that is, the number of 0, and the number of boys, that is, the number of 1.Using ReducByKeyAndWindow with Spark Streaming interface, set the window size to 1 and the sliding step to 1, so the number of 0 and 1 is the number of boys and girls per second.
  1. First, the Kafka message is read at a frequency per second;
  2. Then the wordcount algorithm is executed on the data per second, and the number of 0, 1 and 2 is counted.
  3. Finally, the results are encapsulated as json and sent to Kafka.

1. Set log formatting

import org.apache.spark.internal.Logging
import org.apache.log4j.{Level, Logger}
/** Utility functions for Spark Streaming examples. */
object StreamingExamples extends Logging {
  /** Set reasonable logging levels for streaming if the user has not configured log4j. */
  def setStreamingLogLevels() {
    val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
    if (!log4jInitialized) {
      // We first log something to initialize Spark's default logging, then we override the
      // logging level.
      logInfo("Setting log level to [WARN] for streaming example." +
        " To override add a custom log4j.properties to the classpath.")
      Logger.getRootLogger.setLevel(Level.WARN)
    }
  }
}

2.spark streaming real-time data analysis

Accepts Kafka data for real-time flow analysis, creates producers, sends analysis results to kakfa, and the Python web side receives visualization of analysis results from kafka.

import java.util.HashMap
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}

import org.json4s._
import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization.write

// import org.json4s._
// import org.json4s.JsonDSL._
// import org.json4s.jackson.JsonMethods._

import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.Interval
import org.apache.spark.streaming.kafka010._

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

object KafkaWordCount{
    
    def main(args:Array[String]){
//         Format Log
        StreamingExamples.setStreamingLogLevels()
        val sparkConf = new SparkConf().setAppName("KafkaWordCount")
        val ssc = new StreamingContext(sparkConf, Seconds(1))
//         Set chackpoint
        ssc.checkpoint(".")
        
//         Create consumers, get data from kafka for real-time processing by spark streaming
//         Parameter settings for new version of kafka
        val kafkaParams = Map[String, Object](
          "bootstrap.servers" -> "192.168.1.30:9092",
          "key.deserializer" -> classOf[StringDeserializer],
          "value.deserializer" -> classOf[StringDeserializer],
          "group.id" -> "1",
          "auto.offset.reset" -> "latest",
          "enable.auto.commit" -> (false: java.lang.Boolean)
        )
//         How to set the number of partitions for each topic by possibly subscribing to multiple topics at the same time
        val topics = Array("sex")
//         create data source
        val lineMap = KafkaUtils.createDirectStream[String, String](
            ssc,
            PreferConsistent,
            Subscribe[String, String](topics, kafkaParams)
        )
        val lines =lineMap.map(record => record.value)
        val words=lines.flatMap(_.split(" "))
//         val wordCounts = words.map(x => (x, 1L)).reduceByKeyAndWindow(_+_,_-_, Seconds(1), Seconds(1), 1).print
//         Windows Interception of Data for Real-Time Flow Analysis
        val wordCounts = words.map(x => (x, 1L)).reduceByKeyAndWindow(_+_,_-_, Seconds(1), Seconds(1), 1).foreachRDD(rdd => {
          if(rdd.count !=0 ){
               implicit val formats = DefaultFormats//Data formatting requires that, in order to exclude serialization issues, you place it here
               val props = new HashMap[String, Object]()
            //       Provide brokers address
                props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.1.30:9092")
            //       Specifies how the value in the key value can be serialized because of network transport
                props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
                  "org.apache.kafka.common.serialization.StringSerializer")
            //       Specifies how keys in key value can be serialized because of network transport
                props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
                  "org.apache.kafka.common.serialization.StringSerializer")
            //       Producer object, providing serialization of borker and kv
                val producer = new KafkaProducer[String, String](props)
            //       Converting an array in rdd to an array and then to json
                val str = write(rdd.collect)
              
//               Another way to convert rdd to json
//                 val json=rdd.collect().toList.map{case (word, count) =>(word, count)}
//                 val str=compact(render(json))
                                
//                 println(rdd.collect.getClass().getName()) //Lscala.Tuple2;
//                 println(str)
                val message = new ProducerRecord[String, String]("result", null, str) 
                producer.send(message)
          }
        })
        ssc.start()
        ssc.awaitTermination()
    }
}

3. Visualization of real-time analysis results ( Source Download)

  • Push data in real time using Flask-SocketIO
  • socket.io.js Real-time Data Acquisition
  • highlights.js Show Data

3.1 Flask-SocketIO Real-Time Push Data

Spark Streaming receives real-time log data sent by topic for'sex'in Kafka, and Spark Streaming processes the data in real-time. After counting the number of male and female shoppers per second, the result is sent to Kafka with topic as'result'.In this section, you will learn how to use Flask-SocketIO to push results to your browser in real time.
The file directory structure is as follows:

  • 1. The data directory stores user log data;
  • 2. The scripts directory stores Kafka producers and consumers;
  • 3. The static/js directory stores the JS framework needed by the front end;
  • 4. The templates directory holds html pages;
  • 5.app.py is a web server that receives results from Spark Streaming processing and pushes real-time data to the browser;
  • 6.External Libraries is the Python library on which this project relies and is automatically generated by PyCharm.

    The function of app.py is to act as a simple server, handle connection requests, and process data received from kafka, and push it to the browser in real time.The code for app.py is as follows:
import json
from flask import Flask, render_template
from flask_socketio import SocketIO
from kafka import KafkaConsumer

app = Flask(__name__)
app.config['SECRET_KEY'] = 'secret!'
socketio = SocketIO(app)
thread = None
# Instantiate a consumer to receive a message with top as result
consumer = KafkaConsumer('result', bootstrap_servers='192.168.1.30:9092')

# A background thread that continuously receives Kafka messages and sends them to client browsers
def background_thread():
    girl = 0
    boy = 0
    for msg in consumer:
        data_json = msg.value.decode('utf8')
        data_list = json.loads(data_json)
        for data in data_list:
            if '0' in data.keys():
                girl = data['0']
            elif '1' in data.keys():
                boy = data['1']
            else:
                continue
        result = str(girl) + ',' + str(boy)
        print(result)
        socketio.emit('test_message', {'data': result})


# Handler when client sends connect event
@socketio.on('test_connect')
def connect(message):
    print(message)
    global thread
    if thread is None:
        # Open a separate thread to send data to clients
        thread = socketio.start_background_task(target=background_thread)
    socketio.emit('connected', {'data': 'Connected'})

# Access index.html by accessing http://127.0.0.1:5000/
@app.route("/")
def handle_mes():
    return render_template("index.html")

# main function
if __name__ == '__main__':
    socketio.run(app, debug=True)

4. Effect display

With all the above steps in place, we can start the program to see the final result.The startup steps are as follows: (multiple python files need to be started using the command line startup method)
1. Make sure kafka is turned on.
2. Open producer.py analog data stream.
3. Start Spark Streaming to process data in real time.Tip you can change the top of comsumer.py to result after data processing is started in real time. Run comsumer.py to see the output of data processing.
4.Start app.py.

Use your browser to access the web address http://127.0.0.1:5000/, given in the figure above, to see the final result:

38 original articles published. 6. 20,000 visits+
Private letter follow

Topics: kafka Spark Apache Python