Checkpoint location analysis under spark structured streaming, connection with Grafana monitoring and submission of Kafka Lag monitoring

Posted by ShashidharNP on Mon, 22 Nov 2021 09:44:12 +0100

1, Introduction to spark structured streaming checkpoint location

Structured Streaming was introduced in spark version 2.0 in 2016. It is an extensible and fault-tolerant stream processing engine based on Spark SQL Engine. Compared with traditional Spark Streaming, due to the reuse of Spark SQL Engine, the code writing method is the same as batch processing API (based on Dataframe and Dataset API), and these APIs are very simple.

Structured Streaming also supports the use of event time to process delayed data by setting watermark; Spark Streaming can only be calculated based on process time, which is obviously not enough.

such as  . withWatermark("timestamp", "10 minutes")   Represents the data in the DataFrame   timestamp   Field as event time. If event time is more than 10 minutes behind process time, these data will not be processed.

Structured Streaming still uses micro batch mode to process data by default, but Spark 2.3 has provided a method called   Continuous Processing   The model can be semantic at least once, and the data end-to-end only takes 1ms.

However, the Web UI of Structured Streaming does not have the same monitoring indicators as Spark Streaming.

Structure of Checkpoint Directory:

1. Checkpoint location is in the source code call chain

Analyzing the source code and viewing the StructuredStreaming startup process, it is found that the DataStreamWriter#start method starts a StreamingQuery.

At the same time, pass the checkpoint location configuration parameter to StreamingQuery management.

The implementation relationship of StreamingQuery interface is as follows:

  • The StreamingQueryWrapper wraps only one non serializable StreamExecution

  • StreamExecution manages the executor of Spark SQL queries

  • MicroBatchExecution microbatch executor

  • Continuous processing (streaming) actuator

So we only need to analyze checkpointLocation's call in StreamExecution.

Note: protected def checkpoint file (Name: String) in StreamExecution: the string method returns the path of $checkpoint file / name for all logic related to checkpoint location

2. MetadataLog (metadata log interface)

Spark provides the org.apache.spark.sql.execution.streaming.MetadataLog interface for unified processing of metadata log information.

The contents of the checkpoint location file are maintained using MetadataLog.

The analysis interface implementation relationship is as follows:

Class function description:

  • NullMetadataLog is an empty log, that is, it is discarded directly without outputting the log

  • HDFSMetadataLog uses HDFS as the metadata log output

  • CommitLog commit log

  • OffsetSeqLog offset log

  • Compatible filestreamlog encapsulates MetadataLog that supports merging and deleting history records by size

  • Logging when the StreamSourceLog file type is used as a data source

  • When the FileStreamSinkLog file type is used as the data receiving end, the log is recorded

  • EsSinkMetadataLog Es is used as the data receiver for logging

The analysis of compactablefilestreamlog#compact consolidation logic is briefly described as follows: suppose 0,1,2,3,4,5,6,7,8,9,10 batches arrive here, the consolidation size is 3, the current consolidation result is' 0,1,2.compact,3,4 ', and the next consolidation result is' 0,1,2.compact,3,4,5.compact', **Note: 5. Compact file content = 2.compact + 3 + 4**last.compact file size will increase infinitely with batch operation

Analyze the logic of deleting expired files by CompactibleFileStreamLog: when the CompactibleFileStreamLog#add method is called, it will determine whether the delete operation is supported by default. Override def add (batchid: long, logs: array [t]): Boolean = {val batchadded = if (iscompressionbatch (batchid, compactinterval)) {/ / merge compact (batchid, logs)} else {super.add (batchid, logs) }If (batchadded & & isdeletingexpiredlog) {/ / adding succeeds and deleting expired files is supported. / / judge whether the current batch is outside the spark.sql.streaming.minBatchesToRetain configuration and within the file retention time during deletion. / / for configuration items, refer to section 4 solution configuration instructions deleteexpiredlog (batchid)} batchadded}

3,   Analyze the contents of the checkpoint location directory

At present, the content of checkpoint location mainly includes the following directories

  • offsets

  • commits

  • metadata

  • sources

  • sinks

3.1 offsets directory

Record the offset in each batch. To ensure that a given batch always contains the same data, write the data to this log record before processing it.

The nth record in this log indicates that it is currently being processed, and the N th-1 entry indicates which offsets have been processed.

// In StreamExecution, Val offsetlog = new offsetseqlog (sparksession, checkpoint file ("offsets"))
// The sample contents of the log are as follows: file path = checkpoint location / offsets / 560504v1{"batchwatermarkms": 0, "batchtimestampms": 1574315160001, "conf": {"spark. SQL. Streaming. Statestore. Providerclass": "org. Apache. Spark. SQL. Execution. Streaming. State. Hdfsbackedstatestoreprovider", "spark. SQL. Streaming. Flatmapgroupswithstate. Stateformatversion": "2", "spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}{"game_dc_real_normal":{"17":279843310,"8":318732102,"11":290676804,"2":292352132,"5":337789356,"14":277147358,"13":334833752,"4":319279439,"16":314038811,"7":361740056,"1":281418138,"10":276872234,"9 ":244398684,"3":334708621,"12":290208334,"15":267180971,"6":296588360,"0":350011707}}

3.2 commitLog directory

Record the completed batch, restart the task, check and compare the completed batch with the offsets batch record, and determine the next running batch

StreamExecution in val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))// The sample contents of the log are as follows: file path = checkpoint location / commits / 560504v1{"nextbatchwatermarkms": 0}

3.3 metadata directory

Metadata is the metadata associated with the entire query. Currently, only the current job id is reserved

StreamExecution in val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))// The sample contents of the log are as follows: file path = checkpoint location / metadata {"Id": "5314beeb-6026-485b-947a-cb088a9c9bac"}

3.4 sources directory

When the sources directory is the data source, read the details of each batch

3.5 sinks directory

When the sinks directory is the sink, write out the details of the batch

In addition, if there is a state calculation in the task, there will also be a state Directory: record status. When there are state operations, such as cumulative aggregation, de duplication, maximum and minimum, this directory will be used to record these state data. Directory structure: checkpoint/state/, checkpoint/state/xxx.snapshot. The new. Snapshot is a file generated by merging the old. Snapshot and. Delta. Structured Streaming periodically generates a. Snapshot file to record the status according to the configuration.

2, Spark Structured Streaming docking Grafana monitoring

Structured Streaming has a   StreamingQueryListener   For asynchronous reporting metrics, this is a Official example:

val spark: SparkSession = ...

spark.streams.addListener(new StreamingQueryListener() {
    override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
        println("Query started: " +
    override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {
        println("Query terminated: " +
    override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
        println("Query made progress: " + queryProgress.progress)

Meaning of StreamingQuery API:



If we monitor, we mainly use   onQueryProgress   Method to report data to the monitoring system.

import com.codahale.metrics.graphite.{Graphite, GraphiteReporter}
import com.codahale.metrics.{Gauge, MetricFilter, MetricRegistry}
import org.apache.spark.sql.streaming.StreamingQueryListener

import java.util.concurrent.TimeUnit

class SparkStreamingGraphiteMetrics(prefix: String, graphiteHostName: String, graphitePort: Int) extends StreamingQueryListener {
  val metrics = new MetricRegistry()
  var inputRowsPerSecond = 0D
  var processedRowsPerSecond = 0D
  var numInputRows = 0D
  var triggerExecution = 0L
  var batchDuration = 0L
  var sourceEndOffset = 0L
  var sourceStartOffset = 0L

  override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {
    val graphite = new Graphite(new InetSocketAddress(graphiteHostName, graphitePort))
    val reporter: GraphiteReporter = GraphiteReporter
      .prefixedWith(s"spark_structured_streaming_${prefix}") // Index name prefix, easy to use in Grafana
    reporter.start(30, TimeUnit.SECONDS)
    metrics.register(s"inputRowsPerSecond", new Gauge[Double] {
      override def getValue: Double = inputRowsPerSecond
    metrics.register(s"processedRowsPerSecond", new Gauge[Double] {
      override def getValue: Double = processedRowsPerSecond
    metrics.register("numInputRows", new Gauge[Double] {
      override def getValue: Double = numInputRows
    metrics.register("triggerExecution", new Gauge[Long] {
      override def getValue: Long = triggerExecution
    metrics.register("batchDuration", new Gauge[Long] {
      override def getValue: Long = batchDuration
    metrics.register("sourceEndOffset", new Gauge[Long] {
      override def getValue: Long = sourceEndOffset
    metrics.register("sourceStartOffset", new Gauge[Long] {
      override def getValue: Long = sourceStartOffset

  override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {
    // Assign and report each indicator
    inputRowsPerSecond = event.progress.inputRowsPerSecond
    processedRowsPerSecond = event.progress.processedRowsPerSecond
    numInputRows = event.progress.numInputRows
    triggerExecution = event.progress.durationMs.getOrDefault("triggerExecution", 0L)
    batchDuration = event.progress.batchDuration
    event.progress.sources.foreach(source => {
      sourceEndOffset = source.endOffset.toLong
      sourceStartOffset = source.startOffset.toLong

  override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = {

Add listening in the main program:


Graphite needs to be started_ Exporter, just find a server with two default ports:

  • 9109 is used to report data, i.e. spark - > graphite_ exporter
  • 9108 is Prometheus from graphite_ The exporter is used to pull data

You also need to create a profile in Prometheus   prometheus.yml   Read data is configured inside

- job_name: 'spark'
  - targets: ['192.168.1.xx:9108']

Finally, after starting the spark program, you can configure the chart in Grafana.

Configure Grafana chart

Like I set it up   prefix   yes   Click, then the Explore module in Grafana can select the Prometheus data source and input indicators   spark_click_inputRowsPerSecond  , Click Query to get the reading rate, as shown in the figure below:


3, Submit Offset to Kafka based on StreamingQueryListener

We can submit offset to kafka on the basis of sparkstreaminggraphitemetry. As shown below


import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import org.apache.kafka.clients.consumer.{KafkaConsumer, OffsetAndMetadata}
import org.apache.kafka.common.TopicPartition
import org.apache.spark.sql.streaming.StreamingQueryListener.QueryProgressEvent

import java.util
import java.util.Properties

class KafkaOffsetCommiter(prefix: String, graphiteHostName: String, graphitePort: Int, kafkaProperties: Properties) extends SparkStreamingGraphiteMetrics(prefix: String, graphiteHostName: String, graphitePort: Int) {

  val kafkaConsumer = new KafkaConsumer[String, String](kafkaProperties)

  // Submit Offset
  override def onQueryProgress(event: QueryProgressEvent): Unit = {
    // Traverse all sources
    event.progress.sources.foreach(source => {

      val objectMapper = new ObjectMapper()
        .configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
        .configure(DeserializationFeature.USE_LONG_FOR_INTS, true)

      val endOffset = objectMapper.readValue(source.endOffset, classOf[Map[String, Map[String, Long]]])

      // Traverse each Topic in the Source
      for ((topic, topicEndOffset) <- endOffset) {
        val topicPartitionsOffset = new util.HashMap[TopicPartition, OffsetAndMetadata]()

        //Traverse each Partition in the Topic
        for ((partition, offset) <- topicEndOffset) {
          val topicPartition = new TopicPartition(topic, partition.toInt)
          val offsetAndMetadata = new OffsetAndMetadata(offset)
          topicPartitionsOffset.put(topicPartition, offsetAndMetadata)



Topics: Grafana