Spark's chche and checkpoint

Posted by mazen on Fri, 12 Nov 2021 05:34:07 +0100

In order to introduce these two mechanisms, here we write an operation to realize Pi. Here, we only need to calculate the probability ratio of the point falling in the circle to the point falling in the square

Here we highlight
slices indicates how many tasks are generated
cnt indicates how many points are generated in each task
For the number of tasks here, we will create a task every time we calculate, which leads to a very large number of files to be processed. Here, we can appropriately reduce the number of slices and increase the number of CNTs to improve the calculation efficiency; Although the number of points to be calculated is the same, the efficiency is completely different

def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("Demo6Pi")
    val sc = new SparkContext(conf)

    //This is a random number between 0 and 1
//    println(Random.nextDouble())

    //slices indicates how many tasks are generated, and cnt indicates how many points are generated in each task
    val slices = 100
    val cnt = 10000

//parallelize supports passing in a parameter numSlices, which is consistent with the parallelism by default and can be specified manually, indicating the number of RDD partitions generated at last
    //Finally, the number of task s will be determined
    val seqRDD: RDD[Int] = sc.parallelize(0 to cnt*slices,slices)

    //Here, N points between [- 1,1] are randomly generated
    val pointRDD: RDD[(Double, Double)] = seqRDD.map(seq => {
      val x: Double = Random.nextDouble() * 2 - 1
      val y: Double = Random.nextDouble() * 2 - 1
      (x, y)
    })

    //Filter out the points in the circle from these points
    val circlePointNum: Long = pointRDD.filter(
      (kv) => {
        val x: Double = kv._1
        val y: Double = kv._2
        val res: Double = x * x + y * y
        res <= 1
      }
    ).count()

    println(circlePointNum)
    val pi: Double = circlePointNum.toDouble/ (cnt * slices) * 4
    println(pi)

  }

Cache - > improve efficiency

Our cache is cached in the Executor, which has CPU and memory, and the program runs in the Executor

Why use cache

Before we added cache, the spark task we ran was no different from mapreduce. It was both map and reduce. Although it was classified as stage in spark, the principle was both map and reduce, mixed with shuffle

Here we calculate the number of students in each class and the overall number of gender

val conf: SparkConf = new SparkConf()
      .setMaster("local")
      .setAppName("Demo16Cache")

    val sc: SparkContext = new SparkContext(conf)

    val stuRDD: RDD[String] = sc.textFile("D:\\BigDaTa\\JAVA_Project\\ShuJia01\\data\\students.txt")

    //Count the number of classes
    stuRDD.map(line=>{
      val strings: Array[String] = line.split(",")
      (strings(4),1)
    }).reduceByKey(_+_)
      .foreach(println)

    //Gender Statistics
    stuRDD.map(line=>{
      val strings: Array[String] = line.split(",")
      (strings(3),1)
    }).reduceByKey(_+_)
      .foreach(println)

We can find that stuRDD is repeatedly called by us
How can we see that it is repeatedly called? We make some minor adjustments. For the stuRDD reading data, we use a map method, and each time we use it, we add a print

val conf: SparkConf = new SparkConf()
      .setMaster("local")
      .setAppName("Demo16Cache")

    val sc: SparkContext = new SparkContext(conf)

    val stuRDD: RDD[String] = sc.textFile("D:\\BigDaTa\\JAVA_Project\\ShuJia01\\data\\students.txt")

    val stuRDDMap: RDD[String] = stuRDD.map(stu => {
      println("stuRDD")
      stu
    })

//    stuRDDMap.cache()

    //Count the number of classes
    stuRDDMap.map(line=>{
      val strings: Array[String] = line.split(",")
      (strings(4),1)
    }).reduceByKey(_+_)
      .foreach(println)

    //Gender Statistics
    stuRDDMap.map(line=>{
      val strings: Array[String] = line.split(",")
      (strings(3),1)
    }).reduceByKey(_+_)
      .foreach(println)

As a result, we can find that "stuRDD" is printed twice when printing data (one student data and one gender data),
Therefore, we can find that stuRDD is called twice every time the data is read and used
Every time you do a calculation, you need to repeatedly fetch data from HDFS, load it into stuRDD, and then pull it for calculation. This is not in line with the calculation characteristics of spark
See this blog
The reason why spark has fast calculation speed is that it does not need to repeatedly fetch data for calculation. Therefore, here, we can make a cache of the data. Each time we take this score for use, we do not need to repeatedly select from HDFS
When a piece of data is fetched multiple times, it does not need to be read repeatedly (repeated reading is no different from mapreduce)

Cache cache

We can cache RDD S that are used multiple times

val conf: SparkConf = new SparkConf()
      .setMaster("local")
      .setAppName("Demo16Cache")

    val sc: SparkContext = new SparkContext(conf)

    val stuRDD: RDD[String] = sc.textFile("D:\\BigDaTa\\JAVA_Project\\ShuJia01\\data\\students.txt")

    val stuRDDMap: RDD[String] = stuRDD.map(stu => {
      println("stuRDD")
      stu
    })

    stuRDDMap.cache()

    //Count the number of classes
    stuRDDMap.map(line=>{
      val strings: Array[String] = line.split(",")
      (strings(4),1)
    }).reduceByKey(_+_)
      .foreach(println)

    //Gender Statistics
    stuRDDMap.map(line=>{
      val strings: Array[String] = line.split(",")
      (strings(3),1)
    }).reduceByKey(_+_)
      .foreach(println)

After running here, we found that "stuRDD" was printed only once while printing a class result and a number of sex people

Level of cache

We can see that the cache calls the persist method by default

Click persist, and we can see that a MEMORY_ONLY is given by default

Click StorageLevel and we can see that there are many caching mechanisms by default

The difference between cache and persist


We can see that both cache and persist can be used as cache by default, but cache can only use MEMORY_ONLY by default, that is, the strategy based on memory cache, and persist can give many different cache strategies

Selection of cache policy


In here
DISK_ONLY refers to disk based storage cache
MEMORY_ONLY refers to memory based storage cache
The meaning of SER added later is whether compression is required
Adding 2 means whether backup is required

1. When the data is small and the memory is sufficient
Select MEMORY_ONLY

2. When the amount of data is a little large and the memory cannot be completely put down
Select MEMORY_AND_DISK_SER
Cache data into memory as much as possible, which is the most efficient

unpersist release cache

We also have sparkstreaming in spark. Data is read in continuously. If we keep caching, memory overflow will occur sooner or later. At this time, we need to release the cache

When the cache is not used, the cache is released

Checkpoint - > fault tolerance

Persistence, HDFS does not hang, it does not hang

We have learned from the above that the data cached in the cache is in the Executor, but the Executor may hang. Once hung, our data will disappear and need to be recalculated;
At this time, we can write the cached data to HDFS, but the efficiency will be reduced. This method can be used for fault tolerance

Here we just need to do these two steps

However, the cache here outputs "stuRDD" twice, which is related to the working mechanism of our checkpoint

Principle of checkpoint


1. First calculate from front to back, and make a mark for which checkpoint is needed
2. Restart a new job, recalculate it, and write the marked rdd to HDFS
Our cache writes the rdd that needs to be cached directly into memory after it encounters it

optimization

It is better to use the cache once before calling the checkpoint. In this way, the restarted job only needs to copy the data in memory to HDFS, so the calculation process can be omitted
After the job execution of the checkpoint is completed, the RDD dependency will be cut off

Therefore, cache is mainly used to improve efficiency and checkpoint is mainly used for fault tolerance

Thank you for reading. I am shuaihe, a senior majoring in big data. I wish you happiness.

Topics: Big Data Spark