Collect Cache Persist of Spark cache

Posted by Brian W on Tue, 30 Jun 2020 05:44:38 +0200


All of them have the function of gathering data and pulling data storage. mark their respective roles.


   * Return an array that contains all of the elements in this RDD.
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)

The collect operation converts all the elements in RDD to Array, which is generally used for test output in local local mode. It is not recommended in cluster mode. As the source code says, collect operation should be used when the Array is expected to be small, because the data will be loaded into the memory of the diver side, and the local test has little impact. However, in the cluster mode, if the directory side memory is applied Please be too young to oom.



   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
  def cache(): this.type = persist()

In fact, cache is the most basic mode of persist, which can be understood as a polymorphism of persist, because there is such a definition of persist in the source code:

   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

As you can see, the cache called actually corresponds to the nonparametric persist. The usage scenario here is generally to cache some RDDS that are used repeatedly and occupy less space. It is a little similar to the small tables broadcast by Map Join. Here, memory_ The only representation is only stored in memory, so you need to consider the size of the RDD to be cached.



   * Mark this RDD for persisting using the specified level.
   * @param newLevel the target storage level
   * @param allowOverride whether to override any existing level with the new one
  private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
    // TODO: Handle changes of StorageLevel
    if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
      throw new UnsupportedOperationException(
        "Cannot change storage level of an RDD after it was already assigned a level")
    // If this is the first time this RDD is marked for persisting, register it
    // with the SparkContext for cleanups and accounting. Do this only once.
    if (storageLevel == StorageLevel.NONE) {
    storageLevel = newLevel

Compared with cache, persist provides a more flexible choice: StorageLevel is the storage level. Whether the second parameter allows to be overridden is aimed at modifying the cache level of an RDD in the spark task. The chance to use it is relatively small. What storage levels are there~


StorageLevel Class main class

class StorageLevel private(
    private var _useDisk: Boolean,
    private var _useMemory: Boolean,
    private var _useOffHeap: Boolean,
    private var _deserialized: Boolean,
    private var _replication: Int = 1)
  extends Externalizable {

In the main class, you can see that StorageLevel has five construction parameters, namely:

A kind of useDisk: use hard disk, can be understood as when RDD is too big to fit in memory, it will be placed in HDFS or other storage location

A kind of useMemory: using memory, cache and persist() are the patterns

A kind of useOffHeap: using out of heap memory, the JVM is not familiar with it. Let's dig deeper

A kind of Deserialized: deserialized, can be understood as a lack of space or to save storage space, so using serialization can reduce the memory consumption

A kind of replication: the number of backups. The default value here is 1. If the task cache data is large and the cost of re executing the task failure is relatively high, it can be modified to 2 in order to improve the fault tolerance rate. The common scenario here is to prepare for restarting the large-scale task again due to oom, io and other errors when landing the large-scale task log


StorageLevel Object static class

object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

According to the above five parameters, the static class here gives a variety of construction methods, the most commonly used is MEMORY_ONLY: suitable for small data, can be put in memory, medium and large data is suitable for MEMORY_AND_DISK_SER, if the task fails and the restart cost is too high, you can consider MEMORY_AND_DISK_SER_2. Here, serialization will save space, but it will also increase cpu processing time due to serialization and deserialization, so it is MEMORY_AND_DISK_SER or MEMORY_AND_DISK can be flexibly operated in combination with different scenarios.

usage method:



Use scenarios in total:

1. Local test is mostly used for collect

2. The amount of RDD data is not large cache

3. The amount of RDD data is large, and the Cpu is insufficient to persist (memory)_ AND_ Disk) is expensive to restart and replaced with MEMORY_AND_DISK_2

4. Large amount of RDD data, sufficient Cpu persistent (memory)_ AND_ DISK_ Ser) high restart cost, replace with MEMORY_AND_DISK_SER_2

Common scenarios are these, which can be considered when some RDD needs to be reused, but OOM is easy to appear in MEMORY mode. In DISK mode, it will increase the running time because of IO between disk. These are elements that need to be considered. Finally, remember to use unpersist to release excess space after using RDD.

Topics: less Spark jvm