spark the way of God - detailed explanation of RDD creation

Posted by Matty999555 on Tue, 01 Feb 2022 16:47:04 +0100

3.2 RDD programming

In Spark, RDD is represented as an object, and RDD is converted through method calls on the object. After defining RDD through a series of transformations, you can call actions to trigger the calculation of RDD. Actions can be to return results to the application (count, collect, etc.) or to save data to the storage system (saveAsTextFile, etc.). In Spark, RDD calculation (i.e. delayed calculation) is executed only when action is encountered, so that multiple transformations (calculation logic) can be transmitted through pipeline at run time. To use Spark, developers need to write a Driver program, which is submitted to the cluster to schedule and run the Worker. One or more RDDS are defined in the Driver and call the action on the RDD. The Worker performs the calculation task of RDD partition.

3.2.1 creation method of RDD

Local set conversion operator -- > RDD
Load external data

3.2.1.1 conversion from set

/**
 * Author:   Hang.Z
 * Description: 
 * explain:
 * •Once the RDD is created successfully, the distributed data set can be operated in parallel
 * •parallelize Another important parameter of and makeRDD is the number of partitions the dataset is divided into
 * •Spark One task will be run for each partition Under normal circumstances, Spark will automatically set the number of partitions according to your cluster
 */
object CreateRDD {
  def main(args: Array[String]): Unit = {
    //Create spark environment
    val conf = new SparkConf()
    conf.setAppName(CreateRDD.getClass.getSimpleName).setMaster("local[*]")
    val sc = new SparkContext(conf)
    //Call function to create RDD
    val rdd1: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7))
    //You can also call to create an RDD 
    val rdd2: RDD[String] = sc.parallelize(Seq[String]("scala", "java", "c++", "SQL"))
    // Collect data for local printing
    rdd1.collect().foreach(println)
    // Release resources
    sc.stop()
  }

Examples

package com._51doit.spark.day02

import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
 * Author:   Hang.Z
 * Date:     21/06/02 
 * Description:
 * Generate RDD from local collection
 *  1 RDD The default number of partitions is the number of all available cores, which is the number of partitions local[6]
 *  2 Control of partition data
 *     1)  local[N]  N <=  Kernel number
 *     2) parallelize(seq , num)  num > 0   Preferably, num should not be greater than the number of cores
 *     [One core handles one Task, with 6 cores and 7 partitions generating 7 tasks and one Task blocking
 *                                4 Four tasks are generated in one partition, resulting in a waste of resources
 *        Concurrency 7 parallelism is 6
 *     ]
 */
object _03MakeRDD {
  // Set log level
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    // this.getClass.getSimpleName class name
    // environment
    val conf: SparkConf = new SparkConf()
      .setMaster("local[6]")  // All auditors using this machine 16
      .setAppName(this.getClass.getSimpleName)
    val sc = new SparkContext(conf)
    val arr = Array(1,2,3,4)
    val mp = Map[String,Int](("wb",23),("duanlang",34))
    // Create RDD
    val rdd1: RDD[Int] = sc.parallelize(arr,4)
    // Create RDD
    val rdd2: RDD[(String, Int)] = sc.parallelize(mp.toList,7)
    val size1: Int = rdd1.partitions.size
    val size2: Int = rdd2.partitions.size

    /**
     * Because my machine is a 16Core, the RDD of the local collection defaults to the number of partitions for all available cores of the machine
     */
    println(size1)// 16 partition -- > 16task
    println(size2)// 16 partition


    sc.stop()

  }

}

package com._51doit.spark.day02

import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
 * Author:   Hang.Z
 * Date:     21/06/02 
 * Description:
 * Generate RDD from local collection
 *  1 RDD The default number of partitions is the number of all available cores, which is the number of partitions local[6]
 *
 */
object _02MakeRDD {
  // Set log level
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    // this.getClass.getSimpleName class name
    // environment
    val conf: SparkConf = new SparkConf()
      .setMaster("local[6]")  // All auditors using this machine 16
      .setAppName(this.getClass.getSimpleName)
    val sc = new SparkContext(conf)
    val arr = Array(1,2,3,4)
    val mp = Map[String,Int](("wb",23),("duanlang",34))
    // Create RDD
    val rdd1: RDD[Int] = sc.parallelize(arr)
    // Create RDD
    val rdd2: RDD[(String, Int)] = sc.parallelize(mp.toList)
    val size1: Int = rdd1.partitions.size
    val size2: Int = rdd2.partitions.size

    /**
     * Because my machine is a 16Core, the RDD of the local collection defaults to the number of partitions for all available cores of the machine
     */
    println(size1)// 16 partition -- > 16task
    println(size2)// 16 partition


    sc.stop()

  }

}

package com._51doit.spark.day02

import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
 * Author:   Hang.Z
 * Date:     21/06/02 
 * Description:
 * Generate RDD from local collection
 *  1 RDD The default number of partitions is the number of all available cores, which is the number of partitions local[6]
 *  2 Control of partition data
 *     1)  local[N]  N <=  Kernel number
 *     2) parallelize(seq , num)  num > 0   Preferably, num should not be greater than the number of cores
 *     [One core handles one Task, with 6 cores and 7 partitions generating 7 tasks and one Task blocking
 *                                4 Four tasks are generated in one partition, resulting in a waste of resources
 *        Concurrency 7 parallelism is 6
 *     ]
 */
object _03MakeRDD {
  // Set log level
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    // this.getClass.getSimpleName class name
    // environment
    val conf: SparkConf = new SparkConf()
      .setMaster("local[6]")  // All auditors using this machine 16
      .setAppName(this.getClass.getSimpleName)
    val sc = new SparkContext(conf)
    val arr = Array(1,2,3,4)
    val mp = Map[String,Int](("wb",23),("duanlang",34))
    // Create RDD
    val rdd1: RDD[Int] = sc.parallelize(arr,4)
    // Create RDD
    val rdd2: RDD[(String, Int)] = sc.parallelize(mp.toList,7)
    val size1: Int = rdd1.partitions.size
    val size2: Int = rdd2.partitions.size

    /**
     * Because my machine is a 16Core, the RDD of the local collection defaults to the number of partitions for all available cores of the machine
     */
    println(size1)// 16 partition -- > 16task
    println(size2)// 16 partition
    sc.stop()

  }
}

Summary:

RDD created by local collection has partition by default

1. The number of partitions is the number of available cores by default

2. Modify the partition settings and the number of available cores local [*]

3 number of partitions executed when creating RDD

Note: the number of partitions created corresponds to the number of available cores of the node

// Set log level
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    // this.getClass.getSimpleName class name
    // environment
    val conf: SparkConf = new SparkConf()
      .setMaster("local[6]")  // All auditors using this machine 16
      .setAppName(this.getClass.getSimpleName)
    val sc = new SparkContext(conf)
    val arr = Array(1,2,3,4)
    val mp = Map[String,Int](("wb",23),("duanlang",34))
    val rdd1: RDD[Int] = sc.makeRDD(arr, 4)
    println(rdd1.partitions.size)
    sc.stop()
  }

3.2.1.2 reading external files

It can be local file system, HDFS, Cassandra, HVase, Amazon S3, etc

Spark supports text files, SequenceFiles, and all other Hadoop inputformats

object CreatRDD_File {
  //Set console log level
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    val sc: SparkContext = SparkUtil.getSc
    // 1. Read the files under this project and create RDD
    val rdd1: RDD[String] = sc.textFile("spark-core/data/a.txt")
    // 2 absolute path reads the file on the local disk and creates RDD
    val rdd2: RDD[String] = sc.textFile("D://word.txt")
    // 3 read the files in HDFS distributed file system and create RDD
    val rdd3: RDD[String] = sc.textFile("hdfs://doit01:8020/word.txt")
    // 4. Create RDD by loading the specified rule file in the directory with the unified configuration character
    val rdd4: RDD[String] = sc.textFile("spark-core/data/a*.txt")
    // 5 when loading a file, get the file name and create the tuple returned by RDD
    val rdd5: RDD[(String, String)] = sc.wholeTextFiles("spark-core/data/")
    // Collect data and print
    rdd5.collect().foreach(println)
    sc.stop()
  }
}

1 url can be a local file system file, hdfs://..., s3n://... wait

2 if it is the path of the local file system used, this path must exist for each node

All file based methods support directories, compressed files, and wildcards (* *) For example:

textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

4. Textfile can also have a second parameter, indicating the number of partitions By default, each block corresponds to a partition (for HDFS, the default block size is 128M) You can pass a partition number larger than the number of blocks, but you cannot pass a partition number smaller than the number of blocks

3.2.1.3 other RDD conversion

val conf = new SparkConf()
    conf.setAppName("wc").setMaster("local")
    val sc = new SparkContext(conf)
    // Create RDD
    val rdd1: RDD[String] = sc.textFile("d://word.txt")
    //The transformation returns a new RDD 
    val rdd2: RDD[String] = rdd1.flatMap(_.split("\\s+"))
    //The transformation returns a new RDD 
    val rdd3: RDD[(String, Iterable[String])] = rdd2.groupBy(word => word)
    //The transformation returns a new RDD 
    val rdd4: RDD[(String, Int)] = rdd3.map(tp => {
      (tp._1, tp._2.size)
    })
sc.stop()

3.2.2 RDD partition

RDD has partitions when it is created. RDD partitions are the basis of RDD parallel computing. Generally, a partition will be encapsulated into a Task. If there are enough resources, the number of RDD partitions is the parallelism of RDD computing If there are many partitions, but the number of resource cores is not enough, the number of partitions is greater than the degree of parallelism!

3.2.2.1 number of partitions

1 number of partitions of RDD created using collection

def main(args: Array[String]): Unit = {
    val sc: SparkContext = SparkUtil.getSc
    // Create RDD
    val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6))
    // Manually set the number of partitions
    // val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6) . 6)
    // Check the number of partitions of RDD. If the number of partitions is not specified, RDD uses the default number of partitions
    /**
     * 1 The default number of partitions in RDD local mode is the number of available cores in the current environment
     *   1) setMaster("local[*]") All local audits
     *   2) setMaster("local[8]")  8 audits allocated
     *   3) conf.set("spark.default.parallelism", 4) Parameter setting
     */
    println(rdd.partitions.size)
    sc.stop()
  }

2. Load file to create RDD

def main(args: Array[String]): Unit = {
    val sc: SparkContext = SparkUtil.getSc
    // 1. Read the files under this project and create RDD
    val rdd1: RDD[String] = sc.textFile("spark-core/data/a.txt")
    // 2 read the files in the HDFS distributed file system and create RDD 1K 70m 212m 4
    val rdd3: RDD[String] = sc.textFile("hdfs://doit01:8020/wc/input/")
    /**
     * Create at least 2 data partitions based on RDD files
     * SplitSize Computational logic
     * -- FileInputFormat.getSplits(JobConf job, int numSplits)
     * -- val goalSize: Long = totalSize / (if (numSplits == 0)  { 1}
     * else  { numSplits}).toLong
     * -- this.computeSplitSize(goalSize, minSize, blockSize);
     * -- protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
     * return Math.max(minSize, Math.min(goalSize, blockSize));(1,3)}
     */
    println(rdd3.partitions.size)
    sc.stop()
  }

3.2.2.2 partition data division

Set RDD

def main(args: Array[String]): Unit = {
    val sc: SparkContext = SparkUtil.getSc
    val ls = List(1,2,3,4,5)
    //Create RDD and specify partition as 3
    val rdd: RDD[Int] = sc.makeRDD(ls, 3)
    //Generate 3 result files [1] [2,3] [4,5]
    rdd.saveAsTextFile("data/output1")
    sc.stop()
  }
---------------------------------------Source code analysis--------------------------
1) val rdd: RDD[Int] = sc.makeRDD(ls, 3)
2) makeRDD(ls, 3)--> 
     def makeRDD[T: ClassTag](
      seq: Seq[T], // data
      numSlices: Int = defaultParallelism): RDD[T] = withScope {  //3
    parallelize(seq, numSlices) // (data,3)
  } 
3) parallelize(seq,numSlices) -->
         def parallelize[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    assertNotStopped()
    new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]()) // Look at the two parameters here
  }
4)  new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())-->
       override def getPartitions: Array[Partition] = {
    val slices = ParallelCollectionRDD.slice(data, numSlices).toArray   // Look here 
    slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
  }
5)slice(data, numSlices)---> 
      case _ =>   // Look here
        val array = seq.toArray // To prevent O(n^2) operations for List etc
        positions(array.length, numSlices).map { case (start, end) =>  // View method
            array.slice(start, end).toSeq
        }.toSeq
6)   positions(array.length, numSlices)--->
      def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {   // Data core allocation logic
      (0 until numSlices).iterator.map { i => [0,3)  length=seq.length=5
        val start = ((i * length) / numSlices).toInt
        val end = (((i + 1) * length) / numSlices).toInt
        (start, end)
      }
(0,1)
(1,3)
(3,5)
7) array.slice---->
override def slice(from : scala.Int, until : scala.Int) : scala.Array[T] = { /* compiled code */ }

File RDD

When reading file data, the data is sliced and partitioned according to the rules of Hadoop file reading, and the slicing rules are different from the rules of data reading

When loading the file, calculate the task slice! The number of task slices is the number of partitions [sub size] g

class HadoopRDD[K, V](
    sc: SparkContext,
    broadcastedConf: Broadcast[SerializableConfiguration],
    initLocalJobConfFuncOpt: Option[JobConf => Unit],
    inputFormatClass: Class[_ <: InputFormat[K, V]],
    keyClass: Class[K],
    valueClass: Class[V],
    minPartitions: Int)

  def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
defaultParallelism The default is all cores of the current machine
******************************
minPartitions = 2
******************************
val rdd: RDD[String] = sc.textFile("D:\\spark_data")
1  return hadoopFile  The code of task slice is hadoop Task slice in
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }
2 TextInputFormat   
  public class TextInputFormat extends FileInputFormat<LongWritable, Text> 
3 FileInputFormat
4  public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException { //  numSplits = 2
    Who called this method to pass in parameters 
5  long goalSize = totalSize / (long)(numSplits == 0 ? 1 : numSplits); 
    goalSize =  (273+5)/2  TOINT
6     long splitSize = this.computeSplitSize(goalSize, minSize, blockSize);    
7
      protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
        return Math.max(minSize, Math.min(goalSize, blockSize));
    }
    
     long minSize = Math.max(job.getLong("mapreduce.input.fileinputformat.split.minsize", 1L), this.minSplitSize);

8 Math.max(1, Math.min(139, 128m));    
    splitSize = 139B
9  Calculate the number of task slices 
    for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) {
                        splitHosts = this.getSplitHostsAndCachedHosts(blkLocations, length - bytesRemaining, splitSize, clusterMap);
                        splits.add(this.makeSplit(path, length - bytesRemaining, splitSize, splitHosts[0], splitHosts[1]));
                    }    
    
 5B  ------------>  1
 273B  ---------->  2
 1+2   Task slice calculation

Topics: Big Data Hadoop Spark flink

Programmer Think