3.2 RDD programming
In Spark, RDD is represented as an object, and RDD is converted through method calls on the object. After defining RDD through a series of transformations, you can call actions to trigger the calculation of RDD. Actions can be to return results to the application (count, collect, etc.) or to save data to the storage system (saveAsTextFile, etc.). In Spark, RDD calculation (i.e. delayed calculation) is executed only when action is encountered, so that multiple transformations (calculation logic) can be transmitted through pipeline at run time. To use Spark, developers need to write a Driver program, which is submitted to the cluster to schedule and run the Worker. One or more RDDS are defined in the Driver and call the action on the RDD. The Worker performs the calculation task of RDD partition.
3.2.1 creation method of RDD
-
Local set conversion operator -- > RDD
-
Load external data
3.2.1.1 conversion from set
/** * Author: Hang.Z * Description: * explain: * •Once the RDD is created successfully, the distributed data set can be operated in parallel * •parallelize Another important parameter of and makeRDD is the number of partitions the dataset is divided into * •Spark One task will be run for each partition Under normal circumstances, Spark will automatically set the number of partitions according to your cluster */ object CreateRDD { def main(args: Array[String]): Unit = { //Create spark environment val conf = new SparkConf() conf.setAppName(CreateRDD.getClass.getSimpleName).setMaster("local[*]") val sc = new SparkContext(conf) //Call function to create RDD val rdd1: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7)) //You can also call to create an RDD val rdd2: RDD[String] = sc.parallelize(Seq[String]("scala", "java", "c++", "SQL")) // Collect data for local printing rdd1.collect().foreach(println) // Release resources sc.stop() }
Examples
package com._51doit.spark.day02 import org.apache.log4j.{Level, Logger} import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} /** * Author: Hang.Z * Date: 21/06/02 * Description: * Generate RDD from local collection * 1 RDD The default number of partitions is the number of all available cores, which is the number of partitions local[6] * 2 Control of partition data * 1) local[N] N <= Kernel number * 2) parallelize(seq , num) num > 0 Preferably, num should not be greater than the number of cores * [One core handles one Task, with 6 cores and 7 partitions generating 7 tasks and one Task blocking * 4 Four tasks are generated in one partition, resulting in a waste of resources * Concurrency 7 parallelism is 6 * ] */ object _03MakeRDD { // Set log level Logger.getLogger("org").setLevel(Level.ERROR) def main(args: Array[String]): Unit = { // this.getClass.getSimpleName class name // environment val conf: SparkConf = new SparkConf() .setMaster("local[6]") // All auditors using this machine 16 .setAppName(this.getClass.getSimpleName) val sc = new SparkContext(conf) val arr = Array(1,2,3,4) val mp = Map[String,Int](("wb",23),("duanlang",34)) // Create RDD val rdd1: RDD[Int] = sc.parallelize(arr,4) // Create RDD val rdd2: RDD[(String, Int)] = sc.parallelize(mp.toList,7) val size1: Int = rdd1.partitions.size val size2: Int = rdd2.partitions.size /** * Because my machine is a 16Core, the RDD of the local collection defaults to the number of partitions for all available cores of the machine */ println(size1)// 16 partition -- > 16task println(size2)// 16 partition sc.stop() } }
package com._51doit.spark.day02 import org.apache.log4j.{Level, Logger} import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} /** * Author: Hang.Z * Date: 21/06/02 * Description: * Generate RDD from local collection * 1 RDD The default number of partitions is the number of all available cores, which is the number of partitions local[6] * */ object _02MakeRDD { // Set log level Logger.getLogger("org").setLevel(Level.ERROR) def main(args: Array[String]): Unit = { // this.getClass.getSimpleName class name // environment val conf: SparkConf = new SparkConf() .setMaster("local[6]") // All auditors using this machine 16 .setAppName(this.getClass.getSimpleName) val sc = new SparkContext(conf) val arr = Array(1,2,3,4) val mp = Map[String,Int](("wb",23),("duanlang",34)) // Create RDD val rdd1: RDD[Int] = sc.parallelize(arr) // Create RDD val rdd2: RDD[(String, Int)] = sc.parallelize(mp.toList) val size1: Int = rdd1.partitions.size val size2: Int = rdd2.partitions.size /** * Because my machine is a 16Core, the RDD of the local collection defaults to the number of partitions for all available cores of the machine */ println(size1)// 16 partition -- > 16task println(size2)// 16 partition sc.stop() } }
package com._51doit.spark.day02 import org.apache.log4j.{Level, Logger} import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} /** * Author: Hang.Z * Date: 21/06/02 * Description: * Generate RDD from local collection * 1 RDD The default number of partitions is the number of all available cores, which is the number of partitions local[6] * 2 Control of partition data * 1) local[N] N <= Kernel number * 2) parallelize(seq , num) num > 0 Preferably, num should not be greater than the number of cores * [One core handles one Task, with 6 cores and 7 partitions generating 7 tasks and one Task blocking * 4 Four tasks are generated in one partition, resulting in a waste of resources * Concurrency 7 parallelism is 6 * ] */ object _03MakeRDD { // Set log level Logger.getLogger("org").setLevel(Level.ERROR) def main(args: Array[String]): Unit = { // this.getClass.getSimpleName class name // environment val conf: SparkConf = new SparkConf() .setMaster("local[6]") // All auditors using this machine 16 .setAppName(this.getClass.getSimpleName) val sc = new SparkContext(conf) val arr = Array(1,2,3,4) val mp = Map[String,Int](("wb",23),("duanlang",34)) // Create RDD val rdd1: RDD[Int] = sc.parallelize(arr,4) // Create RDD val rdd2: RDD[(String, Int)] = sc.parallelize(mp.toList,7) val size1: Int = rdd1.partitions.size val size2: Int = rdd2.partitions.size /** * Because my machine is a 16Core, the RDD of the local collection defaults to the number of partitions for all available cores of the machine */ println(size1)// 16 partition -- > 16task println(size2)// 16 partition sc.stop() } }
Summary:
RDD created by local collection has partition by default
1. The number of partitions is the number of available cores by default
2. Modify the partition settings and the number of available cores local [*]
3 number of partitions executed when creating RDD
Note: the number of partitions created corresponds to the number of available cores of the node
// Set log level Logger.getLogger("org").setLevel(Level.ERROR) def main(args: Array[String]): Unit = { // this.getClass.getSimpleName class name // environment val conf: SparkConf = new SparkConf() .setMaster("local[6]") // All auditors using this machine 16 .setAppName(this.getClass.getSimpleName) val sc = new SparkContext(conf) val arr = Array(1,2,3,4) val mp = Map[String,Int](("wb",23),("duanlang",34)) val rdd1: RDD[Int] = sc.makeRDD(arr, 4) println(rdd1.partitions.size) sc.stop() }
3.2.1.2 reading external files
It can be local file system, HDFS, Cassandra, HVase, Amazon S3, etc
Spark supports text files, SequenceFiles, and all other Hadoop inputformats
object CreatRDD_File { //Set console log level Logger.getLogger("org").setLevel(Level.ERROR) def main(args: Array[String]): Unit = { val sc: SparkContext = SparkUtil.getSc // 1. Read the files under this project and create RDD val rdd1: RDD[String] = sc.textFile("spark-core/data/a.txt") // 2 absolute path reads the file on the local disk and creates RDD val rdd2: RDD[String] = sc.textFile("D://word.txt") // 3 read the files in HDFS distributed file system and create RDD val rdd3: RDD[String] = sc.textFile("hdfs://doit01:8020/word.txt") // 4. Create RDD by loading the specified rule file in the directory with the unified configuration character val rdd4: RDD[String] = sc.textFile("spark-core/data/a*.txt") // 5 when loading a file, get the file name and create the tuple returned by RDD val rdd5: RDD[(String, String)] = sc.wholeTextFiles("spark-core/data/") // Collect data and print rdd5.collect().foreach(println) sc.stop() } }
1 url can be a local file system file, hdfs://..., s3n://... wait
2 if it is the path of the local file system used, this path must exist for each node
All file based methods support directories, compressed files, and wildcards (* *) For example:
textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
4. Textfile can also have a second parameter, indicating the number of partitions By default, each block corresponds to a partition (for HDFS, the default block size is 128M) You can pass a partition number larger than the number of blocks, but you cannot pass a partition number smaller than the number of blocks
3.2.1.3 other RDD conversion
val conf = new SparkConf() conf.setAppName("wc").setMaster("local") val sc = new SparkContext(conf) // Create RDD val rdd1: RDD[String] = sc.textFile("d://word.txt") //The transformation returns a new RDD val rdd2: RDD[String] = rdd1.flatMap(_.split("\\s+")) //The transformation returns a new RDD val rdd3: RDD[(String, Iterable[String])] = rdd2.groupBy(word => word) //The transformation returns a new RDD val rdd4: RDD[(String, Int)] = rdd3.map(tp => { (tp._1, tp._2.size) }) sc.stop()
3.2.2 RDD partition
RDD has partitions when it is created. RDD partitions are the basis of RDD parallel computing. Generally, a partition will be encapsulated into a Task. If there are enough resources, the number of RDD partitions is the parallelism of RDD computing If there are many partitions, but the number of resource cores is not enough, the number of partitions is greater than the degree of parallelism!
3.2.2.1 number of partitions
1 number of partitions of RDD created using collection
def main(args: Array[String]): Unit = { val sc: SparkContext = SparkUtil.getSc // Create RDD val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6)) // Manually set the number of partitions // val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6) . 6) // Check the number of partitions of RDD. If the number of partitions is not specified, RDD uses the default number of partitions /** * 1 The default number of partitions in RDD local mode is the number of available cores in the current environment * 1) setMaster("local[*]") All local audits * 2) setMaster("local[8]") 8 audits allocated * 3) conf.set("spark.default.parallelism", 4) Parameter setting */ println(rdd.partitions.size) sc.stop() }
2. Load file to create RDD
def main(args: Array[String]): Unit = { val sc: SparkContext = SparkUtil.getSc // 1. Read the files under this project and create RDD val rdd1: RDD[String] = sc.textFile("spark-core/data/a.txt") // 2 read the files in the HDFS distributed file system and create RDD 1K 70m 212m 4 val rdd3: RDD[String] = sc.textFile("hdfs://doit01:8020/wc/input/") /** * Create at least 2 data partitions based on RDD files * SplitSize Computational logic * -- FileInputFormat.getSplits(JobConf job, int numSplits) * -- val goalSize: Long = totalSize / (if (numSplits == 0) { 1} * else { numSplits}).toLong * -- this.computeSplitSize(goalSize, minSize, blockSize); * -- protected long computeSplitSize(long goalSize, long minSize, long blockSize) { * return Math.max(minSize, Math.min(goalSize, blockSize));(1,3)} */ println(rdd3.partitions.size) sc.stop() }
3.2.2.2 partition data division
Set RDD
def main(args: Array[String]): Unit = { val sc: SparkContext = SparkUtil.getSc val ls = List(1,2,3,4,5) //Create RDD and specify partition as 3 val rdd: RDD[Int] = sc.makeRDD(ls, 3) //Generate 3 result files [1] [2,3] [4,5] rdd.saveAsTextFile("data/output1") sc.stop() } ---------------------------------------Source code analysis-------------------------- 1) val rdd: RDD[Int] = sc.makeRDD(ls, 3) 2) makeRDD(ls, 3)--> def makeRDD[T: ClassTag]( seq: Seq[T], // data numSlices: Int = defaultParallelism): RDD[T] = withScope { //3 parallelize(seq, numSlices) // (data,3) } 3) parallelize(seq,numSlices) --> def parallelize[T: ClassTag]( seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] = withScope { assertNotStopped() new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]()) // Look at the two parameters here } 4) new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())--> override def getPartitions: Array[Partition] = { val slices = ParallelCollectionRDD.slice(data, numSlices).toArray // Look here slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray } 5)slice(data, numSlices)---> case _ => // Look here val array = seq.toArray // To prevent O(n^2) operations for List etc positions(array.length, numSlices).map { case (start, end) => // View method array.slice(start, end).toSeq }.toSeq 6) positions(array.length, numSlices)---> def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = { // Data core allocation logic (0 until numSlices).iterator.map { i => [0,3) length=seq.length=5 val start = ((i * length) / numSlices).toInt val end = (((i + 1) * length) / numSlices).toInt (start, end) } (0,1) (1,3) (3,5) 7) array.slice----> override def slice(from : scala.Int, until : scala.Int) : scala.Array[T] = { /* compiled code */ }
-
File RDD
When reading file data, the data is sliced and partitioned according to the rules of Hadoop file reading, and the slicing rules are different from the rules of data reading
When loading the file, calculate the task slice! The number of task slices is the number of partitions [sub size] g
class HadoopRDD[K, V]( sc: SparkContext, broadcastedConf: Broadcast[SerializableConfiguration], initLocalJobConfFuncOpt: Option[JobConf => Unit], inputFormatClass: Class[_ <: InputFormat[K, V]], keyClass: Class[K], valueClass: Class[V], minPartitions: Int) def defaultMinPartitions: Int = math.min(defaultParallelism, 2) defaultParallelism The default is all cores of the current machine ****************************** minPartitions = 2 ****************************** val rdd: RDD[String] = sc.textFile("D:\\spark_data") 1 return hadoopFile The code of task slice is hadoop Task slice in def textFile( path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = withScope { assertNotStopped() hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => pair._2.toString).setName(path) } 2 TextInputFormat public class TextInputFormat extends FileInputFormat<LongWritable, Text> 3 FileInputFormat 4 public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException { // numSplits = 2 Who called this method to pass in parameters 5 long goalSize = totalSize / (long)(numSplits == 0 ? 1 : numSplits); goalSize = (273+5)/2 TOINT 6 long splitSize = this.computeSplitSize(goalSize, minSize, blockSize); 7 protected long computeSplitSize(long goalSize, long minSize, long blockSize) { return Math.max(minSize, Math.min(goalSize, blockSize)); } long minSize = Math.max(job.getLong("mapreduce.input.fileinputformat.split.minsize", 1L), this.minSplitSize); 8 Math.max(1, Math.min(139, 128m)); splitSize = 139B 9 Calculate the number of task slices for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) { splitHosts = this.getSplitHostsAndCachedHosts(blkLocations, length - bytesRemaining, splitSize, clusterMap); splits.add(this.makeSplit(path, length - bytesRemaining, splitSize, splitHosts[0], splitHosts[1])); } 5B ------------> 1 273B ----------> 2 1+2 Task slice calculation