SparkCore learning notes

I RDD overview 1.1 what is RDD RDD (Resilient Distributed Dataset) is called elastic distributed dataset. It is the most basic data abstraction in Spark. Code is an abstract class, which represents an elastic, immutable, partitioned collection in which the elements can be calculated in parallel. 1.2 RDD features (1) Flexibility Elastici ...

Posted by lancet2003 on Tue, 04 Jan 2022 03:33:45 +0100

spring boot integrates spark and runs and submits spark task spark on yarn based on yarn

preface The previous project was based on springboot and integrated spark, running on standalone. I once wrote a blog, link: https://blog.csdn.net/qq_41587243/article/details/112918052?spm=1001.2014.3001.5501 The same scheme is used now, but spark is submitted on the production environment yarn cluster, and kerbores verification is required, ...

Posted by not_john on Mon, 03 Jan 2022 23:15:07 +0100

[review] Action operator of RDD

3, RDD action operator The so-called action operator is actually a method that can trigger Job execution 1,reduce  function signature def reduce(f: (T, T) => T): T  function description Aggregate all elements in RDD, first aggregate data in partitions, and then aggregate data between partitions val rdd: RDD[Int] = sc.m ...

Posted by vapour_ on Mon, 03 Jan 2022 10:46:22 +0100

Using SparkLauncher to invoke Spark operation in code

background The project needs to deal with many files, and some files have a large number of GB. Therefore, considering that such files are specially written for Spark program processing, for the unified processing of programs, it is necessary to call Spark jobs in code to handle large files. Implementation scheme After investigation, it is foun ...

Posted by nitestryker on Mon, 03 Jan 2022 02:54:31 +0100

Spark on Hive and Hive on Spark for Big Data Hadoop

1. Differences between Spark on Hive and Hive on Spark 1)Spark on Hive Spark on Hive is Hive's only storage role and Spark is responsible for sql parsing optimization and execution. You can understand that Spark uses Hive statements to manipulate Hive tables through Spark SQL, and Spark RDD runs at the bottom. The steps are as follows: ...

Posted by joejoejoe on Mon, 03 Jan 2022 02:40:47 +0100

Spark Learning Notes - creation of RDD

By default, Spark can divide a job into multiple tasks and send it to the Executor node for parallel computing. The number of tasks that can be calculated in parallel is called parallelism. This number can be specified when building the RDD. However, the number of splitting tasks is not necessarily equal to the number of tasks executed in paral ...

Posted by mona02 on Sun, 02 Jan 2022 09:42:25 +0100

Introduction to spark - Spark operating environment

Reference link https://www.bilibili.com/video/BV11A411L7CK?p=11 Spark operating environment As a data processing framework and computing engine, Spark is designed to run in all common cluster environments. The mainstream environment in domestic work is yard, but the container environment is becoming more and more popular Local mode The s ...

Posted by FireyIce01 on Thu, 23 Dec 2021 14:01:46 +0100

Big data (8y) Spark3.0 kernel

1. Operation mechanism of Spark On YARN deployment mode After the task is submitted, start the Driver;Driver registers the application with the cluster manager;The cluster manager allocates the Executor according to the configuration file of this task and starts it;The Driver starts to execute the main function. When the action operator is ...

Posted by Molarmite on Sat, 18 Dec 2021 03:54:26 +0100

spark io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s)

Error log: 1. Error log: ​ WARN TaskSetManager: Lost task 29.0 in stage 22.0 (TID 1851, wn108-cdlcns.bjduloineequ3adfbkrpgi4p2c.shax.internal.chinacloudapp.cn, executor 25): FetchFailed(BlockManagerId(2, wn107-cdlcns.bjduloineequ3adfbkrpgi4p2c.shax.internal.chinacloudapp.cn, 7447, None), shuffleId=12, mapId=4, reduceId=29, message= org.apach ...

Posted by godster on Fri, 17 Dec 2021 02:33:23 +0100

Spark source code reading 04 - Local operation mode of spark operation architecture

Local operation mode Basic introduction Spark's Local operation mode is also called Local operation mode and pseudo distributed mode. This is called Local mode because all spark processes in this mode run in the virtual machine of a Local machine without any resource manager. It mainly uses multiple threads of a single machine to simulate spa ...

Posted by mhenke on Sun, 12 Dec 2021 23:20:27 +0100