SparkCore learning notes
I RDD overview
1.1 what is RDD
RDD (Resilient Distributed Dataset) is called elastic distributed dataset. It is the most basic data abstraction in Spark. Code is an abstract class, which represents an elastic, immutable, partitioned collection in which the elements can be calculated in parallel.
1.2 RDD features
(1) Flexibility Elastici ...
Posted by lancet2003 on Tue, 04 Jan 2022 03:33:45 +0100
spring boot integrates spark and runs and submits spark task spark on yarn based on yarn
preface
The previous project was based on springboot and integrated spark, running on standalone. I once wrote a blog, link:
https://blog.csdn.net/qq_41587243/article/details/112918052?spm=1001.2014.3001.5501
The same scheme is used now, but spark is submitted on the production environment yarn cluster, and kerbores verification is required, ...
Posted by not_john on Mon, 03 Jan 2022 23:15:07 +0100
[review] Action operator of RDD
3, RDD action operator
The so-called action operator is actually a method that can trigger Job execution
1,reduce
function signature
def reduce(f: (T, T) => T): T
function description
Aggregate all elements in RDD, first aggregate data in partitions, and then aggregate data between partitions
val rdd: RDD[Int] = sc.m ...
Posted by vapour_ on Mon, 03 Jan 2022 10:46:22 +0100
Using SparkLauncher to invoke Spark operation in code
background
The project needs to deal with many files, and some files have a large number of GB. Therefore, considering that such files are specially written for Spark program processing, for the unified processing of programs, it is necessary to call Spark jobs in code to handle large files.
Implementation scheme
After investigation, it is foun ...
Posted by nitestryker on Mon, 03 Jan 2022 02:54:31 +0100
Spark on Hive and Hive on Spark for Big Data Hadoop
1. Differences between Spark on Hive and Hive on Spark
1)Spark on Hive
Spark on Hive is Hive's only storage role and Spark is responsible for sql parsing optimization and execution. You can understand that Spark uses Hive statements to manipulate Hive tables through Spark SQL, and Spark RDD runs at the bottom. The steps are as follows: ...
Posted by joejoejoe on Mon, 03 Jan 2022 02:40:47 +0100
Spark Learning Notes - creation of RDD
By default, Spark can divide a job into multiple tasks and send it to the Executor node for parallel computing. The number of tasks that can be calculated in parallel is called parallelism. This number can be specified when building the RDD. However, the number of splitting tasks is not necessarily equal to the number of tasks executed in paral ...
Posted by mona02 on Sun, 02 Jan 2022 09:42:25 +0100
Introduction to spark - Spark operating environment
Reference link
https://www.bilibili.com/video/BV11A411L7CK?p=11
Spark operating environment
As a data processing framework and computing engine, Spark is designed to run in all common cluster environments. The mainstream environment in domestic work is yard, but the container environment is becoming more and more popular
Local mode
The s ...
Posted by FireyIce01 on Thu, 23 Dec 2021 14:01:46 +0100
Big data (8y) Spark3.0 kernel
1. Operation mechanism of Spark On YARN deployment mode
After the task is submitted, start the Driver;Driver registers the application with the cluster manager;The cluster manager allocates the Executor according to the configuration file of this task and starts it;The Driver starts to execute the main function. When the action operator is ...
Posted by Molarmite on Sat, 18 Dec 2021 03:54:26 +0100
spark io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s)
Error log:
1. Error log:
WARN TaskSetManager: Lost task 29.0 in stage 22.0 (TID 1851, wn108-cdlcns.bjduloineequ3adfbkrpgi4p2c.shax.internal.chinacloudapp.cn, executor 25): FetchFailed(BlockManagerId(2, wn107-cdlcns.bjduloineequ3adfbkrpgi4p2c.shax.internal.chinacloudapp.cn, 7447, None), shuffleId=12, mapId=4, reduceId=29, message=
org.apach ...
Posted by godster on Fri, 17 Dec 2021 02:33:23 +0100
Spark source code reading 04 - Local operation mode of spark operation architecture
Local operation mode
Basic introduction
Spark's Local operation mode is also called Local operation mode and pseudo distributed mode. This is called Local mode because all spark processes in this mode run in the virtual machine of a Local machine without any resource manager. It mainly uses multiple threads of a single machine to simulate spa ...
Posted by mhenke on Sun, 12 Dec 2021 23:20:27 +0100