spark-day03-core programming RDD operator
1: RDD operator
RDD operators, also known as RDD methods, are mainly divided into two categories. Conversion operator and action operator.
2: RDD conversion operator
According to different data processing methods, the operators are divided into value type, double value type and key value type
2.1: map value conversion
package com.atg ...
Posted by lightpace on Tue, 15 Feb 2022 06:38:42 +0100
Spark Day06: Spark kernel scheduling and spark SQL quick start of Spark Core
Spark Day06: Spark Core
01 - [understand] - course content review
It mainly explains three aspects: Sogou log analysis, external data sources (HBase and MySQL) and shared variables.
1,Sogou Log analysis
Search based on the official log SparkCore(RDD)Business analysis
Data format:
Text file data, and each data is the log data o ...
Posted by JamesThePanda on Mon, 14 Feb 2022 11:48:56 +0100
What is the RDD operator in Spark
Operator of RDD What is the 1-operator? API, method, behavior What are the classes of 2-operators- transformation and action 3-transformation features: convert to new RDD and delay loading What operators does - transformation have- See the table, such as map filter, etc - transformation continue classification eg: glom - Elements of each partit ...
Posted by kingdm on Sat, 12 Feb 2022 17:44:38 +0100
Play Hudi Docker Demo based on Ubuntu -- Spark write and query
brief introduction
Last article Playing Hudi Docker Demo based on Ubuntu (2) -- writing test data to Kafka Describes how to write test data to fkaka cluster. This article describes how to use Spark to consume Kafka data and write the data to HDFS. Hudi is introduced into Spark in the form of Jar package.
Types of Hudi tables and queries
Tabl ...
Posted by dilum on Fri, 11 Feb 2022 17:01:45 +0100
Spark: JupyterNotebook integrates PySpark development environment
Record
Basic environment
JDK8Python3.7
Setting up Spark environment in Window
First install JDK8 and python 3, which will not be repeated here
Install Hadoop 2 seven
Download address: http://archive.apache.org/dist/hadoop/core/hadoop-2.7.7/hadoop-2.7.7.tar.gz decompression Download winutils of hadoop: https://github.com/stevelou ...
Posted by cl77 on Fri, 11 Feb 2022 15:37:58 +0100
Basic operation of SparkStreaming in PySpark
Basic operation of SparkStreaming in PySpark
preface
Stream data has the following characteristics: • data arrives quickly and continuously, and the potential size may be endless • numerous data sources and complex formats • large amount of data, but do not pay much attention to storage. Once processed, it will either be discar ...
Posted by andrew10181 on Fri, 11 Feb 2022 04:41:05 +0100
scala learning notes - type parameters
Multiple definition
Type variables can have both upper and lower bounds. Written as:
T >: Lower <: Upper
There cannot be multiple upper bounds or multiple lower bounds at the same time; However, you can still require a type to implement multiple characteristics, like this:
T <: Comparable[T] with Serializable with Cloneable
Ther ...
Posted by mbhcool on Tue, 08 Feb 2022 06:46:44 +0100
Spark chasing Wife Series (Pair RDD Episode 2)
After a busy day, I didn't do anything
Small talk:
I didn't do anything today. Unconsciously, it's the fifth day of the lunar new year. I'll start taking subject 4 in 5678 days. I hope to get my driver's license early
combineByKey
First explain the meaning of each parameter
createCombiner: a function that creates a combination withi ...
Posted by mj_23 on Sat, 05 Feb 2022 13:37:41 +0100
Spark chasing Wife Series (RDD mapping)
I finally sank down and began to be more literate
Small talk:
This article will talk about some operators in Spark RDD, which are all about mapping.
Specifically, there are three operators: map, mappartitions and mappartitionwithindex There will be as few as possible to read the set of foreign operators in the set of six days after each ti ...
Posted by stokie-rich on Thu, 03 Feb 2022 07:00:04 +0100
spark the way of God - detailed explanation of RDD creation
3.2 RDD programming
In Spark, RDD is represented as an object, and RDD is converted through method calls on the object. After defining RDD through a series of transformations, you can call actions to trigger the calculation of RDD. Actions can be to return results to the application (count, collect, etc.) or to save data to the storage system ...
Posted by Matty999555 on Tue, 01 Feb 2022 16:47:04 +0100