spark-day03-core programming RDD operator

1: RDD operator RDD operators, also known as RDD methods, are mainly divided into two categories. Conversion operator and action operator. 2: RDD conversion operator According to different data processing methods, the operators are divided into value type, double value type and key value type 2.1: map value conversion package com.atg ...

Posted by lightpace on Tue, 15 Feb 2022 06:38:42 +0100

Spark Day06: Spark kernel scheduling and spark SQL quick start of Spark Core

Spark Day06: Spark Core 01 - [understand] - course content review It mainly explains three aspects: Sogou log analysis, external data sources (HBase and MySQL) and shared variables. 1,Sogou Log analysis Search based on the official log SparkCore(RDD)Business analysis Data format: Text file data, and each data is the log data o ...

Posted by JamesThePanda on Mon, 14 Feb 2022 11:48:56 +0100

What is the RDD operator in Spark

Operator of RDD What is the 1-operator? API, method, behavior What are the classes of 2-operators- transformation and action 3-transformation features: convert to new RDD and delay loading What operators does - transformation have- See the table, such as map filter, etc - transformation continue classification eg: glom - Elements of each partit ...

Posted by kingdm on Sat, 12 Feb 2022 17:44:38 +0100

Play Hudi Docker Demo based on Ubuntu -- Spark write and query

brief introduction Last article Playing Hudi Docker Demo based on Ubuntu (2) -- writing test data to Kafka Describes how to write test data to fkaka cluster. This article describes how to use Spark to consume Kafka data and write the data to HDFS. Hudi is introduced into Spark in the form of Jar package. Types of Hudi tables and queries Tabl ...

Posted by dilum on Fri, 11 Feb 2022 17:01:45 +0100

Spark: JupyterNotebook integrates PySpark development environment

Record Basic environment JDK8Python3.7 Setting up Spark environment in Window First install JDK8 and python 3, which will not be repeated here Install Hadoop 2 seven Download address: http://archive.apache.org/dist/hadoop/core/hadoop-2.7.7/hadoop-2.7.7.tar.gz decompression Download winutils of hadoop: https://github.com/stevelou ...

Posted by cl77 on Fri, 11 Feb 2022 15:37:58 +0100

Basic operation of SparkStreaming in PySpark

Basic operation of SparkStreaming in PySpark preface Stream data has the following characteristics: • data arrives quickly and continuously, and the potential size may be endless • numerous data sources and complex formats • large amount of data, but do not pay much attention to storage. Once processed, it will either be discar ...

Posted by andrew10181 on Fri, 11 Feb 2022 04:41:05 +0100

scala learning notes - type parameters

Multiple definition Type variables can have both upper and lower bounds. Written as: T >: Lower <: Upper There cannot be multiple upper bounds or multiple lower bounds at the same time; However, you can still require a type to implement multiple characteristics, like this: T <: Comparable[T] with Serializable with Cloneable Ther ...

Posted by mbhcool on Tue, 08 Feb 2022 06:46:44 +0100

Spark chasing Wife Series (Pair RDD Episode 2)

After a busy day, I didn't do anything Small talk: I didn't do anything today. Unconsciously, it's the fifth day of the lunar new year. I'll start taking subject 4 in 5678 days. I hope to get my driver's license early combineByKey   First explain the meaning of each parameter createCombiner: a function that creates a combination withi ...

Posted by mj_23 on Sat, 05 Feb 2022 13:37:41 +0100

Spark chasing Wife Series (RDD mapping)

I finally sank down and began to be more literate Small talk: This article will talk about some operators in Spark RDD, which are all about mapping. Specifically, there are three operators: map, mappartitions and mappartitionwithindex There will be as few as possible to read the set of foreign operators in the set of six days after each ti ...

Posted by stokie-rich on Thu, 03 Feb 2022 07:00:04 +0100

spark the way of God - detailed explanation of RDD creation

3.2 RDD programming In Spark, RDD is represented as an object, and RDD is converted through method calls on the object. After defining RDD through a series of transformations, you can call actions to trigger the calculation of RDD. Actions can be to return results to the application (count, collect, etc.) or to save data to the storage system ...

Posted by Matty999555 on Tue, 01 Feb 2022 16:47:04 +0100