Spark SQL: API for structured data operation based on spark

Introduction to Spark SQL Spark SQL is one of the most complex components in spark technology. It provides the function of operating structured data in Spark Program, that is, SQL query. Specifically, spark SQL has the following three important features: 1.Spark SQL supports reading of multiple structured data formats, such as JSON,Parquet ...

Posted by Xurion on Thu, 23 Sep 2021 06:12:04 +0200

spark Learning Notes - core operators

spark Learning Notes - core operator (2) distinct operator /** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { def removeDuplicatesInPartition(partition: Iterator[T]): Iterator[T] = { // Create an instance of extern ...

Posted by Mikester on Wed, 22 Sep 2021 17:46:44 +0200

spark related introduction - extract hive table

Environmental description of this document centos The server jupyter of scala nucleus spylon-kernel spark-2.4.0 scala-2.11.12 hadoop-2.6.0 Main contents of this paper spark reads the data of hive table, mainly including direct sql reading of hive table; Read hive table and hive partition table through hdfs file.Initialize the sparksession t ...

Posted by flash99 on Mon, 20 Sep 2021 18:18:29 +0200

Inside Spark Technology: detailed explanation of Shuffle

Next, we will introduce some more detailed implementation details. Shuffle is undoubtedly a key point of performance tuning. This paper will deeply analyze the implementation details of Spark Shuffle from the perspective of source code implementation. The upper boundary of each Stage requires either reading data from external storage or r ...

Posted by rbrown on Wed, 08 Sep 2021 04:50:23 +0200

Collect Cache Persist of Spark cache

  All of them have the function of gathering data and pulling data storage. mark their respective roles. Collect: /** * Return an array that contains all of the elements in this RDD. * * @note This method should only be used if the resulting array is expected to be small, as * all the data is loaded into the driver's memory. ...

Posted by Brian W on Tue, 30 Jun 2020 05:44:38 +0200

Hundreds of billions of warehouse projects (warehouse theory_ Product dimension data loading (zipper table))

Product dimension data loading (zipper table) Zipper watch design: 1. Collect the full data of the day and store it in the ND (current day) table. 2. You can take out yesterday's full data from the history table and store it in the OD (last day's data) table. 3. ND-OD is the data added and changed on ...

Posted by eagle1771 on Fri, 05 Jun 2020 06:27:04 +0200

Spark SQL -- spark SQL performance optimization

Article directory 1. Cache table data in memory 2. Parameter optimization 1. Cache table data in memory Performance tuning is mainly about putting data into memory. Caching data in memory can improve performance by directly reading the value of memory. In RDD, use rdd.cache or rdd.persist to cac ...

Posted by abgoosht on Fri, 13 Mar 2020 08:27:46 +0100

Push mode integrates Flume and Spark Streaming

1. architecture 2.Flume configuration Create a new configuration file under $flume \ home / conf: flume \ push \ streaming.conf The configuration idea is as follows: source select netcat and configure the host name and port Select avro for sink, and configure the host name and port channel s ...

Posted by phithe on Fri, 06 Mar 2020 12:11:34 +0100

Spark Big Data-Spark+Kafka Build Real-Time Analysis Dashboard

Spark+Kafka Build Real-Time Analysis Dashboard I. Framework Spark+Kafka is used to analyze the number of male and female students shopping per second in real time, Spark Streaming is used to process the user shopping log in real time, then websocket is used to push the data to the browser in real ti ...

Posted by t31os on Fri, 17 Jan 2020 03:40:05 +0100

How to write results to MySQL in Spark

The Spark mentioned here includes SparkCore/SparkSQL/SparkStreaming. In fact, all operations are the same. The following shows the code in the actual project. Method 1: write the entire DataFrame to MySQL at one time (the Schema of DataFrame should be consistent with the domain name defined in the MySQL table) Dat ...

Posted by blacksheepradio on Wed, 11 Dec 2019 06:08:20 +0100