Introduction to Spark SQL
Spark SQL is one of the most complex components in spark technology. It provides the function of operating structured data in Spark Program, that is, SQL query. Specifically, spark SQL has the following three important features:
1.Spark SQL supports reading of multiple structured data formats, such as JSON,Parquet ...
Posted by Xurion on Thu, 23 Sep 2021 06:12:04 +0200
Environmental description of this document
centos The server
jupyter of scala nucleus spylon-kernel
Main contents of this paper
spark reads the data of hive table, mainly including direct sql reading of hive table; Read hive table and hive partition table through hdfs file.Initialize the sparksession t ...
Posted by flash99 on Mon, 20 Sep 2021 18:18:29 +0200
Next, we will introduce some more detailed implementation details.
Shuffle is undoubtedly a key point of performance tuning. This paper will deeply analyze the implementation details of Spark Shuffle from the perspective of source code implementation.
The upper boundary of each Stage requires either reading data from external storage or r ...
Posted by rbrown on Wed, 08 Sep 2021 04:50:23 +0200
All of them have the function of gathering data and pulling data storage. mark their respective roles.
* Return an array that contains all of the elements in this RDD.
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
Posted by Brian W on Tue, 30 Jun 2020 05:44:38 +0200
Product dimension data loading (zipper table)
Zipper watch design:
1. Collect the full data of the day and store it in the ND (current day) table.
2. You can take out yesterday's full data from the history table and store it in the OD (last day's data) table.
3. ND-OD is the data added and changed on ...
Posted by eagle1771 on Fri, 05 Jun 2020 06:27:04 +0200
1. Cache table data in memory
2. Parameter optimization
1. Cache table data in memory
Performance tuning is mainly about putting data into memory. Caching data in memory can improve performance by directly reading the value of memory. In RDD, use rdd.cache or rdd.persist to cac ...
Posted by abgoosht on Fri, 13 Mar 2020 08:27:46 +0100
Create a new configuration file under $flume \ home / conf: flume \ push \ streaming.conf
The configuration idea is as follows:
source select netcat and configure the host name and port
Select avro for sink, and configure the host name and port
channel s ...
Posted by phithe on Fri, 06 Mar 2020 12:11:34 +0100
Spark+Kafka Build Real-Time Analysis Dashboard
Spark+Kafka is used to analyze the number of male and female students shopping per second in real time, Spark Streaming is used to process the user shopping log in real time, then websocket is used to push the data to the browser in real ti ...
Posted by t31os on Fri, 17 Jan 2020 03:40:05 +0100
The Spark mentioned here includes SparkCore/SparkSQL/SparkStreaming. In fact, all operations are the same. The following shows the code in the actual project.
Method 1: write the entire DataFrame to MySQL at one time (the Schema of DataFrame should be consistent with the domain name defined in the MySQL table)
Posted by blacksheepradio on Wed, 11 Dec 2019 06:08:20 +0100