Customize zones and sort within zones

Simple wordCount Suppose there are some data in our file: spark spark hive hadoop spark spark hive hadoop spark spark hive hadoop spark spark hive hadoop spark spark hive hadoop spark spark hive hadoop spark spark hive hadoop spark spark hive hadoop spark spark hive hadoop spark spark hive hadoop spark spark hive hadoop spark spark hive hadoo ...

Posted by dieselmachine on Fri, 15 Oct 2021 01:41:08 +0200

The transformation operator of spark and a case

Conversion operator map: the same partition runs orderly, while different partitions run disorderly (fetching data in each partition is powerful but inefficient) (: / 1c37ca978ea74eae9f8c4258b0f9064f) val result1: RDD[Int] = rdd.map(num => { println(num) num * 2 }) mapPartitions: fetch one partition at a time and cal ...

Posted by ciaran on Sun, 10 Oct 2021 12:07:26 +0200

spark advanced: DataFrame and DataSet use

spark advanced (V): use of DataFrame and DataSet DataFrame is a programming abstraction provided by Spark SQL. Similar to RDD, it is also a distributed data collection. But different from RDD, the data of DataFrame is organized into named columns, just like tables in relational database. In addition, a variety of data can be transformed into D ...

Posted by StormS on Thu, 07 Oct 2021 10:51:16 +0200

Spark big data analysis practice - company sales data analysis

demand Suppose a company provides you with the following data. The modified data includes three. txt document data, namely date data, order header data and order details data. Let you conduct the following demand analysis according to the data provided by the company. 1. Calculate the annual sales orders and total sales in all orders. 2. ...

Posted by bodzan on Wed, 06 Oct 2021 21:39:02 +0200

Spark Learning Achievement Transformation - machine learning - predicting music labels using Spark ML's logical regression (multivariate classification problem)

The third example uses the logical regression of Spark ML to predict music tags This is a multivariate classification problem, that is, there are many predicted results.For the introduction and knowledge points of Spark ML, please refer to: Spark ML learning notes - Spark MLlib and Spark ML. 3.1 data preparation 3.1.1 data set file ...

Posted by sanfly on Tue, 28 Sep 2021 23:38:15 +0200

Spark SQL: API for structured data operation based on spark

Introduction to Spark SQL Spark SQL is one of the most complex components in spark technology. It provides the function of operating structured data in Spark Program, that is, SQL query. Specifically, spark SQL has the following three important features: 1.Spark SQL supports reading of multiple structured data formats, such as JSON,Parquet ...

Posted by Xurion on Thu, 23 Sep 2021 06:12:04 +0200

spark Learning Notes - core operators

spark Learning Notes - core operator (2) distinct operator /** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { def removeDuplicatesInPartition(partition: Iterator[T]): Iterator[T] = { // Create an instance of extern ...

Posted by Mikester on Wed, 22 Sep 2021 17:46:44 +0200

spark related introduction - extract hive table

Environmental description of this document centos The server jupyter of scala nucleus spylon-kernel spark-2.4.0 scala-2.11.12 hadoop-2.6.0 Main contents of this paper spark reads the data of hive table, mainly including direct sql reading of hive table; Read hive table and hive partition table through hdfs file.Initialize the sparksession t ...

Posted by flash99 on Mon, 20 Sep 2021 18:18:29 +0200

Inside Spark Technology: detailed explanation of Shuffle

Next, we will introduce some more detailed implementation details. Shuffle is undoubtedly a key point of performance tuning. This paper will deeply analyze the implementation details of Spark Shuffle from the perspective of source code implementation. The upper boundary of each Stage requires either reading data from external storage or r ...

Posted by rbrown on Wed, 08 Sep 2021 04:50:23 +0200

Collect Cache Persist of Spark cache

  All of them have the function of gathering data and pulling data storage. mark their respective roles. Collect: /** * Return an array that contains all of the elements in this RDD. * * @note This method should only be used if the resulting array is expected to be small, as * all the data is loaded into the driver's memory. ...

Posted by Brian W on Tue, 30 Jun 2020 05:44:38 +0200