2021-07-23 / 2021-07-26 revisit the zipper table | learn Spark SQL | learn Git Comment

Posted by WhiteShade6 on Sat, 15 Jan 2022 08:08:28 +0100

Review the zipper table Spark SQL|

Zipper Watch

First put it into the incremental table, and then associate it into a temporary table. After inserting it into the new table, 1. Judge the invalid value,
2. Judge the valid value,
3. Union through UNION ALL

DROP TABLE IF EXISTS dw_orders_his_tmp;
CREATE TABLE dw_orders_his_tmp AS
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date
FROM (
    //Judge failure value
    SELECT a.orderid,
    a.createtime,
    a.modifiedtime,
    a.status,
    a.dw_start_date,
    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2016-08-21' THEN '2016-08-21' ELSE a.dw_end_date END AS dw_end_date
    FROM dw_orders_his a
    left outer join (SELECT * FROM ods_orders_inc WHERE day = '2016-08-21') b
    ON (a.orderid = b.orderid)
    
    UNION ALL
    
     //Judge valid value
    SELECT orderid,
    createtime,
    modifiedtime,
    status,
    modifiedtime AS dw_start_date,
    '9999-12-31' AS dw_end_date
    FROM ods_orders_inc
    WHERE day = '2016-08-21'
    
) x
ORDER BY orderid,dw_start_date;
 
INSERT overwrite TABLE dw_orders_his
SELECT * FROM dw_orders_his_tmp;

Spark SQL

DateFrame

What is DateFrame

DataFrame is a distributed data set based on RDD, which is similar to two-dimensional tables in traditional databases.
The main difference between DataFrame and RDD is that the former has schema meta information, that is, each column of the two-dimensional table dataset represented by DataFrame has a name and type.

RDD ==> DateFrame

When developing programs in IDEA, if you need RDD and DF or DS to operate with each other, you need to introduce importspark implicits._ The spark here is not the package name in Scala, but the created sparkSession
Object, so you must create a SparkSession object before importing. The spark object here cannot be declared with var, because Scala only supports the introduction of val decorated objects.

//In actual development, RDD is generally transformed into DataFrame through sample classes
scala> case class User(name:String, age:Int)
defined class User
scala> sc.makeRDD(List(("zhangsan",30), ("lisi",40))).map(t=>User(t._1, 
t._2)).toDF.show

DateFrame ==> RDD

	//DataFrame is actually the encapsulation of RDD, so you can directly obtain the internal RDD
scala> val df = sc.makeRDD(List(("zhangsan",30), ("lisi",40))).map(t=>User(t._1, 
t._2)).toDF

df: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> val rdd = df.rdd

rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[46] 
at rdd at <console>:25

scala> val array = rdd.collect

array: Array[org.apache.spark.sql.Row] = Array([zhangsan,30], [lisi,40])

//Note: the RDD storage type obtained at this time is Row

scala> array(0)

res28: org.apache.spark.sql.Row = [zhangsan,30]

scala> array(0)(0)

res29: Any = zhangsan

scala> array(0).getAs[String]("name")

res30: String = zhangsan

DateSet

DataSet is a strongly typed data set, and the corresponding type information needs to be provided.

RDD ⇒ DateSet

SparkSQL can automatically convert the RDD containing the case class into a DataSet. The case class defines the structure of the table
Class attribute becomes the column name of the table through reflection. Case classes can contain complex structures such as Seq or Array.

scala> case class User(name:String, age:Int)

defined class User

scala> sc.makeRDD(List(("zhangsan",30), ("lisi",49))).map(t=>User(t._1, 
t._2)).toDS

res11: org.apache.spark.sql.Dataset[User] = [name: string, age: int]

DateSet ==> RDD

DataSet is actually an encapsulation of RDD, so you can directly obtain the internal RDD

scala> case class User(name:String, age:Int)

defined class User

scala> sc.makeRDD(List(("zhangsan",30), ("lisi",49))).map(t=>User(t._1, 
t._2)).toDS

res11: org.apache.spark.sql.Dataset[User] = [name: string, age: int]

scala> val rdd = res11.rdd

rdd: org.apache.spark.rdd.RDD[User] = MapPartitionsRDD[51] at rdd at 
<console>:25

scala> rdd.collect

res12: Array[User] = Array(User(zhangsan,30), User(lisi,49))

Relationship among RDD, DataFrame and DataSet

Same point

∙ RDD, DataFrame and DataSet are all distributed elastic data sets under the spark platform, providing convenience for processing super large data; ➢
All three have inert mechanisms. They will not be executed immediately when creating and converting, such as map methods, but only when encountering actions such as foreach
When, the three start traversal operation; The three have many common functions, such as filter, sorting, etc;  when comparing DataFrame and Dataset
This package is required for many operations: import spark implicits._ (try to import directly after creating the SparkSession object)
All three will automatically cache operations according to Spark's memory conditions, so that even if there is a large amount of data, there is no need to worry about memory overflow. All three have the concept of partition
Both DataFrame and DataSet can use pattern matching to obtain the value and type of each field

The difference between the three

RDD
Generally, RDD and spark mllib are used at the same time
 RDD does not support sparksql operation
DataFrame
Unlike RDD and Dataset, the type of each Row of DataFrame is fixed as Row, and the value of each column cannot be directly
The value of each field can be obtained only through parsing
Generally, DataFrame and DataSet are not used together with spark mllib
Both DataFrame and DataSet support SparkSQL operations, such as select, groupby, and so on
Register temporary tables / windows for sql statement operations
The DataFrame and DataSet support some particularly convenient saving methods, such as saving as csv and bringing tables
Header, so that the field name of each column is clear at a glance (specifically explained later)
DataSet
The Dataset and DataFrame have identical member functions, except that the data type of each row is different.
DataFrame is actually a special case of DataSet. type DataFrame = Dataset[Row]
The DataFrame can also be called Dataset[Row]. The type of each Row is Row, which is not resolved. What is the type of each Row
The fields and the types of each field are unknown. You can only use the getAS method mentioned above or Co
The pattern matching mentioned in Article 7 of sex takes out specific fields. What is the type of each row in the Dataset
Not necessarily. After customizing the case class, you can freely obtain the information of each line

Transformation of the three

GIt Commit

Commit specification

Use Git Commit Guidelines as the specification scheme

Commit Message format

<type>(<scope>): <subject>
<Blank line>
<body>
<Blank line>
<footer>

Type

feat: New features( feature)

fix: repair bug

docs: Documentation( documentation)

style:  Format (changes that do not affect code operation)

refactor: Refactoring (i.e. neither new nor modified) bug (code change of)

test: Add test

chore: Changes in the construction process or auxiliary tools

Scope

It is used to describe the scope of the impact of this Commit, that is, to briefly describe the parts involved in the modification It can be seen from the actual project that it has basically become a required item.

Subject

It is used to briefly describe this change. It is good to summarize it, because the specific information will be given in the Body later. And it is best to follow the following three rules:

Start with a verb and use the first person present tense, such as change，instead of changed or changes

Don't capitalize

No period at the end(.)

Body

The content in is the expansion of the content in the above subject, which is described in more detail here. The content should include the modification motivation and the comparison before and after modification.

Footer

The information of incompatible changes and Issue closure is mainly placed in the footer

Topics: MySQL

Programmer Think