MySQL 8 window function syntax and example explanation

Posted by gloveny on Mon, 17 Jan 2022 01:48:16 +0100

1, What is the window function?

Window function is a special kind of function, which is used with the over keyword. As the name suggests, the window function is executed only in the window that meets the conditions. The object is each record in the window.

A window can be understood as a collection of records, which contains several records that meet constraints. Windows are also divided into static windows and dynamic windows. Static windows mean that the records in the window are determined according to the constraints and will not change, that is, the window size is fixed; The records in the dynamic window will change dynamically according to the current records and constraints, that is, the window size is changing.

2, Simple application of window function

1. Table building

create table store
(
storeId varchar(255),
storeName varchar(255),
productId varchar(255),
productName varchar(255),
productPrice float,
productNum int
)

Table data can be self fetched
Link: https://pan.baidu.com/s/1GkJUghCVkUR8CfvJGHiYgg
Extraction code: yuan
This table shows the book sales of each store of a company

2. Basic syntax of window function

Basic syntax: sequence number function "over" (partition by [grouping column 1, grouping column 2, ····] order by [sorting column 1, sorting column 2, ····] asc|desc)

over keyword: used to specify the window range of function execution. If nothing is written in the parentheses, it means that the window contains all rows that meet the WHERE condition, and the window function calculates based on all rows; If it is not empty, the following four syntax are supported to set the window.

  • partition by [grouping column 1, grouping column 2, ···]: this clause is used to specify which fields (one or more) the window is grouped according to,
  • order by [sort column 1, sort column 2, ···] asc|desc: this clause is used to specify which fields the records in the window are sorted according to. When there are multiple fields after it, the priority is to sort according to "sort column 1", and then sort according to "sort column 2".
  • frame: this clause is used to specify the partition in the window, and there are few usage scenarios; For details, please refer to Interpretation of new features | MySQL 8.0 window function framework usage
  • Window: when the clause after over is too long, you can use the window keyword to alias the current window.

3. Sequence number function

(1)row_number()

Demand: arrange the sales volume of each store in descending order,

When we use the group by clause to sort, the result will only retain the first data of each group. As follows:

select * from store 
GROUP BY storeId
ORDER BY productNum desc

Use the window function to sort within the group. As a result, each piece of data in the group will be retained, so it is easy to sort within the group

select *,row_number() over w as sorted
from store
window  w  as (partition by storeId ORDER BY productNum desc)--The current window is aliased

(2) Rank () and deny_ rank()

These two functions are related to row_number differs in that the same value is treated differently when sorting.

rank() function

select *,rank() over w as sorted
from store
window  w  as (partition by storeId ORDER BY productNum desc)--The current window is aliased


dense_rank()

select *,dense_rank() over w as sorted
from store
window  w  as (partition by storeId ORDER BY productNum desc)


Summary: rank() will generate records with the same serial number, and may generate serial number gaps; And dense_rank() will also produce records with the same sequence number, but will not produce sequence number gaps.

4. Distribution function

percent_rank()

Usage: related to the previous RANK() function, each line is calculated according to the following formula: (rank - 1) / (rows - 1). Where rank is the sequence number generated by RANK() function, and rows is the total number of records in the current window.

select *,rank() over w as row_rank,percent_rank() over w as percent_row_rank
from store
window  w as (partition by storeId ORDER BY productNum desc)--The current window is aliased

From the results, percent_ row_ The rank column brings in the rank value (row_rank column) and the rows value according to the formula (rank - 1) / (rows - 1) (the windows with storeId 1,2,3 only contain 4 records, that is, rows= 4).


cume_dist()

Usage: the number of rows in the window less than or equal to the rank value of the current row / the number of total rows in the window.

select *,rank() over w as row_rank,cume_dist() over w as cume_rank
from store
window  w as (partition by storeId ORDER BY productNum desc)  --The current window is aliased


From the results, cume_ The rank column follows the formula: the number of rows in the group less than or equal to the current rank value / the total number of rows in the group. Readers can substitute the data for verification.

5. Front and back functions

Purpose: lag (field 1,n) returns the value corresponding to field 1 of the record in line n before the current line in the window; Lead (field 2,n) returns the value corresponding to field 2 of the record in line n (lead) after the current line.

select store.*,
lag(productName,2) over w as lag_product,
lead(productName,2) over w as lead_product ,
row_number() over() as `Line number` 
from store
window  w as (partition by storeId ORDER BY productNum desc)


From the results, we can see that there are three windows, and there are four pieces of data in each window. Take the first window as an example. There is no data before the first data, so lag_ The product attribute is NULL, and there is only one data before the second data, so lag_ The product attribute is also NULL; The second record after the first data is the data with line number equal to 3. The product name attribute value of this data is "operating system", so the lead of the first data_ The product attribute value is "operating system". Similarly, the lead of the second data_ The product attribute value is "Introduction to database system".

6. Head tail function

Head and tail function - first_ Val (field) / last_ Val (field).
Purpose: get the value of the specified field of the first / last data in the window.

select store.*,
first_value(productName) over w as first_productName,
last_value(productName) over w as last_productName
from store
window  w as ( partition by storeId order by productNum desc rows BETWEEN unbounded preceding and unbounded following)
--use frame Clause to specify the boundary of the partition


From the results, we can see that first_value() can directly get the first data of the current window, last_value () can directly get the last record of the current partition.
However, when the partition boundary is not specified using the frame clause, last_value () often returns the data of the current row (the current row is the last row in the partition). Readers can write code for verification.

7. Other functions

nth_value(expr,n)
Usage: returns the value of expr corresponding to the nth data in the window. Expr can be an expression or a column name.

select store.*,
nth_value(productName,2) over w as second_productName,
nth_value(productName,3) over w as third_productName
from store
window  w as ( partition by storeId order by productNum desc rows BETWEEN unbounded preceding and unbounded following)
--use frame Clause to specify the boundary of the partition


As can be seen from the results, nth_value(expr,n) can obtain the value corresponding to the expr attribute of the first data in the current window.
However, when we do not use the frame clause to specify the boundary of the current window, the boundary of the window corresponding to each record is different, which will affect nth_ The return value of value. The author gives an example here, which can be verified by readers in other cases.

select store.*,nth_value(productName,2) over w as second_productName,nth_value(productName,3) over w as third_productName
from store
window  w as ( partition by storeId order by productNum desc)
--not used frame clause

The results are as follows. There are many application scenarios for this function, such as finding out the second and third place of students in a class.

ntile(n)
Usage: divide the ordered data in the window into n and record the bucket number.

select store.*,ntile(2) over w as block
from store
window  w as ( partition by storeId order by productNum desc)

This function is widely used in data analysis. For example, due to the large amount of data, it is necessary to evenly distribute the data to N parallel processes for calculation respectively. At this time, NTILE(N) can be used to group the data. Since the number of records is not necessarily divided by N, the data may not be completely even, and then redistribute the data of different bucket numbers.

8. Aggregate function as window function

select store.*,
sum(productNum) over w as sum_col ,
count(*) over w as count_col, 
avg(productNum) over w as avg_col,
max(productNum) over w as max_col
from store
window  w as ( partition by storeId order by productNum desc)

summary

The functions of window functions are extremely powerful and the application scenarios are complex. This paper just briefly introduces the syntax and examples of various window functions. On this basis, readers should understand the meaning and application of each function in combination with the actual business scenario.

Topics: Database SQL