grammar
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number]
WHERE
Similar to SQL
Extended RLIKE supports regular expressions
sort
order by
Global sorting, only one Reducer
sort by
Each reducer is sorted internally. Each reducer is ordered internally, but not necessarily globally
distribute by
Partition field, which controls that a specific row is output to the specified reducer and appears before sort by
cluster by
When the distribution by and sort by fields are consistent, cluster by can be used instead, but cluster by can only be arranged in ascending order, and asc or desc cannot be specified
function
function | explain |
---|---|
NVL | Assign a value to the data whose value is NULL, and its format is NVL (value, default_value). Its function is to return default if value is NULL_ Value, otherwise return the value of value. If both parameters are NULL, return NULL. |
CASE WHEN THEN ELSE END | case column when val1 then ... when val2 then ...else ...end case when condition1 then ...when condition2 ...else ...end |
CONCAT | concat(str1, str2,... strN) |
CONCAT_WS | CONCAT upgrade, CONCAT(separator, [string | array(string)] +) |
EXPLODE(col) | Split the complex Array or Map structure in the hive column into multiple rows. |
LATERAL VIEW | Usage: final view udtf (expression) tablealias as columnalias Explanation: used with split, expand and other udtfs. It can split a column of data into multiple rows of data. On this basis, it can aggregate the split data. |
NVL
Assuming that there are the following data, the score table, and those who do not take the exam are null by default
Zhang San 100 Li Si Wang WU80
Count the scores of all students. If you don't take the exam, the score will be 0
select name,NVL(score,0) as score from test10;
CASE WHEN THEN ELSE END
Statistics on the number of men and women in various sectors are as follows:
name dept_id sex name of a fictitious monkey with supernatural powers A male sea A male Song song B male Miss Luo Yu feng A female Sister Ting B female Tingting B female
select dept_id, sum(case sex when 'male' then 1 else 0 end) as man_count, sum(case sex when 'female' then 1 else 0 end) as woman_count from test11 group by dept_id; ## case when another way to write select dept_id, sum(case when sex='male' then 1 else 0 end)as man_count, sum(case when sex='female' then 1 else 0 end) as woman_count from test11 group by dept_id;
CONCAT/CONCAT_WS
Continue to use case when case data. Requirements: splice each row of data into the following format data to form a row of text, which is used for each column and split.
name of a fictitious monkey with supernatural powers,A,male ...
select CONCAT(name,',',dept_id,',',sex) from test11; ## Equivalent to select CONCAT_WS(',',name,dept_id,sex) from test11;
EXPLODE/LATERAL VIEW
Case:
Suppose we have a table that describes movie information
movie category <Suspect tracking Suspense,action,science fiction,plot <Lie to me> Suspense,gangster ,action,psychology,plot <War wolf 2 Warfare,action,disaster
Create table statement
create table test12( movie string, category array<string> )row format delimited fields terminated by '\t' collection items terminated by ',' lines terminated by '\n';
Query: splitting a single field
hive (study)> select explode(category) as category from test12; OK category Suspense action science fiction plot Suspense gangster action psychology plot Warfare action disaster Time taken: 0.225 seconds, Fetched: 12 row(s)
If we want to display the movie name and classification column of the table data in a row, we need to use the profile table
select movie, category_name from test12 lateral view explode(category) movie_info_tmp as category_name;
Windowing function
over()
Case data, order data
name orderdate cost jack,2017-01-01,10 tony,2017-01-02,15 jack,2017-02-03,23 tony,2017-01-04,29 jack,2017-01-05,46 jack,2017-04-06,42 tony,2017-01-07,50 jack,2017-01-08,55 mart,2017-04-08,62 mart,2017-04-09,sixty-eight neil,2017-05-10,12 mart,2017-04-11,75 neil,2017-06-12,80 mart,2017-04-13,ninety-four
Requirements:
(1) Query the customers and total number of people who purchased in April 2017
(2) Query the customer's purchase details and monthly total purchase amount
(3) In the above scenario, the cost of each customer is accumulated according to the date
(4) Query the last purchase time of each customer
(5) Order information 20% before query
Problem solving:
(1) Query the customers and total number of people who purchased in April 2017
Analysis: the key point is to find out that the total number of customers purchased in April 2017 is the aggregate de duplication value, and the customers purchased are specific customers. We can get the specific list if we first group the duplicates, and then use over() to open the window as a whole to get the total number of people.
select name, count(name) over() as total_count from test13 where orderdate between '2017-04-01' and '2017-04-30' group by name;
(2) Query the customer's purchase details and monthly total purchase amount
Analysis: the current table is the purchase details of customers, and the total monthly purchase amount is the cumulative purchase of each customer every month. Window opening is to open the window every month according to each customer to calculate the cumulative value
### Because they are all in the same year, the month (order date) partition can be used select name,orderdate,cost ,sum(cost) over(partition by month(orderdate)) as month_cost_sum from test13; ## Standard answer select name,orderdate,cost ,sum(cost) over(partition by substring(orderdate,1,7)) as month_cost_sum from test13;
(3) In the above scenario, the cost of each customer is accumulated according to the date
select name,orderdate,cost ,sum(cost) over(partition by name order by orderdate) as cost_sum from test13;
extend
select *, --Add all rows sum(cost) over() as sample1, --Press name Grouping, intra group data addition sum(cost) over(partition by name) as sample2, --Press name Grouping, intra group data accumulation sum(cost) over(partition by name order by orderdate) as sample3, --and sample3 equally,Aggregation from starting point to current row sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 , --Aggregate the current row with the previous row sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --The current line and the previous and subsequent lines sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6, --Current line and all subsequent lines sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 from test13;
(4) Query the last purchase time of each customer
select name, orderdate, lag(orderdate,1,'1970-01-01') over(partition by name order by orderdate) as last_orderdate from test13;
(5) Order information 20% before query
select * from (select name,orderdate,cost ,ntile(5) over(order by orderdate) sort from test13)t1 where sort = 1;
Rank function
function | explain |
---|---|
rank() | If the sorting is the same, there will be duplication, and the total number will remain the same |
dense_rank() | If the sorting is the same, there will be duplication and the total number will be reduced |
row_number() | In order |
Data: subject transcript
name subject score Monkey King language eighty-seven Monkey King math ninety-five Monkey king English sixty-eight Sea language ninety-four Ocean mathematics fifty-six Sea English eighty-four Song language sixty-four Song mathematics eighty-six Song English eighty-four Tingting language sixty-five Tingting mathematics eighty-five Tingting English seventy-eight
Compare the effects of the previous function
rank(): duplicates occur when the sorting is the same, and the total number remains the same.
The score tied for sixth, no seventh. The total number is 12, and the total number remains unchanged.
hive (study)> select *,rank() over(order by score) from test14; ------------------------- test14.name test14.subject test14.score rank_window_0 sea mathematics fifty-six.0 1 Song song language sixty-four.0 2 Tingting language sixty-five.0 3 Monkey king English sixty-eight.0 4 Tingting English seventy-eight.0 5 Song song English eighty-four.0 6 sea English eighty-four.0 6 Tingting mathematics eighty-five.0 8 Song song mathematics eighty-six.0 9 Monkey King language eighty-seven.0 10 sea language ninety-four.0 11 Monkey King mathematics ninety-five.0 12 Time taken: 18.422 seconds, Fetched: 12 row(s)
dense_rank(): duplicates occur when the sorting is the same, and the total quantity decreases.
The number 6 is tied, and then the sequence is continued. Due to the juxtaposition of the total quantity, the total quantity is reduced.
hive (study)> select *,dense_rank() over(order by score) from test14; ----------------------- test14.name test14.subject test14.score dense_rank_window_0 sea mathematics fifty-six.0 1 Song song language sixty-four.0 2 Tingting language sixty-five.0 3 Monkey king English 68.0 4 Tingting English seventy-eight.0 5 Song song English eighty-four.0 6 sea English eighty-four.0 6 Tingting mathematics eighty-five.0 7 Song song mathematics eighty-six.0 8 Monkey King language eighty-seven.0 9 sea language 94.0 10 Monkey King mathematics ninety-five.0 11
row_number(): arranged in order without repetition.
There will be no juxtaposition.
hive (study)> select *,row_number() over(order by score) from test14; ------------------ test14.name test14.subject test14.score row_number_window_0 sea mathematics 56.0 1 Song song language 64.0 2 Tingting language 65.0 3 Monkey king English 68.0 4 Tingting English 78.0 5 Song song English 84.0 6 sea English 84.0 7 Tingting mathematics 85.0 8 Song song mathematics 86.0 9 Monkey King language 87.0 10 sea language 94.0 11 Monkey King mathematics 95.0 12
Requirements:
Calculate the score ranking of each subject.
Analysis: rank each discipline
select name,subject,score, rank() over(partition by subject order by score) as rank from test14;
Other common functions
Date function
function | explain | case |
---|---|---|
unix_timestamp | Returns the unix timestamp of the current time, deprecated. Use current_timestamp substitution | select unix_timestamp(); select unix_timestamp("2020-10-28",'yyyy-MM-dd'); |
from_unixtime | Timestamp to date | select from_unixtime(1603843200); |
current_date | Get current date | select current_date(); |
current_timestamp | Get current date and time | select current_timestamp; |
to_date | Extract date section | select to_date('2020-10-28 12:12:12'); |
year | The year portion of the extraction date | select year('2020-10-28 12:12:12'); |
month | Extract the month portion of the date | select month('2020-10-28 12:12:12'); |
day | Extract the day portion of the date | select day('2020-10-28 12:12:12'); |
hour | Hour of extraction date | select hour('2020-10-28 12:12:12'); |
minute | Minutes of extraction date | select minute('2020-10-28 12:13:14'); |
second | The second of the extracted date | select second('2020-10-28 12:13:14'); |
weekofyear | The specified date is the week ordinal of the current year | select weekofyear('2020-10-28 12:13:14'); |
dayofmonth | The specified date is the day ordinal of the current month | select dayofmonth('2020-10-28 12:13:14'); |
months_between | Month between two dates | select months_between('2020-04-01','2020-10-28'); |
add_months | Date plus or minus month | select add_months('2020-10-28',-3); |
datediff | Number of days between two dates | select datediff('2020-11-04','2020-10-28'); |
date_add | Date plus days | select date_add('2020-10-28',4); |
date_sub | Date minus days | select date_sub('2020-10-28',-4); |
last_day | Date is the last day of the month | select last_day('2020-02-30'); |
date_format | format date | select date_format('2020-10-28 12:12:12','yyyy/MM/dd HH:mm:ss'); |
Numerical function
function | explain | case |
---|---|---|
round | rounding | select round(3.14); select round(3.54); |
ceil | Round up | select ceil(3.14); |
floor | Round down | select floor(3.14); |
String operation function
function | explain | case |
---|---|---|
upper | Capitalize | select upper('low'); |
lower | Turn lowercase | select lower('low'); |
length | length | select length("atguigu"); |
trim | Remove spaces before and after | select trim(" atguigu "); |
lpad | Fill to the left to the specified length | select lpad('atguigu',9,'g'); |
rpad | Fill to the right to the specified length | select rpad('atguigu',9,'g'); |
regexp_replace | Use regular expression to match the target string. Replace after successful matching! | SELECT regexp_replace('2020/10/25', '/', '-'); |
Set function
function | explain | case |
---|---|---|
size | Number of elements in the collection | select size(friends) from test3; |
map_keys | Returns the key in the map | select map_keys(children) from test3; |
map_values | Returns the value in the map | select map_values(children) from test3; |
array_contains | Determine whether the array contains an element | select array_contains(friends,'bingbing') from test3; |
sort_array | Sort the elements in the array | select sort_array(friends) from test3; |
grouping sets | Multidimensional analysis |
grouping sets
It can be understood as optimizing the SQL writing method of multiple groups by
select col1,col2,count(*) as count from tab1 group by col1,col2 union select col1,null,count(*) as count from tab1 group by col1; ## Equivalent to select col1,col2,count(*) as count from tab1 group by col1,col2 grouping sets((col1,col2),(col1));