Introduction to big data -- Hive data query

Posted by kingconnections on Sun, 16 Jan 2022 10:29:28 +0100

grammar

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]

WHERE

Similar to SQL
Extended RLIKE supports regular expressions

sort

order by

Global sorting, only one Reducer

sort by

Each reducer is sorted internally. Each reducer is ordered internally, but not necessarily globally

distribute by

Partition field, which controls that a specific row is output to the specified reducer and appears before sort by

cluster by

When the distribution by and sort by fields are consistent, cluster by can be used instead, but cluster by can only be arranged in ascending order, and asc or desc cannot be specified

function

function	explain
NVL	Assign a value to the data whose value is NULL, and its format is NVL (value, default_value). Its function is to return default if value is NULL_ Value, otherwise return the value of value. If both parameters are NULL, return NULL.
CASE WHEN THEN ELSE END	case column when val1 then ... when val2 then ...else ...end case when condition1 then ...when condition2 ...else ...end
CONCAT	concat(str1, str2,... strN)
CONCAT_WS	CONCAT upgrade, CONCAT(separator, [string \| array(string)] +)
EXPLODE(col)	Split the complex Array or Map structure in the hive column into multiple rows.
LATERAL VIEW	Usage: final view udtf (expression) tablealias as columnalias Explanation: used with split, expand and other udtfs. It can split a column of data into multiple rows of data. On this basis, it can aggregate the split data.

NVL

Assuming that there are the following data, the score table, and those who do not take the exam are null by default

Zhang San 100
 Li Si 
Wang WU80

Count the scores of all students. If you don't take the exam, the score will be 0

select name,NVL(score,0) as score from test10;

CASE WHEN THEN ELSE END

Statistics on the number of men and women in various sectors are as follows:

name dept_id sex
 name of a fictitious monkey with supernatural powers A male
 sea A male
 Song song B male
 Miss Luo Yu feng A female
 Sister Ting B female
 Tingting B female

select
    dept_id,
    sum(case sex when 'male' then 1 else 0 end) as man_count,
    sum(case sex when 'female' then 1 else 0 end) as woman_count
from 
test11 
group by dept_id;

## case when another way to write
select 
    dept_id,
    sum(case when sex='male' then 1 else 0 end)as man_count,
    sum(case when sex='female' then 1 else 0 end) as woman_count
from 
    test11
group by 
    dept_id;

CONCAT/CONCAT_WS

Continue to use case when case data. Requirements: splice each row of data into the following format data to form a row of text, which is used for each column and split.

name of a fictitious monkey with supernatural powers,A,male
...

select CONCAT(name,',',dept_id,',',sex) from test11;
## Equivalent to
select CONCAT_WS(',',name,dept_id,sex) from test11;

EXPLODE/LATERAL VIEW

Case:
Suppose we have a table that describes movie information

movie category
<Suspect tracking	Suspense,action,science fiction,plot
<Lie to me>	Suspense,gangster ,action,psychology,plot
<War wolf 2	Warfare,action,disaster

Create table statement

create table test12(
movie string,
category array<string>
)row format delimited
fields terminated by '\t'
collection items terminated by ','
lines terminated by '\n';

Query: splitting a single field

hive (study)> select explode(category) as category from test12;
OK
category
 Suspense
 action
 science fiction
 plot
 Suspense
 gangster 
action
 psychology
 plot
 Warfare
 action
 disaster
Time taken: 0.225 seconds, Fetched: 12 row(s)

If we want to display the movie name and classification column of the table data in a row, we need to use the profile table

select 
    movie,
    category_name
from 
    test12 
lateral view explode(category) movie_info_tmp as category_name;

Windowing function

over()
Case data, order data

name orderdate cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,sixty-eight
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,ninety-four

Requirements:
(1) Query the customers and total number of people who purchased in April 2017
(2) Query the customer's purchase details and monthly total purchase amount
(3) In the above scenario, the cost of each customer is accumulated according to the date
(4) Query the last purchase time of each customer
(5) Order information 20% before query
Problem solving:
(1) Query the customers and total number of people who purchased in April 2017

Analysis: the key point is to find out that the total number of customers purchased in April 2017 is the aggregate de duplication value, and the customers purchased are specific customers. We can get the specific list if we first group the duplicates, and then use over() to open the window as a whole to get the total number of people.

select 
    name,
    count(name) over() as total_count
from 
test13
where orderdate between '2017-04-01' and '2017-04-30'
group by name;

(2) Query the customer's purchase details and monthly total purchase amount

Analysis: the current table is the purchase details of customers, and the total monthly purchase amount is the cumulative purchase of each customer every month. Window opening is to open the window every month according to each customer to calculate the cumulative value

### Because they are all in the same year, the month (order date) partition can be used
select 
    name,orderdate,cost
    ,sum(cost) over(partition by month(orderdate)) as month_cost_sum
from test13;
## Standard answer
select 
    name,orderdate,cost
    ,sum(cost) over(partition by substring(orderdate,1,7)) as month_cost_sum
from test13;

(3) In the above scenario, the cost of each customer is accumulated according to the date

select 
    name,orderdate,cost
    ,sum(cost) over(partition by name order by orderdate) as cost_sum
from test13;

extend

select 
    *,
    --Add all rows
    sum(cost) over() as sample1,
    
    --Press name Grouping, intra group data addition
    sum(cost) over(partition by name) as sample2,
    
    --Press name Grouping, intra group data accumulation
    sum(cost) over(partition by name order by orderdate) as sample3,
    
    --and sample3 equally,Aggregation from starting point to current row
    sum(cost) over(partition by name order by orderdate rows between 
    UNBOUNDED PRECEDING and current row ) as sample4 ,
    
    --Aggregate the current row with the previous row
    sum(cost) over(partition by name order by orderdate rows between 1 
    PRECEDING and current row) as sample5,
    
    --The current line and the previous and subsequent lines
    sum(cost) over(partition by name order by orderdate rows between 1 
    PRECEDING AND 1 FOLLOWING ) as sample6,
    
    --Current line and all subsequent lines
    sum(cost) over(partition by name order by orderdate rows between current 
    row and UNBOUNDED FOLLOWING ) as sample7 
from
test13;

(4) Query the last purchase time of each customer

select 
    name,
    orderdate,
    lag(orderdate,1,'1970-01-01') over(partition by name order by orderdate) as last_orderdate
    from 
test13;

(5) Order information 20% before query

select * from (select 
    name,orderdate,cost 
    ,ntile(5) over(order by orderdate) sort
from test13)t1
where sort = 1;

Rank function

function	explain
rank()	If the sorting is the same, there will be duplication, and the total number will remain the same
dense_rank()	If the sorting is the same, there will be duplication and the total number will be reduced
row_number()	In order

Data: subject transcript

name subject score
 Monkey King language eighty-seven
 Monkey King math ninety-five
 Monkey king English sixty-eight
 Sea language ninety-four
 Ocean mathematics fifty-six
 Sea English eighty-four
 Song language sixty-four
 Song mathematics eighty-six
 Song English eighty-four
 Tingting language sixty-five
 Tingting mathematics eighty-five
 Tingting English seventy-eight

Compare the effects of the previous function
rank(): duplicates occur when the sorting is the same, and the total number remains the same.
The score tied for sixth, no seventh. The total number is 12, and the total number remains unchanged.

hive (study)> select *,rank() over(order by score) from test14;
-------------------------
test14.name     test14.subject  test14.score    rank_window_0
 sea    mathematics    fifty-six.0    1
 Song song    language    sixty-four.0    2
 Tingting    language    sixty-five.0    3
 Monkey king English    sixty-eight.0    4
 Tingting    English    seventy-eight.0    5
 Song song    English    eighty-four.0    6
 sea    English    eighty-four.0    6
 Tingting    mathematics    eighty-five.0    8
 Song song    mathematics    eighty-six.0    9
 Monkey King language    eighty-seven.0    10
 sea    language    ninety-four.0    11
 Monkey King mathematics    ninety-five.0    12
Time taken: 18.422 seconds, Fetched: 12 row(s)

dense_rank(): duplicates occur when the sorting is the same, and the total quantity decreases.
The number 6 is tied, and then the sequence is continued. Due to the juxtaposition of the total quantity, the total quantity is reduced.

hive (study)> select *,dense_rank() over(order by score) from test14;
-----------------------
test14.name     test14.subject  test14.score    dense_rank_window_0
 sea    mathematics    fifty-six.0    1
 Song song    language    sixty-four.0    2
 Tingting    language    sixty-five.0    3
 Monkey king English    68.0    4
 Tingting    English    seventy-eight.0    5
 Song song    English    eighty-four.0    6
 sea    English    eighty-four.0    6
 Tingting    mathematics    eighty-five.0    7
 Song song    mathematics    eighty-six.0    8
 Monkey King language    eighty-seven.0    9
 sea    language    94.0    10
 Monkey King mathematics    ninety-five.0    11

row_number(): arranged in order without repetition.
There will be no juxtaposition.

hive (study)> select *,row_number() over(order by score) from test14;
------------------
test14.name     test14.subject  test14.score    row_number_window_0
 sea    mathematics    56.0    1
 Song song    language    64.0    2
 Tingting    language    65.0    3
 Monkey king English    68.0    4
 Tingting    English    78.0    5
 Song song    English    84.0    6
 sea    English    84.0    7
 Tingting    mathematics    85.0    8
 Song song    mathematics    86.0    9
 Monkey King language    87.0    10
 sea    language    94.0    11
 Monkey King mathematics    95.0    12

Requirements:
Calculate the score ranking of each subject.

Analysis: rank each discipline

select 
    name,subject,score,
    rank() over(partition by subject order by score) as rank 
from test14;

Other common functions

Date function

function	explain	case
unix_timestamp	Returns the unix timestamp of the current time, deprecated. Use current_timestamp substitution	select unix_timestamp(); select unix_timestamp("2020-10-28",'yyyy-MM-dd');
from_unixtime	Timestamp to date	select from_unixtime(1603843200);
current_date	Get current date	select current_date()；
current_timestamp	Get current date and time	select current_timestamp;
to_date	Extract date section	select to_date('2020-10-28 12:12:12');
year	The year portion of the extraction date	select year('2020-10-28 12:12:12');
month	Extract the month portion of the date	select month('2020-10-28 12:12:12');
day	Extract the day portion of the date	select day('2020-10-28 12:12:12');
hour	Hour of extraction date	select hour('2020-10-28 12:12:12');
minute	Minutes of extraction date	select minute('2020-10-28 12:13:14');
second	The second of the extracted date	select second('2020-10-28 12:13:14');
weekofyear	The specified date is the week ordinal of the current year	select weekofyear('2020-10-28 12:13:14');
dayofmonth	The specified date is the day ordinal of the current month	select dayofmonth('2020-10-28 12:13:14');
months_between	Month between two dates	select months_between('2020-04-01','2020-10-28');
add_months	Date plus or minus month	select add_months('2020-10-28',-3);
datediff	Number of days between two dates	select datediff('2020-11-04','2020-10-28');
date_add	Date plus days	select date_add('2020-10-28',4);
date_sub	Date minus days	select date_sub('2020-10-28',-4);
last_day	Date is the last day of the month	select last_day('2020-02-30');
date_format	format date	select date_format('2020-10-28 12:12:12','yyyy/MM/dd HH:mm:ss');

Numerical function

function	explain	case
round	rounding	select round(3.14); select round(3.54);
ceil	Round up	select ceil(3.14);
floor	Round down	select floor(3.14);

String operation function

function	explain	case
upper	Capitalize	select upper('low');
lower	Turn lowercase	select lower('low');
length	length	select length("atguigu");
trim	Remove spaces before and after	select trim(" atguigu ");
lpad	Fill to the left to the specified length	select lpad('atguigu',9,'g');
rpad	Fill to the right to the specified length	select rpad('atguigu',9,'g');
regexp_replace	Use regular expression to match the target string. Replace after successful matching!	SELECT regexp_replace('2020/10/25', '/', '-');

Set function

function	explain	case
size	Number of elements in the collection	select size(friends) from test3;
map_keys	Returns the key in the map	select map_keys(children) from test3;
map_values	Returns the value in the map	select map_values(children) from test3;
array_contains	Determine whether the array contains an element	select array_contains(friends,'bingbing') from test3;
sort_array	Sort the elements in the array	select sort_array(friends) from test3;
grouping sets	Multidimensional analysis

grouping sets
It can be understood as optimizing the SQL writing method of multiple groups by

select col1,col2,count(*) as count from tab1 group by col1,col2
union 
select col1,null,count(*) as count from tab1 group by col1;

## Equivalent to
select col1,col2,count(*) as count from tab1  group by col1,col2
grouping sets((col1,col2),(col1));

Topics: Big Data hive

Programmer Think