Introduction to big data -- Hive data query

Posted by kingconnections on Sun, 16 Jan 2022 10:29:28 +0100

grammar

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]

WHERE

Similar to SQL
Extended RLIKE supports regular expressions

sort

order by

Global sorting, only one Reducer

sort by

Each reducer is sorted internally. Each reducer is ordered internally, but not necessarily globally

distribute by

Partition field, which controls that a specific row is output to the specified reducer and appears before sort by

cluster by

When the distribution by and sort by fields are consistent, cluster by can be used instead, but cluster by can only be arranged in ascending order, and asc or desc cannot be specified

function

functionexplain
NVLAssign a value to the data whose value is NULL, and its format is NVL (value, default_value). Its function is to return default if value is NULL_ Value, otherwise return the value of value. If both parameters are NULL, return NULL.
CASE WHEN THEN ELSE ENDcase column when val1 then ... when val2 then ...else ...end
case when condition1 then ...when condition2 ...else ...end
CONCATconcat(str1, str2,... strN)
CONCAT_WSCONCAT upgrade, CONCAT(separator, [string | array(string)] +)
EXPLODE(col)Split the complex Array or Map structure in the hive column into multiple rows.
LATERAL VIEWUsage: final view udtf (expression) tablealias as columnalias
Explanation: used with split, expand and other udtfs. It can split a column of data into multiple rows of data. On this basis, it can aggregate the split data.

NVL

Assuming that there are the following data, the score table, and those who do not take the exam are null by default

Zhang San 100
 Li Si 
Wang WU80

Count the scores of all students. If you don't take the exam, the score will be 0

select name,NVL(score,0) as score from test10; 

CASE WHEN THEN ELSE END

Statistics on the number of men and women in various sectors are as follows:

name dept_id sex
 name of a fictitious monkey with supernatural powers A male
 sea A male
 Song song B male
 Miss Luo Yu feng A female
 Sister Ting B female
 Tingting B female
select
    dept_id,
    sum(case sex when 'male' then 1 else 0 end) as man_count,
    sum(case sex when 'female' then 1 else 0 end) as woman_count
from 
test11 
group by dept_id;

## case when another way to write
select 
    dept_id,
    sum(case when sex='male' then 1 else 0 end)as man_count,
    sum(case when sex='female' then 1 else 0 end) as woman_count
from 
    test11
group by 
    dept_id; 

CONCAT/CONCAT_WS

Continue to use case when case data. Requirements: splice each row of data into the following format data to form a row of text, which is used for each column and split.

name of a fictitious monkey with supernatural powers,A,male
...
select CONCAT(name,',',dept_id,',',sex) from test11;
## Equivalent to
select CONCAT_WS(',',name,dept_id,sex) from test11;

EXPLODE/LATERAL VIEW

Case:
Suppose we have a table that describes movie information

movie category
<Suspect tracking	Suspense,action,science fiction,plot
<Lie to me>	Suspense,gangster ,action,psychology,plot
<War wolf 2	Warfare,action,disaster

Create table statement

create table test12(
movie string,
category array<string>
)row format delimited
fields terminated by '\t'
collection items terminated by ','
lines terminated by '\n';

Query: splitting a single field

hive (study)> select explode(category) as category from test12;
OK
category
 Suspense
 action
 science fiction
 plot
 Suspense
 gangster 
action
 psychology
 plot
 Warfare
 action
 disaster
Time taken: 0.225 seconds, Fetched: 12 row(s)

If we want to display the movie name and classification column of the table data in a row, we need to use the profile table

select 
    movie,
    category_name
from 
    test12 
lateral view explode(category) movie_info_tmp as category_name;

Windowing function

over()
Case data, order data

name orderdate cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,sixty-eight
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,ninety-four

Requirements:
(1) Query the customers and total number of people who purchased in April 2017
(2) Query the customer's purchase details and monthly total purchase amount
(3) In the above scenario, the cost of each customer is accumulated according to the date
(4) Query the last purchase time of each customer
(5) Order information 20% before query
Problem solving:
(1) Query the customers and total number of people who purchased in April 2017

Analysis: the key point is to find out that the total number of customers purchased in April 2017 is the aggregate de duplication value, and the customers purchased are specific customers. We can get the specific list if we first group the duplicates, and then use over() to open the window as a whole to get the total number of people.

select 
    name,
    count(name) over() as total_count
from 
test13
where orderdate between '2017-04-01' and '2017-04-30'
group by name; 

(2) Query the customer's purchase details and monthly total purchase amount

Analysis: the current table is the purchase details of customers, and the total monthly purchase amount is the cumulative purchase of each customer every month. Window opening is to open the window every month according to each customer to calculate the cumulative value

### Because they are all in the same year, the month (order date) partition can be used
select 
    name,orderdate,cost
    ,sum(cost) over(partition by month(orderdate)) as month_cost_sum
from test13;
## Standard answer
select 
    name,orderdate,cost
    ,sum(cost) over(partition by substring(orderdate,1,7)) as month_cost_sum
from test13;

(3) In the above scenario, the cost of each customer is accumulated according to the date

select 
    name,orderdate,cost
    ,sum(cost) over(partition by name order by orderdate) as cost_sum
from test13;

extend

select 
    *,
    --Add all rows
    sum(cost) over() as sample1,
    
    --Press name Grouping, intra group data addition
    sum(cost) over(partition by name) as sample2,
    
    --Press name Grouping, intra group data accumulation
    sum(cost) over(partition by name order by orderdate) as sample3,
    
    --and sample3 equally,Aggregation from starting point to current row
    sum(cost) over(partition by name order by orderdate rows between 
    UNBOUNDED PRECEDING and current row ) as sample4 ,
    
    --Aggregate the current row with the previous row
    sum(cost) over(partition by name order by orderdate rows between 1 
    PRECEDING and current row) as sample5,
    
    --The current line and the previous and subsequent lines
    sum(cost) over(partition by name order by orderdate rows between 1 
    PRECEDING AND 1 FOLLOWING ) as sample6,
    
    --Current line and all subsequent lines
    sum(cost) over(partition by name order by orderdate rows between current 
    row and UNBOUNDED FOLLOWING ) as sample7 
from
test13;

(4) Query the last purchase time of each customer

select 
    name,
    orderdate,
    lag(orderdate,1,'1970-01-01') over(partition by name order by orderdate) as last_orderdate
    from 
test13;

(5) Order information 20% before query

select * from (select 
    name,orderdate,cost 
    ,ntile(5) over(order by orderdate) sort
from test13)t1
where sort = 1;

Rank function

functionexplain
rank()If the sorting is the same, there will be duplication, and the total number will remain the same
dense_rank()If the sorting is the same, there will be duplication and the total number will be reduced
row_number()In order

Data: subject transcript

name subject score
 Monkey King language eighty-seven
 Monkey King math ninety-five
 Monkey king English sixty-eight
 Sea language ninety-four
 Ocean mathematics fifty-six
 Sea English eighty-four
 Song language sixty-four
 Song mathematics eighty-six
 Song English eighty-four
 Tingting language sixty-five
 Tingting mathematics eighty-five
 Tingting English seventy-eight

Compare the effects of the previous function
rank(): duplicates occur when the sorting is the same, and the total number remains the same.
The score tied for sixth, no seventh. The total number is 12, and the total number remains unchanged.

hive (study)> select *,rank() over(order by score) from test14;
-------------------------
test14.name     test14.subject  test14.score    rank_window_0
 sea    mathematics    fifty-six.0    1
 Song song    language    sixty-four.0    2
 Tingting    language    sixty-five.0    3
 Monkey king English    sixty-eight.0    4
 Tingting    English    seventy-eight.0    5
 Song song    English    eighty-four.0    6
 sea    English    eighty-four.0    6
 Tingting    mathematics    eighty-five.0    8
 Song song    mathematics    eighty-six.0    9
 Monkey King language    eighty-seven.0    10
 sea    language    ninety-four.0    11
 Monkey King mathematics    ninety-five.0    12
Time taken: 18.422 seconds, Fetched: 12 row(s)

dense_rank(): duplicates occur when the sorting is the same, and the total quantity decreases.
The number 6 is tied, and then the sequence is continued. Due to the juxtaposition of the total quantity, the total quantity is reduced.

hive (study)> select *,dense_rank() over(order by score) from test14;
-----------------------
test14.name     test14.subject  test14.score    dense_rank_window_0
 sea    mathematics    fifty-six.0    1
 Song song    language    sixty-four.0    2
 Tingting    language    sixty-five.0    3
 Monkey king English    68.0    4
 Tingting    English    seventy-eight.0    5
 Song song    English    eighty-four.0    6
 sea    English    eighty-four.0    6
 Tingting    mathematics    eighty-five.0    7
 Song song    mathematics    eighty-six.0    8
 Monkey King language    eighty-seven.0    9
 sea    language    94.0    10
 Monkey King mathematics    ninety-five.0    11

row_number(): arranged in order without repetition.
There will be no juxtaposition.

hive (study)> select *,row_number() over(order by score) from test14;
------------------
test14.name     test14.subject  test14.score    row_number_window_0
 sea    mathematics    56.0    1
 Song song    language    64.0    2
 Tingting    language    65.0    3
 Monkey king English    68.0    4
 Tingting    English    78.0    5
 Song song    English    84.0    6
 sea    English    84.0    7
 Tingting    mathematics    85.0    8
 Song song    mathematics    86.0    9
 Monkey King language    87.0    10
 sea    language    94.0    11
 Monkey King mathematics    95.0    12

Requirements:
Calculate the score ranking of each subject.

Analysis: rank each discipline

select 
    name,subject,score,
    rank() over(partition by subject order by score) as rank 
from test14;

Other common functions

Date function

functionexplaincase
unix_timestampReturns the unix timestamp of the current time, deprecated. Use current_timestamp substitutionselect unix_timestamp();
select unix_timestamp("2020-10-28",'yyyy-MM-dd');
from_unixtimeTimestamp to dateselect from_unixtime(1603843200);
current_dateGet current dateselect current_date();
current_timestampGet current date and timeselect current_timestamp;
to_dateExtract date sectionselect to_date('2020-10-28 12:12:12');
yearThe year portion of the extraction dateselect year('2020-10-28 12:12:12');
monthExtract the month portion of the dateselect month('2020-10-28 12:12:12');
dayExtract the day portion of the dateselect day('2020-10-28 12:12:12');
hourHour of extraction dateselect hour('2020-10-28 12:12:12');
minuteMinutes of extraction dateselect minute('2020-10-28 12:13:14');
secondThe second of the extracted dateselect second('2020-10-28 12:13:14');
weekofyearThe specified date is the week ordinal of the current yearselect weekofyear('2020-10-28 12:13:14');
dayofmonthThe specified date is the day ordinal of the current monthselect dayofmonth('2020-10-28 12:13:14');
months_betweenMonth between two datesselect months_between('2020-04-01','2020-10-28');
add_monthsDate plus or minus monthselect add_months('2020-10-28',-3);
datediffNumber of days between two datesselect datediff('2020-11-04','2020-10-28');
date_addDate plus daysselect date_add('2020-10-28',4);
date_subDate minus daysselect date_sub('2020-10-28',-4);
last_dayDate is the last day of the monthselect last_day('2020-02-30');
date_formatformat dateselect date_format('2020-10-28 12:12:12','yyyy/MM/dd HH:mm:ss');

Numerical function

functionexplaincase
roundroundingselect round(3.14);
select round(3.54);
ceilRound upselect ceil(3.14);
floorRound downselect floor(3.14);

String operation function

functionexplaincase
upperCapitalizeselect upper('low');
lowerTurn lowercaseselect lower('low');
lengthlengthselect length("atguigu");
trimRemove spaces before and afterselect trim(" atguigu ");
lpadFill to the left to the specified lengthselect lpad('atguigu',9,'g');
rpadFill to the right to the specified lengthselect rpad('atguigu',9,'g');
regexp_replaceUse regular expression to match the target string. Replace after successful matching!SELECT regexp_replace('2020/10/25', '/', '-');

Set function

functionexplaincase
sizeNumber of elements in the collectionselect size(friends) from test3;
map_keysReturns the key in the mapselect map_keys(children) from test3;
map_valuesReturns the value in the mapselect map_values(children) from test3;
array_containsDetermine whether the array contains an elementselect array_contains(friends,'bingbing') from test3;
sort_arraySort the elements in the arrayselect sort_array(friends) from test3;
grouping setsMultidimensional analysis

grouping sets
It can be understood as optimizing the SQL writing method of multiple groups by

select col1,col2,count(*) as count from tab1 group by col1,col2
union 
select col1,null,count(*) as count from tab1 group by col1;

## Equivalent to
select col1,col2,count(*) as count from tab1  group by col1,col2
grouping sets((col1,col2),(col1));

Topics: Big Data hive