Hive common function summary

Posted by jerk on Thu, 30 Dec 2021 17:53:54 +0100

function

View all built-in functions

show functions;

How to use query function

desc function [extended]Detailed display function name

UDF one in one out is measured by line
UDAF enters one more place
UDTF one in many out

UDF

NVL: assign a value to the data whose value is NULL. Its format is NVL (value, default_value). Its function is to return default if value is NULL_ Value, otherwise return the value of value. If both parameters are NULL, return NULL.

CASE WHEN THEN ELSE END

data source

dept_id sex
 name of a fictitious monkey with supernatural powers A male
 sea A male
 Song song B male
 Miss Luo Yu feng A female
 Sister Ting B female
 Tingting B female

select
 dept_id,
 sum(case sex when 'male' then 1 else 0 end) male_count,
 sum(case sex when 'female' then 1 else 0 end) female_count
from emp_sex
group by dept_id;

Equivalent to if (sex = 'male', 1,0)

CONCAT

Splice string

CONCAT_WS(separator, str1, str2,...)

The first is a separator

COLLECT_SET(col)

The function only accepts basic data types. Its main function is to de summarize the values of a field and generate an Array type field.

SET can be replaced by list, and the difference is whether to de duplicate

UDTF

explode

You can split an array into multiple rows of data

Split (field name, ")

Same as java split

lateral view

Usage: final view udtf (expression) tablealias as columnalias

Explanation: used with split, expand and other udtfs. It can split a column of data into multiple rows of data. On this basis, it can aggregate the split data.

Example:

string pageid	Array adid_list
"front_page"	[1, 2, 3]
"contact_page"	[3, 4, 5]

SELECT pageid, adid 
    FROM pageAds LATERAL VIEW explode(adid_list) adTable(Profile table name) AS adid;

string pageid	int adid
"front_page"	1
"front_page"	2
"front_page"	3
"contact_page"	3
"contact_page"	4
"contact_page"	5

Window function

Over

The window function needs to be used with over

mysql8. Window functions are supported after 0

OVER(): Specifies the size of the data window in which the analysis function works. The size of the data window may change with the change of rows

partition by: can be understood as being divided into several windows

CURRENT ROW: CURRENT ROW

N predicting: data in the previous n rows

n FOLLOWING: data in the next n rows

UNBOUNDED: starting point,

Unbounded forecasting means starting from the front and UNBOUNDED FOLLOWING means ending at the back

LAG(col,n, default): data in the nth row ahead

LEAD(col,n, default_val): data of the nth row in the future

NTILE(n): distribute the rows of the ordered window to the groups of the specified data. Each group has a number, starting from 1. For each row, NTILE returns the number of the group to which the row belongs. Note: n must be of type int.

Data preparation: name, orderdate, cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

(1) Query the customers and total number of people who purchased in April 2017

select name,count(*) over () 
from business
where substring(orderdate,1,7) = '2017-04'
group by name;

(2) Query the customer's purchase details and monthly total purchase amount

select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) 
from business;

The cost of each customer is accumulated according to the date

select *,sum(cost) over(partition by name order by orderdate)from business;
orderby Equivalent to accumulation
select *,sum(cost) over(partition by name  rows between  ubounded preceding[Start line] and current row[Current row])from business;

result

jack	2017-01-05	46	46
jack	2017-01-08	55	101
jack	2017-01-01	10	111
jack	2017-04-06	42	153
jack	2017-02-03	23	176
mart	2017-04-13	94	94
mart	2017-04-11	75	169
mart	2017-04-09	68	237
mart	2017-04-08	62	299
neil	2017-05-10	12	12
neil	2017-06-12	80	92
tony	2017-01-04	29	29
tony	2017-01-02	15	44
tony	2017-01-07	50	94

Note: if you encounter the same value, it will be treated as a window function

example

id
1
2
3
3
select id ,sum(id) over(order by id) from test;
id sum(id)
1  1
2  3
3  6
3  6

Rank

Duplicate sorting is allowed when RANK(), and the total number will not change

DENSE_ Duplicate sorting is allowed during rank() sorting, and the total number will be reduced

ROW_NUMBER() cannot be repeated and will be calculated in order

select *,rank() over(order by score desc) from score;
Sun WuKong	mathematics	95	1
 sea	language	94	2
 Sun WuKong	language	87	3
 Song song	mathematics	86	4
 Tingting	mathematics	85	5
 Song song	English	84	6
 sea	English	84	6
 Tingting	English	78	8
 Sun WuKong	English	68	9
 Tingting	language	65	10
 Song song	language	64	11
 sea	mathematics	56	12

Song song is on the sea. Because of the shuffle reverse overflow, the data are the same and not sorted

Other common functions

Common date functions

unix_timestamp: returns the timestamp of the current or specified time (obsolete)

select unix_timestamp();
unix_timestamp(void) is deprecated. Use current_timestamp instead.
unix_timestamp(void) is deprecated. Use 
1628579776
----------------------------------------
select unix_timestamp('2021-8-10','YYYY-MM-DD');
1609027200

current_timestamp

hive (default)> select current_timestamp;
OK
_c0
2021-08-10 15:17:37.051

from_unixtime(time,format): convert timestamp to date format

select from_unixtime(1603843200);
2020-10-28 00:00:00

current_date: get the current date
select current_date;

to_date: extract the date part

select to_date('2020-10-28 12:12:12');
2020-10-28
----------------------------------------
The same is true below, but the date format must be correct

Year: get year
select year('2020-10-28 12:12:12');

Month: get month
select month('2020-10-28 12:12:12');

Day: get day
select day('2020-10-28 12:12:12');

hour: when getting
select hour('2020-10-28 12:12:12');

minute: get score
select minute('2020-10-28 12:12:12');

Second: gets the second
select second('2020-10-28 12:12:12');

weekofyear: the current time is the week of the year

Day of month: the current time is the day of the month

select weekofyear('2021-8-10 12:12:12');
32
----------------------------------------
select dayofmonth('2021-8-35 12:12:12');
4
//Even if it exceeds the normal month / date size, no error will be reported, and it will automatically enter the next month

months_between: the month between two dates, before minus after

select months_between('2020-04-01','2021-10-28');
-18.87096774

add_months: date plus or minus month

select add_months('2020-8-28',10);
2021-06-28

datediff: the number of days between two dates

select datediff('2020-11-04','2021-8-28');
-297

date_add: date plus days

date_sub: date minus days

select date_add('2020-10-28',4);
select date_sub('2020-10-28',-4);

last_day: the last day of the month of the date

select last_day('2020-08-40');
2020-09-30

date_format(date/timestamp/string, fmt) formats the date

select date_format('2020-10-28 12:12:12','yyyy/MM/dd HH:mm:ss');
2020/10/28 12:12:12

Data rounding

Round: round

select round(3.14);
3
select round(3.54);
4

ceil: round up

select ceil(3.14);
select ceil(3.54);
4

floor: round down

select floor(3.14);
select floor(3.54);
3

String operation

It is similar to Java's string method

upper: convert to uppercase

lower: convert to lowercase

Length: length
select length("atguigu");

trim: remove spaces before and after
select trim(" atguigu ");

lpad: fill to the left to the specified length
select lpad('test',5,'g');

rpad: fill right to the specified length
select rpad('test',5,'g');

regexp_replace: use regular expression to match the target string. Replace after matching successfully!

sub: intercept string

split: splits the string by the specified character

Set function

Similar to Java collection functions

size: the number of elements in the collection
select size(col) from table_name;

map_keys: returns the key in the map
select map_keys(col) from table_name;

map_values: returns the value in the map
select map_values(col) from table_name;

array_contains: determines whether an element is included in the array
select array_contains(col,'test') from table_name;

sort_array: sort the elements in the array
select sort_array(col) from table_name;

Topics: Big Data hive

Programmer Think