Hive learning notes - Chapter 7 functions

Posted by holstead on Sun, 16 Jan 2022 06:56:57 +0100

1. System built-in functions

(1) View the functions provided by the system

show functions;

(2) Displays the usage of the built-in function

desc function upper;

(3) Displays the usage of the built-in function in detail

desc function extended upper;

2. Common built-in functions

2.1 empty field assignment

(1) Function description

NVL: assign a value to the data whose value is NULL. Its format is NVL (value, default_value). Its function is to return default if value is NULL_ Value, otherwise return the value of value. If both parameters are NULL, return NULL.

(2) Data preparation: use employee table

(3) Query: if the comm of the employee is NULL, use - 1 instead

select comm,nvl(comm, -1) from emp;

(4) Query: if the comm of the employee is NULL, the leader id will be used instead

select comm, nvl(comm,mgr) from emp;

2.2 CASE WHEN

(1) Data preparation

(2) Demand

Find out the number of men and women in different departments. The results are as follows:

A   2    1
B   1    2

(3) Create local emp_sex.txt, import data

vim emp_sex.txt
#Add data
 name of a fictitious monkey with supernatural powers	A	male
 sea	A	male
 Song song	B	male
 Miss Luo Yu feng	A	female
 Sister Ting	B	female
 Tingting	B	female

(4) Create hive table and import data

#Create Hive table
create table emp_sex(
	name string, 
	dept_id string, 
	sex string) 
row format delimited fields terminated by "\t";
#Import data
load data inpath '/mario/hive/7/emp_sex.txt' into table emp_sex;

(5) Query data by demand

select 
	dept_id,
	sum(case sex when 'male' then 1 else 0 end) male_count,
	sum(case sex when 'female' then 1 else 0 end) female_count
from 
	emp_sex
group by
	dept_id;

2.3 line transfer

(1) Description of correlation function

CONCAT(string A/col, string B/col...): returns the result of connecting input strings. Any input string is supported;

CONCAT_WS(separator, str1, str2,...): it is a special form of CONCAT(). The first parameter is the separator between the remaining parameters. The delimiter can be the same string as the remaining parameters. If the delimiter is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter parameter. The separator will be added between the connected strings;

COLLECT_SET(col): the function only accepts basic data types. Its main function is to de summarize the values of a field and generate an array type field.

(2) Data preparation

(3) Demand

Classify people with the same constellation and blood type. The results are as follows:

sagittarius,A       sea|Miss Luo Yu feng
 Aries,A       Sun WuKong|Zhu Bajie
 Aries,B       Song song|Aoi Sora

(4) Create a local constellation Txt, import data

vim constellation.txt
#Add data
 Sun WuKong	Aries	A
 sea	sagittarius	A
 Song song	Aries	B
 Zhu Bajie	Aries	A
 Miss Luo Yu feng	sagittarius	A

(5) Create hive table and import data

#Create Hive table
create table person_info(
	name string, 
	constellation string, 
	blood_type string) 
row format delimited fields terminated by "\t";
#Import data
load data inpath "/mario/hive/7/constellation.txt" into table person_info;

(6) Query data by demand

select
	t1.base,
	concat_ws('|', collect_set(t1.name)) name
from
	(select
		name,
		concat(constellation, ",", blood_type) base
	from
		person_info) t1
group by
	t1.base;

2.4 column to row

(1) Function description

Expand (Col): split the complex array or map structure in the hive column into multiple rows.

LATERAL VIEW

Usage: final view udtf (expression) tablealias as columnalias

Explanation: used with split, expand and other udtfs. It can split a column of data into multiple rows of data. On this basis, it can aggregate the split data.

(2) Data preparation

(3) Demand

Expand the array data in the movie category. The results are as follows:

<Suspect tracking    Suspense
<Suspect tracking    action
<Suspect tracking    science fiction
<Suspect tracking    plot
<Lie to me>  Suspense
<Lie to me>  gangster 
<Lie to me>  action
<Lie to me>  psychology
<Lie to me>  plot
<War wolf 2     Warfare
<War wolf 2     action
<War wolf 2     disaster

(4) Create local movie Txt, import data

vim movie.txt
#Add data
<Suspect tracking	Suspense,action,science fiction,plot
<Lie to me>	Suspense,gangster ,action,psychology,plot
<War wolf 2	Warfare,action,disaster

(5) Create hive table and import data

#Create Hive table
create table movie_info(
	movie string, 
	category string) 
row format delimited fields terminated by "\t";
#Import data
load data local inpath "/opt/module/datas/movie.txt" into table movie_info;

(6) Query data by demand

select
	m.movie,
	tbl.cate
from
	movie_info m
lateral view
	explode(split(category, ",")) tbl as cate;

2.5 window function (windowing function)

(1) Description of correlation function

**OVER(): * * specifies the data window size of the analysis function. The data window size may change with the change of rows.

**CURRENT ROW: * * CURRENT ROW

**N predicting: * * n rows of data before

**n FOLLOWING: * * next n rows of data

**UNBOUNDED: * * starting point, UNBOUNDED predicting indicates the starting point from the front, UNBOUNDED FOLLOWING indicates the end point from the back

**LAG(col,n,default_val): * * data in the nth row ahead

**LEAD(col,n, default_val): * * data in the next nth row

**NTILE(n): * * distribute the rows of the ordered window to the groups of the specified data. Each group has a number, starting from 1. For each row, NTILE returns the number of the group to which the row belongs. Note: n must be of type int.

(2) Data preparation: name, orderdate, cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

(3) Demand

  1. Query the customers and total number of people who purchased in April 2017
  2. Query the customer's purchase details and monthly total purchase amount
  3. In the above scenario, the cost of each customer is accumulated according to the date
  4. Query the last purchase time of each customer
  5. Order information 20% before query

(4) Create local business Txt, import data

vi business.txt

(5) Create hive table and import data

#Create Hive table
create table business(
	name string, 
	orderdate string,
	cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
#Load data
load data inpath "/mario/hive/7/business.txt" into table business;

(6) Query data by demand

Query the customers and total number of people who purchased in April 2017

select name,count(*) over () 
from business 
where substring(orderdate,1,7) = '2017-04' 
group by name;

Query the customer's purchase details and monthly total purchase amount

select name,orderdate,cost,sum(cost) 
over(partition by month(orderdate)) 
from
	business;

In the above scenario, the cost of each customer is accumulated according to the date

select name,orderdate,cost, 
sum(cost) over() as sample1,--Add all rows 
sum(cost) over(partition by name) as sample2,--Press name Grouping, intra group data addition 
sum(cost) over(partition by name order by orderdate) as sample3,--Press name Grouping, intra group data accumulation 
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--and sample3 equally,Aggregation from starting point to current row 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --Aggregate the current row with the previous row 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--The current line and the previous and subsequent lines 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --Current line and all subsequent lines 
from business;

Rows must follow the Order by clause to limit the sorting results, and use a fixed number of rows to limit the number of data rows in the partition

View the customer's last purchase time

SELECT name,
       orderdate,
       cost,
       lag(orderdate,1,'1900-01-01') over(partition BY name
                                          ORDER BY orderdate) AS time1
FROM business;

Order information 20% before query

select * from (
	select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
	from business
) t
where sorted = 1;

2.6 Rank

(1) Function description

If RANK() sorting is the same, it will be repeated, and the total number will not change

DENSE_ If rank() sorting is the same, it will be repeated and the total number will be reduced

ROW_NUMBER() is calculated in order

(2) Data preparation

(3) Demand

Calculate the score ranking of each subject.

(4) Create local score Txt, import data

vi score.txt

(5) Create hive table and import data

#Create Hive table
create table score(
	name string,
	subject string, 
	score int
) 
row format delimited fields terminated by "\t";
#Load data
load data inpath '/mario/hive/7/score.txt' into table score;

(6) Query data by demand

SELECT name,
       subject,
       score,
       rank() over(partition BY subject
                   ORDER BY score DESC) rp,
              dense_rank() over(partition BY subject
                                ORDER BY score DESC) drp,
                           row_number() over(partition BY subject
                                             ORDER BY score DESC) rmp
FROM score;

Extension: find the top three students in each subject?

2.7 date correlation function

(1)current_date returns the current date

select current_date();

(2)date_ add, date_ Addition and subtraction of sub date

#Date 90 days from today
select date_add(current_date(), 90);
#Date 90 days before today
select date_sub(current_date(), 90);

(3) Date difference between two dates

#The difference between today and June 4, 1990
SELECT datediff(CURRENT_DATE(), "1990-06-04");

3. User defined functions

(1) Hive comes with some functions, such as max/min, but the number is limited. You can easily expand it by customizing UDF.

(2) When the built-in function provided by Hive cannot meet your business processing needs, you can consider using user-defined function (UDF).

(3) There are three types of user-defined functions:

​ UDF(User-Defined-Function)

One in and one out

​ UDAF(User-Defined Aggregation Function)

Aggregate function, one more in and one out

Similar to: count/max/min

​ UDTF(User-Defined Table-Generating Functions)

One in and many out

Such as lateral view explore()

(4) Official document address

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

(5) Programming steps:

Inherit org apache. hadoop. hive. ql.exec. UDF

You need to implement the evaluate function; The evaluate function supports overloading;

Create a function in hive's command line window

Add jar

add jar linux_jar_path

Create function

create [temporary] function [dbname.]function_name AS class_name;

d. delete the function in the command line window of hive

Drop [temporary] function [if exists] [dbname.]function_name;

(6) Note: UDF must have a return type, which can return null, but the return type cannot be void;

4. Customize UDF functions

(1) Create a Maven project Hive

(2) Import dependency

<dependencies>
	<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
	<dependency>
		<groupId>org.apache.hive</groupId>
		<artifactId>hive-exec</artifactId>
		<version>3.1.2</version>
	</dependency>
</dependencies>

(3) Create a class

package com.atguigu.hive;
import org.apache.hadoop.hive.ql.exec.UDF;

public class Lower extends UDF {
	public String evaluate (final String s) {
		
		if (s == null) {
			return null;
		}		
		return s.toLowerCase();
	}

}

(4) Upload the jar package to the server / opt / module / jars / UDF jar

(5) Add the jar package to hive's classpath

add jar /opt/module/datas/udf.jar;

(6) Create a temporary function associated with the developed java class

create temporary function mylower as "com.atguigu.hive.Lower";

(7) You can use the custom function strip in hql

select ename, mylower(ename) lowername from emp;

Topics: hive