hive operation summary

Posted by codects on Mon, 09 Dec 2019 18:16:24 +0100

1. There are many fields in the hive table, and one or several of them need to be removed from a large number of fields. It is not beautiful to write out all the fields, which can be completed by using regular

Set hive.support.quoted.identifiers to None, and you can use regular expressions to get the fields of the table

set hive.support.quoted.identifiers=None;
select `(search_word|user_id)?+.+` from user_action_table;

2. build tables

create table if not exists table(
    user_id bigint comment 'user id',
    goods_id bigint comment 'commodity id'
 ) comment 'Transaction schedule'
 partitioned by (par string comment 'Partition key is par')
 stored as orc;	

3. Add a column

alter table user_action_table add columns ( rec_source string comment 'Commodity source')

4. Insert data into partition

insert overwrite table user_action_table partition (par='1024')
select * from other_table 

5. Write sub query. When hive writes complex statements, he can split them into sub statements, write them separately, and then combine them

with tmp_table as (
select count(*) from table group by user_id),

tmp_table2 as (
select count(*) from table group by goods_id)

6. ranking

select
user_id,
goods_id,
row_number() over (partition by user_id order by action_num desc ) as rank_number
from
user_action_table

7.hive uses Cartesian product. It's better not to use it. Find other ways to replace it. It's very slow

set hive.strict.checks.cartesian.product = false;
set hive.mapred.mode = 'nonstrict' ;

select T1.*, T2.* from

(select * from table1) T1

cross join

(select * from table2) T2

on 1=1;

8. Calculate the proportion, such as the proportion of the value of the current row in the sum of all values in the column. Make good use of the window function. The field after partition by is similar to the field of group by.

select
user_id,
goods_id,
action_num
action_num*1.00/sum(action_num) over (partition by user_id) as user_percent
from tmp_table

9. Filter the data in Table 1 and table 2

select a.user_id,a.goods_id from 
(select user_id, goods_id  from table_1) as a
left join
(select user_id, goods_id from table_2) as b
on a.user_id = b.user_id and a.goods_id = b.goods_id
where b.user_id is null and b.goods_id is null

 

Topics: Big Data hive