Basic operation of Hive

Posted by PyroX on Sun, 23 Feb 2020 09:38:46 +0100

Article directory

4, DDL data definition

4.1 create database

Create a database. The default storage path of the database on HDFS is / user/hive/warehouse/*.db.

To avoid errors in the database to be created, add if not exists judgment. (standard writing)

create database if not exists db_hive;

Create a database, specify the location where the database is stored on HDFS, and add location 'path';

create database db_hive2 location '/db_hive2.db';

4.2 query database

  1. Display database information

    desc database db_hive;
    
  2. Display the database details, extended. You can view the additional properties of the database (the same as above if there is no additional property)

    desc database extended db_hive;
    

4.3 modify database

The ALTER DATABASE command sets key value pair property values for DBPROPERTIES of a database to describe the property information of the database.

Other metadata information of the database is unchangeable, including the database name and the directory location of the database.

alter database db_hive set dbproperties('createtime'='2019/7/31');

See the modification result

desc database extended db_hive;

4.4 delete database

  1. Delete empty database

    drop database db_hive2;
    
  2. if exists is used to judge whether the database exists and delete

    drop database if exists db_hive2;
    
  3. If the database is not empty, you can use the cascade command to force deletion

    drop database db_hive cascade;
    

4.5 create table

  1. Table grammar

    Create [external] table [if not exists] table [name -- create table
     [(Col [name data] type [comment col [comment],...)] -- add a comment to the table
     [COMMENT table_comment] -- add a comment to the column
     [partitioned by (col name data type [comment col comment],...)] -- create partition table
     [CLUSTERED BY (col_name, col_name,...) -- create bucket table
     [sorted by (col_name [asc|desc],...)] into num_buckets buckets] - not commonly used
     [row format row] format
     [stored as file [format] -- specifies the storage file type
     [location HDFS? Path] -- specifies the storage location of the table on HDFS
    
  2. Management table (internal table)

    When we delete a management table, Hive will also delete the data in this table.

    Common create table

    create table if not exists student2(
    id int, name string
    )
    row format delimited fields terminated by '\t'
    stored as textfile
    location '/user/hive/warehouse/student';
    

    Create a table based on the query results (the query results will be added to the newly created table using MapReduce)

    create table if not exists student3 as select id, name from student;
    

    Create a table from an existing table structure

    create table if not exists student4 like student;
    

    Type of query table

    desc formatted student2;
    Table Type:             MANAGED_TABLE 
    

    When student2 is deleted, student2 data in HDFS is also deleted.

    drop table student2;
    
  3. External table

    On the basis of external tables (such as the company's massive log files, which can be used by different developers, but cannot be deleted at will), a lot of statistical analysis is done. The intermediate tables and result tables used are stored in internal tables, so that even if the external table is deleted in hive, the original data will not be deleted, but the metadata information describing the table will be deleted.

    Create dept.txt and emp.txt

    dept.txt
    10	ACCOUNTING	1700
    20	RESEARCH	1800
    30	SALES	1900
    40	OPERATIONS	1700
    emp.txt
    7369	SMITH	CLERK	7902	1980-12-17	800.00		20
    7499	ALLEN	SALESMAN	7698	1981-2-20	1600.00	300.00	30
    7521	WARD	SALESMAN	7698	1981-2-22	1250.00	500.00	30
    7566	JONES	MANAGER	7839	1981-4-2	2975.00		20
    7654	MARTIN	SALESMAN	7698	1981-9-28	1250.00	1400.00	30
    7698	BLAKE	MANAGER	7839	1981-5-1	2850.00		30
    7782	CLARK	MANAGER	7839	1981-6-9	2450.00		10
    7788	SCOTT	ANALYST	7566	1987-4-19	3000.00		20
    7839	KING	PRESIDENT		1981-11-17	5000.00		10
    7844	TURNER	SALESMAN	7698	1981-9-8	1500.00	0.00	30
    7876	ADAMS	CLERK	7788	1987-5-23	1100.00		20
    7900	JAMES	CLERK	7698	1981-12-3	950.00		30
    7902	FORD	ANALYST	7566	1981-12-3	3000.00		20
    7934	MILLER	CLERK	7782	1982-1-23	1300.00		10
    

    Create department table

    create external table if not exists default.dept(
    deptno int,
    dname string,
    loc int
    )
    row format delimited fields terminated by '\t';
    

    Create employee table

    create external table if not exists default.emp(
    empno int,
    ename string,
    job string,
    mgr int,
    hiredate string, 
    sal double, 
    comm double,
    deptno int)
    row format delimited fields terminated by '\t';
    

    Import data to two external tables respectively

    load data local inpath '/opt/module/datas/dept.txt' into table default.dept;
    load data local inpath '/opt/module/datas/emp.txt' into table default.emp;
    

    View table format data

    desc formatted dept;
    Table Type:             EXTERNAL_TABLE
    

    Delete a table to see if the data on HDFS is deleted.

  4. Manage the conversion between tables and external tables

    Type of query table

    desc formatted student2;
    Table Type:             MANAGED_TABLE 
    

    Modify internal table student2 to external table

    alter table student2 set tblproperties('EXTERNAL'='TRUE');
    

    Type of query table

    desc formatted student2;
    Table Type:             EXTERNAL_TABLE
    

    Modify external table student2 to internal table

    alter table student2 set tblproperties('EXTERNAL'='FALSE');
    

    Type of query table

    desc formatted student2;
    Table Type:             MANAGED_TABLE
    

    Note: ('EXTERNAL '='TRUE') and ('EXTERNAL '='FALSE') are fixed and case sensitive!

4.6 zoning table

The partition table is actually a separate folder on the corresponding HDFS file system, under which all the data files of the partition are located. A partition in Hive is a subdirectory.

  1. Basic operation of partition table

    Create partition table

    create table dept_partition(
    deptno int, dname string, loc string
    )
    partitioned by (month string)	-- Set up month Is a partition field
    row format delimited fields terminated by '\t';
    

    Load data into partition table

    -- Add data and mark different groups
    load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201709');
    load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201708');
    load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201707');
    

    Query data in partition table

    -- Specify group query
    select * from dept_partition where month='201709' or month='201708';
    -- Do not specify group query, all grouped data will be found together
    select * from dept_partition;
    

    Increase partition

    -- Add a partition
    alter table dept_partition add partition(month='201706') ;
    -- When adding multiple partitions, separate them with spaces
    alter table dept_partition add partition(month='201705') partition(month='201704');
    

    delete a partition

    -- Delete a partition
    alter table dept_partition drop partition (month='201704');
    -- Comma separated when deleting multiple partitions
    alter table dept_partition drop partition (month='201705'), partition (month='201706');
    

    Check how many partitions are in the partition table

    show partitions table name;
    

    View partition table structure

    esc formatted table name;
    
  2. Two level zoning

    Create secondary partition table

    create table dept_partition2(
    deptno int, dname string, loc string
    )
    partitioned by (month string, day string) -- Set up month and day Is a partition field
    row format delimited fields terminated by '\t';
    

    Normal loading data

    load data local inpath '/opt/module/datas/dept.txt' into table dept_partition2 partition(month='201709', day='13');
    

    Query partition data

    select * from dept_partition2 where month='201709' and day='13';
    
  3. Create folders directly on HDFS and upload data directly to partition folders. There are three ways to associate partition tables and data

    ① Repair after uploading data

     msck repair table dept_partition2;
    

    ② Add partition after uploading data

    alter table dept_partition2 add partition(month='201709', day='11');
    

    ③ load data to partition after creating folder

    load data local inpath '/opt/module/datas/dept.txt' into table dept_partition2 partition(month='201709',day='10');
    

4.7 modification table

  1. rename table

    ALTER TABLE table_name RENAME TO new_table_name
    
  2. Add, modify, and delete table partitions

    See 4.6

  3. Add / modify / replace column information

    Update column

    ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
    -- Example
    alter table dept_partition add columns(deptdesc string);
    

    Add and replace columns

    ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...) 
    -- New column instance
    alter table dept_partition change column deptdesc desc int;
    -- Replace column instance (Note: replace column to redefine all columns)
    alter table dept_partition replace columns(deptno string, dname string, loc string);
    
  4. Delete table

    drop table table name;
    

5, DML data operation

5.1 data import

  1. Load data into a table

    grammar

     load data [local] inpath 'path' overwrite | into table student [partition (partcol1=val1,...)];
    

    Explanation: load data: indicates load data.

    Local: indicates that the data is loaded from the local. If omitted, it indicates that the data is loaded from HDFS.

    inpath: indicates the path to load data.

    Overwrite: indicates to overwrite the existing data in the table; otherwise, indicates to append.

    into table: indicates which table to load.

    student: represents a specific table.

    Partition: indicates to upload to the specified partition.

    Note: importing data from HDFS is mobile; importing data from local is replication.

  2. Insert data into a table through a query statement

    Basic way to insert data

    insert into table  student partition(month='201709') values(1,'wangwu');
    

    Use single table query result as insert data

    insert overwrite table student partition(month='201708') 
    select id, name from student where month='201709';
    

    Use query results of multiple tables as insert data

    from student
    insert overwrite table student partition(month='201707')
    select id, name where month='201709'
    insert overwrite table student partition(month='201706')
    select id, name where month='201709';
    
  3. Create table and load data in query statement (As Select)

    create table if not exists student3
    as select id, name from student;
    
  4. Specify the load data path through Location when creating the table

    create table if not exists student5(
    id int, name string
    )
    row format delimited fields terminated by '\t'
    location '/user/hive/warehouse/student5';
    
  5. Import data to the specified Hive table (Note: export first, then import)

    import table student2 partition(month='201709') from '/user/hive/warehouse/export/student';
    

5.2 data export

  1. Insert export

    Export query results to local

    insert overwrite local directory '/opt/module/datas/export/student'
    select * from student;
    

    Export the result format of query to local

    insert overwrite local directory '/opt/module/datas/export/student1'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    select * from student;
    

    Export the results of the query to HDFS (no local)

    insert overwrite directory '/user/hive/warehouse/student3'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
    select * from student;
    
  2. Hadoop command export to local

    dfs -get /user/hive/warehouse/student3/000000_0 /opt/module/datas/export/student3.txt;
    
  3. Hive Shell command export (use hive -f/-e to execute statement or script > file directly without entering hive client)

    hive -e 'select * from default.student;' > /opt/module/datas/export/student4.txt;
    
  4. Export to HDFS

    export table student to '/user/hive/warehouse/export/student';
    
  5. Sqoop export

    Follow up instructions

5.3 clear data in the table (Truncate)

Truncate can only clear the data in the management table, and cannot delete the data in the external table

truncate table student;

Six, query

Query syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]	-- Grouping
  [ORDER BY col_list]	-- Global ranking
  [CLUSTER BY col_list 	-- Pail ( DISTRIBUTE and SORT Use the same field)
   | [DISTRIBUTE BY col_list]	-- Specify partition fields
   [SORT BY col_list] ]	-- Intra regional ranking
  [LIMIT number]	-- Restricted row number

6.1 basic query (Select From)

  1. Full table and specific column queries

    select * from emp;
    select empno, ename from emp;
    
  2. Column alias (as can be omitted)

    select ename AS name, deptno dn from emp;
    
  3. Arithmetic operator

    operator describe
    A+B Additivity of A and B
    A-B A minus B
    A*B A and B multiply
    A/B A divided by B
    A%B A to B
    A&B A and B take and
    A|B A and B are bitwise OR
    A^B A and B take exclusive or by bit
    ~A A reverse by bit
-- After finding out the salary of all employees, add 1 to display
select sal +1 from emp;
  1. Limit statement

    -- LIMIT Clause to limit the number of rows returned
    select * from emp limit 5;
    

6.2 Where statement

The WHERE clause is a conditional filter clause followed by the FROM clause.

  1. Comparison operator (Between/In/ Is Null)
Operator
A=B returns TRUE if a equals B, FALSE otherwise
A < = > b if both a and B are NULL, TRUE will be returned; if either is NULL, NULL will be returned; if a and B are equal, TRUE will be returned; otherwise, FALSE will be returned;
A<>B, A!=B
A<B
A<=B
A>B
A>=B
A [NOT] BETWEEN B AND C
A IS NULL
A IS NOT NULL
In (value 1, value 2)
A [NOT] LIKE B
A rlike B, a regexp B B is a regular expression. If a matches a, it returns TRUE; otherwise, it returns FALSE.

Find out all employees with salary equal to 5000

select * from emp where sal =5000;

Query employee information with salary between 500 and 1000

select * from emp where sal between 500 and 1000;

Query all employee information with comm empty

select * from emp where comm is null;

Query employee information with salary of 1500 or 5000

select * from emp where sal IN (1500, 5000);
  1. Like and RLike

    %Represents zero or more characters (any character).

    _Represents a character.

    RLIKE clause is an extension of Hive

    Find employee information starting with 2

    select * from emp where sal LIKE '2%';
    

    Find the employee information of the salary with the second value of 2

    select * from emp where sal LIKE '_2%';
    

    Find employee information with 2 in salary

    select * from emp where sal RLIKE '[2]';
    
  2. And/Or/Not

    Query salary greater than 1000, department is 30

    select * from emp where sal>1000 and deptno=30;
    

    Query salary is greater than 1000, or department is 30

    select * from emp where sal>1000 or deptno=30;
    

    Query employee information except 20 and 30 departments

    select * from emp where deptno not IN(30, 20);
    

6.3 grouping

  1. Group By statement

    GROUP BY statements are usually used with aggregate functions to GROUP BY one or more queue results, and then aggregate each group.

    Calculate the average wage of each department in emp table

    select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;
    

    Calculate emp maximum salary for each position in each department

    select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job;
    
  2. Having statement

    having is different from where:

    where works for the columns in the table to query data, and having works for the columns in the query results to filter data.

    where cannot be followed by aggregate function, while having can be followed by aggregate function.

    having is only used for group by group statistics statement.

    For each department with an average salary greater than 2000

    select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;
    

6.4 Join statement

  1. Equivalent Join

    Hive supports normal SQL JOIN statements, but only supports equivalent connections, not non equivalent connections.

    Query the employee number, employee name and department name according to the equal department number in the employee table and department table;

    select empno, ename, dname from emp e join dept d on e.deptno=d.deptno;
    
  2. Table aliases

    Use alias to simplify query; use table name prefix to improve execution efficiency

    Consolidated employee and department tables

    select e.empno,e.ename,d.dname from emp e join dept d on e.deptno=d.deptno;
    
  3. Internal connection

    Only if there is data matching the join condition in both tables that are joined will it be preserved.

    select e.empno,e.ename,d.dname from dept d join emp e on d.deptno=e.deptno;
    
  4. Left outer join

    All records in the left table of the JOIN operator that conform to the WHERE clause will be returned.

    select e.empno e.ename d.dname from emp e left join dept d on e.deptno=d.deptno;
    
  5. Right outer join

    All records in the table to the right of the JOIN operator that conform to the WHERE clause will be returned.

    select e.empno e.ename d.dname from emp e right join dept d on e.deptno=d.deptno;
    
  6. All external connection

    All records in all tables that meet the conditions of WHERE statement will be returned. If the specified field of any table does not have a qualified value, then NULL is used instead.

    select e.empno,e.ename, d.dname from emp e full join dept d on e.deptno=d.deptno;
    
  7. Multi table connection

    Connect n tables, at least n-1 connection conditions are required

    Hive always performs table connection from left to right. At the same time, hive will choose whether to cache the table according to the situation rather than enabling multiple MR's

    select e.empno,e.ename,d.dname,l.loc_name
    from emp e join dept d on e.deptno=d.deptno
    join location l on d.loc=l.loc;
    
  8. Cartesian product

    All rows in all tables are interconnected (infrequently)

6.5 ranking

  1. Global Order (Order By)

    Order By: Global sorting, which means there is only one Reducer in the end

    ASC (ascend): ascending (default)

    DESC (descend): descending

    Query employee information in ascending order of salary

    select * from emp order by sal;
    

    Query employee information in descending order of salary

    select * from emp order by sal desc;
    
  2. Sort by alias

    Sort by 2 times the employee's salary

    select ename, sal*2 twosal from emp order by twosal;
    
  3. Multiple column sorting

    Sort by department and salary in ascending order

    select ename, deptno, sal from emp order by deptno, sal ;
    
  4. Sort By within each MapReduce

    Note: the data in each Reducer (Partition) in Sort By is random.

    Sort By: sort within each Reducer, not for global result set. Usually used in combination with Distribute By

    Set the number of reduce

    set mapreduce.job.reduces=3;
    

    View the number of reduce settings

    set mapreduce.job.reduces;
    

    View employee information in descending order according to department number

    select * from emp sort by empno desc;
    

    Import query results into file (sorted by department number descending)

    insert overwrite local directory '/opt/module/datas/sortby-result' select * from emp sort by deptno desc;
    
  5. Partition by

    Distribute By: similar to partition in MR, it is used in combination with sort by.

    Hive requires the DISTRIBUTE BY statement to be written before the SORT BY statement.

    When testing distribution by, you must allocate more reduce to process, otherwise you cannot see the effect of distribution by.

    First by department number, then by employee number in descending order.

    set mapreduce.job.reduces=3;
    insert overwrite local directory '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;
    
  6. Cluster By

    cluster by not only has the function of distributing by, but also has the function of sort by. But sorting can only be ascending, and the sorting rule cannot be ASC or DESC.

    Division by department number is not necessarily a fixed value. Departments 20 and 30 can be divided into one division.

    -- The following two writing methods are equivalent
    select * from emp cluster by deptno;
    select * from emp distribute by deptno sort by deptno;
    

6.6 bucket and sampling query

  1. Bucket table data storage

    Partition is for data storage path; bucket is for data file.

    Bucket splitting is another technique that breaks down a data set into parts that are easier to manage.

    Create bucket table

    create table stu_buck(id int, name string)
    clustered by(id) 
    into 4 buckets
    row format delimited fields terminated by '\t';
    

    View table structure

    desc formatted stu_buck;
    Num Buckets:            4     
    

    Set barrel splitting attribute, otherwise barrel splitting effect will not be seen

    -- Opening bucket
    set hive.enforce.bucketing=true;
    -- Set up reduce Automatic identification for the system
    set mapreduce.job.reduces=-1;
    

    Import data to bucket table

    load data local inpath '/opt/module/datas/stu_buck.txt' into table stu_buck;
    
  2. Barrel sampling query

    For very large datasets, sometimes users need to use a representative query result rather than all the results. Hive can meet this requirement by sampling tables.

    Syntax: tablesample is a sampling statement, TABLESAMPLE(BUCKET x OUT OF y).

    Y must be a multiple or factor of the number of buckets in the tabl (because of the Division). hive determines the proportion of samples according to the size of Y. For example, the table is divided into four parts. When y=2, extract (4 / 2 =) data from two buckets, and when y=8, extract (4 / 8 =) data from 1 / 2 buckets.

    X indicates which bucket to start extraction from. If multiple partitions need to be taken, the future partition number is the current partition number plus y. For example, if the total number of buckets in a table is 4, the table sample (bucket 1 out of 2) means to extract the data of (4 / 2 =) 2 buckets in total, and extract the data of the first (x) and third (x + y) buckets.

    Note: the value of x must be less than or equal to the value of y, otherwise the last bucket of data extracted will exceed the total number of buckets.

    Query the data in the table stu.

    -- There are 4 barrels in total. Take 4/2 Barrels, starting from the first one and the second from the third.
    select * from stu_buck tablesample(bucket 1 out of 2 on id);
    

6.7 other common query functions

  1. Empty field assignment

    NVL: assign a value to NULL data in the format of NVL (string1, replace with). Take the value of string1 first. If string1 is NULL, the NVL function returns the value of replace with, which can be either the default value or a column.

    If the comm of the employee is NULL, use - 1 instead

    select nvl(comm,-1) from emp;
    

    If the employee's comm is NULL, the leader id is used instead

    select nvl(comm,mgr) from emp;
    
  2. CASE WHEN

    Demand: find out how many men and women are in different departments.

    Create hive table and import data

    create table emp_sex(
    name string, 
    dept_id string, 
    sex string) 
    row format delimited fields terminated by "\t";
    load data local inpath '/opt/module/datas/emp_sex.txt' into table emp_sex;
    

    Query data by demand

    select 
      dept_id,
      -- Be careful CASE WHEN Use
      sum(case sex when 'male' then 1 else 0 end) male_count,
      sum(case sex when 'female' then 1 else 0 end) female_count
    from 
      emp_sex
    group by
      dept_id;
    
  3. Row to column

    CONCAT(string A/col, string B/col… ): returns the result after connecting the input string. Any input string is supported;

    CONCAT_WS(separator, str1, str2,… ): it is a special form of CONCAT(). The separator between the remaining parameters of the first parameter. The separator can be the same string as the remaining parameters. If the separator is NULL, the return value will also be NULL.

    Collect set (Col): the function only accepts basic data types. Its main function is to de summarize the values of a field and generate array type fields.

    Need: group people with the same horoscope and blood type.

    -- metadata
     Monkey King Aries A
     Sagittarius A
     Aries B in Song Dynasty
     Pig Bajie Aries A
     Sagittarius A
     --Expected output
     Sagittarius, A sea
     Aries, A monkey king
     Aries, B song song
    

    Create hive and import data

    create table person_info(
    name string, 
    constellation string, 
    blood_type string) 
    row format delimited fields terminated by "\t";
    load data local inpath "/opt/module/datas/person_info.txt" into table person_info;
    

    Query data by demand

    select
        t1.c_b,
        concat_ws('|', collect_set(t1.name)) name -- Temporary table name Summarize into an array and use | Connect
    from
        (select
            name,
            concat(constellation, ",", blood_type) c_b -- Merging constellations and blood groups
        from
            person_info) t1 -- cursor t1 
    group by
        t1.base;
    
  4. Column switching

    EXPLODE(col): split the complex array or map structure in the hive column into multiple rows.

    LATERAL VIEW

    Usage: later view udtf (expression) tablealias as columnalias

    Explanation: used with split, expand and other UDTF, it can split a column of data into multiple rows of data, on this basis, the split data can be aggregated.

    Requirement: expand the array data in movie classification.

    --Raw data
     Suspect tracking suspense, action, science fiction, plot
     Lie to me suspense, police, action, psychology, plot
     Wolf 2 war, action, disaster
     --Expected data
     Suspect tracking suspense
     Suspect tracking action
     Suspect tracking science fiction
     The plot of "suspect tracking"
    "Lie to me" suspense
     Lie to me police
     "Lie to me" action
     The psychology of Lie to me
     The plot of Lie to me
     Wolf 2 War
     Action of wolf 2
     Wolf 2 disaster
    

    Create hive table and import data

    create table movie_info(
        movie string, 
        category array<string>) 
    row format delimited fields terminated by "\t"
    collection items terminated by ",";
    
    load data local inpath "/opt/module/datas/movie.txt" into table movie_info;
    

    Query data by demand

    Select
        movie,
        category_name
    from 
        movie_info lateral view explode(category) table_tmp as category_name;
        -- explode(category) Split this column of data, table 1 later view Aggregate the data of the original table into each row according to the split data instead.
    
  5. Window function

    OVER(): Specifies the data window size of the analysis function. The data window size may change with the row

    CURRENT ROW: CURRENT ROW

    N forecasting: forward n rows of data

    n FOLLOWING: next n rows of data

    UNBOUNDED: starting point, UNBOUNDED prediction means starting from the front, UNBOUNDED FOLLOWING means ending from the back

    LAG(col,n): data in line n ahead

    LEAD(col,n): data in the next N lines

    NTILE(n): distribute the rows in the ordered partition to the specified data groups. Each group has a number, starting from 1. For each row, NTILE returns the number of the group to which this row belongs. Note: n must be of type int.

    Raw data

    jack,2017-01-01,10
    tony,2017-01-02,15
    jack,2017-02-03,23
    tony,2017-01-04,29
    jack,2017-01-05,46
    jack,2017-04-06,42
    tony,2017-01-07,50
    jack,2017-01-08,55
    mart,2017-04-08,62
    mart,2017-04-09,68
    neil,2017-05-10,12
    mart,2017-04-11,75
    neil,2017-06-12,80
    mart,2017-04-13,94
    

    Create hive table and import data

    create table business(
    name string, 
    orderdate string,
    cost int
    ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    
    load data local inpath "/opt/module/datas/business.txt" into table business;
    

    demand

    (1) Query customers and total number of customers purchased in April 2017

    select name,count(*) over () 
    from business 
    where substring(orderdate,1,7) = '2017-04' 
    group by name;
    

(2) Query customer's purchase details and monthly total purchase amount

select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from business;

(3) In the above scenario, cost should be accumulated by date

select *,sum(cost) over(sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW ) from business;

(4) Query customer's last purchase time

select *, lag(orderdate,1) over(distribute by name sort by orderdate) from business;

(5) Query the order information in the first 20% of the time

select *  from (
    select name, orderdate, cost, ntile(5) over(sort by orderdate) gid 
    from business
) t
where gid = 1;

[failed to save the image in the external link. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-wipuop4e-1582446297431) (C: \ users \ Maben \ appdata \ roaming \ typora \ typora user images \ 1564805328730. PNG))

  1. Rank

    RANK() will repeat at the same time, and the total number will not change

    The same degree of rank() will be repeated, and the total number will be reduced

    ROW_NUMBER() is calculated in order

    Demand: calculate the ranking of each subject.

    Create hive table and import data

    create table score(
    name string,
    subject string, 
    score int) 
    row format delimited fields terminated by "\t";
    load data local inpath '/opt/module/datas/score.txt' into table score;
    

    Query results according to requirements

    select name,
    subject,
    score,
    rank() over(partition by subject order by score desc) rp,
    dense_rank() over(partition by subject order by score desc) drp,
    row_number() over(partition by subject order by score desc) rmp
    from score;
    
Published 31 original articles, won praise 1, visited 1358
Private letter follow

Topics: Database hive Mobile Hadoop