Hive of big data foundation -- DDL statement and DML statement

Posted by KindMan on Tue, 02 Nov 2021 05:46:50 +0100

Author: duktig

Blog: https://duktig.cn (first article)

Excellent still work hard. May you enjoy what you give and enjoy what you get.

See github knowledge base for more articles: https://github.com/duktig666/knowledge

background

After learning Hadoop, do you feel that writing a MapReduce program is very complex, and it requires a lot of development cost to conduct analysis and statistics. Then let's learn about Hive, another member of the Hadoop ecosystem. Let's learn how to use SQL like language to quickly query and analyze data.

In the last article, we learned about Hive's overview. In this article, we'll learn about Hive's DDL statements and DML statements (key points).

Hive series articles are as follows:

DDL

Create database

CREATE DATABASE [IF NOT EXISTS] database_name 
[COMMENT database_comment] 
[LOCATION hdfs_path] 
[WITH DBPROPERTIES (property_name=property_value, ...)];

matters needing attention:

  • To avoid existing errors in the database to be created, add the if not exists judgment
  • COMMENT: database COMMENT
  • LOCATION: the storage path of the database on HDFS (the default storage path is / user/hive/warehouse/*.db).
  • WITH: specify database parameters (not commonly used)

example:

create database if not exists test; 

result:

query data base

# show database 
show databases; 
# Filter the database that displays the query 
show databases like 'db_hive*'; 

# Display database information 
desc database test; 
# Display database details, extended 
desc database extended test; 

Switch database

use test; 

modify the database

Users can use the ALTER DATABASE command to set key value pair attribute values for DBPROPERTIES of a database to describe the attribute information of the database.

alter database test set dbproperties('createtime'='20170830'); 

Delete database

# Delete empty database 
drop database test;
# If the deleted database does not exist, it is best to use if exists to judge whether the database exists 
drop database if exists test; 
# If the database is not empty, you can use the cascade command to forcibly delete it 
drop database test cascade; 

Create table

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
[CLUSTERED BY (col_name, col_name, ...) 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 
[STORED AS file_format] 
[LOCATION hdfs_path] 
[TBLPROPERTIES (property_name=property_value, ...)] 
[AS select_statement] 

Field interpretation:

  • EXTERNAL: allows the user to create an EXTERNAL table. When deleting a table, the metadata and data of the internal table will be deleted together, while the EXTERNAL table only deletes the metadata and does not delete the data.
  • LOCATION: Specifies the LOCATION information of the table
  • COMMENT: adds comments to tables and columns
  • PARTITIONED BY: creates a partitioned table
  • CLUSTERED BY: creates a bucket table
  • ROW FORMAT row_format: defines the format of the row
  • [TBLPROPERTIES (property_name=property_value,...)]: additional properties
  • AS: followed by a query statement to create a table according to the query results
  • SORTED BY: not commonly used. Sort one or more columns in the bucket separately
  • STORED AS: Specifies the storage file type
    • Common storage file types: SEQUENCEFILE (binary sequence file), TEXTFILE (text), RCFILE (column storage format file)
    • If the file data is plain text, you can use STORED AS TEXTFILE. If the data needs to be compressed, use stored as sequence file.

Management table

The tables created by default are so-called management tables, sometimes referred to as internal tables. Because of this table, Hive controls (more or less) the life cycle of the data. By default, Hive stores the data of these tables in a subdirectory of the directory defined by the configuration item Hive.metastore.warehouse.dir (for example, / user/hive/warehouse).

When we delete a management table, Hive will also delete the data in the table. Management tables are not suitable for sharing data with other tools.

External table

Because the table is an external table, Hive does not think it has the data completely. Deleting the table will not delete the data, but the metadata information describing the table will be deleted.

Manage usage scenarios for tables and external tables

The collected website logs are regularly flowed into HDFS text files every day. Do a lot of statistical analysis based on the external table (original log table). The intermediate table and result table used are stored in the internal table, and the data is entered into the internal table through SELECT+INSERT.

Manage the conversion between tables and external tables

# Type of query table MANAGED_TABLE 
desc formatted test; 
# Modify internal table test to external table 
alter table test set tblproperties('EXTERNAL'='TRUE'); 

Note: ('EXTERNAL'='TRUE') and ('EXTERNAL'='FALSE') are fixed and case sensitive!

Modify table

# rename table 
ALTER TABLE table_name RENAME TO new_table_name;
# Update column
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name] 
# Add and replace columns 
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)  

Note:

  • ADD means to ADD a new field. The field position is after all columns (before the partition column)
  • REPLACE means to REPLACE all fields in the table.

Delete table

drop table test; 

DML

Data import

Load data into a table (load)

load data [local] inpath 'Data path' [overwrite] into table test [partition (partcol1=val1,...)]; 

Field interpretation:

  • load data: indicates loading data
  • Local: indicates loading data from local to hive table; Otherwise, load data from HDFS to hive table
  • inpath: indicates the path to load data
  • Overwrite: it means to overwrite the existing data in the table, otherwise it means to append
  • into table: indicates which table to load
  • test: indicates the specific table
  • Partition: indicates to upload to the specified partition

Insert data into a table through a query statement (insert)

# Create a table 
create table student_par(id int, name string) row format delimited fields terminated by '\t'; 
# Basic insert data 
insert into table  student_par values(1,'wangwu'),(2,'zhaoliu');
# Basic mode insertion (based on the query results of a single table) 
insert overwrite table student_par select id, name from student where month='201709'; 
# Multi table (multi partition) insertion mode (based on query results of multiple tables) 
from student 
insert overwrite table student partition(month='201707') 
select id, name where month='201709' 
insert overwrite table student partition(month='201706') 
select id, name where month='201709'; 

Insert into: insert into a table or partition by appending data. The original data will not be deleted

insert overwrite: it will overwrite the existing data in the table

Note: insert does not support inserting partial fields

Create tables and load data in query statements (As Select)

create table if not exists student3 as select id, name from student; 

Create a table based on the query results (the query results are added to the newly created table)

Import data into the specified Hive table

Note: export first, and then import the data.

import table student2 from '/user/hive/warehouse/export/student'; 

Data export

Insert export

# Export query results to local 
insert overwrite local directory '/opt/module/hive/data/export/student' 
select * from student; 
# Format and export query results to local 
insert overwrite local directory 
'/opt/module/hive/data/export/student1' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
select * from student; 
# Export the query results to HDFS (no local) 
insert overwrite directory '/user/atguigu/student2' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'  
select * from student; 

Hadoop command export to local

dfs -get /user/hive/warehouse/student/student.txt 
/opt/module/data/export/student3.txt; 

Hive Shell command export

bin/hive -e 'select * from default.student;' > 
 /opt/module/hive/data/export/student4.txt; 

Export to HDFS

export table default.student to '/user/hive/warehouse/export/student'; 

Clear data in table (Truncate)

Note: Truncate can only delete management tables, not data in external tables

truncate table student; 

Basic query

Data preparation (auxiliary query)

Raw data preparation

dept.txt:

10 ACCOUNTING 1700 
20 RESEARCH 1800 
30 SALES 1900 
40 OPERATIONS 1700

emp.txt:

7369 SMITH CLERK 7902 1980-12-17 800.00 20 
7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30 
7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30 
7566 JONES MANAGER 7839 1981-4-2 2975.00 20 
7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30 
7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30 
7782 CLARK MANAGER 7839 1981-6-9 2450.00 10 
7788 SCOTT ANALYST 7566 1987-4-19 3000.00 20 
7839 KING PRESIDENT 1981-11-17 5000.00 10 
7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30 
7876 ADAMS CLERK 7788 1987-5-23 1100.00 20 
7900 JAMES CLERK 7698 1981-12-3 950.00 30 
7902 FORD ANALYST 7566 1981-12-3 3000.00 20 
7934 MILLER CLERK 7782 1982-1-23 1300.00 10 

Using test database

use test;

Create department table

create table if not exists dept( 
deptno int, 
dname string, 
loc int 
) 
row format delimited fields terminated by ' '; 

Create employee table

create table if not exists emp( 
empno int, 
ename string, 
job string, 
mgr int, 
hiredate string,  
sal double,  
comm double, 
deptno int
) 
row format delimited fields terminated by ' '; 

Import data

load data local inpath './dept.txt' overwrite into table dept; 
load data local inpath './emp.txt' overwrite into table emp; 

Full table and specific column queries

# Full table query 
select * from emp; 
select empno,ename,job,mgr,hiredate,sal,comm,deptno from emp ; 
# Select a specific column query 
select empno, ename from emp; 

matters needing attention:

  • SQL language is case insensitive.
  • SQL can be written on one or more lines
  • Keywords cannot be abbreviated or broken
  • Each clause should be written separately.
  • Use indentation to improve the readability of statements.

alias

Rename a column / table to facilitate calculation, followed by the column name, or add the keyword 'AS' between the column name and alias.

select ename AS name, deptno dn from emp; 

Some query keywords that need special attention

Most of the keywords in Hive are the same as mysql, so they are not listed one by one. Here are some special keywords

RLIKE

RLIKE clause is an extension of this function in Hive, which can specify matching conditions through the more powerful language of Java regular expression.

# Find employee information with A in name 
select * from emp where ename  RLIKE '[A]'; 

Group By group statement

The GROUP BY statement is usually used with an aggregate function to group one or more queued results, and then aggregate each group.

# Calculate the average salary of each department in emp table 
select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno; 
# Calculate emp the maximum salary for each position in each department 
select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job; 

Having statement

Differences between having and where

  1. Grouping functions cannot be written after where, but can be used after having.
  2. having is only used for group by group statistics statements.
# Find the average salary of each department 
select deptno, avg(sal) from emp group by deptno; 
# The average salary of each department is greater than 2000 
select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000; 

join query

The same as mysql, especially full external connection
Full outer join: all records in all tables that meet the conditions of WHERE statement will be returned. If the specified field of any table does not have a qualified value, NULL value is used instead.

select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno = d.deptno; 

sort

Global sorting

Order By (default ASC ascending): Global sorting, with only one Reducer

select * from emp order by sal; 
select * from emp order by sal desc; 

Sort by alias

# Sort by twice the employee's salary 
select ename, sal*2 twosal from emp order by twosal; 

Multiple column sorting

# Sort by department and salary in ascending order 
select ename, deptno, sal from emp order by deptno, sal; 

Sort By within each Reduce

sort by: for large-scale data sets, order by is very inefficient. In many cases, global sorting is not required. In this case, sort by can be used.
Sort by generates a sort file for each reducer. Each reducer is sorted internally, not for the global result set.

The example operation is as follows:

  1. Set the number of reduce

    set mapreduce.job.reduces=3; 
    
  2. View and set the number of reduce

    set mapreduce.job.reduces; 
    
  3. View employee information in descending order according to department number

    select * from emp sort by deptno desc; 
    
  4. Import the query results into the file (sorted by department number in descending order)

    insert overwrite local directory './sortby-result' select * from emp sort by deptno desc; 
    

Distribution by

Distribution by: in some cases, we need to control which reducer a particular row should go to, usually for subsequent aggregation operations. The distribute by clause can do this. distribute by is similar to partition (custom partition) in MR. it is used in combination with sort by.

For the distribution by test, you must allocate multiple reduce for processing, otherwise you can't see the effect of distribution by.

The example operation is as follows:

First, it is divided by department number, and then sorted in descending order by employee number.

set mapreduce.job.reduces=3; 
insert overwrite local directory './distribute-result' 
select * from emp distribute by deptno sort by empno desc; 

matters needing attention:

  • The partition rule of distribution by is to divide the hash code of the partition field by the number of reduce, and then divide the same remainder into a region.
  • Hive requires that the DISTRIBUTE BY statement be written before the SORT BY statement.

Cluster By

When the distribution by and sorts by fields are the same, the cluster by method can be used.

In addition to the function of distribute by, cluster by also has the function of sort by. However, sorting can only be in ascending order, and the sorting rule cannot be ASC or DESC.

# The following two expressions are equivalent 
select * from emp cluster by deptno; 
select * from emp distribute by deptno sort by deptno;

Note: division by department number is not necessarily a fixed value, but can be divided into departments 20 and 30
Go inside the next section.

Topics: Database Big Data hive