Hive data type, database related operations, table related operations, data import and export

Posted by luanne on Sun, 23 Jan 2022 00:23:28 +0100

Hive data type

1. Basic data type

2. Collection data type

Case practice

(1) Assuming that a table has the following row, we use JSON format to represent its data structure. The format accessed under Hive is

{
	"name": "songsong",
	"friends": ["bingbing" , "lili"] , //List Array,
	"children": { //Key value Map,
			"xiao song": 18 ,
			"xiaoxiao song": 19
		}
	"address": { //Struct ure,
			"street": "hui long guan",
			"city": "beijing"
		}
}

(2) Based on the above data structure, we create corresponding tables in Hive and import data. Create the local test file test txt

songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing
yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:19,chao yang_beijing

(3) Create test table test on Hive

create table test(
name string,
friends array<string>,
children map<string, int>,
address struct<street:string, city:string>
)
row format delimited fields terminated by ','
collection items terminated by '_'
map keys terminated by ':'
lines terminated by '\n';

Field interpretation:

  • Column separator: row format delimited fields terminated by ','
  • Separator of map structure and ARRAY (data segmentation symbol): collection items terminated by ''
  • Separator between key and value in MAP: map keys terminated by ':
  • Line separator: lines terminated by '\ n';

(4) Import text data into the test table;

load data local inpath '/opt/module/hive/datas/test.txt' into table test;

(5) Access the data in the three collection columns. The following are the access methods of ARRAY, MAP and structure

select friends[1],children['xiao song'],address.city from test where name="songsong";

3. Type conversion

1) Implicit type conversion rules are as follows

(1)Any integer type can be implicitly converted to a broader type, such as TINYINT Can be converted to INT,INT Can be converted to BIGINT. 
(2)All integer types FLOAT and STRING Types can be implicitly converted to DOUBLE
(3)TINYIN,SMALLINT,INT Can be converted to FLOAT
(4)BOLLEAN Type cannot be converted to any other type

2) The CAST operation display can be used for data type conversion

For example, CAST('1 'AS INT) will convert string' 1 'to integer 1; If the cast fails, such as CAST('X 'AS INT), the expression returns NULL

Database related operations

1. Create database

CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value,...)];

2. Query database

1) Display database

show databases;

2) View database details

desc database [extended] db_hive;	extended: detailed information

3) Switch current database

use database_name;

3. Modify database

Users can use the alter database command to set key value pair property values for the dbproperties of a database to describe the property information of the database.

For example: alter database db_hive set dbproperties('createtime'='20220122');

4. Delete database

1) Delete empty database

drop database database_name;

2) If the deleted database does not exist, it is best to use if exists to judge whether the database exists

drop database if exists database_name;

3) If the database is not empty, you can use the cascade command to forcibly delete it

drop database database_name cascade;

Table related operations

1. Table building

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement]
(1)CREATE TABLE	Create a table with the specified name. If a table with the same name already exists, an exception is thrown; Users can use IF NOT EXISTS Option to ignore this exception.
(2)EXTERNAL	Keyword allows users to create an external table. While creating the table, you can specify a path to the actual data( LOCATION),When deleting a table, the metadata and data of the internal table will be deleted together, while the external table only deletes the metadata and does not delete the data.
(3)COMMENT	Add comments to tables and columns
(4)PARTITIONED BY	Create partition table
(5)CLUSTERED BY 	Create bucket table
(6)SORTED BY	Not commonly used. Sort one or more columns in the bucket separately
(7)ROW FORMAT DELIMITED [FIELDS TERMINATED BY char][COLLECTION ITEMS TERMINATED BY char][MAP KEYS TERMINATED BY char][LINES TERMINATED BY char]
SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value,property_name=property_value, ...)]	Users can customize when creating tables SerDe Or use your own SerDe,If not specified ROW FORMAT perhaps ROW FORMAT 			 DELIMITED,Will use its own SerDe. When creating a table, you also need to specify columns for the table. When you specify the columns of the table, you will also specify columns  			Custom SerDe,Hive adopt SerDe Determine the data of the specific column of the table. SerDe yes Serialize/Deserilize For short, hive use		  Serde Sequence and deserialize row objects.
(8)STORED AS	Specify storage file type, common storage file type, common storage file type: SEQUENCEFILE(Binary sequence file),TEXTFILE(text),RCFILE(Column storage format file)
	If the file data is plain text, you can use STORED AS TEXTFILE,If data needs to be compressed, use STORED AS SEQUENCEFILE
(9)LOCATION 	Specify the table in HDFS Storage location on
(10)AS	Followed by a query statement to create a table based on the query results
(11)LIKE	Allows users to copy existing table structures without copying data

2. Management table

Theory: the tables created by default are so-called management tables, which are sometimes called internal standards. Because of this table, hive controls (more or less) the life cycle of the data. Hive will store the data of these tables in hive by default metastore. warehouse. Under the subdirectory of the directory defined by dir (/ user / hive / warehouse)

When we delete a management table, Hive will also delete the data in the table. Management tables are not suitable for sharing data with other tools.

(1) Normal create table

create table if not exists student(
id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student';

(2) Create a table based on the query results (the query results are added to the newly created table)

create table if not exists student2 as select id,name from student;

(3) Create a table based on an existing table structure

create table if not exists student3 like student;

(4) Type of query table

desc formatted database_name;

3. External table

Theory: because the table is an external table, Hive doesn't think it has the data completely. Deleting the table will not delete the data, but the metadata information describing the table will be deleted.

Usage scenario analysis: regularly flow the website log of the mobile phone into the HDFS file text every day. Do a lot of statistical analysis based on the external table (original log table). The intermediate table and result table used are stored in the internal standard, and the data enters the internal standard through SELECT+INSERT.

4. Conversion between management table and external table

(1) Type of query table

desc formatted database_name;

(2) Modify the internal table student2 to an external table

alter table student2 set tblproperties('EXTERNAL'='TRUE');

(3) Modify the external table student2 to the internal table

alter table student2 set tblproperties('EXTERNAL'='FALSE');

5. Rename table

ALTER TABLE table_name RENAME TO new_table_name

6. Add / modify / replace column information

(1) Modify column

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment][FIRST|AFTER column_name]

(2) Add and replace columns

ALTER TABLE table_name ADD|REPALCE COLUMNS (col_name data_type[COMMENT col_comment],...)
Note: ADD Yes means a new field is added, and the field position is after all columns(partition Before column)
	REPLACE It means to replace all fields in the table

7. Delete table

drop table table_name;

DML data operation

1. Data import

1) Loading data into a table
load data [local] inpath 'Data path' [overwrite] into table student [partition (partcol1=val1,...)];
  • load data: indicates loading data
  • Local: indicates loading data from local to hive; Otherwise, load the hive flag from HDFS
  • inpath: indicates the loaded path
  • Overwrite: it means to overwrite the existing data in the table; otherwise, it means to append
  • into table: indicates which table to load
  • student: represents a specific table
  • Partition: indicates to upload to the specified partition
2) Insert data into a table through a query statement (insert)

① Create a table first

create table student_par(id int, name string) row format delimited fields terminated by '\t';

② Basic insert data

insert into table student_par values(1,'wangwu'),(2,'zhaoliu');

③ Basic mode insertion (based on the query results of a single table)

insert overwrite table student_par select id, name from student where month='201709';

Insert into: insert into a table or partition by appending data. The original data will not be deleted

insert overwrite: it will overwrite the existing data in the table

Note: insert does not support inserting partial fields

④ Multi table (multi partition) insertion mode (based on query results of multiple tables)

from student
insert overwrite table student partition(month='201707')
select id, name where month='201709'
insert overwrite table student partition(month='201706')
select id, name where month='201709';
3) Create tables and load data in query statements (As Select)
create table if not exists student3 as select id,name from student;
4) When creating a table, specify the load data path through Location

① Upload data to hdfs

hive (default)> dfs -mkdir /student;
hive (default)> dfs -put /opt/module/datas/student.txt /student;

② Create a table and specify the location on hdfs

hive (default)> create external table if not exists student5(
			id int, name string
 			)
 			row format delimited fields terminated by '\t'
 			location '/student;

5) The Import data is specified in the Hive table

export first, and then import the data

hive (default)> import table student2 from '/user/hive/warehouse/export/student';

2. Data export

1) Insert export

① Export query results to local

insert overwrite local directory '/opt/module/hive/data/export/student' select * from student;

② Format and export query results to local

insert overwrite local directory '/opt/module/hive/data/export/student1' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;

③ Export the query results to HDFS (no local)

insert overwrite directory '/user/yingzi/student2' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;
2) Hadoop command export to local

dfs -get /user/hive/warehouse/student/student.txt /opt/module/data/export/student3.txt;

3) Hive Shell command export

Basic syntax: hive -f/-e execute statement or script > file

[yingzi@hadoop102 hive]$ bin/hive -e 'select * from default.student;' > /opt/module/hive/data/export/student4.txt;
4) Export to HDFS
export table default.student to '/user/hive/warehouse/export/student';
5) Clear the data in the table (Truncate)

Note: Truncate can only delete management tables, not data sets in external tables

truncate table database_name;

xt;

3) Hive Shell command export

Basic syntax: hive -f/-e execute statement or script > file

[yingzi@hadoop102 hive]$ bin/hive -e 'select * from default.student;' > /opt/module/hive/data/export/student4.txt;
4) Export to HDFS
export table default.student to '/user/hive/warehouse/export/student';
5) Clear the data in the table (Truncate)

Note: Truncate can only delete management tables, not data sets in external tables

truncate table database_name;

Topics: Database Hadoop hive