The first chapter uses Hive to start Hadoop cluster
Start HDFS HA
(check whether the 50070 web page port is started successfully)
- Each machine starts the zookeeper service
zkServer.sh start
- Start HDFS service
sbin/start-dfs.sh
Start YARN HA
(check whether the 8088 web page port is started successfully)
- Start YARN
sbin/start-yarn.sh
- Start the resource manager on another HA node
sbin/yarn-daemon.sh start resourcemanager
Chapter 2 JDBC access using hive
Start the hiveserver2 service
bin/hiveserver2
Start beeline
bin/beeline
Connect to the hiveserver2 service
beeline> !connect jdbc:hive2://server1:10000 Connecting to jdbc:hive2://server1:10000 Enter username for jdbc:hive2://Server1:10000: server (enter) Enter password for jdbc:hive2://server1:10000: (enter) Connected to: Apache Hive (version 2.3.8) Driver: Hive JDBC (version 2.3.8) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://server1:10000>
0: jdbc:hive2://server1:10000> show databases; +----------------+ | database_name | +----------------+ | default | +----------------+
Chapter 3 local access (bin/hive)
1 - help
bin/hive -help
usage: hive -d,--define <key=value> Variable substitution to apply to Hive commands. e.g. -d A=B or --define A=B --database <databasename> Specify the database to use -e <quoted-query-string> Execute from the command line SQL sentence -f <filename> Execute from file SQL sentence -H,--help print the help information --hiveconf <property=value> Use value for given property --hivevar <key=value> Variable substitution to apply to Hive commands. e.g. --hivevar A=B -i <filename> Initialization SQL file -S,--silent Silent mode in interactive shell -v,--verbose Verbose mode (echo executed SQL to the console)
2 - execute SQL statements from the command line
bin/hive -e 'show databases;'
database_name default
3 - view hdfs file system
hive (default)> dfs -ls /;
4 - view local file system
! ls /opt/module/;
5 - view hive's historical commands
[server@server1 hive]$ cat /home/server/.hivehistory
6 - view all configuration information
hive (default)> set;
Chapter IV database operation
1 - create database syntax
CREATE DATABASE [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION HDFS route] [WITH DBPROPERTIES (property_name=property_value, ...)];
2 - create the database and specify the path of the database on HDFS
hive (default)> create database db_hive2 location '/db_hive2.db';
3 - display database information
desc database default;
4 - display database details
desc database extended db_hive;
5 - switch database
hive (default)> use db_hive;
6 - delete database
hive>drop database db_hive;
You'd better judge whether the database exists
hive>drop database if exists db_hive;
If the database is not empty, you can use the cascade command to forcibly delete it
drop database db_hive cascade;
Chapter V table operation
1 - create table syntax
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] [AS select_statement]
2 - create a normal table
create table if not exists student( id int, name string);
3 - create external tables (data in HDFS)
When deleting a table, the metadata and data of the internal table will be deleted together, while the external table only deletes the metadata and does not delete the data.
create external table stu_external( id int,name string) row format delimited fields terminated by '\t' location '/student';
4 - create a table based on query results
create table if not exists student3 as select id, name from student;
5 - create a table based on the existing table structure
create table if not exists student4 like student;
6 - query table type
desc formatted student;
Prompt the following information
OK
col_name data_type comment
col_name data_type comment
id int
name string
Detailed Table Information
Database: db_hive
Owner: server
CreateTime: Wed May 05 11:27:47 CST 2021
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://mycluster/user/hive/warehouse/db_hive.db/student
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1620185267
Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.129 seconds, Fetched: 31 row(s)
7 - view table formatted data
desc formatted dept;
8 - delete external table
drop table stu_external;
9 - conversion of internal and external tables
alter table student2 set tblproperties('EXTERNAL'='TRUE');
alter table student2 set tblproperties('EXTERNAL'='FALSE');
Chapter VI operation of HDFS
1 - upload data to HDFS
dfs -mkdir /student; dfs -put /opt/module/datas/student.txt /student;
Chapter VII zoning table
1 - Import partition table (manage according to different types)
/user/hive/warehouse/log_partition/20170702/20170702.log /user/hive/warehouse/log_partition/20170703/20170703.log /user/hive/warehouse/log_partition/20170704/20170704.log
2 - create partition table
hive (default)> create table dept_partition( deptno int, dname string, loc string) partitioned by (month string)
3 - load data into partition table
hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201709'); hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201708'); hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201707');
Note: the partition must be specified when the partition table loads data
4 - query partition table data
select * from dept_partition where month='201709';
5 - Multi partition joint query
select * from dept_partition where month='201709' union select * from dept_partition where month='201708' union select * from dept_partition where month='201707';
6 - add partition
Add a single partition
alter table dept_partition add partition(month='201706') ;
Add multiple partitions
alter table dept_partition add partition(month='201705') partition(month='201704');
7 - delete partition
Delete a single partition
alter table dept_partition drop partition (month='201704');
Delete multiple partitions
alter table dept_partition drop partition (month='201705'), partition (month='201706');
8 - view partition
show partitions dept_partition;
9 - view partition table structure
desc formatted dept_partition;
10 - create secondary partition table
create table dept_partition2( deptno int, dname string, loc string) partitioned by (month string, day string)
11 - load data into partition table
load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition2 partition(month='201709', day='13');
12 - query secondary partition table data
select * from dept_partition2 where month='201709' and day='13';
13 - associate HDFS data with partition table
- Use repair command after uploading
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12; hive (default)> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12;
msck repair table dept_partition2;
- Add partition after uploading
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=11; hive (default)> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=11;
hive (default)> alter table dept_partition2 add partition(month='201709', day='11');
- load data to partition after creating folder
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=10;
hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table dept_partition2 partition(month='201709',day='10');
Chapter VIII addition, deletion and modification
1-grammar
Update column
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
Add and replace columns
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
2 - add column
alter table dept_partition add columns(deptdesc string);
3 - update column
alter table dept_partition change column deptdesc desc int;
4 - replace column
alter table dept_partition replace columns(deptno string, dname string, loc string);
Chapter IX loading data
1-grammar
hive> load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,...)];
Chapter 10 insert data
1 - add insert
hive(default)> insert into table student partition (month=' 201709' ) values(1, 'wangwu') , (2,' zhaoliu') ;
2 - overwrite insert (will overwrite existing data)
hive (default)> insert overwrite table student partition(month='201708') select id, name from student where month='201709';
3 - multi table and multi partition insertion
hive (default)> from student insert overwrite table student partition(month='201707') select id, name where month='201709' insert overwrite table student partition(month='201706') select id, name where month='201709';
Chapter 11 data export and import
1-import import
Note: export first, and then import the data
hive (default)> import table student2 partition(month='201709') from '/user/hive/warehouse/export/student';
2 - export query results to local
hive (default) >insert overwrite local directory '/opt /module/ datas/ export/student' select * from student ;
3 - format and export the query results to local
hive (default) > insert overwrite local directory '/opt/module/datas/export/student1' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;
4 - export the query results to HDFS (no local)
hive (default)> insert overwrite directory '/user/atguigu/student2' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;
Export 5-Hadoop command to local
hive (default)> dfs -get /user/hive/warehouse/student/month=201709/000000_0 /opt/module/datas/export/student3.txt;
6-Hive Shell command export
bin/hive -e 'select * from default.student;' > /opt/module/datas/export/student4.txt;
7-export export
hive (default)> export table default.student to '/user/hive/warehouse/export/student';
8-Sqoop export
9 - clear table data
Note: Truncate can only delete management tables, not data in external tables
hive (default)> truncate table student;
Chapter XII inquiry
1-grammar
be careful:
(1) The SQL language is case insensitive.
(2) SQL can be written on one or more lines
(3) Keywords cannot be abbreviated or broken
(4) Each clause should be written separately.
(5) Use indentation to improve the readability of statements.
[WITH CommonTableExpression (, CommonTableExpression)*] (Note: Only available starting with Hive 0.13.0) SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number]
2-arithmetic operator
operator | describe |
---|---|
A+B | Add A and B |
A-B | A minus B |
A*B | Multiply A and B |
A/B | A divided by B |
A%B | Surplus of A to B |
A&B | A and B are bitwise AND |
A|B | A and B take or by bit |
A^B | Bitwise exclusive or of A and B |
~A | A reverse by bit |
3 - common functions
-
Find the total number of rows (count)
hive (default)> select count(*) cnt from emp;
-
Maximum wage (max)
hive (default)> select max(sal) max_sal from emp;
-
Minimum wage (min)
hive (default)> select min(sal) min_sal from emp;
-
sum of wages
hive (default)> select sum(sal) sum_sal from emp;
-
Average of wages (avg)
hive (default)> select avg(sal) avg_sal from emp;
4-limit
A typical query returns multiple rows of data. The LIMIT clause is used to LIMIT the number of rows returned.
hive (default)> select * from emp limit 5;
5-Where statement
Note: use the where clause to filter out rows that do not meet the conditions. The where clause follows the FROM clause. Field aliases cannot be used in the where clause.
Case practice
Find out all employees whose salary is greater than 1000
hive (default)> select * from emp where sal >1000;
6-comparison operator (Between/In/ Is Null)
Operator | Supported data types | describe |
---|---|---|
A=B | Basic data type | If A is equal to B, it returns TRUE, otherwise it returns FALSE |
A<=>B | Basic data type | If both A and B are NULL, it returns TRUE. The results of other operators are consistent with those of the equal sign (=) operator. If either is NULL, the result is NULL |
A<>B, A!=B | Basic data type | If A or B is NULL, NULL is returned; If A is not equal to B, it returns TRUE, otherwise it returns FALSE |
A<B | Basic data type | If A or B is NULL, NULL is returned; If A is less than B, it returns TRUE, otherwise it returns FALSE |
A<=B | Basic data type | If A or B is NULL, NULL is returned; If A is less than or equal to B, it returns TRUE, otherwise it returns FALSE |
A>B | Basic data type | If A or B is NULL, NULL is returned; If A is greater than B, it returns TRUE, otherwise it returns FALSE |
A>=B | Basic data type | If A or B is NULL, NULL is returned; If A is greater than or equal to B, it returns TRUE, otherwise it returns FALSE |
A [NOT] BETWEEN B AND C | Basic data type | If any of A, B or C is NULL, the result is NULL. If the value of A is greater than or equal to FALSE, and vice versa. If you use the NOT keyword, the opposite effect can be achieved. |
A IS NULL | All data types | If A is equal to NULL, it returns TRUE, otherwise it returns FALSE |
A IS NOT NULL | All data types | If A is not equal to NULL, it returns TRUE, otherwise it returns FALSE |
In (value 1, value 2) | All data types | Use the IN operation to display the values IN the list |
A [NOT] LIKE B | STRING type | B is A simple regular expression under SQL, also known as wildcard pattern. If A matches it, it returns TRUE; Otherwise, FALSE is returned. The expression of B is described as follows: 'x%' means that A must start with the letter 'x', '% X' means that A must end with the letter 'x', and '% X%' means that A contains the letter 'x', which can be located at the beginning, end or in the middle of the string. If you use the NOT keyword, the opposite effect can be achieved. |
A RLIKE B, A REGEXP B | STRING type | B is A regular expression based on java. If A matches it, it returns TRUE; Otherwise, FALSE is returned. The matching is implemented by the regular expression interface in JDK, because the regular expression is also based on the rules. For example, A regular expression must match the entire string A, not just its string. |
Case:
(1) Find out all employees whose salary is equal to 5000
hive (default)> select * from emp where sal =5000;
(2) Query employee information with salary between 500 and 1000
hive (default)> select * from emp where sal between 500 and 1000;
(3) Query all employee information with comm blank
hive (default)> select * from emp where comm is null;
(4) Query employee information with salary of 1500 or 5000
hive (default)> select * from emp where sal IN (1500, 5000);
7-Like and RLike
1) Use the LIKE operation to select similar values
2) The selection criteria can contain characters or numbers:
%Represents zero or more characters (any character).
_ A character represents a character.
RLIKE clause is an extension of this function in Hive, which can specify matching conditions through the more powerful language of Java regular expression.
(1) Find employee information with salary starting with 2
hive (default)> select * from emp where sal LIKE '2%';
(2) Find the employee information for the salary with the second value of 2
hive (default)> select * from emp where sal LIKE '_2%';
(3) Find employee information with 2 in salary
hive (default)> select * from emp where sal RLIKE '[2]';
8-logical operator
Operator | meaning |
---|---|
AND | Logical Union |
OR | Logical or |
NOT | Logical no |
9 - grouping
9.1 Group By
The GROUP BY statement is usually used with aggregation functions to group one or more queued results, and then aggregate each group.
Case:
Calculate the average salary of each department in the emp table
hive (default)> select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;
Calculate emp the maximum salary for each position in each department
hive (default)> select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job;
9.2 Having
Differences between having and where
(1) Grouping functions cannot be written after where, while grouping functions can be used after having.
(2) having is only used for group by group statistics statements.
Case:
Ask for departments with an average salary of more than 2000 per department
hive (default)> select deptno, avg(sal) from emp group by deptno;
Ask for departments with an average salary of more than 2000 per department
hive (default)> select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;
10 join statement
Hive supports common SQL JOIN statements, but only supports equivalent connections, not non equivalent connections.
Case:
Query employee number, employee name and department name according to the same department number in employee table and department table;
hive (default)> select e.empno, e.ename, d.deptno, d.dname from emp e join dept d on e.deptno = d.deptno;