sqoop principle and basic application

Posted by bouncer on Fri, 26 Nov 2021 14:36:34 +0100

1. Introduction to sqoop
(1) Introduction:
Sqoop is a tool of Apache for "transferring data between hadoop and relational database server".
import data: import data from MySQL and Oracle to hadoop's hdfs, hive, HBASE and other data storage systems.
Export data: export data from hadoop file system to relational database.

(2) Working mechanism

translate the import and export commands into MapReduce program, and MapReduce program does not need reducetask. The translated MapReduce is mainly used to customize InputFormat and OutputFormat.

(3) Import and export principle of sqoop
Data import:
The sqoop tool imports jobs through MapReduce. Generally speaking, a row record of a table in a relational database is written to hdfs.

Explanation:

sqoop will obtain the metadata information of the database through jdbc, such as the column name and data type of the imported table.
The data types of these databases will be mapped to java data types. Based on this information, sqoop will generate a class with the same table name to complete the serialization and save each row of records in the table.
sqoop start MapReduce job
During the input process of the started job, the contents in the data table will be read through jdbc. At this time, the class generated by sqoop will be used for serialization.
Finally, these records are written to hdfs. In the process of writing to hdfs, the class generated by sqoop will also be used for deserialization.
Data export:

Explanation:

Firstly, sqoop accesses the relational database through jdbc to get the metadata information of the data to be exported
According to the obtained metadata information, sqoop generates a java class for data transmission carrier, which must be serialized.
Start MapReduce program
sqoop uses the generated java class to read data from hdfs in parallel
Each map job will generate a batch of insert statements according to the read metadata information of the exported table and the read data, and then multiple map jobs will insert data into the mysql database in parallel.
Summary: data is read and written concurrently from hdfs. Parallel reading depends on the performance of hdfs, while parallel writing to MySQL depends on the performance of MySQL.

First, unzip the installation package and rename it for easy operation

[root@gree139 ~]# tar -zxf /opt/install/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/soft/
[root@gree139 ~]# cd /opt/soft
[root@gree139 soft]# cd ./sqoop146/conf/
[root@gree139 conf]# mv sqoop-env-template.sh sqoop-env.sh

Modify the configuration file, each corresponding to the installation location of the machine

[root@gree139 conf]# vi ./sqoop-env.sh
		#Set path to where bin/hadoop is available
     23 export HADOOP_COMMON_HOME=/opt/soft/hadoop260
     24 
     25 #Set path to where hadoop-*-core.jar is available
     26 export HADOOP_MAPRED_HOME=/opt/soft/hadoop260
     27 
     28 #set the path to where bin/hbase is available
     29 export HBASE_HOME=/opt/soft/hbase120
     30 
     31 #Set the path to where bin/hive is available
     32 export HIVE_HOME=/opt/soft/hive110
     33 
     34 #Set the path for where zookeper config dir is
     35 export ZOOCFGDIR=/opt/soft/zookeeper345/conf

After modification, you need to copy the mysqljar package in hive to the lib of sqoop, and then add environment variables

[root@gree139 conf]# cp /opt/soft/hive110/lib/mysql-connector-java-5.1.25.jar ../lib/

[root@gree139 conf]# vi /etc/profile
#sqoop
export SQOOP_HOME=/opt/soft/sqoop146
export PATH=$PATH:$SQOOP_HOME/bin

See the version number and see how sqoop is used

[root@gree139 conf]# sqoop version
[root@gree139 conf]# sqoop help
[root@gree139 conf]# sqoop help list-databases

Next, use sqoop to connect to the database and query,

[root@gree139 ~]# sqoop list-databases --connect jdbc:mysql://gree139:3306 --username root --password root

[root@gree139 ~]# sqoop list-databases \
--connect jdbc:mysql://gree139:3306 \
--username root \
--password root

Using sqoop to import the table data in the database into HDFS, m 1 is similar to giving a maptask and a partition

sqoop import \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--table student \
--delete-target-dir \
--target-dir /sqoop/demo1 \
--m 1

If you set two task s and divide them according to class id, there will be two

sqoop import \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--table student \
--delete-target-dir \
--target-dir /sqoop/demo2 \
--split-by classId \
--fields-terminated-by '\t' \
--m 2

mysql imports data into hive (first, create hive table and import data)
Create table kb15.student in hive_ mysql, the data structure is the same as that of student in mysql

[root@gree139 ~]# sqoop create-hive-table \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--table student \
--hive-table kb15.student_mysql 


[root@gree139 ~]# sqoop import \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--table student \
--hive-table kb15.student_mysql \
--hive-import \
--m 1

-- One time version
[root@gree139 ~]# sqoop import \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--table student \
--hive-import \
--hive-database kb15 \
--m 1

When you want to set the where condition

[root@gree139 ~]# sqoop import \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--table student \
--where "telephone='1392323224'" \
--target-dir /sqoop/demo4 \
--delete-target-dir \
--m 1

Query through query

sqoop import \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--query 'select id,name,age,gender,telephone,email,classId from student where id>3 and classId=1 and $CONDITIONS' \
--target-dir /sqoop/demo6 \
--delete-target-dir \
--m 1

Implement incremental import

sqoop import \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--table student \
--delete-target-dir \
--target-dir /sqoop/incre1 \
--m 1


insert into student(name,age,gender,telephone,email,classId) values
("Liu Yong",24,"male","13695847598","liuyong@qq.com",1),
("Liu Xiaoyong",2,"male","1360000000","liuxiaoyong@qq.com",1)

sqoop import \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--table student \
--target-dir /sqoop/incre1 \
--incremental append \
--check-column id \
--last-value 9 \
--m 1

Incremental incremental import lastmodified < append , merge-key>

create table student2(
id int,
name varchar(32),
last_mod timestamp default current_timestamp on update current_timestamp
)

insert into student2(id,name) values(1,'zhangxiaohua'),(2,'litiechui');

select * from student2;

sqoop import \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--table student2 \
--delete-target-dir \
--target-dir /sqoop/incre2 \
--m 1

insert into student2(id,name) values(3,'zhaodaqiang'),(4,'chenxiaowang');

sqoop import \
--connect jdbc:mysql://gree139:3306/mybatisdb \
--username root \
--password root \
--table student2 \
--target-dir /sqoop/incre2 \
--incremental lastmodified \
--check-column last_mod \
--last-value "2021-11-23 11:56:25" \
--append \
--m 1

Topics: Big Data Hadoop hdfs sqoop

Programmer Think

sqoop principle and basic application

Hot Topics