1. Introduction to sqoop
(1) Introduction:
Sqoop is a tool of Apache for "transferring data between hadoop and relational database server".
import data: import data from MySQL and Oracle to hadoop's hdfs, hive, HBASE and other data storage systems.
Export data: export data from hadoop file system to relational database.
(2) Working mechanism
translate the import and export commands into MapReduce program, and MapReduce program does not need reducetask. The translated MapReduce is mainly used to customize InputFormat and OutputFormat.
(3) Import and export principle of sqoop
Data import:
The sqoop tool imports jobs through MapReduce. Generally speaking, a row record of a table in a relational database is written to hdfs.
Explanation:
sqoop will obtain the metadata information of the database through jdbc, such as the column name and data type of the imported table.
The data types of these databases will be mapped to java data types. Based on this information, sqoop will generate a class with the same table name to complete the serialization and save each row of records in the table.
sqoop start MapReduce job
During the input process of the started job, the contents in the data table will be read through jdbc. At this time, the class generated by sqoop will be used for serialization.
Finally, these records are written to hdfs. In the process of writing to hdfs, the class generated by sqoop will also be used for deserialization.
Data export:
Explanation:
Firstly, sqoop accesses the relational database through jdbc to get the metadata information of the data to be exported
According to the obtained metadata information, sqoop generates a java class for data transmission carrier, which must be serialized.
Start MapReduce program
sqoop uses the generated java class to read data from hdfs in parallel
Each map job will generate a batch of insert statements according to the read metadata information of the exported table and the read data, and then multiple map jobs will insert data into the mysql database in parallel.
Summary: data is read and written concurrently from hdfs. Parallel reading depends on the performance of hdfs, while parallel writing to MySQL depends on the performance of MySQL.
First, unzip the installation package and rename it for easy operation
[root@gree139 ~]# tar -zxf /opt/install/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/soft/ [root@gree139 ~]# cd /opt/soft [root@gree139 soft]# cd ./sqoop146/conf/ [root@gree139 conf]# mv sqoop-env-template.sh sqoop-env.sh
Modify the configuration file, each corresponding to the installation location of the machine
[root@gree139 conf]# vi ./sqoop-env.sh #Set path to where bin/hadoop is available 23 export HADOOP_COMMON_HOME=/opt/soft/hadoop260 24 25 #Set path to where hadoop-*-core.jar is available 26 export HADOOP_MAPRED_HOME=/opt/soft/hadoop260 27 28 #set the path to where bin/hbase is available 29 export HBASE_HOME=/opt/soft/hbase120 30 31 #Set the path to where bin/hive is available 32 export HIVE_HOME=/opt/soft/hive110 33 34 #Set the path for where zookeper config dir is 35 export ZOOCFGDIR=/opt/soft/zookeeper345/conf
After modification, you need to copy the mysqljar package in hive to the lib of sqoop, and then add environment variables
[root@gree139 conf]# cp /opt/soft/hive110/lib/mysql-connector-java-5.1.25.jar ../lib/ [root@gree139 conf]# vi /etc/profile #sqoop export SQOOP_HOME=/opt/soft/sqoop146 export PATH=$PATH:$SQOOP_HOME/bin
See the version number and see how sqoop is used
[root@gree139 conf]# sqoop version [root@gree139 conf]# sqoop help [root@gree139 conf]# sqoop help list-databases
Next, use sqoop to connect to the database and query,
[root@gree139 ~]# sqoop list-databases --connect jdbc:mysql://gree139:3306 --username root --password root [root@gree139 ~]# sqoop list-databases \ --connect jdbc:mysql://gree139:3306 \ --username root \ --password root
Using sqoop to import the table data in the database into HDFS, m 1 is similar to giving a maptask and a partition
sqoop import \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --table student \ --delete-target-dir \ --target-dir /sqoop/demo1 \ --m 1
If you set two task s and divide them according to class id, there will be two
sqoop import \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --table student \ --delete-target-dir \ --target-dir /sqoop/demo2 \ --split-by classId \ --fields-terminated-by '\t' \ --m 2
mysql imports data into hive (first, create hive table and import data)
Create table kb15.student in hive_ mysql, the data structure is the same as that of student in mysql
[root@gree139 ~]# sqoop create-hive-table \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --table student \ --hive-table kb15.student_mysql [root@gree139 ~]# sqoop import \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --table student \ --hive-table kb15.student_mysql \ --hive-import \ --m 1 -- One time version [root@gree139 ~]# sqoop import \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --table student \ --hive-import \ --hive-database kb15 \ --m 1
When you want to set the where condition
[root@gree139 ~]# sqoop import \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --table student \ --where "telephone='1392323224'" \ --target-dir /sqoop/demo4 \ --delete-target-dir \ --m 1
Query through query
sqoop import \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --query 'select id,name,age,gender,telephone,email,classId from student where id>3 and classId=1 and $CONDITIONS' \ --target-dir /sqoop/demo6 \ --delete-target-dir \ --m 1
Implement incremental import
sqoop import \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --table student \ --delete-target-dir \ --target-dir /sqoop/incre1 \ --m 1 insert into student(name,age,gender,telephone,email,classId) values ("Liu Yong",24,"male","13695847598","liuyong@qq.com",1), ("Liu Xiaoyong",2,"male","1360000000","liuxiaoyong@qq.com",1) sqoop import \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --table student \ --target-dir /sqoop/incre1 \ --incremental append \ --check-column id \ --last-value 9 \ --m 1
Incremental incremental import lastmodified < append , merge-key>
create table student2( id int, name varchar(32), last_mod timestamp default current_timestamp on update current_timestamp ) insert into student2(id,name) values(1,'zhangxiaohua'),(2,'litiechui'); select * from student2; sqoop import \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --table student2 \ --delete-target-dir \ --target-dir /sqoop/incre2 \ --m 1 insert into student2(id,name) values(3,'zhaodaqiang'),(4,'chenxiaowang'); sqoop import \ --connect jdbc:mysql://gree139:3306/mybatisdb \ --username root \ --password root \ --table student2 \ --target-dir /sqoop/incre2 \ --incremental lastmodified \ --check-column last_mod \ --last-value "2021-11-23 11:56:25" \ --append \ --m 1