A Method of Generating Hive DDL Using Sqoop API

Posted by JayBachatero on Thu, 16 May 2019 20:12:19 +0200

First, build tables.

ddl:

create table testtb (id int,content string, country string) row format delimited fields terminated by '|' lines terminated by '\n' stored as textfile;

Importing data

1. Importing sqoop in shell

sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password hehe --table testtb --fields-terminated-by '|' --lines-terminated-by '\n' --hive-delims-replacement " " -m 1 --target-dir /user/hive/warehouse/test.db/testtb/ --append

(stand-alone version, so the localhost set by mysql)

2. Importing sqoop API

import com.cloudera.sqoop.SqoopOptions;
import org.apache.hadoop.conf.Configuration;
import org.apache.sqoop.Sqoop;
import org.apache.sqoop.hive.TableDefWriter;
import org.apache.sqoop.manager.ConnManager;
import org.apache.sqoop.tool.SqoopTool;
import org.apache.sqoop.util.OptionsFileUtil;

public class Test{
    public static void import2HDFS() throws Exception {
        String[] args = new String[]{
                "--connect", "jdbc:mysql://localhost:3306/test",
                "--driver", "com.mysql.jdbc.Driver",
                "--username", "root",
                "--password", "hehe",
                "--table", "testtb",
                "--fields-terminated-by", "|",
                "--lines-terminated-by", "\n",
                "--hive-delims-replacement", " ",
                "-m", "1",
                "--target-dir", "/user/hive/warehouse/test.db/testtb/",
                "--append"
        };
        String[] expandArguments = OptionsFileUtil.expandArguments(args);
        SqoopTool tool = SqoopTool.getTool("import");
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://192.168.0.110:9000");
        Configuration loadPlugins = SqoopTool.loadPlugins(conf);
        Sqoop sqoop = new Sqoop((com.cloudera.sqoop.tool.SqoopTool) tool, loadPlugins);
        System.out.println(Sqoop.runSqoop(sqoop, expandArguments));
    }

}

Mainly track and debug three lines of code:

  • SqoopTool tool = SqoopTool.getTool("import");

    Get some instances of Tools registered in the static block of the SqoopTool, where you get an instance of org.apache.sqoop.tool.ImportTool

  • Sqoop sqoop = new Sqoop((com.cloudera.sqoop.tool.SqoopTool) tool, loadPlugins);

    Sqoop implements the org.apache.hadoop.util.Tool interface. In the construction method, its member variable SqoopTool tool is set as the object of the ImportTool class above.

  • Sqoop.runSqoop(sqoop, expandArguments);

    (1) The run method of org.apache.hadoop.util.ToolRunner is called to analyze some parameters.

    public static int run(Configuration conf, Tool tool, String[] args) throws Exception{
        if(conf == null) {
            conf = new Configuration();
        }
        GenericOptionsParser parser = new GenericOptionsParser(conf, args);
        //set the configuration back, so that Tool can configure itself
        tool.setConf(conf);
        
        //get the args w/o generic hadoop args
        String[] toolArgs = parser.getRemainingArgs();
        return tool.run(toolArgs);
    }
    

    The tool here is an instance of the Sqoop class

    (2) run method of Sqoop:

    public int run(String[] args) {
        if (this.options.getConf() == null) {
            this.options.setConf(this.getConf());
        }
    
        try {
            this.options = this.tool.parseArguments(args, (Configuration)null, this.options, false);
            this.tool.appendArgs(this.childPrgmArgs);
            this.tool.validateOptions(this.options);
        } catch (Exception var3) {
            LOG.debug(var3.getMessage(), var3);
            System.err.println(var3.getMessage());
            return 1;
        }
    
        return this.tool.run(this.options);
    }
    

    The tool here is an object of the Sqoop class's member variable com.cloudera.sqoop.tool.SqoopTool class, which in this case is actually an object of the org.apache.sqoop.tool.ImportTool class.
    (The parent class of com.cloudera.sqoop.tool.SqoopTool is org.apache.sqoop.tool.SqoopTool;
    The parent class of ImportTool is com.cloudera.sqoop.tool.BaseSqoopTool. The parent class of BaseSqoopTool is org.apache.sqoop.tool.BaseSqoopTool. The parent class of BaseSqoopTool is com.cloudera.sqoop.tool.SqoopTool. )
    Here, this.tool.parseArguments calls the methods in the class org.apache.sqoop.tool.SqoopTool, parses the parameters, gets SqoopOptions, and calls the run method of ImportTool as parameters.

    (3) The run method of ImportTool,

    public int run(SqoopOptions options) {
        HiveImport hiveImport = null;
        if (this.allTables) {
            LOG.error("ImportTool.run() can only handle a single table.");
            return 1;
        } else if (!this.init(options)) {
            return 1;
        } else {
            this.codeGenerator.setManager(this.manager);
    
            byte var4;
            try {
                if (options.doHiveImport()) {
                    hiveImport = new HiveImport(options, this.manager, options.getConf(), false);
                }
    
                this.importTable(options, options.getTableName(), hiveImport);
                return 0;
            } catch (...) {
                ...
            }
            ...
        }
    }
    

    It can be seen from the code that if the option of importing hive is set, an object of HiveImport class will be generated and passed into the importTable method. In the importTable method of HiveImport, the hiveImport.importTable method is called at the if statement of judging options. doveImport (). In this method, we can see the target code of getting ddl available:

    TableDefWriter tableWriter = new TableDefWriter(this.options, this.connManager, inputTableName, outputTableName, this.configuration, !debugMode);
        String createTableStr = tableWriter.getCreateTableStmt() + ";\n";
    

    As long as you can find these three parameters: this. options, this. connManager, and this. configuration, you can use the TableDefWriter class to generate DDL statements.

    • options: In the run method of Sqoop, parseArguments method of org.apache.sqoop.tool.SqoopTool is called to generate them.
    • connManager: When an instance of HiveImport is generated in the run method of ImportTool, the construction method of HiveImport is passed in. This member variable is the member variable manager of ImportTool, inherited from the parent class of its parent class com.cloudera.sqoop.tool.BaseSqoopTool, which defines connManager and manager in org. apache. sqoop. tool. BaseSqoopTool. getManager() method. In the run method of ImportTool, the init method of ImportTool is invoked, and the init method of org.apache.sqoop.tool.BaseSqoopTool is invoked. In this method, the code to generate the manager instance is JobData = new JobData (sqoopOpts, this); this.manager = new ConnFactory (sqoopOpts.getf(). getManager (data); so when the ImportTool init method is invoked, the instance can be obtained. Change manager member variables
    • configuration: Object of the org.apache.hadoop.conf.Configuration class, just set the namenode

3. A Method of Using Sqoop API to Generate Hive DDL

    public String test1() throws Exception {
        String[] args = new String[]{
                "--connect", "jdbc:mysql://localhost:3306/test",
                "--driver", "com.mysql.jdbc.Driver",
                "--username", "root",
                "--password", "hehe",
                "--table", "testtb",
                "--fields-terminated-by", "|",
                "--lines-terminated-by", "\n",
                "--hive-delims-replacement", " ",
                "-m", "1",
                "--target-dir", "/user/hive/warehouse/test.db/testtb/",
                "--append"
        };
        String[] expandArguments = OptionsFileUtil.expandArguments(args);
        SqoopTool tool = SqoopTool.getTool("import");
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://192.168.0.110:9000");
        Configuration loadPlugins = SqoopTool.loadPlugins(conf);
        Sqoop sqoop = new Sqoop((com.cloudera.sqoop.tool.SqoopTool) tool, loadPlugins);
        // The getManager method of org.apache.sqoop.tool.BaseSqoopTool is obtained by reflection
        Method getManager = sqoop.getTool().getClass().getSuperclass().getSuperclass().getMethod("getManager");
        // The init method of acquiring ImportTool by reflection
        Method importToolInit = sqoop.getTool().getClass().getDeclaredMethod("init",
                new Class[]{com.cloudera.sqoop.SqoopOptions.class});
        // This method has protected access and is accessible
        importToolInit.setAccessible(true);
        // Get options
        SqoopOptions options = sqoop.getTool().parseArguments(args, conf, sqoop.getOptions(), false);
        // Call the init method of ImportTool to instantiate its manager member variable
        if (!(boolean) importToolInit.invoke(sqoop.getTool(), new Object[]{options})) {
            System.out.println("initialization failed");
            System.exit(1);
        }
        // Call the getManager method of org.apache.sqoop.tool.BaseSqoopTool to get manager
        ConnManager manager = (ConnManager) getManager.invoke(sqoop.getTool());
        // Get the table-building statement
        TableDefWriter tableWriter = new TableDefWriter(sqoop.getOptions()
                , (com.cloudera.sqoop.manager.ConnManager) manager
                , "testtb", "testtb", conf, false);
        return tableWriter.getCreateTableStmt();
    }

Log and output:

19:41:52.778 [main] DEBUG org.apache.sqoop.SqoopOptions - Generated nonce dir: /tmp/sqoop-root/compile/dccf8b05825988dc41eda7a9ac8e040e
19:41:52.810 [main] INFO org.apache.sqoop.Sqoop - Running Sqoop version: 1.4.7
19:41:52.825 [main] WARN org.apache.sqoop.tool.BaseSqoopTool - Setting your password on the command-line is insecure. Consider using -P instead.
19:41:52.911 [main] DEBUG org.apache.hadoop.util.Shell - setsid exited with exit code 0
19:41:52.938 [main] DEBUG org.apache.sqoop.ConnFactory - Loaded manager factory: org.apache.sqoop.manager.oracle.OraOopManagerFactory
19:41:52.947 [main] DEBUG org.apache.sqoop.ConnFactory - Loaded manager factory: com.cloudera.sqoop.manager.DefaultManagerFactory
19:41:52.948 [main] DEBUG org.apache.sqoop.ConnFactory - Trying ManagerFactory: org.apache.sqoop.manager.oracle.OraOopManagerFactory
19:41:52.993 [main] DEBUG org.apache.sqoop.manager.oracle.OraOopUtilities - Enabled OraOop debug logging.
19:41:52.994 [main] DEBUG org.apache.sqoop.manager.oracle.OraOopManagerFactory - Data Connector for Oracle and Hadoop can be called by Sqoop!
19:41:52.994 [main] DEBUG org.apache.sqoop.ConnFactory - Trying ManagerFactory: com.cloudera.sqoop.manager.DefaultManagerFactory
19:41:52.995 [main] DEBUG org.apache.sqoop.manager.DefaultManagerFactory - Trying with scheme: jdbc:mysql:
19:41:53.008 [main] INFO org.apache.sqoop.manager.MySQLManager - Preparing to use a MySQL streaming resultset.
19:41:53.019 [main] DEBUG org.apache.sqoop.ConnFactory - Instantiated ConnManager org.apache.sqoop.manager.MySQLManager@6c9f5c0d
19:41:53.023 [main] DEBUG org.apache.sqoop.manager.SqlManager - Execute getColumnInfoRawQuery : SELECT t.* FROM `testtb` AS t LIMIT 1
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
19:41:53.029 [main] DEBUG org.apache.sqoop.manager.SqlManager - No connection paramenters specified. Using regular API for making connection.
19:41:53.400 [main] DEBUG org.apache.sqoop.manager.SqlManager - Using fetchSize for next query: -2147483648
19:41:53.400 [main] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM `testtb` AS t LIMIT 1
19:41:53.422 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column id of type [4, 10, 0]
19:41:53.423 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column content of type [-1, 21845, 0]
19:41:53.423 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column country of type [12, 255, 0]
19:41:53.426 [main] DEBUG org.apache.sqoop.manager.SqlManager - Using fetchSize for next query: -2147483648
19:41:53.426 [main] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM `testtb` AS t LIMIT 1
19:41:53.427 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column id of type INT
19:41:53.427 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column content of type TEXT
19:41:53.427 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column country of type VARCHAR
19:41:53.429 [main] DEBUG org.apache.sqoop.hive.TableDefWriter - Create statement: CREATE TABLE IF NOT EXISTS `testtb` ( `id` INT, `content` STRING, `country` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\174' LINES TERMINATED BY '\012' STORED AS TEXTFILE



CREATE TABLE IF NOT EXISTS `testtb` ( `id` INT, `content` STRING, `country` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\174' LINES TERMINATED BY '\012' STORED AS TEXTFILE

The last line is the acquired table statement, so that the table statement can be generated by sqoop api, and then do other processing, such as creating external tables, changing the storage format and so on, without writing a lot of code to achieve.

Topics: Apache MySQL hive JDBC