datax source code analysis - startup class analysis

Posted by supermars on Fri, 17 Dec 2021 23:10:52 +0100

Write in front

The version of the source code analysis is 3.0. Because plug-ins are an important part of datax, the source code of plug-ins will be involved in the process of source code analysis. In order to maintain consistency, plug-ins have been illustrated by mysql, which is familiar to most people.

The json file of the task template I used is:

{
    "job":{
        "content":[
            {
                "reader":{
                    "name":"mysqlreader",
                    "parameter":{
                        "column":[
                            "id",
                            "name",
                            "age"
                        ],
                        "connection":[
                            {
                                "jdbcUrl":[
                                    "jdbc:mysql://127.0.0.1:3306/test"
                                ],
                                "table":[
                                    "t_datax_test"
                                ]
                            }
                        ],
                        "password":"11111111",
                        "username":"root"
                    }
                },
                "writer":{
                    "name":"mysqlwriter",
                    "parameter":{
                        "column":[
                            "id",
                            "name",
                            "age"
                        ],
                        "connection":[
                            {
                                "jdbcUrl":"jdbc:mysql://127.0.0.1:3306/test2",
                                "table":[
                                    "t_datax_test"
                                ]
                            }
                        ],
                        "password":"11111111",
                        "username":"root"
                    }
                }
            }
        ],
        "setting":{
            "speed":{
                "channel":"2"
            }
        }
    }
}

In addition, I added a lot of comments when reading the source code. You should pay attention to the comments when looking at the code I posted. Some of the posted code may only contain core fragments.

Start class analysis

The startup class of dataX is com alibaba. datax. core. Engine, start the dataX process through the main method.

public static void main(String[] args) throws Exception {
        int exitCode = 0;
        try {
            Engine.entry(args);
        } catch (Throwable e) {
            exitCode = 1;
            LOG.error("\n\n through DataX Intelligent analysis,The most likely cause of error for this task is:\n" + ExceptionTracker.trace(e));
            ...

Continue to look at the entry method,

public static void entry(final String[] args) throws Throwable {
        Options options = new Options();
        options.addOption("job", true, "Job config.");
        options.addOption("jobid", true, "Job unique id.");
        options.addOption("mode", true, "Job runtime mode.");

        BasicParser parser = new BasicParser();
        CommandLine cl = parser.parse(options, args);

        //datax running directory / xxx json
        String jobPath = cl.getOptionValue("job");

        // If the user does not explicitly specify jobid, dataX Py specifies jobid. The default value is - 1
        String jobIdString = cl.getOptionValue("jobid");
        RUNTIME_MODE = cl.getOptionValue("mode");

        Configuration configuration = ConfigParser.parse(jobPath);

        long jobId;
        if (!"-1".equalsIgnoreCase(jobIdString)) {
            jobId = Long.parseLong(jobIdString);
        } else {
            // only for dsc & ds & datax 3 update
            String dscJobUrlPatternString = "/instance/(\\d{1,})/config.xml";
            String dsJobUrlPatternString = "/inner/job/(\\d{1,})/config";
            String dsTaskGroupUrlPatternString = "/inner/job/(\\d{1,})/taskGroup/";
            List<String> patternStringList = Arrays.asList(dscJobUrlPatternString,
                    dsJobUrlPatternString, dsTaskGroupUrlPatternString);
            jobId = parseJobIdFromUrl(patternStringList, jobPath);
        }

        boolean isStandAloneMode = "standalone".equalsIgnoreCase(RUNTIME_MODE);
        if (!isStandAloneMode && jobId == -1) {
            // If it is not in standalone mode, jobId must not be - 1
            throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, "wrong standalone Mode must be in URL Provide valid jobId.");
        }
        configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, jobId);

        //Print vmInfo
        VMInfo vmInfo = VMInfo.getVmInfo();
        if (vmInfo != null) {
            LOG.info(vmInfo.toString());
        }

        LOG.info("\n" + Engine.filterJobConfiguration(configuration) + "\n");

        LOG.debug(configuration.toJSON());

        ConfigurationValidate.doValidate(configuration);
        Engine engine = new Engine();
        //After configuration initialization, the method instantiates itself and calls its start method
        engine.start(configuration);
    }

The first is to analyze the operation parameters formulated by running datax. For example, the configuration I gave in idea is

-mode
standalone
-jobid
-1
-job
/Users/malu/Documents/code/idea_study/DataX/core/target/datax/job/mysql2mysql.json

Naturally, the value of jobPath is / users / Malu / documents / code / idea_ study/DataX/core/target/datax/job/mysql2mysql. The value of JSON, jobIdString is - 1, runtime_ The value of mode is standalone.

After the values of these key variables are clear, the following process is clear.

Next, let's look at an important method, configparser Parse, this method returns an instance of Configuration class, which is very important in datax. All Configuration information is managed by it, which is equivalent to the role of big housekeeper. I'm going to write an introduction to this class later.

/**
     * Specify the Job Configuration path. ConfigParser will parse all the information of Job, Plugin and Core and return it in Configuration
     */
    public static Configuration parse(final String jobPath) {
        //First, analyze the basic configuration from the task configuration file, including the information of reader s and writer s, the number of channel s, etc
        Configuration configuration = ConfigParser.parseJobConfig(jobPath);

        //Merge some configurations of datax itself, mainly in core JSON file, such as some configuration of speed limit, etc
        configuration.merge(
                ConfigParser.parseCoreConfig(CoreConstant.DATAX_CONF_PATH),
                false);
        // todo config is optimized to capture only the required plugin s
        //The name of the reader plugin. For example, mysql is mysql reader
        String readerPluginName = configuration.getString(
                CoreConstant.DATAX_JOB_CONTENT_READER_NAME);
        //The name of the writer plugin. For example, mysql is mysql writer
        String writerPluginName = configuration.getString(
                CoreConstant.DATAX_JOB_CONTENT_WRITER_NAME);

        String preHandlerName = configuration.getString(
                CoreConstant.DATAX_JOB_PREHANDLER_PLUGINNAME);

        String postHandlerName = configuration.getString(
                CoreConstant.DATAX_JOB_POSTHANDLER_PLUGINNAME);

        Set<String> pluginList = new HashSet<String>();
        pluginList.add(readerPluginName);
        pluginList.add(writerPluginName);

        if(StringUtils.isNotEmpty(preHandlerName)) {
            pluginList.add(preHandlerName);
        }
        if(StringUtils.isNotEmpty(postHandlerName)) {
            pluginList.add(postHandlerName);
        }
        try {
            configuration.merge(parsePluginConfig(new ArrayList<String>(pluginList)), false);
            ...

The notes are very clear.

VMInfo contains some configuration information of the computer itself, which is not shown here.

Then comes the filterJobConfiguration method,

public static String filterJobConfiguration(final Configuration configuration) {
        //clone, because it will be modified later
        Configuration jobConfWithSetting = configuration.getConfiguration("job").clone();

        Configuration jobContent = jobConfWithSetting.getConfiguration("content");

        //Filter sensitive information, such as password
        filterSensitiveConfiguration(jobContent);

        jobConfWithSetting.set("content",jobContent);

        //Format json string display
        return jobConfWithSetting.beautify();
    }

There's nothing to say here. It's all about basic operations. Then enter the start method,

/* check job model (job/task) first */
    public void start(Configuration allConf) {

        // Bind column conversion information
        ColumnCast.bind(allConf);

        /**
         * Initialize PluginLoader to obtain various plug-in configurations
         */
        LoadUtil.bind(allConf);

        boolean isJob = !("taskGroup".equalsIgnoreCase(allConf
                .getString(CoreConstant.DATAX_CORE_CONTAINER_MODEL)));
        //JobContainer will set and adjust the value after schedule
        int channelNumber =0;
        AbstractContainer container;
        ...


        //perfTrace is on by default
        boolean traceEnable = allConf.getBool(CoreConstant.DATAX_CORE_CONTAINER_TRACE_ENABLE, true);
        boolean perfReportEnable = allConf.getBool(CoreConstant.DATAX_CORE_REPORT_DATAX_PERFLOG, true);

        //datax shell tasks in standlone mode are not reported
        if(instanceId == -1){
            perfReportEnable = false;
        }

        int priority = 0;
        try {
            priority = Integer.parseInt(System.getenv("SKYNET_PRIORITY"));
        }catch (NumberFormatException e){
            LOG.warn("prioriy set to 0, because NumberFormatException, the value is: "+System.getProperty("PROIORY"));
        }

        Configuration jobInfoConfig = allConf.getConfiguration(CoreConstant.DATAX_JOB_JOBINFO);
        //Initialize PerfTrace
        PerfTrace perfTrace = PerfTrace.getInstance(isJob, instanceId, taskGroupId, priority, traceEnable);
        perfTrace.setJobInfo(jobInfoConfig,perfReportEnable,channelNumber);
        /**
         * There are two implementations: JobContainer and TaskGroupContainer
         * From the perspective of configuration, it is basically JobContainer, so we mainly analyze it
         */
        container.start();

    }

The comments are also clear. The point here is PerfTrace class, which is a class that tracks performance, that is, datax records some indicators during task execution, such as how much data is transmitted and how much time is spent. The following is an example, which is part of the print content of datax after executing a task:

2021-11-28 09:11:32.532 [job-0] INFO  StandAloneJobContainerCommunicator - Total 5 records, 39 bytes | Speed 3B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%

container. The start method enters the JobContainer and will be discussed in the next article.