Write in front
The version of the source code analysis is 3.0. Because plug-ins are an important part of datax, the source code of plug-ins will be involved in the process of source code analysis. In order to maintain consistency, plug-ins have been illustrated by mysql, which is familiar to most people.
The json file of the task template I used is:
{ "job":{ "content":[ { "reader":{ "name":"mysqlreader", "parameter":{ "column":[ "id", "name", "age" ], "connection":[ { "jdbcUrl":[ "jdbc:mysql://127.0.0.1:3306/test" ], "table":[ "t_datax_test" ] } ], "password":"11111111", "username":"root" } }, "writer":{ "name":"mysqlwriter", "parameter":{ "column":[ "id", "name", "age" ], "connection":[ { "jdbcUrl":"jdbc:mysql://127.0.0.1:3306/test2", "table":[ "t_datax_test" ] } ], "password":"11111111", "username":"root" } } } ], "setting":{ "speed":{ "channel":"2" } } } }
In addition, I added a lot of comments when reading the source code. You should pay attention to the comments when looking at the code I posted. Some of the posted code may only contain core fragments.
Start class analysis
The startup class of dataX is com alibaba. datax. core. Engine, start the dataX process through the main method.
public static void main(String[] args) throws Exception { int exitCode = 0; try { Engine.entry(args); } catch (Throwable e) { exitCode = 1; LOG.error("\n\n through DataX Intelligent analysis,The most likely cause of error for this task is:\n" + ExceptionTracker.trace(e)); ...
Continue to look at the entry method,
public static void entry(final String[] args) throws Throwable { Options options = new Options(); options.addOption("job", true, "Job config."); options.addOption("jobid", true, "Job unique id."); options.addOption("mode", true, "Job runtime mode."); BasicParser parser = new BasicParser(); CommandLine cl = parser.parse(options, args); //datax running directory / xxx json String jobPath = cl.getOptionValue("job"); // If the user does not explicitly specify jobid, dataX Py specifies jobid. The default value is - 1 String jobIdString = cl.getOptionValue("jobid"); RUNTIME_MODE = cl.getOptionValue("mode"); Configuration configuration = ConfigParser.parse(jobPath); long jobId; if (!"-1".equalsIgnoreCase(jobIdString)) { jobId = Long.parseLong(jobIdString); } else { // only for dsc & ds & datax 3 update String dscJobUrlPatternString = "/instance/(\\d{1,})/config.xml"; String dsJobUrlPatternString = "/inner/job/(\\d{1,})/config"; String dsTaskGroupUrlPatternString = "/inner/job/(\\d{1,})/taskGroup/"; List<String> patternStringList = Arrays.asList(dscJobUrlPatternString, dsJobUrlPatternString, dsTaskGroupUrlPatternString); jobId = parseJobIdFromUrl(patternStringList, jobPath); } boolean isStandAloneMode = "standalone".equalsIgnoreCase(RUNTIME_MODE); if (!isStandAloneMode && jobId == -1) { // If it is not in standalone mode, jobId must not be - 1 throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, "wrong standalone Mode must be in URL Provide valid jobId."); } configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, jobId); //Print vmInfo VMInfo vmInfo = VMInfo.getVmInfo(); if (vmInfo != null) { LOG.info(vmInfo.toString()); } LOG.info("\n" + Engine.filterJobConfiguration(configuration) + "\n"); LOG.debug(configuration.toJSON()); ConfigurationValidate.doValidate(configuration); Engine engine = new Engine(); //After configuration initialization, the method instantiates itself and calls its start method engine.start(configuration); }
The first is to analyze the operation parameters formulated by running datax. For example, the configuration I gave in idea is
-mode standalone -jobid -1 -job /Users/malu/Documents/code/idea_study/DataX/core/target/datax/job/mysql2mysql.json
Naturally, the value of jobPath is / users / Malu / documents / code / idea_ study/DataX/core/target/datax/job/mysql2mysql. The value of JSON, jobIdString is - 1, runtime_ The value of mode is standalone.
After the values of these key variables are clear, the following process is clear.
Next, let's look at an important method, configparser Parse, this method returns an instance of Configuration class, which is very important in datax. All Configuration information is managed by it, which is equivalent to the role of big housekeeper. I'm going to write an introduction to this class later.
/** * Specify the Job Configuration path. ConfigParser will parse all the information of Job, Plugin and Core and return it in Configuration */ public static Configuration parse(final String jobPath) { //First, analyze the basic configuration from the task configuration file, including the information of reader s and writer s, the number of channel s, etc Configuration configuration = ConfigParser.parseJobConfig(jobPath); //Merge some configurations of datax itself, mainly in core JSON file, such as some configuration of speed limit, etc configuration.merge( ConfigParser.parseCoreConfig(CoreConstant.DATAX_CONF_PATH), false); // todo config is optimized to capture only the required plugin s //The name of the reader plugin. For example, mysql is mysql reader String readerPluginName = configuration.getString( CoreConstant.DATAX_JOB_CONTENT_READER_NAME); //The name of the writer plugin. For example, mysql is mysql writer String writerPluginName = configuration.getString( CoreConstant.DATAX_JOB_CONTENT_WRITER_NAME); String preHandlerName = configuration.getString( CoreConstant.DATAX_JOB_PREHANDLER_PLUGINNAME); String postHandlerName = configuration.getString( CoreConstant.DATAX_JOB_POSTHANDLER_PLUGINNAME); Set<String> pluginList = new HashSet<String>(); pluginList.add(readerPluginName); pluginList.add(writerPluginName); if(StringUtils.isNotEmpty(preHandlerName)) { pluginList.add(preHandlerName); } if(StringUtils.isNotEmpty(postHandlerName)) { pluginList.add(postHandlerName); } try { configuration.merge(parsePluginConfig(new ArrayList<String>(pluginList)), false); ...
The notes are very clear.
VMInfo contains some configuration information of the computer itself, which is not shown here.
Then comes the filterJobConfiguration method,
public static String filterJobConfiguration(final Configuration configuration) { //clone, because it will be modified later Configuration jobConfWithSetting = configuration.getConfiguration("job").clone(); Configuration jobContent = jobConfWithSetting.getConfiguration("content"); //Filter sensitive information, such as password filterSensitiveConfiguration(jobContent); jobConfWithSetting.set("content",jobContent); //Format json string display return jobConfWithSetting.beautify(); }
There's nothing to say here. It's all about basic operations. Then enter the start method,
/* check job model (job/task) first */ public void start(Configuration allConf) { // Bind column conversion information ColumnCast.bind(allConf); /** * Initialize PluginLoader to obtain various plug-in configurations */ LoadUtil.bind(allConf); boolean isJob = !("taskGroup".equalsIgnoreCase(allConf .getString(CoreConstant.DATAX_CORE_CONTAINER_MODEL))); //JobContainer will set and adjust the value after schedule int channelNumber =0; AbstractContainer container; ... //perfTrace is on by default boolean traceEnable = allConf.getBool(CoreConstant.DATAX_CORE_CONTAINER_TRACE_ENABLE, true); boolean perfReportEnable = allConf.getBool(CoreConstant.DATAX_CORE_REPORT_DATAX_PERFLOG, true); //datax shell tasks in standlone mode are not reported if(instanceId == -1){ perfReportEnable = false; } int priority = 0; try { priority = Integer.parseInt(System.getenv("SKYNET_PRIORITY")); }catch (NumberFormatException e){ LOG.warn("prioriy set to 0, because NumberFormatException, the value is: "+System.getProperty("PROIORY")); } Configuration jobInfoConfig = allConf.getConfiguration(CoreConstant.DATAX_JOB_JOBINFO); //Initialize PerfTrace PerfTrace perfTrace = PerfTrace.getInstance(isJob, instanceId, taskGroupId, priority, traceEnable); perfTrace.setJobInfo(jobInfoConfig,perfReportEnable,channelNumber); /** * There are two implementations: JobContainer and TaskGroupContainer * From the perspective of configuration, it is basically JobContainer, so we mainly analyze it */ container.start(); }
The comments are also clear. The point here is PerfTrace class, which is a class that tracks performance, that is, datax records some indicators during task execution, such as how much data is transmitted and how much time is spent. The following is an example, which is part of the print content of datax after executing a task:
2021-11-28 09:11:32.532 [job-0] INFO StandAloneJobContainerCommunicator - Total 5 records, 39 bytes | Speed 3B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
container. The start method enters the JobContainer and will be discussed in the next article.