background
The project needs to deal with many files, and some files have a large number of GB. Therefore, considering that such files are specially written for Spark program processing, for the unified processing of programs, it is necessary to call Spark jobs in code to handle large files.
Implementation scheme
After investigation, it is found that the SparkLauncher class provided by Spark can be used to submit Spark jobs. There are many parameters to note in the use of this class. After project verification, this paper gives a relatively complete use method and description
First, add pom dependencies to the project and add your own version
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-launcher_2.11</artifactId> </dependency>
Secondly, some parameters of the Spark job itself can be placed in the configuration file and modified flexibly. Here is a CDH cluster configured with kerberos security authentication. The mode used when submitting the Spark job is yarn client. The following configuration is mainly used. The path in the configuration is filled in casually as an example. It is actually filled in according to its own environment. In addition, The whole application is executed at the CDH client node. Each configuration item has a description:
#spark application use #Log output of driver driverLogDir=/root/test/logs/ #kerberos authentication keytab file keytab=/root/test/dw_hbkal.keytab # keyberos authentication principal principal=dw_hbkal # Run spark job on yarn cluster master=yarn # Yarn client mode deployMode=client # Number of spark executors and memory configuration minExecutors=16 maxExecutors=16 executorMemory=1g # driver memory configuration driverMemory=256M # Number of core s used by spark executor executorCores=2 # Main class of spark job mainClass=com.unionpay.css.fcmp.compare.cp.spark.nonprikey.FileCompare # jar package for spark job jarPath=/root/test/my-spark-job-1.0-SNAPSHOT.jar # Third party jar s that spark jobs depend on extjars=/root/test/mysql-connector-java-8.0.27.jar,/root/test/jedis-2.8.1.jar # The cluster configuration file stored on the CHD client indicates which cluster to submit spark jobs to HADOOP_CONF_DIR=/root/CDH/bjc/CDH/etc/conf/hadoop-conf JAVA_HOME=/usr/java/jdk1.8.0_141 SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark # yarn queue for spark job execution yarnQueue=mysparkqueue
The above configuration can be read in the code and used together with SparkLauncher. Please refer to the following example code:
//Responsible for initiating spark operation public class SparkJobService{ private static final Logger logger = LoggerFactory.getLogger(SparkJobService.class); static Config config; //spark task parameters static String keytabPath; static String principal ; static String master; static String deployMode; static String minExecutods; static String maxExecutors; static String executorMemory; static String driverMemory; static String executorCores; static String mainClass; static String jarPath; static String extjars; static String yarnQueue; static String HADOOP_CONF_DIR; static String JAVA_HOME; static String SPARK_HOME; static String driverLogDir; static { config = new Config("job.properties"); keytabPath = config.getString("keytab"); principal = config.getString("principal"); master = config.getString("master"); deployMode = config.getString("deployMode"); minExecutods = config.getString("minExecutods"); maxExecutors = config.getString("maxExecutors"); executorMemory = config.getString("executorMemory"); driverMemory = config.getString("driverMemory"); executorCores = config.getString("executorCores"); mainClass = config.getString("mainClass"); jarPath = config.getString("jarPath"); extjars = config.getString("extjars"); yarnQueue = config.getString("yarnQueue"); HADOOP_CONF_DIR=config.getString("HADOOP_CONF_DIR"); JAVA_HOME = config.getString("JAVA_HOME"); SPARK_HOME = config.getString("SPARK_HOME"); driverLogDir = config.getString("driverLogDir"); } public static void main(String[] args) { try{ //spark task settings //If it is added in the system environment variable, it can not be added HashMap<String,String> env = new HashMap(); env.put("HADOOP_CONF_DIR",HADOOP_CONF_DIR); env.put("JAVA_HOME",JAVA_HOME); env.put("SPARK_HOME",SPARK_HOME); String jobArgs1 = "test1"; String jobArgs2 = "test2" //...... SparkLauncher launcher = new SparkLauncher(env).addSparkArg("--keytab",keytabPath).addSparkArg("--principal",principal).setMaster(master).setDeployMode(deployMode) .setConf("spark.dynamicAllocation.minExecutors",minExecutods).setConf("spark.dynamicAllocation.maxExecutors",maxExecutors).setConf("spark.driver.memory",driverMemory).setConf("spark.executor.memory",executorMemory).setConf("spark.executor.cores",executorCores) .setConf("spark.yarn.queue",yarnQueue) .setAppResource(jarPath).setMainClass(mainClass).addAppArgs(jobArgs1,jobArgs2); //Jar dependency in spark job, such as MySQL connector jar... for(String jarName : extjars.split(",")){ launcher.addJar(jarName); } launcher.setAppName("SparkJob"); //spark local driver log launcher.redirectError(new File(driverLogDir + "spark_driver.log")); final String[] jobId = new String[]{""}; //Used to wait for the spark job to finish CountDownLatch latch = new CountDownLatch(1); SparkAppHandle sparkAppHandle = launcher.setVerbose(false).startApplication(new SparkAppHandle.Listener() { @Override public void stateChanged(SparkAppHandle sparkAppHandle) { SparkAppHandle.State state = sparkAppHandle.getState(); switch (state){ case SUBMITTED: logger.info("Submit spark Job succeeded"); //jobId of spark job on yarn jobId[0] = sparkAppHandle.getAppId(); break; case FINISHED: logger.info("spark job success"); break; case FAILED: case KILLED: case LOST: logger.info("spark job failed"); } if (state.isFinal()) latch.countDown(); } @Override public void infoChanged(SparkAppHandle sparkAppHandle) { } }); //Wait for Spark job execution to end latch.await(); }catch (Exception e){ logger.error("error",e); }finally { //... } } }
In the above code, pay special attention to how the spark job parameters are configured. Different parameters use different method calls. Some parameters are added using the addSparkArg method and some using setConf. In particular, if it is a parameter passed to the spark application itself, it needs to be passed using the addAppArgs method. The formal parameter of this method is a variable length parameter.
In addition, the spark local driver log path is set in the code, which is convenient for viewing logs. Obtain the execution status of the spark job through the stateChanged callback function of SparkAppHandle. In this example, you need to wait for the execution of the spark job to end. Therefore, after submitting the job, wait through the CountDownLatch mechanism. In stateChanged, when it is found that the spark job is in the end state, the counter decreases by one and the whole program ends.
The above is an implementation of calling Spark jobs in code, and there are problems that can be exchanged together.