Using SparkLauncher to invoke Spark operation in code

Posted by nitestryker on Mon, 03 Jan 2022 02:54:31 +0100

background

The project needs to deal with many files, and some files have a large number of GB. Therefore, considering that such files are specially written for Spark program processing, for the unified processing of programs, it is necessary to call Spark jobs in code to handle large files.

Implementation scheme

After investigation, it is found that the SparkLauncher class provided by Spark can be used to submit Spark jobs. There are many parameters to note in the use of this class. After project verification, this paper gives a relatively complete use method and description

First, add pom dependencies to the project and add your own version

<dependency>
	<groupId>org.apache.spark</groupId>
     <artifactId>spark-launcher_2.11</artifactId>
</dependency>

Secondly, some parameters of the Spark job itself can be placed in the configuration file and modified flexibly. Here is a CDH cluster configured with kerberos security authentication. The mode used when submitting the Spark job is yarn client. The following configuration is mainly used. The path in the configuration is filled in casually as an example. It is actually filled in according to its own environment. In addition, The whole application is executed at the CDH client node. Each configuration item has a description:

#spark application use
#Log output of driver
driverLogDir=/root/test/logs/
#kerberos authentication keytab file
keytab=/root/test/dw_hbkal.keytab
# keyberos authentication principal
principal=dw_hbkal
# Run spark job on yarn cluster
master=yarn
# Yarn client mode
deployMode=client
# Number of spark executors and memory configuration
minExecutors=16
maxExecutors=16
executorMemory=1g
# driver memory configuration
driverMemory=256M
# Number of core s used by spark executor
executorCores=2
# Main class of spark job
mainClass=com.unionpay.css.fcmp.compare.cp.spark.nonprikey.FileCompare
# jar package for spark job
jarPath=/root/test/my-spark-job-1.0-SNAPSHOT.jar
# Third party jar s that spark jobs depend on
extjars=/root/test/mysql-connector-java-8.0.27.jar,/root/test/jedis-2.8.1.jar
# The cluster configuration file stored on the CHD client indicates which cluster to submit spark jobs to
HADOOP_CONF_DIR=/root/CDH/bjc/CDH/etc/conf/hadoop-conf
JAVA_HOME=/usr/java/jdk1.8.0_141
SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
# yarn queue for spark job execution
yarnQueue=mysparkqueue

The above configuration can be read in the code and used together with SparkLauncher. Please refer to the following example code:

//Responsible for initiating spark operation
public class SparkJobService{

    private static final Logger logger = LoggerFactory.getLogger(SparkJobService.class);
    static Config config;
    //spark task parameters
    static String keytabPath;
    static String principal ;
    static String master;
    static String deployMode;
    static String minExecutods;
    static String maxExecutors;
    static String executorMemory;
    static String driverMemory;
    static String executorCores;
    static String mainClass;
    static String jarPath;
    static String extjars;
    static String yarnQueue;
    static String HADOOP_CONF_DIR;
    static String JAVA_HOME;
    static String SPARK_HOME;
    static String driverLogDir;

    static {
        config = new Config("job.properties");
        keytabPath = config.getString("keytab");
        principal = config.getString("principal");
        master = config.getString("master");
        deployMode = config.getString("deployMode");
        minExecutods = config.getString("minExecutods");
        maxExecutors = config.getString("maxExecutors");
        executorMemory = config.getString("executorMemory");
        driverMemory = config.getString("driverMemory");
        executorCores = config.getString("executorCores");
        mainClass = config.getString("mainClass");
        jarPath = config.getString("jarPath");
        extjars = config.getString("extjars");
        yarnQueue = config.getString("yarnQueue");
        HADOOP_CONF_DIR=config.getString("HADOOP_CONF_DIR");
        JAVA_HOME = config.getString("JAVA_HOME");
        SPARK_HOME = config.getString("SPARK_HOME");
        driverLogDir = config.getString("driverLogDir");
    }

    public static void main(String[] args) {
        try{
            //spark task settings
            //If it is added in the system environment variable, it can not be added
            HashMap<String,String> env = new HashMap();
            env.put("HADOOP_CONF_DIR",HADOOP_CONF_DIR);
            env.put("JAVA_HOME",JAVA_HOME);
            env.put("SPARK_HOME",SPARK_HOME);
			
			String jobArgs1 = "test1";
			String jobArgs2 = "test2"
			//......

            SparkLauncher launcher = new SparkLauncher(env).addSparkArg("--keytab",keytabPath).addSparkArg("--principal",principal).setMaster(master).setDeployMode(deployMode)
                    .setConf("spark.dynamicAllocation.minExecutors",minExecutods).setConf("spark.dynamicAllocation.maxExecutors",maxExecutors).setConf("spark.driver.memory",driverMemory).setConf("spark.executor.memory",executorMemory).setConf("spark.executor.cores",executorCores)
                    .setConf("spark.yarn.queue",yarnQueue)
                    .setAppResource(jarPath).setMainClass(mainClass).addAppArgs(jobArgs1,jobArgs2);

            //Jar dependency in spark job, such as MySQL connector jar...
            for(String jarName : extjars.split(",")){
                launcher.addJar(jarName);
            }
            launcher.setAppName("SparkJob");
            //spark local driver log
            launcher.redirectError(new File(driverLogDir + "spark_driver.log"));
            final String[] jobId = new String[]{""};
            //Used to wait for the spark job to finish
            CountDownLatch latch = new CountDownLatch(1);
            SparkAppHandle sparkAppHandle = launcher.setVerbose(false).startApplication(new SparkAppHandle.Listener() {
                @Override
                public void stateChanged(SparkAppHandle sparkAppHandle) {
                    SparkAppHandle.State state = sparkAppHandle.getState();
                    switch (state){
                        case SUBMITTED:
                            logger.info("Submit spark Job succeeded");
							//jobId of spark job on yarn
                            jobId[0] = sparkAppHandle.getAppId();
                            break;
                        case FINISHED:
                            logger.info("spark job success");
                            break;
                        case FAILED:
                        case KILLED:
                        case LOST:
                            logger.info("spark job failed");
                    }
                    if (state.isFinal())
                        latch.countDown();
                }

                @Override
                public void infoChanged(SparkAppHandle sparkAppHandle) {
                }
            });
			//Wait for Spark job execution to end
            latch.await();
        }catch (Exception e){
            logger.error("error",e);
        }finally {
			//...
        }
    }
}

In the above code, pay special attention to how the spark job parameters are configured. Different parameters use different method calls. Some parameters are added using the addSparkArg method and some using setConf. In particular, if it is a parameter passed to the spark application itself, it needs to be passed using the addAppArgs method. The formal parameter of this method is a variable length parameter.

In addition, the spark local driver log path is set in the code, which is convenient for viewing logs. Obtain the execution status of the spark job through the stateChanged callback function of SparkAppHandle. In this example, you need to wait for the execution of the spark job to end. Therefore, after submitting the job, wait through the CountDownLatch mechanism. In stateChanged, when it is found that the spark job is in the end state, the counter decreases by one and the whole program ends.

The above is an implementation of calling Spark jobs in code, and there are problems that can be exchanged together.

Topics: Java Big Data Spark