Create and run an EMR on EKS cluster

Posted by algy on Tue, 01 Feb 2022 07:56:44 +0100

The creation of EMR on EKS is completely command-line driven. At present, there is no corresponding UI interface to complete relevant operations. This article will demonstrate how to create and run an EMR on EKS cluster from the command line. The process of creating EMR on EKS can be divided into two stages: the first stage is to create an EKS cluster, and the second stage is to create a virtual cluster of EMR on this EKS cluster. The following are the specific operation steps.

Note: during the operation, we will successively get some values, such as the name of EKS cluster and the ID of virtual cluster. These variables will be used again in subsequent operations. In order to improve the reusability of the script in this paper, we will extract these values separately, assign them to a variable, and export them with export for subsequent reference. The following are some variables that will be generated and referenced during the operation, as well as the values we will adopt in this example:

Variable name	Value in this example	describe
REGION	us-east-1	Current AWS REGION
ZONES	us-east-1a,us-east-1b,us-east-1c	The free zone assigned to the EKS cluster to be created
EKS_CLUSTER_NAME	it-infrastructure	The name of the EKS cluster to be created
DATALAKE_NAMESPACE	datalake	The Kubenetes namespace for data system to be created on EKS, and the EMR on EKS virtual cluster to be created will be placed in this space
VIRTUAL_CLUSTER_NAME	emr-cluster-1	The name of the EMR on EKS virtual cluster to be created
SSH_PUBLIC_KEY	< find from EC2 - > Kye pairs >	The EKS cluster to be created needs to specify the public key
EXECUTION_ROLE_ARN	< find from IAM's Admin Role >	IAM Role for running EMR on EKS
VIRTUAL_CLUSTER_ID	< generated in the process >	ID of the EMR on EKS virtual cluster to be created

The following is the command to assign values to the above global variables (VIRTUAL_CLUSTER_ID will be generated in subsequent operations and will not be assigned temporarily):

export REGION="us-east-1"
export ZONES="us-east-1a,us-east-1b,us-east-1c"
export EKS_CLUSTER_NAME="it-infrastructure"
export DATALAKE_NAMESPACE="datalake"
export VIRTUAL_CLUSTER_NAME="emr-cluster-1"
export SSH_PUBLIC_KEY="<your-pub-key-name>"
export EXECUTION_ROLE_ARN="<your-admin-role-arn>"

0. Preconditions

Make sure you have a Linux host and have the awscli command line installed
Ensure the access configured for awscli_ Key belongs to an Admin account

1. Install ekscli

ekscli is a command-line tool for operating eks. We need to use this tool and install it first. The installation commands are as follows:

curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

2. Install kubectl

kubectl is a command-line tool for managing kubenetes clusters. We need to use this tool and install it first. The installation commands are as follows:

curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.20.4/2021-04-12/bin/linux/amd64/kubectl
chmod +x ./kubectl
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$PATH:$HOME/bin
echo 'export PATH=$PATH:$HOME/bin' >> ~/.bashrc

3. Create EKS cluster

Next, we're going to create ABC in Meidong 1_ IT_ For the EKS cluster of infrastruture, the command is as follows:

eksctl create cluster \
    --region $REGION \
    --name $EKS_CLUSTER_NAME \
    --zones $ZONES \
    --node-type m5.xlarge \
    --nodes 5 \
    --with-oidc \
    --ssh-access \
    --ssh-public-key $SSH_PUBLIC_KEY \
    --managed

The above command line needs to pay attention to the following points:

$SSH_PUBLIC_KEY is the ID of your public key key on AWS. This string can be found in EC2 console - > key pairs. The name column is;
--Zones is not a required option. If it is not specified, AZ will be selected randomly, but sometimes the randomly selected AZ does not have enough resources to support the EKS cluster requested to be created. In this case, it is necessary to explicitly specify zones to avoid unavailable zones;
--Node type and -- nodes are not required. If not specified, the cluster will be deployed in two m5S by default On the large node, the configuration of this cluster is too low for EMR, so these two items must be explicitly configured to give the cluster more resources;

The above command line needs to be executed for a long time (about 20 minutes). When it finally appears:

EKS cluster "ABC_IT_INFRASTRUCTURE" in "us-east-1" region is ready

It indicates that EKS cluster has been built. It should be noted that during the execution of this command, a large number of infrastructures will be created through Cloud Formation, including IAM Role, VPC, EC2, etc. there is a high possibility of errors in the middle, and many operations cannot be rolled back automatically. Therefore, it is necessary to open the console of Cloud Formation and pay continuous attention, such as uncollected stacks, You must manually delete the Stack before re executing the above command.

eksctl create cluster also has many configurable options. You can view the detailed description through the following command:

eksctl create cluster -h

4. Check EKS cluster status

After the EKS cluster is created, in order to ensure the health of the cluster, you can check the cluster status through the command line (this step is not necessary and can be skipped).

View the status of each physical node in the cluster

kubectl get nodes -o wide

View the status of cluster POD

kubectl get pods --all-namespaces -o wide

5. Create a Namespace

To facilitate resource management, we can create a separate namespace for data related systems on the Kubenetes cluster, named ABC_DATALAKE, the EMR virtual cluster created later will be placed in this namespace:

kubectl create namespace $DATALAKE_NAMESPACE

6. Authorize access to Namespace

By default, EMR on EKS has no right to directly access and use the namespace on EKS. We need to create a Kubernetes role, bind the Role to a Kubernetes user, and map a service Role awsservicerrole for Amazon emrcontainers to the user, so as to bridge the permission authentication between Kubenetes and EMR on EKS, Fortunately, we do not need to manually complete these operations one by one. We can directly implement them through an eksctl command:

eksctl create iamidentitymapping \
    --region $REGION \
    --cluster $EKS_CLUSTER_NAME \
    --namespace $DATALAKE_NAMESPACE \
    --service-name "emr-containers"

The output of the console will also confirm the above discussion:

2021-06-02 12:39:49 [ℹ]  created "datalake:Role.rbac.authorization.k8s.io/emr-containers"
2021-06-02 12:39:49 [ℹ]  created "datalake:RoleBinding.rbac.authorization.k8s.io/emr-containers"
2021-06-02 12:39:49 [ℹ]  adding identity "arn:aws:iam::1234567898765:role/AWSServiceRoleForAmazonEMRContainers" to auth ConfigMap

7. Create Job Execution Role

Running EMR on EKS requires an IAM Role. In this Role, you need to configure the resources that EMR on EKS can use, such as some buckets on s3, cloudwatch and other services. These are called Role Policies. The official document gives a reference configuration of Role Policies, which can be seen in: https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/creating-job-execution-role.html .

For convenience, this article will directly use the Admin role as the job execution role.

8. Create a Trust Relationship for Role

If you create a role through step 7, you need to edit the role and add mutual trust between the role and EMR managed service account. The so-called EMR managed service account here is automatically created when the job is submitted, so the unified configuration character will be used in the EMR service account part in the configuration.

Fortunately, we don't need to edit the Trust Relationships part of Role manually. We can automatically add this Trust Relationship on the following command line:

aws emr-containers update-role-trust-policy \
   --cluster-name $EKS_CLUSTER_NAME \
   --namespace $DATALAKE_NAMESPACE \
   --role-name <Admin or the-job-excution-role-name-you-created>

Among them, you need to replace < Admin or the job exclusion role name you created > with Admin or the name of the role created in step 7. After the creation is successful, you can see the generated related configurations similar to the following on the Trust Relationships page of role:

{
  "Effect": "Allow",
  "Principal": {
    "Federated": "arn:aws:iam::1234567898765:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/1C2DF227CD8E011A693BCF03D7EBD581"
  },
  "Action": "sts:AssumeRoleWithWebIdentity",
  "Condition": {
    "StringLike": {
      "oidc.eks.us-east-1.amazonaws.com/id/1C2DF227CD8E011A693BCF03D7EBD581:sub": "system:serviceaccount:kube-system:emr-containers-sa-*-*-1234567898765-3l0vgne6"
    }
  }
}

Even if we choose to use the Admin role as the job execution role in step 7, this step still needs to be executed. The value of – role name is Admin. Otherwise, we have no right to create a Log Group and store logs on s3 during job execution.

9. Create EMR virtual cluster on EKS

Next, we will create an EMR cluster. In fact, the more accurate name should be "registration", because after this step is completed, an EMR cluster will not be generated on EKS. What is created here is a virtual cluster, which will not be created until the job is submitted for the first time. The commands to create a cluster are as follows:

# create virtual cluster description file
tee $VIRTUAL_CLUSTER_NAME.json <<EOF
{
  "name": "$VIRTUAL_CLUSTER_NAME",
  "containerProvider": {
    "type": "EKS",
    "id": "$EKS_CLUSTER_NAME",
    "info": {
      "eksInfo": {
        "namespace": "$DATALAKE_NAMESPACE"
      }
    }
  }
}
EOF

# create virtual cluster
aws emr-containers create-virtual-cluster --cli-input-json file://./$VIRTUAL_CLUSTER_NAME.json

The above command first creates a cluster description file $VIRTUAL_CLUSTER_NAME.json, which describes the name of the EMR cluster and the Namespace of the EKS cluster to be built on, and then create the virtual cluster described in this file through AWS EMR containers create virtual cluster.

If the above command is executed successfully, a json data describing the cluster will be output on the console, in which the id field is more important. This id will be used when submitting jobs later. If it is not saved, you can also query at any time through the following command:

aws emr-containers list-virtual-clusters

Pay the obtained ID to the global variable VIRTUAL_CLUSTER_ID, which will be referenced many times in subsequent operations:

export VIRTUAL_CLUSTER_ID='<cluster-id>'

10. Submit work to EMR on EKS

After the virtual cluster is built, you can submit big data jobs. EMR on EKS is container based, which is different from EMR operating through shell login (yes but inconvenient). The conventional use method is to treat it as a black box of computing resources and submit jobs to it. The following is an example command to submit a job to EMR on EKS, which executes the example program PI provided by spark py

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name sample-job-name \
--execution-role-arn $EXECUTION_ROLE_ARN \
--release-label emr-6.2.0-latest \
--job-driver '{"sparkSubmitJobDriver": {"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py","sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"}}' \
--configuration-overrides '{"monitoringConfiguration": {"cloudWatchMonitoringConfiguration": {"logGroupName": "/emr-on-eks/$VIRTUAL_CLUSTER_NAME", "logStreamNamePrefix": "pi"}}}'

The most important thing to pay attention to in the command start job run is the parameter -- job driver. All relevant information about the job itself is in this parameter. Based on the documents, the current EMR on EKS only supports job submission in the form of sparkSubmitJobDriver, that is, jobs can only be submitted in the form acceptable to spark submit, that is, jobs can be submitted in the form of jar package + class or pyspark script. The jar package and its dependent jar files can be deployed on s3.

A more elegant job submission method is to provide a json description file of job run, and centrally configure all cluster, job and job configuration related information in this json file, and then execute the command, as shown below:

# create job description file
tee start-job-run-request.json <<EOF
{
  "name": "sample-job-name", 
  "virtualClusterId": "$VIRTUAL_CLUSTER_ID",  
  "executionRoleArn": "$EXECUTION_ROLE_ARN", 
  "releaseLabel": "emr-6.2.0-latest", 
  "jobDriver": {
    "sparkSubmitJobDriver": {
      "entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py",
      "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
    }
  }, 
  "configurationOverrides": {
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G"
         }
      }
    ], 
    "monitoringConfiguration": {
      "persistentAppUI": "ENABLED", 
      "cloudWatchMonitoringConfiguration": {
    "logGroupName": "/emr-on-eks/$VIRTUAL_CLUSTER_NAME", 
        "logStreamNamePrefix": "pi"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "s3://glc-emr-on-eks-logs/"
      }
    }
  }
}
EOF
# start job
aws emr-containers start-job-run --cli-input-json file://./start-job-run-request.json

For the preparation of json files, please refer to: https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-CLI.html#emr-eks-jobs-submit

Finally, it is about the configuration of EMR cluster. Similar to pure EMR cluster, the cluster configuration is also submitted through json file and written into applicationConfiguration, such as "classification": "Spark defaults" in the above configuration. Since EMR on EKS only supports Spark at present, only the following types of classification can be configured

Classifications	Descriptions
core-site	Change values in Hadoop's core-site.xml file.
emrfs-site	Change EMRFS settings.
spark-metrics	Change values in Spark's metrics.properties file.
spark-defaults	Change values in Spark's spark-defaults.conf file.
spark-env	Change values in the Spark environment.
spark-hive-site	Change values in Spark's hive-site.xml file.
spark-log4j	Change values in Spark's log4j.properties file.

11. Deletion and cleaning

The order of deleting and cleaning up clusters should be the reverse of the creation process. Delete ERM virtual cluster first, and then EKS cluster:

# 1. list all jobs
aws emr-containers list-job-runs --virtual-cluster-id $VIRTUAL_CLUSTER_ID

# 2. cancel running jobs
aws emr-containers cancel-job-run --id <job-run-id> --virtual-cluster-id $VIRTUAL_CLUSTER_ID

# 3. delete virtual cluster
aws emr-containers delete-virtual-cluster --id $VIRTUAL_CLUSTER_ID

# 4. delete eks cluster
eksctl delete cluster --region $REGION --name $EKS_CLUSTER_NAME

Note: in step 4, when deleting an EKS cluster, you must find a resource NodeInstanceRole in the corresponding Cloud Formation template and manually detach all policies on the Role before the command can be executed successfully.

12. Common errors

eks clusters created through eksctl create cluster are two M5.0 clusters by default Large node. This configuration is difficult to support an EMR cluster, so it is necessary to specify the number and type of nodes:
If you encounter an error similar to the following when creating an EKS cluster in step 3:

AWS::EKS::Cluster/ControlPlane: CREATE_FAILED – "Cannot create cluster 'my-bigdata-infra-cluster' because us-east-1e, the targeted availability zone, does not currently have sufficient capacity to support the cluster. Retry and choose from these availability zones: us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f (Service: AmazonEKS; Status Code: 400; Error Code: UnsupportedAvailabilityZoneException; Request ID: 61028748-0cc1-4100-9152-aab79a475fe6; Proxy: null)"

Description An AZ automatically assigned or specified is not available at present. You can replace it with another AZ in the -- zones parameter list.

About the author: Architect, with 15 years of experience in IT system development and architecture, rich practical experience in big data, enterprise application architecture, SaaS, distributed storage and domain driven design, and keen on functional programming. Have a deep and extensive understanding of Hadoop/Spark ecosystem, participated in the development of Hadoop commercial distribution, and led the team to build several complete enterprise data platforms, personal technology blog: https://laurence.blog.csdn.net/ Author has Big data platform architecture and prototype implementation: actual combat of data platform construction A book, which has been launched in Jingdong and Dangdang.

Topics: Spark kubenetes

Programmer Think