2021-12-30 the 58th step towards the program

Azkaban yes LinkedIn Batch workflow task scheduler created to run Hadoop Mission. Azkaban solves the sorting problem through work dependencies and provides an easy-to-use web User interface to maintain and track your workflow.

2) Generation background

1. A complete big data analysis system is usually composed of a large number of task units: shell Script program, mapreduce Procedures hive Scripts spark Procedures, etc.
2. There are time sequence and before and after dependencies among task units:Priority relationship, dependency relationship, and scheduled execution.
3. In order to organize such a complex execution plan well, a workflow scheduling system is needed to schedule execution.

3) Characteristics of azkaban

compatible Hadoop Any version of
 Easy to use web UI
 ordinary web and http Workflow upload
 Project workspace
 Scheduled workflow
 Modularity and pluginable
 Authentication and authorization
 Track user actions
 Email reminder failed and succeeded
SLA Alarm and automatic kill
 Retry the failed job

4) Comparison between azkaban and oozie

The two are roughly the same in function, but Oozie The bottom layer is submitted Hadoop Spark The job is through org.apache.hadoop The encapsulated interface is submitted, and Azkaban It can be operated directly shell sentence. It is possible in terms of security Oozie It would be better.

Workflow definition: Oozie Yes xml Defined and Azkaban by properties To define.

Deployment process: Oozie The deployment of is relatively difficult, and it is from Yarn Pull up the task log.

Azkaban If a task fails, as long as the process executes effectively, the task will be executed successfully BUG，however Oozie It can effectively detect the success and failure of tasks.

Operation workflow: Azkaban use Web Operation. Oozie support Web，RestApi，Java API Operation.

Permission control: Oozie Basically no authority control, Azkaban It has perfect permission control for users to read and write workflow.

Oozie of action Mainly run in hadoop Zhonger Azkaban of actions Run in Azkaban In the server.

record workflow Status of: Azkaban Will be executing workflow The state is saved in memory, Oozie Save it in Mysql Yes.

Failure: Azkaban All workflows will be lost, but Oozie You can continue to run in a failed workflow

5) Common dispatching system

Simple task scheduling: direct use linux of crontab To define shell and python Script implementation

Off the shelf open source task scheduling: oozie,azkaban and airflow etc.

Complex task scheduling: self-developed scheduling platform

2, System architecture of azkaban

azkaban It consists of three components

1. web server:   Provided webui Interface to receive incoming from the client job Work, and to exec server Distribute job
2. exec server : receive web server Distribute the job and execute it.
3. mysql : For management Web and Exec Data sharing and partial state synchronization between.

3, Installation mode of azkaban

The three methods are source code installation solo Mode multi exec server pattern
1.Source installation mode, reference documentation.
2.solo pattern:  Stand alone mode refers to azkaban All processes are on one machine, and there is only one exec server
3.multi exec server Mode: refers to exec server There are multiple, distributed on different machine nodes

3.1 Solo Server installation

3.1. 1 Introduction to solo server

This Solo Server service is a stand-alone version of azkaban, that is, a single instance. It is simple to install and easy to learn. His advantages are as follows:

- Simple installation:unwanted mysql Instance, which is built-in h2 For storage.
- Easy to start: web server and executor server All run in the same process.
- Fully functional: it contains all azkaban Characteristics of. You can use azkaban Use this common method and install plug-ins for it.

3.1. 2 installation steps

1) Upload, unzip, and rename

[root@xxx01 ~]# tar -zxvf azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz -C /usr/local/
[root@xxx01 ~]# cd /usr/local/
[root@xxx01 local]# mv azkaban-solo-server-0.1.0-SNAPSHOT/ azkaban-solo

2) Configure environment variables

[root@xxx01 ~]#  vim /etc/profile

#azkaban environment
export AZKABAN_HOME=/usr/local/azkaban-solo
export PATH=$AZKABAN_HOME/bin:$PATH


[root@xxx01 ~]#  source /etc/profile

3) Add user

[root@xxx01 ~]# vim $AZKABAN_HOME/conf/azkaban-users.xml
 Add the following on the fourth line:
<user password="admin" roles="metrics,admin" username="admin"/>

So far, the solo mode has been successfully installed

4) Start azkaban: Note: you must run the startup script at azkaban's home

[root@xxx01 azkaban-solo]# ./bin/start-solo.sh

5) Open browser

input ip:8081     If it can be opened, the installation is successful

3.2 installation method of multi exec server

3.2. 1 node layout

xxx01    webserver
xxx02    execserver
xxx03    execserver

3.2. 2. Configure mysql

Step 1) find create-all-sql-0.1 0-SNAPSHOT. SQL script

Mode 1:
    Upload the files in the installation package azkaban-db-0.1.0-SNAPSHOT.tar.gz，stay linux Unzip it and go inside to find it
 Mode 2:
    stay windows Unzip the script, and then enter it to find the script

Step 2) enter mysql and create an azkaban library

create database azkaban;

Step 3) execute the script

use azkaban;
source /root/create-all-sql-0.1.0-SNAPSHOT.sql

Step 4) ensure that azkaban is authorized remotely

grant all privileges on *.* to root@'%' identified by '@Mmforu45';

Step 5) modify the mysql configuration

(it is recommended to modify it. If an error is reported when restarting the service, do not modify it)

[root@xxx03 azkaban]# vi /etc/my.cnf
 stay[mysqld]Add next
max_allowed_packet=1024M
[root@xxx03 ~]# systemctl restart mysqld

3.2. 3. Configure web server

Step 1) upload, unzip and rename

[root@xxx01 ~]# tar -zxvf azkaban-web-server-0.1.0-SNAPSHOT.tar.gz -C /usr/local/
[root@xxx01 ~]# cd /usr/local/
[root@xxx01 local]# mv azkaban-web-server-0.1.0-SNAPSHOT/ azkaban-web

Step 2) configure environment variables (it doesn't matter whether they are configured or not)

Step 3) import the mysql driver package

get into azkaban-web Directory, create extlib Directory and upload mysql Drive of jar Package to extlib Directory
[root@xxx01 local]# cd azkaban-web
[root@xxx01 azkaban-web]# mkdir extlib

Step 4) generate secret key

[root@qphone01 azkaban-web]# keytool -keystore keystore -alias jetty -genkey -keyalg RSA

It is required to specify the keystore instruction and reconfirm the keystore instruction, both of which are 123456
 Enter all the way until "is it correct" appears y that will do

Step 5) configure Azkaban properties

# Azkaban Personalization Settings
azkaban.name=Test
azkaban.label=My Local Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=/usr/local/azkaban-web/web
default.timezone.id=Asia/Shanghai
# Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=/usr/local/azkaban-web/conf/azkaban-users.xml
# Loader for projects
executor.global.properties=/usr/local/azkaban-exec/conf/global.properties
azkaban.project.dir=projects
# Velocity dev mode
velocity.dev.mode=false
# Azkaban Jetty server properties.
jetty.use.ssl=false
jetty.maxThreads=25
jetty.ssl.port=8443
jetty.port=8081
jetty.keystore=keystore
jetty.password=123456
jetty.keypassword=123456
jetty.truststore=keystore
jetty.trustpassword=123456
# Azkaban Executor settings
# mail settings
mail.sender=
mail.host=
# User facing web server configurations used to construct the user facing server URLs. They are useful when there is a reverse proxy between Azkaban web servers and users.
# enduser -> myazkabanhost:443 -> proxy -> localhost:8081
# when this parameters set then these parameters are used to generate email links.
# if these parameters are not set then jetty.hostname, and jetty.port(if ssl configured jetty.ssl.port) are used.
# azkaban.webserver.external_hostname=myazkabanhost.com
# azkaban.webserver.external_ssl_port=443
# azkaban.webserver.external_port=8081
job.failure.email=
job.success.email=
lockdown.create.projects=false
cache.directory=cache
# JMX stats
jetty.connector.stats=true
executor.connector.stats=true
# Azkaban mysql settings by default. Users should configure their own username and password.
database.type=mysql
mysql.port=3306
mysql.host=xxx03
mysql.database=azkaban
mysql.user=root
mysql.password=@Mmforu45
mysql.numconnections=100
#Multiple Executor
azkaban.use.multiple.executors=true
#azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus
azkaban.executorselector.filters=StaticRemainingFlowSize,CpuStatus
azkaban.executorselector.comparator.NumberOfAssignedFlowComparator=1
azkaban.executorselector.comparator.Memory=1
azkaban.executorselector.comparator.LastDispatched=1
azkaban.executorselector.comparator.CpuUsage=1

Step 6) configure Azkaban users xml

Add admin user

<azkaban-users>
  <user groups="azkaban" password="azkaban" roles="admin" username="azkaban"/>
  <user password="metrics" roles="metrics" username="metrics"/>
  <user password="admin" roles="metrics,admin" username="admin"/>
  <role name="admin" permissions="ADMIN"/>
  <role name="metrics" permissions="METRICS"/>
</azkaban-users>

3.2. 4. Configure exec server

Step 1) upload, unzip and rename

[root@xxx02 ~]# tar -zxvf azkaban-exec-server-0.1.0-SNAPSHOT.tar.gz -C /usr/local/
[root@xxx02 ~]# cd /usr/local/
[root@xxx02 local]# mv azkaban-exec-server-0.1.0-SNAPSHOT/ azkaban-exec

Step 2) enter the Azkaban exec directory, create the extlib directory, and import the mysql driver package into this directory

[root@xxx02 local]# cd azkaban-exec
[root@xxx02 azkaban-exec]# mkdir extlib

Step 3) modify Azkaban properties

[root@xxx02 azkaban-exec]# vi conf/azkaban.properties

Modify to the following content (note that the path and password of your machine should match)

# Azkaban Personalization Settings
azkaban.name=Test
azkaban.label=My Local Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=/usr/local/azkaban-web/web
default.timezone.id=Asia/Shanghai
# Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=/usr/local/azkaban-web/conf/azkaban-users.xml
# Loader for projects
executor.global.properties=/usr/local/azkaban-exec/conf/global.properties
azkaban.project.dir=projects
# Velocity dev mode
velocity.dev.mode=false
# Azkaban Jetty server properties.
jetty.use.ssl=false
jetty.maxThreads=25
jetty.port=8081
# Where the Azkaban web server is located
azkaban.webserver.url=http://xxx01:8081
# mail settings
mail.sender=
mail.host=
# User facing web server configurations used to construct the user facing server URLs. They are useful when there is a reverse proxy between Azkaban web servers and users.
# enduser -> myazkabanhost:443 -> proxy -> localhost:8081
# when this parameters set then these parameters are used to generate email links.
# if these parameters are not set then jetty.hostname, and jetty.port(if ssl configured jetty.ssl.port) are used.
# azkaban.webserver.external_hostname=myazkabanhost.com
# azkaban.webserver.external_ssl_port=443
# azkaban.webserver.external_port=8081
job.failure.email=
job.success.email=
lockdown.create.projects=false
cache.directory=cache
# JMX stats
jetty.connector.stats=true
executor.connector.stats=true
# Azkaban plugin settings
azkaban.jobtype.plugin.dir=/usr/local/azkaban-exec/plugins/jobtypes/
# Azkaban mysql settings by default. Users should configure their own username and password.
#azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus
azkaban.executorselector.filters=StaticRemainingFlowSize,CpuStatus
database.type=mysql
mysql.port=3306
mysql.host=xxx03
mysql.database=azkaban
mysql.user=root
mysql.password=@Mmforu45
mysql.numconnections=100
# Azkaban Executor settings
executor.port=12321               
executor.maxThreads=50
executor.flow.threads=30

5) Modify plug-in file

[root@xxx02 azkaban-exec]# vi ./plugins/jobtypes/commonprivate.properties
set execute-as-user
execute.as.user=false
memCheck.enabled=false   #The add memory check is turned off, otherwise the error is less than 3G

So far, after Azkaban exec is configured, it's almost X03. We can scp to another machine

[root@xxx02 azkaban-exec]# cd ..
[root@xxx02 local]# scp -r azkaban-exec xxx03:/usr/local/

6) Start the test (it is recommended to restart the virtual machine first)

zkaban starts in the following order: start the executor first, and then start the web. Otherwise, the web project will fail to start because the executor cannot be found.

Start two exec s first

[root@xxx02 ~]# cd /usr/local/azkaban-exec
[root@xxx02 azkaban-exec]# ./bin/start-exec.sh

[root@xxx03 ~]# cd /usr/local/azkaban-exec
[root@xxx03 azkaban-exec]# ./bin/start-exec.sh

Then look at the metadata table executors

Log in to your mysql
 see executors Two in the table active Isn't it 1,If not, please change to 1

Then start the web server

[root@xxx01 ~]# cd /usr/local/azkaban-web
[root@xxx01 azkaban-web]# ./bin/start-web.sh

Then start webui happily, xxxxx:8081

4, Application of azkaban

So far, azkaban's work flow mechanism is divided into two flow mechanisms, one is an old job flow and the other is a new flow flow.

job flow, called flow1 Version 0, flow flow, called flow2 0 version

4.1,Flow1.0 version of job stream

4.1. 1 Description

1. azkaban of job Stream file, suffix is.job
	There must be one inside type Attribute must be assigned
		Values can be: command,java,pig One of
2. azkaban Executive job It must be packaged in advance, and the packaging format must be zip format
3. Writing format in stream file:
	1)Be sure that there are no spaces at the end of the line
	2)utf-8 Code set, if in window It's really not good. You can upload it to linux conduct zip Compress and download to windows Upload to azkaban upper

4.1. 2 case demonstration 1: print hello world

1) Create a suffix of The file helloworld of job. The input contents are as follows:

type=command
command=echo "hello world"

Note: the coding set must be utf-8 in the end

2) Compressed into a zip package

3) Upload to azkaban

1. First create the project
2. Upload to project
3. implement run job
4. After entering the ready interface, click execute，function

be careful:

Gray: indicates not running
 Green: run through
 Red: run failed
 Blue: running

4.1. 3 case demonstration 2 calling shell instruction

1) Write a shell script calculate sh

#!/usr/bin/bash
sum=0
for i in $(seq 1 100)
do
	sum=$(( $sum + $i ))
done
echo $sum >> /root/sum.log

2) Write job file A2 Job, call shell script

type=command
command=/usr/bin/bash calculate.sh

3) Package, upload, test

4.1. 4 case demonstration 2 execution mr procedure

1) Write job file A3 job

type=command
command=/usr/local/hadoop/bin/hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.13.2.jar wordcount /input /output

2) Hadoop-mapreduce-examples-2.6 0-cdh5. 13.2. Jar and download to and A3 Same location as job

3) Package, upload, test

Note: hdfs and yarn should be started, whether the input directory exists, and upload the files to be counted

4.1. 5 case demonstration 4 workflow demonstration

1) Create b.sh

#!/bin/bash
echo hello_bbb >/root/b.log
sleep 30s

2) Create jobb job

type=command
command=/bin/bash b.sh

3) Create a.sh

#!/bin/bash
echo hello_aaa >/root/a.log

4) Create Joba job

type=command
dependencies=jobB
command=/bin/bash a.sh

5) Package, upload, test

4.1.6 azkaban scheduling hive script

1) Create a HQL script: create_table.hql

create database mydb3;
use mydb3;
create table if not exists test1(
sid int,
sname string
)
row format delimited
fields terminated by ',';

2) Create a job file: create_table.job

type=command
command=/usr/local/hive/bin/beeline -u jdbc:hive2://qianfeng02:10000 -n root  -f create_table.hql

Note: you need to open the hiveserver2 service item on qianfeng02.

3) Package, upload and execute, and then view it

4.1.7 azkaban scheduled scheduling tasks

1) Create an SH script: testfront sh

#!/bin/bash
echo "aaaaa" >>/root/crond.log

2) Create a job file: testfront job

type=command
command=/usr/bin/bash testcrond.sh

3) Package and upload to azkaban

4) Click run job or execute flow to enter the interface. Instead of clicking execute immediately, click schedule to set scheduled tasks

5) After setting, click the schedule button under the scheduled task and continue to click execute

6) After entering the new interface, you need to click the job name in the FLOW column to enter the execution plan interface

7) Then click Schedule/execute Flow to enter the final interface and click execute

4.2, flow2. flow stream of version 0

Azkaban Currently supported at the same time Flow 1.0 and Flow2.0 ，But it is more recommended in official documents Flow 2.0，because Flow 1.0 Will be removed in future versions. Flow 2.0 The main design idea is to provide 1.0 There is no stream level definition. The user can assign all the data belonging to a given stream job / properties The file is merged into a single stream definition file, and its content adopts YAML Syntax. At the same time, it also supports redefinition of streams in streams, which are called embedded streams or sub streams.

4.2. 1 basic structure

The project zip will contain multiple stream YAML files, a project YAML file, and optional libraries and source code. The basic structure of Flow YAML file is as follows:

1. be-all workflow It's all written in one file
2. Files with stream name as suffix, such as: my-flow-name.flow；
3. contain DAG All nodes in the;
4. Each node can be of different types, such as flow，hive,hadoopjava,pig,noop,command
5. Each node can have name, type, config, dependsOn and nodes sections Other attributes;
6. By listing dependsOn Specify dependencies
7. Contains additional configurations related to the flow
8. flow1.0 The properties in are migrated to config Next, config The following is written in the form of key value pairs.

Note: you need to write a separate one xxxx.project File assignment azkaban Using workflow2.0 edition
azkaban-flow-version: 2.0

4.3 YAML syntax

To use Flow 2.0 for workflow configuration, you first need to understand YAML. YAML is a concise non markup language with strict format requirements. If your format configuration fails, parsing exceptions will be thrown when uploading to Azkaban.

4.3. 1 basic rules

1. Case sensitive 2. Use indentation to represent hierarchical relationships; three. There is no limit on the indent length. As long as the elements are aligned, it means that these elements belong to a level; four. use#Indicates a comment; 5. By default, single and double quotation marks are not added to the string, but both single quotation marks and double quotation marks can be used. Double quotation marks mean that there is no need to escape special characters; 6. YAML provides a variety of constant structures, including integer, floating point number, string, NULL, date, Boolean and time.

4.3. 2. Object writing

# There must be a space between value and: symbols
key: value

4.3. 3. How to write map:

# All key value pairs written in the same indent belong to a map
key: 
    key1: value1
    key2: value2

# Writing method 2
{key1: value1, key2: value2}

4.3. 4. Writing of array

# Write 1. Use a dash plus a space to represent an array item
- a
- b
- c

# Writing method 2
[a,b,c]

4.3. 5 single and double quotation marks

s1: 'content\n character string'
s2: "content\n character string"

After conversion:
{ s1: 'content\\n character string', s2: "content\n character string" }

4.3. 6 special symbols

One YAML Multiple documents can be included in the file `---` Split.

4.3. 7 configuration reference

Flow 2.0 It is recommended that common parameters be defined in `config` Down and through `${}` Reference.

4.4 case introduction

4.4. 1 simple case scheduling

1) Write a XXXX Flow files, such as simple Flow (pay attention to character set, TAB key, etc.)

nodes:

   - name: jobA
     type: command
     config:
        command: echo "this is a simple test"

2) Prepared version file: XXX Project, such as the same project

azkaban-flow-version: 2.0

3) Package into XXX Zip file, upload, test

4.4. 2 multitask scheduling

1) Write a XXXX Flow files, such as multi Flow (pay attention to character set, TAB key, etc.)

nodes:
  - name: jobE
    type: command
    config:
      command: echo "This is job E"
    # jobE depends on jobD
    dependsOn: 
      - jobD
    
  - name: jobD
    type: command
    config:
      command: echo "This is job D"
    # jobD depends on jobA,jobB,jobC
    dependsOn:
      - jobA
      - jobB
      - jobC

  - name: jobA
    type: command
    config:
      command: echo "This is job A"

  - name: jobB
    type: command
    config:
      command: echo "This is job B"

  - name: jobC
    type: command
    config:
      command: echo "This is job C"

2) Prepared version file: XXX Project, such as the same project

azkaban-flow-version: 2.0

3) Package into XXX Zip file, upload, test

4.4. 3 embedded flow scheduling

1) Write a XXXX Flow files, such as embedded Flow (pay attention to character set, TAB key, etc.)

nodes:
  - name: jobC
    type: command
    config:
      command: echo "This is job C"
    dependsOn:
      - embedded_flow

  - name: embedded_flow
    type: flow
    config:
      prop: value
    nodes:
      - name: jobB
        type: command
        config:
          command: echo "This is job B ${prop}"
        dependsOn:
          - jobA

      - name: jobA
        type: command
        config:
          command: echo "This is job A"

2) Prepared version file: XXX Project, such as the same project

azkaban-flow-version: 2.0

3) Package into XXX Zip file, upload, test

5, Mailbox alert for azkaban

1) Register a mailbox

Suggestions are Sina, Netease, etc

2) Open the third-party client protocol pop3/smtp/imap of the mailbox

You need to send a text message from your mobile phone to open it. You need to remember a password. It can be backed up to the computer to prevent forgetting.

3) Configuration of azkaban as a client: conf / azkaban properties

mail.sender=Your mailbox
mail.host=smtp.sina.cn
mail.user=Your mailbox
mail.password=open pop3/smtp/imap Password at

The following two attributes can be matched or not compensated. In azkaban3.0 Invalid after version
job.failure.email=mmforu@sina.cn
job.success.email=mmforu@sina.cn

4) Restart azkaban's service

5) Case test

1. Upload a case
2. Enter the execution interface and click Notification
3. Configure mailboxes to notify on failure and success
4. implement

6, azkaban's telephone alarm

1) Register Ruixiang cloud account, preferably email authentication

2) Enter the integration interface in the CA navigation and select email

3) Add corresponding information, such as application name and mailbox (mailbox of Ruixiang cloud), and then click the get AppKey button

4) Click the notification policy in configuration, configure the corresponding status information, and click save

5) Then check the generated mailbox of Ruixiang cloud and copy it

6) Test azkaban

1. Upload a case
2. Enter the execution interface and click Notification
3. Mailbox notified when configuration fails and succeeds: mailbox of Ruixiang cloud
4. implement

(>... <, there is one day left from the three-day holiday on New Year's Day ~, chongchong ~)

Topics: Big Data Hadoop hdfs Azkaban

Programmer Think

2021-12-30 the 58th step towards the program

1, Introduction to azkaban

2, System architecture of azkaban

3, Installation mode of azkaban

3.1 Solo Server installation

3.1. 1 Introduction to solo server

3.1. 2 installation steps

3.2 installation method of multi exec server

3.2. 1 node layout

3.2. 2. Configure mysql

3.2. 3. Configure web server

3.2. 4. Configure exec server

4, Application of azkaban

4.1,Flow1.0 version of job stream

4.1. 1 Description

4.1. 2 case demonstration 1: print hello world

4.1. 3 case demonstration 2 calling shell instruction

4.1. 4 case demonstration 2 execution mr procedure

4.1. 5 case demonstration 4 workflow demonstration

4.1.6 azkaban scheduling hive script

4.1.7 azkaban scheduled scheduling tasks

4.2, flow2. flow stream of version 0

4.2. 1 basic structure

4.3 YAML syntax

4.3. 1 basic rules

4.3. 2. Object writing

4.3. 3. How to write map:

4.3. 4. Writing of array

4.3. 5 single and double quotation marks

4.3. 6 special symbols

4.3. 7 configuration reference

4.4 case introduction

4.4. 1 simple case scheduling

4.4. 2 multitask scheduling

4.4. 3 embedded flow scheduling

5, Mailbox alert for azkaban

6, azkaban's telephone alarm

Hot Topics