Auxiliary System - Workflow Scheduler azkaban

Posted by mikem562 on Sat, 20 Jul 2019 04:06:51 +0200

1. Overview

azkaban: https://azkaban.github.io/

1.1. Why Workflow Scheduling System is Needed

A complete data analysis system usually consists of a large number of task units:
shell script program, java program, mapreduce program, hive script, etc.
Time-dependent and time-dependent relationships exist among task units
In order to organize such complex execution plans well, a workflow scheduling system is needed to schedule execution.
For example, we might have a requirement that a business system produce 20G raw data every day, which we process every day, as follows:
1) Synchronize raw data to HDFS through Hadoop first;
2. Transform the original data with the MapReduce computing framework, and store the generated data in partitioned tables in Hive tables.
3. Multiple tables in Hive need to be JOIN processed to get a large Hive table with detailed data.
4. Various statistical analyses will be performed on the detailed data to get the result report information.
5. Result data from statistical analysis need to be synchronized to the business system for use by business calls.

 


1.2. Workflow Scheduling Implementation

Simple task scheduling: directly using linux's crontab to define;

Complex task scheduling: develop a scheduling platform or use existing open source scheduling systems such as ooize, azkaban, airflow, etc.

1.3. Common workflow scheduling systems

There are many workflow schedulers on the market

In the hadoop domain, common workflow schedulers are Oozie, Azkaban,Cascading,Hamake, etc.

1.4. Characteristic comparison of various dispatching tools

The following table compares the key features of the four hadoop workflow schedulers listed above. Although these workflow schedulers are able to address essentially the same requirement scenarios, there are significant differences in design concepts, target users, application scenarios, and so on. They can be used as reference when making technical choices.

Characteristic

Hamake

Oozie

Azkaban

Cascading

Workflow Description Language

XML

XML (xPDL based)

text file with key/value pairs

Java API

Dependency Mechanism

data-driven

explicit

explicit

explicit

Do you want a web container

No

Yes

Yes

No

Progress tracking

console/log messages

web page

web page

Java API

Hadoop job Scheduling Support

no

yes

yes

yes

Running mode

command line utility

daemon

daemon

API

Pig support

yes

yes

yes

yes

Event Notification

no

no

no

yes

Installation required

no

yes

yes

no

Supported hadoop versions

0.18+

0.20+

currently unknown

0.18+

Retry Support

no

workflownode evel

yes

yes

Run any command

yes

yes

yes

yes

Amazon EMR support

yes

no

currently unknown

yes

1.5, Azkaban vs. Oozie

For the two most popular schedulers on the market, the following detailed comparison is given for reference of technical selection.Overall, ooize is a heavyweight task scheduling system, full-featured, but more complex to configure and use than azkaban.A lightweight dispatcher, azkaban, is a good candidate if you can ignore the loss of some functionality.

function
Both can schedule mapreduce,pig,java, script workflow tasks
Both can perform workflow tasks on a regular basis

Workflow Definition
* Azkaban uses a Properties file to define workflows
Oozie uses XML files to define workflows

Workflow Passage
* Azkaban supports direct parameters such as ${input}
Oozie supports parameters and EL expressions, such as ${fs:dirSize(myInputDir)}

Timed Execution
* Azkaban's scheduled tasks are time based
Oozie's scheduled execution tasks are based on time and input data

resource management
* Azkaban has strict permission controls, such as user read/write/execute operations on Workflows
Oozie does not have strict permission control at this time

Workflow Execution
Azkaban has two modes of operation, solo server mode(executor server and web server deployed on the same node) and multi server mode(executor server and web server can be deployed on different nodes)
Oozie runs as a workflow server, supporting multiple users and workflows

Workflow Management
* Azkaban supports browser and ajax-style workflow
Oozie supports command line, HTTP REST, Java API, browser operation workflow


2. Introduction to Azkaban

Azkaban is a batch workflow task scheduler from Linkedin open source.Used to run a set of work and processes in a specific order within a workflow.

Azkaban defines a KV file (properties) format to establish dependencies between tasks and provides an easy-to-use web user interface to maintain and track your workflow.

It has the following features:
Web User Interface
Convenient upload workflow
Easy to set up relationships between tasks
Schedule workflow
Authentication/Authorization (work with privileges)
Ability to kill and restart workflow
Modular and pluggable plug-in mechanisms
Project Workspace
Logging and auditing of workflows and tasks
 

3. Azkaban installation and deployment

3.1. Compilation of azkaban

Here we use the azkaban3.51.0 version to recompile and install it with the installation package we need

Note: We need JDK1.8 for compilation here. If the JDK version used by the compilation server is 1.7, remember to switch to jdk1.8, we use jdk8u141 for compilation here

cd /export/softwares/
wget https://github.com/azkaban/azkaban/archive/3.51.0.tar.gz
tar -zxvf 3.51.0.tar.gz -C ../servers/
cd /export/servers/azkaban-3.51.0/
yum -y install git
yum -y install gcc-c++
./gradlew build installDist -x test

The list of installation files required after compilation is as follows

azkaban-exec-server

The installation package we need is available after compilation in the following directory

azkaban-exec-server storage directory

/export/servers/azkaban-3.51.0/azkaban-exec-server/build/distributions

azkaban-web-server

azkaban-web-server storage directory

/export/servers/azkaban-3.51.0/azkaban-web-server/build/distributions

azkaban-solo-server

azkaban-solo-server storage directory

/export/servers/azkaban-3.51.0/azkaban-solo-server/build/distributions

execute-as-user.c

The C program needed in azkaban two server mode is under this path

/export/servers/azkaban-3.51.0/az-exec-util/src/main/c

Database script file

Database script files under this path

/export/servers/azkaban-3.51.0/azkaban-db/build/install/azkaban-db

3.2. Installation and use of azkaban single service mode

Required Software

azkaban-solo-server

Single Service Mode Installation

Step 1: Unzip

Azkaban's solo server uses a single-node mode to start the service. It only needs an azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz installation package to start. All the data information is saved in H2, the default data of azkaban, upload our compressed package, and then modify the configuration file to startMove it

cd /export/softwares
tar -zxvf azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz -C ../servers/

Step 2: Modify both profiles

Modify time zone profile
cd /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT/conf
vim azkaban.properties


default.timezone.id=Asia/Shanghai

modify commonprivate.properties configuration file
cd /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT/plugins/jobtypes
vim commonprivate.properties



execute.as.user=false
memCheck.enabled=false

Step 3: Start solo-server

start-up azkaban-solo-server
cd  /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT
bin/start-solo.sh

Step 4: Browser Page Access

Browser Page Access http://node03:8081/

Single Service Mode Usage

Requirements: Use azkaban to schedule our shell scripts and execute shell commands for linux

Create a plain text file foo.job with the following contents


type=command
command=echo "hello world"

Then pack the file as a compressed file, as follows:

azkaban uploads our compressed package

3.3, azkaban two service mode installation

1. Confirm the required software:
Azkaban Web Service Installation Package
azkaban-web-server-0.1.0-SNAPSHOT.tar.gz
 Azkaban Execution Service Installation Package
azkaban-exec-server-0.1.0-SNAPSHOT.tar.gz

Compiled sql script
create-all-sql-0.1.0-SNAPSHOT.sql

C Program File Script

execute-as-user.c program

2. Database preparation

Get into mysql Clients execute the following commands

mysql  -uroot -p

//Execute the following commands:
CREATE DATABASE azkaban;
CREATE USER 'azkaban'@'%' IDENTIFIED BY 'azkaban';    
GRANT all privileges ON azkaban.* to 'azkaban'@'%' identified by 'azkaban' WITH GRANT OPTION; 
flush privileges;
use azkaban; 
source /export/softwares/create-all-sql-0.1.0-SNAPSHOT.sql;

3. Unzip the installation package

decompression azkaban-web-server
cd /export/softwares
tar -zxvf azkaban-web-server-0.1.0-SNAPSHOT.tar.gz -C ../servers/
cd /export/servers
mv azkaban-web-server-0.1.0-SNAPSHOT/ azkaban-web-server-3.51.0

//Unzip azkaban-exec-server
cd /export/softwares
tar -zxvf azkaban-exec-server-0.1.0-SNAPSHOT.tar.gz -C ../servers/
cd /export/servers
mv azkaban-exec-server-0.1.0-SNAPSHOT/ azkaban-exec-server-3.51.0

4. Install SSL Security Authentication

Install ssl security certification to allow access to our azkaban web services using https
 Passwords must be entered letter by letter, or pasted


cd /export/servers/azkaban-web-server-3.51.0
keytool -keystore keystore -alias jetty -genkey -keyalg RSA

5. azkaban web server installation

modify azkaban-web-server Profile
cd /export/servers/azkaban-web-server-3.51.0/conf
vim azkaban.properties



# Azkaban Personalization Settings
azkaban.name=Azkaban
azkaban.label=My Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=web/
default.timezone.id=Asia/Shanghai
# Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=conf/azkaban-users.xml
# Loader for projects
executor.global.properties=conf/global.properties
azkaban.project.dir=projects
# Velocity dev mode
velocity.dev.mode=false
# Azkaban Jetty server properties.
jetty.use.ssl=true
jetty.maxThreads=25
jetty.port=8081



jetty.ssl.port=8443
jetty.keystore=/export/servers/azkaban-web-server-3.51.0/keystore
jetty.password=azkaban
jetty.keypassword=azkaban
jetty.truststore=/export/servers/azkaban-web-server-3.51.0/keystore
jetty.trustpassword=azkaban




# Azkaban Executor settings
# mail settings
mail.sender=
mail.host=
# User facing web server configurations used to construct the user facing server URLs. They are useful when there is a reverse proxy between Azkaban web servers and users.
# enduser -> myazkabanhost:443 -> proxy -> localhost:8081
# when this parameters set then these parameters are used to generate email links.
# if these parameters are not set then jetty.hostname, and jetty.port(if ssl configured jetty.ssl.port) are used.
# azkaban.webserver.external_hostname=myazkabanhost.com
# azkaban.webserver.external_ssl_port=443
# azkaban.webserver.external_port=8081
job.failure.email=
job.success.email=
lockdown.create.projects=false
cache.directory=cache
# JMX stats
jetty.connector.stats=true
executor.connector.stats=true
# Azkaban mysql settings by default. Users should configure their own username and password.
database.type=mysql
mysql.port=3306
mysql.host=node03
mysql.database=azkaban
mysql.user=azkaban
mysql.password=azkaban
mysql.numconnections=100
#Multiple Executor
azkaban.use.multiple.executors=true
#azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus
azkaban.executorselector.comparator.NumberOfAssignedFlowComparator=1
azkaban.executorselector.comparator.Memory=1
azkaban.executorselector.comparator.LastDispatched=1
azkaban.executorselector.comparator.CpuUsage=1



azkaban.activeexecutor.refresh.milisecinterval=10000
azkaban.queueprocessing.enabled=true
azkaban.activeexecutor.refresh.flowinterval=10
azkaban.executorinfo.refresh.maxThreads=10

6. azkaban executor server installation

Step 1: Modify the azkaban-exex-server configuration file

modify azkaban-exec-server Profile
cd /export/servers/azkaban-exec-server-3.51.0/conf
vim azkaban.properties

# Azkaban Personalization Settings
azkaban.name=Azkaban
azkaban.label=My Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=web/
default.timezone.id=Asia/Shanghai
# Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=conf/azkaban-users.xml
# Loader for projects
executor.global.properties=conf/global.properties
azkaban.project.dir=projects
# Velocity dev mode
velocity.dev.mode=false
# Azkaban Jetty server properties.
jetty.use.ssl=true
jetty.maxThreads=25
jetty.port=8081


jetty.keystore=/export/servers/azkaban-web-server-3.51.0/keystore
jetty.password=azkaban
jetty.keypassword=azkaban
jetty.truststore=/export/servers/azkaban-web-server-3.51.0/keystore
jetty.trustpassword=azkaban


# Where the Azkaban web server is located
azkaban.webserver.url=https://node03:8443
# mail settings
mail.sender=
mail.host=
# User facing web server configurations used to construct the user facing server URLs. They are useful when there is a reverse proxy between Azkaban web servers and users.
# enduser -> myazkabanhost:443 -> proxy -> localhost:8081
# when this parameters set then these parameters are used to generate email links.
# if these parameters are not set then jetty.hostname, and jetty.port(if ssl configured jetty.ssl.port) are used.
# azkaban.webserver.external_hostname=myazkabanhost.com
# azkaban.webserver.external_ssl_port=443
# azkaban.webserver.external_port=8081
job.failure.email=
job.success.email=
lockdown.create.projects=false
cache.directory=cache
# JMX stats
jetty.connector.stats=true
executor.connector.stats=true
# Azkaban plugin settings
azkaban.jobtype.plugin.dir=plugins/jobtypes
# Azkaban mysql settings by default. Users should configure their own username and password.
database.type=mysql
mysql.port=3306
mysql.host=node03
mysql.database=azkaban
mysql.user=azkaban
mysql.password=azkaban
mysql.numconnections=100
# Azkaban Executor settings
executor.maxThreads=50
executor.flow.threads=30

Step 2: Add Plugins

execute-as-user.c

Upload to this directory / export/servers/azkaban-exec-server-3.51.0/plugins/jobtypes

Or just copy us / export/softwares

cp /export/softwares/execute-as-user.c /export/servers/azkaban-exec-server-3.51.0/plugins/jobtypes/


//Then execute-as-user is generated by executing the following command
yum -y install gcc-c++
cd /export/servers/azkaban-exec-server-3.51.0/plugins/jobtypes
gcc execute-as-user.c -o execute-as-user 
chown root execute-as-user
chmod 6050 execute-as-user

Step 3: Modify the configuration file

Modify Profile
cd  /export/servers/azkaban-exec-server-3.47.0/plugins/jobtypes
vim commonprivate.properties
execute.as.user=false
memCheck.enabled=false
azkaban.native.lib=/export/servers/azkaban-exec-server-3.51.0/plugins/jobtypes

The final generation is as follows

7. Start Services

Step 1: Start azkaban exec server

cd /export/servers/azkaban-exec-server-3.51.0
bin/start-exec.sh

Step 2: Activate our exec-server

node03 Execute the following commands in any directory of the machine
curl -G "node03:$(<./executor.port)/executor?action=activate" && echo

Step 3: Start azkaban-web-server

cd /export/servers/azkaban-web-server-3.51.0/
bin/start-web.sh

Access address: https://node03:8443

Modifying time zone issues for linux

Because the clock synchronization was done before, you don't have to worry about the time zone, you don't need to modify it.

Note: Configure the time zone on the server node first

  1. Generate the time zone profile Asia/Shanghai using the interactive command tzselect

  2. Copy the time zone file, overwriting the system's local time zone configuration

cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime 

4. Azkaban Actual Warfare

Azkaba's built-in task types support command, java

Command Type Single job Example

Create job Description File

Create a text file and change the name to mycommand.job

Note that the suffix.txt must not be taken with it and saved as UFT-8 without bom

The contents are as follows

type=command
command=echo 'hello world'

Packaging a job resource file into a zip file

Create a project and upload a compressed package

Create a project and upload a job package through azkaban's web management platform

Create project first

Upload zip package

Start execution job

Command Type Multiple job Workflow

1. Create multiple job descriptions with dependencies

First job: foo.job

type=command
command=echo 'foo'

Second job: bar.job depends on foo.job

type=command
dependencies=foo
command=echo 'bar'

2. Put all job resource files into a zip package

3. Create a project in azkaban's web management interface and upload zip packages
4. Start workflow

HDFS Operational Tasks

  1. Create a job description file fs.job

type=command
command=/export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop fs -mkdir /azkaban

2. Package job resource files into zip files

3. Create project and upload job package through azkaban's web management platform

4. Start executing the job

MAPREDUCE Task

Mr s tasks can still be performed using the job type of command

  1. Create a job description file and an mr program jar package (example jar included with hadoop is used directly in the example)

type=command
command=/export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.14.0.jar pi 3 5

2. Put all job resource files into a zip package

3. Create a project in azkaban's web management interface and upload zip packages

4. Start job

HIVE script task

Create job description file and hive script
Hive script: hive.sql
 

create database if not exists azhive;
use azhive;
create table if not exists aztest(id string,name string) row format delimited fields terminated by '\t';

Job Description File: hive.job

type=command
command=/export/servers/hive-1.1.0-cdh5.14.0/bin/hive -f 'hive.sql'

Put all job resource files into a zip package

Create a project in azkaban's web management interface and upload a zip package

Start job

Timed Tasks for azkaban

Using azkaban's scheduler function to schedule our job tasks on a regular basis

*/1 *?* * Run a scheduled task every minute
 0 1? * * Perform this task at 1 a.m. every night
 0 */2? * * Perform this task regularly every two hours
 30 21? * * Perform this task regularly at 9:30 p.m. every night

 

Topics: Jetty MySQL snapshot hive