Unlock Cloud Native AI Skills - Develop Your Machine Learning Workflow

Posted by kakki on Tue, 30 Jul 2019 12:41:36 +0200

According to the previous article Unlocking Cloud Native AI Skills | Building Machine Learning Systems on Kubernetes After setting up a set of Kubeflow Pipelines, we tried it together and learned how to develop a machine learning workflow based on Kubeflow Pipelines with a real case.

Dead work

Machine learning workflow is not only a task-driven process, but also a data-driven process, which involves data import and preparation, model training Checkpoint export evaluation, and final model export. This requires distributed storage as the medium of transmission, where NAS is used as distributed storage.

  • Create distributed storage, for example, NAS. Here, NFS_SERVER_IP needs to be replaced with the real NAS server address
  1. Create Aliyun NAS service, you can refer to File
  2. Need to create / data in NFS Server
# mkdir -p /nfs
# mount -t nfs -o vers=4.0 NFS_SERVER_IP:/ /nfs
# mkdir -p /data
# cd /
# umount /nfs

 

  1. Create the corresponding Persistent Volume
# cat nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: user-susan
  labels:
    user-susan: pipelines
spec:
  persistentVolumeReclaimPolicy: Retain
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteMany
  nfs:
    server: NFS_SERVER_IP
    path: "/data"
    
# kubectl create -f nfs-pv.yaml
//Create Persistent Volume Claim
# cat nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: user-susan
  annotations:
    description: "this is the mnist demo"
    owner: Tom
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
       storage: 5Gi
  selector:
    matchLabels:
      user-susan: pipelines
# kubectl create -f nfs-pvc.yaml

 

Developing Pipeline

Because the examples provided by Kubeflow Pipelines depend on Google's storage services, users in China can not really experience Pipelines'capabilities. To this end, the Aliyun Container Service Team provides an example of MNIST model based on NAS storage training for you to use and learn Kubeflow Pipelines in Aliyun. The concrete steps are divided into three steps:

  • (1) Download data
  • (2) Model training using TensorFlow
  • (3) Model derivation

In these three steps, the latter one depends on the former one.

In Kubeflow Pipelines, such a process can be described with Python code, and the complete code can be viewed. standalone_pipeline.py.

We used an open source-based project in our example. Arena arena_op, which is the default container_op package for Kubeflow, can realize seamless connection between distributed training MPI and PS modes. It also supports simple access to heterogeneous devices and distributed storage such as GPU and RDMA, and facilitates synchronization of code from git source. It is a practical work. API.  

@dsl.pipeline(
  name='pipeline to run jobs',
  description='shows how to run pipeline jobs.'
)
def sample_pipeline(learning_rate='0.01',
    dropout='0.9',
    model_version='1',
    commit='f097575656f927d86d99dd64931042e1a9003cb2'):
  """A pipeline for end to end machine learning workflow."""
  data=["user-susan:/training"]
  gpus=1
# 1. prepare data
  prepare_data = arena.standalone_job_op(
    name="prepare-data",
    image="byrnedo/alpine-curl",
    data=data,
    command="mkdir -p /training/dataset/mnist && \
  cd /training/dataset/mnist && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz")
  # 2. downalod source code and train the models
  train = arena.standalone_job_op(
    name="train",
    image="tensorflow/tensorflow:1.11.0-gpu-py3",
    sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git",
    env=["GIT_SYNC_REV=%s" % (commit)],
    gpus=gpus,
    data=data,
    command='''
    echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/main.py \
    --max_steps 500 --data_dir /training/dataset/mnist \
    --log_dir /training/output/mnist  --learning_rate %s \
    --dropout %s''' % (prepare_data.output, learning_rate, dropout),
    metrics=["Train-accuracy:PERCENTAGE"])
  # 3. export the model
  export_model = arena.standalone_job_op(
    name="export-model",
    image="tensorflow/tensorflow:1.11.0-py3",
    sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git",
    env=["GIT_SYNC_REV=%s" % (commit)],
    data=data,
    command="echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/export_model.py --model_version=%s --checkpoint_path=/training/output/mnist /training/output/models" % (train.output, model_version))

 

Kubeflow Pipelines converts the above code into a directed acyclic graph (DAG), where each node is a Component, and the wires between Components represent their dependencies. From the Pipelines UI, you can see the DAG diagram:

First of all, we have a detailed understanding of the data preparation part. Here we provide the Python API of arena.standalone_job_op. We need to specify the name of this step; the container image to be used; the data to be used and the mount directory corresponding to the inside of the container: data.

Here, data is an array format, such as data=["user-susan:/training"], indicating that multiple data can be mounted. Where user-susan is the previously created Persistent Volume Claim, and/training is the mount directory inside the container.

prepare_data = arena.standalone_job_op(
    name="prepare-data",
    image="byrnedo/alpine-curl",
    data=data,
    command="mkdir -p /training/dataset/mnist && \
  cd /training/dataset/mnist && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz")

 

The above steps are actually to download data from curl at the specified address to the directory / training/dataset/mnist corresponding to distributed storage. Note that / training here is the root directory of distributed storage, similar to the familiar root mount point; and / training/dataset/mnist is a subdirectory. In fact, the next step can be done by using the same root mount point to read the data.

The second step is to use the data downloaded to the distributed storage and specify the fixed commit id by git to download the code and to train the model.

train = arena.standalone_job_op(
    name="train",
    image="tensorflow/tensorflow:1.11.0-gpu-py3",
    sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git",
    env=["GIT_SYNC_REV=%s" % (commit)],
    gpus=gpus,
    data=data,
    command='''
    echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/main.py \
    --max_steps 500 --data_dir /training/dataset/mnist \
    --log_dir /training/output/mnist  --learning_rate %s \
    --dropout %s''' % (prepare_data.output, learning_rate, dropout),
    metrics=["Train-accuracy:PERCENTAGE"])

 

As you can see, this step is a bit more complicated than data preparation. In addition to the name, image, data and command required in the first step, in the model training step, you also need to specify:

  • How to get code: From the point of view of re-experimentation, traceability of running test code is a very important link. The git code source of sync_source can be specified when the API is invoked, and the commit id of the training code can be specified by setting GIT_SYNC_REV in env.
  • GPU: By default 0, GPU is not used; if it is an integer value greater than 0, it means that the step requires this number of GPUs;
  • Metrics: Also for the purpose of reproducible and comparable experiments, users can derive a series of indicators they need, and display and compare them intuitively through Pipelines UI. Specific usage method is divided into two steps: 1. When calling API, specify metrics name and display format PERCENTAGE or RAW in the form of arrays, such as metrics=["Train-accuracy:PERCENTAGE"]. 2. Since Pipelines collects metrics from stdout logs by default, you need to output {metrics name}={value} or {metrics name}:{value} in the real running model code for reference. Sample code.

It is noteworthy that:

In this step, you specify the same data parameter as prepare_data ["user-susan:/training"], so you can read the corresponding data in the training code, such as -- data_dir/training/dataset/mnist.

In addition, since this step relies on prepare_data, the dependency of the two steps can be represented by specifying prepare_data.output in the method.

Finally, export_model generates training model based on checkpoint generated by train training:

export_model = arena.standalone_job_op(
    name="export-model",
    image="tensorflow/tensorflow:1.11.0-py3",
    sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git",
    env=["GIT_SYNC_REV=%s" % (commit)],
    data=data,
    command="echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/export_model.py --model_version=%s --checkpoint_path=/training/output/mnist /training/output/models" % (train.output, model_version))

 

The export_model is similar to or even simpler than the second step, train, which simply exports code from the git synchronization model and executes the model using checkpoint in the shared directory / training/output/mnist.

The whole workflow still looks intuitive. Now we can define a Python method to run the whole process together:

@dsl.pipeline(
  name='pipeline to run jobs',
  description='shows how to run pipeline jobs.'
)
def sample_pipeline(learning_rate='0.01',
    dropout='0.9',
    model_version='1',
    commit='f097575656f927d86d99dd64931042e1a9003cb2'):

 

@ dsl.pipeline is a decorator representing workflow, which needs to define two attributes, name and description.

The entry method sample_pipeline defines four parameters: learning_rate, dropout, model_version and commit, which can be used in the above train ing and export_model stages, respectively. The values of the parameters here are in fact ___________. dsl.PipelineParam Type, defined as dsl.PipelineParam, is intended to be converted into an input form through the native UI of Kubeflow Pipelines, with the keyword of the form being the parameter name and the default value being the value of the parameter. It's worth noting that the dsl. PipelineParam corresponding values here can only be string and numeric; arrays and map s, as well as custom types, can't be transformed by transformation.

In fact, these parameters can be overwritten when the user submits the workflow. The following is the UI corresponding to the submission workflow:

Submit Pipeline

You can submit the Python DSL of the previous development workflow to the Kubeflow Pipelines service in your own Kubernetes. The actual submission code is simple:

KFP_SERVICE="ml-pipeline.kubeflow.svc.cluster.local:8888"
  import kfp.compiler as compiler
  compiler.Compiler().compile(sample_pipeline, __file__ + '.tar.gz')
  client = kfp.Client(host=KFP_SERVICE)
  try:
    experiment_id = client.get_experiment(experiment_name=EXPERIMENT_NAME).id
  except:
    experiment_id = client.create_experiment(EXPERIMENT_NAME).id
  run = client.run_pipeline(experiment_id, RUN_ID, __file__ + '.tar.gz',
                            params={'learning_rate':learning_rate,
                                     'dropout':dropout,
                                    'model_version':model_version,
                                    'commit':commit})

 

Python code is compiled into a DAG configuration file identified by the execution engine (Argo) using compiler.compile.

Create or find existing experiments through the Kubeflow Pipeline client and submit the previously compiled DAG configuration file.

Prepare a python 3 environment in the cluster and install the Kubeflow Pipelines SDK:

# kubectl create job pipeline-client --namespace kubeflow --image python:3 -- sleep infinity
# kubectl  exec -it -n kubeflow $(kubectl get po -l job-name=pipeline-client -n kubeflow | grep -v NAME| awk '{print $1}') bash

 

After logging into the Python 3 environment, execute the following commands and submit two tasks with different parameters in succession:

# pip3 install http://kubeflow.oss-cn-beijing.aliyuncs.com/kfp/0.1.14/kfp.tar.gz --upgrade
# pip3 install http://kubeflow.oss-cn-beijing.aliyuncs.com/kfp-arena/kfp-arena-0.4.tar.gz --upgrade
# curl -O https://raw.githubusercontent.com/cheyang/pipelines/update_standalone_sample/samples/arena-samples/standalonejob/standalone_pipeline.py
# python3 standalone_pipeline.py --learning_rate 0.0001 --dropout 0.8 --model_version 2
# python3 standalone_pipeline.py --learning_rate 0.0005 --dropout 0.8 --model_version 3

 

View the results of the operation

Log in to the UI of Kubeflow Pipelines:https:// {pipeline address}/ pipeline/#/experiments, such as:

https://11.124.285.171/pipeline/#/experiments

By clicking the Compare runs button, you can compare the input, time and precision of the two experiments. Making experiments traceable is the first step to make experiments reproducible, and using Kubeflow Pipelines'own experimental management ability is the first step to start experiments reproducible.

summary

The steps required to implement a runnable Kubeflow Pipeline are:

  1. Constructing the smallest execution unit Component (component) needed in Pipeline (pipeline) requires two parts of code if using the native defined dsl.container_ops:
  • Build runtime code: Usually a container image is built for each step as an adapter between Pipelines and the actual execution of business logic code. What it does is to get the input parameters of the Pipelines context, call business logic code, and transfer the output that needs to be passed to the next step to the specified location in the container according to the rules of Pipelines, which is transferred by the underlying workflow component. The result is that the runtime code is coupled with the business logic code. For reference An example of Kubeflow Pipelines
  • Building client code: This step usually grows as follows, and friends familiar with Kubernetes will find that this step is actually writing Pod Spec:
container_op = dsl.ContainerOp(
        name=name,
        image='<train-image>',
        arguments=[
            '--input_dir', input_dir,
            '--output_dir', output_dir,
            '--model_name', model_name,
            '--model_version', model_version,
            '--epochs', epochs
        ],
        file_outputs={'output': '/output.txt'}
    )
container_op.add_volume(k8s_client.V1Volume(
            host_path=k8s_client.V1HostPathVolumeSource(
                path=persistent_volume_path),
            name=persistent_volume_name))
container_op.add_volume_mount(k8s_client.V1VolumeMount(
            mount_path=persistent_volume_path,
            name=persistent_volume_name))

 

The advantage of using the native definition of dsl.container_ops is flexibility. Because of the open interface between dsl.container_ops and Piplines, users can do many things at the container_ops level. But its problem is:

  • Low reuse. Every Component needs to build mirrors and develop runtime code.
  • High complexity. Users need to understand the concept of Kubernetes, such as resource limit, PVC, node selector and a series of concepts.
  • It is difficult to support distributed training. Because container_op operates on a single container, if distributed training is needed, TFJob-like tasks need to be submitted and managed in container_ops. There are two challenges: complexity and security. The complexity is well understood. Security means that the privileges of submitting tasks like TFJob will require additional privileges to be open to Pipeline developers.

Another way is to use arena_op, a reusable Component API, which uses generic run-time code to avoid duplicating the work of building runtime code; at the same time, it simplifies the use of users by using a generic arena_op API; it also supports scenarios such as Parameter Server and MPI. It is recommended that you compile Pipelines in this way.

  1. Splice the constructed Component into Pipeline.
  2. Pipeline is compiled into the DAG configuration file identified by Argo's execution engine. The DAG configuration file is submitted to Kubeflow Pipelines to view the process results using Kubeflow Pipelines'own UI.

Topics: PHP Python curl git Kubernetes