Kubernetes GPU Cluster Automation Deep Learning Training

Posted by bobbfwed on Thu, 22 Aug 2019 04:22:26 +0200

Reference Blog: http://www.infoq.com/cn/articles/kubernetes-gpu-cluster-to-automate-deep-learning-trainin
2018.2.4, if you change the source, you can not turn over the wall.Update reference https://github.com/EagleChen/kubernetes_init
Automated in-depth learning training using the Kubernetes GPU cluster can greatly improve the process of training models on the cloud.The overall structure is as follows:

#Disable Firewall
View Status

sudo ufw status

Deactivate ufw

sudo  ufw disable
sudo apt-get remove iptables

#Wall Flipping: Shadowsocks-Qt5 + proxychains for terminal wall flipping
##Install Shadowsocks-Qt5 and open it in the application window after installation

sudo add-apt-repository ppa:hzwhuang/ss-qt5
sudo apt-get update
sudo apt-get install shadowsocks-qt5

Find a free ss account https://doub.bid/sszhfx/
Open Shadowsocks-Qt5, open connection-add-URI, and copy in the ss link found above
##Install proxychains

sudo apt-get install proxychains
sudo vi /etc/proxychains.conf

Change the bottom line socks4 9050 to socks5 1080

sudo proxychains curl www.google.com

Configure ssr

wget http://www.djangoz.com/ssr
sudo mv ssr /usr/local/bin
sudo chmod 766 /usr/local/bin/ssr
ssr install
ssr config
ssr start


sudo mv ssr /usr/local/bin
sudo chmod 766 /usr/local/bin/ssr
ssr install
ssr config
ssr start

You can view it with windows and copy it

    "server": "",
    "server_ipv6": "::",
    "server_port": 1314,
    "local_address": "",
    "local_port": 1080,

    "password": "JPeFVnmcnT",
    "method": "aes-256-cfb",
    "protocol": "origin",
    "protocol_param": "",
    "obfs": "tls1.2_ticket_auth",
    "obfs_param": "",
    "speed_limit_per_con": 0,
    "speed_limit_per_user": 0,

    "additional_ports" : {}, // only works under multi-user mode
    "additional_ports_only" : false, // only works under multi-user mode
    "timeout": 120,
    "udp_timeout": 60,
    "dns_ipv6": false,
    "connect_verbose_info": 0,
    "redirect": "",
    "fast_open": false

There's another, Global

https://softs.fun/?dir=Internet Science/PC/Shadowsocks
#Configuration for Master Nodes
##Installation Dependency

sudo apt-get update 
sudo apt-get install -y apt-transport-https

##Add Kubernetes Repository to the packagemanager

sudo proxychains bash -c 'curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-xenial main
sudo proxychains apt-get update

##Install docker-engine, kubeadm, kubectl, and kubernetes-cni

sudo proxychains apt-get install -y docker-engine
sudo proxychains apt-get install -y docker.io
sudo proxychains apt-get install -y kubelet kubeadm kubectl kubernetes-cni
sudo groupadd docker
sudo usermod -aG docker $USER

sudo systemctl enable docker && systemctl start docker
sudo systemctl enable kubelet && systemctl start kubelet


Since we want to create a cluster that uses a GPU, we need to enable GPU acceleration on the master node. Before the cluster is initialized, add GPU support to the Kubeadm configuration.This step must be performed on every node in the cluster, even if some nodes do not have a GPU.
These executions are written as init-master.sh executions in the shell.

for file in /etc/systemd/system/kubelet.service.d/*-kubeadm.conf
    echo "Found ${file}"

echo "Chosen ${FILE_NAME} as kubeadm.conf"
sudo sed -i '/^ExecStart=\/usr\/bin\/kubelet/ s/$/ --feature-gates="Accelerators=true"/' ${FILE_NAME}
#sudo sed -i "s,ExecStart=$,Environment=\"KUBELET_CGROUPS_ARGS=--runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice\"\nExecStart=,g" ${FILE_NAME}
#sudo sed -i "s,ExecStart=$,Environment=\"KUBELET_EXTRA_ARGS=--pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause-amd64:3.1\"\nExecStart=,g" ${FILE_NAME}

Edit vi/etc/system d/system/kubelet.service.d/10-kubeadm.conf

Write a code snippet here

Restart kubelet

sudo swapoff -a
sudo sysctl net.bridge.bridge-nf-call-iptables=1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

View Status

kubeadm version
kubectl version
kubelet version

Change docker to domestic mirror http://blog.csdn.net/Mr_OOO/article/details/67016309
Offline Installation Reference:
Link: https://pan.baidu.com/s/1sniH07N Password: 2bkt

Initialize the master node. You need the IP of the master node.Also, this step will give you authentication information to add worker nodes, since remember your token.Similar to kubeadm join--token d979a7.33be06ce36e5c892

sudo proxychains kubeadm init --apiserver-advertise-address= --kubernetes-version=v1.9.2

disable swap

sudo swapoff -a  

Since Kubernetes 1.6 has changed from ABAC volume management to RBAC-style, we need to publish authentication information to users.

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

This is required every time you log on to the machine.Reconnect every time!!!

export KUBECONFIG=$HOME/.kube/config

Install network plug-ins to allow nodes to communicate with each other.Use wave-works

sudo proxychains kubectl apply -f https://git.io/weave-kube-1.6
sudo proxychains kubectl create -f https://git.io/kube-dashboard

1.9.2 Use flannel network module

sudo kubeadm init --apiserver-advertise-address= --kubernetes-version=v1.9.2 --pod-network-cidr
kubectl create -f kube-flannel.yaml

##Error Handling
If an error occurs in executing these two sentences
sudo systemctl enable docker && systemctl start docker
sudo systemctl enable kubelet && systemctl start kubelet
If there is an error Error starting daemon: Error initializing network controller: Error creating default "bridge" network:***networks have the same bridge name "Please perform the following,

su root
rm -r /var/lib/docker/network/files/*

Check that all pod s are online and make sure everything is running.

kubectl get pods --all-namespaces

If you want to remove the master node, you need to reset it

sudo kubeadm reset

#Configuration for worker nodes
##Installation Dependency

sudo apt-get update 
sudo apt-get install -y apt-transport-https

##Add Kubernetes Repository to the packagemanager

sudo proxychains bash -c 'curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-xenial main
sudo proxychains apt-get update

##Install docker-engine, kubeadm, kubectl, and kubernetes-cni

sudo proxychains apt-get install -y docker-engine
sudo proxychains apt-get install -y kubelet kubeadm kubectl kubernetes-cni
sudo groupadd docker
sudo usermod -aG docker $USER

sudo systemctl enable docker && systemctl start docker
sudo systemctl enable kubelet && systemctl start kubelet


Then write the following statement as a shell execution, init-work.sh

for file in /etc/systemd/system/kubelet.service.d/*-kubeadm.conf
    echo "Found ${file}"

echo "Chosen ${FILE_NAME} as kubeadm.conf"
sudo sed -i '/^ExecStart=\/usr\/bin\/kubelet/ s/$/ --feature-gates="Accelerators=true"/' ${FILE_NAME}

#Restart kubelet

sudo systemctl daemon-reload
sudo systemctl restart kubelet

##Join worker to cluster, token previously recorded

sudo kubeadm join --token d979a7.33be06ce36e5c892

##Check the nodes on the master to see if everything works.

kubectl get nodes

If the newly added node status is NotReady, check that the firewall is off on both the worker and master nodes
Edit vi/etc/system d/system/kubelet.service.d/10-kubeadm.conf

Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"

Restart Node

systemctl daemon-reload
systemctl restart kubelet

##Undo worker node
If you want to remove the worker node, you need to remove it from the cluster and reset it.Removing the worker node from the cluster is helpful.
On the master node:

kubectl delete node <worker node name>

On the worker node:

sudo kubeadm reset

In order to control your cluster, such as master from the client, you need to authenticate the correct users of the client.In order to control your cluster, such as master from the client, you need to authenticate the correct users of the client.
##Install kubectl on the client side. If it is already installed, it is not possible to install it
Download the latest version


Add Execution Rights to Migrate Executable Files

chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl 

##Copy master's admin authentication to client

mkdir ~/.kube
scp chase@ ~/.kube/config
sudo chown $(id -u):$(id -g) ~/.kube/config

##Add config configuration and authentication information to the Kubernetes configuration

export KUBECONFIG=~/.kube/config


sudo kubectl get pods --all-namespaces

#Install Kubernetes dashboard
Check that dashboard is installed and not required

kubectl get pods --all-namespaces | grep dashboard

Add dashboard proxy to client, execute on, then enter in browser

sudo kubectl proxy

#How to build your GPU container
##Install nvidia-docker

wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
sudo dpkg -i nvidia-docker*.deb && rm nvidia-docker*.deb

# Test nvidia-smi, need to download
nvidia-docker run --rm nvidia/cuda nvidia-smi

Create a GPU pod, VI gputestpod.yaml

apiVersion: v1
kind: Pod
  name: gpu-test
  - name: nvidia-driver
      path: /var/lib/nvidia-docker/volumes/nvidia_driver/367.48
  - name: tensorflow
    image: daocloud.io/daocloud/tensorflow:0.11.0-gpu
    - containerPort: 8000
        alpha.kubernetes.io/nvidia-gpu: 1
    - name: nvidia-driver
      mountPath: /usr/local/nvidia/
      readOnly: true
sudo kubectl create -f gputestpod.yaml

##A little more complex

apiVersion: extensions/v1beta1
kind: Deployment
  name: tf-jupyter
  replicas: 1
        app: tf-jupyter
      - hostPath:
          path: /var/lib/nvidia-docker/volumes/nvidia_driver/367.48
        name: nvidia-driver
      - name: tensorflow
        image: daocloud.io/daocloud/tensorflow:0.11.0-gpu
        - containerPort: 8888
            alpha.kubernetes.io/nvidia-gpu: 1
        - mountPath: /usr/local/nvidia/
          name: nvidia-driver
apiVersion: v1
kind: Service
  name: tf-jupyter-service
    app: tf-jupyter
    app: tf-jupyter
  - port: 8888
    protocol: TCP
    nodePort: 30061
  type: LoadBalancer

Check ip if it has not been created successfully and error free.

sudo vi /etc/kubernetes/manifests/kube-apiserver.yaml

Modified to calculate via Baidu <> Network and ip Address Calculator >>

- --service-cluster-ip-range=

Then restart

systemctl daemon-reload
systemctl restart kubelet

To view the IP address information for the service, you can use the following command

sudo kubectl describe services example-service

To verify that these settings are correct, you can access the JupyterNotebook instance with the link http://IP-of-service:8888.

Now let's verify that your JupyterNotebook instance has access to the GPU.Therefore, run the following program in a new terminal.It lists all the services available to tensorflow.

from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
print([x.name for x in local_device_protos])

The results are similar to:

##Pass parameter (refer to others)
First, let's talk about ENTRYPOINT in dockerfile. The official explanation is:

An ENTRYPOINT allows you to configure a container that will run as an executable.

That is, it lets your container function behave like an executable program, and this command will be executed when the container is created.The general ENTRYPOINT format is:

ENTRYPOINT ["executable", "param1", "param2"] (the preferred exec form) 
ENTRYPOINT command param1 param2 (shell form)

All in all, a command takes several parameters, which is the use of Docker.
Entry commands give container creation some flexibility. If you want to override the entry commands in DockerFile in k8s, can you redefine the entry commands yourself?The answer is yes. Look at the following paragraph in the.yaml file:

apiVersion: v1
kind: Pod
  name: command-demo
    purpose: demonstrate-command
  - name: command-demo-container
    image: debian
    command: ["printenv"]

This is a configuration to create a Pod with the following two lines under the containers node

command: ["printenv"]

Here, you can see literally that these two lines can override the ENTRYPOINT functionality in DockerFile.Specific commands represent the command line of ENTRYPOINT, and args represent specific parameters.
When a user writes command and args at the same time, it is natural to override the command line and parameters of ENTRYPOINT in DockerFile, so what about the specific situation, such as writing command or args only?The complete situation is categorized as follows:

    If command and args are not written, use the default configuration of Docker.
    If the command is written, but args is not, the default configuration of the Docker is ignored and only the command of the.yaml file (with no parameters) is executed.
    If the command is not written, but args does, the command line of the ENTRYPOINT configured by Docker default will be executed, but the parameter invoked is args in.yaml.
    If both command and args are written, the default configuration of Docker is ignored and.yaml configuration is used.

#Some useful commands
##Get command, output basic information

kubectl get services                 # List all services for this namespace
kubectl get pods --all-namespaces    # List all pod s for all namespaces
kubectl get pods -o wide             # List all pod s for this namespace, giving details
kubectl get deployments              # List all deployments
kubectl get deployment my-dep        # List given deployments

##Describe command, output long information

kubectl describe nodes <node-name>
kubectl describe pods <pod-name>

##Delete Resources

kubectl delete -f ./pod.yaml                   # Delete pod whose type and name are defined in pod.yaml
kubectl delete pod,service baz foo             # Delete pod s and services named "baz" and "foo"
kubectl delete pods,services -l name=<myLabel> # Delete pod s and services labeled myLabel
kubectl -n <namespace> delete po,svc --all     # Delete all pod s and services from namespace my-ns

##Enter the bash console of a pod, or through the UI

sudo kubectl exec -it <pod-name> -- /bin/bash

Topics: sudo Docker Kubernetes kubelet