Detailed pod Scheduling for k8s

Posted by Wireless102 on Sun, 05 Jan 2020 22:12:08 +0100

Usually, the default k8s scheduling method is used, but in some cases, we need to run the pod on the node with the characteristic label to run them all. At this time, the pod's scheduling policy can not use the default k8s scheduling policy. At this time, we need to specify the scheduling policy, telling k8s that we need to schedule the pod on those nodes.

nodeSelector
Normally, the nodeSelector scheduling strategy is used directly.Labels are a common way to label resources in k8s. We can label nodes with special labels, and then nodeSelector will schedule pod s on nodes with specified labels.

Here's an example:

First, view the label information of the node, and view the label of the node with the following command:

$ kubectl get nodes --show-labels
NAME      STATUS    ROLES     AGE       VERSION   LABELS
master    Ready     master    147d      v1.10.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=master,node-role.kubernetes.io/master=
node02    Ready     <none>    67d       v1.10.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,course=k8s,kubernetes.io/hostname=node02
node03    Ready     <none>    127d      v1.10.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,jnlp=haimaxy,kubernetes.io/hostname=node03

You can then add a label to the node02 node:

$ kubectl label nodes node02 com=yijiadashuju
node "node02" labeled

You can then see if the above labels are valid by using the--show-labels parameter above.When a node is labeled, these labels can be used when dispatching by adding a nodeSelector field to the spec field of Pod, which contains the label of the node we need to be dispatched.For example, to force a Pod to be dispatched to the node 02, you can use the nodeSelector to represent it: (pod-selector-demo.yaml)

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: busybox-pod
  name: test-busybox
spec:
  containers:
  - command:
    - sleep
    - "3600"
    image: busybox
    imagePullPolicy: Always
    name: test-busybox
  nodeSelector:
    com: yijiadashuju

Then, after executing the pod-selector-demo.yaml file, you can view the node information of the pod running using the following command

kubectl get pod -o wide -n default

You can also use the description command to see which node the pod is dispatched to:

$ kubectl create -f pod-selector-demo.yaml
pod "test-busybox" created
$ kubectl describe pod test-busybox
Name:         test-busybox
Namespace:    default
Node:         node02/10.151.30.63
......
QoS Class:       BestEffort
Node-Selectors:  com=youdianzhishi
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason                 Age   From               Message
  ----    ------                 ----  ----               -------
  Normal  SuccessfulMountVolume  55s   kubelet, node02    MountVolume.SetUp succeeded for volume "default-token-n9w2d"
  Normal  Scheduled              54s   default-scheduler  Successfully assigned test-busybox to node02
  Normal  Pulling                54s   kubelet, node02    pulling image "busybox"
  Normal  Pulled                 40s   kubelet, node02    Successfully pulled image "busybox"
  Normal  Created                40s   kubelet, node02    Created container
  Normal  Started                40s   kubelet, node02    Started container

As you can see from the execution above, pod is on the node02 node through the default-scheduler scheduler.However, this scheduling method is mandatory.If there are insufficient resources on node02, the state of the pod will always be pending.This is the use of nodeselector.

 

From the above introduction, we can see that nodeselector is very convenient to use, but there are many shortcomings, that is, it is not flexible enough, the control granularity is large, and there are still many inconveniences in actual use.Next let's look at first-affinity and anti-affinity scheduling.

 

Affinity and Antiaffinity Scheduling

The default scheduling process for k8s actually goes through two phases: predicates and priorities.Using the default dispatch process, k8s will dispatch pods to resource-rich nodes, use the nodeselector's dispatch method, and dispatch pods with specified labels.Then in the actual production ring ongoing environment, we need to dispatch the pod to a set of nodes with some labels to meet the actual needs, at this time we need nodeAffinity (node affinity), podAffinity(pod affinity), and podAntiAffinity(pod anti-affinity).

Affinity can be divided into hard and soft affinity.

Soft affinity: If the schedule does not meet the requirements, you can continue to schedule, that is, to meet the best, can not and does not matter
Hard affinity: refers to the dispatch must meet specific requirements, if not, then the pod will not be dispatched to the current node

Rules can be set:
Soft policy: preferredDuringScheduling IgnoredDuringExecution

Hard policy: requiredDuringScheduling IgnoredDuringExecution

Node Affinity
Node affinity is primarily used to control which nodes a pod can be deployed on and on which nodes it cannot be deployed.It can make some simple logical combinations, not just simple equal matching.

Next, let's look at an example where Deployment is used to manage three copies of the pod and nodeAffinity is used to control the dispatch of the pod, as follows: (node-affinity-demo.yaml)

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: affinity
  labels:
    app: affinity
spec:
  replicas: 3
  revisionHistoryLimit: 15
  template:
    metadata:
      labels:
        app: affinity
        role: test
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
          name: nginxweb
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:  # Hard Policy
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: NotIn
                values:
                - node03
          preferredDuringSchedulingIgnoredDuringExecution:  # Soft Policy
          - weight: 1
            preference:
              matchExpressions:
              - key: com
                operator: In
                values:
                - yijiadashuju

When this pod is dispatched, the first requirement is that it cannot run on node 03, but it will be dispatched on this node first if any node satisfies labels of com:yijiadashuju.

Next, look at the node information:

$ kubectl get nodes --show-labels
NAME      STATUS    ROLES     AGE       VERSION   LABELS
master    Ready     master    154d      v1.10.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=master,node-role.kubernetes.io/master=
node02    Ready     <none>    74d       v1.10.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,com=yijiadashuju,course=k8s,kubernetes.io/hostname=node02
node03    Ready     <none>    134d      v1.10.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,jnlp=haimaxy,kubernetes.io/hostname=node03

You can see that the node02 node has a label of com=yijiadashuju, which is scheduled first as required. Next, create a pod, and then use the descirbe command to view the scheduling.

$ kubectl create -f node-affinity-demo.yaml
deployment.apps "affinity" created
$ kubectl get pods -l app=affinity -o wide
NAME                        READY     STATUS    RESTARTS   AGE       IP             NODE
affinity-7b4c946854-5gfln   1/1       Running   0          47s       10.244.4.214   node02
affinity-7b4c946854-l8b47   1/1       Running   0          47s       10.244.4.215   node02
affinity-7b4c946854-r86p5   1/1       Running   0          47s       10.244.4.213   node02

As you can see from the results, the pod s are all deployed to the node02 node.

Kubernetes now provides the following operators

In:label value in a label
 NotIn:label value is not in a label
 The value of Gt:label is greater than a value
 The value of Lt:label is less than a value
 Exists: A label exists
 DoesNotExist: A label does not exist

If there are multiple options under nodeSelectorTerms, any of these conditions will be satisfied; if there are multiple options for matchExpressions, these conditions must be met to properly schedule POD s.

podAffinity pod affinity

The affinity of pods is mainly used to solve which pods can be deployed in the same cluster as which pods, i.e. in a topological domain (a cluster composed of nodes); while the anti-affinity of pods is used to solve the problem that pods cannot be deployed with which pods, both of which are used to solve the deployment problem between pods.It is important to note that affinity and anti-affinity between Pods require a lot of processing, which can significantly slow down scheduling in large clusters, is not recommended for clusters with hundreds of nodes, and Pod anti-affinity requires consistent tagging of nodes, that is, each node in the cluster must have an appropriate tag to match the topologyKey.Unexpected behavior may result if some or all nodes lack the specified topologyKey tag.

 

Here is an example of the affinity between pod s:

pods/pod-with-pod-affinity.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: failure-domain.beta.kubernetes.io/zone
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: failure-domain.beta.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0

podAntiAffinity pod antiaffinity

Here is an example of a pod anti-affinity yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
  replicas: 3
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: redis-server
        image: redis:3.2-alpine

Topics: Linux Kubernetes kubelet Redis