K8S Learning (12) -- Pod scheduling

Posted by whitehat on Wed, 02 Feb 2022 09:45:44 +0100

1, Pod scheduling

By default, the Node on which a Pod runs is calculated by the Scheduler component using the corresponding algorithm. This process is not controlled manually. However, in actual use, this does not meet the needs of users, because in many cases, we want to control some pods to reach some nodes, so what should we do? This requires understanding the scheduling rules of kubernetes for Pod. Kubernetes provides four types of scheduling methods:

  • Automatic scheduling: the node on which to run is completely calculated by the Scheduler through a series of algorithms

  • Directional scheduling: NodeName, NodeSelector

  • Affinity scheduling: NodeAffinity, PodAffinity, PodAntiAffinity

  • Stain tolerance scheduling: Taints, tolerance

2, Directional scheduling

Directed scheduling refers to scheduling the pod to the desired node node by declaring nodeName or nodeSelector on the pod. Note that the scheduling here is mandatory, which means that even if the target node to be scheduled does not exist, it will be scheduled to the above, but the pod operation fails.

(1)NodeName

NodeName is used to force the constraint to schedule the Pod to the Node with the specified Name. In fact, this method directly skips the Scheduler's scheduling logic and directly schedules the Pod to the Node with the specified Name.

Create a pod nodeName Yaml file

apiVersion: v1
kind: Pod
metadata:
  name: pod-nodename
  namespace: mk
spec:
  containers:
  - name: tomcat
    image: tomcat:lastest
  nodeName: node1 # Specify scheduling to node1 node
#Create Pod
kubectl create -f pod-nodename.yaml

#Check the NODE attribute of Pod scheduling. It is indeed scheduled to node1 NODE
kubectl get pods pod-nodename -n mk -o wide
​
# Next, delete the pod and change the value of nodeName to node5 (there is no node5 node)
kubectl delete -f pod-nodename.yaml

# vim pod-nodename.yaml change nodeName:node5 and recreate
kubectl create -f pod-nodename.yaml
​
#After checking again, it is found that node5 node has been dispatched, but since node5 node does not exist, pod cannot operate normally
kubectl get pods pod-nodename -n mk -o wide
         

(2)NodeSelector

NodeSelector is used to schedule the pod to the node node with the specified label added. It is realized through kubernetes' label selector mechanism, that is, before the pod is created, the scheduler will use the MatchNodeSelector scheduling strategy to match the label, find out the target node, and then schedule the pod to the target node. The matching rule is a mandatory constraint.

Next, experiment:

1. First, add labels for node nodes respectively

kubectl label nodes node1 worker=node1
kubectl label nodes node2 worker=node2

2 create a Pod nodeselector Yaml file and use it to create a Pod

apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeselector
  namespace: mk
spec:
  containers:
  - name: tomcat
    image: tomcat:lastest
  nodeSelector: 
    worker: node1# Specify the node with worker=node1 label for scheduling
#Create Pod
kubectl create -f pod-nodeselector.yaml
​
#Check the NODE attribute of Pod scheduling. It is indeed scheduled to node1 NODE
kubectl get pods pod-nodeselector -n mk -o wide
​
# Next, delete the pod and change the value of nodeSelector to worker: node5 (there is no node with this label)
kubectl delete -f pod-nodeselector.yaml

# vim pod-nodeselector.yaml  worker: node5
kubectl create -f pod-nodeselector.yaml
​
#After checking again, it is found that pod cannot operate normally, and the value of Node is none
kubectl get pods -n mk -o wide
​
# Check the details and find the prompt of node selector matching failure
kubectl describe pods pod-nodeselector -n mk

3, Affinity scheduling

The two directional scheduling methods are very convenient to use, but there are some problems. That is, if there are no qualified nodes, the Pod will not be run, even if there is a list of available nodes in the cluster, which limits its use scenario.

Based on the above problems, kubernetes also provides an Affinity scheduling. It is extended on the basis of NodeSelector. Through configuration, it can give priority to the nodes that meet the conditions for scheduling. If not, it can also be scheduled to the nodes that do not meet the conditions, making the scheduling more flexible.

Affinity is mainly divided into three categories:

  • node affinity: Aiming at nodes, it solves the problem of which nodes a pod can schedule

  • Podaffinity: take pod as the goal to solve the problem of which existing pods can be deployed in the same topology domain

  • Pod anti affinity: take pod as the target to solve the problem that pod cannot be deployed in the same topology domain with existing pods

Description of affinity (anti affinity) usage scenario:

Affinity: if two applications interact frequently, it is necessary to use affinity to make the two applications as close as possible, so as to reduce the performance loss caused by network communication.

Anti affinity: when the application is deployed with multiple copies, it is necessary to use anti affinity to make each application instance scattered and distributed on each node, which can improve the high availability of the service.

(1)NodeAffinity

First, let's take a look at the configurable items of NodeAffinity:

pod.spec.affinity.nodeAffinity
  The requiredduringschedulingignoredduringexecution node must meet all specified rules, which is equivalent to a hard limit
 nodeSelectorTerms node selection list
 matchFields - list of node selector requirements by node field
 matchExpressions - list of node selector requirements by node label (recommended)
 key
 values value
 The operator or relation supports Exists, DoesNotExist, In, NotIn, Gt, Lt
  preferredDuringSchedulingIgnoredDuringExecution gives priority to the nodes that meet the specified rules, which is equivalent to soft restriction (tendency)
 preference is a node selector item associated with the corresponding weight
 matchFields - list of node selector requirements by node field
 matchExpressions - list of node selector requirements by node label (recommended)
 key
 values value
 operator relations support In, NotIn, Exists, DoesNotExist, Gt, Lt
    Weight tendency weight, in the range of 1-100.
Instructions for using relation characters:
​
- matchExpressions:
 - key: worker # matches the node where the key with label is worker
    operator: Exists
 - key: worker # matches the node whose tag key is worker and whose value is "xxx" or "yyy"
    operator: In
    values: ["xxx","yyy"]
 - key: worker # matches the node whose tag key is worker and whose value is greater than "xxx"
    operator: Gt
    values: "xxx"

Next, first demonstrate the required duringschedulingignored duringexecution,

Create pod nodeaffinity required yaml

apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity-required
  namespace: mk
spec:
  containers:
  - name: tomcat
    image: tomcat:lastest
  affinity:  #Affinity settings
    nodeAffinity: #Set node affinity
      requiredDuringSchedulingIgnoredDuringExecution: # Hard limit
        nodeSelectorTerms:
        - matchExpressions: # Match the label of woker's value in ["node3"]
          - key: worker
            operator: In
            values: ["node3"]
# Create pod
kubectl create -f pod-nodeaffinity-required.yaml
​
# View pod status (failed to run)
kubectl get pods pod-nodeaffinity-required -n mk -o wide
​
# View Pod details
# If the scheduling fails, the node selection fails
kubectl describe pod pod-nodeaffinity-required -n mk
​
#Next, stop the pod
kubectl delete -f pod-nodeaffinity-required.yaml

# Modify the file and set values: ["node3"] ---- > ["node1"]
vim pod-nodeaffinity-required.yaml
​
# Restart
kubectl create -f pod-nodeaffinity-required.yaml
​
# At this time, it is found that the scheduling is successful and the pod has been scheduled to node1
kubectl get pods pod-nodeaffinity-required -n mk -o wide

Next, let's demonstrate the required duringschedulingignored duringexecution,

Create pod nodeaffinity preferred yaml

apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity-preferred
  namespace: mk
spec:
  containers:
  - name: tomcat
    image: tomcat:lastest
  affinity:  #Affinity settings
    nodeAffinity: #Set node affinity
      preferredDuringSchedulingIgnoredDuringExecution: # Soft limit
      - weight: 1
        preference:
          matchExpressions: # Match the label of worker's value in ["node3"] (not available in the current environment)
          - key: worker
            operator: In
            values: ["node3"]
# Create pod
kubectl create -f pod-nodeaffinity-preferred.yaml
​
# Check the status of pod (running successfully)
kubectl get pod pod-nodeaffinity-preferred -n mk
Precautions for NodeAffinity rule setting:
 1 if nodeSelector and nodeAffinity are defined at the same time, Pod can run on the specified Node only if both conditions are met
 2 if nodeAffinity specifies multiple nodeSelectorTerms, only one of them needs to match successfully
 3. If there are multiple matchExpressions in a nodeSelectorTerms, a node must meet all of them to match successfully
 4 if the label of the Node where a pod is located changes during the operation of the pod and no longer meets the Node affinity requirements of the pod, the system will ignore this change

(2)PodAffinity

PodAffinity mainly realizes the function of making the newly created pod and the reference pod in the same area by taking the running pod as the reference.

First, let's take a look at the configurable items of PodAffinity:

pod.spec.affinity.podAffinity
  requiredDuringSchedulingIgnoredDuringExecution hard limit
 namespaces - specifies the namespace that references the pod
 topologyKey specifies the scheduling scope
 labelSelector
 matchExpressions list of node selector requirements by node label (recommended)
 key
 values value
 operator relations support In, NotIn, Exists, DoesNotExist
 matchLabels refers to the content mapped by multiple matchExpressions
  preferredDuringSchedulingIgnoredDuringExecution soft limit
 Termaffinity option
      namespaces      
      topologyKey
      labelSelector
        matchExpressions  
 key
 values value
          operator
        matchLabels 
 Weight tendency weight, in the range of 1-100
topologyKey is used to specify the scope of scheduling, for example:
 If specified as kubernetes IO / hostname, that is to distinguish between nodes
    If specified as beta kubernetes. IO / OS is distinguished by the operating system type of the Node

requiredDuringSchedulingIgnoredDuringExecution,

1) First create a reference Pod, Pod podaffinity target yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod-podaffinity-target
  namespace: mk
  labels:
    worker: node1 #Set label
spec:
  containers:
  - name: tomcat
    image: tomcat:lastest
  nodeName: node1 # Specify the target pod name on node1
# Start target pod
kubectl create -f pod-podaffinity-target.yaml
​
# View pod status
kubectl get pods  pod-podaffinity-target -n mk

2) Create pod podaffinity required Yaml, as follows:

apiVersion: v1
kind: Pod
metadata:
  name: pod-podaffinity-required
  namespace: mk
spec:
  containers:
  - name: tomcat
    image: tomcat:lastest
  affinity:  #Affinity settings
    podAffinity: #Set pod affinity
      requiredDuringSchedulingIgnoredDuringExecution: # Hard limit
      - labelSelector:
          matchExpressions: # Match the label of worker's value in ["node3"]
          - key: worker
            operator: In
            values: ["node3"]
        topologyKey: kubernetes.io/hostname

The above configuration means that the new pod must be on the same Node as the pod with label worker=worker3. Obviously, there is no such pod at present.

# Start pod
kubectl create -f pod-podaffinity-required.yaml
​
# Check the status of pod and find that it is not running
kubectl get pods pod-podaffinity-required -n mk

# View details
kubectl describe pods pod-podaffinity-required  -n mk


# Next, modify values: ["node3"] ---- > values: ["node1"]
# This means that the new pod must be on the same Node as the pod with label worker=node1
# Then recreate the pod to see the effect
kubectl delete -f  pod-podaffinity-required.yaml

kubectl create -f pod-podaffinity-required.yaml

# It is found that the Pod is running normally at this time
kubectl get pods pod-podaffinity-required -n mk

The preferred during scheduling ignored during execution of PodAffinity is similar to that of node above

(3)PodAntiAffinity

PodAntiAffinity mainly realizes the function of taking the running pod as the reference and making the newly created pod and the reference pod not in the same area.

Its configuration mode and options are the same as those of podaffinity. No detailed explanation will be given here, but a test case will be made directly.

1) Continue to use the target pod in the previous case

kubectl get pods -n mk -o wide --show-labels

2) Create pod podandiaffinity required Yaml, as follows:

apiVersion: v1
kind: Pod
metadata:
  name: pod-podantiaffinity-required
  namespace: mk
spec:
  containers:
  - name: tomcat
    image: tomcat:lastest
  affinity:  #Affinity settings
    podAntiAffinity: #Set pod affinity
      requiredDuringSchedulingIgnoredDuringExecution: # Hard limit
      - labelSelector:
          matchExpressions: # Match the label of worker's value in ["node1"]
          - key: worker
            operator: In
            values: ["node1"]
        topologyKey: kubernetes.io/hostname

The above configuration means that the new pod must not be on the same Node as the pod with label worker=node1.

# Create pod
kubectl create -f pod-podantiaffinity-required.yaml

# View pod
# It is found that the schedule is on node2
kubectl get pods pod-podantiaffinity-required -n mk -o wide

4, Stain and tolerance

(1) Taints

The previous scheduling methods are all from the perspective of Pod. By adding attributes to the Pod, we can determine whether the Pod should be scheduled to the specified Node. In fact, we can also decide whether to allow Pod scheduling from the perspective of Node by adding stain attributes to the Node.

After the Node is tainted, there is a mutually exclusive relationship between the Node and the Pod, and then refuse the scheduling of the Pod, or even expel the existing Pod.

The format of the stain is: key = value: effect. Key and value are the labels of the stain. Effect describes the function of the stain and supports the following three options:

  • PreferNoSchedule: kubernetes will try to avoid scheduling the Pod to the Node with this stain unless there are no other nodes to schedule

  • NoSchedule: kubernetes will not schedule the Pod to the Node with the stain, but will not affect the existing Pod on the current Node

  • NoExecute: kubernetes will not schedule the Pod to the Node with the stain, but will also drive the existing Pod on the Node away

Examples of commands to set and remove stains using kubectl are as follows:

# Set stain
kubectl taint nodes node1 key=value:effect
​
# Remove stains
kubectl taint nodes node1 key:effect-
​
# Remove all stains
kubectl taint nodes node1 key-

Effect of setting stain:

  1. Prepare node node1 (stop node2 for obvious effect)

  2. Set a stain for node1 node: error=mk:PreferNoSchedule; Then create pod1 (pod1 can)

  3. Modify node1 node to set a stain: error=mk:NoSchedule; Then create pod2 (pod1 is normal and pod2 fails)

  4. Modify node1 node to set a stain: error=mk:NoExecute; Then create pod3 (all three pods fail)

# Set stain for node1 (PreferNoSchedule)
kubectl taint nodes node1 error=mk:PreferNoSchedule
​
# Create pod1
kubectl run taint-deploy1 --image=tomcat:lastest -n mk
kubectl get pods -n mk -o wide

​
# Set stain for node1 (cancel PreferNoSchedule and set NoSchedule)
kubectl taint nodes node1 error:PreferNoSchedule-
kubectl taint nodes node1 error=mk:NoSchedule
​
# Create pod2
kubectl run taint-deploy2 --image=tomcat:lastest -n mk
kubectl get pods taint-deploy2 -n mk -o wide

# Set stain for node1 (cancel NoSchedule and set NoExecute)
kubectl taint nodes node1 error:NoSchedule-
kubectl taint nodes node1 error=mk:NoExecute
​
# Create pod3
kubectl run taint-deploy3 --image=tomcat:lastest -n mk
kubectl get pods -n mk -o wide
Tips:
 The cluster built with kubedm will add a stain mark to the master node by default, so pod will not be dispatched to the master node

(2) Tolerance

The above describes the role of stains. You can add stains on nodes to reject pod scheduling. If you want to schedule a pod to a stained node, you must use tolerance.

Stain is rejection, tolerance is neglect. Node rejects the pod through stain, and pod ignores rejection through tolerance

Let's first look at the effect through an example:

  1. The node1 node has been marked with the stain of NoExecute. At this time, the pod cannot be scheduled

  2. You can add tolerance to the pod and then schedule it

Create pod tolerance Yaml, as follows

apiVersion: v1
kind: Pod
metadata:
  name: pod-toleration
  namespace: mk
spec:
  containers:
  - name: tomcat
    image: tomcat:lastest
  tolerations:      # Add tolerance
  - key: "error"        # The key to the stain to be tolerated
    operator: "Equal" # Operator
    value: "mk"    # The value of tolerated stains
    effect: "NoExecute"   # Add the tolerance rule, which must be the same as the stain rule of the mark
# Add pod before tolerance
kubectl get pods -n mk -o wide     
​
# Add pod after tolerance
kubectl get pods -n mk -o wide

Let's take a look at the detailed configuration of tolerance:

kubectl explain pod.spec.tolerations
......
FIELDS:
 Key # corresponds to the key of the stain to be tolerated, and null means to match all the keys
 Value # corresponds to the value of the stain to be tolerated
 Operator # key value operator, which supports Equal and Exists (default)
 Effect # corresponds to the effect of the stain. Null means that all effects are matched
 tolerationSeconds # tolerance time, which takes effect when the effect is NoExecute, indicates the residence time of the pod on the Node

Topics: Linux Kubernetes Container