1, Pod scheduling
By default, the Node on which a Pod runs is calculated by the Scheduler component using the corresponding algorithm. This process is not controlled manually. However, in actual use, this does not meet the needs of users, because in many cases, we want to control some pods to reach some nodes, so what should we do? This requires understanding the scheduling rules of kubernetes for Pod. Kubernetes provides four types of scheduling methods:
-
Automatic scheduling: the node on which to run is completely calculated by the Scheduler through a series of algorithms
-
Directional scheduling: NodeName, NodeSelector
-
Affinity scheduling: NodeAffinity, PodAffinity, PodAntiAffinity
-
Stain tolerance scheduling: Taints, tolerance
2, Directional scheduling
Directed scheduling refers to scheduling the pod to the desired node node by declaring nodeName or nodeSelector on the pod. Note that the scheduling here is mandatory, which means that even if the target node to be scheduled does not exist, it will be scheduled to the above, but the pod operation fails.
(1)NodeName
NodeName is used to force the constraint to schedule the Pod to the Node with the specified Name. In fact, this method directly skips the Scheduler's scheduling logic and directly schedules the Pod to the Node with the specified Name.
Create a pod nodeName Yaml file
apiVersion: v1 kind: Pod metadata: name: pod-nodename namespace: mk spec: containers: - name: tomcat image: tomcat:lastest nodeName: node1 # Specify scheduling to node1 node
#Create Pod kubectl create -f pod-nodename.yaml #Check the NODE attribute of Pod scheduling. It is indeed scheduled to node1 NODE kubectl get pods pod-nodename -n mk -o wide # Next, delete the pod and change the value of nodeName to node5 (there is no node5 node) kubectl delete -f pod-nodename.yaml # vim pod-nodename.yaml change nodeName:node5 and recreate kubectl create -f pod-nodename.yaml #After checking again, it is found that node5 node has been dispatched, but since node5 node does not exist, pod cannot operate normally kubectl get pods pod-nodename -n mk -o wide
(2)NodeSelector
NodeSelector is used to schedule the pod to the node node with the specified label added. It is realized through kubernetes' label selector mechanism, that is, before the pod is created, the scheduler will use the MatchNodeSelector scheduling strategy to match the label, find out the target node, and then schedule the pod to the target node. The matching rule is a mandatory constraint.
Next, experiment:
1. First, add labels for node nodes respectively
kubectl label nodes node1 worker=node1 kubectl label nodes node2 worker=node2
2 create a Pod nodeselector Yaml file and use it to create a Pod
apiVersion: v1 kind: Pod metadata: name: pod-nodeselector namespace: mk spec: containers: - name: tomcat image: tomcat:lastest nodeSelector: worker: node1# Specify the node with worker=node1 label for scheduling
#Create Pod kubectl create -f pod-nodeselector.yaml #Check the NODE attribute of Pod scheduling. It is indeed scheduled to node1 NODE kubectl get pods pod-nodeselector -n mk -o wide # Next, delete the pod and change the value of nodeSelector to worker: node5 (there is no node with this label) kubectl delete -f pod-nodeselector.yaml # vim pod-nodeselector.yaml worker: node5 kubectl create -f pod-nodeselector.yaml #After checking again, it is found that pod cannot operate normally, and the value of Node is none kubectl get pods -n mk -o wide # Check the details and find the prompt of node selector matching failure kubectl describe pods pod-nodeselector -n mk
3, Affinity scheduling
The two directional scheduling methods are very convenient to use, but there are some problems. That is, if there are no qualified nodes, the Pod will not be run, even if there is a list of available nodes in the cluster, which limits its use scenario.
Based on the above problems, kubernetes also provides an Affinity scheduling. It is extended on the basis of NodeSelector. Through configuration, it can give priority to the nodes that meet the conditions for scheduling. If not, it can also be scheduled to the nodes that do not meet the conditions, making the scheduling more flexible.
Affinity is mainly divided into three categories:
-
node affinity: Aiming at nodes, it solves the problem of which nodes a pod can schedule
-
Podaffinity: take pod as the goal to solve the problem of which existing pods can be deployed in the same topology domain
-
Pod anti affinity: take pod as the target to solve the problem that pod cannot be deployed in the same topology domain with existing pods
Description of affinity (anti affinity) usage scenario:
Affinity: if two applications interact frequently, it is necessary to use affinity to make the two applications as close as possible, so as to reduce the performance loss caused by network communication.
Anti affinity: when the application is deployed with multiple copies, it is necessary to use anti affinity to make each application instance scattered and distributed on each node, which can improve the high availability of the service.
(1)NodeAffinity
First, let's take a look at the configurable items of NodeAffinity:
pod.spec.affinity.nodeAffinity The requiredduringschedulingignoredduringexecution node must meet all specified rules, which is equivalent to a hard limit nodeSelectorTerms node selection list matchFields - list of node selector requirements by node field matchExpressions - list of node selector requirements by node label (recommended) key values value The operator or relation supports Exists, DoesNotExist, In, NotIn, Gt, Lt preferredDuringSchedulingIgnoredDuringExecution gives priority to the nodes that meet the specified rules, which is equivalent to soft restriction (tendency) preference is a node selector item associated with the corresponding weight matchFields - list of node selector requirements by node field matchExpressions - list of node selector requirements by node label (recommended) key values value operator relations support In, NotIn, Exists, DoesNotExist, Gt, Lt Weight tendency weight, in the range of 1-100.
Instructions for using relation characters: - matchExpressions: - key: worker # matches the node where the key with label is worker operator: Exists - key: worker # matches the node whose tag key is worker and whose value is "xxx" or "yyy" operator: In values: ["xxx","yyy"] - key: worker # matches the node whose tag key is worker and whose value is greater than "xxx" operator: Gt values: "xxx"
Next, first demonstrate the required duringschedulingignored duringexecution,
Create pod nodeaffinity required yaml
apiVersion: v1 kind: Pod metadata: name: pod-nodeaffinity-required namespace: mk spec: containers: - name: tomcat image: tomcat:lastest affinity: #Affinity settings nodeAffinity: #Set node affinity requiredDuringSchedulingIgnoredDuringExecution: # Hard limit nodeSelectorTerms: - matchExpressions: # Match the label of woker's value in ["node3"] - key: worker operator: In values: ["node3"]
# Create pod kubectl create -f pod-nodeaffinity-required.yaml # View pod status (failed to run) kubectl get pods pod-nodeaffinity-required -n mk -o wide # View Pod details # If the scheduling fails, the node selection fails kubectl describe pod pod-nodeaffinity-required -n mk #Next, stop the pod kubectl delete -f pod-nodeaffinity-required.yaml # Modify the file and set values: ["node3"] ---- > ["node1"] vim pod-nodeaffinity-required.yaml # Restart kubectl create -f pod-nodeaffinity-required.yaml # At this time, it is found that the scheduling is successful and the pod has been scheduled to node1 kubectl get pods pod-nodeaffinity-required -n mk -o wide
Next, let's demonstrate the required duringschedulingignored duringexecution,
Create pod nodeaffinity preferred yaml
apiVersion: v1 kind: Pod metadata: name: pod-nodeaffinity-preferred namespace: mk spec: containers: - name: tomcat image: tomcat:lastest affinity: #Affinity settings nodeAffinity: #Set node affinity preferredDuringSchedulingIgnoredDuringExecution: # Soft limit - weight: 1 preference: matchExpressions: # Match the label of worker's value in ["node3"] (not available in the current environment) - key: worker operator: In values: ["node3"]
# Create pod kubectl create -f pod-nodeaffinity-preferred.yaml # Check the status of pod (running successfully) kubectl get pod pod-nodeaffinity-preferred -n mk
Precautions for NodeAffinity rule setting: 1 if nodeSelector and nodeAffinity are defined at the same time, Pod can run on the specified Node only if both conditions are met 2 if nodeAffinity specifies multiple nodeSelectorTerms, only one of them needs to match successfully 3. If there are multiple matchExpressions in a nodeSelectorTerms, a node must meet all of them to match successfully 4 if the label of the Node where a pod is located changes during the operation of the pod and no longer meets the Node affinity requirements of the pod, the system will ignore this change
(2)PodAffinity
PodAffinity mainly realizes the function of making the newly created pod and the reference pod in the same area by taking the running pod as the reference.
First, let's take a look at the configurable items of PodAffinity:
pod.spec.affinity.podAffinity requiredDuringSchedulingIgnoredDuringExecution hard limit namespaces - specifies the namespace that references the pod topologyKey specifies the scheduling scope labelSelector matchExpressions list of node selector requirements by node label (recommended) key values value operator relations support In, NotIn, Exists, DoesNotExist matchLabels refers to the content mapped by multiple matchExpressions preferredDuringSchedulingIgnoredDuringExecution soft limit Termaffinity option namespaces topologyKey labelSelector matchExpressions key values value operator matchLabels Weight tendency weight, in the range of 1-100
topologyKey is used to specify the scope of scheduling, for example: If specified as kubernetes IO / hostname, that is to distinguish between nodes If specified as beta kubernetes. IO / OS is distinguished by the operating system type of the Node
requiredDuringSchedulingIgnoredDuringExecution,
1) First create a reference Pod, Pod podaffinity target yaml:
apiVersion: v1 kind: Pod metadata: name: pod-podaffinity-target namespace: mk labels: worker: node1 #Set label spec: containers: - name: tomcat image: tomcat:lastest nodeName: node1 # Specify the target pod name on node1
# Start target pod kubectl create -f pod-podaffinity-target.yaml # View pod status kubectl get pods pod-podaffinity-target -n mk
2) Create pod podaffinity required Yaml, as follows:
apiVersion: v1 kind: Pod metadata: name: pod-podaffinity-required namespace: mk spec: containers: - name: tomcat image: tomcat:lastest affinity: #Affinity settings podAffinity: #Set pod affinity requiredDuringSchedulingIgnoredDuringExecution: # Hard limit - labelSelector: matchExpressions: # Match the label of worker's value in ["node3"] - key: worker operator: In values: ["node3"] topologyKey: kubernetes.io/hostname
The above configuration means that the new pod must be on the same Node as the pod with label worker=worker3. Obviously, there is no such pod at present.
# Start pod kubectl create -f pod-podaffinity-required.yaml # Check the status of pod and find that it is not running kubectl get pods pod-podaffinity-required -n mk # View details kubectl describe pods pod-podaffinity-required -n mk # Next, modify values: ["node3"] ---- > values: ["node1"] # This means that the new pod must be on the same Node as the pod with label worker=node1 # Then recreate the pod to see the effect kubectl delete -f pod-podaffinity-required.yaml kubectl create -f pod-podaffinity-required.yaml # It is found that the Pod is running normally at this time kubectl get pods pod-podaffinity-required -n mk
The preferred during scheduling ignored during execution of PodAffinity is similar to that of node above
(3)PodAntiAffinity
PodAntiAffinity mainly realizes the function of taking the running pod as the reference and making the newly created pod and the reference pod not in the same area.
Its configuration mode and options are the same as those of podaffinity. No detailed explanation will be given here, but a test case will be made directly.
1) Continue to use the target pod in the previous case
kubectl get pods -n mk -o wide --show-labels
2) Create pod podandiaffinity required Yaml, as follows:
apiVersion: v1 kind: Pod metadata: name: pod-podantiaffinity-required namespace: mk spec: containers: - name: tomcat image: tomcat:lastest affinity: #Affinity settings podAntiAffinity: #Set pod affinity requiredDuringSchedulingIgnoredDuringExecution: # Hard limit - labelSelector: matchExpressions: # Match the label of worker's value in ["node1"] - key: worker operator: In values: ["node1"] topologyKey: kubernetes.io/hostname
The above configuration means that the new pod must not be on the same Node as the pod with label worker=node1.
# Create pod kubectl create -f pod-podantiaffinity-required.yaml # View pod # It is found that the schedule is on node2 kubectl get pods pod-podantiaffinity-required -n mk -o wide
4, Stain and tolerance
(1) Taints
The previous scheduling methods are all from the perspective of Pod. By adding attributes to the Pod, we can determine whether the Pod should be scheduled to the specified Node. In fact, we can also decide whether to allow Pod scheduling from the perspective of Node by adding stain attributes to the Node.
After the Node is tainted, there is a mutually exclusive relationship between the Node and the Pod, and then refuse the scheduling of the Pod, or even expel the existing Pod.
The format of the stain is: key = value: effect. Key and value are the labels of the stain. Effect describes the function of the stain and supports the following three options:
-
PreferNoSchedule: kubernetes will try to avoid scheduling the Pod to the Node with this stain unless there are no other nodes to schedule
-
NoSchedule: kubernetes will not schedule the Pod to the Node with the stain, but will not affect the existing Pod on the current Node
-
NoExecute: kubernetes will not schedule the Pod to the Node with the stain, but will also drive the existing Pod on the Node away
Examples of commands to set and remove stains using kubectl are as follows:
# Set stain kubectl taint nodes node1 key=value:effect # Remove stains kubectl taint nodes node1 key:effect- # Remove all stains kubectl taint nodes node1 key-
Effect of setting stain:
-
Prepare node node1 (stop node2 for obvious effect)
-
Set a stain for node1 node: error=mk:PreferNoSchedule; Then create pod1 (pod1 can)
-
Modify node1 node to set a stain: error=mk:NoSchedule; Then create pod2 (pod1 is normal and pod2 fails)
-
Modify node1 node to set a stain: error=mk:NoExecute; Then create pod3 (all three pods fail)
# Set stain for node1 (PreferNoSchedule) kubectl taint nodes node1 error=mk:PreferNoSchedule # Create pod1 kubectl run taint-deploy1 --image=tomcat:lastest -n mk kubectl get pods -n mk -o wide # Set stain for node1 (cancel PreferNoSchedule and set NoSchedule) kubectl taint nodes node1 error:PreferNoSchedule- kubectl taint nodes node1 error=mk:NoSchedule # Create pod2 kubectl run taint-deploy2 --image=tomcat:lastest -n mk kubectl get pods taint-deploy2 -n mk -o wide # Set stain for node1 (cancel NoSchedule and set NoExecute) kubectl taint nodes node1 error:NoSchedule- kubectl taint nodes node1 error=mk:NoExecute # Create pod3 kubectl run taint-deploy3 --image=tomcat:lastest -n mk kubectl get pods -n mk -o wide
Tips: The cluster built with kubedm will add a stain mark to the master node by default, so pod will not be dispatched to the master node
(2) Tolerance
The above describes the role of stains. You can add stains on nodes to reject pod scheduling. If you want to schedule a pod to a stained node, you must use tolerance.
Stain is rejection, tolerance is neglect. Node rejects the pod through stain, and pod ignores rejection through tolerance
Let's first look at the effect through an example:
-
The node1 node has been marked with the stain of NoExecute. At this time, the pod cannot be scheduled
-
You can add tolerance to the pod and then schedule it
Create pod tolerance Yaml, as follows
apiVersion: v1 kind: Pod metadata: name: pod-toleration namespace: mk spec: containers: - name: tomcat image: tomcat:lastest tolerations: # Add tolerance - key: "error" # The key to the stain to be tolerated operator: "Equal" # Operator value: "mk" # The value of tolerated stains effect: "NoExecute" # Add the tolerance rule, which must be the same as the stain rule of the mark
# Add pod before tolerance kubectl get pods -n mk -o wide # Add pod after tolerance kubectl get pods -n mk -o wide
Let's take a look at the detailed configuration of tolerance:
kubectl explain pod.spec.tolerations ...... FIELDS: Key # corresponds to the key of the stain to be tolerated, and null means to match all the keys Value # corresponds to the value of the stain to be tolerated Operator # key value operator, which supports Equal and Exists (default) Effect # corresponds to the effect of the stain. Null means that all effects are matched tolerationSeconds # tolerance time, which takes effect when the effect is NoExecute, indicates the residence time of the pod on the Node