Pod scheduling strategy of kubernetes
Scheduler is the Scheduler of kubernetes. Its main task is to allocate the defined pod to the nodes of the cluster. It sounds very simple, but 0 there are many questions to consider:
- Fairness: how to ensure that each node can be allocated resources
- Efficient utilization of resources: all resources in the cluster are used to the maximum extent
- Efficiency: the scheduling performance is good, and it can complete the scheduling of a large number of pod s as soon as possible
- Flexibility: allows users to control the logic of scheduling according to their own needs
Sheduler runs as a separate program. After startup, it will always listen to API Server and get podspec For a pod with an empty nodeName, a binding will be created for each pod, indicating which node the pod should be placed on.
1, Scheduling process
Scheduling is divided into filtering and optimization.
The first is to filter out nodes that do not meet the conditions. This process is called predict; Then, the passing nodes are sorted according to priority; Select the highest priority node from the last. If there is an error in any of the steps, the error will be returned directly.
Ⅰ. Predict
Predict has a series of algorithms that can be used:
- PodFitsResources: whether the remaining resources on the node are greater than the resources requested by the pod.
- PodFitsHost: if NodeName is specified in pod, check whether the node name matches NodeName.
- PodFitsHostPorts: whether the port already used on the node conflicts with the port applied by pod.
- PodSelectorMatches: filter out nodes that do not match the label specified by pod.
- NoDiskConflict: the volume that has been mount ed does not conflict with the volume specified by pod unless they are both read-only.
If there is no suitable node in the predict process, the pod will remain in the pending state and continue to retry scheduling until a node meets the conditions. After this step, if multiple nodes meet the conditions, continue the priorities process: sort the nodes according to the priority size.
Ⅱ. Priority
Priority consists of a series of key value pairs. The key is the name of the priority item and the value is its weight (the importance of the item). These priority options include:
- Leastrequested priority: the weight is determined by calculating the utilization rate of CPU and Memory. The lower the utilization rate, the higher the weight. In other words, this priority indicator tends to nodes with lower resource utilization ratio.
- Balanced resource allocation: the closer the CPU and Memory utilization on the node, the higher the weight. This should be used together with the above and should not be used alone.
- ImageLocalityPriority: it tends to have nodes to use the image. The larger the total size of the image, the higher the weight.
All priority items and weights are calculated by the algorithm to get the final result.
Ⅲ. Custom scheduler
In addition to kubernetes' own scheduler, you can also write your own scheduler. By specifying the name of the scheduler through the spec:schedulername parameter, you can select a scheduler for the pod to schedule.
2, Affinity
Affinity is an attribute of pod (preference or rigid requirement), which makes pod attracted to a specific class of nodes.
Ⅰ. Node affinity
pod.spec.nodeAffinity
- preferredDuringSchedulingIgnoredDuringExecution: soft policy.
- requiredDuringSchedulingIgnoredDuringExecution: hard policy.
- These two strategies can be used together.
requiredDuringSchedulingIgnoredDuringExecution
Hard strategy: if the conditions are not met, the Pod will continue to pending and wait for the conditions to be met.
apiVersion: v1 kind: Pod metadata: name: affinity labels: app: node-affinity-pod spec: containers: - name: with-node-affinity image: hub.atguigu.com/library/myapp:v1 affinity: nodeAffinity: ## node affinity hard strategy requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: NotIn values: - k8s-node02
preferredDuringSchedulingIgnoredDuringExecution
Soft strategy: if you don't meet the conditions, forget it.
apiVersion: v1 kind: Pod metadata: name: affinity labels: app: node-affinity-pod spec: containers: - name: with-node-affinity image: hub.atguigu.com/library/myapp:v1 affinity: nodeAffinity: ## node affinity soft strategy preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: source operator: In values: - k8s-node02
- Key value operation relationship
- The value of In: label is In a list
- NotIn: the value of label is not in a list
- Gt: the value of label is greater than a certain value
- Lt: the value of label is less than a certain value
- Exists: a label exists
- DoesNotExist: a label does not exist
Ⅱ. Pod affinity
- preferredDuringSchedulingIgnoredDuringExecution: soft policy
- requiredDuringSchedulingIgnoredDuringExecution: hard policy
- The two strategies can also be used together.
apiVersion: v1 kind: Pod metadata: name: pod-3 labels: app: pod-3 spec: containers: - name: pod-3 image: hub.atguigu.com/library/myapp:v1 affinity: podAffinity: # pod affinity requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - pod-1 topologyKey: kubernetes.io/hostname podAntiAffinity: # pod anti affinity preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - pod-2 topologyKey: kubernetes.io/hostname
The comparison of affinity / anti affinity scheduling strategies is as follows:
scheduling strategy | Match label | Operator | Topology domain support | Scheduling target |
---|---|---|---|---|
nodeAffinity | node | In, NotIn, Exists,DoesNotExist, Gt, Lt | no | Specify host |
podAffinity POD | pod | In, NotIn, Exists,DoesNotExist | yes | POD is in the same topology domain as the specified POD |
podAnitAffinity POD | pod | In, NotIn, Exists,DoesNotExist | yes | POD is not in the same topology domain as the specified POD |
3, Taint and acceleration
Node affinity is an attribute (preference or rigid requirement) of pod, which makes pod attracted to a specific class of nodes. Taint, on the contrary, enables nodes to exclude a specific kind of pod.
Scope: node node.
Taint and tolerance cooperate with each other to avoid pod being allocated to inappropriate nodes. One or more taints can be applied to each node, which means that pods that cannot tolerate these taints will not be accepted by the node. If tolerance is applied to pods, it means that these pods can (but are not required to) be scheduled to nodes with matching taint.
Ⅰ. Stain (Taint)
Composition of Taint
The kubectl taint command can be used to set a stain on a Node. After the stain is set, there is a mutually exclusive relationship between the Node and the Pod, which can make the Node refuse the scheduling execution of the Pod, and even expel the existing Pod of the Node.
The composition of each stain is as follows:
key=value:effect
Each stain has a key and value as the label of the stain, where value can be empty and E "ect" describes the function of the stain. Currently, taint e "ect supports the following three options:
- NoSchedule: indicates that k8s the Pod will not be scheduled to the Node with this stain.
- PreferNoSchedule: indicates that k8s it will try to avoid scheduling the Pod to the Node with this stain.
- NoExecute: indicates k8s that the Pod will not be scheduled to the Node with the stain, and the existing Pod on the Node will be expelled.
Setting, viewing and removal of stains
#Set stain kubectl taint nodes node1 key1=value1:NoSchedule #In the node description, look for the Taints field kubectl describe pod pod-name #Remove stains kubectl taint nodes node1 key1:NoSchedule-
Usage scenario of the stain: the stain can remove all the pod s on the current node, so that some settings of the current node can be updated without business interruption.
Ⅱ. Tolerances
The Node with the stain set will generate mutually exclusive relationship according to the e "ect" of taint: NoSchedule, PreferNoSchedule, NoExecute and Pod, and the Pod will not be scheduled to the Node to a certain extent. However, we can set tolerance on the Pod, which means that the Pod with tolerance can tolerate the existence of stains and can be scheduled to the nodes with stains.
pod.spec.tolerations
tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule" tolerationSeconds: 3600 - key: "key1" operator: "Equal" value: "value1" effect: "NoExecute" - key: "key2" operator: "Exists" effect: "NoSchedule"
- The key, vaule and E "ect should be consistent with the taint set on the Node.
- If the value of operator is Exists, the value value will be ignored.
- tolerationSeconds is used to describe the time that can continue to run on the Pod when the Pod needs to be expelled
1. When the key value is not specified, it means that all stain keys are tolerated
tolerations: - operator: "Exists"
2. When the e "ect" value is not specified, it means that all stain effects are tolerated
tolerations: - key: "key" operator: "Exists"
3. When there are multiple masters, the following settings can be used to prevent resource waste
kubectl taint nodes Node-Name node-role.kubernetes.io/master=:PreferNoSchedule
4, Specify scheduling node
Ⅰ. Pod.spec.nodeName
Pod.spec.nodeName directly schedules the pod to the specified Node, which will skip the Scheduler's scheduling policy. The matching rule is forced matching.
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: myweb spec: replicas: 7 template: metadata: labels: app: myweb spec: nodeName: k8s-node01 containers: - name: myweb image: hub.atguigu.com/library/myapp:v1 ports: - containerPort: 80
Ⅱ. Pod.spec.nodeSelector
Pod.spec.nodeSelector: select the node through the label-selector mechanism of kubernetes, match the label by the scheduler scheduling strategy, and then schedule Pod to the target node, which is a mandatory constraint.
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: myweb spec: replicas: 2 template: metadata: labels: app: myweb spec: nodeSelector: type: backEndNode1 containers: - name: myweb image: harbor/tomcat:8.5-jre8 ports: - containerPort: 80