Resource management and scheduling in Kubernetes

Posted by mustatin on Thu, 27 Jan 2022 04:43:49 +0100

Background:

I wonder if there are any small partners like me who have not calculated their resource ratio in detail when creating applications in the cluster. Then I saw kubectl top node. When I saw that there were still many resources on each node, I directly created several applications with high resource allocation, and these applications were running under high load... Then the result was that many applications in the cluster began to crash

Why limit resources?

1. Limiting the resources of a pod can prevent an application from occupying too much resources and causing exceptions to other applications.

2. Effective isolation of resources.

3. Priority of pod scheduling.

4. Efficient and rational utilization of resources.

Resource classification in Kubernets

Kubernetes is classified into compressible resources and non compressible resources according to the scalability of resources:

1. Compressible resources: when the compressible resources are insufficient, Pod will only "starve", but will not quit. For example, CPU(GPU, too? Just not for gpu applications)

2. Incompressible resources: when the incompressible resources are insufficient, the Pod will be killed by the kernel because of OOM (out of memory). For example: memory disk.

Note: see the column of geek time Leishen: https://time.geekbang.org/column/article/69678

Resource quota management - LimitRange ResourceQuota

ResourceQuota

ResourceQuota is used to limit the total resource request and limit occupied by all pods in the namespace
You can refer to: https://kubernetes.io/zh/docs/concepts/policy/resource-quotas/

1. Calculate resource quota

You can refer to: https://kubernetes.io/zh/docs/tasks/administer-cluster/manage-resources/quota-memory-cpu-namespace/.

But according to Lei Shen: https://time.geekbang.org/column/article/69678 The configuration file is still recommended in the following internal general format:

cat > ResourceQuota.yaml << EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: cpu-mem
  namespace: quota
spec:
  hard:
    requests.cpu: 1000m
    requests.memory: 1000Mi
    limits.cpu: 2000m
    limits.memory: 2000Mi
EOF

[root@sh-master-01 quota]# kubectl create ns quota
namespace/quota created
[root@sh-master-01 quota]# kubectl apply -f ResourceQuota.yaml 
resourcequota/cpu-mem created

kubectl get resourcequota cpu-mem -n quota
kubectl get resourcequota cpu-mem--namespace=quota --output=yaml

Create a pod with an official document instance to test:

cat > quota-mem-cpu-pod.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: quota-mem-cpu-demo
spec:
  containers:
  - name: quota-mem-cpu-demo-ctr
    image: nginx
    resources:
      limits:
        memory: "800Mi"
        cpu: "800m"
      requests:
        memory: "600Mi"
        cpu: "400m"
EOF

Create pod in quota namespace:

kubectl apply -f quota-mem-cpu-pod.yaml --namespace=quota

Check that the container in the Pod is running:

kubectl get pod quota-mem-cpu-demo --namespace=quota

Check the details of ResourceQuota again:

kubectl get resourcequota cpu-mem --namespace=quota --output=yaml

The output shows the quota and how many quotas have been used. You can see that the memory and CPU request values and limit values of the Pod do not exceed the quota

Create another pod:

cat > quota-mem-cpu-pod-2.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: quota-mem-cpu-demo-2
spec:
  containers:
  - name: quota-mem-cpu-demo-2-ctr
    image: nginx
    resources:
      limits:
        memory: "1000Mi"
        cpu: "800m"
      requests:
        memory: "700Mi"
        cpu: "400m"
EOF

kubectl apply -f quota-mem-cpu-pod-2.yaml --namespace=quota

The following error is reported. The second Pod cannot be created successfully. The output shows that creating a second Pod will cause the total amount of memory requests to exceed the memory request quota.

2. Storage resource quota (volume count quota)

Let's demonstrate the storage resource quota. I only limit the total storage here. For other settable parameters, please refer to the official documents:

cat > storage.yaml << EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: storage
  namespace: quota
spec:
  hard:
    requests.storage: 10Gi
EOF

cat > pvc.yaml << EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: quota-mem-cpu-demo-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: cbs-csi
EOF

Create pvc in quota namespace:

kubectl apply -f pvc.yaml -n quota

Get the following error reports. The maximum limit is 10G.

3. Object count quota

cat > objects.yaml << EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: objects
  namespace: quota
spec:
  hard:
    pods: 10
    replicationcontrollers: 5
    secrets: 10
    configmaps: 10
    persistentvolumeclaims: 4
    services: 5
    services.loadbalancers: 1
    services.nodeports: 2
    cbs.storageclass.storage.k8s.io/persistentvolumeclaims: 2
EOF

Quantity quota is not demonstrated. Set the parameters that can be set to limit the number according to the official documents, and set the number. As long as the quantity is greater than the set quantity, failure can be verified!

4. Quota Scopes

This ghost function hasn't been used much, so I copied the official one. If you have time to study, the following example is taken from page 750 of the fifth edition of kubernetes authoritative guide:

1. Create ResourceQuota scope

Create a namespace of quota scopes, a resourcequote named best effort, and a resourcequote named not best effort:

kubectl create ns quota-scopes
kubectl create quota best-effort --hard=pods=10 --scopes=BestEffort -n quota-scopes
kubectl create quota not-best-effort --hard=pods=4,requests.cpu=1000m,requests.memory=1024Mi,limits.cpu=2000m,limits.memory=2048Mi --scopes=NotBestEffort -n quota-scopes

Note: the official way is yaml. I happened to see it http://docs.kubernetes.org.cn/541.html#kubectl_create_quota in
Just use kubectl create quota

2. Create two deployment s

 kubectl run best-effort-nginx --image=nginx --replicas=8 --namespace=quota-scope
 
 kubectl run not-best-effort-nginx --image=nginx= --replicas=2 --requests=cpu=100m,memory=256Mi --limits=cpu=200m,memory=512Mi --namespace=quota-scopes

Well, it's a pit here. kubectl run can be tested with replicas in 1.16.16. But my 1.21.1 can't be implemented like this. After reading the following documents, use the kubectl create deployment command below

kubectl create deployment best-effort-nginx --image=nginx --replicas=8 --namespace=quota-scopes

However, when creating not best effort nginx deployment, it seems that -- requests and – limits cannot be used in the command? Use the stupidest method to create a deployment, and then edit deployment to increase the resources limit:

kubectl create deployment not-best-effort-nginx --image=nginx --replicas=2  --namespace=quota-scopes
kubectl edit deployment not-best-effort-nginx -n quota-scopes

3. Verification:

 kubectl get pods -n quota-scopes
 kubectl describe quota -n quota-scopes

5. Set the resource quota based on the priority class

Note: this ghost has not been used in official documents
FEATURE STATE: Kubernetes v1.17 [stable]
Pod can be created as a specific priority . By using the scopeSelector field in the quota specification, users can control their system resource consumption according to the priority of the Pod.
Only when a Pod is selected in the scopeSelector field in the quota specification, the quota mechanism will match and measure the resource consumption of the Pod.
If a quota object sets its scope as a priority class through the scopeSelector field, the quota object can only track the following resources:

pods
cpu
memory
ephemeral-storage
limits.cpu
limits.memory
limits.ephemeral-storage
requests.cpu
requests.memory
requests.ephemeral-storage

This example creates a quota object and matches it to a Pod with a specific priority. The example works as follows:

Pod in the cluster can take one of three priority classes, namely "low", "medium" and "high".
Create a quota object for each priority.

Save the following YAML to the file quota YML.

apiVersion: v1
kind: List
items:
- apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: pods-high
  spec:
    hard:
      cpu: "1000"
      memory: 200Gi
      pods: "10"
    scopeSelector:
      matchExpressions:
      - operator : In
        scopeName: PriorityClass
        values: ["high"]
- apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: pods-medium
  spec:
    hard:
      cpu: "10"
      memory: 20Gi
      pods: "10"
    scopeSelector:
      matchExpressions:
      - operator : In
        scopeName: PriorityClass
        values: ["medium"]
- apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: pods-low
  spec:
    hard:
      cpu: "5"
      memory: 10Gi
      pods: "10"
    scopeSelector:
      matchExpressions:
      - operator : In
        scopeName: PriorityClass
        values: ["low"]

Run the following using the kubectl create command.

kubectl create -f quota.yml
kubectl describe quota

LimitRange

LimitRange is used to limit the default resource request and limit of a single Pod in the namespace

1. General example: pod of a single container

Well, requests limits are generally used to limit the resources of pod cpu and memory. As follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: php-test
spec:
  replicas: 2
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: php-test
  template:
    metadata:
      labels:
        app: php-test
    spec:
      containers:
        - name: php-test
          image: ccr.ccs.tencentyun.com/master/php-test:0.1
          env:
          - name: PHP_MEM_LIMIT
            value: "256M"
          envFrom:
          - configMapRef:
              name: deploy
          ports:
            - containerPort: 80
          resources:
            requests:
              memory: "256M"
              cpu: "250m"
            limits:
              memory: "2048M"
              cpu: "4000m" 
          livenessProbe:
            httpGet:
              scheme: HTTP
              path: /test.html
              port: 80
            initialDelaySeconds: 30
            periodSeconds: 30
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              scheme: HTTP
              path: /test.html
              port: 80
            initialDelaySeconds: 30
            periodSeconds: 30
      imagePullSecrets:                                              
        - name: tencent
---

apiVersion: v1
kind: Service
metadata:
  name: php-test
  labels:
    app: php-test
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: php-test

2. pod of multiple container s

Of course, a pod can have multiple container s:

apiVersion: v1
kind: Pod
metadata:
  name: kucc1
spec:
  containers:
  - image: nginx
    name: nginx
    resources:
      requests:
        memory: "256M"
        cpu: "250m"
      limits:
        memory: "2048M"
        cpu: "4000m"     
  - image: redis
    name: redis
    resources:
      requests:
        memory: "256M"
        cpu: "250m"
      limits:
        memory: "2048M"
        cpu: "4000m"

The minimum request memory of this pod is 512M, and the cpu is 0.5 core. limits the memory is 4096m and the cpu is 8 cores.

Well, the relationship between pod and container. A pod can include multiple containers

About resource scheduling and pod Qos model:

Resource Management - Compute Resource
qos quality of service management

When kubernetes schedules the pod, Kube scheduler will only calculate according to the value of requests. When you really set the Cgroups limit, kubelet will set it according to the limit value.
Kubernetes divides pod into three QoS levels:

BestEffort (lowest priority)
Burstable
Guaranteed (highest priority)

The lowest priority QoS level is BestEffort. It will be assigned to pods that do not have any requests and limits set (for any container). Containers running at this level have no resource guarantees. In the worst case, they don't get any CPU resources, and when they need to free memory for other pods, these containers will be killed first. However, because BestEffort pod is not configured with memory limits, these containers can use as much memory as they want when there is enough available memory.

apiVersion: v1
kind: Pod
metadata:
  name: php-test
  namespace: develop
spec:
  containers:
  - name:  php-test
    image: nginx

Guaranteed pod has the following conditions:

Both CPU and memory should set requests and limits
Each container needs to set the amount of resources
They must be equal (the requests and limits of each resource in each container must be equal)

apiVersion: v1
kind: Pod
metadata:
  name: php-test
  namespace: qa
spec:
  containers:
  - name: php-test
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "700m"
      requests:
        memory: "200Mi"
        cpu: "700m"

Bursable QoS level is between BestEffort and Guaranteed. All other pods belong to this level. It includes a single container pod whose requests and limits are different. At least one container defines only requests but no limits, and one container has equal requests and limits, but the other container does not specify requests or limits.

apiVersion: v1
kind: Pod
metadata:
  name: php-test
  namespace: master
spec:
  containers:
  - name: php-test
    image: nginx
    resources:
      limits
        memory: "200Mi"
      requests:
        memory: "100Mi"

Note:
Guaranteed > bursable > best effort, so in the above example, I used three different namespaces to distinguish the priority. Because my master namespace is the most important and priority online service, qa is the test environment and development is the development environment. The best way to avoid resource preemption is to set the limits and requests values in the resource to the same value, and the Qos priority is guaranteed.

How to ensure the resource priority and scheduling priority of pod?

The node to which kubernetes pod applications are distributed is obtained by a series of algorithms through the master scheduler by default, and the user has no intervention process and results. Setting Qos priority to Guaranteed can only guarantee the priority of resources. What other ways to ensure resource priority and scheduling?

1. NodeSelector directed scheduling

The pod is scheduled to a group of work nodes with the same label

[root@sh-master-01 ~]#  kubectl get nodes --show-labels
NAME           STATUS                     ROLES                  AGE   VERSION   LABELS
sh-master-01   Ready                      control-plane,master   92d   v1.21.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-master-01,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
sh-master-02   Ready                      control-plane,master   92d   v1.21.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-master-02,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
sh-master-03   Ready                      control-plane,master   92d   v1.21.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-master-03,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
sh-work-01     Ready                      <none>                 92d   v1.21.1   IngressProxy=true,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-work-01,kubernetes.io/os=linux,topology.com.tencent.cloud.csi.cbs/zone=ap-shanghai-2
sh-work-02     Ready                      <none>                 92d   v1.21.1   IngressProxy=true,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-work-02,kubernetes.io/os=linux,topology.com.tencent.cloud.csi.cbs/zone=ap-shanghai-2
sh-work-03     Ready                      <none>                 92d   v1.21.1   IngressProxy=true,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-work-03,kubernetes.io/os=linux,topology.com.tencent.cloud.csi.cbs/zone=ap-shanghai-2

Label the node sh-work-01 with zone=shanghai,

[root@sh-master-01 ~]# kubectl label node sh-work-01 zone=shanghai
node/sh-work-01 labeled
[root@sh-master-01 ~]#  kubectl get nodes --show-labels
NAME           STATUS                     ROLES                  AGE   VERSION   LABELS
sh-master-01   Ready                      control-plane,master   92d   v1.21.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-master-01,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
sh-master-02   Ready                      control-plane,master   92d   v1.21.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-master-02,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
sh-master-03   Ready                      control-plane,master   92d   v1.21.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-master-03,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
sh-work-01     Ready                      <none>                 92d   v1.21.1   IngressProxy=true,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-work-01,kubernetes.io/os=linux,topology.com.tencent.cloud.csi.cbs/zone=ap-shanghai-2,zone=shanghai
sh-work-02     Ready                      <none>                 92d   v1.21.1   IngressProxy=true,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-work-02,kubernetes.io/os=linux,topology.com.tencent.cloud.csi.cbs/zone=ap-shanghai-2
sh-work-03     Ready                      <none>                 92d   v1.21.1   IngressProxy=true,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sh-work-03,kubernetes.io/os=linux,topology.com.tencent.cloud.csi.cbs/zone=ap-shanghai-2

Create a new pod and schedule it to the zone=shanghai node
Use nginx image to test:

kubectl run nginx --image=nginx:latest --port=80  --dry-run -o yaml > nodeselector.yaml

vim nodeselector.yaml, add nodeSelector configuration

cat > nodeselector.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: nginx
  name: nginx
spec:
  containers:
  - image: nginx:latest
    name: nginx
    ports:
    - containerPort: 80
    resources: {}
  nodeSelector:
    zone: shanghai
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}
EOF

Deploy and verify whether the pod is scheduled to the sh-work-01 node:

kubectl apply -f nodeselector.yaml
kubectl get pods -o wide

Note: nodeSelector is mandatory. If our target node has no available resources, the Pod will always be in Pending state.

2. NodeAffinity: Node affinity

1. Affinity scheduling can be divided into soft strategy and hard strategy:

1. Soft policy - preferredDuringSchedulingIgnoredDuringExecution

     Just emphasize the priority, but don't force it. If you don't have a node that meets the scheduling requirements, pod You will ignore this rule and continue to complete the scheduling process**It's best to meet the conditions. If not, it doesn't matter**Strategy. Of course, there can also be multiple priority rules that can be set wight Weight value, defining the order of precedence.

2. Hard policy - requiredDuringSchedulingIgnoredDuringExecution

The specified rules must be met before the pod can be dispatched to the node. It is tough. If there is no node that meets the conditions, try again and again until the conditions are met. In short, it is a strategy that you must meet my requirements, or I won't do it.

2. About IgnoredDuringExecution

See official for details:

IgnoredDuringExecution is similar to the working principle of nodeSelector. If the label of a node changes at run time and no longer meets the affinity rules on the Pod, the Pod will continue to run on the node.

3. Common syntax operators

The following operators are provided:

The value of In: label is In a list
NotIn: the value of label is not in a list
Gt: the value of label is greater than a certain value
Lt: the value of label is less than a certain value
Exists: a label exists
DoesNotExist: a label does not exist

If you specify both nodeSelector and nodeAffinity_ Both_ It must be satisfied before the Pod can be scheduled to the candidate node.
If you specify multiple nodeSelectorTerms associated with nodeAffinity type, pod can be scheduled to the node if one nodeSelectorTerms is satisfied.
If you specify multiple matchExpressions associated with nodeSelectorTerms, the Pod can be scheduled to the node only if all matchExpressions are satisfied.
If you modify or delete the label of the node to which the Pod is scheduled, the Pod will not be deleted. In other words, affinity selection is only valid during Pod scheduling.
The weight field value in preferredDuringSchedulingIgnoredDuringExecution ranges from 1 to 100. For each node that meets all scheduling requirements (resource request, RequiredDuringScheduling affinity expression, etc.), the scheduler will traverse the elements of the field to calculate the sum, and if the node matches the corresponding MatchExpressions, add "weight" to the sum. Then combine this score with the scores of other priority functions of the node. The node with the highest total score is the most preferred.

4. Take an example

1. Examples of hard strategies

Try a hard limit with an empty label:

cat > pod-nodeAffinity.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity
  namespace: default
  labels:
    app: myapp
    type: pod
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
    ports:
    - name: http
      containerPort: 80
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - node01
EOF

kubectl apply -f pod-nodeAffinity.yaml 
kubectl get pods

Well, I don't have kubernetes IO / hostname: node of node01 tag, so the pod is always in pending state
Modify kubernetes IO / hostname: sh-work-01

Well, as follows

Try deleting and redeploying:

kubectl delete -f pod-nodeAffinity.yaml
kubectl create -f pod-nodeAffinity.yaml

Then I'll try to modify values now? Change sh-work-01 to sh-work-02

Still! The feature of IgnoredDuringExecution is further verified!

2. Examples of soft strategies

The whole node without label is still scheduled:

cat > pod-prefer.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: pod-prefer
  namespace: default
  labels:
    app: myapp
    type: pod
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
    ports:
    - name: http
      containerPort: 80
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 10
        preference:
          matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - node01           
EOF

[root@sh-master-01 priority]# kubectl apply -f pod-prefer.yaml 
pod/pod-prefer created
[root@sh-master-01 priority]# kubectl get pods
NAME                          READY   STATUS              RESTARTS   AGE
csi-app                       1/1     Running             10         93d
nginx                         1/1     Running             0          62m
php-apache-5b95f8f674-clzn5   1/1     Running             2          86d
pod-nodeaffinity              1/1     Running             0          19m
pod-prefer                    0/1     ContainerCreating   0          3s
[root@sh-master-01 priority]# kubectl get pods -o wide
NAME                          READY   STATUS              RESTARTS   AGE   IP           NODE         NOMINATED NODE   READINESS GATES
csi-app                       1/1     Running             10         93d   10.0.4.204   sh-work-01   <none>           <none>
nginx                         1/1     Running             0          62m   10.0.4.60    sh-work-01   <none>           <none>
php-apache-5b95f8f674-clzn5   1/1     Running             2          86d   10.0.3.64    sh-work-03   <none>           <none>
pod-nodeaffinity              1/1     Running             0          19m   10.0.4.118   sh-work-01   <none>           <none>
pod-prefer                    0/1     ContainerCreating   0          7s    <none>       sh-work-02   <none>           <none>

Well, I don't have label kubernetes IO / hostname: name of node01 node. As a result, he was dispatched to the sh-work-02 node. Is this not mandatory?
Then test the weight?

cat > pod-prefer1.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: pod-prefer1
  namespace: default
  labels:
    app: myapp1
    type: pod
spec:
  containers:
  - name: myapp1
    image: ikubernetes/myapp:v1
    ports:
    - name: http
      containerPort: 80
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 10
        preference:
          matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - sh-work-01
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 20
        preference:
          matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - sh-work-03
EOF

[root@sh-master-01 priority]# kubectl apply -f pod-prefer1.yaml 
pod/pod-prefer1 created
[root@sh-master-01 priority]# kubectl get pods -o wide
NAME                          READY   STATUS              RESTARTS   AGE     IP           NODE         NOMINATED NODE   READINESS GATES
csi-app                       1/1     Running             10         93d     10.0.4.204   sh-work-01   <none>           <none>
nginx                         1/1     Running             0          67m     10.0.4.60    sh-work-01   <none>           <none>
php-apache-5b95f8f674-clzn5   1/1     Running             2          86d     10.0.3.64    sh-work-03   <none>           <none>
pod-nodeaffinity              1/1     Running             0          23m     10.0.4.118   sh-work-01   <none>           <none>
pod-prefer                    1/1     Running             0          4m57s   10.0.5.181   sh-work-02   <none>           <none>
pod-prefer1                   0/1     ContainerCreating   0          6s      <none>       sh-work-03   <none>           <none>

Well, it's scheduled to the sh-work-03 node based on the weight

3. As for the soft and hard strategies, do colleagues exist?

As an example, colleagues have soft strategies and hard strategies

cat > pod-prefer2.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - sh-work-01
            - sh-work-02
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 10
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - shanghai
  containers:
  - name: with-node-affinity
    image: k8s.gcr.io/pause:2.0
EOF

 kubectl apply -f pod-prefer2.yaml
 kubectl get pods -o wide

Of course, no matter what the value in preferredduringschedulingignored duringexecution is, the premise is to meet the values in requiredduringschedulingignored duringexecution first.

3. PodAfinity: Pod affinity and mutually exclusive scheduling strategy

The affinity between pods is specified by the podAffinity field under the affinity field in the PodSpec. The anti affinity between pods is specified by the podanti affinity field under the affinity field in PodSpec.
It is mainly used to deploy some frequently called and interdependent pods in one node or the same area as much as possible, and the resource occupied pods are mutually exclusive and distributed in different nodes or areas.
For example:

1. First deploy a Pod with the tag security=S1 app=busybox

cat > pod-flag.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: pod-flag
  namespace: default
  labels:
    app: busybox
    security: S1
spec:
  containers:
  - name: busybox
    image: busybox:1.28.4
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
EOF

kubectl apply -f pod-flag.yaml
kubectl get pods --show-labels

2. Examples of pod affinity:

1. Soft policy affinity

cat > with-pod-affinity.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0
EOF

kubectl apply -f with-pod-affinity.yaml
kubectl get pods -o wide

2. About topologyKey

As the name suggests, topology means topology. Here, it refers to a topology domain, which refers to the concept of a range, such as a Node, a cabinet, a computer room or a region (such as Hangzhou and Shanghai). In fact, it corresponds to the label on the Node. The topologyKey here corresponds to the Key of the tag on the Node (no Value). It can be seen that the topologyKey is actually used to filter nodes. In this way, we can dispatch each Pod across clusters, machine rooms and regions. I don't have multiple regional zone s, just one. I'm here now.

Original link: https://blog.csdn.net/asdfsadfasdfsa/article/details/106027367

3. Hard policy affinity

Mandatory requirements

cat > with-pod-affinity1.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity1
spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-affinity1
    image: k8s.gcr.io/pause:2.0
EOF

ErrImagePull ignores that the official example used has not changed the image. Just use nginx image normally.

3. Mutually exclusive scheduling of pod

Pod and security=S1 are in the same zone, not in the same Node as pod of app=busybox:

cat > with-pod-antiaffinity.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: topology.com.tencent.cloud.csi.cbs/zone
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - busybox
        topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-antiaffinity
    image: nginx
EOF

 kubectl apply -f with-pod-antiaffinity.yaml
 kubectl get pods -o wide --show-labels

Well, the pod is scheduled to a node other than sh-work-02. This place is still a little around. I'll study it when I have time

4. Taints and tolerances

1. Cluster information

[root@sh-master-01 priority]# kubectl get node
NAME           STATUS                     ROLES                  AGE   VERSION
sh-master-01   Ready                      control-plane,master   93d   v1.21.1
sh-master-02   Ready                      control-plane,master   93d   v1.21.1
sh-master-03   Ready                      control-plane,master   93d   v1.21.1
sh-work-01     Ready                      <none>                 93d   v1.21.1
sh-work-02     Ready                      <none>                 93d   v1.21.1
sh-work-03     Ready                      <none>                 93d   v1.21.1
test-01        Ready,SchedulingDisabled   <none>                 86d   v1.21.0

2. Stain the sh-work-01 node (key=work, value=ops, effect=NoSchedule)

kubectl taint nodes sh-work-01 work=ops:NoSchedule

3. It's more intuitive to test the whole daemon

cat > nginx-daemonset.yaml << EOF
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nginx-1
  namespace: default
  labels:
    web: nginx-1
spec:
  selector:
    matchLabels:
      web: nginx-1
  template:
    metadata:
      labels:
        web: nginx-1
    spec:
      containers:
      - name: nginx-1
        image: nginx:1.17
        ports:
        - containerPort: 80
EOF

kubectl apply -f nginx-daemonset.yaml
kubectl get pods -o wide --show-labels

Note: I stain the test-01 node because my node is schedulendisabled. However, the daemonset will also be deployed to the IE point for easy differentiation.
Remove the stain on sh-work-01 and see what happens:

[root@sh-master-01 priority]#  kubectl taint nodes sh-work-01 work:NoSchedule-
node/sh-work-01 untainted
[root@sh-master-01 priority]#  kubectl get pods -o wide --show-labels

Well, remove the delay label of sh-work-01 node. The node resumes scheduling.

4. Add tolerance parameters for testing

cat > nginx-daemonset-2.yaml << EOF
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nginx-2
  namespace: default
  labels:
    web: nginx-2
spec:
  selector:
    matchLabels:
      web: nginx-2
  template:
    metadata:
      labels:
        web: nginx-2
    spec:
      containers:
      - name: nginx-2
        image: nginx:1.17
        ports:
        - containerPort: 80
      tolerations:
      - key: "work"
        operator: "Equal"
        value: "ops"
        effect: "NoSchedule"        
EOF

kubectl apply -f nginx-daemonset-2.yaml
kubectl get pods -o wide --show-labels

Well, the delayed nodes have been scheduled, including the test-01 node. Here we are adding another delay to the test-01 node? Anyway, I just don't want him to be dispatched:

kubectl taint nodes test-01 work1=dev:NoSchedule

The pod in the test-01 node has not been expelled

Modify the label of work=ops, or else?

kubectl taint nodes test-01 work=dev:NoSchedule --overwrite=true

Still the same.
What should I do?

5. Modify the stain on the test-01 node and change the effect to "NoExecute" to test whether the Pod will be expelled

kubectl taint nodes test-01 work=dev:NoExecute --overwrite=true

The expulsion succeeded
For taints and tolerances, please refer to: https://www.jianjiacc.cn/archives/6fdbe877.html

To sum up:

To create a cluster, we should make reasonable resource planning and capacity planning
Resource priority scheduling and resource allocation can be carried out according to the qos of pod (of course, there will still be pod OOM)
Resources can be reasonably scheduled through node labeling, affinity and anti affinity to avoid cluster resource avalanche
There are still some differences between different kubernetes versions. Try to read the official documents as much as possible!

Topics: Kubernetes

Programmer Think