kubernetes resource scheduling

Posted by john-iom on Sat, 25 Dec 2021 05:20:49 +0100

dispatch

Create a pod workflow

kubernetes realizes the decoupling of interaction between components based on the controller architecture of list watch mechanism.
Other components monitor their own resources. When these resources change, Kube apiserver will notify these components. This process is similar to publish and subscribe.

Main attributes affecting scheduling in pod

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: default
spec:
...
    containers:
    - image: lizhenliang/java-demo
      name: java-demo
      imagePullPolicy: Always
      livenessProbe:
        initialDelaySeconds: 30
        periodSeconds: 20
        tcpSocket:
          port: 8080
    respurces: {}       ## Resource scheduling basis
restartPolicy: Always
schedulerName: default-scheduler    ## The following are the scheduling policies
nodeName: ""
nodeSelector: {}
affinity: {}
tolerations: []

Impact of resource constraints on pod scheduling

Container resource limit:

  • resources.limits.cpu
  • resources.limits.memory

The minimum resource requirement used by the container is used as the basis for resource allocation during container scheduling:

  • resources.requests.cpu
  • resources.requests.memory

CPU unit: you can write either m or floating-point number, for example, 0.5=500m, 1=1000m

apiVersion: v1
kind: Pod
metadata:
  name: web
spec:
  containers:
  - name: web
    image: nginx
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

K8s will find nodes with sufficient resources according to the value of Request to schedule this Pod

nodeSelector&nodeAffinity

nodeSelector: used to schedule the Pod to the Node matching the Label. If there is no matching Label, the scheduling will fail.

effect:

  • Constrain the Pod to run on a specific node
  • Exactly match node labels

Application scenario:

  • Private Node: manage nodes in groups according to business lines
  • Equipped with special hardware: some nodes are equipped with SSD hard disk and GPU

Example: make sure that the pod is assigned to a node with a specific SSD hard disk

Step 1: add labels to nodes

format: kubectl label nodes <node-name> <label-key> =<label-value>
for example: kubectl label nodes k8s-node1 disktype=ssd
 verification: kubectl get nodes --show-labels

Step 2: add nodeSelector field to Pod configuration

Finally, verify that:

kubectl get pods -o wide
apiVersion: v1
kind: Pod
metadata:
  name: pod-example
spec:
  nodeSelector:
    disktype: "ssd"
  containers:
  - name: nginx
    image: nginx:1.19

nodeAffinity: node affinity, which has the same function as nodeSelector, but is more flexible and meets more conditions, such as:

  • Matching has more logical combinations, not just the exact equality of strings
  • Scheduling is divided into soft strategy and hard strategy, not hard requirements
    • Required: must meet
    • Preferred: try to meet, but not guarantee

Operators: ln, Notln, Exists, DoesNotExist, Gt, Lt

apiversion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulinglgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: In
            values:
            - nvidia-tesla
      preferredDuringSchedulinglgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: group
          operator: In
          values:
          - ai
  containers:
  - name: web
    image: nginx

Taint & tolerances

Taints: avoid point scheduling to specific nodes
Accelerations: allows the Pod to be scheduled to the Node that holds the Taints

Application scenario:

  • Private Node: nodes are grouped and managed according to the business line. It is hoped that this Node will not be scheduled by default. Allocation is allowed only when stain tolerance is configured
  • Equipped with special hardware: some nodes are equipped with SSD, hard disk and GPU. It is hoped that the Node will not be scheduled by default. The allocation is allowed only when stain tolerance is configured
  • Taint based expulsion

Step 1: add a stain to the node

format: kubectl taint node [node] key=value:[effect]
for example: kubectl taint node k8s-node1 gpu=yes:NoSchedule
 verification: kubectl describe node k8s-node1 lgrep Taint

Where [effect] can be taken as:

  • NoSchedule: must not be scheduled
  • PreferNoSchedule: try not to schedule. Tolerance must be configured
  • NoExecute: not only will it not be scheduled, it will also expel the existing Pod on the Node

Step 2: add the stain tolerance field to the Pod configuration

Remove stains:

kubectl taint node [node] key:[effect]-
apiversion: v1
kind: Pod
metadata:
  name: pod-taints
spec:
  containers:
  - name: pod-taints
    image: busybox:latest
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "yes"
    effect:"NoSchedule"

nodeName

nodeName: Specifies the Node name. It is used to schedule the Pod to the specified Node. The Node name passes through the scheduler

apiVersion: v1
kind: Pod
metadata:
  name: pod-example
  labels:
    app: nginx
spec:
  nodeName: k8s-node2
  containers:
  - name: nginx
    image: nginx:1.15

Resource scheduling

The scheduling strategies in Kubernetes are mainly divided into global scheduling and runtime scheduling. The global scheduling policy is configured when the scheduler starts, The runtime scheduling strategy mainly includes node selector, Node Affinity, pod affinity and anti affinity. The features of Node Affinity, podAffinity / anti affinity and Taints and tolerances to be introduced later are in Beta in kuberntes 1.6.

Set node label

Label is one of the core concepts of Kubernetes. It is attached to various objects in the form of key/value, such as Pod, Service, Deployment, Node, etc. to identify these objects and manage the association relationship, such as the association between Node and Pod.
Get all nodes in the current cluster:

[root@master ~]# kubectl  get nodes
NAME                 STATUS   ROLES                  AGE    VERSION
master.example.com   Ready    control-plane,master   2d2h   v1.23.1
node1.example.com    Ready    <none>                 2d2h   v1.23.1
node2.example.com    Ready    <none>                 2d2h   v1.23.1

View node default label:

[root@master ~]# kubectl get node node1.example.com --show-labels
NAME                STATUS   ROLES    AGE    VERSION   LABELS
node1.example.com   Ready    <none>   2d2h   v1.23.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1.example.com,kubernetes.io/os=linux

Set label for the specified node:

[root@master ~]# kubectl  label node node1.example.com disktype=ssd
node/node1.example.com labeled

Confirm whether the node label is set successfully:

[root@master ~]# kubectl get node node1.example.com --show-labels
NAME                STATUS   ROLES    AGE    VERSION   LABELS
node1.example.com   Ready    <none>   2d2h   v1.23.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1.example.com,kubernetes.io/os=linux

[root@master ~]# kubectl get nodes -l disktype=ssd
NAME                STATUS   ROLES    AGE    VERSION
node1.example.com   Ready    <none>   2d2h   v1.23.1

Select node (nodeSelector)

Nodeselector is currently the simplest pop runtime scheduling restriction, which is currently in kubernetes1 Versions 7. X and below are available. Pod.spec.nodeSelector selects nodes through the label-selector mechanism of kubernetes. The scheduler schedules the policy to match label, and then schedules pod to the target node, which is a mandatory constraint. The nodeAffinity to be mentioned later has all the functions of nodeselector, so kubernetes will abolish nodeselector in the future.
nodeSelector example:

Set label

[root@master ~]# kubectl  label node node1.example.com disktype=ssd
node/node1.example.com labeled

View nodes that meet the requirements of non master nodes and whose disktype type is ssd:

[root@master ~]# kubectl get nodes -l 'role!=master, disktype=ssd'
NAME                STATUS   ROLES    AGE    VERSION
node1.example.com   Ready    <none>   2d2h   v1.23.1

pod.yaml file content:

[root@master ~]# vi pod.yml
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    disktype: ssd

To create a pod:

[root@master ~]# kubectl apply -f pod.yml 
pod/nginx created

Check that pod nginx is scheduled to run on the expected node:

[root@master ~]# kubectl get pod nginx -o  wide
NAME    READY   STATUS    RESTARTS   AGE     IP            NODE                NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          9m35s   10.244.1.31   node1.example.com   <none>           <none>

Note: if it is not the default namespace, you need to specify a specific namespace, for example:

kubectl -n kube-system get pods -o wide

Built in label example

yaml file content:

[root@master ~]# vi pod.yml
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    kubernetes.io/hostname: node1.example.com

Create a pod and check that the results meet expectations. The pod is scheduled at the preset node node1 example. com

[root@master ~]# kubectl apply -f pod.yml 
pod/nginx unchanged
[root@master ~]# kubectl get pod nginx -o wide
NAME    READY   STATUS    RESTARTS   AGE   IP            NODE                NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          63s   10.244.1.32   node1.example.com   <none>           <none>

Affinity and anti affinity

The nodeSelector mentioned earlier only enforces the restriction of pod scheduling to the specified node in a very simple way, that is, label. Affinity and anti affinity are more flexible in assigning pods to the expected nodes. Compared with nodeSelector, affinity and anti affinity have the following advantages:

  • The expressive grammar is more diversified and is no longer limited to mandatory constraints and matching.
  • The scheduling rule is no longer a hard constraint, but a soft limit or preference.
  • Specify which pod s can be deployed in the same / different topology.
    Affinity is mainly divided into three types: node affinity and inter pod affinity / anti affinity, which will be described in detail below.

Node affinity

Node affinity is introduced as an alpha in Kubernetes 1.2, which covers the nodeSelector function. It is mainly divided into two types: required duringschedulingignored duringexecution and preferredduringschedulingignored duringexecution. The former can be regarded as a kind of mandatory restriction. If the label of a node changes and it does not meet the scheduling requirements of the pod, the pod scheduling will fail. The latter can be understood as soft limit or preference. Similarly, if the label of the node changes, resulting in it no longer meeting the scheduling requirements of the pod, the pod will still schedule and run.

Node affinity example

Set node label:

[root@master ~]# kubectl label nodes node1.example.com  cpu=high
node/node1.example.com labeled

[root@master ~]# kubectl  label node node1.example.com disktype=ssd
node/node1.example.com labeled


[root@master ~]# kubectl label nodes node2.example.com  cpu=low
node/node2.example.com labeled

It is expected to deploy pod to a machine with ssd type hard disk (disktype=ssd) and high CPU configuration (cpu=high).
To view nodes that meet the criteria:

[root@master ~]# kubectl get nodes -l 'cpu=high, disktype=ssd'
NAME                STATUS   ROLES    AGE    VERSION
node1.example.com   Ready    <none>   2d3h   v1.23.1

pod.yaml file contents are as follows:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: cpu
            operator: In
            values:
            - high
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent

The inspection results are in line with expectations. pod nginx is successfully deployed to the machine with ssd type hard disk and high CPU configuration.

[root@master ~]# kubectl  get pod nginx -o wide
NAME    READY   STATUS    RESTARTS   AGE   IP            NODE                NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          27s   10.244.1.33   node1.example.com  

Taints and tolerances

For Node affinity, both hard and preference methods are used to schedule pods to the expected nodes, while taints is just the opposite. If a node is marked as taints, unless the point is also identified as a stain tolerant node, the taints node will not be scheduled Pod. Taints and tolerances are currently in beta,
Taints node application scenarios, for example, users want to reserve the Kubernetes Master node for Kubernetes system components, or reserve a group of special resources for some pods. The pod will no longer be scheduled to the taint marked node.

Examples of taint tag nodes are as follows:

[root@master ~]# kubectl taint node node1.example.com cpu=high:NoSchedule
node/node1.example.com tainted


[root@master ~]# kubectl apply -f pod.yml 
pod/nginx created
[root@master ~]# kubectl  get pods
NAME    READY   STATUS    RESTARTS   AGE
nginx   0/1     Pending   0          6s

If you still want a pod to be dispatched to the taint node, you must make a tolerance definition in the Spec to dispatch to the node, for example:

[root@master ~]# vim pod.yml 
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: cpu
            operator: In
            values:
            - high
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  tolerations:
  - key: "cpu"
    operator: "Equal"
    value: "high"
    effect: "NoSchedule"
    

[root@master ~]# kubectl apply -f pod.yml 
pod/nginx configured
[root@master ~]# kubectl  get pods
NAME    READY   STATUS    RESTARTS   AGE
nginx   1/1     Running   0          3m48s

effect has three options, which can be set according to actual needs:

  • NoSchedule: pod will not be scheduled to nodes marked tails.
  • PreferNoSchedule: the "preference" or "soft" version of NoSchedule.
  • NoExecute: this option means that once Taint takes effect, if the running Pod in this node does not have a corresponding tolerance setting, it will be evicted directly.

Topics: Kubernetes Container Cloud Native