pod health examination for k8s detection mechanism

Posted by Zmodem on Mon, 09 Dec 2019 10:28:44 +0100

One: Source of demand:

First, let's look at the source of the entire need: how do you ensure the health and stability of your app when it's migrated to Kubernetes?In fact, it is very simple and can be enhanced in two ways:

1. First, improve the observability of the application;
2, the second is to improve the resilience of the application.

Observability can be enhanced in three ways:

1. First of all, the health status of the application can be observed in real time.
2, the second is to obtain the resource usage of the application;
3. The third is to get a real-time log of the application to diagnose and analyze the problem.

When a problem occurs, the first thing to do is to reduce the scope of impact, debug and diagnose the problem.Finally, when problems arise, it is ideal that a complete recovery can be achieved through a self-healing mechanism integrated with K8s.

Two detection methods are introduced: livenessProbe and ReadnessProbe.

  • Liveness Probe: [Activity Detection] is a user-defined rule to determine if the container is healthy.This is also known as the survival pointer. If the Liveness pointer determines that the container is unhealthy, it kills the pod by kubelet and determines whether to restart the container by the restart policy.If the Liveness pointer is not configured by default, it is assumed by default that the probe default return was successful.

  • ReadnessProbe: [Agile Detection] Used to determine if this container is started and completed, that is, if the pod's state (expected value) is ready.If one of the results of the probe is unsuccessful, it will then be removed from the Endpoint on the pod, that is, the previous pod will be removed from the access layer (set the pod to unavailable state) and will not be suspended on the corresponding endpoint again until the next judgement is successful.

What is Endpoint?
Endpoint is a resource object in the k8s cluster that is stored in etcd to record the access addresses of all pod s corresponding to a service.

2. Scenarios for the use of Liveness and Readness detection mechanisms:
The Liveness Pointer scenario is for applications that can be pulled up again, while the Readines Pointer is for applications that cannot be served immediately after startup.

3. The similarities and differences between Liveness and Readness detection mechanisms:
The same point is to check the health of a pod based on an application or file within the probe pod, but the difference is that liveness restarts the pod if the probe fails, whereas readliness sets the pod unavailable after three consecutive probe failures and does not restart the pod.

4. The Liveness and Readines pointers support three different detection methods:

  • 1, the first is httpGet.It is judged by sending an http Get request, and when the return code is a status code between 200 and 399, identifies that the application is healthy;
  • 2. The second detection method is Exec.It determines whether the current service is normal by executing a command in a container and identifies that the container is healthy when the command line returns 0.
  • 3. The third detection method is tcpSocket.It does TCP health checks by detecting the IP and Port of the container, and if the TCP link can be established properly, then identifying the current container is healthy.

The first detection method is very similar to the third, and the first and second detection methods are generally used.

3. Examples of application of detection mechanism:

1,LivenessProbe:

Method 1: Use the exec detection method to check if a specified file in the pod exists, if it exists, it is considered healthy, otherwise the pod will be restarted according to the restart policy set.

Configuration file for ###pod:

[root@sqm-master yaml]# vim livenss.yaml
kind: Pod
apiVersion: v1
metadata:
  name: liveness
  labels:
    name: liveness
spec:
  restartPolicy: OnFailure    ##Define a restart policy that only restarts when a pod object has an error
  containers:
  - name: liveness
    image: busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/test; sleep 30; rm -rf /tmp/test; sleep 300  #Create a file and delete it after 30 seconds
    livenessProbe:      #Perform activity detection
      exec:
        command:
        - cat              #There is a test file in the probe/tmp directory, if it represents health, and if it does not, a restart pod policy is executed.
        - /tmp/test
      initialDelaySeconds: 10        #How long after the container has been running (in s)
      periodSeconds: 5     #Detect frequency (units s) every 5 seconds.

Other optional fields in the detection mechanism:

  • initialDelaySeconds: The number of seconds to wait for the first probe to execute after the container starts.
  • periodSeconds: The frequency at which the probe is performed.Default is 10 seconds, minimum 1 second.
  • timeoutSeconds: Probe timeout.Default 1 second, minimum 1 second.
  • SuccThreshold: The minimum number of successive probes that are considered successful after a probe fails.The default is 1.For liveness, it must be 1.The minimum value is 1.
  • failureThreshold: The minimum number of consecutive failures after a successful probe is identified as failures.The default is 3.The minimum value is 1.
//Run the pod for testing:
[root@sqm-master yaml]# kubectl  apply -f  livenss.yaml 
pod/liveness created

//Monitor the status of the pod:
Probing starts 10 seconds after the container starts and occurs every 5s.

We can see that pod has been restarting, and we can see RESTARTS from the image above
Seven times since the command was executed when the pod was started:

/bin/sh -c "touch /tmp/test; sleep 30; rm -rf /tmp/test; sleep 300"

The cat/tmp/test command returns a successful return code within the first 30 seconds of the container's life.However, after 30 seconds, cat/tmp/test will return the failed return code, triggering the restart policy of the pod.

//Let's take a look at pod's Events information:
[root@sqm-master ~]# kubectl  describe  pod liveness 


As you can see from the above events, the probe failed and the container will be restarted because the file was not found in the specified directory.

Method 2: Run a web service using httpGet probing to detect whether a specified file exists in the root directory of a web page, which is equivalent to "curl-I container ip address: /healthy".(The directory / here specifies the home directory of the web service in the container.)

//pod yaml file:

[root@sqm-master yaml]# vim http-livenss.yaml
apiVersion: v1
kind: Pod
metadata:
  name: web
  labels:
    name: mynginx
spec:
  restartPolicy: OnFailure      #Define pod restart policy
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
    livenessProbe:       #Define the detection mechanism
      httpGet:               #Probe as httpGet
        scheme: HTTP    #Specify Agreement
        path: /healthy       #The file under the specified path, if it does not exist, fails to detect
        port: 80
      initialDelaySeconds: 10       #How long after the container has been running (in s)
      periodSeconds: 5       #Detection frequency (units s), detected every 5 seconds
---
apiVersion: v1       #Associate a service object
kind: Service
metadata:
  name: web-svc
spec:
  selector:
    name: mynginx
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80

The httpGet detection method has the following optional control fields:

  • Host: The host name of the connection, the IP to which the pod is connected by default.You may want to set "Host" in the http header instead of using IP.
  • scheme: schema used for connection, default HTTP.
  • Path: The path of the HTTP server accessed.
  • httpHeaders: Header for custom requests.HTTP runs a duplicate header.
  • Port: The port name or port number of the container being accessed.Port number must be between 1 and 65525.
//Run the pod:
[root@sqm-master yaml]# kubectl  apply -f  http-livenss.yaml 
pod/web created
service/web-svc created

##View how the pod was running 10 seconds ago:

The container was alive for the first 10 seconds and returned a status code of 200.

Look at the pod again after 10 seconds when the detection mechanism starts exploring:

//Viewed pod events:
[root@sqm-master yaml]# kubectl describe  pod web 

You can see that the status code returned is 404, indicating that the specified file was not found in the root directory of the web page, indicating that the probe failed and restarted four times, and that the status is completed, indicating that the pod is problematic.

2) Next we proceed with the detection to make it a success:
Modify the pod configuration file:
[root@sqm-master yaml]# vim http-livenss.yaml

//Rerun pod:
[root@sqm-master yaml]# kubectl  delete -f  http-livenss.yaml 
pod "web" deleted
service "web-svc" deleted
[root@sqm-master yaml]# kubectl apply -f  http-livenss.yaml 
pod/web created
service/web-svc created
//Finally, we look at the status of the pod and the Events information:
[root@sqm-master yaml]# kubectl  get pod -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
web    1/1     Running   0          5s    10.244.1.11   node01   <none>           <none>
[root@sqm-master yaml]# kubectl  describe  pod web 
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  71s   default-scheduler  Successfully assigned default/web to node01
  Normal  Pulling    71s   kubelet, node01    Pulling image "nginx"
  Normal  Pulled     70s   kubelet, node01    Successfully pulled image "nginx"
  Normal  Created    70s   kubelet, node01    Created container nginx
  Normal  Started    70s   kubelet, node01    Started container nginx

It works when you see the state of the pod.

##Test Access Page Header Information:
[root@sqm-master yaml]# curl -I 10.244.1.11


The status code returned is 200, indicating that the pod is in good health.

ReadnessProbe probe:

Method 1: Use exec to detect whether a file exists, the same as iveness.
The configuration file for //pod is as follows:

[root@sqm-master yaml]# vim readiness.yaml
kind: Pod
apiVersion: v1
metadata:
  name: readiness
  labels:
    name: readiness
spec:
  restartPolicy: OnFailure
  containers:
  - name: readiness
    image: busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/test; sleep 30; rm -rf /tmp/test; sleep 300;
    readinessProbe:   #Define readiness detection method
      exec:
        command:
        - cat
        - /tmp/test
      initialDelaySeconds: 10
      periodSeconds: 5
//Run the pod:
[root@sqm-master yaml]# kubectl apply -f  readiness.yaml 
pod/readiness created

//Detect the status of the pod:

//View pod's Events:
[root@sqm-master yaml]# kubectl  describe  pod readiness 


You can see that the file cannot be found, indicating a probe failure, but instead of restarting the pod, the readiness mechanism, unlike the liveness mechanism, sets the container to an unavailable state after three successive probe failures.

Method 2: httpGet mode.

[root@sqm-master yaml]# vim http-readiness.yaml
apiVersion: v1
kind: Pod
metadata:
  name: web2
  labels:
    name: web2
spec:
  containers:
  - name: web2
    image: nginx
    ports:
    - containerPort: 81
    readinessProbe:
      httpGet:
        scheme: HTTP    #Specify Agreement
        path: /healthy    #Specify the path, if it does not exist, then it needs to be created or the probe fails
        port: 81
      initialDelaySeconds: 10
      periodSeconds: 5
---     
apiVersion: v1       
kind: Service
metadata:
  name: web-svc
spec:
  selector:
    name: web2
  ports:
  - protocol: TCP
    port: 81             
    targetPort: 81
//Run pod:
[root@sqm-master yaml]# kubectl apply -f  http-readiness.yaml 
pod/web2 created
service/web-svc created
//View the status of the pod:
[root@sqm-master yaml]# kubectl  get pod -o wide
NAME        READY   STATUS      RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
readiness   0/1     Completed   0          37m     10.244.2.12   node02   <none>           <none>
web         1/1     Running     0          50m     10.244.1.11   node01   <none>           <none>
web2        0/1     Running     0          2m31s   10.244.1.14   node01   <none>           <none>


Looking at the Events information for a pod, you can tell by probing that the pod is unhealthy and that the http access failed.
Instead of restarting, it simply sets the pod to an unavailable state.

Application of health detection in rolling update process:

First we look at the fields used by the update using the explain tool:

[root@sqm-master ~]#  kubectl  explain deploy.spec.strategy.rollingUpdate


You can see that two parameters are available during the rolling update process:

  • maxSurge: This parameter controls the total number of copies exceeding the maximum number of pods expected during rolling updates.It can be either a percentage or a specific value, defaulting to 1.If the value is set to 3, three pods will be added directly during the update process (and of course, the detection mechanism will be verified to be updated successfully).The larger the value is, the faster the upgrade will be, but it will consume more system resources.
  • maxUnavailable: This parameter controls the number of pods that are not available during rolling updates. Note that the reduction in the number of original pods does not calculate the range of maxSurge values.If the value is 3, three pods will be unavailable during the upgrade process if the probe fails.The larger the value is, the faster the upgrade will be, but it will consume more system resources.

Scenarios for maxSurge and maxUnavailable:
1. If you want to upgrade as quickly as possible while maintaining system availability and stability, you can set maxUnavailable to 0 and assign a larger value to maxSurge.
2. If system resources are tight and pod load is low, you can set maxSurge to 0 and give maxUnavailable a larger value to speed up the upgrade.It is important to note that if maxSurge is 0maxUnavailable is DESIRED, the entire service may become unavailable, at which point RollingUpdate will degenerate to downtime Publishing
1) First we create a deployment resource object:

[root@sqm-master ~]# vim app.v1.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-web
spec:
  replicas: 10              #Define 10 copies
  template:
    metadata:
      labels:
        name: my-web
    spec:
       containers:
        - name: my-web
          image: nginx
          args:
          - /bin/sh
          - -c
          - touch /usr/share/nginx/html/test.html; sleep 300000; #Create a file to keep the pod healthy while probing
          ports:
          - containerPort: 80
          readinessProbe:       #Using readiness mechanism
            exec:
              command:
              - cat
              - /usr/share/nginx/html/test.html
            initialDelaySeconds: 10
            periodSeconds: 10
//After running the pod, look at the number of pods (10):
[root@sqm-master yaml]# kubectl  get pod -o wide
NAME                      READY   STATUS      RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
my-web-7bbd55db99-2g6tp   1/1     Running     0          2m11s   10.244.2.44   node02   <none>           <none>
my-web-7bbd55db99-2jdbz   1/1     Running     0          118s    10.244.2.45   node02   <none>           <none>
my-web-7bbd55db99-5mhcv   1/1     Running     0          2m53s   10.244.1.40   node01   <none>           <none>
my-web-7bbd55db99-77b4v   1/1     Running     0          2m      10.244.1.44   node01   <none>           <none>
my-web-7bbd55db99-h888n   1/1     Running     0          2m53s   10.244.2.41   node02   <none>           <none>
my-web-7bbd55db99-j5tgz   1/1     Running     0          2m38s   10.244.2.42   node02   <none>           <none>
my-web-7bbd55db99-kjgm2   1/1     Running     0          2m25s   10.244.1.42   node01   <none>           <none>
my-web-7bbd55db99-kkmh2   1/1     Running     0          2m38s   10.244.1.41   node01   <none>           <none>
my-web-7bbd55db99-lr896   1/1     Running     0          2m13s   10.244.1.43   node01   <none>           <none>
my-web-7bbd55db99-rpd8v   1/1     Running     0          2m23s   10.244.2.43   node02   <none>       

Probing was successful and all 10 copies were running.

2) First update:
Update the nginx mirror version and set the rolling update policy:

[root@sqm-master yaml]# vim app.v1.yaml 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-web
spec: 
  strategy:       #Set the rolling update policy by setting the rolling Update subproperty under this field
    rollingUpdate:
      maxSurge: 3                 #Specifies that up to three additional pod s can be created during the rolling update process
      maxUnavailable: 3     #- Specifies that up to 3 pod s are allowed to be unavailable during rolling updates
  replicas: 10              
  template:
    metadata:
      labels:
        name: my-web
    spec:
       containers:
        - name: my-web
          image: 172.16.1.30:5000/nginx:v2.0    #The updated mirror is a mirror nginx:v2.0 in the private repository
          args:
          - /bin/sh
          - -c
          - touch /usr/share/nginx/html/test.html; sleep 300000; 
          ports:
          - containerPort: 80
          readinessProbe:       
            exec:
              command:
              - cat
              - /usr/share/nginx/html/test.html
            initialDelaySeconds: 10
            periodSeconds: 10
//After executing the yaml file, look at the number of pod s:
[root@sqm-master yaml]# kubectl  get pod -o wide
NAME                      READY   STATUS      RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
my-web-7db8b88b94-468zv   1/1     Running     0          3m38s   10.244.2.57   node02   <none>           <none>
my-web-7db8b88b94-bvszs   1/1     Running     0          3m24s   10.244.1.60   node01   <none>           <none>
my-web-7db8b88b94-c4xvv   1/1     Running     0          3m38s   10.244.2.55   node02   <none>           <none>
my-web-7db8b88b94-d5fvc   1/1     Running     0          3m38s   10.244.1.58   node01   <none>           <none>
my-web-7db8b88b94-lw6nh   1/1     Running     0          3m21s   10.244.2.59   node02   <none>           <none>
my-web-7db8b88b94-m9gbh   1/1     Running     0          3m38s   10.244.1.57   node01   <none>           <none>
my-web-7db8b88b94-q5dqc   1/1     Running     0          3m38s   10.244.1.59   node01   <none>           <none>
my-web-7db8b88b94-tsbmm   1/1     Running     0          3m38s   10.244.2.56   node02   <none>           <none>
my-web-7db8b88b94-v5q2s   1/1     Running     0          3m21s   10.244.1.61   node01   <none>           <none>
my-web-7db8b88b94-wlgwb   1/1     Running     0          3m25s   10.244.2.58   node02   <none>           <none>
//View pod version information:
[root@sqm-master yaml]# kubectl  get deployments. -o wide
NAME     READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                        SELECTOR
my-web   10/10   10           10          49m   my-web       172.16.1.30:5000/nginx:v2.0   name=my-web

Detection was successful and all 10 pod versions were updated successfully.

3) Second update:
Update the mirror version to version 3.0 and set a rolling update policy.(detection failed)
The configuration file for pod is as follows:

[root@sqm-master yaml]# vim app.v1.yaml 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-web
spec:
  strategy:
    rollingUpdate:
      maxSurge: 3                  #Define the update strategy and keep the number at 3
      maxUnavailable: 3
  replicas: 10                       #The number of pod s is still 10
  template:
    metadata:
      labels:
        name: my-web
    spec:
       containers:
        - name: my-web
          image: 172.16.1.30:5000/nginx:v3.0   #Test mirror version updated to 3.0
          args:
          - /bin/sh
          - -c
          - sleep 300000;        #Do not create the specified file to fail its probing
          ports:
          - containerPort: 80
          readinessProbe:       
            exec:
              command:
              - cat
              - /usr/share/nginx/html/test.html
            initialDelaySeconds: 10
            periodSeconds: 5
//Rerun the pod profile:
[root@sqm-master yaml]# kubectl apply -f  app.v1.yaml 
deployment.extensions/my-web configured
//See how many pod s have been updated:
[root@sqm-master yaml]# kubectl  get pod  -o wide
NAME                      READY   STATUS      RESTARTS   AGE    IP            NODE     NOMINATED NODE   READINESS GATES
my-web-7db8b88b94-468zv   1/1     Running     0          12m    10.244.2.57   node02   <none>           <none>
my-web-7db8b88b94-c4xvv   1/1     Running     0          12m    10.244.2.55   node02   <none>           <none>
my-web-7db8b88b94-d5fvc   1/1     Running     0          12m    10.244.1.58   node01   <none>           <none>
my-web-7db8b88b94-m9gbh   1/1     Running     0          12m    10.244.1.57   node01   <none>           <none>
my-web-7db8b88b94-q5dqc   1/1     Running     0          12m    10.244.1.59   node01   <none>           <none>
my-web-7db8b88b94-tsbmm   1/1     Running     0          12m    10.244.2.56   node02   <none>           <none>
my-web-7db8b88b94-wlgwb   1/1     Running     0          12m    10.244.2.58   node02   <none>           <none>
my-web-849cc47979-2g59w   0/1     Running     0          3m9s   10.244.1.63   node01   <none>           <none>
my-web-849cc47979-2lkb6   0/1     Running     0          3m9s   10.244.1.64   node01   <none>           <none>
my-web-849cc47979-762vb   0/1     Running     0          3m9s   10.244.1.62   node01   <none>           <none>
my-web-849cc47979-dv7x8   0/1     Running     0          3m9s   10.244.2.61   node02   <none>           <none>
my-web-849cc47979-j6nwz   0/1     Running     0          3m9s   10.244.2.60   node02   <none>           <none>
my-web-849cc47979-v5h7h   0/1     Running     0          3m9s   10.244.2.62   node02   <none>           <none>

We can see that the current total number of pods is 13 (including maxSurge plus an additional number) because detection fails, three pods (including additional pods) are set to be unavailable, but there are still seven pods available (because maxUnavailable is set to three), but note that the versions of the seven pods have not been updated successfully or are the previous version.

//View pod's updated version information:
[root@sqm-master yaml]# kubectl  get deployments. -o wide
NAME     READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                        SELECTOR
my-web   7/10    6            7           58m   my-web       172.16.1.30:5000/nginx:v3.0   name=my-web

Explanation of parameters:
READY: Indicates the desired value of the user
UP-TO-DATE: Indicates an updated
AVAILABLE: Indicates available

We can see that the number of mirrored versions that have been updated is 6 (including an additional 3 pods), but it is unavailable, but make sure that the number of pods available is 7, but the version has not been updated.

Summary:
Describes what the detection mechanism does during rolling updates?
If you need to update a pod in an application in your company, if there is no detection mechanism, whether the pod is ready for updating or not, it will update all the pods in the application. This will have serious consequences, although after updating, you will find that the status of the pod is normal in order to meet the expectations of the Controller ManagerValue, the READY value is still 1/1, but the pod is already a regenerated pod, indicating that all data within the pod will be lost.
If a probing mechanism is added, it will detect whether the files or other applications you specified exist in the container. If the conditions you specified are met, the probing will succeed, and your pod will be updated. If the probing fails, the pod will be set to be unavailable. Although the failed container is unavailable, at least there are other previous versions of the pod available in this module.Keep the company's service running properly.So how important is the detection mechanism?

_________

Topics: Nginx vim kubelet curl