Adding k8s nodes in kubedm

Posted by usvpn on Sat, 15 Jan 2022 15:42:46 +0100

From below, you can see that there are problems in three places
etcd-master1,kube-apiserver-master1,kube-flannel-ds-42z5p

[root@master3 ~]# kubectl get pods   -n kube-system -o wide
NAME                              READY   STATUS             RESTARTS   AGE    IP             NODE              NOMINATED NODE   READINESS GATES
coredns-546565776c-m96fb          1/1     Running            0          46d    10.244.1.3     master2           <none>           <none>
coredns-546565776c-thczd          1/1     Running            0          44d    10.244.2.2     master3           <none>           <none>
etcd-master1                      0/1     CrashLoopBackOff   21345      124d   10.128.4.164   master1           <none>           <none>
etcd-master2                      1/1     Running            1          124d   10.128.4.251   master2           <none>           <none>
etcd-master3                      1/1     Running            1          124d   10.128.4.211   master3           <none>           <none>
kube-apiserver-master1            0/1     CrashLoopBackOff   21349      124d   10.128.4.164   master1           <none>           <none>
kube-apiserver-master2            1/1     Running            1          124d   10.128.4.251   master2           <none>           <none>
kube-apiserver-master3            1/1     Running            1          124d   10.128.4.211   master3           <none>           <none>
kube-controller-manager-master1   1/1     Running            11         124d   10.128.4.164   master1           <none>           <none>
kube-controller-manager-master2   1/1     Running            2          124d   10.128.4.251   master2           <none>           <none>
kube-controller-manager-master3   1/1     Running            1          124d   10.128.4.211   master3           <none>           <none>
kube-flannel-ds-42z5p             0/1     Error              1568       6d2h   10.128.2.173   bg7.test.com.cn   <none>           <none>
kube-flannel-ds-6g59q             1/1     Running            7          43d    10.128.4.8     wd8.test.com.cn   <none>           <none>
kube-flannel-ds-85hxd             1/1     Running            3          123d   10.128.4.107   wd6.test.com.cn   <none>           <none>
kube-flannel-ds-brd8d             1/1     Running            1          33d    10.128.4.160   wd9.test.com.cn   <none>           <none>
kube-flannel-ds-gmmhx             1/1     Running            3          124d   10.128.4.82    wd5.test.com.cn   <none>           <none>
kube-flannel-ds-lj4g2             1/1     Running            1          124d   10.128.4.251   master2           <none>           <none>
kube-flannel-ds-n68dn             1/1     Running            11         124d   10.128.4.164   master1           <none>           <none>
kube-flannel-ds-ppnd7             1/1     Running            4          124d   10.128.4.191   wd4.test.com.cn   <none>           <none>
kube-flannel-ds-tf9lk             1/1     Running            0          33d    10.128.4.170   wd7.test.com.cn   <none>           <none>
kube-flannel-ds-vt5nh             1/1     Running            1          124d   10.128.4.211   master3           <none>           <none>
kube-proxy-622c7                  1/1     Running            11         124d   10.128.4.164   master1           <none>           <none>
kube-proxy-7bp72                  1/1     Running            0          7d4h   10.128.2.173   bg7.test.com.cn   <none>           <none>
kube-proxy-8cx5q                  1/1     Running            4          123d   10.128.4.107   wd6.test.com.cn   <none>           <none>
kube-proxy-h2qh5                  1/1     Running            1          124d   10.128.4.211   master3           <none>           <none>
kube-proxy-kpkm4                  1/1     Running            7          43d    10.128.4.8     wd8.test.com.cn   <none>           <none>
kube-proxy-lp74p                  1/1     Running            1          33d    10.128.4.160   wd9.test.com.cn   <none>           <none>
kube-proxy-nwsnm                  1/1     Running            1          124d   10.128.4.251   master2           <none>           <none>
kube-proxy-psjll                  1/1     Running            4          124d   10.128.4.82    wd5.test.com.cn   <none>           <none>
kube-proxy-v6x42                  1/1     Running            0          33d    10.128.4.170   wd7.test.com.cn   <none>           <none>
kube-proxy-vdfmz                  1/1     Running            4          124d   10.128.4.191   wd4.test.com.cn   <none>           <none>
kube-scheduler-master1            1/1     Running            11         124d   10.128.4.164   master1           <none>           <none>
kube-scheduler-master2            1/1     Running            1          124d   10.128.4.251   master2           <none>           <none>
kube-scheduler-master3            1/1     Running            1          124d   10.128.4.211   master3           <none>           <none>
kuboard-7986796cf8-2g6bs          1/1     Running            0          44d    10.244.1.4     master2           <none>           <none>
metrics-server-677dcb8b4d-pshqw   1/1     Running            0          44d    10.128.4.191   wd4.test.com.cn   <none>           <none>

1. Flannl's problem
Solution to the crash loopbackoff state of the flannel component in the K8s cluster
The content of this website is the problem of loading ipvs. You can use lsmod | grep ip_vs check whether the loading is successful

[root@master3 net.d]# cat /etc/sysconfig/modules/ipvs.modules
#!/bin/sh
modprobe -- ip_vs
modprobe -- ip_vs_rr
modprobe -- ip_vs_wrr
modprobe -- ip_vs_sh
modprobe -- nf_conntrack_ipv4

But my anomaly here is not like this

[root@master3 ~]# kubectl logs kube-flannel-ds-42z5p -n kube-system
I0714 08:58:00.590712       1 main.go:519] Determining IP address of default interface
I0714 08:58:00.687885       1 main.go:532] Using interface with name eth0 and address 10.128.2.173
I0714 08:58:00.687920       1 main.go:549] Defaulting external address to interface address (10.128.2.173)
W0714 08:58:00.687965       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
E0714 08:58:30.689584       1 main.go:250] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-42z5p': Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-42z5p": dial tcp 10.96.0.1:443: i/o timeout

Reading materials Troubleshooting of installing flannel in k8s: failed to create subnet manager: error retrieving pod spec for: the server doe
Using kubedm in ubtu16 04 installing kubernetes1 6.1-flannel
Quickly deploy a set of K8S clusters with kubedm
View the cluster. You can see it on a work node that does not exist

[root@wd5 ~]# ps -ef|grep flannel
root      8359 28328  0 17:13 pts/0    00:00:00 grep --color=auto flannel
root     22735 22714  0 May31 ?        00:26:16 /opt/bin/flanneld --ip-masq --kube-subnet-mgr

The problematic work node does not have this process

[root@bg7 ~]# kubectl create -f https://github.com/coreos/flannel/raw/master/Documentation/kube-flannel-rbac.yml
The connection to the server localhost:8080 was refused - did you specify the right host or port?

Viewing k8s cluster status

[root@master3 ~]# kubectl get cs
NAME                 STATUS      MESSAGE                                                                                     ERROR
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
etcd-0               Healthy     {"health":"true"}     

Solve k8s Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
vi /etc/kubernetes/manifests/kube-scheduler.yaml and VI / etc / kubernetes / manifest / Kube controller manager yaml
After commenting out -- - port=0, execute systemctl restart kubelet Service, the status is normal now

NAME                 STATUS    MESSAGE             ERROR
scheduler            Healthy   ok                  
controller-manager   Healthy   ok                  
etcd-0               Healthy   {"health":"true"} 

The above configuration changes do not fix the problem of abnormal pods status
see k8s flannel network problem dial tcp 10.0.0.1:443: i/o timeout
All nodes without problems have this cni virtual network card, while the nodes with problems do not

[root@wd4 ~]# ifconfig
cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.244.3.1  netmask 255.255.255.0  broadcast 10.244.3.255
        inet6 fe80::44d6:8ff:fe10:9c7e  prefixlen 64  scopeid 0x20<link>
        ether 46:d6:08:10:9c:7e  txqueuelen 1000  (Ethernet)
        RX packets 322756760  bytes 105007395106 (97.7 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 328180837  bytes 158487160202 (147.6 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Because Kube controller manager The network segment for cluster management set by yaml is 10.244.0.0/16
Check the node status. There is an exception message below. I didn't notice it before

[root@bg7 net.d]# service kubelet status
Redirecting to /bin/systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Thu 2021-07-08 13:40:44 CST; 6 days ago
     Docs: https://kubernetes.io/docs/
 Main PID: 5290 (kubelet)
    Tasks: 45
   Memory: 483.8M
   CGroup: /system.slice/kubelet.service
           └─5290 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --co...

Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.322908    5290 cni.go:364] Error adding longhorn-system_longhorn-csi-plugi...rectory
Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.355433    5290 cni.go:364] Error adding longhorn-system_engine-image-ei-e1...rectory
Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.372196    5290 cni.go:364] Error adding longhorn-system_longhorn-manager-2...rectory
Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: W0714 18:30:58.378600    5290 pod_container_deletor.go:77] Container "5ae13a0a2be56237a3f...tainers
Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: W0714 18:30:58.395855    5290 pod_container_deletor.go:77] Container "ea0b2a805f720628172...tainers
Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: W0714 18:30:58.411259    5290 pod_container_deletor.go:77] Container "63776660a9ee92b50ee...tainers
Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.700878    5290 remote_runtime.go:105] RunPodSandbox from runtime service failed: ...
Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.700942    5290 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "longhorn-csi-...
Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.700958    5290 kuberuntime_manager.go:733] createPodSandbox for pod "longhorn-csi...
Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.701009    5290 pod_workers.go:191] Error syncing pod 3b0799d3-9446-4f51-94...446-4f5
Hint: Some lines were ellipsized, use -l to show in full.

Install the network plug-in on the work node

[root@bg7 ~]# kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
The connection to the server localhost:8080 was refused - did you specify the right host or port?

The reason for this problem is that you need to use admin. In the master node Conf is configured in the work node

echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> ~/.bash_profile
source ~/.bash_profile

[root@bg7 kubernetes]# kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
podsecuritypolicy.policy/psp.flannel.unprivileged configured
clusterrole.rbac.authorization.k8s.io/flannel unchanged
clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged
serviceaccount/flannel unchanged
configmap/kube-flannel-cfg unchanged
daemonset.apps/kube-flannel-ds configured

Kubedm installing kubetnets (flannel)
I really can't find a way to reset the work node

systemctl stop kubelet
kubeadm reset
rm -rf /etc/cni/net.d
# If the firewall is turned on, execute
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
# Join cluster
kubeadm join 10.128.4.18:16443 --token xfp80m.xx--discovery-token-ca-cert-hash sha256:dee39c2f7c7484af5872018d786626c9a6264da9334xxxxxxxx

#

The fundamental problem was that the 6443 port was restricted

[root@master2 ~]# netstat -ntlp | grep 6443
tcp        0      0 0.0.0.0:16443           0.0.0.0:*               LISTEN      886/haproxy         
tcp6       0      0 :::6443                 :::*                    LISTEN      3006/kube-apiserver
[root@bg7 net.d]# kubectl describe pod kube-flannel-ds-5jhm6 -n kube-system
Name:                 kube-flannel-ds-5jhm6
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 bg7.test.com.cn/10.128.2.173
Start Time:           Thu, 15 Jul 2021 14:17:39 +0800
Labels:               app=flannel
                      controller-revision-hash=68c5dd74df
                      pod-template-generation=2
                      tier=node
Annotations:          <none>
Status:               Running
IP:                   10.128.2.173
IPs:
  IP:           10.128.2.173
Controlled By:  DaemonSet/kube-flannel-ds
Init Containers:
  install-cni:
    Container ID:  docker://f04fdac1c8d9d0f98bd11159aebb42f9870709fd6fa2bb96739f8d255967033a
    Image:         quay.io/coreos/flannel:v0.14.0
    Image ID:      docker-pullable://quay.io/coreos/flannel@sha256:4a330b2f2e74046e493b2edc30d61fdebbdddaaedcb32d62736f25be8d3c64d5
    Port:          <none>
    Host Port:     <none>
    Command:
      cp
    Args:
      -f
      /etc/kube-flannel/cni-conf.json
      /etc/cni/net.d/10-flannel.conflist
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 15 Jul 2021 14:45:18 +0800
      Finished:     Thu, 15 Jul 2021 14:45:18 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/cni/net.d from cni (rw)
      /etc/kube-flannel/ from flannel-cfg (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from flannel-token-wc2lq (ro)
Containers:
  kube-flannel:
    Container ID:  docker://8ab52d4dc3c29d13d7453a33293a8696391f31826afdc1981a1df9c7eafd6994
    Image:         quay.io/coreos/flannel:v0.14.0
    Image ID:      docker-pullable://quay.io/coreos/flannel@sha256:4a330b2f2e74046e493b2edc30d61fdebbdddaaedcb32d62736f25be8d3c64d5
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/bin/flanneld
    Args:
      --ip-masq
      --kube-subnet-mgr
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 15 Jul 2021 15:27:58 +0800
      Finished:     Thu, 15 Jul 2021 15:28:29 +0800
    Ready:          False
    Restart Count:  12
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:     100m
      memory:  50Mi
    Environment:
      POD_NAME:       kube-flannel-ds-5jhm6 (v1:metadata.name)
      POD_NAMESPACE:  kube-system (v1:metadata.namespace)
    Mounts:
      /etc/kube-flannel/ from flannel-cfg (rw)
      /run/flannel from run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from flannel-token-wc2lq (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  run:
    Type:          HostPath (bare host directory volume)
    Path:          /run/flannel
    HostPathType:  
  cni:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  flannel-cfg:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-flannel-cfg
    Optional:  false
  flannel-token-wc2lq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  flannel-token-wc2lq
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Normal   Pulled   48m                   kubelet  Container image "quay.io/coreos/flannel:v0.14.0" already present on machine
  Normal   Created  48m                   kubelet  Created container install-cni
  Normal   Started  48m                   kubelet  Started container install-cni
  Normal   Created  44m (x5 over 48m)     kubelet  Created container kube-flannel
  Normal   Started  44m (x5 over 48m)     kubelet  Started container kube-flannel
  Normal   Pulled   28m (x9 over 48m)     kubelet  Container image "quay.io/coreos/flannel:v0.14.0" already present on machine
  Warning  BackOff  3m8s (x177 over 47m)  kubelet  Back-off restarting failed container
journalctl -xeu kubelet

"longhorn-csi-plugin-fw2ck_longhorn-system" network: open /run/flannel/subnet.env: no such file or directory
[root@wd4 flannel]# cat subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.3.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

As can be seen from the figure below, 10.96.0.1 can be ping ed, but port 443 cannot be accessed

[root@bg7 ~]# ping 10.96.0.1
PING 10.96.0.1 (10.96.0.1) 56(84) bytes of data.
64 bytes from 10.96.0.1: icmp_seq=1 ttl=64 time=0.034 ms
[root@bg7 ~]# telnet 10.96.0.1 443
Trying 10.96.0.1...

2 etcd-master1

[root@master1 ~]# kubectl logs etcd-master1 -n kube-system
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-07-14 09:56:08.703026 I | etcdmain: etcd Version: 3.4.3
2021-07-14 09:56:08.703052 I | etcdmain: Git SHA: 3cf2f69b5
2021-07-14 09:56:08.703055 I | etcdmain: Go Version: go1.12.12
2021-07-14 09:56:08.703058 I | etcdmain: Go OS/Arch: linux/amd64
2021-07-14 09:56:08.703062 I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
2021-07-14 09:56:08.703101 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-07-14 09:56:08.703131 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = 
2021-07-14 09:56:08.703235 C | etcdmain: open /etc/kubernetes/pki/etcd/peer.crt: no such file or directory

This problem is relatively simple. You can copy the etcd certificates of other master nodes in the past. Because each master node of the k8s cluster is peer-to-peer, it is speculated that you can copy them directly in the past

[root@master2 ~]# cd /etc/kubernetes/pki/etcd
[root@master2 etcd]# ll
total 32
-rw-r--r-- 1 root root 1017 Mar 12 11:59 ca.crt
-rw------- 1 root root 1675 Mar 12 11:59 ca.key
-rw-r--r-- 1 root root 1094 Mar 12 13:47 healthcheck-client.crt
-rw------- 1 root root 1675 Mar 12 13:47 healthcheck-client.key
-rw-r--r-- 1 root root 1127 Mar 12 13:47 peer.crt
-rw------- 1 root root 1675 Mar 12 13:47 peer.key
-rw-r--r-- 1 root root 1127 Mar 12 13:47 server.crt
-rw------- 1 root root 1675 Mar 12 13:47 server.key
cd /etc/kubernetes/pki/etcd
scp healthcheck-client.crt root@10.128.4.164:/etc/kubernetes/pki/etcd
scp healthcheck-client.key peer.crt peer.key server.crt server.key  root@10.128.4.164:/etc/kubernetes/pki/etcd

Check etcd and install the etcdctl client command line tool according to the following command. This is the access tool of ectd installed in the host

wget https://github.com/etcd-io/etcd/releases/download/v3.4.14/etcd-v3.4.14-linux-amd64.tar.gz
tar -zxf etcd-v3.4.14-linux-amd64.tar.gz
mv etcd-v3.4.14-linux-amd64/etcdctl /usr/local/bin
chmod +x /usr/local/bin/

In addition to the above methods, you can directly enter the docker container

docker exec -it $(docker ps -f name=etcd_etcd -q) /bin/sh
# View the list of etcd cluster members
# etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
63009835561e0671, started, master1, https://10.128.4.164:2380, https://10.128.4.164:2379, false
b245d1beab861d15, started, master2, https://10.128.4.251:2380, https://10.128.4.251:2379, false
f3f56f36d83eef49, started, master3, https://10.128.4.211:2380, https://10.128.4.211:2379, false

View high availability cluster health status

[root@master3 application]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=10.128.4.164:2379,10.128.4.251:2379,10.128.4.211:2379 endpoint health
{"level":"warn","ts":"2021-07-14T19:37:51.455+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2684301f-38ba-4150-beab-ed052321a6d9/10.128.4.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
+-------------------+--------+------------+---------------------------+
|     ENDPOINT      | HEALTH |    TOOK    |           ERROR           |
+-------------------+--------+------------+---------------------------+
| 10.128.4.211:2379 |   true | 8.541405ms |                           |
| 10.128.4.251:2379 |   true | 8.922941ms |                           |
| 10.128.4.164:2379 |  false | 5.0002425s | context deadline exceeded |

View the list of etcd highly available clusters

[root@master3 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=10.128.4.164:2379,10.128.4.251:2379,10.128.4.211:2379 member list
+------------------+---------+---------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME   |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------+---------------------------+---------------------------+------------+
| 63009835561e0671 | started | master1 | https://10.128.4.164:2380 | https://10.128.4.164:2379 |      false |
| b245d1beab861d15 | started | master2 | https://10.128.4.251:2380 | https://10.128.4.251:2379 |      false |
| f3f56f36d83eef49 | started | master3 | https://10.128.4.211:2380 | https://10.128.4.211:2379 |      false |
+------------------+---------+---------+---------------------------+---------------------------+------------+

View etcd highly available cluster leader

[root@master3 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=10.128.4.164:2379,10.128.4.251:2379,10.128.4.211:2379 endpoint status
{"level":"warn","ts":"2021-07-15T10:24:33.494+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///10.128.4.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.128.4.164:2379: connect: connection refused\""}
Failed to get the status of endpoint 10.128.4.164:2379 (context deadline exceeded)
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|     ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 10.128.4.251:2379 | b245d1beab861d15 |   3.4.3 |   25 MB |     false |      false |        16 |   46888364 |           46888364 |        |
| 10.128.4.211:2379 | f3f56f36d83eef49 |   3.4.3 |   25 MB |      true |      false |        16 |   46888364 |           46888364 |        |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Copy the valid certificate to master1 according to the following command, but there is still a problem

scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/
scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/
scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/
scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/

Then the idea is to remove the master node from the cluster and rejoin it
Remove the master node server from the k8s cluster and rejoin it

# Remove the problematic master node from k8s
kubectl drain master1
kubectl delete node master1
# Remove the corresponding configuration from etcl. Note that 12637f5ec2bd02b8 the etcd is viewed through the member list of the etcd cluster

etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 12637f5ec2bd02b8

# Note that this is executed in the master node without problems
mkdir -p /etc/kubernetes/pki/etcd/
scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/
scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/
scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/
scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/
scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/

# Note that the following commands are executed in the problematic master node
kubeadm reset


# Note that this is executed in the problematic node
kubeadm join 10.128.4.18:16443 --token xfp80m.tzbnqxoyv1p21687 --discovery-token-ca-cert-hash sha256:dee39c2f7c7484af5872018d786626c9a6264da93346acc9114ffacd0a2782d7 --control-plane

kubectl cordon master1
# So far, the problem of synchronizing kube-apiserver-master1 has been solved

If you accidentally execute kubedm reset on a machine that has no problem, you can see that master3 changes to NotReady

[root@master1 pki]# kubectl get nodes
NAME              STATUS                        ROLES    AGE     VERSION
bg7.test.com.cn   Ready                         <none>   7d22h   v1.18.9
master1           Ready                         master   6m6s    v1.18.9
master2           Ready,SchedulingDisabled      master   124d    v1.18.9
master3           NotReady,SchedulingDisabled   master   124d    v1.18.9
wd4.test.com.cn   Ready                         <none>   124d    v1.18.9

Solution reference The master NODE and NODE node of K8S mistakenly execute kubedm reset
I didn't execute the following operation successfully. I succeeded by removing and then adding

scp /etc/kubernetes/admin.conf root@10.128.2.173:/etc/kubernetes/
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubeadm init --kubernetes-version=v1.18.9 --pod-network-cidr=10.244.0.0/16

echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> ~/.bash_profile
source ~/.bash_profile

3-node scheduling problem
Scheduling disabled is not schedulable. There must be a problem with this. Execute the command kubectl uncordon wd9 test. com. Cn, set the node that cannot be scheduled as schedulable
Through kubectl cordon master1, the master node can be set as non schedulable

[root@master1 pki]# kubectl get nodes
NAME              STATUS                     ROLES    AGE    VERSION
bg7.test.com.cn   Ready,SchedulingDisabled   <none>   7d6h   v1.18.9
master1           Ready                      master   124d   v1.18.9
master2           Ready                      master   124d   v1.18.9
master3           Ready                      master   124d   v1.18.9
wd4.test.com.cn   Ready                      <none>   124d   v1.18.9
wd5.test.com.cn   Ready                      <none>   124d   v1.18.9
wd6.test.com.cn   Ready,SchedulingDisabled   <none>   124d   v1.18.9
wd7.test.com.cn   Ready                      <none>   34d    v1.18.9
wd8.test.com.cn   Ready,SchedulingDisabled   <none>   43d    v1.18.9
wd9.test.com.cn   Ready    

Topics: Kubernetes Cloud Native