From below, you can see that there are problems in three places
etcd-master1,kube-apiserver-master1,kube-flannel-ds-42z5p
[root@master3 ~]# kubectl get pods -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES coredns-546565776c-m96fb 1/1 Running 0 46d 10.244.1.3 master2 <none> <none> coredns-546565776c-thczd 1/1 Running 0 44d 10.244.2.2 master3 <none> <none> etcd-master1 0/1 CrashLoopBackOff 21345 124d 10.128.4.164 master1 <none> <none> etcd-master2 1/1 Running 1 124d 10.128.4.251 master2 <none> <none> etcd-master3 1/1 Running 1 124d 10.128.4.211 master3 <none> <none> kube-apiserver-master1 0/1 CrashLoopBackOff 21349 124d 10.128.4.164 master1 <none> <none> kube-apiserver-master2 1/1 Running 1 124d 10.128.4.251 master2 <none> <none> kube-apiserver-master3 1/1 Running 1 124d 10.128.4.211 master3 <none> <none> kube-controller-manager-master1 1/1 Running 11 124d 10.128.4.164 master1 <none> <none> kube-controller-manager-master2 1/1 Running 2 124d 10.128.4.251 master2 <none> <none> kube-controller-manager-master3 1/1 Running 1 124d 10.128.4.211 master3 <none> <none> kube-flannel-ds-42z5p 0/1 Error 1568 6d2h 10.128.2.173 bg7.test.com.cn <none> <none> kube-flannel-ds-6g59q 1/1 Running 7 43d 10.128.4.8 wd8.test.com.cn <none> <none> kube-flannel-ds-85hxd 1/1 Running 3 123d 10.128.4.107 wd6.test.com.cn <none> <none> kube-flannel-ds-brd8d 1/1 Running 1 33d 10.128.4.160 wd9.test.com.cn <none> <none> kube-flannel-ds-gmmhx 1/1 Running 3 124d 10.128.4.82 wd5.test.com.cn <none> <none> kube-flannel-ds-lj4g2 1/1 Running 1 124d 10.128.4.251 master2 <none> <none> kube-flannel-ds-n68dn 1/1 Running 11 124d 10.128.4.164 master1 <none> <none> kube-flannel-ds-ppnd7 1/1 Running 4 124d 10.128.4.191 wd4.test.com.cn <none> <none> kube-flannel-ds-tf9lk 1/1 Running 0 33d 10.128.4.170 wd7.test.com.cn <none> <none> kube-flannel-ds-vt5nh 1/1 Running 1 124d 10.128.4.211 master3 <none> <none> kube-proxy-622c7 1/1 Running 11 124d 10.128.4.164 master1 <none> <none> kube-proxy-7bp72 1/1 Running 0 7d4h 10.128.2.173 bg7.test.com.cn <none> <none> kube-proxy-8cx5q 1/1 Running 4 123d 10.128.4.107 wd6.test.com.cn <none> <none> kube-proxy-h2qh5 1/1 Running 1 124d 10.128.4.211 master3 <none> <none> kube-proxy-kpkm4 1/1 Running 7 43d 10.128.4.8 wd8.test.com.cn <none> <none> kube-proxy-lp74p 1/1 Running 1 33d 10.128.4.160 wd9.test.com.cn <none> <none> kube-proxy-nwsnm 1/1 Running 1 124d 10.128.4.251 master2 <none> <none> kube-proxy-psjll 1/1 Running 4 124d 10.128.4.82 wd5.test.com.cn <none> <none> kube-proxy-v6x42 1/1 Running 0 33d 10.128.4.170 wd7.test.com.cn <none> <none> kube-proxy-vdfmz 1/1 Running 4 124d 10.128.4.191 wd4.test.com.cn <none> <none> kube-scheduler-master1 1/1 Running 11 124d 10.128.4.164 master1 <none> <none> kube-scheduler-master2 1/1 Running 1 124d 10.128.4.251 master2 <none> <none> kube-scheduler-master3 1/1 Running 1 124d 10.128.4.211 master3 <none> <none> kuboard-7986796cf8-2g6bs 1/1 Running 0 44d 10.244.1.4 master2 <none> <none> metrics-server-677dcb8b4d-pshqw 1/1 Running 0 44d 10.128.4.191 wd4.test.com.cn <none> <none>
1. Flannl's problem
Solution to the crash loopbackoff state of the flannel component in the K8s cluster
The content of this website is the problem of loading ipvs. You can use lsmod | grep ip_vs check whether the loading is successful
[root@master3 net.d]# cat /etc/sysconfig/modules/ipvs.modules #!/bin/sh modprobe -- ip_vs modprobe -- ip_vs_rr modprobe -- ip_vs_wrr modprobe -- ip_vs_sh modprobe -- nf_conntrack_ipv4
But my anomaly here is not like this
[root@master3 ~]# kubectl logs kube-flannel-ds-42z5p -n kube-system I0714 08:58:00.590712 1 main.go:519] Determining IP address of default interface I0714 08:58:00.687885 1 main.go:532] Using interface with name eth0 and address 10.128.2.173 I0714 08:58:00.687920 1 main.go:549] Defaulting external address to interface address (10.128.2.173) W0714 08:58:00.687965 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. E0714 08:58:30.689584 1 main.go:250] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-42z5p': Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-42z5p": dial tcp 10.96.0.1:443: i/o timeout
Reading materials Troubleshooting of installing flannel in k8s: failed to create subnet manager: error retrieving pod spec for: the server doe
Using kubedm in ubtu16 04 installing kubernetes1 6.1-flannel
Quickly deploy a set of K8S clusters with kubedm
View the cluster. You can see it on a work node that does not exist
[root@wd5 ~]# ps -ef|grep flannel root 8359 28328 0 17:13 pts/0 00:00:00 grep --color=auto flannel root 22735 22714 0 May31 ? 00:26:16 /opt/bin/flanneld --ip-masq --kube-subnet-mgr
The problematic work node does not have this process
[root@bg7 ~]# kubectl create -f https://github.com/coreos/flannel/raw/master/Documentation/kube-flannel-rbac.yml The connection to the server localhost:8080 was refused - did you specify the right host or port?
Viewing k8s cluster status
[root@master3 ~]# kubectl get cs NAME STATUS MESSAGE ERROR controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused scheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused etcd-0 Healthy {"health":"true"}
Solve k8s Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
vi /etc/kubernetes/manifests/kube-scheduler.yaml and VI / etc / kubernetes / manifest / Kube controller manager yaml
After commenting out -- - port=0, execute systemctl restart kubelet Service, the status is normal now
NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"}
The above configuration changes do not fix the problem of abnormal pods status
see k8s flannel network problem dial tcp 10.0.0.1:443: i/o timeout
All nodes without problems have this cni virtual network card, while the nodes with problems do not
[root@wd4 ~]# ifconfig cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 10.244.3.1 netmask 255.255.255.0 broadcast 10.244.3.255 inet6 fe80::44d6:8ff:fe10:9c7e prefixlen 64 scopeid 0x20<link> ether 46:d6:08:10:9c:7e txqueuelen 1000 (Ethernet) RX packets 322756760 bytes 105007395106 (97.7 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 328180837 bytes 158487160202 (147.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Because Kube controller manager The network segment for cluster management set by yaml is 10.244.0.0/16
Check the node status. There is an exception message below. I didn't notice it before
[root@bg7 net.d]# service kubelet status Redirecting to /bin/systemctl status kubelet.service ● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/kubelet.service.d └─10-kubeadm.conf Active: active (running) since Thu 2021-07-08 13:40:44 CST; 6 days ago Docs: https://kubernetes.io/docs/ Main PID: 5290 (kubelet) Tasks: 45 Memory: 483.8M CGroup: /system.slice/kubelet.service └─5290 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --co... Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.322908 5290 cni.go:364] Error adding longhorn-system_longhorn-csi-plugi...rectory Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.355433 5290 cni.go:364] Error adding longhorn-system_engine-image-ei-e1...rectory Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.372196 5290 cni.go:364] Error adding longhorn-system_longhorn-manager-2...rectory Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: W0714 18:30:58.378600 5290 pod_container_deletor.go:77] Container "5ae13a0a2be56237a3f...tainers Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: W0714 18:30:58.395855 5290 pod_container_deletor.go:77] Container "ea0b2a805f720628172...tainers Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: W0714 18:30:58.411259 5290 pod_container_deletor.go:77] Container "63776660a9ee92b50ee...tainers Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.700878 5290 remote_runtime.go:105] RunPodSandbox from runtime service failed: ... Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.700942 5290 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "longhorn-csi-... Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.700958 5290 kuberuntime_manager.go:733] createPodSandbox for pod "longhorn-csi... Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.701009 5290 pod_workers.go:191] Error syncing pod 3b0799d3-9446-4f51-94...446-4f5 Hint: Some lines were ellipsized, use -l to show in full.
Install the network plug-in on the work node
[root@bg7 ~]# kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml The connection to the server localhost:8080 was refused - did you specify the right host or port?
The reason for this problem is that you need to use admin. In the master node Conf is configured in the work node
echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> ~/.bash_profile source ~/.bash_profile [root@bg7 kubernetes]# kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml podsecuritypolicy.policy/psp.flannel.unprivileged configured clusterrole.rbac.authorization.k8s.io/flannel unchanged clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged serviceaccount/flannel unchanged configmap/kube-flannel-cfg unchanged daemonset.apps/kube-flannel-ds configured
Kubedm installing kubetnets (flannel)
I really can't find a way to reset the work node
systemctl stop kubelet kubeadm reset rm -rf /etc/cni/net.d # If the firewall is turned on, execute iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X # Join cluster kubeadm join 10.128.4.18:16443 --token xfp80m.xx--discovery-token-ca-cert-hash sha256:dee39c2f7c7484af5872018d786626c9a6264da9334xxxxxxxx #
The fundamental problem was that the 6443 port was restricted
[root@master2 ~]# netstat -ntlp | grep 6443 tcp 0 0 0.0.0.0:16443 0.0.0.0:* LISTEN 886/haproxy tcp6 0 0 :::6443 :::* LISTEN 3006/kube-apiserver
[root@bg7 net.d]# kubectl describe pod kube-flannel-ds-5jhm6 -n kube-system Name: kube-flannel-ds-5jhm6 Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Node: bg7.test.com.cn/10.128.2.173 Start Time: Thu, 15 Jul 2021 14:17:39 +0800 Labels: app=flannel controller-revision-hash=68c5dd74df pod-template-generation=2 tier=node Annotations: <none> Status: Running IP: 10.128.2.173 IPs: IP: 10.128.2.173 Controlled By: DaemonSet/kube-flannel-ds Init Containers: install-cni: Container ID: docker://f04fdac1c8d9d0f98bd11159aebb42f9870709fd6fa2bb96739f8d255967033a Image: quay.io/coreos/flannel:v0.14.0 Image ID: docker-pullable://quay.io/coreos/flannel@sha256:4a330b2f2e74046e493b2edc30d61fdebbdddaaedcb32d62736f25be8d3c64d5 Port: <none> Host Port: <none> Command: cp Args: -f /etc/kube-flannel/cni-conf.json /etc/cni/net.d/10-flannel.conflist State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 15 Jul 2021 14:45:18 +0800 Finished: Thu, 15 Jul 2021 14:45:18 +0800 Ready: True Restart Count: 0 Environment: <none> Mounts: /etc/cni/net.d from cni (rw) /etc/kube-flannel/ from flannel-cfg (rw) /var/run/secrets/kubernetes.io/serviceaccount from flannel-token-wc2lq (ro) Containers: kube-flannel: Container ID: docker://8ab52d4dc3c29d13d7453a33293a8696391f31826afdc1981a1df9c7eafd6994 Image: quay.io/coreos/flannel:v0.14.0 Image ID: docker-pullable://quay.io/coreos/flannel@sha256:4a330b2f2e74046e493b2edc30d61fdebbdddaaedcb32d62736f25be8d3c64d5 Port: <none> Host Port: <none> Command: /opt/bin/flanneld Args: --ip-masq --kube-subnet-mgr State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Thu, 15 Jul 2021 15:27:58 +0800 Finished: Thu, 15 Jul 2021 15:28:29 +0800 Ready: False Restart Count: 12 Limits: cpu: 100m memory: 50Mi Requests: cpu: 100m memory: 50Mi Environment: POD_NAME: kube-flannel-ds-5jhm6 (v1:metadata.name) POD_NAMESPACE: kube-system (v1:metadata.namespace) Mounts: /etc/kube-flannel/ from flannel-cfg (rw) /run/flannel from run (rw) /var/run/secrets/kubernetes.io/serviceaccount from flannel-token-wc2lq (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: run: Type: HostPath (bare host directory volume) Path: /run/flannel HostPathType: cni: Type: HostPath (bare host directory volume) Path: /etc/cni/net.d HostPathType: flannel-cfg: Type: ConfigMap (a volume populated by a ConfigMap) Name: kube-flannel-cfg Optional: false flannel-token-wc2lq: Type: Secret (a volume populated by a Secret) SecretName: flannel-token-wc2lq Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: :NoSchedule node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/network-unavailable:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/pid-pressure:NoSchedule node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unschedulable:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 48m kubelet Container image "quay.io/coreos/flannel:v0.14.0" already present on machine Normal Created 48m kubelet Created container install-cni Normal Started 48m kubelet Started container install-cni Normal Created 44m (x5 over 48m) kubelet Created container kube-flannel Normal Started 44m (x5 over 48m) kubelet Started container kube-flannel Normal Pulled 28m (x9 over 48m) kubelet Container image "quay.io/coreos/flannel:v0.14.0" already present on machine Warning BackOff 3m8s (x177 over 47m) kubelet Back-off restarting failed container
journalctl -xeu kubelet "longhorn-csi-plugin-fw2ck_longhorn-system" network: open /run/flannel/subnet.env: no such file or directory
[root@wd4 flannel]# cat subnet.env FLANNEL_NETWORK=10.244.0.0/16 FLANNEL_SUBNET=10.244.3.1/24 FLANNEL_MTU=1450 FLANNEL_IPMASQ=true
As can be seen from the figure below, 10.96.0.1 can be ping ed, but port 443 cannot be accessed
[root@bg7 ~]# ping 10.96.0.1 PING 10.96.0.1 (10.96.0.1) 56(84) bytes of data. 64 bytes from 10.96.0.1: icmp_seq=1 ttl=64 time=0.034 ms [root@bg7 ~]# telnet 10.96.0.1 443 Trying 10.96.0.1...
2 etcd-master1
[root@master1 ~]# kubectl logs etcd-master1 -n kube-system [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead 2021-07-14 09:56:08.703026 I | etcdmain: etcd Version: 3.4.3 2021-07-14 09:56:08.703052 I | etcdmain: Git SHA: 3cf2f69b5 2021-07-14 09:56:08.703055 I | etcdmain: Go Version: go1.12.12 2021-07-14 09:56:08.703058 I | etcdmain: Go OS/Arch: linux/amd64 2021-07-14 09:56:08.703062 I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16 2021-07-14 09:56:08.703101 N | etcdmain: the server is already initialized as member before, starting as etcd member... [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead 2021-07-14 09:56:08.703131 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = 2021-07-14 09:56:08.703235 C | etcdmain: open /etc/kubernetes/pki/etcd/peer.crt: no such file or directory
This problem is relatively simple. You can copy the etcd certificates of other master nodes in the past. Because each master node of the k8s cluster is peer-to-peer, it is speculated that you can copy them directly in the past
[root@master2 ~]# cd /etc/kubernetes/pki/etcd [root@master2 etcd]# ll total 32 -rw-r--r-- 1 root root 1017 Mar 12 11:59 ca.crt -rw------- 1 root root 1675 Mar 12 11:59 ca.key -rw-r--r-- 1 root root 1094 Mar 12 13:47 healthcheck-client.crt -rw------- 1 root root 1675 Mar 12 13:47 healthcheck-client.key -rw-r--r-- 1 root root 1127 Mar 12 13:47 peer.crt -rw------- 1 root root 1675 Mar 12 13:47 peer.key -rw-r--r-- 1 root root 1127 Mar 12 13:47 server.crt -rw------- 1 root root 1675 Mar 12 13:47 server.key
cd /etc/kubernetes/pki/etcd scp healthcheck-client.crt root@10.128.4.164:/etc/kubernetes/pki/etcd scp healthcheck-client.key peer.crt peer.key server.crt server.key root@10.128.4.164:/etc/kubernetes/pki/etcd
Check etcd and install the etcdctl client command line tool according to the following command. This is the access tool of ectd installed in the host
wget https://github.com/etcd-io/etcd/releases/download/v3.4.14/etcd-v3.4.14-linux-amd64.tar.gz tar -zxf etcd-v3.4.14-linux-amd64.tar.gz mv etcd-v3.4.14-linux-amd64/etcdctl /usr/local/bin chmod +x /usr/local/bin/
In addition to the above methods, you can directly enter the docker container
docker exec -it $(docker ps -f name=etcd_etcd -q) /bin/sh # View the list of etcd cluster members # etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list 63009835561e0671, started, master1, https://10.128.4.164:2380, https://10.128.4.164:2379, false b245d1beab861d15, started, master2, https://10.128.4.251:2380, https://10.128.4.251:2379, false f3f56f36d83eef49, started, master3, https://10.128.4.211:2380, https://10.128.4.211:2379, false
View high availability cluster health status
[root@master3 application]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=10.128.4.164:2379,10.128.4.251:2379,10.128.4.211:2379 endpoint health {"level":"warn","ts":"2021-07-14T19:37:51.455+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2684301f-38ba-4150-beab-ed052321a6d9/10.128.4.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} +-------------------+--------+------------+---------------------------+ | ENDPOINT | HEALTH | TOOK | ERROR | +-------------------+--------+------------+---------------------------+ | 10.128.4.211:2379 | true | 8.541405ms | | | 10.128.4.251:2379 | true | 8.922941ms | | | 10.128.4.164:2379 | false | 5.0002425s | context deadline exceeded |
View the list of etcd highly available clusters
[root@master3 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=10.128.4.164:2379,10.128.4.251:2379,10.128.4.211:2379 member list +------------------+---------+---------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+---------+---------------------------+---------------------------+------------+ | 63009835561e0671 | started | master1 | https://10.128.4.164:2380 | https://10.128.4.164:2379 | false | | b245d1beab861d15 | started | master2 | https://10.128.4.251:2380 | https://10.128.4.251:2379 | false | | f3f56f36d83eef49 | started | master3 | https://10.128.4.211:2380 | https://10.128.4.211:2379 | false | +------------------+---------+---------+---------------------------+---------------------------+------------+
View etcd highly available cluster leader
[root@master3 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=10.128.4.164:2379,10.128.4.251:2379,10.128.4.211:2379 endpoint status {"level":"warn","ts":"2021-07-15T10:24:33.494+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///10.128.4.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.128.4.164:2379: connect: connection refused\""} Failed to get the status of endpoint 10.128.4.164:2379 (context deadline exceeded) +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | 10.128.4.251:2379 | b245d1beab861d15 | 3.4.3 | 25 MB | false | false | 16 | 46888364 | 46888364 | | | 10.128.4.211:2379 | f3f56f36d83eef49 | 3.4.3 | 25 MB | true | false | 16 | 46888364 | 46888364 | | +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Copy the valid certificate to master1 according to the following command, but there is still a problem
scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/ scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/ scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/ scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/
Then the idea is to remove the master node from the cluster and rejoin it
Remove the master node server from the k8s cluster and rejoin it
# Remove the problematic master node from k8s kubectl drain master1 kubectl delete node master1 # Remove the corresponding configuration from etcl. Note that 12637f5ec2bd02b8 the etcd is viewed through the member list of the etcd cluster etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 12637f5ec2bd02b8 # Note that this is executed in the master node without problems mkdir -p /etc/kubernetes/pki/etcd/ scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/ scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/ scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/ scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/ scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/ # Note that the following commands are executed in the problematic master node kubeadm reset # Note that this is executed in the problematic node kubeadm join 10.128.4.18:16443 --token xfp80m.tzbnqxoyv1p21687 --discovery-token-ca-cert-hash sha256:dee39c2f7c7484af5872018d786626c9a6264da93346acc9114ffacd0a2782d7 --control-plane kubectl cordon master1 # So far, the problem of synchronizing kube-apiserver-master1 has been solved
If you accidentally execute kubedm reset on a machine that has no problem, you can see that master3 changes to NotReady
[root@master1 pki]# kubectl get nodes NAME STATUS ROLES AGE VERSION bg7.test.com.cn Ready <none> 7d22h v1.18.9 master1 Ready master 6m6s v1.18.9 master2 Ready,SchedulingDisabled master 124d v1.18.9 master3 NotReady,SchedulingDisabled master 124d v1.18.9 wd4.test.com.cn Ready <none> 124d v1.18.9
Solution reference The master NODE and NODE node of K8S mistakenly execute kubedm reset
I didn't execute the following operation successfully. I succeeded by removing and then adding
scp /etc/kubernetes/admin.conf root@10.128.2.173:/etc/kubernetes/ mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config kubeadm init --kubernetes-version=v1.18.9 --pod-network-cidr=10.244.0.0/16 echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> ~/.bash_profile source ~/.bash_profile
3-node scheduling problem
Scheduling disabled is not schedulable. There must be a problem with this. Execute the command kubectl uncordon wd9 test. com. Cn, set the node that cannot be scheduled as schedulable
Through kubectl cordon master1, the master node can be set as non schedulable
[root@master1 pki]# kubectl get nodes NAME STATUS ROLES AGE VERSION bg7.test.com.cn Ready,SchedulingDisabled <none> 7d6h v1.18.9 master1 Ready master 124d v1.18.9 master2 Ready master 124d v1.18.9 master3 Ready master 124d v1.18.9 wd4.test.com.cn Ready <none> 124d v1.18.9 wd5.test.com.cn Ready <none> 124d v1.18.9 wd6.test.com.cn Ready,SchedulingDisabled <none> 124d v1.18.9 wd7.test.com.cn Ready <none> 34d v1.18.9 wd8.test.com.cn Ready,SchedulingDisabled <none> 43d v1.18.9 wd9.test.com.cn Ready