k8s connects to external ceph cluster

Posted by kishore_marti on Mon, 27 Dec 2021 22:53:56 +0100

In order to deploy stateful services, we need to provide k8s with a set of persistent storage scheme. We use ceph as the underlying storage. Generally, there are two types of k8s docking ceph:

This paper mainly records the scheme and problems of k8s cluster connecting with external ceph cluster. During this period, I still encountered many problems.

Environmental preparation

The k8s and ceph environments we use are shown in:
 https://blog.51cto.com/leejia/2495558
 https://blog.51cto.com/leejia/2499684

Static persistent volume

Each time a storage space needs to be used, the storage administrator needs to manually create the corresponding image on the storage before k8s it can be used.

Create ceph secret

You need to add a secret to k8s to access ceph, which is mainly used for k8s to map rbd.
1. On the ceph master node, execute the following command to obtain the base64 encoded key of admin (the production environment can create a special user for k8s):

# ceph auth get-key client.admin | base64
QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==

2. Create a secret in k8s the manifest

# vim ceph-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
data:
  key: QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==
  
# kubectl apply -f ceph-secret.yaml

Create image

By default, the default pool used after ceph creation is rdb. Use the following command to create an image on the client where ceph is installed or directly on the ceph master node:

# rbd create image1 -s 1024
# rbd info rbd/image1
rbd image 'image1':
	size 1024 MB in 256 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.374d6b8b4567
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
	flags:

Create persistent volume

Create on k8s via manifest:

# vim pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: ceph-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  rbd:
    monitors:
      - 172.18.2.172:6789
      - 172.18.2.178:6789
      - 172.18.2.189:6789
    pool: rbd
    image: image1
    user: admin
    secretRef:
      name: ceph-secret
    fsType: ext4
  persistentVolumeReclaimPolicy: Retain
  
# kubectl apply -f pv.yaml
persistentvolume/ceph-pv created

# kubectl get pv
NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
ceph-pv   1Gi        RWO,ROX        Retain           Available                                   76s

The main instructions are as follows:
1,accessModes:

RWO: ReadWriteOnce,Only a single node can be mounted for reading and writing;
ROX: ReadOnlyMany,Allow multiple nodes to be mounted and read-only;
RWX: ReadWriteMany,Allow multiple nodes to mount for reading and writing;

2,fsType

If PersistentVolumes of VolumeMode by Filesystem,Then this field specifies the file system that should be used when mounting the volume. If the volume has not been formatted and formatting is supported, this value is used to format the volume.

3,persistentVolumeReclaimPolicy:

There are three recycling strategies:
Delete: For dynamically configured PersistentVolumes For example, the default recycling policy is“ Delete". This means that when the user deletes the corresponding PersistentVolumeClaim Dynamically configured volume Will be automatically deleted.

Retain: If volume Suitable for use when important data is included“ Retain"Strategy. Use“ Retain" If the user deletes PersistentVolumeClaim,Corresponding PersistentVolume Will not be deleted. Instead, it will become Released Status, indicating that all data can be recovered manually.

Recycle: If the user deletes PersistentVolumeClaim,The data on the volume is deleted and the volume is not deleted.

Create persistent volume declaration

Create on k8s via manifest:

# vim pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ceph-claim
spec:
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  resources:
    requests:
      storage: 1Gi

# kubectl apply -f pvc.yaml

After the claim is created, k8s will match the most appropriate pv and bind it to the claim. The capacity of the persistent volume must meet the requirements of the claim + the mode of the volume must include the access mode specified in the claim. Therefore, the above pvc will be bound to the pv we just created.

To view the binding of pvc:

# kubectl get pvc
NAME         STATUS   VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ceph-claim   Bound    ceph-pv   1Gi        RWO,ROX                       13m

pod uses persistent volumes

Create on k8s via manifest:

vim cat ubuntu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ceph-pod
spec:
  containers:
  - name: ceph-ubuntu
    image: phusion/baseimage
    command: ["sh", "/sbin/my_init"]
    volumeMounts:
    - name: ceph-mnt
      mountPath: /mnt
      readOnly: false
  volumes:
  - name: ceph-mnt
    persistentVolumeClaim:
      claimName: ceph-claim

# kubectl apply -f ubuntu.yaml
pod/ceph-pod created

Check the status of the pod and find that one is in the ContainerCreating stage, and then find an error through the describe log:

# kubectl get pods
NAME                     READY   STATUS              RESTARTS   AGE
ceph-pod                 0/1     ContainerCreating   0          75s

# kubectl describe pods ceph-pod
Events:
  Type     Reason       Age                   From            Message
  ----     ------       ----                  ----            -------
  Warning  FailedMount  48m (x6 over 75m)     kubelet, work3  Unable to attach or mount volumes: unmounted volumes=[ceph-mnt], unattached volumes=[default-token-tlsjd ceph-mnt]: timed out waiting for the condition
  Warning  FailedMount  8m59s (x45 over 84m)  kubelet, work3  MountVolume.WaitForAttach failed for volume "ceph-pv" : fail to check rbd image status with: (executable file not found in $PATH), rbd output: ()
  Warning  FailedMount  3m13s (x23 over 82m)  kubelet, work3  Unable to attach or mount volumes: unmounted volumes=[ceph-mnt], unattached volumes=[ceph-mnt default-token-tlsjd]: timed out waiting for the condition

This problem occurs because k8s relies on kubelet to implement attach (rbd map) and detach (rbd unmap) RBD image operations, while kubelet runs on each k8s node. Therefore, each k8s node must install the CEPH common package to provide kubelet with rbd commands. After installing the ceph repo of Alibaba cloud on each machine, new errors are found:

# kubectl describe pods ceph-pod
Events:
  Type     Reason       Age                   From            Message
  ----     ------       ----                  ----            -------
MountVolume.WaitForAttach failed for volume "ceph-pv" : rbd: map failed exit status 6, rbd output: 2020-06-02 17:12:18.575338 7f0171c3ed80 -1 did not load config file, using default settings.
2020-06-02 17:12:18.603861 7f0171c3ed80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
rbd: sysfs write failed
2020-06-02 17:12:18.620447 7f0171c3ed80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable".
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (6) No such device or address
  Warning  FailedMount  15s  kubelet, work3  MountVolume.WaitForAttach failed for volume "ceph-pv" : rbd: map failed exit status 6, rbd output: 2020-06-02 17:12:19.257006 7fc330e14d80 -1 did not load config file, using default settings.

We can only continue to check the data to find the cause, and found that there are two problems to be solved:
1) , it is found that the kernel version of k8s cluster is different from that of ceph cluster. The kernel version of k8s cluster is lower. Some feature s stored in rdb blocks are not supported by the lower version kernel and need to be disabled. disable with the following command:

# rbd info rbd/image1
rbd image 'image1':
	size 1024 MB in 256 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.374d6b8b4567
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
	flags:
	
# rbd  feature disable rbd/image1 exclusive-lock object-map fast-diff deep-flatten	

2) , the error that the key cannot be found is because the k8s node needs to interact with ceph to map the image to the local machine. ceph must be placed in the / etc/ceph directory of each k8s node client. admin. Keyring file is used for authentication during mapping. Therefore, the / etc/ceph directory is created for each node, and a script is written to place the key file.

# scp /etc/ceph/ceph.client.admin.keyring root@k8s-node:/etc/ceph

Check the pod status and finally run:

# kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
ceph-pod                 1/1     Running   0          29s

Enter the ubuntu system to view the mount items and find that the image has been mounted and formatted:

# kubectl exec ceph-pod -it sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
# df -hT
Filesystem              Type     Size  Used Avail Use% Mounted on
overlay                 overlay   50G  3.6G   47G   8% /
tmpfs                   tmpfs     64M     0   64M   0% /dev
tmpfs                   tmpfs    2.9G     0  2.9G   0% /sys/fs/cgroup
/dev/rbd0               ext4     976M  2.6M  958M   1% /mnt
/dev/mapper/centos-root xfs       50G  3.6G   47G   8% /etc/hosts
shm                     tmpfs     64M     0   64M   0% /dev/shm
tmpfs                   tmpfs    2.9G   12K  2.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                   tmpfs    2.9G     0  2.9G   0% /proc/acpi
tmpfs                   tmpfs    2.9G     0  2.9G   0% /proc/scsi
tmpfs                   tmpfs    2.9G     0  2.9G   0% /sys/firmware

On CEPH pod, the node on which the pod runs, view the rbd mounting through the df command:

# df -hT|grep rbd
/dev/rbd0               ext4      976M  2.6M  958M   1% /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/rbd-image-image2

Dynamic persistent volume

Without the intervention of the storage administrator, the k8s used storage image s can be automatically created, that is, the storage space can be dynamically applied for and automatically created according to the use needs. One or more storageclasses need to be defined first. Each StorageClass must be configured with a provisioner to decide which volume plug-in to allocate PV. Then, the StorageClass resource specifies which provider is used to create the persistent volume in the corresponding storage when the persistent volume declaration requests StorageClass.

k8s officially provides supported volume plug-ins:  https://kubernetes.io/zh/docs/concepts/storage/storage-classes/

Create an ordinary user to map the rdb to k8s

Create a k8s dedicated pool and user in the ceph cluster:

# ceph osd pool create kube 8192
# ceph auth get-or-create client.kube mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=kube' -o ceph.client.kube.keyring

Create kube user's secret in k8s cluster:

# ceph auth get-key client.kube|base64
QVFBS090WmVDcUxvSHhBQWZma1YxWUNnVzhuRTZUcjNvYS9yclE9PQ==

# vim ceph-kube-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: ceph-kube-secret
  namespace: default
data:
  key: QVFBS090WmVDcUxvSHhBQWZma1YxWUNnVzhuRTZUcjNvYS9yclE9PQ==
type:
  kubernetes.io/rbd
  
# kubectl create -f ceph-kube-secret.yaml
# kubectl get secret
NAME                  TYPE                                  DATA   AGE
ceph-kube-secret      kubernetes.io/rbd                     1      68s

Create a StorageClass or use a StorageClass that has already been created

# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/rbd
parameters:
  monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: default
  pool: kube
  userId: kube
  userSecretName: ceph-kube-secret
  userSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"
  
# kubectl apply -f sc.yaml
# kubectl get storageclass
NAME                 PROVISIONER         RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ceph-rbd (default)   kubernetes.io/rbd   Delete          Immediate           false                  6s

The main instructions are as follows:
1,storageclass.beta.kubernetes.io/is-default-class
If set to true, it is the default storageclasses. pvc applies for storage. If no storageclass is specified, it applies from the default storageclass.
2. adminId: ceph client ID, which is used to create an image in ceph pool. The default is "admin".
3. userId: ceph client ID, which is used to map rbd images. The default is the same as adminId.
4. imageFormat: ceph rbd image format, "1" or "2". The default value is "1".
5. imageFeatures: this parameter is optional and can only be used when you set imageFormat to "2". Currently, only layering is supported. The default is "? And no function is turned on.

Create persistent volume declaration

Since we have specified the default storageclass, we can directly create pvc. The provider creation will be triggered only when the creation is in pending status:

# vim pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ceph-sc-claim
spec:
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  resources:
    requests:
      storage: 500Mi

# kubectl apply -f pvc.yaml
persistentvolumeclaim/ceph-sc-claim created

# kubectl get pvc
NAME            STATUS    VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ceph-sc-claim   Pending                                        ceph-rbd       50s

After creating pvc, we found that pvc was not successfully bound to pv and was always in pending status. Then we checked the error message of pvc and found the following problems:

# kubectl describe pvc  ceph-sc-claim
Name:          ceph-sc-claim
Namespace:     default
StorageClass:  ceph-rbd
Status:        Pending
Volume:
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/rbd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Mounted By:    <none>
Events:
  Type     Reason              Age                From                         Message
  ----     ------              ----               ----                         -------
  Warning  ProvisioningFailed  5s (x7 over 103s)  persistentvolume-controller  Failed to provision volume with StorageClass "ceph-rbd": failed to get admin secret from ["default"/"ceph-secret"]: failed to get secret from ["default"/"ceph-secret"]: Cannot get secret of type kubernetes.io/rbd

By reporting an error, we know that the controller of k8s failed to obtain the admin secret of ceph. Because the cepe secret we created is under the default namespace and the controller is under the Kube system, we do not have permission to obtain it. Therefore, we create cepe secret under the Kube system, delete pvc and storageclass resources, update the storageclass configuration, and re create storageclass and pvc resources:

# cat ceph-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
  namespace: kube-system
data:
  key: QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==
type:
  kubernetes.io/rbd

# kubectl apply -f ceph-secret.yaml
# kubectl get secret ceph-secret -n kube-system
NAME          TYPE                DATA   AGE
ceph-secret   kubernetes.io/rbd   1      19m

# kubectl delete pvc ceph-sc-claim
# kubectl delete sc ceph-rbd
# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/rbd
parameters:
  monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: kube-system
  pool: kube
  userId: kube
  userSecretName: ceph-kube-secret
  userSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"
# kubectl apply -f sc.yaml
# kubectl apply -f pvc.yaml

# kubectl describe  pvc ceph-sc-claim
Name:          ceph-sc-claim
Namespace:     default
StorageClass:  ceph-rbd
Status:        Pending
Volume:
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/rbd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Mounted By:    <none>
Events:
  Type     Reason              Age                  From                         Message
  ----     ------              ----                 ----                         -------
  Warning  ProvisioningFailed  33s (x59 over 116m)  persistentvolume-controller  Failed to provision volume with StorageClass "ceph-rbd": failed to create rbd image: executable file not found in $PATH, command output:

It is found that binding pv still fails. Continue to find the problem. We have installed CEPH common on every node of the k8s cluster. Why can't we find the rbd command. Through query and analysis, the reasons are as follows:
When k8s uses stroageclass to dynamically apply for ceph storage resources, the controller manager needs to use the rbd command to interact with the ceph cluster, while the controller manager of k8s uses the default image k8s gcr. There is no rbd client integrating ceph in io / Kube controller manager. K8s officials suggest that we use external providers to solve this problem. These independent external programs follow the specifications defined by k8s.
According to the official recommendation, we use an external RBD provider to provide services. The following operations are performed on k8s's master:

# git clone https://github.com/kubernetes-incubator/external-storage.git
# cd external-storage/ceph/rbd/deploy
# sed -r -i "s/namespace: [^ ]+/namespace: kube-system/g" ./rbac/clusterrolebinding.yaml ./rbac/rolebinding.yaml
# kubectl -n kube-system apply -f ./rbac

# kubectl describe deployments.apps -n kube-system rbd-provisioner
Name:               rbd-provisioner
Namespace:          kube-system
CreationTimestamp:  Wed, 03 Jun 2020 18:59:14 +0800
Labels:             <none>
Annotations:        deployment.kubernetes.io/revision: 1
Selector:           app=rbd-provisioner
Replicas:           1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:       Recreate
MinReadySeconds:    0
Pod Template:
  Labels:           app=rbd-provisioner
  Service Account:  rbd-provisioner
  Containers:
   rbd-provisioner:
    Image:      quay.io/external_storage/rbd-provisioner:latest
    Port:       <none>
    Host Port:  <none>
    Environment:
      PROVISIONER_NAME:  ceph.com/rbd
    Mounts:              <none>
  Volumes:               <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   rbd-provisioner-c968dcb4b (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  6m5s  deployment-controller  Scaled up replica set rbd-provisioner-c968dcb4b to 1

Modify the provisioner of storageclass to our newly added provisioner:

# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: ceph.com/rbd
parameters:
  monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: kube-system
  pool: kube
  userId: kube
  userSecretName: ceph-kube-secret
  userSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"
  
# kubectl delete pvc ceph-sc-claim
# kubectl delete sc ceph-rbd
# kubectl apply -f sc.yaml
# kubectl apply -f pvc.yaml

Wait for the provisioner to allocate storage and bind pv to pvc, about 3 minutes. Finally binding succeeded:

# kubectl get pvc
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ceph-sc-claim   Bound    pvc-0b92a433-adb0-46d9-a0c8-5fbef28eff5f   2Gi        RWO            ceph-rbd       7m49s

pod uses persistent volumes

Create a pod and view the mount status:

# vim ubuntu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ceph-sc-pod
spec:
  containers:
  - name: ceph-sc-ubuntu
    image: phusion/baseimage
    command: ["/sbin/my_init"]
    volumeMounts:
    - name: ceph-sc-mnt
      mountPath: /mnt
      readOnly: false
  volumes:
  - name: ceph-sc-mnt
    persistentVolumeClaim:
      claimName: ceph-sc-claim

# kubectl apply -f ubuntu.yaml
# kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
ceph-sc-pod              1/1     Running   0          24s

# kubectl exec ceph-sc-pod -it  sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
# df -h
Filesystem               Size  Used Avail Use% Mounted on
overlay                   50G  3.8G   47G   8% /
tmpfs                     64M     0   64M   0% /dev
tmpfs                    2.9G     0  2.9G   0% /sys/fs/cgroup
/dev/rbd0                2.0G  6.0M  1.9G   1% /mnt
/dev/mapper/centos-root   50G  3.8G   47G   8% /etc/hosts
shm                       64M     0   64M   0% /dev/shm
tmpfs                    2.9G   12K  2.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                    2.9G     0  2.9G   0% /proc/acpi
tmpfs                    2.9G     0  2.9G   0% /proc/scsi
tmpfs                    2.9G     0  2.9G   0% /sys/firmware

After so many twists and turns, we finally successfully connected to the external ceph

summary

1. K8s relies on kubelet to implement the operation of attach (rbd map) and detach (rbd unmap) RBD image, and kubelet runs on the node of each k8s. Therefore, each k8s node should install CEPH common package to provide rbd commands to kubelet.
2. When k8s uses stroageclass to dynamically create ceph storage resources, the controller manager needs to use the rbd command to interact with the ceph cluster, while the controller manager of k8s uses the default image k8s gcr. There is no rbd client integrating ceph in io / Kube controller manager. K8s officials suggest that we use external providers to solve this problem. These independent external programs follow the specifications defined by k8s.

reference resources

 https://kubernetes.io/zh/docs/concepts/storage/storage-classes/
 https://kubernetes.io/zh/docs/concepts/storage/volumes/
https://groups.google.com/forum/#!topic/kubernetes-sig-storage-bugs/4w42QZxboIA

Reprinted to https://blog.51cto.com/leejia/2501080

Topics: Kubernetes