- 1. General
- 2. Terminology
- 3. nfs-ganesha
- 4. Deployment
- 5. Verification
- 6. Appendix
- 7. Reference
There are many systems involved, and the concept introduction is a little more. If you use it, you can directly look at the deployment practice and understand various systems and concepts through practice
1. General
NFS is a widely used protocol for linux operating system shared file system. It is also a standard function of various shared file products. The client is easy to install and use. CephFS needs to install CephFS client and add client configuration. NFS is more convenient for users.
The purpose of this article is to illustrate how to deploy a highly available NFS gateway cluster for CEPHFS
- Why cluster NFS
- nfs is a stateful service. Stateful service is highly available. The consistency of state must be maintained. The state is
- Open files
- File locks
- ...
- nfs is a stateful service. Stateful service is highly available. The consistency of state must be maintained. The state is
- What are the more mature technical solutions at present
- rook is the only implementation of CephFS exporting NFS with complete data available at present
- Ensure high availability of services through ing svc deployment of K8S
- The state of the file system is stored through the ceph cluster
- rook is the only implementation of CephFS exporting NFS with complete data available at present
- What are the advantages and disadvantages of the existing technical solutions
- Advantages: community support
- Disadvantages: it can only be deployed through container ROOK, and it is very challenging to produce containers on CEPH. However, the nfs cluster uses containers, and the CEPH cluster uses processes as well, but this increases the difficulty of delivery.
2. Terminology
-
nfs1 : is a stateful service. The following information is maintained between the client and the server:
- Open files
- File locks
-
nfs service properties 2:
- Minimum supported version
- Grace period defines the number of seconds after rebooting the device from an unplanned interrupt (from 15 seconds to 600 seconds). This attribute only affects NFSv4.0 and NFSv4.1 clients (NFSv3 is a stateless protocol, so there is no state to recover) . during this period, service for NFS only handles recycling of old lock states. Other requests to the service will not be processed until the grace period ends. The default grace period is 90 seconds. Reducing the grace period will enable NFS clients to resume operations faster after the server reboots, but also increase the likelihood that clients will not be able to recover all their lock states.
- Enable NFSv4 delegation Selecting this property allows clients to cache files locally and modify them without contacting the server. This option is enabled by default and usually improves performance, but can cause problems in rare cases. If you want to disable this setting, you can only do it after careful performance measurement of a specific workload and verifying that such a setting has considerable performance advantages . this option only affects NFSv4.0 and NFSv4.1 mounts.
- Mount visibility (Mount visibility this property allows you to limit the availability of information about NFS clients' shared resource access lists and remote mounts. Full access is fully allowed. Restricted access, such as clients can only view the shared resources they are allowed to access. Clients cannot see the shared resources defined on the server or remote mount access completed by other clients in the server Question list. By default, this property is set to "full".
- Maximum supported version
- The Maximum # of server threads defines the maximum number of concurrent NFS requests (from 20 to 1000). This should at least cover the number of concurrent NFS clients you expect. The default value is 500.
3. nfs-ganesha
3.1. Introduction
nfs-ganesha3 : it is an nfs server. It supports exporting different file systems to nfs through FSAL(File System Abstraction Layer), and supports exporting the following file systems to nfs.
- cephfs
- Gluster
- GPFS
- VFS
- XFS LUSTER
- RadosGW
3.2. Architecture
3.2.1. Overall structure diagram
- Community architecture
[external chain picture transfer failed. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-26ybcvxc-1634473847273)( https://raw.githubusercontent.com/wiki/nfs-ganesha/nfs-ganesha/images/nfs-arch.png )] - More detailed IBM architecture diagram 4
3.2.2. Architecture description
To use NFS Ganesha, focus on the following points
NFS Ganesha is the client of various distributed file systems. When using the file system, the client needs to cache the directory structure (Dentry) and the mapping (inode) between the local file system and the back-end storage medium / remote file system 5
- Various distributed file systems that need to be exported as nfs
- MDCACHE: metadata cache of backend file system (dentry node)
- FSAL: it requires a file system abstraction layer everywhere, which unifies the APIs of different file systems at the back end
- nfs: nfs services exported by users using various distributed storage
- Log: NFS Ganesha service log
- Default log path: / var/log/ganesha/ganesha.log
- other log configuration
3.2.3. Genesha Rados cluster design
This is the special session of Ceph. The design documents are derived from the translation of official documents 6
3.2.3.1. Client recovery (single case)
NFSv4 is a lease based protocol. After establishing a connection between the client and the server, it must be updated regularly for a full month to maintain the file system status (open files, locks, deletions or layouts)
When an nfs service is restarted, all status is lost. When the service is online again, the client will detect that the service has been started, obtain the last connection status from the server and restore the connection.
3.2.3.2. Grace period (single case)
Grace period, in short, when the client and server recover the last connection state after the connection is interrupted, if the recovery timeout (grace period) expires, the connection will be re established. During the recovery process, the client is prohibited from acquiring new state and only recycling state is allowed.
3.2.3.3. Reboot Epochs
Service version
- R Recovery
- N Normal
Version status
- The current version of C, reflected in grace db dump output, is cur
- A non-zero value of R indicates the effective time of the grace period, and 0 indicates the end, which is reflected in the grace db dump output as rec
3.2.3.4. gracedb
Stateful data is recorded in the database, and the recovery information of Rados cluster is stored in rados omap through Ganesha Rados grace 7 The command line can operate data, such as adding nodes to the cluster and kicking nodes out of the cluster. After joining the cluster, nodes have two states
- N(NEED) whether the node has a client that needs recovery
- E(ENFORCING) grace period
3.2.3.5. Cluster
The above client recovery, grace period and gracedb explain the recovery principle and state storage of single services. For the cluster, only multiple single services are deployed, and the state is consistent
ganesha cluster scenario: multiple ganesha services are deployed, and the server accesses one of them through VIP or DNS. If the ganesha service being accessed hangs, the client changes to another ganesha service for access (it should be noted that the last accessed status data is also in the database of other nodes of the cluster). The new access connection restores the original state, If it cannot be restored, delete the original state of the database and rebuild the connection.
3.3. Implementation of high availability cluster
- Stateless part: deploy multiple ganesha services and carry out high availability load through keepalive+haproxy
- Stateful part: state data is managed by gracedb and stored in omap of ceph rados object
Above, the architecture is determined and deployment begins
4. Deployment
4.1. Environmental description
assembly | edition | remarks |
---|---|---|
operating system | CentOS Linux release 7.8.2003 (Core) | |
Operating system kernel | 3.10.0-1127.el7.x86_64 | |
nfs-Ganesha | 2.8.1 | |
ceph | ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable) | |
haproxy | 1.5.18-9 | |
keepalived | 1.3.5-19 | |
Number of nodes | 3 |
4.2. Installing software
4.2.1. Configuring yum source
cat /etc/yum.repos.d/nfs-ganasha.repo [nfsganesha] name=nfsganesha baseurl=https://mirrors.cloud.tencent.com/ceph/nfs-ganesha/rpm-V2.8-stable/nautilus/x86_64/ gpgcheck=0 enable=1
4.2.2. Installation
yum install -y nfs-ganesha nfs-ganesha-ceph \ nfs-ganesha-rados-grace nfs-ganesha-rgw \ haproxy keepalived
4.3. ganesha configuration
4.3.1. /etc/ganesha/ganesha.conf
All three nodes are configured. Pay attention to modifying the configuration items corresponding to the nodes in the configuration file
NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4; } MDCACHE { Dir_Chunk = 0; } EXPORT_DEFAULTS { Attr_Expiration_Time = 0; } NFSv4 { Delegations = false; RecoveryBackend = 'rados_cluster'; Minor_Versions = 1, 2; } RADOS_KV { # Ceph profile ceph_conf = '/etc/ceph/ceph.conf'; # ganesha users accessing ceph userid = admin; # ganesha node name nodeid = "ceph01"; # ganesha state storage pool pool = "cephfs_data"; # ganesha state file storage namespace namespace = "nfs-ns"; } RADOS_URLS { ceph_conf = '/etc/ceph/ceph.conf'; userid = admin; watch_url = 'rados://cephfs_data/nfs-ns/conf-ceph01'; } # Configure online settings through ganesha configuration on watch ceph rados %url rados://cephfs_data/nfs-ns/conf-ceph01
4.3.2. Create an export file and upload it to ceph
Cluster operation can be performed on one of them
# create a file cat conf-ceph %url "rados://cephfs_data/nfs-ns/export-1" # All nodes need to be uploaded rados put -p cephfs_data -N nfs-ns conf-ceph01 conf-ceph rados put -p cephfs_data -N nfs-ns conf-ceph02 conf-ceph rados put -p cephfs_data -N nfs-ns conf-ceph03 conf-ceph
4.3.3. Create the first export directory
cat export
EXPORT { FSAL { # ceph user, linked to the following access rights user_id = "admin"; # The secret corresponding to the above user secret_access_key = "AQC5Z1Rh6Nu3BRAAc98ORpMCLu9kXuBh/k3oHA=="; name = "CEPH"; # file system name filesystem = "cephfs"; } # The path that the customer takes when accessing through nfs pseudo = "/test"; # jurisdiction squash = "no_root_squash"; # Access to nfs directory access_type = "RW"; # cephfs path path = "/test001"; # Generally digital export_id = 1; transports = "UDP", "TCP"; protocols = 3, 4; }
Like the above configuration, it needs to be uploaded to the cluster
rados put -p cephfs_data -N nfs-ns export-1 export
4.3.4. Join nodes to gracedb cluster
The stateful part is processed, and three nodes are added to the gracedb cluster
be careful ⚠️ Note that node name, pool name and namespace name correspond to those in configuration file
ganesha-rados-grace -p cephfs_data --ns nfs-ns add ceph01 ceph02 ceph03
- After executing the command, the status is as follows
ganesha-rados-grace -p cephfs_data --ns nfs-ns cur=5 rec=4 ====================================================== ceph01 NE ceph02 NE ceph03 NE
- Start three ganesha services
systemctl start nfs-ganesha
- After the service is started for a while, look at the node status in gracedb and restore the following status: normal
ganesha-rados-grace -p cephfs_data --ns nfs-ns cur=5 rec=0 ====================================================== ceph01 ceph02 ceph03
For cur res and N E in the output, see the introduction in the NFS Ganesha chapter
4.4. haproxy+keepalived
4.4.1. haproxy
global log 127.0.0.1 local2 chroot /var/lib/haproxy pidfile /var/run/haproxy.pid maxconn 8000 user haproxy group haproxy daemon stats socket /var/lib/haproxy/stats defaults mode http log global option httplog option dontlognull option http-server-close option forwardfor except 127.0.0.0/8 option redispatch retries 3 timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 1m timeout server 1m timeout http-keep-alive 10s timeout check 10s maxconn 8000 listen stats bind 172.16.80.86:9000 mode http stats enable stats uri / stats refresh 15s stats realm Haproxy\ Stats stats auth admin:admin frontend nfs-in bind 172.16.80.244:2049 mode tcp option tcplog default_backend nfs-back backend nfs-back balance source mode tcp log /dev/log local0 debug server ceph01 172.16.80.86:2049 check server ceph02 172.16.80.136:2049 check server ceph03 172.16.80.10:2049 check
4.4.2. keepalived
global_defs { router_id CEPH_NFS } vrrp_script check_haproxy { script "killall -0 haproxy" weight -20 interval 2 rise 2 fall 2 } vrrp_instance VI_0 { state BACKUP priority 100 interface eth0 virtual_router_id 51 advert_int 1 authentication { auth_type PASS auth_pass 1234 } virtual_ipaddress { 172.16.80.244/24 dev eth0 } track_script { check_haproxy } }
Note: the three keepalived and priority cannot be the same virtual_ router_ The ID must be the same
4.4.2.1. System configuration
If you do not do the following, haproxy cannot start normally
echo "net.ipv4.ip_nonlocal_bind = 1" >> /etc/sysctl.conf echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf sysctl -p
4.4.2.2. Start service
systemctl restart haproxy keepalived
So far, we have built a highly available nfs gateway for cephfs. As for rgw, the same is true
5. Verification
Oct 8 20:52:03 wanggangfeng-dev kernel: nfs: server 172.16.80.244 not responding, still trying Oct 8 20:53:07 wanggangfeng-dev kernel: nfs: server 172.16.80.244 OK
# Re switch back cat aaa cat: aaa: long-range I/O error # df mount disappears and needs to be mounted again
5.1. When the client does not io write, the server is disconnected
The client will be stuck for a period of time (about 1 minute) and then recover
5.2. When the client is in high io, the server is disconnected for more than five minutes
The subsequent investigation process is quite long, and the conclusion is made first
5.2.1. Test overview
Five minutes after the test is disconnected, the NFS Ganesha client will be expelled by cephfs timeout and added to the and list. After the network is restored, the client remains stuck. Finally, the problem is solved by adjusting parameters
The following configurations are added to / etc/ceph/ceph.conf on NFS Ganesha server
# mds session timed out and will not be blacklisted mds_session_blacklist_on_timeout = false # No blacklist after mds expulsion mds_session_blacklist_on_evict = false # If the client state is stale, the connection is re established client_reconnect_stale = true
5.2.2. Specific test phenomena and troubleshooting
5.2.2.1. Phenomenon
5.2.2.1.1. Client
Kernel crash during frequent read and write (normal phenomenon under IO timeout) – ceph cluster has slow requests: This is caused by the poor performance of ceph running on the virtual machine. In fact, the switch is successful. You can see the connection by using netstat -anolp|grep 2049 on the vip server
Oct 8 21:47:16 wanggangfeng-dev kernel: INFO: task dd:2053 blocked for more than 120 seconds. Oct 8 21:47:16 wanggangfeng-dev kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 8 21:47:16 wanggangfeng-dev kernel: dd D ffff986436d9acc0 0 2053 2023 0x00000080 Oct 8 21:47:16 wanggangfeng-dev kernel: Call Trace: Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83ed0>] ? bit_wait+0x50/0x50 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab85d89>] schedule+0x29/0x70 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83891>] schedule_timeout+0x221/0x2d0 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa4e69a1>] ? put_prev_entity+0x31/0x400 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa46d39e>] ? kvm_clock_get_cycles+0x1e/0x20 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83ed0>] ? bit_wait+0x50/0x50 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab8547d>] io_schedule_timeout+0xad/0x130 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab85518>] io_schedule+0x18/0x20 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83ee1>] bit_wait_io+0x11/0x50 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83a07>] __wait_on_bit+0x67/0x90 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83ed0>] ? bit_wait+0x50/0x50 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83b71>] out_of_line_wait_on_bit+0x81/0xb0 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa4c7840>] ? wake_bit_function+0x40/0x40 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc0669193>] nfs_wait_on_request+0x33/0x40 [nfs] Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc066e483>] nfs_updatepage+0x153/0x8e0 [nfs] Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc065d5c1>] nfs_write_end+0x171/0x3c0 [nfs] Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa5be044>] generic_file_buffered_write+0x164/0x270 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc06a4e90>] ? nfs4_xattr_set_nfs4_label+0x50/0x50 [nfsv4] Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc06a4e90>] ? nfs4_xattr_set_nfs4_label+0x50/0x50 [nfsv4] Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa5c0872>] __generic_file_aio_write+0x1e2/0x400 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa5c0ae9>] generic_file_aio_write+0x59/0xa0 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc065ca2b>] nfs_file_write+0xbb/0x1e0 [nfs] Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa64c663>] do_sync_write+0x93/0xe0 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa64d150>] vfs_write+0xc0/0x1f0 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa64df1f>] SyS_write+0x7f/0xf0 Oct 8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab92ed2>] system_call_fastpath+0x25/0x2a
The dd process will not restore its state and needs to be terminated manually. The client can restore its state after a while
client
Oct 8 21:54:22 wanggangfeng-dev kernel: NFS: nfs4_reclaim_open_state: unhandled error -121
5.2.2.1.1.1. After the server is restored
# Re switch back cat aaa cat: aaa: long-range I/O error # df mount disappears and needs to be mounted again
Ganesha log error / var/log/ganesha/ganesha.log
09/10/2021 11:13:02 : epoch 616104eb : ceph01 : ganesha.nfsd-78386[svc_8] posix2fsal_error :FSAL :CRIT :Mapping 108(default) to ERR_FSAL_SERVERFAULT
mds log
2021-10-09 11:05:59.220 7fc5b6e8c700 0 log_channel(cluster) log [WRN] : evicting unresponsive client ceph01 (44152), after 302.173 seconds 2021-10-09 11:05:59.220 7fc5b6e8c700 1 mds.0.12 Evicting (and blacklisting) client session 44152 (v1:172.16.80.86:0/3860306704) 2021-10-09 11:05:59.220 7fc5b6e8c700 0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 44152 (v1:172.16.80.86:0/3860306704) 2021-10-09 11:06:04.220 7fc5b6e8c700 0 log_channel(cluster) log [WRN] : evicting unresponsive client ceph01 (49033), after 302.374 seconds 2021-10-09 11:06:04.220 7fc5b6e8c700 1 mds.0.12 Evicting (and blacklisting) client session 49033 (172.16.80.86:0/2196066904) 2021-10-09 11:06:04.220 7fc5b6e8c700 0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 49033 (172.16.80.86:0/2196066904) 2021-10-09 11:09:47.444 7fc5bae94700 0 --2- [v2:172.16.80.10:6818/3513818921,v1:172.16.80.10:6819/3513818921] >> 172.16.80.86:0/2196066904 conn(0x5637be582800 0x5637bcbcf800 crc :-1 s=SESSION_ACCEPTING pgs=465 cs=0 l=0 rev1=1 rx=0 tx=0).handle_reconnect no existing connection exists, reseting client 2021-10-09 11:09:53.035 7fc5bae94700 0 --1- [v2:172.16.80.10:6818/3513818921,v1:172.16.80.10:6819/3513818921] >> v1:172.16.80.86:0/3860306704 conn(0x5637be529c00 0x5637bcc17000 :6819 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION 2021-10-09 11:09:53.047 7fc5b8e90700 0 mds.0.server ignoring msg from not-open sessionclient_reconnect(1 caps 1 realms ) v3 2021-10-09 11:09:53.047 7fc5bae94700 0 --1- [v2:172.16.80.10:6818/3513818921,v1:172.16.80.10:6819/3513818921] >> v1:172.16.80.86:0/3860306704 conn(0x5637be50dc00 0x5637bcc14800 :6819 s=OPENED pgs=26 cs=1 l=0).fault server, going to standby
Join the blacklist
ceph osd blacklist ls 172.16.80.86:0/4234145215 2021-10-09 11:23:34.169972
Modify blacklist configuration
mds_session_blacklist_on_timeout = false mds_session_blacklist_on_evict = false
Refer to for configuration details CEPH FILE SYSTEM CLIENT EVICTION
ganesha log
15/10/2021 10:37:42 : epoch 6168e43f : ceph01 : ganesha.nfsd-359850[svc_102] rpc :TIRPC :EVENT :svc_vc_wait: 0x7f9a88000c80 fd 67 recv errno 104 (will set dead) 15/10/2021 10:37:44 : epoch 6168e43f : ceph01 : ganesha.nfsd-359850[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is unhealthy. enq new: 6289, old: 6288; deq new: 6287, old: 6287
5.2.2.1.1.2. Mount a CEPH fuse client
The client is disconnected and waits for him to timeout
x7f28740060c0 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).stop 2021-10-15 14:21:11.445 7f28ad46b700 1 -- 172.16.80.86:0/1706554648 --> [v2:172.16.80.136:3300/0,v1:172.16.80.136:6789/0] -- mon_subscribe({config=0+,mdsmap=246+,monmap=2+}) v3 -- 0x7f28740080d0 con 0x7f2874008600 2021-10-15 14:21:11.445 7f28ad46b700 1 --2- 172.16.80.86:0/1706554648 >> [v2:172.16.80.136:3300/0,v1:172.16.80.136:6789/0] conn(0x7f2874008600 0x7f2874009770 secure :-1 s=READY pgs=149220 cs=0 l=1 rev1=1 rx=0x7f289c0080c0 tx=0x7f289c03dd20).ready entity=mon.2 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0 2021-10-15 14:21:11.445 7f28a57fa700 10 client.1058179.objecter ms_handle_connect 0x7f2874008600 2021-10-15 14:21:11.445 7f28a57fa700 10 client.1058179.objecter resend_mon_ops 2021-10-15 14:21:11.446 7f28a57fa700 1 -- 172.16.80.86:0/1706554648 <== mon.2 v2:172.16.80.136:3300/0 1 ==== mon_map magic: 0 v1 ==== 422+0+0 (secure 0 0 0) 0x7f289c040310 con 0x7f2874008600 2021-10-15 14:21:11.446 7f28a57fa700 1 -- 172.16.80.86:0/1706554648 <== mon.2 v2:172.16.80.136:3300/0 2 ==== config(0 keys) v1 ==== 4+0+0 (secure 0 0 0) 0x7f289c040510 con 0x7f2874008600 2021-10-15 14:21:11.446 7f28a57fa700 1 -- 172.16.80.86:0/1706554648 <== mon.2 v2:172.16.80.136:3300/0 3 ==== mdsmap(e 251) v1 ==== 693+0+0 (secure 0 0 0) 0x7f289c040a20 con 0x7f2874008600 2021-10-15 14:21:11.446 7f28a57fa700 10 client.1058179.objecter ms_dispatch 0x560568e78330 mdsmap(e 251) v1 2021-10-15 14:21:15.317 7f28a7fff700 10 client.1058179.objecter tick 2021-10-15 14:21:15.565 7f28a6ffd700 1 -- 172.16.80.86:0/1706554648 --> [v2:172.16.80.10:6818/1475563033,v1:172.16.80.10:6819/1475563033] -- client_session(request_renewcaps seq 476) v3 -- 0x7f2884045550 con 0x560568fe8180 2021-10-15 14:21:15.893 7f28ad46b700 1 --2- 172.16.80.86:0/1706554648 >> [v2:172.16.80.10:6818/1475563033,v1:172.16.80.10:6819/1475563033] conn(0x560568fe8180 0x560568fea590 unknown :-1 s=BANNER_CONNECTING pgs=16770 cs=565 l=0 rev1=1 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0 2021-10-15 14:21:15.893 7f28ad46b700 1 --2- 172.16.80.86:0/1706554648 >> [v2:172.16.80.10:6818/1475563033,v1:172.16.80.10:6819/1475563033] conn(0x560568fe8180 0x560568fea590 crc :-1 s=SESSION_RECONNECTING pgs=16770 cs=565 l=0 rev1=1 rx=0 tx=0).handle_session_reset received session reset full=1 2021-10-15 14:21:15.893 7f28ad46b700 1 --2- 172.16.80.86:0/1706554648 >> [v2:172.16.80.10:6818/1475563033,v1:172.16.80.10:6819/1475563033] conn(0x560568fe8180 0x560568fea590 crc :-1 s=SESSION_RECONNECTING pgs=16770 cs=565 l=0 rev1=1 rx=0 tx=0).reset_session 2021-10-15 14:21:15.894 7f28a57fa700 0 client.1058179 ms_handle_remote_reset on v2:172.16.80.10:6818/1475563033 2021-10-15 14:21:15.894 7f28a57fa700 10 client.1058179.objecter _maybe_request_map subscribing (onetime) to next osd map 2021-10-15 14:21:15.894 7f28a57fa700 1 -- 172.16.80.86:0/1706554648 --> [v2:172.16.80.136:3300/0,v1:172.16.80.136:6789/0] -- mon_subscribe({osdmap=379}) v3 -- 0x7f2890007ca0 con 0x7f2874008600 2021-10-15 14:21:15.894 7f28a57fa700 10 client.1058179.objecter ms_handle_connect 0x560568fe8180 2021-10-15 14:21:15.894 7f28ad46b700 1 --2- 172.16.80.86:0/1706554648 >> [v2:172.16.80.10:6818/1475563033,v1:172.16.80.10:6819/1475563033] conn(0x560568fe8180 0x560568fea590 crc :-1 s=READY pgs=17729 cs=0 l=0 rev1=1 rx=0 tx=0).ready entity=mds.0 client_cookie=13aaa6fc82c7d050 server_cookie=a531fa96d3b029a4 in_seq=0 out_seq=0 2021-10-15 14:21:15.895 7f28a57fa700 1 -- 172.16.80.86:0/1706554648 <== mon.2 v2:172.16.80.136:3300/0 4 ==== osd_map(379..390 src has 1..390) v4 ==== 69665+0+0 (secure 0 0 0) 0x7f289c008390 con 0x7f2874008600 2021-10-15 14:21:15.895 7f28a57fa700 10 client.1058179.objecter ms_dispatch 0x560568e78330 osd_map(379..390 src has 1..390) v4 2021-10-15 14:21:15.895 7f28a57fa700 3 client.1058179.objecter handle_osd_map got epochs [379,390] > 378 2021-10-15 14:21:15.895 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 379 2021-10-15 14:21:15.895 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 380 2021-10-15 14:21:15.895 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 381 2021-10-15 14:21:15.895 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 382 2021-10-15 14:21:15.895 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 383 2021-10-15 14:21:15.896 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 384 2021-10-15 14:21:15.896 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 385 2021-10-15 14:21:15.896 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 386 2021-10-15 14:21:15.896 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 387 2021-10-15 14:21:15.896 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 388 2021-10-15 14:21:15.896 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 389 2021-10-15 14:21:15.896 7f28a57fa700 3 client.1058179.objecter handle_osd_map decoding incremental epoch 390 2021-10-15 14:21:15.896 7f28a57fa700 20 client.1058179.objecter dump_active .. 0 homeless 2021-10-15 14:21:16.019 7f289bfff700 1 -- 172.16.80.86:0/1706554648 >> [v2:172.16.80.136:3300/0,v1:172.16.80.136:6789/0] conn(0x7f2874008600 msgr2=0x7f2874009770 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
5.2.2.1.2. Server
5.2.2.1.2.1. System log
dmesg
ganesha.nfsd[1400]: segfault at 7f5d8b8a19d0 ip 00007f5d9ac6ab81 sp 00007fffc2be41b0 error 4 in libpthread-2.17.so[7f5d9ac5e000+17000]
2021-10-15 10:15:20.851 7f5d96ffd700 0 client.1008523 destroyed lost open file 0x7f5da8001a10 on 0x10000000018.head(faked_ino=0 ref=3 ll_ref= 1 cap_refs={4=0,1024=0,4096=0,8192=0} open={2=1} mode=100644 size=1571815424/3145728000 nlink=1 btime=0.000000 mtime=2021-10-15 10:15:19.624848 ctime=2021-10-15 10:15:19.624848 caps=- objectset[0x10000000018 ts 0/0 objects 314 dirty_or_tx 0] parents=0x10000000000.head["dddccc"] 0x7f5dd 400d4a0)
After the client is evicted, the status will become stale
{ "id": 1094578, "inst": { "name": { "type": "client", "num": 1094578 }, "addr": { "type": "v1", "addr": "172.16.80.86:0", "nonce": 1231685619 } }, "inst_str": "client.1094578 v1:172.16.80.86:0/1231685619", "addr_str": "v1:172.16.80.86:0/1231685619", "sessions": [ { "mds": 0, "addrs": { "addrvec": [ { "type": "v2", "addr": "172.16.80.10:6818", "nonce": 1475563033 }, { "type": "v1", "addr": "172.16.80.10:6819", "nonce": 1475563033 } ] }, "seq": 0, "cap_gen": 0, "cap_ttl": "2021-10-15 16:28:35.900065", "last_cap_renew_request": "2021-10-15 16:27:35.900065", "cap_renew_seq": 64, "num_caps": 2, "state": "open" } ], "mdsmap_epoch": 251 }
Parameter client_reconnect_stale = true
After setting this parameter, the problem is solved. The use of this parameter in ceph is as follows (only a short section of context is cut, see Src / client / client. CC for details). When the connection in stale status uses this parameter, it will be reconnected immediately. If this parameter is false, no other operations will be performed.
void Client::ms_handle_remote_reset(Connection *con) { ... case MetaSession::STATE_OPEN: { objecter->maybe_request_map(); /* to check if we are blacklisted */ const auto& conf = cct->_conf; if (conf->client_reconnect_stale) { ldout(cct, 1) << "reset from mds we were open; close mds session for reconnect" << dendl; _closed_mds_session(s); } else { ldout(cct, 1) << "reset from mds we were open; mark session as stale" << dendl; s->state = MetaSession::STATE_STALE; } } break; ... }
6. Appendix
6.1. Appendix 1: gracedb data structure
- list
rados -p myfs-data0 -N nfs-ns ls conf-my-nfs.a grace rec-0000000000000003:my-nfs.b conf-my-nfs.b conf-my-nfs.c rec-0000000000000003:my-nfs.a rec-0000000000000003:my-nfs.c
- grace omap
rados -p myfs-data0 -N nfs-ns listomapvals grace my-nfs.a value (1 bytes) : 00000000 00 |.| 00000001 my-nfs.b value (1 bytes) : 00000000 00 |.| 00000001 my-nfs.c value (1 bytes) : 00000000 00 |.| 00000001
rados -p myfs-data0 -N nfs-ns listomapvals rec-0000000000000003:my-nfs.b 7013321091993042946 value (52 bytes) : 00000000 3a 3a 66 66 66 66 3a 31 30 2e 32 34 33 2e 30 2e |::ffff:10.243.0.| 00000010 30 2d 28 32 39 3a 4c 69 6e 75 78 20 4e 46 53 76 |0-(29:Linux NFSv| 00000020 34 2e 31 20 6b 38 73 2d 31 2e 6e 6f 76 61 6c 6f |4.1 k8s-1.novalo| 00000030 63 61 6c 29 |cal)| 00000034
ganesha-rados-grace -p cephfs_data --ns nfs-ns --oid conf-ceph03 cur=7021712291677762853 rec=7305734900415688548 ======================================================
7. Reference
- SUSE Installation of NFS Ganesha
- Haproxy NFS ha only supports active / standby and commercial aloha
- haproxy-nfs-ha
- ha-nfs-cluster-quick-start-guide
- mount hung when use nfs-ganesha+cephfs
- FSAL Ceph Attr_Expiration_Time multiple MDS corrupt metadata
- MDS problem slow requests, cache pressure, damaged metadata after upgrading 14.2.7 to 14.2.8
- Re: mds client reconnect
- CEPH FILE SYSTEM CLIENT EVICTION