CephFS highly available NFS Ganesha gateway

Posted by regexpert on Sun, 17 Oct 2021 19:51:09 +0200

1. General
2. Terminology
3. nfs-ganesha
4. Deployment
5. Verification
- 5.1. When the client does not io write, the server is disconnected
- 5.2. When the client is in high io, the server is disconnected for more than five minutes
  - 5.2.1. Test overview
  - 5.2.2. Specific test phenomena and troubleshooting
    - 5.2.2.1. Phenomenon
      - 5.2.2.1.1. Client
        5.2.2.1.1.1. After the server is restored
        5.2.2.1.1.2. Mount a CEPH fuse client
      - 5.2.2.1.2. Server
        5.2.2.1.2.1. System log
6. Appendix
- 6.1. Appendix 1: gracedb data structure
7. Reference

There are many systems involved, and the concept introduction is a little more. If you use it, you can directly look at the deployment practice and understand various systems and concepts through practice

1. General

NFS is a widely used protocol for linux operating system shared file system. It is also a standard function of various shared file products. The client is easy to install and use. CephFS needs to install CephFS client and add client configuration. NFS is more convenient for users.

The purpose of this article is to illustrate how to deploy a highly available NFS gateway cluster for CEPHFS

Why cluster NFS
- nfs is a stateful service. Stateful service is highly available. The consistency of state must be maintained. The state is
  - Open files
  - File locks
  - ...
What are the more mature technical solutions at present
- rook is the only implementation of CephFS exporting NFS with complete data available at present
  - Ensure high availability of services through ing svc deployment of K8S
  - The state of the file system is stored through the ceph cluster
What are the advantages and disadvantages of the existing technical solutions
- Advantages: community support
- Disadvantages: it can only be deployed through container ROOK, and it is very challenging to produce containers on CEPH. However, the nfs cluster uses containers, and the CEPH cluster uses processes as well, but this increases the difficulty of delivery.

2. Terminology

nfs1 : is a stateful service. The following information is maintained between the client and the server:
- Open files
- File locks
nfs service properties 2:
- Minimum supported version
- Grace period defines the number of seconds after rebooting the device from an unplanned interrupt (from 15 seconds to 600 seconds). This attribute only affects NFSv4.0 and NFSv4.1 clients (NFSv3 is a stateless protocol, so there is no state to recover) . during this period, service for NFS only handles recycling of old lock states. Other requests to the service will not be processed until the grace period ends. The default grace period is 90 seconds. Reducing the grace period will enable NFS clients to resume operations faster after the server reboots, but also increase the likelihood that clients will not be able to recover all their lock states.
- Enable NFSv4 delegation Selecting this property allows clients to cache files locally and modify them without contacting the server. This option is enabled by default and usually improves performance, but can cause problems in rare cases. If you want to disable this setting, you can only do it after careful performance measurement of a specific workload and verifying that such a setting has considerable performance advantages . this option only affects NFSv4.0 and NFSv4.1 mounts.
- Mount visibility (Mount visibility this property allows you to limit the availability of information about NFS clients' shared resource access lists and remote mounts. Full access is fully allowed. Restricted access, such as clients can only view the shared resources they are allowed to access. Clients cannot see the shared resources defined on the server or remote mount access completed by other clients in the server Question list. By default, this property is set to "full".
- Maximum supported version
- The Maximum # of server threads defines the maximum number of concurrent NFS requests (from 20 to 1000). This should at least cover the number of concurrent NFS clients you expect. The default value is 500.

3. nfs-ganesha

3.1. Introduction

nfs-ganesha3 : it is an nfs server. It supports exporting different file systems to nfs through FSAL(File System Abstraction Layer), and supports exporting the following file systems to nfs.

cephfs
Gluster
GPFS
VFS
XFS LUSTER
RadosGW

3.2. Architecture

3.2.1. Overall structure diagram

Community architecture
[external chain picture transfer failed. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-26ybcvxc-1634473847273)( https://raw.githubusercontent.com/wiki/nfs-ganesha/nfs-ganesha/images/nfs-arch.png )]
More detailed IBM architecture diagram 4

3.2.2. Architecture description

To use NFS Ganesha, focus on the following points

NFS Ganesha is the client of various distributed file systems. When using the file system, the client needs to cache the directory structure (Dentry) and the mapping (inode) between the local file system and the back-end storage medium / remote file system 5

Various distributed file systems that need to be exported as nfs
MDCACHE: metadata cache of backend file system (dentry node)
FSAL: it requires a file system abstraction layer everywhere, which unifies the APIs of different file systems at the back end
nfs: nfs services exported by users using various distributed storage
Log: NFS Ganesha service log
- Default log path: / var/log/ganesha/ganesha.log
- other log configuration

3.2.3. Genesha Rados cluster design

This is the special session of Ceph. The design documents are derived from the translation of official documents 6

3.2.3.1. Client recovery (single case)

NFSv4 is a lease based protocol. After establishing a connection between the client and the server, it must be updated regularly for a full month to maintain the file system status (open files, locks, deletions or layouts)
When an nfs service is restarted, all status is lost. When the service is online again, the client will detect that the service has been started, obtain the last connection status from the server and restore the connection.

3.2.3.2. Grace period (single case)

Grace period, in short, when the client and server recover the last connection state after the connection is interrupted, if the recovery timeout (grace period) expires, the connection will be re established. During the recovery process, the client is prohibited from acquiring new state and only recycling state is allowed.

3.2.3.3. Reboot Epochs

Service version

R Recovery
N Normal

Version status

The current version of C, reflected in grace db dump output, is cur
A non-zero value of R indicates the effective time of the grace period, and 0 indicates the end, which is reflected in the grace db dump output as rec

3.2.3.4. gracedb

Stateful data is recorded in the database, and the recovery information of Rados cluster is stored in rados omap through Ganesha Rados grace 7 The command line can operate data, such as adding nodes to the cluster and kicking nodes out of the cluster. After joining the cluster, nodes have two states

N(NEED) whether the node has a client that needs recovery
E(ENFORCING) grace period

3.2.3.5. Cluster

The above client recovery, grace period and gracedb explain the recovery principle and state storage of single services. For the cluster, only multiple single services are deployed, and the state is consistent

ganesha cluster scenario: multiple ganesha services are deployed, and the server accesses one of them through VIP or DNS. If the ganesha service being accessed hangs, the client changes to another ganesha service for access (it should be noted that the last accessed status data is also in the database of other nodes of the cluster). The new access connection restores the original state, If it cannot be restored, delete the original state of the database and rebuild the connection.

3.3. Implementation of high availability cluster

Stateless part: deploy multiple ganesha services and carry out high availability load through keepalive+haproxy
Stateful part: state data is managed by gracedb and stored in omap of ceph rados object

Above, the architecture is determined and deployment begins

4. Deployment

4.1. Environmental description

assembly	edition	remarks
operating system	CentOS Linux release 7.8.2003 (Core)
Operating system kernel	3.10.0-1127.el7.x86_64
nfs-Ganesha	2.8.1
ceph	ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
haproxy	1.5.18-9
keepalived	1.3.5-19
Number of nodes	3

4.2. Installing software

4.2.1. Configuring yum source

cat /etc/yum.repos.d/nfs-ganasha.repo
[nfsganesha]
name=nfsganesha
baseurl=https://mirrors.cloud.tencent.com/ceph/nfs-ganesha/rpm-V2.8-stable/nautilus/x86_64/
gpgcheck=0
enable=1

4.2.2. Installation

yum install -y nfs-ganesha nfs-ganesha-ceph \
  nfs-ganesha-rados-grace nfs-ganesha-rgw \
  haproxy keepalived

4.3. ganesha configuration

4.3.1. /etc/ganesha/ganesha.conf

All three nodes are configured. Pay attention to modifying the configuration items corresponding to the nodes in the configuration file

NFS_CORE_PARAM {
	Enable_NLM = false;
	Enable_RQUOTA = false;
	Protocols = 4;
}

MDCACHE {
	Dir_Chunk = 0;
}

EXPORT_DEFAULTS {
	Attr_Expiration_Time = 0;
}

NFSv4 {
	Delegations = false;
	RecoveryBackend = 'rados_cluster';
	Minor_Versions = 1, 2;
}

RADOS_KV {
    # Ceph profile
	ceph_conf = '/etc/ceph/ceph.conf';
    # ganesha users accessing ceph
	userid = admin;
    # ganesha node name
    nodeid = "ceph01";
    # ganesha state storage pool
	pool = "cephfs_data";
    # ganesha state file storage namespace
	namespace = "nfs-ns";
}

RADOS_URLS {
	ceph_conf = '/etc/ceph/ceph.conf';
	userid = admin;
	watch_url = 'rados://cephfs_data/nfs-ns/conf-ceph01';
}

# Configure online settings through ganesha configuration on watch ceph rados
%url	rados://cephfs_data/nfs-ns/conf-ceph01

4.3.2. Create an export file and upload it to ceph

Cluster operation can be performed on one of them

# create a file
cat conf-ceph
%url "rados://cephfs_data/nfs-ns/export-1"
# All nodes need to be uploaded
rados put -p cephfs_data -N nfs-ns conf-ceph01 conf-ceph
rados put -p cephfs_data -N nfs-ns conf-ceph02 conf-ceph
rados put -p cephfs_data -N nfs-ns conf-ceph03 conf-ceph

4.3.3. Create the first export directory

cat export

EXPORT {
    FSAL {
        # ceph user, linked to the following access rights
        user_id = "admin";
        # The secret corresponding to the above user
        secret_access_key = "AQC5Z1Rh6Nu3BRAAc98ORpMCLu9kXuBh/k3oHA==";
        name = "CEPH";
        # file system name
        filesystem = "cephfs";
    }
    # The path that the customer takes when accessing through nfs
    pseudo = "/test";
    # jurisdiction
    squash = "no_root_squash";
    # Access to nfs directory
    access_type = "RW";
    # cephfs path
    path = "/test001";
    # Generally digital
    export_id = 1;
    transports = "UDP", "TCP";
    protocols = 3, 4;
}

Like the above configuration, it needs to be uploaded to the cluster

rados put -p cephfs_data -N nfs-ns export-1 export

4.3.4. Join nodes to gracedb cluster

The stateful part is processed, and three nodes are added to the gracedb cluster
be careful ⚠️ Note that node name, pool name and namespace name correspond to those in configuration file

ganesha-rados-grace -p cephfs_data --ns nfs-ns add ceph01 ceph02 ceph03

After executing the command, the status is as follows

ganesha-rados-grace -p cephfs_data --ns nfs-ns
cur=5 rec=4
======================================================
ceph01	NE
ceph02	NE
ceph03	NE

Start three ganesha services

systemctl start nfs-ganesha

After the service is started for a while, look at the node status in gracedb and restore the following status: normal

ganesha-rados-grace -p cephfs_data --ns nfs-ns
cur=5 rec=0
======================================================
ceph01
ceph02
ceph03

For cur res and N E in the output, see the introduction in the NFS Ganesha chapter

4.4. haproxy+keepalived

4.4.1. haproxy

global
    log         127.0.0.1 local2

    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     8000
    user        haproxy
    group       haproxy
    daemon
    stats socket /var/lib/haproxy/stats
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 8000

listen stats
   bind 172.16.80.86:9000
   mode http
   stats enable
   stats uri /
   stats refresh 15s
   stats realm Haproxy\ Stats
   stats auth admin:admin

frontend nfs-in
    bind 172.16.80.244:2049
    mode tcp
    option tcplog
    default_backend             nfs-back

backend nfs-back
    balance     source
    mode        tcp
    log         /dev/log local0 debug
    server      ceph01   172.16.80.86:2049 check
    server      ceph02   172.16.80.136:2049 check
    server      ceph03   172.16.80.10:2049 check

4.4.2. keepalived

global_defs {
   router_id CEPH_NFS
}

vrrp_script check_haproxy {
    script "killall -0 haproxy"
    weight -20
    interval 2
    rise 2
    fall 2
}

vrrp_instance VI_0 {
    state BACKUP
    priority 100
    interface eth0
    virtual_router_id 51
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    virtual_ipaddress {
        172.16.80.244/24 dev eth0
    }
    track_script {
        check_haproxy
    }
}

Note: the three keepalived and priority cannot be the same virtual_ router_ The ID must be the same

4.4.2.1. System configuration

If you do not do the following, haproxy cannot start normally

echo "net.ipv4.ip_nonlocal_bind = 1" >> /etc/sysctl.conf
echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf
sysctl -p

4.4.2.2. Start service

systemctl restart haproxy keepalived

So far, we have built a highly available nfs gateway for cephfs. As for rgw, the same is true

5. Verification

Oct  8 20:52:03 wanggangfeng-dev kernel: nfs: server 172.16.80.244 not responding, still trying
Oct  8 20:53:07 wanggangfeng-dev kernel: nfs: server 172.16.80.244 OK

# Re switch back
cat aaa
cat: aaa: long-range I/O error
# df mount disappears and needs to be mounted again

5.1. When the client does not io write, the server is disconnected

The client will be stuck for a period of time (about 1 minute) and then recover

5.2. When the client is in high io, the server is disconnected for more than five minutes

The subsequent investigation process is quite long, and the conclusion is made first

5.2.1. Test overview

Five minutes after the test is disconnected, the NFS Ganesha client will be expelled by cephfs timeout and added to the and list. After the network is restored, the client remains stuck. Finally, the problem is solved by adjusting parameters

The following configurations are added to / etc/ceph/ceph.conf on NFS Ganesha server

# mds session timed out and will not be blacklisted
mds_session_blacklist_on_timeout = false
# No blacklist after mds expulsion
mds_session_blacklist_on_evict = false
# If the client state is stale, the connection is re established
client_reconnect_stale = true

5.2.2. Specific test phenomena and troubleshooting

5.2.2.1. Phenomenon

5.2.2.1.1. Client

Kernel crash during frequent read and write (normal phenomenon under IO timeout) – ceph cluster has slow requests: This is caused by the poor performance of ceph running on the virtual machine. In fact, the switch is successful. You can see the connection by using netstat -anolp|grep 2049 on the vip server

Oct  8 21:47:16 wanggangfeng-dev kernel: INFO: task dd:2053 blocked for more than 120 seconds.
Oct  8 21:47:16 wanggangfeng-dev kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 21:47:16 wanggangfeng-dev kernel: dd              D ffff986436d9acc0     0  2053   2023 0x00000080
Oct  8 21:47:16 wanggangfeng-dev kernel: Call Trace:
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83ed0>] ? bit_wait+0x50/0x50
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab85d89>] schedule+0x29/0x70
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83891>] schedule_timeout+0x221/0x2d0
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa4e69a1>] ? put_prev_entity+0x31/0x400
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa46d39e>] ? kvm_clock_get_cycles+0x1e/0x20
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83ed0>] ? bit_wait+0x50/0x50
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab8547d>] io_schedule_timeout+0xad/0x130
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab85518>] io_schedule+0x18/0x20
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83ee1>] bit_wait_io+0x11/0x50
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83a07>] __wait_on_bit+0x67/0x90
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83ed0>] ? bit_wait+0x50/0x50
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab83b71>] out_of_line_wait_on_bit+0x81/0xb0
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa4c7840>] ? wake_bit_function+0x40/0x40
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc0669193>] nfs_wait_on_request+0x33/0x40 [nfs]
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc066e483>] nfs_updatepage+0x153/0x8e0 [nfs]
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc065d5c1>] nfs_write_end+0x171/0x3c0 [nfs]
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa5be044>] generic_file_buffered_write+0x164/0x270
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc06a4e90>] ? nfs4_xattr_set_nfs4_label+0x50/0x50 [nfsv4]
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc06a4e90>] ? nfs4_xattr_set_nfs4_label+0x50/0x50 [nfsv4]
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa5c0872>] __generic_file_aio_write+0x1e2/0x400
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa5c0ae9>] generic_file_aio_write+0x59/0xa0
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffc065ca2b>] nfs_file_write+0xbb/0x1e0 [nfs]
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa64c663>] do_sync_write+0x93/0xe0
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa64d150>] vfs_write+0xc0/0x1f0
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaa64df1f>] SyS_write+0x7f/0xf0
Oct  8 21:47:16 wanggangfeng-dev kernel: [<ffffffffaab92ed2>] system_call_fastpath+0x25/0x2a

The dd process will not restore its state and needs to be terminated manually. The client can restore its state after a while

client

Oct  8 21:54:22 wanggangfeng-dev kernel: NFS: nfs4_reclaim_open_state: unhandled error -121

5.2.2.1.1.1. After the server is restored

# Re switch back
cat aaa
cat: aaa: long-range I/O error
# df mount disappears and needs to be mounted again

Ganesha log error / var/log/ganesha/ganesha.log

09/10/2021 11:13:02 : epoch 616104eb : ceph01 : ganesha.nfsd-78386[svc_8] posix2fsal_error :FSAL :CRIT :Mapping 108(default) to ERR_FSAL_SERVERFAULT

mds log

2021-10-09 11:05:59.220 7fc5b6e8c700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ceph01 (44152), after 302.173 seconds
2021-10-09 11:05:59.220 7fc5b6e8c700  1 mds.0.12 Evicting (and blacklisting) client session 44152 (v1:172.16.80.86:0/3860306704)
2021-10-09 11:05:59.220 7fc5b6e8c700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 44152 (v1:172.16.80.86:0/3860306704)
2021-10-09 11:06:04.220 7fc5b6e8c700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ceph01 (49033), after 302.374 seconds
2021-10-09 11:06:04.220 7fc5b6e8c700  1 mds.0.12 Evicting (and blacklisting) client session 49033 (172.16.80.86:0/2196066904)
2021-10-09 11:06:04.220 7fc5b6e8c700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 49033 (172.16.80.86:0/2196066904)
2021-10-09 11:09:47.444 7fc5bae94700  0 --2- [v2:172.16.80.10:6818/3513818921,v1:172.16.80.10:6819/3513818921] >> 172.16.80.86:0/2196066904 conn(0x5637be582800 0x5637bcbcf800 crc :-1 s=SESSION_ACCEPTING pgs=465 cs=0 l=0 rev1=1 rx=0 tx=0).handle_reconnect no existing connection exists, reseting client
2021-10-09 11:09:53.035 7fc5bae94700  0 --1- [v2:172.16.80.10:6818/3513818921,v1:172.16.80.10:6819/3513818921] >> v1:172.16.80.86:0/3860306704 conn(0x5637be529c00 0x5637bcc17000 :6819 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION
2021-10-09 11:09:53.047 7fc5b8e90700  0 mds.0.server  ignoring msg from not-open sessionclient_reconnect(1 caps 1 realms ) v3
2021-10-09 11:09:53.047 7fc5bae94700  0 --1- [v2:172.16.80.10:6818/3513818921,v1:172.16.80.10:6819/3513818921] >> v1:172.16.80.86:0/3860306704 conn(0x5637be50dc00 0x5637bcc14800 :6819 s=OPENED pgs=26 cs=1 l=0).fault server, going to standby

Join the blacklist

ceph osd blacklist ls
172.16.80.86:0/4234145215 2021-10-09 11:23:34.169972

Modify blacklist configuration

mds_session_blacklist_on_timeout = false
mds_session_blacklist_on_evict = false

Refer to for configuration details CEPH FILE SYSTEM CLIENT EVICTION

ganesha log

15/10/2021 10:37:42 : epoch 6168e43f : ceph01 : ganesha.nfsd-359850[svc_102] rpc :TIRPC :EVENT :svc_vc_wait: 0x7f9a88000c80 fd 67 recv errno 104 (will set dead)
15/10/2021 10:37:44 : epoch 6168e43f : ceph01 : ganesha.nfsd-359850[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is unhealthy. enq new: 6289, old: 6288; deq new: 6287, old: 6287

5.2.2.1.1.2. Mount a CEPH fuse client

The client is disconnected and waits for him to timeout

x7f28740060c0 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).stop
2021-10-15 14:21:11.445 7f28ad46b700  1 -- 172.16.80.86:0/1706554648 --> [v2:172.16.80.136:3300/0,v1:172.16.80.136:6789/0] -- mon_subscribe({config=0+,mdsmap=246+,monmap=2+}) v3 -- 0x7f28740080d0 con 0x7f2874008600
2021-10-15 14:21:11.445 7f28ad46b700  1 --2- 172.16.80.86:0/1706554648 >> [v2:172.16.80.136:3300/0,v1:172.16.80.136:6789/0] conn(0x7f2874008600 0x7f2874009770 secure :-1 s=READY pgs=149220 cs=0 l=1 rev1=1 rx=0x7f289c0080c0 tx=0x7f289c03dd20).ready entity=mon.2 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2021-10-15 14:21:11.445 7f28a57fa700 10 client.1058179.objecter ms_handle_connect 0x7f2874008600
2021-10-15 14:21:11.445 7f28a57fa700 10 client.1058179.objecter resend_mon_ops
2021-10-15 14:21:11.446 7f28a57fa700  1 -- 172.16.80.86:0/1706554648 <== mon.2 v2:172.16.80.136:3300/0 1 ==== mon_map magic: 0 v1 ==== 422+0+0 (secure 0 0 0) 0x7f289c040310 con 0x7f2874008600
2021-10-15 14:21:11.446 7f28a57fa700  1 -- 172.16.80.86:0/1706554648 <== mon.2 v2:172.16.80.136:3300/0 2 ==== config(0 keys) v1 ==== 4+0+0 (secure 0 0 0) 0x7f289c040510 con 0x7f2874008600
2021-10-15 14:21:11.446 7f28a57fa700  1 -- 172.16.80.86:0/1706554648 <== mon.2 v2:172.16.80.136:3300/0 3 ==== mdsmap(e 251) v1 ==== 693+0+0 (secure 0 0 0) 0x7f289c040a20 con 0x7f2874008600
2021-10-15 14:21:11.446 7f28a57fa700 10 client.1058179.objecter ms_dispatch 0x560568e78330 mdsmap(e 251) v1
2021-10-15 14:21:15.317 7f28a7fff700 10 client.1058179.objecter tick
2021-10-15 14:21:15.565 7f28a6ffd700  1 -- 172.16.80.86:0/1706554648 --> [v2:172.16.80.10:6818/1475563033,v1:172.16.80.10:6819/1475563033] -- client_session(request_renewcaps seq 476) v3 -- 0x7f2884045550 con 0x560568fe8180
2021-10-15 14:21:15.893 7f28ad46b700  1 --2- 172.16.80.86:0/1706554648 >> [v2:172.16.80.10:6818/1475563033,v1:172.16.80.10:6819/1475563033] conn(0x560568fe8180 0x560568fea590 unknown :-1 s=BANNER_CONNECTING pgs=16770 cs=565 l=0 rev1=1 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2021-10-15 14:21:15.893 7f28ad46b700  1 --2- 172.16.80.86:0/1706554648 >> [v2:172.16.80.10:6818/1475563033,v1:172.16.80.10:6819/1475563033] conn(0x560568fe8180 0x560568fea590 crc :-1 s=SESSION_RECONNECTING pgs=16770 cs=565 l=0 rev1=1 rx=0 tx=0).handle_session_reset received session reset full=1
2021-10-15 14:21:15.893 7f28ad46b700  1 --2- 172.16.80.86:0/1706554648 >> [v2:172.16.80.10:6818/1475563033,v1:172.16.80.10:6819/1475563033] conn(0x560568fe8180 0x560568fea590 crc :-1 s=SESSION_RECONNECTING pgs=16770 cs=565 l=0 rev1=1 rx=0 tx=0).reset_session
2021-10-15 14:21:15.894 7f28a57fa700  0 client.1058179 ms_handle_remote_reset on v2:172.16.80.10:6818/1475563033
2021-10-15 14:21:15.894 7f28a57fa700 10 client.1058179.objecter _maybe_request_map subscribing (onetime) to next osd map
2021-10-15 14:21:15.894 7f28a57fa700  1 -- 172.16.80.86:0/1706554648 --> [v2:172.16.80.136:3300/0,v1:172.16.80.136:6789/0] -- mon_subscribe({osdmap=379}) v3 -- 0x7f2890007ca0 con 0x7f2874008600
2021-10-15 14:21:15.894 7f28a57fa700 10 client.1058179.objecter ms_handle_connect 0x560568fe8180
2021-10-15 14:21:15.894 7f28ad46b700  1 --2- 172.16.80.86:0/1706554648 >> [v2:172.16.80.10:6818/1475563033,v1:172.16.80.10:6819/1475563033] conn(0x560568fe8180 0x560568fea590 crc :-1 s=READY pgs=17729 cs=0 l=0 rev1=1 rx=0 tx=0).ready entity=mds.0 client_cookie=13aaa6fc82c7d050 server_cookie=a531fa96d3b029a4 in_seq=0 out_seq=0
2021-10-15 14:21:15.895 7f28a57fa700  1 -- 172.16.80.86:0/1706554648 <== mon.2 v2:172.16.80.136:3300/0 4 ==== osd_map(379..390 src has 1..390) v4 ==== 69665+0+0 (secure 0 0 0) 0x7f289c008390 con 0x7f2874008600
2021-10-15 14:21:15.895 7f28a57fa700 10 client.1058179.objecter ms_dispatch 0x560568e78330 osd_map(379..390 src has 1..390) v4
2021-10-15 14:21:15.895 7f28a57fa700  3 client.1058179.objecter handle_osd_map got epochs [379,390] > 378
2021-10-15 14:21:15.895 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 379
2021-10-15 14:21:15.895 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 380
2021-10-15 14:21:15.895 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 381
2021-10-15 14:21:15.895 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 382
2021-10-15 14:21:15.895 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 383
2021-10-15 14:21:15.896 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 384
2021-10-15 14:21:15.896 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 385
2021-10-15 14:21:15.896 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 386
2021-10-15 14:21:15.896 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 387
2021-10-15 14:21:15.896 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 388
2021-10-15 14:21:15.896 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 389
2021-10-15 14:21:15.896 7f28a57fa700  3 client.1058179.objecter handle_osd_map decoding incremental epoch 390
2021-10-15 14:21:15.896 7f28a57fa700 20 client.1058179.objecter dump_active .. 0 homeless
2021-10-15 14:21:16.019 7f289bfff700  1 -- 172.16.80.86:0/1706554648 >> [v2:172.16.80.136:3300/0,v1:172.16.80.136:6789/0] conn(0x7f2874008600 msgr2=0x7f2874009770 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down

5.2.2.1.2. Server

5.2.2.1.2.1. System log

dmesg

ganesha.nfsd[1400]: segfault at 7f5d8b8a19d0 ip 00007f5d9ac6ab81 sp 00007fffc2be41b0 error 4 in libpthread-2.17.so[7f5d9ac5e000+17000]

2021-10-15 10:15:20.851 7f5d96ffd700  0 client.1008523  destroyed lost open file 0x7f5da8001a10 on 0x10000000018.head(faked_ino=0 ref=3 ll_ref=
1 cap_refs={4=0,1024=0,4096=0,8192=0} open={2=1} mode=100644 size=1571815424/3145728000 nlink=1 btime=0.000000 mtime=2021-10-15 10:15:19.624848
 ctime=2021-10-15 10:15:19.624848 caps=- objectset[0x10000000018 ts 0/0 objects 314 dirty_or_tx 0] parents=0x10000000000.head["dddccc"] 0x7f5dd
400d4a0)

After the client is evicted, the status will become stale

{
    "id": 1094578,
    "inst": {
        "name": {
            "type": "client",
            "num": 1094578
        },
        "addr": {
            "type": "v1",
            "addr": "172.16.80.86:0",
            "nonce": 1231685619
        }
    },
    "inst_str": "client.1094578 v1:172.16.80.86:0/1231685619",
    "addr_str": "v1:172.16.80.86:0/1231685619",
    "sessions": [
        {
            "mds": 0,
            "addrs": {
                "addrvec": [
                    {
                        "type": "v2",
                        "addr": "172.16.80.10:6818",
                        "nonce": 1475563033
                    },
                    {
                        "type": "v1",
                        "addr": "172.16.80.10:6819",
                        "nonce": 1475563033
                    }
                ]
            },
            "seq": 0,
            "cap_gen": 0,
            "cap_ttl": "2021-10-15 16:28:35.900065",
            "last_cap_renew_request": "2021-10-15 16:27:35.900065",
            "cap_renew_seq": 64,
            "num_caps": 2,
            "state": "open"
        }
    ],
    "mdsmap_epoch": 251
}

Parameter client_reconnect_stale = true

After setting this parameter, the problem is solved. The use of this parameter in ceph is as follows (only a short section of context is cut, see Src / client / client. CC for details). When the connection in stale status uses this parameter, it will be reconnected immediately. If this parameter is false, no other operations will be performed.

void Client::ms_handle_remote_reset(Connection *con)
{
  ...
	case MetaSession::STATE_OPEN:
	  {
	    objecter->maybe_request_map(); /* to check if we are blacklisted */
	    const auto& conf = cct->_conf;
	    if (conf->client_reconnect_stale) {
	      ldout(cct, 1) << "reset from mds we were open; close mds session for reconnect" << dendl;
	      _closed_mds_session(s);
	    } else {
	      ldout(cct, 1) << "reset from mds we were open; mark session as stale" << dendl;
	      s->state = MetaSession::STATE_STALE;
	    }
	  }
	  break;
...
}

6. Appendix

6.1. Appendix 1: gracedb data structure

list

rados -p myfs-data0 -N nfs-ns  ls
conf-my-nfs.a
grace
rec-0000000000000003:my-nfs.b
conf-my-nfs.b
conf-my-nfs.c
rec-0000000000000003:my-nfs.a
rec-0000000000000003:my-nfs.c

grace omap

rados -p myfs-data0 -N nfs-ns  listomapvals grace
my-nfs.a
value (1 bytes) :
00000000  00                                                |.|
00000001

my-nfs.b
value (1 bytes) :
00000000  00                                                |.|
00000001

my-nfs.c
value (1 bytes) :
00000000  00                                                |.|
00000001

rados -p myfs-data0 -N nfs-ns  listomapvals rec-0000000000000003:my-nfs.b
7013321091993042946
value (52 bytes) :
00000000  3a 3a 66 66 66 66 3a 31  30 2e 32 34 33 2e 30 2e  |::ffff:10.243.0.|
00000010  30 2d 28 32 39 3a 4c 69  6e 75 78 20 4e 46 53 76  |0-(29:Linux NFSv|
00000020  34 2e 31 20 6b 38 73 2d  31 2e 6e 6f 76 61 6c 6f  |4.1 k8s-1.novalo|
00000030  63 61 6c 29                                       |cal)|
00000034

ganesha-rados-grace -p cephfs_data --ns nfs-ns --oid conf-ceph03
cur=7021712291677762853 rec=7305734900415688548
======================================================

7. Reference

Topics: Linux Ceph cloud computing

Programmer Think