How to identify and solve complex dcache problems

Posted by tefflox on Tue, 18 Jan 2022 13:30:10 +0100

Background: This is in CentOS 7 6, but this problem actually exists in many kernel versions. How to do a good job in monitoring and controlling some linux caches has always been a hot spot in the direction of cloud computing, but these hotspots belong to subdivided scenarios and are difficult to integrate into the main baseline of linux. With the gradual stabilization of ebpf, there may be new gains in the programming and observation of general linux kernel. Here's how we troubleshoot and solve this problem.

1, Fault phenomenon

The oppo cloud kernel team found that the cpu consumption of snmpd s in the cluster was high,
snmpd occupies almost one core for a long time. perf finds the following hotspots:

+   92.00%     3.96%  [kernel]    [k]    __d_lookup 
-   48.95%    48.95%  [kernel]    [k] _raw_spin_lock 
     20.95% 0x70692f74656e2f73                       
        __fopen_internal                              
        __GI___libc_open                              
        system_call                                   
        sys_open                                       
        do_sys_open                                    
        do_filp_open                                   
        path_openat                                    
        link_path_walk                                 
      + lookup_fast                                    
-   45.71%    44.58%  [kernel]    [k] proc_sys_compare 
   - 5.48% 0x70692f74656e2f73                          
        __fopen_internal                               
        __GI___libc_open                               
        system_call                                    
        sys_open                                       
        do_sys_open                                    
        do_filp_open                                   
        path_openat                                    
   + 1.13% proc_sys_compare                                                                                                                     

Almost all are consumed in kernel state__ d_ In the call of lookup, the consumption seen by strace is:

open("/proc/sys/net/ipv4/neigh/kube-ipvs0/retrans_time_ms", O_RDONLY) = 8 <0.000024>------v4 It's fast
open("/proc/sys/net/ipv6/neigh/ens7f0_58/retrans_time_ms", O_RDONLY) = 8 <0.456366>-------v6 Very slow

After further manual operation, it is found that the path to ipv6 is very slow:

time cd /proc/sys/net

real 0m0.000s
user 0m0.000s
sys 0m0.000s

time cd /proc/sys/net/ipv6

real 0m2.454s
user 0m0.000s
sys 0m0.509s

time cd /proc/sys/net/ipv4

real 0m0.000s
user 0m0.000s
sys 0m0.000s
It can be seen that the time consumption of the path to ipv6 is much greater than that of the path to ipv4.

2, Fault phenomenon analysis

We need to see why perf's hotspots are displayed as__ d_ Proc in lookup_ sys_ Compare consumes a lot. What is its process
proc_sys_compare has only one call path, which is d_compare callback. From the call chain:

__d_lookup--->if (parent->d_op->d_compare(parent, dentry, tlen, tname, name))
struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
{
.....
	hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {

		if (dentry->d_name.hash != hash)
			continue;

		spin_lock(&dentry->d_lock);
		if (dentry->d_parent != parent)
			goto next;
		if (d_unhashed(dentry))
			goto next;

		/*
		 * It is safe to compare names since d_move() cannot
		 * change the qstr (protected by d_lock).
		 */
		if (parent->d_flags & DCACHE_OP_COMPARE) {
			int tlen = dentry->d_name.len;
			const char *tname = dentry->d_name.name;
			if (parent->d_op->d_compare(parent, dentry, tlen, tname, name))
				goto next;//caq: if 1 is returned, it is different
		} else {
			if (dentry->d_name.len != len)
				goto next;
			if (dentry_cmp(dentry, str, len))
				goto next;
		}
		....
next:
		spin_unlock(&dentry->d_lock);//caq: enter the linked list cycle again
 	}		

.....
}

The snmp process of a cluster should be the same as that of a physical machine, so it's natural to doubt whether it's hlist_bl_for_each_entry_rcu
Too many cycles, resulting in parent - > D_ op->d_ Compare keeps comparing conflict chains,
When you enter ipv6, do you compare a lot of times, because you will encounter a lot of cache miss es in the process of traversing the list
Too many linked list elements may trigger this situation. The following needs to be verified:

static inline long hlist_count(const struct dentry *parent, const struct qstr *name)
{
  long count = 0;
  unsigned int hash = name->hash;
  struct hlist_bl_head *b = d_hash(parent, hash);
  struct hlist_bl_node *node;
  struct dentry *dentry;

  rcu_read_lock();
  hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
    count++;
  }
  rcu_read_unlock();
  if(count >COUNT_THRES)
  {
     printk("hlist_bl_head=%p,count=%ld,name=%s,hash=%u\n",b,count,name,name->hash);
  }
  return count;
}

kprobe results are as follows:

[20327461.948219] hlist_bl_head=ffffb0d7029ae3b0 count = 799259,name=ipv6/neigh/ens7f1_46/base_reachable_time_ms,hash=913731689
[20327462.190378] hlist_bl_head=ffffb0d7029ae3b0 count = 799259,name=ipv6/neigh/ens7f0_51/retrans_time_ms,hash=913731689
[20327462.432954] hlist_bl_head=ffffb0d7029ae3b0 count = 799259,name=ipv6/conf/ens7f0_51/forwarding,hash=913731689
[20327462.675609] hlist_bl_head=ffffb0d7029ae3b0 count = 799259,name=ipv6/neigh/ens7f0_51/base_reachable_time_ms,hash=913731689

Judging from the length of the conflict chain, it does enter a relatively long conflict chain in the hash table of dcache. The number of dentries of this chain is 799259,
And they all point to ipv6 dentry.
Students who understand the principle of dcache must know that the hash value of the elements in the conflict chain must be the same, and the hash value of dcache is the parent
The dentry of and the hash value of form the final hash value:

static inline struct hlist_bl_head *d_hash(const struct dentry *parent,
					unsigned int hash)
{
	hash += (unsigned long) parent / L1_CACHE_BYTES;
	hash = hash + (hash >> D_HASHBITS);
	return dentry_hashtable + (hash & D_HASHMASK);
}
The higher kernel versions are:
static inline struct hlist_bl_head *d_hash(unsigned int hash)
{
	return dentry_hashtable + (hash >> d_hash_shift);
}

On the surface, dentry - > dName The calculation of hash value has changed. In fact
The hash is stored in dentry - > D_ name. A helper has been added when hashing. For details, please refer to
The following patches:

commit 8387ff2577eb9ed245df9a39947f66976c6bcd02
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Jun 10 07:51:30 2016 -0700

    vfs: make the string hashes salt the hash
    
    We always mixed in the parent pointer into the dentry name hash, but we
    did it late at lookup time.  It turns out that we can simplify that
    lookup-time action by salting the hash with the parent pointer early
    instead of late.

Here are two questions:

1. Although the conflict chain is long, our dentry may be in front of the conflict chain
Not always so far away;

2. dentry under proc is a common and fixed file name,
Why is there such a long chain of conflict?

To solve these two questions, it is necessary to further analyze the dentry in the hedge chain.
According to the hash header printed by kprobe above, we can further analyze the dentry as follows:

crash> list dentry.d_hash -H 0xffff8a29269dc608 -s dentry.d_sb
ffff89edf533d080
  d_sb = 0xffff89db7fd3c800
ffff8a276fd1e3c0
  d_sb = 0xffff89db7fd3c800
ffff8a2925bdaa80
  d_sb = 0xffff89db7fd3c800
ffff89edf5382a80
  d_sb = 0xffff89db7fd3c800
.....

Because the linked list is very long, we print the corresponding analysis to the file and find all dentries in the conflict chain
All belong to the same super_block, that is 0xff89db7fd3c800,

crash> list super_block.s_list -H super_blocks -s super_block.s_id,s_nr_dentry_unused >/home/caq/super_block.txt

# grep ffff89db7fd3c800 super_block.txt  -A 2 
ffff89db7fd3c800
  s_id = "proc\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"

0xffff89db7fd3c800 is a proc file system. Why does it create so many ipv6 dentries?
Continue to use the command and take a look at the d corresponding to dentry_ Inode:

...
ffff89edf5375b00
  d_inode = 0xffff8a291f11cfb0
ffff89edf06cb740
  d_inode = 0xffff89edec668d10
ffff8a29218fa780
  d_inode = 0xffff89edf0f75240
ffff89edf0f955c0
  d_inode = 0xffff89edef9c7b40
ffff8a2769e70780
  d_inode = 0xffff8a291c1c9750
ffff8a2921969080
  d_inode = 0xffff89edf332e1a0
ffff89edf5324b40
  d_inode = 0xffff89edf2934800
...

We found that these namesake, D_ name. Names are ipv6 dentries, and their inode s are different, indicating that these proc s
There is no hard link in the file under, so this is normal.
We continue to analyze the formation of ipv6 paths.
/The formation of proc/sys/net/ipv6 path is simply divided into the following steps:

start_kernel-->proc_root_init()//caq: registering proc fs
 because proc yes linux The system is mounted by default, so find kern_mount_data function
pid_ns_prepare_proc-->kern_mount_data(&proc_fs_type, ns);//caq: Mount proc fs
proc_sys_init-->proc_mkdir("sys", NULL);//Create sys directory under caq: proc directory
net_sysctl_init-->register_sysctl("net", empty);//caq: create net under / proc/sys
 about init_net:
ipv6_sysctl_register-->register_net_sysctl(&init_net, "net/ipv6", ipv6_rotable);
For others net_namespace,Generally, the creation is triggered by a system call
ipv6_sysctl_net_init-->register_net_sysctl(net, "net/ipv6", ipv6_table);//Create ipv6

With these foundations, next, let's focus on the last one, the ipv6 creation process.
ipv6_sysctl_net_init function
ipv6_sysctl_register–>register_pernet_subsys(&ipv6_sysctl_net_ops)–>
register_pernet_operations–>__register_pernet_operations–>
ops_init–>ipv6_sysctl_net_init
Common call stacks are as follows:

 :Fri Mar  5 11:18:24 2021,runc:[1:CHILD],tid=125338.path=net/ipv6
 0xffffffffb9ac66f0 : __register_sysctl_table+0x0/0x620 [kernel]
 0xffffffffb9f4f7d2 : register_net_sysctl+0x12/0x20 [kernel]
 0xffffffffb9f324c3 : ipv6_sysctl_net_init+0xc3/0x150 [kernel]
 0xffffffffb9e2fe14 : ops_init+0x44/0x150 [kernel]
 0xffffffffb9e2ffc3 : setup_net+0xa3/0x160 [kernel]
 0xffffffffb9e30765 : copy_net_ns+0xb5/0x180 [kernel]
 0xffffffffb98c8089 : create_new_namespaces+0xf9/0x180 [kernel]
 0xffffffffb98c82ca : unshare_nsproxy_namespaces+0x5a/0xc0 [kernel]
 0xffffffffb9897d83 : sys_unshare+0x173/0x2e0 [kernel]
 0xffffffffb9f76ddb : system_call_fastpath+0x22/0x27 [kernel]

In dcache, we use net under / proc/sys /_ Dentries in the namespace are hash ed together,
How do you guarantee a net_namespace
What about dentry isolation? Let's look at the corresponding__ register_sysctl_table function:

struct ctl_table_header *register_net_sysctl(struct net *net,
	const char *path, struct ctl_table *table)
{
	return __register_sysctl_table(&net->sysctls, path, table);
}

struct ctl_table_header *__register_sysctl_table(
	struct ctl_table_set *set,
	const char *path, struct ctl_table *table)
{
	.....
	for (entry = table; entry->procname; entry++)
		nr_entries++;//caq: calculate the number of items under the table first

	header = kzalloc(sizeof(struct ctl_table_header) +
			 sizeof(struct ctl_node)*nr_entries, GFP_KERNEL);
....
	node = (struct ctl_node *)(header + 1);
	init_header(header, root, set, node, table);
....
	/* Find the directory for the ctl_table */
	for (name = path; name; name = nextname) {
....//caq: traverse to find the corresponding path
	}

	spin_lock(&sysctl_lock);
	if (insert_header(dir, header))//caq: insert into the management structure
		goto fail_put_dir_locked;
....
}

The specific code is not expanded. The dentry under each sys is through ctl_table_set to distinguish whether it is visible
Then, when searching, the comparison is as follows:

static int proc_sys_compare(const struct dentry *parent, const struct dentry *dentry,
		unsigned int len, const char *str, const struct qstr *name)
{
....
	return !head || !sysctl_is_seen(head);
}

static int sysctl_is_seen(struct ctl_table_header *p)
{
	struct ctl_table_set *set = p->set;//Get the corresponding set
	int res;
	spin_lock(&sysctl_lock);
	if (p->unregistering)
		res = 0;
	else if (!set->is_seen)
		res = 1;
	else
		res = set->is_seen(set);
	spin_unlock(&sysctl_lock);
	return res;
}

//Not the same ctl_table_set is not visible
static int is_seen(struct ctl_table_set *set)
{
	return &current->nsproxy->net_ns->sysctls == set;
}

As can be seen from the above code, the current process to find if it belongs to net_ns set
If it is inconsistent with the set in dentry, failure will be returned, and snmpd belongs to
set is actually init_net, and the vast majority of dentries in front of each in the conflict chain
sysctls of are not attributed to init_net, so the front is more failed.

So why does it belong to init_net's / proc/sys/net's dentry will be at the end of the conflict chain?
That's because of the following code:

static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
					struct hlist_bl_head *h)
{
	struct hlist_bl_node *first;

	/* don't need hlist_bl_first_rcu because we're under lock */
	first = hlist_bl_first(h);

	n->next = first;//caq: each time you add it later, it is added to the chain header
	if (first)
		first->pprev = &n->next;
	n->pprev = &h->first;

	/* need _rcu because we can have concurrent lock free readers */
	hlist_bl_set_first_rcu(h, n);
}

We already know the reason why snmp needs to traverse the conflict linked list to a very backward position. Next, we need to find out
Understand why there are so many dentries. According to the management, we found that if docker kept
Create a pause container and destroy it, and these ipv6 dentries under net will accumulate,
One reason for accumulation is that dentry will not be automatically destroyed without triggering memory shortage,
If you can cache, you can cache. The other is that we don't limit the length of the burst chain.

Then the question comes again. Why doesn't the dentry of ipv4 accumulate? Since the parent of ipv6 and ipv4
All are the same. How many children are there in this parent?

Then look hash Inside the watch dentry,d_parent Many point to 0 xffff8a0a7739fd40 this dentry. 
crash> dentry.d_subdirs 0xffff8a0a7739fd40 ----View this parent dentry How many? child
  d_subdirs = {
    next = 0xffff8a07a3c6f710, 
    prev = 0xffff8a0a7739fe90
  }
crash> list 0xffff8a07a3c6f710 |wc -l
1598540----------There are 1.59 million child

1.59 million subdirectories, excluding 799259 long conflict chains in front, there are almost 790000 subdirectories. Since the path to ipv4 is fast,
Explain that there should be other dentries in the net directory. There are many child dentries. Is it a common problem?

Then check other machines in the cluster and find the type phenomenon. The intercepted print is as follows:

 count=158505,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4
hlist_bl_head=ffffbd9d5a7a6cc0,count=158507
 count=158507,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
hlist_bl_head=ffffbd9d429a7498,count=158506

As you can see, ffffbd9d429a7498 has a conflict chain of almost the same length as ffffbd9d5a7a6cc0.
First analyze the ipv6 chain. The analysis of the core chain is actually the same. The data analysis of picking up the conflict chain is as follows:

crash> dentry.d_parent,d_name.name,d_lockref.count,d_inode,d_subdirs ffff9b867904f500
  d_parent = 0xffff9b9377368240
  d_name.name = 0xffff9b867904f538 "ipv6"-----This is a ipv6 of dentry
  d_lockref.count = 1
  d_inode = 0xffff9bba4a5e14c0
  d_subdirs = {
    next = 0xffff9b867904f950, 
    prev = 0xffff9b867904f950
  }

d_child Offset 0 x90,Then 0 xffff9b867904f950 Minus 0 x90 Is 0 xffff9b867904f8c0
crash> dentry 0xffff9b867904f8c0
struct dentry {
......
  d_parent = 0xffff9b867904f500, 
  d_name = {
    {
      {
        hash = 1718513507, 
        len = 4
      }, 
      hash_len = 18898382691
    }, 
    name = 0xffff9b867904f8f8 "conf"------Name is conf
  }, 
  d_inode = 0xffff9bba4a5e61a0, 
  d_iname = "conf\000bles_names\000\060\000.2\000\000pvs.(*Han", 
  d_lockref = {
......
        count = 1----------------The reference count is 1, indicating that someone else is quoting
......
  }, 
 ......
  d_subdirs = {
    next = 0xffff9b867904fb90, 
    prev = 0xffff9b867904fb90
  }, 
......
}
Since the reference count is 1, continue to dig down:
crash> dentry.d_parent,d_lockref.count,d_name.name,d_subdirs 0xffff9b867904fb00
  d_parent = 0xffff9b867904f8c0
  d_lockref.count = 1
  d_name.name = 0xffff9b867904fb38 "all"
  d_subdirs = {
    next = 0xffff9b867904ef90, 
    prev = 0xffff9b867904ef90
  }
  Further down:
crash> dentry.d_parent,d_lockref.count,d_name.name,d_subdirs,d_flags,d_inode -x 0xffff9b867904ef00
  d_parent = 0xffff9b867904fb00
  d_lockref.count = 0x0-----------------------------Dig until the reference count is 0
  d_name.name = 0xffff9b867904ef38 "disable_ipv6"
  d_subdirs = {
    next = 0xffff9b867904efa0, --------Empty
    prev = 0xffff9b867904efa0
  }
  d_flags = 0x40800ce-------------The following focuses on this
  d_inode = 0xffff9bba4a5e4fb0

You can see that the dentry path of ipv6 is ipv6/conf/all/disable_ipv6, as seen by probe,
For d_flags, the analysis is as follows:

#define DCACHE_FILE_TYPE        0x04000000 /* Other file type */

#define DCACHE_ lru_ List 0x80000 -------- this is indicated on lru

#define DCACHE_REFERENCED   0x0040  /* Recently used, don't discard. */
#define DCACHE_RCUACCESS    0x0080  /* Entry has ever been RCU-visible */

#define DCACHE_OP_COMPARE   0x0002
#define DCACHE_OP_REVALIDATE    0x0004
#define DCACHE_OP_DELETE    0x0008

We see that disable_ipv6 has a reference count of 0, but it has DCACHE_LRU_LIST flag,
According to the following function:

static void dentry_lru_add(struct dentry *dentry)
{
	if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST))) {
		spin_lock(&dcache_lru_lock);
		dentry->d_flags |= DCACHE_LRU_LIST;//There is this sign on lru
		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
		dentry->d_sb->s_nr_dentry_unused++;//caq: put it in s_dentry_lru is free
		dentry_stat.nr_unused++;
		spin_unlock(&dcache_lru_lock);
	}
}

This shows that it can be released. Because it is an online business, we dare not use it
echo 2 >/proc/sys/vm/drop_caches
Then write a module to release. The main code of the module is as follows. Refer to shrink_slab:

  spin_lock(orig_sb_lock);
        list_for_each_entry(sb, orig_super_blocks, s_list) {
                if (memcmp(&(sb->s_id[0]),"proc",strlen("proc"))||\
                   memcmp(sb->s_type->name,"proc",strlen("proc"))||\
                    hlist_unhashed(&sb->s_instances)||\
                    (sb->s_nr_dentry_unused < NR_DENTRY_UNUSED_LEN) )
                        continue;
                sb->s_count++;
                spin_unlock(orig_sb_lock);
                printk("find proc sb=%p\n",sb);
                shrinker = &sb->s_shrink;
                
               count = shrinker_one(shrinker,&shrink,1000,1000);
               printk("shrinker_one count =%lu,sb=%p\n",count,sb);
               spin_lock(orig_sb_lock);//caq: lock again
                if (sb_proc)
                        __put_super(sb_proc);
                sb_proc = sb;

         }
         if(sb_proc){
             __put_super(sb_proc);
             spin_unlock(orig_sb_lock);
         }
         else{
            spin_unlock(orig_sb_lock);
            printk("can't find the special sb\n");
         }

It was found that both conflict chains were indeed released.
For example, before releasing a node:

[3435957.357026] hlist_bl_head=ffffbd9d5a7a6cc0,count=34686
[3435957.357029] count=34686,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3435957.457039] IPVS: Creating netns size=2048 id=873057
[3435957.477742] hlist_bl_head=ffffbd9d429a7498,count=34686
[3435957.477745] count=34686,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4
[3435957.549173] hlist_bl_head=ffffbd9d5a7a6cc0,count=34687
[3435957.549176] count=34687,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3435957.667889] hlist_bl_head=ffffbd9d429a7498,count=34687
[3435957.667892] count=34687,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4
[3435958.720110] find proc sb=ffff9b647fdd4000-----------------------Start release
[3435959.150764] shrinker_one count =259800,sb=ffff9b647fdd4000------End of release

After separate release:

[3436042.407051] hlist_bl_head=ffffbd9d466aed58,count=101
[3436042.407055] count=101,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3436042.501220] IPVS: Creating netns size=2048 id=873159
[3436042.591180] hlist_bl_head=ffffbd9d466aed58,count=102
[3436042.591183] count=102,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3436042.685008] hlist_bl_head=ffffbd9d4e8af728,count=101
[3436042.685011] count=101,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4
[3436043.957221] IPVS: Creating netns size=2048 id=873160
[3436044.043860] hlist_bl_head=ffffbd9d466aed58,count=103
[3436044.043863] count=103,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3436044.137400] hlist_bl_head=ffffbd9d4e8af728,count=102
[3436044.137403] count=102,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4
[3436044.138384] IPVS: Creating netns size=2048 id=873161
[3436044.226954] hlist_bl_head=ffffbd9d466aed58,count=104
[3436044.226956] count=104,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3436044.321947] hlist_bl_head=ffffbd9d4e8af728,count=103

Two details can be seen above:

1. Before release, hlist was also growing. After release, hlist was still growing.

2. After the release, the dentry of net changes, so the position of hashlist changes.

To sum up, we traverse hotspots slowly because snmpd needs to find init_net ctl_table_set
And CTLs belonging to other dentries in dcache_ table_ Inconsistent sets result in, while the length of the linked list is
Because someone is destroying it_ When the namespace is running, you are still accessing ipv6/conf/all/disable_ipv6 and
Due to core/somaxconn, these two dentries are placed in the belonging super_ S of block_ dentry_ lru
Come on.
Finally, what call accesses these dentries? The trigger mechanism is as follows:

pid=16564,task=exe,par_pid=366883,task=dockerd,count=1958,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4,hlist_bl_head=ffffbd9d429a7498
hlist_bl_head=ffffbd9d5a7a6cc0,count=1960

pid=16635,task=runc:[2:INIT],par_pid=16587,task=runc,count=1960,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4,hlist_bl_head=ffffbd9d5a7a6cc0
hlist_bl_head=ffffbd9d429a7498,count=1959

It can be seen that dockerd and runc actually trigger this problem. k8 calls docker to create a pause container,
The network parameters of cni are not filled in correctly, resulting in the creation of net_ The namespace is quickly destroyed, although it is called when it is destroyed
unregister_net_sysctl_table, but runc and exe access it at the same time
The net_ The two dentries in the namespace cause the two dentries to be cached in super_block
s_dentry_lru linked list. Because the overall memory is sufficient, it will continue to grow.
Note that the corresponding path is ipv6/conf/all/disable_ipv6, core/somaxconn, and dentry under ipv4 path are unavailable
I was visiting, so CTL_ The table can be cleaned up at that time.
The unlucky snmpd always accesses the corresponding chain,
The cpu is high, and manual drop is used_ After caches, restore immediately. Note that the online machine cannot be used
drop_caches, which will lead to high sys and affect some delay sensitive services.

3, Fault recurrence

1. When the memory is free, the memory recovery of slab is not triggered. k8 calls docker to create a different net_namespace
Because the cni parameter is incorrect, the newly created net will be destroyed immediately_ Namespace, if you're in dmesg
Frequently see the following logs:

IPVS: Creating netns size=2048 id=866615

It is necessary to pay attention to the cache of dentry.

4, Fault avoidance or resolution

Possible solutions are:

1. Read dentry through rcu_ Each conflict chain of hashtable will throw an alarm if it is greater than a certain degree.

2. Set the number of cached dentries through a proc parameter.

3. The global can focus on / proc / sys / FS / dentry state

4. Local, can be for super_block, read s_nr_dentry_unused, if it exceeds a certain number, it will give an alarm,
For example code, refer to shrink_ Implementation of slab function.

5. Note the difference between negative dentry limit and negative dentry limit.

6. There are many places where hash buckets are used in the kernel. How can we monitor the length of hash bucket conflict chain? Make module
Scan, or find a place to save the length of a linked list.

5, Introduction to the author

Anqing
In oppo hybrid cloud, I am responsible for linux kernel, container, virtual machine and other virtualization work

Get more exciting content, scan code concern [OPPO Internet technology] official account

Topics: Back-end