Virtual File System VFS

Posted by ihw13 on Sat, 29 Jan 2022 08:00:26 +0100

Virtual File System VFS (Part 2)

9 catalog item object

VFS treats directories as files, so in the path / bin/vi, bin and vi belong to files - bin is a special directory file, vi is an ordinary file, and each component in the path is represented by an index node object. Although it can be uniformly represented by index nodes, VFS needs to often perform directory related operations, such as path name search. Pathname lookup needs to resolve each component in the path. It not only needs to ensure effectiveness, but also needs to further find the next component in the path.

In order to facilitate the search operation, VFS introduces the concept of directory item. Each dentry represents a specific part of the path. For the previous example, /, bin, and vi are all directory entry objects. The first two are directories and the last one is ordinary files. It should be clear that in the path (including ordinary files), each part is a directory item object. The process of parsing and traversing a path is cumbersome and time-consuming. The introduction of directory items makes this process easier.

Directory entries can also include installation points. In the path / mnt/cdrom/foo, the elements /, mnt, cdrom, and foo are all directory entry objects. VFS may create directory item objects on site when performing directory operations.

The directory entry object is represented by the dentry structure and defined in the file < Linux / DCache h> In, the structure is given:

struct dentry {
	atomic_t d_count;							/* Use count */
	unsigned long d_vfs_flags;					/* Directory item ID */
	spinlock_t d_lock;							/* Single directory entry lock */
	struct inode  * d_inode;					/* Associated inode */
	struct list_head d_lru;						/* Unused linked list */
	struct list_head d_child;					/* A linked list formed inside a directory item */
	struct list_head d_subdirs;					/* Subdirectory linked list */
	struct list_head d_alias;					/* Index node alias linked list */
	unsigned long d_time;						/* reset time  */
	struct dentry_operations  *d_op;			/* Directory entry operation pointer */
	struct super_block * d_sb;					/* Super block of file */
	int d_mounted;								/* Is it the directory entry of the mount point */
	void * d_fsdata;							/* File system specific data */
 	struct rcu_head d_rcu;						/* RCU Lock */
	struct dentry * d_parent;					/* Directory entry object of parent directory */
	struct qstr d_name;							/* Catalog item name */
	struct hlist_node d_hash;					/* Hash table */
	unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* Short file name */
};

Different from the first two objects, the directory item object has no corresponding disk data structure. VFS is created on site according to the path name in the form of string. Because the directory item object is not really saved on the disk, the directory item flag has no flag whether it has been modified or not.

Directory item objects have three states: used, unused, and negative

  • A directory entry being used corresponds to a valid inode (i.e. d_inode points to the corresponding inode) and indicates that there are one or more users of the object (i.e. d_count is positive). A directory entry is in use, which means it is being used by VFS and points to valid data, so it cannot be discarded.
  • An unused directory entry corresponds to a valid inode (d_inode points to an inode), but it should be indicated that VFS is not currently using it. The directory entry object still points to a valid object and is kept in the cache for reuse when needed. Since the directory entry will not be revoked prematurely, it is not necessary to recreate it when needed in the future. Path lookup is faster than uncached directory entries.
  • The directory entry in negative status does not have a corresponding valid inode (d_inode is NULL), because the inode is deleted or the path is no longer correct, but the directory entry is still retained to quickly resolve future path queries. For example, if the process keeps opening a file that no longer exists, it can be revoked if necessary.

If VFS traverses all the elements in the pathname and puts them in one by one, it is very laborious to parse them into directory item objects, so the kernel caches the directory item objects in the directory item cache (DCache for short).

Directory item cache mainly includes three parts:

  • Table of contents used
  • Recently used two-way linked list. Directory entry objects with unused and negative status
  • Hash tables and corresponding hash functions are used to quickly parse a given path into related directory item objects

The hash table consists of an array dentry_hashtable means that each element is a pointer to the linked list of directory item objects with the same key value. The size of the array depends on the size of the physical memory in the system.

The actual hash value is determined by d_hash() function calculation, which is the only hash function provided by the kernel to the file system.

Find hash table through d_lookup() function. If the function finds a directory item object matching it in dcache, the matching object will be returned; Otherwise, a NULL pointer is returned.

10 directory item manipulation

dentry_ The operations structure describes all methods of VFS operating directory entries:

struct dentry_operations {
    /* Judge whether the directory item object is valid. VFS will call this function when it is ready to use a directory item from dcache. Most file systems think that directory entry objects in dcache are always valid, so set this method to NULL */
	int (*d_revalidate)(struct dentry *dentry, struct nameidata *);
    /* Generates hash values for directory entries, and when directory entries need to be added to the hash table, the function is called. */
	int (*d_hash) (struct dentry *, struct qstr *);
    /* Call this function to compare name1 and name2 file names */
	int (*d_compare) (struct dentry *dentry, struct qstr *name1, struct qstr *name2);
    /* When the D of the directory entry_ When count is 0, VFS calls this function, and DCache needs to be added to use this function_ Lock lock and D of directory entry_ lock */
	int (*d_delete)(struct dentry *);
    /* VFS calls this function when the directory item object is about to be released */
	void (*d_release)(struct dentry *);
    /* VFS calls this function when the directory entry object loses the relevant inode */
	void (*d_iput)(struct dentry *, struct inode *);
};

11 file object

The file object represents the file that has been opened by the process. The process directly deals with files rather than superblocks, index nodes and directory entries.

The file object is the representation of the opened file in memory. The object (not a physical file) is created by the corresponding open() system call and revoked by the close() system call. The calls related to these files are actually the methods defined in the file operation table. Because multiple processes may open and operate on the same file, there may be multiple file objects in the same file. In the process, the file object only represents the opened file, which in turn points to the directory item object. In fact, only the directory item object represents the opened actual file (is it the reason why the corresponding directory item will be generated only when the file is accessed?), Although the file object corresponding to a file is not unique, the corresponding inode and directory item objects are unique.

The file object is represented by the file structure and defined in the file < Linux / Fs h> Medium:

struct file {
	struct list_head	f_list;							/* File object linked list */
	struct dentry		*f_dentry;						/* Related catalog item objects */
	struct vfsmount         *f_vfsmnt;					/* Mount information */
	struct file_operations	*f_op;						/* File operation table */
	atomic_t		f_count;							/* Usage count of file objects */
	unsigned int 		f_flags;						/* Flag specified when opening a file */
	mode_t			f_mode;								/* File access mode */
	loff_t			f_pos;								/* File current offset */
	struct fown_struct	f_owner;						/* The owner transmits asynchronous IO data through signals */
	struct file_ra_state	f_ra;						/* Read ahead status */

	unsigned long		f_version;						/* Version number */
	void			*f_security;						/* Security module */

	/* needed for tty driver, and maybe others */
	void			*private_data;						/* Event pool chain table */
							/* Event pool lock */
	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct list_head	f_ep_links;						/* Event pool chain table */
	spinlock_t		f_ep_lock;							/* Event pool lock */
};

Similar to the directory entry object, the file object actually has no corresponding disk data, so there is no flag in the structure to represent whether its object is dirty and whether it needs to be written back to the disk. File object via f_ The dentry object points to the related directory item object. The directory item will point to the related index node, which will record whether the file is dirty.

12 file operation

The operation of file object is controlled by file_operations structure, defined in the file < Linux / Fs h> Medium:

struct file_operations {
    
	struct module *owner;
    /* Used to update the offset pointer, which is called by the system call llseek() */
	loff_t (*llseek) (struct file *, loff_t, int);
    /* Read the data of count bytes from the offset of the given file into buf, and update the file pointer, which is called by the system call read() */
	ssize_t (*read) (struct file *file, char __user *buf, size_t count, loff_t *offset);
    /* From the file described by iocb, count bytes of data are synchronously taken into buf, and AIO is called by the system_ Read() call */
	ssize_t (*aio_read) (struct kiocb *iocb, char __user *buf, size_t count, loff_t *offset);
    /* Take the data of count bytes from the given buf and write it to the offset of the given file. At the same time, update the file pointer, which is called by the system call write() */
	ssize_t (*write) (struct file *, char __user *buf, size_t count, loff_t *offset);
    /* Take out the data of count bytes from the given buf in a synchronous way and write it to the file described by iocb. At the same time, update the file pointer, and the system calls aio_write() call */
	ssize_t (*aio_write) (struct kiocb *iocb, char __user *buf, size_t count, loff_t *offset);
    /* Returns the next directory in the directory list, which is called by the system call readdir() */
	int (*readdir) (struct file *, void *, filldir_t);
    /* Sleep waits for a given file activity and is called by the system call poll() */
	unsigned int (*poll) (struct file *, struct poll_table_struct *);
    /* Send command parameter pairs to the device. When the file is an open device node, you can operate through it, which is called by the system call ioctl() */
	int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
    /* Is a variant of ioctl(). Used by 32-bit programs on 64 bit systems. It is designed to be safe for 32 bits on the 64 bit architecture, and can carry out the necessary word size conversion */
	int (*compact_ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
    /* Map the given file to the specified address space, which is called by the system call mmap() */
	int (*mmap) (struct file *, struct vm_area_struct *);
    /* Create a new file object and associate it with the corresponding inode object, which is called by the system call open() */
	int (*open) (struct inode *, struct file *);
    /* When the reference count of the open file decreases, this function is called by VFS and called by the system call flush() */
	int (*flush) (struct file *);
    /* This function is called by VFS when the last reference of the file is unregistered */
	int (*release) (struct inode *, struct file *);
    /* Write all cached data of the given file back to disk, which is called by the system call fsync() */
	int (*fsync) (struct file *, struct dentry *, int datasync);
    /* Write all the cached data of the file described by iocb back to the disk, and the system calls aio_fsync() call */
	int (*aio_fsync) (struct kiocb *, int datasync);
    /* This function is used to turn on or off the notification signal of asynchronous IO */
	int (*fasync) (int, struct file *, int);
    /* Used to lock the execution file */
	int (*lock) (struct file *, int, struct file_lock *);
    /* Read the data from the given file and write it into the count buffers described by the vector. At the same time, increase the offset of the file, which is called by the system call readv() */
	ssize_t (*readv) (struct file *file, const struct iovec *vector, unsigned long count, loff_t *offset);
    /* Write the data in the count buffers described by vector to the file specified by file, and reduce the offset of the file, which is called by the system call readv() */
	ssize_t (*writev) (struct file *file, const struct iovec *vector, unsigned long count, loff_t *offset);
    /* It is used to copy data from one file to another. The copy operation is completely completed in the kernel, avoiding Venus and unnecessary copies to the user space. It is called by the system call secfile) */
	ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void __user *);
    /* Used to send data from one file to another */
	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
    /* Used to get the unused address space to map the given file */
	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
};

The specific file system can make a special implementation for each operation. If there is a general operation, you can also use the general operation.

13 data structure related to file system

In addition to the above types, the kernel also uses some other standard data structures to manage other related data of the file system. The first data is file_system_type is used to describe various specific file system types. Another structure is vfsmount, which is used to describe an instance of mounting a file system.

struct file_system_type {
	const char *name;								/* File system name */
	int fs_flags;									/* File system type flag */
	struct super_block *(*get_sb) (struct file_system_type *, int,
				       const char *, void *);		/* Read superblock from disk */
	void (*kill_sb) (struct super_block *);			/* Terminate access to superblock */
	struct module *owner;							/* File system module */
	struct file_system_type * next;					/* Next file system type in the linked list */
	struct list_head fs_supers;						/* Super block object linked list */
};

get_ The sb () function reads the superblock from the disk, assembles the superblock object in memory when the file system is mounted, and the remaining functions describe the attributes of the file system.

No matter whether each file system is mounted to the system or not, or how many instances are mounted to the system, there is only one file_system_type instance. When the file system is actually installed, a vfsmount structure will be created at the mount point, which represents the instance of the file system.

struct vfsmount {
	struct list_head mnt_hash;					/* Hash table */
	struct vfsmount *mnt_parent;				/* Parent file system */
	struct dentry *mnt_mountpoint;				/* File system of mount point */
	struct dentry *mnt_root;					/* The root entry of the file system */
	struct super_block *mnt_sb;					/* Super block of the file system */
	struct list_head mnt_mounts;				/* Sub file system linked list */
	struct list_head mnt_child;					/* Sub file system linked list */
	int mnt_flags;								/* Installation sign */
	/* 4 bytes hole on 64bits arches */
	const char *mnt_devname;					/* Device file name */
	struct list_head mnt_list;					/* Descriptor chain table */
	struct list_head mnt_expire;				/* Entry in the expiration linked list */
	struct list_head mnt_share;					/* Entry in the shared installation linked list */
	struct list_head mnt_slave_list;			/* Install linked list from */
	struct list_head mnt_slave;					/* From the entrance where the linked list is installed */
	struct vfsmount *mnt_master;				/* From the master who installed the linked list */
	struct mnt_namespace *mnt_ns;				/* Related namespace */
	int mnt_id;									/* Installation identifier */
	int mnt_group_id;							/* Group identifier */
	/*
	 * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount
	 * to let these frequently modified fields in a separate cache line
	 * (so that reads of mnt_flags wont ping-pong on SMP machines)
	 */
	atomic_t mnt_count;							/* Use count */
	int mnt_expiry_mark;						/* True if the flag bit expires */
	int mnt_pinned;								/* "Pinning process count */
	int mnt_ghosts;								/* "Mirror reference count */
	/*
	 * This value is not stable unless all of the mnt_writers[] spinlocks
	 * are held, and all mnt_writer[]s on this mount have 0 as their ->count
	 */
	atomic_t __mnt_writers;						/* Writer reference count */
};

MNT of vfsmount_ The flags field stores the flag information specified during installation, such as MNT_MODEV prohibits access to device files on the file system, MNT_NOEXEC prohibits the execution of executable files on this file system.

14 process related data structures

Each process of the system has its own set of open files, such as root file system, current working directory, mount point, etc. There are three data structures that closely connect the VFS layer and the process of the system, namely file_struct,fs_struct and namespace structures.

file_ The struct structure is pointed to by the files directory entry in the process descriptor. All information related to a single process is contained in it:

struct files_struct {
  /*
   * read mostly part
   */
	atomic_t count;											/* Usage count of structure */
	struct fdtable *fdt;									/* Pointer to other fd tables */
	struct fdtable fdtab;									/* Base fd table */
  /*
   * written part on a separate cache line in SMP
   */
	spinlock_t file_lock ____cacheline_aligned_in_smp;		/* Lock of a single file */
	int next_fd;											/* Cache the next available fd */
	struct embedded_fd_set close_on_exec_init;				/* exec()File descriptor chain table closed on */
	struct embedded_fd_set open_fds_init;					/* Open file descriptor chain table */
	struct file * fd_array[NR_OPEN_DEFAULT];				/* Default file object array */
};

fd_ The array pointer points to an open file object. NR_OPEN_DEFAULT equals BITS_PER_LONG. In 64 bit machine architecture, the value is 64 bits, so the array can hold 64 file objects. If a process opens more than 64 file objects, the kernel allocates a new array and points fdt to it.

fs_struct is pointed to by the FS field of the process descriptor and contains information related to the file system and process:

struct fs_struct {
	int users;				/* Number of users */
	rwlock_t lock;			/* Lock for protecting the structure */
	int umask;				/* Mask */
	int in_exec;			/* Currently executing files */
	struct path root, pwd;	/* The root directory path is the same as the current working directory path */
};

mnt_ The namespace is defined by the MNT of the process descriptor_ The namespace domain points to a structure that enables each process to see a unique installation file system in the system:

struct mnt_namespace {
	atomic_t		count;			/* Usage count of structure */
	struct vfsmount *	root;		/* Mount point object in the root directory */
	struct list_head	list;		/* Linked list of installation points */
	wait_queue_head_t poll;			/* Waiting queue for polling */
	int event;						/* Event count */
};

The list field is a two-way linked list of installed file systems, and the elements contained constitute all namespaces. These data structures are connected through process descriptors. For most processes, the descriptor points to a unique file_struct and fs_struct structure, but for using CLONE_FILES or clone_ The process created by FS will share two structures, so multiple process descriptors may point to the same file_struct and fs_struct structure. Each structure maintains a count field as a reference count. When the structure is being used by the process, the structure is revoked.

For namespace, by default, all processes share the same namespace, and clone is used only when the process clone() operates_ Newsbiaozhi will give the process a copy of the unique namespace structure. Because most flags do not provide this flag, all processes inherit the namespace of their parent process, so there is only one namespace on most systems. But clone_ The news flag can disable this function.

Topics: Linux kernel