Linux system call II -- call mechanism

Posted by xnor82 on Sun, 09 Jan 2022 09:08:57 +0100

0 background

The previous blog gave an overview of system calls. Interested partners can check it out: One of Linux system calls -- overview and introduction , here is the main introduction to the system call mechanism.

Whether it is GUI (user graphical interface), application program, or command line interface, it needs to be implemented by system call.
When we want to open a file and write or allocate memory, we will switch to the kernel state, although we are not aware of it; After that, the kernel checks the call. If it passes, it performs corresponding operations and allocates corresponding resources according to the instructions.
This mechanism is called system call. The user state process initiates the call, switches to the kernel state, completes the kernel state, returns to the user state and continues to execute. This is also the only legal means for the user state to actively switch to the kernel state (exception and interrupt are passive switching).

Let's take a rough look at the diagram first. Take fork() as an example to see the system call process:

In the history of Linux, the implementation of x86 system call has experienced the evolution from int / iret to sysenter / sysexit, and then to syscall / sysret. It's too slow to realize system call through software interrupt 0x80. At present, as long as the hardware supports it, most of them use sysenter / sysexit or syscall / sysret. However, both in learning and understanding, we should start with soft interrupt, so this paper will mainly introduce the system call mechanism by analyzing soft interrupt 0x80.

1 0x80 interrupt

1.1 interrupt processing function entry_ INT80_ thirty-two

**The Linux kernel registers an interrupt named entry for interrupt number 128 (0x80)_ INT80_ 32 interrupt handler** Let's look at the code that actually does this.
PS: this analysis is based on Linux kernel 4.9.76 and glibc 2.25.90
File: arch / x86 / kernel / traps c

void __init trap_init(void)
{
 /* ..... other code ... */
/*CONFIG_IA32_EMULATION //This macro allows 32 - bit code to run in a 64 - bit kernel
 We use the macro CONFIG_X86_32 including content analysis
#ifdef CONFIG_IA32_EMULATION 
	set_system_intr_gate(IA32_SYSCALL_VECTOR, entry_INT80_compat);
	set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif
*/
#ifdef CONFIG_X86_32
	set_system_intr_gate(IA32_SYSCALL_VECTOR, entry_INT80_32);
	set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif
 /* ..... other code ... */
}

Including IA32_SYSCALL_VECTOR in header file arch/x86/include/asm/irq_vectors.h is defined as 0x80, set_system_intr_gate is used to set the system call gate on the interrupt descriptor table (IDT).

That is, the Linux kernel retains a 0x80 software interrupt, through which the user program can trigger the kernel, and then the hardware finds the corresponding table entry in the IDT according to the interrupt vector number, that is, the interrupt descriptor. The problem is, how does the kernel know how many system calls it should execute? In fact, when the program is compiled, the system call number will be put into the eax register in advance. System call related parameters will be placed in the remaining general registers.

Therefore, when 0x80 interrupt comes, it starts from the entry_INT80_32, which is defined in arch/x86/entry/entry_32.S

ENTRY(entry_INT80_32)
	ASM_CLAC
	pushl	%eax			/* pt_regs->orig_ax */
	SAVE_ALL pt_regs_ax=$-ENOSYS	/* save rest */

	/*
	 * User mode is traced as though IRQs are on, and the interrupt gate
	 * turned them off.
	 */
	TRACE_IRQS_OFF

	movl	%esp, %eax
	call	do_int80_syscall_32
.Lsyscall_32_done:

Pushl%eax stack the system call number in eax, and then SAVE_ALL stacks the values of other registers. These registers hold the parameters of the calling function. See the following code:

.macro SAVE_ALL pt_regs_ax=%eax
    cld
    PUSH_GS
    pushl   %fs
    pushl   %es
    pushl   %ds
    pushl   \pt_regs_ax
    pushl   %ebp
    pushl   %edi
    pushl   %esi
    pushl   %edx
    pushl   %ecx
    pushl   %ebx
    movl    $(__USER_DS), %edx
    movl    %edx, %ds
    movl    %edx, %es
    movl    $(__KERNEL_PERCPU), %edx
    movl    %edx, %fs
    SET_KERNEL_GS %edx
.endm

Stack pressing completed, TRACE_IRQS_OFF turns off the interrupt, movl% ESP,% eax saves the current stack pointer to eax, and then calls do_int80_syscall_32 call do_int80_syscall_32. The above is the interrupt processing function entry_INT80_32 definition of.

1.2 interrupt processing function do_syscall_32_irqs_on

Continue the analysis, and then call do_int80_syscall_32 this function is defined in arch / x86 / entry / common c

/* Handles int $0x80 */
__visible void do_int80_syscall_32(struct pt_regs *regs)
{
	enter_from_user_mode();/* Called on entry from user mode with IRQs off*/
	local_irq_enable();/* unconditionally enable interrupts */
	do_syscall_32_irqs_on(regs);
}

In the first two, I added the English comments. I won't take a closer look. I mainly focus on the function do_syscall_32_irqs_on, also defined in arch / x86 / entry / common c.

static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
{
    struct thread_info *ti = current_thread_info();
    unsigned int nr = (unsigned int)regs->orig_ax;

#ifdef CONFIG_IA32_EMULATION
    current->thread.status |= TS_COMPAT;
#endif

    if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
        /*
         * Subtlety here: if ptrace pokes something larger than
         * 2^32-1 into orig_ax, this truncates it.  This may or
         * may not be necessary, but it matches the old asm
         * behavior.
         */
        nr = syscall_trace_enter(regs);
    }

    if (likely(nr < IA32_NR_syscalls)) {
        /*
         * It's possible that a 32-bit syscall implementation
         * takes a 64-bit parameter but nonetheless assumes that
         * the high bits are zero.  Make sure we zero-extend all
         * of the args.
         */
        regs->ax = ia32_sys_call_table[nr](
            (unsigned int)regs->bx, (unsigned int)regs->cx,
            (unsigned int)regs->dx, (unsigned int)regs->si,
            (unsigned int)regs->di, (unsigned int)regs->bp);
    }

    syscall_return_slowpath(regs);
}

The parameter regs of this function actually refers to the previous entry_INT80_32 register values pushed into the stack in turn. Defined in arch / x86 / include / ASM / ptrace In H__ i386__ The definitions in are as follows:

#ifdef __i386__
struct pt_regs {
	unsigned long bx;
	unsigned long cx;
	unsigned long dx;
	unsigned long si;
	unsigned long di;
	unsigned long bp;
	unsigned long ax;
	unsigned long ds;
	unsigned long es;
	unsigned long fs;
	unsigned long gs;
	unsigned long orig_ax;
	unsigned long ip;
	unsigned long cs;
	unsigned long flags;
	unsigned long sp;
	unsigned long ss;
};
#else /* __i386__ */

Take out the system call number first, IA32_ sys_ call_ table[nr]( (unsigned int)regs->bx, (unsigned int)regs->cx,(unsigned int)regs->dx, (unsigned int)regs->si,(unsigned int)regs->di, (unsigned int)regs->bp); Take the corresponding processing function from the system call table (ia32_sys_call_table) according to the function parameters stored in the register, and then syscall_return_slowpath(regs) then calls this function.

1.3 system call table ia32_sys_call_table

In fact, another key part is the system call table ia32_sys_call_table. In the function processing in the previous step, the corresponding processing function is taken from the system call table according to the register parameters.
IDT is defined in arch/x86/entry/syscall_32.c

/* System call table for i386. */

#include <linux/linkage.h>
#include <linux/sys.h>
#include <linux/cache.h>
#include <asm/asm-offsets.h>
#include <asm/syscall.h>

#define __SYSCALL_I386(nr, sym, qual) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) ;
#include <asm/syscalls_32.h>
#undef __SYSCALL_I386

#define __SYSCALL_I386(nr, sym, qual) [nr] = sym,

extern asmlinkage long sys_ni_syscall(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);

__visible const sys_call_ptr_t ia32_sys_call_table[__NR_syscall_compat_max+1] = {
	/*
	 * Smells like a compiler bug -- it doesn't work
	 * when the & below is removed.
	 */
	[0 ... __NR_syscall_compat_max] = &sys_ni_syscall,
#include <asm/syscalls_32.h>
};

The above is syscall_ No IDT definition was found in all the code of 32. C. It should be in the header file. In fact, syscalls could not be found in the uncompiled kernel_ In fact, the header file 32. H will not appear until we compile the kernel.
In fact, syscalls_32.h depends on syscall_32.tbl
File: syscall_32.tbl

#
# 32-bit system call numbers and entry vectors
#
# The format is:
# <number> <abi> <name> <entry point> <compat entry point>
#
# The abi is always "i386" for this file.
#
0	i386	restart_syscall		sys_restart_syscall
1	i386	exit			sys_exit
2	i386	fork			sys_fork			sys_fork
3	i386	read			sys_read
4	i386	write			sys_write
5	i386	open			sys_open			compat_sys_open
...
...
381	i386	pkey_alloc		sys_pkey_alloc
382	i386	pkey_free		sys_pkey_free

There are 383 system call functions, and the compiled rch/x86/include/generated/asm/syscalls_32.h

__SYSCALL_I386(0, sys_restart_syscall, )
__SYSCALL_I386(1, sys_exit, )
#ifdef CONFIG_X86_32
__SYSCALL_I386(2, sys_fork, )
#else
__SYSCALL_I386(2, sys_fork, )
#endif
__SYSCALL_I386(3, sys_read, )
__SYSCALL_I386(4, sys_write, )
#ifdef CONFIG_X86_32
__SYSCALL_I386(5, sys_open, )
#else
__SYSCALL_I386(5, compat_sys_open, )
...

Description syscalls_32.h is generated dynamically during compilation. Interested partners can check the script arch / x86 / entry / syscalls / syscalltbl sh. arch/x86/syscalls/syscall_32.tbl
Thus, our system call table is as follows:

__visible const sys_call_ptr_t ia32_sys_call_table[__NR_syscall_compat_max+1] = {
   [0] = sys_restart_syscall,
   [1] = sys_exit,
   [2] = sys_fork,
   [3] = sys_read,
   [4] = sys_write,
   [5] = sys_open,
   ...
};

1.4 system call sys_open as an example

Here, a simple system call is written through soft interrupt and at & T sink, which is not standardized. At least exit. (can be compiled directly)

int main(int argc, char *argv[])
{
  asm ("movl $0x05, %eax\n" /* Set system call number */
       "movl $1, %ebx\n"    /* Set system call parameters */
       "movl $2, %ecx\n"    /* Set system call parameters */
       "int $0x80"          /* Enter system call interrupt */
       );
}

Movl $0x05,% eax visible call number is 0x05, so sys is called here_ Open, which is defined in FS / open c. Here we pass in three parameters

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
	if (force_o_largefile())
		flags |= O_LARGEFILE;

	return do_sys_open(AT_FDCWD, filename, flags, mode);
}

About SYSCALL_DEFINE3 and related macros are defined as follows:

#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)

#define SYSCALL_DEFINEx(x, sname, ...)                \
        SYSCALL_METADATA(sname, x, __VA_ARGS__)       \
        __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)

#define __SYSCALL_DEFINEx(x, name, ...)                                 \
        asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))       \
                __attribute__((alias(__stringify(SyS##name))));         \
                                                                        \
        static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__));  \
                                                                        \
        asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__));      \
                                                                        \
        asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))       \
        {                                                               \
                long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__));  \
                __MAP(x,__SC_TEST,__VA_ARGS__);                         \
                __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__));       \
                return ret;                                             \
        }                                                               \
                                                                        \
        static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))

Where SYSCALL_METADATA saves the basic information of the call for the debugger to trace (the kernel needs to open CONFIG_FTRACE_SYSCALLS).

And__ SYSCALL_DEFINEx is used to splice functions, and the function name is spliced as sys##_##open, the parameter is also passed__ SC_DECL splicing, and finally get the definition after expansion:

asmlinkage long sys_open(const char __user * filename, int flags, umode_t mode)
{
    if (force_o_largefile())
        flags |= O_LARGEFILE;

    return do_sys_open(AT_FDCWD, filename, flags, mode);
}

Here, that is to say, will do_sys_open(AT_FDCWD, filename, flags, mode) is encapsulated as sys with 3 parameters_ Similarly, other quantitative parameters are encapsulated in the same way.

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
	struct open_flags op;
	int fd = build_open_flags(flags, mode, &op);
	struct filename *tmp;

	if (fd)
		return fd;

	tmp = getname(filename);
	if (IS_ERR(tmp))
		return PTR_ERR(tmp);

	fd = get_unused_fd_flags(flags);
	if (fd >= 0) {
		struct file *f = do_filp_open(dfd, tmp, &op);
		if (IS_ERR(f)) {
			put_unused_fd(fd);
			fd = PTR_ERR(f);
		} else {
			fsnotify_open(f);
			fd_install(fd, f);
		}
	}
	putname(tmp);
	return fd;
}

getname copies the file name in user state to kernel state, and then through get_unused_fd_flags get an unused file descriptor, and then do_filp_open create struct file, fd_install binds FD and struct file (task_struct - > files - > FDT [FD] = file), and then returns FD.

fd returns to do_syscall_32_irqs_on, set to regs - > ax (eax). Then return to entry_INT80_32 continue to execute, and finally execute INTERRUPT_RETURN . INTERRUPT_RETURN is in arch / x86 / include / ASM / irqflags H is defined as iret, which is responsible for restoring the register previously pressed on the stack and returning to the user state. System call execution completed.

assembly

In the current mainstream system call library (glibc), int 0x80 can only be called when the hardware does not support fast system call (sysenter / syscall), but the current hardware supports fast system call. Therefore, in order to see the effect of int 0x80, we compile it manually:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(){
    char * filename = "/tmp/test";
    char * buffer = malloc(80);
    memset(buffer, 0, 80);
    int count;
    __asm__ __volatile__("movl $0x5, %%eax\n\t"
                         "movl %1, %%ebx\n\t"
                         "movl $0, %%ecx\n\t"
                         "movl $0664, %%edx\n\t"
                         "int $0x80\n\t"
                         "movl %%eax, %%ebx\n\t"
                         "movl $0x3, %%eax\n\t"
                         "movl %2, %%ecx\n\t"
                         "movl $80, %%edx\n\t"
                         "int $0x80\n\t"
                         "movl %%eax, %0\n\t"
                         :"=m"(count)
                         :"g"(filename), "g"(buffer)
                         :"%eax", "%ebx", "%ecx", "%edx");
    printf("%d\n", count);
    printf("%s\n", buffer);
    free(buffer);
}

This code first calls the system call open through int 0x80 to get fd (returned by eax), and then passes it in as a read parameter to read the contents of the file. Strangely enough, if the buffer is stored on the stack (buffer[80]), the call to read fails. Only when buffer is used as a global variable or stored in the heap can the call succeed. I hope you can give me some advice.

Take a look

[hezz coding_test 15:37]$touch /tmp/test
[hezz coding_test 15:37]$vi /tmp/test 
[hezz coding_test 15:37]$gcc syscall_asm.c
[hezz coding_test 15:37]$./a.out 
4
111

summary

The traditional system call (int 0x80) is implemented through interrupt / exception. When executing the int instruction, a trap occurs. The hardware finds the entry in the interrupt descriptor table, automatically switches to the kernel stack (tss.ss0: TSS. Esp0), finds the corresponding segment descriptor in GDT / LDT according to the segment selector of the interrupt descriptor, obtains the base address of the segment from the segment descriptor, loads it into cs, and loads the offset into eip. Finally, the hardware presses ss / sp / eflags / cs / ip / error code onto the kernel stack in turn. When returning, iret pops up ss / sp / eflags / cs / ip previously pressed on the stack to restore the register context during user state call.

Topics: Linux kernel