2021-2022-1 20212802 Linux kernel principle and analysis week 7

Posted by Elarion on Fri, 05 Nov 2021 23:38:42 +0100

1, The process by which the Linux kernel creates a new process

1. Relevant knowledge

  • The three functions of the operating system kernel are process management, memory management and file system. The core is process management
  • The state of linux process is different from the description of operating system principle. For example, the ready state and running state are both TASK_RUNNING. (this indicates that it is runnable, but whether it is actually running depends on whether it occupies the CPU)
  • fork is called once and can be returned twice. Return the pid of the newly created child process in the parent process; Return 0 in child process
  • After fork is called, there are two copies of data, heap and stack, and the code is still one (this code segment becomes the shared code segment of the two processes). When one of the parent and child processes wants to modify the data or stack, the two processes really split.

2. Kernel code analysis

SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
	return do_fork(SIGCHLD, 0, 0, NULL, NULL);
#else
	return -EINVAL;
#endif
}
SYSCALL_DEFINE0(vfork)
{
	return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
			0, NULL, NULL);
}
#ifdef __ARCH_WANT_SYS_CLONE
#ifdef CONFIG_CLONE_BACKWARDS
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
		 int __user *, parent_tidptr,
		 int, tls_val,
		 int __user *, child_tidptr)
#elif defined(CONFIG_CLONE_BACKWARDS2)
SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags,
		 int __user *, parent_tidptr,
		 int __user *, child_tidptr,
		 int, tls_val)
#elif defined(CONFIG_CLONE_BACKWARDS3)
SYSCALL_DEFINE6(clone, unsigned long, clone_flags, unsigned long, newsp,
		int, stack_size,
		int __user *, parent_tidptr,
		int __user *, child_tidptr,
		int, tls_val)
#else
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
		 int __user *, parent_tidptr,
		 int __user *, child_tidptr,
		 int, tls_val)
#endif
{
	return do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr);
}
#endif

From the above code, we can see that fork, vfork and clone can create a new process through do_fork to create the process, but the parameters passed are different.

(1)do_fork

long do_fork(unsigned long clone_flags, unsigned long stack_start,
        unsigned long stack_size, int __user *parent_tidptr,
        int __user *child_tidptr)

First, let's look at do_ Parameters of fork():

  • clone_flags: the child process creates a related flag, which can selectively copy the resources of the parent process.
  • stack_start: the address of the child process user state stack.

  • Regs: point to Pt_ Pointer to the regs structure (when a system call occurs, pt_regs structure saves the values in the register and pushes them into the kernel stack in order).

  • stack_size: the size of the user state stack, which is usually unnecessary, is always set to 0.

  • parent_tidptr and child_tidptr: pid address of parent process and child process in user status.

For ease of understanding, the following key codes are simplified:

struct task_struct *p;    //Create process descriptor pointer
  int trace = 0;
  long nr;                  //Subprocess pid
  ...
  p = copy_process(clone_flags, stack_start, stack_size, 
              child_tidptr, NULL, trace);   //Create the descriptor of the child process and other data structures required for execution

  if (!IS_ERR(p))                            //If copy_process executed successfully
        struct completion vfork;             //Define the amount of completion (one execution unit waits for another execution unit to complete something)
        struct pid *pid;
        ...
        pid = get_task_pid(p, PIDTYPE_PID);   //Get the pid in the task structure
        nr = pid_vnr(pid);                    //Obtain the process pid from the pid structure
        ...
        // If clone_flags contains CLONE_VFORK flag, assign the completion amount vfork to vfork in the process descriptor_ Done field. Here, only the completed quantity is initialized
        if (clone_flags & CLONE_VFORK) {
            p->vfork_done = &vfork;
            init_completion(&vfork);
            get_task_struct(p);
        }

        wake_up_new_task(p);        //Add the child process to the scheduler's queue to give it a chance to get the CPU

        /* forking complete and child started to run, tell ptracer */
        ...
        // If clone_flags contains clone_ With the VFORK flag, the parent process is inserted into the waiting queue until the child process calls the exec function or exits. Here is the specific blocking
        if (clone_flags & CLONE_VFORK) {
            if (!wait_for_vfork_done(p, &vfork))
                ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
        }

        put_pid(pid);
    } else {
        nr = PTR_ERR(p);        //error handling
    }
    return nr;               //Return the child process pid (the value returned by the fork function of the parent process is the reason for the child process pid)
}

do_fork() mainly completes the call to copy_process() copies the parent process information, obtains the pid, and calls wake_up_new_task adds the child process to the scheduler queue, allocates CPU to it, and passes the clone_flags flag to do some auxiliary work. Where copy_process() is the main code to create the content of a process.

(2)copy_process

static struct task_struct *copy_process(unsigned long clone_flags,
                    unsigned long stack_start,
                    unsigned long stack_size,
                    int __user *child_tidptr,
                    struct pid *pid,
                    int trace)
{
    int retval;
    struct task_struct *p;
    ...
    retval = security_task_create(clone_flags);//Safety inspection
    ...
    p = dup_task_struct(current);   //Copy PCB and create kernel stack and process descriptor for sub process
    ftrace_graph_init_task(p);
    ···
    
    retval = -EAGAIN;
    // Check whether the number of processes for the user exceeds the limit
    if (atomic_read(&p->real_cred->user->processes) >=
            task_rlimit(p, RLIMIT_NPROC)) {
        // Check whether the user has relevant permissions, not necessarily root
        if (p->real_cred->user != INIT_USER &&
            !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
            goto bad_fork_free;
    }
    ...
    // Check whether the number of processes exceeds max_threads, which depends on the size of memory
    if (nr_threads >= max_threads)
        goto bad_fork_cleanup_count;

    if (!try_module_get(task_thread_info(p)->exec_domain->module))
        goto bad_fork_cleanup_count;
    ...
    spin_lock_init(&p->alloc_lock);          //Initialize spin lock
    init_sigpending(&p->pending);           //Initialize pending signal 
    posix_cpu_timers_init(p);               //Initialize CPU timer
    ···
    retval = sched_fork(clone_flags, p);  //Initialize the data structure of the new process scheduler and set the state of the new process to TASK_RUNNING, and kernel preemption is prohibited
    ...
    // Copy all process information
    shm_init_task(p);
    retval = copy_semundo(clone_flags, p);
    ...
	retval = copy_files(clone_flags, p);
    ...
	retval = copy_fs(clone_flags, p);
    ...
	retval = copy_sighand(clone_flags, p);
    ...
	retval = copy_signal(clone_flags, p);
    ...
	retval = copy_mm(clone_flags, p);
    ...
	retval = copy_namespaces(clone_flags, p);
    ...
	retval = copy_io(clone_flags, p);
    ...
	retval = copy_thread(clone_flags, stack_start, stack_size, p);// Initialize subprocess kernel stack
    ...
    //If the pid pointer and global structure variable init are passed in_ struct_ If the pid addresses are different, a new pid should be assigned to the child process
    if (pid != &init_struct_pid) {
        retval = -ENOMEM;
        pid = alloc_pid(p->nsproxy->pid_ns_for_children);
        if (!pid)
            goto bad_fork_cleanup_io;
    }

    ...
    p->pid = pid_nr(pid);    //Obtain the process pid from the pid structure
    //If clone_flags contains CLONE_THREAD flag, indicating that the child process and the parent process are in the same thread group
    if (clone_flags & CLONE_THREAD) {
        p->exit_signal = -1;
        p->group_leader = current->group_leader; //Set the leader of the thread group as the group leader of the child process
        p->tgid = current->tgid;       //The child process inherits the tgid of the parent process
    } else {
        if (clone_flags & CLONE_PARENT)
            p->exit_signal = current->group_leader->exit_signal;
        else
            p->exit_signal = (clone_flags & CSIGNAL);
        p->group_leader = p;	      //The group leader of the child process is itself
        
       
        p->tgid = p->pid;        //The group number tgid is its own pid
    }

    ...
    
    if (likely(p->pid)) {
        ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);

        init_task_pid(p, PIDTYPE_PID, pid);
        if (thread_group_leader(p)) {
            ...
            // Add the child process to the hash linked list of its group
            attach_pid(p, PIDTYPE_PGID);
            attach_pid(p, PIDTYPE_SID);
            __this_cpu_inc(process_counts);
        } else {
            ...
        }
        attach_pid(p, PIDTYPE_PID);
        nr_threads++;     //Increase the number of processes in the system
    }
    ...
    return p;             //Returns the created child process descriptor pointer P
    ...
}

copy_process is mainly used to call dup_task_struct copies the current task_struct, information check, initialization, set the process status to TASK_RUNNING, copying all process information, calling copy_thread initializes the sub process kernel stack and sets the sub process pid.

(3)dup_task_struct

static struct task_struct *dup_task_struct(struct task_struct *orig)
{
    struct task_struct *tsk;
    struct thread_info *ti;
    int node = tsk_fork_get_node(orig);
    int err;
    tsk = alloc_task_struct_node(node);    //Create a process descriptor for the child process
    ...
    ti = alloc_thread_info_node(tsk, node); //In fact, two pages are created, one of which is used to store the thread_info, part of which is the kernel stack
    ...
    err = arch_dup_task_struct(tsk, orig);  //Copy task of parent process_ Struct information
    ...
    tsk->stack = ti;                  // Assign the value at the bottom of the stack to the stack of the new node
   
    setup_thread_stack(tsk, orig);//Thread to child process_ Initialize the info structure (copy the thread_info structure of the parent process, and then point the task pointer to the process descriptor of the child process)
    ...
    return tsk;               // Returns the newly created process descriptor pointer
    ...
}

(4)copy_thread

dup_task_struct just creates a kernel stack for the child process, copy_thread actually completes the assignment.

int copy_thread(unsigned long clone_flags, unsigned long sp,
    unsigned long arg, struct task_struct *p)
{

    
    struct pt_regs *childregs = task_pt_regs(p);
    struct task_struct *tsk;
    int err;

    p->thread.sp = (unsigned long) childregs;
    p->thread.sp0 = (unsigned long) (childregs+1);
    memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));

    
    if (unlikely(p->flags & PF_KTHREAD)) {
        /* kernel thread */
        memset(childregs, 0, sizeof(struct pt_regs));
      
        p->thread.ip = (unsigned long) ret_from_kernel_thread; //If you are creating a kernel thread, start from ret_from_kernel_thread starts execution
        task_user_gs(p) = __KERNEL_STACK_CANARY;
        childregs->ds = __USER_DS;
        childregs->es = __USER_DS;
        childregs->fs = __KERNEL_PERCPU;
        childregs->bx = sp; /* function */
        childregs->bp = arg;
        childregs->orig_ax = -1;
        childregs->cs = __KERNEL_CS | get_kernel_rpl();
        childregs->flags = X86_EFLAGS_IF | X86_EFLAGS_FIXED;
        p->thread.io_bitmap_ptr = NULL;
        return 0;
    }

    
    *childregs = *current_pt_regs();//Copy the kernel stack (copy the register information of the parent process, that is, the part of the system call SAVE_ALL stack)
    
    childregs->ax = 0;           //The eax of the child process is set to 0, so the return value of the child process of fork is 0
    ...
    p->thread.ip = (unsigned long) ret_from_fork;//ip pointing to ret_from_fork, where the child process starts execution
    task_user_gs(p) = get_user_gs(current_pt_regs());
    ...
    return err;

4. gdb commissioning

Set breakpoints at the key points just analyzed:

Now sys_clone, stop and do again_ Fork stop and continue stepping:

Continue in copy_process stop, copy_ Stop at the thread, where you can view the value of p:

Last ret_from_fork trace to syscall_ Cannot continue after exit.

2, Summary

The process of creating a process is roughly copying the process descriptor, copying other process resources one by one (using write time replication technology), allocating the kernel stack of the sub process, and initializing the key information of the kernel stack.
The problem is that the three system calls fork, vfork and clone are different. Through consulting the data, it is found that:

  • fork(): the child process copies the data segment and code segment of the parent process. The execution order of the parent and child processes is uncertain;
  • vfork(): the created child process shares data segments with the parent process, and the child process will run before the parent process;
  • clone(): If yes, the parent process resources can be selectively copied to the child processes, while the data structures that are not copied can be shared by the child processes through pointer copying.

 

 

 

Topics: Linux Operation & Maintenance server