Linux Namespace Starter Series: Namespace API

Posted by jonabomer on Sun, 29 Mar 2020 05:51:03 +0200

Linux Namespace is a kernel-level environment isolation method provided by Linux.Officially, Linux Namespace encapsulates global system resources in an abstraction so that processes within the namespace consider themselves to have separate resource instances.This technology did not make much waves, but the rise of container technology brought it back to everyone's attention.

There are six types of Linux Namespace s:

classification System Call Parameters Related Kernel Versions
Mount namespaces CLONE_NEWNS Linux 2.4.19
UTS namespaces CLONE_NEWUTS Linux 2.6.19
IPC namespaces CLONE_NEWIPC Linux 2.6.19
PID namespaces CLONE_NEWPID Linux 2.6.24
Network namespaces CLONE_NEWNET Starting on Linux 2.6.24 and completing on Linux 2.6.29
User namespaces CLONE_NEWUSER Starting on Linux 2.6.23 and completing on Linux 3.8

The API for namespace consists of three system calls and a series of / proc files, which are described in detail in this article.In order to specify the namespace type to operate on, you need to specify multiple constants in the flag of the system call by the constant CLONE_NEW* (including CLONE_NEWIPC, CLONE_NEWNS, CLONE_NEWNET, CLONE_NEWPID, CLONE_NEWUSER, and `CLONE_NEWUTS'), which can be implemented by the | (bit or) operation.

Briefly describe the functions of three system calls:

  • clone(): Implements a thread's system call to create a new process and isolate it by designing the above system call parameters.
  • unshare(): Detaches a process from a namespace.
  • setns(): Add a process to a namespace.

See below how this works.

1. clone()

The prototype of clone() is as follows:

int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);
  • child_func: The main function of the program that passes in the child process to run.
  • child_stack: The stack space used by the incoming child process.
  • Flags: Indicates which CLONE_* flags to use.
  • args: Used to pass in user parameters.

clone() is similar to fork() in that it copies the current process, but clone() provides finer granularity of control over the resources that are shared with child processes (in fact, flags), including virtual memory, open file descriptors, and semaphores.Once the flag bit CLONE_NEW*is specified, a namespace of the corresponding type is created and the newly created process becomes a member of that namespace.

The clone() prototype is not the lowest system call, but encapsulated. The real system call kernel implementation function is do_fork(), in the following form:

long do_fork(unsigned long clone_flags,
	      unsigned long stack_start,
	      unsigned long stack_size,
	      int __user *parent_tidptr,
	      int __user *child_tidptr)

Where clone_flags can be assigned to the above mentioned flags.

Here's an example:

/* demo_uts_namespaces.c

   Copyright 2013, Michael Kerrisk
   Licensed under GNU General Public License v2 or later

   Demonstrate the operation of UTS namespaces.
*/
#define _GNU_SOURCE
#include <sys/wait.h>
#include <sys/utsname.h>
#include <sched.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/* A simple error-handling function: print an error message based
   on the value in 'errno' and terminate the calling process */

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

static int              /* Start function for cloned child */
childFunc(void *arg)
{
    struct utsname uts;

    /* Modify host name in new UTS namespace */

    if (sethostname(arg, strlen(arg)) == -1)
        errExit("sethostname");

    /* Get and display hostname */

    if (uname(&uts) == -1)
        errExit("uname");
    printf("uts.nodename in child:  %s\n", uts.nodename);

    /* Keep the namespace open for a while, by sleeping.
       This allows some experimentation--for example, another
       process might join the namespace. */
     
    sleep(100);

    return 0;           /* Terminates child */
}

/* Define a stack for clone, stack size 1M */
#define STACK_SIZE (1024 * 1024) 

static char child_stack[STACK_SIZE];

int
main(int argc, char *argv[])
{
    pid_t child_pid;
    struct utsname uts;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    /* Call the clone function to create a new UTS namespace, where a function comes out and there is also a stack space (why the trailing pointer is because the stack is the opposite);
       The new process will execute in the user-defined function childFunc() */

    child_pid = clone(childFunc, 
                    child_stack + STACK_SIZE,   /* Because the stack is reversed, 
                                                   So the trailing pointer */ 
                    CLONE_NEWUTS | SIGCHLD, argv[1]);
    if (child_pid == -1)
        errExit("clone");
    printf("PID of child created by clone() is %ld\n", (long) child_pid);

    /* Parent falls through to here */

    sleep(1);           /* Allow time for child processes to change hostname */

    /* Displays the host name in the current UTS namespace, and 
       The host name in the UTS namespace where the child process resides is different */

    if (uname(&uts) == -1)
        errExit("uname");
    printf("uts.nodename in parent: %s\n", uts.nodename);

    if (waitpid(child_pid, NULL, 0) == -1)      /* Waiting for child process to end */
        errExit("waitpid");
    printf("child has terminated\n");

    exit(EXIT_SUCCESS);
}

The program creates a UTS namespace by calling the clone() function with the flag bit CLONE_NEWUTS.UTS namespace isolates two system identifiers, the host name and the NIS domain name, which are set by two system calls, sethostname() and setdomainname(), respectively, and are obtained by a system call to uname().

Some key parts of the program are explained below (error checking will be omitted for simplicity).

The program runs with a command-line parameter that creates a subprocess executed in a new UTS namespace that changes the host name to the value provided in the command-line parameter.

The first key part of the main program is to create a subprocess by calling clone():

child_pid = clone(childFunc, 
                  child_stack + STACK_SIZE,   /* Points to start of 
                                                 downwardly growing stack */ 
                  CLONE_NEWUTS | SIGCHLD, argv[1]);

printf("PID of child created by clone() is %ld\n", (long) child_pid);

The subprocess will start execution in the user-defined function childFunc(), which will take the last parameter of clone() (argv[1]) as its own parameter, and the flags contain CLONE_NEWUTS, so the subprocess will execute in the newly created UTS namespace.

The main process then sleeps for a period of time, giving the child process time to change the host name in its UTS namespace.Then call uname() to retrieve the host name in the current UTS namespace and display it:

sleep(1);           /* Give child time to change its hostname */

uname(&uts);
printf("uts.nodename in parent: %s\n", uts.nodename);

Meanwhile, the function childFunc() executed by the child process created by clone() first changes the host name to the value provided in the command line parameter, then retrieves and displays the modified host name:

sethostname(arg, strlen(arg);
    
uname(&uts);
printf("uts.nodename in child:  %s\n", uts.nodename);

The child process also sleeps for a period of time before it exits, which prevents the new UTS namespace from being shut down and gives us the opportunity to experiment later.

Execute the program to see if the parent and child processes are in different UTS namespace s:

$ su                   # Privileges are required to create UTS namespace
Password: 
# uname -n
antero
# ./demo_uts_namespaces bizarro
PID of child created by clone() is 27514
uts.nodename in child:  bizarro
uts.nodename in parent: antero

In addition to User namespace, privileges are required to create other namespaces, or rather, corresponding Linux Capabilities, CAP_SYS_ADMIN.This prevents programs with a SUID (Set User ID on execution) from behaving foolishly because of different host names.If you're not familiar with Linux Capabilities, you can refer to my previous articles: Introduction to Linux Capabilities: Concepts.

2. proc file

Each process has a / proc/PID/ns directory, with the following files representing each namespace in turn, such as user for user namespace.Starting with version 3.8 of the kernel, each file in this directory is a special symbolic link to $namespace:[$namespace-inode-number], the first half being the name of the namespace, and the second half of the number indicating the handle of the namespace.The handle number is used to perform certain operations on the namespace associated with the process.

$ ls -l /proc/$$/ns         # $$denotes the PID of the current shell
total 0
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 net -> net:[4026531956]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 pid -> pid:[4026531836]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 user -> user:[4026531837]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 uts -> uts:[4026531838]

One use of these symbolic links is to confirm that two different processes are in the same namespace.If two processes point to the same namespace inode number, they are under the same namespace, otherwise they are under different namespaces.These symbolic links point to files that are special and cannot be accessed directly. In fact, the files they point to are stored in a file system called nsfs, which is not visible to users and can be called by the system stat() Get the inode number in the st_ino field of the returned structure.The inode information pointing to the file can be seen in the shell terminal using the command (actually calling stat()):

$ stat -L /proc/$$/ns/net
  File: /proc/3232/ns/net
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: 4h/4d	Inode: 4026531956  Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-01-17 15:45:23.783304900 +0800
Modify: 2020-01-17 15:45:23.783304900 +0800
Change: 2020-01-17 15:45:23.783304900 +0800
 Birth: -

These symbolic links have other uses besides those described above. If we open one of these files, the namespace will not be deleted as long as the file descriptor associated with the file is open, even if all processes in the namespace have terminated.The same effect can be achieved by mounting symbolic links to other locations of the system via bind mount:

$ touch ~/uts
$ mount --bind /proc/27514/ns/uts ~/uts

3. setns()

Joining an existing namespace can be accomplished by calling setns() on the system.Its prototype is as follows:

int setns(int fd, int nstype);

More specifically, setns() detaches the called process from one instance of a particular type of namespace and reassociates it with another instance of that type of namespace.

  • The fd represents the file descriptor of the namespace to be added, either by opening one of the symbolic links or by opening the file from bind mount to one of the links.
  • nstype allows the caller to check the namespace type pointed to by fd, and the value can be set to the constant CLONE_NEW*, mentioned earlier, with 0 indicating no check.This parameter can be used to automate validation if the caller already knows he or she wants to add the namespace type or does not care about it.

A simple but useful function can be achieved by combining setns() with execve(): adding a process to a specific namespace and executing commands in that namespace.Look directly at the example:

/* ns_exec.c 

   Copyright 2013, Michael Kerrisk
   Licensed under GNU General Public License v2 or later

   Join a namespace and execute a command in the namespace
*/
#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

/* A simple error-handling function: print an error message based
   on the value in 'errno' and terminate the calling process */

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

int
main(int argc, char *argv[])
{
    int fd;

    if (argc < 3) {
        fprintf(stderr, "%s /proc/PID/ns/FILE cmd [arg...]\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    fd = open(argv[1], O_RDONLY);   /* Get the file descriptor of the namespace you want to join */
    if (fd == -1)
        errExit("open");

    if (setns(fd, 0) == -1)         /* Join the namespace */
        errExit("setns");

    execvp(argv[2], &argv[2]);      /* Execute commands in the joined namespace */
    errExit("execvp");
}

The program requires two or more command-line parameters to run. The first parameter represents the path of a specific namespace symbolic link (or the file path from bind mount to these symbolic links); the second parameter represents the name of the program to execute in the namespace corresponding to the symbolic link, and the command-line parameters required to execute the program.The key steps are as follows:

fd = open(argv[1], O_RDONLY);   /* Get the file descriptor of the namespace you want to join */

setns(fd, 0);                   /* Join the namespace */

execvp(argv[2], &argv[2]);      /* Execute commands in the joined namespace */

Remember that we previously mounted the UTS namespace created by demo_uts_namespaces into ~/uts via bind mount?You can combine the program in this example so that a new process can execute a shell in the UTS namespace:

    $ ./ns_exec ~/uts /bin/bash     # ~/uts mounted by bind to/proc/27514/ns/uts
    My PID is: 28788

Verify that the new shell is in the same UTS namespace as the child process created by demo_uts_namespaces:

$ hostname
bizarro
$ readlink /proc/27514/ns/uts
uts:[4026532338]
$ readlink /proc/$$/ns/uts      # $$denotes the PID of the current shell
uts:[4026532338]

In earlier kernel versions, setns() could not be used to add mount namespace, PID namespace, and user namespace. Starting with version 3.8 of the kernel, setns() supports adding all namespaces.

The nsenter command is provided in the util-linux package, which provides a way to run a newly created process within a specified namespace. It is implemented simply by specifying a symbolic link to the namespace to enter through the command line (-t parameter), then using setns() to place the current process inside the specified namespace, and calling clone ()Run the specified execution file.We can use strace to see how it works:

# strace nsenter -t 27242 -i -m -n -p -u /bin/bash
execve("/usr/bin/nsenter", ["nsenter", "-t", "27242", "-i", "-m", "-n", "-p", "-u", "/bin/bash"], [/* 21 vars */]) = 0
............
............
pen("/proc/27242/ns/ipc", O_RDONLY)    = 3
open("/proc/27242/ns/uts", O_RDONLY)    = 4
open("/proc/27242/ns/net", O_RDONLY)    = 5
open("/proc/27242/ns/pid", O_RDONLY)    = 6
open("/proc/27242/ns/mnt", O_RDONLY)    = 7
setns(3, CLONE_NEWIPC)                  = 0
close(3)                                = 0
setns(4, CLONE_NEWUTS)                  = 0
close(4)                                = 0
setns(5, CLONE_NEWNET)                  = 0
close(5)                                = 0
setns(6, CLONE_NEWPID)                  = 0
close(6)                                = 0
setns(7, CLONE_NEWNS)                   = 0
close(7)                                = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4deb1faad0) = 4968

4. unshare()

The last system call to be introduced is unshare(), which has the following prototype:

int unshare(int flags);

Unhare () is similar to clone(), but it runs on the original process without creating a new process, that is, a new namespace is created first with the specified flags parameter CLONE_NEW*, and then the caller is added to the namespace.The final effect is to detach the caller from the current namespace and add a new namespace.

The unshare command that comes with Linux is implemented through the unshare() system call as follows:

$ unshare [options] program [arguments]

options specifies the namespace type to be created.

The main implementation of the unshare command is as follows:

/* Initialize'flags'with supplied command line parameters */

unshare(flags);

/* Now execute 'program' with 'arguments'; 'optind' is the index
   of the next command-line argument after options */

execvp(argv[optind], &argv[optind]);

The complete implementation of the unshare command is as follows:

/* unshare.c 

   Copyright 2013, Michael Kerrisk
   Licensed under GNU General Public License v2 or later

   A simple implementation of the unshare(1) command: unshare
   namespaces and execute a command.
*/

#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

/* A simple error-handling function: print an error message based
   on the value in 'errno' and terminate the calling process */

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

static void
usage(char *pname)
{
    fprintf(stderr, "Usage: %s [options] program [arg...]\n", pname);
    fprintf(stderr, "Options can be:\n");
    fprintf(stderr, "    -i   unshare IPC namespace\n");
    fprintf(stderr, "    -m   unshare mount namespace\n");
    fprintf(stderr, "    -n   unshare network namespace\n");
    fprintf(stderr, "    -p   unshare PID namespace\n");
    fprintf(stderr, "    -u   unshare UTS namespace\n");
    fprintf(stderr, "    -U   unshare user namespace\n");
    exit(EXIT_FAILURE);
}

int
main(int argc, char *argv[])
{
    int flags, opt;

    flags = 0;

    while ((opt = getopt(argc, argv, "imnpuU")) != -1) {
        switch (opt) {
        case 'i': flags |= CLONE_NEWIPC;        break;
        case 'm': flags |= CLONE_NEWNS;         break;
        case 'n': flags |= CLONE_NEWNET;        break;
        case 'p': flags |= CLONE_NEWPID;        break;
        case 'u': flags |= CLONE_NEWUTS;        break;
        case 'U': flags |= CLONE_NEWUSER;       break;
        default:  usage(argv[0]);
        }
    }

    if (optind >= argc)
        usage(argv[0]);

    if (unshare(flags) == -1)
        errExit("unshare");

    execvp(argv[optind], &argv[optind]);  
    errExit("execvp");
}

Here we execute the unshare.c program to execute the shell in a new mount namespace:

$ echo $$                             # Displays the PID of the current shell
8490
$ cat /proc/8490/mounts | grep mq     # Display a mount point in the current namespace
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
$ readlink /proc/8490/ns/mnt          # Display the ID of the current namespace 
mnt:[4026531840]
$ ./unshare -m /bin/bash              # Execute a new shell in the newly created mount namespace
$ readlink /proc/$$/ns/mnt            # Display the ID of the new namespace 
mnt:[4026532325]

By comparing the output of the two readlink commands, you can see that the shell s are in different mount namespace s.Change a mount point in the new namespace and see if the mount points of the two namespaces have changed:

$ umount /dev/mqueue                  # Remove mount point from new namespace
$ cat /proc/$$/mounts | grep mq       # Check for Effectiveness
$ cat /proc/8490/mounts | grep mq     # See if the mount point in the original namespace still exists?
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0

You can see that the mount point/dev/mqueue in the new namespace has disappeared, but it still exists in the original namespace.

5. Summary

This article examines each component of the namespace API and combines them together.Subsequent articles will continue to delve into each individual namespace, especially the PID namespace and user namespace.

Reference Links

WeChat Public Number

Sweep the QR code below to pay attention to the WeChat Public Number, reply to the Public Number Add Group to join our cloud native exchange group, and discuss cloud native technology with big guys like Sun Hongliang, Chairman Zhang, Yang Ming

Topics: Linux shell network Docker