Linux resource control [notes]

Posted by jbalanski on Wed, 02 Mar 2022 05:27:47 +0100

Linux resource control

I Resource isolation

What is a Namespace?

namespace is a method used by the Linux kernel to isolate kernel resources.

It is the encapsulation and isolation of global system resources,
Processes in different namespace s have independent global system resources,
Changing the system resources in a namespace will only affect the processes in the current namespace,
It has no effect on processes in other namespace s

Resources in each namespace are transparent and invisible to resources in other namespaces.
From the perspective of operating system, multiple processes with the same pid can appear,
Because they belong to different namespace s, there is no conflict between processes.
From the user's perspective, you can only see the resources under the user's own namespace,

For example, the ps command can only list the processes in its own namespace.

Six namespace s are implemented in the kernel,

According to the order of introduction, the list is as follows:

namespace introduces the isolation effect of global system resources isolated by kernel version in container context

1. Mount namespaces

Document system hook point
Each container can see different document system hierarchies

2. UTS namespaces

nodename and domainname
Each container can have its own hostname and domain name

3. PID namespaces

Process ID number space
Each process in PID namespace can have its own independent PID; Each container can have a root process with PID 1; It also allows containers to migrate between different hosts, because the process ID in the namespace is independent of the host. This also allows each process in the container to have two PIDs: the PID in the container and the PID on the host.

4.IPC namespaces

Each container has its own System IPC and POSIX message queue document system, so only processes in the same IPC namespace can communicate with each other

5. Network namespaces

Network related system resources
Each container has its own independent network device, IP address, IP routing table, / proc/net directory, port number, etc. This also enables the same application in multiple containers on a host to be bound to port 80 of their respective containers.

6. User namespaces

User and group ID space
The user and group IDs of processes in user namespace can be different from those on host; Each container can have different user and group IDs; A non privileged user on a host can become a privileged user in the user namespace;

What is the purpose of Namespace?

Currently, the linux kernel provides seven types of namespace s, which are respectively used for:

CGroup: CGroup root directory
IPC: System V IPC / POSIX message queue
Network: network device / protocol stack / port
Mount: mount point
PID: process ID
User: user and group ID
UTS: hostname and NIS domain name

View the namespace to which a process belongs

Get the id of an nginx process

[root@galaxy-node-master 2675]#  ps auxfww | grep nginx
root      10091  0.0  0.0 112824   988 pts/0    S+   10:26   0:00          \_ grep --color=auto nginx:
root       2425  0.0  0.0  20104  3644 ?        Ss   09:40   0:00      |   |   \_ nginx: master process /opt/gitlab/embedded/sbin/nginx -p /var/opt/gitlab/nginx
polkitd    2452  0.0  0.0  24356  5668 ?        S    09:40   0:00      |   |       \_ nginx: worker process
polkitd    2454  0.0  0.0  24496  6384 ?        S    09:40   0:00      |   |       \_ nginx: worker process
polkitd    2455  0.0  0.0  24356  5668 ?        S    09:40   0:00      |   |       \_ nginx: worker process
polkitd    2456  0.0  0.0  24356  5668 ?        S    09:40   0:00      |   |       \_ nginx: worker process
polkitd    2457  0.0  0.0  20320  1444 ?        S    09:40   0:00      |   |       \_ nginx: cache manager process

We choose 2452 this process

View 2452 the namespace of this process

[root@galaxy-node-master 2675]# ls /proc/2452/ns/
ipc  mnt  net  pid  user  uts

The types of these namespace files are symbolic links

[root@galaxy-node-master 2675]# ll /proc/2452/ns/
Total consumption 0
lrwxrwxrwx. 1 polkitd input 0 3 February 10:00 ipc -> ipc:[4026532754]
lrwxrwxrwx. 1 polkitd input 0 3 February 10:00 mnt -> mnt:[4026532752]
lrwxrwxrwx. 1 polkitd input 0 3 February 10:00 net -> net:[4026532757]
lrwxrwxrwx. 1 polkitd input 0 3 February 10:00 pid -> pid:[4026532755]
lrwxrwxrwx. 1 polkitd input 0 3 February 10:00 user -> user:[4026531837]
lrwxrwxrwx. 1 polkitd input 0 3 February 10:00 uts -> uts:[4026532753]

The format of the content of the linked file is xxx:[inode number].
xxx is the type of namespace,
inode number is used to identify a namespace,

View mnt namespace information of a process

The mount point information of mnt namespace is recorded in the following three files

[root@galaxy-node-master 2675]# ll /proc/2452/mount*
-r--r--r--. 1 polkitd input 0 3 February 10:00 /proc/2452/mountinfo
-r--r--r--. 1 polkitd input 0 3 February 10:00 /proc/2452/mounts
-r--------. 1 polkitd input 0 3 February 10:00 /proc/2452/mountstats

mnt namespace is used to isolate mount point s,
The file structure in each mnt namespace can be modified independently without affecting each other

We do an experiment to verify mnt namespace:

First create two directories and create one file each as follows:

[root@localhost ~]# mkdir /root/hosta
[root@localhost ~]# touch /root/hosta/a.txt
[root@localhost ~]# mkdir /root/hostb
[root@localhost ~]# touch /root/hostb/b.txt

To view the current mnt Directory:

[root@localhost ~]# ls /mnt

Open two new terminals:

Perform the following operations in terminal a:

Create new mount namespace and uts namespace and run bash

[root@localhost ~]# unshare --mount --uts bash

Change the host name to hosta

[root@localhost ~]# hostname hosta && exec bash

View inode number s of mnt and uts namespace s in the current process

$$: current process id

[root@hosta ~]# readlink /proc/$$/ns/{mnt,uts}

Mount hosta directory to mnt

[root@hosta ~]# mount --bind hosta/ /mnt/
[root@hosta ~]# ls /mnt

Go back to the earliest localhost terminal and check:

[root@localhost ~]# ls /mnt

/The contents in the mnt directory have not changed, indicating that the mount namespace of localhost terminal and hosta terminal is successfully isolated

Perform the following operations in terminal b:

Create new mount namespace and uts namespace and run bash

[root@localhost ~]# unshare --mount --uts bash

Change the host name to hostb

[root@localhost ~]# hostname hostb && exec bash

View inode number s of mnt and uts namespace s in the current process

$$: current process id

[root@hostb ~]# readlink /proc/$$/ns/{mnt,uts}

Mount the hostb directory to mnt

[root@hostb ~]# mount --bind hostb/ /mnt/
[root@hostb ~]# ls /mnt

Test pid namespace

fork: start bash as a child process of unshare

[root@localhost ~]# unshare --pid --uts --mount --fork bash

Modify the hostname as the identity

[root@localhost ~]# hostname hosta && exec bash
[root@hosta ~]# echo $$

The current process id is 1

-p: Display pid
-l: Display long lines (do not intercept the width according to the environment variable COLUMNS)

[root@hosta ~]# pstree -pl
           │                   └─{ModemManager}(924)
           │                     ├─{NetworkManager}(915)
           │                     └─{NetworkManager}(925)

Using pstree, we can see that the process with pid 1 is systemd

This is because the proc here is the / Proc of mount namespace brought by unshare

The same is true for inode number under ns,

[root@hosta ~]# readlink /proc/$$/ns/{pid,uts,mnt}

/ proc needs to be remounted

[root@hosta liuhongdi]# mount --types proc proc /proc/
[root@hosta liuhongdi]# pstree -pl

Note: if you add the -- mount proc parameter when starting bash with unshare, you don't need to mount / proc again

Check the inode number under ns again, and it can also be displayed correctly

[root@hosta liuhongdi]# readlink /proc/$$/ns/{pid,uts,mnt}

II Resource control

Cgroup overview

In Linux, there has always been the concept and requirement of grouping processes, such as session group and progress group. Later, as people have more and more requirements in this regard, such as the need to track the memory and IO usage of a group of processes, cgroup appears, which has two main purposes:

  1. Used to uniformly group processes

  2. On the basis of grouping, process monitoring and resource control management are carried out.

cgroup is a mechanism to manage processes by groups under Linux. From the user level, cgroup technology is to organize all processes in the system into an independent tree. Each tree contains all processes of the system. Each node of the tree is a process group, and each tree is associated with one or more subsystems. The function of the tree is to group processes, The function of subsystem is to operate these groups. cgroup mainly includes the following two parts:


A subsystem is a kernel module. After it is associated with a cgroup tree, it will do specific operations on each node (process group) of the tree. Subsystem is often called "resource controller" because it is mainly used to schedule or limit the resources of each process group, but this statement is not completely accurate, because sometimes we group processes just to do some monitoring and observe their status, such as perf_event subsystem.


A hierarchy can be understood as a cgroup tree. Each node of the tree is a process group, and each tree will be associated with zero to more subsystems. In a tree, all processes in the Linux system will be included, but each process can only belong to one node (process group). There can be many cgroup trees in the system, and each tree is associated with different subsystems. A process can belong to multiple trees, that is, a process can belong to multiple process groups, but these process groups are associated with different subsystems.

At present, Linux supports 12 kinds of subsystems. If you don't consider the situation that it is not associated with any subsystem (this is the case with systemd), Linux can build up to 12 cgroup trees, and each tree is associated with a subsystem. Of course, you can also build only one tree, and then let this tree be associated with all subsystems. When a cgroup tree is not associated with any subsystem, it means that the tree only groups the processes. What to do on the basis of grouping will be decided by the application process itself. systemd is an example.

Limit the cpu usage of a process to 50%

1. First write a script that takes up more cpu

while [ True ];do

2. You can see that the cpu is 100% used after running

20369 root      20   0  113452   1664   1196 R  100.0  0.0   0:10.73 sh

3. Create control group

mkdir /sys/fs/cgroup/cpu/foo

4. Next, use cgroups to control the cpu resources of this process

echo 50000 > /sys/fs/cgroup/cpu/foo/cpu.cfs_quota_us #Set CPU cfs_ quota_ Us is set to 50000, relative to CPU cfs_ period_ 100000 US is 50%
echo 20369 >/sys/fs/cgroup/cpu/foo/tasks

5. We see that the limit is about 50%

20369 root      20   0  113828   1908   1196 R  49.8  0.0   0:33.75 sh

6. There are many other controls for cpu under cgroup control group

[root@foreman ~]# ls /sys/fs/cgroup/cpu/foo/
cgroup.clone_children  cpuacct.usage          cpu.rt_period_us       notify_on_release
cgroup.event_control   cpuacct.usage_percpu   cpu.rt_runtime_us      tasks
cgroup.procs           cpu.cfs_period_us      cpu.shares
cpuacct.stat           cpu.cfs_quota_us       cpu.stat

Limit the use of memory for a process

ls /sys/fs/cgroup/memory/cgtest/*

 cgroup.event_control       #Interface for eventfd
 memory.usage_in_bytes      #Displays the memory currently used
 memory.limit_in_bytes      #Sets / displays the amount of memory currently limited
 memory.failcnt             #Memory usage limit reached
 memory.max_usage_in_bytes  #Maximum historical memory usage
 memory.soft_limit_in_bytes #Set / display the current limit of memory soft quota
 memory.stat                #Displays the memory usage of the current cgroup
 memory.use_hierarchy       #Set / display whether to count the memory usage of sub cgroups into the current cgroup
 memory.force_empty         #Trigger the system to immediately recycle the memory that can be recycled in the current cgroup as much as possible
 memory.pressure_level      #Set the notification event of memory pressure, cooperate with cggroup event_ Control
 memory.swappiness          #Set and display the current swing
 memory.move_charge_at_immigrate #Set whether the memory occupied by a process will pass with it when it moves to other cgroup s
 memory.oom_control         #Set / display configuration related to oom controls
 memory.numa_stat           #Display numa related memory

Write a memory occupied c program and apply for 1MB of memory per second

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h> #define MB (1024 * 1024) int main(int argc, char *argv[])
    char *p;
    int i = ;
    while() {
        p = (char *)malloc(MB);
        memset(p, , MB);
        printf("%dM memory allocated\n", ++i);
    }     return ;

#gcc mem-allocate.c  -o  mem-allocate

cgroup limits memory usage by 50m (forcibly limits memory.limit_in_bytes)

[root@foreman cgtest]# pwd
[root@foreman cgtest]# echo 50M > memory.limit_in_bytes
[root@foreman cgtest]# echo 0 > memory.oom_control # When it is 0, the upper limit will be directly kill ed
[root@foreman cgtest]# pgrep mem-allocate
[root@foreman cgtest]# echo 35190 > tasks
# Limit only one thread ID. if you need to limit a thread group, you need to put the PID into CGroup Procs.
# In this way, the PID itself and the derived process as a whole will be limited to memory limit_ in_ Memory size set in bytes
# At the same time, the process generated by including this PID call will also be limited. To view the thread group to which a process belongs, use the following command
#Cat / proc / < PID > / CGroup to view

Limiting effect:

[root@foreman ~]# ./mem-allocate
1M memory allocated
2M memory allocated
3M memory allocated
4M memory allocated
49M memory allocated
50M memory allocated
51M memory allocated
# If the limit is reached, you do not want to directly kill the process, but suspend the process. You need to set oom_kill_disable is set to 1
[root@foreman cgtest]# cat memory.oom_control #default
oom_kill_disable 0
under_oom 0
[root@foreman cgtest]# echo 1 > memory.oom_control
[root@foreman cgtest]# cat memory.oom_control
oom_kill_disable 1
under_oom 0

Run a script to generate multiple sub threads to eat memory crazily (set memory.limitxxxxx to 300MB and oom_kill to 0 in advance)

[root@foreman ~]# cat
sleep 20
while [ True ];do
    nohup /root/mem-allocate >>/root/mem.log 2>&1 &
    sleep 1
    proc_num=$(pgrep mem-allocate | wc -l)
    if [ $proc_num -eq 50 ];then
        sleep 1000000

Check the restrictions with SYSTEMd cgtop after running:

#systemd-cgtop # Use this command to view the resources restricted by cgroup
/cgtest     25      -   295.8M
# You can clearly see from the above that the 25 memory of tasks is limited to 300MB

Conceptual understanding:

The 5678 process was added to the/foo Control group. that tasks and cgroups.procs What's the difference?
The management restrictions on "process" mentioned above are not accurate enough. The task scheduling unit of the system is thread.
here tasks What you see in is threads id. and cgroups.procs Thread group in id,That is generally referred to as the process id. 

Put a general pid Write to tasks This is the only one pid The corresponding thread and other processes and threads generated by it will belong to this control group, while the original other threads will not.
And write cgroups.procs All current threads will be added. If write cgroups.procs Is not a thread group id,It's a normal thread id,That will automatically find the corresponding thread group id Join in.

After a process joins a control group, the restrictions corresponding to the control group will take effect immediately. To know which control groups a process belongs to, you can cat/proc/<pid>/cgroup see.

To remove a process from a control group pid Write to root cgroup of tasks File. Because each process belongs to and only belongs to one cgroup,Add to new cgroup After, the original relationship was dissolved.

To delete a cgroup,Can use rmdir Delete the corresponding directory. However, before deleting, all processes must exit and the resources of the corresponding subsystem have been released, otherwise it cannot be deleted.

The previous operations are through file system access cgroups of In fact, there is also a set of command-line tools.

III process

Concept of process

Before creating a process, we must understand a concept. What process?

**Concept of process: * * we know that the core concept of the operating system is process. In fact, a process is simply a program running in the operating system. It is the smallest unit of operating system resource management. However, a process is a dynamic entity, which is an execution process of a program. The difference between process and program is that process is dynamic, program is static, process is a running program, and program is some executable code saved on hard disk.

Linux Process Structure

Under Linux, you can view the processes in the current system through the command ps or pstree.

With the general concept of a process, what we need to understand next is what the process actually has. It doesn't mean that it is a dynamic entity. It means that once it is started, it runs all the time, right (within a certain time of course), so how does it operate and what additional resources are needed? This requires us to understand the structure of the process.

**Linux Process Structure: * * can be composed of three parts: code segment, data segment and stack segment. That is, it is composed of program, data and Process Control Block PCB (Process Control Block). The Process Control Block is the unique identification of the existence of the process. The system perceives the existence of the process through the existence of PCB.

The code segment stores the executable code of the program.
The data segment stores the global variables, constants and static variables of the program.
The heap in the stack segment is used to store dynamically allocated memory variables. The stack in the stack section is used for function calls. It stores the parameters of the function and the local variables defined inside the function.

The system manages and schedules the process through PCB. PCB includes creating process, executing program, exiting process and changing the priority of process. The PCB in the process uses a name called task_struct, defined in include / Linux / sched H, whenever a new process is created, an empty task is requested in memory_ Struct structure to fill in the required information. At the same time, the pointer to the structure is also added to the task array, and all process control blocks are stored in the task [] array.

Mode of interprocess communication

  1. Pipes and named pipe s: pipes can be used for communication between parent-child processes with kinship. In addition to the functions of pipes, famous pipes also allow communication between unrelated processes.
  2. Signal: a signal is a simulation of the interrupt mechanism at the software level. It is a relatively complex communication method used to inform the process of an event. The effect of a signal received by a process is consistent with that of an interrupt request received by the processor.
  3. Message queue: message queue is the link table of messages. It overcomes the disadvantage of the limited number of signals in the above two communication methods. Processes with write permission can add new information to the message queue according to certain rules; Processes that have read access to the message queue can read information from the message queue.
  4. Shared memory: it can be said that this is the most useful way of inter process communication. It enables multiple processes to access the same memory space, and different processes can see the updates of data in shared memory in each other's processes in time. This method needs to rely on some synchronous operations, such as mutexes and semaphores.
  5. semaphore: it is mainly used as a means of synchronization and mutual exclusion between processes and between different threads of the same process.
  6. socket: This is a more general interprocess communication mechanism. It can be used for interprocess communication between different machines in the network. It is widely used.

Relationship between process and thread

  1. A thread can only belong to one process, and a process can have multiple threads, but at least one thread.
  2. Resources are allocated to processes, and all threads of the same process share all resources of the process.
  3. The processor is allocated to threads, that is, threads are really running on the processor.
  4. During the execution of threads, cooperative synchronization is required. The threads of different processes should be synchronized by means of message communication. Thread refers to an execution unit in a process and a schedulable entity in the process

The difference between process and thread

Process: each process has independent code and data space (process context). Switching between processes will have a large overhead. A process contains 1 – n threads. (process is the smallest unit of resource allocation)
**Threads: * * threads of the same type share code segments and data segments. Each thread has an independent running stack and program counter (PC), and the thread switching overhead is small. (thread is the smallest unit of cpu scheduling)

(1) Scheduling: thread is the basic unit of scheduling and allocation, and process is the basic unit of owning resources
(2) Concurrency: not only processes can execute concurrently, but also multiple threads of the same process can execute concurrently
(3) Owning resources: a process is an independent unit that owns resources. Threads do not own system resources, but can access resources belonging to the process
(4) System overhead: when creating or undoing a process, because the system has to allocate and recycle resources for it, the system overhead is significantly greater than that when creating or undoing a thread.

Topics: Linux namespace cgroup