Analysis of problems encountered when edgemesh hijacks traffic using iptables

Posted by loftystew on Mon, 03 Jan 2022 14:06:03 +0100

Analysis of problems encountered when edgemesh hijacks traffic using iptables

concept

  1. Cloud side communication: cloud container. Access the edge container through serviceName/clusterIP;
  2. Edge cloud communication: in contrast to the above, the edge container accesses the cloud container through serviceName/clusterIP;
  3. Edge to edge communication: the container on the edge host accesses the container of the edge host or other edge hosts through serviceName/clusterIP;

edgemesh architecture

Compared with kubernetes, kubeedge mainly provides the scheme of building clusters across subnets; Its most powerful feature is that kubernetes is not limited to a physical computer room; The host nodes in the cluster can be in different private LANs.

Hosts under different private networks need to communicate with each other, which is mainly limited by the firewall NAT form:

  • Full cone NAT

  • Restricted cone NAT

  • Port restricted cone NAT

  • Symmetric NAT

Among the above networking schemes, the security is enhanced in turn. In a symmetrical conical network, all hosts are behind their own firewalls, so direct penetration access is very difficult. Therefore, direct edgemesh provides two penetration access schemes:

deploy

Edgemesh server: deployment cloud

Edgemesh agent: daemonset cloud + edge

Direct penetration access

For full cone NAT, restricted NAT and restricted cone NAT access of some ports, the following schemes can be directly adopted;

Relay penetration access

For requests that cannot be accessed through direct penetration, the edgemesh agent will forward the traffic to the edgemesh server for relay, because after all edgemesh agents are started, a tunnel network will be established in the edgemesh server in the cloud;

edgemesh principle

On the edge cluster composed of kubernetes+kubeedge, edgemesh agent is deployed regardless of cloud or edge host, which has two main functions:

  • The 53 port is exposed to the public, and the function of querying k8s:service configuration through serviceName to obtain the corresponding clusterIP is realized;
  • Create a dummy device network card on the host, which is exposed to a port on 169.254.96.16:40001 by default, and configure the EDGE-MESH chain of preouting and OUTPUT through iptables to hijack all traffic access.
    • In the following NAT table settings of iptables, we can see the related service target addresses configured in the EDGE-MESH chain: that is, services that do not need EDGEMESH proxy will RETURN to the next line as proxy, that is, dock - > other proxy services (Kube proxy, Kube router...). The declaration of these services is to add the label of noproxy=edgemesh when k8s creating servcie, It will be written to the location of the host iptables by the EDGEMESH agent. These services are actually services that do not require cloud side communication;
    • When the traffic that needs to be represented by edgemesh agent, i.e. cloud side communication, cloud side communication and edge side communication, is hijacked to the traffic, the traffic will be hijacked and forwarded to 169.254.96.16:40001 by the kernel, and then forwarded to the remote edgemesh agent: 20006 or edgemesh server: 20004 through port 20006 of edgemesh agent, The edgemesh of the final target end forwards the traffic to the accessed service container;
[root@edge001 ~]# iptables -t nat -nvL --line
Chain PREROUTING (policy ACCEPT 9734 packets, 833K bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1    32317 3700K EDGE-MESH  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* edgemesh root chain */
2    14436  930K DOCKER     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT 9734 packets, 833K bytes)
num   pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 278 packets, 18244 bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1      893 60171 EDGE-MESH  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* edgemesh root chain */
2       47  3390 DOCKER     all  --  *      *       0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT 278 packets, 18244 bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1        0     0 MASQUERADE  all  --  *      !docker0  172.17.0.0/16        0.0.0.0/0           
2        0     0 RETURN     all  --  *      *       192.168.122.0/24     224.0.0.0/24        
3        0     0 RETURN     all  --  *      *       192.168.122.0/24     255.255.255.255     
4        0     0 MASQUERADE  tcp  --  *      *       192.168.122.0/24    !192.168.122.0/24     masq ports: 1024-65535
5        0     0 MASQUERADE  udp  --  *      *       192.168.122.0/24    !192.168.122.0/24     masq ports: 1024-65535
6        0     0 MASQUERADE  all  --  *      *       192.168.122.0/24    !192.168.122.0/24    

Chain DOCKER (2 references)
num   pkts bytes target     prot opt in     out     source               destination         
1       27  1896 RETURN     all  --  docker0 *       0.0.0.0/0            0.0.0.0/0           

Chain EDGE-MESH (2 references)
num   pkts bytes target     prot opt in     out     source               destination         
1        0     0 RETURN     all  --  *      *       0.0.0.0/0            10.254.165.225       /* ignore kubeedge/cloudcore service */
2        0     0 RETURN     all  --  *      *       0.0.0.0/0            10.242.247.106       /* ignore kube-system/traefik-web-ui service */
3        0     0 RETURN     all  --  *      *       0.0.0.0/0            10.243.116.144       /* ignore kube-system/traefik-ingress-service service */
4        0     0 RETURN     all  --  *      *       0.0.0.0/0            10.255.163.200       /* ignore kube-system/prometheus-service service */
5        0     0 RETURN     all  --  *      *       0.0.0.0/0            10.240.83.24         /* ignore kube-system/metrics-server service */
6        0     0 RETURN     all  --  *      *       0.0.0.0/0            10.254.210.250       /* ignore kube-system/kube-dns service */
7        0     0 RETURN     all  --  *      *       0.0.0.0/0            10.240.0.1           /* ignore default/kubernetes service */
8        0     0 EDGE-MESH-TCP  tcp  --  *      *       0.0.0.0/0            10.240.0.0/12        /* tcp service proxy */

Chain EDGE-MESH-TCP (1 references)
num   pkts bytes target     prot opt in     out     source               destination         
1        0     0 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            to:169.254.96.16:40001

Normal access analysis

Direct penetration request side packet capture

Environmental Science:

4.4.2.4.2.4.2.4.2.4.2 el7. elrepo. x86_ sixty-four

mkedge2 10.201.82.131 edgemesh agent role: edgenode kernel: 4.15.6-1 el7. elrepo. x86_ sixty-four

Use case:

Start the TCP echo cloud container on mkedge2 and remotely telnet TCP echo cloud 2701 through the busybox container on mkedge3

explain:

From the perspective of actual traffic packet capturing, the host of the temple of heaven computer room can be directly accessed from the office computer room without relay;

Packet capture details

mkedge3 initiates a tcp packet capture request

Receive payload tcp packets on mkedge2

Relay penetration request side packet capture

Environmental Science:

mkmaster1 10.200.50.118 international trade machine room edgemesh server, edgemesh agent role: Master kernel: 5.2.14-1 el7. elrepo. x86_ sixty-four

mkworker3 10.202.42.112 edgemesh agent role: cloudnode kernel: 5.2.14-1 el7. elrepo. x86_ sixty-four

mkedge3 10.136.77.2 edgemesh agent role: edgenode kernel: 3.10.0-514 el7. x86_ sixty-four

Use case:

Start the TCP echo cloud container on mkworker3 and remotely telnet TCP echo cloud 2701 through the busybox container on mkedge3

explain:

Judging from the actual traffic packet capturing, the host of Yizhuang machine room cannot be directly accessed from the office machine room, and the edgemesh server relay of ITC machine room is required;

Packet capture details

mkedge3 initiates a tcp packet capture request

mkmaster1 relay tcp packet capture

Receive payload tcp packets on mkworker3

Failure access analysis

Let's start with the conclusion. It is suspected that the host default kernel is 4.15.6-1 el7. elrepo. x86_ 64 (centos7) has a serious bug calling the system kernel function getsockopt!!!

Failed penetration request packet capture

Environmental Science:

mkmaster1 10.200.50.118 international trade machine room edgemesh server, edgemesh agent role: Master kernel: 5.2.14-1 el7. elrepo. x86_ sixty-four

mkworker1 10.201.82.139 edge mesh agent role: cloudnode kernel: 5.2.14-1 el7. elrepo. x86_ sixty-four

mkworker2 10.201.83.74 edge mesh agent role: cloudnode kernel: 5.2.14-1 el7. elrepo. x86_ sixty-four

mkedge1 10.201.82.148 edgemesh agent role: edgenode kernel: 4.15.6-1 el7. elrepo. x86_ sixty-four

mkedge2 10.201.82.131 edgemesh agent role: edgenode kernel: 4.15.6-1 el7. elrepo. x86_ sixty-four

mkedge3 10.136.77.2 edgemesh agent role: edgenode kernel: 3.10.0-514 el7. x86_ sixty-four

Use case:

Start the TCP echo cloud container on mkworker1, mkworker2, mkedge1 and mkedge2, and remotely access these services through the busybox container on mkedge1. It is found that none of them is available;

present situation:

Judging from the actual traffic packet capturing, the traffic disappeared after the service container sent a request to the edgemesh agent: 40001 port;

  1. After repeated startup and shutdown of these containers, it is true that normal communication cannot be carried out after many tests;
  2. It is suspected that the problem is iptables. After the docker chain is lost, the rule chain cannot be created by restarting the docker service;
  3. Restart the edgemesh agent service. It is found that its container cannot be cleared and is always in Running status. If docker rm -f is forcibly stopped, edgemesh agent will leave zombie processes on the host machine;

Initial troubleshooting stage

What is the wrong version of Go? Container mirroring problem? One by one:

To edgemesh agent: proxy In the Run() function of go, the print log is added before the call of realserveraddress (& conn). At the same time, some log output flag bits are added in this function call. The container is re made for testing. It is found that the output log 555 is not output 666, and the process status exception changes to D:

func (proxy *EdgeProxy) Run() {
	// ensure ipatbles
	proxy.Proxier.Start()

	// start tcp proxy
	for {
		conn, err := proxy.TCPProxy.Listener.Accept()
		if err != nil {
			klog.Warningf("get tcp conn error: %v", err)
			continue
		}
		klog.Info("!!! has workload !!!")
		ip, port, err := realServerAddress(&conn)
        ...
    }
}



// realServerAddress returns an intercepted connection's original destination.
func realServerAddress(conn *net.Conn) (string, int, error) {
	tcpConn, ok := (*conn).(*net.TCPConn)
	if !ok {
		return "", -1, fmt.Errorf("not a TCPConn")
	}
	klog.Info("111")
	file, err := tcpConn.File()
	if err != nil {
		return "", -1, err
	}
	defer file.Close()
	klog.Info("222")
	// To avoid potential problems from making the socket non-blocking.
	tcpConn.Close()
	*conn, err = net.FileConn(file)
	if err != nil {
		return "", -1, err
	}
	klog.Info("333")
	fd := file.Fd()
	klog.Info("444")
	var addr sockAddr
	size := uint32(unsafe.Sizeof(addr))
	err = getSockOpt(int(fd), syscall.SOL_IP, SoOriginalDst, uintptr(unsafe.Pointer(&addr)), &size)
	if err != nil {
		return "", -1, err
	}
	klog.Info("777")
	var ip net.IP
	switch addr.family {
	case syscall.AF_INET:
		ip = addr.data[2:6]
	default:
		return "", -1, fmt.Errorf("unrecognized address family")
	}
	klog.Info("888")
	port := int(addr.data[0])<<8 + int(addr.data[1])
	if err := syscall.SetNonblock(int(fd), true); err != nil {
		return "", -1, nil
	}

	return ip.String(), port, nil
}

func getSockOpt(s int, level int, name int, val uintptr, vallen *uint32) (err error) {
	klog.Info("555")
	_, _, e1 := syscall.Syscall6(syscall.SYS_GETSOCKOPT, uintptr(s), uintptr(level), uintptr(name), uintptr(val), uintptr(unsafe.Pointer(vallen)), 0)
	if e1 != 0 {
		err = e1
	}
	klog.Info("666")
	return
}

Then I began to suspect that there was a problem with the kernel call. First, I wrote a go program: https://gitee.com/Hu-Lyndon/gogetsockopt.git

The following is the kernel on the mkedge3 host: 3.10.0-514 el7. x86_ 64. Start the program on one terminal to listen to port 18011, and then directly telnet the port on another terminal connected to mkedge3. The program can communicate correctly and output the print log address: 192.168.127.1:18011

The following is the test on mkedge1, kernel: 4.15.6-1 el7. elrepo. x86_ 64. After telnet is found, the print log cannot be output, and the program cannot kill -9 in the third terminal window, and the process state changes to D;

If you don't give up, the program is still the same on mkedge2;

I didn't give up. I found a kernel and upgraded it to kernel: 4.19.12-1 el7. elrepo. x86_ 64 host, the result is correct;

Re troubleshooting phase

Never give up: Although I seriously doubt the kernel: 4.15.6-1 el7. elrepo. x86_ 64. Is it the Go problem or the system kernel problem? So I wrote a C program to test: https://gitee.com/Hu-Lyndon/unix-network-programming , in compile-on-kernel 4 15.6 branch / sockopt / sockoptchk c

The above problems still appear, which means that it should be a kernel problem. It is recommended not to install this version of the kernel after O & M. at present, the kernel above 4.19.12 or simply 3.10 has been patch ed.

Relevant information

https://go.dev/play/p/GMAaKucHOr

https://elixir.bootlin.com/linux/latest/C/ident/sys_getsockopt

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/socket.c

https://bugzilla.kernel.org/show_bug.cgi?id=198791

http://patchwork.ozlabs.org/project/netfilter-devel/patch/a4752d0887579496c5db267d9db7ff77719436d8.1518088560.git.pabeni@redhat.com/

https://github.com/cybozu-go/transocks/pull/14

Topics: Linux Docker network