Linux eBPF program composition and communication principle

Posted by crzyman on Tue, 08 Mar 2022 14:34:52 +0100

About the author: daemon Wu, Linux kernel performance optimization engineer, works in a micro mobile phone factory, engaged in mobile phone performance optimization. Motto: unity of knowledge and practice.
Original creation: all kinds of community original good articles delivered by taixiao readers.
Copyright notice: This article was first published in the WeChat official account of "Tai Xiao technology". Please welcome the reprint and reprint. Please keep this statement at the beginning of the article.

catalogue

1. Preface

2. eBPF framework

2.1 tracex4_kern.c

2.2 tracex4_user.c

3. Parse the target file through readelf and llvm objdump

3.1 # read ELF file header

3.2 # print the contents of each paragraph

3.3 # parsing BPF ELF format files through llvm objdump

4. Trace and analyze eBPF program behavior through strace tool

4.1} BPF bytecode loading analysis

4.2} MAP data communication analysis

5. Introduction to bpftool usage

6. References

1. Preface

eBPF compiles C code into BPF bytecode in the Linux kernel and hangs it on hooks such as kprobe/tracepoint. When "hook" is triggered, the Linux kernel runs bytecode to track performance.

2. eBPF framework

<Linux eBPF: kprobes injection of bcc usage and principle > This chapter describes how to call the basic process between BPF and bcc.

In the Linux kernel, there are many examples of bpf programs in the {sample/bpf} directory to} tracex4_kern.c # and # tracex4_user.c ， as an example.

The following figure is the framework of eBPF program, which is divided into program execution flow and data communication flow.

For program execution flow,

trace_kern.c ￠ is the code called when allocating the slab and releasing the slab, which declares the hook processing function and data map. The data map is used for data communication between kernel state and user state, which is compiled into bpf bytecode by LLVM/clang compiler
trace_user.c) it is used to load bpf bytecode and fall into the kernel state. The bpf bytecode is converted into machine sink code through JIT(just in time) compiler. When kprobe or tracepoint tracks a certain kind of event, it executes the hook processing function declared above, obtains the data map and transmits the data to userspace

2.1 tracex4_kern.c

A data map of "my_map" is defined. The data map is composed of # key/value # key value pairs. Here, value is a variable of structure, which is used to obtain the current running time and ip register. The data map is composed of__ attribute__ Declaration is a separate section. When it is compiled into an ELF format file, the structure variable exists in the "maps" section.

The next statement is to allocate the slab, release the hook processing function of the slab and put it separately in kprobe/kmem_cache_free，kretprobe/kmem_cache_alloc_node # in two code snippets.

#include <linux/version.h>
#include <uapi/linux/bpf.h>
#include "bpf_helpers.h"
struct pair {
	u64 val;
	u64 ip;
};
struct bpf_map_def SEC("maps") my_map = {
	.type = BPF_MAP_TYPE_HASH,
	.key_size = sizeof(long),
	.value_size = sizeof(struct pair),
	.max_entries = 1000000,
};
SEC("kprobe/kmem_cache_free")
int bpf_prog1(struct pt_regs *ctx)
{
	long ptr = PT_REGS_PARM2(ctx);
	bpf_map_delete_elem(&my_map, &ptr);
	return 0;
}
SEC("kretprobe/kmem_cache_alloc_node")
int bpf_prog2(struct pt_regs *ctx)
{
	long ptr = PT_REGS_RC(ctx);
	long ip = 0;
	/* get ip address of kmem_cache_alloc_node() caller */
	BPF_KRETPROBE_READ_RET_IP(ip, ctx);
	struct pair v = {
		.val = bpf_ktime_get_ns(),
		.ip = ip,
	};
	bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
	return 0;
}
char _license[] SEC("license") = "GPL";
u32 _version SEC("version") = LINUX_VERSION_CODE;

Compiled using clang o ， file is also an ELF file. You can use llvm dump to output each section table. For example, for the above customized sections, you can use ， SEC to customize a section:

#define SEC(NAME) __attribute__((section(NAME), used))

2.2 tracex4_user.c

A pair structure is defined to accept the data of the hook processing function. First, load {tracex4_kern.o. Then poll to obtain data in the dead cycle, and then print it with # printf #:

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <stdbool.h>
#include <string.h>
#include <time.h>
#include <linux/bpf.h>
#include <sys/resource.h>
#include <bpf/bpf.h>
#include "bpf_load.h"
struct pair {
	long long val;
	__u64 ip;
};
static __u64 time_get_ns(void)
{
	struct timespec ts;
	clock_gettime(CLOCK_MONOTONIC, &ts);
	return ts.tv_sec * 1000000000ull + ts.tv_nsec;
}
static void print_old_objects(int fd)
{
	long long val = time_get_ns();
	__u64 key, next_key;
	struct pair v;
	key = write(1, "\e[1;1H\e[2J", 12); /* clear screen */
	key = -1;
	while (bpf_map_get_next_key(map_fd[0], &key, &next_key) == 0) {
		bpf_map_lookup_elem(map_fd[0], &next_key, &v);
		key = next_key;
		if (val - v.val < 1000000000ll)
			/* object was allocated more then 1 sec ago */
			continue;
		printf("obj 0x%llx is %2lldsec old was allocated at ip %llx\n",
		       next_key, (val - v.val) / 1000000000ll, v.ip);
	}
}
int main(int ac, char **argv)
{
	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
	char filename[256];
	int i;
	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
		perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
		return 1;
	}
	if (load_bpf_file(filename)) {
		printf("%s", bpf_log_buf);
		return 1;
	}
	for (i = 0; ; i++) {
		print_old_objects(map_fd[1]);
		sleep(1);
	}
}

3. Parse the target file through readelf and llvm objdump

3.1 # read ELF file header

wu@ubuntu:~/linux/samples/bpf$ readelf -h tracex4_kern.o
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:			     ELF64
  Data: 			     2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: REL (Relocatable file)
Machine: Linux BPF
Version: 0x1
Entry point address: 0x0
Start of program headers: 0 (bytes into file)
Start of section headers: 8344 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 0 (bytes)
Number of program headers: 0
Size of section headers: 64 (bytes)
Number of section headers: 27
Section header string table index: 1

In the readelf output:

Line 1, ELF Header: name the beginning of ELF file header.

Line 2, Magic magic number, is used to indicate that the file is an ELF target file.

The first byte 7F is a fixed number;
The last three bytes are the ASCII form of E, l and F.

In line 3, CLASS indicates the file type. Here is the 64 bit ELF format.

In line 4, Data indicates the format in which the Data in the file is organized (big end or small end). The Data organization format of different processor platforms may be different. For example, x86 platform is a small end storage format.

Line 5, the current ELF file header version number, where the version number is 1.

Line 6, OS/ABI, indicates the operating system type. ABI is the abbreviation of Application Binary Interface.

Line 7, ABI version number, currently 0.

In line 8, Type indicates the file Type. There are three types of ELF files, one is the Relocatable file as shown above, the other is the executable file, and the other is the shared library.

Line 9, machine platform type. Here is the bpf virtual machine line 10, the version number of the current target file.

Line 11 is the virtual address entry point of the program. Because this is not a runnable program, it is zero here.

Line 12 is the same as line 11. There is no Program Headers in this object file.

Line 13, the beginning of the sections header. Here 8344 is decimal, indicating that it starts from the address offset 0x2098.

Line 14 is a flag associated with the processor, which is 0 on x86 platforms.

Line 15, the number of bytes in the ELF file header.

In line 16, because this is not an executable program, the size here is 0. In line 19, how many section headers are there? Here are 27.

3.2 # print the contents of each paragraph

wu@ubuntu:~/linux/samples/bpf$ readelf -S tracex4_kern.o
There are 27 section headers, starting at offset 0x2098:
Section Headers:
  [Nr] Name		 Type		  Address	    Offset
       Size		 EntSize	  Flags  Link  Info  Align
  [ 0]			 NULL		  0000000000000000  00000000
       0000000000000000  0000000000000000	    0	  0	0
  [ 1] .strtab		 STRTAB 	  0000000000000000  00001f80
       0000000000000115  0000000000000000	    0	  0	1
  [ 2] .text		 PROGBITS	  0000000000000000  00000040
       0000000000000000  0000000000000000  AX	    0	  0	4
  [ 3] kprobe/kmem_cache PROGBITS	  0000000000000000  00000040
       0000000000000048  0000000000000000  AX	    0	  0	8
  [ 4] .relkprobe/kmem_c REL		  0000000000000000  00001870
       0000000000000010  0000000000000010	   26	  3	8
  [ 5] kretprobe/kmem_ca PROGBITS	  0000000000000000  00000088
       00000000000000c0  0000000000000000  AX	    0	  0	8
  [ 6] .relkretprobe/kme REL		  0000000000000000  00001880
       0000000000000010  0000000000000010	   26	  5	8
  [ 7] maps		 PROGBITS	  0000000000000000  00000148
       000000000000001c  0000000000000000  WA	    0	  0	4
  [ 8] license		 PROGBITS	  0000000000000000  00000164
       0000000000000004  0000000000000000  WA	    0	  0	1
  [ 9] version		 PROGBITS	  0000000000000000  00000168
       0000000000000004  0000000000000000  WA	    0	  0	4
  [10] .debug_str	 PROGBITS	  0000000000000000  0000016c
       00000000000001e9  0000000000000001  MS	    0	  0	1
  [11] .debug_loc	 PROGBITS	  0000000000000000  00000355
       0000000000000150  0000000000000000	    0	  0	1
  [12] .rel.debug_loc	 REL		  0000000000000000  00001890
       0000000000000050  0000000000000010	   26	 11	8
  [13] .debug_abbrev	 PROGBITS	  0000000000000000  000004a5
       0000000000000101  0000000000000000	    0	  0	1
  [14] .debug_info	 PROGBITS	  0000000000000000  000005a6
       0000000000000376  0000000000000000	    0	  0	1
  [15] .rel.debug_info	 REL		  0000000000000000  000018e0
       00000000000004b0  0000000000000010	   26	 14	8
  [16] .debug_ranges	 PROGBITS	  0000000000000000  0000091c
       0000000000000030  0000000000000000	    0	  0	1
  [17] .rel.debug_ranges REL		  0000000000000000  00001d90
       0000000000000040  0000000000000010	   26	 16	8
  [18] .BTF		 PROGBITS	  0000000000000000  0000094c
       0000000000000569  0000000000000000	    0	  0	1
  [19] .rel.BTF 	 REL		  0000000000000000  00001dd0
       0000000000000030  0000000000000010	   26	 18	8
  [20] .BTF.ext 	 PROGBITS	  0000000000000000  00000eb5
       0000000000000178  0000000000000000	    0	  0	1
  [21] .rel.BTF.ext	 REL		  0000000000000000  00001e00
       0000000000000140  0000000000000010	   26	 20	8
  [22] .eh_frame	 PROGBITS	  0000000000000000  00001030
       0000000000000050  0000000000000000   A	    0	  0	8
  [23] .rel.eh_frame	 REL		  0000000000000000  00001f40
       0000000000000020  0000000000000010	   26	 22	8
  [24] .debug_line	 PROGBITS	  0000000000000000  00001080
       0000000000000147  0000000000000000	    0	  0	1
  [25] .rel.debug_line	 REL		  0000000000000000  00001f60
       0000000000000020  0000000000000010	   26	 24	8
  [26] .symtab		 SYMTAB 	  0000000000000000  000011c8
       00000000000006a8  0000000000000018	    1	 66	8

The third column represents the Type:

”NULL ": not used, such as the first empty segment of the segment table
"PROGBITS": program data, such as text,. data,. rodata;
Table "relocating", such as. REL rel.text;
"NOBITS": program space without data temporarily, such as bss;
"strtab": string table, such as strtab,. shstrtab;
"symtab": symbol table, such as symtab, including all relevant symbolic information used, such as function name and variable name.

3.3 # parsing BPF ELF format files through llvm objdump

wu@ubuntu:~/linux/samples/bpf$ llvm-objdump -h tracex4_kern.o
tracex4_kern.o: file format ELF64-BPF
Sections:
Idx Name				Size	 VMA		  Type
  0					00000000 0000000000000000
  1 .strtab				00000115 0000000000000000
  2 .text				00000000 0000000000000000 TEXT
  3 kprobe/kmem_cache_free		00000048 0000000000000000 TEXT
  4 .relkprobe/kmem_cache_free		00000010 0000000000000000
  5 kretprobe/kmem_cache_alloc_node	000000c0 0000000000000000 TEXT
  6 .relkretprobe/kmem_cache_alloc_node 00000010 0000000000000000
  7 maps				0000001c 0000000000000000 DATA
  8 license				00000004 0000000000000000 DATA
  9 version				00000004 0000000000000000 DATA
... ...

You can use the llvm tool to decompile the eBPF program, tracex4_kern.o # is a file in ELF format, which is divided into two code segments, as follows:

wu@ubuntu:~/linux/samples/bpf$ llvm-objdump -d -r -print-imm-hex tracex4_kern.o
tracex4_kern.o: file format ELF64-BPF
Disassembly of section kprobe/kmem_cache_free:
0000000000000000 bpf_prog1:
       0:	79 11 68 00 00 00 00 00 r1 = *(u64 *)(r1 + 0x68)
       1:	7b 1a f8 ff 00 00 00 00 *(u64 *)(r10 - 0x8) = r1
       2:	bf a2 00 00 00 00 00 00 r2 = r10
       3:	07 02 00 00 f8 ff ff ff r2 += -0x8
       4:	18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
		0000000000000020:  R_BPF_64_64	my_map
       6:	85 00 00 00 03 00 00 00 call 0x3
       7:	b7 00 00 00 00 00 00 00 r0 = 0x0
       8:	95 00 00 00 00 00 00 00 exit
Disassembly of section kretprobe/kmem_cache_alloc_node:
0000000000000000 bpf_prog2:
       0:	79 12 50 00 00 00 00 00 r2 = *(u64 *)(r1 + 0x50)
       1:	7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 0x8) = r2
       2:	b7 02 00 00 00 00 00 00 r2 = 0x0
       3:	7b 2a f0 ff 00 00 00 00 *(u64 *)(r10 - 0x10) = r2
       4:	79 13 20 00 00 00 00 00 r3 = *(u64 *)(r1 + 0x20)
       5:	07 03 00 00 08 00 00 00 r3 += 0x8
       6:	bf a1 00 00 00 00 00 00 r1 = r10
       7:	07 01 00 00 f0 ff ff ff r1 += -0x10
       8:	b7 02 00 00 08 00 00 00 r2 = 0x8
       9:	85 00 00 00 04 00 00 00 call 0x4
      10:	85 00 00 00 05 00 00 00 call 0x5
      11:	7b 0a e0 ff 00 00 00 00 *(u64 *)(r10 - 0x20) = r0
      12:	79 a1 f0 ff 00 00 00 00 r1 = *(u64 *)(r10 - 0x10)
      13:	7b 1a e8 ff 00 00 00 00 *(u64 *)(r10 - 0x18) = r1
      14:	bf a2 00 00 00 00 00 00 r2 = r10
      15:	07 02 00 00 f8 ff ff ff r2 += -0x8
      16:	bf a3 00 00 00 00 00 00 r3 = r10
      17:	07 03 00 00 e0 ff ff ff r3 += -0x20
      18:	18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
		0000000000000090:  R_BPF_64_64	my_map
      20:	b7 04 00 00 00 00 00 00 r4 = 0x0
      21:	85 00 00 00 02 00 00 00 call 0x2
      22:	b7 00 00 00 00 00 00 00 r0 = 0x0
      23:	95 00 00 00 00 00 00 00 exit
	     ........

Next, check the contents of other segments, such as maps and license:

wu@ubuntu:~/linux/samples/bpf$ llvm-objdump --section=maps  -s tracex4_kern.o
tracex4_kern.o: file format ELF64-BPF
Contents of section maps:
 0000 01000000 08000000 10000000 40420f00  ............@B..
 0010 00000000 00000000 00000000	   ............
wu@ubuntu:~/linux/samples/bpf$ llvm-objdump --section=license  -s tracex4_kern.o
tracex4_kern.o: file format ELF64-BPF
Contents of section license:
 0000 47504c00				   GPL.

The bytecode of the above bpf program cannot be used in x86_ It is directly executed on 64 platform. When loading bpf program, JIT(just in time) compiler needs to be used to translate bpf bytecode into sink code recognized by the host. However, for most opcodes, eBPF instruction set can be mapped with X86 or aarch64 instruction set one by one.

bpf program has customized a set of instructions, which are different from x86, ARM64 and so on. Moreover, the instruction set is not as rich as the two, and there is no floating-point calculation. However, the functions of registers are similar, as follows:

eBPF register	describe
R1-R5	Parameters passed into kernel function by eBPF program
R0	Return value of kernel function ebpf program exit value
R6-R9	It is used as data storage and follows the usage rules of the callee
R10	Stack frame

4. Trace and analyze eBPF program behavior through strace tool

Execute the following command to see the assigned address and time of the slab object:

wu@ubuntu:~/linux/samples/bpf$ sudo strace -v -f -s 128 -o tracex4.txt ./tracex4
obj 0xffff9637e3175cc0 is  1sec old was allocated at ip ffffffff99679a9a
obj 0xffff9637e31750c0 is  1sec old was allocated at ip ffffffff99679a9a
obj 0xffff9637e3175e00 is  1sec old was allocated at ip ffffffff99679a9a
obj 0xffff9637e3175780 is  1sec old was allocated at ip ffffffff99679a9a
... ...

The key system calls traced by strace are as follows:

execve("./tracex4", ["./tracex4"], ["LANG=en_US.UTF-8", "LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca"..., "TERM=xterm", "DISPLAY=localhost:12.0", "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin", "MAIL=/var/mail/root", "LOGNAME=root", "USER=root", "HOME=/root", "SHELL=/bin/bash", "SUDO_COMMAND=/usr/bin/strace -v -f -s 128 -o tracex4.txt ./tracex4", "SUDO_USER=wu", "SUDO_UID=1000", "SUDO_GID=1000"]) = 0
... ...
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_HASH, key_size=8, value_size=16, max_entries=1000000, map_flags=0, inner_map_fd=0, map_name="my_map", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 112) = 4
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=9, insns=[{code=BPF_LDX|BPF_DW|BPF_MEM, dst_reg=BPF_REG_1, src_reg=BPF_REG_1, off=104, imm=0}, {code=BPF_STX|BPF_DW|BPF_MEM, dst_reg=BPF_REG_10, src_reg=BPF_REG_1, off=-8, imm=0}, {code=BPF_ALU64|BPF_X|BPF_MOV, dst_reg=BPF_REG_2, src_reg=BPF_REG_10, off=0, imm=0}, {code=BPF_ALU64|BPF_K|BPF_ADD, dst_reg=BPF_REG_2, src_reg=BPF_REG_0, off=0, imm=0xfffffff8}, {code=BPF_LD|BPF_DW|BPF_IMM, dst_reg=BPF_REG_1, src_reg=BPF_REG_1, off=0, imm=0x4}, {code=BPF_LD|BPF_W|BPF_IMM, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}, {code=BPF_JMP|BPF_K|BPF_CALL, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0x3}, {code=BPF_ALU64|BPF_K|BPF_MOV, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}, {code=BPF_JMP|BPF_K|BPF_EXIT, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}], license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(5, 4, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0}, 112) = 5
openat(AT_FDCWD, "/sys/kernel/debug/tracing/kprobe_events", O_WRONLY|O_APPEND) = 6
write(6, "p:kmem_cache_free kmem_cache_free", 33) = 33
close(6)
openat(AT_FDCWD, "/sys/kernel/debug/tracing/events/kprobes/kmem_cache_free/id", O_RDONLY) = 6
read(6, "2145\n", 256)		 = 5
close(6)
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0 /* PERF_ATTR_SIZE_??? */, config=2145, sample_period=1, sample_type=PERF_SAMPLE_RAW, read_format=0, disabled=0, inherit=0, pinned=0, exclusive=0, exclusive_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=0, exclude_host=0, exclude_guest=0, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, wakeup_events=1, config1=0}, -1, 0, -1, 0) = 6
... ...
openat(AT_FDCWD, "/sys/kernel/debug/tracing/kprobe_events", O_WRONLY|O_APPEND) = 8
write(8, "r:kmem_cache_alloc_node kmem_cache_alloc_node", 45) = 45
openat(AT_FDCWD, "/sys/kernel/debug/tracing/events/kprobes/kmem_cache_alloc_node/id", O_RDONLY) = 8
read(8, "2146\n", 256)		 = 5
close(8)			 = 0
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0 /* PERF_ATTR_SIZE_??? */, config=2146, sample_period=1, sample_type=PERF_SAMPLE_RAW, read_format=0, disabled=0, inherit=0, pinned=0, exclusive=0, exclusive_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=0, exclude_host=0, exclude_guest=0, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, wakeup_events=1, config1=0}, -1, 0, -1, 0) = 8
ioctl(8, PERF_EVENT_IOC_ENABLE, 0) = 0
ioctl(8, PERF_EVENT_IOC_SET_BPF, 7) = 0
write(1, "\33[1;1H\33[2J\0\0", 12) = 12
bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=4, key=0x7ffffaf162b0, next_key=0x7ffffaf162b8}, 112) = 0
bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=4, key=0x7ffffaf162b8, value=0x7ffffaf162d0, flags=BPF_ANY}, 112) = 0
... ...

4.1} BPF bytecode loading analysis

When the first parameter of BPF system is CMDS = BPF_ PROG_ When loading ， it means loading the file in ， ELF64-BPF ， format. Carefully analyze the second parameter ， attr. The prototype of this structure is as follows. It is found that the BPF bytecode is saved:

struct {    /* Used by BPF_PROG_LOAD */
	__u32	      prog_type;
	__u32	      insn_cnt;
	__aligned_u64 insns;	  /* 'const struct bpf_insn *' */
	__aligned_u64 license;	  /* 'const char *' */
	__u32	      log_level;  /* verbosity level of verifier */
	__u32	      log_size;   /* size of user buffer */
	__aligned_u64 log_buf;	  /* user supplied 'char *'
buffer */
	__u32	      kern_version;
				/* checked when prog_type=kprobe
(since Linux 4.1) */
};
 __attribute__((aligned(8)));

When the bpf program is loaded, BPF_PROG_LOAD ＾ represents the specific bpf instruction of the program, corresponding to ＾ bpf_prog1} This code snippet.

The instructions traced by strace are as follows. The opcode of each instruction consists of six parts:

Code (operation code)
dst_ Reg (destination register)
src_ Reg (source register)
Off (offset)
IMM (immediate)

See:

insns=[
{code=BPF_LDX|BPF_DW|BPF_MEM, dst_reg=BPF_REG_1, src_reg=BPF_REG_1, off=104, imm=0},
{code=BPF_STX|BPF_DW|BPF_MEM, dst_reg=BPF_REG_10, src_reg=BPF_REG_1, off=-8, imm=0},
{code=BPF_ALU64|BPF_X|BPF_MOV, dst_reg=BPF_REG_2, src_reg=BPF_REG_10, off=0, imm=0},
{code=BPF_ALU64|BPF_K|BPF_ADD, dst_reg=BPF_REG_2, src_reg=BPF_REG_0, off=0, imm=0xfffffff8},
{code=BPF_LD|BPF_DW|BPF_IMM, dst_reg=BPF_REG_1, src_reg=BPF_REG_1, off=0, imm=0x4},
{code=BPF_LD|BPF_W|BPF_IMM, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0},
{code=BPF_JMP|BPF_K|BPF_CALL, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0x3},
{code=BPF_ALU64|BPF_K|BPF_MOV, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0},
{code=BPF_JMP|BPF_K|BPF_EXIT, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}
]

The instruction format is shown in the figure below. BPF currently has 102 instructions, mainly including three categories:

ALU (64bit and 32bit)
Memory operation
Branch operation

The format of instruction is mainly composed of the following parts:

The lower 3 bits of opcode indicate the instruction type, BPF_LDX，BPF_REG_10. Wait for these macros in the kernel directory, \ tools / include / UAPI / Linux / BPF Definition in H:

#define 	BPF_LDX 	0x01
#define 	BPF_MEM 	0x60
#define 	BPF_DW		0x18
... ...

Take the first bpf instruction as an example:

{code=BPF_LDX|BPF_DW|BPF_MEM, dst_reg=BPF_REG_1, src_reg=BPF_REG_1, off=104, imm=0}

It exactly corresponds to the first instruction parsed by llvm objdump, which is a memory access instruction:

0:       79 11 68 00 00 00 00 00 r1 = *(u64 *)(r1 + 0x68)
BPF_LDX|BPF_MEM|BPF_DW=0x79

This instruction is defined in the kernel as:

/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
	((struct bpf_insn) {					\
		.code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,	\
		.dst_reg = DST, 				\
		.src_reg = SRC, 				\
		.off   = OFF,					\
		.imm   = 0 })

4.2} MAP data communication analysis

When the first parameter of bpf system is CMDS = bpf_ MAP_ When creating ， it means creating a data map. Carefully analyze the second parameter ， attr. The prototype of this structure includes map type, key size, value e size, etc

union bpf_attr {
struct {    /* Used by BPF_MAP_CREATE */
	__u32	      map_type;
	__u32	      key_size;    /* size of key in bytes */
	__u32	      value_size;  /* size of value in bytes */
	__u32	      max_entries; /* maximum number of entries
in a map */
};

According to the log analysis captured by strace, the type of map is BPF_MAP_TYPE_HASH, the size of key is 8, the size of value is 16, and the access map is bpf_prog1 # the 5th bytecode, in which the imm immediate number is 4, representing map_fd, this is a pseudo instruction, which is a relocatable instruction:

bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_HASH, key_size=8, value_size=16, max_entries=1000000, map_flags=0, inner_map_fd=0, map_name="my_map", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 112) = 4
 0000000000000020:  R_BPF_64_64  my_map
{code=BPF_LD|BPF_DW|BPF_IMM, dst_reg=BPF_REG_1, src_reg=BPF_REG_1, off=0, imm=0x4}

bpf programs can use bpf to access all types of maps_ map_ lookup_ Elem () and bpf_ map_ update_ The elem() function, socket maps, and some other extra maps are used for special purposes.

When the first parameter of bpf system is cmds=BPF_MAP_GET_NEXT_KEY or BPF_MAP_LOOKUP_ELEM ， means traversing the map. Carefully analyze the second parameter ， attr. The prototype of this structure contains the map_fd is the first parameter of bpf system call. cmd=BPF_MAP_CREATE return value, key value, value value, etc.:

struct {    /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
commands */
	__u32	      map_fd;
	__aligned_u64 key;
	union {
	__aligned_u64 value;
	__aligned_u64 next_key;
	};
	__u64	      flags;
};
bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=4, key=0x7ffffaf162b0, next_key=0x7ffffaf162b8}, 112) = 0
bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=4, key=0x7ffffaf162b8, value=0x7ffffaf162d0, flags=BPF_ANY}, 112) = 0

5. Introduction to bpftool usage

Bpftool can be used by compiling it with make under the directory of # tools/bpf/bpftool / in the kernel. Check the currently running bpf program, as shown below. You can see that the currently running bpf program is # kprobe event # and # map id:

wu@ubuntu:~/linux/samples/bpf$ sudo bpftool prog show
[sudo] password for wu:
... ...
205: kprobe  tag a6cfc4a29f52a193  gpl
	loaded_at 2021-01-19T11:51:26+0000  uid 0
	xlated 72B  jited 62B  memlock 4096B  map_ids 72
206: kprobe  tag d16c41919f3b767a  gpl
	loaded_at 2021-01-19T11:51:26+0000  uid 0
	xlated 192B  jited 119B  memlock 4096B	map_ids 72

By checking the id of the map, you can see that the hash map is currently used:

wu@ubuntu:~/linux$ sudo bpftool map show
72: hash  name my_map  flags 0x0
	key 8B	value 16B  max_entries 1000000	memlock 88788992B

View all the contents of the map and the value of the corresponding key:

wu@ubuntu:~/linux$ sudo bpftool map dump id 72
key: 80 35 43 e5 26 95 ff ff  value: b9 f3 f3 9e 87 6b 00 00  9a 9a c7 86 ff ff ff ff
key: 80 9c 5f f6 25 95 ff ff  value: 98 85 ac c7 8c 6b 00 00  9a 9a c7 86 ff ff ff ff
key: 00 4c 9b e4 25 95 ff ff  value: e6 8d c2 b7 7e 6b 00 00  9a 9a c7 86 ff ff ff ff
key: 80 21 15 f8 26 95 ff ff  value: 60 df 01 fc 5c 6b 00 00  9a 9a c7 86 ff ff ff ff
key: 80 f5 46 5f 26 95 ff ff  value: 5e e1 73 7b 8d 6b 00 00  9a 9a c7 86 ff ff ff ff
... ...
wu@ubuntu:~/linux$ sudo bpftool map lookup id 72 key 0x80 0x35 0x43 0xe5 0x26 0x95 0xff 0xff
key: 80 35 43 e5 26 95 ff ff  value: b9 f3 f3 9e 87 6b 00 00  9a 9a c7 86 ff ff ff ff

Programmer Think

Linux eBPF program composition and communication principle

1. Preface

2. eBPF framework

2.1 tracex4_kern.c

2.2 tracex4_user.c

3. Parse the target file through readelf and llvm objdump

3.1 # read ELF file header

3.2 # print the contents of each paragraph

3.3 # parsing BPF ELF format files through llvm objdump

4. Trace and analyze eBPF program behavior through strace tool

4.1} BPF bytecode loading analysis

4.2} MAP data communication analysis

5. Introduction to bpftool usage

6. References

Related reading

Hot Topics