ebpf instruction system

Posted by kayess2004 on Sun, 16 Jan 2022 14:03:31 +0100

abstract

Understand ebpf's instruction system. Read an example of programming using the ebpf instruction set.

ebpf instruction system

reference resources: eBPF opcode encoding – Linux document | Instruction Set – cilium document

BPF is a general RISC instruction set. It was originally designed to write programs with a subset of C. These programs can be compiled into BPF instructions through the compiler back-end (such as LLVM), so that the kernel can later convert the kernel JIT compiler into native opcodes to achieve the best execution performance within the kernel.

register

eBPF consists of 11 64 bit registers, a program counter and a large BPF stack space of 512 bytes. The register is named r0- r10. The operation mode is 64 bit by default. 64 bit registers can also be used as 32-bit sub registers. They can only be accessed through special ALU (arithmetic logic unit) operations. The low 32 bits are used, and the high 32 bits are filled with zeros.

The usage convention of register is as follows:

register	use
r0	Register containing BPF program exit value. The semantics of the exit value is defined by the program type. In addition, when execution is returned to the kernel, the exit value is passed as a 32-bit value.
r1-r5	Save the parameters from the BPF program to the kernel helper function. Where the r1 register points to the context of the program (for example, a network program can take the kernel representation of a network packet (skb) as an input parameter).
r6-r9	General purpose register
r10	A unique read-only register containing the frame pointer address used to access the BPF stack space.

Other: in load and store instructions, register R6 is an implicit input and must contain a pointer to sk_ Pointer to buff. Register R0 is an implicit output that contains the data obtained from the packet.

Instruction format

The code implementation is as follows:

struct bpf_insn {
	__u8	code;		/* opcode */
	__u8	dst_reg:4;	/* dest register */
	__u8	src_reg:4;	/* source register */
	__s16	off;		/* signed offset */
	__s32	imm;		/* signed immediate constant */
};

Instruction type

The op field is as follows:

+----------------+--------+--------------------+
|       5 bits            |   3 bits           |
|       xxxxxx            | instruction class  |
+----------------+--------+--------------------+
(MSB)                                      (LSB)

The lower 3 bits of the op field determine the instruction type. Instruction types include: load and store instructions, operation instructions, and jump instructions. [by the way: a word in ebpf is four bytes in size, 32 bits]

cBPF class	eBPF class
BPF_LD 0x00	BPF_LD 0x00
BPF_LDX 0x01	BPF_LDX 0x01
BPF_ST 0x02	BPF_ST 0x02
BPF_STX 0x03	BPF_STX 0x03
BPF_ALU 0x04	BPF_ALU 0x04
BPF_JMP 0x05	BPF_JMP 0x05
BPF_RET 0x06	BPF_JMP32 0x06
BPF_MISC 0x07	BPF_ALU64 0x07

BPF_LD, BPF_LDX: both classes are used for loading operations. BPF_LD is used to load doublewords. The latter is inherited from cBPF, mainly to maintain the conversion efficiency from cBPF to BPF, because they optimize the JIT code.
BPF_ST, BPF_STX: both classes are used for storage operations to transfer data from registers to memory.
BPF_ALU, BPF_ALU64: Alu operations under 32-bit and 64 bit respectively.
BPF_JMP and BPF_JMP32: jump instruction. The jump range of JMP32 is 32-bit size (a word)

Operation and jump instructions

When BPF_CLASS(code) == BPF_ALU or BPF_ During JMP, the op field can be divided into three parts, as shown below:

+----------------+--------+--------------------+
|   4 bits       |  1 bit |   3 bits           |
| operation code | source | instruction class  |
+----------------+--------+--------------------+
(MSB)                                      (LSB)

The fourth bit can be 0 or 1. In linux, the following macro definitions are used:

BPF_K     0x00
BPF_X     0x08
// #define BPF_CLASS(code) ((code) & 0x07)

In eBPF, this means:

BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
// #define BPF_SRC(code)   ((code) & 0x08)

If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [in eBPF], BPF_OP(code) is one of the following:

BPF_ADD   0x00
BPF_SUB   0x10
BPF_MUL   0x20
BPF_DIV   0x30
BPF_OR    0x40
BPF_AND   0x50
BPF_LSH   0x60
BPF_RSH   0x70
BPF_NEG   0x80
BPF_MOD   0x90
BPF_XOR   0xa0
BPF_MOV   0xb0  /* eBPF only: mov reg to reg */
BPF_ARSH  0xc0  /* eBPF only: sign extending shift right */
BPF_END   0xd0  /* eBPF only: endianness conversion */

If BPF_CLASS(code) == BPF_JMP or BPF_ Jmp32 [in ebpf], BPF_OP(code) is one of the following:

BPF_JA    0x00  /* BPF_JMP only */
BPF_JEQ   0x10
BPF_JGT   0x20
BPF_JGE   0x30
BPF_JSET  0x40
BPF_JNE   0x50  /* eBPF only: jump != */
BPF_JSGT  0x60  /* eBPF only: signed '>' */
BPF_JSGE  0x70  /* eBPF only: signed '>=' */
BPF_CALL  0x80  /* eBPF BPF_JMP only: function call */
BPF_EXIT  0x90  /* eBPF BPF_JMP only: function return */
BPF_JLT   0xa0  /* eBPF only: unsigned '<' */
BPF_JLE   0xb0  /* eBPF only: unsigned '<=' */
BPF_JSLT  0xc0  /* eBPF only: signed '<' */
BPF_JSLE  0xd0  /* eBPF only: signed '<=' */

Load and store instructions

When BPF_CLASS(code) == BPF_LD or BPF_ In St, the op field can be divided into three parts, as shown below:

+--------+--------+-------------------+
| 3 bits | 2 bits |   3 bits          |
|  mode  |  size  | instruction class |
+--------+--------+-------------------+
(MSB)                             (LSB)

The size in linux has the following macro definitions:

BPF_W   0x00    /* word=4 byte */
BPF_H   0x08    /* half word */
BPF_B   0x10    /* byte */
BPF_DW  0x18    /* eBPF only, double word */

mode in linux has the following macro definitions:

BPF_IMM     0x00  /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
BPF_ABS     0x20
BPF_IND     0x40
BPF_MEM     0x60
BPF_LEN     0x80  /* classic BPF only, reserved in eBPF */
BPF_MSH     0xa0  /* classic BPF only, reserved in eBPF */
BPF_ATOMIC  0xc0  /* eBPF only, atomic operations */

ebpf instruction set programming

There are three ways of eBPF programming: BPF instruction set programming, BPF C programming, and BPF front end (BCC, bpftrace).

To demonstrate instructions, we read a piece of code programmed in instruction set mode.

code

Code source: sample/bpf/sock_example.c

Code logic:

The popen() function opens the process call shell by creating a pipe, fork, and. Since the definition of pipeline is one-way, type parameters can only specify read or write, not both;

Here, using IPv4, ping the host 5 times, and the results can be read.
Create a BPF_MAP_TYPE_ARRAY type map of.
The eBPF instruction set is used for programming, and the instructions are stored in prog. For the reading of these instruction codes, see the next section.
When these instructions are loaded into the kernel, the instruction program type For BPF_PROG_TYPE_SOCKET_FILTER.
open_raw_sock Returns a native socket. Through this socket, the data of the data link layer of the lo network interface can be read directly.
Attach the eBPF program to the socket (used as a filter).
Print some information in the map.

/* eBPF example program:
 * - creates arraymap in kernel with key 4 bytes and value 8 bytes
 *
 * - loads eBPF program:
 *   r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)];
 *   *(u32*)(fp - 4) = r0;
 *   // assuming packet is IPv4, lookup ip->proto in a map
 *   value = bpf_map_lookup_elem(map_fd, fp - 4);
 *   if (value)
 *        (*(u64*)value) += 1;
 *
 * - attaches this program to loopback interface "lo" raw socket
 *
 * - every second user space reads map[tcp], map[udp], map[icmp] to see
 *   how many packets of given protocol were seen on "lo"
 */
#include <stdio.h>
#include <unistd.h>
#include <assert.h>
#include <linux/bpf.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <stddef.h>
#include <bpf/bpf.h>
#include "bpf_insn.h"
#include "sock_example.h"

char bpf_log_buf[BPF_LOG_BUF_SIZE];

static int test_sock(void)
{
	int sock = -1, map_fd, prog_fd, i, key;
	long long value = 0, tcp_cnt, udp_cnt, icmp_cnt;

	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value),
				256, 0);
	if (map_fd < 0) {
		printf("failed to create map '%s'\n", strerror(errno));
		goto cleanup;
	}

	struct bpf_insn prog[] = {
		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
		BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),
		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
		BPF_EXIT_INSN(),
	};
	size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);

	prog_fd = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog, insns_cnt,
				   "GPL", 0, bpf_log_buf, BPF_LOG_BUF_SIZE);
	if (prog_fd < 0) {
		printf("failed to load prog '%s'\n", strerror(errno));
		goto cleanup;
	}

	sock = open_raw_sock("lo");

	if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
		       sizeof(prog_fd)) < 0) {
		printf("setsockopt %s\n", strerror(errno));
		goto cleanup;
	}

	for (i = 0; i < 10; i++) {
		key = IPPROTO_TCP;
		assert(bpf_map_lookup_elem(map_fd, &key, &tcp_cnt) == 0);

		key = IPPROTO_UDP;
		assert(bpf_map_lookup_elem(map_fd, &key, &udp_cnt) == 0);

		key = IPPROTO_ICMP;
		assert(bpf_map_lookup_elem(map_fd, &key, &icmp_cnt) == 0);

		printf("TCP %lld UDP %lld ICMP %lld packets\n",
		       tcp_cnt, udp_cnt, icmp_cnt);
		sleep(1);
	}

cleanup:
	/* maps, programs, raw sockets will auto cleanup on process exit */
	return 0;
}

int main(void)
{
	FILE *f;

	f = popen("ping -4 -c5 localhost", "r");
	(void)f; //Why is this line in the code?

	return test_sock();
}

eBPF instruction programming code reading

Let's take out this part of the code and read it separately.

	struct bpf_insn prog[] = {
		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* R6 = R1*/
		BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),
		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
		BPF_EXIT_INSN(),
	};

BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),

/* Short form of mov, dst_reg = src_reg */

#define BPF_MOV64_REG(DST, SRC)					\
	((struct bpf_insn) {					\
		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
		.dst_reg = DST,					\
		.src_reg = SRC,					\
		.off   = 0,					\
		.imm   = 0 })

As you can see, this instruction moves the value of the source register R1 to the R6 register. Where R1 points to the starting address of the packet.

BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)),
```
/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */

#define BPF_LD_ABS(SIZE, IMM)					\
	((struct bpf_insn) {					\
		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
		.dst_reg = 0,					\
		.src_reg = 0,					\
		.off   = 0,					\
		.imm   = IMM })
```
In load and store instructions, register R6 is an implicit input and register R0 is an implicit output. (? I want this. What's the use of dst_reg and src_reg?)

To understand the format of the data packet, refer to: Introduction to MAC header, IP header and TCP header

Read the IP protocol type according to the offset. For example, the protocol number of TCP is 6, the protocol number of UDP is 17, and the protocol number of ICMP is 1. Among them, the protocol field accounts for 8 bits.

Therefore, this instruction indicates that the IP protocol is placed in the R0 register.
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
```
/* Memory store, *(uint *) (dst_reg + off16) = src_reg */

#define BPF_STX_MEM(SIZE, DST, SRC, OFF)			\
	((struct bpf_insn) {					\
		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,	\
		.dst_reg = DST,					\
		.src_reg = SRC,					\
		.off   = OFF,					\
		.imm   = 0 })
```
R10 is the only read-only register that contains the frame pointer address used to access the BPF stack space. (for stack frame structure, please refer to: gdb debug stack frame information)

So here, save the contents of R0 register (protocol type saved in the previous step) to the stack. Note that this is BPF_W. Only the 32nd bit in the R0 register is saved.
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),,BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),

Because the stack grows down. So the R2 register is used to point to the top of the stack.

As for BPF_ ALU64_ The macro expansion of IMM is not listed here. It can be expanded by itself samples/bpf/bpf_insn.h View in.

These macros expand numbers in include/uapi/linux/bpf.h View in.

In this way, the above instruction expansion is a 64 bit binary number. Isn't it amazing~

BPF_LD_MAP_FD(BPF_REG_1, map_fd),

This instruction is more interesting. Let's have a look.

/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
#define BPF_LD_IMM64(DST, IMM)					\
	BPF_LD_IMM64_RAW(DST, 0, IMM)

#define BPF_LD_IMM64_RAW(DST, SRC, IMM)				\
	((struct bpf_insn) {					\
		.code  = BPF_LD | BPF_DW | BPF_IMM,		\
		.dst_reg = DST,					\
		.src_reg = SRC,					\
		.off   = 0,					\
		.imm   = (__u32) (IMM) }),			\
	((struct bpf_insn) {					\
		.code  = 0, /* zero is reserved opcode */	\
		.dst_reg = 0,					\
		.src_reg = 0,					\
		.off   = 0,					\
		.imm   = ((__u64) (IMM)) >> 32 })

#ifndef BPF_PSEUDO_MAP_FD
# define BPF_PSEUDO_MAP_FD	1
#endif

/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */
#define BPF_LD_MAP_FD(DST, MAP_FD)				\
	BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)

As you can see, this instruction is to map_ The value of FD is saved in the R1 register. At this time, we may be curious that there is Src in the middle_ What's up?

As we can see above, if you simply save an immediate number to a register, src_reg=0； If this immediate number represents a map_fd, then src_reg=1；

In this way, we can distinguish whether the immediate in the instruction represents a map_fd. behind replace_map_fd_with_map_ptr Function uses this property.

Besides, I tried to combine it code = 0；. code = BPF_LD | BPF_W | BPF_IMM. Does this really mean nothing?.

BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
```
/* Raw code statement block */

#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
	((struct bpf_insn) {					\
		.code  = CODE,					\
		.dst_reg = DST,					\
		.src_reg = SRC,					\
		.off   = OFF,					\
		.imm   = IMM })
```
BPF_ FUNC_ map_ lookup_ The macro of element expands to 1. As for the position after JIT, jump to 1 BPF_ map_ lookup_ The function elem is a follow-up problem.

Here, you can see from the name of the macro that it is a jump to BPF_ map_ lookup_ Element function position.
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
```
/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */

#define BPF_JMP_IMM(OP, DST, IMM, OFF)				\
	((struct bpf_insn) {					\
		.code  = BPF_JMP | BPF_OP(OP) | BPF_K,		\
		.dst_reg = DST,					\
		.src_reg = 0,					\
		.off   = OFF,					\
		.imm   = IMM })
```
This instruction indicates that if the R0 register is equal to 0, two instructions are skipped down.

R0 register stores the protocol number according to IP protocol number list It can be seen that if the protocol in the IP packet is "IPv6 hop by hop option", skip two instructions down.
BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */,BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */

xadd exchange addition. The initial is R0 = agreement number, R1=1; The result is R0=1, R1 = protocol number + 1.

The R1 register stores the map_ The value of FD. How to identify later? This register stores a map_fd?
BPF_MOV64_IMM(BPF_REG_0, 0),

R0 is a register containing the BPF program exit value. Set the return value, R0=0

BPF_EXIT_INSN()

/* Program exit */

#define BPF_EXIT_INSN()						\
	((struct bpf_insn) {					\
		.code  = BPF_JMP | BPF_EXIT,			\
		.dst_reg = 0,					\
		.src_reg = 0,					\
		.off   = 0,					\
		.imm   = 0 })

Run this program

If you want to run this program, you can pull down the source code and compile it.

Pull the source code corresponding to the current linux kernel version. You can refer to: How to get the source code from ubuntu

sudo apt source linux

Then compile the bpf program under the sample/bpf directory. You can refer to: Run the first bpf program

make M=samples/bpf

Run the program and the output is as follows. (PS: My lo is forwarding browser data) (ping sending four ICMP packets at a time?)

➜  bpf sudo ./sock_example
TCP 0 UDP 0 ICMP 0 packets
TCP 28 UDP 0 ICMP 4 packets
TCP 60 UDP 0 ICMP 4 packets
TCP 100 UDP 0 ICMP 8 packets
TCP 134 UDP 0 ICMP 12 packets
TCP 166 UDP 0 ICMP 16 packets
TCP 228 UDP 0 ICMP 16 packets
TCP 302 UDP 0 ICMP 16 packets
TCP 334 UDP 0 ICMP 16 packets
TCP 366 UDP 0 ICMP 16 packets

Topics: bpf

Programmer Think

ebpf instruction system

abstract

ebpf instruction system

register

Instruction format

Instruction type

Operation and jump instructions

Load and store instructions

ebpf instruction set programming

code

eBPF instruction programming code reading

Run this program

Hot Topics