abstract
Understand ebpf's instruction system. Read an example of programming using the ebpf instruction set.
ebpf instruction system
reference resources: eBPF opcode encoding – Linux document | Instruction Set – cilium document
BPF is a general RISC instruction set. It was originally designed to write programs with a subset of C. These programs can be compiled into BPF instructions through the compiler back-end (such as LLVM), so that the kernel can later convert the kernel JIT compiler into native opcodes to achieve the best execution performance within the kernel.
register
eBPF consists of 11 64 bit registers, a program counter and a large BPF stack space of 512 bytes. The register is named r0- r10. The operation mode is 64 bit by default. 64 bit registers can also be used as 32-bit sub registers. They can only be accessed through special ALU (arithmetic logic unit) operations. The low 32 bits are used, and the high 32 bits are filled with zeros.
The usage convention of register is as follows:
register | use |
---|---|
r0 | Register containing BPF program exit value. The semantics of the exit value is defined by the program type. In addition, when execution is returned to the kernel, the exit value is passed as a 32-bit value. |
r1-r5 | Save the parameters from the BPF program to the kernel helper function. Where the r1 register points to the context of the program (for example, a network program can take the kernel representation of a network packet (skb) as an input parameter). |
r6-r9 | General purpose register |
r10 | A unique read-only register containing the frame pointer address used to access the BPF stack space. |
Other: in load and store instructions, register R6 is an implicit input and must contain a pointer to sk_ Pointer to buff. Register R0 is an implicit output that contains the data obtained from the packet.
Instruction format
The code implementation is as follows:
struct bpf_insn { __u8 code; /* opcode */ __u8 dst_reg:4; /* dest register */ __u8 src_reg:4; /* source register */ __s16 off; /* signed offset */ __s32 imm; /* signed immediate constant */ };
Instruction type
The op field is as follows:
+----------------+--------+--------------------+ | 5 bits | 3 bits | | xxxxxx | instruction class | +----------------+--------+--------------------+ (MSB) (LSB)
The lower 3 bits of the op field determine the instruction type. Instruction types include: load and store instructions, operation instructions, and jump instructions. [by the way: a word in ebpf is four bytes in size, 32 bits]
cBPF class | eBPF class |
---|---|
BPF_LD 0x00 | BPF_LD 0x00 |
BPF_LDX 0x01 | BPF_LDX 0x01 |
BPF_ST 0x02 | BPF_ST 0x02 |
BPF_STX 0x03 | BPF_STX 0x03 |
BPF_ALU 0x04 | BPF_ALU 0x04 |
BPF_JMP 0x05 | BPF_JMP 0x05 |
BPF_RET 0x06 | BPF_JMP32 0x06 |
BPF_MISC 0x07 | BPF_ALU64 0x07 |
- BPF_LD, BPF_LDX: both classes are used for loading operations. BPF_LD is used to load doublewords. The latter is inherited from cBPF, mainly to maintain the conversion efficiency from cBPF to BPF, because they optimize the JIT code.
- BPF_ST, BPF_STX: both classes are used for storage operations to transfer data from registers to memory.
- BPF_ALU, BPF_ALU64: Alu operations under 32-bit and 64 bit respectively.
- BPF_JMP and BPF_JMP32: jump instruction. The jump range of JMP32 is 32-bit size (a word)
Operation and jump instructions
When BPF_CLASS(code) == BPF_ALU or BPF_ During JMP, the op field can be divided into three parts, as shown below:
+----------------+--------+--------------------+ | 4 bits | 1 bit | 3 bits | | operation code | source | instruction class | +----------------+--------+--------------------+ (MSB) (LSB)
The fourth bit can be 0 or 1. In linux, the following macro definitions are used:
BPF_K 0x00 BPF_X 0x08 // #define BPF_CLASS(code) ((code) & 0x07)
In eBPF, this means:
BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand // #define BPF_SRC(code) ((code) & 0x08)
-
If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [in eBPF], BPF_OP(code) is one of the following:
BPF_ADD 0x00 BPF_SUB 0x10 BPF_MUL 0x20 BPF_DIV 0x30 BPF_OR 0x40 BPF_AND 0x50 BPF_LSH 0x60 BPF_RSH 0x70 BPF_NEG 0x80 BPF_MOD 0x90 BPF_XOR 0xa0 BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ BPF_END 0xd0 /* eBPF only: endianness conversion */
-
If BPF_CLASS(code) == BPF_JMP or BPF_ Jmp32 [in ebpf], BPF_OP(code) is one of the following:
BPF_JA 0x00 /* BPF_JMP only */ BPF_JEQ 0x10 BPF_JGT 0x20 BPF_JGE 0x30 BPF_JSET 0x40 BPF_JNE 0x50 /* eBPF only: jump != */ BPF_JSGT 0x60 /* eBPF only: signed '>' */ BPF_JSGE 0x70 /* eBPF only: signed '>=' */ BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ BPF_JSLT 0xc0 /* eBPF only: signed '<' */ BPF_JSLE 0xd0 /* eBPF only: signed '<=' */
Load and store instructions
When BPF_CLASS(code) == BPF_LD or BPF_ In St, the op field can be divided into three parts, as shown below:
+--------+--------+-------------------+ | 3 bits | 2 bits | 3 bits | | mode | size | instruction class | +--------+--------+-------------------+ (MSB) (LSB)
The size in linux has the following macro definitions:
BPF_W 0x00 /* word=4 byte */ BPF_H 0x08 /* half word */ BPF_B 0x10 /* byte */ BPF_DW 0x18 /* eBPF only, double word */
mode in linux has the following macro definitions:
BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ BPF_ABS 0x20 BPF_IND 0x40 BPF_MEM 0x60 BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */
ebpf instruction set programming
There are three ways of eBPF programming: BPF instruction set programming, BPF C programming, and BPF front end (BCC, bpftrace).
To demonstrate instructions, we read a piece of code programmed in instruction set mode.
code
Code source: sample/bpf/sock_example.c
Code logic:
-
The popen() function opens the process call shell by creating a pipe, fork, and. Since the definition of pipeline is one-way, type parameters can only specify read or write, not both;
Here, using IPv4, ping the host 5 times, and the results can be read.
-
Create a BPF_MAP_TYPE_ARRAY type map of.
-
The eBPF instruction set is used for programming, and the instructions are stored in prog. For the reading of these instruction codes, see the next section.
-
When these instructions are loaded into the kernel, the instruction program type For BPF_PROG_TYPE_SOCKET_FILTER.
-
open_raw_sock Returns a native socket. Through this socket, the data of the data link layer of the lo network interface can be read directly.
-
Attach the eBPF program to the socket (used as a filter).
-
Print some information in the map.
/* eBPF example program: * - creates arraymap in kernel with key 4 bytes and value 8 bytes * * - loads eBPF program: * r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]; * *(u32*)(fp - 4) = r0; * // assuming packet is IPv4, lookup ip->proto in a map * value = bpf_map_lookup_elem(map_fd, fp - 4); * if (value) * (*(u64*)value) += 1; * * - attaches this program to loopback interface "lo" raw socket * * - every second user space reads map[tcp], map[udp], map[icmp] to see * how many packets of given protocol were seen on "lo" */ #include <stdio.h> #include <unistd.h> #include <assert.h> #include <linux/bpf.h> #include <string.h> #include <stdlib.h> #include <errno.h> #include <sys/socket.h> #include <arpa/inet.h> #include <linux/if_ether.h> #include <linux/ip.h> #include <stddef.h> #include <bpf/bpf.h> #include "bpf_insn.h" #include "sock_example.h" char bpf_log_buf[BPF_LOG_BUF_SIZE]; static int test_sock(void) { int sock = -1, map_fd, prog_fd, i, key; long long value = 0, tcp_cnt, udp_cnt, icmp_cnt; map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 256, 0); if (map_fd < 0) { printf("failed to create map '%s'\n", strerror(errno)); goto cleanup; } struct bpf_insn prog[] = { BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */), BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */ BPF_LD_MAP_FD(BPF_REG_1, map_fd), BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */ BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ BPF_EXIT_INSN(), }; size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn); prog_fd = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog, insns_cnt, "GPL", 0, bpf_log_buf, BPF_LOG_BUF_SIZE); if (prog_fd < 0) { printf("failed to load prog '%s'\n", strerror(errno)); goto cleanup; } sock = open_raw_sock("lo"); if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) < 0) { printf("setsockopt %s\n", strerror(errno)); goto cleanup; } for (i = 0; i < 10; i++) { key = IPPROTO_TCP; assert(bpf_map_lookup_elem(map_fd, &key, &tcp_cnt) == 0); key = IPPROTO_UDP; assert(bpf_map_lookup_elem(map_fd, &key, &udp_cnt) == 0); key = IPPROTO_ICMP; assert(bpf_map_lookup_elem(map_fd, &key, &icmp_cnt) == 0); printf("TCP %lld UDP %lld ICMP %lld packets\n", tcp_cnt, udp_cnt, icmp_cnt); sleep(1); } cleanup: /* maps, programs, raw sockets will auto cleanup on process exit */ return 0; } int main(void) { FILE *f; f = popen("ping -4 -c5 localhost", "r"); (void)f; //Why is this line in the code? return test_sock(); }
eBPF instruction programming code reading
Let's take out this part of the code and read it separately.
struct bpf_insn prog[] = { BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* R6 = R1*/ BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */), BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */ BPF_LD_MAP_FD(BPF_REG_1, map_fd), BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */ BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ BPF_EXIT_INSN(), };
-
BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
/* Short form of mov, dst_reg = src_reg */ #define BPF_MOV64_REG(DST, SRC) \ ((struct bpf_insn) { \ .code = BPF_ALU64 | BPF_MOV | BPF_X, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = 0, \ .imm = 0 })
As you can see, this instruction moves the value of the source register R1 to the R6 register. Where R1 points to the starting address of the packet.
-
BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)),
/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */ #define BPF_LD_ABS(SIZE, IMM) \ ((struct bpf_insn) { \ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \ .dst_reg = 0, \ .src_reg = 0, \ .off = 0, \ .imm = IMM })
In load and store instructions, register R6 is an implicit input and register R0 is an implicit output. (? I want this. What's the use of dst_reg and src_reg?)
To understand the format of the data packet, refer to: Introduction to MAC header, IP header and TCP header
Read the IP protocol type according to the offset. For example, the protocol number of TCP is 6, the protocol number of UDP is 17, and the protocol number of ICMP is 1. Among them, the protocol field accounts for 8 bits.
Therefore, this instruction indicates that the IP protocol is placed in the R0 register.
-
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
/* Memory store, *(uint *) (dst_reg + off16) = src_reg */ #define BPF_STX_MEM(SIZE, DST, SRC, OFF) \ ((struct bpf_insn) { \ .code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ .imm = 0 })
R10 is the only read-only register that contains the frame pointer address used to access the BPF stack space. (for stack frame structure, please refer to: gdb debug stack frame information)
So here, save the contents of R0 register (protocol type saved in the previous step) to the stack. Note that this is BPF_W. Only the 32nd bit in the R0 register is saved.
-
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),,BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
Because the stack grows down. So the R2 register is used to point to the top of the stack.
As for BPF_ ALU64_ The macro expansion of IMM is not listed here. It can be expanded by itself samples/bpf/bpf_insn.h View in.
These macros expand numbers in include/uapi/linux/bpf.h View in.
In this way, the above instruction expansion is a 64 bit binary number. Isn't it amazing~
-
BPF_LD_MAP_FD(BPF_REG_1, map_fd),
This instruction is more interesting. Let's have a look.
/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */ #define BPF_LD_IMM64(DST, IMM) \ BPF_LD_IMM64_RAW(DST, 0, IMM) #define BPF_LD_IMM64_RAW(DST, SRC, IMM) \ ((struct bpf_insn) { \ .code = BPF_LD | BPF_DW | BPF_IMM, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = 0, \ .imm = (__u32) (IMM) }), \ ((struct bpf_insn) { \ .code = 0, /* zero is reserved opcode */ \ .dst_reg = 0, \ .src_reg = 0, \ .off = 0, \ .imm = ((__u64) (IMM)) >> 32 }) #ifndef BPF_PSEUDO_MAP_FD # define BPF_PSEUDO_MAP_FD 1 #endif /* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */ #define BPF_LD_MAP_FD(DST, MAP_FD) \ BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
As you can see, this instruction is to map_ The value of FD is saved in the R1 register. At this time, we may be curious that there is Src in the middle_ What's up?
As we can see above, if you simply save an immediate number to a register, src_reg=0; If this immediate number represents a map_fd, then src_reg=1;
In this way, we can distinguish whether the immediate in the instruction represents a map_fd. behind replace_map_fd_with_map_ptr Function uses this property.
Besides, I tried to combine it code = 0;. code = BPF_LD | BPF_W | BPF_IMM. Does this really mean nothing?.
-
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
/* Raw code statement block */ #define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \ ((struct bpf_insn) { \ .code = CODE, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ .imm = IMM })
BPF_ FUNC_ map_ lookup_ The macro of element expands to 1. As for the position after JIT, jump to 1 BPF_ map_ lookup_ The function elem is a follow-up problem.
Here, you can see from the name of the macro that it is a jump to BPF_ map_ lookup_ Element function position.
-
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */ #define BPF_JMP_IMM(OP, DST, IMM, OFF) \ ((struct bpf_insn) { \ .code = BPF_JMP | BPF_OP(OP) | BPF_K, \ .dst_reg = DST, \ .src_reg = 0, \ .off = OFF, \ .imm = IMM })
This instruction indicates that if the R0 register is equal to 0, two instructions are skipped down.
R0 register stores the protocol number according to IP protocol number list It can be seen that if the protocol in the IP packet is "IPv6 hop by hop option", skip two instructions down.
-
BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */,BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
xadd exchange addition. The initial is R0 = agreement number, R1=1; The result is R0=1, R1 = protocol number + 1.
The R1 register stores the map_ The value of FD. How to identify later? This register stores a map_fd?
-
BPF_MOV64_IMM(BPF_REG_0, 0),
R0 is a register containing the BPF program exit value. Set the return value, R0=0
-
BPF_EXIT_INSN()
/* Program exit */ #define BPF_EXIT_INSN() \ ((struct bpf_insn) { \ .code = BPF_JMP | BPF_EXIT, \ .dst_reg = 0, \ .src_reg = 0, \ .off = 0, \ .imm = 0 })
Run this program
If you want to run this program, you can pull down the source code and compile it.
Pull the source code corresponding to the current linux kernel version. You can refer to: How to get the source code from ubuntu
sudo apt source linux
Then compile the bpf program under the sample/bpf directory. You can refer to: Run the first bpf program
make M=samples/bpf
Run the program and the output is as follows. (PS: My lo is forwarding browser data) (ping sending four ICMP packets at a time?)
➜ bpf sudo ./sock_example TCP 0 UDP 0 ICMP 0 packets TCP 28 UDP 0 ICMP 4 packets TCP 60 UDP 0 ICMP 4 packets TCP 100 UDP 0 ICMP 8 packets TCP 134 UDP 0 ICMP 12 packets TCP 166 UDP 0 ICMP 16 packets TCP 228 UDP 0 ICMP 16 packets TCP 302 UDP 0 ICMP 16 packets TCP 334 UDP 0 ICMP 16 packets TCP 366 UDP 0 ICMP 16 packets