Basic principles
Simulation principle
- MMIO Memory, memory mapped I / O memory, is a memory type defined by QEMU/KVM. Like ordinary RAM Memory, MMIO Memory in Guest also represents a memory area. It is also volatile storage.
- The difference between MMIO Memory and RAM Memory is that when the Guest reads and writes the memory represented by MMIO Memory, IN addition to simulating the same behavior as RAM Memory, the backend will trigger a write callback when writing memory (OUT) and a read callback when reading memory (IN), just like IO. So it is called memory mapped IO.
- Through the description of MMIO function, we can summarize the elements of realizing MMIO simulation by QEMU/KVM into two: one is to simulate the behavior of Guest reading and writing memory, and the other is to monitor the reading and writing of virtual machine to MMIO memory and trigger the execution of corresponding callback function. For the former, QEMU/KVM simulates the behavior of Guest reading and writing MMIO memory by executing the same instruction as Guest. For the latter, QEMU/KVM should first listen to the behavior of Guest accessing memory. Once there is reading and writing to MMIO memory, it is necessary to let Guest fall into KVM and let QEMU/KVM trigger the callback of user registration.
- In the specific implementation of MMIO simulation, QEMU/KVM is realized through the following steps:
- Define the MMIO memory. When the Guest memory exits due to page shortage, it is identified and simulated MMIO logic is followed in the QEMU/KVM page shortage processing flow.
- QEMU/KVM obtains the read-write instructions before Guest exit on the Host and executes them directly, so as to complete the memory read-write simulation of Guest.
- Return to QEMU user status and trigger the MMIO read / write callback registered in user status.
Hardware foundation
- MMIO memory is different from ordinary memory. When Guest writes MMIO memory, it will exit without pages every time and hand it over to the back end for processing, which is similar to sensitive instructions. Similarly, the exit is caused by page table exception. How can KVM identify the PF of MMIO and the PF of ordinary memory? The answer is distinguished by the special flag of page table items. The spte format under x86 intel Architecture is as follows:
- intel manual 28.2.3.1 contains the following paragraph:
An EPT misconfiguration occurs if the entry is present and a reserved bit is set.
EPT misconfigurations result when an EPT paging-structure entry is configured with settings reserved for future
functionality.
- When the page executed by the page table of ept is marked as existing, but the reserved bit is set, the hardware will generate ept misconfiguration exception, and the software can realize special functions based on this exception.
- Here, our KVM is used to realize the special functions of MMIO PF of intel Architecture. In the specific implementation, set bit2:0 of the page table item to 0b110 and set bit51 and bit62 as follows:
- KVM maintains all page table entries of spte. When Guest exits due to EPT misconfiguration, KVM first checks the GPA address of the abnormal exit, traverses the EPT page table structure, and finds its corresponding spte. If the corresponding bit described above is set to 1, it can be recognized that it is an MMIO PF, and then does the corresponding processing, as follows:
data structure
- The simulation of MMIO PF is divided into two parts: the implementation of callback and the simulation of memory read-write instructions. The two parts are implemented in QEMU and kernel respectively. We take the simulation of virtio PCI configuration space as an example to analyze the related data structure of MMIO memory.
QEMU
- MemoryRegionOps in Qemu describes the callback hook of MMIO memory, as follows:
/*
* Memory region callbacks
*/
struct MemoryRegionOps {
/* Read from the memory region. @addr is relative to @mr; @size is
* in bytes. */
uint64_t (*read)(void *opaque,
hwaddr addr,
unsigned size);
/* Write to the memory region. @addr is relative to @mr; @size is
* in bytes. */
void (*write)(void *opaque,
hwaddr addr,
uint64_t data,
unsigned size);
enum device_endian endianness;
......
};
- For the virtio PCI implementation of legacy (virtio spec 0.95), Qemu defines the following structure objects to describe the corresponding operation of MMIO reading and writing:
static const MemoryRegionOps virtio_pci_config_ops = {
.read = virtio_pci_config_read, /* This callback is triggered when the configuration space of virtio PCI is written */
.write = virtio_pci_config_write, /* This callback is triggered when reading the configuration space of virtio PCI */
.impl = {
.min_access_size = 1,
.max_access_size = 4,
},
.endianness = DEVICE_LITTLE_ENDIAN,
};
- For the virtio PCI implementation of modern (virtio spec 1.0/1.1), Qemu defines the MMIO read and write operations of pci common configuration space and device specific configuration space respectively:
/* common Read / write callback of configuration space */
static const MemoryRegionOps common_ops = {
.read = virtio_pci_common_read,
.write = virtio_pci_common_write,
.impl = {
.min_access_size = 1,
.max_access_size = 4,
},
.endianness = DEVICE_LITTLE_ENDIAN,
};
/* Interrupt read / write callback of configuration space */
static const MemoryRegionOps isr_ops = {
.read = virtio_pci_isr_read,
.write = virtio_pci_isr_write,
.impl = {
.min_access_size = 1,
.max_access_size = 4,
},
.endianness = DEVICE_LITTLE_ENDIAN,
};
/* Read / write callback of specific device (virtio net / virtio BLK) configuration space */
static const MemoryRegionOps device_ops = {
.read = virtio_pci_device_read,
.write = virtio_pci_device_write,
.impl = {
.min_access_size = 1,
.max_access_size = 4,
},
.endianness = DEVICE_LITTLE_ENDIAN,
};
......
- With the help of the structure in the figure below, we can easily understand the above code, which defines the corresponding read-write callback for each virtio PCI configuration space. Once the Guest's memory read / write falls into this area, KVM will return to the user state after simulating the read / write instruction, and the registered callback function will be executed.
KVM
- KVM completes the task of identifying MMIO PF and performing memory read-write instruction simulation. Here we mainly introduce the relevant data structures.
- TODO
PF process
- We use Guest to read and write the device in virtio PCI common space_ Take the status field as an example to analyze the PF process of MMIO. device_status is used to synchronize the status of the front and back terminals when the Guest initializes the virtio PCI device. There are mainly the following states. When the Guest wants to reset the virtio PCI device, go to device_ The status field writes 0 to the notification backend.
/* Status byte for guest to report progress. */
#define VIRTIO_ CONFIG_ STATUS_ Reset 0x00 / * device reset*/
#define VIRTIO_CONFIG_STATUS_ACK 0x01
#define VIRTIO_CONFIG_STATUS_DRIVER 0x02
#define VIRTIO_CONFIG_STATUS_DRIVER_OK 0x04
#define VIRTIO_CONFIG_STATUS_FEATURES_OK 0x08
#define VIRTIO_CONFIG_STATUS_FAILED 0x80
Guest
- This paper is based on the source code analysis of dpdk-18.05_ The read and write functions of the status field are:
static uint8_t
modern_get_status(struct virtio_hw *hw)
{
return rte_read8(&hw->common_cfg->device_status);
}
static void
modern_set_status(struct virtio_hw *hw, uint8_t status)
{
rte_write8(status, &hw->common_cfg->device_status);
}
- testpmd will reset the device when initializing the virtio net network card. The process is as follows:
main
rte_eal_init
rte_bus_probe
rte_pci_probe
pci_probe_all_drivers
rte_pci_probe_one_driver
eth_virtio_pci_probe
rte_eth_dev_pci_generic_probe
eth_virtio_dev_init
virtio_init_device
vtpci_reset
modern_set_status
Host
KVM
QEMU
experiment
Guest
- Dpdk's test program testpmd is used to test the use of the network card in the way of poll mode driver. Therefore, when taking over the network card, a series of initialization will be carried out on the network card. When Guest runs this program, if the network card is virtio net, it will read and write the configuration space of virtio PCI. By running testpmd tool, we can verify the complete process of MMIO PF. Testpmd tool comes from VPP dpdk devel. The whole test procedure is as follows:
yum install -y vpp-dpdk-devel
ip link set ens5 down
modprobe uio /* Load user space IO module */
insmod igb_uio.ko /* Load the virtual user mode io module IGB generated by dpdk_ uio */
dpdk-devbind -b igb_uio 00:05.0 /* Connect the ens5 network card from the kernel detach and attach to the IGB_UIO driver, 00:05.0 is the pci number of the network card */
gdb testpmd -l 1-7 -n 2 -- -i --nb-cores=6 --eth-peer=0,82:54:00:d8:42:b0 --forward-mode=rxonly --txq=12 --rxq=12 --txd=1024 --rxd=1024 /* Test testpmd */
set args -l 1-7 -n 2 -- -i --nb-cores=6 --eth-peer=0,82:54:00:d8:42:b0 --forward-mode=rxonly --txq=12 --rxq=12 --txd=1024 --rxd=1024
set print pretty
b modern_set_status
b modern_get_status
r
- The results are as follows:
- Function breaks in modern_set_status,vtpci_reset function calls this function to reset the virtio net device. The process is as follows:
void vtpci_reset(struct virtio_hw *hw)
{
VTPCI_OPS(hw)->set_status(hw, VIRTIO_CONFIG_STATUS_RESET);
/* flush status write */
VTPCI_OPS(hw)->get_status(hw);
}
/* set_status Definition of callback function */
const struct virtio_pci_ops modern_ops = {
.read_dev_cfg = modern_read_dev_config,
.write_dev_cfg = modern_write_dev_config,
.get_status = modern_get_status,
.set_status = modern_set_status,
......
};
/* The actual operation of resetting the virtio net device is to configure the device in the common space_ Write 0 to the status field */
static void modern_set_status(struct virtio_hw *hw, uint8_t status)
{
rte_write8(status, &hw->common_cfg->device_status);
}
- Disassembly modern_ set_ The status function obtains two mov instructions as follows:
=> 0x000055555596c360 <+0>: mov 0x48(%rdi),%rax
0x000055555596c364 <+4>: mov %sil,0x14(%rax)
- The second instruction is to write the value of the lower 8 bits of the rsi register to the memory address with the rax pointer offset of 0x14. The memory with the rax pointer offset of 0x14 is hw - > common_ cfg->device_ The memory address of status. After gdb executes the first mov instruction, the status is as follows:
- It can be determined that when gdb executes the next mov instruction, it is really writing the MMIO memory. It is inferred that the execution of this instruction will trigger the EPT misconfiguration exception and fall into the kernel state. After simulating the instruction, the kernel state returns to the user state and finally calls virtio_pci_common_write. For the kernel. We need to open trace validation. For Qemu, we only need gdb attach Qemu process in virtio_pci_common_write can be disconnected.
Host
- KVM is opened separately_ exit,handle_mmio_page_fault and kvm_emulate_insn trace:
echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_exit/enable
echo 1 > /sys/kernel/debug/tracing/events/kvmmmu/handle_mmio_page_fault/enable
echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_emulate_insn/enable
- gdb attach Qemu process and set virtio_pci_common_write bit breakpoint
- We print the instruction at 0x0000555596c364 in the Guest, then execute the mov instruction, and observe the Guest, KVM and Qemu respectively, as follows:
- We can see from the above that when Guest executes the assembly instruction (0x40 0x88 0x70 0x14) at 0x0000555596c364, it falls into KVM. KVM processes the missing page of MMIO memory in the page missing processing flow of EPT misconfiguration. The specific operation is to execute the assembly instruction instead of Guest, so as to simulate Guest's memory reading and writing action, and then return to user state to trigger virtio_pci_common_write callback.