Rong Tao May 12, 2021
Relevant note code of this article: https://github.com/Rtoax/linux-5.10.13
1. Initialize function call relationship
1.1. start_kernel
See Linux startup process (10): start_kernel initialization (to the initial stage of setup_arch) Series of articles.
During the kernel startup phase, perf is called_ event_ Init to perf_ Initialize event.
start_kernel() ... perf_event_init() <<<<<<<<<<perf relevant perf_tp_register() <<<<<<<<<<perf relevant init_hw_breakpoint() <<<<<<<<<<perf relevant arch_call_rest_init() rest_init() kernel_thread() kernel_init() => PID=1 kernel_init() ->kernel_init_freeable() ->do_basic_setup() ->do_initcalls() ->do_initcall_level() ->do_one_initcall() ->xxx__initcall() <<<<<<<<<<perf relevant
1.2. initcall
The functions marked by these macros will be in do_ initcall_ Called in level
#define early_initcall(fn) /* fn */ __define_initcall(fn, early)/* (fn,early) */ /* The smaller the number, the higher the priority */ #define pure_initcall(fn) __define_initcall(fn, 0) #define core_initcall(fn) __define_initcall(fn, 1) #define core_initcall_sync(fn) __define_initcall(fn, 1s) #define postcore_initcall(fn) __define_initcall(fn, 2) #define postcore_initcall_sync(fn) __define_initcall(fn, 2s) #define arch_initcall(fn) __define_initcall(fn, 3) #define arch_initcall_sync(fn) __define_initcall(fn, 3s) #define subsys_initcall(fn) __define_initcall(fn, 4) #define subsys_initcall_sync(fn) __define_initcall(fn, 4s) #define fs_initcall(fn) __define_initcall(fn, 5) #define fs_initcall_sync(fn) __define_initcall(fn, 5s) #define rootfs_initcall(fn) __define_initcall(fn, rootfs) #define device_initcall(fn) __define_initcall(fn, 6) #define device_initcall_sync(fn) __define_initcall(fn, 6s) #define late_initcall(fn) __define_initcall(fn, 7) #define late_initcall_sync(fn) __define_initcall(fn, 7s)
The initcall function of perf includes:
early_initcall(init_hw_perf_events); arch_initcall(bts_init); arch_initcall(pt_init); device_initcall(perf_event_sysfs_init); device_initcall(amd_ibs_init); device_initcall(amd_iommu_pc_init); device_initcall(msr_init); device_initcall(amd_uncore_init);
1.3. modular
In addition, some perf related
static int __init cstate_pmu_init(void) { return cstate_init(); } module_init(cstate_pmu_init); static int __init amd_power_pmu_init(void) { } module_init(amd_power_pmu_init); static int __init rapl_pmu_init(void) { } module_init(rapl_pmu_init); static int __init intel_uncore_init(void) { uncore_pci_init uncore_pci_sub_driver_init uncore_pci_pmu_register uncore_pmu_register uncore_cpu_init uncore_msr_pmus_register type_pmu_register uncore_pmu_register uncore_mmio_init type_pmu_register uncore_pmu_register } module_init(intel_uncore_init);
None of these are found in my system. I don't think they are necessary, so I'm not going to discuss them in this series.
lsmod | grep -e core -e power -e stat
2. perf_event_init
perf_event_init at start_ Called in kernel
asmlinkage __visible void __init __no_sanitize_address start_kernel(void) { perf_event_init(); ... arch_call_rest_init(); }
Calling relationship:
start_kernel() ... perf_event_init() <<<<<<<<<<perf relevant perf_tp_register() <<<<<<<<<<perf relevant init_hw_breakpoint() <<<<<<<<<<perf relevant arch_call_rest_init() rest_init() kernel_thread() kernel_init() => PID=1 kernel_init() ->kernel_init_freeable() ->do_basic_setup() ->do_initcalls() ->do_initcall_level() ->do_one_initcall() ->xxx__initcall() <<<<<<<<<<perf relevant
Here, it is necessary to talk about the perf type (include\uapi\linux\perf_event.h):
/* * attr.type */ enum perf_type_id { /* perf type */ PERF_TYPE_HARDWARE = 0, /* Hardware */ PERF_TYPE_SOFTWARE = 1, /* Software */ PERF_TYPE_TRACEPOINT = 2, /* Tracking point */ PERF_TYPE_HW_CACHE = 3, /* Hardware cache */ PERF_TYPE_RAW = 4, /* RAW */ PERF_TYPE_BREAKPOINT = 5, /* breakpoint */ PERF_TYPE_MAX, /* non-ABI */ };
They are the PMU registration function perf passed into the performance snap in_ pmu_ Field type of register
int perf_pmu_register(struct pmu *pmu, const char *name, int type)
Note here that the function perf_pmu_register is a very important registration function. The registered PMU will be added to the global linked list pmus, which will be introduced separately in another article.
static LIST_HEAD(pmus);
perf_ event_ The init function calls perf respectively_ pmu_ Register registers the following:
perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE); perf_pmu_register(&perf_cpu_clock, NULL, -1); perf_pmu_register(&perf_task_clock, NULL, -1); perf_tp_register(); init_hw_breakpoint();
Where perf_tp_register is:
static inline void perf_tp_register(void) { perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT); #ifdef CONFIG_KPROBE_EVENTS perf_pmu_register(&perf_kprobe, "kprobe", -1); #endif #ifdef CONFIG_UPROBE_EVENTS perf_pmu_register(&perf_uprobe, "uprobe", -1); #endif }
Where init_ hw_ The breakpoint is:
int __init init_hw_breakpoint(void) { ... perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT); ... }
3. initcall
pmu registration initialized by initcall includes:
early_initcall(init_hw_perf_events); arch_initcall(bts_init); arch_initcall(pt_init); device_initcall(perf_event_sysfs_init); device_initcall(amd_ibs_init); device_initcall(amd_iommu_pc_init); device_initcall(msr_init); device_initcall(amd_uncore_init);
They are briefly described below
3.1. init_hw_perf_events
This is the hardware perf initialization related to the architecture / manufacturer. We only look at Intel (it is worth noting that there is a domestic megacore. Here, we don't talk about the feelings of home and country, but only about technology, so we don't look at the code related to megacore).
static int __init init_hw_perf_events(void) { intel_pmu_init(); pmu_check_apic(); perf_events_lapic_init(); /* Register PMU as NMI interrupt */ register_nmi_handler(NMI_LOCAL, perf_event_nmi_handler, 0, "PMI"); /* Register function */ pr_info("... version: %d\n", x86_pmu.version); pr_info("... bit width: %d\n", x86_pmu.cntval_bits); pr_info("... generic registers: %d\n", x86_pmu.num_counters); pr_info("... value mask: %016Lx\n", x86_pmu.cntval_mask); pr_info("... max period: %016Lx\n", x86_pmu.max_period); pr_info("... fixed-purpose events: %d\n", x86_pmu.num_counters_fixed); pr_info("... event mask: %016Lx\n", x86_pmu.intel_ctrl); perf_pmu_register(&pmu, "cpu", PERF_TYPE_RAW); } early_initcall(init_hw_perf_events);
Printing of relevant x86 pmu information in this function:
[rongtao@localhost perf]$ dmesg | grep fixed- -A 1 -B 5 [ 0.320139] ... version: 2 [ 0.320140] ... bit width: 48 [ 0.320141] ... generic registers: 4 [ 0.320142] ... value mask: 0000ffffffffffff [ 0.320143] ... max period: 000000007fffffff [ 0.320144] ... fixed-purpose events: 3 [ 0.320145] ... event mask: 000000070000000f
perf_event_nmi_handler will be explained separately.
3.2. bts_init
For an introduction to BTS, see What are Intel LBR, BTS and AET? That is, BTS uses RAM cache (CAR) or system DRAM to store more instructions and events (BTS branch tracking store). At the same time, you can refer to Processor Tracing.
Be careful not to talk to BTS — Bit Test and Set or x86 and amd64 instruction reference Confusion.
3.3. pt_init
As mentioned in the previous section Processor Tracing.
This will be described separately later.
3.4. perf_event_sysfs_init
perf_event``sysfs initialization, in kernel \ events \ core Perf in C_ event_ sysfs_ init:
[root@localhost event_source]# pwd /sys/bus/event_source [root@localhost event_source]# tree . ├── devices │ ├── breakpoint -> ../../../devices/breakpoint │ ├── cpu -> ../../../devices/cpu │ ├── kprobe -> ../../../devices/kprobe │ ├── msr -> ../../../devices/msr │ ├── power -> ../../../devices/power │ ├── software -> ../../../devices/software │ ├── tracepoint -> ../../../devices/tracepoint │ └── uprobe -> ../../../devices/uprobe ├── drivers ├── drivers_autoprobe ├── drivers_probe └── uevent
This section will also be introduced in a separate article.
3.5. msr_init
The following is taken from MSR register of x86 CPU:
MSR (Model Specific Register) is a concept in x86 architecture. It refers to a series of registers used to control CPU operation, function switch, debugging, track program execution, monitor CPU performance and so on in x86 architecture processor.
The prototype of MSR register began with Intel 80386 and 80486 processors. When Intel Pentium processor came, Intel officially introduced RDMSR and WRMSR instructions to read and write MSR register. At this time, MSR was officially introduced. While introducing RDMSR and WRMSR instructions, CPUID instruction is also introduced. This instruction is used to indicate which functions are available in a specific CPU chip, or whether MSR registers corresponding to these functions exist. The software can query whether some functions are supported on the current CPU through CPUID instruction.
Each MSR register will have a corresponding ID, namely MSR Index, or also known as MSR register address. When executing RDMSR or WRMSR instructions, as long as MSR Index is provided, the CPU can know the target MSR register. The number (MSR Index), name and definition of each data area of these MSR registers can be found in Volume 4 of Intel x86 architecture manual "Intel 64 and IA-32 Architectures Software Developer's Manual".
A detailed discussion of MSR will also be in a separate article.
3.6. amd_ibs_init
IBS is used for Apic initialization, perf and oprofile. About amd_ibs_init will also be discussed in detail in a separate article, mainly involving perf_ event_ ibs_ Introduction to init.
3.7. amd_iommu_pc_init
The hardware responsible for DMA remapping is called IOMMU. Make an analogy: as we all know, MMU is hardware that supports memory address virtualization, and MMU serves CPU; IOMMU serves for I/O devices and is the hardware that virtualizes DMA addresses.
ARM SMMU principle and IOMMU Technology ("VT-d" DMA, I/O virtualization, memory virtualization)
DMAR (DMA removing) and IOMMU
How is the kernel boot parameter IOMMU different from INTEL_IOMMU
Improving the startup efficiency of KVM heterogeneous virtual machines: pass through, DMA mapping (VFIO, PCI, IOMMU), virtio balloon, asynchronous DMA mapping and preprocessing
The detailed discussion of iommu perf will also be in a separate article, and init will be introduced in detail_ one_ Perf in IOMMU_ pmu_ Register registration process.
3.8. amd_uncore_init
The following is taken from Introduction to Intel microprocessor Uncore architecture:
The term "uncore" is used by Intel to describe the components of microprocessors that are functionally burdened by non processor cores, but play an essential role in the performance and maintenance of processors. The processor components contained in the processor Core are related to the operation of processor commands, including arithmetic logic unit (ALU), floating point unit (FPU), L1 Cache and L2 Cache. The functions of uncore include QPI controller, L3 Cache, snoop agent pipeline, memory controller and Thunderbolt controller. As for other bus controllers, such as PCI-E and SPI, they are part of the chipset.
The root of Intel's Uncore design comes from Beiqiao chip. Uncroe is designed to rearrange the functions that play a key role in the processor core, physically making them closer to the core (integrated into the processor chip, and some of them were originally located on the North Bridge), so as to reduce their access delay. The remaining functions on the north bridge that are not critical to the processor core, such as PCI-E controller or power control unit (PCU), are not integrated into the Uncore part, but continue to be part of the chipset.
Specifically, the uncore in the microarchitecture is subdivided into several module units. Uncore is connected to the processor core through an interface called Cache Box (CBox), which is also the connection interface of Last Level Cache (LLC) and is responsible for managing cache consistency. The internal and external QPI links of the composite are managed by Physical Layer units, called PBox. The connection of PBox, CBox and one or more built-in memory controllers (iMC, MBox) is managed by System Config Controller (UBox) and Router (RBox).
Removing the list bus controller from the uncore part can better promote the improvement of performance. By allowing the clock frequency (UCLK) of uncore to operate at 2.66GHz of the benchmark and increase to 3.44GHz exceeding the overclocking limit, the performance can be improved. This clock boost makes the delay value of core access to key functional components (such as memory controller) lower (typically, the time of processor core access to DRAM can be reduced by 10 nanoseconds or more).
A detailed discussion of uncore will also be included in a separate article.
4. perf_pmu_register
Register the interface of a PMU, which will be detailed in other articles in this series.