page fault of linux (AMD64 Architecture)

Posted by Alanmoss on Tue, 25 Jan 2022 09:22:43 +0100

The application or kernel runs in the virtual memory space. After kernel {startup, if a virtual address wants to access the physical memory, it needs to perform address translation through the CPU MMU hardware. The logical process of accessing the physical memory by the whole virtual address is as follows:

After kernel # startup, the application or kernel accesses memory
Trigger CPU MMU hardware conversion and convert VA (virtual address) - > to pa (physical memory)
After the conversion is successful and the physical memory accessed by the lock exists, PA is used to access the physical memory
If the conversion fails or the physical memory does not exist, the CPU interrupt exception mechanism is triggered. The interrupt number is #PF and the interrupt number is 14
The CPU uses the #PF interrupt function ASM set during kernel startup_ exc_ page_ Fault(), jump to the interrupt function, and the hardware will automatically transfer the #PF specific error code to the memory
#PF interrupt function entry asm_exc_page_fault(), according to the incoming error code and the address of generating page fault read from CR2 register, enter exc_page_fault function
Continue further processing. If the kernel space address occurs #PF, enter do_kern_addr_fault function. If it is user space, enter do_ user_ addr_ The fault() function further processing.
asm_ exc_ page_ The fault () function is the processing entry of page fault interrupt function, which is mainly composed of assembly language. After processing, it enters exc_ page_ The fault () processing part is the C processing entry.

MMU hardware address translation

After the kernel is started, when an application or kernel accesses memory from a virtual address, it needs to start the conversion from virtual address to physical address through MMU. The main conversion logic process is as follows:

When converting a virtual address to a physical address, MMU will first find out whether there is a corresponding address mapping in the TLB cache. If so, go to step 3 to explain TLB HIT. If there is no mapping, go to step 2 to explain TLB MISS
After entering phase 3, after the TLB hits, the hardware will check the permission. If the permission check is successful, the address translation will successfully obtain the physical address
When TLB MISS is in phase 2, open the page table to traverse the page table hierarchically from memory (note that if the caching mechanism is enabled at the corresponding level, it will be searched from the cache first) (refer to Page translation of linux ), if successful, enter permission check
If the walk page table fails, a #PG interrupt is triggered
Similarly, even if the address translation is successful, but the permission check fails, that is, phase 7 will trigger #PG

Hardware related to PAGE FAULT

from Interrupts and exceptions in linux (AMD64 Architecture) _1 It is learned that the page fault interrupt number is No. 14, which is called #PG for short. According to the official instructions of AMD64, the main reasons for generating #PG are as follows:

After MMU address translation, the physical address obtained after TLB HIT hit does not exist
After MMU MISS, the entry does not exist during the walk page table process.
When trying to load an instruction, the physical address corresponding to the instruction does not have execution permission
Physical page memory permission check paging protection checks failed
When Cr4 PSE = 1 or Cr4 When PAE = 1, the reserved bit in page table entry is set to 1, and #PG occurs during address translation
When obtaining data in user mode, because the protection key #pg check has not been passed, PG will also occur

A page-translation-table entry or physical page involved in translating the memory access is not present in physical memory. This is indicated by a cleared present bit (P=0) in the translation-table entry.
An attempt is made by the processor to load the instruction TLB with a translation for a nonexecutable page.
The memory access fails the paging-protection checks (user/supervisor, read/write, or both).
A reserved bit in one of the page-translation-table entries is set to 1. A #PF occurs for this reason only when CR4.PSE=1 or CR4.PAE=1.
A data access to a user-mode address caused a protection key violation

CR2

When #PG occurs, the hardware will automatically save the virtual address of #PG to CR2 register. When 32-bit CPU, CR2 saves the 32-bit address. When the 64 bit CPU is, the 64 bit virtual address is saved:

Page_fault Error Code Returned

page fault error code is used to indicate the specific occurrence #PG error code, which is not represented by a special register. When #PG occurs, it is automatically pushed into the stack of the interrupt function by the hardware, and the interrupt function can obtain the relevant detailed error code from the stack. The specific error code distribution is as follows:

The specific error causes are as follows:

P(present):BIt 0. When the P bit is 0, it indicates that the corresponding physical page {does not exist when the page fault is caused. When the P bit is set to 1, it indicates that page protection is caused by physical page protection isolation
R/W(Read/Write):BIt 1. When set to 0. Read the memory when page fault is caused. If it is set to 1 Caused by writing to memory.
U/S(User/Supervisor):BIt 2. When set to 0, it indicates that a super management mode (CPL=0,or 2) is caused by memory operation. When it is set to 1, it indicates that it is caused by memory operation in user mode (CPL=3)
RSV(Reserved):Bit 3. When it is set to 1, it indicates that the reserved bit in page table entry is set to 1 during address conversion. When it is set to 0, it indicates that entry reserved is not set to 1
I/D (Instruction/Data): BIt 4, when set to 1, indicates that the page fault is caused during instruction acquisition. When it is set to 0, it indicates that it is caused by data access.
PK(protection key):Bit 5. When it is set to 1, it indicates that the user address is caused by the protection key( Memory protection keys for linux kernel MPK features are introduced).
SS(Shadow Stack):Bit 6. When set to 1, it indicates that there is SS access. Note that only when CR2 Valid only when CET = 1
RMP:Bit 31. If it is set to 1, it indicates that #PG is caused by RMP.

asm_exc_page_fault

#PF interrupt function initialization

Interrupts and exceptions in linux (AMD64 Architecture) _2 Describes the whole interrupt function in the kernel initialization process, and #PF's interrupt function is finally:

static const __initconst struct idt_data early_pf_idts[] = {
	INTG(X86_TRAP_PF,		asm_exc_page_fault),
};

Corresponding asm_exc_page_fault interrupt function, which is the interrupt entry of #PF.

asm_exc_page_fault definition

asm_ exc_ page_ The definition of fault () is slightly complex, mainly due to the mixed implementation of assembly and C. This function uses the macro DECLARE_IDTENTRY_RAW_ERRORCODE(arch\x86\include\asm\idtentry.h):

DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_PF,	exc_page_fault);

X86_TRAP_PF is #pf interrupt vector number definition, declare_ IDTENTRY_ RAW_ The errorcode macro has a cut dtentry in both assembly and C language H file (arch\x86\include\asm\idtentry.h) assembly language and C language will be loaded and used. Therefore, the implementation of this file has two parts: the part used to load the header file by assembly and the part used to load the header file by C:

#ifndef __ ASSEMBLY__  // The C language implementation part is referenced by the C file
... ...

#define DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)			\
	DECLARE_IDTENTRY_ERRORCODE(vector, func)
... ...

#else /* !__ASSEMBLY__ */
... ...    //The assembly implementation part is referenced by the assembly file

#define DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)			\
	DECLARE_IDTENTRY_ERRORCODE(vector, func)

... ...

#endif

Vector is the interrupt vector number, fuc is the interrupt function name, and finally declare is called_ IDTENTRY_ ERRORCODE:

#ifndef __ASSEMBLY__ //C language part, referenced by c file

... ...

/**
 * DECLARE_IDTENTRY_ERRORCODE - Declare functions for simple IDT entry points
 *				Error code pushed by hardware
 * @vector:	Vector number (ignored for C)
 * @func:	Function name of the entry point
 *
 * Declares three functions:
 * - The ASM entry point: asm_##func
 * - The XEN PV trap entry point: xen_##func (maybe unused)
 * - The C handler called from the ASM entry point
 *
 * Same as DECLARE_IDTENTRY, but has an extra error_code argument for the
 * C-handler.
 */
#define DECLARE_IDTENTRY_ERRORCODE(vector, func)			\
	asmlinkage void asm_##func(void);				\
	asmlinkage void xen_asm_##func(void);				\
	__visible void func(struct pt_regs *regs, unsigned long error_code)

... ...

#else / / the assembly part refers to the header file part of the assembly file

... ... 

#define DECLARE_IDTENTRY_ERRORCODE(vector, func)			\
	idtentry vector asm_##func func has_error_code=1

... ...

#endif

DECLARE_ IDTENTRY_ The errorcode assembly part mainly defines the func function. When the vector is passed in, the interrupt vector number is X86_TRAP_PF, func is exc_page_fault, the expanded structure of the assembly part is:

idtentry vector asm_exc_page_fault exc_page_fault has_error_code=1

It is mainly to realize} asm_exc_page_fault function, has_error_code is the error code returned by the specific hardware and calls the idtentry macro defined by assembly. The C language part is expanded:

asmlinkage void asm_exc_page_fault (void);                \
    asmlinkage void xen_asm_exc_page_fault (void);                \
    __visible void exc_page_fault(struct pt_regs *regs, unsigned long error_code)

Mainly for {asm_exc_page_fault declare, finally call exc_page_fault function, asm_exc_page_fault--->exc_page_fault, asmlinkage indicates ASM_ exc_ page_ The fault function passes parameters through the stack.

idtentry macro

idtentry macro definition is used Macro macro is defined in arch \ x86 \ entry \ entry_ In the 64. S file Macro macro is used as follows:

.macro macname macpara
...
...
.endm

macname is the macro name, macpara is the macro parameter, and multiple parameters can be connected. The idtentry macro is defined as follows:

/**
 * idtentry - Macro to generate entry stubs for simple IDT entries
 * @vector:		Vector number
 * @asmsym:		ASM symbol for the entry point
 * @cfunc:		C function to be called
 * @has_error_code:	Hardware pushed error code on stack
 *
 * The macro emits code to set up the kernel context for straight forward
 * and simple IDT entries. No IST stack, no paranoid entry checks.
 */
.macro idtentry vector asmsym cfunc has_error_code:req
SYM_CODE_START(\asmsym)
	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
	ASM_CLAC

	.if \has_error_code == 0
		pushq	$-1			/* ORIG_RAX: no syscall to restart */
	.endif

	.if \vector == X86_TRAP_BP
		/*
		 * If coming from kernel space, create a 6-word gap to allow the
		 * int3 handler to emulate a call instruction.
		 */
		testb	$3, CS-ORIG_RAX(%rsp)
		jnz	.Lfrom_usermode_no_gap_\@
		.rept	6
		pushq	5*8(%rsp)
		.endr
		UNWIND_HINT_IRET_REGS offset=8
.Lfrom_usermode_no_gap_\@:
	.endif

	idtentry_body \cfunc \has_error_code

_ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
.endm

Vector is the interrupt vector number, asmsym is the assembly interrupt function corresponding to the interrupt vector number, and the C language part of the interrupt function corresponding to cfunc, has_ error_ Code: the error value passed in by req call is 1. In #PF interrupt, asmsym is asm_exc_page_fault function, cfunc is exc_page_fault function,

SYM_CODE_START(\asmsym)

...

SYM_CODE_END(\asmsym)

As asm_exc_page_fault implementation, ASM_ exc_ page_ The last processing part of the fault function is ((the has_error_code passed in from the page fault part is 1, the interrupt is #PG, the processing skips and directly enters the idtentry_body):

idtentry_body \cfunc \has_error_code

Call idtentry_body macro.

idtentry_body macro

idtentry_ The body macro is defined as follows:

/**
 * idtentry_body - Macro to emit code calling the C function
 * @cfunc:		C function to be called
 * @has_error_code:	Hardware pushed error code on stack
 */
.macro idtentry_body cfunc has_error_code:req

	call	error_entry
	UNWIND_HINT_REGS

	movq	%rsp, %rdi			/* pt_regs pointer into 1st argument*/

	.if \has_error_code == 1
		movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
	.endif

	call	\cfunc

	jmp	error_return
.endm

has_error_code is 1, indicating that the parameters are passed through the stack. The specific page fault error code value is pushed into the stack through the movq instruction, and finally called to cfunc, that is, the C language part through the call function. Page fault is exc_page_fault function to jump into the C language part.

exc_page_fault

exc_ page_ The fault function definition is located in the (arch\x86\mm\fault.c) file:

DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
	unsigned long address = read_cr2();
	bool rcu_exit;

	prefetchw(&current->mm->mmap_lock);

	/*
	 * KVM has two types of events that are, logically, interrupts, but
	 * are unfortunately delivered using the #PF vector.  These events are
	 * "you just accessed valid memory, but the host doesn't have it right
	 * now, so I'll put you to sleep if you continue" and "that memory
	 * you tried to access earlier is available now."
	 *
	 * We are relying on the interrupted context being sane (valid RSP,
	 * relevant locks not held, etc.), which is fine as long as the
	 * interrupted context had IF=1.  We are also relying on the KVM
	 * async pf type field and CR2 being read consistently instead of
	 * getting values from real and async page faults mixed up.
	 *
	 * Fingers crossed.
	 *
	 * The async #PF handling code takes care of idtentry handling
	 * itself.
	 */
	if (kvm_handle_async_pf(regs, (u32)address))
		return;

	/*
	 * Entry handling for valid #PF from kernel mode is slightly
	 * different: RCU is already watching and rcu_irq_enter() must not
	 * be invoked because a kernel fault on a user space address might
	 * sleep.
	 *
	 * In case the fault hit a RCU idle region the conditional entry
	 * code reenabled RCU to avoid subsequent wreckage which helps
	 * debugability.
	 */
	rcu_exit = idtentry_enter_cond_rcu(regs);

	instrumentation_begin();
	handle_page_fault(regs, error_code, address);
	instrumentation_end();

	idtentry_exit_cond_rcu(regs, rcu_exit);
}

address = read_cr2() obtains the virtual address of the exception from the CR2 register.
Lock the current process mm current - > mm - > MMAP_ lock.
kvm_handle_async_pf: KVM related operations.
idtentry_enter_cond_rcu: RCU processing.
instrumentation_begin: it is mainly a function modified with noinstr variable to prevent the same interrupt from occurring again during the processing of the current interrupt increase, so as to overwrite some current status registers( https://lwn.net/Articles/877229/ ), begin is the start lock area.
handle_page_fault further processes page fault and passes in error code and virtual address.
instrumentation_end: end is the end lock area.
idtentry_exit_cond_rcu: exit processing.

DEFINE_IDTENTRY_RAW_ERRORCODE

DEFINE_ IDTENTRY_ RAW_ The errorcode macro is defined as follows:

/**
 * DEFINE_IDTENTRY_RAW_ERRORCODE - Emit code for raw IDT entry points
 * @func:	Function name of the entry point
 *
 * @func is called from ASM entry code with interrupts disabled.
 *
 * The macro is written so it acts as function definition. Append the
 * body with a pair of curly brackets.
 *
 * Contrary to DEFINE_IDTENTRY_ERRORCODE() this does not invoke the
 * idtentry_enter/exit() helpers before and after the body invocation. This
 * needs to be done in the body itself if applicable. Use if extra work
 * is required before the enter/exit() helpers are invoked.
 */
#define DEFINE_IDTENTRY_RAW_ERRORCODE(func)				\
__visible noinstr void func(struct pt_regs *regs, unsigned long error_code)

exc_ page_ The definition of fault function is extended to:

__visible noinstr void exc_page_fault(struct pt_regs *regs, unsigned long error_code)

noinstr variable modifies the interrupt function, which is mainly used to prevent the same interrupt from happening again in the hardware during the current interrupt processing, so as to overwrite some status registers:

+Non-instrumentable code - noinstr
+---------------------------------
+
+Low level transition code cannot be instrumented before RCU is watching and
+after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
+instrumentation facilities depend on RCU.
+
+Aside of that many architectures have to save register state, e.g. debug or
+cause registers before another exception of the same type can happen. A
+breakpoint in the breakpoint entry code would overwrite the debug registers
+of the inital breakpoint.
+
+Such code has to be marked with the 'noinstr' attribute. That places the
+code into a special section which is taboo for instrumentation and debug
+facilities.

+In a function which is marked 'noinstr' it's only allowed to call into
+non-instrumentable code except when the invocation of instrumentable code
+is annotated with a instrumentation_begin()/instrumentation_end() pair

handle_page_fault

handle_ page_ The main processing of fault is as follows:

static __always_inline void
handle_page_fault(struct pt_regs *regs, unsigned long error_code,
			      unsigned long address)
{
	trace_page_fault_entries(regs, error_code, address);

	if (unlikely(kmmio_fault(regs, address)))
		return;

	/* Was the fault on kernel-controlled part of the address space? */
	if (unlikely(fault_in_kernel_space(address))) {
		do_kern_addr_fault(regs, error_code, address);
	} else {
		do_user_addr_fault(regs, error_code, address);
		/*
		 * User address page fault handling might have reenabled
		 * interrupts. Fixing up all potential exit points of
		 * do_user_addr_fault() and its leaf functions is just not
		 * doable w/o creating an unholy mess or turning the code
		 * upside down.
		 */
		local_irq_disable();
	}
}

If the page fault address is in the kernel space, call fault_in_kernel_space for processing
If it occurs in the user data space, do is called_ user_ addr_ fault

reference material

.macro.

https://lwn.net/Articles/877229/

Topics: Linux Operation & Maintenance server

Programmer Think