1. log analysis
[ 3537.282130] PC is at do_page_fault+0x40/0x2e0 [ 3537.282130] LR is at do_translation_fault+0x5c/0xd4 [ 3537.282130] pc : [<ffffffc000095704>] lr : [<ffffffc000095a00>] pstate: 800001c5 [ 3537.282130] sp : ffffffc027b38130 [ 3537.282130] x29: ffffffc027b38130 x28: ffffffc027bdc000 [ 3537.282130] x27: ffffffc0009fa0e9 x26: 0000000000000000 [ 3537.282130] x25: 0000000096000005 x24: 0000000000000025 [ 3537.282130] x23: 0000000000000000 x22: ffffffc027b38390 [ 3537.282130] x21: 00000000000002b0 x20: 00000000000002b0 [ 3537.282130] x19: ffffffc027b38390 x18: 000000000000001e [ 3537.282130] x17: 00000000000101d0 x16: ffffffc0111dccf4 [ 3537.282130] x15: ffffffc0111dcc04 x14: 0000000000000003 [ 3537.282130] x13: 000000004437411e x12: ffffffc000822000 [ 3537.282130] x11: 0000000000000006 x10: 0000000000000007 [ 3537.282130] x9 : 000000000000000e x8 : 00125bbb859b6f00 [ 3537.282130] x7 : 0000000000000012 x6 : ffffffc000cf77d0 [ 3537.282130] x5 : ffffffc00079e3d8 x4 : ffffffc00079e3d8 [ 3537.282130] x3 : ffffffc0000959a4 x2 : ffffffc027b38390 [ 3537.282130] x1 : 0000000096000005 x0 : 00000000800001c5
do_page_fault stack assembly:
ffffffc0000956c4 <do_page_fault>: ffffffc0000956c4: a9a87bfd stp x29, x30, [sp,#-384]! ffffffc0000956c8: 910003fd mov x29, sp
Crash site sp and x29
sp : ffffffc027b38130 x29: ffffffc027b38130 x30(lr):ffffffc000095a00
X30 is the upper level LR register data, and x30 is put into the memory of [sp-384+8] address ffffc027b38138,
The data in the memory address ffffc027b38138 is ffffc000095a00, which is inversely inferred by SP,
The lr data stored in sp is consistent with the lr data of Runfei; It indicates that the CPU data is normal
2. DS5 level analysis
Use DS5 to connect according to cpu0 -- > CPU1 -- > CPU2 -- > cpu3 in turn,
DS5 stop s the online cpu in turn, loads vmlinux in turn,
After that, you can view all CPU stack information and dump the cpu current thread information:
info stack
It can be seen from the stack that the cause of the crash is that el1 is triggered after the cpu accesses an illegal address_ Sync is interrupted abnormally,
During interrupt processing, it is checked that the reason for triggering the interrupt is data abort in EL1,
Jump to do_ mem_ The abort process handles page missing exceptions,
do_ page_ The fault phase detects that the illegal address is triggered in kernel space, resulting in panic abnormal crash
At present, the scene of multiple crashes is consistent, and when there is a problem, the incoming addr parameter is relatively random,
At present, it is suspected that the pointer in the parameter passed from 32-bit user space to 64 bit kernel is abnormal, resulting in an exception when the cpu accesses the address in kernel space
At present, the main difficulty is that DS5 can only grab el1_ The crash process after sync abnormal interrupt. The SPSR and SP of CPU before CPU abnormal interrupt need to be deduced through assembly
el1_sync kernel_entry el=1 in sp = sp - (288-240) sp = sp - (15*16) x21 register = sp + 288 x22 register = el1 lr x23 register = el1 spsr take lr Register stack [sp + 240] //LR take x21 Register value stack[sp + 240 + 8] take x22 Register value stack[sp + 256] //PC take x23 Register value stack[sp + 256 +8] el1_da: x2 = sp do_mem_abort: x29 Register stack[sp-176] x30 Register stack[sp-176+8] sp = sp - 176 x29 = sp do_translation_fault: x29-->[sp-48] x30-->[sp-48+8] sp = sp - 48 do_page_fault: x29-->[sp-384] x30-->[sp-384+8] sp = sp - 384
Since the scene in sections 1 ~ 3 has been destroyed, the memory cannot be read,
When the phenomenon reappears, grab the valid data as follows:
#0 arch_counter_get_cntvct() at arch_timer.h:153 #1 __delay(cycles = 24000) at delay.c:31 #2 __const_udelay(xloops = <Value currently has no location>) at delay.c:42 #3 panic(fmt = <Value currently has no location>) at panic.c:187 #4 die(str = <Value currently has no location>, regs = (struct pt_regs*) 0xFFFFFFC0297FC050, err = -1778384891) at traps.c:247 #5 __do_kernel_fault(mm = (struct mm_struct*) 0xFFFFFFC029BF8680, addr = 18446743833205608120, esr = 2516582405, regs = (struct pt_regs*) 0xFFFFFFC0297FC050) at fault.c:102 #6 do_translation_fault(addr = 18446743833205608120, esr = 2516582405, regs = (struct pt_regs*) 0xFFFFFFC0297FC050) at fault.c:362 #7 do_mem_abort(addr = 18446743833205608120, esr = 2516582405, regs = (struct pt_regs*) 0xFFFFFFC0297FC050) at fault.c:459 #8 [el1_sync+0xB0]
(1)#11: try_to_wake_up
stay#In 10, sp change and x30 data stacking operations are as follows: #11-x29 --> #11-sp -80 #11-x30 --> #11-sp -80 +8 #10-sp = #11-sp -80 = 0xFFFFFFC0297FC170 #11-x29 = data captured by DS5 0xffffc0297fc1c0 #11-x30 = data captured by DS5 0xffffc0000cec7c #11-SP = 0xFFFFFFC0297FC1C0 LR = X30 = 0xFFFFFFC0000CEC7C The assembly code is: ffffffc0000cea74 <try_to_wake_up>: ... ... ... ... ffffffc0000cec78: 97ffecd8 bl ffffffc0000c9fd8 <ttwu_stat> -->ffffffc0000cec7c: 14000020 b ffffffc0000cecfc <try_to_wake_up+0x288> ... ... ... ... x19 Register: 0 xFFFFFFC012DD3440
(2)#10: ttwu_stat
#10-cpsr = #9-spsr = 0x00000000800001C5 M[4:0] = 0b00101 AARCH64 EL1h System abnormal mode M[0]= 0b1 SP_EL1 As SP #10-sp = #9-sp = 0xFFFFFFC0297FC170 #9-lr 0xffffc0000ca014 derivation code location: ffffffc0000c9fd8 <ttwu_stat>: ffffffc0000c9fd8: a9bb7bfd stp x29, x30, [sp,#-80]! ffffffc0000c9fdc: 910003fd mov x29, sp ffffffc0000c9fe0: a90153f3 stp x19, x20, [sp,#16] ffffffc0000c9fe4: a9025bf5 stp x21, x22, [sp,#32] ffffffc0000c9fe8: a90363f7 stp x23, x24, [sp,#48] ffffffc0000c9fec: f90023f9 str x25, [sp,#64] ffffffc0000c9ff0: 90006656 adrp x22, ffffffc000d91000 <__key.22563> ffffffc0000c9ff4: aa0003f3 mov x19, x0 ffffffc0000c9ff8: 9102e2d6 add x22, x22, #0xb8 ffffffc0000c9ffc: aa1e03e0 mov x0, x30 ffffffc0000ca000: 2a0103f8 mov w24, w1 ffffffc0000ca004: 2a0203f7 mov w23, w2 ffffffc0000ca008: 97ff185a bl ffffffc000090170 <_mcount> ffffffc0000ca00c: b00054b5 adrp x21, ffffffc000b5f000 <cpu_worker_pools+0x440> ffffffc0000ca010: 940a6f4c bl ffffffc000365d40 <debug_smp_processor_id> --->ffffffc0000ca014: f8605ad4 ldr x20, [x22,w0,uxtw #3]
cpu in EL1 system abnormal mode from EL1_ sync-->EL1_ Da incoming do_ mem_ The X0 register of abort is as follows: mrs X0, far_el1 //el1 FAR the exception address in the exception address register X0 is 0x1999940015b4ac,
In the daily test, it is found that the address value is very random;
At present, it is suspected that during the execution of instruction ldr x20, [x22,w0,uxtw #3] by cpu, an exception occurs when accessing the register address. After the exception interrupt is generated, lr points to the instruction that currently triggers the exception
1). Check x22 register data:
#The upper level X22 saved in stack 10 is saved in stack [0xffffffc0297fc170 + 32 + 8] = [0xffffc0297fc198], and the data captured by DS5 is x22:0x00000000 00000000 #The data of X22 register stored in stack 9 captured by DS5 is: 0xffffc000d910b8, Use first#10. Calculate the x22 data stored in the stack in combination with the code: adrp x22, ffffffc000d91000 <__key.22563> //Calculated X22 = ffc000d91000 add x22, x22, #0xb8 / / calculated X22 = ffc000d910b8 After calculation x22 The data is ffffffc000d910b8,This data is consistent with#The data stored in 9-x22 is consistent
2). Check x22,w0,uxtw #3
x22 = 0xffffffc000d910b8 w0 = ((unsigned long)w0)<<3 x22 + w0 = 0x199999940015B4AC ? Backstepping: w0 = 0x199999D3FF3CA3F4 ? w0>>3 = 0x333333A7FE7947E #8: el1_ CPU register status data under Sync: PC 0xFFFFFFC000083C30 SP 0xFFFFFFC0297FC050 W0 0x00001317 //Data exception W1 0xCBD6EEA0 W2 0x0000000C W3 0xCBD701B6 W4 0x00000001 W5 0x0035EEBC W6 0x00CD2E21 W7 0x2064656C W8 0x20706F74 W9 0x7F7F7F7F W10 0xFEFEFEFF W11 0x7F7F7F7F W12 0x01010101 W13 0x00000038 W14 0xFFFFFFFE W15 0x00000000 W16 0x001E1B30 W17 0x00000000 W18 0x00000000 W19 0x00005DC0 W20 0x001DC004 W21 0x00000001 W22 0x001DC068 W23 0x00000056 W24 0x96000005 W25 0x00D91000 W26 0x00B5F000 W27 0x009FA0E9 W28 0x297FC000 W29 0x297FBE10 W30 0x00353AEC
(3) Stack data saved in EL1 Mode is combined with el1 from #8 data_ Sync code flow derivation
In el1 mode: / #9-sp = #8-sp+(15*16)+(288-240)=#8-sp+288= 0xFFFFFFC0297FC170 from the code: #8-x21 = #8-sp + 288, field #8-x21=0xFFFFFFC0297FC170,
The code derivation is consistent with the field cpu data;
And the code derivation #9-sp data is consistent with the on-site cpu status data and #9-sp correct.
/#9-lr = [#8-sp+240]=[0xFFFFFFC0297FC050+240]= [0xFFFFFFC0297FC140] = (DS5 dump memory) 0xFFFFFFC0000CA014
/#9-el1 lr = [#8-sp + 256] = [0xffffc0297fc050 + 256] = [0xffffc0297fc150] = (DS5 dump memory) 0xFFFFFFC0000CA014 Code: X22 register = el1 lr, field #9-x22 register = 0xffffc0000ca014, consistent with el1 lr data;
/#9-spsr = [0xFFFFFFC0297FC050+256+8]= [0xFFFFFFC0297FC158] = (DS5 dump memory)0x00000000800001C5 deduces from the code: x23 register = el1 spsr, field x23=0x00000000800001C5, the code derivation data and cpu status data are correct;
The EL1h Mode phase is in the kernel_ The X0~X29 register assembly code of the system before abnormal interrupt is saved in the entry:
sp = sp - (288-240)// = 0xFFFFFFC0297FC140 push x28, x29 // stp \xreg1,\xreg2,[sp,#-16]! push x26, x27 push x24, x25 push x22, x23 push x20, x21 push x18, x19 push x16, x17 push x14, x15 push x12, x13 push x10, x11 push x8, x9 push x6, x7 push x4, x5 push x2, x3 push x0, x1
DS5 grab stack [#9-sp -48] ~ [#9-sp -48 -240]
Address memory data: SP [0xffffc0297fc140] ~ [0xffffc0297fc050]
Combined kernel_entry register distribution in assembly backstepping stack: x28 -- > [sp-16]: 0xffffc0297fc130 = 0xffffc0297fc000
X29-->[sp-8]:0xFFFFFFC0297FC138 = 0xFFFFFFC0297FC170
The distribution of the obtained register data is as follows:
EL1N:0xFFFFFFC0297FC050: X0 0x00000000FFFFFFC0 X1 0x0000000000000000 X2 0x0000000000000000 X3 0x0000000000000200 X4 0x0000000000000000 X5 0x0000000000000044 X6 0xFFFFFFC000CDB33C EL1N:0xFFFFFFC0297FC088: X7 0x0000000000000000 X8 0xFFFFFFC000CDB33C X9 0x7F7F7F7F7F7F7F7F X10 0x67531F534F4C4444 X11 0x7F7F7F7F7F7F7F7F X12 0x0101010101010101 X13 0x0000000000000028 EL1N:0xFFFFFFC0297FC0C0: X14 0xFFFFFFFFFFFFFFFF X15 0x0000000000000000 X16 0xFFFFFFC0001E1B30 X17 0x0000000000000000 X18 0x0000000000000000 X19 0xFFFFFFC012DD3440 X20 0x0000000000000001 EL1N:0xFFFFFFC0297FC0F8: X21 0xFFFFFFC000B5F000 X22 0xFFFFFFC000D910B8 X23 0x0000000000000000 X24 0x0000000000000000 X25 0xFFFFFFC000D91000 X26 0xFFFFFFC000B5F000 X27 0xFFFFFFC0009FA0E9 EL1N:0xFFFFFFC0297FC130: X28 0xFFFFFFC0297FC000 X29 0xFFFFFFC0297FC170
(4)
el1_sync-->kernel_entry el=1 sp = sp - (288-240) sp = sp - (15*16) //Stack X0-X29 registers x21 register = sp + 288 x22 register = el1 lr x23 register = el1 spsr take lr Register stack [sp + 240] //LR take x21 Register value stack[sp + 240 + 8] take x22 Register stack value[sp + 256] //PC take x23 Register value stack[sp + 256 +8] el1_da: x2 = sp
Field stack data:
X19 0xFFFFFFC012DD3440 X20 0x0000000000000001 X21 0xFFFFFFC0297FC170 X22 0xFFFFFFC0000CA014 X23 0x00000000800001C5 X24 0x0000000000000025 X25 0xFFFFFFC000D91000 X26 0xFFFFFFC000B5F000 X27 0xFFFFFFC0009FA0E9 X28 0xFFFFFFC0297FC000 X29 0xFFFFFFC0297FC170 PC 0xFFFFFFC000083C30 SP 0xFFFFFFC0297FC050
Code derivation: #8-sp = #7-sp + 176 = 0xFFFFFFC0297FBFA0 + 176 = 0xFFFFFFC0297FC050 the theoretical value deduced from #7-sp to #8-sp is consistent with the field cpu SP stack data, and #8-sp is consistent with #8-X21 - and the data is normal
(5)
data_bad-->do_mem_abort x29 Register stack[sp-176] x30 Register stack[sp-176+8] sp = sp - 176 x29 = sp
Field stack data:
X19 0x0000000096000005 X20 0xFFFFFFC800D90EB8 X21 0xFFFFFFC000B70E90 X22 0xFFFFFFC0297FC050 X23 0x00000000800001C5 X24 0x0000000000000025 X25 0xFFFFFFC000D91000 X26 0xFFFFFFC000B5F000 X27 0xFFFFFFC0009FA0E9 X28 0xFFFFFFC0297FC000 X29 0xFFFFFFC0297FBFA0 PC 0xFFFFFFC000081238 SP 0xFFFFFFC0297FBFA0
Code derivation: #7-sp = #6-sp + 48 = 0xFFFFFFC0297FBF70 +48 = 0xFFFFFFC0297FBFA0 the theoretical value of #7-sp deduced from #6-sp is consistent with the field cpu SP stack data, and #7-sp is consistent with #7-X29, and the data is normal
(6)
do_mem_abort-->do_translation_fault x29-->[sp-48] x30-->[sp-48+8] sp = sp - 48 x29 = sp
Field stack data:
X19 0xFFFFFFC0297FC050 X20 0xFFFFFFC800D90EB8 X21 0x0000000096000005 X22 0xFFFFFFC029BF8680 X23 0x00000000800001C5 X24 0x0000000000000025 X25 0xFFFFFFC000D91000 X26 0xFFFFFFC000B5F000 X27 0xFFFFFFC0009FA0E9 X28 0xFFFFFFC0297FC000 X29 0xFFFFFFC0297FBF70 PC 0xFFFFFFC000095A64 SP 0xFFFFFFC0297FBF70
Conclusion:
#5-sp + 48 = 0xffffc0297fbf40 + 48 = 0xffffc0297fbf70 from #5-sp
The theoretical value extrapolated to #6-sp is consistent with the field stack data,
And #6-sp is consistent with #6-X29, and the data is normal