A 4-node rac of a provincial power system and a 2-node crash in the morning.
Wed Nov 03 07:59:05 2021 SMON: Restarting fast_start parallel rollback Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x98214CA, kgegpa()+40] [flags: 0x0, count: 1] Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2] Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2] Wed Nov 03 07:59:07 2021 PMON (ospid: 100111): terminating the instance due to error 474 Wed Nov 03 07:59:07 2021 System state dump requested by (instance=2, osid=100111 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_diag_100469_20211103075907.trc Instance terminated by PMON, pid = 100111
When SMON initiates parallel rollback again, the instance is terminated by PMON. There is a hidden error, which is often ignored. PMON (ospid: 100111): terminating the instance due to error 474
[oracle@dbtest ~]$ oerr ora 474 00474, 00000, "SMON process terminated with error" // *Cause: The system cleanup process died // *Action: Warm start instance
Description is an instance of PMON terminating when it finds SMON exception. Then look at the SMON trace file:
*** 2021-11-03 07:59:05.420 Parallel Transaction recovery caught error 30317 SMON: Restarting fast_start parallel rollback *** 2021-11-03 07:59:06.283 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x98214CA, kgegpa()+40] [flags: 0x0, count: 1] DDE previous invocation failed before phase II DDE was called in a 'No Invocation Mode' ----- Start Diag Diagnostic Dump ----- Diag diagnostic dump is performed due to an error in the diagfw code during error handling. DDE is switched to protected mode during the diagnostic dump to prevent recursive errors in the error hadnling code. Dump error and call stack for the diagnostic dump: *** 2021-11-03 07:59:06.306 dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0) ----- Error Stack Dump ----- *** 2021-11-03 07:59:06.306 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2] Not recording an ORA-7445 (previous attmpet failed) Warning! DDE is invoked in protected mode. DDE call is aborted. Warning! DDE is invoked in protected mode. DDE call is aborted. *** 2021-11-03 07:59:06.308 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2] Not recording an ORA-7445 (previous attmpet failed) ssexhd: crashing the process... Background_Core_Dump = partial ksdbgcra: writing core file to directory '/u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/cdump'
When SMON restarts the rollback, the process is crash ed after the exception. Finally, PMON terminates the instance. Continue to check the alert log and find that after some exceptions are reported from 6:52:46, blockrecover starts with some ora-00600[4193] errors.
Wed Nov 03 06:51:58 2021 Archived Log entry 220723 added for thread 2 sequence 33832 ID 0xbc36136c dest 1: Wed Nov 03 06:52:46 2021 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xE559064] [PC:0x98214CA, kgegpa()+40] [flags: 0x0, count: 1] Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xE559064] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2] Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xE559064] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2] Wed Nov 03 06:52:48 2021 Block recovery from logseq 33833, block 2030926 to scn 16125071688907 Recovery of Online Redo Log: Thread 2 Group 8 Seq 33833 Reading mem 0 Mem# 0: +FDATADG/newepmfdp/onlinelog/group_8.278.1083929519 Block recovery completed at rba 33833.2031098.16, scn 3754.1764459724 Block recovery from logseq 33832, block 4457757 to scn 16125071688920 Recovery of Online Redo Log: Thread 2 Group 7 Seq 33832 Reading mem 0 Mem# 0: +FDATADG/newepmfdp/onlinelog/group_7.279.1083929515 Block recovery interrupted at rba 33832.4613427.16 Block recovery from logseq 33832, block 4613427 to scn 16125071688920 Recovery of Online Redo Log: Thread 2 Group 7 Seq 33832 Reading mem 0 Mem# 0: +FDATADG/newepmfdp/onlinelog/group_7.279.1083929515 Block recovery interrupted at rba 33832.4769190.16 Block recovery from logseq 33832, block 4769190 to scn 16125071688920 ... Wed Nov 03 07:00:04 2021 Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_m000_84504.trc (incident=320530): ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/incident/incdir_320530/epmfdp2_m000_84504_i320530.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Nov 03 07:06:09 2021 Sweep [inc][320530]: completed Wed Nov 03 07:06:16 2021 Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/incident/incdir_320530/epmfdp2_m000_84504_i320530.trc: ORA-12751: cpu time or run time policy violation ORA-12751: cpu time or run time policy violation ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], [] Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/incident/incdir_320530/epmfdp2_m000_84504_i320530.trc: ORA-12751: cpu time or run time policy violation ORA-12751: cpu time or run time policy violation ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], [] Wed Nov 03 07:06:22 2021 Dumping diagnostic data in directory=[cdmp_20211103070622], requested by (instance=2, osid=84504 (M000)), summary=[incident=320530]. Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_m000_84504.trc: ORA-12751: cpu time or run time policy violation ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], [] Wed Nov 03 07:06:28 2021 Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_m000_84504.trc: ORA-12751: cpu time or run time policy violation ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], [] Wed Nov 03 07:11:57 2021 Sweep [inc2][320530]: completed Wed Nov 03 07:31:57 2021 Suspending MMON action 'undo usage' for 82800 seconds
For ora-00600[4193], I'm too familiar with recovery. However, it is generally encountered when the database is forced to open due to abnormal power failure. The normal operation environment is very rare. Its meaning is as follows:
Meaning: seq# mismatch while adding an undo record to an undo block. This is done by the application of redo. --------------------------------------------------------------------------- Argument Description: a. (ktubhseq): undo record seq# - this is the seq# of the block that this undo record WILL BE APPLIED TO. This is from the Undo Block. It is NOT the seq# of the undo block itself. b. (ktudbseq): redo RECORD seq# - this is the seq# number in the block that this redo WILL BE APPLIED TO. This is from the Redo Record. When we try to apply redo to an undo block (forward changes are made by the application of redo to a block) we check that the seq# in the undo record matches the seq# in the redo record. These seq# should be the same because when we apply a redo record we must apply it to the correct version of the block. We can only apply a redo record to a block that contains the same seq# as in the redo record. If the seq# do not match then this error is raised. This implies some kind of block corruption in either the redo or the undo block. ksesic2(OERI(4193), ksenrg(ubh->ktubhseq), ksenrg(db->ktudbseq)); struct ktubh { kxid ktubhxid; /* txid of tx currently using or last used this block */ ub2 ktubhseq; /* undo block sequence number */ ub1 ktubhcnt; /* high water mark record index, number of undo entries */ ub1 ktubhirb; /* rollback record index, rec index to start the rollback */ ub1 ktubhicl; /* collecting record index, rec index to start retrieving col info */ ub1 ktubhflg; /* dummy */ ub2 ktubhidx[1]; /* byte offset of record in block, grows at runtime */ }; struct ktudb Kernel Transaction Undo Data operation Block (redo) { ub2 ktudbsiz; /* size of entry */ ub2 ktudbspc; /* verification: space left in undo block */ ub2 ktudbflg; /* flag to indicate the kind of redo operation */ kxid ktudbxid; /* current tx id */ ub2 ktudbseq; /* block sequence number */ ub1 ktudbrec; /* new record index for this change */ };
The essence is that ktubhseq and ktudbseq are inconsistent. If ktudbrec and ktubhcnt are inconsistent, it is ORA-00600[4194]. ktubh is the structure in undo block, and ktudb is the information that will be recorded when redo records undo block operations.
Undo Data struct kcbh, 20 bytes @0 ub1 type_kcbh @0 ub1 frmt_kcbh @1 ub1 spare1_kcbh @2 ub1 spare2_kcbh @3 ub4 rdba_kcbh @4 ub4 bas_kcbh @8 ub2 wrp_kcbh @12 ub1 seq_kcbh @14 ub1 flg_kcbh @15 ub2 chkval_kcbh @16 ub2 spare3_kcbh @18 struct ktubh, 142 bytes @20 struct ktubhxid, 8 bytes @20 ub2 ktubhseq @28 ub1 ktubhcnt @30 ub1 ktubhirb @31 ub1 ktubhicl @32 ub1 ktubhflg @33 ub2 ktubhidx[64] @34 ub1 freespace[38] @162 ub1 undodata[7988] @200 ub4 tailchk @8188
Back to the case, you can see the information of ktubh and ktudb from trace.
ktudb records are as follows:
xid: 0x750f.003.0093843a uba: 0x1cd49528.ddee.01 seq: 0xddee rec: 0x02 *** ktuc_diag_dmp: dump of current change vector ktudb redo: siz: 232 spc: 8068 flg: 0x0012 seq: 0xddee rec: 0x02 xid: 0x750f.003.0093843a ktubl redo: slt: 3 rci: 0 opc: 11.1 [objn: 6729 objd: 6729 tsn: 1] Undo type: Regular undo Begin trans Last buffer split: No Temp Object: No Tablespace Undo: No 0x00000000 prev ctl uba: 0x1cd49528.ddee.01 prev ctl max cmt scn: 0x0eaa.692b5862 prev tx cmt scn: 0x0eaa.692b58c4 txn start scn: 0xffff.ffffffff logon user: 0 prev brb: 0 prev bcl: 0 BuExt idx: 0 flg2: 0 KDO undo record: KTB Redo op: 0x04 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: L itl: xid: 0x7607.013.00713adb uba: 0x1d001423.538a.15 flg: C--- lkc: 0 scn: 0x0eaa.68c33db2 KDO Op code: URP row dependencies Disabled xtype: XA flags: 0x00000000 bdba: 0x008011c0 hdba: 0x00801a02 itli: 1 ispac: 0 maxfr: 4858 tabn: 0 slot: 78(0x4e) flag: 0x2c lock: 0 ckix: 23 ncol: 10 nnew: 1 size: 0 col 8: [ 6] c5 02 0b 2a 4d 5c
The undo block ktudb information pointed to by uba is as follows:
xid: 0x750d.009.00abc7f4 seq: 0xd24 cnt: 0x4d ******************************************************************************** UNDO BLK: xid: 0x750d.009.00abc7f4 seq: 0xd24 cnt: 0x4d irb: 0x1 icl: 0x0 flg: 0x0000 Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset --------------------------------------------------------------------------- 0x01 0x1f8c 0x02 0x1f18 0x03 0x1ebc 0x04 0x1e48 0x05 0x1dec 0x06 0x1d78 0x07 0x1d1c 0x08 0x1cc0 0x09 0x1c64 0x0a 0x1bf0 0x0b 0x1b94 0x0c 0x1b20 0x0d 0x1ac4 0x0e 0x1a50 0x0f 0x19f4 0x10 0x1980 0x11 0x1924 0x12 0x18b0 0x13 0x1854 0x14 0x17e0 0x15 0x1784 0x16 0x1710 0x17 0x16b4 0x18 0x1640 0x19 0x15e4 0x1a 0x1570 0x1b 0x1514 0x1c 0x14a0 0x1d 0x144c 0x1e 0x13d8 0x1f 0x137c 0x20 0x1320 0x21 0x12c4 0x22 0x1250 0x23 0x11f4 0x24 0x1180 0x25 0x1124 0x26 0x10b0 0x27 0x1054 0x28 0x0fe0 0x29 0x0f84 0x2a 0x0f10 0x2b 0x0eb4 0x2c 0x0e40 0x2d 0x0de4 0x2e 0x0d70 0x2f 0x0d14 0x30 0x0ca0 0x31 0x0c44 0x32 0x0be0 0x33 0x0b84 0x34 0x0b10 0x35 0x0ab4 0x36 0x0a40 0x37 0x09e4 0x38 0x0970 0x39 0x0914 0x3a 0x08a0 0x3b 0x0844 0x3c 0x07d0 0x3d 0x0774 0x3e 0x0700 0x3f 0x06a4 0x40 0x0630 0x41 0x05d4 0x42 0x0560 0x43 0x0504 0x44 0x0490 0x45 0x0434 0x46 0x03c0 0x47 0x0364 0x48 0x02f0 0x49 0x0294 0x4a 0x0220 0x4b 0x01c4 0x4c 0x0150 0x4d 0x00f4
Here we can see that there is already an active transaction in the undo block, which is not only the problem of seq inconsistency, but also a more serious problem is that a new transaction actually finds an undo block with active transactions to use. We know that the undo segment header will pre allocate up to 5 blocks to the free block pool as preparation for the next transaction. When it is necessary to allocate undo blocks, we will look for the blocks in the free block pool.
Look at the undo segment header block pointed to by the error reporting transaction (the segment header number is 0x750f):
TRN CTL:: seq: 0xddee chd: 0x0003 ctl: 0x001d inc: 0x00000000 nfb: 0x0001 mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe) uba: 0x1cd49528.ddee.01 scn: 0x0eaa.692b5862 Version: 0x01 FREE BLOCK POOL:: uba: 0x1cd49528.ddee.01 ext: 0x16a6 spc: 0x1f84 uba: 0x00000000.ddc3.3e ext: 0x167b spc: 0x4e2 uba: 0x00000000.dc0b.4b ext: 0x47ab spc: 0x320 uba: 0x00000000.7727.48 ext: 0xd13 spc: 0x326 uba: 0x00000000.0000.00 ext: 0x0 spc: 0x0 TRN TBL:: index state cflags wrap# uel scn dba parent-xid nub stmt_num cmt ------------------------------------------------------------------------------------------------ 0x00 9 0x00 0x9416bd 0x0014 0x0eaa.692b7f1d 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893564 0x01 9 0x00 0x93546c 0x000f 0x0eaa.692b5bde 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563 0x02 9 0x00 0x9454fb 0x000b 0x0eaa.692b622a 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563 0x03 9 0x00 0x93842a 0x0019 0x0eaa.692b58c4 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563 0x04 9 0x00 0x932b09 0x0007 0x0eaa.692b7254 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893564 0x05 9 0x00 0x931008 0x0010 0x0eaa.692b5d5f 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563 0x06 9 0x00 0x948b77 0x0021 0x0eaa.692b59f4 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563 0x07 9 0x00 0x93c1e6 0x0009 0x0eaa.692b7540 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893564 0x08 9 0x00 0x93c4b5 0x000d 0x0eaa.692b5d8a 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563 0x09 9 0x00 0x94ac14 0x0000 0x0eaa.692b79f1 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893564 0x0a 9 0x00 0x944b33 0x0012 0x0eaa.692b704e 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893564 0x0b 9 0x00 0x9453f2 0x000c 0x0eaa.692b667a 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563 0x0c 9 0x00 0x9468e1 0x001e 0x0eaa.692b6a9d 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563 0x0d 9 0x00 0x94d5e0 0x0020 0x0eaa.692b5f85 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1635893563
You can see that a block in the free block pool is exactly 0x1cd49528. To sum up, the undo block 0x1cd49528 has actually been allocated to the undo segment header 0x750d, and there are already transactions XID: 0x750d 009.00abc7f4 is using the undo block, but the free block pool in the undo section header 0x750f does not clear the block, resulting in ORA-00600[4193].
Generally, the problem of transaction exceptions will trigger the blockrecover operation. The blockrecover occurs at 6:52:46. Is this where the problem starts? PMON trace shows that blockrecover starts from 6:52:49, and file 2608 block 144 is the block address of undo segment header 0x750f:
*** 2021-11-03 06:52:49.027 Block recovery interrupted at rba 33832.4613427.16 Resuming block recovery (PMON) for file 2608 block 144 Block recovery from logseq 33832, block 4613427 to scn 16125071688920 *** 2021-11-03 06:52:49.031 Recovery of Online Redo Log: Thread 2 Group 7 Seq 33832 Reading mem 0 Block recovery interrupted at rba 33832.4769190.16 Resuming block recovery (PMON) for file 2608 block 144 Block recovery from logseq 33832, block 4769190 to scn 16125071688920 ...
It can be seen from some foreground process trace that the transaction TX-750F0003-0093843A at 6:52:51 can no longer be carried out. The process was smashed.
*** 2021-11-03 06:52:51.294 *** SESSION ID:(2049.549) 2021-11-03 06:52:51.294 *** CLIENT ID:() 2021-11-03 06:52:51.294 *** SERVICE NAME:(SYS$USERS) 2021-11-03 06:52:51.294 *** MODULE NAME:(OGG-REPYCJ-OPEN_DATA_SOURCE) 2021-11-03 06:52:51.294 *** ACTION NAME:() 2021-11-03 06:52:51.294 ksqgtl:newtx: res: 321161d7b0 TX-750F0003-0093843A flag: 0x47 *** 2021-11-03 06:52:52.294 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x98214CA, kgegpa()+40] [flags: 0x0, count: 1] DDE previous invocation failed before phase II DDE was called in a 'No Invocation Mode' ----- Start Diag Diagnostic Dump ----- Diag diagnostic dump is performed due to an error in the diagfw code during error handling. DDE is switched to protected mode during the diagnostic dump to prevent recursive errors in the error hadnling code. Dump error and call stack for the diagnostic dump: *** 2021-11-03 06:52:52.319 dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0) ----- Error Stack Dump ----- *** 2021-11-03 06:52:52.319 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2] Not recording an ORA-7445 (previous attmpet failed) Warning! DDE is invoked in protected mode. DDE call is aborted. Warning! DDE is invoked in protected mode. DDE call is aborted. *** 2021-11-03 06:52:52.321 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x505E9323] [PC:0x981FF3E, kgebse()+776] [flags: 0x2, count: 2] Not recording an ORA-7445 (previous attmpet failed) ssexhd: crashing the process... Shadow_Core_Dump = PARTIAL
The transaction id is the same as ORA-00600[4193] which reported an error in the previous analysis. Note from 6:52:46, the undo segment header 0x750F cannot bind the transaction. The reason is that the undo block in the free block pool does not belong to the undo segment and is replaced by transaction 0x750d 009.00abc7f4 used. This can be verified by redodump.
REDO RECORD - Thread:2 RBA: 0x008429.001ef4a5.0184 LEN: 0x00bc VLD: 0x01 SCN: 0x0eaa.692b839b SUBSCN: 6 11/03/2021 06:52:46 CHANGE #1 TYP:0 CLS:59949 AFN:2608 DBA:0x8c800090 OBJ:4294967295 SCN:0x0eaa.692b832d SEQ:1 OP:14.4 ENC:0 RBL:0 kteop redo - redo operation on extent map SETSTAT: exts:24996 blks:219215 lastmap:0x8c91f248 mapcnt:56 UPDXNT: extent:5798 add:TRUE CHANGE #2 TYP:0 CLS:7 AFN:2608 DBA:0x8c859230 OBJ:4294967295 SCN:0x0eaa.692b832d SEQ:1 OP:14.4 ENC:0 RBL:0 kteop redo - redo operation on extent map ADD: dba:0x1cd49528 len:8 at offset:1 ADDRET: offset:1 ctime:0 CHANGE #3 TYP:0 CLS:13 AFN:115 DBA:0x1cc00002 OBJ:4294967295 SCN:0x0eaa.692b839b SEQ:1 OP:22.2 ENC:0 RBL:0 ktfbhredo - File Space Header Redo: Header Opcode: Save: No Pending Op ... REDO RECORD - Thread:2 RBA: 0x008429.001efbb0.0050 LEN: 0x00bc VLD: 0x01 SCN: 0x0eaa.692b8484 SUBSCN: 4 11/03/2021 06:52:46 CHANGE #1 TYP:0 CLS:59945 AFN:2608 DBA:0x8c800080 OBJ:4294967295 SCN:0x0eaa.692b8427 SEQ:1 OP:14.4 ENC:0 RBL:0 kteop redo - redo operation on extent map SETSTAT: exts:25381 blks:221582 lastmap:0x8c8d46a8 mapcnt:57 UPDXNT: extent:13691 add:TRUE CHANGE #2 TYP:0 CLS:7 AFN:115 DBA:0x1cd3aa10 OBJ:4294967295 SCN:0x0eaa.692b8425 SEQ:1 OP:14.4 ENC:0 RBL:0 kteop redo - redo operation on extent map ADD: dba:0x1cd49528 len:8 at offset:189 ADDRET: offset:189 ctime:0 CHANGE #3 TYP:0 CLS:13 AFN:115 DBA:0x1cc00002 OBJ:4294967295 SCN:0x0eaa.692b8484 SEQ:1 OP:22.2 ENC:0 RBL:0 ktfbhredo - File Space Header Redo: Header Opcode: Save: No Pending Op ... REDO RECORD - Thread:2 RBA: 0x008429.001efd6f.0010 LEN: 0x02a4 VLD: 0x01 SCN: 0x0eaa.692b84b8 SUBSCN: 6 11/03/2021 06:52:46 CHANGE #1 TYP:0 CLS:59946 AFN:115 DBA:0x1cd49528 OBJ:4294967295 SCN:0x0eaa.692b84b8 SEQ:3 OP:5.1 ENC:0 RBL:0 ktudb redo: siz: 92 spc: 6690 flg: 0x0022 seq: 0x0d24 rec: 0x0f xid: 0x750d.009.00abc7f4 ktubu redo: slt: 9 rci: 14 opc: 11.1 objn: 6049924 objd: 6049924 tsn: 9 Undo type: Regular undo Undo type: Last buffer split: No Tablespace Undo: No 0x00000000 KDO undo record: KTB Redo op: 0x02 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: C uba: 0x1cd49528.0d24.07 KDO Op code: DRP row dependencies Disabled xtype: XA flags: 0x00000000 bdba: 0x45707392 hdba: 0x09ab6222 itli: 2 ispac: 0 maxfr: 4858 tabn: 0 slot: 20(0x14)
A series of operations of the undo block were completed at 06:52:46 on 11 / 03 / 2021, which obviously does not comply with the allocation principle of the undo block. It is suspected to be a bug, but mos did not find a similar document:
SCN: 0x0eaa.692b839b, the undo block 0x1cd49528 is assigned to the undo segment whose header address is 0x8c800090 (the segment header number is 0x750f)
SCN: 0x0eaa.692b8484, the undo block 0x1cd49528 is allocated to the undo segment with the undo segment header address of 0x8c800080 (the segment header number is 0x750d)
SCN: 0x0eaa.692b84b8, transaction 0x750d 009.00abc7f4 uses this undo block
undo block allocation principle:
1.If the current extent has more free blocks, return the next free block. 2.Otherwise if the next extent has expired, wrap into the next extent and return the first block in that extent. 3.Otherwise get space from the Undo tablespace. If a free extent is available, allocate the extent to the transaction table and return the first block in the new extent. 4.Steal from an offlined transaction table. If a free extent is available, deallocate it from the offlined transaction table and add it to the current transaction table. Return the first free block from the new extent. 5.Steal from an onlined transaction table. The procedure is similar to (4). 6.Extend the file in the undo tablespace.If the files can be extended, add an extent to the current transaction table and return a block to the requestor 7.Reuse unexpired extents from own transaction table. If all extents are currently busy (i.e.contain uncommitted undo), go to the next step. Otherwise wrap into the next extent 8.Steal unexpired extents from offlined transaction tables. If fail try the onlined transaction tables. 9.If all else fails, report the ORA-1562: "Failed to extend rollback segment number %s" error. Note:Steps 7 and 8 are skipped if the undo retention guarantee (10g and higher) is specified.
To sum up, the solutions we get are as follows:
undo instance of this tablespace
Delete the abnormal undo segment (the segment header is 0x750f)
bbed clear the free block pool of undo segment (segment number 0x750f)
After analyzing the cause and solution of the fault, there is another abnormal problem. Why are undo blocks continuously allocated to two undo segments at almost the same time? When undo blocks are allocated to other undo segments, the free block pool of the previous undo segment is not cleared? According to the awr at that time, there was serious undo contention, and undo ext steel was also very frequent. Looking forward to alert, we found frequent ORA-01628:
undo global data ktusm_stealext_2 0 1,029,644 827,502 undo global data ktusmupst: KSLBEGIN 0 44,653 83,762 Wed Nov 03 06:38:58 2021 Archived Log entry 220714 added for thread 2 sequence 33828 ID 0xbc36136c dest 1: Wed Nov 03 06:39:51 2021 ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30353_261227539$ ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30353_261227539$ ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30353_261227539$ ...... ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30353_261227539$ ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30353_261227539$ ORA-1628: max # extents 32765 reached for rollback segment Wed Nov 03 06:41:12 2021 ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30292_4090763577$ Wed Nov 03 06:41:27 2021 Thread 2 advanced to log sequence 33830 (LGWR switch) Current log# 11 seq# 33830 mem# 0: +FDATADG/newepmfdp/onlinelog/group_11.305.1083929527 Wed Nov 03 06:41:31 2021 Archived Log entry 220716 added for thread 2 sequence 33829 ID 0xbc36136c dest 1: Wed Nov 03 06:41:48 2021 ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30343_1212715771$ Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_j002_77822.trc: ORA-12012: error on auto execute of job 1230 ORA-12008: error in materialized view refresh path ORA-01628: max # extents (32765) reached for rollback segment _SYSSMU30343_1212715771$ ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2809 ORA-06512: at "SYS.DBMS_SNAPSHOT", line 3025 ORA-06512: at "SYS.DBMS_IREFRESH", line 689 ORA-06512: at "SYS.DBMS_REFRESH", line 195 ORA-06512: at line 1 Wed Nov 03 06:43:45 2021 ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30359_11512129$ Wed Nov 03 06:43:45 2021 ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU30359_11512129$
I suspect that the frequent expansion, contraction and stealing of undo extent s lead to the disorder of undo block allocation. Since the normal load of the system is not analyzed, here are some rough follow-up suggestions for the time being:
Undo is set to undo retention guarantee
Reasonably set undo related parameters
Reasonably design transactions to avoid large transactions
Analyze dba_undo_extents, there may be too many very small undo extent s
Original link of ink Sky Wheel: https://www.modb.pro/db/17514... (copy link to browser)
About the author
Li Xiangyu, delivery technical consultant of Yunhe enmo West District, has long served customers in the mobile operator industry, and is familiar with Oracle Performance Optimization, fault diagnosis and special recovery.