Oracle case: a gc buffer busy acquire diagnosis

Posted by gat on Wed, 23 Feb 2022 04:03:01 +0100

This case comes from a production failure of a two node rac of a customer. The phenomenon is that a large area of gc buffer busy acquire leads to business paralysis.
First, check the AWR header information and load profile of node 1:


1 node AWR
Key information points obtained:

For LCPU 256 system, AAS=13379.42/59.91=223, indicating that the system is very busy or has encountered abnormal waiting.

sessions grew several times abnormally, and DB CPU/DB Time accounted for a very low proportion, indicating that an abnormal wait was encountered.

Other indicators are fairly normal.

Since you are waiting for an exception, take a look at the top event section:

1 node AWR

You can see that a large number of wait class es are cluster session s and top event s also see a large number of gc buffer busy acquire wait events. This wait event is very common and will not be explained separately. According to rough calculation, cluster wait events account for about 90% of dbtime.

When you encounter a large number of Cluster waiting events, you must first look at RAC Statistics:

1 node AWR

It can be seen that there are not many blocks and message s transmitted per second, and the traffic is not large, so it is completely unnecessary to check the SQL ordered by Cluster Wait Time. Continue to look down and find that Avg global cache cr block receive time (ms) is too high, reaching 1473 Ms. It is judged that a large number of cluster waits this time are caused by the slow sending of Cr blocks from remote instances.

gc cr block receive time = gc cr block (flush time + build time + send time)
It can be seen from the formula that there is a problem with the gc cr block flush time /build time/send time of the remote instance, so you need to look at the RAC Statistics of the 2-node AWR at this time:

2-node AWR

You can see that Avg global cache cr block flush time (ms) is very high. There are many introductions about the behavior of current block flush redo, which will not be explained here. The behavior of cr block flush redo is usually generated when the cr block needs to be constructed from the current block of the remote instance.

Normally CR block buffer processing does not include the 'gc cr block flush time'. However, when a CR buffer is cloned from a current buffer that has redo pending, a log flush for a CR block transfer is required. A high percentage is indicative of hot blocks with frequent read after write access.
For current/cr flush time, there are usually two possibilities:

LGWR has poor write performance;

LGWR blocked.

So the next step is to directly look at the Background Wait Events and Wait Event Histogram of 2-node AWR to see how the write performance of LGWR is and whether it is stable.

2-node AWR


2-node AWR

It can be seen that the write performance of 2-node LGWR is very stable and the delay is normal. Then it is not the problem of LGWR write performance. It is likely that LGWR is blocked.

2-node AWR

Through the top event, we can see that node 2 has a lot of waiting for log switching to complete, indicating that LGWR is indeed blocked. At this time, it is time to analyze ash. You can directly filter other information to view the ash information of LGWR separately.

SQL> select to_char(sample_time,'yyyy-mm-dd hh24:mi:ss'),program,session_id,event,seq#,BLOCKING_SESSION,BLOCKING_INST_ID from m_ash where program like '%LGWR%' and inst_id=2 order by 1;
TO_CHAR(SAMPLE_TIME PROGRAM                        SESSION_ID EVENT                                SEQ# BLOCKING_SESSION BLOCKING_INST_ID
------------------- ------------------------------ ---------- ------------------------------ ---------- ---------------- ----------------
...
2020-03-07 15:44:40 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:41 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:42 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:43 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:44 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:45 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:46 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:47 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:48 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:49 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:50 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:51 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:52 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:53 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:54 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:55 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:56 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:57 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:58 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
...

It can be seen that from 15:44:40, LGWR began to wait for the CF queue and lasted for a very long time. The blocking session is the session with sid 2224 of node 1. Combined with the previous analysis of AWR, it is speculated that the log switching was carried out at 15:44:40, because the log switching needs to read the control file. Now let's take a look at the information of blocking session.

SQL> select to_char(sample_time,'yyyy-mm-dd hh24:mi:ss'),program,session_id,event,seq#,BLOCKING_SESSION,BLOCKING_INST_ID from m_ash where session_id=2224 and inst_id=1 order by 1;
TO_CHAR(SAMPLE_TIME PROGRAM                   SESSION_ID EVENT                                SEQ# BLOCKING_SESSION BLOCKING_INST_ID
------------------- ------------------------- ---------- ------------------------------ ---------- ---------------- ----------------
...
2020-03-07 15:37:41 oracle@sxdb01 (M000)            2224 control file sequential read          479
2020-03-07 15:37:42 oracle@sxdb01 (M000)            2224                                       566
2020-03-07 15:37:43 oracle@sxdb01 (M000)            2224 control file sequential read          632
2020-03-07 15:37:44 oracle@sxdb01 (M000)            2224 control file sequential read          684
2020-03-07 15:37:45 oracle@sxdb01 (M000)            2224                                       736
2020-03-07 15:37:46 oracle@sxdb01 (M000)            2224                                       781
2020-03-07 15:37:47 oracle@sxdb01 (M000)            2224                                       824
2020-03-07 15:37:48 oracle@sxdb01 (M000)            2224                                       865
2020-03-07 15:37:49 oracle@sxdb01 (M000)            2224                                       915
2020-03-07 15:37:50 oracle@sxdb01 (M000)            2224                                      1031
2020-03-07 15:37:51 oracle@sxdb01 (M000)            2224                                      1183
2020-03-07 15:37:52 oracle@sxdb01 (M000)            2224 control file sequential read         1304
2020-03-07 15:37:53 oracle@sxdb01 (M000)            2224                                      1400
2020-03-07 15:37:54 oracle@sxdb01 (M000)            2224                                      1481
2020-03-07 15:37:55 oracle@sxdb01 (M000)            2224 control file sequential read         1631
2020-03-07 15:37:56 oracle@sxdb01 (M000)            2224 control file sequential read         1834
2020-03-07 15:37:57 oracle@sxdb01 (M000)            2224 control file sequential read         1947
2020-03-07 15:37:58 oracle@sxdb01 (M000)            2224 control file sequential read         2052
2020-03-07 15:37:59 oracle@sxdb01 (M000)            2224 control file sequential read         2159
2020-03-07 15:38:00 oracle@sxdb01 (M000)            2224 control file sequential read         2269
2020-03-07 15:38:01 oracle@sxdb01 (M000)            2224 control file sequential read         2404
2020-03-07 15:38:02 oracle@sxdb01 (M000)            2224 control file sequential read         2517
2020-03-07 15:38:03 oracle@sxdb01 (M000)            2224 control file sequential read         2672
2020-03-07 15:38:04 oracle@sxdb01 (M000)            2224                                      2801
2020-03-07 15:38:05 oracle@sxdb01 (M000)            2224                                      2857
2020-03-07 15:38:06 oracle@sxdb01 (M000)            2224                                      2866
2020-03-07 15:38:07 oracle@sxdb01 (M000)            2224 control file sequential read         2893
2020-03-07 15:38:08 oracle@sxdb01 (M000)            2224                                      3007
2020-03-07 15:38:09 oracle@sxdb01 (M000)            2224 control file sequential read         3111
2020-03-07 15:38:10 oracle@sxdb01 (M000)            2224                                      3184
2020-03-07 15:38:11 oracle@sxdb01 (M000)            2224 control file sequential read         3218
2020-03-07 15:38:12 oracle@sxdb01 (M000)            2224 control file sequential read         3263
2020-03-07 15:38:13 oracle@sxdb01 (M000)            2224                                      3304
..

Here we can see that the blocking process of M000 of node 1 has held the CF lock since 15:37 and has been reading the control file for a very long time. As a result, the LGWR of node 2 cannot hold the CF lock when the log of node 2 is switched. M000 is the slave process of MMON process. We know that MMON process is usually related to AWR. Why do we keep reading control files?

Check the short of M000 in the systemstate dump generated by diag_ Stack information:

Short stack dump: ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pread()+12<-pread()+112<-skgfqio()+532<-ksfd_skgfqio()+756<-ksfd_io()+676<-ksfdread()+640<-kfk_ufs_sync_io()+416<-kfk_submit_io()+260<-kfk_io1()+916<-kfk_transitIO()+2512<-kfioSubmitIO()+408<-kfioRequestPriv()+220<-kfioRequest()+472<-ksfd_kfioRequest()+444<-ksfd_osmio()+2956<-ksfd_io()+1868<-ksfdread()+640<-kccrbp()+496<-kccrec_rbl()+296<-kccrec_read_write()+1680<-kccrrc()+1072<-krbm_cleanup_map()+28<-kgghstmap()+92<-krbm_cleanup_backup_records()+1100<-kraalac_slave_action()+1016<-kebm_slave_main()+744<-ksvrdp()+1928<-opirip()+1972<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380
kraalac_slave_action ->krbm_cleanup_backup_records-> krbm_cleanup_map -> kccrrc

Here you can see the control file that M000 was cleaning up the backup records to read. From the action name of M000 in M000 trace or ash, you can see that the action of M000 was Monitor FRA Space at that time. It shows that MMON initiated a slave to do Monitor FRA Space, and the FRA space is insufficient, so it triggered to clean up some backup records in FRA to read the control file.
Knowing why M000 reads the control file, the next question is why it takes so long to read it? Control file too large? Control file read too slow? Or did you find the answer from systemstate dump:

 SO: 0x22e13908e0, type: 10, owner: 0x23012ace68, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
     proc=0x23012ace68, name=FileOpenBlock, file=ksfd.h LINE:6688 ID:, pg=0
    (FOB) 22e13908e0 flags=2560 fib=22fde0d288 incno=0 pending i/o cnt=0
     fname=+FASTDG/sxboss/controlfile/current.256.907844431
     fno=1 lblksz=16384 fsiz=118532

The control file size is lblksz fsiz=16384118532=1.8g. Why is the control file so large?

Because control_file_record_keep_time is set to 365 days.

Solution:

1. Reasonably set the control according to the backup strategy_ file_ record_ keep_ timeļ¼›

2. Rebuild control documents.

The harvest of this case is that we don't know much about the role of MMON process at ordinary times. Tracking MMON through KST trace, we found that MMON plays a lot of roles, not only related to AWR. The action s of MMON tracked by KST trace are as follows:

(MMON) : (infrequent action) : acnum=&#91;133] comment=&#91;deferred controlfile autobackup action]     
(MMON) : (infrequent action) : acnum=&#91;150] comment=&#91;recovery area alert action]     
(MMON) : (infrequent action) : acnum=&#91;167] comment=&#91;undo usage]       
(MMON) : (infrequent action) : acnum=&#91;171] comment=&#91;Block Cleanout Optim, Undo Segment Scan]   
(MMON) : (infrequent action) : acnum=&#91;175] comment=&#91;Flashback Archive RAC Health Check]    
(MMON) : (infrequent action) : acnum=&#91;178] comment=&#91;tune undo retention lob]     
(MMON) : (infrequent action) : acnum=&#91;179] comment=&#91;MMON Periodic LOB MQL Selector]    
(MMON) : (infrequent action) : acnum=&#91;180] comment=&#91;MMON Periodic LOB Spc Analyze ]   
(MMON) : (infrequent action) : acnum=&#91;183] comment=&#91;tablespace alert monitor]      
(MMON) : (infrequent action) : acnum=&#91;197] comment=&#91;OLS Cleanup]       
(MMON) : (infrequent action) : acnum=&#91;205] comment=&#91;Sample Shared Server Activity]     
(MMON) : (infrequent action) : acnum=&#91;212] comment=&#91;Compute cache stats in background]    
(MMON) : (infrequent action) : acnum=&#91;213] comment=&#91;SPM: Auto-purge expired SQL plan baselines]   
(MMON) : (infrequent action) : acnum=&#91;214] comment=&#91;SPM: Check SMB size]     
(MMON) : (infrequent action) : acnum=&#91;215] comment=&#91;SPM: Delete excess sqllog$ batches]    
(MMON) : (infrequent action) : acnum=&#91;219] comment=&#91;KSXM Advance DML Frequencies]     
(MMON) : (infrequent action) : acnum=&#91;220] comment=&#91;KSXM Broadcast DML Frequencies]     
(MMON) : (infrequent action) : acnum=&#91;225] comment=&#91;Cleanup client cache server state in background]  
(MMON) : (infrequent action) : acnum=&#91;226] comment=&#91;MMON TSM Cleanup]      
(MMON) : (infrequent action) : acnum=&#91;296] comment=&#91;alert message cleanup]      
(MMON) : (infrequent action) : acnum=&#91;297] comment=&#91;alert message purge]      
(MMON) : (infrequent action) : acnum=&#91;298] comment=&#91;AWR Auto Flush Task]     
(MMON) : (infrequent action) : acnum=&#91;299] comment=&#91;AWR Auto Purge Task]     
(MMON) : (infrequent action) : acnum=&#91;300] comment=&#91;AWR Auto DBFUS Task]     
(MMON) : (infrequent action) : acnum=&#91;301] comment=&#91;AWR Auto CPU USAGE Task]    
(MMON) : (infrequent action) : acnum=&#91;305] comment=&#91;Advisor delete expired tasks]     
(MMON) : (infrequent action) : acnum=&#91;313] comment=&#91;run-once action driver]      
(MMON) : (infrequent action) : acnum=&#91;319] comment=&#91;metrics monitoring]       
(MMON) : (infrequent action) : acnum=&#91;322] comment=&#91;sql tuning hard kill defense]    
(MMON) : (infrequent action) : acnum=&#91;323] comment=&#91;autotask status check]      
(MMON) : (infrequent action) : acnum=&#91;324] comment=&#91;Maintain AWR Baseline Thresholds Task]    
(MMON) : (infrequent action) : acnum=&#91;325] comment=&#91;WCR: Record Action Switcher]     
(MMON) : (infrequent action) : acnum=&#91;331] comment=&#91;WCR: Replay Action Switcher]     
(MMON) : (infrequent action) : acnum=&#91;338] comment=&#91;SQL Monitoring Garbage Collector]     
(MMON) : (infrequent action) : acnum=&#91;344] comment=&#91;Coordinator autostart timeout]
(MMON) : (infrequent action) : acnum=&#91;348] comment=&#91;ADR Auto Purge Task]     
(MMON) : (infrequent action) : acnum=&#91;41] comment=&#91;reload failed KSPD callbacks]     
(MMON) : (infrequent action) : acnum=&#91;75] comment=&#91;flashcache object keep monitor]     
(MMON) : (interrupt action) : acnum=&#91;108] comment=&#91;Scumnt mount lock]      
(MMON) : (interrupt action) : acnum=&#91;109] comment=&#91;Poll system events broadcast channel]    
(MMON) : (interrupt action) : acnum=&#91;20] comment=&#91;KSB action for ksbxic() calls]    
(MMON) : (interrupt action) : acnum=&#91;2] comment=&#91;KSB action for X-instance calls]    
(MMON) : (interrupt action) : acnum=&#91;306] comment=&#91;MMON Remote action Listener]     
(MMON) : (interrupt action) : acnum=&#91;307] comment=&#91;MMON Local action Listener]     
(MMON) : (interrupt action) : acnum=&#91;308] comment=&#91;MMON Completion Callback Dispatcher]     
(MMON) : (interrupt action) : acnum=&#91;309] comment=&#91;MMON set edition interrupt action]    
(MMON) : (interrupt action) : acnum=&#91;341] comment=&#91;Check for sync messages from other instances]  
(MMON) : (interrupt action) : acnum=&#91;343] comment=&#91;Check for autostart messages from other instances]  
(MMON) : (interrupt action) : acnum=&#91;350] comment=&#91;Process staged incidents]      
(MMON) : (interrupt action) : acnum=&#91;351] comment=&#91;DDE MMON action to schedule async action slaves] 
(MMON) : (interrupt action) : acnum=&#91;39] comment=&#91;MMON request for RLB metrics]    
(MMON) : (requested action) : acnum=&#91;314] comment=&#91;shutdown MMON]       
(MMON) : (requested action) : acnum=&#91;315] comment=&#91;MMON DB open]      
(MMON) : (requested action) : acnum=&#91;321] comment=&#91;ADDM (KEH)]       
(MMON) : (requested action) : acnum=&#91;347] comment=&#91;Job Autostart action force]     
(MMON) : (requested action) : acnum=&#91;349] comment=&#91;Schedule slave to update incident meter]   
(MMON) : (requested action) : acnum=&#91;63] comment=&#91;SGA memory tuning parameter update]    
(MMON) : (timeout action) : acnum=&#91;0] comment=&#91;Monitor Cleanup]       
(MMON) : (timeout action) : acnum=&#91;11] comment=&#91;Update shared pool advice stats]    
(MMON) : (timeout action) : acnum=&#91;154] comment=&#91;Flashback Marker]       
(MMON) : (timeout action) : acnum=&#91;172] comment=&#91;Block Cleanout Optim, Rac specific code]   
(MMON) : (timeout action) : acnum=&#91;173] comment=&#91;BCO:]        
(MMON) : (timeout action) : acnum=&#91;1] comment=&#91;Update KGSTM Translation]      
(MMON) : (timeout action) : acnum=&#91;3] comment=&#91;KSB action for bast checking]    
(MMON) : (timeout action) : acnum=&#91;42] comment=&#91;reconfiguration MMON action]      
(MMON) : (timeout action) : acnum=&#91;63] comment=&#91;SGA memory tuning parameter update]    
(MMON) : (timeout action) : acnum=&#91;69] comment=&#91;SGA memory tuning]

. . .
There are about hundreds of action s, and usually too little attention is paid to MMON.

Original link of ink Sky Wheel: https://www.modb.pro/db/17494... (copy link to browser)

Author about
Li Xiangyu, delivery technical consultant of Yunhe enmo West District, has long served customers in the mobile operator industry, and is familiar with Oracle Performance Optimization, fault diagnosis and special recovery.

Topics: Database Oracle