Problem phenomenon
Business level
The power of the computer room is cut off abnormally, and the access of the business system is slow after the database is restored.
Database level
Check the instance and find that only one instance is open and the other instance is not started.
RAC cluster level
View CRS status, found started but unable to view. RAC1's cluster is in an abnormal state and cannot be used normally.
Cause analysis
Judging from the phenomenon, the period may be caused by human misoperation, and the cluster is abnormal. Because CRS can no longer be used normally, analyze the cause according to CRS log and system log.
AIX process alarm log
tail -fn 20 /var/adm/syslog
Start the CRS command to observe the exception:
./crsctlstart cluster -all
alert log
[ohasd(3735686)]CRS-2765:Resource 'ora.diskmon' has failed on server 'rac1'.
diskmon log
2018-03-19 15:05:32.078: [ DISKMON][5439794:515] dskm_clss_ini2: calling clsssinit
2018-03-19 15:05:32.083: [ CSSCLNT]clssscConnect: gipc request failed with 29 (16)
2018-03-19 15:05:32.084: [ CSSCLNT]clsssInitNative: connect failed, rc 29
gipc log
2018-03-19 15:34:40.817: [ default][1]gipcd gipcd START pid=4194436 Oracle Grid IPC Daemon
2018-03-19 15:34:40.817: [ GIPCD][1] gipcdMain: gipcd Started
2018-03-19 15:34:40.817: [ GIPCLIB][1] gipcInitializeF [gipcdMain : gipcd.c : 135]: started with cb 0, flags 0x1
2018-03-19 15:34:40.829: [ COMMCRS][515]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))
2018-03-19 15:34:40.829: [ clsdmt][258]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))
2018-03-19 15:34:40.829: [ clsdmt][258]Terminating process
2018-03-19 15:34:40.829: [ GIPCD][258] gipcd_ExitCB: Received a shutdown message from agent framework
2018-03-19 15:34:40.829: [ GIPCLIB][258] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:1100f5790, idxPtr:1104a2f20, key:110607df8, flags:0x0
2018-03-19 15:34:40.829: [GIPCXCPT][258] gipcObjectLookupF [gipcPostF : gipc.c : 1898]: search found no matching oid 0000000000000000, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3)
2018-03-19 15:34:40.830: [GIPCXCPT][258] gipcPostF [gipcd_ExitCB : gipcd.c : 431]: EXCEPTION[ ret gipcretInvalidObject (3) ] failed to post obj 0000000000000000, flags 0x0
2018-03-19 15:34:40.830: [ GIPCLIB][258] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:1100f5790, idxPtr:1104a2f20, key:110607df8, flags:0x0
2018-03-19 15:34:40.830: [GIPCXCPT][258] gipcObjectLookupF [gipcPostF : gipc.c : 1898]: search found no matching oid 0000000000000000, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3)
2018-03-19 15:34:40.830: [GIPCXCPT][258] gipcPostF [gipcd_ExitCB : gipcd.c : 432]: EXCEPTION[ ret gipcretInvalidObject (3) ] failed to post obj 0000000000000000, flags 0x0
2018-03-19 15:34:40.830: [ GIPCLIB][258] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:1100f5790, idxPtr:1104a2f20, key:110607df8, flags:0x0
2018-03-19 15:34:40.830: [GIPCXCPT][258] gipcObjectLookupF [gipcPostF : gipc.c : 1898]: search found no matching oid 0000000000000000, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3)
2018-03-19 15:34:40.830: [GIPCXCPT][258] gipcPostF [gipcd_ExitCB : gipcd.c : 433]: EXCEPTION[ ret gipcretInvalidObject (3) ] failed to post obj 0000000000000000, flags 0x0
mdnsd log
2018-03-19 15:34:41.034: [ default][1]mdnsd mdnsd START pid=5177722
2018-03-19 15:34:41.045: [ COMMCRS][772]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_MDNSD))
2018-03-19 15:34:41.045: [ clsdmt][515]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_MDNSD))
2018-03-19 15:34:41.045: [ clsdmt][515]Terminating process
2018-03-19 15:34:41.045: [ MDNSD][515] clsdm requested mdnsd exit
2018-03-19 15:34:41.046: [ MDNSD][515] mdnsd exit
Solution
- Stop crs on All/Both Nodes
$GRID_HOME/bin//crsctl stop crs
OR
$GRID_HOME/bin/crsctl stop crs - f ( Force option is used here, means there would be abort stop of cluster resource)
When the node comes back up : clean all the sockets files under :
/var/tmp/.oracle or /tmp/.oracle. Adjust to the right permissions. )Bring up the crs stack on node 1
$GRID_HOME/bin/crsctl start crs
- Check the status of the crs stack
$GRID_HOME/bin/crsctl check crs