crs of RAC cannot be started due to abnormal power failure

Posted by bseven on Tue, 31 Mar 2020 11:32:51 +0200

Problem phenomenon

Business level

The power of the computer room is cut off abnormally, and the access of the business system is slow after the database is restored.

Database level

Check the instance and find that only one instance is open and the other instance is not started.

RAC cluster level

View CRS status, found started but unable to view. RAC1's cluster is in an abnormal state and cannot be used normally.

Cause analysis

Judging from the phenomenon, the period may be caused by human misoperation, and the cluster is abnormal. Because CRS can no longer be used normally, analyze the cause according to CRS log and system log.
AIX process alarm log

tail -fn 20 /var/adm/syslog

Start the CRS command to observe the exception:

./crsctlstart cluster -all

alert log

[ohasd(3735686)]CRS-2765:Resource 'ora.diskmon' has failed on server 'rac1'.

diskmon log

2018-03-19 15:05:32.078: [ DISKMON][5439794:515] dskm_clss_ini2: calling clsssinit
2018-03-19 15:05:32.083: [ CSSCLNT]clssscConnect: gipc request failed with 29 (16)
2018-03-19 15:05:32.084: [ CSSCLNT]clsssInitNative: connect failed, rc 29

gipc log

2018-03-19 15:34:40.817: [ default][1]gipcd gipcd START pid=4194436 Oracle Grid IPC Daemon
2018-03-19 15:34:40.817: [   GIPCD][1] gipcdMain: gipcd Started
2018-03-19 15:34:40.817: [ GIPCLIB][1] gipcInitializeF [gipcdMain : gipcd.c : 135]: started with cb 0, flags 0x1
2018-03-19 15:34:40.829: [ COMMCRS][515]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

2018-03-19 15:34:40.829: [  clsdmt][258]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))
2018-03-19 15:34:40.829: [  clsdmt][258]Terminating process
2018-03-19 15:34:40.829: [   GIPCD][258] gipcd_ExitCB: Received a shutdown message from agent framework 
2018-03-19 15:34:40.829: [ GIPCLIB][258] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:1100f5790, idxPtr:1104a2f20, key:110607df8, flags:0x0
2018-03-19 15:34:40.829: [GIPCXCPT][258] gipcObjectLookupF [gipcPostF : gipc.c : 1898]: search found no matching oid 0000000000000000, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3)
2018-03-19 15:34:40.830: [GIPCXCPT][258] gipcPostF [gipcd_ExitCB : gipcd.c : 431]: EXCEPTION[ ret gipcretInvalidObject (3) ]  failed to post obj 0000000000000000, flags 0x0
2018-03-19 15:34:40.830: [ GIPCLIB][258] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:1100f5790, idxPtr:1104a2f20, key:110607df8, flags:0x0
2018-03-19 15:34:40.830: [GIPCXCPT][258] gipcObjectLookupF [gipcPostF : gipc.c : 1898]: search found no matching oid 0000000000000000, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3)
2018-03-19 15:34:40.830: [GIPCXCPT][258] gipcPostF [gipcd_ExitCB : gipcd.c : 432]: EXCEPTION[ ret gipcretInvalidObject (3) ]  failed to post obj 0000000000000000, flags 0x0
2018-03-19 15:34:40.830: [ GIPCLIB][258] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:1100f5790, idxPtr:1104a2f20, key:110607df8, flags:0x0
2018-03-19 15:34:40.830: [GIPCXCPT][258] gipcObjectLookupF [gipcPostF : gipc.c : 1898]: search found no matching oid 0000000000000000, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3)
2018-03-19 15:34:40.830: [GIPCXCPT][258] gipcPostF [gipcd_ExitCB : gipcd.c : 433]: EXCEPTION[ ret gipcretInvalidObject (3) ]  failed to post obj 0000000000000000, flags 0x0

mdnsd log

2018-03-19 15:34:41.034: [ default][1]mdnsd mdnsd START pid=5177722 
2018-03-19 15:34:41.045: [ COMMCRS][772]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_MDNSD))

2018-03-19 15:34:41.045: [  clsdmt][515]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_MDNSD))
2018-03-19 15:34:41.045: [  clsdmt][515]Terminating process
2018-03-19 15:34:41.045: [   MDNSD][515] clsdm requested mdnsd exit
2018-03-19 15:34:41.046: [   MDNSD][515] mdnsd exit

Solution

  1. Stop crs on All/Both Nodes
$GRID_HOME/bin//crsctl stop crs 
OR 
$GRID_HOME/bin/crsctl stop crs - f   ( Force option is used here, means there would be abort stop of cluster resource)
  1. When the node comes back up : clean all the sockets files under :
    /var/tmp/.oracle or /tmp/.oracle. Adjust to the right permissions. )

  2. Bring up the crs stack on node 1

$GRID_HOME/bin/crsctl start crs
  1. Check the status of the crs stack
$GRID_HOME/bin/crsctl check crs

Reference resources

How to remove Network socket files in a RAC Environment for Cluster/Resource startup
issues (Doc ID 2099377.1)

Topics: Oracle Database Permission denied network