Tuesday, June 1, 2010

Oracle Clusterware Failures: Useful Logs

To determine why a node failed, try the following clusterware logs (in order of usefulness):

  • alert log
  • ocssd log
  • evm log

Oracle Clusterware Parameters: misscount, disktimeout, reboottime

Clusterware timeout parameters:
  • misscount - It represents maximum time in seconds that, a heartbeat can be missed before entering into a cluster reconfiguration to evict the node. 
  • disktimeout - It is the maximum amount of time allowed for a voting file I/O to complete; if this time is exceeded the voting disk will be marked as offline. 
  • reboottime - It is the amount of time allowed for a node to complete a reboot after the CSS daemon has been evicted. 
 Default values for these parameters are as follows:
  • misscount = 60 seconds
  • disktimeout = 200 seconds
  • reboottime = 3 seconds 
Commands to check / modify CSS parameters:
  • crsctl get css misscount ---------- to check misscount value 
  • crsctl get css disktimeout --------- to check disktimeout value 
  • crsctl get css reboottime ---------- to check reboottime value 
  • crsctl set css misscount 120 --------- to set misscount to 120 seconds 
  • crsctl set css disktimeout 200 ------- to set disktimeout to 200 seconds
  • crsctl set css reboottime 3 ----------- to set reboottime to 3 seconds 
Only non-default values are only returned from get calls above. To confirm the default values, look in ocssd.log for the following:
lssnmNMInitialize: misscount set to (30)

clssnmNMInitialize: Network heartbeat thresholds are: impending reconfig 15000 ms, reconfig start (misscount) 30000 ms

clssgmInitCMInfo: Wait for remote node termination set to 13 seconds
clssnmNMInitialize: misscount set to (60), impending reconfig threshold set to (56000)
clssnmNMInitialize: diskShortTimeout set to (57000)ms
clssnmNMInitialize: diskLongTimeout set to (200000)ms

clssnmHandleUpdate: diskTimeout set to (200000)ms