Thursday, January 13, 2022

Oracle RAC: kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [osysmond.bin:4024]

 While in the middle of installing Oracle software on RAC. One of the terminal throwing out and error and terminated the session.


[oracle@rac02 ~]$ 

Message from syslogd@rac02 at Jan 13 15:47:15 ...

 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [osysmond.bin:4024]


In another session, the installation still going on, and no other errors are reported. Installation is still going and files are being copied from node1 to node2.





This issue seems to be coming from VMW and indicating that this is simply some performance (latency) hiccups.

https://kb.vmware.com/s/article/67623


Oracle RAC: Oracle RAC not starting upon rebooting.

Oracle RAC not starting upon rebooting. The following show some symptoms, tests, and what to look for. 

Node2 not getting any RAC services status back when first startup or reboot.

[root@rac02 bin]# ./crsctl stat res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.


Upon starting it all up, it has couple of errors.  The "CRS-1705" and the "ora.diskmon" . Those are indications that there are some issues with the ASM storage being provisioned in node2.


[root@rac01 bin]# ./crsctl start cluster -all

CRS-2672: Attempting to start 'ora.cssd' on 'rac01'

CRS-2672: Attempting to start 'ora.diskmon' on 'rac01'

CRS-2672: Attempting to start 'ora.cssd' on 'rac02'

CRS-2672: Attempting to start 'ora.diskmon' on 'rac02'

CRS-2676: Start of 'ora.diskmon' on 'rac01' succeeded

CRS-2676: Start of 'ora.diskmon' on 'rac02' succeeded

CRS-2676: Start of 'ora.cssd' on 'rac01' succeeded

CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac01'

CRS-2672: Attempting to start 'ora.ctssd' on 'rac01'

CRS-2676: Start of 'ora.ctssd' on 'rac01' succeeded

CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac01' succeeded

CRS-2672: Attempting to start 'ora.asm' on 'rac01'

CRS-2676: Start of 'ora.asm' on 'rac01' succeeded

CRS-2672: Attempting to start 'ora.storage' on 'rac01'

CRS-2676: Start of 'ora.storage' on 'rac01' succeeded

CRS-2672: Attempting to start 'ora.crsd' on 'rac01'

CRS-2676: Start of 'ora.crsd' on 'rac01' succeeded

CRS-1705: Found 0 configured voting files but 1 voting files are required, terminating to ensure data integrity; details at (:CSSNM00065:) in /u01/app/oracle/diag/crs/rac02/crs/trace/ocssd.trc

CRS-2674: Start of 'ora.cssd' on 'rac02' failed

CRS-2679: Attempting to clean 'ora.cssd' on 'rac02'

CRS-2681: Clean of 'ora.cssd' on 'rac02' succeeded

CRS-2672: Attempting to start 'ora.cssd' on 'rac02'

CRS-2672: Attempting to start 'ora.diskmon' on 'rac02'

CRS-2676: Start of 'ora.diskmon' on 'rac02' succeeded


                           

CRS-2676: Start of 'ora.cssd' on 'rac02' succeeded

CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac02'

CRS-2672: Attempting to start 'ora.ctssd' on 'rac02'

CRS-2676: Start of 'ora.ctssd' on 'rac02' succeeded

CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac02' succeeded

CRS-2672: Attempting to start 'ora.asm' on 'rac02'

CRS-2676: Start of 'ora.asm' on 'rac02' succeeded

CRS-2672: Attempting to start 'ora.storage' on 'rac02'

CRS-2676: Start of 'ora.storage' on 'rac02' succeeded

CRS-2672: Attempting to start 'ora.crsd' on 'rac02'

CRS-2676: Start of 'ora.crsd' on 'rac02' succeeded




Performing all the RAC-related storage checks will all appear hung. Once performing the oracelasm listdisks, it initiated the disk on node2 the RAC-related disk checks will show the disks output.



[root@rac02 bin]# oracleasm scandisks

Reloading disk partitions: done

Cleaning any stale ASM disks...

Scanning system for ASM disks...

Instantiating disk "DISK01"



[root@rac02 bin]# ./crsctl query css votedisk

##  STATE    File Universal Id                File Name Disk group

--  -----    -----------------                --------- ---------

 1. ONLINE   cacc790e79514f14bf658a94d092b503 (/dev/oracleasm/disks/DISK01) [DATA]

Located 1 voting disk(s).

 

[root@rac02 bin]# ./ocrcheck

Status of Oracle Cluster Registry is as follows :

Version                  :          4

Total space (kbytes)     :     491684

Used space (kbytes)      :      84360

Available space (kbytes) :     407324

ID                       : 1922254439

Device/File Name         :      +DATA

                                    Device/File integrity check succeeded


                                    Device/File not configured


                                    Device/File not configured


                                    Device/File not configured


                                    Device/File not configured


Cluster registry integrity check succeeded


Logical corruption check succeeded


Oracle Cluster Registry check was cancelled because an ongoing update was detected.


All the oracleasm configure seems to be appropriately set.


[root@rac02 ~]# oracleasm status

Checking if ASM is loaded: yes

Checking if /dev/oracleasm is mounted: yes

[root@rac02 ~]# oracleasm configure

ORACLEASM_ENABLED=true

ORACLEASM_UID=oracle

ORACLEASM_GID=oinstall

ORACLEASM_SCANBOOT=true

ORACLEASM_SCANORDER=""

ORACLEASM_SCANEXCLUDE=""

ORACLEASM_SCAN_DIRECTORIES=""

ORACLEASM_USE_LOGICAL_BLOCK_SIZE="false"

systemd did have the oracleasm enabled upon start up.

[root@rac02 bin]# systemctl list-unit-files --type=service|grep oracleasm

oracleasm.service                             enabled 


In the oracleasm log "/var/log/oracleasm" was showing "Disk "DISK01" does not exist or is not instantiated" when node2 rebooted. oracleasm configure showing disk scan is enabled. So, Scanning the disks manually after reboot seems to fix the issue. So, the issue has to do storage and timing. After some googling, my issue seems to match the following 2 notes from Oracle Metalink.


Oracle Linux 7: ASM Disks Created on FCOE Target Disks are Not Visible After System Reboot (Doc ID 2065945.1)


/usr/sbin/oracleasm.init"  prior to scandisk and it solved my issue. That gives about 20 seconds for the storage to be presented before the scandisk and initiation. 

With the 20 seconds delay, /var/log/oracleasm is showing "Instantiating disk "DISK01" upon reboot. The scandisk initialization attempt passes the Oracle RAC cluster start-up. If the disk is not initiated, nothing start-up in the cluster.