Monday, October 3, 2022

Oracle: Dataguard: Automatic Failover and Switchover

 

An example of how to perform Automatic Failover and later switchover on Oracle DataGuard.


show configuration as expected Primary is at orclprm

[oracle@Primary ~]$ dgmgrl /
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Mon Oct 3 14:23:59 2022
Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "ORCLPRM"
Connected as SYSDG.
DGMGRL> connect sys@Standby
Password:
Connected to "ORCLSTB"
Connected as SYSDBA.
DGMGRL> show configuration

Configuration - orcl

Protection Mode: MaxPerformance
Members:
orclprm - Primary database
orclstb - Physical standby database

Fast-Start Failover: Disabled

Configuration Status:
SUCCESS (status updated 32 seconds ago)

Configuring Failover with an expected warnings that needs to be addressed.  It is complaining about Observer not starting. It is a process that monitors both primary and standby. So, start it and open another terminal to perform the rest of the test.

This step is set new fast fail over time to 60 seconds and enable the fast_start failover.

DGMGRL> edit configuration set property FastStartFailoverLagLimit=60;
Property "faststartfailoverlaglimit" updated


DGMGRL> enable fast_start failover;
Warning: ORA-16827: Flashback Database is disabled

Enabled in Potential Data Loss Mode.
DGMGRL> show configuration

Configuration - orcl

Protection Mode: MaxPerformance
Members:
orclprm - Primary database
Warning: ORA-16819: fast-start failover observer not started

orclstb - (*) Physical standby database

Fast-Start Failover: Enabled in Potential Data Loss Mode

Configuration Status:
WARNING (status updated 48 seconds ago)


DGMGRL> start observer;
[W000 2022-10-03T14:27:52.856-06:00] FSFO target standby is orclstb
Observer 'Primary.localdomain' started
[W000 2022-10-03T14:27:52.936-06:00] Observer trace level is set to USER
[W000 2022-10-03T14:27:52.936-06:00] Try to connect to the primary.
[W000 2022-10-03T14:27:52.936-06:00] Try to connect to the primary primary.
[W000 2022-10-03T14:27:52.947-06:00] The standby orclstb is ready to be a FSFO target
[W000 2022-10-03T14:27:53.948-06:00] Connection to the primary restored!
[W000 2022-10-03T14:27:55.949-06:00] Disconnecting from database primary.
[W000 2022-10-03T14:29:52.153-06:00] Primary database cannot be reached.
[W000 2022-10-03T14:29:52.153-06:00] Fast-Start Failover threshold has not exceeded. Retry for the next 30 seconds
[W000 2022-10-03T14:29:53.153-06:00] Try to connect to the primary.
[W000 2022-10-03T14:29:54.199-06:00] Primary database cannot be reached.
[W000 2022-10-03T14:29:55.199-06:00] Try to connect to the primary.
[W000 2022-10-03T14:30:20.788-06:00] Primary database cannot be reached.
[W000 2022-10-03T14:30:20.788-06:00] Fast-Start Failover threshold has not exceeded. Retry for the next 2 seconds
[W000 2022-10-03T14:30:21.789-06:00] Try to connect to the primary.
[W000 2022-10-03T14:30:22.839-06:00] Primary database cannot be reached.
[W000 2022-10-03T14:30:22.839-06:00] Fast-Start Failover threshold has expired.
[W000 2022-10-03T14:30:22.839-06:00] Try to connect to the standby.
[W000 2022-10-03T14:30:22.839-06:00] Making a last connection attempt to primary database before proceeding with Fast-Start Failover.
[W000 2022-10-03T14:30:22.839-06:00] Check if the standby is ready for failover.
[S002 2022-10-03T14:30:22.848-06:00] Fast-Start Failover started...

The following messages will repeat indefinitely until the primary database is reinstated after Primary db is Shutdown Abort. The Observer will continuously ping the primary that is currently down.

2022-10-03T14:30:22.848-06:00
Initiating Fast-Start Failover to database "orclstb"...
[S002 2022-10-03T14:30:22.848-06:00] Initiating Fast-start Failover.
Performing failover NOW, please wait...
Failover succeeded, new primary is "orclstb"
2022-10-03T14:30:50.857-06:00
[S002 2022-10-03T14:30:50.857-06:00] Fast-Start Failover finished...
[W000 2022-10-03T14:30:50.857-06:00] Failover succeeded. Restart pinging.
[W000 2022-10-03T14:30:50.870-06:00] Primary database has changed to orclstb.
[W000 2022-10-03T14:30:50.871-06:00] Try to connect to the primary.
[W000 2022-10-03T14:30:50.871-06:00] Try to connect to the primary standby.
[W000 2022-10-03T14:30:50.962-06:00] The standby orclprm needs to be reinstated
[W000 2022-10-03T14:30:50.962-06:00] Try to connect to the new standby orclprm.
[W000 2022-10-03T14:30:50.962-06:00] Connection to the primary restored!
[W000 2022-10-03T14:30:52.963-06:00] Connection to the new standby restored!
[W000 2022-10-03T14:30:52.963-06:00] Disconnecting from database standby.
[W000 2022-10-03T14:30:53.969-06:00] Failed to ping the new standby.
[W000 2022-10-03T14:30:54.969-06:00] Try to connect to the new standby orclprm.
[W000 2022-10-03T14:30:56.970-06:00] Connection to the new standby restored!
[W000 2022-10-03T14:30:56.983-06:00] Failed to ping the new standby.
[W000 2022-10-03T14:30:57.984-06:00] Try to connect to the new standby orclprm.
[W000 2022-10-03T14:30:59.985-06:00] Connection to the new standby restored!

On a different SQLPLUS prompt on primary db perform a Startup Mount then on dgmgrl, perform a Reinstate.

DGMGRL> reinstate database orclprm
Reinstating database "orclprm", please wait...
Error: ORA-16657: reinstatement of database in progress

Failed.
Database reinstatement for "orclprm" in progress

DGMGRL> show configuration

Configuration - orcl

Protection Mode: MaxPerformance
Members:
orclstb - Primary database
Warning: ORA-16824: multiple warnings, including fast-start failover-related warnings, detected for the database

orclprm - (*) Physical standby database
Error: ORA-16816: incorrect database role

Fast-Start Failover: Enabled in Potential Data Loss Mode

Configuration Status:
ERROR (status updated 31 seconds ago)


As soon as the Primary database is Startup Mount and Reinstate, the Observer will change its status to successfully restore on Primary.

[W000 2022-10-03T14:34:11.161-06:00] Failed to ping the new standby.
[W000 2022-10-03T14:34:12.161-06:00] Try to connect to the new standby orclprm.
[W000 2022-10-03T14:34:13.161-06:00] Connection to the new standby restored!
[W000 2022-10-03T14:34:46.215-06:00] Try to connect to the primary standby.
[W000 2022-10-03T14:34:47.215-06:00] Connection to the primary restored!
[W000 2022-10-03T14:34:47.215-06:00] Wait for new primary to be ready to reinstate.
[W000 2022-10-03T14:34:48.229-06:00] New primary is now ready to reinstate.
[W000 2022-10-03T14:34:49.230-06:00] Issuing REINSTATE command.

2022-10-03T14:34:49.230-06:00
Initiating reinstatement for database "orclprm"...
Reinstating database "orclprm", please wait...
[W000 2022-10-03T14:35:05.258-06:00] The standby orclprm is ready to be a FSFO target
Reinstatement of database "orclprm" succeeded
2022-10-03T14:35:32.952-06:00
[W000 2022-10-03T14:35:33.305-06:00] Successfully reinstated database orclprm.

With everything is going great, the new primary is with Standby db until the switchover takes place.

DGMGRL> show configuration

Configuration - orcl

Protection Mode: MaxPerformance
Members:
orclstb - Primary database
orclprm - (*) Physical standby database

Fast-Start Failover: Enabled in Potential Data Loss Mode

Configuration Status:
SUCCESS (status updated 52 seconds ago

Perform swithover back to Primary
DGMGRL> disable fast_start failover;
Disabled.

DGMGRL> connect sys as sysdba
Password:
Connected to "ORCLPRM"
Connected as SYSDBA.
DGMGRL> show configuration

Configuration - orcl

Protection Mode: MaxPerformance
Members:
orclstb - Primary database
orclprm - Physical standby database

Fast-Start Failover: Disabled

Configuration Status:
SUCCESS (status updated 73 seconds ago)

DGMGRL> switchover to orclprm
Performing switchover NOW, please wait...
New primary database "orclprm" is opening...
Operation requires start up of instance "orcl" on database "orclstb"
Starting instance "orcl"...
Connected to "ORCLSTB"
ORACLE instance started.
Connected to "ORCLSTB"
Database mounted.
Database opened.
Connected to "ORCLSTB"
Connected to "ORCLPRM"
Switchover succeeded, new primary is "orclprm"
DGMGRL> show configuration

Configuration - orcl

Protection Mode: MaxPerformance
Members:
orclprm - Primary database
orclstb - Physical standby database

Fast-Start Failover: Disabled

Configuration Status:
SUCCESS (status updated 70 seconds ago)


Friday, September 23, 2022

Oracle: DataGuard:

 

DataGuard Standby Database do not allow DDL or DML. I am going to convert my Physical Standby to Snapshot so it can be fully updatable standby database while Primary can be taken down for patching purposes. 



Standby is Read Only with Apply and cannot be updated.

SQL> alter session set container=pdb1;

Session altered.

SQL> select name, open_mode from v$database;

NAME OPEN_MODE
--------- --------------------
ORCL READ ONLY WITH APPLY


SQL> select name,database_role,open_mode,db_unique_name from v$database;

NAME DATABASE_ROLE OPEN_MODE DB_UNIQUE_NAME
--------- ---------------- -------------------- ------------------------------
ORCL PHYSICAL STANDBY READ ONLY WITH APPLY ORCLSTB


SQL> create table crm (id number);
create table crm (id number)
*
ERROR at line 1:
ORA-16000: database or pluggable database open for read-only access

Converting from Physical Standby to Snapshot Standby


SQL> !hostname
Standby.localdomain

SQL> select open_mode from v$database;

OPEN_MODE
--------------------
READ ONLY WITH APPLY


SQL> select name, open_mode from v$pdbs;

NAME
--------------------------------------------------------------------------------
OPEN_MODE
----------
PDB1
READ ONLY


SQL> show pdbs;

CON_ID CON_NAME OPEN MODE RESTRICTED
---------- ------------------------------ ---------- ----------
3 PDB1 READ ONLY NO

SQL> shutdown immediate;
Pluggable Database closed.


SQL> show pdbs; 

CON_ID CON_NAME OPEN MODE RESTRICTED
---------- ------------------------------ ---------- ----------
2 PDB$SEED READ ONLY NO
3 PDB1 MOUNTED

SQL> select name, open_mode from v$database;

NAME OPEN_MODE
--------- --------------------
ORCL READ ONLY WITH APPLY

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.

-- Take the database to Mount Mode.

SQL> startup mount;
ORACLE instance started.

Total System Global Area 1560278096 bytes
Fixed Size 9135184 bytes
Variable Size 973078528 bytes
Database Buffers 570425344 bytes
Redo Buffers 7639040 bytes
Database mounted.


SQL> SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, BLOCK#, BLOCKS FROM V$MANAGED_STANDBY;

PROCESS STATUS THREAD# SEQUENCE# BLOCK# BLOCKS
--------- ------------ ---------- ---------- ---------- ----------
ARCH CONNECTED 0 0 0 0
DGRD ALLOCATED 0 0 0 0
DGRD ALLOCATED 0 0 0 0
ARCH CONNECTED 0 0 0 0
ARCH CONNECTED 0 0 0 0
ARCH CONNECTED 0 0 0 0

6 rows selected.

-- stop and cancel the log applying process

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
Database altered.

SQL> show parameter DB_RECOVERY_FILE_DEST
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
db_recovery_file_dest string /u01/app/oracle/fast_recovery_
area
db_recovery_file_dest_size big integer 12732M





-- Converting the database to Snapshot
SQL> ALTER DATABASE CONVERT TO SNAPSHOT STANDBY;
Database altered.

SQL> ALTER DATABASE OPEN READ WRITE;
Database altered.

SQL> select name,database_role,open_mode,db_unique_name from v$database;
NAME DATABASE_ROLE OPEN_MODE DB_UNIQUE_NAME
--------- ---------------- -------------------- ------------------------------
ORCL SNAPSHOT STANDBY READ WRITE ORCLSTB

SQL> select name,open_mode from v$pdbs;
NAME
--------------------------------------------------------------------------------
OPEN_MODE
----------
PDB$SEED
READ ONLY

PDB1
READ WRITE

SQL> col name format a30;
SQL> select name,open_mode from v$pdbs;

NAME OPEN_MODE
------------------------------ ----------
PDB$SEED READ ONLY
PDB1 READ WRITE




-- Testing the Standby
-- from the standby db
SQL> conn sys@pdb1 as sysdba
Enter password: 
Connected.
SQL> show con_name

CON_NAME
------------------------------
PDB1

SQL> create user test identified by test;

User created.

SQL> create table test(id number);

Table created.

SQL> insert into test values(100); 

1 row created.

SQL> commit;

Commit complete.

SQL> select * from test;

ID
----------
100

Note: if converted back to Physical Standby, all the changes will be wiped out like it never happened. So, technically, this is just a temporarily letting the database serviceable while Primary is down.

To revert to Primary Database, just redo the same step with the following command. Again, the db has to be in Mount Mode.

ALTER DATABASE CONVERT TO PHYSICAL STANDBY;



Saturday, September 17, 2022

Oracle: Dataguard: Error: ORA-16664: unable to receive the result from a member

 dgmgrl Physical Standby is showing offline and Real Time Query is showing OFF.


DGMGRL> show configuration


Configuration - orcl


  Protection Mode: MaxPerformance

  Members:

  orclprm - Primary database

    orclstb - Physical standby database 

      Error: ORA-16664: unable to receive the result from a member


Fast-Start Failover:  Disabled


Configuration Status:

ERROR   (status updated 66 seconds ago)


DGMGRL> show database orclprm


Database - orclprm


  Role:               PRIMARY

  Intended State:     TRANSPORT-ON

  Instance(s):

    orcl


Database Status:

SUCCESS


DGMGRL> show database orclstb


Database - orclstb


  Role:               PHYSICAL STANDBY

  Intended State:     APPLY-ON


  Transport Lag:      (unknown)

  Apply Lag:          (unknown)

  Average Apply Rate: (unknown)

  Real Time Query:    OFF

  Instance(s):

    orcl


Database Status:

DGM-17016: failed to retrieve status for database "orclstb"

ORA-16664: unable to receive the result from a member


Primary node listener is not started and that's what caused the inability to communicate with the Physical Standby database.

oracle@Primary ~]$ lsnrctl status


LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 16-SEP-2022 16:36:15


Copyright (c) 1991, 2019, Oracle.  All rights reserved.


Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))

TNS-12541: TNS:no listener

 TNS-12560: TNS:protocol adapter error

  TNS-00511: No listener

   Linux Error: 111: Connection refused

[oracle@Primary ~]$ lsnrctl start


The Physical Standby status returns from ON after starting up the listener.


[oracle@Primary ~]$ dgmgrl /

DGMGRL for Linux: Release 19.0.0.0.0 - Production on Fri Sep 16 16:36:32 2022

Version 19.3.0.0.0


Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.


Welcome to DGMGRL, type "help" for information.

Connected to "ORCLPRM"

Connected as SYSDG.

DGMGRL> show configuration


Configuration - orcl


  Protection Mode: MaxPerformance

  Members:

  orclprm - Primary database

    orclstb - Physical standby database 

      Error: ORA-16664: unable to receive the result from a member


Fast-Start Failover:  Disabled


Configuration Status:

ERROR   (status updated 71 seconds ago)


DGMGRL> show database orclstb


Database - orclstb


  Role:               PHYSICAL STANDBY

  Intended State:     APPLY-ON

  Transport Lag:      0 seconds (computed 0 seconds ago)

  Apply Lag:          0 seconds (computed 0 seconds ago)

  Average Apply Rate: 2.00 KByte/s

  Real Time Query:    ON

  Instance(s):

    orcl


Database Status:

SUCCESS


 


 

Thursday, January 13, 2022

Oracle RAC: kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [osysmond.bin:4024]

 While in the middle of installing Oracle software on RAC. One of the terminal throwing out and error and terminated the session.


[oracle@rac02 ~]$ 

Message from syslogd@rac02 at Jan 13 15:47:15 ...

 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [osysmond.bin:4024]


In another session, the installation still going on, and no other errors are reported. Installation is still going and files are being copied from node1 to node2.





This issue seems to be coming from VMW and indicating that this is simply some performance (latency) hiccups.

https://kb.vmware.com/s/article/67623


Oracle RAC: Oracle RAC not starting upon rebooting.

Oracle RAC not starting upon rebooting. The following show some symptoms, tests, and what to look for. 

Node2 not getting any RAC services status back when first startup or reboot.

[root@rac02 bin]# ./crsctl stat res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.


Upon starting it all up, it has couple of errors.  The "CRS-1705" and the "ora.diskmon" . Those are indications that there are some issues with the ASM storage being provisioned in node2.


[root@rac01 bin]# ./crsctl start cluster -all

CRS-2672: Attempting to start 'ora.cssd' on 'rac01'

CRS-2672: Attempting to start 'ora.diskmon' on 'rac01'

CRS-2672: Attempting to start 'ora.cssd' on 'rac02'

CRS-2672: Attempting to start 'ora.diskmon' on 'rac02'

CRS-2676: Start of 'ora.diskmon' on 'rac01' succeeded

CRS-2676: Start of 'ora.diskmon' on 'rac02' succeeded

CRS-2676: Start of 'ora.cssd' on 'rac01' succeeded

CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac01'

CRS-2672: Attempting to start 'ora.ctssd' on 'rac01'

CRS-2676: Start of 'ora.ctssd' on 'rac01' succeeded

CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac01' succeeded

CRS-2672: Attempting to start 'ora.asm' on 'rac01'

CRS-2676: Start of 'ora.asm' on 'rac01' succeeded

CRS-2672: Attempting to start 'ora.storage' on 'rac01'

CRS-2676: Start of 'ora.storage' on 'rac01' succeeded

CRS-2672: Attempting to start 'ora.crsd' on 'rac01'

CRS-2676: Start of 'ora.crsd' on 'rac01' succeeded

CRS-1705: Found 0 configured voting files but 1 voting files are required, terminating to ensure data integrity; details at (:CSSNM00065:) in /u01/app/oracle/diag/crs/rac02/crs/trace/ocssd.trc

CRS-2674: Start of 'ora.cssd' on 'rac02' failed

CRS-2679: Attempting to clean 'ora.cssd' on 'rac02'

CRS-2681: Clean of 'ora.cssd' on 'rac02' succeeded

CRS-2672: Attempting to start 'ora.cssd' on 'rac02'

CRS-2672: Attempting to start 'ora.diskmon' on 'rac02'

CRS-2676: Start of 'ora.diskmon' on 'rac02' succeeded


                           

CRS-2676: Start of 'ora.cssd' on 'rac02' succeeded

CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac02'

CRS-2672: Attempting to start 'ora.ctssd' on 'rac02'

CRS-2676: Start of 'ora.ctssd' on 'rac02' succeeded

CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac02' succeeded

CRS-2672: Attempting to start 'ora.asm' on 'rac02'

CRS-2676: Start of 'ora.asm' on 'rac02' succeeded

CRS-2672: Attempting to start 'ora.storage' on 'rac02'

CRS-2676: Start of 'ora.storage' on 'rac02' succeeded

CRS-2672: Attempting to start 'ora.crsd' on 'rac02'

CRS-2676: Start of 'ora.crsd' on 'rac02' succeeded




Performing all the RAC-related storage checks will all appear hung. Once performing the oracelasm listdisks, it initiated the disk on node2 the RAC-related disk checks will show the disks output.



[root@rac02 bin]# oracleasm scandisks

Reloading disk partitions: done

Cleaning any stale ASM disks...

Scanning system for ASM disks...

Instantiating disk "DISK01"



[root@rac02 bin]# ./crsctl query css votedisk

##  STATE    File Universal Id                File Name Disk group

--  -----    -----------------                --------- ---------

 1. ONLINE   cacc790e79514f14bf658a94d092b503 (/dev/oracleasm/disks/DISK01) [DATA]

Located 1 voting disk(s).

 

[root@rac02 bin]# ./ocrcheck

Status of Oracle Cluster Registry is as follows :

Version                  :          4

Total space (kbytes)     :     491684

Used space (kbytes)      :      84360

Available space (kbytes) :     407324

ID                       : 1922254439

Device/File Name         :      +DATA

                                    Device/File integrity check succeeded


                                    Device/File not configured


                                    Device/File not configured


                                    Device/File not configured


                                    Device/File not configured


Cluster registry integrity check succeeded


Logical corruption check succeeded


Oracle Cluster Registry check was cancelled because an ongoing update was detected.


All the oracleasm configure seems to be appropriately set.


[root@rac02 ~]# oracleasm status

Checking if ASM is loaded: yes

Checking if /dev/oracleasm is mounted: yes

[root@rac02 ~]# oracleasm configure

ORACLEASM_ENABLED=true

ORACLEASM_UID=oracle

ORACLEASM_GID=oinstall

ORACLEASM_SCANBOOT=true

ORACLEASM_SCANORDER=""

ORACLEASM_SCANEXCLUDE=""

ORACLEASM_SCAN_DIRECTORIES=""

ORACLEASM_USE_LOGICAL_BLOCK_SIZE="false"

systemd did have the oracleasm enabled upon start up.

[root@rac02 bin]# systemctl list-unit-files --type=service|grep oracleasm

oracleasm.service                             enabled 


In the oracleasm log "/var/log/oracleasm" was showing "Disk "DISK01" does not exist or is not instantiated" when node2 rebooted. oracleasm configure showing disk scan is enabled. So, Scanning the disks manually after reboot seems to fix the issue. So, the issue has to do storage and timing. After some googling, my issue seems to match the following 2 notes from Oracle Metalink.


Oracle Linux 7: ASM Disks Created on FCOE Target Disks are Not Visible After System Reboot (Doc ID 2065945.1)


/usr/sbin/oracleasm.init"  prior to scandisk and it solved my issue. That gives about 20 seconds for the storage to be presented before the scandisk and initiation. 

With the 20 seconds delay, /var/log/oracleasm is showing "Instantiating disk "DISK01" upon reboot. The scandisk initialization attempt passes the Oracle RAC cluster start-up. If the disk is not initiated, nothing start-up in the cluster.