Thursday, September 08, 2011

Enhancing Oracle Active Data Guard.

I think Active Data Guard (ADG) is one of the best features Oracle has ever released and it is remarkably popular considering how easily read work load can be off loaded while guaranteeing data quality (due to the nature of binary replication, good luck to those who rely on logical replication for DR) and ability to ensure zero data loss.

Here are some of the enhancements which can make ADG even more adaptable:

1) Ability to create global temporary tables. Oracle doesn't have to generate redo for GTT's and hence they can be supported on ADG standby. This will enable batch jobs and ETL job's to run on the ADG standby.

2) Continue using parallel threads for redo apply when redo stream encounters new datafile addition. When you add new datafile on primary, redo apply will serialize on creating the datafile on ADG standby. When you add lot of datafiles in quick succession on primary, standby can fall behind by several minutes.

3) When new datafile addition fails on standby for any reason, manual fix requires ADG standby to go in "MOUNT" mode, which is an outage. Most common reason for datafile addition to fail is lack of space, once the space issue is addressed manual fix is to use "create datafile as" command which requires standby to be in "MOUNT" mode.

4) Ability to maintain partial physical standby for ADG. When you use RAC for database consolidation, multiple schemas become part of one big physical database. Not all applications have same business priority, it is very natural that only some of those schema's require read only traffic offload to ADG standby. Instead of creating full physical standby, you only want to create standby with the tablespaces you are interested in and it requires quite a bit of manual scripting and processes involved to take care of new datafiles being created on primary which are not part of the interested tablespaces. Oracle can introduce init parameters like "KEEP_TABLESPACES" and "UNKEEP_TABLESPACES" on the standby which can address this issue by managing new datafiles depending on how it falls into these parameters.

5) Block cleanout that happens as part of commit on primary database is non logged (not recorded in the redolog), which means commit time block cleanout can't happen on the standby. This will cause all read queries on ADG standby to keep visiting undo transaction table (one additional undo block visit for every data segment block that is not cleaned out). I think Oracle should log commit time block cleanout so that standby can also see it and this problem will be resolved.

6) Data Guard role switchover from standby to primary causing ADG standby to go from open read only state to "MOUNT" state and then to "OPEN" state causing all established connections to fail, which can extend total database outage. Although it is not significant, every second counts for availability.

7) Like "CURRENT_SCN" column in v$database, additional column for corresponding time would make it very easier to determine standby lag in real time. All other options currently available are either expensive to query for multiple executions or not fully reliable.

8) When you enable supplemental logging (or force logging) on the primary database, it only gets saved in the control file and this change will not go to the standby unless you regenerate standby control file from the primary or run that command explicitly on the standby too. This can cause huge problem, if you have logical standby or Goldengate being used, after you switchover to the standby. Oracle should record any change in supplemental logging definition to the dictionary also than just to the control file.

9) Ability to determine datafiles impacted by NOLOGGING changes on the standby. Querying "UNRECOVERABLE_TIME" from v$datafile doesn't yield correct result and customer's can falsely rely on this data. Since metadata about blocks impacted by NOLOGGING change is already logged, redo apply on the standby can update standby control file with this information and hence v$datafile can be queried reliably for determining nologging impacted datafiles.

10) Ability to use multiple processes for redo thread merging as part of redo apply for RAC databases. As redo generation is increasing with each release of Oracle and customers increasingly using RAC for database consolidation, one process may not be able to merge redo from all the threads fast enough for recovery slaves to consume. It can be a problem if large number of RAC nodes are being used and redo generation in most of the threads is quite high.

I've opened enhancement request for some of the above points.