Saturday, September 04, 2010

Beware of data loss in BCT based RMAN backups

Block change tracking feature introduced in Oracle 10g is intended for RMAN incremental backups to directly go after the changed data blocks. It can even be used on active data guard standby in 11g.

Alex Gorbachev has written excellent white paper about BCT, it is at http://www.pythian.com/documents/Pythian-oracle-block-change.pdf

By default BCT maintains 8 bitmap versions for a given datafile in the change tracking file, background process ctwr process is responsible for maintaining BCT bitmap file.

BCT background process ctwr is hooked up with the redo apply mechanism, as user process executes any transaction on the primary database (internally Oracle will apply the generated redo to the data blocks for making changes to the data blocks) and as media recovery process applies the redo on the standby database.

One of the most popular database backup method is to run rman incremental level 0 image copy of all datafiles, followed by periodic incremental level 1 backups which can be merged with the previous level 0 backup for rolling full backup of the database. BCT is very useful for this purpose.

So BCT is a good feature and it has very light overhead on primary or the standby database and it has been out there for few years now with many customers using it; are there any issues with using this feature?

There is a possibility of data loss in rman incremantal backups based on BCT.

Data loss scenario 1:
BCT works on the physical standby only when managed recovery is in use and of course active data guard license is needed.

Enable BCT on the standby.
Create a test table on the primary and insert one record.
Identify which datafile the test table belongs to.
Use standby managed recovery to bring it current with the primary.
Stop the managed recovery on the standby.
Take rman incremental backup (backup incremental level 1 for recover of copy with tag 'bct1' tablespace test_tbs)
Insert second row in test table on the primary and switch the logfile.
Apply new logs on the standby using traditional recovery (i.e. recover standby database)
Run the rman incremental backup again with the same command as above
Merge the rman incremental backup with the first image copy (i.e. recover copy of datafile 5 with tab 'bct1')
Offline drop the datafile having the test table on the standby and rename it to the rman backup copy of that datafile.
Now open the standby in read only mode and select from the test table. Second row will be missing.
Bug# 10094823 was opened for this. It is now fixed and the patch is available for 11.2.0.1


Data loss scenario 2:
If you ever have to refresh a datafile on the standby with later checkpoint time (SCN) and you run the rman incremental backup right after that before bringing rest of the datafiles on the standby to the consistent checkpoint time with the recently refreshed datafile, then there will be a data loss in the just refreshed datafile if the same backup is restored (and rename the just refresh datafile to the rman backup copy) before running the next rman incremental.

This problem can get automatically corrected in the next rman incremental but only if you use the latest backup.

Data loss scenario 3:
Offline drop a datafile on the standby
Create a test table on production. Make sure that the table extents are in the above datafile.
Copy the datafile from production with later checkpoint time and leave the datafile in offline status.
Apply few logs on the standby using managed recovery mode.
Online the datafile.
Apply logs on the standby using managed recovery until standby is caught up with live.
Take rman incremental backup of that datafile
Update incremental backup copy with "recover copy of datafile with tag" command
Rename datafile to the rman backup copy
Apply few more logs and open the standby in read only mode
You will now see the test table data missing in that datafile

Reliability of BCT:
On 11.2.0.1 standby, I've seen managed standby recovery failing to start until BCT is reset at least while running the above tests. I'm working with Oracle support to get all these issues fixed.

As of 11.2.0.1, make sure to get the above mentioned issues addressed before using BCT on the standby.

Update (09/23/2010):
I've opened an enhancement request for Oracle support to implement the following features in BCT:

1) Ability to enable/disable BCT at datafile level.
2) Make rman to check checkpoint_time seen from BCT for a given datafile with the actual checkpoint_time from the datafile header and use BCT only if both the checkpoint_time's match.

Labels: , , ,

3 Comments:

At December 10, 2010 at 1:24:00 AM PST, Blogger Brijesh said...

you can also visit www.oracledba.in

 
At June 22, 2011 at 11:32:00 PM PDT, Anonymous sap erp system said...

I recently started working on Oracle 10g and is trying to learn new things about it so as to understand it better. Its good to know about this useful fact when doing backups using the rman method based on BCT. I will keep in mind all the points that you have shared.

 
At March 11, 2012 at 10:17:00 PM PDT, Anonymous sapnewbie said...

Thanks for the article. The patch details were helpful in avoiding the problem.

 

Post a Comment

Links to this post:

Create a Link

<< Home