Monday, August 09, 2010

Internals of online index build

Internals of online index build:
Basic premise for online index build, which was introduced in Oracle 8i, was to let DML's to continue while online index build is in progress. It greatly improved database availability for most of the applications. But it suffered from complete hang of DML's during initial and end phases of online index build, which is resolved in 11g.

The way online index build (OIB) works is by creating an IOT journal table to keep track of changes while the OIB is in progress and merge all the changes from journal table to complete index build operation.

If I have a table with the structure of (A NUMBER, B NUMBER, C CHAR(100), D NUMBER) and to create index on (A, B, C) column, Oracle would create journal table like:

create table "ORACLE"."SYS_JOURNAL_18155" (C0 NUMBER, C1 NUMBER, C2 VARCHAR2(100), opcode char(1), partno number, rid rowid, primary key( C0, C1, C2 , rid )) organization index;

Before 11g, OIB will get in the DML queue to lock the table exclusively while preventing the new DML's to go through, once all the active transactions (ones which were initiated before the OIB) are completed, OIB will create the journal IOT table and release the exclusive table lock (it'll still keep the share lock on the table to prevent any other DDL's) for DML's to continue.

As part of journal table creation, Oracle would create an internal trigger on the primary table to record all the changes to the journal table. Along with using all the index columns, journal table will add "ROWID" to that list to make it as primary key.

"OPCODE" column represents the type of operation like "I" for Insert and "D" for Delete.
"PARTNO" column represents partition number of the underlying table.

Among all the changes to a given row for any of the index key columns, only the most recent change for that record is visible in the journal table. Any update operation of index key columns would be converted to "DELETE" and "INSERT" in the journal table.

While rest of the user sessions populate journal table with the ongoing changes for index key columns, OIB session would be reading the table in consistent mode (as of the time journal table is created) to create the index followed by the final merge operation.

During the merge process, Oracle will read the journal table (IOT) leaf blocks from left to right to merge those changes with the index being built. As the journal table leaf block changes are applied, once a leaf block is fully consumed, it's reference will be deleted from the branch block.

This process will continue until all leaf blocks are consumed and when it comes to the last leaf block, Oracle would stop all the DML's again to do the final merge and drop the journal table before releasing the mode 6 exclsuive table lock.

As each leaf block is consumed, Oracle would mark each entry as deleted. If more DML's happen while Oracle is doing the merge, it'll do one more pass of walking through the leaf blocks, this process continues until the merge process is all done.

If there is any long running DML before issuing the OIB, Oracle would wait on the long running transaction and it will prevent upcoming DML's also. This is one of the major problems with OIB. Same thing can happen when it is doing the final merge.

With 11g, significant changes were introduced to address all these problems.

Oracle will still wait for long running transactions to complete before starting the OIB process, but it will not block the upcoming DML's. Oracle will not use exclusive table lock (mode 6) for OIB, which will eliminate any DML hang situations.

As soon as OIB process is initiated, Oracle will create IOT journal table and use internal trigger on the underlying table to keep track of the changes.

Once the journal table is created and the long running transactions are completed, Oracle will create the index by reading the table blocks in "CURRENT" mode, which means any committed changes in a block happened before the current SCN are read instead of the old way of accessing the blocks as of journal table creation time. This will virtually eliminate chance of running into rollback segment too small errors.

Oracle will recreate the internal trigger to directly modify the target index structure once the index build is completed (but the merge job is still pending). This means all user sessions will now directly update the target index instead of updating the journal table. Any change coming through would first be checked in the journal table before consolidating the change for the target index (this is best explanied with the example below).

Record "A" was inserted with rowid "RID" before the merge phase and hence it is tracked in the journal table.
Record "A" was deleted during the merge phase. Now the user session will read the journal table by record "A" with rowid "RID" and apply that change to the target index before deleteing the same record. It will also mark that record as consumed (i.e. deleted) in the IOT journal table leaf block.

While the user sessions continue to update the target index with the ongoing changes, OIB session will do the merge of journal table changes into the final index, as there will not be any ongoing DML changes being tracked in the journal table, this process will complete faster than before (i.e. pre 11g).

If you are creating unique index online, that uniqueness will not be enforced in the IOT journal table, but rather in the target index from the initiation of build/merge phase.

If you are creating either a reverse key index or global hash partitioned index to resolve any index contention (because you have sequentially increasing leading key column), then the OIB process will suffer from the contention because the underlying IOT journal table is created as normal B-tree index (i.e. no reverse key or hash partitioning).

I've created enhancement request 9912950 for this issue with Oracle support. Hopefully they resolve it in 12g.