Data Warehouse Fault Tolerance Part 2: Restarting

In my last post, I described the First "R" in data warehouse fault tolerance: Resuming. As I mentioned in the introduction to this series, my goal is a triage approach where the simple things, such as space errors, are handled effortlessly without repercussions. But what happens when the errors are not so simple, and Oracle's built in resuming functionality can't catch it? In these cases, the ETL processing will actually error or fail, and the cause will have to be corrected before the load can be restarted.

There are numerous approaches to crafting sustainable ETL; as a matter of fact, Peter Scott wrote a post by that very title. Jon Mead contributed this one about resuming ETL processes. Just a note of clarification: what Jon describes as "resuming" is what I am describing here as "code-controlled restarting": building smart ETL process flows and instrumented mappings so that a record is kept of what's already been run. This is a required component, and I recommend coding best practices such as these into all ETL processes. But the restartability feature I'm focusing on is "data management restartability", which deals with controlling data sets after failures. So the feature that I'm plugging in for the Restartability phase is Oracle's Flashback functionality.

Flashback provides the capability to revert the entire database, or smaller portions of it, to a particular point in time. For Oracle, a "point in time" is always referenced by the System Change Number(SCN), an internal clock for the Oracle Database. It auto-increments every time a transaction commits, but other sources, such as the SMON process, can increment the SCN as well. The current SCN can be viewed in many of the data dictionary tables, as well as using the DBMS_FLASHBACK server package.

SQL> select current_scn,
  2  dbms_flashback.get_system_change_Number
  3  from v$database;

CURRENT_SCN | GET_SYSTEM_CHANGE_NUMBER
----------- | ------------------------
    2536238 |                  2536238

1 row selected.

SQL> 

We can convert from SCN's to timestamps and back again, but this conversion is not exact. The Oracle documentation states that the functions are precise to about 3 seconds, which is evident from this example:

SQL> select SCN_TO_TIMESTAMP(2536238) scn
  2  from dual;

SCN
------------------------------
02/09/2010 12.47.26000000000

1 row selected.

SQL> select TIMESTAMP_TO_SCN('02/09/2010 12.47.26000000000') ts
  2  from dual;

        TS
----------
   2536237

1 row selected.

SQL>

Even though we access both Flashback Database and Flashback Table with the same general syntax and specify SCN's for both incarnations, the technical implementation under the hood is drastically different. Flashback Table is completely an UNDO operation, and is really not a new feature at all. Oracle has always used the UNDO space (rollback segments before that) to manage the state of tables as of a particular SCN to allow the robust multi-versioning that keeps reads and writes from blocking one another. Flashback Table is just an "opening" of the multi-version API, in a manner of speaking, so that any SCN can be viewed as long as sufficient UNDO exists.

Flashback Database, on the other hand, doesn't use UNDO at all, instead using new instance file structures called flashback logs in conjunction with a little bit of archived redo. Flashback logs contain prior versions of changed blocks, and we use the version of the block just prior to the SCN of interest and put them back in the datafiles, followed by redo log recovery to get the database to the exact point of the SCN.

So what part does Oracle's Flashback technology play with data warehouse fault tolerance, specifically in the area of Restartability? Some aspect of the load will likely need to be "undone" before we can continue, and this is where Flashback fits in neatly, as demonstrated in the following examples.

I created copies of the CUSTOMERS, PRODUCTS and SALES tables from the SH schema and inserted the rows from there as well. Before I start, I need to enable row movement on the new SALES table. This would need to be implemented for all tables in the data warehouse that are a consideration for Flashback Table:

SQL> alter table target.sales enable row movement;

Table altered.

SQL>
SQL> SELECT count(*) FROM target.products;

  COUNT(*)
----------
       72

1 row selected.

SQL> SELECT count(*) FROM target.customers;

  COUNT(*)
----------
     55500

1 row selected.

SQL> SELECT count(*) FROM target.sales;

  COUNT(*)
----------
    918843

1 row selected.

SQL> 

Next I'll get ready to execute my code. First, I'll create what's called a "restore point" in the database. This allows me to give an intelligent name to a particular SCN and is similar to tagging a release in Subversion. Before each new step in the process, I'll create a restore point so that each phase has a tagged, referenceable SCN. As I'm using the concept of a unique, sequence-generated number for each batch that runs (Jon calls it an "execution ID" in his posting above), I'll work that number into the name of my restore points.

SQL> create restore point dw_load_1001;

Restore point created.

SQL>

Next... I do the processing that moves the necessary files into place (if any), prepares and loads the ODS tables, etc. After that... I move into the load of the dimensional model itself:

SQL> create restore point load_customers_1001;

Restore point created.

SQL> exec dw_load.load_customers;
Number of records loaded: 0

PL/SQL procedure successfully completed.

SQL> create restore point load_products_1001;

Restore point created.

SQL> exec dw_load.load_products;
Number of records loaded: 72

PL/SQL procedure successfully completed.

SQL> create restore point load_sales_1001;

Restore point created.

SQL> exec dw_load.load_sales;
5 indexes and 0 local index partitions affected on table TARGET.SALES
Number of records loaded: 699999
Rebuild processes for unusable indexes on 28 partitions of table TARGET.SALES executed
No matching unusable global indexes found

PL/SQL procedure successfully completed.

SQL> 

So the data warehouse load ran without error, so I can assume that it was successful, right? In looking back over the log, I see that no rows were actually loaded into the CUSTOMERS table. After researching the issue, I discover that the Change Data Capture process on the source system is experiencing errors, and there were no rows published to the CUSTOMERS change set. Since the load didn't technically fail, the process continued to the load of the fact table, and it's very likely that many of the rows in the fact table have the wrong surrogate key from the CUSTOMERS table.

In describing my triage approach from earlier postings, the "aftermath" is exactly what I'm trying to avoid. In my experience, ETL load failures and the subsequent aftermath (investigations, data corrections, and reloads) cause more downtime than any other hardware or software related issues. But with the approach I've put into place, this aftermath shouldn't concern me, because now I can simply "undo" it (pun intended).

SQL> flashback table target.sales to restore point load_sales_1001;

Flashback complete.

SQL> select count(*) from target.sales;

  COUNT(*)
----------
    918843

1 row selected.

SQL> create restore point new_load_customers_1001;

Restore point created.

SQL> exec dw_load.load_customers;
Number of records loaded: 99

PL/SQL procedure successfully completed.

SQL> create restore point new_load_sales_1001;

Restore point created.

SQL> exec dw_load.load_sales;
5 indexes and 0 local index partitions affected on table TARGET.SALES
Number of records loaded: 699999
Rebuild processes for unusable indexes on 28 partitions of table TARGET.SALES executed
No matching unusable global indexes found

PL/SQL procedure successfully completed.

SQL> 

Instead of flashing back, I could try to sort out the issue. For instance, if I'm attaching the unique execution ID to every row in the fact table, either directly, or through an AUDIT dimension table, then I could probably identify the rows for this run. But why would I do this when the Flashback functionality is already available to me?

My test case above was a simple one; I was able to proceed just by flashing back a single table before restarting the process. However, in a large enterprise data warehouse, the effort involved in a typical aftermath could be staggering depending on how many fact tables are involved, how many dimension tables track history with SCD Type 2 changes, etc. Combine that with the possible need to flashback ODS tables, history tables, persistent staging tables, etc. I've seen numerous situations where the exact ramifications are tough to quantify: we know what broke, but we have no idea what needs to be fixed. Perhaps there was a hardware failure in the middle of an ETL load, and it's hard to identify just exactly which tables were loaded and which ones weren't. In this case, what I really need is the ability to do a complete "do-over": put everything back the way it was prior to the beginning of the load, and just restart everything.

Enter Flashback Database. So I'll demonstrate what's required to enable this feature, and then I'll replay the test case above and solve it from this angle.

I first need to put my database in Archive Log Mode, as archived redo is a required component of the feature:

SQL> startup mount
ORACLE instance started.

Total System Global Area  422670336 bytes
Fixed Size                  1336960 bytes
Variable Size             335546752 bytes
Database Buffers           79691776 bytes
Redo Buffers                6094848 bytes
Database mounted.
SQL> alter database archivelog;

Database altered.

SQL> archive log list
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence     25
Next log sequence to archive   27
Current log sequence           27
SQL> 

Next, I need to configure the Flash Recovery Area, which is a file system on the server where the database will create the flashback logs:

SQL> alter system set db_recovery_file_dest_size=3G;

System altered.

SQL> Alter system set db_recovery_file_dest='/oracle/flash_recovery_area';

System altered.

SQL>

Finally, I need to set the flashback_retention_target parameter, which instructs the Flash Recovery Area on our needs for retention. This parameter is actually in minutes... thanks for the consistency Oracle. After that, I just enable flashback and open the database:

SQL> alter system set db_flashback_retention_target=2880;

System altered.

SQL> alter database flashback on;

Database altered.

SQL> alter database open;

Database altered.

SQL> 

So, Flashback Database should be ready to use. I'll take a quick look and see if the database thinks it's ready:

SQL> select oldest_flashback_scn,
  2         oldest_flashback_time,
  3         startup_time
  4    from v$flashback_database_log
  5         cross join v$instance;

OLDEST_FLASHBACK_SCN | OLDEST_FLASHBACK_TIME  | STARTUP_TIME
-------------------- | ---------------------- | ----------------------
             2912097 | 02/10/2010 12:10:30 AM | 02/10/2010 12:09:08 AM

1 row selected.

Elapsed: 00:00:00.15
SQL>  

Now I'll flashback the entire database to the very first restore point I created: dw_load_1001:

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount
ORACLE instance started.

Total System Global Area  422670336 bytes
Fixed Size                  1336960 bytes
Variable Size             343935360 bytes
Database Buffers           71303168 bytes
Redo Buffers                6094848 bytes
Database mounted.
SQL> flashback database to restore point dw_load_1001;

Flashback complete.

SQL> alter database open resetlogs;

Database altered.

SQL> SELECT count(*) FROM target.products;

  COUNT(*)
----------
       72

1 row selected.

SQL> SELECT count(*) FROM target.customers;

  COUNT(*)
----------
     55500

1 row selected.

SQL> SELECT count(*) FROM target.sales;

  COUNT(*)
----------
    918843

1 row selected.

SQL>

So the immediate downside of this approach is that it requires the involvement of the operations team because the database has to be in mount mode, and the data warehouse is not available during this slight outage. However, when compared with the time it might take to sort out and correct massive aftermath scenarios, this seems to be the preferable choice. Is the data warehouse really "available" if data corrections and data reloads are occurring? I would rather involve the operations team for a quick, concrete fix so the reload can complete as soon as possible.

The next "R" is Restoring, though it really involves putting the pieces in place for a scalable Backup strategy. And "Backup" doesn't start with an "R".