Data Warehouse Fault Tolerance Part 1: Resuming

In the introduction to this series of posts, I spoke briefly about data warehouse fault tolerance and the unique challenges resulting from high data volumes combined the batch load window required to create them. I then defined the goal: a layered approach allowing simple errors to be caught early before they turn in to serious conditions.

Resuming is the ability to continue effortlessly after an error. The important thing is that there should be no aftermath from the error: our process should pause gracefully until the error is corrected. The Oracle Database has offered out of the box functionality for resuming since version 9i in the form of Resumable Space Allocation. Resumable operations are supported for SELECT queries, DML and DDL, and can be enabled at either the system or the session level. To enable at the system level, the RESUMABLE_TIMEOUT database parameter should have a non-zero value.

SQL> alter system set resumable_timeout=3600;

System altered.

SQL>

To enable resumable operations at the session level, the statement follows this basic syntax, with the TIMEOUT and NAME clauses being optional:

ALTER SESSION ENABLE RESUMABLE <TIMEOUT n> <NAME string>;

The TIMEOUT value is specified in seconds, and if omitted, the default value of 7200 is used, or 2 hours. The NAME clause gives the resumable session a user-friendly name for when we are monitoring for resumable sessions (as we will see later) to see which of our processes is suspended. Enabling resumable operations for the session level requires that the RESUMABLE permission has been granted:

SQL> grant resumable to stewart;

Grant succeeded.

SQL>

Resumable operations can also be enabled with the Oracle utilities... such as SQL-Loader, Export/Import and Datapump. The command-line parameters RESUMABLE, RESUMABLE_NAME and RESUMABLE_TIMEOUT exist to mimic the functionality mentioned above.

Now for a demonstration. I'll create a situation that is ripe for a space allocation error: I'll put an empty copy of the SALES fact table from the SH schema in a tablespace with only 250K of space:

SQL> create tablespace target datafile '/oracle/oradata/bidw1/target01.dbf' size 250K;

Tablespace created.

SQL> create table target.sales tablespace target as select * from sh.sales where 1=0;

Table created.

SQL>

Now I'll load some records into the table, which should cause it to suspend. To prepare my session, I need to enable resumable operations. Since I always instrument my code, I'll register my process with the database. After that, I have an easy way to guarantee consistency when referring to processes. Now, I can use the registered name for my resumable session as well:

SQL> exec dbms_application_info.set_module('SALES fact load','insert some rows');

PL/SQL procedure successfully completed.

SQL>
SQL> DECLARE
  2     l_module VARCHAR2(48) := sys_context('USERENV','MODULE');
  3  BEGIN
  4     EXECUTE IMMEDIATE
  5     'alter session enable resumable timeout 18000 name '''||l_module||'''';
  6  END;
  7  /

PL/SQL procedure successfully completed.

SQL>

I start loading the records in hopes of a suspended session:

SQL> insert into target.sales select * from sh.sales;

So now, I open up another session, and I start another transaction against the TARGET.SALES table, just to pile on the TARGET tablespace:

SQL> exec dbms_application_info.set_module('SALES fact load2','insert more rows');

PL/SQL procedure successfully completed.

SQL>
SQL> DECLARE
  2     l_module VARCHAR2(48) := sys_context('USERENV','MODULE');
  3  BEGIN
  4     EXECUTE IMMEDIATE
  5     'alter session enable resumable timeout 18000 name '''||l_module||'''';
  6  END;
  7  /

PL/SQL procedure successfully completed.

SQL> insert into target.sales select * from sh.sales;

I'll have a look in the DBA_RESUMABLE view (there is also a USER_RESUMABLE version) for my suspended sessions. Even though I could get all the following information with a single SQL statement, I broke it up for better visibility on the blog:

SQL> select name, start_time, suspend_time, status from dba_resumable;

NAME              | START_TIME           | SUSPEND_TIME         | STATUS
----------------- | -------------------- | -------------------- | ------------
SALES fact load2  | 02/06/10 10:33:33    | 02/06/10 10:33:33    | SUSPENDED
SALES fact load   | 02/06/10 10:29:03    | 02/06/10 10:29:03    | SUSPENDED

2 rows selected.

Elapsed: 00:00:00.07
SQL> select name, sql_text from dba_resumable;

NAME              | SQL_TEXT
----------------- | -----------------------------------------------
SALES fact load2  | insert into target.sales select * from sh.sales
SALES fact load   | insert into target.sales select * from sh.sales

2 rows selected.

SQL> select name, error_msg from dba_resumable;

NAME              | ERROR_MSG
----------------- | ------------------------------------------------------------------------
SALES fact load2  | ORA-01653: unable to extend table TARGET.SALES by 8 in tablespace TARGET
SALES fact load   | ORA-01653: unable to extend table TARGET.SALES by 8 in tablespace TARGET

2 rows selected.

SQL>

The Oracle Database also publishes server alerts concerning suspended transactions using the Server-Generated Alerts infrastructure. This infrastructure uses the AWR toolset, the server package DBMS_SERVER_ALERT for getting and setting metric threshholds, and the queue table ALERT_QUE to hold alerts that have been published from AWR. Custom processes could be written to mine ALERT_QUE for these alerts, but the easiest way to configure and view server alerts is using Oracle Enterprise Manager (OEM). On the Alerts section of the main OEM page, we can see three different alerts generated by the Oracle Database:

all alerts.png
If we click on the "Session Suspended" link, we can see the multiple alerts generated in this category:
suspend alerts.png
Another alert generated indirectly by the suspended transaction is the "Configuration" class event caused by our session "waiting" to proceed. The Oracle wait event interface can show us information about the suspend waits on the system:
SQL> SELECT event,
  2         SUM(time_waited) time_waited,
  3         SUM(total_waits) total_waits,
  4         AVG(average_wait) average_wait
  5    FROM gv$session_event
  6   WHERE lower(event) LIKE '%suspend%'
  7   GROUP BY event
  8   ORDER BY time_waited ASC
  9  /
EVENT TIME_WAITED TOTAL_WAITS AVERAGE_WAIT
statement suspended, wait error to be cleared 305373 1377 221.78

1 row selected.

SQL>


To free up the space issue, I'll enable autoextend on the TARGET tablespace. Then, I'll take a look and see if anything has changed:

SQL> alter database datafile '/oracle/oradata/bidw1/target01.dbf'
  2  autoextend on next 10M maxsize 1000M;

Database altered.

SQL> select status, resume_time, name from dba_resumable;

STATUS       | RESUME_TIME          | NAME
------------ | -------------------- | -----------------
NORMAL       | 02/06/10 10:56:49    | SALES fact load2
NORMAL       | 02/06/10 10:56:49    | SALES fact load

2 rows selected.

SQL>

The Resumable Space Allocation features includes the AFTER SUSPEND trigger, which allows the specification of a system-wide trigger that will fire whenever a transaction is suspended. The typical use for this functionality is alerting as suspended operations don't write anything to the alert log.

UPDATE: I made a mistake here... suspended transactions do in fact cause entries in the alert log, and so does the RESUME process detailed below.

There are some features in the DBMS_RESUMABLE package that may make sense when writing an AFTER SUSPEND trigger:

SQL> desc dbms_resumable
PROCEDURE ABORT
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 SESSIONID                      NUMBER                  IN
FUNCTION GET_SESSION_TIMEOUT RETURNS NUMBER
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 SESSIONID                      NUMBER                  IN
FUNCTION GET_TIMEOUT RETURNS NUMBER
PROCEDURE SET_SESSION_TIMEOUT
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 SESSIONID                      NUMBER                  IN
 TIMEOUT                        NUMBER                  IN
PROCEDURE SET_TIMEOUT
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TIMEOUT                        NUMBER                  IN
FUNCTION SPACE_ERROR_INFO RETURNS BOOLEAN
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 ERROR_TYPE                     VARCHAR2                OUT
 OBJECT_TYPE                    VARCHAR2                OUT
 OBJECT_OWNER                   VARCHAR2                OUT
 TABLE_SPACE_NAME               VARCHAR2                OUT
 OBJECT_NAME                    VARCHAR2                OUT
 SUB_OBJECT_NAME                VARCHAR2                OUT

SQL>

This package adds functionality for writing custom processes in the AFTER SUSPEND trigger. The SPACE_ERROR_INFO function returns specifics about the table and tablespace affected by the space error. A series of checks could be coded enabling specific actions depending on which objects were affected. A suspended process can be ended prematurely with the ABORT procedure, or more time can be added using the SET_TIMEOUT procedure. I actually had one client explain how she had written an AFTER SUSPEND trigger that compiled information about the tablespace affected so that an "ALTER DATABASE... RESIZE..." command could be issued to add more space to the affected datafile. I didn't have the heart to tell her that she had basically written a feature that already existed in the database: AUTOEXTEND.

So what are the best practices to take away from this? Quite simply... all ETL mappings and flows, as well as database maintenance processes, should use Resumable Space Allocation, preferably using the NAME clause in conjunction with DBMS_APPLICATION_INFO. Setting a RESUMABLE_TIMEOUT value at the system level can be scary, because a single suspended transaction could cause locks that reverberate all the way through the system. But is this really a concern in a BI/DW environment? Are there any processes in our batch load window or with any of our operational maintenance processes that we wouldn't want to enable for resumable operations, no matter how many processes back up waiting for them to complete? It could spell bad news if we used any kind of synchronous replication technology to move data to the DW instance, but short of that, I can't think of any. Please let me know if you have alternative viewpoints.

I've never found much reason to use the AFTER SUSPEND trigger though. Data warehouses should have production-type monitoring running already, just like other production systems. OEM is more than satisfactory for basic monitoring and alerting, and with the Server-Generated Alerts introduced in 10g, forms a complete product for Oracle environments. But regardless of which monitoring solution is used, it should be able to issue simple queries against the database and alert based on the results of those queries. A select against the DBA_RESUMABLE table provides all the information required to send out an alert, and with features such as AUTOEXTEND, I just can't see a requirement for the ability to issue procedural code because a transaction is suspended.

UPDATE: as pointed out above, since suspended transactions do in fact show up in the alert log, this is good news for integrating Resumable Space Allocation into an existing environment. Assuming that there's proper alert log monitoring with paging functionality already in place, implementing resumable operations can simply use that infrastructure already in place.

Keep your eyes open for the next of the "Three R's" in BI/DW fault tolerance: Restarting.