Data Warehouse Fault Tolerance Part 1: Resuming

February 8th, 2010 by

In the introduction to this series of posts, I spoke briefly about data warehouse fault tolerance and the unique challenges resulting from high data volumes combined the batch load window required to create them. I then defined the goal: a layered approach allowing simple errors to be caught early before they turn in to serious conditions.

Resuming is the ability to continue effortlessly after an error. The important thing is that there should be no aftermath from the error: our process should pause gracefully until the error is corrected. The Oracle Database has offered out of the box functionality for resuming since version 9i in the form of Resumable Space Allocation. Resumable operations are supported for SELECT queries, DML and DDL, and can be enabled at either the system or the session level. To enable at the system level, the RESUMABLE_TIMEOUT database parameter should have a non-zero value.

To enable resumable operations at the session level, the statement follows this basic syntax, with the TIMEOUT and NAME clauses being optional:

ALTER SESSION ENABLE RESUMABLE <TIMEOUT n> <NAME string>;

The TIMEOUT value is specified in seconds, and if omitted, the default value of 7200 is used, or 2 hours. The NAME clause gives the resumable session a user-friendly name for when we are monitoring for resumable sessions (as we will see later) to see which of our processes is suspended. Enabling resumable operations for the session level requires that the RESUMABLE permission has been granted:

Resumable operations can also be enabled with the Oracle utilities… such as SQL-Loader, Export/Import and Datapump. The command-line parameters RESUMABLE, RESUMABLE_NAME and RESUMABLE_TIMEOUT exist to mimic the functionality mentioned above.

Now for a demonstration. I’ll create a situation that is ripe for a space allocation error: I’ll put an empty copy of the SALES fact table from the SH schema in a tablespace with only 250K of space:

Now I’ll load some records into the table, which should cause it to suspend. To prepare my session, I need to enable resumable operations. Since I always instrument my code, I’ll register my process with the database. After that, I have an easy way to guarantee consistency when referring to processes. Now, I can use the registered name for my resumable session as well:

I start loading the records in hopes of a suspended session:

So now, I open up another session, and I start another transaction against the TARGET.SALES table, just to pile on the TARGET tablespace:

I’ll have a look in the DBA_RESUMABLE view (there is also a USER_RESUMABLE version) for my suspended sessions. Even though I could get all the following information with a single SQL statement, I broke it up for better visibility on the blog:

The Oracle Database also publishes server alerts concerning suspended transactions using the Server-Generated Alerts infrastructure. This infrastructure uses the AWR toolset, the server package DBMS_SERVER_ALERT for getting and setting metric threshholds, and the queue table ALERT_QUE to hold alerts that have been published from AWR. Custom processes could be written to mine ALERT_QUE for these alerts, but the easiest way to configure and view server alerts is using Oracle Enterprise Manager (OEM). On the Alerts section of the main OEM page, we can see three different alerts generated by the Oracle Database:

all alerts.png

If we click on the “Session Suspended” link, we can see the multiple alerts generated in this category:

suspend alerts.png

Another alert generated indirectly by the suspended transaction is the “Configuration” class event caused by our session “waiting” to proceed. The Oracle wait event interface can show us information about the suspend waits on the system:

To free up the space issue, I’ll enable autoextend on the TARGET tablespace. Then, I’ll take a look and see if anything has changed:

The Resumable Space Allocation features includes the AFTER SUSPEND trigger, which allows the specification of a system-wide trigger that will fire whenever a transaction is suspended. The typical use for this functionality is alerting as suspended operations don’t write anything to the alert log.

UPDATE: I made a mistake here… suspended transactions do in fact cause entries in the alert log, and so does the RESUME process detailed below.

There are some features in the DBMS_RESUMABLE package that may make sense when writing an AFTER SUSPEND trigger:

This package adds functionality for writing custom processes in the AFTER SUSPEND trigger. The SPACE_ERROR_INFO function returns specifics about the table and tablespace affected by the space error. A series of checks could be coded enabling specific actions depending on which objects were affected. A suspended process can be ended prematurely with the ABORT procedure, or more time can be added using the SET_TIMEOUT procedure. I actually had one client explain how she had written an AFTER SUSPEND trigger that compiled information about the tablespace affected so that an “ALTER DATABASE… RESIZE…” command could be issued to add more space to the affected datafile. I didn’t have the heart to tell her that she had basically written a feature that already existed in the database: AUTOEXTEND.

So what are the best practices to take away from this? Quite simply… all ETL mappings and flows, as well as database maintenance processes, should use Resumable Space Allocation, preferably using the NAME clause in conjunction with DBMS_APPLICATION_INFO. Setting a RESUMABLE_TIMEOUT value at the system level can be scary, because a single suspended transaction could cause locks that reverberate all the way through the system. But is this really a concern in a BI/DW environment? Are there any processes in our batch load window or with any of our operational maintenance processes that we wouldn’t want to enable for resumable operations, no matter how many processes back up waiting for them to complete? It could spell bad news if we used any kind of synchronous replication technology to move data to the DW instance, but short of that, I can’t think of any. Please let me know if you have alternative viewpoints.

I’ve never found much reason to use the AFTER SUSPEND trigger though. Data warehouses should have production-type monitoring running already, just like other production systems. OEM is more than satisfactory for basic monitoring and alerting, and with the Server-Generated Alerts introduced in 10g, forms a complete product for Oracle environments. But regardless of which monitoring solution is used, it should be able to issue simple queries against the database and alert based on the results of those queries. A select against the DBA_RESUMABLE table provides all the information required to send out an alert, and with features such as AUTOEXTEND, I just can’t see a requirement for the ability to issue procedural code because a transaction is suspended.

UPDATE: as pointed out above, since suspended transactions do in fact show up in the alert log, this is good news for integrating Resumable Space Allocation into an existing environment. Assuming that there’s proper alert log monitoring with paging functionality already in place, implementing resumable operations can simply use that infrastructure already in place.

Keep your eyes open for the next of the “Three R’s” in BI/DW fault tolerance: Restarting.

Comments

  1. Uwe Hesse Says:

    Thank you for this very instructive article! Apart from the important technical information, I liked the anecdote about the “self written AUTOEXTEND” :-) Let me just add one more information: Since 10g, if a session gets suspended by the resumable space feature, not only does this lead to an entry in DBA_RESUMABLE but also into DBA_OUTSTANDING_ALERTS. Also, it leads to an entry into the alert logfile. I am looking forward to your next announced articles in this interesting series!

  2. Stewart Bryson Says:

    @Uwe:

    Thanks for the tip on the alert log Uwe… I missed that one somehow. I’ve updated the blog posting based on your comments, so again, thanks for pointing this out, and of couse, for reading. I haven’t had time to look at the DBA_OUTSTANDING_ALERTS view… I assume that’s part of the Server-Generated Alerts feature. I’ll need to play around with that and see if it affects my approach or not, and what part it might play in a best practices approach.

  3. Niall Litchfield Says:

    Hi Stewart

    dba_outstanding_alerts is indeed a part of the server alerts infrastructure. It is the source for the list of server alerts on the em screenshot you show.

  4. Stewart Bryson Says:

    Thanks for the clarification Niall… and for reading.

  5. Blogroll Report 05/02/2009 – 12/02/2010 « Coskan’s Approach to Oracle Says:

    [...] Stewart Bryson-Data Warehouse Fault Tolerance Part 1: Resuming [...]

Website Design & Build: tymedia.co.uk