Data Warehouse Fault Tolerance: An Introduction

With so much of the blog devoted to OBIEE, OWB and Essbase lately, I felt like it was time to do a few database-related postings. In the past, when I've posted database content to the blog, I usually gravitate toward ETL-related features: those that waffle between database administration and ETL development. But this time I'm going to take a very different route and discuss data warehouse fault tolerance, and so I'll be doing a series of postings that discuss what it means to strive to be fault-free.

Fault tolerance isn't disaster recovery exactly... though there's a lot of overlap. Instead, fault tolerance is the ability to recover from errors, and those errors can result from hardware issues, software issues, general systems issues (network latency, out-of-space errors), and human mistakes. The main point is that BI/DW environments present unique challenges, both for operations and for the development team. I'm not preposing that the divide between transactional and reporting systems is necessarily vast... we still need redundant storage systems and dependable backup strategies. I am preposing, however, that one-size-fits-all approaches to fault-tolerance is problematic, and applying standards that evolved in support of transactional systems may not provide the best protection for BI/DW environments.

The operational teams (DBAs, Unix Admins, Storage Admins, etc.) and the development teams (source system extraction, ETL) have to work closer in a BI/DW than perhaps they do in OLTP environments. Of course, OLTP developers have to write scalable code... but I think that's within their control for the most part. ETL developers are thrashing around millions or billions of rows of data, and because of this, everything needs to be well-oiled: undo spaces need to be available, temp space needs to be plentiful, standard operational jobs such as backup and recovery or statistics gathering need to keep the batch load window in mind, etc. Whereas OLTP code is exclusively SQL... ETL code is packed full of DDL: partition-exchange loads, index and constraint maintenance, table truncates, the whole gamut.

So when working with millions or billions of rows, we need to eliminate errors as best we can. Sounds simple enough, but the truth is that errors are going to happen, and there's nothing we can do to wipe them out completely. But we can mitigate. So we need to introduce a triage process: catching and correcting errors as early as possible so that their damage is minimal. In essence: don't let simple errors turn into weekend-long data correction issues, where millions of rows need to be updated or deleted. Let's work smarter, not harder, using every solution available to use, including features present in the Oracle Database, best practices in ETL development, and possible modifications to our backup and recovery strategies.

I should note that, when speaking of BI/DW environments, I still have the batch load paradigm squarely in mind. Although the line in the sand is certainly moving in one constant direction, most data warehouses are still loaded with either batch or mini-batch processing. However, being a fan of near-real-time data warehouse techniques (as my colleague Peter Scott has written from time to time... only reporting from the source system itself is truly real-time), I'll be sure to point out how some of these techniques differ the closer we get to the actual transaction.

I currently have three postings in mind that correlate to the Three R's of Data Warehouse Fault Tolerance. Be on the lookout for the first installment coming soon: Resuming.