Realtime BI Is Not Realtime DW

April 24, 2010 BI (General), Data Warehousing

This posting is slightly edited to clarify that the majority of  the post is about using transactional or transactionally structured sources for reporting.

One of the points I make early on in my talk “Getting Real Data Warehouse Loading as it Happens” is that realtime business intelligence is not the same as realtime data warehousing; having one does not imply having the other.

People can, and do, direct their BI reporting at the source transactional system or systems and for these people there is no data warehouse involved; conversely others populate their data warehouse continuously through a realtime trickle-feed since that is the only viable way to load their data but choose to keep reporting aligned to an event in the past (such as midnight yesterday) so that users do not see changing data throughout the day.

So if it is possible to do realtime BI against the source systems why have a data warehouse? In my opinion the data warehouse is doing three things: it is protecting the transactional source from excessive, potentially sub-optimal queries; it is acting as a platform to conform data from multiple source systems and it allows the use of slowly changing dimensions.

Assuming we need realtime BI but do not want a realtime data warehouse we will need to mix transactional and reporting functionalities on the same system. Although it could be argued that a system such as Oracle’s Exadata V2 allows mixed reporting / transactional workloads few people have implemented Exadata yet and thus, for many people it means that they are mixing two distinctly different workloads on a platform designed for one workload. DML operations that are designed to insert or update a single row as quickly as possible are mixed with queries that access a great number of rows based on a combination of predicates that may well not be indexed on the OLTP system. Furthermore, some of the database features that assist performance in a data warehouse such as bitmap indexes and parallel query can actually inhibit performance of transactional systems. The fundamental problem of mixing workload is that if resource is busy doing one thing it is then not available to do something else and performance for all users suffers.

There are some things that we can do to reduce the BI query impact on the source. We can use data federation in a reporting application such as OBIEE and restrict the transactional source to current data and use a separate data warehouse, perhaps with pre-aggregated data to supply historic information; that is, we remove the need to repeatedly scan the whole of the transactional system to provide historic data that is no longer changing and by reducing the volume of data we are accessing we lessen the conflict with transactional activity. Of course we will still need a process that updates the data warehouse, but that can now be deferred to a later time. However, federation has its own drawbacks; where we once had one query we now have two and the results of these two queries need to be stitched together and that process has to happen on the BI server, where it consumes memory and processing resource to join the result sets. Another thing we could do is to use a replica database to be the source of reporting, thus reducing the impact on the transactional source (but not fully removing it as any from of replication has some impact as it will involve some process executing on or against the source). We could use a standby instance of the OLTP system for reporting, but we need to consider the licensing requirements of the replica, the location of the replica (it may be at a different site) and what happens if we need to use the standby system for its real purpose, business continuity – do we lose BI when that happens? Perhaps a better approach is to build a specific reporting replica. This replica can be the right size and have the appropriate product licensing for a reporting load. It does not have to be complete replica of the source system, just the tables of interest; if we have no reporting interest in audit and journal tables then we need not replicate them.

In fact we can get quite sophisticated with a reporting replica in that we change the database so that it is better geared for reporting; this might include dropping unhelpful indexes and possibly partitioning the data to allow partition pruning to occur. Adding bitmap indexes may be a step too far though as we are essentially receiving row by row updates and inserts and that results in lower throughput caused by table locking (actually multi-row locking) as the bitmaps change.

We can even combine the the federated approach with a database replica and use the replica as source for the recent transactions and the data warehouse for the historic data; and if we chose to implement our replica on a fast database, perhaps an in-memory one, we could get excellent performance with low impact on the transactional source.