Realtime BI Is Not Realtime DW

April 24th, 2010 by

This posting is slightly edited to clarify that the majority of  the post is about using transactional or transactionally structured sources for reporting.

One of the points I make early on in my talk “Getting Real Data Warehouse Loading as it Happens” is that realtime business intelligence is not the same as realtime data warehousing; having one does not imply having the other.

People can, and do, direct their BI reporting at the source transactional system or systems and for these people there is no data warehouse involved; conversely others populate their data warehouse continuously through a realtime trickle-feed since that is the only viable way to load their data but choose to keep reporting aligned to an event in the past (such as midnight yesterday) so that users do not see changing data throughout the day.

So if it is possible to do realtime BI against the source systems why have a data warehouse? In my opinion the data warehouse is doing three things: it is protecting the transactional source from excessive, potentially sub-optimal queries; it is acting as a platform to conform data from multiple source systems and it allows the use of slowly changing dimensions.

Assuming we need realtime BI but do not want a realtime data warehouse we will need to mix transactional and reporting functionalities on the same system. Although it could be argued that a system such as Oracle’s Exadata V2 allows mixed reporting / transactional workloads few people have implemented Exadata yet and thus, for many people it means that they are mixing two distinctly different workloads on a platform designed for one workload. DML operations that are designed to insert or update a single row as quickly as possible are mixed with queries that access a great number of rows based on a combination of predicates that may well not be indexed on the OLTP system. Furthermore, some of the database features that assist performance in a data warehouse such as bitmap indexes and parallel query can actually inhibit performance of transactional systems. The fundamental problem of mixing workload is that if resource is busy doing one thing it is then not available to do something else and performance for all users suffers.

There are some things that we can do to reduce the BI query impact on the source. We can use data federation in a reporting application such as OBIEE and restrict the transactional source to current data and use a separate data warehouse, perhaps with pre-aggregated data to supply historic information; that is, we remove the need to repeatedly scan the whole of the transactional system to provide historic data that is no longer changing and by reducing the volume of data we are accessing we lessen the conflict with transactional activity. Of course we will still need a process that updates the data warehouse, but that can now be deferred to a later time. However, federation has its own drawbacks; where we once had one query we now have two and the results of these two queries need to be stitched together and that process has to happen on the BI server, where it consumes memory and processing resource to join the result sets. Another thing we could do is to use a replica database to be the source of reporting, thus reducing the impact on the transactional source (but not fully removing it as any from of replication has some impact as it will involve some process executing on or against the source). We could use a standby instance of the OLTP system for reporting, but we need to consider the licensing requirements of the replica, the location of the replica (it may be at a different site) and what happens if we need to use the standby system for its real purpose, business continuity – do we lose BI when that happens? Perhaps a better approach is to build a specific reporting replica. This replica can be the right size and have the appropriate product licensing for a reporting load. It does not have to be complete replica of the source system, just the tables of interest; if we have no reporting interest in audit and journal tables then we need not replicate them.

In fact we can get quite sophisticated with a reporting replica in that we change the database so that it is better geared for reporting; this might include dropping unhelpful indexes and possibly partitioning the data to allow partition pruning to occur. Adding bitmap indexes may be a step too far though as we are essentially receiving row by row updates and inserts and that results in lower throughput caused by table locking (actually multi-row locking) as the bitmaps change.

We can even combine the the federated approach with a database replica and use the replica as source for the recent transactions and the data warehouse for the historic data; and if we chose to implement our replica on a fast database, perhaps an in-memory one, we could get excellent performance with low impact on the transactional source.

Comments

  1. Fahd Mirza Says:

    Scott, would you please look at the following point and comment:

    Almost all the times, fact tables in the data warehouses should be partitioned. If the data is loaded w.r.t time then the partition key should be the date column. All the foreign keys in fact table should have bit map indexes upon them. Local prefixed indexs should be created on the partitions haveing the partition key at the leading edge, and these indexes should be B Tree. Stats should be gathered after every data load and the estimate size should be DBMS_STATS.AUTO_SAMPLE_SIZE.

    Thanks and regards

  2. Peter Scott Says:

    @Fahd Mirza

    Your quote is not directly relevant to this blog posting – here I am talking about using a transactional source for reporting and not using a data warehouse. But I do see that the third paragraph can sound confusing. I will edit this to clarify it.

    For a data warehouse, I do agree with some of the points you raise – for Oracle this requires the Enterprise Edition database licensing with the additional cost option for partitioning.

    For a realtime data warehouse there are some challenges to using bitmap indexes on trickle fed fact data but these can be overcome by good design, and here partitioning is very important part. With partition tables all bitmap indexes are local bitmap indexes. I am not sure that I understand the comment about the b-tree indexes with a leading column of the partition key.

    One area of difficulty with very large databases is the overhead of statistics collection at the end of the batch and the time available to refresh them. I suggest you look at the excellent series on partition statistics by Doug Burns

    Peter

  3. Fahd Mirza Says:

    Thank you for your reply. Can you please direct me any of blog posts or articles regarding indexing in data warehouse? or better write a blog post as what should be the best pratices regarding indexing the dimensions and facts in a data warehouse?

    Thanks and regards

  4. Peter Scott Says:

    @Fahd Mirza
    Two Oracle manuals you might want to look at are the Oracle Data Warehousing guide and the Oracle Very Large Database Guide. The VLDB guide was first released as part of the 11gR1 documentation, the DW guide has been out (at least) since 9i.

    Indexing suggestions for Exadata could be very different to those for more traditionally configured Oracle databases.

  5. Kuldeep Says:

    But how can you separate them. Real time DW is basic need of realtime BI.. or am i getting it wrong.

  6. Peter Scott Says:

    @Kuldeep – realtime BI can be run using the transactional system as the direct BI source in this case we do not have a data warehouse.

    In my opinion it always best to have a data warehouse (but that’s because I like building them :-) )

  7. Matt Hosking Says:

    Peter Scott Says: “Indexing suggestions for Exadata could be very different to those for more traditionally configured Oracle databases.”

    Is that pertinent to V1 and V2 Peter.. I am thinking you may have been mostly considering V2? We have Exadata V1 here (running 11.2) and are using all the “usual suspects” for indexing/partitioning etc.

    Cheers,
    Matt.

  8. Peter Scott Says:

    @Matt
    Well if you upgraded to 11.2 you have V2 (but without the physical aspects such as flash and a Sun label on the box), the the software features are the same.

    We did a retail implementation on V1 for a customer – the advice we had from Oracle was that indexes were not needed (unless testing showed they helped!). We should aim to use as much as the Exadata system as possible to keep things balanced – so partitions and parallel to keep things distributed really help maximise performance

    In practice we needed to use some indexes but not as many as a traditional DW.
    V2 of Exadata can off-load more types of activity to the the storage cells and perhaps indexes may have more uses there.

    Of course being on 11g you can always test if the index is helping – just make the index “invisible” and run the query again – check the plan and the response

  9. Matt Hosking Says:

    @Peter..

    Thanks for the useful comments :-)

    “Well if you upgraded to 11.2 you have V2 (but without the physical aspects such as flash and a Sun label on the box), the the software features are the same.”

    ..well other than the O/S as well of course – it would have been great to have Solaris like the rest of the estate here rather than a “rogue” linux box.

    Still once it was re-built by Oracle to 11.2 a couple of months back with some further hardware replacements it has cooled down a lot, is much more stable and hasn’t crashed since (on 11.1 it completely crashed four times).

    Cheers,
    Matt.

Website Design & Build: tymedia.co.uk