Realtime Data Warehouse Challenges - Part 1

In a previous part of my notes on Realtime Data Warehousing I mentioned some of the challenges of reducing latency. The piece picked up quite a few comments - to which I say thanks to all that posted responses. One of the comments from Matt Hosking mentioned some of the points I was to raise in this posting.

If you ask someone on the outside of developing a (near) realtime data warehouse what the greatest challenge will be they probably would say "capturing the change" since they know we can already "do" data warehouses. I think that is wrong, capturing change is easy; the big problem is applying that change in a timely fashion to a data warehouse that also remains available for query. Adding relatively few rows of new fact to a table is trivial compared to the actions needed to validate, transform, apply keys, index, and publish the fact; and then think about the impact of merging that new fact into existing aggregate tables or materialized views. A lot of moving parts, a lot of challenge.

Realistically, we could populate an "atomic data store style" layer in realtime with what is in effect a versioned (timestamped, journalized or however you term it) replica of the source, a replica which is probably suited for realtime reporting but what we don't get are the features of a data warehouse that we come to expect in a traditional star schema DW. We possibly miss out on: data validation through the ETL process, data enrichment and derived measures, conformed dimensions, slowly changing dimensions (especially type 2 SCD) through surrogate keys. It may well be that you don't actually need a star model, after all one of the viable DW models for an Exadata warehouse is just that; a bunch of conventional tables joined on the natural business keys.

Another point to consider is that it is quite unlikely that all of the fact domains in a data warehouse need to be realtime ones; for example data sourced from a supplier's EDI feed may arrive far less frequently than, say, sales transactions from the company's web-store. Obviously, if we have realtime feed of sales, we must ensure we have all of the dimensional (reference) data loaded before a new transaction arrives, or else develop robust ways to handle this. This is a situation where we need business knowledge; if a new customer can be created at time of purchase (as often is the case for a web sale) we will need a realtime customer feed along with the realtime sales feed, but for banks with strict money laundering regulations customers are registered way before transactions occur, so a timely load of customer is likely to be sufficient.

Not only is it unlikely that all data feeds to a data warehouse need be realtime, it quite likely for some "facts" that only some measures are realtime measures. Consider sales: we know the quantity and the price charged to the customer at the time of the sale, but we may well not know the cost of goods until the time the order is fulfilled.