Realtime Data Warehouse Challenges – Part 1

May 27th, 2010 by Peter Scott

In a previous part of my notes on Realtime Data Warehousing I mentioned some of the challenges of reducing latency. The piece picked up quite a few comments – to which I say thanks to all that posted responses. One of the comments from Matt Hosking mentioned some of the points I was to raise in this posting.

If you ask someone on the outside of developing a (near) realtime data warehouse what the greatest challenge will be they probably would say “capturing the change” since they know we can already “do” data warehouses. I think that is wrong, capturing change is easy; the big problem is applying that change in a timely fashion to a data warehouse that also remains available for query. Adding relatively few rows of new fact to a table is trivial compared to the actions needed to validate, transform, apply keys, index, and publish the fact; and then think about the impact of merging that new fact into existing aggregate tables or materialized views. A lot of moving parts, a lot of challenge.

Realistically, we could populate an “atomic data store style” layer in realtime with what is in effect a versioned (timestamped, journalized or however you term it) replica of the source, a replica which is probably suited for realtime reporting but what we don’t get are the features of a data warehouse that we come to expect in a traditional star schema DW. We possibly miss out on: data validation through the ETL process, data enrichment and derived measures, conformed dimensions, slowly changing dimensions (especially type 2 SCD) through surrogate keys. It may well be that you don’t actually need a star model, after all one of the viable DW models for an Exadata warehouse is just that; a bunch of conventional tables joined on the natural business keys.

Another point to consider is that it is quite unlikely that all of the fact domains in a data warehouse need to be realtime ones; for example data sourced from a supplier’s EDI feed may arrive far less frequently than, say, sales transactions from the company’s web-store. Obviously, if we have realtime feed of sales, we must ensure we have all of the dimensional (reference) data loaded before a new transaction arrives, or else develop robust ways to handle this. This is a situation where we need business knowledge; if a new customer can be created at time of purchase (as often is the case for a web sale) we will need a realtime customer feed along with the realtime sales feed, but for banks with strict money laundering regulations customers are registered way before transactions occur, so a timely load of customer is likely to be sufficient.

Not only is it unlikely that all data feeds to a data warehouse need be realtime, it quite likely for some “facts” that only some measures are realtime measures. Consider sales: we know the quantity and the price charged to the customer at the time of the sale, but we may well not know the cost of goods until the time the order is fulfilled.

Comments

  1. Matt Hosking Says:

    Hi Peter, thanks for the mention, I now see we were indeed closely aligned in our thinking.

    I wonder if Oracle’s purchase of Goldengate will attempt to tackle some of these issues as they was all about realtime/neartime data synchronization across heterogeneous systems.

    Best regards,
    Matt

  2. Matt Hosking Says:

    ..”they were” even

  3. Peter Scott Says:

    @Matt … again you preempt the next piece :-)

    GoldenGate is about replicating a table (or a subset of it) from one place to another. If we are just interested in current values – say the source fact never updates and we report on SCD-1 dimensions we could use that replica as our DW source. But if data changes over time (orders flow through a workflow, or we need to track SCD-2 changes) then we need to build some form of process to consume the changes that arrive in the order that they arrive – and here we have to duck out of pure Goldengate and build some ELT over it – ODI and OWB 11gR2 could do this through a ODI KM for journalized GoldenGate change capture – or we could code our own. More on this later :-)

  4. Abhishek Says:

    what I remember about GoldenGate ( shareplex in particular ) , it actually replicates the changes to the target system exactly in same order as it has occurred in source system.I would say thats a beauty of this product.Also even if the target system is down it keep on accepting the data from source system and would put it to Target system without any issue in the same order , when its up.One of the biggest challenge on Real Time Warehouse is – Designing a kind of Warehouse model where you can refresh your Dim / Fact tables without much dependency on other tables , The moment you are done with fetching the data from any of the Staging table/Dim Table it should be ready to get the next set of rows from OLTP and should not wait for any other tables to get loaded.

  5. Peter Scott Says:

    @Abhishek – isn’t Shareplex a Quest product?

    I agree that the beauty of replication is that the changes are applied in order, but if we need to preserve the intermediate versions still need to develop some ETL to apply those changes to the target.

  6. Matt Josking Says:

    Hi Peter,

    How is the goldengate article coming along? We are busy struggling here with taking mappings from OWB11r1 to 11r2 and them actually running so I could do with some light reading :-)

    Best regards,
    Matt.

  7. Peter Scott Says:

    Hi Matt
    The draft is done – I hope to get it published within the next week.

    Now, if you want help migrating OWB 11.1 to 11.2 we are the people to talk to…

Write a comment





Website Design & Build: tymedia.co.uk