Realtime Data Warehouse Loading

May 6th, 2010 by Peter Scott

Last time I wrote about the use of replication to provide a source for realtime BI. This time I am going to look at putting data in to a data warehouse in “realtime”. Note the quotes, when we speak of realtime there is always some degree of latency: the time from the source transaction to its committal on source database, the time to notice this change, the time to propagate it to the target database. And then there is the time to process the newly arrived change: loading, validating, transforming, publishing, and perhaps aggregating; oh, and then the time to actually query the data and to react to the query. We can strive to reduce this time, we can never totally eradicate it as the laws of physics are not on our side; but latency reduction comes at a cost and it is going to be a business call between the value of knowing something promptly and the cost of knowing it.

Non-realtime data warehouses often use a periodic batch data load paradigm; once a month, a week, a day, or whenever, we execute a batch process that extracts data from the source, identifies what has changed and applies that do the data warehouse. But what if we modify this batch to run much more frequently, say half-hourly. We are moving towards realtime. Loading data more frequently should reduce the volume of data in the individual loads, and less data should equate to a reduced batch times (but this is not likely to be a linear reduction in batch duration). But we are imposing some new challenges on the source extract – we will have frequently running queries that access multiple source rows, this will have an impact on the source performance – we will probably need to modify the extract code to robustly identify the contents of the extract windows, that is we must not miss or duplicate data. In addition we are imposing an impact on the target, we need a good method to publish the received data to the data warehouse without adversely affecting the query workload on the data warehouse and we must ensure that our micro-batch load-to-publish time is substantially shorter than the interval between the micro-batches.

However say “realtime data warehousing” to most people and they think of continuous data capture, possibly through a extract built on streams or SOA messaging from the source, but more than likely through synchronous or asynchronous change data capture using database triggers or redo logs. Again there is going to be some degree of latency between event and the data being replicated to the target. But now we have a design choice on the target. Do we consume the captured changes as continuous trickle-feed process? Or do we run a series of micro-batches to consume the data? By necessity true trickle-feed will move us into row by row processing and possibly significant impact on processes that need to aggregate data. I feel that most continuously-fed data warehouses will use some micro-batch for the majority of the DW transform and publish process, even if trickle-feed processing is used to populate an non-aggregated ODS style layer for the special cases when people need to see “now data”

As I mentioned in a previous blog, captured commit-based change can generate a lot of “noise”; commits associated with no data change, multiple updates to the same row, changes to columns of no interest to the DW system. How we choose to handle this in our load procedures will depend on what the business needs to see: final status for a row within the batch, treating clusters of row changes within a short period (such as a few seconds) as single change, or applying all changes (in the order they happened)

Comments

  1. Bruno Says:

    Good post! Ir would be nice to see another follow up regarding the products available that allow real-time (trickle-feed) and how they work. Thanks!

  2. rayc Says:

    Solid article. Thanks

  3. Nitin Aggarwal Says:

    To me, real time warehousing is far from achievable. Real time warehousing is not just about passing on a committed transaction in source OLTP system to target warehouse. In many cases, Facts in warehouse being loaded require multiple source tables, which could run in million of rows, to joined together to get the required extract. Any single row change to one source table would necessite the extraction query to be executed taking incremental record from one table and joining it to the other tables in full. This approach is not scalable when we talk about extracting every single change, each time, as it happens. At times, daily incremental data capturing and warehousing do struggle to get itself completed in stipulated 24 hrs window, when it is about time to execute the next incremental run. Also, during the load, Datawarehouse is mostly unusable for dashboard reporting because of lot of indexes which needs to be dropped to speed up the inserts/loads.

  4. Emil Says:

    Very interesting topic … however talking about real time data warehouse , I would rather use the wording “real time applications integration reporting” and by stepping on that notion I would build my architecture…Such systems should be design at the beginning to process large amount of data … subsequent adding such (data warehouse) capabilities will always be near to real time and the source system/systems will look like patched one …

  5. Peter Scott Says:

    @Nitin
    Indeed, RT DW is not just about loading a single fact table in realtime, one of the next blogs in this series of mine will talk about just that; here we are not just thinking of facts but dimensional data too.
    Remember too, the key point that I made – true realtime can not be achieved – we can get close though – it depends on the compromises that we need to adopt.

    Even without realtime DW I am seeing many businesses that require 24×7 availability of reporting – global banks reporting against a single DW for example and for that we need to be able use techniques that allow ETL to complete without detriment to reporting. High-availability DW is another topic to follow later this year.

    But if you have an incremental load that takes 24 hours to complete, I feel it needs to be investigated as there could be insufficient hardware capacity or there may be scope to optimise the data load and publish mechanism.

  6. Matt Hosking Says:

    Hi, An intersting read!

    As per a recent reply of mine in a different thread I think this is all fine and dandy when you have a simple (potentially single) source to warehouse relationship but to keep a whole business value chain updated e.g. account for stock movements against an intraday sales movement, is another matter entirely.

    Real “business intelligence” for me comes from being able to amalgamate, sort, translate and present data from across the whole enterprise to be able to make fully quantified decisions and this ultimately means several source systems all with their own interdependencies that need understanding.

    I do not see solution for “realtime/neartime” DW in a BI environment without first de-constructing the relationships, services and links between an organisations many true sources (product masters, stock masters, customer masters, EPOS systems, fulfillment/distribution systems etc) and then attempting to create a load/publish mechanism based on an understanding of all (or at least most) of it.

    The thought occurs that if you just have a single source (eg. EPOS system) that a buisness wants reporting from then is there a requirement for a DW anyway – you could sit OBI EE on top of a log copy of the EPOS system and do away with any of the problems of the ETL batch etc !!? :)

    Cheers,
    Matt.

  7. Peter Scott Says:

    @Matt
    If you just have a single fact source such as EPOS I would be tempted to look at using a replica (perhaps differently indexed) as a source.

    As you say, real-world BI uses disparate sources. Often these have differing periodicities. To use this in a real-time manner means you need to think through the whole architecture – a point Emil mentioned above.

    And, yes, calculating stock levels based on sales is challenging domain, even without realtime! – so much muddies the waters from shrinkage and wastage to alternate suppliers directly fulfilling orders without using your own stock. For some businesses the best place to get stock numbers is a direct feed from the warehouse management system and not attempt to derive it from the basis of what you sell.

    If real-time was was real-easy, more people would do it :-)

  8. Matt Hosking Says:

    @Peter, well yes, I think you just echoed my post. :-)

    ..and just to be clear I don’t think there is any reliable way of getting stock data other than from the Warehouse system – deriving it from sales would be bonkers as you wouldn’t account for any of usual the non-sales related movements within any warehouse ecosystem.

    For me your article just poses more questions in my own mind about the practical constraints of streaming of data to a warehouse. One of the main thoughts I have is the amount of data checking/cleansing and quality work that goes on in staging prior to loading any “core” architecture. We check every sale of products against our product master list, check every new part coming in has a certain set of fields populated (ACP, CRP etc), check every budget entry in a budget dataset has a corresponding location in our location tables etc, check every customer entry on an order against our customer source, not to mention a checksum of every non transactional row of data to see if it has been imported before and a range of calculations to update for example AOV against customers, ARP against products etc.

    In trying to orchestrate this little lot to run in the right order and have updated dimension data to match fact data etc I think it would be a nightmare in any level of CDC/near time environment. Once you spin in a bit of network latency or low packet priority in data from one of many contributing systems this could all end up in a heap with a rapidly filling quarantine bucket !

    As you say, if it was real-easy, more people would do it.

  9. Peter Scott Says:

    @Matt – have you been looking at the draft of my next blog on realtime :-) ?
    In the blog I write about challenges such as data quality, slowly changing dimensions and operational timing.

  10. Matt Hosking Says:

    I shall look forward to reading that then.

    Sadly I think this whole line of exploration will end up as “too difficult for now” albeit it is an interesting academic debate. IMO the moment you have a distributed (and often heterogeneous) multi-source system setup it all just gets way too complicated in the “real” corporate world, not only technically but also from the user perspective in understanding what is going on.

  11. Sangram Aglave Says:

    Nice article. What if the target is a columnar database,, do you think the updates can travel faster than row oriented databases

  12. John Lamont Watson Says:

    Just on the topic of 24/7 availability … I have resolved this problem for a recent client. We created two databases for our warehouse, on both an incremental update would kick off every hour, but they were staggered by 30 minutes. And we programmed the RPD to use the source with the most up to refresh time.

  13. chandra sekhar pathivada Says:

    Its not realistic to implement a real time data warehouse . but this might be applicable for the projects with limited scope. if we are focussing on realtime data from one mart with in a datawarehouse then this might be achievable with a little latency. in case of sql server this can be achiveable through a single process flow in ssis package with rebuilding the updated cube at the last step.
    one of the major challenge i could see here is processing the cube , as there is no 100%gurantee that the cube can be processed with out any failure during every execution , but with some tweeks and depends on the architecure this might be doable.

  14. Peter Scott Says:

    @chandra – I think that even in a “real time” data warehouse- a term that is somewhat fanciful anyway (see other posts and the recent talks posted on our Articles tab) – there is always a significant amount of information that is not and need not be be realtime.

    Loading data as it happens is not the big problem – as you mention, it is the aggregations that are the problem be it OLAP cubes or summary tables.

Write a comment





Website Design & Build: tymedia.co.uk