Realtime Data Warehouse Loading

Last time I wrote about the use of replication to provide a source for realtime BI. This time I am going to look at putting data in to a data warehouse in "realtime". Note the quotes, when we speak of realtime there is always some degree of latency: the time from the source transaction to its committal on source database, the time to notice this change, the time to propagate it to the target database. And then there is the time to process the newly arrived change: loading, validating, transforming, publishing, and perhaps aggregating; oh, and then the time to actually query the data and to react to the query. We can strive to reduce this time, we can never totally eradicate it as the laws of physics are not on our side; but latency reduction comes at a cost and it is going to be a business call between the value of knowing something promptly and the cost of knowing it.

Non-realtime data warehouses often use a periodic batch data load paradigm; once a month, a week, a day, or whenever, we execute a batch process that extracts data from the source, identifies what has changed and applies that do the data warehouse. But what if we modify this batch to run much more frequently, say half-hourly. We are moving towards realtime. Loading data more frequently should reduce the volume of data in the individual loads, and less data should equate to a reduced batch times (but this is not likely to be a linear reduction in batch duration). But we are imposing some new challenges on the source extract - we will have frequently running queries that access multiple source rows, this will have an impact on the source performance - we will probably need to modify the extract code to robustly identify the contents of the extract windows, that is we must not miss or duplicate data. In addition we are imposing an impact on the target, we need a good method to publish the received data to the data warehouse without adversely affecting the query workload on the data warehouse and we must ensure that our micro-batch load-to-publish time is substantially shorter than the interval between the micro-batches.

However say "realtime data warehousing" to most people and they think of continuous data capture, possibly through a extract built on streams or SOA messaging from the source, but more than likely through synchronous or asynchronous change data capture using database triggers or redo logs. Again there is going to be some degree of latency between event and the data being replicated to the target. But now we have a design choice on the target. Do we consume the captured changes as continuous trickle-feed process? Or do we run a series of micro-batches to consume the data? By necessity true trickle-feed will move us into row by row processing and possibly significant impact on processes that need to aggregate data. I feel that most continuously-fed data warehouses will use some micro-batch for the majority of the DW transform and publish process, even if trickle-feed processing is used to populate an non-aggregated ODS style layer for the special cases when people need to see "now data"

As I mentioned in a previous blog, captured commit-based change can generate a lot of "noise"; commits associated with no data change, multiple updates to the same row, changes to columns of no interest to the DW system. How we choose to handle this in our load procedures will depend on what the business needs to see: final status for a row within the batch, treating clusters of row changes within a short period (such as a few seconds) as single change, or applying all changes (in the order they happened)