Realtime Data Warehouse Challenges – Part 2

Last time I mentioned some of the challenges of taking realtime feeds and publishing them into a data warehouse. This time I am going propose a way to meet those challenges.

But before that I will take a small detour around what Oracle refers to as their Reference Data Warehouse and Business Intelligence Architecture. Here we are dividing the data warehouse into three "sections": "staging" which is a local copy of data from the source systems; "foundation" which is typically a "process neutral" 3NF representation of the business data such as "customer", "product" and "orders". It is likely to be different in structure from the staged tables in that it could well be the merging of data from multiple sources, for example customer attributes could come from both CRM and ERP systems. This foundation layer is likely to be versioned (that is, whenever a dimensional attribute changes a new, current version is created) and non-aggregated. The data in this layer are our BI jewels; we don't know what future analysis and data mining needs will be and by aggregating things we lose the flexibility we might need; remember there is no UNGROUP BY clause in SQL. The third tier is performance and access layer where we typically present optimized table structures to the query tools, it's here that we have the aggregated tables, bitmap indexes and all of those other 'traditional' data warehouse features. This not really a revolutionary (or even new) architecture - I have been doing similar things in my data warehouse design since the 1990s.

One of the key things to note is that the staging tables for the dimensional data should be complete replicas of the source tables and not just a set of extracted rows provided by the source data owner. Here is the ideal place to bring in replication technology and hence the beginnings of a real time data warehouse. Fact (or more accurately in this case "events") in the staging area are not going to be full replicas of the fact source but rather all of the events that have occurred since the last load, agin this can be achieved by realtime replication. Remember we only need to replicate the tables of interest and not all of the structures of the source applications. At first glance it might seem extravagant to have effectively three copies of dimensional data (one in each layer) and two of facts - but these days disk is cheap and it is also (tongue firmly in cheek) a good way to use some more of the disk space you had to buy to get the required data throughput.

Acquiring data in realtime is not going to be our problem, and if we can use the stage tables directly for reporting then we can say "job done" and not worry more. Our problems arise if we need to do significant work on the staged data to report over it. We might see problems with data quality, surrogate key management for dimensions, particularly slowly changing dimensions, and the need to aggregate facts to improve performance of the query tool.

I am not going to get into the debate on what to do about data quality, I have blogged about that in the past. The only thing I will say though is that the resolution to data quality problems should be in the source system(s) - data warehouses should report the same data as used in the transactional systems and if that requires a master data management program then so be it.

I suspect that this next point might be considered heresy, but if you have immutable business keys for your dimensional data (perhaps from a master data management system) then consider using them in the data warehouse - this will reduce the complexity of the ETL processes needed to push data from stage to the data warehouse, it might also remove a time dependency of pushing dimensional data through to the foundation and access layers in realtime. The need to track slowly changing type 2 dimensions (where we keep a history of change) might force the use of surrogate keys, but other approaches are possible that might avoid the need for surrogate keys being used on the fact tables; one approach is to split the non-volatile and SCD-1 attributes from the versioned (SCD-2) attributes and store the dimension in two tables, with the first table joined to the FACT table (on the business key) and the versioned table of SCD2 attributes joined to first dimension table on business key; queries against the second table will need to also pass a date so the correct version is selected, but this is not hard to achieve with most query tools. By far the easiest thing to do, though, is avoid SCD-2 all together; many organizations think they need to implement SCD-2, but when they come to use the system they find that SCD-1 actually fits the reporting requirements of the vast majority of their users.