Realtime Data Warehouse Challenges – Part 2
June 27th, 2010 by Peter Scott
Last time I mentioned some of the challenges of taking realtime feeds and publishing them into a data warehouse. This time I am going propose a way to meet those challenges.
But before that I will take a small detour around what Oracle refers to as their Reference Data Warehouse and Business Intelligence Architecture. Here we are dividing the data warehouse into three “sections”: “staging” which is a local copy of data from the source systems; “foundation” which is typically a “process neutral” 3NF representation of the business data such as “customer”, “product” and “orders”. It is likely to be different in structure from the staged tables in that it could well be the merging of data from multiple sources, for example customer attributes could come from both CRM and ERP systems. This foundation layer is likely to be versioned (that is, whenever a dimensional attribute changes a new, current version is created) and non-aggregated. The data in this layer are our BI jewels; we don’t know what future analysis and data mining needs will be and by aggregating things we lose the flexibility we might need; remember there is no UNGROUP BY clause in SQL. The third tier is performance and access layer where we typically present optimized table structures to the query tools, it’s here that we have the aggregated tables, bitmap indexes and all of those other ‘traditional’ data warehouse features. This not really a revolutionary (or even new) architecture – I have been doing similar things in my data warehouse design since the 1990s.
One of the key things to note is that the staging tables for the dimensional data should be complete replicas of the source tables and not just a set of extracted rows provided by the source data owner. Here is the ideal place to bring in replication technology and hence the beginnings of a real time data warehouse. Fact (or more accurately in this case “events”) in the staging area are not going to be full replicas of the fact source but rather all of the events that have occurred since the last load, agin this can be achieved by realtime replication. Remember we only need to replicate the tables of interest and not all of the structures of the source applications. At first glance it might seem extravagant to have effectively three copies of dimensional data (one in each layer) and two of facts – but these days disk is cheap and it is also (tongue firmly in cheek) a good way to use some more of the disk space you had to buy to get the required data throughput.
Acquiring data in realtime is not going to be our problem, and if we can use the stage tables directly for reporting then we can say “job done” and not worry more. Our problems arise if we need to do significant work on the staged data to report over it. We might see problems with data quality, surrogate key management for dimensions, particularly slowly changing dimensions, and the need to aggregate facts to improve performance of the query tool.
I am not going to get into the debate on what to do about data quality, I have blogged about that in the past. The only thing I will say though is that the resolution to data quality problems should be in the source system(s) – data warehouses should report the same data as used in the transactional systems and if that requires a master data management program then so be it.
I suspect that this next point might be considered heresy, but if you have immutable business keys for your dimensional data (perhaps from a master data management system) then consider using them in the data warehouse – this will reduce the complexity of the ETL processes needed to push data from stage to the data warehouse, it might also remove a time dependency of pushing dimensional data through to the foundation and access layers in realtime. The need to track slowly changing type 2 dimensions (where we keep a history of change) might force the use of surrogate keys, but other approaches are possible that might avoid the need for surrogate keys being used on the fact tables; one approach is to split the non-volatile and SCD-1 attributes from the versioned (SCD-2) attributes and store the dimension in two tables, with the first table joined to the FACT table (on the business key) and the versioned table of SCD2 attributes joined to first dimension table on business key; queries against the second table will need to also pass a date so the correct version is selected, but this is not hard to achieve with most query tools. By far the easiest thing to do, though, is avoid SCD-2 all together; many organizations think they need to implement SCD-2, but when they come to use the system they find that SCD-1 actually fits the reporting requirements of the vast majority of their users.


June 30th, 2010 at 9:02 am
Hi,
Thanks for such a wonderful article. One of the challenges in maintaining a warehouse is how to manage its backup. I mean if you have 15TB of database, then you cannot take full backup of it weekly or even monthly. Can you please, in the light of your experience, let us know as what things do you keep in perspective while on deciding as how to formulate a backup policy for a data warehouse?
regards
June 30th, 2010 at 5:54 pm
What ever you do for a backup policy you need to ensure that it is actually usable to restore your database for when things go wrong, and that means testing the restore process and proving it works.
Backup and recovery is a big topic; there are so many options and technologies, so perhaps we should write an overview article for the blog
July 2nd, 2010 at 7:26 pm
Peter,
For your process neutral foundation layer, I urge you to look at the Data Vault modeling approach (http://danlinstedt.com/about/). I have been using it for many years and found it to be a more flexible and extensible modeling technique for a central historical EDW than 3NF. One benefit is that you model just the data you need for the current project (to feed a data mart say) without boxing yourself in a corner when you need to collect the next bit of data. It definately lends itself to “agile” data warehousing and was designed with the intention of supporting things like realtime data warehousing. Even Bill Inmon (father of DW) supports using this technique.
July 9th, 2010 at 9:19 am
Major issue that any business faces is trying to successfully manage an ever increasing amount of documents, records and files. Most firms have a huge database to deal with that it can be really difficult to keep track of all data. So maintain those data we need proper wire housing of data. Good effort Pitter. ………..
Thanks….
August 19th, 2010 at 3:11 am
These are really good thoughts,very practical and detailed.It is all about balance. Under present technologies and the complext business complexity,to achieve real time DW, we have to make some sacrifice(avoiding SCD-2 and only subset of the whole data) and have a better foundation(Master data and good-quality data in the source system).
According to the DW2.0 of Bill Inmon,it proposed interactive sector. That is the same thought as the staging area here. With respect to DW2.0,the interactive sector is realtime(or near-near-realtime),but can only satisfy operational needs,say,clerical community.
August 19th, 2010 at 3:21 am
“Fact (or more accurately in this case “events”) in the staging area are not going to be full replicas of the fact source but rather all of the events that have occurred since the last load, agin this can be achieved by realtime replication.”
If you only have part of the fact data in the staging area. What can u do with it? Can you elaborate more about this?
October 13th, 2010 at 7:00 pm
how we can use the stage tables directly for reporting in real time database.