Drilling Down in the Oracle Next-Generation Reference DW Architecture

July 16th, 2009 by

If you’ve read this blog over the past year or heard me present at conferences on Oracle data warehouse architecture, you’ll probably have seen this architecture diagram used at some point, put together by Doug Cackett, Simon Griffiths, Kevin Lancaster, Andrew Bond and Keith Laker from Oracle in the UK (and to which I had a bit of input back at the conception):

Dw Ref Full

This is a reference data warehouse, BI and EPM architecture that supports structured and unstructured data sources, access by BI tools such as Oracle BI Enterprise Edition and Oracle EPM Suite, and a three-layer data warehouse with staging, foundation and access and performance layers. The idea is to try and make use of the strengths of the Oracle database, OBIEE and EPM Suite together with Fusion Middleware to come up with a “next-generation” BI&DW architecture. This presentation of mine from the recent Oracle BI Symposium in London sets it out in a bit more detail, if you’re interested in reading more.

Setting aside the middleware and BI tools aspects for a while though, the data warehouse model has been something we’ve used a few times recently to some considerable success. When we suggest it though I tend to get a few questions about the actual details of each of the layers, and whilst Staging is fairly obvious, the purpose of the foundation and performance & access layers is sometimes initially a bit confusing and so I’ll try and set out in this posting what they are actually for.

Dw Small

The Staging layer is the “landing pad” for your incoming data, and is typically made up of regular, heap-organized tables that are populated via extraction routines from a range of data sources. You might also find external tables there, or possibly capture tables populated by Oracle Change Data Capture. Staging areas are particularly important for Extract, Load and Transform (ELT) tools such as Oracle Warehouse Builder and Oracle Data Integrator that use the database together with SQL to transform your data.

Now most people who have worked in data warehousing for the past ten years or so are aware of the dimensional modeling concept (Rittman Mead even helped to organize a seminar on this topic with Kimball University a few months ago). Dimensional modeling, with its facts, dimensions, hierarchies, drill paths and so on are a great way of presenting data out to users, and to BI tools such as Oracle BI Enterprise Edition. By embedding the drill paths in the data, and by denormalizing the data into simple tables, you make life easier for the Oracle query optimizer, you minimize the number of joins and you present users with a data model that’s easy to understand. This is what we’ve got with the Performance and Access Layer in this model.

Whilst a dimensional model is therefore great for information access, it’s not a great way to store data for the long-term, particularly if you think your analysis requirements might change over time. What if you choose to use Essbase or Oracle OLAP as your query tool, or indeed a tool like Qlikview or Microsoft’s forthcoming Gemini that don’t use a dimensional model? What if you want to reconfigure your stars, perhaps make what’s currently a fact into a dimension, change around some hierarchies or perhaps add a new product line that has completely different attributes to your existing product dimension? If you’ve perhaps “painted yourself into a corner” with a dimensional model, or worse maybe aggregated your data or otherwise lost some level of detail, then making changes to your warehouse might well end up being tricky.

Because of this, the Foundation Layer in the reference architecture acts as your information management layer, storing your detail-level warehouse data in a business and query-neutral form that will persist over time, regardless of what analysis requirements and BI tools you have to support. Data in this layer is typically stored in normalized, third normal form with hierarchies stored in column values, rather than hard-coded table structures, so that it is space efficient but flexible in its design.

The Foundation Layer typically contains entities for all of the key data items in the warehouse, typically with a “current value” table and a “history of changes” table for each entity. Business keys together with source system identifiers are used in this layer, and all data should be cleansed, transformed and conformed with any erroneous data being pushed back into the rejected data area of the Staging layer. The aim for this area should be for it to be “lossless”, with source system codes being retained particularly if a new, warehouse-wide coding system loses detail that might be required for future analysis. Note that in this layer, there are no surrogate keys, no slowly changing dimensions, no hard-coded hierarchies and so on as these are an aspect of dimensional modeling that should therefore only be found in the performance and access layer.

The Performance and Access Layer is where you will find your star schemas, either snowflaked or in denormalized tables, and you do everything you can in this layer to optimize your data for the query tools you are using, safe in the knowledge that you’ve got your original data in the Foundation layer should you need to reconfigure your stars.

My belief is that this approach has several key benefits. Firstly, as much as us OBIEE developers would like to think it’s the case, not everything is a dimensional query especially once you set outside of the world of sales analysis, financial analysis and the like. If you use a toolset like OBIEE, you can provide access to both the Performance and Access Layer and the Foundation Layers within the same query environment, either by transforming the 3NF model in the Foundation Layer into a virtual star schema on the fly, or by providing access to the 3NF model in its “raw” form through Oracle BI Publisher (or indeed, Discoverer).

The key benefit though is that your data warehouse should now last beyond the immediate requirements that your users have for a dimensional model. Whilst Ralph Kimball would argue that you can gracefully adapt your dimensional model over time to incorporate changes and new data, it’s my experience that this is a far trickier task that maintaining a 3NF intermediate layer and spinning off a new conformed dimensional model if your requirements change significantly. Moreover, as I said earlier not everything is a dimensional model, and if you make your warehouse data available in a query and business-neutral form as well as a dimensional form, and particularly if you use technologies such as Exadata, range/hash partitioning and the like, you should be able to provide fast query access for those users who need either transactional-level data or data that doesn’t fit into a dimensional model.

Comments

  1. Kaustubh Says:

    Thanks Mark ,

    This was indeed really insightful. I hope you would write a reference on this as most of the time it is difficult for us to convince the users to convert to dimensional modeling for fear of loosing data and yet we are always unable to derive the full functionality of the BI tools on databases which are not dimensionally aware.

  2. Jeff Moss Says:

    Pretty much the same as we’ve put in at my current client and we started it about 5 years ago so I can tell you it works and that yes, some of the data is dimensional “stars”, but quite a bit doesn’t conform to that approach, hence this multi layered approach.

    Do you not consider the “history” tables in the Foundation layer (we call ours Business Data Model – BDM) to be SCD2s?

  3. Mark Rittman Says:

    Jeff,

    I guess it’s a semantic thing – you’ve got to model history in the foundation layer some how, and start and end-dating the changed records is of course one way. I would steer clear of surrogate keys in this level though, and try and make sure what we’re tracking changes on are entities, not dimensions – it’s the dimensional modeling but I’m trying to get away from in this level, in favour of analysis-neutral entities.

    regards, Mark

  4. Sid Says:

    Mark,
    while I can agree with you on the fact that not everything is a dimension query and incorporating large changes into a dimension model in short period of time is tricky, I am a bit skeptic about the selling point of the foundation layer. A compelling real life example would be helpful. Also, foundation layer has to contain both entities and relationships in some form. And should not this reference architecture contain some mechanism for storing accessing and updating the meta data?

  5. Dhrubo Says:

    Gud one Mark,
    Ya this does work gud..and Simon Griffith had given us some valuable inputs going ahead to this design about 3yrs back and now we see it was worth moving with this Architecture.

  6. Abhishek Says:

    Somehow I am reading it quite late :) , but I just want to make it clear – what I understood is you are suggesting is to have another layer after staging which will be having same Entity Relationship among the tables as we have in OLTP ? so both OLTP and OLAP will have same Relationship Diagrams and the only thing that OLAP will have extra is those History Tables ( I am not sure on what columns you will keep the history, and probably the column you choose here would be the only choice in the Performance Layer which is going to represent the Dimensional Model ) …. Please correct me if I am getting it wrong.

  7. Mark Rittman Says:

    Hi Abhishek,

    Well the second layer (the Foundation Layer) will be normalized, and integrated across all of the sources. Given that the sources may not themselves be properly normalized or designed, and given that there will be several source systems, the foundation layer gives us:

    (a) a properly designed normalized data layer
    (b) that covers the data taken from several source systems, and
    (c) stores history
    (d) all in an analysis-neutral design

    This is the way that it differs from the OLTP source that you mention.

    regards

    Mark

  8. Mike Peacey Says:

    Mark,

    We are using a similiar architecture to the one described and tracking history in the Foundation Layer in 3NF tables.

    Up until now, the access layer has been SCD1 – but we have now been asked that some users get SCD2.

    The quandary I have is that it doesn’t seem possible to use the history in Foundation to inform the SCD2. What I am looking at doing is using the SCD1 output to MERGE into the SCD2 versions (using normal start and end date processing).

    I kind of hoped that having done all the change history in Foundation, I would be able to use this history in SCD2. But of course, a single dimension in Reporting layer may be derived from multiple 3NF tables, each of which have seperate start and end dates. And the fields in the SCD2 dimension will be a subset of those from the 3NF table.

    Consequently, the easiest way forward for me is to use the existing SCD1 tables as a basis for SCD2. I’m unhappy about this since I don’t therefore need to use the history I’m maintaining in Foundation. Which makes me ask myself why I bothered with it in the first place :)

    Mike

  9. Nitin Joshi Says:

    Hello Mark
    Forgive me for reading this article and commenting on it so late.

    My question is – How does this approach gel into the deployment of prepackaged analytics such as OBIA? Due to the additional layer, the ETL adapters (it may just be SIL) may be rendered useless. Also the implementation cycle will increase.

    Thanks,
    Nitin

  10. Mark Rittman Says:

    @Nitin

    In my experience, you can incorporate packaged analytics into this architecture in a couple of ways.

    1. Either you can treat the packaged analytics sources, transformations and dimensional layer as a separate silo of information, and then integrate it into the rest of the DW at the “BI Administration and Query Generation” layer – basically, the OBIEE semantic layer and its federated query capabilities, or

    2. You can load the source data that the packaged analytics will use into your foundation layer as normal, forgo the SDE and SIL mappings (or amend the SIL mappings to point to your foundation layer tables), and just use the data structures and dashboards provided by the packaged apps. Obviously this will alter the cost-benefit for the packaged apps, and you’ll still need to integrate the structures using the BI tool semantic layer.

    3. You can forgo the foundation layer, and build the rest of your custom warehouse Access and Performance layer to the same design approach and standard as the packaged apps – basically, build out from these.

    For me, the major issue with the packaged apps is the lack of foundation layer. If it had this, you could just build out from it to incorporate your additional data sources and subject areas. As you say, if you try and incorporate the packaged apps in with a proper, three-layer data warehouse, it means that you either have to leave all integration to just the BI tool layer, or forgo much of the benefits of the seeded ETL routines that the packaged apps provide (and charge you for).

    Mark

Website Design & Build: tymedia.co.uk