Real-time BI: Federated OLTP/EDW Reporting

May 16th, 2011 by

The typical approach in Federated OLTP/EDW reporting environments is to use a BI tool such as OBIEE to do horizontal federation. This means combining data from multiple sources at the same grain in a single logical table. One note of clarification: my use of the word “federated” might be a misnomer, and I apologize in advance. As I argued in the last post, the best practice for performance reasons is to actually stream, or “GoldenGate” the source system data to a foundation layer on the data warehouse instance. But old habits die hard, so I’ll continue to refer to this as “federation” even though it may not be technically accurate. Thanks for the latitude.

One of the sources for federation is a classic, batch-loaded EDW, with ETL processes that load conformed dimension tables, followed by fact tables that store the measures and calculations for the enterprise. Oracle Warehouse Builder (OWB), the ETL tool built inside the Oracle Database, is a standard choice for data warehouses built on the Oracle Database, and below, I show a sample process flow of what that batch load might look like:

Batch DW

Logical table sources (LTS’s) are a key feature within the OBIEE semantic model but are often misunderstood. Each LTS represents a single location for data to exist for either a logical fact table, or logical dimension table. A logical table in the BMM can have multiple LTS’s for any of the following reasons:

1. Including different table sources into a single logical table at different levels of granularity. Tables containing data pre-aggregated at a different level in a hierarchy is a common example of this scenario, and is known as “vertical fragmentation”.

2. Including different table sources into a single logical table at the same level of granularity. Having data exist in two different locations, but wanting them to be combined in particular situations, is a common example of this scenario, and is known as “horizontal fragmentation”.

Using horizontal fragmentation in OBIEE, we can map a single logical fact table to multiple LTS’s. For example, suppose we had a physical fact table in our EDW called SALES_FACT. To represent that fact table in the semantic model, we would create a logical fact table in the BMM — called “Sales Fact Realtime” in this example — and create an LTS that maps to the SALES_FACT table. We would also map another LTS which presents this data in the source system as well. As the source system is transactional and likely exists in third-normal form (3NF), the LTS that maps to the transactional schema would likely not be a simple one-to-one relationship. In 3NF, we would likely have to join multiple tables in our source system to represent the logical fact table Sales Fact Realtime:

Source to target fact

We would have to do something comparable with the Customer Dimension:

Source to target dimension

With the two LTS’s, we still need to configure the horizontal fragmentation. For this implementation, I have configured a repository variable called RV_REALTIME_THRESHOLD_DT, with an initialization block that keeps the value consistently at TRUNC(SYSTDATE). I use this variable as the threshold between reporting against the EDW schema and the source system schema.

Init block

Once I have the variable available, I can configure the fragmentation on the fact table to use the threshold to determine the appropriate source for a particular record. This is less complicated with the EDW LTS… simple fragmentation configured for all rows with a transaction date less than the threshold date:

Fragmentation EDW

Whereas only the source system contains the newer rows needed for layering in real-time data… both the EDW and the source system contain historic data, albeit the EDW data is likely transformed to a certain degree. So we have to configure fragmentation using the RV_REALTIME_THRESHOLD_DT variable, but we also have to use that variable as a filter on the source system LTS to make sure we don’t over allocate the data.

Fragmentation OLTP

What’s the result of all this complex mapping among different LTS’s in the BMM? OBIEE understands that each source schema is completely segmented, and the tables in each LTS never join to tables in the other LTS… but they do union. OBIEE will construct a complete query against the transactional schema, in this example, joining between the CUSTOMER_DEMOG_TYPES, CUSTOMERS, POS_TRANS and POS_TRANS_HEADER tables. Additionally, OBIEE will construct another complete query against the EDW schema, in this case, only the tables SALES_FACT and CUSTOMER_DIM. The BI Server then logically unions the results between the two source schemas into a single result set that is returned whenever a user builds a report against the logical tables Customer Dim and Sales Fact Realtime. So I run the following report against my fragmented Sales Fact Realtime:

High level report federated

The interesting part is how OBIEE does the logical union. When the EDW and the transactional schema exist in separate databases, the BI Server issues two different database queries and combines them into a single result set in its own memory space. However, if the schemas exist within the same database, as the Oracle Next-Generation Reference Architecture recommends, then the BI Server is able to issue a single query, transforming the logical union into an actual physical union in the SQL statement, as demonstrated in the statement below. Notice that the SQL threshold has been applied, and the UNION was constructed with a single SQL statement pushed down from the BI Server to the Oracle Database holding the Foundation and Presentation and Access layers in our Oracle architecture:

WITH 
SAWITH0 AS (select T44105.AMOUNT as c1,
     T44042.CUSTOMER_LAST_NAME as c2,
     T48199.CALENDAR_MONTH_NUMBER as c3,
     T48199.CALENDAR_YEAR as c4,
     T48199.SQL_DATE as c5
from 
     GCBC_EDW.DATE_DIM T48199 /* CONFORMED_DATE_DIM */ ,
     GCBC_EDW.CUSTOMER_DIM T44042,
     GCBC_EDW.SALES_FACT T44105
where  ( T44042.CUSTOMER_KEY = T44105.CUSTOMER_KEY and T44105.SALES_DATE_KEY = T48199.DATE_KEY ) ),
SAWITH1 AS (select T43971.SAL_AMT as c1,
     T43901.CUST_LAST_NAME as c2,
     T48199.CALENDAR_MONTH_NUMBER as c3,
     T48199.CALENDAR_YEAR as c4,
     T48199.SQL_DATE as c5
from 
     GCBC_EDW.DATE_DIM T48199 /* CONFORMED_DATE_DIM */ ,
     GCBC_CRM.CUSTOMERS T43901,
     GCBC_POS.POS_TRANS T43971,
     GCBC_POS.POS_TRANS_HEADER T43978
where  ( T43901.CUST_ID = T43978.CUST_ID 
         and T43971.TRANS_ID = T43978.TRANS_ID 
         and T48199.DATE_KEY =  TRUNC(T43978.TRANS_DATE) 
         and T43978.TRANS_DATE >= TO_DATE('2011-05-16 00:00:00' , 'YYYY-MM-DD HH24:MI:SS') 
       )),
SAWITH2 AS ((select concat(D0.c4, D0.c3) as c2,
     D0.c5 as c3,
     D0.c2 as c4,
     D0.c1 as c5
from 
     SAWITH0 D0
union all
select concat(D0.c4, D0.c3) as c2,
     D0.c5 as c3,
     D0.c2 as c4,
     D0.c1 as c5
from 
     SAWITH1 D0)),
SAWITH3 AS (select sum(D3.c5) as c1,
     D3.c2 as c2,
     D3.c3 as c3,
     D3.c4 as c4
from 
     SAWITH2 D3
group by D3.c2, D3.c3, D3.c4)
select distinct 0 as c1,
     D2.c2 as c2,
     D2.c3 as c3,
     D2.c4 as c4,
     D2.c1 as c5
from 
     SAWITH3 D2
order by c2, c4, c3

But OBIEE is also capable of doing the fragmentation equivalent of “partition pruning.” When the BI Server has enough information to know that the entire result set will come from a single source, then the SQL will be issued against only one of the LTS’s. For instance, if I click on one of the “SQL Date” attributes in the above report which will apply a filter on the fragmentation column, the BI Server will know that the result set only comes from the EDW:

WITH
SAWITH0 AS (select sum(T44105.AMOUNT) as c1,
     concat(T48199.CALENDAR_YEAR, T48199.CALENDAR_MONTH_NUMBER) as c2,
     T48199.DATE_KEY as c3,
     T48199.SQL_DATE as c4,
     T44042.CUSTOMER_LAST_NAME as c5
from 
     GCBC_EDW.DATE_DIM T48199 /* CONFORMED_DATE_DIM */ ,
     GCBC_EDW.CUSTOMER_DIM T44042,
                   GCBC_EDW.SALES_FACT T44105
where  ( T44042.CUSTOMER_KEY = T44105.CUSTOMER_KEY 
         and T44042.CUSTOMER_LAST_NAME = 'Carr' 
         and T44105.SALES_DATE_KEY = T48199.DATE_KEY 
         and T48199.SQL_DATE = TO_DATE('2009-07-03' , 'YYYY-MM-DD') 
         and concat(T48199.CALENDAR_YEAR, T48199.CALENDAR_MONTH_NUMBER) = '200907' ) 
group by T44042.CUSTOMER_LAST_NAME, 
         T48199.DATE_KEY, 
         T48199.SQL_DATE, 
         concat(T48199.CALENDAR_YEAR, T48199.CALENDAR_MONTH_NUMBER))
select distinct 0 as c1,
     D1.c2 as c2,
     D1.c3 as c3,
     D1.c4 as c4,
     D1.c5 as c5,
     D1.c1 as c6
from 
     SAWITH0 D1
order by c2, c5, c4, c3

Before closing this section of the real-time discussion, I want to take a minute to identify the strengths and weaknesses of this approach. As far as strengths go, we have several items that register with this solution. First off… this is a low-latency solution. When using the Oracle Next-Generation Reference Architecture, we have the latency of streaming, or “GoldenGating,” the content from the source system to the DW database. With clients we’ve had in the past, this can run anywhere from a few seconds to several minutes, depending on the solution implemented. Additionally, there is no complex logical or physical data modeling and supporting ETL to deliver this solution, as there is with the EDW with a Real-Time Component, which we will explore in the next posting.

As far as weaknesses go, there will be a fair amount of complex RPD semantic-layer modeling. Obviously, the degree of difficulty depends on a number of factors: number of source systems integrated, number of subject areas, complexity of reports delivered, etc. Also, increased complexity of RPD modeling may introduce performance degradation as OLTP schemas have to be transformed “on the fly” to star schemas by the BI Server. But keep in mind… we are typically only doing this for at most a day’s worth of data, so with proper database tuning, this content can usually perform quite well.

Next up: EDW with a Real-Time Component

Comments

  1. GERARD Nicolas Says:

    Hy Stewart,

    Thanks for this article that give me the possibility to ask a little question that I have since a long time ago on the subject.

    You explain a time-based content fragmentation on the date level. I.e. all rapports will work if you have the date column (SQL Date) in your rapport.

    How do you handle all other time attributes (year, month, week, …) ?

    Do you create a huge list of “OR” predicate in your fragmentation content to handle them or do you keep this kind of approach only for the day query grain ?

    Cheers
    Nico

  2. Stewart Bryson Says:

    @Gerard

    The fragmentation always works whether you have the SQL Date in your report or not. Adding SQL Date in this case, because it’s the fragmentation column, allows the BI Server to prune to a single system for performance purposes, but whether it prunes or not has no effect on the content of the returned results. Including the fragmentation column as a filter is not required.

    So for pruning purposes, the best practice is to use shared filters whenever possible to prune to a single system when the report is designed to return specific results. For instance, Finance typically doesn’t want their numbers changing during the day, so it’s common to add filters to all the reports on a Finance dashboard to ensure that intra-day results are not included.

  3. Robert Tooker Says:

    Very interesting article. I know in obiee 10g there was a limitation where you couldn’t have both fragmentation and federation (in the aggregate sense) for one LTS. For example I have a table in my dwh up to cob yesterday (a), and an aggregated version of that table (b). I want to add today’s results (c) using what you have described above. I want the BI server to either combine a+c or b+c depending on the granularity of the query (with b+c being preferable for performance).

    I recall the repository would allow it but Answers would produce an error – I can’t remember the exact error but there is a metalink note saying it is a feature for a future release. Do you know if this has been resolved in obiee 11g?

    On a separate note, this solution only works if your dml at the source is very inserty (technical term) and your dwh structures and etl are straightforward. Imagine you had a status on your POS_TRANS_HEADER table that was ‘in progress’/'closed’/'cancelled’ and you only want to report on ‘closed’. How do you know what you’ve processed in your etl and what you haven’t, or what needs to be logically deleted? I suppose that leads on to your next posting … looking forward to it. Regards,

    Robert

  4. Miguel Escaja Says:

    Very interesting post.

    My View is that essentially this very common problem/use case can be more effectively be solved using the data virtualization technology. Although this technology has been around for a good few years now it’s nethertheless not very well known and quite often forgotten it’s there

    * Although OBIEE is does provide some federation capabilities, they’re very limited in my view (not really easy to elborate in a few lines)
    * No new federation capabilities has been added in 11g compared to what was available in 10g (Oracle migth argue that), a part from the introduction of the Lookup tables, reason being probably Oracle direccion is to provide this capabilities through a separate product as suppossed to add more buil-in features into OBIEE.
    * To me it’s a significant difference between a reporting tool with built-in federation capabilities,i.e, OBIEE and a virtualization tool (again it’d take a whole new post to explain this)
    * The aproach I would take it’d be to provide a Data Access Layer, outside of OBIEE, essentialy a virtualized database to provide a single data access point across all your physical data sources by mean of publishing vialualized data services that could be consumed not only by Reporting system like OBIEE but any other upstream applications.
    * This way all business units acrross the organization would have a single view of the truth from the data stand point.
    * This additional Architecture component will seat in between OBIEE/Upstream Apps and the underlined physical sources.

  5. Stewart Bryson Says:

    @Miguel:

    Thanks for the informative comment. Oracle has a very, very large investment in the new “Big Data” movement, and I think we will see a lot of ways to counter this in future releases of Database 12g and OBIEE. Watch what Jean-Pierre Dijcks has to say about this subject, as he is as plugged into Oracle’s strategy on this as anyone.

    The difficulty in “virtualization”-based federation as you have described comes in what is effectively “predicate-pushing”, to use database terminology. When a dimension table and a fact table exist in separate database structures at this point, the BI Tool has no way of pushing predicates down to those separate systems. So full data sets from both systems have to be brought into the OBIEE tier and joined there, and then filtered.

    If you ever write-up what you have in mind for your virtualization-based architecture, make sure you share it. I’d be keen to see what you have in mind.

  6. Miguel Escaja Says:

    Thanks for sharing the info.

    You got it, the key is to push predicate down to the underlined data sources so much so that without it Federation won’t be doable because as you rightly said a vast amount of data would have to be streamed out of the sources into the OBI Server for merging.

    I still think Oracle solution always relies on movement of data in some way, shape or form, either getting data into Active Monitoring Data Cache or real-time sources like Times ten or Coherence which needs to be persisted in a relation database or into a ROLAP or MOLAP like Essbase. Essensially, the same piece of data is being duplicated at least.

    There isn’t a Data Federation/Virtualization Component in the Oracle Product Stack at the moment other than OBIEE, which is by the way very limited and is not able to push predicate down so then not very useful or only advisable for some tactical solution (Pre-canned dashboard perhaps) definitely not feasible for Add-Hoc Analysis.

    We’re planning to some work using a Data Virtualization Tool. Once the POC is completed I’ll be happy to share the conclusions.

  7. Nitin R Joshi Says:

    Hello Stewart – Very informative article.

    In the example above, going down to the next level of the solution, there will be scenario where record for the same sales will appear from Real Time (updated) and EDW (till yesterday).

    Handling this scenario at OBIEE layer will be very difficult. Can you shed some light on this.

    Thanks,
    Nitin

Write a comment





Website Design & Build: tymedia.co.uk