Incremental refresh of Exalytics aggregates using native BI Server capabilities

One of the key design features of the Exalytics In-Memory Machine is the use of aggregates (pre-calculated summary data), held in the TimesTen In-Memory database. Out of the box (“OotB”) these aggregates are built through the OBIEE tool, and when the underlying data changes they must be rebuilt from scratch.

For OBIEE (Exalytics or not) to make use of aggregate tables in a manner invisible to the user, they must be mapped into the RPD as additional Logical Table Sources for the respective Logical Table in the Business Model and Mapping (BMM) layer. OBIEE will then choose the Logical Table Source that it thinks will give the fastest response time for a query, based on the dimension level at which the query is written.

OBIEE’s capability to load aggregates is provided by the Aggregate Persistence function, scripts for which are generated by the Exalytics Summary Advisor, or the standard tool’s Aggregate Persistence Wizard. The scripts can also be written by hand.

Aggregate Persistence has two great benefits:
  1. It uses the existing metadata model of the RPD to understand where to get the source data for the aggregate from, and how to aggregate it. Because it uses standard RPD metadata, it also means that any data source that is valid for reporting against in OBIEE can be used as a source for the aggregates, and OBIEE will generate the extract SQL automagically. The aggregate creation process becomes source-agnostic. OBIEE will also handle any federation required in creating the aggregates. For example, if there are two source systems (such as Sales, and Stock) but one target aggregate, OBIEE will manage the federation of the aggregated data, just as it would in any query through the front-end.
  2. All of the required RPD work for mapping the aggregate as a new Logical Table Source is done automagically. There is no work on the RPD required by the developer.
However, there are two particular limitations to 'vanilla' Aggregate Persistence:
  1. It cannot do incremental refresh of aggregates. Whenever the underlying data changes, the aggregate must be dropped and rebuilt in entirety. This can be extremely inefficient if only a small proportion of the source data has changed, and can ultimately lead to scalability and batch SLA issues.
  2. Each time that the aggregate is updated, the RPD is modified online. This can mean that batch times take longer than they need to, and is also undesirable in a Production environment.
I have written about alternatives and variations to the OotB approach for refreshing Exalytics aggregates previously here and here, namely:
  1. Loading TimesTen aggregates through bespoke ETL, in tools such as GoldenGate and ODI. TimesTen supports a variety of interfaces - including ODBC and JDBC - and therefore can be loaded by any standard ETL tool. A tool such as GoldenGate can be a good way of implementing a light-touch CDC solution against a source database.
  2. Loading TimesTen aggregates directly using TimesTen’s Load from Oracle functionality, taking advantage of Aggregate Persistence to do the aggregate mapping work in the RPD
In both of these cases, there are downsides to the method. Using bespoke ETL is ultimately very powerful and flexible, but has the overhead of writing the ETL along with requiring manual mapping of the aggregates into the RPD. This mapping work is done in the TimesTen Load from Oracle method, but can only be used against an Oracle source database and where there is a single physical SQL required to load the aggregate.

Refreshing aggregates using native OBIEE functionality alone

Here I present another alternative method for refreshing Exalytics aggregates, but using OBIEE functionality alone and remaining close to the OotB method. It is based on Aggregate Persistence but varies in two significant ways :
  1. Incremental refresh of the aggregate is possible
  2. No changes are made to the RPD when the aggregate is refreshed
The method still uses the fundamentals of Aggregate Persistence since , as I mentioned above, it has some very significant benefits:
  • BI Server uses (dare I say, leverages), your existing metadata modelling work which is necessary - regardless of your aggregates - for users to report from the unaggregated data.
  • BI Server generates your aggregate refresh ETL code
  • If your source systems change, your aggregate refresh code doesn’t need to - just as reports are decoupled from the source system through the RPD metadata layers, so are your target aggregates
For us to understand the new method, a bit of background and explanation of the technology is required.

Background, part 1 : Aggregate Persistence - under the covers

When Aggregate Persistence runs, it does several things:
  1. Remove aggregates from physical database and RPD mappings
  2. Create the physical aggregate tables and indexes on the target database, for the fact aggregate and supporting dimensions
  3. Update the RPD Physical and Logical (BMM) layers to include the newly built aggregates
  4. Populate the aggregate tables, from source via the BI Server to the aggregate target (TimesTen)
2What we are going to do here is pick apart Aggregate Persistence and invoke just part of it. We don’t need to rebuild the physical tables each time we refresh the data, and we don’t need to touch the RPD. We can actually just tell the BI Server to load the aggregate table, using the results of a Logical SQL query. That is, pretty much the same SQL that would be executed if we ran the aggregate query from an analysis in the OBIEE front end.

The command to tell the BI Server to do this is the populate command, which can be found from close inspection of the nqquery.log during execution of normal Aggregate Persistence:

populate "ag_sales_month" mode ( append table connection pool "TimesTen aggregates"."TT_CP") as
select_business_model "Sales"."Fact Sales"."Sale Amount" as "Sale_Amoun000000AD","Sales"."Dim Times"."Month YYYYMM" as "Month_YYYY000000D0" 
from "Sales";

This populate <table> command can be sent by us directly to the BI Server (exactly in the way that a standard create aggregate Aggregate Persistence script would be - with nqcmd etc) and causes it to load the specified table (using the specified connection pool) using the logical SQL given. The re-creation of the aggregate tables, and the RPD mapping, doesn't get run:

The syntax of the populate command is undocumented, but from observing the nqquery.log file it follows this pattern:
Looking at a very simple example, we can see how a simple aggregate with a measure summarised by month could be populated:
SELECT_BUSINESS_MODEL was written about by Venkat here, and is BI Server syntax allowing a query directly against the BMM, rather than the Presentation Layer which Logical SQL usually specifies. You can build and test the SELECT_BUSINESS_MODEL clause in OBIEE directly (from Administration -> Issue SQL), in nqcmd, or just by extracting it from the nqquery.log.

Background, part 2 : Secret Sauce - INACTIVE_SCHEMAS

So, we have seen how we can take advantage of Aggregate Persistence to tell the BI Server to load an aggregate, from any source we’ve modelled in the RPD, without requiring it to delete the aggregate to start with or modify the RPD in any way.

Now, we need the a bit of secret sauce to complete the picture and make this method a viable one.

In side-stepping the full Aggregate Persistence sequence, we have one problem. The Logical SQL that we use in the populate statement is going to be parsed by the BI Server to generate the select statement(s) against the source database. However, the BI Server uses its standard query parsing on it, using the metadata defined. Because the aggregates we are loading are already mapped into the RPD then by default the BI Server will probably try to use the aggregate to satisfy the aggregate populate request (because it will judge it the most efficient LTS) – thus loading data straight from the table that we are trying to populate!

The answer is the magical INACTIVE_SCHEMAS variable. What this does it tell OBIEE to ignore one or more Physical schemas in the RPD, and importantly, any associated Logical Table Sources. INACTIVE_SCHEMAS is documented as part of the Double Buffering. It can be used in any logical SQL statement, so is easily demonstrated in an analysis (using Advanced SQL Clauses -> Prefix):

Forcing OBIEE query to use avoid a LTS, using INACTIVE_SCHEMAS. Click image for a larger version.

So when we specify the populate command to update the aggregate, we just include the necessary INACTIVE_SCHEMAS prefix:

SET VARIABLE INACTIVE_SCHEMAS='"TimesTen Aggregates".."EXALYTICS"': 
populate "ag_sales_month" mode ( append table connection pool 
"TimesTen aggregates"."TT_CP") as  
select_business_model "Sales"."Fact Sales"."Sale Amount" as "Sale_Amoun000000AD","Sales"."Dim Times"."Month YYYYMM" as "Month_YYYY000000D0" 
from "Sales";
Why, you could reasonably ask, is this not necessary in a normal OotB aggregate refresh? For the simply reason that in “vanilla” Aggregate Persistence usage the whole aggregate gets deleted from the RPD before it is rebuilt, and therefore when the aggregate query is executed there is only the base LTS is enabled in the RPD at that point in time.

The final part of the puzzle - Incremental refresh

So, we have a way of telling BI Server to populate a target aggregate without rebuilding it, and we have the workaround necessary to stop it trying to populate the aggregate from itself. The last bit is making sure that we only load the data we want to. If we execute the populate statement as it stands straight from the nqquery.log of the initial Aggregate Persistence run then we will end up with duplicate data in the target aggregate. So we need to do one of the following :
  1. Truncate the table contents before the populate
  2. Use a predicate in the populate Logical SQL so that only selected data gets loaded
To issue a truncate command, you can use the logical SQL command execute physical to get the BI Server to run a command against the target database, for example:
execute physical connection pool "TimesTen Aggregates"."TT_CP" truncate table ag_sales_month
This truncate/load method is appropriate for refreshing dimension aggregate tables, since there won’t usually be an update key as such. However, when refreshing a fact aggregate it is better for performance to use an incremental update and only load data that has changed. This assumes that you can identify the data and have an update key for it. In this example, I have an aggregate table at Month level, and each time I refresh the aggregate I want to load just data for the current month. In my repository I have a dynamic repository variable called THIS_MONTH. To implement the incremental refresh, I just add the appropriate predicate to the SELECT_BUSINESS_MODEL clause of the populate statement:
select_business_model "Sales"."Fact Sales"."Sale Amount" as "Sale_Amoun000000AD","Sales"."Dim Times"."Month YYYYMM" as "Month_YYYY000000D0" 
from "Sales" 
where "Dim Times"."Month YYYYMM" =  VALUEOF("THIS_MONTH")
Making the completed aggregate refresh command to send to the BI Server:
SET VARIABLE DISABLE_CACHE_HIT=1, DISABLE_CACHE_SEED=1, DISABLE_SUMMARY_STATS_LOGGING=1, 
INACTIVE_SCHEMAS='"TimesTen Aggregates".."EXALYTICS"'; 
populate "ag_sales_month" mode ( append table connection pool 
"TimesTen aggregates"."TT_CP") as  
select_business_model "Sales"."Fact Sales"."Sale Amount" as "Sale_Amoun000000AD","Sales"."Dim Times"."Month YYYYMM" as "Month_YYYY000000D0" 
from "Sales" 
where "Dim Times"."Month YYYYMM" =  VALUEOF("THIS_MONTH");
Since there will be data in the table for the current month, I delete this out first, using execute physical:
execute physical connection pool "TimesTen Aggregates"."TT_CP" delete from ag_sales_month where Month_YYYY000000D0 = VALUEOF(THIS_MONTH);

Step-by-step

The method I have described above is implemented in two parts:
  1. Initial build- only needs doing once
    1. Create Aggregate Persistence scripts as normal (for example, with Summary Advisor)
    2. Execute the Aggregate Persistence script to :
      1. Build the aggregate tables in TimesTen
      2. Map the aggregates in the RPD
    3. Create custom populate scripts:
      1. From nqquery.log, extract the full populate statement for each aggregate (fact and associated dimensions)
      2. Amend the INACTIVE_SCHEMAS setting into the populate script, specifying the target TimesTen database and schema.
      3. For incremental refresh, add a WHERE clause to the populate logical SQL so that it only fetches the data that will have changed. Repository variables are useful here for holding date values such as current date, week, etc.
      4. If necessary, build an execute physical script to clear down all or part of the aggregate table. This is run prior to the populate script to ensure you do not load duplicate data
  2. Aggregate refresh - run whenever the base data changes
    1. Optionally, execute the execute physical script to prepare the aggregate table (by deleting whatever data is about to be loaded)
    2. Execute the custom populate script from above. Because the aggregates are being built directly from the base data (as enforced by INACTIVE_SCHEMAS) the refresh scripts for multiple aggregates could potentially be run in parallel (eg using xargs). A corollary of this is that this method could put additional load on the source database, because it will be hitting it for every aggregate, whereas vanilla Aggregate Persistence will build aggregates from existing lower-level aggregates if it can.

Summary

This method is completely valid for use outside of Exalytics too, since only the Summary Advisor is licensed separately. Aggregate Persistence itself is standard OBIEE functionality. For Exalytics deployed in an environment where aggregate definitions and requirements change rapidly then this method would be less appropriate, because of the additional work required to modify the scripts. However, for an Exalytics deployment where aggregates change less frequently, it could be very useful.

The approach is not without drawbacks. Maintaining a set of custom populate commands has an overhead (although arguably no more so than a set of Aggregate Persistence scripts), and the flexibility comes at the cost of putting the onus of data validity on the developer. If an aggregate table is omitted from the refresh (for example, a support aggregate dimension table) then reports will show erroneous data.

The benefit of this approach is that aggregates can be rapidly built and maintained in a sensible manner. The RPD is modified only in the first step, the initial build. It is then left entirely untouched. This makes refreshes faster, and safer; if it fails there is just the data to tidy up, not the RPD too.