Analytic Views as OWB Data Source
March 3rd, 2009 by Peter Scott
Readers of my old blog – much now ported here to the Rittman Mead blog may remember my fondness for things SQL (rather than PL/SQL) and analytic functions in particular. It is not that I can’t “do” PL/SQL it is more that I think there is sometimes a better way. Recently, I came across one such case in an Oracle Warehouse Builder project.
OWB is lovely tool and can produce some really nice SQL and visual mapping diagrams that are so easy to follow: “Take these columns from this table and those columns from another table and join them together on these columns and insert the results into that table“; and simple transforms that can be represented by a CASE construct or SQL function are so easy to add using the expression operator.
The thing I am not too happy with is the adding of PL/SQL processes to the map as either a pre-map or post-map operator (yes, I know that is not their purpose, but I have seen them used like that) or as a operator in the mapping. This can hide the purpose of the map and means that you need to jump out of the OWB design tool and look at the source of the process to find what is going on. On this project the customer had three data sources for one of the dimensions, each with its own set of attributes and effective dates for its attributes. The need was to construct a single attribute history table (similar to a SCD type 2 table) from the three sources. The existing code used a very complex cursor based package to “fill in the nulls” on the target table.
I always go back to “what are we trying to achieve?” when looking at refactoring a process – getting hung-up on optimising the “as now” can miss so many opportunities for other approaches. In this case I saw the opportunity of replacing the “build a table with nulls for missing data then update it to fill in the blanks” method with an analytic view within the database as the OWB source for a simple insert into target OWB mapping.
As a simplified example consider three tables: A,B and C each with a common key column, some attributes and an effective date for the attribute set: A with columns PK, A1, A2, A3, DATE_EFFECTIVE_A; B with columns PK, B1, B2, DATE_EFFECTIVE_B; C with columns PK, C1, C2, DATE_EFFECTIVE_C. We need to create a table, ATTRIBUTE_HISTORY with columns: PK, A1,A2,A3,B1,B2,C1,C2, DATE_EFFECTIVE where the attribute columns report the active attribute values current at DATE_EFFECTIVE
This can be achieved using a full outer join between the three tables and using the analytic function LAST_VALUE(). LAST_VALUE() is one of those often overlooked analytic functions that seems to have little real purpose until you realise that the option to IGNORE NULLS allows you to “copy down” the last non-null value and the windowing clause allows you to only consider rows “before” the current one. The other SQL goody to use in this view is the COALESCE function on the three possible date effective columns to return the non-null date_effective and similarly to return the first non-null value of PK.
So we get
CREATE OR REPLACE VIEW my_view ASSELECT pk ,last_value(a1 ignore nulls) over (partition BY pk order by date_effective) AS A1,last_value(a2 ignore nulls) over (partition BY pk order by date_effective) AS A2, last_value(a3 ignore nulls) over (partition BY pk order by date_effective) AS A3, last_value(b1 ignore nulls) over (partition BY pk order by date_effective) AS B1, last_value(b2 ignore nulls) over (partition BY pk order by date_effective) AS B2, last_value(c1 ignore nulls) over (partition BY pk order by date_effective) AS C1, last_value(c2 ignore nulls) over (partition BY pk order by date_effective) AS C2, date_effective FROM (SELECT COALESCE(a.pk, b.pk,c.pk) AS PK, a.a1,a.a2,a.a3 , b.b1,b.b2 , c.c1,c.c2 , COALESCE(a.date_effective, b.date_effective,c.date_effective) AS date_effective FROM a FULL OUTER JOIN b ON a.pk = b.pk AND a.date_effective = b.date_effective FULL OUTER JOIN c ON c.pk = NVL(a.pk, b.pk) AND c.date_effective = NVL(a.date_effective,b.date_effective) );
In the absence of a windowing clause on the analytic function it defaults to ROWS UNBOUNDED PRECEDING – which is, of course, just what we need here. We just expose this view to OWB and use it as the data source for a straight insert into the target table
There are changes we need to add to the view to use it in a real data warehouse – for example we need to get the current set of attributes from the target table to be the intial values for the columns and we need to code for default values for truely new value of PK, but these are not too complicated to do.
For me the test of the new approach is peformance and in this case the set based operations (even with two full outer joins) was far better than the iterative PL/SQL route it replaced.
Edit to include example query plans
As with any development or test system the plans I obtain may not be representative of “real world” plans. I only use a few rows in my tests, production examples of this technique would be likely to involve a few thousand rows.
Plan with no indexes
---------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |---------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 9 | 1350 | 34 (12)| 00:00:01 |
| 1 | WINDOW SORT | | 9 | 1350 | 34 (12)| 00:00:01 |
| 2 | VIEW | | 9 | 1350 | 33 (10)| 00:00:01 |
| 3 | UNION-ALL | | | | | |
|* 4 | HASH JOIN OUTER | | 7 | 1141 | 17 (12)| 00:00:01 |
| 5 | VIEW | | 7 | 1036 | 13 (8)| 00:00:01 |
| 6 | UNION-ALL | | | | | |
|* 7 | HASH JOIN OUTER | | 4 | 136 | 7 (15)| 00:00:01 |
| 8 | TABLE ACCESS FULL| A | 4 | 72 | 3 (0)| 00:00:01 |
| 9 | TABLE ACCESS FULL| B | 4 | 64 | 3 (0)| 00:00:01 |
|* 10 | HASH JOIN ANTI | | 3 | 81 | 7 (15)| 00:00:01 |
| 11 | TABLE ACCESS FULL| B | 4 | 64 | 3 (0)| 00:00:01 |
| 12 | TABLE ACCESS FULL| A | 4 | 44 | 3 (0)| 00:00:01 |
| 13 | TABLE ACCESS FULL | C | 3 | 45 | 3 (0)| 00:00:01 |
|* 14 | HASH JOIN ANTI | | 2 | 118 | 17 (12)| 00:00:01 |
| 15 | TABLE ACCESS FULL | C | 3 | 45 | 3 (0)| 00:00:01 |
| 16 | VIEW | | 7 | 308 | 13 (8)| 00:00:01 |
| 17 | UNION-ALL | | | | | |
|* 18 | HASH JOIN OUTER | | 4 | 88 | 7 (15)| 00:00:01 |
| 19 | TABLE ACCESS FULL| A | 4 | 44 | 3 (0)| 00:00:01 |
| 20 | TABLE ACCESS FULL| B | 4 | 44 | 3 (0)| 00:00:01 |
|* 21 | HASH JOIN ANTI | | 3 | 66 | 7 (15)| 00:00:01 |
| 22 | TABLE ACCESS FULL| B | 4 | 44 | 3 (0)| 00:00:01 |
| 23 | TABLE ACCESS FULL| A | 4 | 44 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------------Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("C"."DATE_EFFECTIVE"(+)=NVL("A"."DATE_EFFECTIVE","B"."DATE_EFFECTIVE") AND "C"."PK"(+)=NVL("A"."PK","B"."PK"))
7 - access("A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE"(+) AND "A"."PK"="B"."PK"(+))
10 - access("A"."PK"="B"."PK" AND "A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE")
14 - access("C"."PK"=NVL("A"."PK","B"."PK") AND "C"."DATE_EFFECTIVE"=NVL("A"."DATE_EFFECTIVE","B"."DATE_EFFECTIVE"))
18 - access("A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE"(+) AND "A"."PK"="B"."PK"(+))
21 - access("A"."PK"="B"."PK" AND "A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE")
With two-column indexes on the PK and Date_effective columns of each of the three tables the plan changes to use nested loops on the indexes.
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9 | 1350 | 19 (16)| 00:00:01 |
| 1 | WINDOW SORT | | 9 | 1350 | 19 (16)| 00:00:01 |
| 2 | VIEW | | 9 | 1350 | 18 (12)| 00:00:01 |
| 3 | UNION-ALL | | | | | |
|* 4 | HASH JOIN OUTER | | 7 | 1141 | 13 (8)| 00:00:01 |
| 5 | VIEW | | 7 | 1036 | 10 (10)| 00:00:01 |
| 6 | UNION-ALL | | | | | |
|* 7 | HASH JOIN OUTER | | 4 | 136 | 7 (15)| 00:00:01 |
| 8 | TABLE ACCESS FULL | A | 4 | 72 | 3 (0)| 00:00:01 |
| 9 | TABLE ACCESS FULL | B | 4 | 64 | 3 (0)| 00:00:01 |
| 10 | NESTED LOOPS ANTI | | 3 | 81 | 3 (0)| 00:00:01 |
| 11 | TABLE ACCESS FULL | B | 4 | 64 | 3 (0)| 00:00:01 |
|* 12 | INDEX RANGE SCAN | INDEX1 | 1 | 11 | 0 (0)| 00:00:01 |
| 13 | TABLE ACCESS FULL | C | 3 | 45 | 3 (0)| 00:00:01 |
| 14 | MERGE JOIN ANTI | | 2 | 118 | 5 (20)| 00:00:01 |
| 15 | TABLE ACCESS BY INDEX ROWID| C | 3 | 45 | 2 (0)| 00:00:01 |
| 16 | INDEX FULL SCAN | INDEX3 | 3 | | 1 (0)| 00:00:01 |
|* 17 | SORT UNIQUE | | 7 | 308 | 3 (34)| 00:00:01 |
| 18 | VIEW | | 7 | 308 | 2 (0)| 00:00:01 |
| 19 | UNION-ALL | | | | | |
| 20 | NESTED LOOPS OUTER | | 4 | 88 | 1 (0)| 00:00:01 |
| 21 | INDEX FULL SCAN | INDEX1 | 4 | 44 | 1 (0)| 00:00:01 |
|* 22 | INDEX RANGE SCAN | INDEX2 | 1 | 11 | 0 (0)| 00:00:01 |
| 23 | NESTED LOOPS ANTI | | 3 | 66 | 1 (0)| 00:00:01 |
| 24 | INDEX FULL SCAN | INDEX2 | 4 | 44 | 1 (0)| 00:00:01 |
|* 25 | INDEX RANGE SCAN | INDEX1 | 1 | 11 | 0 (0)| 00:00:01 |
------------------------------------------------------------------------------------------Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("C"."DATE_EFFECTIVE(+)=NVL("A"."DATE_EFFECTIVE","B"."DATE_EFFECTIVE") AND "C"."PK"(+)=NVL("A"."PK","B"."PK"))
7 - access("A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE"(+) AND "A"."PK"="B"."PK"(+))
12 - access("A"."PK"="B"."PK" AND "A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE")
17 - access("C"."PK"=NVL("A"."PK","B"."PK") AND "C"."DATE_EFFECTIVE"=NVL("A"."DATE_EFFECTIVE","B"."DATE_EFFECTIVE"))
filter("C"."DATE_EFFECTIVE"=NVL("A"."DATE_EFFECTIVE","B"."DATE_EFFECTIVE") AND "C"."PK"=NVL("A"."PK","B"."PK"))
22 - access("A"."PK"="B"."PK"(+) AND "A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE"(+))
25 - access("A"."PK"="B"."PK" AND "A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE")
With any plan you need consider whether it is the appropriate way to access your data – keep an open mind; full table scans or nested loops are not always bad!

March 3rd, 2009 at 5:54 pm
Nice. I agree that at some points when you have to step outside the bounds of the tool, having experience and good knowledge of things like analytic functions like last_value and coalesce are invaluable. I at first also wondered when I would ever use such things….until I couldn’t find an easy way to solve such problems.
March 4th, 2009 at 8:19 am
Very inspiring example!
I’m afraid that because of the complex join condition for table C the optimizer might opt for a nested loop join – which would be slow if C is big. Do you have a possibility to post the execution plan?
March 4th, 2009 at 9:50 am
@Todor
Typically, for this use, the number of rows in each table will be small (low thousands) – we are looking at new attribute changes since the last data load.
If in testing we find that table order is significant we could change the query to give the best performance My ordering was arbitrary.
With my very small test data set I had hash joins in the query plan using unindexed tables A,B & C, adding a two-column index on the PK and Date_effective columns of each source tables changed the plan to include nested loop joins on the indexes.
I will edit the post to include the plans – but my sample data set is small so the plans may not be realistic for “real” data.
With any refactoring you must always compare before and after: are the results of the new method correct, is the new method easier to maintain and does it run at least as well as the old way – we aim for better.
March 4th, 2009 at 2:09 pm
Thank you! I see that hash join is used.
The FOJ of 3 tables doesn’t look very good in the execution plan – tables A,B are full scanned 4 times (in the non-index case). But this should not be an issue with small tables like yours.
I suppose it would look much better with the native FOJ in 11g.
May 7th, 2009 at 1:30 pm
Brilliant analytics always have a set based solution to row based processing. It is usually more performant this way but only as long as the set isn’t to complex to derive.
One thing I did note recently is don’t put a case statement in a where clause. Its clever but very poor in executing!
May 7th, 2009 at 2:57 pm
@Tim
Sometimes a CASE in a where clause can give improve performance – for example it may allow an indexed path to be chosen – I have an old example here
May 8th, 2009 at 8:08 am
Thanks for the advice. It just goes to show that you need to understand the detail and that simplistic tuning is a contradiction. Burleson would have you apply certain tuning rules and see what happens however yourself and kite understanding the detail and would apply solutions to each problem.
I will try case in where clauses again having eradicating that simplistic rule from my mind.
May 8th, 2009 at 8:24 am
@Tim
As you suggest there no “rules that always work” in tuning; what works well with the specifics of your data set may be dreadful for someone else. If there were rules then they would have been built into the database and tuning would not be needed :-)
The CASE was a trick to force the use of a bitmap index access on a single value and then to select the rows not selected… it does not always work :-)
The way I approach tuning is no more (or less) valid than that of any other consultant – I only work with Oracle data warehouses these days so I tend to concentrate on techniques that are more relevant to the world I see
July 2nd, 2009 at 7:54 am
Hi Peter,
nice example! I am just wondering why you say ‘We just expose this view to OWB’? All of the operations (analytical functions, full outer join) can be done within an OWB mapping. (At least when using OWB 10gR2 or above – OWB 10gR1 and earlier versions do not support full outer joins)