Cube, Rollup, Grouping Sets and Grouping IDs

September 10th, 2008 by Mark Rittman

If you’ve kept up with the various new data warehousing features in Oracle over the past few years, you might have noticed extensions to the GROUP BY clause such as ROLLUP, CUBE, GROUPING SETS and so on. If like me you’ve glanced at these features and mentally noted them as being potentially useful, you might have been put off by the fact that they look fiendishly complicated to use. Like me you might also have questioned whether SQL query tools such as OBIEE would ever be able to use them, given that not many query tools generate SQL that uses these extensions. So what are they actually used for, can they ever benefit “regular” query tools such as OBIEE, and in what circumstances can they be used to make aggregation in your data warehouse more efficient?

If you think of the data in your relational star schema as being a “logical cube”, with its measures potentially being available across all levels of aggregation, the actual data stored in your relational star schema is often a small fraction of this, being based on the detail-level data extracted from your source systems. If you wanted to report on data at a higher level of aggregation than this stored, detail-level view, you would create a SELECT statement that listed out the columns to aggregate by, the measures to aggregate along with aggregation functions, an optional filter to restrict down the column values that aggregates are calculated for, and a GROUP BY clause to tell Oracle what columns to group (aggregate) the data on. This would then tell Oracle to display these specific aggregations in your report, with the aggregated data being dynamically calculated at run-time.

To take an example run against the SH schema on an Oracle 11g (11.1) database, the following SELECT statement aggregates amount_sold by channel, month and country ISO code and returns results for a selection of these column values:

SQL> SELECT
  2          ch.channel_desc,
  3          t.calendar_month_desc,
  4          co.country_iso_code,
  5          SUM(s.amount_sold) sum_amount_sold
  6  FROM
  7          sales s,
  8          customers cu,
  9          times t,
 10          channels ch,
 11          countries co
 12  WHERE
 13          s.time_id=t.time_id AND
 14          s.cust_id=cu.cust_id AND
 15          cu.country_id = co.country_id AND
 16          s.channel_id = ch.channel_id AND
 17          ch.channel_desc IN ('Direct Sales', 'Internet') AND
 18          t.calendar_month_desc IN ('2001-09', '2000-10') AND
 19          co.country_iso_code IN ('GB', 'US')
 20  GROUP BY
 21          ch.channel_desc,
 22          t.calendar_month_desc,
 23          co.country_iso_code;

CHANNEL_DESC         CALENDAR CO SUM_AMOUNT_SOLD
-------------------- -------- -- ---------------
Direct Sales         2000-10  US       682296.59
Internet             2000-10  US       137054.29
Internet             2001-09  GB        36806.73
Direct Sales         2000-10  GB        91925.43
Internet             2001-09  US       299621.96
Direct Sales         2001-09  US       621197.94
Internet             2000-10  GB        14539.14
Direct Sales         2001-09  GB        92865.04

8 rows selected.

So far so good. But what if you wanted to generate subtotals for the columns that you are filtering on, plus perhaps a grand total for all columns? You could put the results of the first GROUP BY into a temporary table and then aggregate the results again, or you could potentially base one SELECT statement on the results of another; or, you could use the ROLLUP extension to GROUP BY to calculate these subtotals and grand totals automatically.

SQL> SELECT
  2          ch.channel_desc,
  3          t.calendar_month_desc,
  4          co.country_iso_code,
  5          SUM(s.amount_sold) sum_amount_sold
  6  FROM
  7          sales s,
  8          customers cu,
  9          times t,
 10          channels ch,
 11          countries co
 12  WHERE
 13          s.time_id=t.time_id AND
 14          s.cust_id=cu.cust_id AND
 15          cu.country_id = co.country_id AND
 16          s.channel_id = ch.channel_id AND
 17          ch.channel_desc IN ('Direct Sales', 'Internet') AND
 18          t.calendar_month_desc IN ('2001-09', '2000-10') AND
 19          co.country_iso_code IN ('GB', 'US')
 20  GROUP BY
 21          ROLLUP(
 22                  ch.channel_desc,
 23                  t.calendar_month_desc,
 24                  co.country_iso_code);

CHANNEL_DESC         CALENDAR CO SUM_AMOUNT_SOLD
-------------------- -------- -- ---------------
Internet             2000-10  GB        14539.14
Internet             2000-10  US       137054.29
Internet             2000-10           151593.43
Internet             2001-09  GB        36806.73
Internet             2001-09  US       299621.96
Internet             2001-09           336428.69
Internet                               488022.12
Direct Sales         2000-10  GB        91925.43
Direct Sales         2000-10  US       682296.59
Direct Sales         2000-10           774222.02
Direct Sales         2001-09  GB        92865.04
Direct Sales         2001-09  US       621197.94
Direct Sales         2001-09           714062.98
Direct Sales                             1488285
                                      1976307.12

15 rows selected.

This is called “full rollup” and generates subtotals for all the columns listed in the ROLLUP() clause. You can also use the ROLLUP() clause to perform partial rollups, with columns being added to the GROUP BY clause and excluded from the ROLLUP() clause to stop them from being subtotalled.

SQL> SELECT
  2          ch.channel_desc,
  3          t.calendar_month_desc,
  4          co.country_iso_code,
  5          SUM(s.amount_sold) sum_amount_sold
  6  FROM
  7          sales s,
  8          customers cu,
  9          times t,
 10          channels ch,
 11          countries co
 12  WHERE
 13          s.time_id=t.time_id AND
 14          s.cust_id=cu.cust_id AND
 15          cu.country_id = co.country_id AND
 16          s.channel_id = ch.channel_id AND
 17          ch.channel_desc IN ('Direct Sales', 'Internet') AND
 18          t.calendar_month_desc IN ('2001-09', '2000-10') AND
 19          co.country_iso_code IN ('GB', 'US')
 20  GROUP BY
 21          ch.channel_desc,
 22          ROLLUP(
 23                  t.calendar_month_desc,
 24                  co.country_iso_code);

CHANNEL_DESC         CALENDAR CO SUM_AMOUNT_SOLD
-------------------- -------- -- ---------------
Internet             2000-10  GB        14539.14
Internet             2000-10  US       137054.29
Internet             2000-10           151593.43
Internet             2001-09  GB        36806.73
Internet             2001-09  US       299621.96
Internet             2001-09           336428.69
Internet                               488022.12
Direct Sales         2000-10  GB        91925.43
Direct Sales         2000-10  US       682296.59
Direct Sales         2000-10           774222.02
Direct Sales         2001-09  GB        92865.04
Direct Sales         2001-09  US       621197.94
Direct Sales         2001-09           714062.98
Direct Sales                             1488285

14 rows selected.

Notice how this partial rollup returns one less row, as it excludes the subtotal of amount sold by the internet channel and the direct sales channel combined , i.e. it excludes the channel subtotal.

A more typical use of ROLLUP would be to produce subtotals and grand totals across a single dimension hierarchy, rolling up sales for example by month, quarter and year within a single SELECT statement. Note how in the results below the amount sold gets aggregated by month first, then by all the months in each quarter, and then across all months and quarters in the year.

SQL> SELECT
  2     t.calendar_month_desc "MONTH",
  3     t.calendar_quarter_desc "QUARTER",
  4          t.calendar_year "YEAR",
  5     SUM(s.amount_sold) sum_amount_sold
  6  FROM
  7     sales s,
  8     customers cu,
  9     times t,
 10     channels ch,
 11     countries co
 12  WHERE
 13     s.time_id=t.time_id AND
 14     s.cust_id=cu.cust_id AND
 15     cu.country_id = co.country_id AND
 16     s.channel_id = ch.channel_id AND
 17     ch.channel_desc IN ('Direct Sales', 'Internet') AND
 18     t.calendar_year = 1999 AND
 19     co.country_iso_code IN ('GB', 'US')
 20  GROUP BY
 21     ROLLUP(
 22             t.calendar_year,
 23             t.calendar_quarter_desc,
 24             t.calendar_month_desc);

MONTH    QUARTER       YEAR SUM_AMOUNT_SOLD
-------- ------- ---------- ---------------
1999-01  1999-01       1999       974627.95
1999-02  1999-01       1999      1089255.92
1999-03  1999-01       1999        754026.7
         1999-01       1999      2817910.57
1999-04  1999-02       1999       708060.57
1999-05  1999-02       1999       818055.52
1999-06  1999-02       1999       729677.52
         1999-02       1999      2255793.61
1999-07  1999-03       1999       893452.47
1999-08  1999-03       1999       883460.92
1999-09  1999-03       1999       923577.01
         1999-03       1999       2700490.4
1999-10  1999-04       1999       715831.36
1999-11  1999-04       1999       742248.42
1999-12  1999-04       1999       841572.17
         1999-04       1999      2299651.95
                       1999      10073846.5
                                 10073846.5

18 rows selected.

To describe the ROLLUP feature in more detail then, the full rollup returns rows representing the regular aggregations you’d get from a GROUP BY, subtotals for each column in the ROLLUP column list with the last column listed repeating most frequently, and a grand total for all columns in the ROLLUP list. Partial ROLLUP does the same but excludes the colum listed in the GROUP BY clause but excluded from the ROLLUP clause.

So ROLLUP is useful for creating subtotals and totals for the columns that your aggregating by in a SELECT statement, with NULLs in the listed column values indicating that the aggregated value is a subtotal or grand total. Not too difficult a concept to understand, but where then does the CUBE extension to GROUP BY come in?

What the CUBE extension to GROUP BY does it to calculate every possible aggregate permutation of the columns listed in the CUBE clause. This is particularly useful when you want to get hold of a whole set of aggregations in a single go, perhaps to pre-populate a summary table or as the input into a further reporting process. To take the example I used earlier on where I first GROUPed by channel, country and month, and then selected the same data and ran it through the ROLLUP extension, doing the same but with the CUBE extension to GROUP BY brings back the following results set:

SQL> SELECT
  2          ch.channel_desc,
  3          t.calendar_month_desc,
  4          co.country_iso_code,
  5          SUM(s.amount_sold) sum_amount_sold
  6  FROM
  7          sales s,
  8          customers cu,
  9          times t,
 10          channels ch,
 11          countries co
 12  WHERE
 13          s.time_id=t.time_id AND
 14          s.cust_id=cu.cust_id AND
 15          cu.country_id = co.country_id AND
 16          s.channel_id = ch.channel_id AND
 17          ch.channel_desc IN ('Direct Sales', 'Internet') AND
 18          t.calendar_month_desc IN ('2001-09', '2001-10') AND
 19          co.country_iso_code IN ('GB', 'US')
 20  GROUP BY
 21          CUBE(
 22                  ch.channel_desc,
 23                  t.calendar_month_desc,
 24                  co.country_iso_code);

CHANNEL_DESC         CALENDAR CO SUM_AMOUNT_SOLD
-------------------- -------- -- ---------------
                                      2117845.22
                              GB       243978.97
                              US      1873866.25
                     2001-09          1050491.67
                     2001-09  GB       129671.77
                     2001-09  US        920819.9
                     2001-10          1067353.55
                     2001-10  GB        114307.2
                     2001-10  US       953046.35
Internet                                  761766
Internet                      GB        75817.49
Internet                      US       685948.51
Internet             2001-09           336428.69
Internet             2001-09  GB        36806.73
Internet             2001-09  US       299621.96
Internet             2001-10           425337.31
Internet             2001-10  GB        39010.76
Internet             2001-10  US       386326.55
Direct Sales                          1356079.22
Direct Sales                  GB       168161.48
Direct Sales                  US      1187917.74
Direct Sales         2001-09           714062.98
Direct Sales         2001-09  GB        92865.04
Direct Sales         2001-09  US       621197.94
Direct Sales         2001-10           642016.24
Direct Sales         2001-10  GB        75296.44
Direct Sales         2001-10  US        566719.8

27 rows selected.

So you can see in the SQL output that there’s a lot more NULL values in the dimension columns, indicating the totals and subtotals generated by adding CUBE to the GROUP BY clause. So ROLLUP will create subtotals and totals for dimension values individually, whereas CUBE will create further aggregate permutations by grouping in this case two of the columns together and then calculating the subtotal against just those.

Where we are then is that ROLLUP effectively creates a certain set of subtotals, CUBE goes one further and generates even more permutations. ROLLUP is typically used to aggregate up a hierarchy whilst CUBE is used to generate all the aggregations for a selection of columns. If you’re particularly keen on this and want to generate multiple ROLLUPs per SELECT statement, you can even create concatenated ROLLUPs in Oracle 10g and higher.

SQL> SELECT
  2          ch.channel_total,
  3          ch.channel_class,
  4          co.country_region,
  5          co.country_iso_code,
  6          SUM(s.amount_sold) sum_amount_sold
  7  FROM
  8          sales s,
  9          customers cu,
 10          times t,
 11          channels ch,
 12          countries co
 13  WHERE
 14          s.time_id=t.time_id AND
 15          s.cust_id=cu.cust_id AND
 16          cu.country_id = co.country_id AND
 17          s.channel_id = ch.channel_id AND
 18          t.calendar_month_desc IN ('2001-09', '2001-10') AND
 19          co.country_iso_code IN ('GB', 'US')
 20  GROUP BY
 21          ROLLUP(
 22                  ch.channel_total,
 23                  ch.channel_class),
 24          ROLLUP(
 25                  co.country_region,
 26                  co.country_iso_code);

CHANNEL_TOTAL CHANNEL_CLASS        COUNTRY_REGION       CO SUM_AMOUNT_SOLD
------------- -------------------- -------------------- -- ---------------
                                   Europe               GB       321244.43
                                   Europe                        321244.43
                                   Americas             US      2603472.57
                                   Americas                     2603472.57
                                                                   2924717
Channel total                      Europe               GB       321244.43
Channel total                      Europe                        321244.43
Channel total                      Americas             US      2603472.57
Channel total                      Americas                     2603472.57
Channel total                                                      2924717
Channel total Direct               Europe               GB       168161.48
Channel total Direct               Europe                        168161.48
Channel total Direct               Americas             US      1187917.74
Channel total Direct               Americas                     1187917.74
Channel total Direct                                            1356079.22
Channel total Others               Europe               GB        77265.46
Channel total Others               Europe                         77265.46
Channel total Others               Americas             US       729606.32
Channel total Others               Americas                      729606.32
Channel total Others                                             806871.78
Channel total Indirect             Europe               GB        75817.49
Channel total Indirect             Europe                         75817.49
Channel total Indirect             Americas             US       685948.51
Channel total Indirect             Americas                      685948.51
Channel total Indirect                                              761766

25 rows selected.

In the same vein, you can include all dimensions and all levels within a CUBE clause and calculate every possible aggregation in a logical dataset, although the output from this is likely to get pretty large if you cube has several dimension with complex hierearchies.

One way that you can make the output from ROLLUP and CUBE queries a bit more readable, is to use the GROUPING function to return a “1″, which you can then transform using a further DECODE function, to replace the NULLs that CUBE and ROLLUP return.

SQL> SELECT
  2     DECODE(GROUPING(ch.channel_desc),1,'All Channels',ch.channel_desc),
  3     DECODE(GROUPING(t.calendar_month_desc),1,'All Months',t.calendar_month_desc),
  4     DECODE(GROUPING(co.country_iso_code),1,'All Countries',co.country_iso_code),
  5     SUM(s.amount_sold) sum_amount_sold
  6  FROM
  7     sales s,
  8     customers cu,
  9     times t,
 10     channels ch,
 11     countries co
 12  WHERE
 13     s.time_id=t.time_id AND
 14     s.cust_id=cu.cust_id AND
 15     cu.country_id = co.country_id AND
 16     s.channel_id = ch.channel_id AND
 17     ch.channel_desc IN ('Direct Sales', 'Internet') AND
 18     t.calendar_month_desc IN ('2001-09', '2000-10') AND
 19     co.country_iso_code IN ('GB', 'US')
 20  GROUP BY
 21     ROLLUP(
 22             ch.channel_desc,
 23             t.calendar_month_desc,
 24             co.country_iso_code);

DECODE(GROUPING(CH.C DECODE(GRO DECODE(GROUPI SUM_AMOUNT_SOLD
-------------------- ---------- ------------- ---------------
Internet             2000-10    GB                   14539.14
Internet             2000-10    US                  137054.29
Internet             2000-10    All Countries       151593.43
Internet             2001-09    GB                   36806.73
Internet             2001-09    US                  299621.96
Internet             2001-09    All Countries       336428.69
Internet             All Months All Countries       488022.12
Direct Sales         2000-10    GB                   91925.43
Direct Sales         2000-10    US                  682296.59
Direct Sales         2000-10    All Countries       774222.02
Direct Sales         2001-09    GB                   92865.04
Direct Sales         2001-09    US                  621197.94
Direct Sales         2001-09    All Countries       714062.98
Direct Sales         All Months All Countries         1488285
All Channels         All Months All Countries      1976307.12

15 rows selected.

Still with me? Good. The final part of this puzzle is around three more extensions called GROUPING_ID, GROUPING SETS and GROUP_ID.

GROUPING_ID is something you can attach to a SELECT column list to generate a predictable “bit vector ID” for each aggregation produced by a ROLLUP or CUBE, like this:

SQL> SELECT
  2          ch.channel_desc,
  3          t.calendar_month_desc,
  4          co.country_iso_code,
  5          SUM(s.amount_sold) sum_amount_sold,
  6          GROUPING_ID(
  7                  ch.channel_desc,
  8                  t.calendar_month_desc,
  9                  co.country_iso_code) grouping_id
 10  FROM
 11          sales s,
 12          customers cu,
 13          times t,
 14          channels ch,
 15          countries co
 16  WHERE
 17          s.time_id=t.time_id AND
 18          s.cust_id=cu.cust_id AND
 19          cu.country_id = co.country_id AND
 20          s.channel_id = ch.channel_id AND
 21          ch.channel_desc IN ('Direct Sales', 'Internet') AND
 22          t.calendar_month_desc IN ('2001-09', '2001-10') AND
 23          co.country_iso_code IN ('GB', 'US')
 24  GROUP BY
 25          ROLLUP(
 26                  ch.channel_desc,
 27                  t.calendar_month_desc,
 28                  co.country_iso_code);

CHANNEL_DESC         CALENDAR CO SUM_AMOUNT_SOLD GROUPING_ID
-------------------- -------- -- --------------- -----------
Internet             2001-09  GB        36806.73           0
Internet             2001-09  US       299621.96           0
Internet             2001-09           336428.69           1
Internet             2001-10  GB        39010.76           0
Internet             2001-10  US       386326.55           0
Internet             2001-10           425337.31           1
Internet                                  761766           3
Direct Sales         2001-09  GB        92865.04           0
Direct Sales         2001-09  US       621197.94           0
Direct Sales         2001-09           714062.98           1
Direct Sales         2001-10  GB        75296.44           0
Direct Sales         2001-10  US        566719.8           0
Direct Sales         2001-10           642016.24           1
Direct Sales                          1356079.22           3
                                      2117845.22           7

15 rows selected.

The GROUPING ID is based on a bit vector algorithm, that means you can predict what value GROUPING_ID will be generated for each aggregation. To take the example above, where we are rolling up channel, month and country, the values of GROUPING_ID would come out as follows:

Aggregation                               Bit Vector    GROUPING_ID
-----------------                         ------------- -----------
Channel, Month, Country                   0 0 0         0
Channel, Month, All Countries             0 0 1         1
Channel, All Months, All Countries        0 1 1         3
All Channels, All Months, All Countries   1 1 1         7

Now once you know what values will get generated for GROUPING_ID, you can then add a HAVING clause to your GROUP BY and just retrieve the aggregation you want from your ROLLUP, like this:

SQL> SELECT
  2          ch.channel_desc,
  3          t.calendar_month_desc,
  4          co.country_iso_code,
  5          SUM(s.amount_sold) sum_amount_sold,
  6          GROUPING_ID(
  7                  ch.channel_desc,
  8                  t.calendar_month_desc,
  9                  co.country_iso_code) grouping_id
 10  FROM
 11          sales s,
 12          customers cu,
 13          times t,
 14          channels ch,
 15          countries co
 16  WHERE
 17          s.time_id=t.time_id AND
 18          s.cust_id=cu.cust_id AND
 19          cu.country_id = co.country_id AND
 20          s.channel_id = ch.channel_id AND
 21          ch.channel_desc IN ('Direct Sales', 'Internet') AND
 22          t.calendar_month_desc IN ('2001-09', '2001-10') AND
 23          co.country_iso_code IN ('GB', 'US')
 24  GROUP BY
 25          ROLLUP(
 26                  ch.channel_desc,
 27                  t.calendar_month_desc,
 28                  co.country_iso_code)
 29  HAVING
 30          GROUPING_ID(
 31                  ch.channel_desc,
 32                  t.calendar_month_desc,
 33                  co.country_iso_code) = 7
 34  ;

CHANNEL_DESC         CALENDAR CO SUM_AMOUNT_SOLD GROUPING_ID
-------------------- -------- -- --------------- -----------
                                      2117845.22           7

Do you see where this is going now? It’s all well and good being able to generate multiple levels of aggregation in a query, but typically you only want one of them, and using CUBE, ROLLUP and GROUPING_ID allows you to zero-in on just the aggregation you are looking for, which is interesting if you consider the fact that with this vectoring now possible, you could potentially store a whole bunch of aggregates in one materialized view, retrieving the one you want at runtime by the use of this vector..

GROUPING SETS is another way of generating multiple aggregates, this time allowing you to generate sets of arbitrary aggregations (including aggregations created using ROLLUP, for example) with each set of aggregations then combined into a single result set using UNION ALL.

SQL> SELECT
  2     ch.channel_desc,
  3     t.calendar_month_desc,
  4     co.country_iso_code,
  5     SUM(s.amount_sold) sum_amount_sold
  6  FROM
  7     sales s,
  8     customers cu,
  9     times t,
 10     channels ch,
 11     countries co
 12  WHERE
 13     s.time_id=t.time_id AND
 14     s.cust_id=cu.cust_id AND
 15     cu.country_id = co.country_id AND
 16     s.channel_id = ch.channel_id AND
 17     ch.channel_desc IN ('Direct Sales', 'Internet') AND
 18     t.calendar_month_desc IN ('2001-09', '2001-10') AND
 19     co.country_iso_code IN ('GB', 'US')
 20  GROUP BY
 21     GROUPING SETS(
 22             (t.calendar_month_desc, co.country_iso_code),
 23             (ch.channel_desc, co.country_iso_code))
 24  ;

CHANNEL_DESC         CALENDAR CO SUM_AMOUNT_SOLD
-------------------- -------- -- ---------------
                     2001-09  GB       129671.77
                     2001-09  US        920819.9
                     2001-10  US       953046.35
                     2001-10  GB        114307.2
Direct Sales                  GB       168161.48
Internet                      GB        75817.49
Internet                      US       685948.51
Direct Sales                  US      1187917.74

8 rows selected.

So I think we’ve established that these extensions to the GROUP BY clause can be pretty nifty in terms of generating lots of aggregates, and lots of hierarchical aggregations, from a single SELECT statement. This is all well and good though if your query tool can generate SQL in this format, and apart from Discover which I seem to recall generating SQL using GROUPING SETS in the past, the vast majority of query tools are going to carry on generating their SQL using the good old GROUP BY clause. Is there any way, then, that you can use query rewrite to get these GROUP BYs to rewrite to use a materialized view created using some of these extensions? Well you can actually, as long as you use the GROUPING_ID function I mentioned earlier so that query rewrite can vector in on the required aggregate.

To show how this works, let’s create a materialized view that uses the ROLLUP clause:

SQL> CREATE MATERIALIZED VIEW mv_sales_cube1
  2  ENABLE QUERY REWRITE AS
  3  SELECT
  4          ch.channel_total,
  5          ch.channel_class,
  6          co.country_region,
  7          co.country_iso_code,
  8          SUM(s.amount_sold) sum_amount_sold,
  9          GROUPING_ID(
 10                  ch.channel_total,
 11                  ch.channel_class,
 12                  co.country_region,
 13                  co.country_iso_code) grouping_id
 14  FROM
 15          sales s,
 16          customers cu,
 17          times t,
 18          channels ch,
 19          countries co
 20  WHERE
 21          s.time_id=t.time_id AND
 22          s.cust_id=cu.cust_id AND
 23          cu.country_id = co.country_id AND
 24          s.channel_id = ch.channel_id AND
 25          t.calendar_month_desc IN ('2001-09', '2001-10') AND
 26          co.country_iso_code IN ('GB', 'US')
 27  GROUP BY
 28          ROLLUP(
 29                  ch.channel_total,
 30                  ch.channel_class),
 31          ROLLUP(
 32                  co.country_region,
 33                  co.country_iso_code);

Materialized view created.

So this materialized view contains two sets of rolled-up aggregations that’ll give me totals and subtotals across channels, channel classes, regions and countries, the benefit of this being that we’ve stored multiple levels of aggregation in a relational cube, similar to the way that multi-dimensional servers like Essbase store their data in a pre-aggregated form.

If we run a SELECT that uses a corresponding ROLLUP clause, the query rewrites as you would expect.

SQL> SELECT
  2          ch.channel_total,
  3          ch.channel_class,
  4          co.country_region,
  5          co.country_iso_code,
  6          SUM(s.amount_sold) sum_amount_sold
  7  FROM
  8          sales s,
  9          customers cu,
 10          times t,
 11          channels ch,
 12          countries co
 13  WHERE
 14          s.time_id=t.time_id AND
 15          s.cust_id=cu.cust_id AND
 16          cu.country_id = co.country_id AND
 17          s.channel_id = ch.channel_id AND
 18          t.calendar_month_desc IN ('2001-09', '2001-10') AND
 19          co.country_iso_code IN ('GB', 'US')
 20  GROUP BY
 21          ROLLUP(
 22                  ch.channel_total,
 23                  ch.channel_class),
 24          ROLLUP(
 25                  co.country_region,
 26                  co.country_iso_code);

CHANNEL_TOTAL CHANNEL_CLASS        COUNTRY_REGION       CO SUM_AMOUNT_SOLD
------------- -------------------- -------------------- -- ---------------
                                   Europe               GB       321244.43
                                   Europe                        321244.43
                                   Americas             US      2603472.57
                                   Americas                     2603472.57
                                                                   2924717
Channel total                      Europe               GB       321244.43
Channel total                      Europe                        321244.43
Channel total                      Americas             US      2603472.57
Channel total                      Americas                     2603472.57
Channel total                                                      2924717
Channel total Direct               Europe               GB       168161.48
Channel total Direct               Europe                        168161.48
Channel total Direct               Americas             US      1187917.74
Channel total Direct               Americas                     1187917.74
Channel total Direct                                            1356079.22
Channel total Others               Europe               GB        77265.46
Channel total Others               Europe                         77265.46
Channel total Others               Americas             US       729606.32
Channel total Others               Americas                      729606.32
Channel total Others                                             806871.78
Channel total Indirect             Europe               GB        75817.49
Channel total Indirect             Europe                         75817.49
Channel total Indirect             Americas             US       685948.51
Channel total Indirect             Americas                      685948.51
Channel total Indirect                                              761766

25 rows selected.

Elapsed: 00:00:00.12
SQL>
SQL> SELECT * FROM TABLE(dbms_xplan.display_cursor);

...

Plan hash value: 2249608591

-----------------------------------------------------------------------------------------------
| Id  | Operation                    | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                |       |       |     3 (100)|          |
|   1 |  MAT_VIEW REWRITE ACCESS FULL| MV_SALES_CUBE1 |    25 |  1225 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

What’s cool though is that if we then put together a regular SQL statement using just GROUP BY, this rewrites as well, as the GROUPING_ID I’ve stored in the materialized view allows query rewrite to retrieve the correct aggregation from the stored summary.

SQL> SELECT
  2          ch.channel_class,
  3          SUM(s.amount_sold) sum_amount_sold
  4  FROM
  5          sales s,
  6          customers cu,
  7          times t,
  8          channels ch,
  9          countries co
 10  WHERE
 11          s.time_id=t.time_id AND
 12          s.cust_id=cu.cust_id AND
 13          cu.country_id = co.country_id AND
 14          s.channel_id = ch.channel_id AND
 15          t.calendar_month_desc IN ('2001-09', '2001-10') AND
 16          co.country_iso_code IN ('GB', 'US')
 17  GROUP BY
 18          ch.channel_class;

CHANNEL_CLASS        SUM_AMOUNT_SOLD
-------------------- ---------------
Direct                    1356079.22
Others                     806871.78
Indirect                      761766

Elapsed: 00:00:00.23
SQL>
SQL> SELECT * FROM TABLE(dbms_xplan.display_cursor);

...

Plan hash value: 1714973320

------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                |       |       |     4 (100)|          |
|   1 |  HASH GROUP BY                |                |     3 |   114 |     4  (25)| 00:00:01 |
|*  2 |   MAT_VIEW REWRITE ACCESS FULL| MV_SALES_CUBE1 |     3 |   114 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MV_SALES_CUBE1"."GROUPING_ID"=3)

Not bad, and it answers the question that I’ve always had as to when we’d ever get to use these extensions to the GROUP BY clause in a real-world situation. If your query tool doesn’t support these extensions to GROUP BY, you can add the GROUPING_ID function to your materialized view and regular GROUP BYs should rewrite to use the summary. If you’re lucky enough to have a query tool that generates SQL using these extensions, then again through the use of GROUPING_ID and HAVING you can potentially extract all different levels of aggregation using these extensions. Again, as usual, add feedback to this posting if you’ve got any further tips to add to this.

Live from Denmark

September 7th, 2008 by Mark Rittman

I’m currently sitting in my hotel room in Fredericia, Denmark, as tomorrow I’m running a customized version of our OBIEE course for a new customer over here. I flew in to Copenhagen Airport late this afternoon and then got the train from the airport to where I am now; the journey on the train was around two and a half hours and as I wasn’t quite sure that I got on the right train, I spent most of the journey (a) not really sure if the sign above me said my seat was already reserved, and (b) whether I was in fact going in the right direction. In the end I got here around 8pm and now I’ve been spending most of the evening answering emails and getting things set up for tomorrow.

The past few weeks have been fairly hectic what with the preparation for my Open World sessions, adding some new material to the OBIEE course and pulling the seminar material together for my new Oracle 11g Data Warehousing seminar that’s running for the first time in Denmark next week. Apart from that I’ve been teaching and consulting in OBIEE over in Galway (a very nice place), doing some work with Borkur on a Discoverer project in London, and appearing on the panel for a BI Thought Leaders’ forum that Imperial College kindly arranged last week. Between now and Open World I’ve got the three days course delivery this week, then back to London for more Discoverer work, then I’m doing some technical consulting for an Oracle BI Applications project back in the UK for two days, then off to Denmark for the DW seminar, then back to the UK for a day before flying off to San Francisco. In all it’s turning into one of those months where we’re working flat out and often away from home, in my experience it’s often now (autumn) and the early part of the year (February through to May) when we’re the most busiest and this year Open World falls right into the middle of it. Hopefully come the end of September things will have calmed down a bit at least in terms of R&D work, however I suspect the first OBIEE 11g beta will arrive around then so I doubt I’ll actually get much spare time.

Apart from consulting and training, the other thing we’ve been working on is pulling an agenda together for the first of our “Oracle BI Training Days” events, where we’ll be hiring out a venue in London for three days and offering intensive, expert-level training on OBIEE and the Oracle BI Applications for a small, public audience, plus a social event on one of the evenings where we’ll all get a chance to go out for a meal and shoot the breeze on Oracle BI development. There’ll be more news on this during the next week or so, but if you’re interested the dates are likely to be the 22nd through to the 24th of October, the venue will be Canary Wharf London, and we’ll post more details soon.

Oracle 11g Partitioning

September 5th, 2008 by Peter Scott

Last time I mentioned that I have been helping Mark with some demonstrations of the Data Warehouse features of modern (notably 11g) Oracle databases. Having a few minutes spare I decided to take a look at partitioning.

I have long been a fan of partitioning or at least Oracle’s implementation of it - the first time I used it was back in Oracle 8i with a 4.5 TB data warehouse running on a large chunk of iron with an massive number of disks, 4.5 TB is a large number of disks if the biggest disk drive you could get was 18GB!

I digress, if I was told that I could use only one feature in a data warehouse I would think long and hard about partitioning and bitmap indexes and probably side with partitioning; you may ask about parallel query, but that can be tricky to exploit well and there are too many pitfalls for the unwary.

Being an old hand with partitioning you get to know some of the quirks and develop techniques to handle them and to perform the routine tasks that a DBA should not have to do by hand; for example maintaining a rolling window of partitions (range on date) by dropping the oldest partition and creating a new one. You learn that it is sensible to encode the date into the partition name as that might be easier to process in your partition manipulation scripts than using the high_value stored in the xxx_TAB_PARTITIONS as that is a long and thus a bit more tricky to parse in a block of PL/SQL. You learn that for a partitioned index organized table you need to to look in xxx_IND_PARTITIONS

But when a new release of Oracle comes out with something called interval partitioning you sometimes dismiss it with a on off-the-cuff “oh, but that won’t work for me as I need to know the name of the partition I am dropping and a system generated name won’t cut it”. Which is really another way of saying that you have not looked at the Oracle 11g SQL manual to see that there are new extensions to the partition maintenance ALTER TABLE commands and good old DELETE, UPDATE and INSERT. These extensions  now permit you to specify the partition by stating a value of the partition key rather than having to know the partition name: PARTITION FOR (TO_DATE(’17-JAN-2007′, ‘dd-mon-rrrr’)) is a much more elegant solution to storing the partition range value in it’s name.

So what happens when you insert into a non existing partition on an interval partitioned table? Just what should happen; a new partition is created. And, interestingly, if there is a gap in your data it only creates the partitions needed for the data being inserted. And if more data arrives later to fill the gaps, the required missing partitions are created. This looks a pretty good feature to use, unless you use partition exchange; in that case you still need to create the partition before exchanging it.

OBIEE Repository Variables, Filtering and Conditional Formatting

September 3rd, 2008 by Mark Rittman

I’m currently sitting in Galway airport coming back from a few days work, and as I’ve got an hour or so free I thought I’d take a look at server variables in OBIEE. Repository and Session server variables are something you tend to encounter first when working with LDAP security, but they can also be used to create dynamic row-level filters, provide the filter values for conditional formatting and many other tasks, and it’s worth taking a few minutes to understand how they work.

Variables in the Oracle BI Server can be either repository or session variables, and are defined using the Variable Manager within the Oracle BI Administrator (Manage > Variables).

Variable1

This brings up the Variable Manager, that divides variables out into repository and session variables, with a further subdivision into static and dynamic repository ones and system and non-system session ones. You can also define variables at the Presentation Server level, these are scoped to individual dashboards and are used amongst other things for setting dashboard prompt values and integrating the dashboard with BI Publisher parameters. Here’s a screenshot of a typical Variable Manager screen.

Variable2

So what are server (repository) variables used for? Well if you’ve done any programming, either in languages like Visual Basic or on the web, variables these platforms use are the same as the ones that the BI Server uses. As with web development, you can either have server variables that hold the same value across all sessions (in BI Server terms, a “repository” variable), or variables that hold a value specific to each user session (”session” variables in the BI Server). Furthermore, repository variables can either be constants (”static repository” variables) or can have values that vary over time (”dynamic repository” variables), with the dynamic ones refreshed to a schedule via SQL calls and session variables usually set at the time of logon by reference to an LDAP server or an SQL statement. So now that’s all clear, in what sort of situation would they be used? Here’s a scenario that makes use of all these types of BI Server variable.

We have a reporting requirement where salespeople are responsible for a number of states, and these states change on a constant basis. Some of these states are shared with other salespeople and it’s not really practical to define set groupings of these states, hence we have a lookup table on our database that lists out the states each salesperson is responsible for, together with a marker against one particular state that is their home state. We want to use these lists of states as dynamic filters on the reports our salespeople run, and have reports default to their home state when they view their data in a dashboard. We also want to only show data for the current month in their reports, and hold the state in which our headquarters is based in a constant so that we can highlight the sales that are made near to it. In other words, we’ve got a good example of where variables in various shapes and forms can be used to make this process a whole lot easier to implement.

To start off, I use the Variable Manager to define a new static repository variable that I call HEAD_OFFICE_STATE and set to the value ‘NY’. This initial value, that stays constant for static repository variables, is called the “default initializer” and can either be typed in or set via an expression. The only way this variable value can change is if I go back into the Variable Manager and change it there.

Variable3

Next I create the second repository variable, this time a dynamic one, that is used to return the current month name and year to any query that requires it. Unlike static repository variables this variable’s values change over time, using an SQL statement executed to a schedule to update its values.

Variable4

As well as having a default initializer, dynamic variables get their values from initialization blocks that provide the mechanism to execute the SQL via a schedule. To define the initialization block I press “New” and enter the SQL, refresh frequency and variable target for the block. Notice in this example that I’m running the SQL against an Oracle database connection pool, and the truncated month and year name is generated through some Oracle function names and a reference to sysdate.

Variable5

So now I’ve got my two repository variables defined. Before I create my my session variable, I first have to locate a table in my database that lists out the selection of states that each salesperson is working with. Notice how there’s sometimes more than one row per salesperson.

Variable6

I now move on to creating the session variable. To use the data from the table above to provide the values for my dynamic filter, I first of all make sure that this table is accessible through a connection pool in the BI Server physical layer, and then go back to the Variable Manager to define my session variable. Now if this variable was going to hold a single, scalar value, I could define it as a regular non-system session variable, but as it need to hold data from more than one database row, I instead define it using an Initialization Block and row-wise variable initialization, which defines one or more variables based on a select statement, dynamically creating the variables as needed and concatenating the results of multiple rows into the variable.

To perform this task I first define a the new Initialization Block and call it STATE_SECURITY. I then define my SELECT statement as the following:

select 'STATE_FILTER',state from variable_example.salesperson
where salesperson_name = ':USER'

This will return one or more rows with STATE_FILTER, the new dynamically-created variable name, as the first column, and the set of allowed salesman states as the second value, filtered on the logged-in user ID. My initialization block now looks like this:

Variable7

To assign the results of this SELECT statement to a variable, the STATE_FILTER variable, I then click on the “Edit Data Target” button and select Row-Wise Initialization, rather than try and assign the results to individual variables. The initialization block will then take all the rows that have STATE_FILTER as the first column and create a single concatenated, comma-separated list out of the second column, so that I can then apply this value to a filter.

Variable8

Now that the session variable definition is complete, I move over to the Security Manager application, create a group for all of my report users and then define a filter for that group against the SALES table, the one I want to restrict access to. The filter references the session variable I just created, using an equality (”=”) operator rather than the “in” operator you’d have expected, this is a requirement for row-wise variables and OBIEE handles the translation properly in the background.

Variable9

Now if one of the affected users logs in an runs a report against that table, the results are filtered down without any intervention on their part.

Variable10

Conditionally formatting the State column based on whether the value each row contains is also fairly straightforward. When you create a conditional format condition you can specify that that value tested against is a variable; to access one of the repository variables you put biServer.variables[''] around the variable name so that, in my example, the variable name becomes biServer.variables['HEAD_OFFICE_STATE'].

Variable11

Displaying the report now shows all instances of “NY” highlighted in red, based on my conditional formatting rule and the value of the HOME_OFFICE_STATE static repository variable.

Variable12

So there you have it. Moving on from here, filtering the report again based on the dynamic repository variable is just a case of referencing the CURRENT_MONTH variable in my filter, and adding another session variable to hold the salesperson’s home state involves creating another initialization block that this time provides a value for a regular (i.e. not row-wise) HOME_STATE session variable.

Testing the OBIEE 10.1.3.x Multi-User Development Environment

September 2nd, 2008 by Mark Rittman

Adrian Ward’s recent post on multi-user development in OBIEE, together with some questions I’ve received about the feature, prompted me to dig out the manuals and start working through this relatively new part of OBIEE on my laptop. The basic proposition with the Multiuser Development Environment is that you can have a group of developers all checking out, and checking in, various elements of a central BI Server repository project without tripping up over each other; Adrian’s post indicated that it more or less worked but that it took a bit of setting up. To work through this example then I’ve got a repository with a number of fact tables that share some conformed dimensions, I’m going to simulate a number of developers working on the repository and see how it handles the concurrent development. (UPDATE: Note that I’ve made a few corrections and clarifications to this article based on some subsequent testing, and based on some userful feedback from Adrian in the article comments. Thanks Adrian.)

The first step in getting multi-user development up and running is to open up a copy of the repository in question, in offline mode, making sure that it’s not also being used in online mode at the same time (you’ll find you can only open it read-only otherwise). Once you open the repository you can select Manage > Projects from the BI Administrator and start selecting presentation layer fact tables to make up your project.

Mde1

Each project can contain multiple presentation layer fact tables, and selecting a fact table brings in the dimension tables that are associated with them (you don’t see these in the Project Manager folder though. You can also bring in users (a user needs to be listed in the project to be able to check the project out, these users need to be in the administrator group to be able to log in to the BI Administrator tool), initialization blocks and variables (for security, dynamic filters and so on). I ended up creating three projects, two of which were for separate fact tables that shared conformed dimensions (to see what happened when one project overwrote the dimension table design used by another project), one of which included two fact tables of which one was also listed in another project. The point of all this was to see how the Multiuser Development Environment dealt with presentation folders “owned” by more than one project and logical dimension tables that were been updated by separate projects. I also created some separate users to see how security works and how audit information was added to the project audit trail.

Mde2

The next step is to create a directory on a shared network drive, and to copy this repository file to this shared location. The MDE process apparently makes changes to the file such that it’s not safe to leave it in the usual repository location ($ORACLEBI_HOME/server/repository) and so I copied it, for the purposes of this exercise, to a directory called “rpd_multiuser” on the root of my C drive. You can see a copy of the Paint repository that I’ve also copied there and worked with, as you can see the MDE process creates versions of the RPD file to support different checked-out releases of the file.

Mde3

Once the file is copied to the shared drive, each individual developer then needs to go through the steps to enable their copy of BI Administrator to work in multi-user mode. This involves logging in to BI Administrator and then selecting Tools > Options, then entering the location of the shared directory into a dialog along with their name, that will appear in the audit trails for the various projects.

Mde4

As this information gets stored in the Windows registry on the PC that’s running BI Administrator, it’s going to be tricky to simulate lots of different users accessing different projects as BI Administrator will only have one “Full Name” to work with, the one that’s stored in the registry. I’ll give it a go though and bear in mind that the full name is going to be the same for each user, at least in this simulation.

Once you’ve set up your projects, copied the repository to a share drive and configured each copy of BI Administrator to work in multi-user mode, you can start BI Administrator now and select File > Multiuser > Checkout Project to check your project out of the repository.

Mde5

If there are multiple repository files in the shared directory, as there are in my case, you are presented with a list of available ones.

Mde6

In my case I select the one that I just created, and then log in as “User1″, which should have access to the Orders Fact project. In fact when the user connects to the multi-user repository they are presented with a list of all projects, however if they select on that they don’t have permissions on then they can’t save the required local copy.

Mde8

I pick a project that the user does have access to, and then am prompted to save a local copy of the repository file extract. I’m then presented with a repository that is just the presentation, business model and mapping and physical folders and table that I require to work on my project.

Mde9

User2 has access to this project as well, and so I open up another BI Administrator session and log in as this user, again selecting the same project. The BI Administrator lets me open the project, which is already checked out by User1, and save a local copy of the project repository extract to my hard drive.

I then open up another BI Administrator session and connect as User1 again, this time opening the Orders and Appointments folder that has elements of the other project within it (the Product dimension) and some items that are unique to it. I finally open another session as User3 and open the Orders Fact project, which again shares dimensions with other projects, and save a local copy of the repository extract.

I now make an amendment, as User1, to the Order Returns fact project, and then select File > Multiuser > Compare with Original to see if it recognizes the change I made. This comparison is done using a copy of the original repository when I first checked the project out.

Mde10

Looking at the comparison report it’s picked up the change I made.

Mde11

I then make another changes as the User2 user, to the same project, adding another column to the Return Reasons Dim table. Running the comparison with original test again just highlights this user’s change, not the one carried out by the first user, as their change hasn’t been checked into the repository yet.

I switch back to User1 and select File > Multiuser > Merge Local Changes, to apply my project changes back to the main repository within the shared area.

Mde12

The Multiuser feature then checks as to whether this project is locked, and if it’s not, it prompts me for some descriptive text to describe the update I’m committing, plus I can confirm or amend the Full Name that gets logged alongside the change. I use this opportunity to add a “User1″ prefix to my name, so that’s its clear which of my simulated users is making the change.

Mde13

I then get presented with a Merge Repositories dialog. Now this dialog is prompting me to merge the changes in my locally stored repository with the master repository on the shared drive.

Mde14

Once I press the Merge button, my local changes are applied to the master repository but not yet committed, and the Administration tool then displays the full multi-user project within the admin tool.

Now if I switch to User2 and try and merge their changes in, I get an error saying that “I am already trying to check this file in.” Switching to User3 I try and merge their changes in, to a completely different project, and I get the same error, and I’m not sure if this is because I’ve got several BI Administrator sessions open at one time all of which are trying to work in multiuser mode, or whether it’s because one has merged their changes into the master repository but not yet committed, or “published”, those changes to the master repository.

I therefore switch back to User1 which now has the multi-user repository opened (this happened after I merged that users own local changes in) and attempt to close the repository. I am then prompted to publish the changes I’ve made after performing the merge or discard them, this presumably releases the lock on the project and thereafter allows my other users to start merging their own changes back into the master repository.

Mde15

Now one thing I’ve not been able to deduce is whether one project being locked then excludes any other project using that multi-user from merging and then publishing its changes; from looking at the screenshot above and the tests I’ve carried out, I suspect that (a) assuming each developer is working on their own PC, they can all independently merge changes they make on projects into the master repository, but that (b) if a single developer, working on any project using a multi-user developer takes a lock on that repository, in preparation for publishing their merged changes to it, then the others have to wait until that developer either publishes or discards the changes. Based on my test case I can’t be sure of this as I couldn’t merge any changes once one other user had done a merge, I’ll have to try and set up a multiple PC environment and see which one of these findings is actually correct.

Having said that, and no doubt there’s a few subtleties in the process and some niggles in the initial release of this feature, but on this superficial test all the bits you’d expect to see in a basic multi-developer environment are available, albeit with a bit of a complicated arrangement around the number of copies of the RPD file that you need and the multiple-step change committing process. To be honest, if Siebel had done the sensible thing and stored the repository in a relational database no doubt none of this faffing about would be needed, but then again OWB has just that and the multi-user features it has aren’t as complete as this. So all we need now is a repository that’s stored in a proper database, plus at some point in the future the ability to do proper source-control steps such as splitting and branching repositories, versioning them and so on, together with some sort of ability to include the web catalog and the BI Publisher catalog in the process, and I guess we’ll be getting there. I’ll update the posting once I know more about the master repository locking issue, but I suspect this is more down to me trying to run a test on a single PC rather than an issue with the process. Of course any feedback is welcome.

UKOUG BIRT SIG Meeting, London 8th October 2008

August 30th, 2008 by Mark Rittman

The next meeting of the UKOUG Business Intelligence & Reporting Tools Special Interest Group is on October 8th, 2008 at the Hilton Hotel Paddington. We’ve got a special line up this meeting with an extended session by Michelle Bird from Oracle on the new features coming in OWB 11gR2, plus we’ve got sessions by myself on the technology behind the Oracle BI Applications, one on Oracle Business Intelligence for PeopleSoft, a session by Inatech on Oracle BI and Oracle Application Express, and a session by IBM on driving green initiatives using Oracle Business Intelligence Enterprise Edition.

2360830563_601ba5cc1d

This SIG meeting is also a special one for me, as I’m standing down as the BIRT SIG Chair. I’ve chaired the SIG now for around five years and my feeling was that it was time for someone else to have a go, as it’s always good to shake things up once in a while. I’ll still be coming along to the SIG meetings and if I think I’ve got enough time free next year, I’d be interested in standing for the board so that I can get involved in a more directional role. Anyway, here’s the meeting agenda.

09:30 - Registration and Coffee
10:00 - Welcome and Introduction
10:10 - What’s Coming with Oracle Warehouse Builder 11gR2 - Michelle Bird, Oracle
11:40 - Coffee
12:00 - Driving Green Initiatives using OBIEE - Lin Gardner, IBM
12:45 - Oracle Business Intelligence for PeopleSoft - Oracle Open World Preview
13:15 - Lunch
14:10 - SIG Committee Elections
14:20 - An Introduction to the Technology Behind the Oracle BI Applications - Mark Rittman, Rittman Mead
15:05 - Integrating Oracle Business Intelligence with Oracle Application Express - Inatech
15:50 - AOB and Close

Rittman Mead are once again one of the event sponsors, so stop by our booth during one of the breaks if you’re coming along. We can talk to you about the BI Publisher, OBIEE and Oracle BI Applications courses we’ve recently put together, and we’ll be able to answer any of your questions about OBIEE, Discoverer, OWB, Oracle BI Applications and any of the other Oracle BI & DW products.

2360827887_2f6fd4c611

More details on the meeting can be found here, and if you’re interesting in standing for either the SIG chair or deputy chairs positions, you can fill in your details in this online survey.

Dimension Hierarchy Tables and Slowly Changing Dimensions in the Oracle Business Analytics Warehouse

August 30th, 2008 by Mark Rittman

The blog posts I wrote earlier in the week about ragged hierarchies and bridge tables got me thinking about how these things are represented in the Oracle Business Analytics Warehouse (OBAW), the pre-defined data warehouse that comes with the Oracle BI Applications. The OBAW comes with support for ragged hierarchies in the form of dimension hierarchy tables, I also thought I’d spotted some bridge tables in there, and I’ve been meaning to look at how it supports slowly changing dimensions, and so I fired up my virtual machine and took a look.

The OBAW comes with a number of tables with the _DH suffix, which complement the usual set of tables with _D (dimension), _F (fact), _A (aggregate), _DS (dimension staging) and so on. Listing out the table names in SQL*Plus, the full set in BI Apps 7.9.5 is:

SQL> select table_name
  2  from   user_tables
  3  where  table_name like '%_DH'
  4  /

TABLE_NAME
------------------------------
W_ALIGNVER_DH
W_AGREEMENT_DH
W_ALIGNMT_DH
W_INDUSTRY_DH
W_INT_ORG_DH
W_PRODUCT_DH
W_POSITION_DH
W_MED_PLAN_DH
W_OPTY_DH
W_FUND_DH
W_GEO_REGION_DH
W_PRODCAT_DH
W_PERIOD_DH
W_ORG_DH
W_REGN_DH

So what are these hierarchy tables for, given that there’s already a bunch of dimensions defined in the business model and mapping layer of the repository, with their own hierarchies, levels and so on?

Dim 1

If we take one example, the W_INT_ORG_DH table, you’ll find that it’s listed in the physical model of the BI Apps repository and is aliased several times, as for example Dim_W_INT_ORG_DH_Employee_Org and Dim_W_INT_ORG_DH_Operating_Unit_Org. Taking a look a the Dim - Operating Unit Org logical table within the Core business model, you can see that this logical table maps to two logical table sources, one of which is called Dim_W_INT_ORG_D_Operating_Unit_Org and maps to a single corresponding physical alias table with the names of operating units in it, and another called Dim_W_INT_ORG_DH_Operating_Unit_Org that maps to the alias above the W_INT_ORG_DH table.

Dim 2

So why does a logical dimension table map to both a dimension table and a dimension hierarchy table? How does this relate to the regular, level-based hierarchies in the business model and mapping layer?

The clue is in the level-based bit. What these _DH dimension hierarchy tables are for is to flatten-out the ragged hierarchies that might be in your source system, in this case Peoplesoft, so that your internal parent-child organizational hierarchy relationships can be represented in the level-based way that OBIEE requires them. These are then used for reports that drill-down through the hierarchy (making use of regular dimension and hierarchy objects in the repository), and other dimension history tables such as W_POSITION_DH are used within the repository to secure fact table information to just staff members and their superiors.

Let’s take a look at the structure of the W_INT_ORG_DH table, and the W_INT_ORG_D table that’s the primary source for the other logical table source that this logical dimension table uses, to see how this works.

Looking at the W_INT_ORG_D table first, you can see that it contains the list of organizational units within the business

SQL> SELECT row_wid
  2  ,      org_num
  3  ,      org_name
  4  ,      effective_from_dt
  5  ,      effective_to_dt
  6  ,      current_flg
  7  from   w_int_org_d
  8  where  rownum < 10
  9  /

   ROW_WID ORG_NUM ORG_NAME                       EFFECTIVE EFFECTIVE C
---------- ------- ------------------------------ --------- --------- -
         0                                        01-JAN-01           Y
      6001 0       Setup Business Group           01-JAN-99 01-JAN-14 Y
      6002 202     Vision Corporation             01-JAN-99 01-JAN-14 Y
      6003 204     Vision Operations              01-JAN-99 01-JAN-14 Y
      6004 205     Manufacturing                  01-JAN-99 01-JAN-14 Y
      6005 206     Widgets Product Line           01-JAN-99 01-JAN-14 Y
      6006 207     Seattle Manufacturing          01-JAN-99 01-JAN-14 Y
      6007 208     Chicago Subassembly Plant      01-JAN-99 01-JAN-14 Y
      6008 209     Boston Manufacturing           01-JAN-99 01-JAN-14 Y

9 rows selected.

Looking at the names of the organization units, you can probably guess that some are higher in the organizational hierarchy than the others. For many organizations their organization hierarchy is stored in their source systems in a parent-child fashion and can often be unbalanced, and the ETL routine that prepares the internal organization hierarchy in the BI Apps anticipates this, initially storing this parent-child hierarchy in the W_INT_ORG_DH_TMP temporary table, with one row per organization unit per source system hierarchy, like this:

SQL> SELECT org_id
  2  ,      org_num
  3  ,      org_name
  4  ,      par_org_id
  5  ,      par_org_num
  6  ,      par_org_name
  7  ,      hierarchy_name
  8  FROM   w_int_org_dh_tmp
  9  WHERE  org_num between 0 and 209
 10  ORDER BY org_id
 11  /

ORG_ID  ORG_NUM ORG_NAME                       PAR_ORG PAR_ORG PAR_ORG_NAME           HIERARCHY_NAME
------- ------- ------------------------------ ------- ------- ---------------------- ------------------------------
0       0       Setup Business Group
202     202     Vision Corporation
204     204     Vision Operations              202     202     Vision Corporation     Vision Global Reporting
204     204     Vision Operations              202     202     Vision Corporation     US Commercial Sales (DBI)
204     204     Vision Operations              202     202     Vision Corporation     Global
204     204     Vision Operations              202     202     Vision Corporation     Primary Reporting Hierarchy
204     204     Vision Operations              205     205     Manufacturing          Vision Corp EBI Hierarchy
204     204     Vision Operations              202     202     Vision Corporation     Assets Hierarchy
204     204     Vision Operations              202     202     Vision Corporation     Single Establishment Reporting
204     204     Vision Operations              260     260     Vision Administration  PJI
204     204     Vision Operations              202     202     Vision Corporation     Vision Corp - Global
204     204     Vision Operations              458     458     Vision Services        Burden Schedule Hierarchy
204     204     Vision Operations              202     202     Vision Corporation     US DBI Hierarchy
205     205     Manufacturing                  202     202     Vision Corporation     Vision Global Reporting
205     205     Manufacturing                  204     204     Vision Operations      Single Establishment Reporting
205     205     Manufacturing                  204     204     Vision Operations      Primary Reporting Hierarchy
205     205     Manufacturing                  202     202     Vision Corporation     Vision Corp EBI Hierarchy
205     205     Manufacturing                  204     204     Vision Operations      US Commercial Sales (DBI)
205     205     Manufacturing                  204     204     Vision Operations      US DBI Hierarchy
205     205     Manufacturing                  202     202     Vision Corporation     Vision Corp - Global
206     206     Widgets Product Line           1615    1615    Widget Companies       Vision Global Reporting
206     206     Widgets Product Line           1615    1615    Widget Companies       Vision Corp - Global
206     206     Widgets Product Line           205     205     Manufacturing          Primary Reporting Hierarchy
206     206     Widgets Product Line           205     205     Manufacturing          Single Establishment Reporting
207     207     Seattle Manufacturing          204     204     Vision Operations      Global
207     207     Seattle Manufacturing          205     205     Manufacturing          Vision Corp - Global
207     207     Seattle Manufacturing          205     205     Manufacturing          Vision Global Reporting
207     207     Seattle Manufacturing          206     206     Widgets Product Line   Primary Reporting Hierarchy
207     207     Seattle Manufacturing          206     206     Widgets Product Line   Single Establishment Reporting
207     207     Seattle Manufacturing          204     204     Vision Operations      Vision Corp EBI Hierarchy
207     207     Seattle Manufacturing          204     204     Vision Operations      US DBI Hierarchy
207     207     Seattle Manufacturing          204     204     Vision Operations      US Commercial Sales (DBI)
207     207     Seattle Manufacturing          204     204     Vision Operations      DBI - Vision Operations
208     208     Chicago Subassembly Plant      205     205     Manufacturing          Vision Corp - Global
208     208     Chicago Subassembly Plant      206     206     Widgets Product Line   Single Establishment Reporting
208     208     Chicago Subassembly Plant      206     206     Widgets Product Line   Primary Reporting Hierarchy
208     208     Chicago Subassembly Plant      205     205     Manufacturing          Vision Global Reporting
209     209     Boston Manufacturing           205     205     Manufacturing          Vision Global Reporting
209     209     Boston Manufacturing           205     205     Manufacturing          Vision Corp - Global
209     209     Boston Manufacturing           207     207     Seattle Manufacturing  SEATTLE ECO HIERARCHY
209     209     Boston Manufacturing           206     206     Widgets Product Line   Single Establishment Reporting
209     209     Boston Manufacturing           204     204     Vision Operations      DBI - Vision Operations
209     209     Boston Manufacturing           206     206     Widgets Product Line   Primary Reporting Hierarchy

43 rows selected.

Now if you read my posting the other day about ragged hierarchies and OBIEE, you’ll know that the semantic model used by OBIEE doesn’t (currently, as of the 10gR3 release) support ragged hierarchies, requiring them instead to be “flattened out” into level-based hierarchies. This is exactly the route that the OBAW load routine takes when processing ragged hierarchies, flattening them out into tables such as W_INT_ORG_DH table, so that you now have a row per organization unit, per hierarchy, that shows their path up the organization chart. There’s a number of SDE (Source-Dependent Mappings) that carry out this task, flattening the hierarchy through joining the temporary table about several times to itself using a Source Qualifier transformation, with the rest of the mapping now working with a set of flattened out level columns.

Dim 3-1

Once the data is in the W_ORG_INT_DH table, it looks like this:

SQL> SELECT row_wid
  2  ,      org_wid
  3  ,      org_hier1_num
  4  ,      org_hier2_num
  5  ,      org_hier3_num
  6  ,      org_hier8_num
  7  ,      org_top_num
  8  ,      effective_from_dt
  9  ,      effective_to_dt
 10  ,      current_flg
 11  FROM   w_int_org_dh
 12  WHERE  org_hier1_num between 0 and 209
 13  /

   ROW_WID    ORG_WID ORG_HIER1_NUM ORG_HIER2_NUM ORG_HIER3_NUM ORG_HIER8_NUM ORG_TOP_NUM   EFFECTIVE EFFECTIVE C
---------- ---------- ------------- ------------- ------------- ------------- ------------- --------- --------- -
        54       6004 205           205           205                         202           01-JAN-99 01-JAN-14 Y
       183       6002 202           202           202           202           202           01-JAN-99 01-JAN-14 Y
       193       6001 0             0             0             0             0             01-JAN-99 01-JAN-14 Y
        91       6008 209           209           209           204           202           01-JAN-99 01-JAN-14 Y
        92       6006 207           207           207           204           202           01-JAN-99 01-JAN-14 Y
        93       6003 204           204           204           204           202           01-JAN-99 01-JAN-14 Y
       102       6007 208           208           208           204           202           01-JAN-99 01-JAN-14 Y
      1230       6001 0             0             0             0             0             01-JAN-99 01-JAN-14 Y
      1411       6008 209           209           209           204           202           01-JAN-99 01-JAN-14 Y
      1412       6006 207           207           207           204           202           01-JAN-99 01-JAN-14 Y
      1413       6005 206           206           206           204           202           01-JAN-99 01-JAN-14 Y
      1414       6004 205           205           205           204           202           01-JAN-99 01-JAN-14 Y
      1415       6003 204           204           204           204           202           01-JAN-99 01-JAN-14 Y
      1416       6002 202           202           202           202           202           01-JAN-99 01-JAN-14 Y
      1527       6007 208           208           208           204           202           01-JAN-99 01-JAN-14 Y
      3045       6004 205           205           205           205           202           01-JAN-99 01-JAN-14 Y
      3047       6008 209           209           209           205           202           01-JAN-99 01-JAN-14 Y
      2090       6003 204           204           204           204           202           01-JAN-99 01-JAN-14 Y
      2618       6007 208           208           208           205           202           01-JAN-99 01-JAN-14 Y
      2093       6004 205           205           205           204           202           01-JAN-99 01-JAN-14 Y
      2095       6005 206           206           206           204           202           01-JAN-99 01-JAN-14 Y
      2096       6006 207           207           207           204           202           01-JAN-99 01-JAN-14 Y
      2097       6007 208           208           208           204           202           01-JAN-99 01-JAN-14 Y
      2098       6008 209           209           209           204           202           01-JAN-99 01-JAN-14 Y
      2902       6003 204           204           204           458           202           01-JAN-99 01-JAN-14 Y
      2428       6004 205           205           205           205           202           01-JAN-99 01-JAN-14 Y
      2679       6005 206           206           206           1615          202           01-JAN-99 01-JAN-14 Y
      2430       6008 209           209           209           205           202           01-JAN-99 01-JAN-14 Y
      2436       6006 207           207           207           205           202           01-JAN-99 01-JAN-14 Y
      2441       6003 204           204           204           204           202           01-JAN-99 01-JAN-14 Y
      2696       6004 205           205           205           205           202           01-JAN-99 01-JAN-14 Y
      2701       6003 204           204           204           205           202           01-JAN-99 01-JAN-14 Y
      2703       6006 207           207           207           205           202           01-JAN-99 01-JAN-14 Y
      2208       6003 204           204           204           204           202           01-JAN-99 01-JAN-14 Y
      3490       6003 204           204           204           204           202           01-JAN-99 01-JAN-14 Y
      4068       6008 209           209           209           207           202           01-JAN-99 01-JAN-14 Y
      3053       6006 207           207           207           205           202           01-JAN-99 01-JAN-14 Y
      3058       6003 204           204           204           204           202           01-JAN-99 01-JAN-14 Y
      3880       6006 207           207           207           204           202           01-JAN-99 01-JAN-14 Y
      3882       6008 209           209           209           204           202           01-JAN-99 01-JAN-14 Y
      3651       6003 204           204           204           204           202           01-JAN-99 01-JAN-14 Y
      3666       6006 207           207           207           204           202           01-JAN-99 01-JAN-14 Y
      3183       6007 208           208           208           205           202           01-JAN-99 01-JAN-14 Y
      3979       6003 204           204           204           260           202           01-JAN-99 01-JAN-14 Y
      3465       6003 204           204           204           204           202           01-JAN-99 01-JAN-14 Y
      3236       6005 206           206           206           1615          202           01-JAN-99 01-JAN-14 Y
      3478       6004 205           205           205           204           202           01-JAN-99 01-JAN-14 Y
      3479       6006 207           207           207           204           202           01-JAN-99 01-JAN-14 Y
      3480       6004 205           205           205           204           202           01-JAN-99 01-JAN-14 Y
      3481       6006 207           207           207           204           202           01-JAN-99 01-JAN-14 Y

50 rows selected.

This table is then made available in the OBAW, and added to the Internal Organization logical table as an additional logical table source (not sure why its separate logical table source, I would have thought it’d be more efficient to just add it to the existing logical table source that gets the main dimension member information), so that reports can drill up and down the dimension hierarchy. One slight drawback to this approach that’s common to all hierarchy flattening techniques is that you’ll need to amend the _DH warehouse table, plus all the dependent staging and temporary tables if your source parent-child hierarchy has more than the set number (nine I think) of flattened levels, this will be particularly interesting as you’ll need to amend the tables themselves plus the SDE and SIL (source-independent load) Informatica mapping that load them, quite a big task when you think about it especially as there are SDE mappings for each source and for each hierarchy category.

The eagle-eyed amongst you would also have spotted effective from, and effective to, dates in the above tables, together with a current flag column that you usually see in dimensions configured to store history (known in Kimball terms as a slowly changing dimension). That’s precisely what they are, organization dimension entries having start and end dates and a flag to show whether the row you’re looking at is the current (i.e. active) row, with the same columns also being present in the dimension hierarchy tables and in fact every seeded dimension table within the OBAW.

The fact that every dimension table, and in fact every dimension SIL mapping, is set up in order to support slowly changing dimensions means that you can turn on this feature just by setting the $$TYPE2_FLG mapping parameter to “Y”, though you’ll need to start customizing the mapping if you want to go with anything other than the default set of Type 2 trigger columns. If you compare this to SCD handling in OWB it approaches things from a slightly different way, with OWB setting SCD types at the dimension object level and the BI Apps setting it at the mapping level, for the standard set of BI Apps seeded mappings their arrangement works well but it gets a bit messy when you want to alter the set of trigger columns and of course you have to reproduce all of this mapping logic in your own mappings if you want your own custom dimensions to be handled in the same way.

Anyway, that’s it for dimension history tables and slowly changing dimensions. In terms of what I want to cover next in this series, there’s the integration of the BI Apps with E-Business Suite security (probably the number 1 thing I’m asked) and a low-down on the universal adapter and how it can be used to bring in your own, unsupported data sources. But for now though that’s it, let me know if you’ve got any feedback on the above.

The Mystery of OBIEE Bridge Tables

August 28th, 2008 by Mark Rittman

If you’ve played around with the Oracle BI Administrator tool for a while, you may have noticed a box you can tick in the Logical Table Properties dialog, called “Bridge Table”. If you’re familiar with Ralph Kimball and some of this dimensional modeling ideas, you’ve probably heard of this concept before, but it’s not all that clear how you use bridge tables in OBIEE and a quick search around the internet and the OTN forums doesn’t really come up with an examples on how it’s used.

Bridge Table

Bridge tables are a solution to what’s called the “multi-valued dimension” problem. For most dimensional models you generally want to link one sale, for example, to one product, one customer, one time period and so on, and this translates into a simple dimensional model where your fact table contains a single key value per dimension for each row that’s been stored. In some circumstances though, say where you are recording the diagnoses for a patient or the claim elements in a claim, you might need to record more than one key value for a particular dimension in each fact table row. In entity relationship modeling terms, you’ve got a many-to-many relationship between patient admissions and diagnoses, like this:

200808281839

and the usual way you resolve these many to many relationships is to use an intersection table, with the key from the patient admissions table and the key from the diagnoses table copied across to form the intersection, and usually with a weighting column that adds up to 1, so that you can properly add up all the diagnoses and not over-count them.

200808281842

As I said, this is not exactly new stuff and bridge tables, of which diagnosis group above is one of them, are a fairly common dimensional modeling construct. The problem you hit though when starting to use this feature is that the documentation on it is pretty minimal, and only really talks about setting this feature on the bridge table itself and doesn’t really mention what to do with the dimension table that hangs off of it. What I’ll do in this posting then is set out how I use it, explain my rationale and thereafter invite some feedback, so if someone else has come up with a better idea then we can work with that instead.

Going in to BI Administrator and looking at the physical model for the data set above, it looks like this:

Physical Model

with the key thing here being that the fact table we’ve got the bridge table joining to both the fact and diagnosis dim tables to form an intersection. If you imported this model into the logical business layer as is, the BI Administrator would think the intersection table is the fact table as the other tables join to it.

Logical No Bridge

So what you do now is go into the properties for the diagnosis table and indicate that it’s a bridge table, like this:

Set Bridge

Now when you look at the logical model the fact table is identified correctly.

Logical Bridge

If you try and validate the model now though, you get a warning because the diagnoses dimension table doesn’t link through to the fact table, as it goes through the bridge table instead.

Bridge Warning

Now not linking through to the fact table is sometimes allowed, basically in situations where you’ve snowflaked your logical model and the dimension table is actually a higher level in the same dimension, but in this case if we try and use this model and bring in the diagnoses information into a query, we’ll get a metadata consistency error.

To solve this, what I would do is remove the diagnosis dimension from the logical model, and instead add it to the logical table source for the bridge table, like this:

Add To Bridge Lts

Then I’d add any columns that I needed from the diagnosis dimension physical table into the bridge table, which I can do now as I’ve added the dimension table to the bridge table LTS, so that this bridge table now becomes my diagnosis dimension, like this:

Lts Diag Name

So what I’ve done here is take the dimension table, which before linked to the bridge table and thereafter caused the metadata inconsistency, and instead added it to the bridge table logical table source and added it’s columns to the bridge table logical table.

If I run a report now, the data comes out as you would expect, with multiple diagnoses per patient and the weighting applied correctly.

Bridge Report

However I’m conscious that to me, getting rid of the logical dimension table seems a bit wrong, and others have said that they don’t use the bridge table feature at all, and instead just combine the various tables into a single logical table source. If anyone’s got any other way of using bridge tables, add a comment to this post and we’ll see what the consensus is.

Bitmap Index Examples

August 28th, 2008 by Peter Scott

Between assignments (mostly, OBIEE and BI Applications work) and a family holiday in the Austrian Alps I have been helping Mark with some of the examples he will be demonstrating as part of his 11g Data Warehousing Seminar. I must confess that I, too, will be using the examples in some of my data warehouse courses.

One of the demonstrations I have worked on for is the use of BITMAP indexes in simple (single table) queries. For this demonstration I used the standard SH schema and a couple of the provided bitmap indexes on the CUSTOMERS table; that is the GENDER (CUSTOMERS_GENDER_BIX)  and YEAR OF BIRTH (CUSTOMERS_YOB_BIX) indexes.

Firstly, let’s look at the data distributions:

select  cust_gender, count(*) from customers
group by cust_gender;

C   COUNT(*)
- ----------
F      18325
M      37175

There are about 66% M customers and 33% F, and no customers of any other (or NULL) gender.
A similar query shows us that there are 75 distinct years of birth (between 1913 and 1990) and for 1917 there are 37 customers

So what happens when we query the customers table on gender = M

SQL> explain plan for select * from customers where cust_gender = 'M';
select * from table(dbms_xplan.display);

Explained.
SQL>
PLAN_TABLE_OUTPUT
-------------------------------Plan hash value: 2008213504
-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           | 27750 |  4877K|   407   (1)
|*  1 |  TABLE ACCESS FULL| CUSTOMERS | 27750 |  4877K|   407   (1)
---------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("CUST_GENDER"='M')

We do a full table scan and do not use the index.

But when we choose a more selective query - the people born in 1917 (which hits about 0.07% of the table) we see this:

SQL> explain plan for select * from customers where
CUST_YEAR_OF_BIRTH = 1917;

select * from table(dbms_xplan.display);

Explained.

SQL>
PLAN_TABLE_OUTPUT
--------------------------Plan hash value: 3460183038
---------------------------------------------------
| Id  | Operation                    | Name              | Rows  | Bytes | Cost (%CPU)
---------------------------------------------------
|   0 | SELECT STATEMENT             |                   |   740 |   130K|   146   (0)
|   1 |  TABLE ACCESS BY INDEX ROWID | CUSTOMERS         |   740 |   130K|   146   (0)
|   2 |   BITMAP CONVERSION TO ROWIDS|                   |       |       |
|*  3 |    BITMAP INDEX SINGLE VALUE | CUSTOMERS_YOB_BIX |       |       |
---------------------------------------------------

Predicate Information (identified by operation id):
—————————————————

3 - access(”CUST_YEAR_OF_BIRTH”=1917)

This time we are using the bitmap index.

So, is there any merit to using a bitmap index on a column with just two distinct values? After all, we do less work by table scanning the whole table.

The answer to this is yes. One of the great things about bitmap indexes is that they can be combined with other bitmap indexes on the same table by a bitwise ‘AND’ operation; that is we effectively create a multi-column index on-the-fly

SQL> explain plan for select * from customers where
CUST_YEAR_OF_BIRTH = 1917 and cust_gender = 'M';
select * from table(dbms_xplan.display);
Explained.
SQL>
PLAN_TABLE_OUTPUT
---------------------------------------------------Plan hash value: 4226934075
---------------------------------------------------
| Id  | Operation                    | Name                 | Rows  | Bytes | Cost (%CPU)|
---------------------------------------------------
|   0 | SELECT STATEMENT             |                      |   370 | 66600 |    82   (2)
|   1 |  TABLE ACCESS BY INDEX ROWID | CUSTOMERS            |   370 | 66600 |    82   (2)
|   2 |   BITMAP CONVERSION TO ROWIDS|                      |       |
|   3 |    BITMAP AND                |                      |       |
|*  4 |     BITMAP INDEX SINGLE VALUE| CUSTOMERS_YOB_BIX    |       |
|*  5 |     BITMAP INDEX SINGLE VALUE| CUSTOMERS_GENDER_BIX |       |       |           ---------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("CUST_YEAR_OF_BIRTH"=1917)
5 - access("CUST_GENDER"='M')

Here you can see both indexes being combined  to access a small subset of the customer table.

Hopefully, this has been a clear example of why a low cardinality column may not be a good bitmap index choice in isolation but when combined with other bitmap indexes can be very beneficial.

Ragged Hierarchy Handling in OBIEE

August 27th, 2008 by Mark Rittman

Another area I’m looking to cover in my Oracle Open World session on OBIEE data modeling is support for ragged, skip-level and unbalanced hierarchies. If you’ve worked with OBIEE for a while and you’re from an OLAP background, one of the first things you would have noticed was that OBIEE, like most relational query tools, only really supports level-based hierarchies, where for example a product hierarchy breaks down into product classes, product families and items, and every item has a family, every family has a class and so on. This works well in your typical sales-type reporting environment, but when you are looking to report on organizations, account hierarchies and so on, this can be a bit limiting, as you may well end up working with a hierarchy that looks like the following:

Org Chart

Unfortunately in this current release of OBIEE there is no out-of-the-box support for these types of hierarchies, but if you have to report on hierarchies like this there are three main ways in which you can do this.

The first and most common way of handling this sort of data is to “flatten” it out so that all routes down the hierarchy have the same number of levels. If we took the organization chart above, this would mean that you’d need to create a dimension table, that you’ll then load into OBIEE, that looks something like this (note the fact that you need to make up names for the various dimension levels that you’ve created, and that you need to “copy-down” the lowest level dimension member down to the bottom level, to ensure that each route down the hierarchy has a bottom level to roll up from).

Flattened Table

Depending on whether your hierarchy is ragged (has varying depths down the hierarchy) or skip-level (has levels missing on various routes up the hierarchy, sometimes referred to as unbalanced) you end up filling in values at various points in the table, your measures can only be recorded at the lowest level in the flattened hierarchy, and OBIEE ends up repeating data where levels are missing when you drill down the hierarchy. This route does however have the virtue of simplicity and if your hierarchy doesn’t change much (most importantly, doesn’t increase the maximum depth of the longest route down the hierarchy) you can usually get away with this.

Another method that I’ve seen used, and in fact I documented in this blog post a couple of years ago, is to an approach put together by Joe Celko where you generate a table that sets out, for each member in the dimension, the member IDs that are it’s descendants.

Descendent

The idea here is that you take the parent-child relationship in your source table and break it out into this “stack” table, where each dimension member has its “children” listed out and you can then navigate through it, in the case of the example using the report navigation feature in Oracle BI Answers, to list out the results. This approach has the advantage of allowing you to record facts against any member in the dimension, not just those at the lowest (flattened hierarchy) level, but it’s a bit tricky to set up and maintaining the stack table is quite cumbersome. If you look around the internet you’ll see other variations on this, others I’ve seen include descendent tables and a variation where each route down the hierarchy has a guaranteed lowest level but then varying levels of aggregation up the hierarchy.

The final way that I’ve seen ragged and unbalanced hierarchies handled in OBIEE is to use Essbase as a data source. Essbase (like Oracle OLAP) inherently stores its dimension hierarchies in parent-child format and I was intrigued to notice that, when you bring in an Essbase dimension into the physical layer of your semantic model, there’s an option to set the dimension to balanced, unbalanced, ragged balanced and network. Now what can this be for?

Essbase Ragged 4

Normally, if you try and import a ragged hierarchy into the OBIEE seminar model using the BI Administrator tool, the hierarchy gets rejected and only the balanced ones import correctly. This was certainly the case for SAP B/W and Microsoft Analysis Services, but since the BI Administrator tool started using the Essbase C API rather than XML/A to read in it’s OLAP metadata (from the 10.1.3.3.2 release onwards, i.e. the first production release to support Essbase) you can now actually import most kinds of Essbase hierarchies in, and you use this setting after the import to tell OBIEE what sort of hierarchy it is (this is then used by the MDX generator within the BI Server). So what happens when you import a ragged hierarchy in, such as the one in this Essbase outline?

Essbase Ragged 1

Well, what actually happens is that the import routine scans the Essbase Outline and works out the maximum depth of the hierarchy, and then generates “columns” for each of the hiearchy levels, so that you end up with a physical model that looks like this for my organization chart:

Essbase Ragged 2

and a logical model that looks like this:

Essbase Ragged 3

So what’s happened here is that, like the conversion I did earlier on with the flattened table I set up, the import routine has generated a flattened relational representation of the data for the hierarchy as it stands at the moment. If I run a report in Oracle BI Answers and take a look at the resulting hierarchy, it looks like this:

200808272250

Which isn’t a bad representation, when you think about it, although this method does have one major drawback in that you’d need to re-import your Essbase dimension into the physical layer, and regenerate the logical business model of it, if the maximum depth of your ragged hierarchy increased.

So, that’s where things are in terms of ragged hierarchy support in OBIEE. At present, it’s not something that’s fully supported and you have to jump through various hoops to create a flattened, relational version of your hierarchy. The Essbase support is interesting (especially as with other multi-dimensional data sources, you can’t actually import ragged hierarchies in at all), and it’ll be interesting to see what functionality Oracle add to OBIEE 11g around this area, especially as Answers+ is being positioned as a replacement for the ragged-hierarchy capable Hyperion Web Analysis.