Thinking About the Aggregation Testing Methodology

The first step in my Aggregation Techniques Comparison project is to put together a methodology. As a recap, I'm looking to compare the rollup and query performance of regular materialized views, materialized views using grouping sets, and analytic workspaces, when used in the context of a relational star schema. The project was prompted by an Oracle white paper published a  couple of years ago, that claimed that, for queries that were more ad-hoc and random (as opposed to fairly predictable), analytic workspaces were faster to aggregate, and faster to query, than regular materialized views. 

The justifications for this claim were that, when you had a large number of materialized views to create (in order to support a wide range of ad-hoc star schema queries), creating these took longer than the equivalent summary built using an analytic workspace, and that once they were created, ad-hoc dimensional queries took longer to run than queries against analytic workspaces because their were more objects (materialized views) to consider, more possible SQL execution plans, and SQL against relational tables was inherently slower to run than SQL that gets translated into analytic workspace lookups. Now I've got no reason to doubt the author of the white paper (Bud Endress) but I thought it'd be interesting to run my own tests, and also at the same time introduce another type of materialized view, one created using the DBMS_ODM procedure and using the grouping sets feature, that appears to offer some of the benefits of both approaches - relational storage but with all aggregates stored in the one database object.

With this in mind, I've started thinking about a methodology. Now the presentation and paper itself are aimed at the Oracle DBA audience, with the idea being that this might offer them a couple of alternatives to the creation of a number of materialized views to support queries in a data warehouse. Given the audience, I'll need to explain how analytic workspaces and materialized views using grouping sets work, and create the objects using scripts and the command line rather than using GUI tools such as Analytic Workspace Manager. As the focus is going to be on alternative aggregation techniques to support star schemas, we won't bother with OLAP query tool-specific stuff like the OLAP API, and we'll create SQL views over the analytic workspace so that it can be queried using regular SQL, giving it the same SQL interface as a regular star schema - the idea being that if the claims turn out to be correct, it'll be a "slot in" replacement for the star schema and materialized view, with no changes needed to be made to any other applications.

The first bit of work I did was to come up with a sample dataset. Ideally, I'd use on that's readily available to everyone, and so the fairly obvious candidate was the SH ("Sales History") sample schema that comes with the Oracle 9i and Oracle Database 10g databases. The other part to this though was that the sample dataset had to be big enough to (a) normally take enough time to aggregate and query to give some scope for improvement, and (b) reasonably compare in size to the sorts of star schemas DBAs normally look afer. So how did it size up?

The first thing I did was connect to the SH schema and run some table counts. The database version is Oracle 10.2.0.2.0 running on Microsoft Windows XP 32-bit, within a Parallels Desktop virtual machine under OS X.

SQL*Plus: Release 10.2.0.1.0 - Production on Thu Sep 7 08:43:11 2006

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - Productio
With the Partitioning, OLAP and Data Mining options

SQL> select count(*)
2 from sales
3 /

COUNT(*)

918843

SQL> select count(*)
2 from customers
3 /

COUNT(*)

 55500

SQL> select count(*)
2 from times
3 /

COUNT(*)

  1826

SQL> select count(*)
2 from promotions
3 /

COUNT(*)

   503

SQL> select count(*)
2 from channels
3 /

COUNT(*)

     5</pre>

OK, so the main fact table has around 918,000 rows, there are 55,000 customers, 72 products, 1826 time periods, 503 promotions and 5 channels. I wonder how big the tables are currently?

SQL> select segment_name, segment_type, sum(bytes)/1024000 "SIZE_IN_MB"
  2  from   user_segments
  3  where  segment_name in ('SALES','CUSTOMERS','PRODUCTS','CHANNELS','TIMES','PROMOTIONS')
  4  group by segment_name, segment_type
  5  /

SEGMENT_NAME SEGMENT_TYPE SIZE_IN_MB


PRODUCTS TABLE .064
CUSTOMERS TABLE 12.288
SALES TABLE PARTITION 16.256
TIMES TABLE .512
CHANNELS TABLE .064
PROMOTIONS TABLE .512

6 rows selected.

OK, that's no so promising. The main SALES fact table is only 16MB in size, most fact tables I come across are much bigger than that. Having said that, what we're going to compare now is the time it takes to aggregate this fact table using its dimension tables, so the basic size of the fact table probably won't be the major factor here, it'll be the number of dimensions, the levels within them and the overall total number of possible aggregation possibilities that'll be the major factor. At the back of my mind though I'll have to keep the source table size in mind, in case it starts to undermine the credibility of the tests.

The next thing to look at is the structure of the supporting dimension tables. What I'm looking for here are two things:

  • A large number of dimension members, mostly dependent on the number of base, leaf level members, but also influenced by the number of members at higher levels in the dimension hierarchy - a dimension with 50,000 leaf level members that rolls up into 40,000 members above it, 30,000 above that and so on will affect performance far more than 50,000 that roll up to 100 and then 1.
  • A large number of levels in the dimension hierarchy, or even better, multiple hierarchies - these influence the total number of possible aggregations that can be performed using the dimension.

Basically, a large number of dimensions, with lots of base level members, lots of levels and multiple hierarchies is going to affect rollup and aggregate navigation more than just a couple of dimensions with only single hierarchies and just a couple of levels.

To work out the impact of this, I first need to run some count distincts on the table columns that will go to form the level IDs in the dimensions. I happen to know which columns in the tables go on to form the levels, and I therefore run some SQL to check out the numbers:

SQL> desc products
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 PROD_ID                                   NOT NULL NUMBER(6)
 PROD_NAME                                 NOT NULL VARCHAR2(50)
 PROD_DESC                                 NOT NULL VARCHAR2(4000)
 PROD_SUBCATEGORY                          NOT NULL VARCHAR2(50)
 PROD_SUBCATEGORY_ID                       NOT NULL NUMBER
 PROD_SUBCATEGORY_DESC                     NOT NULL VARCHAR2(2000)
 PROD_CATEGORY                             NOT NULL VARCHAR2(50)
 PROD_CATEGORY_ID                          NOT NULL NUMBER
 PROD_CATEGORY_DESC                        NOT NULL VARCHAR2(2000)
 PROD_WEIGHT_CLASS                         NOT NULL NUMBER(3)
 PROD_UNIT_OF_MEASURE                               VARCHAR2(20)
 PROD_PACK_SIZE                            NOT NULL VARCHAR2(30)
 SUPPLIER_ID                               NOT NULL NUMBER(6)
 PROD_STATUS                               NOT NULL VARCHAR2(20)
 PROD_LIST_PRICE                           NOT NULL NUMBER(8,2)
 PROD_MIN_PRICE                            NOT NULL NUMBER(8,2)
 PROD_TOTAL                                NOT NULL VARCHAR2(13)
 PROD_TOTAL_ID                             NOT NULL NUMBER
 PROD_SRC_ID                                        NUMBER
 PROD_EFF_FROM                                      DATE
 PROD_EFF_TO                                        DATE
 PROD_VALID                                         VARCHAR2(1)

SQL> set linesize 800

SQL> select count(distinct(prod_id)) as product_count
  2  ,      count(distinct(prod_subcategory_id)) as product_subcategory_count
  3  ,      count(distinct(prod_category_id)) as product_category_count
  4  ,      count(distinct(prod_total_id)) as product_total_count
  5  from   products
  6  /

PRODUCT_COUNT PRODUCT_SUBCATEGORY_COUNT PRODUCT_CATEGORY_COUNT PRODUCT_TOTAL_COUNT
------------- ------------------------- ---------------------- -------------------
           72                        22                      5                   1

SQL> desc customers
 Name
 --------------------------------------------------------------------------------------------
 CUST_ID
 CUST_FIRST_NAME
 CUST_LAST_NAME
 CUST_GENDER
 CUST_YEAR_OF_BIRTH
 CUST_MARITAL_STATUS
 CUST_STREET_ADDRESS
 CUST_POSTAL_CODE
 CUST_CITY
 CUST_CITY_ID
 CUST_STATE_PROVINCE
 CUST_STATE_PROVINCE_ID
 COUNTRY_ID
 CUST_MAIN_PHONE_NUMBER
 CUST_INCOME_LEVEL
 CUST_CREDIT_LIMIT
 CUST_EMAIL
 CUST_TOTAL
 CUST_TOTAL_ID
 CUST_SRC_ID
 CUST_EFF_FROM
 CUST_EFF_TO
 CUST_VALID

SQL> select count(distinct(cust_id)) as customer
  2  ,      count(distinct(cust_city_id)) as cust_city
  3  ,      count(distinct(cust_state_province_id)) as state_province
  4  ,      count(distinct(country_id)) as country
  5  ,      count(distinct(cust_total_id)) as customer_total
  6  from   customers
  7  /

  CUSTOMER  CUST_CITY STATE_PROVINCE    COUNTRY CUSTOMER_TOTAL
---------- ---------- -------------- ---------- --------------
     55500        620            145         19              1

SQL> desc countries
 Name
 --------------------------------------------------------------------------------------------
 COUNTRY_ID
 COUNTRY_ISO_CODE
 COUNTRY_NAME
 COUNTRY_SUBREGION
 COUNTRY_SUBREGION_ID
 COUNTRY_REGION
 COUNTRY_REGION_ID
 COUNTRY_TOTAL
 COUNTRY_TOTAL_ID
 COUNTRY_NAME_HIST

SQL> select count(distinct(country_id)) as country
  2  ,      count(distinct(country_subregion_id)) as subregion
  3  ,      count(distinct(country_region_id)) as region
  4  ,      count(distinct(country_total_id)) as country_total
  5  from   countries
  6  /

   COUNTRY  SUBREGION     REGION COUNTRY_TOTAL
---------- ---------- ---------- -------------
        23          8          6             1

SQL> desc times
 Name
 --------------------------------------------------------------------------------------------
 TIME_ID
 DAY_NAME
 DAY_NUMBER_IN_WEEK
 DAY_NUMBER_IN_MONTH
 CALENDAR_WEEK_NUMBER
 FISCAL_WEEK_NUMBER
 WEEK_ENDING_DAY
 WEEK_ENDING_DAY_ID
 CALENDAR_MONTH_NUMBER
 FISCAL_MONTH_NUMBER
 CALENDAR_MONTH_DESC
 CALENDAR_MONTH_ID
 FISCAL_MONTH_DESC
 FISCAL_MONTH_ID
 DAYS_IN_CAL_MONTH
 DAYS_IN_FIS_MONTH
 END_OF_CAL_MONTH
 END_OF_FIS_MONTH
 CALENDAR_MONTH_NAME
 FISCAL_MONTH_NAME
 CALENDAR_QUARTER_DESC
 CALENDAR_QUARTER_ID
 FISCAL_QUARTER_DESC
 FISCAL_QUARTER_ID
 DAYS_IN_CAL_QUARTER
 DAYS_IN_FIS_QUARTER
 END_OF_CAL_QUARTER
 END_OF_FIS_QUARTER
 CALENDAR_QUARTER_NUMBER
 FISCAL_QUARTER_NUMBER
 CALENDAR_YEAR
 CALENDAR_YEAR_ID
 FISCAL_YEAR
 FISCAL_YEAR_ID
 DAYS_IN_CAL_YEAR
 DAYS_IN_FIS_YEAR
 END_OF_CAL_YEAR
 END_OF_FIS_YEAR

SQL> select count(distinct(time_id)) as day
  2  ,      count(distinct(calendar_month_id)) as month
  3  ,      count(distinct(calendar_quarter_id)) as quarter
  4  ,      count(distinct(calendar_year_id)) as year
  5  from   times
  6  /

       DAY      MONTH    QUARTER       YEAR
---------- ---------- ---------- ----------
      1826         60         20          5

SQL> select count(distinct(time_id)) as day
  2  ,      count(distinct(fiscal_month_id)) as fiscal_month
  3  ,      count(distinct(fiscal_quarter_id)) as fiscal_quarter
  4  ,      count(distinct(fiscal_year_id)) as fiscal_year
  5  from   times
  6  /

       DAY FISCAL_MONTH FISCAL_QUARTER FISCAL_YEAR
---------- ------------ -------------- -----------
      1826           61             21           6

SQL> desc channels
 Name
 --------------------------------------------------------------------------------------------
 CHANNEL_ID
 CHANNEL_DESC
 CHANNEL_CLASS
 CHANNEL_CLASS_ID
 CHANNEL_TOTAL
 CHANNEL_TOTAL_ID

SQL> select count(distinct(channel_id)) as channel
  2  ,      count(distinct(channel_class_id)) as channel_class
  3  ,      count(distinct(channel_total_id)) as channel_total
  4  from   channels
  5  /

   CHANNEL CHANNEL_CLASS CHANNEL_TOTAL
---------- ------------- -------------
         5             3             1

SQL> desc promotions
 Name
 --------------------------------------------------------------------------------------------
 PROMO_ID
 PROMO_NAME
 PROMO_SUBCATEGORY
 PROMO_SUBCATEGORY_ID
 PROMO_CATEGORY
 PROMO_CATEGORY_ID
 PROMO_COST
 PROMO_BEGIN_DATE
 PROMO_END_DATE
 PROMO_TOTAL
 PROMO_TOTAL_ID

SQL> select count(distinct(promo_id)) as promo
  2  ,      count(distinct(promo_subcategory_id)) as subcategory
  3  ,      count(distinct(promo_category_id)) as category
  4  ,      count(distinct(promo_total_id)) as promo_total
  5  from   promotions
  6  /

     PROMO SUBCATEGORY   CATEGORY PROMO_TOTAL
---------- ----------- ---------- -----------
       503          22          9           1

The COUNTRIES table will be joined to the customer table to form a single dimension - in dimensional modelling terms this is known as a "snowflake" dimension.

Before I use these figures though, I first want to do is think about how many dimension level combinations there are. Looking back at my SQL:

  • Customers had 7 levels, if I count the Countries table as well (customer, city, state/province, country, subregion, region, total customer)
  • Products had 4 levels (products, subcategory, category, total)
  • Times had 7 levels over 2 hierarchies that shared the same bottom day level (day, calendar month, calendar quarter, calendar year, then day, fiscal month, fiscal quarter and fiscal year)
  • Channels had 3 levels (channel, channel class and total channel), and
  • Promotions had 4 levels (promotion, subcategory, category and total promotions)

If I multiply these figures together (7 x 4 x 7 x 3 x 4) I get 2352 combinations of dimension levels by which users can aggregate the data. That looks promising as there's plenty of scope for random, ad-hoc queries against the star schema, a requirement that our sample data set needs to fufil.

From an OLAP perspective though, I need to calculate the total amount of cells that each fact table column will require, which is worked out by adding together all the dimension members for all the levels in the dimension, then multiply these figures together to give me the total cell requirement for the OLAP measure. When I store my star schema data in an OLAP analytic workspace, it normally reserves one cell - usually equating to 8 bytes of storage - for every possible combination of dimension members, even if there isn't a value held for that dimension member combination. I'll come on in a moment to how we deal with this issue - basically we only actually store values for valid dimension member combinations - but for the time being, how much space might an analytic workspace variation of this star schema potentially take up?

Taking the figures produced by the queries, I end up with the follow count of dimension members:

  • Customers: 55000 customers + 620 cities + 145 state/provinces + 23 countries + 8 sub-regions + 6 regions + 1 total level = 55803 members
  • Products : 72 products + 22 product subcategories +  5 product categories + 1 product total = 100 members
  • Times : 1826 days + 60 calendar months + 61 fiscal months + 20 calendar quarters + 21 fiscal quarters + 5 calendar years + 6 fiscal years = 1999 members
  • Channels : 5 channels + 3 channel classes + 1 total channels = 6 members, and
  • Promotions : 503 promotions + 22 promotion subcategories + 9 promotion categories + 1 promotion total = 535 members.

Now, if I multiply the dimension member totals together, I get 55803 x 100 x 1999 x 6 x 535 = 35,807,613,237,000 cell combinations per measure, which if you multiply up by the storage requirement (8 bytes) per cell, and then the number of measures in the SALES fact table (3), means that our storage requirement, if all dimension member combinations had a valid value, would be 839TB of storage. Which might cause us a bit of an issue.

Of course, in reality, not every possible dimension member combination has a valid value, and in most cases our data is in fact what we term "sparse" - not every customer buys every product via every channel using every promotion each day. Sparsity isn't something that's ever really discussed when working with relational star schemas, as that type of storage only stores data for actual, real combinations of dimension values, but it's one of the most important factors in designing analytic workspaces because there are steps we can take to reduce the space taken up by the empty cells, and very sparse measures can reduce down in size significantly.

To see to what extend our fact table measures are sparse, I can get a rough approximation by firstly seeing how many distinct combinations of dimension values are present in the fact table (which remember had 918,843 rows)

SQL> select count(*) from (
  2     select distinct time_id
  3  ,                  prod_id
  4  ,                  cust_id
  5  ,                  promo_id
  6  ,                  cust_id
  7     from   sales )
  8  /

  COUNT(*)
----------
    700890

So there are in fact just over 700,000 actual valid combinations of lowest-level dimension members. Compare this to the potential number of lowest level dimension members (55000 customers x 72 products x 1826 days x 5 channels x 503 promotions = 18,185,864,400,000, makes each measure (700,890 / 18,185,864m)*100 = 0.000003849% sparse, i.e. it's very sparse. If I then apply this ratio to our 839TB of required storage, the amount of disk space goes back down to the megabytes, although the figures are getting a bit tenuous at this point as there's a few more factors to include. However, what it does mean is that this star schema, with it's lots of dimensions, levels and dimension members, does certainly have the potential to create a large analytic workspace if the technology doesn't hold up well.

So, does it make the SH schema a good sample dataset? Well, being honest, the size of the star schema is still quite small, and once I come up with some results it'd be wise to test the findings out by running against a more realistic dataset, if only to keep the exercise credible. However, there are obviously lots of potential MV aggregations we can make, and there's lots of potential for database explosion, pre-aggregation and sparsity handling from the analytic workspace end, so it's probably a good starting point.

Going on from here then, how will I go about doing the tests? Here's the methodology I've come up with so far.

  1. Create three tablespaces with the default options for a 10gR2 database (locally managed etc), and three users called SH_MV, SH_ODM and SH_AW
  2. Export the SH sample schema into the SH_MV schema and the SH_ODM schemas, leave the SW_AW schema empty for the time being.
  3. Truncate the tables in the SH_MV and SH_ODM schemas, so we keep the table structures, dimensions, CWM2 metadata, constraints and so on.
  4. Create three views over the SALES table in the SH schema, one to retrieve the first 800,000 rows based on the TIME_ID, a second to retrieve the next 60,000 rows, a third to retrieve the rest. The first view we'll use to do an initial load and aggregation, the second two we'll use for incremental loads and aggregations (my suspicion is that MVs will be quicker to re-aggregate for incremental loads, but the partitioning/compression benefits in OLAP 10.2 might surprise us all)
  5. Put on a 10046 trace, and then do a direct-path load from the SH schema into the SH_MV schema. Time it and see how long it takes to load. Do the same for the SH_ODM schema (these should obviously be the same)
  6. Generate a set of materialized views (more on how I'll do this tomorrow) over the SH_MV fact table, run and time the process.
  7. Generate a materialized view using grouping sets over the fact table in the SH_ODM schema, run and time the process.
  8. In the SH_AW schema, create the AW/XML documents to define the equivalent analytic workspace and an equivalent set of aggregations, use DBMS_AW_XML to generate the analytic workspace (leaving aggregation to later), after again turning on trace and timing it.
  9. Use the DBMS_AW_XML utility to aggregate the analytic workspace, again timing and tracing.
  10. Compare the three results of the initial load and aggregation
  11. Repeat the process with the second, 60000 row view, to see how well each handles incremental loads and aggregation.
  12. Repeat the process with the third 60,000 row view, but this time inserting row-by-row to see whether incremental refresh performance changes when we load on a drip-feed rather than in one go.

So that should handle the summary creation testing. At the moment, there's still a few questions that I need to address:

  • How do we go about choosing the right set of materialized views to create? There are lots and lots of potential dimension level combinations we can aggregate using, how do we choose which to use and when do we stop?
  • Is an event 10046 trace appropriate for an analytic workspace load?
  • Is it realistic testing with a single processor laptop (i.e. are there optimisations in aggregate creation that only kick-in on >1 CPU?)
  • How are we going to generate the AW/XML documents to create the AW?
  • How are we going to create the materialized view using grouping sets?
  • For the analytic workspace and the grouping sets materialized view, what levels should we aggregate on, and how can we marry this up with the set of materialized view aggregations we've also created?

Questions questions. Anyway I'll make a start on the project setup tomorrow, then start thinking about how we create the materialized views.