Thinking About the Aggregation Testing Methodology
The first step in my Aggregation Techniques Comparison project is to put together a methodology. As a recap, I'm looking to compare the rollup and query performance of regular materialized views, materialized views using grouping sets, and analytic workspaces, when used in the context of a relational star schema. The project was prompted by an Oracle white paper published a couple of years ago, that claimed that, for queries that were more ad-hoc and random (as opposed to fairly predictable), analytic workspaces were faster to aggregate, and faster to query, than regular materialized views.
The justifications for this claim were that, when you had a large number of materialized views to create (in order to support a wide range of ad-hoc star schema queries), creating these took longer than the equivalent summary built using an analytic workspace, and that once they were created, ad-hoc dimensional queries took longer to run than queries against analytic workspaces because their were more objects (materialized views) to consider, more possible SQL execution plans, and SQL against relational tables was inherently slower to run than SQL that gets translated into analytic workspace lookups. Now I've got no reason to doubt the author of the white paper (Bud Endress) but I thought it'd be interesting to run my own tests, and also at the same time introduce another type of materialized view, one created using the DBMS_ODM procedure and using the grouping sets feature, that appears to offer some of the benefits of both approaches - relational storage but with all aggregates stored in the one database object.
With this in mind, I've started thinking about a methodology. Now the presentation and paper itself are aimed at the Oracle DBA audience, with the idea being that this might offer them a couple of alternatives to the creation of a number of materialized views to support queries in a data warehouse. Given the audience, I'll need to explain how analytic workspaces and materialized views using grouping sets work, and create the objects using scripts and the command line rather than using GUI tools such as Analytic Workspace Manager. As the focus is going to be on alternative aggregation techniques to support star schemas, we won't bother with OLAP query tool-specific stuff like the OLAP API, and we'll create SQL views over the analytic workspace so that it can be queried using regular SQL, giving it the same SQL interface as a regular star schema - the idea being that if the claims turn out to be correct, it'll be a "slot in" replacement for the star schema and materialized view, with no changes needed to be made to any other applications.
The first bit of work I did was to come up with a sample dataset. Ideally, I'd use on that's readily available to everyone, and so the fairly obvious candidate was the SH ("Sales History") sample schema that comes with the Oracle 9i and Oracle Database 10g databases. The other part to this though was that the sample dataset had to be big enough to (a) normally take enough time to aggregate and query to give some scope for improvement, and (b) reasonably compare in size to the sorts of star schemas DBAs normally look afer. So how did it size up?
The first thing I did was connect to the SH schema and run some table counts. The database version is Oracle 10.2.0.2.0 running on Microsoft Windows XP 32-bit, within a Parallels Desktop virtual machine under OS X.
SQL*Plus: Release 10.2.0.1.0 - Production on Thu Sep 7 08:43:11 2006Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - Productio
With the Partitioning, OLAP and Data Mining optionsSQL> select count(*)
2 from sales
3 /COUNT(*)
918843
SQL> select count(*)
2 from customers
3 /COUNT(*)
55500
SQL> select count(*)
2 from times
3 /COUNT(*)
1826
SQL> select count(*)
2 from promotions
3 /COUNT(*)
503
SQL> select count(*)
2 from channels
3 /COUNT(*)
5</pre>
OK, so the main fact table has around 918,000 rows, there are 55,000 customers, 72 products, 1826 time periods, 503 promotions and 5 channels. I wonder how big the tables are currently?
SQL> select segment_name, segment_type, sum(bytes)/1024000 "SIZE_IN_MB" 2 from user_segments 3 where segment_name in ('SALES','CUSTOMERS','PRODUCTS','CHANNELS','TIMES','PROMOTIONS') 4 group by segment_name, segment_type 5 /SEGMENT_NAME SEGMENT_TYPE SIZE_IN_MB
PRODUCTS TABLE .064
CUSTOMERS TABLE 12.288
SALES TABLE PARTITION 16.256
TIMES TABLE .512
CHANNELS TABLE .064
PROMOTIONS TABLE .5126 rows selected.
OK, that's no so promising. The main SALES fact table is only 16MB in size, most fact tables I come across are much bigger than that. Having said that, what we're going to compare now is the time it takes to aggregate this fact table using its dimension tables, so the basic size of the fact table probably won't be the major factor here, it'll be the number of dimensions, the levels within them and the overall total number of possible aggregation possibilities that'll be the major factor. At the back of my mind though I'll have to keep the source table size in mind, in case it starts to undermine the credibility of the tests.
The next thing to look at is the structure of the supporting dimension tables. What I'm looking for here are two things:
- A large number of dimension members, mostly dependent on the number of base, leaf level members, but also influenced by the number of members at higher levels in the dimension hierarchy - a dimension with 50,000 leaf level members that rolls up into 40,000 members above it, 30,000 above that and so on will affect performance far more than 50,000 that roll up to 100 and then 1.
- A large number of levels in the dimension hierarchy, or even better, multiple hierarchies - these influence the total number of possible aggregations that can be performed using the dimension.
Basically, a large number of dimensions, with lots of base level members, lots of levels and multiple hierarchies is going to affect rollup and aggregate navigation more than just a couple of dimensions with only single hierarchies and just a couple of levels.
To work out the impact of this, I first need to run some count distincts on the table columns that will go to form the level IDs in the dimensions. I happen to know which columns in the tables go on to form the levels, and I therefore run some SQL to check out the numbers:
SQL> desc products Name Null? Type ----------------------------------------- -------- ---------------------------- PROD_ID NOT NULL NUMBER(6) PROD_NAME NOT NULL VARCHAR2(50) PROD_DESC NOT NULL VARCHAR2(4000) PROD_SUBCATEGORY NOT NULL VARCHAR2(50) PROD_SUBCATEGORY_ID NOT NULL NUMBER PROD_SUBCATEGORY_DESC NOT NULL VARCHAR2(2000) PROD_CATEGORY NOT NULL VARCHAR2(50) PROD_CATEGORY_ID NOT NULL NUMBER PROD_CATEGORY_DESC NOT NULL VARCHAR2(2000) PROD_WEIGHT_CLASS NOT NULL NUMBER(3) PROD_UNIT_OF_MEASURE VARCHAR2(20) PROD_PACK_SIZE NOT NULL VARCHAR2(30) SUPPLIER_ID NOT NULL NUMBER(6) PROD_STATUS NOT NULL VARCHAR2(20) PROD_LIST_PRICE NOT NULL NUMBER(8,2) PROD_MIN_PRICE NOT NULL NUMBER(8,2) PROD_TOTAL NOT NULL VARCHAR2(13) PROD_TOTAL_ID NOT NULL NUMBER PROD_SRC_ID NUMBER PROD_EFF_FROM DATE PROD_EFF_TO DATE PROD_VALID VARCHAR2(1)SQL> set linesize 800
SQL> select count(distinct(prod_id)) as product_count 2 , count(distinct(prod_subcategory_id)) as product_subcategory_count 3 , count(distinct(prod_category_id)) as product_category_count 4 , count(distinct(prod_total_id)) as product_total_count 5 from products 6 / PRODUCT_COUNT PRODUCT_SUBCATEGORY_COUNT PRODUCT_CATEGORY_COUNT PRODUCT_TOTAL_COUNT ------------- ------------------------- ---------------------- ------------------- 72 22 5 1 SQL> desc customers Name -------------------------------------------------------------------------------------------- CUST_ID CUST_FIRST_NAME CUST_LAST_NAME CUST_GENDER CUST_YEAR_OF_BIRTH CUST_MARITAL_STATUS CUST_STREET_ADDRESS CUST_POSTAL_CODE CUST_CITY CUST_CITY_ID CUST_STATE_PROVINCE CUST_STATE_PROVINCE_ID COUNTRY_ID CUST_MAIN_PHONE_NUMBER CUST_INCOME_LEVEL CUST_CREDIT_LIMIT CUST_EMAIL CUST_TOTAL CUST_TOTAL_ID CUST_SRC_ID CUST_EFF_FROM CUST_EFF_TO CUST_VALID SQL> select count(distinct(cust_id)) as customer 2 , count(distinct(cust_city_id)) as cust_city 3 , count(distinct(cust_state_province_id)) as state_province 4 , count(distinct(country_id)) as country 5 , count(distinct(cust_total_id)) as customer_total 6 from customers 7 / CUSTOMER CUST_CITY STATE_PROVINCE COUNTRY CUSTOMER_TOTAL ---------- ---------- -------------- ---------- -------------- 55500 620 145 19 1 SQL> desc countries Name -------------------------------------------------------------------------------------------- COUNTRY_ID COUNTRY_ISO_CODE COUNTRY_NAME COUNTRY_SUBREGION COUNTRY_SUBREGION_ID COUNTRY_REGION COUNTRY_REGION_ID COUNTRY_TOTAL COUNTRY_TOTAL_ID COUNTRY_NAME_HIST SQL> select count(distinct(country_id)) as country 2 , count(distinct(country_subregion_id)) as subregion 3 , count(distinct(country_region_id)) as region 4 , count(distinct(country_total_id)) as country_total 5 from countries 6 / COUNTRY SUBREGION REGION COUNTRY_TOTAL ---------- ---------- ---------- ------------- 23 8 6 1 SQL> desc times Name -------------------------------------------------------------------------------------------- TIME_ID DAY_NAME DAY_NUMBER_IN_WEEK DAY_NUMBER_IN_MONTH CALENDAR_WEEK_NUMBER FISCAL_WEEK_NUMBER WEEK_ENDING_DAY WEEK_ENDING_DAY_ID CALENDAR_MONTH_NUMBER FISCAL_MONTH_NUMBER CALENDAR_MONTH_DESC CALENDAR_MONTH_ID FISCAL_MONTH_DESC FISCAL_MONTH_ID DAYS_IN_CAL_MONTH DAYS_IN_FIS_MONTH END_OF_CAL_MONTH END_OF_FIS_MONTH CALENDAR_MONTH_NAME FISCAL_MONTH_NAME CALENDAR_QUARTER_DESC CALENDAR_QUARTER_ID FISCAL_QUARTER_DESC FISCAL_QUARTER_ID DAYS_IN_CAL_QUARTER DAYS_IN_FIS_QUARTER END_OF_CAL_QUARTER END_OF_FIS_QUARTER CALENDAR_QUARTER_NUMBER FISCAL_QUARTER_NUMBER CALENDAR_YEAR CALENDAR_YEAR_ID FISCAL_YEAR FISCAL_YEAR_ID DAYS_IN_CAL_YEAR DAYS_IN_FIS_YEAR END_OF_CAL_YEAR END_OF_FIS_YEAR SQL> select count(distinct(time_id)) as day 2 , count(distinct(calendar_month_id)) as month 3 , count(distinct(calendar_quarter_id)) as quarter 4 , count(distinct(calendar_year_id)) as year 5 from times 6 / DAY MONTH QUARTER YEAR ---------- ---------- ---------- ---------- 1826 60 20 5 SQL> select count(distinct(time_id)) as day 2 , count(distinct(fiscal_month_id)) as fiscal_month 3 , count(distinct(fiscal_quarter_id)) as fiscal_quarter 4 , count(distinct(fiscal_year_id)) as fiscal_year 5 from times 6 / DAY FISCAL_MONTH FISCAL_QUARTER FISCAL_YEAR ---------- ------------ -------------- ----------- 1826 61 21 6 SQL> desc channels Name -------------------------------------------------------------------------------------------- CHANNEL_ID CHANNEL_DESC CHANNEL_CLASS CHANNEL_CLASS_ID CHANNEL_TOTAL CHANNEL_TOTAL_ID SQL> select count(distinct(channel_id)) as channel 2 , count(distinct(channel_class_id)) as channel_class 3 , count(distinct(channel_total_id)) as channel_total 4 from channels 5 / CHANNEL CHANNEL_CLASS CHANNEL_TOTAL ---------- ------------- ------------- 5 3 1 SQL> desc promotions Name -------------------------------------------------------------------------------------------- PROMO_ID PROMO_NAME PROMO_SUBCATEGORY PROMO_SUBCATEGORY_ID PROMO_CATEGORY PROMO_CATEGORY_ID PROMO_COST PROMO_BEGIN_DATE PROMO_END_DATE PROMO_TOTAL PROMO_TOTAL_ID SQL> select count(distinct(promo_id)) as promo 2 , count(distinct(promo_subcategory_id)) as subcategory 3 , count(distinct(promo_category_id)) as category 4 , count(distinct(promo_total_id)) as promo_total 5 from promotions 6 / PROMO SUBCATEGORY CATEGORY PROMO_TOTAL ---------- ----------- ---------- ----------- 503 22 9 1
The COUNTRIES table will be joined to the customer table to form a single dimension - in dimensional modelling terms this is known as a "snowflake" dimension.
Before I use these figures though, I first want to do is think about how many dimension level combinations there are. Looking back at my SQL:
- Customers had 7 levels, if I count the Countries table as well (customer, city, state/province, country, subregion, region, total customer)
- Products had 4 levels (products, subcategory, category, total)
- Times had 7 levels over 2 hierarchies that shared the same bottom day level (day, calendar month, calendar quarter, calendar year, then day, fiscal month, fiscal quarter and fiscal year)
- Channels had 3 levels (channel, channel class and total channel), and
- Promotions had 4 levels (promotion, subcategory, category and total promotions)
If I multiply these figures together (7 x 4 x 7 x 3 x 4) I get 2352 combinations of dimension levels by which users can aggregate the data. That looks promising as there's plenty of scope for random, ad-hoc queries against the star schema, a requirement that our sample data set needs to fufil.
From an OLAP perspective though, I need to calculate the total amount of cells that each fact table column will require, which is worked out by adding together all the dimension members for all the levels in the dimension, then multiply these figures together to give me the total cell requirement for the OLAP measure. When I store my star schema data in an OLAP analytic workspace, it normally reserves one cell - usually equating to 8 bytes of storage - for every possible combination of dimension members, even if there isn't a value held for that dimension member combination. I'll come on in a moment to how we deal with this issue - basically we only actually store values for valid dimension member combinations - but for the time being, how much space might an analytic workspace variation of this star schema potentially take up?
Taking the figures produced by the queries, I end up with the follow count of dimension members:
- Customers: 55000 customers + 620 cities + 145 state/provinces + 23 countries + 8 sub-regions + 6 regions + 1 total level = 55803 members
- Products : 72 products + 22 product subcategories + 5 product categories + 1 product total = 100 members
- Times : 1826 days + 60 calendar months + 61 fiscal months + 20 calendar quarters + 21 fiscal quarters + 5 calendar years + 6 fiscal years = 1999 members
- Channels : 5 channels + 3 channel classes + 1 total channels = 6 members, and
- Promotions : 503 promotions + 22 promotion subcategories + 9 promotion categories + 1 promotion total = 535 members.
Now, if I multiply the dimension member totals together, I get 55803 x 100 x 1999 x 6 x 535 = 35,807,613,237,000 cell combinations per measure, which if you multiply up by the storage requirement (8 bytes) per cell, and then the number of measures in the SALES fact table (3), means that our storage requirement, if all dimension member combinations had a valid value, would be 839TB of storage. Which might cause us a bit of an issue.
Of course, in reality, not every possible dimension member combination has a valid value, and in most cases our data is in fact what we term "sparse" - not every customer buys every product via every channel using every promotion each day. Sparsity isn't something that's ever really discussed when working with relational star schemas, as that type of storage only stores data for actual, real combinations of dimension values, but it's one of the most important factors in designing analytic workspaces because there are steps we can take to reduce the space taken up by the empty cells, and very sparse measures can reduce down in size significantly.
To see to what extend our fact table measures are sparse, I can get a rough approximation by firstly seeing how many distinct combinations of dimension values are present in the fact table (which remember had 918,843 rows)
SQL> select count(*) from ( 2 select distinct time_id 3 , prod_id 4 , cust_id 5 , promo_id 6 , cust_id 7 from sales ) 8 / COUNT(*) ---------- 700890
So there are in fact just over 700,000 actual valid combinations of lowest-level dimension members. Compare this to the potential number of lowest level dimension members (55000 customers x 72 products x 1826 days x 5 channels x 503 promotions = 18,185,864,400,000, makes each measure (700,890 / 18,185,864m)*100 = 0.000003849% sparse, i.e. it's very sparse. If I then apply this ratio to our 839TB of required storage, the amount of disk space goes back down to the megabytes, although the figures are getting a bit tenuous at this point as there's a few more factors to include. However, what it does mean is that this star schema, with it's lots of dimensions, levels and dimension members, does certainly have the potential to create a large analytic workspace if the technology doesn't hold up well.
So, does it make the SH schema a good sample dataset? Well, being honest, the size of the star schema is still quite small, and once I come up with some results it'd be wise to test the findings out by running against a more realistic dataset, if only to keep the exercise credible. However, there are obviously lots of potential MV aggregations we can make, and there's lots of potential for database explosion, pre-aggregation and sparsity handling from the analytic workspace end, so it's probably a good starting point.
Going on from here then, how will I go about doing the tests? Here's the methodology I've come up with so far.
- Create three tablespaces with the default options for a 10gR2 database (locally managed etc), and three users called SH_MV, SH_ODM and SH_AW
- Export the SH sample schema into the SH_MV schema and the SH_ODM schemas, leave the SW_AW schema empty for the time being.
- Truncate the tables in the SH_MV and SH_ODM schemas, so we keep the table structures, dimensions, CWM2 metadata, constraints and so on.
- Create three views over the SALES table in the SH schema, one to retrieve the first 800,000 rows based on the TIME_ID, a second to retrieve the next 60,000 rows, a third to retrieve the rest. The first view we'll use to do an initial load and aggregation, the second two we'll use for incremental loads and aggregations (my suspicion is that MVs will be quicker to re-aggregate for incremental loads, but the partitioning/compression benefits in OLAP 10.2 might surprise us all)
- Put on a 10046 trace, and then do a direct-path load from the SH schema into the SH_MV schema. Time it and see how long it takes to load. Do the same for the SH_ODM schema (these should obviously be the same)
- Generate a set of materialized views (more on how I'll do this tomorrow) over the SH_MV fact table, run and time the process.
- Generate a materialized view using grouping sets over the fact table in the SH_ODM schema, run and time the process.
- In the SH_AW schema, create the AW/XML documents to define the equivalent analytic workspace and an equivalent set of aggregations, use DBMS_AW_XML to generate the analytic workspace (leaving aggregation to later), after again turning on trace and timing it.
- Use the DBMS_AW_XML utility to aggregate the analytic workspace, again timing and tracing.
- Compare the three results of the initial load and aggregation
- Repeat the process with the second, 60000 row view, to see how well each handles incremental loads and aggregation.
- Repeat the process with the third 60,000 row view, but this time inserting row-by-row to see whether incremental refresh performance changes when we load on a drip-feed rather than in one go.
So that should handle the summary creation testing. At the moment, there's still a few questions that I need to address:
- How do we go about choosing the right set of materialized views to create? There are lots and lots of potential dimension level combinations we can aggregate using, how do we choose which to use and when do we stop?
- Is an event 10046 trace appropriate for an analytic workspace load?
- Is it realistic testing with a single processor laptop (i.e. are there optimisations in aggregate creation that only kick-in on >1 CPU?)
- How are we going to generate the AW/XML documents to create the AW?
- How are we going to create the materialized view using grouping sets?
- For the analytic workspace and the grouping sets materialized view, what levels should we aggregate on, and how can we marry this up with the set of materialized view aggregations we've also created?
Questions questions. Anyway I'll make a start on the project setup tomorrow, then start thinking about how we create the materialized views.