Back From the Hotel of Misery, Progress on the Aggregation Setup

September 15th, 2006 by Mark Rittman

Well, I’m back home in my own house tonight after my week of misery and despair
in my hotel in
Newbury
. I guess I’ve been spoilt recently working up in Soho, London, but a
week of nights on my own in a bleak single room, with only a sandwich to look
forward to the next day from the Texaco petrol station, and unrelenting rain and
overcast skies makes coming home all that much sweeter. In fact the only
highlight (apart from the selection of stale custard creams and jammy dodgers
that were left in my room each day by the cleaner) was making a bit of progress
with my aggregation testing project for the UKOUG conference later this year. I
ended up taking David Aldridge up on an offer for some advice on the
materialized view side of things, and as I all I otherwise had to do each
evening when I got to my room at 5.30 was eat the custard creams, I got a fair
bit done.

I started off by amending the script I wrote the other day to create the
schemas, to instead create them all with the same default tablespace - the
EXAMPLE one that the SH schema normally uses. This was because I was spending
more and more time just trying to get the export from the SH schema to go into
these new schemas, I tried all the normal steps of setting a quote of zero on
the EXAMPLE schema for the SH_AW, SH_MV and SH_ODM users, removing the UNLIMITED
TABLESPACE and RESOURCE priviledges, and in the end I gave up and put them all
in the same tablespace. Maybe later on I’ll try again but I was getting nowhere
fast and I needed to get some sort of result to cheer myself up. The revised
script to set up the test environment can now be downloaded here.

Next I wanted to put a script together to load data into the empty schemas
(the export from the SH schema was with ROWS=N, so I could pick up the table
structures, indexes, dimensions and so on, but not the data). First of all
though, I created a new user called SH_TEST_RESULTS in which I created two
tables, one to hold the results of the tests I ran, the other just a lookup
table of run types, the idea being that the scripts would write the start and
end times of each run in the table, and I could run and run the scripts again
and again to check that the results I get aren’t just a one-off - infact the
whole set of tests will be scriptable so that I can quickly run them again if I
want to try something new out. Anyway, the script to create the two tables in
the test results schema is as follow (the SQL and DDL was auto-generated from
SQL Developer, hence the formatting and the trigger code for the table sequence)

conn sh_test_results/password@ora10g

drop table results;

  CREATE TABLE "SH_TEST_RESULTS"."RESULTS"
   (	"RUN_SEQUENCE_ID" NUMBER NOT NULL ENABLE,
	"SCHEMA_NAME" VARCHAR2(6 BYTE) NOT NULL ENABLE,
	"RUN_TYPE_ID" NUMBER(2,0) NOT NULL ENABLE,
	"START_TIME" DATE,
	"RUN_TIME_SECS" NUMBER (10,2),
	 CONSTRAINT "TABLE1_PK" PRIMARY KEY ("RUN_SEQUENCE_ID") ENABLE
   ) ;

  CREATE OR REPLACE TRIGGER "SH_TEST_RESULTS"."SEQ_TRIG"
	before insert on "RESULTS"
	for each ro
	begin
	 if inserting then
	  if :NEW."RUN_SEQUENCE_ID" is null then
	   select RESULTS_SEQ.nextval
	   into :NEW."RUN_SEQUENCE_ID"
	   from dual;
	  end if;
         end if;
	end;
/
ALTER TRIGGER "SH_TEST_RESULTS"."SEQ_TRIG" ENABLE;

CREATE TABLE "SH_TEST_RESULTS"."RUN_TYPE"
   (	"RUN_TYPE_ID" NUMBER(2,0) NOT NULL ENABLE,
	"DESCRIPTION" VARCHAR2(30 BYTE) NOT NULL ENABLE,
	 CONSTRAINT "RUN_TYPE_PK" PRIMARY KEY ("RUN_TYPE_ID") ENABLE
   ) ;

Insert into "RUN_TYPE" ("RUN_TYPE_ID","DESCRIPTION") values (1,'INITIAL LOAD');
Insert into "RUN_TYPE" ("RUN_TYPE_ID","DESCRIPTION") values (2,'INITIAL AGGREGATION');
Insert into "RUN_TYPE" ("RUN_TYPE_ID","DESCRIPTION") values (4,'INCREMENTAL AGGR 1');
Insert into "RUN_TYPE" ("RUN_TYPE_ID","DESCRIPTION") values (6,'INCREMENTAL AGGR 2');
Insert into "RUN_TYPE" ("RUN_TYPE_ID","DESCRIPTION") values (5,'INCREMENTAL LOAD 2');
Insert into "RUN_TYPE" ("RUN_TYPE_ID","DESCRIPTION") values (3,'INCREMENTAL LOAD 1');

Now that the users and the control table were set up, I started writing the
code to load the first of the test users, SH_MV. This script would firstly
INSERT /*+ APPEND */ data from the SH schema dimension tables into the SH_MV
schema, then drop the bitmap indexes on the SALES table, load the table and then
recreate the bitmap indexes. At the start and the end of the script I’d put code
in to record the start time, then store it, the end time and the elapsed time in
the control table, like this:

spool c:\agg_test_files\load_aw_mv.log

connect sh_mv/password@ora10g

variable start_time varchar2(20)
variable end_time varchar2(20)

begin
select to_char(sysdate, 'DD:MM:YYYY HH:MI:SS') into :start_time from dual;
end;
/

drop index sales_time_bix;
drop index sales_prod_bix;
drop index sales_cust_bix;
drop index sales_promo_bix;
drop index sales_channel_bix;

insert /*+ APPEND */ into countries
select *
from   sh.countries
/

insert /*+ APPEND */ into customers
select *
from   sh.customers
/

insert /*+ APPEND */ into channels
select *
from   sh.channels
/

insert /*+ APPEND */ into times
select *
from   sh.times
/

insert /*+ APPEND */ into promotions
select *
from   sh.promotions
/

insert /*+ APPEND */ into products
select *
from   sh.products
/

insert /*+ APPEND */ into sales
select *
from   sh.sales
/

commit;

CREATE BITMAP INDEX "SALES_TIME_BIX" ON "SALES" ("TIME_ID")
   LOCAL
 (PARTITION "SALES_1995" ,
 PARTITION "SALES_1996" ,
 PARTITION "SALES_H1_1997" ,
 PARTITION "SALES_H2_1997" ,
 PARTITION "SALES_Q1_1998" ,
 PARTITION "SALES_Q2_1998" ,
 PARTITION "SALES_Q3_1998" ,
 PARTITION "SALES_Q4_1998" ,
 PARTITION "SALES_Q1_1999" ,
 PARTITION "SALES_Q2_1999" ,
 PARTITION "SALES_Q3_1999" ,
 PARTITION "SALES_Q4_1999" ,
 PARTITION "SALES_Q1_2000" ,
 PARTITION "SALES_Q2_2000" ,
 PARTITION "SALES_Q3_2000" ,
 PARTITION "SALES_Q4_2000" ,
 PARTITION "SALES_Q1_2001" ,
 PARTITION "SALES_Q2_2001" ,
 PARTITION "SALES_Q3_2001" ,
 PARTITION "SALES_Q4_2001" ,
 PARTITION "SALES_Q1_2002" ,
 PARTITION "SALES_Q2_2002" ,
 PARTITION "SALES_Q3_2002" ,
 PARTITION "SALES_Q4_2002" ,
 PARTITION "SALES_Q1_2003" ,
 PARTITION "SALES_Q2_2003" ,
 PARTITION "SALES_Q3_2003" ,
 PARTITION "SALES_Q4_2003" )
/

CREATE BITMAP INDEX "SALES_CHANNEL_BIX" ON "SALES" ("CHANNEL_ID")
   LOCAL
 (PARTITION "SALES_1995" ,
 PARTITION "SALES_1996" ,
 PARTITION "SALES_H1_1997" ,
 PARTITION "SALES_H2_1997" ,
 PARTITION "SALES_Q1_1998" ,
 PARTITION "SALES_Q2_1998" ,
 PARTITION "SALES_Q3_1998" ,
 PARTITION "SALES_Q4_1998" ,
 PARTITION "SALES_Q1_1999" ,
 PARTITION "SALES_Q2_1999" ,
 PARTITION "SALES_Q3_1999" ,
 PARTITION "SALES_Q4_1999" ,
 PARTITION "SALES_Q1_2000" ,
 PARTITION "SALES_Q2_2000" ,
 PARTITION "SALES_Q3_2000" ,
 PARTITION "SALES_Q4_2000" ,
 PARTITION "SALES_Q1_2001" ,
 PARTITION "SALES_Q2_2001" ,
 PARTITION "SALES_Q3_2001" ,
 PARTITION "SALES_Q4_2001" ,
 PARTITION "SALES_Q1_2002" ,
 PARTITION "SALES_Q2_2002" ,
 PARTITION "SALES_Q3_2002" ,
 PARTITION "SALES_Q4_2002" ,
 PARTITION "SALES_Q1_2003" ,
 PARTITION "SALES_Q2_2003" ,
 PARTITION "SALES_Q3_2003" ,
 PARTITION "SALES_Q4_2003" )
/

CREATE BITMAP INDEX "SALES_PROMO_BIX" ON "SALES" ("PROMO_ID")
   LOCAL
 (PARTITION "SALES_1995" ,
 PARTITION "SALES_1996" ,
 PARTITION "SALES_H1_1997" ,
 PARTITION "SALES_H2_1997" ,
 PARTITION "SALES_Q1_1998" ,
 PARTITION "SALES_Q2_1998" ,
 PARTITION "SALES_Q3_1998" ,
 PARTITION "SALES_Q4_1998" ,
 PARTITION "SALES_Q1_1999" ,
 PARTITION "SALES_Q2_1999" ,
 PARTITION "SALES_Q3_1999" ,
 PARTITION "SALES_Q4_1999" ,
 PARTITION "SALES_Q1_2000" ,
 PARTITION "SALES_Q2_2000" ,
 PARTITION "SALES_Q3_2000" ,
 PARTITION "SALES_Q4_2000" ,
 PARTITION "SALES_Q1_2001" ,
 PARTITION "SALES_Q2_2001" ,
 PARTITION "SALES_Q3_2001" ,
 PARTITION "SALES_Q4_2001" ,
 PARTITION "SALES_Q1_2002" ,
 PARTITION "SALES_Q2_2002" ,
 PARTITION "SALES_Q3_2002" ,
 PARTITION "SALES_Q4_2002" ,
 PARTITION "SALES_Q1_2003" ,
 PARTITION "SALES_Q2_2003" ,
 PARTITION "SALES_Q3_2003" ,
 PARTITION "SALES_Q4_2003" )
/

 CREATE BITMAP INDEX "SALES_PROD_BIX" ON "SALES" ("PROD_ID")
   LOCAL
 (PARTITION "SALES_1995" ,
 PARTITION "SALES_1996" ,
 PARTITION "SALES_H1_1997" ,
 PARTITION "SALES_H2_1997" ,
 PARTITION "SALES_Q1_1998" ,
 PARTITION "SALES_Q2_1998" ,
 PARTITION "SALES_Q3_1998" ,
 PARTITION "SALES_Q4_1998" ,
 PARTITION "SALES_Q1_1999" ,
 PARTITION "SALES_Q2_1999" ,
 PARTITION "SALES_Q3_1999" ,
 PARTITION "SALES_Q4_1999" ,
 PARTITION "SALES_Q1_2000" ,
 PARTITION "SALES_Q2_2000" ,
 PARTITION "SALES_Q3_2000" ,
 PARTITION "SALES_Q4_2000" ,
 PARTITION "SALES_Q1_2001" ,
 PARTITION "SALES_Q2_2001" ,
 PARTITION "SALES_Q3_2001" ,
 PARTITION "SALES_Q4_2001" ,
 PARTITION "SALES_Q1_2002" ,
 PARTITION "SALES_Q2_2002" ,
 PARTITION "SALES_Q3_2002" ,
 PARTITION "SALES_Q4_2002" ,
 PARTITION "SALES_Q1_2003" ,
 PARTITION "SALES_Q2_2003" ,
 PARTITION "SALES_Q3_2003" ,
 PARTITION "SALES_Q4_2003" ) ;

CREATE BITMAP INDEX "SALES_CUST_BIX" ON "SALES" ("CUST_ID")
   LOCAL
 (PARTITION "SALES_1995" ,
 PARTITION "SALES_1996" ,
 PARTITION "SALES_H1_1997" ,
 PARTITION "SALES_H2_1997" ,
 PARTITION "SALES_Q1_1998" ,
 PARTITION "SALES_Q2_1998" ,
 PARTITION "SALES_Q3_1998" ,
 PARTITION "SALES_Q4_1998" ,
 PARTITION "SALES_Q1_1999" ,
 PARTITION "SALES_Q2_1999" ,
 PARTITION "SALES_Q3_1999" ,
 PARTITION "SALES_Q4_1999" ,
 PARTITION "SALES_Q1_2000" ,
 PARTITION "SALES_Q2_2000" ,
 PARTITION "SALES_Q3_2000" ,
 PARTITION "SALES_Q4_2000" ,
 PARTITION "SALES_Q1_2001" ,
 PARTITION "SALES_Q2_2001" ,
 PARTITION "SALES_Q3_2001" ,
 PARTITION "SALES_Q4_2001" ,
 PARTITION "SALES_Q1_2002" ,
 PARTITION "SALES_Q2_2002" ,
 PARTITION "SALES_Q3_2002" ,
 PARTITION "SALES_Q4_2002" ,
 PARTITION "SALES_Q1_2003" ,
 PARTITION "SALES_Q2_2003" ,
 PARTITION "SALES_Q3_2003" ,
 PARTITION "SALES_Q4_2003" ) ;

begin
select to_char(sysdate, 'DD:MM:YYYY HH:MI:SS') into :end_time from dual;
end;
/

connect sh_test_results/password@ora10g

insert into sh_test_results.results (schema_name
,                            run_type_id
,                            start_time
,                            end_time
,                            run_time_secs)
values 			    ('SH_MV'
,                            1
,                            to_date(:start_time,'DD:MM:YYYY HH:MI:SS')
,                            to_date(:end_time,'DD:MM:YYYY HH:MI:SS')
,        	             (to_date(:end_time,'DD:MM:YYYY HH:MI:SS') - to_date(:start_time,'DD:MM:YYYY HH:MI:SS')) * 86400
)
/

commit;

spool off

Next, I do the same for the SH_ODM schema, changing the schema names as
appropriate. Now I’ve got two schemas with data from the SH schema copied into
them, and timings for how long the process took.

Now I come the the bit where I need to define the materialized views for the
SH_MV schema. Now the idea I had for this before I started the tests, given that
it’s fairly tricky to come up with a correct set of summaries for a schema
you’ve not queried before, is to use the SQL Access Advisor that you get with
Oracle Database 10g. By logging on to Database Control and navigating to Advisor
Central, you can get the SQL Access advisor to recommend a set of materialized
views based on either a stored workload, or more appropriately for me, the
dimensional metadata that exists for these tables in the form of CREATE
DIMENSION statements.

Working through the SQL Access Advisor, I created an advisor job against the
tables in the SH schema (this was before I set up the users, but it should hold
the same for the SH_MV user) and let the job run; it came back with the
encouraging message "Potential for Large Improvement" (sounds like my
old P.E. report)


Looking at the results, it had 28 recommendations ordered by
effectiveness.

Looking at the recommendations individually, they were all for materialized
views (they could have been indexes) against the SALES and dimension tables,
together with the required materialized view logs.

Now my thinking now is that this selection - 28 of them - is
pretty good seeing as we were looking to check the performance of an analytic
workspace compared to a number of materialized views, and to a single
hierarchical materialized view using GROUPING SETS. I therefore turned the
recommendations into a script which you can download here
(there’s a lot of MV creation commands in there).

Now comes the next bit - where I have to create a summary for
the SH_ODM schema, but using a materialized view using GROUPING SETS to put all
the aggregations into a single summary table. I did a bit of too-ing and fro-ing
with David Aldridge on this one, and I did originally plan to have the DBMS_ODM
procedure create it for me, but it was so much work creating the required CWM2
OLAP metadata that I couldn’t imagine a real-life DBA doing this, so I created
it manually myself. The script is here,
but the interesting part, where we create the single materialized view using
GROUPING SETS, is here (I’ve edited it to remove all but a couple of the 28
grouping sets)

CREATE MATERIALIZED VIEW "SH_ODM"."MV$SALES_HMV"
BUILD IMMEDIATE
REFRESH FORCE ON DEMAND
ENABLE QUERY REWRITE
AS
SELECT  CU.CUST_TOTAL_ID C1
,       CU.CUST_STATE_PROVINCE_ID C2
,       CU.CUST_CITY_ID C3
,	CU.COUNTRY_ID C4
,	CT.COUNTRY_TOTAL_ID C5
,	 CT.COUNTRY_REGION_ID C6
, 	 CT.COUNTRY_SUBREGION_ID C7
,       S.CUST_ID C8
, 	PM.PROMO_TOTAL_ID C9
,       PM.PROMO_CATEGORY_ID C10
,       PM.PROMO_SUBCATEGORY_ID C11
,       S.PROMO_ID C12
, 	CH.CHANNEL_TOTAL_ID C13
, 	CH.CHANNEL_CLASS_ID C14
, 	PD.PROD_CATEGORY_ID C15
, 	PD.PROD_SUBCATEGORY_ID C16
, 	S.PROD_ID C17
,       T.CALENDAR_YEAR_ID C18
,       T.CALENDAR_QUARTER_ID C19
,       T.CALENDAR_MONTH_ID C20
,       S.TIME_ID C21
, 	COUNT(*) M1
,       SUM(S.QUANTITY_SOLD) QUANTITY_SOLD
,       SUM(S.AMOUNT_SOLD) AMOUNT_SOLD
,       COUNT(S.QUANTITY_SOLD) C_QUANTITY_SOLD
,       COUNT(S.AMOUNT_SOLD) C_AMOUNT_SOLD
FROM 	SH_ODM.CUSTOMERS CU
, 	SH_ODM.PROMOTIONS PM
, 	SH_ODM.CHANNELS CH
, 	SH_ODM.PRODUCTS PD
,       SH_ODM.TIMES T
, 	SH_ODM.SALES S
,       SH_ODM.COUNTRIES CT
WHERE 	S.PROMO_ID = PM.PROMO_ID
AND 	S.CHANNEL_ID = CH.CHANNEL_ID
AND 	S.CUST_ID = CU.CUST_ID
AND	S.PROD_ID = PD.PROD_ID
AND     S.TIME_ID = T.TIME_ID
AND	CU.COUNTRY_ID = CT.COUNTRY_ID
GROUP BY
 GROUPING SETS
(
(	CU.CUST_TOTAL_ID
,	 PM.PROMO_TOTAL_ID
,	 CH.CHANNEL_TOTAL_ID
,	 CH.CHANNEL_CLASS_ID
,	 PD.PROD_TOTAL_ID
, 	 PD.PROD_CATEGORY_ID
, 	 PD.PROD_SUBCATEGORY_ID
,	 S.CHANNEL_ID
, 	 S.PROD_ID)
,
(	 CT.COUNTRY_TOTAL_ID
,	 CT.COUNTRY_REGION_ID
, 	 CT.COUNTRY_SUBREGION_ID
, 	 CU.COUNTRY_ID
,	 PM.PROMO_TOTAL_ID
, 	 CH.CHANNEL_TOTAL_ID
, 	 CH.CHANNEL_CLASS_ID
, 	 PD.PROD_TOTAL_ID
,	 PD.PROD_CATEGORY_ID
, 	 S.CHANNEL_ID)
,
... edited to remove 25 other grouping sets

(	 CT.COUNTRY_TOTAL_ID
,	PM.PROMO_TOTAL_ID
, 	CH.CHANNEL_TOTAL_ID
, 	CH.CHANNEL_CLASS_ID
,	PD.PROD_TOTAL_ID
, 	PD.PROD_CATEGORY_ID
, 	S.CHANNEL_ID)
);

begin
dbms_stats.gather_table_stats('"SH_ODM"','"MV$SALES_HMV"',NULL,dbms_stats.auto_sample_size);
	end;
/

The idea being that I created a materialized view containing a
superset of all the columns in the 28 individual materialized views, and put a
grouping set at the end of the statement for each of the GROUP BYs in the
individual materialized views - which should, in theory, create me a
materialized view containing every aggregation performed by the individual MVs.

Unfortunately, I couldn’t get the MV to fast refresh - when I
create it with REFRESH FAST I get the error "cannot create a fast refresh
materialized view from a complex query" which I guess is down to the
grouping sets - I’ll have to check this out later.

So, the stage I’m at now is that I’ve written the code to do the
initial load - although I’ll need to amend this to load just the first 100k rows
from the SH.SALES table - and the initial aggregation, on the one hand using 28
materialized views recommended by the SQL Access Advisor, on the other using a
single materialized view created using GROUP BY … GROUPING SETS, but
containing the same set of aggregates. The next stage now is to create the
script to generate the analytic workspace. If you want to grab the scripts,
there here.

Comments

  1. Pete_s Says:

    Mark, you should also include the grouping_id in the grouping set example - well if you want query rewrite to work you should! I think there may even be a warning in 10gR2 dbm_mview.explain_rewrite.
    It will also help you avoid overcounting if more that onecrouping set has a dimension at the same level as another.
    And of course has Dave has mentioned in his blog (an I did elsewhere) the grouping_id could form a usefule subpartition key for partition elimination

  2. David Aldridge Says:

    Yes, the grouping id ought to be slapped into the select, and I’d personally regard it as essential that it be a list partitioning key, or subpartitioning key, on the table.
    Here’s the grouping_id docs …
    http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14200/functions063.htm#SQLRF00648
    So your column definition would be something like …
    GROUPING_ID(CU.CUST_TOTAL_ID, CU.CUST_STATE_PROVINCE_ID, CU.CUST_CITY_ID, CU.COUNTRY_ID,
    …
    ) GID
    ie. include the group by columns.
    Getting the list of values for the grouping_id for the purpose of defining the list (sub) partitions is easy … run your MV select statement but select only the grouping_id expression and replace the FROM clause with …
    FROM SH_ODM.CUSTOMERS CU
    , SH_ODM.PROMOTIONS PM
    , SH_ODM.CHANNELS CH
    , SH_ODM.PRODUCTS PD
    , SH_ODM.TIMES T
    , (SELECT * FROM SH_ODM.SALES WHERE ROWNUM = 1) S
    , SH_ODM.COUNTRIES CT
    I’d also range partition the table by some date value — time_id, I expect — and some other commonly applied predicate column.
    Another hint … the partitions of the local indexes will automatically inherit the table partition names, so you can strip that part out of your script.
    Also it’s interesting that the advisor included all of those ID columns in the MV definition. I would imagine that if you are going to go to the trouble of including a join to the dimension tables then you might also include natural values such as FISCAL_MONTH_NUMBER. I wonder whether it avoids doing this because they might be subject to change (as CUST_LAST_NAME might be, for instance) which makes for a tricky refresh scenario? Not so tricky if you also had the dimension PK column included in the MV, but trickier than just including immutable ID’s. This wouldn’t apply for the date values of course, but it would be worth thinking about what other dimension columns would be fixed, and include them also.
    Lastly, make sure that you migrate the DIMENSION definitions from the original schema. Get them from the original SH script, I guess, or extract them with a script like …
    set long 10000 pagesize 0
    select dbms_metadata.get_ddl(’DIMENSION’,dimension_name)
    from user_dimensions;
    Output from that is a little messy with it’s line breaks, I find.
    This ought to be enough to make anyone swear of the database stuff for life, I’d think.

  3. Mark Says:

    Hmm, thanks Pete, Dave. So with the grouping_id in the select statement, would I need one grouping ID for each grouping set defined in the group by section of the statement, aliased appropriately, i.e. i’ll have 28 columns that are based on grouping_id, one for each grouping set I define later on?
    thanks
    Mark

  4. Pete_s Says:

    Just the one - but the argument to grouping_id function would be a tad long! The grouping_id has a unique value for each grouping in the aggregation.

  5. Mark Says:

    I get it - I’ve just read this again as well - http://oraclesponge.wordpress.com/tag/oracle/oracle-materialized-views/page/2/ - which I think explains it. Thanks again. If you don’t mind, I’ll rework the MV definition and get to you review it, make sure I’ve got the correct end of the stick. Thanks for your help.
    Mark