Using a PARTITION BY JOIN to Fill In The Gaps

As a quick introduction, I'm Stewart Bryson, Technical Director for North America. It seems strange to be posting an entry on this blog after so many years of reading it, and learning from it. The momentum it has in the community is something I do not take lightly, so hopefully, this post and others from me will find a receptive audience.

Recently, I had a design decision to make with some ETL code I was writing for a client. The fact table I was loading was at a transactional grain, recording activity of particular business processes, and storing the cumulative amount of time for certain tasks. The source and target tables were reasonably complicated, so for simplicity sake, I’ll use the PROMOTIONS table in the SH schema (an optional sample schema installed by the DBCA) to demonstrate my method.

desc sh.promotions
Name                          Null?    Type


PROMOID NOT NULL NUMBER(6) PROMONAME NOT NULL VARCHAR2(30) PROMOSUBCATEGORY NOT NULL VARCHAR2(30) PROMOSUBCATEGORYID NOT NULL NUMBER PROMOCATEGORY NOT NULL VARCHAR2(30) PROMOCATEGORYID NOT NULL NUMBER PROMOCOST NOT NULL NUMBER(10,2) PROMOBEGINDATE NOT NULL DATE PROMOENDDATE NOT NULL DATE PROMOTOTAL NOT NULL VARCHAR2(15) PROMOTOTALID NOT NULL NUMBER

For simplicity sake, I’ll be loading a very simple fact table without surrogate keys and only a simple measure: DURATION_DAYS.

desc dim.promotions_fact
Name                          Null?    Type


PROMOID NUMBER(6) PROMONAME VARCHAR2(30) PROMOSUBCATEGORY VARCHAR2(30) PROMOBEGINDATE DATE PROMOENDDATE DATE DURATIONDAYS NUMBER

Now ordinarily, finding the difference between two dates is easy in Oracle… it’s simple subtraction. We could easily load this fact table with the following statement:

INSERT INTO dim.promotionsfact
SELECT promoid,
       promoname,
       promosubcategory,
       promobegindate,
       promoenddate,
       promoenddate-promobegindate
FROM sh.promotions;

503 rows created.

select promobegindate, promoenddate, durationdays from dim.promotionsfact where promo_id=108;

PROMOBEGINDATE | PROMOENDDATE | DURATION_DAYS ---------------------- | ---------------------- | ------------- 12/23/2000 12:00:00 AM | 01/23/2001 12:00:00 AM | 31

1 row selected.

Problem solved, right? Well, maybe not. What if one of the requirements for the DURATION_DAYS measure was to exclude weekends? My client had a process for calculating these values in the OLTP reporting database using some cursor-based PL/SQL code that was terribly inefficient, doing multiple table lookups for every single row processed. I concluded that I should be able to do this with set-based processing and a single join. All I really need is a table that is loaded with a row for every day in the year, and also distinguishes weekdays from non-weekdays. Does that sound like any table we know?

Sure, it’s the DATE dimension table… or what some models refer to as the TIME or PERIOD dimension. We have such a table in the SH schema, called TIMES, and it’s likely that you have such a dimension table as well. I hadn’t looked at the TIMES table in a while, and I was hoping it also had an attribute distinguishing holidays from non-holidays. Since it doesn’t, I’ll remove that requirement from my list, but just keep in mind: if your date dimension table also has a holiday flag column, you could eliminate holidays from the calculation as well.

First, I’ll need a single date column to use to join in the TIMES table, but currently, I have two date columns: PROMOBEGINDATE and PROMOENDDATE. Sounds like I need to pivot those two dates into a single one called STATUSDATE. I did this using the following Warehouse Builder mapping to load a staging table called PROMOTIONSSTG:

200904220651

200904220651-1

200904220652

The natural key of the PROMOTIONS table is PROMO_ID, and you can see that I now have two rows for every natural key: a row representing the beginning of the promotion, and one representing the end:

select promoid, statusdate, status
  from stage.promotionsstg
 where promoid=108;

PROMOID | STATUSDATE | STATUS ---------- | ---------------------- | ---------- 108 | 12/23/2000 12:00:00 AM | begin 108 | 01/23/2001 12:00:00 AM | end

2 rows selected.

Now my idea is to join in the TIMES table to this staging table using an outer join so the TIMES table will effectively fill in the gaps between start dates and end dates, and then I can use a simple COUNT aggregate to calculate DURATIONDAYS. But there is only one problem with this solution… a simple outer join will only fill in the missing days once FOR THE ENTIRE ROW SET, while I need the rows filled in ONCE FOR EACH PROMOID.

The Oracle database offers an extension to the ANSI join syntax called the PARTITION BY JOIN. I can use the the PARTITION BY syntax, just as I would with an analytic function, to “partition” or “divide up” the rows in my staging table according to the PROMOID. This allows me to use a single join to the TIMES table, which actually acts as if I were performing multiple joins; one for each PROMOID.

Warehouse Builder does not support the PARTITION BY JOIN syntax… which is a real shame. However, I can create a view to perform this logic, and then use the view as a source in my Warehouse Builder mapping:

CREATE OR REPLACE VIEW stage.promotionsvw
AS
SELECT promoid,
       promoname,
       promosubcategory,
       promobegindate,
       promoenddate,
       timeid statusdate,
       status,
       weekdayflag
  FROM ( SELECT timeid,
                CASE
                WHEN dayname IN ('Saturday','Sunday') THEN 'N'
                ELSE 'Y'
                END weekdayflag,
                promoid,
                MAX(promoname) OVER (partition BY promoid) promoname,
                MAX(promosubcategory) OVER (partition BY promoid) promosubcategory,
                MAX(promobegindate) OVER (partition BY promoid) promobegindate,
                MAX(promoenddate) OVER (partition BY promoid) promoenddate,
                nvl(status, 'ongoing') status
           FROM stage.promotionsstg partition BY (promoid)
          right outer JOIN sh.times
                ON timeid=statusdate)
 WHERE timeid BETWEEN promobegindate AND promoenddate
/

View created.

select statusdate, status, weekdayflag from stage.promotionsvw where promoid=108 order by status_date;

STATUS_DATE | STATUS | W ---------------------- | ---------- | - 12/23/2000 12:00:00 AM | begin | N 12/24/2000 12:00:00 AM | ongoing | N 12/25/2000 12:00:00 AM | ongoing | Y 12/26/2000 12:00:00 AM | ongoing | Y 12/27/2000 12:00:00 AM | ongoing | Y 12/28/2000 12:00:00 AM | ongoing | Y 12/29/2000 12:00:00 AM | ongoing | Y 12/30/2000 12:00:00 AM | ongoing | N 12/31/2000 12:00:00 AM | ongoing | N 01/01/2001 12:00:00 AM | ongoing | Y 01/02/2001 12:00:00 AM | ongoing | Y 01/03/2001 12:00:00 AM | ongoing | Y 01/04/2001 12:00:00 AM | ongoing | Y 01/05/2001 12:00:00 AM | ongoing | Y 01/06/2001 12:00:00 AM | ongoing | N 01/07/2001 12:00:00 AM | ongoing | N 01/08/2001 12:00:00 AM | ongoing | Y 01/09/2001 12:00:00 AM | ongoing | Y 01/10/2001 12:00:00 AM | ongoing | Y 01/11/2001 12:00:00 AM | ongoing | Y 01/12/2001 12:00:00 AM | ongoing | Y 01/13/2001 12:00:00 AM | ongoing | N 01/14/2001 12:00:00 AM | ongoing | N 01/15/2001 12:00:00 AM | ongoing | Y 01/16/2001 12:00:00 AM | ongoing | Y 01/17/2001 12:00:00 AM | ongoing | Y 01/18/2001 12:00:00 AM | ongoing | Y 01/19/2001 12:00:00 AM | ongoing | Y 01/20/2001 12:00:00 AM | ongoing | N 01/21/2001 12:00:00 AM | ongoing | N 01/22/2001 12:00:00 AM | ongoing | Y 01/23/2001 12:00:00 AM | end | Y

32 rows selected.

Now I can use this view as a source in a Warehouse Builder mapping, and load the fact table using an aggregate function:

200904220652-1

200904220653

200904220653-1

200904220653-2

And now, you can see the results:

select promobegindate,
       promoenddate,
       durationdays
  from dim.promotionsfact
 where promo_id=108;

PROMOBEGINDATE | PROMOENDDATE | DURATION_DAYS ---------------------- | ---------------------- | ------------- 12/23/2000 12:00:00 AM | 01/23/2001 12:00:00 AM | 22

1 row selected.

If OWB supported the PARTITION BY JOIN syntax, the entire process could be done in a single mapping. Because it doesn't, I was forced to to first write to a staging table, and then use a view on top of that table to fill in the PARTITION BY JOIN logic.

Subscribe to Rittman Mead

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!

Subscribe to Our
Monthly Newsletter!

* indicates required