Using a PARTITION BY JOIN to Fill In The Gaps

April 22nd, 2009 by

As a quick introduction, I’m Stewart Bryson, Technical Director for North America. It seems strange to be posting an entry on this blog after so many years of reading it, and learning from it. The momentum it has in the community is something I do not take lightly, so hopefully, this post and others from me will find a receptive audience.

Recently, I had a design decision to make with some ETL code I was writing for a client. The fact table I was loading was at a transactional grain, recording activity of particular business processes, and storing the cumulative amount of time for certain tasks. The source and target tables were reasonably complicated, so for simplicity sake, I’ll use the PROMOTIONS table in the SH schema (an optional sample schema installed by the DBCA) to demonstrate my method.

For simplicity sake, I’ll be loading a very simple fact table without surrogate keys and only a simple measure: DURATION_DAYS.

Now ordinarily, finding the difference between two dates is easy in Oracle… it’s simple subtraction. We could easily load this fact table with the following statement:

Problem solved, right? Well, maybe not. What if one of the requirements for the DURATION_DAYS measure was to exclude weekends? My client had a process for calculating these values in the OLTP reporting database using some cursor-based PL/SQL code that was terribly inefficient, doing multiple table lookups for every single row processed. I concluded that I should be able to do this with set-based processing and a single join. All I really need is a table that is loaded with a row for every day in the year, and also distinguishes weekdays from non-weekdays. Does that sound like any table we know?

Sure, it’s the DATE dimension table… or what some models refer to as the TIME or PERIOD dimension. We have such a table in the SH schema, called TIMES, and it’s likely that you have such a dimension table as well. I hadn’t looked at the TIMES table in a while, and I was hoping it also had an attribute distinguishing holidays from non-holidays. Since it doesn’t, I’ll remove that requirement from my list, but just keep in mind: if your date dimension table also has a holiday flag column, you could eliminate holidays from the calculation as well.

First, I’ll need a single date column to use to join in the TIMES table, but currently, I have two date columns: PROMO_BEGIN_DATE and PROMO_END_DATE. Sounds like I need to pivot those two dates into a single one called STATUS_DATE. I did this using the following Warehouse Builder mapping to load a staging table called PROMOTIONS_STG:

200904220651

200904220651-1

200904220652

The natural key of the PROMOTIONS table is PROMO_ID, and you can see that I now have two rows for every natural key: a row representing the beginning of the promotion, and one representing the end:

Now my idea is to join in the TIMES table to this staging table using an outer join so the TIMES table will effectively fill in the gaps between start dates and end dates, and then I can use a simple COUNT aggregate to calculate DURATION_DAYS. But there is only one problem with this solution… a simple outer join will only fill in the missing days once FOR THE ENTIRE ROW SET, while I need the rows filled in ONCE FOR EACH PROMO_ID.

The Oracle database offers an extension to the ANSI join syntax called the PARTITION BY JOIN. I can use the the PARTITION BY syntax, just as I would with an analytic function, to “partition” or “divide up” the rows in my staging table according to the PROMO_ID. This allows me to use a single join to the TIMES table, which actually acts as if I were performing multiple joins; one for each PROMO_ID.

Warehouse Builder does not support the PARTITION BY JOIN syntax… which is a real shame. However, I can create a view to perform this logic, and then use the view as a source in my Warehouse Builder mapping:

Now I can use this view as a source in a Warehouse Builder mapping, and load the fact table using an aggregate function:

200904220652-1

200904220653

200904220653-1

200904220653-2

And now, you can see the results:

If OWB supported the PARTITION BY JOIN syntax, the entire process could be done in a single mapping. Because it doesn’t, I was forced to to first write to a staging table, and then use a view on top of that table to fill in the PARTITION BY JOIN logic.

Comments

  1. Scott Powell Says:

    Hi Stewart – apologies in advance that I haven’t had time to digest the “partition by join” piece (it’s late…) – but I did want to add a comment about using calendars to count “business days” in general. When I build a time dimension (and assuming it has weekend and holiday indicators), I also build a column called business_day_num – which is simply a sequential count of all business days from whatever the first day in the time dim is. i.e. if the calendar starts 1/1/1970, then 1/2/1970 would be business_day_num = 1, 1/3/1970 would be business_day_num = 2 (assuming they are weekdays, of course), up to whatever end date you go through (i.e. assuming 250 business days per year, you’d have something like business_day_num = 7500 after 30 years…).

    For every day that is NOT a business day, it simply gets the business_day_num of the previous day.

    Then, calculating the # of business days between two given dates becomes a simple subtraction.

    Hope this helps,
    Scott

  2. Jeff Moss Says:

    Welcome Stewart.

    It’s interesting that you raise the point of OWB not being able to do something and then coding around it inside a view…I spend a lot of time on a Data Warehouse system that is largely based on that way of doing things, i.e. views with all the “work” in them and simple OWB mappings that take those views and push the generated transactions to targets.

    The design approach to use code inside views, was taken for that system, at a time when OWB was on version 9 (way old now) and when it was quite immature in terms of functionality – perhaps the right choice then as we kept hitting similar issues to the one you describe, where we couldn’t do what we wanted with OWB.

    Whether it’s the right choice now? Well, I think there are still some advantages to choosing the views to do all the “complex” coding rather than drawing pretty pictures in OWB…but this client is actively looking at whether they should be using more of the OWB product going forwards.

    Might be nice to see a comparison piece about why one should choose one approach over the other…

    Cheers
    Jeff

  3. Stewart Bryson Says:

    @Scott
    That’s a great solution. It’s good to see developers step away from the cookie-cutter date dimension and actually USE it for their purposes. The original ETL I wrote didn’t count days… that was used for simplicity in this example. I was actually calculating multiple business metrics using seconds, doing some complicated expressions on the BEGIN and END status days. I’m not certain if I would have been able to do that with additions to the date dimension or not.

    @Jeff
    It depends on your requirements. With some shops, the key is getting to market quickly, and with exceptional developers doing all the complex logic, that model has tremendous upside. Other organizations will use OWB to allow a mix of talents and skill levels in a development team to all contribute to the same code in a modular fashion. In those scenarios, having the “pretty pictures” as you say serves to better document the source-to-target mappings inherent in the process. That’s one of the real benefits of OWB… it can adjust to your skill level to provide different ways of accomplishing the same task.

    Personally, I choose to put the logic in OWB whenever possible. However, when I take that position, and then find myself having to do logic in a view, I am possibly watering down the rest of the OWB logic because I have essentially placed the logic in two places. Your approach always maintains the logic in one place, so it certainly gets points in that regard.

  4. Jeff Moss Says:

    Well, I’m glad you think I and the team I work with are “exceptional developers”!

    ;-)

    The documentation reason is perhaps the only reason our team is considering the “do it inside OWB” approach now…but of course we have all these exceptional developers who are used to writing views to do the “work” and not particularly used to writing OWB mappings using the “inside the tool” techniques and a whole codebase that uses the views approach, so it would be a bit radical if we were to change our strategy now…the old adage, if it ain’t broke, don’t fix it, springs to mind.

    Cheers
    Jeff

  5. Peter Scott Says:

    Jeff raises a good question about whether to use a view or code in the ETL tool – we (one of the Rittman Mead team) should write on this; of course there is not going to be an “always true” answer.

    The data warehouse project I am working on at the moment gives me a choice of an analytic view to get previous states of an item or “writing” some complex map in OWB – the measures for this fact are based on the nature of the transition and not the state itself. In terms of simplicity (and probably performance) the view is highly attractive; but of course it obscures the transformation.

  6. Jeff Moss Says:

    Hi Peter

    Completely agree that there is no “always true” answer…I guess the article someone needs to write is one that debates the reasoning behind either choice…I’ve thought about doing that myself, but I haven’t got the time at the minute.

    Yes, that’s why we’re considering the “inside OWB” route going forwards…the documentation reasoning….but I’m not necessarily sure that the documentation is meaningful per se.

    Yes, I’d have a “pretty picture” but unless I train the end users, they wouldn’t be able to understand it any more than they can looking inside the view – for the more complex mappings – easy ones perhaps. The developers would be able to use it, but I can’t see anyone developing on a warehouse without being a proficient SQL developer…and, in my experience at least, they’d much rather write a view than paint pretty pictures.

    I know from experience on the current implementation I’m working on, that the code falls into 30% simple views and 70% complicated views buckets and consequently, for the majority of mappings, only the “exceptional developers” will be able to make use of that documentation – not that that’s a bad thing of course.

    The more complex it is, the more likely I’d want to write it in SQL, rather than build it inside OWB…adding that to the fact that it’s easier to develop using one approach rather than two, that then leads me to writing everything in views rather than OWB. Your mileage, as they say, may vary.

    If you then start talking automatically generating code, is it easier to generate the complex SQL for views and simple OMB mappings using OMBPlus, or use OMBPlus to generate all of it? I’d suggest the former since use of the less widely known OMBPlus, is minimal then.

    What about tuning? You build a complex mapping and deploy it…it doesn’t run right due to a bad plan, so you need to tune it…there are, of course, a number of ways to tackle that…but if your method of choice is adding hints, then that, from my understanding, gets a bit tricky if it’s all inside OWB…but very simple if it’s a view. Of course using hints is a debatable point in it’s own right, but that’s another debate altogether!

    What about debugging? Our debugging process with views is simply debugging a piece of SQL…with the OWB approach I guess that’s a bit more complicated?

    It’s a tricky one really and one that the architects team at my current client have debated numerous times over the past few years. We’re using the complex views approach currently but we’re open to moving if and when we see compelling reasons to do so.

    Cheers
    Jeff

  7. Petr Schmidt Says:

    Interesting discussion re OWB viability…we are in the process of replacing OWB with custom PL/SQL transform modules at one of our client’s DW implementations. Aside from experiencing similar constraints as Jeff’s team we are also finding that maintenance of complex OWB maps can be quite time-consuming (relative to hand-written SQL) especially if significant map restructuring is required. I should note that we have a very experienced Oracle development group and only small portion of the DW was developed using OWB. I can appreciate OWB attractiveness as a pretty complete ETL platform for installations looking to get an Oracle DW initiative started.

    Petr

  8. Stewart Bryson Says:

    @Peter
    be my guest :)

    @Jeff
    >…I can’t see anyone developing on a warehouse without being a proficient SQL developer…

    I wish that were the case… but not always. When an organization dives into creating their first data warehouse, the developers familiar with the source systems quite often get the call… and they may be OLTP-style developers without advanced analytic and aggregate SQL ability. Some of these developers may not even have experience writing complex joins, as it’s not done in the source systems they are used to. I find that OWB forces some of these developers to write complex set-based processes, where, left to their own devices, they may revert to cursor-based PL/SQL.

    > …hints… gets a bit tricky if it’s all inside OWB…
    Actually, using hints in OWB (if you are so inclined) is quite easy. Just Configure the mapping, and there you go.

    > What about debugging?
    OWB has a debugger that’s not too bad. And, in versions 10.2.0.4 and 11g, showing incoming and outgoing SQL from a particular operator is easy… just as easy as if the code were in a view. And you can always trace it as well.

  9. Jeff Moss Says:

    My experience is both extensive and limited, in different senses…I’ve been dealing with DW type systems since 1998, but only two major systems during that period and each has involved only “savvy” SQL developers…who ensure that any new arrival on the team is aware of the “set based” approaches, design patterns and other key features of the “way we do things” on the DW, before they are let loose!

    ;-)

    We tried, admittedly with v9 of OWB, to use the hints capabilities inside OWB…and it was a nightmare…which was another nail in the coffin of it’s usage at the time. I don’t know if that part of the product has “moved on” but if it hasn’t then I’m not sure I’d agree as to it’s ease of use…whatever the case, it simply can’t be as easy as adding them to the SQL in a view.

    As I said, we’ve reached a point now where we’re just upgrading to 10gR2 of OWB and therefore perhaps, being on a recent release, we’ll be able to reassess our position over the coming period. I’m looking forward to that.

    Cheers
    Jeff

  10. Dhaval Says:

    Hi Stevart,

    We also had same requirment to determine weekend between given 2 dates, we did it something like below.
    OWB expersion can be used to write CASE statement. It uses standard weekend ( sat and Sun)

    SELECT promo_id,
    promo_name,
    promo_subcategory,
    promo_begin_date,
    promo_end_date,
    promo_end_date-promo_begin_date
    +
    (CASE WHEN TO_NUMBER(TO_CHAR(promo_begin_date,’D’)) = 7 THEN 0 ELSE TO_NUMBER(TO_CHAR(promo_begin_date,’D’)) enD
    – CASE WHEN TO_NUMBER(TO_CHAR(promo_end_date,’D’)) = 7 THEN 0 ELSE TO_NUMBER(TO_CHAR(promo_end_date,’D’)) enD )/7 * 2
    + CASE WHEN TO_NUMBER(TO_CHAR(promo_begin_date,’D’)) = 7 THEN 2 ELSE 0 enD
    + CASE WHEN TO_NUMBER(TO_CHAR(promo_begin_date,’D’)) = 1 THEN 1 ELSE 0 enD
    + CASE WHEN TO_NUMBER(TO_CHAR(promo_end_date,’D’)) = 7 THEN -1 ELSE 0 enD ) diff_excluding_WEEKENDs
    FROM sh.promotions;

Website Design & Build: tymedia.co.uk