Partitioning Fact Tables, Part 1

August 13th, 2010 by

I’m dogmatic about certain aspects of data warehousing. For instance, fact tables should be range partitioned by DATE. I tell my clients all the time: you will have a very difficult time persuading me otherwise. But they always try: they argue about all the attributes that are more pervasive than DATE: customer classes, transaction types, etc., etc. But I’m just not buying it. We are building data warehouses, and the third rail of the Soul of the Data Warehouse is how it handles time.

If you agree with me about this precept (and I really think you should), this is still not the end of the story. We must charge ahead into the lion’s den of a debate that has been raging in the Oracle data warehousing world for years: do we make the surrogate key of our date dimension a NUMBER, or do we make it a DATE? It’s funny… I remember this being the first question I ever posed to Mark years and years ago, and he did a blog entry that evolved out of our email communication. I don’t see the entry on the blog any more… it must have been lost in The Great Blog Disaster. Pity.

The choice between NUMBER and DATE bubbles up from the two streams at work in the Oracle Data Warehousing community: the data warehousing folks, and the Oracle folks. Ralph Kimball argues that the surrogate key of the date dimension should be numeric. In the Data Warehouse Lifecycle Toolkit book (or at least, in my edition of it), Kimball basically makes the argument that numbers require less space than dates. That one never did too much for me. However, in his Latest Thinking on Time Dimension Tables design tip, he makes a better argument: if our surrogate key is a DATE, then how do we handle “Not Applicable” type rows? This one has teeth, and I think that most designers who struggle with this decision point to this issue. If we use an actual DATE as our surrogate key, then what value can we use that actually means “no date at all”?

Oracle experts like Tom Kyte argue that “dates belong in DATES”. (If you look really hard at this post, you can see a younger and more naive version of myself weighing in on the debate… and also, apparently, not knowing how to gather histograms with DBMS_STATS. Oh well.) As Tom demonstrates on that post, the optimizer just plain works better when dates are stored in DATE datatypes.

I’ve typically been on Kyte’s side in this debate, both from a performance and a maintenance perspective. I’ve parted ways with Kimball on this point and urged my clients to build date dimensions with DATE surrogate keys, calling the column something like DATE_KEY. For the ‘NA’ types of dimension records, I use a wacky DATE value for DATE_KEY, such as ’12/31/9999′ or ’01/01/0001′. Think of this as the equivalent of -1 if the surrogate key were actually numeric. Being a surrogate key… it really doesn’t matter what value it contains: we just need to know the column name so we can construct the correct JOIN syntax. Then, I’ll build another DATE column in the table called SQL DATE, and this is the one that I expose to the reporting layer. Since SQL DATE does not have to serve as the primary key, it’s fine for it to be a NULL if desired.

In subsequents posts, I’ll examine new partitioning features in 11g, including interval partitioning (which Pete Scott recently blogged about), and also reference partitioning, and whether these enhancements provide more options to this historically binary choice.

Comments

  1. Bence Says:

    Stewart, I do remember this topic as one of your favourites, as we discussed exactly this when we first met :-)

    I understand Tom costing-related issues about using numbers to store dates and its effects on query plans, but
    could you shed some light on the maintenance benefits of using real date primary keys?

    Bence

  2. David Aldridge Says:

    I’ve come across the “Date is Not Applicable” situation before, as well as receiving data from source systems in the (numeric) form YYYYDDD in which the DDD value was “000” (ie. “January 0th”) to indicate “this is an annual number, not a daily one”.

    If this represents a fundamental difference in the granularity of the fact, which it often does, then there is a strong argument for moving such facts out of the day-granularity table into their own fact table, because it’s not as if BI systems should have a problem with the generation and result processing of the consequent query(ies). In some senses it is the correct approach.

  3. Scott Powell Says:

    I couldn’t agree more. Kimball argues using numeric surrogate keys for two reasons:

    1) The meaning of the “business key” can change over time. Not applicable on dates. 1/1/2001 always means EXACTLY the same thing

    2) Size – numbers being smaller than text strings and/or date fields. But in using dates to partition – if the date is a surrogate key, generally you end up having to add the actual date field anyways so that you can “intelligently” partition groups of dates likely to be queried together.

    Having said that, I still do struggle with “not known” dates and “not required” dates. I end up picking dummy date values such as 12/31/9999 also…but always wished for a cleaner answer. Having said that – the cost of handling that situation is a LOT less then the cost of using a number surrogate.

    p.s. I’ve also seen where the surrogate key for a date is numeric, but is not a sequence – the actual date is embedded, ie number = 20010101 for Jan 1st 2001. I don’t really like that solution – for example if I want to format the output to show “January 1st 2001″ I can’t use simple date transformations.

    Scott

  4. Greg Rahn Says:

    I’ve seen the N/A value take the form of a very high or very low value in the set. As a result, this can cause issues with the cardinality estimates for that dimension. The reason for this is the low value, high value and number of distinct values are the input for that calculation. If you “stretch” the range by adding a very high or low value, it can drastically change the cardinality estimate. To work around this I recommend:

    CTAS the low & high values from the dimension
    DELETE the low & high values from the dimension
    Gather stats on the dimension
    Lock the stats on the dimension
    INSERT the two rows back into the dimension

    These steps have produced accurate cardinality estimates even with low/high N/A values.

  5. Stewart Bryson Says:

    @Bence: I’m thinking mostly about partition maintenance operations (PMOP’s)… specifically adding new partitions, etc. I’ll be drilling into this side of things in future postings, so hopefully I’ll answer your question then.

    @David: Are the annual numbers just subtotals… or are they new or different bits of information?

    Depending on your answer, there are a few additional ways this can be solved, but ultimately I think you made the right choice. One is to split the dimension into two dimensions, and have a ‘NA’ row in the lower level dimension, which basically means “does not apply at this level.” This doesn’t seem so radical when done with a dimension that isn’t the date dimension: splitting this standard dimension into the Date Dimension and the Year Dimension doesn’t seem right.

    You can also use “solved” dimension rows… though I am not a big fan of this approach either. These are basically rows that exist for the different hiearchies in the date dimension. For instance… you would have day hierarchy rows, and month hierarchy rows, year hierarchy rows, etc. I tend to avoid this… but it is an option.

    @Scott: regarding using the “smart” number key… as I said in my posting on AskTom all those years ago… this is just logical promiscuity. I see this proposed all the time as well, and I think Tom handles the arguments against it better than I do… as he got hit with the “smart” key response over and over again in that post.

    @Greg: I have not seen the optimizer behave in this way myself… but thanks for the tip. With automatic histogram collection on highly selected columns beginning in 10g, this doesn’t seem like it would be a big issue, as histograms would paint the true picture of what’s going on to the optimizer.

Website Design & Build: tymedia.co.uk