Keeping Warehouse Dimensions To A Minimum

September 25th, 2003 by Mark Rittman

Data warehouse fact tables contain one or more measures (such as ‘volume sold’, ‘cost price’ and ‘retail price’) and one or more dimensions. When the data model is first put together, the number of dimensions is usually quite small, and centre around common business areas like organization, time and geography. As the data model is further refined, and the requirements for reporting are incorporated into the fact table design, a number of additional items often get added to meet someone’s specific requirements for a report. Examples I’ve come across include ‘Invoice Status’, ‘Payment Type’, ‘Matched/Not Matched’, and so on, and the question then becomes - do we treat these items as proper dimensions in the data model, or do we just leave them in as a special type of measure in the fact table?

Ralph Kimball came up with the concept of ‘Degenerate Dimensions‘, a situation where a value is placed in the fact table to aid reporting and selection, but where it isn’t truly a measure. An example of this is ‘Invoice number’; it may well be worth putting this in each row of the fact table, to help tie the row to a particular invoice or sale, but it isn’t really a dimension as you wouldn’t use it to roll up values in the fact table. It isn’t really a measure either, as you certainly wouldn’t want to sum up or average the values held in the invoice number field, and so we class these as degenerate dimensions, displaying their values in reports and in some circumstances letting the user limit down the query by using their values.

I often think the difference between a dimension and a degenerate dimension is whether or not you’d want to see the range of values available as a ‘drop-down list’ when analyzing your data. You probably wouldn’t want all of the company’s historic invoice numbers available on a drop-down list (there’d be too many for a start) but, taking our example, you’d probably want to have ‘Invoiced/Not Invoiced’, ‘Matched/Not Matched’ and so on available as page items when working with a tool like Oracle Discoverer.

Multidimensional databases, such as Oracle 9i OLAP and Oracle Express Server, experience rapid growth in their database sizes if a variable (the 9i OLAP equivalent of a measure in a fact table) is dimensioned by too many dimensions, as it has to precompute the aggregate of each variable at each intersection of the different dimensions, along each hierarchy. Along with issue of database explosion, having too many dimensions associated with a variable also makes analysis difficult as each dimension has to be referred to when retrieving the value of a variable.

Relational OLAP, as provided by Oracle 9i through the use of fact tables, dimensions and hierarchies doesn’t suffer from these problem for a number of reasons;

  • It doesn’t insist that you calculate aggregates for every combination of dimension and hierarchy in advance, therefore negating the need to store lots of summarized data
  • Dimensions can be left out of queries against the fact table if they are not needed for analysis
  • Materialized views can be created that address a small cross-section of possible aggregates, with query rewrite using these ’seeded’ summaries to satisfy the majority of queries with a limited amount of further aggregation

To be fair, Express Server and 9i OLAP don’t absolutely insist that all aggregations are calculated in advance, and it can also selectively aggregate values up a selection of dimensions and hierarchies; however it’s true to say that Express is designed to work when all cubes are ‘fully-solved’ and performance can degrade considerably if it has to carry out aggregations on the fly. What you can’t get away from with Express however is the need to specify a value (or range of values) for each dimension when working with data in variables.

It doesn’t seem then that there would be any problems with having our main fact table in the warehouse dimensioned by twenty different business areas, as it won’t neccessarily lead to an explosion in database size, and we can leave out joins to dimensions that we don’t require as part of a query.

The temptation therefore, is to create individual dimensions for each of these items rather than just leaving them as degenerate dimensions in the fact table. These dimensions can then be populated with synthetic keys and attributes, and load routines within Oracle Warehouse Builder can then be created to build and populate the tables and metadata. However, having worked on a few projects where the number of dimensions has risen has risen to over twenty, there’s a few issues that can crop up when working with this many dimensions that may make it worth looking at the data model again.

Data Warehouse Build Time Increases Significantly

Compared to the time needed to add an extra measure in to the fact table for our degenerate dimension, the time taken to create a dimension, define a hierarchy and create a synthetic key is more or less comparable. It takes less than a minute to add a new measure, but only around five minutes more to put together a proper dimension.

The factor that is often overlooked in the rush to build a dimension for every descriptive item is the amount of time it subsequently takes to create a dimension source file, build, test and debug the load routine, deal with missing values, create new dimension values when the business adds a new type of invoice status, and track changes to dimension attributes over time. The routine that loads the fact table, instead of carrying out a key lookup on just a few dimension tables, now needs to carry out many more key lookups, adding to the time it takes to build the OWB mapping and adding to the overall complexity of the load process.

Data Warehouse Loads Take Longer

A mapping that populates the fact table that previously only joined (through key lookups) to a small number of dimensions, now potentially has to join to twenty tables to retrieve synthetic keys from each of the dimension tables. If (as recommended) you use the key lookup function within Warehouse Builder to retrieve synthetic keys from dimensions, each join becomes an outer join, as the mapping has to be able to deal with values not being returned by the key lookup - if the mapping generated normal equi-joins, one lookup failing could result in a row being left out of the fact table insertion.

Apart from the outer joins being generated, the actual number of tables that have to be joined to populate the fact table can cause issues. Joins that involve more than ten tables confuse the cost-based optimizer, potentially leading to a situation where the wrong driving table is chosen or indexes are not used when present, and as you shouldn’t really alter the code OWB generates, you cannot guarantee that the best access plan is chosen.

Documentation And Testing Takes Longer

Each dimension that you put in place needs be documented, agreed with the business, tested and maintained. It’s usually the case that a dimension with just two possible values takes just as long to document and get signed off as one that contains thousands of values, and the more of them you create the longer it’s going to take to get documented. And none of us likes doing documentation.

Synthetic Keys Can Actually Make Querying Harder Work

Replacing values in the fact table such as ‘Invoiced’/'Not Invoiced’ with integers such as ‘1′ and ‘2′ appears to make sense as it takes up less space in the table and any indexes you create. However, if you’re running a quick ad-hoc query against the fact table it’s not often easy to remember what ‘1′ and ‘2′ stand for, requiring a join to the dimension that otherwise wouldn’t be neccessary. Although it’s true to say that with Oracle features such as star transformations the cost of these joins isn’t as much as it could be, you’re still making more work for users when it’s not really needed.

The Conclusion: Even When Working With ROLAP, Keep The Dimension Count Small

It might not be the case for every database platform and every ETL tool, but the conclusion i’ve come to when working with Oracle and OWB is keep the dimension count small. Whilst Oracle doesn’t suffer from the performance problems experienced by multidimensional databases such as Express and 9i OLAP when dimension count increases, the amount of time taken to build, test, populate and document the data warehouse increases exponentially when dimensions increase beyond half a dozen per fact table. Just because it can be done, doesn’t mean it’s required, and your project will be much simpler and will take much less time if the number of dimensions is kept to the minimum.

Comments

  1. Vinayak Pai Says:

    The article is really useful in listing down the problem faced with the increasing number of dimensions. But I would be interested to know what is the best approach when we need to add these additional ‘measures’or ‘dimensions’ due to business requirements?
    Some thoughts?

  2. Mark Rittman Says:

    Well, if you’ve got to add them, then you’ve got to add them. However, it’s worth checking the following;
    1. Some of the additional ‘measures’ will infact be descriptive fields (i.e. degenerate dimensions, such as invoice number). Make sure these additional measures aren’t infact levels from existing dimensions (for example, having customer_group in the fact table, as well as customer_id) - this is an easy mistake to make, and although this might seem to make summing up by customer_group easier, it stops you from cleanly aggregating up by customer - group - whatever is the next level within one query.
    2. Also, as the article says, if you can avoid making a full OWB dimension out of any extra candidate dimensions, try and do so. The overhead of building and maintaining a dimension is quite high, so if you can avoid this, and just have the dimension business ID in the fact table column (instead of a surrogate key link to the dimension table) it’ll be less work for you.
    Hope this helps
    Mark

  3. Lisa Says:

    We are warehousing information from a contact centre in a health insurance company. Some of our core dimensions are providers, brokers, companies, members, and beneficiaries. Any one of these entities can place a call to the contact centre (optionally making a reference to any other entity).

    A member must belong to a company (though can change companies obviously), and a beneficiary must belong to a member. A company may have one or more brokers (brokers and companies may chop and change obviously).
    They all seem like dimensions in their own right. However, they are not independent. Should they all be separate dimensions? Should beneficiaries and members be rolled into one dimension? I assume members and companies are separate dimensions, however company attributes are of interest at the member level, e.g. how many calls did we get from members belonging to companies of more than 50 members? Should the relevant company information be duplicated on the member dimension? Or should the company and member dimensions be modelled with a parent-child relationship?

  4. Mark Says:

    Lisa,
    Good question. I’m going to get a few opinions, and get back to you in a day or so.
    Mark

  5. Mark Says:

    Lisa,
    I’ve had a think about this, and here’s my first take on it.
    Firstly, if beneficiaries are a superset of members (i.e. beneficiaries are members, plus their dependents) it makes sense to have a single ‘persons’ dimension, and use this twice in your fact table, once for a ‘member’ key and once for a ‘beneficiary’ key.
    In those cases where the beneficiary is a member, they’ll appear twice in the fact table row. Otherwise, you’ll have two different IDs from the dimension table. You could also have some additional attributes in the ‘person’ dimension, to hold details of their membership, if they’re a member.
    The fact that a beneficiary must either be a member, or belong to a member, is a business rule that should be enforced by your ETL tool. If a case comes along that breaks these rules (i.e. the beneficiary’s member is not *their* member) you can either reject the record during the ETL process, or allow it, and run subsequent reports against the fact table to list out ‘illegal’ cases like this.
    Secondly, you ask whether attributes from other dimensions (company, for example) should be included in the member dimension, when you’d want to retrieve this information when querying members. It’s my opinion that you should not do this - you should leave all company attributes in the company table, and do the same for other dimensions. If you want to put together a complex query, such as the ‘how many calls did we get from members belonging to companies of more than 50 members?’ example, any good OLAP tool (such as Oracle Express, or the Oracle OLAP Option + BI Beans) should allow you to answer this question - queries like this (complex inter-row queries where the dimension members retrieved are dependent on other calculations in the query) are common practice in the OLAP world, and you shouldn’t need to compromise the integrity of your dimensional model to answer questions like this.
    Hope this helps.