Keeping Warehouse Dimensions To A Minimum

Data warehouse fact tables contain one or more measures (such as 'volume sold', 'cost price' and 'retail price') and one or more dimensions. When the data model is first put together, the number of dimensions is usually quite small, and centre around common business areas like organization, time and geography. As the data model is further refined, and the requirements for reporting are incorporated into the fact table design, a number of additional items often get added to meet someone's specific requirements for a report. Examples I've come across include 'Invoice Status', 'Payment Type', 'Matched/Not Matched', and so on, and the question then becomes - do we treat these items as proper dimensions in the data model, or do we just leave them in as a special type of measure in the fact table?

Ralph Kimball came up with the concept of 'Degenerate Dimensions', a situation where a value is placed in the fact table to aid reporting and selection, but where it isn't truly a measure. An example of this is 'Invoice number'; it may well be worth putting this in each row of the fact table, to help tie the row to a particular invoice or sale, but it isn't really a dimension as you wouldn't use it to roll up values in the fact table. It isn't really a measure either, as you certainly wouldn't want to sum up or average the values held in the invoice number field, and so we class these as degenerate dimensions, displaying their values in reports and in some circumstances letting the user limit down the query by using their values.

I often think the difference between a dimension and a degenerate dimension is whether or not you'd want to see the range of values available as a 'drop-down list' when analyzing your data. You probably wouldn't want all of the company's historic invoice numbers available on a drop-down list (there'd be too many for a start) but, taking our example, you'd probably want to have 'Invoiced/Not Invoiced', 'Matched/Not Matched' and so on available as page items when working with a tool like Oracle Discoverer.

Multidimensional databases, such as Oracle 9i OLAP and Oracle Express Server, experience rapid growth in their database sizes if a variable (the 9i OLAP equivalent of a measure in a fact table) is dimensioned by too many dimensions, as it has to precompute the aggregate of each variable at each intersection of the different dimensions, along each hierarchy. Along with issue of database explosion, having too many dimensions associated with a variable also makes analysis difficult as each dimension has to be referred to when retrieving the value of a variable.

Relational OLAP, as provided by Oracle 9i through the use of fact tables, dimensions and hierarchies doesn't suffer from these problem for a number of reasons;

It doesn't insist that you calculate aggregates for every combination of dimension and hierarchy in advance, therefore negating the need to store lots of summarized data
Dimensions can be left out of queries against the fact table if they are not needed for analysis
Materialized views can be created that address a small cross-section of possible aggregates, with query rewrite using these 'seeded' summaries to satisfy the majority of queries with a limited amount of further aggregation

To be fair, Express Server and 9i OLAP don't absolutely insist that all aggregations are calculated in advance, and it can also selectively aggregate values up a selection of dimensions and hierarchies; however it's true to say that Express is designed to work when all cubes are 'fully-solved' and performance can degrade considerably if it has to carry out aggregations on the fly. What you can't get away from with Express however is the need to specify a value (or range of values) for each dimension when working with data in variables.

It doesn't seem then that there would be any problems with having our main fact table in the warehouse dimensioned by twenty different business areas, as it won't neccessarily lead to an explosion in database size, and we can leave out joins to dimensions that we don't require as part of a query.

The temptation therefore, is to create individual dimensions for each of these items rather than just leaving them as degenerate dimensions in the fact table. These dimensions can then be populated with synthetic keys and attributes, and load routines within Oracle Warehouse Builder can then be created to build and populate the tables and metadata. However, having worked on a few projects where the number of dimensions has risen has risen to over twenty, there's a few issues that can crop up when working with this many dimensions that may make it worth looking at the data model again.

Data Warehouse Build Time Increases Significantly

Compared to the time needed to add an extra measure in to the fact table for our degenerate dimension, the time taken to create a dimension, define a hierarchy and create a synthetic key is more or less comparable. It takes less than a minute to add a new measure, but only around five minutes more to put together a proper dimension.

The factor that is often overlooked in the rush to build a dimension for every descriptive item is the amount of time it subsequently takes to create a dimension source file, build, test and debug the load routine, deal with missing values, create new dimension values when the business adds a new type of invoice status, and track changes to dimension attributes over time. The routine that loads the fact table, instead of carrying out a key lookup on just a few dimension tables, now needs to carry out many more key lookups, adding to the time it takes to build the OWB mapping and adding to the overall complexity of the load process.

Data Warehouse Loads Take Longer

A mapping that populates the fact table that previously only joined (through key lookups) to a small number of dimensions, now potentially has to join to twenty tables to retrieve synthetic keys from each of the dimension tables. If (as recommended) you use the key lookup function within Warehouse Builder to retrieve synthetic keys from dimensions, each join becomes an outer join, as the mapping has to be able to deal with values not being returned by the key lookup - if the mapping generated normal equi-joins, one lookup failing could result in a row being left out of the fact table insertion.

Apart from the outer joins being generated, the actual number of tables that have to be joined to populate the fact table can cause issues. Joins that involve more than ten tables confuse the cost-based optimizer, potentially leading to a situation where the wrong driving table is chosen or indexes are not used when present, and as you shouldn't really alter the code OWB generates, you cannot guarantee that the best access plan is chosen.

Documentation And Testing Takes Longer

Each dimension that you put in place needs be documented, agreed with the business, tested and maintained. It's usually the case that a dimension with just two possible values takes just as long to document and get signed off as one that contains thousands of values, and the more of them you create the longer it's going to take to get documented. And none of us likes doing documentation.

Synthetic Keys Can Actually Make Querying Harder Work

Replacing values in the fact table such as 'Invoiced'/'Not Invoiced' with integers such as '1' and '2' appears to make sense as it takes up less space in the table and any indexes you create. However, if you're running a quick ad-hoc query against the fact table it's not often easy to remember what '1' and '2' stand for, requiring a join to the dimension that otherwise wouldn't be neccessary. Although it's true to say that with Oracle features such as star transformations the cost of these joins isn't as much as it could be, you're still making more work for users when it's not really needed.

The Conclusion: Even When Working With ROLAP, Keep The Dimension Count Small

It might not be the case for every database platform and every ETL tool, but the conclusion i've come to when working with Oracle and OWB is keep the dimension count small. Whilst Oracle doesn't suffer from the performance problems experienced by multidimensional databases such as Express and 9i OLAP when dimension count increases, the amount of time taken to build, test, populate and document the data warehouse increases exponentially when dimensions increase beyond half a dozen per fact table. Just because it can be done, doesn't mean it's required, and your project will be much simpler and will take much less time if the number of dimensions is kept to the minimum.