Some Thoughts on Dimensional Modeling

If you're wondering who the "Mead" in Rittman Mead Consulting is, now's probably a good time to introduce myself. I'm Jon Mead, I've worked with Mark for a number of years on projects around the UK, and together we started up Rittman Mead Consulting earlier this year to try and help Oracle customers around the UK and Europe get the most out of their investment in Oracle business intelligence & data warehousing technology. I've actually had my head down on a number of project since we started and now had much of a chance to work on the blog, but I've got a bit of time free this evening and I thought I'd put my first posting together, on the always interesting subject of data warehouse data modeling.

I have recently been involved in a review of an organisation's Data Warehouse, and some of the time was spent reviewing the data model. During the review I ended up proposing that the Data Warehouse should have a dimensional model and the question then arose as to why we use this modelling technique, what qualities does it bring to the Data Warehouse and so on. In my view there are a few different answers to this question, which I'll now attempt to summarize.

It makes the data model and hence the data in the Warehouse easy to understand. This has a two-fold benefit. First, business users are able to intuitively query and hence analyse data in the Warehouse. The objects in the Warehouse are defined in business terms, for example dimensions may be time, product or organisation,
measures may be revenue and cost. The model is design to clearly and simply represent business entities, users do not have to understand the intricacies of Third Normal Form to interrogate the Data Warehouse. Second, reporting tools are easy to integrate with the Warehouse as the underlying schema is effectively an industry standard,
the dimensional model provides a common structure which a wide range of tools can utilise.

In the same way reporting tools can predict the layout of the schema and hence easily integrate with the Data Warehouse, the database can also make use of the expected design. The Oracle database has added many features over the years that enable Data Warehouses to either be loaded quicker, or to query data faster, for example Materialized Views, Partitioning, Query Rewrite and Star Transformations. The latter is a join access path between a fact table and a number of dimension tables, each dimension is joined to the fact table using its numeric primary (and surrogate) key to foreign key, the dimension tables are not joined to each other. Bitmap indexes are created on each of the foreign keys in the fact table. The query optimiser recognises this structure and creates efficient execution plans for the
query.

The Warehouse will be able to adapt to the addition of new requirements. Part of the rationale of building a Data Warehouse is to move away from the paradigm of the development team creating a table or view to answer individual business queries. The Warehouse should provide a platform to allow analysts freeform access to data, both in terms of definition and responsiveness/performance. When the business do provide new requirements, or another phase of the Warehouse is developed the dimensional model provides an easy mechanism to add new facts, add new dimension attributes, or even adding completely new dimensions.

In terms of an approach to dimension modelling, the following steps can provide a basis:

  1. Identify a business process or event to model. For example the business event could be a delivery performed by an organization.
  2. Declare the grain of this event or process and hence the fact table. In the above example this could be all deliveries to a location on a day, all deliveries to a location for a week, each individual delivery to a location, or even each product delivered to a location for each delivery. Generally, within the constraints or space and performance, the lowest grain should be selected. This will be more flexibility in the future as data can always be aggregated to find a higher level answer, however it cannot easily be allocated to find an accurate lower level answer.
  3. Select the dimensions that constrain or measure the business event, and hence are associated with the fact table. In the above example time, location, product would all be candidates, however you should also consider more abstract ideas, for example the marketing campaign that led to the sale.
  4. Identify what to measure, for example the revenue gained for each delivery, the cost of the delivery, tax payable on the delivery, the number of deliveries made.
One question that comes up a lot in dimensional modelling is whether to use star or snowflake schemas. Snowflake schemas are a more normalized approach to dimensional modelling, splitting out higher hierarchical levels into separate tables. Snowflake schema will reduce the amount of data stored by taking out repeating attributes, or taking out attributes that are only related to a subset of your dimension members. The problem with this is that fact tables will typically account for 80% to 90% of the storage in your Warehouse, this means that your storage saving may only equate to a very small percentage of the overall size. Coupled with this is the fact that the database now has to do more joins to answer queries, so performance will be affected. The data model will also become more complex which may preclude non-specialist users from performing effective analysis.

One danger of modelling every business process/event is that you could end up with many disparate, unrelated data marts. Each data mart will function effectively and answer specific business questions, however the value to the organisation of the Data Warehouse will be restricted as questions won't be able to be answered across different functions of
the business, for example revenue and cost may not be able to be compared. The answer is to use conformed dimensions, these allow users to answer questions across the entire enterprise, not just in silos of data.

Conformed dimensions are dimensions that are common to more that one data mart, or business process across the organisation. Common examples may be time, organisation structure and product. Conformed dimensions should be modelled at the lowest level of granularity used, this way they can service any fact table that needs to use them, either at their natural granularity, or at a higher level. They provide a consistent view of the business, including
attributes and keys to all the data marts in the organisation. Conformed dimensions can either be implemented as a single physical table, or maybe be a replicated table used by each data mart.

Dimensional modelling is a broad and complex subject and this posting does not look to address all aspects of it, just some core principles and to provide an answer to why it is used, what benefit it can provide for an organisation. Future postings will discuss capturing history using slow changing dimensions (SCD), junk and degenerate dimensions, how to model the customer dimension and the use of surrogate keys.

Finally a word from the founder of Dimensional Modelling, Ralph Kimball, below is a list of mistakes commonly associated with Dimensional Modelling, the full article can be found here, the question is how many of them do you recognise?

Mistake 12: Place text attributes in a fact table if you mean to use them as the basis of constraining and grouping.

Mistake 11: Limit the use of verbose descriptive attributes in dimensions to save space.

Mistake 10: Split hierarchies and hierarchy levels into multiple dimensions.

Mistake 9: Delay dealing with a slowly changing dimension (SCD).

Mistake 8: Use smart keys to join a dimension table to a fact table.

Mistake 7: Add dimensions to a fact table before declaring its grain.

Mistake 6: Declare that a dimensional model is "based on a specific report."

Mistake 5: Mix facts of differing grain in the same fact table.

Mistake 4: Leave lowest-level atomic data in E/R format.

Mistake 3: Eschew aggregate fact tables and shrunken dimension tables when faced with query performance concerns

Mistake 2: Fail to conform facts across separate fact tables.

AND THE BIGGEST MISTAKE ...

Mistake 1: Fail to conform dimensions across separate fact tables.