Some Thoughts on Dimensional Modeling

April 25th, 2007 by Jon Mead

If you’re wondering who the “Mead” in Rittman Mead Consulting is, now’s probably a good time to introduce myself. I’m Jon Mead, I’ve worked with Mark for a number of years on projects around the UK, and together we started up Rittman Mead Consulting earlier this year to try and help Oracle customers around the UK and Europe get the most out of their investment in Oracle business intelligence & data warehousing technology. I’ve actually had my head down on a number of project since we started and now had much of a chance to work on the blog, but I’ve got a bit of time free this evening and I thought I’d put my first posting together, on the always interesting subject of data warehouse data modeling.

I have recently been involved in a review of an organisation’s Data Warehouse, and some of the time was spent reviewing the data model. During the review I ended up proposing that the Data Warehouse should have a dimensional model and the question then arose as to why we use this modelling technique, what qualities does it bring to the Data Warehouse and so on. In my view there are a few different answers to this question, which I’ll now attempt to summarize.

It makes the data model and hence the data in the Warehouse easy to understand. This has a two-fold benefit. First, business users are able to intuitively query and hence analyse data in the Warehouse. The objects in the Warehouse are defined in business terms, for example dimensions may be time, product or organisation,
measures may be revenue and cost. The model is design to clearly and simply represent business entities, users do not have to understand the intricacies of Third Normal Form to interrogate the Data Warehouse. Second, reporting tools are easy to integrate with the Warehouse as the underlying schema is effectively an industry standard,
the dimensional model provides a common structure which a wide range of tools can utilise.

In the same way reporting tools can predict the layout of the schema and hence easily integrate with the Data Warehouse, the database can also make use of the expected design. The Oracle database has added many features over the years that enable Data Warehouses to either be loaded quicker, or to query data faster, for example Materialized Views, Partitioning, Query Rewrite and Star Transformations. The latter is a join access path between a fact table and a number of dimension tables, each dimension is joined to the fact table using its numeric primary (and surrogate) key to foreign key, the dimension tables are not joined to each other. Bitmap indexes are created on each of the foreign keys in the fact table. The query optimiser recognises this structure and creates efficient execution plans for the
query.

The Warehouse will be able to adapt to the addition of new requirements. Part of the rationale of building a Data Warehouse is to move away from the paradigm of the development team creating a table or view to answer individual business queries. The Warehouse should provide a platform to allow analysts freeform access to data, both in terms of definition and responsiveness/performance. When the business do provide new requirements, or another phase of the Warehouse is developed the dimensional model provides an easy mechanism to add new facts, add new dimension attributes, or even adding completely new dimensions.

In terms of an approach to dimension modelling, the following steps can provide a basis:

  1. Identify a business process or event to model. For example the business event could be a delivery performed by an organization.
  2. Declare the grain of this event or process and hence the fact table. In the above example this could be all deliveries to a location on a day, all deliveries to a location for a week, each individual delivery to a location, or even each product delivered to a location for each delivery. Generally, within the constraints or space and performance, the lowest grain should be selected. This will be more flexibility in the future as data can always be aggregated to find a higher level answer, however it cannot easily be allocated to find an accurate lower level answer.
  3. Select the dimensions that constrain or measure the business event, and hence are associated with the fact table. In the above example time, location, product would all be candidates, however you should also consider more abstract ideas, for example the marketing campaign that led to the sale.
  4. Identify what to measure, for example the revenue gained for each delivery, the cost of the delivery, tax payable on the delivery, the number of deliveries made.

One question that comes up a lot in dimensional modelling is whether to use star or snowflake schemas. Snowflake schemas are a more normalized approach to dimensional modelling, splitting out higher hierarchical levels into separate tables. Snowflake schema will reduce the amount of data stored by taking out repeating attributes, or
taking out attributes that are only related to a subset of your dimension members. The problem with this is that fact tables will typically account for 80% to 90% of the storage in your Warehouse, this means that your storage saving may only equate to a very small percentage of the overall size. Coupled with this is the fact that the database now has to do more joins to answer queries, so performance will be affected. The data model will also become more complex which may preclude non-specialist users from performing effective analysis.

One danger of modelling every business process/event is that you could end up with many disparate, unrelated data marts. Each data mart will function effectively and answer specific business questions, however the value to the organisation of the Data Warehouse will be restricted as questions won’t be able to be answered across different functions of
the business, for example revenue and cost may not be able to be compared. The answer is to use conformed dimensions, these allow users to answer questions across the entire enterprise, not just in silos of data.

Conformed dimensions are dimensions that are common to more that one data mart, or business process across the organisation. Common examples may be time, organisation structure and product. Conformed dimensions should be modelled at the lowest level of granularity used, this way they can service any fact table that needs to use them, either at their natural granularity, or at a higher level. They provide a consistent view of the business, including
attributes and keys to all the data marts in the organisation. Conformed dimensions can either be implemented as a single physical table, or maybe be a replicated table used by each data mart.

Dimensional modelling is a broad and complex subject and this posting does not look to address all aspects of it, just some core principles and to provide an answer to why it is used, what benefit it can provide for an organisation. Future postings will discuss capturing history using slow changing dimensions (SCD), junk and degenerate dimensions, how to model the customer dimension and the use of surrogate keys.

Finally a word from the founder of Dimensional Modelling, Ralph Kimball, below is a list of mistakes commonly associated with Dimensional Modelling, the full article can be found here, the question is how many of them do you recognise?

Mistake 12: Place text attributes in a fact table if you mean to use them as the basis of constraining and grouping.

Mistake 11: Limit the use of verbose descriptive attributes in dimensions to save space.

Mistake 10: Split hierarchies and hierarchy levels into multiple dimensions.

Mistake 9: Delay dealing with a slowly changing dimension (SCD).

Mistake 8: Use smart keys to join a dimension table to a fact table.

Mistake 7: Add dimensions to a fact table before declaring its grain.

Mistake 6: Declare that a dimensional model is “based on a specific report.”

Mistake 5: Mix facts of differing grain in the same fact table.

Mistake 4: Leave lowest-level atomic data in E/R format.

Mistake 3: Eschew aggregate fact tables and shrunken dimension tables when faced with query performance concerns

Mistake 2: Fail to conform facts across separate fact tables.

AND THE BIGGEST MISTAKE …

Mistake 1: Fail to conform dimensions across separate fact tables.

Comments

  1. Jeff Moss Says:

    Welcome Jon.

    Hmmm…the word that comes to my mind when I read your post is “sometimes”.

    Sometimes users will be able to understand a dimensional model better – sometimes they can’t understand any model and should only be given access to precanned tabular reporting tables – it depends on their requirements and abilities.

    Sometimes the primary key of the dimension is a numeric surrogate value – but it doesn’t have to be.

    Sometimes the optimizer will create efficient execution plans for a star schema and sometimes it won’t without ensuring that a number of prerequisites are in place – like accurate stats and appropriate indexes being available.

    Sometimes data fits into a dimensional model easily – the company that makes sales of products at different times in different locations for example to use the sterotypical example. Sometimes the business or it’s source system data and processes is much more complicated than that and one will find difficulty in applying a dimensional model – I’ve worked in both environments and each have their challenges.

    Sometimes a dimensional model – a la Raph Kimball – is the way to go – and sometimes people follow the Bill Inmon approach of a normalised design – sometimes a combination of both is more appropriate – and that seems to be the approach Oracle themselves advocate recently after focusing on the dimensional model in times gone by.

    A case of your mileage may vary really is what I’m saying I guess.

    Welcome aboard the good ship HMS Dee Dubya.

  2. admin Says:

    Thank Jeff, I agree, sometimes the world is a great place, other times not. However, I think the idea of design patterns/industry standard designs is to give people a starting point. Part of the aim of design is to simplify or to abstract, represent something in terms that people can understand – that’s why I like using this approach as a model.

    Design is also about trade-offs and decisions, for every simplification you make, you may be allowing more people to understand the data, but you may be hiding crucial data from key users and preventing them making critical business decisions, or you may be making creating huge volumes of data that are crippling the performance of the Warehouse.

    I guess the only thing you can be sure is the next Warehouse I look at will be way more complex than anything described above :)

  3. Diego Arenas Says:

    Hi Jon, thanks for your post, i hope you continue writing and i hope learn or remember this contents very interesting for me.

    Greetings from Chile,
    Diego Arenas C.

  4. Peter Scott Says:

    Jon

    Nice to see another DW author!

    Mistakes 10 & 5 are interesting ones – perhaps it is because I’m not a dogmatic Kimball man!
    Sometimes the choice between a hierarchy member for one dimension or part of a dimension of its own is really a business analysis one rather than for pure DW design; take as an example a product sold by a store. We can recognise attributes of a product (type of product, maker etc) that could be used as hierarchies of a PRODUCT dimension, but there would be a valid business case to express maker as a dimension of its own if we need to report on both maker and product type in the same query. After all we don’t want to mix levels in a fact table!

    Mixing levels in a “table” is probably true for conventional tables – but the use of OLAP structures (cubes or GROUP BY ROLLUP tables) sometimes makes this distinction cloudy – perhaps it better expressed as not exposing to the user multiple levels in a single query.

  5. Jon Mead Says:

    Peter,

    Thanks for your comments – I think modelling of the relationships that exist between entities (for want of a better word) related to your fact is pretty difficult – business requirement can mean that one way works for one requirement, and another way works for another. Out of the mistakes, 10 seems a more abstact, where 5 is maybe more clear cut for me.

    Jon

  6. Tim Berry Says:

    Nothing wrong with number 5 as long as the differing granularity is clearly understood. However being a data migration rather than data warehousing major I see highly normalised targets more often than dimensional ones. But for my perspective the complexity is in the ETL which always differs so the fact tables being 1 more stage prior to reporting or loading of cubes should be flexible to adapt to the ETL requirements not the dimensional model. That said the overall system needs taking into account and thus if the ETL is okay but the system performs poorly than the ETL changes. But again this is an example of compromises that put kimballs rules into grey areas.

    I would add that I have never worked on a project that has been able to utilise purist rules for normalising or not but following the methodology gets you further through the hurdles than not following it. Problem is it is easy to hide the difficult detail if simplistic modelling of common dimensions is completed suggesting that 80% of the work is complete. When the usual problem occurs that the 20% left takes 80% of the time.

    So Jon I agree with you, from a consulting perspective, however we musn’t loose focus on the complexity underneath. Say model the common dimensions and explore the detail of some of the complex areas to expose unknown difficulties early on. This is difficult when under constant pressure to show progress but aids understanding when reporting to management.

    I think more companies than not prefer to follow more formal methods however I would suggest that they usually hide more problems than they uncover!

  7. Jon Mead Says:

    Tim,

    I think the problem with mixing the grain can come from reporting tools and how they summarize/aggregate data – for example if the grain of the fact table is day, and a time dimension with a Y-M-D hierarchy then when the reporting tool displays data at month level it groups by the month identifier and sums the values – the problem comes when some of the values are in the fact table already at month level, these could get doubly aggregrated (sorry, terrible phrase – double counted). This is one of the main reasons the grain should be consistent.

    Regarding design and modelling – I think it is important to try and follow common ‘design patterns’ and standards. In a different area, look at the Model-View-Controller design pattern. This is a highly effective way of implement a common system – the good thing is that it saves us re-inventing the wheel, and it is used over and over again. There is scope to be to model complex situations – however there are only so many problems we ever need to solve, so once the industry comes up with a pretty standard approach to one of them, that is often because that is the best way to solve that problem.

    Jon

  8. Simi Says:

    Can you please explain how to implement conformed dimensions using OWB?

    Thanks,
    Simi.

  9. nall Says:

    Thank you all for your ideas and valuable comments on DW designing. Kimball’s mistake 10 is my concern. I have a situation here where i am dealing with two hierarchies within the same Retail system. For example: 1st hierarchy (organization product): A-B-C-D-E-F (A being the top level and F the lowest level) and 2nd hierarchy (Geography): A-X-C-D-E-F (Both the hierarchies are almost same except for there are two levels in 1st hierarchy (B-C) which have no relation with the level ‘X’ in the 2nd hierarchy. Any suggestions on designing this? I understand the it depends on the requirements and what i have given here is vague but anyone who has done it before and would like throw some light on it will be appreciated. Also my requirement is forcing me to design the star schema’s from the above hierarchy in this fashion: my 1st fact table is ending up with dimensions A, B, X and the 2nd fact table with B, C, X 3rd fact table with A,C,X,D,E,F and all other combinations. So, if taken Kimball’s mistake 10 into consideration then i have to put all the levels of hierarchies in one dimension that is not allowing me more flexibility for reporting purposes. Please help.

  10. Jon Mead Says:

    Simi,

    Conformed dimension refer to dimension that are shared across many different fact tables within your data warehouse, so in terms of OWB you would implement them just like any other dimension, but you would find them used in multiple cubes/fact tables,

    Regards,

    Jon

  11. Jon Mead Says:

    nall,

    Could you give me a few more details, when you say the 1st and 2nd hiearchies you describe hierarchies within the same dimension, or different dimensions.

    Mistake 10 refers to the modelling of one dimension, for example if you had a geography dimension that had the following levels: city->country->region->market then Kimball thinks that they should all be modelled to one dimension and that you *shouldn’t* create two dimensions: city->country and region->market,

    Regards,

    Jon

  12. nall Says:

    Hello Jon,

    Thank you for responding back. This is what I have. I am working for a retail management system where we have hierarchies in this fashion. The 1st hierarchy from top to bottom is: Command (a company; ex: Sears) has many Sites (ex:Virginia Sears), Sites have many Departments (ex: ladies clothes), Each Department has many Classes (ex: Tops, Bottoms etc). Each Class may have many Subclasses (ex: Sweaters or shirts) and Subclass can have Styles (ex: Men’s Shirts, or Women shirts). So the hierarchy from top to bottom is Command

  13. nall Says:

    Hello Jon,
    Thank you for responding back. This is what I have. I am working for a retail management system where we have hierarchies in this fashion. The 1st hierarchy from top to bottom is: Command (a company; ex: Sears) has many Sites (ex:Virginia Sears), Sites have many Departments (ex: ladies clothes), Each Department has many Classes (ex: Tops, Bottoms etc). Each Class may have many Subclasses (ex: Sweaters or shirts) and Subclass can have Styles (ex: Men’s Shirts, or Women shirts). So the hierarchy from top to bottom is Command

  14. nall1 Says:

    I think there is a byte limitation on posting and here is the continuation:
    So the hierarchy from top to bottom is Command -> Sites->Departments -> Class-> Sub class-> Style. The 2nd hierarchy is almost the same except for Command -> DIVISION (Softlines, Hardlines etc) -> LOB (ex: Auto accessories, cosmetics etc) -> Departments-> Class-> Sub class-> Styles.There is no ‘SITE’ in the 2nd hierarchy and Command is derived from site ID’s. I am not sure if this information is good enough and please let me know if I can be more detail and thank you in advance for any of your suggestions.

  15. nall2 Says:

    Hello Jon,
    Thank you for responding back. This is what I have. I am working for a retail management system where we have hierarchies in this fashion. The 1st hierarchy from top to bottom is: Command (a company; ex: Sears) has many Sites (ex:Virginia Sears), Sites have many Departments (ex: ladies clothes), Each Department has many Classes (ex: Tops, Bottoms etc). Each Class may have many Subclasses (ex: Sweaters or shirts) and Subclass can have Styles (ex: Men’s Shirts, or Women shirts). So the hierarchy from top to bottom is Command -> Sites->Departments -> Class-> Sub class-> Style. The 2nd hierarchy is almost the same except for Command -> DIVISION (Softlines, Hardlines etc) -> LOB (ex: Auto accessories, cosmetics etc) -> Departments-> Class-> Sub class-> Styles.There is no ‘SITE’ in the 2nd hierarchy and Command is derived from site ID’s. I am not sure if this information is good enough and please let me know if I can be more detail and thank you in advance for any of your suggestions.

  16. Tim Berry Says:

    I can see how the model view controller manages the requirements and gathering of data for OLTP type requests. I can also envisage how you might utilise a similar process regarding higher data volume collections but need to consider it in more detail to see if I can put it into practice. I’m off from this data warehousing project to a data migration in a couple of weeks. I’ll let you know how I butcher the process to achieve a consolidated view of data for the new project. I think the first thing to change maybe the order in which the process is named i.e. controller – define the data to be collected, model – detemine the sources and limits (clean up/ filters) and view – build the denormalised table that collects defined data. It’s starting to take shape in my brain now!
    Cheers Jon

  17. One man's meat is another man's butchery product « Pete-s random notes Says:

    [...] writing this piece a while back and since I started Beth wrote on data quality, Jon Mead touched on conformed dimensions in a follow up to his posting on dimensions, and Andy Whitehurst of Xansa spoke about Master [...]

  18. Jon Mead Says:

    Nall,

    The key thing is the grain of your dimension, in your case Style. Your fact table would join to this dimension at this level and at this level only.

    Each row dimension table would contain at least one column for each level: Style, Sub Class, Class, Departments, LOB, Division, Command (aside from surrogate keys).

    You would then define two hierarchies using the levels, as you have identified.

    Queries would then be able to aggregate to any level in either hierarchy – grouping by that level.

    Hope this helps,

    Regards,

    Jon

  19. Jon Mead Says:

    Tim,

    Adapting MVC for data migration – sounds interesting. The model (target/source/both?) and the controller I can see – does that leave the view as the audit and reconciliation data :)

    Jon

  20. Tim Berry Says:

    You had to mention it (Audit & Reconciliation) you need balls for that.

    Following on from the blog regarding data clean up…
    There is always problems when integrating two differing ERD’s together.
    My personal preference is to create hybrid staging tables that resolve source relationships from the targets requirements.
    Its’ a bit difficult to express my mechanism for this within a blog but the process, currently unnamed, is…

    1. Get the business to describe their data
    2. Determine the cardinality and data content of the source
    3. Review the target requirements
    4. Build a hybrid that considers 2 & 3 that addresses ETL effciency.

    Another thing I like to do is introduce decision tables that generate too many rows in the hybrid.
    Decision tables being the key fields that drive source and target joined together
    with their counterpart, where more than 1 row is required from source 2 target keys are allocated
    to the same source keys. Where merges are required the differing source values have the same target key.
    Then when we join the source to the decision table to give both views of the data.
    Sequence numbers and other metadata can be attached to the decision table to simplify rules.

    Lets name the process, ETL was a good one, but in this case we have…
    Define,
    Assess,
    Review,
    Define decision tables (add metadata),
    Denormalise into a hybrid
    Audit & reconcile
    Map to the target system

    DARMHAM not very catchy! However I think controller, model and view is a little better where
    controller is step 1
    model is steps 2 & 3
    view is step 4 (with a smattering of decision tables)

    And this is how the specifications would look where some of the decision tables (under the control of the BA)
    would be defined. It helps with getting the BA’s involved with understanding the complexity and also gives
    them the control they always crave on ETL projects. It is quite common for development to take over in ETL
    technical issues so anyway we can mitigate against such is a good thing. It aids off site development aswell
    where changes in the decision tables either rows or metadata to affect the path of the data without the need
    to change the processing/ mapping.

    However under existing client/ contractor relationships it is still difficult to move away from mapping
    through normalised form to the above nirvana I try to employ. It maybe that simplifying the above
    description through a formalised process would be better received by clients!

    Tim

Website Design & Build: tymedia.co.uk