Data Warehousing

Re-summarisation

Peter Scott

Jul 23, 2006 • 2 min read

Re-summarization, re-statement or re-summarisation, call it (or spell it) as you will, is a periodic maintenance task in some data warehouses.

Most fact tables contain just the measures and keys to the dimension tables. This design is flexible in that it supports both non-changing and slowly changing dimensions, and for slowly changing dimensions it works with all Kimball types. The design protects the fact data and it summary from changes in attribute for dimensional values, if an item that had an "orange" attribute became "tangerine" we do not need to rebuild summaries as the attribute is not stored in the summary but is referenced through the key on the dimension table.

So why the need to re-summarise? Supposing the product changed its parent item, this may sound unlikely but it happens with surprising frequency in data warehouses especially where the dimension implements some business rule based hierarchy. It's even not unknown in hierarchies such as Geography either - countries can split (Czechoslovakia became the Czech and Slovak republics) and join (East and West Germany) and and at a lower level operating units can move between reporting regions, for example I worked on the design of a national crime reporting database were we expected that during the lifetime of the project some police forces would merge or otherwise changes their territory. But dimensions such as PRODUCT are more likely to have change. The most common change is where a new reporting grouping is introduced.

I have a customer that runs many bars and restaurants. They report drink sales by the product but also summarise by drink type, so they can report draught and bottled beer sales as separate totals. A while back they started to sell 'Alchopops', that is 5% alcohol 'sodas'. At first when there were only one or two available they classified them as beer or spirits depending on whether the drink was brewed as a fruit beer or mixed with a spirit such as rum. But as the sector became popular they made it a grouping in its own right. We could model this quite easily using type 2 SCD and have a date when sales starts to report as a Alchopop and before that as a beer or spirit. But often businesses want their data to reflect current reporting hierarchies and to do this we need to rebuild summary tables where data may report against the original hierarchy, the changes in reporting structures is especially noticeable in SCD type 1 dimensions where some summaries exist using old reporting structures but some queries aggregate from base without using the summaries, in this case the totals won't match.

The simplest approach to re-summarisation is to rebuild the summaries for all dates at the levels that are susceptible to change. This usually demands a big chunk of down time. Fast view refresh is also an option but this is fraught with potential problems...

Sign up for more like this.