Time series

June 21st, 2005 by Peter Scott

Of all dimensions, time is perhaps the most common. In fact I would be hard pressed to think of a data warehouse that did not encompass time. Inventory, sales, phone calls, web click-throughs, energy usage or whatever, they all hold measures (quantities) along a time line. So you’d think that time series aggregates (such as year to date or moving quarterly total) would come naturally to a data warehouse, sadly this is not always the case.

Calculating total sales for the current year (year-to-date, YTD) is (conceptually) not particularly hard – just add the total sales for each day.

There are two fundamental approaches to providing such aggregates over time: pre-calculate the aggregate or do the sum on-the-fly. On the face of it pre-calculation looks very attractive; expensive summations are done just once, in the batch window. For high levels summaries (all products, all customers) this approach works well, but there is a problem when you try to calculate year-to-date by customer, by product. At low levels of aggregation, data warehouse data is sparse: if you have 200 stores, 60,000 products and a million customers you will find a) most customers only visit one or two stores, b) they don’t shop every day and c) they only buy a few of the products. Clearly, it would be nonsensical to store an YTD figure for each available combination (that is 200 X 60,000 X 1 million values) as most values would be zero. Calculating the values for the combinations that occur looks a better bet but sparsity again causes us a problem. Suppose that on January 10 I sold item A to customer X for $10 (my first sale of the year!) – so, my YTD for customer X and product A is $10. The next day customer X does not shop but I still need to include that sale in my YTD calculation. That is once a sale is made I need to maintain an YTD value for the reminder of the year. To make maters worse we don’t need just a current YTD figure but one for each time slice in the data warehouse; without this it would be impossible to provide year on year comparison reports. And even worse, what happens when a store has a communications failure and sends it data to the data warehouse 10 days late and we need to revisit previous YTD values?

Calculating YTD on-the-fly has the problem that (potentially) a large amount of data needs aggregating – worst case you need to total a whole year. But you do not have problems with sparsity loss and storing masses of seldom used data. The other downside here is that many of the common reporting tools expect pre-calculated values since time series aggregate functionality is lacking in the products. This lack can be circumvented by joining queries to ‘list of dates’ style tables or using functions to return date lists. Oracle OLAP and Oracle Express both have the functionality to calculate time series aggregates, this is an option worth considering if you need time such aggregates and find the brute-force method of reading most of the database unappealing

Comments

  1. Anonymous Says:

    Ok, nothing is perfect. I understand that. What is your preferred approach?
    When do you do pre-calculate, when do you do on-the-fly?

  2. Scot Says:

    Interesting post. As background on myself, I don’t really know much about data warehouses compared to someone in your position. And as I don’t work with them in my current projects, I won’t be investigating them full time. But I am finding your blog interesting and educational.

    I mention all that as sort of a request (if it suits you and/or the other readers) to sprinkle a few introductory postings in with more advanced ones. If nothing else, you could use them as “documentation” to hand to any future junior developers and DBA’s that you find yourself managing now or in the future.

    With regard to the time dimension, I’ve always thought that a warehouse would have a time table, with a date primary key, with one entry for each day that matters, and several columns for each date such as calendar_quarter_number, fiscal_quarter_number, calendar_week, fiscal_week, calendar_month, fiscal_month, and so and and so forth.

    A time table like this would be relatively small (30 years times 365 days is trivial).

    Then all of the time (date) relationships would be made in the various fact tables, so that you could relate each date with the various months, weeks, years, whatever that it went with.

    And I suppose you could break this down into smaller units than date if needed, such as hour. My question, at last, is whether or not this concept is actually used?

  3. Pete_S Says:

    Well… it depends on what the customer wants, I have one who wants micro-detail (I be they don’t really, they just think they do) so that would way too expensive in terms of disk real-estate. So we do it all on the fly - The real low level stuff takes a while to come back but only a few people look at that sort of thing. A lot of my other clients go forhigh level stuff like sales by store by category (say children’s clothes)

  4. Pete_S Says:

    Scot

    I had always intended to mix up technical with more basic stuff and completely non DW stuff. So point noted – just got to think of something that is good to read and interesting to my readers…

    Yep, a time table is common in a denormailised dw (partially denormalised or normalised designs may be spread time across more that one table with FK relationships) In addition to columns you suggest common additions are day of the week, public holidays, season, and day_number in year (this could be used for moving totals, YTD and same period last year type queries) typically, I would bitmap index the commonly used columns.

  5. Scott Swank Says:

    Cardinality if of course huge here. How many customers do we have and how many products to we have? Can we create a materialized view of customer/product/week or customer/product/month so that year-to-date calculations can make a “query rewrite” hop over to the mv?

  6. Pete_S Says:

    Scott

    An MV aggregate was considered - the two big problems for us was that this company’s customers mainly shop once a week (7 days at day level is only marginally more rows than 1 week in week level table) and that the company’s business analyst team expect to be able slice and dice by day.
    I guess if we could modify their expectations there would be scope for month and 4-weekly summaries to be used