Surrogate keys

June 15th, 2005 by Peter Scott

June 17 - I was not too happy about the way this post was written - I will be revising parts of this - I will use italics to highlight changes.

I must admit my view on surrogate keys in a data warehouse is far from the orthodox. For me a surrogate key on a dimension is the exception, not the norm. If an item has an immutable natural single column key then why not use it? For most businesses, the renumbering all customers or all products is a big change; it runs across the whole business and all systems – logistics, ERP, finance, and data warehouse will all be affected. This is the type of change that does not happen often; it may, therefore, be pragmatic to consider customer and product to immutable keys.

I certainly would not consider a surrogate numeric key to replace a natural date key on the time dimension – date has ‘real’ meaning, it can easily be used in a partition key, it also facilitates ‘between’ type queries; numeric surrogates on the other hand can make partitioning schemes more complex (perhaps list instead of range partitioning) and ‘between dates’ queries become ‘in list’ ones.

In a comment to this post, Tom Kyte referred me to a recent thread on AskTom where an improvement to query plan through the use of a surrogate key joining to the dimension table over the plan obtained for a direct select from a fact table using a natural key was discussed. To me, the key point here is the join between fact and dimension tables (on the dimension surrogate key) however I feel that similar (if not identical) plans would result by joining on natural single column keys.

It could be argued that surrogate keys ‘force’ a casual DW user to create queries that join the fact tables to its dimension tables and hence improve query optimisation; however, if you consider that most casual users will use query tools and do not write SQL directly against the database that point becomes less compelling; the same joins to the dimension tables will be required to resolve the key codes into more friendly descriptions.

Comments

  1. Thomas Kyte Says:

    recent discussion on this (dates in particular) was had here

    Seems sometimes it makes sense to surrogate them, sometimes it does not (it all comes back to how you use them)

  2. Pete_S Says:

    Thanks for the link, Tom - I see an edit/clarification coming on!

  3. Thomas Kyte Says:

    and don’t get me wrong, I am all for using natural keys.

    I’ve see systems that assign a number to a state code to look up the state code to get to the state name (and so on)

    If they had their way, every fact table would be a bunch of numbers, it would be a mess.

    I don’t like turning the date into a number - but offered that up as a counter point (that I was not totally clued into until I read it).

  4. Pete_S Says:

    Knew exactly what you were saying, Tom. It’s good to look at what you do and re-analyse it from time to time. The ask Tom thread made a very true point on date time-series queries – I shall write on this another day.

    Perhaps the biggest negative to date as a surrogate number key is that it makes date range partitioning schemes less intuitive (and if you go too far and disassociate date order from surrogate key order… just don’t go there).

    Natural keys, surrogate keys, whatever the important thing is to keep the fact ‘normalised’ – just the measures and foreign keys to the dimensions.

    I’ve see systems that assign a number to a state code to look up the state code to get to the state name (and so on)
    Of the systems I manage for various customers, the one I hate most is an ODS/data switch/data entry system (it’s not a data warehouse – so why me?) It was developed about 8 years ago in a ‘new’ 4GL OO language – it took about 12 months to write (as an aside, I needed to change some functionality so I got one of my team to re-write it in Java – it took less than a week, the original team took 3 months!) . The designers went to great lengths to apply OO design principles – even to creating a single generic database table to hold all of the reference data – four columns: surrogate key, type of data (as a numeric code) “code” (visible to users) and a description – Yuk is a great word for it.

  5. Pete_S Says:

    Knew exactly what you were saying, Tom. It’s good to look at what you do and re-analyse it from time to time

    perhaps it would be clearer if I split that into a two paragraphs - it’s my work that needs review … ;-)

  6. Howard J. Rogers Says:

    A timely Blog. The topic has resurfaced at the Dizwell Forum lately, too, and I seem to be the only boy there batting for good natural keys. Good to know I’m not completely off my trolley!

    I consider myself generally pro-natural, but ultimately agnostic: whatever works best. But some of the synthetic/surrogate guys take it very religiously indeed!

    Regards
    HJR

  7. Anonymous Says:

    I thought that surrogate keys were only beneficial for taking care of changes in dimensions. (SCD type II). Having a surrogate key allows you to track the changes through time.

  8. Pete_S Says:

    True, surrogate keys do have a use in tracking slowly changing dimensions but you can do the same thing with immuatble single column keys from external systems. Say your DW gets its data from SAP - the product interface includes the SAP for a product as well as the more friendly descriptions and other attibutes. We can make the assumption the SAP code will not change (descriptions, etc might!), therefore we use it for change tracking, but to our DW it is a natural key.