Geography Hierarchies

May 24th, 2011 by

I have been thinking about address a lot recently, in part it was moving house and all of the 1001 people that need to be notified. In the main, though, it was thoughts inspired by a data warehouse project I am working on. For this DWH people are geo-located by their street address, but for most reporting we are only concerned with a grain of city. This all sounds so simple but how do we build a hierarchy from address, the line between the street where you live and the planet you live on. I remember as a child thinking it cool to address a letter to 23 Railway Cuttings, East Cheam then adding Surrey, England, Europe, The World, and however far I could get through navigating the solar system and the universe. To a child the hierarchy of address is relatively straightforward. But in data warehouse modelling things are not quite so simple.

Take the postal code (or zip code) where does it fit in the hierarchy? Well the answer is might not fit at all. Postal codes were developed to help post offices deliver mail – and each postal authority did their own thing. The UK and the Netherlands have postal code systems that can identify a single street or even a cluster of houses of within a street. Other countries work on a code per town or group of nearby towns – so straight away we have a difference in grain; a few houses in the UK a few towns in France. In Germany postal codes relate to geographic areas but those areas are not aligned to the Bundesländer; on the other hand, France ties postal code to Department but there are anomalies notably where a river runs through a village and opposite banks share a postal code but are different Departments (and in one case, different regions). Some national postal codes are numeric, some area alphanumeric (like Canadian and UK ones), the length of the postcode varies between countries too.

Perhaps the sensible thing, especially if you are dealing with addresses from multiple countries, is to not use postal code as a level in geographical hierarchies. If you use them at all just make them as an attribute of the address and remember that they don’t always have geographical parents.

I think the key point about modelling geography is that just because you know how addresses work in your own country you can’t assume that they work like that in the country next door. If you have a requirement to report, for example, the efficiency of the postal service in delivering your goods by postal region you need to ensure that your reporting handles the anomalies and exceptions. As always, knowing your data is key to creating a correct model.

Comments

  1. Tim Berry Says:

    Funnily enough I have just encountered this problem whereby the existing model was based upon the requirements given by the business analysts. However when we looked again at the existing solution and the business requirements we found your illustrated problem.
    Personally the best mechanism to discover these problems is to target the cardinality of the data and normalise based on that. Concentrating on the cardinality tends to tease out the issue you have illustrated sooner rather than later. As we all know an invalid data model means a costly rewrite.
    Tim

Website Design & Build: tymedia.co.uk