GIGO (or part 2)

As an acronym, GIGO for Garbage In, Garbage Out seems to have a better ring to it than the less common (or non-existent) terms TITO, RIRO or CICO. But the principle is the same; if you want good results, don't store rubbish.

As I mentioned earlier we must guarantee the quality of our data. And to do this we must design in the checking needed to confirm (or conform) quality. Bad data simply must not make it to the main storage of the data warehouse, partly this is to avoid unnecessary work in the ETL "publish" phase but more importantly it is to avoid the major task of backing out the bad data and the effects that data removal might have on some the database structures we so carefully crafted to maximise efficiency such as (for example) compressed tables

But what is quality data? It this just referentially valid or do we actually range check values for business sense? What do we do with the exceptions we find?

My usual form of DW design is to create three layers: a staging layer to receive data from the source systems, a data store to contain, clean but unaggregated data - often this in a normalized form as an homage to Inmon; this layer is used by the raw-data analysts and data miners and as a source for the presentation layer which is typically a star schema with fully denormalised reference data and enough aggregate tables or OLAP cubes to do a good reporting job.

If possible, I like to receive self-contained reference data sets; if I am loading product I would expect the whole product hierarchy to be contained in the dataload batch. This enables me to ensure that the data being loaded is self-consistent. In its simplest form I check primary keys and foreign key relationships and sideline data that does not conform. With Oracle databases I often do this by using a good old exceptions into clause on enabling database constraints and then moving all the rows (identified by row-id) that fail the tests; there are a few sophisticated twists that can be added to improve supportability. In some cases we extend checking to fix iffy business data or at least present it in a consistent fashion, here data cleansing tools help. Once clean the data can be published.

I use a different approach for fact data, that is a set of measures and the supporting reference keys. In this case I check that the keys exist in the mid, or raw, data layer. I could use constraints for this but often use outer joins to the reference tables instead. This allows me to use Oracle multi-table inserts to split data between to-be-published and rejected tables. From this you can tell that I do not often attempt to isolate exceptional (numeric) data values, after all what is an exceptional transaction?