Technical

End-to-end data quality

Peter Scott

Oct 25, 2008 • 3 min read

One of our customers is about to embark on a significant BI project; but being in the "public sector" they have to (by EU law) publish tender documents so that qualified suppliers throughout the EU can bid to do the work. This means they have a gap of almost a year before the, yet to be selected, BI infrastructure can be implemented and work on building the solution can start.

In the interim, the customer can work on data quality; they know what they need to report on (it's in the project mandate!) and they know the sources of information (their operational systems) so they can start to verify that all of the required facts can be found in the source systems and more importantly look at the data content and assess "fitness for purpose". If data defects are found then it may be possible to get them fixed before the serious construction of the ETL layer starts. Besides, the knowledge of source and target gives a good head start in the specification of ETL interfaces.

One particular issue they might meet, and one that is sadly far too common across many business sectors, is the use of operational systems that do not enforce data integrity. For whatever reasons there is just too much freedom in data entry and although it may not affect the operational system much it really can cause problems when you try to aggregate information on the BI system.

But how do we deal with this? Recently I joined in on a thread on one of LinkedIn BI groups where it was proposed that a "receive garbage, store garbage strategy was adopted" - in my opinion this might be OK for a mature BI system where users can understand that the reporting accurately reflects the source, but for a new venture into BI? To me, this seems to be too much a risk; it might be that the new BI users do not have sufficient exposure to the source systems to realise that the data is at fault on the source. We could prevent data that fails a quality threshold from loading on the BI system, but then we would show incomplete results which although correctly aggregated are misleading because of omission; at the end of the day load policy is a business choice. If we go with the "reject poor data" route we should seriously think about providing a data quality dashboard on the reporting system to indicate the numbers of records that failed to be loaded and drill-down to the reasons why they failed.

So what do we do with data that fails the quality standard? Ideally, we should get it fixed at source. Auto-fixing on load is possible, but then we need to think about data governance and the possible 'trust' problems of the data being not aligned with the source. Maybe you could 'standardise' country names and other columns on loading; I've seen systems with 'USA', 'U.S.A', 'U S A', 'US of A', 'America', and 'US' in the country data feed and that's before we get to the mis-keying of 'United' to get 'Untied'! But maybe that sort of improvement in quality should also be available to operational systems users.

For this customer, I have suggested that they construct a source to BI target matrix and include some basic traffic light measures on the source data:

How good is it?
What sort of errors are present; missing items, typographical errors, missing or incorrect parents, inconsistent use of names, even data entered in the wrong fields.
How important is it to be correct in the BI system; for example street address can not be aggregated in reporting and we may not be going to use BI to create mailing lists, but postal code (or a sub string of it) can be used to aggregate people by location areas.
How important is it to be correct on the operational source - do we need to apply the corrections at source to improve the operational use of the system

But this type of quality review may not tackle the data problem that is probably hardest to deal with what is a correct fact? How do I know if house value of £20,000 is reasonable (it could be in a shared ownership scheme) or £2,0000,000 or £20,000,000? We could set a validation range, but where is there that point that one penny more is obviously wrong, but the current value OK?

Sign up for more like this.