End-to-end data quality

October 25th, 2008 by Peter Scott

One of our customers is about to embark on a significant BI project; but being in the “public sector” they have to (by EU law) publish tender documents so that qualified suppliers throughout the EU can bid to do the work. This means they have a gap of almost a year before the, yet to be selected, BI infrastructure can be implemented and work on building the solution can start.

In the interim, the customer can work on data quality; they know what they need to report on (it’s in the project mandate!) and they know the sources of information (their operational systems) so they can start to verify that all of the required facts can be found in the source systems and more importantly look at the data content and assess “fitness for purpose”. If data defects are found then it may be possible to get them fixed before the serious construction of the ETL layer starts. Besides, the knowledge of source and target gives a good head start in the specification of ETL interfaces.

One particular issue they might meet, and one that is sadly far too common across many business sectors, is the use of operational systems that do not enforce data integrity. For whatever reasons there is just too much freedom in data entry and although it may not affect the operational system much it really can cause problems when you try to aggregate information on the BI system.

But how do we deal with this? Recently I joined in on a thread on one of LinkedIn BI groups where it was proposed that a “receive garbage, store garbage strategy was adopted” – in my opinion this might be OK for a mature BI system where users can understand that the reporting accurately reflects the source, but for a new venture into BI? To me, this seems to be too much a risk; it might be that the new BI users do not have sufficient exposure to the source systems to realise that the data is at fault on the source. We could prevent data that fails a quality threshold from loading on the BI system, but then we would show incomplete results which although correctly aggregated are misleading because of omission; at the end of the day load policy is a business choice. If we go with the “reject poor data” route we should seriously  think about providing a data quality dashboard on the reporting system to indicate the numbers of records that failed to be loaded and drill-down to the reasons why they failed.

So what do we do with data that fails the quality standard? Ideally, we should get it fixed at source. Auto-fixing on load is possible, but then we need to think about data governance and the possible ‘trust’ problems of the data being not aligned with the source. Maybe you could ‘standardise’ country names and other columns on loading; I’ve seen systems with ‘USA’, ‘U.S.A’, ‘U S A’, ‘US of A’, ‘America’,  and ‘US’ in the country data feed and that’s before we get to the mis-keying of ‘United’ to get ‘Untied’!  But maybe that sort of improvement in quality should also be available to operational systems users.

For this customer, I have suggested that they construct a source to BI target matrix and include some basic traffic light measures on the source data:

  • How good is it?
  • What sort of errors are present; missing items, typographical errors, missing or incorrect parents, inconsistent use of names, even data entered in the wrong fields.
  • How important is it to be correct in the BI system; for example street address can not be aggregated in reporting and we may not be going to use BI to create mailing lists, but postal code (or a sub string of it) can be used to aggregate people by location areas.
  • How important is it to be correct on the operational source – do we need to apply the corrections at source to improve the operational use of the system

But this type of quality review may not tackle the data problem that is probably hardest to deal with what is a correct fact? How do I know if house value of £20,000 is reasonable (it  could be in a shared ownership scheme) or £2,0000,000 or £20,000,000? We could set a validation range, but where is there that point that one penny more is obviously wrong, but the current value OK?

Comments

  1. illiyaz Says:

    Data Quality has been one of the oldest problems and i think if the users are well versed with the source system, they should be able to correct the data on the fly(using write back features) if it is possible.Also if a log of corrections is maintained(may be in a table), the inexperienced users can use the logtable to some extent in ascertaining the right values.

  2. Emil Says:

    Hi Mark,

    From my experience in dealing with data quality matters there is no 100% solution. But in my projects what I have done I will gladly share it with you. Extending the Ralph Kimball idea to include a control dimention , I have come to a solution to log all errors encounterred in a dimentional objects ( facts and respective dimensions) then showing those error reports to the BI administartor or to respective business users responsible to check and correct errors. The error reports are not just reports but forms like where using writeback option of BI , procedures can be run to correct the errors and even to re-run the etl again. Automation is not possible but giving a nice user interface where to correct some/most of the errors is acceptable to most end-users. This is the practical solution of OBIEE , I have implemented …of course all the analysis must be comleted and there lots of nice data cleansing methodolgies but at the end of the day , smth simple and understandable must be presented to the end-user and in most of the cases I had , this approach will be accepted…

    Best ragards,
    Emil.

    PS Hope to attend your training days once again … if closer to EEC

  3. Emil Says:

    Appologies Peter…

    Just missed to look at the author of the post..

  4. Peter Scott Says:

    @Emil – no need to apologise!

    You are right, there is no 100% solution and certainly not a one-size-fits-all one.
    The idea of write back is interesting, I can see it working for some cases. In the example I was thinking of we should really get the data fixed up in the sources – for various reasons of data governance (and third parties keeping their intellectual property) we do not have access direct (SQL) access to the application data structure structures. The other problem is that people who want the reports do not know much about the source.. I guess this is quite common where business users are interested in highly aggregated information and not raw fact

    The best we can do is report back the problems and hope they get fixed.

  5. Dave Katz Says:

    If the people who want the reports don’t know much about the source, then you have to show them the ‘garbage out’. I have found that showing the business users a few examples of reports based on dirty data tends to make the problem more concrete for them. Otherwise, discussions of data quality can seem very abstract.

  6. Tim Berry Says:

    Looking at the time they have they should instigate a central store and report MI fashion as to the spread of the content of the systems against what is valid, invalid and grey (potential for rule based cleanup).
    Usually there are placeholders in warehouses for non conformant data but these are usually filled larger than expected and the clean up investigation left until the warehouse is populated.
    I think that it is better to assess the source data prior to loading any target and that asessment needs to be planned so that volumes and actions can be considered and reassessed against one another.
    Therefore a central store that takes comparison and rule based checks is the way to go. Usually there isn’t enough time but as these systems save more time after the extent of corruption is ascertained they are in the good position to instigate the action outside of the target system.

  7. Peter Scott Says:

    @Tim – broadly the same as my suggestion to the customer.
    @Dave – I understand where you are coming from, and for some organizational cultures you can do this (it is the best way!), but in organisations that currently view data through spreadsheets backed by an extensive suite of macros to ‘fix’ the data it can be a trickier prospect to sell reports of bad data… data users are sometimes protected from problems for too long

  8. Emil Says:

    There are two things that are always contradict – from one side the customer to be happy with nice and correct reports and from the other side those correct reports to come from dirty data. The goal here is to find the intersection point. How do I do this. Some time I use the IFRS concept of materiality.. is the error material (Does the error lead to taking wrong decisions). If the answer is yes then the question is how did you corrected it up to now , the same way will be with the new system or some automation may be proposed to ease this process(e.g. OBIEE writeback option –see my previous post). If the answer is no then they can live happy with this error. But supplying erroneous reports even the customer accepted it is not good from professional point of view .Also it is a good practice to request the last IT Audit reports for the source systems and if somewhere audit controls findings and conclusions are mentioned this is a good starting point to discuss this issue. In all cases a very well prepared UAT (user acceptance criteria ) must be signed off even before project kicks off.

Write a comment





Website Design & Build: tymedia.co.uk