Data quality is not a one-off

November 12th, 2008 by

In my Blog post on End-to-end data quality I mentioned the desirability of fixing bad data at source. This certainly attracted comment both here and on other blogs for example

One point to keep in mind about fixing bad data on the source; it is just that, fixing DATA. We are not fixing bad applications or bad processes. Often fixing source applications is just not going to happen, especially with the case of legacy and third-party packaged applications. Likewise process can be hard to fix; if we rely on humans to key information in and the application is incapable of enforcing data rules then data entry problems will occur. For example take a university admissions system; here data entry is highly seasonal, first off a lot of prospective candidates (say about 10x the number who will start as students in the new year) then a reduced number of place offers which get further reduced by the candidates choosing another university or not reaching the entrance grades. This work flow often requires extra temporary staff just to key in details, they are not full-time employees, they do not know the significance of the data, they do not know about the reporting systems that give insight to the data, they are just there to get the data in as quickly as possible and with enough accuracy to let the process as whole work. It really does not matter to the process if the country name is not standardised providing the funding code is correct; even the course code may not matter that much providing the faculty is correct, especially if all faculty members follow a common scheme of study in the first year.
So, fixing data at source is most unlikely to be a one off exercise. True, a one-off data fix will improve the quality of our historic reporting, but what we are not doing is preventing data errors from occurring in the future. We must build into our ETL processes methods to continually monitor source data quality and build processes so that any errors detected can be corrected at source.

Comments

  1. Gints Plivna Says:

    Hmmm, it is not clear for me, what are you suggesting? Or you are not suggesting anything just establishing facts?
    Otherwise fixing data without fixing app, adding integrity constraints, and probably redesigning data entry process is like work of Sisyphus. Already ancient Greeks knew that such work is pointless and cursed :)
    Of course it may be different for each case but I’d say for most cases at least in long term fixing just data is much more expensive than just either upgrading existing app or throwing it away and creating new keeping in mind the mistakes of the old one.
    Postponing the decision of leaving Sisyphus work behind most of the time would make things just worse. More data to migrate, more expenses for pointless fixing old data, less people knowing the old app/data etc.

  2. Peter Scott Says:

    @Gints
    In the ideal world, source applications prevent problems, processes are perfect. But whilst we wait for the replaced systems and re-trained staff we need to have defensive mechanisms in place in the ETL processes of the data warehouse to trap poor data and deal with according to the business requirement – which would either be reject it or apply some form of documented fix.

  3. Tim Berry Says:

    One of my pet subjects in the nature of my work (data migration).

    Peter I can see what you are pointing to and I agree. However it is difficult, at first, to illustrate that data corruptions will continue to occur. To this end the ETL needs to have flexibility built into its process. At first this may be as simple as error tables for detail and aggregates with some columns for the ETL to complete – from which system, at which time, during which ETL process, which column etc. Then aggregate reporting on an iterative basis can go to build decision tables akin to to type 2 dimensions whereby common mistakes can be joined and the correct value brought from the decision table.
    As time goes on the process improves but metrics need to be trapped so that the solution can be proven to be working.

    It is not an exact science but flexibility in the ETL is the key. To this end it is always best to populate a staging area prior to taking source data through to the target system. With this interuption to processing in place we should have the flexibility required for automated corrections. Otherwise place holders in a warehouse and subsequent corrections taking place prior to aggregation would be required.

    Tim Berry

  4. Greg Partenach Says:

    I fully agree with you Peter. I’m in the midst of lots of data quality issues, and the best approach always tends to be a multi-pronged attack. I’ve tried to manage what I could in the source. Tried to handle it in the ETL. Developed triage areas to handle problem data. Even developing special lookup tables that can be edited through forms that increase the chances for proper matches. There is definitely not a ‘magic bullet’ 1-shot fix all solution to this problem. And what good is all of our DW and BI work when the data that sits in front of executives is flawed? Thanks for sharing.

  5. antonio romero Says:

    Great post… Definitely food for thought. I’m thinking about these issues a lot lately.

    Oracle Warehouse Builder does a pretty good job from the ETL process onwards, and helps you with building the rules based on profiling, but doesn’t provide an optimal way to push rules back to the steps before the ETL… which means, I guess, that my job’s not done yet :)

  6. Steve Sarsfield Says:

    Great Point, Peter. When it comes down to it, there are a LOT of other places where you can impact data quality besides ETL. Also, bringing a business user into the ETL process can be difficult for the user. Other technologies (such as those listed here http://data-governance.blogspot.com/2008/09/data-intelligence-gap-part-ii.html) can have more of an impact on the process.

Write a comment





Website Design & Build: tymedia.co.uk