November 12th, 2008 by Peter Scott
One point to keep in mind about fixing bad data on the source; it is just that, fixing DATA. We are not fixing bad applications or bad processes. Often fixing source applications is just not going to happen, especially with the case of legacy and third-party packaged applications. Likewise process can be hard to fix; if we rely on humans to key information in and the application is incapable of enforcing data rules then data entry problems will occur. For example take a university admissions system; here data entry is highly seasonal, first off a lot of prospective candidates (say about 10x the number who will start as students in the new year) then a reduced number of place offers which get further reduced by the candidates choosing another university or not reaching the entrance grades. This work flow often requires extra temporary staff just to key in details, they are not full-time employees, they do not know the significance of the data, they do not know about the reporting systems that give insight to the data, they are just there to get the data in as quickly as possible and with enough accuracy to let the process as whole work. It really does not matter to the process if the country name is not standardised providing the funding code is correct; even the course code may not matter that much providing the faculty is correct, especially if all faculty members follow a common scheme of study in the first year.
So, fixing data at source is most unlikely to be a one off exercise. True, a one-off data fix will improve the quality of our historic reporting, but what we are not doing is preventing data errors from occurring in the future. We must build into our ETL processes methods to continually monitor source data quality and build processes so that any errors detected can be corrected at source.