Data quality thoughts

My e-friend, Beth, has been busy posting from the Information on Demand conference in Las Vegas. She holds dear the belief that data is key, and quality data is the master key. I think saying it like that is putting words into her mouth, but hopefully I am not distorting her view point too much. In fact we are probably kindred spirits on this, everything I do professionally is underpinned by data quality. Anyway, Beth pondered a question about data quality and unstructured data, an intriguing concept that I thought I might think about too.

Quality in structured data is simple (from a rules viewpoint) for me: if it is not clean I won't store it. That is, upstream of my data warehouse I have processes to clean data; it is de-duplicated, referentially checked, sense checked or whatever-else-is-appropriate checked before it is loaded and the nonconforming data is parked for investigation or rule-based fix-up. Dirty data is bad for reporting accuracy and user confidence, it is also problematic to remove once it has been propagated though a BI system. Tools exist to profile data, and database features help to clean up some of the problems with data hierarchy mismatches, for example Oracle dimension objects can be examined by a call to DBMS_DIMENSION.VALIDATE_DIMENSION (or before 10g, the similar DBMS_OLAP.VALIDATE_DIMENSION) and inspecting the exceptions table for 'bad' rows; this may not be the technique of choice for a data load process, but is a good method to validate that already loaded data complies to your sense of dimensionality.

But when we move to unstructured data we come across a fundamental problem... what is quality? By definition there are no clearly defined dimension hierarchies, there probably aren't even tags to identify whether data is 'reference' (unstructured reference data sounds a bizarre concept!) or 'fact' let alone customer or product related. We are probably not concerned with duplicates as each unstructured item should be considered as unique, but do we need consider near duplicates to be versions (revisions) of the same item or new items? In practice the only sensible thing to today is store the data 'as is' maybe correcting spelling to some standard form (but that could well contravene governance rules) and rely on smart contextual indexing techniques to identify data items that possesses attribute.

Establishing context is hard

A long while back I worked on a project for a publisher of online classified ads that takes copy submitted electronically and assess it for compliance with standards of legality and decency. Apart from a complex rule base that required certain professions to supply validation of right to practice, the brunt of the system was about scanning for rude words, their phonetic equivalent and IM and web variants where numb3rs might appear in place of letters and vowels dropped. But this approach really worked with words in isolation, moving to spans of text and scanning for expressions becomes harder - just think about electronically trying to establish meaning from poetry where metaphors and non-standard word meanings abound. And then when the feed is not keyed text but machine recognised speech or scanned handwriting.