A flat world
December 27th, 2007 by Peter Scott
A lot of data warehouses get some (if not all) of their data feeds through data files; in fact I can’t recall a recent project where at least one feed from an external system did not arrive as some form of flat file. I have had customers whose corporate data movement strategy is to go moving everything through a piece of heavy-weight middleware and use a sophisticated son-of-FTP technology to ensure data delivery, and even if the data existed in a database on the same network not use inter-database links; I have also had customers that try do everything as database to database movements using change data capture, remote tables, materialized views, transportable tablespaces or whatever the latest method might be; but somewhere there is always that little nugget to be loaded that exists only in an accountant’s spreadsheet or as a piece of XML from a web server.
Ignoring mock-database techniques such as reading spreadsheets using an ODBC driver and querying directly against the data the most common way of loading pure text files into Oracle is to describe the data in some form of control structure and then use SQL Loader or external tables to bring the data into a structure that can be queried. We may well need to do some work making sure that the data types are recognised correctly; this is often a case of making sure the that the file is appropriate for the NLS settings involved (or is that the NLS setting appropriate for the file) how many times do you see the 6-MAY-1955 failing to load because the database wants mai or even 05, or the wrong symbol is used for the decimal point. Some of these NLS induced difficulties may only affect a subset of the data for example some month abbreviations might be common across languages or perhaps a thousand separator is causing problems with numbers bigger than 999
But loading the data in only part of the battle, we need to check it for sense, to profile it for anomalies, and to try to impose some referential (or dimensional) integrity rules on it; if we load data relating to products and product type we need to verify that we already know about the product types (the parent keys) or come up with a mechanism that allows to load both parents and children; and what constitutes a duplicate record? Where we find problems we need to feedback to the data provider and workout how we can fix the source data if it is lacking or work with them on rules to fix-up the data once it is loaded. The key thing is to get this right before we start to propagate suspect data into a the data warehouse.
December 28th, 2007 at 1:03 pm
Like the snow.
I had a similar conversation recently with a project who were going to accept some human produced flat file input to a data load. At the time no thought had gone into whether the provided data would be valid, or how this could be checked. It’s a similar problem to that which you describe. I suggested that they clearly define the rules for valid data and get agreement of these with the customer, along with how any validation exceptions will be handled. I suggested that once defined these validations should then be automated.
I’ve found external tables and SQL validation the most efficient. Basically, two queries, one the inverse of the other. The first pulls out the good data and the second reports on the bad data. It’s amazing how much can be done in SQL with a few carefully applied analytic functions. But, you know all this already.
I’ve been quiet for a while on the web, but I’m still visiting here….and still enjoying what I find.
All the best for 2008.
Mathew Butler
December 28th, 2007 at 1:59 pm
Thanks Matthew
The snow is an wordpress.com option that expires on the 2nd Jan - so make the most of it!
Do you ever find the first time the data owner becomes aware of problems with their data is the time you share the profiling results with them
?
Of course auto correction is only an option if you are allowed to make changes to the incoming data - compliance regulators can frown on this sort of thing
December 28th, 2007 at 2:16 pm
Cool snow! It even builds up on the bottom of my screen.
Ah, flat data. My favorite is manually generated csv, which never seems to truly adhere to the csv spec (embedded commas and quote marks, anyone?).
I agree that these data sources need to be assessed before moving their data into a warehouse, master data management system, ERP system — any of those big, data-driven systems upon which the organization relies.
Key, to my mind, is a set of staging tables upon which a standard set of data profiling and data quality jobs are run. In a perfect world, we do test runs and establish the data profiling jobs early in the design phase of a project. This gives us time to find examplar errant data points, discuss/negotiate with users, postulate cleansing rules, and in some unfortunate ERP cases even change our target configuration. All well before the normal ‘load to prep for testing’ that occurs very close to go-live.
If you’re lucky, compliance regulation can be leveraged to actually help — if used to drive a data governance approach.
December 28th, 2007 at 2:37 pm
we do test runs and establish the data profiling jobs early in the design phase of a project
Couldn’t agree more, Beth. As soon as I get a feed that represents real data and not some dummy feed that matches the specification
I start to profile the data, either with proper profiling tools if the client has one or just plain old SQL - GROUP BY, COUNT, and the analytic functions can be your friends here.
December 30th, 2007 at 6:32 pm
@Peter
“Do you ever find the first time the data owner becomes aware of problems with their data is the time you share the profiling results with them ?”
Yes. I’m always surprised at how much bad data can be found, even in systems that have been functioning correctly with no reported issues.
“Of course auto correction is only an option if you are allowed to make changes to the incoming data - compliance regulators can frown on this sort of thing”
I didn’t mean to imply auto-correction. Rather that the “how to fix” is a decision point which the customer must address.
What is your prefered data validation/profiling approach?
Do you use an intermediate data staging area, as Beth describes in her comment above? Some of the features of Oracle mean that the staging can be sidestepped (pipelined functions etc). I haven’t seen this implemented though. My view is that there is value in having the staging area, even at the expense of the additional processing required to populate it. I’m curious of your thoughts and experiences.
Regards.
December 30th, 2007 at 8:26 pm
@Matthew
I didn’t mean to imply auto-correction. And nor did I mean to imply that you suggested auto correction - it was more of a passing comment [to other readers] that in these days of heightened monitoring for compliance we can’t take such tasks on without full consideration of the implications will parties concerned;
I still think that there should be a layer to validate data and manage its quality, historically this has been an integral part of the DW, but with people investing in corporate Master Data Management systems it is sensible to validate content just once (in the MDM system) and only perform referential checking in data warehouse to ensure that parent records exist for newly loaded child records and that fact matches known dimensional keys.
But in reality very few organisations have a comprehensive MDM programme in place for all DW dimensions, maybe they have tackled customer and perhaps geography but “clean” product is a bit more elusive. So for now I tend to do it all myself in the stage layer and report back exceptions to the data owner
December 31st, 2007 at 10:38 am
Thanks Peter.
January 17th, 2008 at 3:42 pm
Cool first name.
“Peter”
I have been working with data my entire career and agree about the comment where the client is typically made aware for the first time after seeing a report/profle, etc…
Now, I see companies quickly trying to establish and define data governance councils/business processes in order to monitor, report, and clean the defective data.
I am looking for authors to post articles or share related news about industry solutions, related news, or insights about enterprise data quality.
http://www.enterprisedataquality.com
Regards,
“Peter”