Technical

A flat world

Peter Scott

Dec 27, 2007 • 2 min read

A lot of data warehouses get some (if not all) of their data feeds through data files; in fact I can't recall a recent project where at least one feed from an external system did not arrive as some form of flat file. I have had customers whose corporate data movement strategy is to go moving everything through a piece of heavy-weight middleware and use a sophisticated son-of-FTP technology to ensure data delivery, and even if the data existed in a database on the same network not use inter-database links; I have also had customers that try do everything as database to database movements using change data capture, remote tables, materialized views, transportable tablespaces or whatever the latest method might be; but somewhere there is always that little nugget to be loaded that exists only in an accountant's spreadsheet or as a piece of XML from a web server.

Ignoring mock-database techniques such as reading spreadsheets using an ODBC driver and querying directly against the data the most common way of loading pure text files into Oracle is to describe the data in some form of control structure and then use SQL Loader or external tables to bring the data into a structure that can be queried. We may well need to do some work making sure that the data types are recognised correctly; this is often a case of making sure the that the file is appropriate for the NLS settings involved (or is that the NLS setting appropriate for the file) how many times do you see the 6-MAY-1955 failing to load because the database wants mai or even 05, or the wrong symbol is used for the decimal point. Some of these NLS induced difficulties may only affect a subset of the data for example some month abbreviations might be common across languages or perhaps a thousand separator is causing problems with numbers bigger than 999

But loading the data in only part of the battle, we need to check it for sense, to profile it for anomalies, and to try to impose some referential (or dimensional) integrity rules on it; if we load data relating to products and product type we need to verify that we already know about the product types (the parent keys) or come up with a mechanism that allows to load both parents and children; and what constitutes a duplicate record? Where we find problems we need to feedback to the data provider and workout how we can fix the source data if it is lacking or work with them on rules to fix-up the data once it is loaded. The key thing is to get this right before we start to propagate suspect data into a the data warehouse.

Sign up for more like this.