August 24th, 2007 by Peter Scott
This year holiday activities have been low key. My elder daughter had various school examination results to receive (for those who know about the English system, GCSEs and AS levels) which basically meant that we needed to be home two Thursdays in August. Coupled with the need to work on the garden, meet the mayor (yesterday) and various other activities that I need not go into meant that we only could fit in a spur-of-the-moment trip to the North Wales coast. Despite having a customer (or two) on the fringes of North Wales, this is an area that I have not really visited much in recent years – I learnt to climb in Wales (but that was as a schoolboy). The whole trip was a pleasant surprise.
In the past I have blogged about a third party application that my team provides a little support for. Most of my blogs about this application tend to move towards ‘rant’ – the application is poorly designed (no foreign keys, few indexes, inappropriate data types) – in short the kind of app where the developers have written their own RI code in the application and ignored the capabilities of the database. In a more charitable moment I would say that the database components were poorly translated into Oracle, but for now I can’t think of a variant of SQL where they would have been well written. Recently our customer has asked that some new account codes are loaded into the application, and as the original vendor can’t help or indeed has any staff with access to the design documentation our customer has asked us to investigate reverse engineering the database to locate the table that contains this data. Hence my dip into the database and headache of trying to decipher other people’s code.
There is not much in the way of PL/SQL in this application; just two procedures, one to load data and one to extract it. I have already talked about the extraction code so this time I looked at the data load code. I am firmly convinced that if I can not see all of the logic in a single screenful of code editor then I am going to despise the code (and coder responsible) – perhaps my attention span is waning in old age. This code is massive, hundreds and hundreds of lines in one single block inserts, updates, and deletes galore over at least 8 tables. In data warehouse terms the process is a classic; load to staging tables adding a surrogate key, compare data with production then update or insert and log the results (updates, inserts, rejects, and batch duration) to a log table, but the way the developers to do it!
- the emptying of the 6 source stage tables by using DELETE (why, oh why, not TRUNCATE?)
- The insert into the log table of the run number and run start time (a novel use of SELECT 1 + COUNT(*) from logging_table as the source of the run number) – which of course means we can’t purge records from the log table without reducing the run number
- The numerous updates to the logging table (at least 10 of them) using WHERE RUN_NUMBER=(SELECT MAX(RUN_NUMBER) from LOGGING_TABLE) as a filter (haven’t these guys heard of storing the run_number in a variable… it won’t change during a run)
- using those 10 updates where one would have done
- inserting into the stage tables and then updating rows to fix up nulls. Then updating updating again to provide derived values, then updating again to build concatenated text keys. Each update having its own commit. To make matters worse they are flitting between all 6 stage tables. I would have tackled one table at a time and tried to write the code as a single insert with no updates (CASE and NVL() are your friends in ETL coding)
- We then get to the removal of duplicates – I would prefer to copy out the duplicates so that I could report them back to the source data owner. Still, at least they count the duplicates before deleting them.
- We then copy out the surrogate keys from the denormalised target table and insert them into the individual staging tables. Then check for changes and process as updates before finally inserting any new records into the denormalised table.
- And to finish we fix up any Nulls in the denormalised table (as 20 separate update statements with differing where clauses) – and because of a quirk of how we process duplicates we can get nulls in to the denormalised table. And finally record the batch run time (in minutes) using our final update of the logging table and the most odd looking time calculation I have seen in an age.
- Oh, and there is no exception handling code
Today, I aged five years wading through that tosh – Tuesday, I’ll rewrite it so that anyone can understand it, or perhaps I won’t bother…