Capturing Change (Part 1)

Shortly after I joined Rittman Mead I wrote a small article on real-time Business Intelligence , there is also a link to it on our "Articles" tab. One of the techniques I mentioned in passing was change data capture, CDC.

Although many people believe that change capture is a technique for real-time or near real-time data acquisition it also plays a role in batch-orientated ETL processes where for whatever reason you can't directly query the source or where it is hard to identify new or changed data for loading into the data warehouse. Recently, we had to do just that; extract changes from a e-retailers system where there was no scope to modify or query the data source to generate conventional extracts.

Oracle have recently acquired GoldenGate, who had framework for CDC, I will be writing about GoldenGate later in December, but for this project we had a requirement to use asynchronous CDC with Oracle Warehouse Builder 11.1 and an Oracle 11gR1 (Exadata 1) target system. A further restriction we had was that we could not modify the change logging on the source system.

With asynchronous CDC we access data changes through subscriber views that effectively use system change numbers as a filter condition in the view definition. There are two calls to an Oracle package that are used in our ETL workflow: DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW which in essence sets the upper bound of the SCN range selector to the current SCN and DBMS_CDC_SUBSCRIBE.PURGE_WINDOW that sets the lower bound of the SCN filter to one above the current upper bound (i.e. returns no rows). The view itself contains, amongst others, an operation column to describe the type of change captured, 'I' for insert, 'D' for delete and 'UO' and 'UN' for the old and new values of updated rows; columns to identify the order that the changes occurred - such as timestamps, SCN values and update numbers and, of course, the data changes. For this data warehouse we had no interest in deletes, but needed to know about new rows (type I) and changes (UO and UN). Because of the nature of the customer's business it was very likely that many changes to a row could occur in a single CDC window - the simple option of using CDC to identify the changed rows and then fetch them from source was not available to us.

CDC can be configured to log the whole of the source row (supplementary logging), in which case we only need to look at the 'I' records and the 'UN' records and apply the whole rows in order. But as we could not change the logging to track the whole row we ended up with a source that contained a primary key and data for columns that have changed or NULLs for the case where no change has occurred, this makes things a little harder for us as we need to synthesise the whole row before processing updates. A further complication was that in some cases data could become NULL and those nulls need to processed

If we reduce the information in CDC subscriber view to operation$, the primary key of the source table (PK) and just one source column that might change we get a possibility matrix for updates:

operation$...	PK...	C1
UO		1	NULL
UN		1	'X'	-- NULL value becomes 'X'
UO		99	'Y'
UN		99	'Z'	-- 'Y' becomes 'Z'
UO		42	'A'
UN		42	NULL -- 'A' becomes NULL
UO		39	NULL
UN		39	NULL -- data value remains unchanged (not necessarily NULL)

So to process updates we need to retrieve the previous version of the row, apply changes to the updated columns and store a new (versioned) copy of the row. Where data does not change we need to "copy down" the previous value stored in our versioned data store or the next earliest version in our CDC view. We also wanted to keep this a set based operation for performance reasons

In part two I will describe the approach we adopted