Capturing Change (Part 2)

In the previous part I outlined the business need for writing our own CDC routines and started to outline some of the issues we would need to resolve. Here I shall outline the approach we took and describe how to go about building the SQL needed to synthesise the rows that need to be added to our data warehouse
First a recap of the types of records we will need to process

operation$...	PK	...	C1
 I		87		'J' -- a newly inserted row
UO		1		NULL
UN		1		'X' -- NULL value becomes 'X'
UO		99		'Y'
UN		99		'Z' -- 'Y' becomes 'Z'
UO		42		'A'
UN		42		NULL -- 'A' becomes NULL
UO		39		NULL
UN		39		NULL -- data value remains unchanged (not necessarily NULL)

As I mentioned last time we have no real interest in delete records (in fact my source system does not generate deletes); we are using CDC to populate a data warehouse and the fact that a record had existed is of interest to us. As I also mentioned we are not using supplementary logging so we can't just apply the I and UN records in chronological order

A point I did not emphasise last time is that change data capture does just that: it captures changes, in fact every commit of a change. This means that we may see changes that are of no interest to our data warehouse. It also means we may see 'non-changes' where a record is "updated" and then committed without changes to data values being made. I would suggest that the best way to filter out these records (if filtering is appropriate for your business) is as part of the downstream ETL and not of the change capture process.

So how do we go about the building of a change capture view. The first thing is to realise that we can do this with analytic functions, the second thing is that we will need a lot of them (several for each column being processed) and this can have a major (negative) impact on query performance as we carry out many window sort operations on the same dataset.

Let's start with the update row type. For an update we need a row to exist, therefore any change should be built on a prior version of the row. If we consider the first update for a row in the subscriber view it must either be for a newly inserted row or one we already hold in target table. Subsequent updates in the same CDC subscriber view need to be applied in the order they occurred. As the CDC identifies changes by the row key, a SCN and a RSID value we already have the bones of ordering our changes to apply, we only need to add in a way of finding the original value in the target table if it exists, again analytics come to our rescue.

Whenever I write a query using analytics I try to work out the how to partition and order the data to achieve my goal. With the updates I need to process pairs of UO and UN records for the same key value and change number and order them so that the UN records comes last. We then need to look for changes between the UO and UN record. To my thinking this is a simple LEAD() or LAG() to bring the before and after versions onto the same row and a case statement to determine if the column has become null in the captured change.

select * from (
SELECT OPERATION$,
  CSCN$,
  COMMIT_TIMESTAMP$,
  RSID$,
  ORDER_ID,

/* Now for the changing columns */
  ORDER_STATUS,
  CASE
    WHEN lag(ORDER_STATUS) over (partition BY ORDER_ID, rsid$ order by OPERATION$ DESC ) IS NULL
    AND ORDER_STATUS                                                            IS NULL
    THEN NULL -- No change to value
    ELSE
      CASE
        WHEN ORDER_STATUS IS NULL
        THEN 2  -- 2 = change from NOT NULL to NULL
        ELSE 1  -- 1 = change from NULL to NOT NULL
      END
  END c_ORDER_STATUS
/* repeat similar logic for each of the other columns of interest. */

  FROM CDC_ORDERS
  WHERE operation$ <> 'I' -- only looking at UO and UN operations
  )
WHERE operation$ = 'UN' - we only need to process the final states of each change
)

We now need to union this set of rows to the most recent stored version of the row; this is either the version in our data store with the most recent timestamp or the insert record in our CDC view. Picking the most recent value the data store can be achieved by using the row_number function:

	SELECT OPERATION$, CSCN$,COMMIT_TIMESTAMP$,RSID$,ORDER_ID, ORDER_STATUS, L_ORDER_STATUS ... from (
	SELECT 'X' OPERATION$, -- set a constant
	  -1 CSCN$, -- set to a constant so we can simply filter it out later -- we don't want to re-insert this record!
	  COMMIT_TIMESTAMP$,
	  RSID$,
	  ORDER_ID,

/* For each column of interest */
	  ORDER_STATUS,
	  1 c_ORDER_STATUS, -- set a constant
/* Repeat the above block */

	row_number() over (partition by order_id, order by COMMIT_TIMESTAMP$ DESC,RSID$ DESC) RN  -- most recent version will be 1
	from  ODS_ORDERS
	where ORDER_ID in (select order_id from CDC_ORDERS) -- we only want order_id values that are in the CDC view
	) where RN = 1 -- only select the most recent

or for the 'I' record in our CDC subscriber view window.

	SELECT OPERATION$,
	  CSCN$,
	  COMMIT_TIMESTAMP$,
	  RSID$,
	  ORDER_ID,

	/* Now for the changing columns */
	  ORDER_STATUS,
	  1 c_ORDER_STATUS -- set a constant
	/* repeat similar logic for each of the other columns of interest. */

	  FROM CDC_ORDERS
	  WHERE operation$ = 'I' -- only looking at I operations

As the previous value for a given PK will be in one of two places (but not both) then a simple UNION ALL would suffice to provide all of the rows we need to build the history of changes.
The next stage of the processing is take the three UNION ALL sources and then using analytics "copy down" previous values to fill in blanks. Here I use the LAST_VALUE function to look back over an ordered window.

SELECT
	OPERATION$,
    CSCN$,
    COMMIT_TIMESTAMP$,
    RSID$,
    ORDER_ID,
    CASE
      WHEN c_ORDER_STATUS IS NOT NULL  -- there is a new value of ORDER_STATUS
      THEN ORDER_STATUS
      ELSE -- look at the last change for this column
        CASE LAST_VALUE(c_ORDER_STATUS ignore nulls) over (partition BY ORDER_ID order by CSCN$, RSID$)
          WHEN 1 -- changed to a non null at the last change
          THEN LAST_VALUE(ORDER_STATUS ignore nulls) over (partition BY ORDER_ID order by CSCN$, RSID$)
          WHEN 2 -- became NULL at the last change
          THEN NULL
/*
we could use LAST_VALUE(ORDER_STATUS ) over (partition BY ORDER_ID order by CSCN$, RSID$) but this would add a sort to query plan
*/
        END
    END ORDER_STATUS,
/* similar code for the remaining columns all selected from my UNION ALL VIEW of ODS_ORDER, CDC_ORDER where operation = 'I' and CDC_ORDER where operation is not 'I'

The final things to deal with are: the COMMIT_TIMESTAMP$ column is a DATE and we may get multiple rows for a given key and date if multiple commits occur in the same second; as far as I am concerned here, multiple commits are (in spirit) the same change so we could take the last row in any given second, again we use the row_number function for this

ROW_NUMBER() OVER (PARTITION BY ORDER_ID, COMMIT_TIMESTAMP$ ORDER BY RSID$ DESC) RN

and not reinserting the "seed" rows we took from ODS_ORDERS ... but as we set the CSCN$ to be -1 we just filter on CSCN$ values to be greater than zero in our insert.

That's basically it - a huge view with many, many analytics - but it performs quite well providing you are not processing too large a window.