Thoughts on Change Data Capture

In little over a month I will be in Las Vegas speaking at Collaborate 10. There is a lot of BI / DW talks this year and for the first time with BIWA Training Days branding. Rittman Mead will be there at the conference giving talks on each of the conference days. If you are at the conference (or even just on vacation there) then come and say 'Hi' to Stewart, Venkat, Mark and myself.

My talk will be about Realtime Data Warehousing - it is an overview of reasons, techniques and pitfalls, but I do cover a lot of material in that hour. Of course, Change Data Capture (CDC) will be a major part of the talk; Oracle has so many options here including their recently acquired GoldenGate product set. As always, the slides will be here on the Rittman Mead site soon after I speak.

My colleague, Stewart Bryson has also had some recent thoughts about change data capture over on the TDWI group at LinkedIn.com (group membership needed); he was quite preceptive (and on the money, in my opinion) with his comment "I would hesitate to let technical limitations dictate user requirements. In today's BI/DW market, there are very few technical limitations that cannot be solved one way or another."

One of points I will make in my Realtime DW talk, and perhaps I need a few more slides to do it justice, is the need to profile the change you capture on the source system. Often there is a lot of "noise" that looks like change but you have no real interest in it at the data warehouse. Not all systems are "well behaved"; I have seen systems that always update a record even if nothing has changed and even systems that update each column as separate statement with its own commit.  Of course, even systems that don't have those vices can still have columns that have no DW significance being updated and see those changes being filtered out on the data warehouse after we had already done a lot of work (processing, network bandwidth and the like) to get the data there.

The more I do this kind of work I feel there is a need to switch CDC on on the live source for a while and see the typical patten of change that occurs in a day, week, period whatever and then make decisions on how to handle this defensively downstream. Do we need to exclude certain columns that are just "noise"? What will be the impact of multiple, rapidly-occurring commits on how we handle SCD-2 dimensions? Of course we can predict what will see and come up with a proposed solution but the real source often has a few surprises up its sleeve - once a customer gave me a sequence of order statuses that an order passed through in its life-cycle except that on the actual source system the order sequence was not the same as their documentation and that would impact our reporting.