GoldenGate and Oracle Data Integrator – A Perfect Match… Part 3: Real-time ETL Challenges

July 24th, 2013 by

I’ve finally had the chance to sit down and properly closeout this series on GoldenGate and Oracle Data Integrator. Since the last post, I’ve actually presented this topic on three different occasions. Once at each of the Rittman Mead BI Forums, in both Brighton and Atlanta, as a part of the ODI Masterclass, and once in New Orleans at KScope13. Hopefully you were able to attend one of these sessions and learn a few new things about data integration.

In the previous post, “GoldenGate and Oracle Data Integrator – A Perfect Match… Part 2: Replicate to Staging and Foundation”, I described the setup and configuration of GoldenGate via ODI 11g at a high level. Now we have the GoldenGate parameter files configured, replication is active from the source to both Staging and Foundation layer schemas, and the ODI Change Data Capture (CDC) framework is in place. In this, the final post in the series, I will switch the focus to ODI Change Data Capture (CDC), walking through several options to using change data in downstream ETL processes.

odi_journalizing

Real-Time ETL Using Oracle Data Integrator CDC

ETL mappings, built as Interfaces in Oracle Data Integrator 11g, are developed to populate the Access and Performance layer. Remember from previous posts that this layer is where the traditional star schemas are built and accessed by reporting and analytical tools, such as OBIEE. The Interfaces will use the ODI Change Data Capture (CDC) framework to flow data captured in the change tables through to the target. To recap, ODI CDC is setup to identify, capture, and deliver changes made to data in the source database. When Journalizing is started on a Model, not only are the GoldenGate parameter files created, but also the ODI CDC Framework is generated. The CDC framework includes the following:

  • Journals – tables (prefixed with J$) that hold references to the change records and the change type (insert/update/delete)
  • Journalizing views – (prefixed with JV$, JV$D) provide access to the change data by joining the journal table to the fully replicated table, and are used by IKM’s and LKM’s to access the change rows
  • Capture processes – captures changed data from source datastores (in this example, Oracle GoldenGate)
  • Subscribers – entities that consume the changed data as a consistent set

When the change data is consumed for a specific set of tables, the “window” is extended for that change set. This will essentially freeze the set of change data by only allowing transactions between the minimum and maximum WINDOW_ID (equivalent to the SCN) to be consumed, ensuring that no transactions are missed and allowing for consistency across the dataset. Once the data has been processed, the set of change data for that given window is purged from the change tables. More detailed information on ODI Change Data Capture can be found here.

Using Journalized Data

Once journalizing is setup and changes are being captured, building an interface for real-time data warehousing is quite simple. Well, sort of simple. Add the journalized Datastore to the Interface as a source and check the “Journalized data only” checkbox. This will change the source of the Interface in the generated code to be the JV$ change view rather than the actual table. The JV$ view will contain only the change rows available within the window after the “extend window” process is called. The Interface is executed and the journal is purged, setting up the process for the next run.

journalizing_interface

Now, I did say this process is sort of simple. There is a restriction that only one journalized Datastore can be used per Interface. So in the case that both the EMPLOYEE and DEPARTMENT tables are journalized, only one is allowed to have the “Journalized data only” checkbox selected when used as a source in the Interface. Fortunately, there are several ways to work around this limitation.

Parent-Child Relationship

Often a join between two source tables is due to a parent-child relationship: Department->Employee, Order->Order Line, etc. In this situation, and with any join really, we want to capture the change data from each individual table and only lookup additional data from the other table via the join when necessary to complete the full record. Unfortunately, this setup is not possible in a single Interface as was just described. So how do we work around it?

We must create two Interfaces; both with the exact same logic, source Datastores, and target Datastore. The only difference will be which source Datastore is using the “Journalized data only” option. This approach will ensure that no transaction is left behind and all changes will flow through to the target with the appropriate lookup data filled in for each row.

parent-child

But this might not be the best approach if there are 3 or more Datastores to join. Imagine if there were 12 source Datastores and then some logic in the where clause changes, forcing an update to the ETL code. That makes 12 Interfaces to modify just for a single change!

ODI Cookbook Example

There is another workaround, outlined in the new book “Oracle Data Integrator 11g Cookbook”, that takes a similar approach but with potentially less maintenance headaches. This method separates the final interface, and all of its logic, from the activity of ensuring that each table has its changes processed along with the additional lookup data from other supporting tables. I won’t go into details here (the book is well worth the purchase), but I can say this approach does look promising.

Subscription Views

A final option for moving data from the change tables through to the target facts and dimensions as quickly as possible can be achieved using what we call subscription views. The goal here is to always return a consistent set of data and to let the ETL developer make the choice as to how this is accomplished. In this solution, the “Journalized data only” checkbox is never checked for any of the source Datastores in the Interface. Instead, we create a view for each of the Staging schema tables and include a join to the ODI J$ change table.

Subscription View

The ETL developer can then reverse engineer the views into an ODI Model and use them as the source Datastores in the transformation interfaces. There are two flags that are added to the view to allow the developer a choice of data source.

  • STAGE_IND – Indicates the rows that are ready to be consumed for the first time. In other words, these rows have not yet been “seen” by the process.
  • CURRENT_IND – The most recent version of that row by natural key, which, timing aside, matches the record in the source.

Using a combination of these indicators, the developer can choose to return only change rows (STAGE_IND = ‘Y’) or return all current rows (CURRENT_IND = ‘Y’). This works well for incremental processing of data, but let’s go one step further. What if there is a need to reload the fact and dimension tables using the historical data in the Foundation layer schema? A view can be created which combines all three sets of data: Change rows (J$ table), Current replicated rows (Staging table), and Full transactional history (Foundation table).

Foundation Subscription View

In this case, we need to perform a bit of SQL analytics using the SCN and natural key to determine the CURRENT_IND from the set of historical data in the Foundation layer, but beyond that the view is effectively the same. Now we can use a combination of the indicators to return various datasets, such as all historical rows, current rows, or just the latest change rows.

GoldenGate and Oracle Data Integrator – A Perfect Match…

1. Introduction
2. Replicate to Staging and Foundation
3. Real-time ETL Challenges

If you’re interested in learning more about GoldenGate and ODI, please drop us a line at info@rittmanmead.com. We offer our Oracle Data Integrator 11g “Bootcamp” course both as a publicly available class coming up in Atlanta (Oct 28 – Nov 1) and Brighton (Aug 12 – 16 & Oct 21 – 25) or we can always schedule a visit to your company site.

Tags: , ,

Comments

  1. Nicholas Hurt Says:

    Succinctly put and an interesting read. Thanks Michael!

  2. David Mann Says:

    Great stuff, making my way from homegrown DW solutions it is nice to be aware of the Reference Architecture and how it fits into the bigger picture.

  3. OracleSolutions Says:

    Thanks Michael for such an interesting blog.. it has increased the level of our knowledge..
    keep it up

  4. Michael Rainey Says:

    I’m glad you have all enjoyed the read!

    Regards,
    Michael

  5. Brian Says:

    A very insightful series of articles – thank you!

    I do have a question about the Subscription Views:

    Shouldn’t the join between the two in-line views be connected using a FULL OUTER JOIN? If I understand correctly, the table in EDW_STG is an exact replicate of the source table. So, in your example, if you delete a player from a team, that row is physically deleted from EDW_STG.OFFENSE_PLAYMAKERS, and is added to EDW_STG.J$OFFENSE_PLAYMAKERS with a JNL_FLAG = ‘D’. If EDW_STG.OFFENSE_PLAYMAKERS drives the join, then the row in EDW_STG.J$OFFENSE_PLAYMAKERS is excluded from the join, right? You would need a FULL OUTER JOIN to capture the staged “delete” transactions.

    Brian

  6. Michael Rainey Says:

    @Brian – you bring up a good point about the full outer join when using the fully replicated table in the subscription views. When using the foundation table, these rows exist in both places as all transactional history is loaded into foundation, so the full outer join would not be necessary.

    Thank you for catching that! I’m glad you enjoyed the read.

    Regards,
    Michael

  7. Tanveer Says:

    Hi Michael,
    Great Article. I have a question regarding ODI as I am new in ODI world.

    Can we just use ODI solely as a replication tool to synch CDC data from Source to staging. And then again use ODI to use the staging data to load into Data warehouse? Is there any example like this?

    Regards,
    Tanveer.

  8. Michael Rainey Says:

    @Tanveer – you can use ODI for the entire process, yes. If you don’t have an external replication or CDC tool, such as Oracle GoldenGate or Oracle Streams, ODI can implement change data capture by adding triggers to the source tables. These triggers would capture any change to the source table and load this change data into the J$ table on the source. ODI Interfaces would then need to be built to gather the source change data and load into Staging / Foundation in the data warehouse.

    Using GoldenGate, you already have the data replicated from source to target, so you can eliminate the need for those additional Interfaces – as well as limit the impact on the source tables.

    Regarding examples – I would start with the ODI documentation “Working with Change Data Capture”. Glad you enjoyed the post!

    Regards,
    Michael

  9. Tanveer Says:

    Hi Michael,
    really appreciate your time for responding to my question. You mentioned adding triggers to the source tables but DBA may not allow that. Is there any other option to do CDC using ODI e.g. using database logs? Is ODI widely being used for large volume CDC?

    Thanks for your help again.

    Regards,
    Tanveer.

  10. Michael Rainey Says:

    @Tanveer – Take a look at the ODI documentation I mentioned in the previous comment. The other option for reading change data from DB logs, besides GoldenGate, is Oracle Streams – but beware, this technology has been deprecated.

    Regards,
    Michael

  11. Krishna Says:

    Hi Michael,

    Thanks for the post.

    I am new to ODI and I have a different question.

    1) What makes it call ODI as Integration tool? If it can integrate data from multiple source and load to any source you need, can this not be done by any other ETL tool?

    2) What is the clear distinction between ODI vs Informatica in terms of data integration?

    Thanks in advance.

Website Design & Build: tymedia.co.uk