The One Mapping Paradigm

April 1st, 2011 by

Here at Rittman Mead we have been working on some new methodology and design patterns for ETL. We have long realised that the bottleneck in  Business Intelligence and Data Warehousing projects is ETL, so we have been prototyping new techniques to approaching this and trialling them at client’s sites.

Taking a step back and looking at the ETL process we felt there was a lot of complexity unnecessarily created by decomposing the process into a number of program units or mappings. In our view this process creates the following problems:

  • A large amount of processing time was wasted on the inter-communication of these mappings.
  • Unnessary temporary storage objects and created and populated in the database.
  • A separate technology is required to orchestrate all the mappings.
  • It encouraged multiple developers to work on the ETL process thereby increasing the risk of mis-communication and mis-aligned interfaces.

In response to this Rittman Mead have developed the One Mapping Paradigm. We believe that you should put all your ETL code into one mapping, and as such have called this approach the One Mapping Paradigm (OMP). The goal of this approach is to encapsulate your entire ETL routine into one mapping or program unit.

We feel this approach adheres to some of the fundamental tennets of software development: encapsulation (everything is in the one mapping) and decoupling (there are no external dependencies). Further it completely negates the need for old bugbear re-usability, you now don’t even need to re-use code, just use it once, all in the same mapping. Most importantly OMP will also provides a reduction in development costs: you now only need one developer.

Our extensive research has also developed a series of steps you can follow to deliver your One Mapping. You should note that the One Mapping that OMP generates will be extremely complex, only by following these can you address the complexity of the mapping that will be generated.

OMP follows a black hole development approach where it is crucial for the developer to do as much development as possible without any outside interfere from either peers or the business. This allows the developer to focus solely on the development task in hand, which is a must when developing extremely complex code. It is also essential that the developer is allowed to proceed as far through the process as possible without stopping for other distracting activities like testing. In order to follow the OMP I have built the following example using Oracle Warehouse Builder.

  • Step 1: source objects – create new mapping a drag all your source objects onto the canvas – it is important to arrange these in a straight line on the left hand side of the canvas.
  • Step 2: add all your join operators to combine the data. A couple of tips here, (1) add predicates into the join conditions to avoid using filter operators (2) keep the data transition lines as straight as possible for performance reasons.
  • Step 3: add any expression or transformational operators required – these should really be added to the middle of the canvas.
  • Step 4: add all your target tables – these are added to the right hand side of your canvas. You are in the home straight now, but you may find this the trickiest part and we recommend using at least a 29″ monitor to complete this process.
  • Step 5: unit test – note there is no orchestration or integration required, as you only have One Mapping.
  • Step 6: release to production – you can just release you mapping straight into production, overwriting whatever was there before. There is no system or integration testing required as there is only one piece of code. UAT is further bypassed as your unit testing verifies whether the entire ETL process works or not.

We are looking for beta testers for this concept, so if you want to try the OMP for your ETL code, please contact me at omp@rittmanmead.com.

Comments

  1. Mark Rittman Says:

    Jon,

    Thanks for taking the time to write our new methodology down. I’ll be following up next week with more advanced topics that takes this into the BI world – “One Reporting Table” (ideally with one big index) together with “One Analysis”. If you really want to cut down on testing time and tuning, to me this is really the only way to approach things.

  2. Damian Arnold Says:

    I may be along later to discuss our One User approach to reporting as enterprises are suffering from the ‘too many cooks: scenario resulting in too many versions of the truth.

  3. Peter Scott Says:

    I have been doing some performance work on the methodology.
    One area to consider is the use of gravity to boost data flow. Larger source tables should be on the top left – larger targets on the bottom right. I will post some metrics next week examining the effect of monitor size on performance… and size does matter

  4. AndreML Says:

    I’m currently working on a migration project to bring 4 source systems into on single target system. We spent so many time to plan and prepare our self in order to be able to manage the hight complexity of this task. But now finaly everything seems to become so simple. I will immediatly throw all the paperwork over board and make a brand new start acordingly to your methodologie. Thank you so much for sharing this genious ideas with the rest of the world. This april 1. will go into history of since and computing.
    I’m sure. Sweden is waiting for you.
    Best Regards
    Andre

  5. hickup Says:

    I’ve been involved in optimizing our OWB mappings with a simple solution – using BAD_NEWS transitions between source and target operators, instead of normal transitions. In my experience bad news travels faster between operators and we’ve managed to speed up our ETL up to 15 times. The only trick is to create a BAD_NEWS UDP and not give a damn about data quality.

  6. John Says:

    I am look forward to applying this technique but I currently do not posses a 29 monitor, I was thinking of using 2 0r 3 monitors instead (yes I know, school boy error) but then relalised splitting this across a number of monitors goes against the fundamental principles of the One Mapping paradigm

  7. Mark Rittman Says:

    Jon, Pete

    Pete covers a basic performance optimization, but it’s also worth sharing further work our crack “performance optimization” team has been doing in this area. Taking Pete’s idea of putting source tables on the top left-hand side of the page and targets on the bottom-right, we’ve found that (a) using smaller font sizes for the data helps significantly boost throughput through the mapping “pipes”, and (b) physically placing the source server higher than the server holding the target data, perhaps by placing the server on a small filing cabinet and the target one on the floor, helps with the “gravity” approach Pete mentions.

    We also find that OWB can respond faster to our deployment requests by having the operator + workstation also physically higher than the source and target servers, so the commands flow “downhill” to the server faster. Our ideal setup, and one that we advise customers to adopt, it to have the operator sitting, with his workstation, on top of a wardrobe, connected by thick cables to a source server on a filing cabinet, itself connected to the target server on the floor. I would also disagree with Jon’s comment about large monitors – having too large a monitor makes the distance between the mapping operators too large and hence your mappings too long; I prefer to use a 14″ monitor because of this.

    regards

    Mark

  8. christian Says:

    Two more optimization methodologies are obviously

    a) the utilization of a “flow-oriented” language. Therefore I do all mappings in hiragana since it implicitly contains an up-to-down orientation and hence eases the flow of the process. Granted this may seem a bit excessive but everything in the name of performance optimization!

    b) always paint all your machine casing in bright RED (#FF0000 if you please), cause da red wunz go fasta!

    Cheers!

  9. John Says:

    Mark, good point about using a 14″ monitor, this got me thinking and I have optimised this further by using a 14″ monitor running 1024 * 768 resolution and then using 1 pixel per table and placing these next to each other giving me a maximum number of tables in a mapping of 786,432 ( I admit this is a bit of a limitation as far a scalability is concerned). The beauty of this optimisation is it eliminates all data flows and I can fit all of the condensed mapping into my SGA.

  10. Peter Scott Says:

    @Mark the point about large monitors is that you can get a bigger “drop” between source and target. I find that I get the best ETL speed using a large plasma screen mounted in portrait format – that layout maximises the data gradient

  11. Niall Litchfield Says:

    I’m sorry to see that you don’t seem to have considered the physical layout on modern multi platter disk technology. I see many environments where the on screen layout is as Peter describes, but this is negated by storing the destination objects on the upper platters of disks on the upper disk trays whereas the source data is stored on the lower media. Rumour has it that a large exadata customer has had to install their production rack upside down in the data centre due to this issue.

  12. Stewart Bryson Says:

    One Love! Peace!

Write a comment





Website Design & Build: tymedia.co.uk