Agile Data Warehousing with Exadata and OBIEE: Model-Driven Iteration

January 16th, 2012 by

After laying the groundwork with an introduction, and following up with a high-level description of the required puzzle pieces, it’s time to get down to business and describe how Extreme BI works. At Rittman Mead, we have several projects delivering with this methodology right now, and more in the pipeline.

I’ll gradually introduce the different types of generic iterations that we engage in, focusing on what I call the “model-driven” iteration for this post. Our first few iterations are always model-driven. We begin when a user opens a user story requesting new content. For any request for new content, we require that all the following elements are including in the story:

  1. A narrative about the data they are looking for, and how they want to see it. We are not looking for requirements documents here, but we are looking for the user to give a complete picture of what it is that they need.
  2. An indication of how they report on this content today. In a new data warehouse environment, this would include some sort of report that they are currently running against the source system, and in a perfect world, this would involve the SQL that is used to pull that report.
  3. An indication of data sets that are “nice to haves”. This might include data that isn’t available to them in the current paradigm of the report, or was simply too complicated to pull in that paradigm. After an initial inspection of these nice-to-haves and the complexity involved with including them in this story, the project manager may decide to pull these elements out and put them a separate user story. This, of course, depends on the Agile methodology used, and the individual implementation of that methodology.

First we assign the story to an RPD developer, who uses the modeling capabilities in the OBIEE Admin Tool to “discover” the logical dimensional model tucked inside the user story, and develop that logical model inside the Business Model and Mapping (BMM) layer. Unlike a “pure” dimensional modeling exercise where we focus only on user requirements and pay very little attention to source systems, in model-driven development, we constantly shift between the source of the data, and how best the user story can be solved dimensionally. Instead of working directly against the source system though, we are working against the foundation layer in the Oracle Next-Generation Reference Data Warehouse Architecture. We work from a top-down approach, first creating empty facts and dimensions in the BMM, and mapping them to the foundation layer tables in the physical layer.

To take a simple example, we can see how a series of foundation layer tables developed in 3NF could be mapped to a logical dimension table as our Customer dimension:

Model-Driven Development of Dimension Table

I rearranged the layout from the Admin Tool to provide an “ETL-friendly” view of the mapping. All the way to the right, we can see the logical, dimensional version of our Customer table, and how it maps back to the source tables. This mapping could be quite complicated, with perhaps dozens of tables. The important thing to keep in mind is that this complexity is hidden from not only the consumer of the reports, but also from the developers. We can generate a similar example of what our Sales fact table would look like:

Another way of making the same point is to look at the complex, transaction model:

We can then compare this to the simplified, dimensional model:

And finally, when we view the subject area during development of an analyses, all we see are facts and dimensions. The front-end developer can be blissfully ignorant that he or she is developing against a complex transactional schema, because all that is visible is the abstracted logical model:

When mapping the BMM to complex 3NF schemas, the BI Server is very, very smart, and understands how to do more with less. Using the metadata capabilities of OBIEE is superior to other metadata products, or superior to a “roll-you-own metadata” approach using database views, because of the following:

  1. The generated SQL usually won’t involve self-joins, even when tables exists in both the logical fact table, and the logical dimension table.
  2. The BI Server will only include tables that are required to facilitate the intelligent request, either because it has columns mapped to the attributes being requested, or because the table is a required reference table to bring disparate tables together. Any tables not required to facilitate the request will be excluded.

Since the entire user story needs to be closed in a single iteration, the user who opened the story needs to be able to see the actual content. This means that the development of the analyses (or report) and the dashboard are also required to complete the story. It’s important to get something in front of the end user immediately, but it doesn’t have to be perfect. We should focus on a clear, concise analyses in the first iteration, so it’s easy for the end user to verify that the data is correct. In future iterations, we can deliver high-impact, eye-catching dashboards. Equally important to closing the story is being able to prove that it’s complete. In Agile methodologies, this is usually referred to as the “Validation Step” or “Showcase”. Since we have already produced the content, then it’s easy to prove to the user that the story is complete. But suppose that we believed we couldn’t deliver new content in a single iteration. That would imply that we would have an iteration during our project that didn’t include actual end-user content. How would you go about validating or showcasing that content? How would we go about showcasing a completed ETL mapping, for instance, if we haven’t delivered any content to consume it?

What we have at the end of the iteration is a completely abstracted view of our model: a complex, transactional, 3NF schema presented as a star schema. We are able to deliver portions of a subject area, which is important for time-boxed iterations. The Extreme Metadata of OBIEE 11g allows us to remove this complexity in a single iteration, but it’s the performance of the Exadata Database Machine that allows us to build real analyses and dashboards and present it to the general user community.

In the next post, we’ll examine the ETL Iteration, and explore how we can gradually manifest our logical business model into a physical model over time. As you will see, the ETL iteration is an optional one… it will be absolutely necessary in some environments, and completely superflous in others.

Comments

  1. Agile Management Says:

    HI,

    Thanks for this important article each and every thing is clear with this article.

    thanks a lot.

  2. Dan Says:

    Hi again, Some thoughts … Think the principals here are good. However, directly modelling on 3NF is not always possible IMHO. So when not possible, I mock up a star in Excel and work with this via ODBC in the RPD. Then, after first iteration story agreed with user. Rewire the Excel source to Oracle Warehouse. Thus dependency between ETL and dashboards is broken, allows parallel development.

    Another option to consider is using opaque views in physical layer on 3NF source to mimic facts and dimensions and build your star there; especially useful when you are dealing with multiple sources.

    For more simple sources, the appraoch you suggest is the right one.

    Perhaps a hybrid opaque-view & 3NF approach is an option too.

  3. Stewart Bryson Says:

    @Dan

    Thanks a lot for taking the time to comment… we sincerely appreciate that.

    Keep in mind that we are modeling against a foundation layer, not the explicit source system. Typically, the foundation layer is exactly like the source system, except that it it “insert only”, so we have history. It’s much easier to map against 3NF in this way, as we can replicate things like slowly-changing dimensions, etc.

    I have found that I can model most scenarios in the RPD. Perhaps I haven’t seen some of the systems that you have. But watch out for the next posting where I discuss the ETL iteration. You will see that when we come up against issues that are difficult to model, or require extremely complex approaches, then it is time for us to start considering ETL. I haven’t ruled out ETL completely, I’m just arguing that it doesn’t have to be started right away. And as you will see, the process of doing the RPD modeling first will actually make the ETL iteration easier.

    If we start by replicating current reports that the user is generating against the source, and then slowly add additional, more complex content, we can add that with additional RPD development, or with ETL. Also… watch out for the hybrid iteration that I describe last, which hopefully will clear up the ETL versus RPD decision that has to be made in later iterations.

  4. Kent Graziano Says:

    Nice example Stewart. Makes it easy to follow your logic. It does bear repeating that for Extreme BI to work, you have to lay the foundation first by have the right infrastructure (like Exadata) and have at least part of your foundation layer (3NF or Data Vault)in place in order to have a source for the dimensional model.

    So a question – where does building the foundation layer fit in the methodology? Is really just an interation (or several) earlier in the project?

  5. Stewart Bryson Says:

    @Kent: Nice to hear from you again.

    This depends a bit on the environment. In a situation where the foundation layer would look nearly identical to the source system, then I look at this a bit like plumbing, and don’t consider it to be part of the development cycle at all. Whether we use Oracle CDC (based on Streams) or use GoldenGate, both of these options provide simple ways to produce “insert only” versions of the source system to the Exadata Database Machine.

    Considering a more complicated scenario, where a more robust model (such as Data Vault) is used for the foundation layer instead, then I would recommend a prior iteration to get the needed entities into the Data Vault model. Unlike CIF versions of the “data warehouse” (to use Inmon terminology), Data Vault is fast and reactive, and lends itself well to the Agile methodology. This might, perhaps, violate the concept of “new content in a single iteration”, but since we are clearly separating the pieces that require user requirements, namely the access and performance layer, from the “process neutral” foundation layer, then this still seems to fit into the Agile methodology in my mind.

    Depending on how responsive the Data Vault entities could be produced and populated in the foundation layer, then the foundation and logical access layers could conceivably be generated in a single iteration. But I think it would be pushing it. Also, keep in mind, that a new user story might request content that has already been addressed in the foundation layer. In these cases, the normal methodology could be used in kind.

    Thoughts?

  6. Larry Dooley Says:

    Very nice example. However, the infrastructure is really expensive. Take out the EXADATA and this becomes a POC throw away. Nothing wrong with that. Once you get all the kinks worked out with the user you can then put the backend stuff in place.

  7. Stewart Bryson Says:

    @Larry said “…this becomes a POC throw away.”

    I have to disagree Larry. If you read the next post on the ETL iteration, I explain that we have done a lot of work in the model-driven iteration that would have to be done otherwise in a standard waterfall methodology:
    1) The BMM layer will contain the logical model we have worked out. This will be the first draft of a physical model needed, but the majority of work around granularity, separation of entities, etc., is already done.
    2) The RPD will contain the source-to-target mappings for the ETL developer. The Admin Tool has several options for generating source-to-target documentation, so the standard spreadsheet document for this wouldn’t have to completed.
    3) The use of the model-driven iteration will generate SQL statements that can be used for prototyping for ETL development. Additionally, ODI, OWB and Informatica provide “SQL override” functionality in one way or another, so these SQL prototypes could be used in a first iteration of the mapping to get quick-win development.

    Also… ask you end-user. Would she rather have a first-version of the report, which perhaps runs very poorly, while the backend is being optimized to deliver that report more efficiently? Or would she rather wait and not have anything until everything is delivered completely? I know that I would choose the former.

Website Design & Build: tymedia.co.uk