Introducing the Updated Oracle / Rittman Mead Information Management Reference Architecture Pt2. - Delivering the Data Factory

In my previous post on our updated Oracle Information Management Reference Architecture, jointly-developed with Oracle’s Enterprise Architecture team, we went through a conceptual and logical view of the information architecture, introducing new concepts like the Raw Data Reservoir, the Data Factory and the Discovery Lab. I said that the Data Factory, Data Reservoir, Enterprise Information Store and Reporting, together with the Discovery Lab, comprised the Information Platform, which could be used to create data applications (raw data plus the data factory), or more wide-ranging Information Solutions, with the Discovery Lab often acting as the place where new ideas and models were prototyped, and algorithms developed that could be productionized in the Execution area of the overall architecture.

NewImage

And this is typically where most information architectures leave you; with a good starting point and an overall schematic, but without any real guidance on how to deliver it. For example, how do you develop small pieces of functionality that may often be throwaway, and how do you take on larger projects? Is it possible to develop this type of architecture in an agile way, and if so, can the data reservoir part be delivered agile too? In the collaboration with Oracle’s Enterprise Architecture team this is where we made our major contribution, taking the ideas behind our ExtremeBI agile development approach and using them to deliver this updated Information Management architecture. Let’s go back a few steps though and think about why an agile, or at least a different, way of developing BI content is needed, and what part of the overall architecture are we developing for?

Within this overall architecture, the bit that developers such as ourselves typically add value for is the “data factory”; the part of the architecture that acts as the conduit, or interface, between the schema-on-read raw data reservoir and the schema-on-write enterprise information store.

NewImage

There’s other code to write too; interface and streaming code to get data into the event engine, for example, and more traditional ETL code to load data in from database and application stores. But what we’re most concerned about is the data models and reports that users request in their user stories and epics. If you take a look back at the logical view of our new information architecture, you can see that information can come in at any level in the architecture - raw data reservoir, foundation or access and performance, with different pros and cons for landing and accessing data at each layer, and the ability to take data from one layer and use it to load layers higher up in the layer hierarchy.

NewImage

And this is where our agile development methodology, “ExtremeBI”, comes in. ExtremeBI, for structured relational sources, typically uses Oracle GoldenGate to trickle-feed data of interest directly into the foundation layer of this information management data architecture, with developers then using that layer plus the access and performance layer, if needed, to close-off user stories as quickly as possible, as shown in the diagram below.

NewImage

In some cases it’s possible to model against just the foundation layer in OBIEE, where that foundation layer contains all the metadata and history columns that you need. One of the main parts of our ExtremeBI methodology is a technical process to make this modelling possible, and more importantly allow the underlying physical model to evolve in response to user stories about performance and analysis needs, without breaking existing reports or creating an unmanageable system.

NewImage

We’ll be documenting this process as part of the series of white papers we’ll be co-authoring with Oracle, and we also bring utilities and accelerators to ExtremeBI customer projects, along with continuous integration and regression testing tools, to help get usable software in the hands of end-users as soon as possible.

But what about the parts of the data factory that source data from the raw data store? Can we use similar techniques, and tools like ODI and OBIEE, to deliver working software to users within a short time span? Well in practice it’s not quite so easy; typically, getting information out of schema-on-read database involves writing code, and its easy to fall back into bad habits where everything is scripted, code is brittle and fragile, and no-one feels confident in evolving the model over time.

We think though it’s possible to use ExtremeBI techniques with schema-on-read sources thought, when you combine them with tools like Flume or GoldenGate for trickle-feed extraction and loading, and new capabilities in ODI and OBIEE where Hadoop data can be accessed via Hive. Flume is the industry-standard mechanism for transporting log entries from source to the Hadoop cluster, and GoldenGate has the ability through its Java API to land data in HDFS or Hive, when the source is a relational database (see MOS Doc.IDs 1586188.1 (Hive) and 1586210.1 (HDFS) for example code).

Hive, though slow for ad-hoc query access against Hadoop data, has a range of Serializer/Deserializer utilities that can translate semi-structured and NoSQL sources into more regular columns and tables, with Cloudera Impala removing the response-time issue and rapidly matching Hive in terms of plugins and SerDes. ODI, as I demonstrated in a five-part series on the blog last week, has a number of Hadoop-native knowledge modules and also leverages Oracle Loader for Hadoop to move data out of the raw data reservoir and into the foundation and access + performance layers.

The important thing to understand though when working with data in the updated Information Management architecture, and reporting against schema-on-read and schema-on-write sources, is that the most effective way to deliver real value for end-users is to have two phases to your development work;

NewImage
  • A discovery phase, typically done in the Discovery Lab sandbox or Rapid Development Sandboxes, that focuses on understanding the data and helping the users see where the value is, with less emphasis on governance and corporate standards, and
  • A commercial exploitation phase, when our understanding of the data has significantly increased and we start to add structure to the data, and where it’s an appropriate time to add governance and address data quality issues. Typically, this would be done in the Execution part of our architecture, taking our prototype design and adding it to the data factory as one of its interface routines.

We’ll be developing these ideas out further over the summer, with some white papers and a joint presentation at Openworld 2014 to formally launch it. In the meantime, keep an eye on the blog as we work through some of the key concepts.