In this five part series over the Christmas break, I'm taking a look at some of the lesser-known features in Oracle Data Integrator 11g, focusing on "enterprise" features such as build automation, scripting, data quality and middleware integration. If you've arrived at this post via a Google search, here's the links to the other posts in this short series:
- ODI11g in the Enterprise Part 1: Beyond Data Warehouse Table Loading
- ODI11g in the Enterprise Part 2 : Data Integration using Essbase, Messaging, and Big Data Sources and Targets
- ODI 11g in the Enterprise Part 3: Data Quality and Data Profiling using Oracle EDQ
- ODI 11g in the Enterprise Part 4: Build Automation and Devops using the ODI SDK, Groovy and ODI Tools
- ODI 11g in the Enterprise Part 5: ETL Resilience and High-Availability
So the first post in this series was titled "Beyond Data Warehouse Table Loading", and that's where we'll start this week's look into ODI 11g in the Enterprise. If you've had any exposure to ODI you'll probably notice that it's primarily a relational database data integration tool, using concepts such as joins, filters and expressions to transform data together using table-like data objects called "data stores". But ODI can extract from, transform and load into much more than relational databases, supporting sources and targets such as flat file and XML files, web services, JMS queues and pretty-much anything that provides a JDBC driver. In the context of Oracle projects though, there are three types of non-relational sources and targets that are particularly relevant; Oracle Essbase and Hyperion Planning multidimensional databases; message and service bus integration in a Service-Orientated Architecture (SOA) environment; and big data sources and targets such as Hadoop and NoSQL database types.
Prior to the Oracle acquisition, there were a number of different routes available to Essbase and Hyperion Planning customers looking to load one of their databases or applications. At the most basic level, "rules files" can still be used to define SQL queries to load data into Essbase outlines and databases, either written manually or auto-generated using tools such as Essbase Studio. But rules files aren't the best place to embed ETL logic and transformations (see this presentation by Cameron Lackpour for some good background on the topic), and over the past few years Oracle have transitioned from an Informatica-based solution for Essbase ETL towards Oracle Data Integrator, which is now the "strategic" ETL option for loading Essbase databases and planning applications. As shown in the diagram below, ODI 11g can be used to load Essbase databases, Hyperion Planning applications along with HFM applications through their own individual APIs, with ODI providing a set of code template "knowledge modules" that allows developers to write to, and read from, abstracted data objects and with ODI under the covers issuing the correct API calls to load the various applications.
Loading an Essbase database using ODI isn't the most intuitive process for Essbase developers as ODI represents the Essbase database as a set of table/column views, but some good groundwork has been done in the developer community to establish best practices and standard workflows, and it'll be interesting to see how Essbase support evolves over time within ODI. In the screenshots below, you can see and Essbase database outline translated into a generic ODI "Model", and an Essbase account dimension transformed into a generic datastore and columns. At the bottom is a set of target loading properties setting out how rules files and logs are used, whilst on the right you can see a typical Hyperion Planning interface "flow" moving data from a staging data into the Planning Essbase database and metadata tables.
Some good sources of information on Essbase and Planning ETL using ODI are, for example:
- "Slay the Evil of Bad Data in Essbase with ODI" - Oracle Openworld 2012 presentation by Cameron Lackpour on data quality and ETL in the context of Essbase, using ODI 11g
- "ODI and Essbase - There and Back Again" - Links to a series of blog posts by John Goodwin on ODI and Essbase - essential reading, very comprehensive
- "ETL Loads on Essbase & Planning with Oracle Data Integrator 11g" - Oracle Openworld 2010 presentation by Venkatakrishnan J on the various ETL options for Essbase products
Another area we're seeing lots of uptake of ODI within is Fusion Middleware projects, particularly ones that use messaging and other SOA-type features to move data around the enterprise. Most SOA projects only have a requirement to move small, message-size bits of information around from service to service, and would typically use products such as BPEL and Oracle Enterprise Service Bus to provide communication between services, but when large sets of data need to be moved then Oracle Data Integrator comes in handy.
To take a typical example, in the diagram below data from an Orders database (1) can take two routes through a BPEL process depending on the size of the message payload. For regular, small messages the standard approach of handling the message through BPEL steps is used (2), whereas when the payload is large, it's passed by reference to ODI (3), which then loads it (4), transforms (5) and sends it where it's required (6), notifying the calling BPEL/ESB routine when the job is complete (7), allowing the wider BPEL routine to complete (8). ODI has a full web service interface that allows packages and other ETL processes to be called from BPEL invoke activities, and can read from queues, web services and other sources as part of a wider SOA/Fusion Middleware deployment.
There's actually a couple of other data integration products within Fusion Middleware 11g that you might come across when looking at SOA environments; ODI 11g is mainly for bulk-data movements as we've seen already, whereas another product called "Oracle Data Service Integrator", brought across from the BEA acquisition, provides a means to perform data service federation within a SOA project. For anyone familiar with OBIEE, data federation is a familiar concept and is all about integrating data "in-place", in real-time, in the case of OSDI providing calling applications with a federated, transformed and combined view of data without actually staging it into a database en-route. OSDI is quite an interesting product and is due for an update shortly, and is typically called from products like BPEL or ESB (or as in the diagram below, potentially from a BI tool like OBIEE) to provide a current, real-time, updatable view of the enterprise's data, typically based on object-orientated data sources that don't readily translate into the table-and-column structured favoured by ODI.
What this then leads to in more complex Fusion Middleware 11g / SOA deployments is ODI being used for bulk-data movement, called and communicated with using web service calls, and Oracle Data Service Integrator then used to create abstracted, updatable services over live data for more general messaging and data integration. Note also the use of Oracle Enterprise Data Quality, which we'll cover in tomorrow's post in this series.
The other area that has seen a lot of innovation around ODI is around big data; specifically, the ability to work with Hadoop clusters to leverage their ability to process huge unstructured datasets in parallel, as a kind of supercharged "staging area" for a more conventional Oracle-based data warehouse. I'll cover Oracle and big data in more detail early in 2013, but at a high-level Oracle has had a lot of activity in the big data space over the past couple of years, partnering with Cloudera to resell their Cloudera Hadoop distribution and management tools, creating a new engineered system called Oracle Big Data Appliance, and creating new tools and products such as their own NoSQL database, connectors from Hadoop and HDFS to the Oracle database, extending open-source R to create Oracle R Enterprise, and investing in new products such as Oracle Endeca Information Discovery that major on unstructured, textual sources.
The role that ODI plays in this big data architecture is as a means to get data and insights out of Hadoop and NoSQL databases into a regular Oracle data warehouse, and as a way to load Hadoop clusters with additional data from an Oracle system to complement big data sources such as log files, meter readings, tracking data and other high-volume, low-granularity non-traditional data sets. The graphic below from Oracle sets out where ODI sits in this type of arrangement, with ODI typically using a set of Hadoop-specific knowledge modules along with the Oracle Connector for Hadoop to gain access to these new sources and targets.
So when most people talk about "big data", what they are generally referring to is Hadoop, an open-source project for co-ordinating data processing jobs across large clusters of commodity servers, and MapReduce, a programming approach that provides a framework and API for creating massively-parallel data analysis routines. We'll cover Hadoop and MapReduce in a lot more detail in some postings early in 2013, but suffice to say that whilst Hadoop and MapReduce have greatly simplified the process of creating parallel-processing Java applications, creating MapReduce routines is still typically beyond most end-users and developers and so over the past few years, sub-projects within Apache Hadoop have come about to provide a SQL-style interface over Hadoop/MapReduce (Apache Hive), data loading routines (Apache Sqoop), scripting languages to create MapReduce programs (Apache Pig) and so forth. Oracle also partner with Cloudera to provide the Hadoop elements of their Big Data Applicance, and there are other vendors such as Hortonworks and MapR who are trying to make Hadoop more accessible to end-users, and more suited to deployment in the enterprise - see this blog post by Curt Monash on the different "enterprise Hadoop" vendors in the market earlier in 2012 for more details on who's who and what products you should look at.
The way that ODI 11g accesses Hadoop clusters and MapReduce routines then, is to use the Apache Hive SQL (HiveQL) and metadata layer to provide a relational-like view over data in the Hadoop cluster, with Hive in turn generating MapReduce routines to transform its data and return data back to ODI for further processing and integration. Together with a set of Hadoop/Hive-specific knowledge modules, a JDBC driver for Hive and Oracle Loader for Hadoop, a utility for getting data out of Hadoop and into and Oracle database, ODI can generate SQL-like data extraction and transformation code, and have Hive do the work of creating and running the MapReduce routines used to filter, aggregate and transform data in the Hadoop cluster, as shown in the diagram below.
To take the canonical example of web log analysis (the first major Hadoop use case, developed by Yahoo and Google to analyze thousands of web search logs in parallel to determine most popular sites and search terms at a particular point in time), ODI11g along with the Oracle Data Integration Adaptor for Hadoop can work with Hadoop sources in four main ways. After creating a connection to the Hadoop cluster and the Hive server using the Hive JDBC driver and reverse-engineering the Hive table data model into the ODI repository, you can use the IKM File to Hive knowledge module to load data into a previously-defined Hive table, the first step that you need to perform before analyzing data using Hive's SQL-like interface.
Similar features exist for transforming data within Hive using either regular SQL expressions, or custom transformation code, with ODI generating the HiveQL query language understood by Hive, and Hive then turning this into MapReduce routines that sort, sift, aggregate and process data across servers in the Hadoop cluster, typically prior to loading the results into the main part of your data warehouse.
ODI also makes use of the Oracle Loader for Hadoop, which uses Hadoop and MapReduce to do the "heavy lifting" and then loads the results into an Oracle database, using direct path loads and other optimisations. Other connectors and integrators also exist including Oracle Direct Connector for Hadoop Distributed File System (HDFS), typically used to provide file sources for Oracle external tables, and Oracle R Connector for Hadoop, for linking R statistical and modelling routines to data stored in Hadoop, to allow larger sets of data to be process and modelled within R routines. We'll cover all of these new products early in 2013, so keep an eye on the blog for more details and use cases, and in the meantime Oracle's David Allan has been putting together a series of articles on Hive and ODI over on the Oracle Data Integration blog:
- ODI - Reverse Engineering Hive Tables
- ODI - Basic Hive Queries
- ODI - Hive External Tables, reverse engineering
- ODI - Hive and Complex JSON
So there's definitely more to ODI than loading and transforming relational data. But what if there's issues with the quality of the data you're working with? We'll look at how ODI integrates with Oracle's newest data integration product, Oracle Enterprise Data Quality, in the next post in this series.