June 13th, 2014 by Mark Rittman
All week I’ve been looking at what’s involved in moving data around Hadoop on the Oracle Big Data Appliance, using ODI12c to orchestrate the end-to-end process. Using web log data from the Rittman Mead website, I first landed the log data on HDFS using Apache Flume, then transformed the data in several stages using Hive transformations built using ODI mappings and knowledge modules. The links below give the previous articles in this series, where you can read how I transformed data using Hive, Sqoop and Python streaming:
- End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.1 : Flume to Initial Hive Table
- End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.2 : Hive Table Joins, Aggregation and Loading
- End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.3 : Enhance with Oracle Reference Data via Sqoop, and CKMs
- End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.4 : Transforming Data using Python & Hive Streaming
- End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.5 : Bulk Unload to Oracle
At the end of all these ODI12c transformations, I’m left with a Hive table containing summarised web page accesses, enhanced with reference data on the page title and author, and with the entries geocoded with the name of the country associated with the page access. In many cases, this is where I’d leave the data, but quite often it’d be useful to then export the data out of Hadoop and into an Oracle database, so that I could load it into a data warehouse, or analyse it using a richer SQL dialect than I can get with HiveQL. I could use Sqoop to get the data out of Hive and into Oracle, but for larger exports in particular I’d get better unload performance by using Oracle Loader for Hadoop, one of the Oracle Big Data Connectors that typically come with Oracle Big Data Appliance.
There’s actually two Oracle Big Data Connectors that deal with getting data out of Hadoop and into Oracle; Oracle Direct Connector for HDFS (ODCH) gives you the ability to define an Oracle Database External Table over a HDFS file or Hive table, and is useful if you just want to access the data in Hadoop without actually loading it into an Oracle Database.
The way ODCH works is that it adds a special type of adapter to the external table feature in Oracle Database, that gives the database the ability to stream HDFS files in a similar way to how external tables stream regular file data into the database. This makes ODCH very convenient if you just want access to HDFS or Hive data from an Oracle database, but it suffers from the same performance penalty as regular external flat file access, in that you can’t index the table and so forth; so if you want high-performance access to data held in a Hadoop system, and you need this access more than just occasionally, you’re better off copying the data right into the Oracle database, and this is where Oracle Loader for Hadoop comes in.
Oracle Loader for Hadoop (OLH) is a bulk-unloader for Hadoop that uses MapReduce technology to sort, partition and organise the data extraction at the Hadoop end, and then fast-unload the data into the Oracle Database typically much faster than a Sqoop data transfer.
OLH is, however, a bit fiddly to setup, so what’s useful is that ODI has a knowledge module, IKM File/Hive to Oracle, that can set up both OLH and ODCH jobs. This knowledge module does have its own setup process that itself is a bit involved – mostly around environment variables on your Linux system – but once its done, each OLH data transfer is then very easy to setup, and as long as you don’t then go on to do any more transformation on the Oracle data post-unload, it’s covered by the ODI limited-use license you get when you buy it in conjunction with the BDA.
So the key things to get OLH working with ODI are firstly, to install OLH on the BDA node that’s running the ODI12c agent (or in my case, full ODI Studio), and then set up the Oracle user’s environment variables so OLH works, and ODI knows where to find any additional Hadoop / Hive libraries and configuration files to work with OLH. In my case, this meant adding the following entries to my .bashrc file:
where OLH was installed to /home/oracle/oracle/product/oraloader-3.0.0-h2, Hadoop was installed to /usr/lib/hadoop and Hive to /usr/lib/hive. ODI_HIVE_SESSION_JARS is optional, but if you do anything fancy with SerDes in your mappings you’re best adding this entry in, to the hive-contrib.jar file (or technically the symbolic link to it). Getting this right was the hardest part about setting this KM up, but once it’s done, it’s done.
Then, to set the ODI parts up, I first create a table in my Oracle 22.214.171.124 database with matching datatypes for the Hive table i’m going to unload; varchar2 for string, number for int, etc, and then create entries in the ODI topology and Designer navigator. Then, I create a new mapping in ODI12c, where I copy the columns from my Hive table into the Oracle table, like this (note that I could map from an HDFS file instead, if I wanted to):
Now it’s a case of configuring the Physical part of the mapping. There’s two steps to this; first, you have to change the default LKM setting for the access table in the physical mapping, the ACCESS_F object in the TARGET_GROUP below:
By default this sets itself to LKM SQL to Oracle (as you’ve got an Oracle target), but we need to change this to a special one, LKM SQL Multi-Connect (below), that delegates the multi-connect capabilities to the downstream node, so you can use a multi-connect IKM such as IKM File/Hive to Oracle.
Then, and only once you’ve selected this LKM, you can select IKM File/Hive to Oracle for the target table KM.
There’s not much you need to enable with this KM to at least get started. Set OLH_OUTPUT_MODE to JDBC (you can enable OCI later on instead, for potentially faster unloading) and set MAPRED_OUTPUT_BASE_DIR to an HDFS directory that the agent’s OS user (typically “oracle”) can read and write to – in my case, I created one specifically for this KM, using Hue’s File System browser tool. Then, if all’s set up correctly you should be able to execute the mapping and view the results in the Operator navigator – note that in my case, it took several attempts before I finally got all the environment variable settings right. Once done, you should then be able to view the Hive table output in your Oracle table, using SQL Developer.
So now, we’ve got each step for our end-to-end process. The only thing left to do is to pull all of these steps, including the ODI Procedure to run the sqoop transfer, into an ODI package or load plan, so that we can run the steps in sequence:
Then, finally, I can run the whole process as a single execution, moving data from Flume capture through Hadoop ETL steps, to the unload into an Oracle table, all from within ODI12c.
So there you have it – Hadoop-native ETL processing, with the metadata-driven design, error handling and heterogenous connectivity provided by ODI12c.