End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.3 : Enhance with Oracle Reference Data via Sqoop, and CKMs

June 11th, 2014 by

In the first two posts in this series, I used the software on the Oracle Big Data Appliance 3.0 to ingest web log data from the Rittman Mead blog server, parse and load that data into a Hive table, and then join that table to another to add details on post and author. Links to the post in this series are below, and I’ll complete them as the series is posted this week:

In this next step, I’ve got some data sitting in an Oracle database that I’d like to use to enhance the data in the Hadoop cluster; as a reminder, the diagram below shows how I’m looking to move data through the system, and we’re currently at step number three:

NewImage

The data in the Oracle tables contains categories that I can join to the page entries that have come through the ETL process, and the join is pretty simple: POST_ID = POST_ID; but how do I get access to the Oracle data – I need to register the Oracle source in the ODI topology and create a model to represent the table, but can I then just join that table to the Hive table and load the results into another Hive table? Let’s try.

The screenshot below shows the mapping that would use this approach, with two Oracle tables joining to the Hive table and another Hive table as the target; on the right-hand side of the screen is the join condition, across both sources:

NewImage

The Execution view shows how ODI intends to run the transformation, with LKM SQL to SQL KMs used to load the Oracle tables and stage their data in Hive, and then IKM Hive Control Append used to load the joined dataset into another Hive table.

NewImage

If you try and execute the mapping though, it fails – because the LKM SQL to SQL KM can’t work with Hive tables yet.

NewImage

In fact. the various IKMs that come with Oracle Data Integrator Application Adapter for Hadoop (IKM file to Hive, IKM Hive Control Append etc) are a bit of a mix of IKMs and LKMs in that they contain extraction code, and integration code in the same KM, and none of the regular LKMs and IKMs will otherwise work with Hadoop sources. In fact, what we need to do with ODI at this point is actually land the Oracle data in Hive first, then do the join, which then begs the second question – how do we do that?

Currently, the only way to get Oracle data in to Hadoop is via IKM File to Hive (LOAD DATA), which involves an unnecessary extra step of exporting the Oracle data to a file, and then loading that file into HDFS and Hive. What we can do though is use sqoop, a tool within Hadoop to extract from, and load into, relational databases, something I covered on the blog a few weeks ago. Sqoop creates data loading and unloading jobs that run in parallel on the Hadoop cluster, and can use native JDBC drivers or even additional plugins such as Oraoop to make the process run faster (though Oracle Loader for Hadoop is considered the fastest way to unload to Oracle, if you’ve licensed the Big Data Adapters).

The only problem is that there’s no official support for sqoop in ODI, so no KMs that make use of it. What you can do though is create a command-line script to run sqoop and include that in an ODI procedure, which is what I’ll now do to bring in my Oracle data into Hive. To do this, I create an ODI procedure and add a single task, using the Operating System (command-shell) technology type, and use it to tell Sqoop to create me a Hive table based on an SQL SELECT statement against my Oracle database:

NewImage

(note that I had to format the sqoop command, in practice, as one line, to get it to run – the above listing is so that you can see all of the code.)

Executing this procedure works OK, and thereafter I’ve got a single Hive table containing the joined dataset from Oracle.

NewImage

Joining this new Hive table to the previous one containing the distinct set of page views is then fairly straightforward, but something I’d also like to do is stop any entries going into the rest of the ETL process where the calling IP address is a test one we use, “63.73.199.69″. The way I can do this is to use the CKM Hive knowledge module and put a constraint on the hostname in the table I’ll be loading from the join, so that I can then use ODI’s flow control feature to divert those rows to an error table.

NewImage

I also need to define a primary key for this table, something that’s mandatory when flow control is used. So let’s put the mapping together, joining the table I just brought in from Sqoop with the latest version of the weblog entries Hive table, loading into the Hive table I’ve just enabled the constraint for:

NewImage

and then I enable flow control, and the CKM Hive check knowledge module, in the Physical mapping settings.

NewImage

This is of course one of the benefits of using ODI to do your Hadoop data loading – you’ve got access to the data quality and error handling features that come with the tool. Then, when I execute the mapping and check with the Operator navigator, I can see the error handling process running, and afterwards in Hue I can see the contents of the new error table, which now contains those log entries where my test IP address was used, removing them from the target Hive table where they’d ordinarily have gone.

NewImage

So that’s the third step in the ODI BDA ETL process complete. The next one’s a bit trickier though – I need to geocode the entries in the log table, assigning country names to each row based on where the IP address is located. More tomorrow.

The BI Survey 14 – Have Your Voice Heard!

June 11th, 2014 by

Long-term readers of this blog will know that we’ve supported for many years the BI Survey, an independent survey of BI tools customers and implementors. Rittman Mead have no (financial or other) interest in the BI Survey or its organisers, but we like the way it gathers in detailed data on which tools work best and when, and it’s been a useful set of data for companies such as Oracle when they prioritise their investment in tools such as OBIEE, Essbase and the BI Applications.

Here’s the invite text and link to the survey:

“We would like to invite you to participate in The BI Survey 14, the world’s largest annual survey of business intelligence (BI) users.

To take part in this year’s survey, visit: https://digiumenterprise.com/answer/?link=1906-PHB5RT7V

As a participant, you will:

  • Receive a summary of the results from the full survey
  • Be entered into a draw to win one of ten $50 Amazon vouchers
  • Ensure that your experiences are included in the final analyses

BARC’s annual survey gathers input from thousands of organizations to analyze their buying decisions, implementation cycles and the benefits they achieve from using BI software.

The BI Survey 14 is strictly vendor-independent: It is not sponsored by any vendor and the results are analyzed and published independently.

You will be asked to answer questions on your usage of a BI product from any vendor. Your answers will be used anonymously and your personal details will not be passed on to software vendors or other third parties.

Business and technical users, as well as vendors and consultants, are all encouraged to participate.

The BI Survey 14 should take about 20 minutes to complete. For further information, please contact Jevgeni Vitsenko at BARC (jvitsenko@barc.de). 

Click below to take part in The BI Survey 14: https://digiumenterprise.com/answer/?link=1906-PHB5RT7V

Rittman Mead Featured in Oracle In-Memory Option Launch

June 11th, 2014 by

Today saw the official launch of the Oracle Database In-Memory Option, with Larry Ellison going through the product features and then reading out some quotes and testimonials from beta testers. Rittman Mead were part of the beta testing program, with several of our team testing out various scenarios where we ETL’d into it, used it with OBIEE and worked out what would be involved in “in-memory-enabling” some of our customer’s BI systems.

In fact, as we said in our quote for the launch, enabling Oracle Database for in-memory analysis was almost “boringly simple” – just enable the option, choose your tables, drop any OLTP indexes and you’re ready to go.

NewImage

Of course, in practice you’ll need to think about which tables you’ll put into memory if RAM is limited, in some scenarios TimesTen might be a better option, and you’ll need to test your particular system and carefully consider whether you’ll keep particular indexes or materialised views, but we’re really excited about the In-Memory Option for Oracle Database as it’s got the potential to significantly improve query response times for users – and from what we’ve seen so far, it “just works”.

We’re still in the NDA period whilst beta testing goes on, but you can read more on the In-Memory Option on the Oracle website, and on the blog post I wrote when the feature was announced last Openworld. Once it goes GA look out for some in-depth articles on the blog around how it works, and details on how we’ll be able to help customers take advantage of this significant new Oracle Database feature.

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.2 : Hive Table Joins, Aggregation and Loading

June 10th, 2014 by

In this series of posts, I’m going to be looking at an end-to-end ETL process on Oracle Big Data Appliance, using Hadoop technologies to do the data manipulation and Oracle Data Integrator 12c to orchestrate the process. Over the five posts, I’ll be landing web server log data on the Hadoop cluster using Flume, loading that data into a Hive table, transforming and enhancing the dataset, and then loading the final dataset into an Oracle database using one of the Oracle Big Data Connectors. The first post in the list below has a schematic for the overall process, and over the rest of the week I’ll be adding the links in for the remaining posts in the series.

So in today’s step, I want to take the initial Hive table containing the full set of log data, and then do three things; first extract the page details out of the “request” column in the incoming Hive table, then join that to some reference data also in the BDA that’ll allow me to add details of the post author and title of the post, and finally aggregate and project just certain columns from the the dataset so I’ve got a Hive table of just page accesses by IP address, with the author and title details.

In fact this is the easiest out of the five steps to set up and bring together, as we’re just using basic ODI functionality, albeit with Hive tables rather than regular database tables. In the screenshot below you can see the incoming weblog hive table, with the individual columns split-out in the previous step via the RegEx SerDe, and I just join it to the table containing post and author details, apply a distinct on the join output and then load the results into a third Hive table.

NewImage

The join between the two tables is bit complex, because I need to extract just the URL details out of the request field (which also contains the transport method and other metadata), but its no different that joining two regular database tables. Also, I did need to check the Generate ANSI Syntax checkbox, as Hive expects table joins to be in this format rather than the “Oracle” format.

NewImage

Going over to the Physical part of the mapping, I set the KM to IKM Hive Control Append, turn on the feature to enable table truncation prior to load, and I’m ready to go.

NewImage

Then, when I execute the mapping, I can see its progress in the Operator navigator, and the code view shows me the HiveQL commands that the KM has generated to move data around the BDA.

NewImage

So far so good, and as I said, this was the easy bit. For the next transformation I want to bring in some additional reference data from an Oracle database, so the question for me is whether I need to land that data into the BDA before doing the transformation, or whether I can create a join between the Oracle table and my Hive table? Check back tomorrow for the next installment…

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.1 : Flume to Initial Hive Table

June 9th, 2014 by

A few months ago I posted an article on the ODI12c examples in the new Oracle Big Data Appliance, and over the past few weeks I’ve been writing about the various components within the Cloudera CDH Hadoop stack, including Hive, Pig, Spark and Flume. Since then I’ve built out a virtualized example of an Oracle Big Data Appliance using the Oracle Big Data 3.0 software set, and I thought it’d be interesting to create an example ETL flow through the system showing how ODI could be used to initiate and control the process. As with any situation where you actually build a demo to do something specific, as opposed to just play around with the technology, you end up uncovering a few quirks and techniques that you wouldn’t have otherwise been aware of, so I’ll spend this week going through the various steps and calling out things others in the same situation might find useful – the steps I’ll go through are below, and I’ll add the links as the articles get published during the week;

As an overview, what I’ll be working with is a six-node Hadoop cluster running Cloudera CDH5, Oracle Big Data Connectors 3.0 and Oracle Data Integrator 12c, as you’d get with the current version of Oracle Big Data Appliance. Obviously BDA goes up to eighteen nodes as a maximum, and under the covers there’s lots more memory and much better networking that I was able to set up on VMWare ESXi, but what I’ve got is enough to prove the process. The diagram below shows the six nodes, and where the software is installed.

NewImage

I took a couple of short-cuts with the setup; obviously each node has a lot less RAM than BDA’s 64GB per node, but the main node in the cluster (bdanode1) running the HDFS NameNode and all the ODI software got bumped-up to 28GB, with the next two allocated 12GB and the others 8GB – enough to work through a demo at least. I also ran ODI Studio on the bdanode1 as well, instead of setting it up on a separate client VM, mainly to avoid having to set up all the Hadoop and Hive libraries on another machine. Other than that though, its the same CDH distribution you get on BDA, the same version of ODI and the Big Data Connectors, and so on, with the following software versions and downloads being used:

Setup of the Hadoop cluster is really out of scope for these articles, except to say that with CDH5, I find it easier to select the (non-default, deprecated) Packages install type rather than the new Parcels type, as this new methods installs all of the Hadoop software on each node in a new place – /opt/cloudera – rather than the usual /usr/lib/hadoop, /usr/lib/hive and so forth, meaning that most configuration examples you’ll read point to the wrong place for your install. Parcels are Cloudera’s way forward for delivering software components (there’s advantages in terms of patching across the cluster) but if you’re getting started with the Hadoop platform, installing in the old location usually makes things easier to follow. Other than that, the two bits of configuration you need to do is firstly to tell ODI Studio where to find that various Hadoop libraries and configuration files; because I’ve installed Studio directly on the main Hadoop node, I could then just add that node’s file locations to the Oracle user’s $HOME/.odi/oracledi/userlib/additional_path.txt file, so that it looked like this:

Also to make use of Oracle Loader for Hadoop, one of the Oracle Big Data Connectors and something we’ll use at the end to bulk-unload data from Hadoop to an Oracle database, the second thing I’ll need to do is set a couple of environment variables in the “oracle” user’s .bashrc profile file pointing to where OLH is installed, and where the Hadoop and Hive libraries and configuration files are:

The scenario I’ll be working with is similar to the ones I covered on the blog recently, where I landed data from the Rittman Mead blog webserver into HDFS files using Hive, and then processed the files using Hive, Pig, Spark and so on. In this example though, I’ll use ODI to do the data manipulation where possible, but still use Hive and so forth under-the-covers to do the work. The diagram below shows the data flow that i’ll be looking to set up in the example:

NewImage

So in this scenario incoming data is being landed in the Hadoop cluster by Flume, using the process outlined in this previous blog post. All Flume is is a transport mechanism; it doesn’t in itself have any processing ability (making it analogous to GoldenGate) and all it does it transport log file entries from one place to another, via Flume agents on either end. All we as developers need to be aware of is (a) where the log files will be landed, and (b) that Flume will keep continuously writing to these files until the source log file gets rotated – therefore I can’t assume a log file is completely written to when I read from it.

What this means in practice is that if I want to do incremental loads, I need to consider the fact that a file I’m reading to might have more data in it later on. There’s various solutions to this – principally having Flume write to HBase, rather than raw HDFS files, and then I read from the HBase database noting the last extraction point at the time – but to keep things simple I’ll just do a full-load each time the ETL run takes place, meaning that I don’t need to think about incremental loads throughout the system.

So the first thing I need to do is have ODI take the incoming files and load them into a Hive table. To do this I set up Topology entries for the incoming HFDS files, and here’s the first “gotcha” – to create a connection to HDFS files, you use the regular File technology, but you leave the JDBC driver type blank, and put the address of the HDFS NameNode in to the JDBC URL – which of course is technically invalid and won’t allow you to either test it, or reverse-engineer the file names in, but is a “hack” used by the IKM File to Hive KM to get the address of your HDFS NameNode (if you choose to source the incoming file from HDFS rather than the local filesystem).

NewImage

Then, when you come to register the physical schemas to go with the new File technology data server, you’ll need to be aware that ODI just appends the final file name to the directory name when retrieving the file data – so if you want the connection to point to a directory, rather than just a file, you’ll need to set up the physical schema to be the directory “above” the one you’re interested in, and then set the file name later on to be that directory. In this example , I want the final file reference to point to hdfs://bdanode1.rittmandev.com/user/oracle/weblog_incoming_files/apache_access_combined, a whole directory (HDFS aggregates all the files in a directory if you reference just the directory in an operation) rather than just a single log file. You can see the directory and the log files it contains in the Hue screenshot below:

NewImage

I therefore set the physical schema to be hdfs://bdanode1.rittmandev.com/user/oracle/weblog_incoming_files, and the file reference in the datastore model is set to the final directory name, like this:

NewImage

If it seems a bit confusing, it’ll become clearer in a moment.

Now I need to go back to the Topology navigator and create a connection through to the Hive server on the Big Data Appliance VMs – in fact recent versions of CDH (CDH4, CDH5) swap out the old Hive Server for HiveServer2, so you’ll need to use the correct JDBC drivers (as supplied with CDH4/5) to connect to it, and also create the JDBC URL in the format jdbc:hive2://[machine name:port], as shown inn the screenshot below:

NewImage

A quick note about security at this point; by default, most CDH4/5 clusters are set up as what’s called “unsecured”, which means that whilst you use a username and login to connect to Cloudera Manager, for example, by this default although Hive and Impala request user credentials when you connect, they don’t actually check the password against anything, for example your LDAP server. You can connect these tools to LDAP or Active Directory, and typically you’d combine this with Kerebos authentication between the various components and when you connect via Impala and Hive, and typically you’d also use Apache Sentry to provide role-based access control to the data within individual HDFS files. But by default, Hive will accept more or less anything as a password when you connect, but then you may hit issues later on when your HDFS files are owned by a different user to the one you connect as.

Where this manifests itself is when a Hive table has underlying HDFS files owned by, say, the “admin” user in CDH5 (because that’s how you connected to Hue to upload them), but then you connect as the “oracle” user through Hive to then manipulate the Hive table contents. In practice, what I try to do is create any Hive tables (using Hue) as the user I’m then going to connect in Hive to them as, which means you’ll most probably need to go into Hue and create a new user called “oracle” (if that’s what you’re running ODI as, and connecting through Hive as) before creating the Hive tables you’ll then import into the ODI topology.

So once you’ve done all that, you can go into the Designer navigator and reverse-engineer the definition of your Hive tables into datastore models. In my case, I’ve got a bunch of tables that I’ll be using throughout the whole ETL process.

NewImage

Now it’s time for the first ODI mapping, to take the raw log file directory and load it into a Hive table. As it stands though, these raw log files are rows of just a single “column” of data – the log file entry in Apache CombinedLogFormat format. To make them useful to the rest of the ETL process I’ll somehow need to parse them into the individual log file elements, so I create a target Hive table that contains an entry for the raw log entry, and then columns for the various log file elements:

NewImage

The way that I parse the log file is to use a feature within the IKM File to Hive (LOAD DATA) KM that allows me to specify a regular expressed Serde (Serializer-Deserializer) to parse the log file entry into its individual columns, like this (note that you’ll need to ensure the hive-contrib-* JAR file is available to all of your Hadoop nodes before using this SerDe)

NewImage

In this instance, I want the KM to leave the source files in-place when doing the data load (Hive by default moves incoming source files into the /user/hive/warehouse directory area) as these files most probably haven’t been finished written-to by Flume yet, so I leave the EXTERNAL_TABLE value set to true (Hive external table, not Oracle external table) and make sure FILE_IS_LOCAL is set to FALSE, so that this KM knows to use the HDFS file location hack I set up in the topology earlier. Then, I just run the mapping and check that it’s worked OK:

NewImage

and I can check from the Model pane in the Designer navigator that I’ve now got a Hive table of individually split-up log entry columns to work with, for the rest of the downstream ETL process:

NewImage

So that’s the first stage done – next, I’ll be combining this Hive table with data from another one, using the IKM Hive Control Append KM.

Website Design & Build: tymedia.co.uk