Using Oracle Big Data SQL to Add Dimensions and Attributes to Hadoop Reporting

In a previous post I looked at using Oracle’s new Big Data SQL product with ODI12c, where I used Big Data SQL to expose two Hive tables as Oracle external tables, and then join them using the BETWEEN operator, something that’s not possible with regular HiveQL. In this post I’m going to look at using Oracle Big Data SQL with OBIEE11g, to enable reporting against Hive tables without the need to use Hive ODBC drivers and to bring in reference data without having to stage it in Hive tables in the Hadoop cluster.

In this example I’ve got some webserver log activity from the Rittman Mead Blog stored as a Hive table in Hadoop, which in its raw form only has a limited amount of descriptive data and wouldn’t be all that useful to users reporting against it using OBIEE. Here’s the contents of the Hive table as displayed via SQL*Developer:

When I bring this table into OBIEE, I really want to add details of the country that each user is visiting from, and also details of the category that each post referenced in the webserver logs belongs to. Tables for these reference data items can be found in an accompanying Oracle database, like this:

The idea then is to create an ORACLE_HIVE external table over the Hive table containing the log activity, and then import all of these tables into the OBIEE RPD as regular Oracle tables. Back in SQL*Developer, connected to the database that has the link setup to the Hadoop cluster via Big Data SQL, I create the external table using the new ORACLE_HIVE external table access driver:

And now with the Hive table exposed as the Oracle external table BDA_OUTPUT.ACCESS_PER_POST_EXTTAB, I can import all four tables into the OBIEE repository.

I can now create joins across the two Oracle schemas and four tables:

and then create a business model and presentation model to define a simple star schema against the combined dataset:

Once the RPD is saved and made available to the Presentation layer, I can now go and create some simple reports against the Hive and Oracle tables, with the Big Data SQL feature retrieving the Hive data using SmartScan technology running directly on the Hadoop cluster - bypassing MapReduce and filtering, projecting and just returning the results dataset back to the Exadata server running the Oracle SQL query.

In the previous ODI12c and Big Data SQL posting, I used the Big Data SQL feature to enable a join between the Hive table and a table containing IP address range lookups using the BETWEEN operator, so that I could return the country name for each visitor to the website. I can do a similar thing with OBIEE, by first recreating the main incoming fact table source as a view over the ORACLE_HIVE external table and adding an IP integer calculation that I can then use for the join to the IP range lookup table (and also take the opportunity to convert the log-format date string into a proper Oracle DATE datatype):

and then using that to join to a new table I’ve imported from the BLOG_REFDATA Oracle schema that contains the IP range lookups:

Now I can add country as a dimension, and create reports that break down site visits by country of access.

Similarly, I can break the date column in the view over the Hive external table out into its own logical dimension table, and then create some reports to show site access over time.

and with the final RPD looking like this:

If you’re interested in reading more about Oracle Big Data SQL I also covered it earlier on the blog around the launch date, with this post introducing the feature and another looking at how it extends Oracle security over your Hadoop cluster.