Accelerating Hadoop/Hive OBIEE Queries Using Exalytics and the Summary Advisor

July 6th, 2013 by

A few weeks ago I blogged about the new Hadoop/Hive support in OBIEE 11.1.1.7, and how it enables users to query “big data” sources with no need to know Hadoop or MapReduce. As this feature is currently based around Apache Hive, one of its drawbacks is that it can be a bit slow, at least in terms of latency and initial response time. This is because each call to Hive spins up a Java VM and creates MapReduce jobs on the fly, whereas more recent implementations of SQL-like access over Hadoop (Apache Drill, Cloudera Impala etc) bypass MapReduce and use their own in-memory engines instead.

One workaround thats worth considering though if you’ve got access to an Exalytics Server is to use TimesTen for Exalytics, and the Summary Advisor, to cache aggregations from these Hadoop data sources into memory, giving you the lightning-fast response times you’re used to with Exalytics against this new data source. I’m presenting on OBIEE, Exalytics and Hadoop and next week’s ilOUG Tech Days event in Jerusalem, so I thought I’d knock-up a quick demo as a “proof of concept”.

In this example then, I’ve got an OBIEE 11.1.1.7 repository that’s connecting to a Hadoop source via HiveODBC. OBIEE’s support for Hive access is fairly primitive at this point, so I’m sourcing all the data for my business model from a single Hive table, and modelling it into a simple two-dimensional star schema and subject area, like this:

Hive0

I can then run a series of queries against this subject area, to create some entries in the S_NQ_ACCT, S_NQ_DB_ACCT and S_NQ_SUMMARY_ADVISOR tables. 

Hive5

Note also the logical dimension I created in the BMM layer, which has levels for all customers, salesperson and customer. The Summary Advisor will need these to be in place in order to make recommendations based on measure aggregation by level.

Hive6

Response time on queries such as the one above is typically around 20-30 seconds, which generally wouldn’t be acceptable to most users running queries against a BI dashboard. Taking a look at the query in the BI Server query log, I can see the Hive query that the BI Server is sending across to the Hadoop cluster.

So let’s move over to the Exalytics Summary Advisor, which lets me focus on the Hive-related logical fact table sources for the aggregation recommendations, just like any other data source.

Hive7

Running through the steps, I eventually end-up with a recommendation for a single aggregate, aggregating the various measures by the salesperson level in the logical dimension.

Hive1

Running this through nqcmd, the Aggregate Persistence feature within the BI Server creates the TimesTen aggregate tables as expected.

Hive2

Then, moving back to the Repository, I can see the new in-memory aggregates plugged-into the Hive-related business model.

Hive3

And, running the query again and checking the query log file shows TimesTen for Exalytics being queried instead, and with data now being returned instantaneously.

The other thing to bear in mind is that, if I’d been running Hadoop on Oracle’s Big Data Appliance connected to Exalytics via Infiniband, I’d have also benefited from the extra bandwidth between the two boxes, something that might also help with the aggregation-into-TimesTen part too. Not bad and a viable solution to the latency issue with Hive and OBIEE.

Comments

  1. Raghu Says:

    Nice article

Website Design & Build: tymedia.co.uk