Accelerating Hadoop/Hive OBIEE Queries Using Exalytics and the Summary Advisor

A few weeks ago I blogged about the new Hadoop/Hive support in OBIEE 11.1.1.7, and how it enables users to query "big data" sources with no need to know Hadoop or MapReduce. As this feature is currently based around Apache Hive, one of its drawbacks is that it can be a bit slow, at least in terms of latency and initial response time. This is because each call to Hive spins up a Java VM and creates MapReduce jobs on the fly, whereas more recent implementations of SQL-like access over Hadoop (Apache Drill, Cloudera Impala etc) bypass MapReduce and use their own in-memory engines instead.

One workaround thats worth considering though if you've got access to an Exalytics Server is to use TimesTen for Exalytics, and the Summary Advisor, to cache aggregations from these Hadoop data sources into memory, giving you the lightning-fast response times you're used to with Exalytics against this new data source. I'm presenting on OBIEE, Exalytics and Hadoop and next week's ilOUG Tech Days event in Jerusalem, so I thought I'd knock-up a quick demo as a "proof of concept".

In this example then, I've got an OBIEE 11.1.1.7 repository that's connecting to a Hadoop source via HiveODBC. OBIEE's support for Hive access is fairly primitive at this point, so I'm sourcing all the data for my business model from a single Hive table, and modelling it into a simple two-dimensional star schema and subject area, like this:

Hive0

I can then run a series of queries against this subject area, to create some entries in the S_NQ_ACCT, S_NQ_DB_ACCT and S_NQ_SUMMARY_ADVISOR tables. 

Hive5

Note also the logical dimension I created in the BMM layer, which has levels for all customers, salesperson and customer. The Summary Advisor will need these to be in place in order to make recommendations based on measure aggregation by level.

Hive6

Response time on queries such as the one above is typically around 20-30 seconds, which generally wouldn't be acceptable to most users running queries against a BI dashboard. Taking a look at the query in the BI Server query log, I can see the Hive query that the BI Server is sending across to the Hadoop cluster.

select avg(T44678.age) as c1,
T44678.sales_pers as c2,
sum(T44678.age) as c3,
count(T44678.age) as c4
from
dwh_customer T44678
group by T44678.sales_pers

So let's move over to the Exalytics Summary Advisor, which lets me focus on the Hive-related logical fact table sources for the aggregation recommendations, just like any other data source.

Hive7

Running through the steps, I eventually end-up with a recommendation for a single aggregate, aggregating the various measures by the salesperson level in the logical dimension.

Hive1

Running this through nqcmd, the Aggregate Persistence feature within the BI Server creates the TimesTen aggregate tables as expected.

Hive2

Then, moving back to the Repository, I can see the new in-memory aggregates plugged-into the Hive-related business model.

Hive3

And, running the query again and checking the query log file shows TimesTen for Exalytics being queried instead, and with data now being returned instantaneously.

select distinct T44969.Total_Age0000AF82 as c1,
T44969.Cust_Count0000AF88 as c2,
T44969.Avg_Age0000AF2C as c3,
T44963.sales_pers0000AF56 as c4
from
EXALYTICS.SA_Salespe0000AF78 T44963,
EXALYTICS.ag_1567321688 T44969
where ( T44963.sales_pers0000AF56 = T44969.sales_pers0000AF56 )

The other thing to bear in mind is that, if I'd been running Hadoop on Oracle's Big Data Appliance connected to Exalytics via Infiniband, I'd have also benefited from the extra bandwidth between the two boxes, something that might also help with the aggregation-into-TimesTen part too. Not bad and a viable solution to the latency issue with Hive and OBIEE.