So now that Oracle Visual Analyzer is out as part of Oracle BI Cloud Service, and Visual Analyzer (VA) is due to ship on-premise as part of OBIEE12c sometime in the next twelve months, several of our customers have asked us if they need both VA and Oracle Big Data Discovery if they’re looking to analyse Hadoop data as part of a BI project. It’s an interesting question so I thought it’d be useful to go through my thoughts on how the two tools work together, when to use one, and when to use the other.
Taking our standard “big data” dataset of website log activity, Twitter mentions and page details from our Wordpress blogging software, before Visual Analyzer came along the two usual ways we’d want to analyze these datasets is either a traditional BI metrics analysis-type scenario, and a data discovery/visualization scenario where we’re more interested in the content of the data rather than precise metrics. My half of the recent BI Forum 2015 Masterclass goes through these two scenarios in detail (presentation slides in PDF format here), and it’s Big Data Discovery that provides the more “Tableau”-type experience with fast point-and-click access to both datasets joined together on their common website page URL details.
Now we have Visual Analyzer though, things get interesting; in my article on Visual Analyzer within BICS I showed a number of data visualisations that look pretty similar to what you’d get with Big Data Discovery, and when we have VA available on-site as part of OBIEE12c we’ll be able to connect it directly to Hadoop via Cloudera Impala, potentially analyzing the whole dataset rather than the (representative) sample that Big Data Discovery loads into its Endeca Server-based engine.
So if the customer is looking to analyze data held in Hadoop and Visual Analyzer is available, where’s the value in Big Data Discovery (BDD)? To my mind there’s three areas where BDD goes beyond what VA can do, or helps you perform tasks that you’ll need to do before you can work with your data in VA:
- The initial data discovery, preparation and cleansing that you’d otherwise have to do using HiveQL or an ETL tool such as ODI12c
- Providing you with a high-level overview and landscape of your data, when that’s more important to you at the time than precise counts and drill-down analysis
- Understanding how data joins together, and how best to use your datasets in terms of metrics, facts, dimensions and so forth
Taking the data preparation and cleansing part first, I’ve covered in several blogs over the past couple of years how tools such as ODI can be used to create formal, industrialized data pipelines to ingest, prepare and then summarise data coming into your Hadoop system, and how you can drop-down to languages such as HiveQL, Pig and Spark to code these data transformations yourself. In the case of my webserver log, twitter and page details datasets this work would include standardising URL formats across the three sources, geocoding the IP addresses in the access logs to derive the country and city for site visitors, turning dates and times in different formats into ones that work as Hive timestamps, and so forth. Doing this all using ODI and/or HiveQL can be a pretty technical task, so where BDD comes in useful even - if VA and an OBIEE RPD is the final destination for the data.
Datasets that you transform and enrich in Big Data Discovery can be saved back to Hive as new Hive tables, or exported out as files for you to load into Oracle using SQL*Developer, or upload into BICS to use in Visual Analyzer. Where BDD then becomes useful is giving you a quick, easy to use overview of your dataset before you get into the serious business of defining facts, dimensions and aliases against these three Hive tables. The screenshots below show a couple of typical Big Data Discovery Studio data visualisation pages against the webserver logs dataset, and you can see how easy it is to create simple charts, tag clouds and maps against the data you’re working with - the aim being to give you an overview of the data you’re working with, help you understand its contents and “shape”, before moving further down the curation process and applying formal structures to the data.
Where things get harder to do within Big Data Discovery is when more-and-more formatting, complex joining and “arranging” of the data is required; for example, BDD gives you a lot of flexibility in how you join datasets, but this flexibility can be confusing for end-users if they’re then presented with every possible variation of a three-table join rather than having the data presented to them as simple facts and dimensions. But this is how we’d really expect it - if you go back to the logical data architecture I went through in the blog post a while ago about the updated Oracle Information Management Reference Architecture, the trade-off in using schema-on-read data reservoirs is that this data, although quick and cheap to store, requires a lot more work to be done each time you access the data to get “value” from it.
OBIEE, in contrast, makes you define your data structures in-full before you present data to end-users, dividing data in the three datasets into measures (for the fact tables) and attributes (for dimensions) and making it possible to add more dimension lookups (for a date dimension, for Twitter users in this case) and separate the overall set of data into more focused subject areas. Working with the dataset on the on-premise version of OBIEE first, the RPD that I created to present this data in a more formal, dimensional and hierarchical way to users looked like this:
I can leave this RPD connected directly to the underlying Hive and Impala tables if I want to use just Answers and Dashboards, but for the time being I either need to export the underlying Hive tables into CSV files or into an Oracle Database before uploading into Visual Analyzer, but come OBIEE12c this should all be seamless. What users are then presented with when they go into Visual Analyzer is then something like this:
Notice how the various attributes of interest are grouped into fact and dimension table folders, and there’s a simple means to add calculations, change the visualisation type and swap chart settings around. Note also that the count on the screen is the actual count of records in the full dataset, not the sample that BDD takes in order to provide an overview of values and distribution in the full dataset. Whilst it’s relatively easy to create a line chart, for example, to show tweets per user within BDD, using Visual Analyzer it’s just a case of double-clicking on the relevant measures and attributes on one side of the page, selecting and arranging the visualisation and applying any filters using dialog boxes and value-selectors - all much more familiar and obvious to BI users.
Enrichment to the data that I’ve done in Big Data Discovery should in most cases be able to come through to Visual Analyzer; for example, I used Big Data Discovery’s text enrichment features to determine the sentiment of blog post titles, tweets and other commentary data, I could use the latitude and longitude values derived during the visitor IP address geocoding to plot site visitors on a map. Using the sentiment value derived from the post title, tweet contents and other textual data, I can create a chart of our most popular posts mentioned on Twitter and colour bars to show how positive, or negative, the comments about the post were.
The only thing that Visual Analyzer can’t yet do that would be useful, is to be able to include more than one subject area in a project. To analyze the number of tweets and the number of page views for posts in a scatter chart, for example, I currently have to create a separate subject area that includes both sets of facts and dimensions, though I understand BICS on VA will have the ability to include multiple subject areas in a forthcoming release.
So in summary, I’d say that Big Data Discovery, and Visual Analyzer as part of BI Cloud Service, are complementary tools rather than one being able to replace the other in a big data context. I find that Big Data Discovery is a great tool to initially understand, catalog and view at a high-level data sources going into VA, and then to do some user-driven cleaning-up of the data, enhancing it and enriching it before committing it to the formal dimensional model that Visual Analyzer requires.
In its BICS guise there’s the additional step of having to export the Hadoop data out of your Big Data Appliance or other Hadoop cluster and upload it in the form of files using BICS’s data load or the new Data Sync utility, but when VA comes as part of OBIEE12c in the next twelve months you’ll be able to directly connect to the Hadoop cluster using Impala ODBC and analyse the data directly, in-place.
I’ll be covering more on BICS over the next few weeks, including how I got data from Hadoop into BICS using the new Data Sync utility.