Big Data

Taking a Look at Oracle Big Data Preparation Cloud Service - Spark-Based Data Transformation in the Cloud

Mark Rittman

Sep 27, 2015 • 6 min read

One of the sessions I’m delivering at the upcoming Oracle Openworld 2015 in San Francisco is entitled "Oracle Business Intelligence Cloud Service—Moving Your Complete BI Platform to the Cloud [UGF4906]”, and looks at how you can now migrate your entire OBIEE11g platform into Oracle Public Cloud including data warehouse and data integration routines. Using Oracle BI Cloud Services’ new on-premise RPD upload feature you can upload an existing RPD into BICS and run it from there, with the accompanying data warehouse database moving into Oracle’s Database Cloud Service (and with the restriction that you can’t then edit the repository within BICS, you need to do that on-premise and upload again). For ETL and data integration you can carry on using ODI12c which now has the ability to load data into Oracle Storage Cloud (for file sources) and BICS (via a REST API) as well as the full Oracle DBaaS, but another Oracle option for doing purely cloud-based data processing enrichment has recent become available - Oracle Big Data Preparation Cloud Service. So what is it, how does it work and how is it different to ODI12c?

Oracle Big Data Preparation Cloud Service (“BDP”) is a thin-client application within Oracle Cloud for ingesting, preparing and enriching datasets that don’t have a predefined schema and may well need certain fields obfuscated or cleansed. Being integrated with Oracle Storage Cloud and other infrastructure and platform services within Oracle cloud it’s obviously aimed mainly at data transformation tasks within the Oracle Cloud enviroment, but you can upload and download datasets from your browser for use with on-premise applications. Unlike the more general-purpose Oracle Data Integrator it’s aimed instead at a particular use-case - non-technical information analysts who need to get data transformed, wrangled and enriched before they can make use of it in an environment like Hadoop. In fact the product name is a bit misleading - it runs on a big data platform within Oracle Cloud and like Oracle Big Data Discovery uses Apache Spark for its data processing - but it could potentially be useful for a business analyst to prepare data for loading into Oracle BI Cloud Service, and I’ll cover this angle when I talk about data loading options in by Oracle Openworld session.

Within a logical architecture for a typical big data DW and BI system, BDP sits alongside ODI within the Data Factory and provides self-service, agile transformation capabilities to more business-orientated users.

Oracle Big Data Cloud Preparation Service shares quite a bit of functionality and underlying technology, with Oracle Big Data Discovery - both run on Hadoop, they both use Apache Spark for data processing and transformation, and both offer data transformation and “wrangling” features aimed at non-technical users. Oracle are positioning Big Data Preparation Service as something you’d use in the execution layer of the Oracle Information Management Reference Architecture whereas Big Data Discovery is associated more with the discovery layer - I’d mostly agree but I can see a role for BDD even within the execution layer, as a front-end to the data reservoir that typically now runs alongside relationally-stored data warehouses.

Looking back at the slides from one of the recent Strata conferences, for example, sees Oracle positioning BDP as the “operational data preparation” tool for structured and unstructured data - with no defined schema - coming into your information platform, with the enriched output then being used BI tools, enterprise reporting and data discovery tools.

Apart from the scalability benefits of running BDP on Apache Spark, the other interesting feature in BDP is how it uses Spark’s machine learning capabilities to try to automate as much of the data preparation process as possible, for example detecting credit card numbers in data fields and recommending you obfuscate that column. Similar to BICS and how Oracle have tried to simplify the process of creating reports and dashboards for a small team, BDP runs in the cloud tries to automate and simplify as much of the data preparation and enrichment process as possible, with ODI12c still available for ETL developers to develop more complex transformations.

The development lifecycle for BDP (from the Oracle Big Data Preparation Cloud Service e-book on Oracle’s website) uses a cycle of ingesting, cleaning, enriching and then publishing data using scripts authored using the tool and run on the Apache Spark platform. The diagram below shows the BDP development lifecycle from Oracle’s Big Data Preparation Cloud Service Handbook, and shows how ingestion, enrichment, publishing and governance go in a cycle with the common foundation of the transformation scripts that you build using BDP’s web interface.

So let’s walk through an example data preparation exercise using a file of data stored initially in Oracle Storage Cloud Service. After logging into BDP via Oracle Cloud you’re first presented with the Catalog view, listing out all your previous transformations and showing you when they were last used to process some data.

To create a transformation you first give it a name, then select the data source and then the file you’re interested in. In my environment I’ve got Oracle Storage Cloud and HDFS available as my main data sources, or I could upload a file from my desktop and start from there.

BDP then ingests the file and then uses its machine learning features to process and classify data in each column, recommending column names such as “gender”, “city” and cc_number based on (presumably) some sort of classification model. In the screenshot below you can see a set of these recommendations on the left-hand side of the screen, with the columns themselves listed centre and a summary of the file profiling on the right.

Taking a closer look at the profile results panel you can see two of the columns have alerts raised, in red. Clicking on the alert shows that the two columns have credit card data stored in clear text, with the recommendation being to obfuscate or otherwise secure these fields. Clicking on a field then shows the various transformation options, with the obvious choice here being to automatically obfuscate the data in those fields.

Once you’ve worked through all the recommendations and added any transformations you choose to add yourself, the final step is to publish your transformation to one of the available targets. In the example below we’ve got Oracle Storage Cloud and HDFS again as potential targets; I’d imagine Oracle will add a connector to BICS soon, for example, so that you can use BDP as a data prep tool for file data that will then be added to your dataset in BICS.

So … it’ll be interesting to see where this one goes. Its interesting that Oracle have split out data preparation and data discovery into two tools whilst others are saying theirs can do both, and you’ll still need ODI for the more complex integration jobs. But I like the innovative use of machine learning to do away with much of the manual work required for classification of incoming data fields, and running the whole thing on Spark certainly gives it the potential of scale. A couple of years ago I was worried Oracle didn’t really have a strategy for data integration and ETL in the cloud, but we’re starting to see something happen now.

There’s a big push from the Oracle field at the moment to move customers into the cloud, and I can see BDP getting bundled in with Big Data Cloud Service and BICS as the accompanying cloud data preparation tool. The danger then of course is that Big Data Discovery starts to look less useful, especially with Visual Analyzer already available within BICS and coming soon on-premise with OBIEE12c. My guess is that what we’re seeing now with these initial releases of BDP and BDD is just the start, with BDP adding more automatic enrichment “smarts” and starting to cover integration use-cases too, whilst BDD will put more focus on data visualization and analytics on the data reservoir.

Sign up for more like this.