August 11th, 2014 by Mark Rittman
Over the past couple of years Rittman Mead have been broadening our skills and competencies out from core OBIEE, ODI and Oracle data warehousing into the new “emerging” analytic platforms: R and database advanced analytics, Hadoop, cloud and clustered/distributed systems. As we talked about in the recent series of updated Oracle Information Management Reference Architecture blog posts and my initial look at the Oracle Big Data SQL product, our customers are increasingly looking to complement their core Oracle analytics platform with ones to handle unstructured and big data, and as technologists we’re always interesting in what else we can use to help our customers get more insight out of their (total) dataset.
An area we’ve particularly focused on over the past year has been Hadoop and R analysis, with the recent announcement of our partnering with Cloudera and the recruitment of a big data and advanced analytics team operating our of our Brighton, UK office. We’ve also started to work on a number of projects and proof of concepts with customers in the UK and Europe, working mainly with core Oracle BI, DW and ETL customers looking to make their first move into Hadoop and big data. The usual pattern of engagement is for us to engage with some business users looking to analyse a dataset hitherto too large or too unstructured to load into their Oracle data warehouse, or where they recognise the need for more advanced analytics tools such as R, MapReduce and Spark but need some help getting started. Most often we put together a PoC Hadoop cluster for them using virtualization technology on existing hardware they own, allowing them to get started quickly and with no initial licensing outlay, with our preferred Hadoop distribution being Cloudera CDH, the same Hadoop distribution that comes on the Oracle Big Data Appliance. Projects then typically move on to Hadoop running directly on physical hardware, in a couple of cases Oracle’s Big Data Appliance, usually in conjunction with Oracle Database, Oracle Exadata and Oracle Exalytics for reporting.
One such project started off by the customer wanting to analyse a dataset that was too large for the space available in their Oracle database and that they couldn’t easily process or analyse using the SQL-based tools they usually used; in addition, like most large organisations, database and hardware provisioning took a long time and they needed to get the project moving quickly. We came in and quickly put together a virtualised Hadoop cluster together for them, on re-purposed hardware and using the free (Standard) edition of Cloudera CDH4, and then used the trial version of Oracle Big Data Connectors along with SFTP transfers to get data into the cluster and then analysed.
The PoC itself then ran for just over a month with the bulk of the analysis being done using Oracle R Advanced Analytics for Hadoop, an extension to R that allows you to use Hive tables as a data source and create MapReduce jobs from within R itself; the output from the exercise was a series of specific-answer-to-specific-question R graphs that solved an immediate problem for the client, and showed the value of further investment in the technology and our services – the screenshot below shows a typical ORAAH session, in this case analyzing the flight delays dataset that you can also find on the Exalytics server and in smaller form in OBIEE 11g’s SampleApp dataset.
That project has now moved onto a larger phase of work with Oracle Big Data Appliance used as the Hadoop platform rather than VMs, and Cloudera Hadoop upgraded from the free, unsupported Standard version to Cloudera Enterprise. The VMs in fact worked pretty well and had the advantage that they could be quickly spun-up and housed temporarily on an existing server, but were restricted by the RAM that we could assign to each VM – 2GB initially, quickly upgraded to 8GB per VM, and the fact that they were sharing CPU and IO resources. Big Data Appliance, by contrast, has 64GB or RAM per node – something that’s increasingly important now in-memory tools like Impala are begin used – and has InfiniBand networking between the nodes as well as fast network connections out to the wider network, something thats often overlooked when speccing up a Hadoop system.
The support setup for the BDA is pretty good as well; from a sysadmin perspective there’s a lights-out ILOM console for low-level administration, as well as plugins for Oracle Enterprise Manager 12c (screenshot below), and Oracle support the whole package, typically handling the hardware support themselves and delegating to Cloudera for more Hadoop-specific queries. I’ve raised several SRs on client support contracts since starting work on BDAs, and I’ve not had any problem with questions not being answered or buck-passing between Oracle and Cloudera.
One thing that’s been interesting is the amount of actual work that you need to do with the Big Data Appliance beyond the actual installation and initial configuration by Oracle to “on-board” it into the typical enterprise environment. BDAs are left with customers in a fully-working state, but like Exalytics and Exadata though, initial install and configuration is just the start, and you’ve then got to integrate the platform in with your corporate systems and get developers on-boarded onto the platform. Tasks we’ve typically provided assistance with on projects like these include:
- Configuring Cloudera Manager and Hue to connect to the corporate LDAP directory, and working with their security team to create LDAP groups for developer and administrative access that we then used to restrict and control access to these tools
- Configuring other tools such as RStudio Server so that developers can be more productive on the platform
- Putting in place an HDFS directory structure to support incoming data loads and data archiving, as well as directories to hold the output datasets from the analysis work we’re doing – all within the POSIX security setup that HDFS currently uses which limits us to just granting owner, group and world permissions on directories
- Working with the client’s infrastructure team on things like alerting, troubleshooting and setting up backup and recovery – something that’s surprisingly tricky in the Hadoop world as Cloudera’s backup tools only backup from Hadoop-to-Hadoop, and by definition your Hadoop system is going to hold a lot of data, the volume of which your current backup tools aren’t going to easily handle
Once things are set up though you’ve got a pretty comprehensive platform that can be expanded up from the initial six nodes our customers’ systems typically start with to the full eighteen node cluster, and can use tools such as ODI to do data loading and movement, Spark and MapReduce to process and analyse data, and Hive, Impala and Pig to provide end-user access. The diagram below shows a typical future-state architecture we propose for clients on this initial BDA “starter config” where we’ve moved up to CDH5.x, with Spark and YARN generally used as the processing framework and with additional products such as MongoDB used for document-type storage and analysis:
Something that’s turned out to be more of an issue on projects than I’d originally anticipated is complying with corporate security policies. By definition, most customers who buy an Oracle Big Data Appliance and going to be large customers with an existing Oracle database estate, and if they deal with the public they’re going to have pretty strict security and privacy rules you’ll need to adhere to. Something that’s surprising therefore to most customers new to Hadoop is how insecure or at least easily compromised the average Hadoop cluster is, with Hadoop FS shell security relying on trusted networks and incoming user connections and interfaces such as ODBC not checking passwords at all.
Hadoop and the BDA only becomes what’s termed “secure” when you link it to a Kerebos server, but not every customer has Kerebos set up and unless you enable this feature right at the start when you set up the BDA, it’s a fairly involved task to add retrospectively. Moreover, customers are used to fine-grained access control to their data, a single security model over their data and a good understanding in their heads as to how security works on their database, whereas Hadoop is still a collection of fairly-loosely coupled components with pretty primitive access controls, and no easy way to delete or redact data, for example, when a particular country’s privacy laws in-theory mandate this.
Like everything there’s a solution if you’re creative enough, with tools such as Apache Sentry providing role-based access control over Hive and Impala tables, alternative storage tools like HBase that permit read, write, update and delete operations on data rather than just HDFS’s insert and (table or partition-level) delete, and tools like Cloudera Navigator and BDA features like Oracle Audit Vault that provide administrators with some sort of oversight as to who’s accessing what data and when. As I mentioned in my blog post a couple of weeks ago, Oracle’s Big Data SQL product addresses this requirement pretty well, potentially allowing us to apply Oracle security over both relational, and Hadoop, datasets, but for now we’re working within current CDH4 capabilities and planning on introducing Apache Sentry for role-based access control to Hive and Impala in the coming weeks. We’re also looking at implementing Cloudera’s “secure gateway” cluster topology with all access restricted to just a single gateway Hadoop node, and the cluster itself firewalled-off with external access to just that gateway node and HTTP / REST API access to the various cluster services, for example as shown in the diagram below:
My main focus on Hadoop projects has been on the overall Hadoop system architecture, and interacting with the client’s infrastructure and security teams to help them adopt the BDA and take over its maintenance. From the analysis side, it’s been equally as interesting, with a number of projects using tools such as R, Oracle R Advanced Analytics for Hadoop and core Hive/MapReduce for data analysis, Flume, Java and Python for data ingestion and processing, and most recently OBIEE11g for publishing the results out to a wider audience. Following the development model that we outlined in the second post in our updated Information Management Reference Architecture blog series, we typically split delivery of each project’s output into two distinct phases; a discovery phase, typically done using RStudio and Oracle R Advanced Analytics for Hadoop, where we explore and start understanding the dataset, presenting initial findings to the business and using their feedback and direction to inform the second phase; and a second, commercial exploitation phase where we use the discovery phases’ outputs and models to drive a more structured dimensional model with output begin in the form of OBIEE analyses and dashboards.
We looked at several options for providing the datasets for OBIEE to query, with our initial idea being to connect OBIEE directly to Hive and Impala and let the users query the data in-place, directly on the Hadoop cluster, with an architecture like the one in the diagram below:
In fact this turned out to not be possible, as whilst OBIEE 220.127.116.11 can access Apache Hive datasources, it currently only ships with HiveServer1 ODBC support, and no support for Cloudera Impala, which means we need to wait for a subsequent release of OBIEE11g to be able to report against the ODBC interfaces provided by CDH4 and CDH5 on the BDA (although ironically, you can get HiveServer2 and Impala working on OBIEE 18.104.22.168 on Windows, though this platform isn’t officially supported by Oracle for Hadoop access, only Linux). Whichever way though, it soon became apparent that even if we could get Hive and Impala access working, in reality it made more sense to use Hadoop as the data ingestion and processing platform – providing access to data analysts at this point if they wanted access to the raw datasets – but with the output of this then being loaded into an Oracle Exadata database, either via Sqoop or via Oracle Loader for Hadoop and ideally orchestrated by Oracle Data Integrator 12c, and users then querying these Oracle tables rather than the Hive and Impala ones on the BDA, as shown in the diagram below.
In-practice, Oracle SQL is far more complete and expressive than HiveQL and Impala SQL and it makes more sense to use Oracle as the query platform for the vast majority of users, with data analysts and data scientists still able to access the raw data on Hadoop using tools like Hive, R and (when we move to CDH5) Spark.
The final thing that’s been interesting about working on Hadoop and Big Data Appliance projects is that 80% of it, in my opinion, is just the same as working on large enterprise data warehouse projects, with 20% being “the magic”. A large portion of your time is spent on analysing and setting up feeds into the system, just in this case you use tools like Flume instead of GoldenGate (though GoldenGate can also load into HDFS and Hive, something that’s useful for transactional database data sources vs. Flume’s focus on file and server log data sources). Another big part of the work is data processing, ingestion, reformatting and combining, again skills an ETL developer would have (though there’s much more reliance, at this point, on command-line tools and Unix utilities, albeit with a place for tools like ODI once you get to the set-based filtering, joining and aggregating phase). In most cases, the output of your analysis and processing will be Hive and Impala tables so that results can be analysed using tools such as OBIEE, and you therefore need skills in areas such as dimensional modelling, business analysis and dashboard prototyping as well as tool-specific skills such as OBIEE RPD development.
Where the “magic” happens, of course, is the data preparation and analysis that you do once the data is loaded, quite intensively and interactively in the discovery phase and then in the form of MapReduce and Spark jobs, Sqoop loads and Oozie workflows once you know what you’re after and need to process the data into something more tabular for tools like OBIEE to access. We’re building up a team competent in techniques such as large-scale data analysis, data visualisation, statistical analysis, text classification and sentiment analysis, and use of NoSQL and JSON-type data sources, which combined with our core BI, DW and ETL teams allows us to cover the project from end-to-end. It’s still relatively early days but we’re encouraged by the response from our project customers so far, and – to be honest – the quality of the Oracle big data products and the Cloudera platform they’re based around – and we’re looking forward to helping other Oracle customers get the most out of their adoption of these new technologies.
If you’re an Oracle customer looking to make their first move into the worlds of Hadoop, big data and advanced analytics techniques, feel free to drop me an email at email@example.com for some initial advice and guidance – the fact we come from an Oracle-centric background as well typically makes it easier for us to relate these new concepts to the ones you’re typically more familiar with. Similarly, if you’re about to bring on-board an Oracle Big Data Appliance system and want to know how best to integrate it in with your existing Oracle BI, DW, data integration and systems management estate, get in contact and I’d be happy to share experiences and our delivery approach.