Rittman Mead in the OTN TOUR Latin America 2014

August 26th, 2014 by

Another OTN Tour Latin America has come and gone. This is the most important technical event in the region visiting 12 countries and with more than 2500 attendees in two weeks.

This year Rittman Mead was part of the OTN Tour in Buenos Aires (Argentina) and Montevideo (Uruguay) presenting about ODI and OGG.

We have started in Buenos Aires on August 11 for the first day of the OTN Tour in Argentina. I’ve talked about the integration of ODI and OGG 12c, explaining all the technical details to configure and how to implement it. Most of the attendees didn’t work with these tools (but were curious about them) so I personalised a little the presentation giving them first an introduction of ODI and OGG.

As the vice-president of the UYOUG (Uruguayan Oracle User Group) I’m part of the organisation of the OTN Tour in my country, so we needed to come back in the same Monday to adjust some last details to have everything ready for the event in Uruguay.

Most of the speakers came on Wednesday, and we have spent a great day with Michelle Malcher, Kamran Agayev, Hans Forbrich and Mike Dietrich. First, we went to lunch at the Mercado del Puerto, an emblematic place that has lot of “parrillas” (kind of barbecues) and then we gave them a little city tour which included a visit to El Cerro de Montevideo. Finally we visited one of the most important wineries in Uruguay, Bodega Bouza where we have a wine tour followed by an amazing wine tasting of a variety of wines including Tannat which is our insignia grape. You know…it is important to be relaxed before a conference :-)



The first day of the event in Uruguay was dedicated exclusively to technical sessions and in the second one we had the hands-on labs. The conference covered a wide range of topics from BI Mobile, e-Business Suite to how to upgrade to Oracle Database12c, Oracle Virtualization and Oracle RAC. All the sessions were packed with attendees.

Mike session Edel session

The next day, we had labs with PCs with software already installed but attendees could came with their own laptops to install all the software needed for the hands-on. We had the famous RAC Attack! lead by Kamran and with the help of the ninjas Michelle, Hans and Nelson Calero, and an Oracle Virtualization lab by Hernan Petitti for 7 hours!

Rac AttackRAC Attack2

It was a great event. You can see more pictures here and download the presentations here. The attendees as well as all the speakers were really happy with the result. And so did we.

This  is only the beginning for Rittman Mead in Latin America. There are a lot of things to come, so stay tuned!

Upcoming Big Data and Hadoop for Oracle BI, DW and DI Developers Presentations

August 26th, 2014 by

If you’ve been following our postings on the blog over the past year, you’ll probably have seen quite a lot of activity around big data and Hadoop and in particular, what these technologies bring to the world of Oracle Business Intelligence, Oracle Data Warehousing and Oracle Data Integration. For anyone who’s not had a chance to read the posts and articles, the three links below are a great introduction to what we’ve been up to:

In addition, we recently took part in an OTN ArchBeat podcast with Stewart Bryson and Andrew Bond on the updated Oracle Information Management Reference Architecture we co-developed with Oracle’s Enterprise Architecture team, where you can hear me talk with Stewart and Andrew about how the updated architecture came about, the thinking behind it, and how concepts like the data reservoir and data factory can be delivered in an agile way.

I’m also pleased to be delivering a number of presentations and seminars over the next few months, on Oracle and Cloudera’s Hadoop technology and how it applies to Oracle BI, DW and DI developers – if you’re part of a local Oracle user group and you’d like me to deliver one of them for your group, drop me an email at mark.rittman@rittmanmead.com.

Slovenian Oracle User Group / Croatian Oracle User Group Conferences, October 2014

These two events run over consecutive days in Slovenia and Croatia, and I’m delivering the keynote at each on Analytics and Big Data, and a one-day seminar running on the Tuesday in Slovenia, and over the Wednesday and Thursday in Croatia. The theme of the seminar is around applying Hadoop and big data technologies to Oracle BI, DW and data integration, and is made up of four sessions:

Part 1 : Introduction to Hadoop and Big Data Technologies for Oracle BI & DW Developers

“In this session we’ll introduce some key Hadoop concepts including HDFS, MapReduce, Hive and NoSQL/HBase, with the focus on Oracle Big Data Appliance and Cloudera Distribution including Hadoop. We’ll explain how data is stored on a Hadoop system and the high-level ways it is accessed and analysed, and outline Oracle’s products in this area including the Big Data Connectors, Oracle Big Data SQL, and Oracle Business Intelligence (OBI) and Oracle Data Integrator (ODI).”

Part 2 : Hadoop and NoSQL Data Ingestion using Oracle Data Integrator 12c and Hadoop Technologies

“There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.”

Part 3 : Big Data Analysis using Hive, Pig, Spark and Oracle R Enterprise / Oracle R Advanced Analytics for Hadoop

“Data within a Hadoop cluster is typically analysed and processed using technologies such as Pig, Hive and Spark before being made available for wider use using products like Oracle Big Data SQL and Oracle Business Intelligence. In this session, we’ll introduce Pig and Hive as key analysis tools for working with Hadoop data using MapReduce, and then move on to Spark as the next-generation analysis platform typically being used on Hadoop clusters today. We’ll also look at the role of Oracle’s R technologies in this scenario, using Oracle R Enterprise and Oracle R Advanced Analytics for Hadoop to analyse and understand larger datasets than we could normally accommodate with desktop analysis environments.”

Part 4 : Visualizing Hadoop Datasets using Oracle Business Intelligence, Oracle BI Publisher and Oracle Endeca Information Discovery

“Once insights and analysis have been produced within your Hadoop cluster by analysts and technical staff, it’s usually the case that you want to share the output with a wider audience in the organisation. Oracle Business Intelligence has connectivity to Hadoop through Apache Hive compatibility, and other Oracle tools such as Oracle BI Publisher and Oracle Endeca Information Discovery can be used to visualise and publish Hadoop data. In this final session we’ll look at what’s involved in connecting these tools to your Hadoop environment, and also consider where data is optimally located when large amounts of Hadoop data need to be analysed alongside more traditional data warehouse datasets.”

Oracle Openworld 2014 (ODTUG Sunday Symposium), September 2014

Along with another session later in the week on the upcoming Oracle BI Cloud Services, I’m doing a session on the User Group Sunday for ODTUG on ODI12c and the Big Data Connectors for ETL on Hadoop:

Deep Dive into Big Data ETL with Oracle Data Integrator 12c and Oracle Big Data Connectors [UGF9481]

“Much of the time required to work with big data sources is spent in the data acquisition, preparation, and transformation stages of a project before your data reaches a state suitable for analysis by your users. Oracle Data Integrator, together with Oracle Big Data Connectors, provides a means to efficiently load and unload data to and from Oracle Database into a Hadoop cluster and perform transformations on the data, either in raw form or in technologies such as Apache Hive or R. This presentation looks at how Oracle Data Integrator can form the centerpiece of your big data ETL strategy, within either a custom-built big data environment or one based on Oracle Big Data Appliance.”

UK Oracle User Group Tech’14 Conference, December 2014

I’m delivering an extended version of my OOW presentation on the UKOUG Tech’14’s “Super Sunday” event, this time over 90 minutes rather than the 45 at OOW, giving me a bit more time for demos and discussion:

Deep-Dive into Big Data ETL using ODI12c and Oracle Big Data Connectors

“Much of the time required to work with Big Data sources is spent in the data aquisition, preparation and transformation stages of a project; before your data is in a state suitable for analysis by your users.Oracle Data Integrator, together with Oracle Big Data Connectors, provides a means to efficiently load and unload data from Oracle Database into a Hadoop cluster, and perform transformations on the data either in raw form or technologies such as Apache Hive or R. In this presentation, we will look at how ODI can form the centrepiece of your Big Data ETL strategy, either within a custom-built Big Data environment or one based on Oracle Big Data Appliance.”

Oracle DW Global Leaders’ Meeting, Dubai, December 2014

The Oracle DW Global Leaders forum is an invite-only group organised by Oracle and attended by select customers and associate partners, one of which is Rittman Mead. I’ll be delivering the technical seminar at the end of the second day, which will run over two sessions and will be based on the main points from the one-day seminars I’m running in Croatia and Slovenia.

From Hadoop to dashboards, via ODI and the BDA – the complete trail : Part 1 and Part 2

“Join Rittman Mead for this afternoon workshop, taking you through data acquisition and transformation in Hadoop using ODI, Cloudera CDH and Oracle Big Data Appliance, through to reporting on that data using OBIEE, Endeca and Oracle Big Data SQL. Hear our project experiences, and tips and techniques based on real-world implementations”

Keep an eye out for more Hadoop and big data content over the next few weeks, including a look at MongoDB and NoSQL-type databases, and how they can be used alongside Oracle BI, DW and data integration tools.


Smoothing the Transition – The New Smart View for Microsoft Office and OBIEE

August 15th, 2014 by


There’s a good chance that, if you’re reading this, you likely perform some reporting, analytics, data stewardship role or probably some combination of all three. And be it for a large corporation or a small company, there are likely standards and practices that pertain to how the above jobs are performed on a day to day basis; not easily changed and perpetually validated by big budgets and long careers. It is equally likely that deeply ingrained within these reporting practices lies some moderate to heavy implementation of Excel. It wasn’t long ago that I found myself utilizing the spreadsheet program on a daily basis and for hours upon hours at a time.

What this essentially amounted to:

  • Pulling down large amounts of data from our department’s data model using large SQL queries that themselves could take most of the day to elucidate, let alone waiting on the query to yield results, which could easily warrant a bathroom break, a phone call, or if you were feeling adventurous, catching up on email.
  • Validating your results
  • Exporting to Excel (key step here!)Massaging and formatting your data by implementing innumerable and often unwieldy functions that deserved their own time slot on your schedule for the day to figure out
  • Proofing your analysis so that it got to management in ship shape
  • Hoping that an analyst from another department who utilized the same metric on their report and who would be at the same meeting actually coincided with yours


Fast forward a bit and I’m sitting here, writing this blog as a sort of proverbial white flag in the great battle between Excel and the behemoth that is OBIEE. And just what is this white flag? Why, it’s Oracle’s most recent iteration of Smart View, which provides expanded functionality and support for the Microsoft Office Suite of programs. Namely, its golden boy, Excel. That’s right, Excel, the darling of office staff everywhere, the program upon which empires rise and fall. In paraphrasing a quote from www.cfo.com, some 64% of public and private companies still use Excel and other “manual” solutions to perform their finance functions. So, in the world of the spreadsheet, when does it makes sense to cross that blurry line from cell to subject area? Smart View now makes answering that question much easier. It seems that they’ve really gotten a grasp on the formatting shortcomings of the last version and made up for it in spades. Or, so at least they claim.

The Test Run – OBIEE to Excel 

The example below illustrates a simple import via Smart View. I generated a dashboard in Answers which mimics that of an Excel design I found online. Thank the good folks over at www.chandoo.org for their excellent skills in Excel dashboarding and for providing plenty of great examples. The dashboard contains a table with a selection of KPI’s that the user may then choose to sort on via a View Selector (each view has been sorted on a different KPI and is on a different Compound Layout). Upon selecting a KPI, the analysis will then display the Top 10 products by the KPI selected. In addition, the table contains conditional formatting which simply alerts users to the variance between different KPI’s and their targets. Lastly, there is a scatter plot view which displays our Product dimension as seen through the lens of Revenue and Quantity. Per the most recent Oracle documentation, we shouldn’t have any trouble including the current selections of a dashboard prompt either. Let’s see how it performs when we move it over to Excel.

 “OBIEE report and page prompts are fully supported as part of the import process. Dashboards can be imported through Oracle Smart View on a per page basis or the entire dashboard. Prompts are applied at the current state of the logged in user. Future releases of the product will support dashboard prompts directly through Microsoft Office.”


The Results

And there you have it! Excel displays our table and graph views as per the most recent selection from the Dashboard prompt. But wait! Our conditional formatting seems to be missing and to prove this, this is even the case when exported directly from the analysis view as an Excel workbook.


Conditional Formatting

For our second scenario, let’s see how Excel handles a simpler, heat map style conditional formatting. I’ve made a simple table on our dashboard that measures Revenue, Quantity Sold, and the Average Order in $. I set up conditional formatting around the Average Order measure to see how Excel handles importing the color scheme for the currently selected Time parameters on the dashboard.

Contrastingly, we see that Smart View has preserved a simpler, Heatmap style of conditional formatting when imported from OBIEE through Smart View. So, perhaps it is Excel’s lack of corresponding graphic in the previous example that has caused the migration snafu? OBIEE doesn’t even seem to render our arrow graphics as per the documentation.

“Oracle BI Customizations and View Standards – The Import of Oracle BI content can leverage the customizations and view standards used within an OBIEE environment. All view designed modifications such as conditional formatting, background colors or data configuration is automatically translated to the Microsoft Office environment.”




Excel to OBIEE

Let’s see what the latest edition of Smart View offers when moving an analysis from Excel to OBIEE.
Because we weren’t able to import our full table view, why don’t we construct it using the View Designer? The interface looks clean and provides an intuitive approach to producing basic Answers views. Accessing our subject area, I simply selected the columns that matched those on our Answers analysis. After clicking ‘OK’, sorting on our Revenue column from largest to smallest and doing a little deleting, we have a pseudo ‘Top 10’ analysis by Revenue. Given the aesthetic attributes of our Answers analysis, lets see how we’re going to replicate this in Excel.





After selecting the table, we can navigate to the Design tab under ‘Table Tools’ and select an alternating Grey scheme which gives us the ‘Enable Alternate Styling’ design quality. Now lets add some formulas and conditional formatting that will give us our Calculated column equivalents. We can insert two rows, one between Revenue and Target, and between Qty and Target, to make room for conditional formatting and Excel’s Icon Sets feature. We then create a simple formula that subtracts Revenue and Quantity from their respective targets in the column between the two, assign conditional formatting and voila! Excel even has a check box that lets you show the arrow only.



SVB 10

From Excel, we can select Publish View to deposit our analysis into our Shared Folder. The results indicate a sort of ‘two way street’ between Smart View and Excel and vice versa. Neither totally supports the formatting capabilities of the other, as if to say Smart View is giving ground with every new release. In this blog, we’ve taken a look at how Smart View handles some mildly complex conditional formatting and what it takes to replicate this feature in native Excel. In a user environment where reports are flying back and forth between the two platforms, Smart View definitely makes sense, however it might be advisable to simply deliver the minimum of what is needed and let an end user make any formatting based modifications. After all, who would want to do all that work only to have it lost in translation?

The Business Value In Training

August 11th, 2014 by

One of the main things I get asked to do here at Rittman Mead, is deliver the OBIEE front-end training course (TRN 202). This a great course that has served both us, and our clients well over the years. It has always been in high demand and always delivered with great feedback from those in attendance. However, as with all things in life and business, there is going to be room for improvement and opportunities to provide even more value to our clients. Of all the feedback I receive from delivering the course, my favorite is that we do an incredible job delivering both the content and providing real business scenarios on how we have used this tool in the consulting field. Attendees will ask me how a feature works, and how I have used it with current and former clients, 100% of the time.

This year at KSCope ’14 in Seattle, we were asked to deliver a 2 hour front-end training course. Our normal front-end course runs a span of two days and covers just about every feature you can use all the way from Answers and Dashboards, to BI Publisher. Before the invitation to KScope ’14, we had bee tooling with the idea to deliver a course that not only teaches attendees on how to navigate OBIEE and use it’s features, but also emphasizes the business value behind why those features exist in the first place. We felt that too often users are given a quick overview of what the tool includes, but left figure out on their own how to extract the most value. It is one thing to create a graph in Answers, and another to know what the best graph to use might be. So in preparation for the KScope session, we decided to build the content around not only how to develop in OBIEE, but also why, as a business user, you would choose one layout/graph/feature over another. As you would expect, the turn out for the session was fantastic, we had over 70 plus pre-register, with another 10 on the waiting list. This was proof that there is an impending need to pull as much business value out of the tool as there is to simply learn how to use it. We were so encouraged by the attendance and feedback from this event, that we spent the next several weeks developing what is called the “Business Enablement Bootcamp”. It is a 3 day course that will cover Answers, Dashboards, Action Framework, BI Publisher, and the new Mobile App Designer. This is an exciting time for us in that we not only get show people how to use all of the great features that are built into the tool, but to also incorporate years of consulting experience and hundreds of client engagements right into the content. Below I have listed a breakdown of the material and the value it will provide.


Whenever we deliver our OBIEE 5-day bootcamp, which covers everything from infrastructure to the front end, Answers is one of the key components that we teach. Answers is the building block for analysis in OBIEE. While this portion of the tool is relatively intuitive to get started with, there are so many valuable nuances and settings that can get over looked without proper instruction. In order to get the most out of the tool, a business user needs be able to not only create basic analyses, but be able to use many of the advanced features such as hierarchical columns, master-detail, and selection steps. Knowing how and why to use these features is a key component to gaining valuable insight for your business users.


This one in particular is dear to my heart. To create an analysis and share it on a dashboard is one thing, but to tell a particular story with a series of visualizations strategically placed on a dashboard is something entirely different. Like anything else business intelligence, optimal visualization and best practices are learned skills that take time and practice. Valuable skills like making the most of your white space, choosing the correct visualizations, and formatting will be covered. When you provide your user base with the knowledge and skills to tell the best story, there will be no time wasted with clumsy iterations and guesswork as to what is the best way to present your data. This training will provide some simple parameters to work within, so that users can quickly gather requirements and develop dashboards that more polish and relevance than ever before.


 Action Framework

Whenever I deliver any form of front end training, I always feel like this piece of OBIEE is either overlooked, undervalued, or both. This is because most users are either unaware of it’s use, or really don’t have a clear idea of its value and functionality. It’s as if it is viewed as an add-on in the sense that is just simply a nice feature. The action framework is something that when properly taught how to navigate, or given demonstration of its value, it will indeed become an invaluable piece of the stack. In order to get the most out of your catalog, users need to be shown how to strategically place action links to give the ability to drill across to analyses and add more context for discovery. These are just a few capabilities within the action framework that when shown how and when to use it, can add valuable insight (not to mention convenience) to an organization.

Bi Publisher/Mobile App Designer

Along with the action framework, this particular piece of the tool has the tendency to get overlooked, or simply give users cold feet about implementing it to complement answers. I actually would have agreed with these feelings before the release of Before this release, a user would need to have a pretty advanced knowledge of data modeling. However, users can now simply pick any subject area, and use the report creation wizard to be off and running creating pixel perfect reports in no time. Also, the new Mobile App Designer on top of the publisher platform is another welcomed addition to this tool. Being the visual person that I am, I think that this is where this pixel perfect tool really shines. Objects just look a lot more polished right out of the box, without having to spend a lot of time formatting the same way you would have to in answers. During training, attendees will be exposed the many of the new features within BIP and MAD, as well as how to use them to complement answers and dashboards.

Third Party Visualizations

While having the ability to implement third party visualizations like D3 and Flot into OBIEE is more of an advanced skill, the market and need seems to be growing for this. While Oracle has done some good things in past releases with new visualizations like performance tiles and waterfall charts, we all know that business requirements can be demanding at times and may require going elsewhere to appease the masses. You can visit https://github.com/mbostock/d3/wiki/Gallery to see some of the other available visualizations beyond what is available in OBIEE. During training, attendees will learn the value of when and why external visualizations might be useful, as well as a high level view of how they can be implemented.

Bullet Chart

Users often make the mistake of viewing each piece of the front end stack as separate entities, and without proper training this is very understandable. Even though they are separate pieces of the product, they are all meant to work together and enhance the “Business Intelligence” of an organization. Without training the business to complement one piece to another, it will always be viewed as just another frustrating tool that they don’t have enough time to learn on their own. This tool is meant to empower your organization to have everything they need to make the most informed and timely decisions, let us use our experience to enable your business.

Rittman Mead and Oracle Big Data Appliance

August 11th, 2014 by

Over the past couple of years Rittman Mead have been broadening our skills and competencies out from core OBIEE, ODI and Oracle data warehousing into the new “emerging” analytic platforms: R and database advanced analytics, Hadoop, cloud and clustered/distributed systems. As we talked about in the recent series of updated Oracle Information Management Reference Architecture blog posts and my initial look at the Oracle Big Data SQL product, our customers are increasingly looking to complement their core Oracle analytics platform with ones to handle unstructured and big data, and as technologists we’re always interesting in what else we can use to help our customers get more insight out of their (total) dataset.

An area we’ve particularly focused on over the past year has been Hadoop and R analysis, with the recent announcement of our partnering with Cloudera and the recruitment of a big data and advanced analytics team operating our of our Brighton, UK office. We’ve also started to work on a number of projects and proof of concepts with customers in the UK and Europe, working mainly with core Oracle BI, DW and ETL customers looking to make their first move into Hadoop and big data. The usual pattern of engagement is for us to engage with some business users looking to analyse a dataset hitherto too large or too unstructured to load into their Oracle data warehouse, or where they recognise the need for more advanced analytics tools such as R, MapReduce and Spark but need some help getting started. Most often we put together a PoC Hadoop cluster for them using virtualization technology on existing hardware they own, allowing them to get started quickly and with no initial licensing outlay, with our preferred Hadoop distribution being Cloudera CDH, the same Hadoop distribution that comes on the Oracle Big Data Appliance. Projects then typically move on to Hadoop running directly on physical hardware, in a couple of cases Oracle’s Big Data Appliance, usually in conjunction with Oracle Database, Oracle Exadata and Oracle Exalytics for reporting.

One such project started off by the customer wanting to analyse a dataset that was too large for the space available in their Oracle database and that they couldn’t easily process or analyse using the SQL-based tools they usually used; in addition, like most large organisations, database and hardware provisioning took a long time and they needed to get the project moving quickly. We came in and quickly put together a virtualised Hadoop cluster together for them, on re-purposed hardware and using the free (Standard) edition of Cloudera CDH4, and then used the trial version of Oracle Big Data Connectors along with SFTP transfers to get data into the cluster and then analysed.


The PoC itself then ran for just over a month with the bulk of the analysis being done using Oracle R Advanced Analytics for Hadoop, an extension to R that allows you to use Hive tables as a data source and create MapReduce jobs from within R itself; the output from the exercise was a series of specific-answer-to-specific-question R graphs that solved an immediate problem for the client, and showed the value of further investment in the technology and our services – the screenshot below shows a typical ORAAH session, in this case analyzing the flight delays dataset that you can also find on the Exalytics server and in smaller form in OBIEE 11g’s SampleApp dataset.


That project has now moved onto a larger phase of work with Oracle Big Data Appliance used as the Hadoop platform rather than VMs, and Cloudera Hadoop upgraded from the free, unsupported Standard version to Cloudera Enterprise. The VMs in fact worked pretty well and had the advantage that they could be quickly spun-up and housed temporarily on an existing server, but were restricted by the RAM that we could assign to each VM – 2GB initially, quickly upgraded to 8GB per VM, and the fact that they were sharing CPU and IO resources. Big Data Appliance, by contrast, has 64GB or RAM per node – something that’s increasingly important now in-memory tools like Impala are begin used – and has InfiniBand networking between the nodes as well as fast network connections out to the wider network, something thats often overlooked when speccing up a Hadoop system.

The support setup for the BDA is pretty good as well; from a sysadmin perspective there’s a lights-out ILOM console for low-level administration, as well as plugins for Oracle Enterprise Manager 12c (screenshot below), and Oracle support the whole package, typically handling the hardware support themselves and delegating to Cloudera for more Hadoop-specific queries. I’ve raised several SRs on client support contracts since starting work on BDAs, and I’ve not had any problem with questions not being answered or buck-passing between Oracle and Cloudera.

NewImageOne thing that’s been interesting is the amount of actual work that you need to do with the Big Data Appliance beyond the actual installation and initial configuration by Oracle to “on-board” it into the typical enterprise environment. BDAs are left with customers in a fully-working state, but like Exalytics and Exadata though, initial install and configuration is just the start, and you’ve then got to integrate the platform in with your corporate systems and get developers on-boarded onto the platform. Tasks we’ve typically provided assistance with on projects like these include:

  • Configuring Cloudera Manager and Hue to connect to the corporate LDAP directory, and working with their security team to create LDAP groups for developer and administrative access that we then used to restrict and control access to these tools
  • Configuring other tools such as RStudio Server so that developers can be more productive on the platform
  • Putting in place an HDFS directory structure to support incoming data loads and data archiving, as well as directories to hold the output datasets from the analysis work we’re doing – all within the POSIX security setup that HDFS currently uses which limits us to just granting owner, group and world permissions on directories
  • Working with the client’s infrastructure team on things like alerting, troubleshooting and setting up backup and recovery – something that’s surprisingly tricky in the Hadoop world as Cloudera’s backup tools only backup from Hadoop-to-Hadoop, and by definition your Hadoop system is going to hold a lot of data, the volume of which your current backup tools aren’t going to easily handle

Once things are set up though you’ve got a pretty comprehensive platform that can be expanded up from the initial six nodes our customers’ systems typically start with to the full eighteen node cluster, and can use tools such as ODI to do data loading and movement, Spark and MapReduce to process and analyse data, and Hive, Impala and Pig to provide end-user access. The diagram below shows a typical future-state architecture we propose for clients on this initial BDA “starter config” where we’ve moved up to CDH5.x, with Spark and YARN generally used as the processing framework and with additional products such as MongoDB used for document-type storage and analysis:



Something that’s turned out to be more of an issue on projects than I’d originally anticipated is complying with corporate security policies. By definition, most customers who buy an Oracle Big Data Appliance and going to be large customers with an existing Oracle database estate, and if they deal with the public they’re going to have pretty strict security and privacy rules you’ll need to adhere to. Something that’s surprising therefore to most customers new to Hadoop is how insecure or at least easily compromised the average Hadoop cluster is, with Hadoop FS shell security relying on trusted networks and incoming user connections and interfaces such as ODBC not checking passwords at all.

Hadoop and the BDA only becomes what’s termed “secure” when you link it to a Kerebos server, but not every customer has Kerebos set up and unless you enable this feature right at the start when you set up the BDA, it’s a fairly involved task to add retrospectively. Moreover, customers are used to fine-grained access control to their data, a single security model over their data and a good understanding in their heads as to how security works on their database, whereas Hadoop is still a collection of fairly-loosely coupled components with pretty primitive access controls, and no easy way to delete or redact data, for example, when a particular country’s privacy laws in-theory mandate this.

Like everything there’s a solution if you’re creative enough, with tools such as Apache Sentry providing role-based access control over Hive and Impala tables, alternative storage tools like HBase that permit read, write, update and delete operations on data rather than just HDFS’s insert and (table or partition-level) delete, and tools like Cloudera Navigator and BDA features like Oracle Audit Vault that provide administrators with some sort of oversight as to who’s accessing what data and when. As I mentioned in my blog post a couple of weeks ago, Oracle’s Big Data SQL product addresses this requirement pretty well, potentially allowing us to apply Oracle security over both relational, and Hadoop, datasets, but for now we’re working within current CDH4 capabilities and planning on introducing Apache Sentry for role-based access control to Hive and Impala in the coming weeks. We’re also looking at implementing Cloudera’s “secure gateway” cluster topology with all access restricted to just a single gateway Hadoop node, and the cluster itself firewalled-off with external access to just that gateway node and HTTP / REST API access to the various cluster services, for example as shown in the diagram below:


My main focus on Hadoop projects has been on the overall Hadoop system architecture, and interacting with the client’s infrastructure and security teams to help them adopt the BDA and take over its maintenance. From the analysis side, it’s been equally as interesting, with a number of projects using tools such as R, Oracle R Advanced Analytics for Hadoop and core Hive/MapReduce for data analysis, Flume, Java and Python for data ingestion and processing, and most recently OBIEE11g for publishing the results out to a wider audience. Following the development model that we outlined in the second post in our updated Information Management Reference Architecture blog series, we typically split delivery of each project’s output into two distinct phases; a discovery phase, typically done using RStudio and Oracle R Advanced Analytics for Hadoop, where we explore and start understanding the dataset, presenting initial findings to the business and using their feedback and direction to inform the second phase; and a second, commercial exploitation phase where we use the discovery phases’ outputs and models to drive a more structured dimensional model with output begin in the form of OBIEE analyses and dashboards.


We looked at several options for providing the datasets for OBIEE to query, with our initial idea being to connect OBIEE directly to Hive and Impala and let the users query the data in-place, directly on the Hadoop cluster, with an architecture like the one in the diagram below:


In fact this turned out to not be possible, as whilst OBIEE can access Apache Hive datasources, it currently only ships with HiveServer1 ODBC support, and no support for Cloudera Impala, which means we need to wait for a subsequent release of OBIEE11g to be able to report against the ODBC interfaces provided by CDH4 and CDH5 on the BDA (although ironically, you can get HiveServer2 and Impala working on OBIEE on Windows, though this platform isn’t officially supported by Oracle for Hadoop access, only Linux). Whichever way though, it soon became apparent that even if we could get Hive and Impala access working, in reality it made more sense to use Hadoop as the data ingestion and processing platform – providing access to data analysts at this point if they wanted access to the raw datasets – but with the output of this then being loaded into an Oracle Exadata database, either via Sqoop or via Oracle Loader for Hadoop and ideally orchestrated by Oracle Data Integrator 12c, and users then querying these Oracle tables rather than the Hive and Impala ones on the BDA, as shown in the diagram below.


In-practice, Oracle SQL is far more complete and expressive than HiveQL and Impala SQL and it makes more sense to use Oracle as the query platform for the vast majority of users, with data analysts and data scientists still able to access the raw data on Hadoop using tools like Hive, R and (when we move to CDH5) Spark.

The final thing that’s been interesting about working on Hadoop and Big Data Appliance projects is that 80% of it, in my opinion, is just the same as working on large enterprise data warehouse projects, with 20% being “the magic”. A large portion of your time is spent on analysing and setting up feeds into the system, just in this case you use tools like Flume instead of GoldenGate (though GoldenGate can also load into HDFS and Hive, something that’s useful for transactional database data sources vs. Flume’s focus on file and server log data sources). Another big part of the work is data processing, ingestion, reformatting and combining, again skills an ETL developer would have (though there’s much more reliance, at this point, on command-line tools and Unix utilities, albeit with a place for tools like ODI once you get to the set-based filtering, joining and aggregating phase). In most cases, the output of your analysis and processing will be Hive and Impala tables so that results can be analysed using tools such as OBIEE, and you therefore need skills in areas such as dimensional modelling, business analysis and dashboard prototyping as well as tool-specific skills such as OBIEE RPD development.

Where the “magic” happens, of course, is the data preparation and analysis that you do once the data is loaded, quite intensively and interactively in the discovery phase and then in the form of MapReduce and Spark jobs, Sqoop loads and Oozie workflows once you know what you’re after and need to process the data into something more tabular for tools like OBIEE to access. We’re building up a team competent in techniques such as large-scale data analysis, data visualisation, statistical analysis, text classification and sentiment analysis, and use of NoSQL and JSON-type data sources, which combined with our core BI, DW and ETL teams allows us to cover the project from end-to-end. It’s still relatively early days but we’re encouraged by the response from our project customers so far, and – to be honest – the quality of the Oracle big data products and the Cloudera platform they’re based around – and we’re looking forward to helping other Oracle customers get the most out of their adoption of these new technologies. 

If you’re an Oracle customer looking to make their first move into the worlds of Hadoop, big data and advanced analytics techniques, feel free to drop me an email at mark.rittman@rittmanmead.com  for some initial advice and guidance – the fact we come from an Oracle-centric background as well typically makes it easier for us to relate these new concepts to the ones you’re typically more familiar with. Similarly, if you’re about to bring on-board an Oracle Big Data Appliance system and want to know how best to integrate it in with your existing Oracle BI, DW, data integration and systems management estate, get in contact and I’d be happy to share experiences and our delivery approach.

Website Design & Build: tymedia.co.uk