In this week of postings we're going to look at Oracle Exalytics and how it enables "big data" and unstructured data analytics, using Oracle Endeca, Oracle Exadata, Oracle Big Data Appliance and the Oracle Database Advanced Analytics option. In case you've arrived via a Google search and you're interested in the rest of the postings in this series, here's the links to the articles (to be completed as postings are published).
- Oracle Exalytics, Oracle R Enterprise and Endeca Part 1 : Oracle's Analytics, Engineered Systems, and Big Data Strategy
- Oracle Exalytics, Oracle R Enterprise and Endeca Part 2 : Oracle Endeca, the Advanced Analytics Option and Oracle Exalytics
- Oracle Exalytics, Oracle R Enterprise and Endeca Part 3 : Flight Delays Analysis using OBIEE, Endeca and Oracle R Enterprise on Exalytics
So in the first post in this series we looked at Exalytics as part of the Oracle database tech stack, and how Oracle's analytics strategy is to handle all types of data, using a number of optimised analytic tools and analysis engines, with packaged applications where appropriate and delivered via the web, via mobile devices, in the cloud and embedded in business application and processes. We closed the post with a mention of a new database option called the Advanced Analytics Option, and in this second posting we'll look at just what this new option contains and how it related to Oracle's engineered systems strategy.
The Advanced Analytics Option is an option to Oracle Database Enterprise Edition, available from version 11.2 of the database onwards. It includes two major components:
- Oracle Data Mining, which prior to the Advanced Analytics Option was an option in itself (typically bought along with the OLAP Option, which is still a separate option)
- Oracle R Enterprise, Oracle's take on R, the statistical language used widely in academia and rapidly replacing Base SAS and SPSS within commercial organisations
For both data mining and R, the key premise with the Advanced Analytics Option is to bring the algorithms to the data; instead of having to extract data from a database, along with files and other sources, and then load this into a statistics engine such as SAS, you can instead embed R scripts and data mining algorithms directly within the database, making it easy to score and classify data in real-time, such as in a call-centre application or as part of an ETL routine.
Oracle Data Mining has been around for a number of years now, but R is a new addition to Oracle's analytic toolset, and is probably new to most Oracle BI & DW developers. So just what is R, and Oracle's version of it, Oracle R Enterprise?
In fact, there are actually two R packages that Oracle have put together; one is free, the other is a database option. The free one is Oracle's own distribution of open-source R, the same as you'd get from downloading it from the R Project's website, but with additional libraries to make it run faster on x86 hardware. Open-source R can also be downloaded from the Oracle website along with licensable Oracle R Enterprise, installing it direct onto Oracle Linux or other Unix OS's using Oracle's Yum repository. Oracle R Enterprise, however, is basic R extended to work closer with the Oracle database by adding the following (licensed) elements:
- R packages to add to and extend the standard packages provided with open-source R
- A database library for connecting to Oracle and running R scripts within the database
- SQL extensions to allow R functionality to be called from SQL and PL/SQL
These elements then provide four main Oracle Enterprise R features:
- A "Transparency" layer that intercepts standard R functions and extends them to allow certain R functions and datatypes to reside in the Oracle database
- A Statistics Engine providing a set of statistical functions and procedures for commonly-used statistical libraries, which then execute in the Oracle database
- SQL extensions, which allow database server execution of R code, and support parallelism, SQL access to R and XML output
- A Hadoop connector, for running R scripts and functions against an Oracle Hadoop cluster with its files held in either HDFS, an Oracle database, or local files.
When you work with R, you typically have the R client installed on your laptop or workstation which communicates with the R server, typically delivered as a single executable for Windows, Linux or Unix. Whilst this has the virtue of simplicity it also means that you are limited by the amount of RAM and CPU on your local machine, which can quickly become an issue when you try to spin up multiple R engines to process a model in parallel, as each engine loads up the full data set into memory before starting work. Even on a 2-4 core laptop with 16GB RAM you can quickly run out of memory, which is where Oracle R Enterprise comes in - the basic data structure that you work with in R called a "frame", analogous to a relational table, can with Oracle R Enterprise be actually stored in a database giving you the ability to process much larger sets of data, with many more R engines running, than if you were running standalone. Typically this would be a large, multi-core Oracle database, though you can also connect R and ORE to the TimesTen in-memory database using the new ROracle R interface, detailed in this blog post by Jason Feldhaus on the Oracle R Enterprise blog.
Oracle R Enterprise also has the ability to spin-up (or "spawn") it's own R engines within the database server, providing a lights-out environment that allows R computations to be carried out even when you're not at your workstation, and with these database-resident R engines having full access to the database, SQL and PL/SQL. Coupled with the Oracle R Connector for Hadoop, a typical ORE (as we'll shorten Oracle R Enterprise to now) topology looks like the diagram below.
So where does Exalytics come in to this? If you've followed-along so far you may well have spotted that, as ORE is in fact a database option and therefore runs as part of the Oracle Database, it shouldn't really be installed (along with an Oracle database) on the Exalytics server - apart from having to license 20 processors of Oracle Database Enterprise Edition plus the Advanced Analytics Option, Exalytics is really meant for just OBIEE, WebLogic, TimesTen and Essbase, with ORE really supposed to reside on Exadata, or at least a separate database server. What Exalytics does do well though is play the role of a supercharged client for ORE, with Open-source R running on Exalytics then connecting to ORE on Exadata; The R client can then spin-up multiple R engines to process models in parallel making use of Exalytics 40 cores, whilst the 1TB of RAM allows multiple copies of the models' data to be held in memory without the machine breaking a sweat. Coupled with ORE's ability to spin-up it's own R engines on the Exalytics server, and the InfiniBand connection between the two servers, and Oracle Big Data Appliance if you've also got this, and your R topology now looks like the diagram below.
The question you're probably asking at this point now, seeing as we've established where R fits into the Oracle BI and big data architecture, is just what is R? And what can it do for Oracle BI, if it's just a statistical programming language? Well if you've got the latest OBIEE 11g SampleApp (v207) downloadable from OTN, it's actually got R, and Oracle R Enterprise, already installed and set up, ready to go. So assuming you've got SampleApp v207 installed and all of the OBIEE and other servers running, you can start your first R session by selecting Applications > Accessories > Terminal from the Linux desktop menu bar, then type in "R" to start the R console, part of the standard R client, like this:
[oracle@obieesampleapp ~]$ R R version 2.13.1 (2011-07-08) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: i686-redhat-linux-gnu (32-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. Loading Oracle R Enterprise Packages Connecting to Oracle RDBMS User: rquser SID : orcl Host: localhost Port: 1521 Done. >
Note how starting the R console first displays the open-source license messages, then displays details about ORE, finishing up by displaying the connection details of the database account it's now going to connect us to, which is the standard ORE account on a database that's been configured to use the Advanced Analytics Option.
Obviously explaining the full syntax and capabilities of R is outside the scope of this blog post (try the online ORE docs, and the R Project's online manuals), but ORE comes with a number of sample R scripts that are part of a base ORE package, that you can look at to get a flavour of the language and its syntax. Whilst connected to the R console you can list out the demo ORE scripts like this:
> demo (package = "ORE") basic Basic connectivity to database binning Binning logic columnfns Column functions cor Correlation matrix crosstab Frequency cross tabulations derived Handling of derived columns distributions Distribution, density, and quantile functions do_eval Embedded R processing freqanalysis Frequency cross tabulations graphics Demonstrates visual analysis group_apply Embedded R processing by group hypothesis Hyphothesis testing functions matrix Matrix related operations nulls Handling of NULL in SQL vs. NA in R push_pull RDBMS <-> R data transfer rank Attributed-based ranking of observations reg Ordinary least squares linear regression row_apply Embedded R processing by row chunks sql_like Mapping of R to SQL commands stepwise Stepwise OLS linear regression summary Summary functionality table_apply Embedded R processing of entire table
To run one of these, for example the correlation matrix one, type in the command:
> demo ("cor", package = "ORE")
R also ships with a number of graphics demos, that show off some of the graphs and other visualisations that R can produce. To run these, from the R console type in:
> demo (graphics)
The R console will then step you through a number of graph demos, displaying each graph when you press the enter key.
Compared to the basic statistical functions provided by Oracle SQL, R provides a wider variety of statistical and graphical techniques including:
- Linear and non-linear modelling
- Classical statistical tests and time-series analysis
- Classification, clustering and other capabilities
- Matrix arithmetic, with scalar, vector, list and data frame (analogous to relational tables)
In addition, R is extensible through community-contributed packages at the Comprehensive R Archive Network (CRAN), which is probably the main attraction for users of R, and it can connect to "big data" sources such as Hadoop through Oracle's R Connector for Hadoop. So now that we've seen the basics of R and how it might benefit from Exalytics and Oracle's Enterprise R features, how might R, OBIEE and Endeca work together if used on Exalytics? In the final posting in this series we'll look at a case study that takes the publically-available Flight Delays dataset and analyses it using OBIEE, Endeca and Oracle R Enterprise, to see what each tool can contribute and how they might look to a typical end-user.