Using Oracle R Enterprise to Analyze Large In-Database Datasets

March 23rd, 2014 by

The other week I posted an article on the blog about Oracle R Advanced Analytics for Hadoop, part of Oracle’s Big Data Connectors and used for running certain types of R analysis over a Hadoop cluster. ORAAH lets you move data in and out of HDFS and Hive and into in-memory R data frames, and gives you the ability to create Hadoop MapReduce jobs but using R commands and syntax. If you’re looking to use R to analyse, prepare and explore your data, and you’ve got access to a large Hadoop cluster, ORAAH is a useful way to go beyond the normal memory constraints of R running on your laptop.

But what if the data you want to analyse is currently in an Oracle database? You can export the relevant tables to flat files and then import them into HDFS, or you can use a tools such as sqoop to copy the data directly into HDFS and Hive tables. Another option you could consider though is to run your R analysis directly on the database tables, avoiding the need to move data around and taking advantage of the scalability of your Oracle database – which is where Oracle R Enterprise comes in.

Oracle R Enterprise is part of the Oracle Database Enterprise Edition “Advanced Analytics Option”, so it’s licensed separately to ORAAH and the Big Data Connectors. What it gives you is three things:

image2

  • Some client packages to install locally on your desktop along. installed into regular R (or ideally, Oracle’s R distribution)
  • Some database server-side R packages to provide a “transparency layer”, converting R commands into SQL ones, along with extra SQL stats functions to support R
  • The ability to spawn-off R engines within the Oracle Database’s using the extproc mechanism, for performing R analysis directly on the data rather than through the client on your laptop

Where this gets interesting for us is that the ORE transparency layer makes it simple to move data in and out of the Oracle Database, but more importantly it allows us to use database tables and views as R “ore.frames” – proxies for “data frames”, the equivalent to database tables in R and the basic data set that R commands work on. Going down this route avoids the need to export the data we’re interesting out of the Oracle Database, with the ORE transparency layer converting most R function calls to Oracle Database SQL ones – meaning that we can use the data analyst-friendly R language whilst using Oracle under the covers for the heavy lifting.

NewImage

There’s more to ORE than just the transparency layer, but let’s take a look at how you might use ORE and this feature, using the same “flight delays” dataset I used in my post a couple of months ago on Hadoop, Hive and Impala. We’ll use the OBIEE 11.1.1.7.1 SampleApp v309R2 that you can download from OTN as it’s got Oracle R Enterprise already installed, although you’ll need to follow step 10 in the accompanying deployment guide to install the R packages that Oracle couldn’t distribute along with SampleApp.

In the following examples, we’ll:

  • Connect to the main PERFORMANCE fact table in the BI_AIRLINES schema, read in it’s metadata (columns), and then set it up as a “virtual” R data frame that actually  points through to the database table
  • Then we’ll perform some basic analysis, binning and totalling for that table, to give us a sense of what’s in it
  • And then we’ll run some more R analysis on the table, outputting the results in the form of graphs and answering questions such as “which days of the week are best to fly out on?” and “how have airlines relative on-time performance changed over time?”

Let’s start off them by starting the R console and connecting to the database schema containing the flight delays data.

[oracle@obieesample ~]$ R
 
Oracle Distribution of R version 2.15.1  (--) -- "Roasted Marshmallows"
Copyright (C)  The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
 
  Natural language support but running in an English locale
 
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
 
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
You are using Oracle's distribution of R. Please contact
Oracle Support for any problems you encounter with this
distribution.
 
[Previously saved workspace restored]
 
> library(ORE)
Loading required package: OREbase
 
Attaching package: ‘OREbase’
 
The following object(s) are masked from ‘package:base’:
 
    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table
 
Loading required package: OREstats
Loading required package: MASS
Loading required package: OREgraphics
Loading required package: OREeda
Loading required package: OREdm
Loading required package: lattice
Loading required package: OREpredict
Loading required package: ORExml
> ore.connect("bi_airlines","orcl","localhost","BI_AIRLINES",all=TRUE)
Loading required package: ROracle
Loading required package: DBI
> 

Note that “library(ORE)” loads up the Oracle R Enterprise R libraries, and “ore.connect” connects the R session to the relevant Oracle database.

I then synchronise R’s view of the objects in this database schema with its own metadata views, list out what tables are available to us in that schema, and attach that schema to my R session so I can manipulate them from there.

> ore.sync()
> ore.ls()
 [1] "AIRCRAFT_GROUP"           "AIRCRAFT_TYPE"           
 [3] "AIRLINE_ID"               "AIRLINES_USER_DATA"      
 [5] "CANCELLATION"             "CARRIER_GROUP_NEW"       
 [7] "CARRIER_REGION"           "DEPARBLK"                
 [9] "DISTANCE_GROUP_250"       "DOMESTIC_SEGMENT"        
[11] "OBIEE_COUNTY_HIER"        "OBIEE_GEO_AIRPORT_BRIDGE"
[13] "OBIEE_GEO_ORIG"           "OBIEE_ROUTE"             
[15] "OBIEE_TIME_DAY_D"         "OBIEE_TIME_MTH_D"        
[17] "ONTIME_DELAY_GROUPS"      "PERFORMANCE"             
[19] "PERFORMANCE_ENDECA_MV"    "ROUTES_FOR_LINKS"        
[21] "SCHEDULES"                "SERVICE_CLASS"           
[23] "UNIQUE_CARRIERS"         
> ore.attach("bi_airlines")
> 

Now although we know these objects as database tables, what ORE does is present them to R as “data frames” using ore.frame as a proxy, the fundamental data structure in R that looks just like a table in the relational database world. Behind the scenes though, ORE maps these data frames to the underlying Oracle structures using the ore.frame proxy, and turns R commands into SQL function calls including a bunch of new ones added specifically for ORE. Note that this is conceptually different to Oracle R Advanced Analytics for Hadoop, which doesn’t map (or overload) standard R functions to their Hadoop (MapReduce or Hive) equivalent – it instead gives you a set of new R functions that you can use to create MapReduce jobs, which you can then submit to a Hadoop cluster for processing, giving you a more R-native way of creating MapReduce jobs; ORE in-contrast tries to map all of R functionality to Oracle database functions, allowing you to run normal R sessions but with Oracle Database allowing you process bigger R queries closer to the data.

Let’s use another two R commands to see how it views the PERFORMANCE table in the flight delays data set, and get some basic sizing metrics.

> class(PERFORMANCE)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> dim(PERFORMANCE)
[1] 6362422     112

Now at this point I could pull the data from one of those tables directly into an in-memory R data frame, like this:

> carriers <- ore.pull(UNIQUE_CARRIERS)
Warning message:
ORE object has no unique key - using random order 
> class(UNIQUE_CARRIERS)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> class(carriers)
[1] "data.frame"
> 

As you see, R sees the UNIQUE_CARRIERS object as an ore.frame, whilst carriers (into which data from UNIQUE_CARRIERS was loaded) is a regular data.frame object. In some cases you might want to load data from Oracle tables into a regular data.frame, but what’s interesting here is that we can work directly with ore.frame objects and let the Oracle database do the hard work. So let’s get to work on the PERFORMANCE ore.frame object and do some initial analysis and investigation.

> df <- PERFORMANCE[,c("YEAR","DEST","ARRDELAY")]
> class(df)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> head(df)
  YEAR DEST ARRDELAY
1 2010  BOI      -13
2 2010  BUF       44
3 2010  BUF      -14
4 2010  BUR       -6
5 2010  BUR       -2
6 2010  BUR       -9
Warning messages:
1: ORE object has no unique key - using random order 
2: ORE object has no unique key - using random order 
> options(ore.warn.order = FALSE)
> head(PERFORMANCE[,c(1,4,23)])
  YEAR DAYOFMONTH DESTWAC
1 2010         16      83
2 2010         16      22
3 2010         16      22
4 2010         16      91
5 2010         16      91
6 2010         16      91
>

In the above script, the first command creates a temporary ore.frame object made up of just three of the columns from the PERFORMANCE table / ore.frame. Then I switch off the warning about these tables not having unique keys (“options(ore.warn.order = FALSE)”), and then I select three more columns directly from the PERFORMANCE table / ore.frame.

> aggdata <- aggregate(PERFORMANCE$DEST,
+                      by = list(PERFORMANCE$DEST),
+                      FUN = length)
> class(aggdata)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
 
> head(aggdata)
    Group.1     x
ABE     ABE  4104
ABI     ABI  2497
ABQ     ABQ 33298
ABR     ABR     5
ABY     ABY  1028
ACK     ACK   346
 
> (t <- table(PERFORMANCE$DAYOFWEEK))
 
     1      2      3      4      5      6      7 
943305 924442 932113 942066 956123 777203 887170
 
> dat = PERFORMANCE[PERFORMANCE$ARRDELAY<100 & PERFORMANCE$ARRDELAY>-100,]
> ad = with(dat, split(ARRDELAY,UNIQUECARRIER))
> boxplot(ad,col = "blue", notch = TRUE, cex = 0.5, varwidth = TRUE)

In the above set of scripts, I first aggregate flights by destination airports, then count flights by day of week. In the final set of commands I get a bit more advanced and create a box plot graph showing the range of flight delays by airline, which produces the following graph from the R console:

NewImage

whereas in the next one I create a histogram of flight delays (minutes), showing the vast majority of delays are just a few minutes.

> ad = PERFORMANCE$ARRDELAY
> ad = subset(ad, ad>-200&ad<200)
> hist(ad, breaks = 100, main = "Histogram of Arrival Delay")

NewImage

All of this so far, to be fair, you could do just as easily in SQL or in a tool like Excel, but they’re the sort of commands an R analyst would want to run before getting onto the interesting stuff, and it’s great that they could now do this on the full dataset in an Oracle database, not just on what they can pull into memory on their laptop. Let’s do something more interesting now, and answer the question “which day of the week is best for flying out, in terms of not hitting delays?”

> ad = PERFORMANCE$ARRDELAY
> ad = subset(ad, ad>-200&ad<200)
> hist(ad, breaks = 100, main = "Histogram of Arrival Delay")
> ontime <- PERFORMANCE
> delay <- ontime$ARRDELAY
> dayofweek <- ontime$DAYOFWEEK
> bd <- split(delay, dayofweek)
> boxplot(bd, notch = TRUE, col = "red", cex = 0.5,
+         outline = FALSE, axes = FALSE,
+         main = "Airline Flight Delay by Day of Week",
+         ylab = "Delay (minutes)", xlab = "Day of Week")

NewImage

Looks like Tuesday’s the best. So how has a selection of airlines performed over the past few years?

> ontimeSubset <- subset(PERFORMANCE, UNIQUECARRIER %in% c("AA", "AS", "CO", "DL","WN","NW")) 
> res22 <- with(ontimeSubset, tapply(ARRDELAY, list(UNIQUECARRIER, YEAR), mean, na.rm = TRUE))
> g_range <- range(0, res22, na.rm = TRUE)
> rindex <- seq_len(nrow(res22))
> cindex <- seq_len(ncol(res22))
> par(mfrow = c(2,3))
> res22 <- with(ontimeSubset, tapply(ARRDELAY, list(UNIQUECARRIER, YEAR), mean, na.rm = TRUE))
> g_range <- range(0, res22, na.rm = TRUE)
> rindex <- seq_len(nrow(res22))
> cindex <- seq_len(ncol(res22))
> par(mfrow = c(2,3))
> for(i in rindex) {
+   temp <- data.frame(index = cindex, avg_delay = res22[i,])
+   plot(avg_delay ~ index, data = temp, col = "black",
+        axes = FALSE, ylim = g_range, xlab = "", ylab = "",
+        main = attr(res22, "dimnames")[[1]][i])
+        axis(1, at = cindex, labels = attr(res22, "dimnames")[[2]]) 
+        axis(2, at = 0:ceiling(g_range[2]))
+        abline(lm(avg_delay ~ index, data = temp), col = "green") 
+        lines(lowess(temp$index, temp$avg_delay), col="red")
+ } 
>

NewImage

See this presentation from the BIWA SIG for more examples of ORE queries against the flight delays dataset, which you can adapt from the ONTIME_S dataset that ships with ORE as part of the install.

Now where R and ORE get really interesting, in the context of BI and OBIEE, is when you embed R scripts directly in the Oracle Database and use them to provide forecasting, modelling and other “advanced analytics” features using the database’s internal JVM and an R engine that gets spun-out on-demand. Once you’ve done this, you can expose the calculations through an OBIEE RPD, as Oracle have done in the OBIEE 11.1.1.7.1 SampleApp, shown below:

NewImage

But that’s really an article in itself – so I’ll cover this process and how you surface it all through OBIEE in a follow-up post soon.

Comments are closed.

Website Design & Build: tymedia.co.uk