Rittman Mead BI Forum 2014 Registration Now Open – Don’t Miss Out!

March 25th, 2014 by

Just a quick reminder to say that registration for the Rittman Mead BI Forum 2014 is now open, with the speaker and presentation list now up on the event website. As with previous years, the BI Forum runs in Brighton on the first week, and then moves over to Atlanta on the second, with the dates and venues as follows:

We’ve got a fantastic line-up of sessions and speakers, including:

  • Oracle ACE and past BI Forum best speaker winner Kevin McGinley, on adding third-party visualisations to OBIEE
  • Sessions from TimesTen PMs Chris Jenkins and Susan Cheung on what’s coming with TimesTen
  • Wayne Van Sluys from InterRel, on Essbase optimisation
  • Oracle’s Andrew Bond, and our own Stewart Bryson (Oracle ACE) with an update to Oracle’s reference BI, DW and Big Data Architecture
  • Dan Vlamis on using Oracle Database analytics with the Oracle BI Applications
  • Sessions from Oracle’s Jack Berkowitz, Adam Bloom and Matt Bedin on what’s coming with OBIEE and Oracle BI Applications
  • Peak Indicators’ Alastair Burgess on tuning TimesTen with Aggregate Persistence
  • Endeca sessions from Chris Lynskey (PM), Omri Traub (Development Manager) on Endeca, along with ones from Branchbird’s Patrick Rafferty and Truls Bergersen
  • And sessions from Rittman Mead’s Robin Moffatt (OBIEE performance), Gianni Ceresa (Essbase) and Michael Rainey (ODI, with Nick Hurt from IFPI)

NewImage

We’ve also got some excellent keynote sessions including one in the US from Maria Colgan on the new in-memory database option, and another in Brighton from Matt Bedin and Adam Bloom on BI in the Cloud – along with the opening-night Oracle product development keynote in both Brighton and Atlanta.

We’re also very exited to welcome Lars George from Cloudera to deliver this year’s optional one-day masterclass, this year on Hadoop, big data, and how Oracle BI&DW developers can get started with this technology. Lars is Cloudera’s Chief Architect in EMEA and an HBase committer, and he’ll be covering topics such as:

  • What is Hadoop, what’s in the Hadoop ecosystem and how do you design a Hadoop cluster
  • Using tools such as Flume and Sqoop to import data into Hadoop, and then analyse it using Hive, Pig, Impala and Cloudera Search
  • Introduction to NoSQL and HBase
  • Connecting Hadoop to tools such as OBIEE and ODI using JDBC, ODBC, Impala and Hive

If you’ve been meaning to take a look at Hadoop, or if you’ve made a start but would like a chance to discuss techniques with someone who’s out in the field every week designing and building Hadoop systems, this session is aimed at you – it’s on the Wednesday before each event and you can book at the same time as registering for the main BI Forum days.

NewImage

Attendance is limited to around seventy at each event, and we’re running the Brighton BI Forum back at the Hotel Seattle, whilst the US one is running at the Renaissance Midtown Hotel, Atlanta. We encourage attendees to stay at the hotel as well so as to maximise networking opportunities, and this year you can book US accommodation directly with the hotel so you can collect any Marriott points, corporate discounts etc. As usual, we’ll take good care of you over the two or three days, with meals each night, drinks receptions and lots of opportunities to meet colleagues and friends in the industry.

Full details are on the BI Forum 2014 web page including links to the registration sites. Book now so you don’t miss-out – each year we sell-out in advance, so don’t leave it to the last minute if you’re thinking of coming. Hopefully see you all in Brighton and Atlanta in May 2014!

Using Oracle R Enterprise to Analyze Large In-Database Datasets

March 23rd, 2014 by

The other week I posted an article on the blog about Oracle R Advanced Analytics for Hadoop, part of Oracle’s Big Data Connectors and used for running certain types of R analysis over a Hadoop cluster. ORAAH lets you move data in and out of HDFS and Hive and into in-memory R data frames, and gives you the ability to create Hadoop MapReduce jobs but using R commands and syntax. If you’re looking to use R to analyse, prepare and explore your data, and you’ve got access to a large Hadoop cluster, ORAAH is a useful way to go beyond the normal memory constraints of R running on your laptop.

But what if the data you want to analyse is currently in an Oracle database? You can export the relevant tables to flat files and then import them into HDFS, or you can use a tools such as sqoop to copy the data directly into HDFS and Hive tables. Another option you could consider though is to run your R analysis directly on the database tables, avoiding the need to move data around and taking advantage of the scalability of your Oracle database – which is where Oracle R Enterprise comes in.

Oracle R Enterprise is part of the Oracle Database Enterprise Edition “Advanced Analytics Option”, so it’s licensed separately to ORAAH and the Big Data Connectors. What it gives you is three things:

image2

  • Some client packages to install locally on your desktop along. installed into regular R (or ideally, Oracle’s R distribution)
  • Some database server-side R packages to provide a “transparency layer”, converting R commands into SQL ones, along with extra SQL stats functions to support R
  • The ability to spawn-off R engines within the Oracle Database’s using the extproc mechanism, for performing R analysis directly on the data rather than through the client on your laptop

Where this gets interesting for us is that the ORE transparency layer makes it simple to move data in and out of the Oracle Database, but more importantly it allows us to use database tables and views as R “ore.frames” – proxies for “data frames”, the equivalent to database tables in R and the basic data set that R commands work on. Going down this route avoids the need to export the data we’re interesting out of the Oracle Database, with the ORE transparency layer converting most R function calls to Oracle Database SQL ones – meaning that we can use the data analyst-friendly R language whilst using Oracle under the covers for the heavy lifting.

NewImage

There’s more to ORE than just the transparency layer, but let’s take a look at how you might use ORE and this feature, using the same “flight delays” dataset I used in my post a couple of months ago on Hadoop, Hive and Impala. We’ll use the OBIEE 11.1.1.7.1 SampleApp v309R2 that you can download from OTN as it’s got Oracle R Enterprise already installed, although you’ll need to follow step 10 in the accompanying deployment guide to install the R packages that Oracle couldn’t distribute along with SampleApp.

In the following examples, we’ll:

  • Connect to the main PERFORMANCE fact table in the BI_AIRLINES schema, read in it’s metadata (columns), and then set it up as a “virtual” R data frame that actually  points through to the database table
  • Then we’ll perform some basic analysis, binning and totalling for that table, to give us a sense of what’s in it
  • And then we’ll run some more R analysis on the table, outputting the results in the form of graphs and answering questions such as “which days of the week are best to fly out on?” and “how have airlines relative on-time performance changed over time?”

Let’s start off them by starting the R console and connecting to the database schema containing the flight delays data.

[oracle@obieesample ~]$ R
 
Oracle Distribution of R version 2.15.1  (--) -- "Roasted Marshmallows"
Copyright (C)  The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
 
  Natural language support but running in an English locale
 
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
 
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
You are using Oracle's distribution of R. Please contact
Oracle Support for any problems you encounter with this
distribution.
 
[Previously saved workspace restored]
 
> library(ORE)
Loading required package: OREbase
 
Attaching package: ‘OREbase’
 
The following object(s) are masked from ‘package:base’:
 
    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table
 
Loading required package: OREstats
Loading required package: MASS
Loading required package: OREgraphics
Loading required package: OREeda
Loading required package: OREdm
Loading required package: lattice
Loading required package: OREpredict
Loading required package: ORExml
> ore.connect("bi_airlines","orcl","localhost","BI_AIRLINES",all=TRUE)
Loading required package: ROracle
Loading required package: DBI
> 

Note that “library(ORE)” loads up the Oracle R Enterprise R libraries, and “ore.connect” connects the R session to the relevant Oracle database.

I then synchronise R’s view of the objects in this database schema with its own metadata views, list out what tables are available to us in that schema, and attach that schema to my R session so I can manipulate them from there.

> ore.sync()
> ore.ls()
 [1] "AIRCRAFT_GROUP"           "AIRCRAFT_TYPE"           
 [3] "AIRLINE_ID"               "AIRLINES_USER_DATA"      
 [5] "CANCELLATION"             "CARRIER_GROUP_NEW"       
 [7] "CARRIER_REGION"           "DEPARBLK"                
 [9] "DISTANCE_GROUP_250"       "DOMESTIC_SEGMENT"        
[11] "OBIEE_COUNTY_HIER"        "OBIEE_GEO_AIRPORT_BRIDGE"
[13] "OBIEE_GEO_ORIG"           "OBIEE_ROUTE"             
[15] "OBIEE_TIME_DAY_D"         "OBIEE_TIME_MTH_D"        
[17] "ONTIME_DELAY_GROUPS"      "PERFORMANCE"             
[19] "PERFORMANCE_ENDECA_MV"    "ROUTES_FOR_LINKS"        
[21] "SCHEDULES"                "SERVICE_CLASS"           
[23] "UNIQUE_CARRIERS"         
> ore.attach("bi_airlines")
> 

Now although we know these objects as database tables, what ORE does is present them to R as “data frames” using ore.frame as a proxy, the fundamental data structure in R that looks just like a table in the relational database world. Behind the scenes though, ORE maps these data frames to the underlying Oracle structures using the ore.frame proxy, and turns R commands into SQL function calls including a bunch of new ones added specifically for ORE. Note that this is conceptually different to Oracle R Advanced Analytics for Hadoop, which doesn’t map (or overload) standard R functions to their Hadoop (MapReduce or Hive) equivalent – it instead gives you a set of new R functions that you can use to create MapReduce jobs, which you can then submit to a Hadoop cluster for processing, giving you a more R-native way of creating MapReduce jobs; ORE in-contrast tries to map all of R functionality to Oracle database functions, allowing you to run normal R sessions but with Oracle Database allowing you process bigger R queries closer to the data.

Let’s use another two R commands to see how it views the PERFORMANCE table in the flight delays data set, and get some basic sizing metrics.

> class(PERFORMANCE)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> dim(PERFORMANCE)
[1] 6362422     112

Now at this point I could pull the data from one of those tables directly into an in-memory R data frame, like this:

> carriers <- ore.pull(UNIQUE_CARRIERS)
Warning message:
ORE object has no unique key - using random order 
> class(UNIQUE_CARRIERS)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> class(carriers)
[1] "data.frame"
> 

As you see, R sees the UNIQUE_CARRIERS object as an ore.frame, whilst carriers (into which data from UNIQUE_CARRIERS was loaded) is a regular data.frame object. In some cases you might want to load data from Oracle tables into a regular data.frame, but what’s interesting here is that we can work directly with ore.frame objects and let the Oracle database do the hard work. So let’s get to work on the PERFORMANCE ore.frame object and do some initial analysis and investigation.

> df <- PERFORMANCE[,c("YEAR","DEST","ARRDELAY")]
> class(df)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> head(df)
  YEAR DEST ARRDELAY
1 2010  BOI      -13
2 2010  BUF       44
3 2010  BUF      -14
4 2010  BUR       -6
5 2010  BUR       -2
6 2010  BUR       -9
Warning messages:
1: ORE object has no unique key - using random order 
2: ORE object has no unique key - using random order 
> options(ore.warn.order = FALSE)
> head(PERFORMANCE[,c(1,4,23)])
  YEAR DAYOFMONTH DESTWAC
1 2010         16      83
2 2010         16      22
3 2010         16      22
4 2010         16      91
5 2010         16      91
6 2010         16      91
>

In the above script, the first command creates a temporary ore.frame object made up of just three of the columns from the PERFORMANCE table / ore.frame. Then I switch off the warning about these tables not having unique keys (“options(ore.warn.order = FALSE)”), and then I select three more columns directly from the PERFORMANCE table / ore.frame.

> aggdata <- aggregate(PERFORMANCE$DEST,
+                      by = list(PERFORMANCE$DEST),
+                      FUN = length)
> class(aggdata)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
 
> head(aggdata)
    Group.1     x
ABE     ABE  4104
ABI     ABI  2497
ABQ     ABQ 33298
ABR     ABR     5
ABY     ABY  1028
ACK     ACK   346
 
> (t <- table(PERFORMANCE$DAYOFWEEK))
 
     1      2      3      4      5      6      7 
943305 924442 932113 942066 956123 777203 887170
 
> dat = PERFORMANCE[PERFORMANCE$ARRDELAY<100 & PERFORMANCE$ARRDELAY>-100,]
> ad = with(dat, split(ARRDELAY,UNIQUECARRIER))
> boxplot(ad,col = "blue", notch = TRUE, cex = 0.5, varwidth = TRUE)

In the above set of scripts, I first aggregate flights by destination airports, then count flights by day of week. In the final set of commands I get a bit more advanced and create a box plot graph showing the range of flight delays by airline, which produces the following graph from the R console:

NewImage

whereas in the next one I create a histogram of flight delays (minutes), showing the vast majority of delays are just a few minutes.

> ad = PERFORMANCE$ARRDELAY
> ad = subset(ad, ad>-200&ad<200)
> hist(ad, breaks = 100, main = "Histogram of Arrival Delay")

NewImage

All of this so far, to be fair, you could do just as easily in SQL or in a tool like Excel, but they’re the sort of commands an R analyst would want to run before getting onto the interesting stuff, and it’s great that they could now do this on the full dataset in an Oracle database, not just on what they can pull into memory on their laptop. Let’s do something more interesting now, and answer the question “which day of the week is best for flying out, in terms of not hitting delays?”

> ad = PERFORMANCE$ARRDELAY
> ad = subset(ad, ad>-200&ad<200)
> hist(ad, breaks = 100, main = "Histogram of Arrival Delay")
> ontime <- PERFORMANCE
> delay <- ontime$ARRDELAY
> dayofweek <- ontime$DAYOFWEEK
> bd <- split(delay, dayofweek)
> boxplot(bd, notch = TRUE, col = "red", cex = 0.5,
+         outline = FALSE, axes = FALSE,
+         main = "Airline Flight Delay by Day of Week",
+         ylab = "Delay (minutes)", xlab = "Day of Week")

NewImage

Looks like Tuesday’s the best. So how has a selection of airlines performed over the past few years?

> ontimeSubset <- subset(PERFORMANCE, UNIQUECARRIER %in% c("AA", "AS", "CO", "DL","WN","NW")) 
> res22 <- with(ontimeSubset, tapply(ARRDELAY, list(UNIQUECARRIER, YEAR), mean, na.rm = TRUE))
> g_range <- range(0, res22, na.rm = TRUE)
> rindex <- seq_len(nrow(res22))
> cindex <- seq_len(ncol(res22))
> par(mfrow = c(2,3))
> res22 <- with(ontimeSubset, tapply(ARRDELAY, list(UNIQUECARRIER, YEAR), mean, na.rm = TRUE))
> g_range <- range(0, res22, na.rm = TRUE)
> rindex <- seq_len(nrow(res22))
> cindex <- seq_len(ncol(res22))
> par(mfrow = c(2,3))
> for(i in rindex) {
+   temp <- data.frame(index = cindex, avg_delay = res22[i,])
+   plot(avg_delay ~ index, data = temp, col = "black",
+        axes = FALSE, ylim = g_range, xlab = "", ylab = "",
+        main = attr(res22, "dimnames")[[1]][i])
+        axis(1, at = cindex, labels = attr(res22, "dimnames")[[2]]) 
+        axis(2, at = 0:ceiling(g_range[2]))
+        abline(lm(avg_delay ~ index, data = temp), col = "green") 
+        lines(lowess(temp$index, temp$avg_delay), col="red")
+ } 
>

NewImage

See this presentation from the BIWA SIG for more examples of ORE queries against the flight delays dataset, which you can adapt from the ONTIME_S dataset that ships with ORE as part of the install.

Now where R and ORE get really interesting, in the context of BI and OBIEE, is when you embed R scripts directly in the Oracle Database and use them to provide forecasting, modelling and other “advanced analytics” features using the database’s internal JVM and an R engine that gets spun-out on-demand. Once you’ve done this, you can expose the calculations through an OBIEE RPD, as Oracle have done in the OBIEE 11.1.1.7.1 SampleApp, shown below:

NewImage

But that’s really an article in itself – so I’ll cover this process and how you surface it all through OBIEE in a follow-up post soon.

Using Sqoop for Loading Oracle Data into Hadoop on the BigDataLite VM

March 22nd, 2014 by

This is old-hat for most Hadoop veterans, but I’ve been meaning to note it on the blog for a while, for anyone who’s first encounter with Hadoop is Oracle’s BigDataLite VM.

Most people looking to bring external data into Hadoop, do so through flat-file exports that they then import into HDFS, using the “hadoop fs” command-line tool or Hue, the web-based developer tool in BigDataLite, Cloudera CDH, Hortonworks and so on. They then often create Hive tables over them, either creating them from the Hive / Beeswax shell or through Hue, which can create a table for you out of a file you upload from your browser. ODI, through the ODI Application Adapter for Hadoop, also gives you a knowledge module (IKM File to Hive) that’s used in the ODI demos also on BigDataLite to load data into Hive tables, from an Apache Avro-format log file.

NewImage

What a lot of people don’t know who’re new to Hadoop, is that you can skip the “dump to file” step completely, and load data straight into HDFS direct from the Oracle database, without an intermediate file export step. The tool you use for this comes as part of the Cloudera CDH4 Hadoop distribution that’s on BigDataLite, and it’s called “Sqoop”.

“Sqoop”, short for “SQL to Hadoop”, gives you the ability to do the following Oracle data transfer tasks amongst other ones:

  • Import whole tables, or whole schemas, from Oracle and other relational databases into Hadoop’s file system, HDFS
  • Export data from HDFS back out to these databases – with the export and import being performed through MapReduce jobs
  • Import using an arbitrary SQL SELECT statement, rather than grabbing whole tables
  • Perform incremental loads, specifying a key column to determine what to exclude
  • Load directly into Hive tables, creating HDFS files in the background and the Hive metadata automatically

Documentation for Sqoop as shipped with CDH4 can be found on the Cloudera website here, and there are even optimisations and plugins for databases such as Oracle to enable faster, direct loads – for example OraOOP.

Normally, you’d need to download and install JDBC drivers for sqoop before you can use it, but BigDataLite comes with the required Oracle JDBC drivers, so let’s just have a play around and see some examples of Sqoop in action. I’ll start by importing the ACTIVITY table from the MOVIEDEMO schema that comes with the Oracle 12c database also on BigDataLite (make sure the database is running, first though):

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY

You should then see sqoop process your command in its console output, and then run the MapReduce jobs to bring in the data via the Oracle JDBC driver:

14/03/21 18:21:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.3-cdh4.5.0
14/03/21 18:21:36 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/03/21 18:21:37 INFO manager.SqlManager: Using default fetchSize of 1000
14/03/21 18:21:37 INFO tool.CodeGenTool: Beginning code generation
14/03/21 18:21:38 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM ACTIVITY t WHERE 1=0
14/03/21 18:21:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
14/03/21 18:21:38 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar
Note: /tmp/sqoop-oracle/compile/b4949ed7f3e826839679143f5c8e23c1/ACTIVITY.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/03/21 18:21:41 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/b4949ed7f3e826839679143f5c8e23c1/ACTIVITY.jar
14/03/21 18:21:41 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:41 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:42 INFO mapreduce.ImportJobBase: Beginning import of ACTIVITY
14/03/21 18:21:42 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/03/21 18:21:45 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ACTIVITY_ID), MAX(ACTIVITY_ID) FROM ACTIVITY
14/03/21 18:21:45 INFO mapred.JobClient: Running job: job_201403111406_0015
14/03/21 18:21:47 INFO mapred.JobClient: map 0% reduce 0%
14/03/21 18:22:19 INFO mapred.JobClient: map 25% reduce 0%
14/03/21 18:22:27 INFO mapred.JobClient: map 50% reduce 0%
14/03/21 18:22:29 INFO mapred.JobClient: map 75% reduce 0%
14/03/21 18:22:37 INFO mapred.JobClient: map 100% reduce 0%
...
14/03/21 18:22:39 INFO mapred.JobClient: Map input records=11
14/03/21 18:22:39 INFO mapred.JobClient: Map output records=11
14/03/21 18:22:39 INFO mapred.JobClient: Input split bytes=464
14/03/21 18:22:39 INFO mapred.JobClient: Spilled Records=0
14/03/21 18:22:39 INFO mapred.JobClient: CPU time spent (ms)=3430
14/03/21 18:22:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=506802176
14/03/21 18:22:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2714157056
14/03/21 18:22:39 INFO mapred.JobClient: Total committed heap usage (bytes)=506724352
14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Transferred 103 bytes in 56.4649 seconds (1.8241 bytes/sec)
14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Retrieved 11 records.

By default, sqoop will put the resulting file in your user’s home directory in HDFS. Let’s take a look and see what’s there:

[oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/ACTIVITYFound 6 items
-rw-r--r-- 1 oracle supergroup 0 2014-03-21 18:22 /user/oracle/ACTIVITY/_SUCCESS
drwxr-xr-x - oracle supergroup 0 2014-03-21 18:21 /user/oracle/ACTIVITY/_logs
-rw-r--r-- 1 oracle supergroup 27 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00000
-rw-r--r-- 1 oracle supergroup 17 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00001
-rw-r--r-- 1 oracle supergroup 24 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00002
-rw-r--r-- 1 oracle supergroup 35 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00003
[oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000001,Rate
2,Completed
3,Pause
[oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000014,Start
5,Browse

What you can see there is that sqoop has imported the data as a series of “part-m” files, CSV files with one per MapReduce reducer. There’s various options in the docs for specifying compression and other performance features for sqoop imports, but the basic format is a series of CSV files, one per reducer.

You can also import Oracle and other RDBMS data directly into Hive, with sqoop creating equivalent datatypes for the data coming in (basic datatypes only, none of the advanced spatial and other Oracle ones). For example, I could import the CREW table in the MOVIEDEMO schema in like this, directly into an equivalent Hive table:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CREW --hive-import<

Taking a look at Hive, I can then see the table this is created, describe it and count the number of rows it contains:

[oracle@bigdatalite ~]$ hive
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/oracle/hive_job_log_effb9cb5-6617-49f4-97b5-b09cd56c5661_1747866494.txt

hive> desc CREW;
OK
crew_id double 
name string 
Time taken: 2.211 seconds

hive> select count(*) from CREW;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403111406_0017, Tracking URL = http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201403111406_0017
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201403111406_0017
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-03-21 18:33:40,756 Stage-1 map = 0%, reduce = 0%
2014-03-21 18:33:46,797 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:47,812 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:48,821 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:49,907 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:50,916 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:51,929 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:52,942 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:53,951 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:54,961 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_201403111406_0017
MapReduce Jobs Launched: 
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 1.75 sec HDFS Read: 135111 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
6860
Time taken: 20.095 seconds

I can even do an incremental import to bring in new rows, appending their contents to the existing ones in Hive/HDFS:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CREW --hive-import --incremental append --check-column CREW_ID

The data I’ve now loaded can be processed by a tool such as ODI, or you can use a tool such as Pig to do some further analysis or number crunching, like this:

grunt> RAW_DATA = LOAD 'ACTIVITY' USING PigStorage(',') AS  
grunt> (act_type: int, act_desc: chararray);  
grunt> B = FILTER RAW_DATA by act_type < 5; 
grunt> STORE B into 'FILTERED_ACTIVITIES' USING PigStorage(‘,'); 

the output of which is another file on HDFS. Finally, I can export this data back to my Oracle database using the sqoop export feature:

[oracle@bigdatalite ~]$ sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY_FILTERED --export-dir FILTERED_ACTIVITIES

So there’s a lot more to sqoop than just this, including features and topics such as compression, transforming data and so on, but it’s a useful tool and also something you could call from the command-line, using an ODI Tool if you want to. Recent versions of Hue also come with a GUI for Sqoop, giving you the ability to create jobs graphically and also schedule them using Oozie, the Hadoop scheduler in CDH4,

OBIEE Dashboard prompt: at least one mandatory

March 17th, 2014 by

It’s almost two years now I joined the Rittman Mead team and never wrote on this blog before, time to start. For this first post nothing long, I wanted to share something I had to setup few weeks ago on a project and it can be interesting for others as well.

My need was related to dashboard prompts: it’s easy to have all of them optional (the default behavior) or the make a prompt (or many of them) mandatory. But if you need to have at least one mandatory field between many in a dashboard prompt? For example if you have a prompt to select a customer and it contains 2 fields: “customer name” and “customer number”. A user can select a customer based on its name or number and I want the user to always select a customer, so at least one of these 2 fields needs to be mandatory.

There is not a magic checkbox in OBIEE to achieve this behavior, setting both fields as mandatory means the user need to enter the customer details twice all the time, not acceptable from a usability point of view.

There is a way to simulate this behavior and will not impact (or in a negligible way) performances without generating unneeded queries on the database.

First I create my dashboard prompt and for my 2 fields I set a presentation variable: pv_CustomerName and pv_CustomerNumber in my example.

dbprompt_prompt_customer_name

Then I create an analysis where I will perform the check to detect if at least one of my prompt fields is set. Add a single column, not important which one of your subject area because I will edit the formula to set a fixed value, a value I’m absolutely sure doesn’t exist in my prompted fields, so for example something like ‘x0xx_I_can_not_exist_xx0x’. Thanks to this unique column with a fixed value this analysis will never send a query to the database, the BI server will manage it.

dbprompt_condition_analysis_column_formula

Time to add some filter to this analysis, my filters will be set on this single column and will use the presentation variables I defined in the dashboard prompt.  I set the condition to be ‘is equal to / is in’ and as value I choose “Presentation variable” and enter the name of the first of my variables and, really important, I set a default value: x0xx_I_can_not_exist_xx0x. Does it remind you something? Yes, it’s the same strange value I manually set as being the value of the my unique column (really important, it must be exactly the same value).

dbprompt_condition_analysis_add_filter

What will this filter do? If there is no variable set the filter is like a “1=1″, if there is a variable set from the prompt it will behave like a “1=2″. Now add all the other presentation variables in the same way using a “AND” condition.

dbprompt_condition_analysis_filters

This analysis will return exactly a unique row with the value of “x0xx_I_can_not_exist_xx0x” or no rows at all, I will use it as a condition to know if at least one dashboard prompt is set or not.

dbprompt_condition_analysis_result

Save this analysis ideally with the word “condition” in the name as we will use it as condition to know if the prompt is filled or not (mine is called “condition_no_prompt_set”).

Time now to finish the work as we have all the elements. Edit the dashboard where you want to add the “at least one mandatory” field functionality, add the dashboard prompt and 2 additional sections: in one of these sections add the analysis you want to run when at least one field is set, in the other one add a text object with a polite message asking the user to use at least one of the dashboard prompt fields.

dbprompt_dashboard_2_sections

Last step is to add conditions to decide when to display the first or the second section.

dbprompt_dashboard_add_condition

For the one with the message set a condition when the “condition_no_prompt_set” analysis return 1 row, in the other section (the one with the analysis to be executed if the prompt is correctly set) set a condition on the same “condition_no_prompt” analysis when it return 0 rows.

dbprompt_dashboard_condition_no_prompt

dbprompt_dashboard_condition_prompt_ok

Done! I have my “at least one mandatory dashboard prompt” as you can see in the next screenshots:

dbprompt_final_output_no_prompt

dbprompt_final_output_customer_name

dbprompt_final_output_customer_number

And the nice thing is that the condition managing the display doesn’t generate any query on the database (as you can see in the logs below), only the BI server will get the query and it’s a really simple one, similar to a “SELECT ‘xxxx’ FROM dual WHERE 1=1″.

dbprompt_condition_query_log1
dbprompt_condition_query_log2

Oracle Data Integrator 12c Patch – Missing Components

March 12th, 2014 by

While working with a client this week, I came across an interesting issue, and ultimately a workaround, when patching Oracle Data Integrator 12c. But first, a little background…

The client was working through the OWB to ODI 12c migration process, detailed nicely by Stewart Bryson in his OTN article, “Making the Move from Oracle Warehouse Builder to Oracle Data Integrator 12c“. After patching OWB to enable the OWB to ODI Migration Tool, they performed the migration and assessed the migration report for errors. While most of the mappings were generated successfully in ODI, albeit with a few minor issues, there was one that was skipped completely. The report entry stated:

“No valid component type found with name: PIVOT“. That line indicates that the patch to ODI 12c had not yet been completed. Easy enough to resolve, right? Following the patch guidelines that Stewart provided, we installed patch 17053768, reopened ODI Studio, and …

Components - No PIVOT!

Still no PIVOT component!

After verifying that indeed the correct patch was applied, we dug deeper. A query against the ODI repository metadata showed that the PIVOT/UNPIVOT components did indeed exist, but for some reason, they were not being displayed in ODI Studio.

Query component metadata

After a call for help to several ODI experts, I received the same response from both Jérôme Françoisse and David Allan (sorry David, Jérôme beat you to it!). While ODI Studio is closed, delete the following folder:

Once ODI Studio is reopened, this folder will recreate itself and the components will magically appear.

Components - with PIVOT

It seems this has something to do with caching of the ODI Studio components, specifically for OSGi - according to David Allan, in an attempt to speed things up for the developer. Hopefully, subsequent patches will remove the folder for us so we don’t have to think so hard next time!

Big thanks to Jérôme and David for their help!

Edit: As Jeff pointed out in the comments (and my colleague Pete Scott did after reading this post!), the ReadMe.txt file for the patch mentions the need to delete the  folder. Thanks Jeff and Pete! 

post-install

Website Design & Build: tymedia.co.uk