The Rittman Mead BI Forum 2015 is running in Brighton from May 6th-8th 2015, and Atlanta from May 13th – 15th 2015. At this year’s events we’re introducing our first “data visualization challenge”, open to all attendees and with the dataset and scenario open from now until the start of each event. Using Oracle Business Intelligence 11g and any plugins or graphics libraries that embed and interact with OBIEE (full details and rules below), we challenge you to create the most effective dashboard or visualisation and bring it along to demo on the Friday of each event.
Help DonorsChoose.org Donors Use their Funds Most Effectively
This year’s inaugural data visualisation challenge is based around the DonorsChoose.org project and dataset, an online charity that makes it easy for anyone to help public school classroom projects that need funding (Rittman Mead will be making donations on behalf of the Brighton and Atlanta BI Forums to show our support for this great initiative). The Donorschoose.org project and dataset have been used in several hackathons and data crunching contests around the world, with analysis and visualisations helping to answer questions such as:
Why do some projects get funded, while others don’t?
Who donates to projects from different subjects?
Does proximity to schools change donation behavior?
What types of materials are teachers lacking the most? (eg chalk, paper, markers, etc)
Do poorer schools ask for more or less money from their donors?
If I need product x, what is the difference between projects asking for x that were successful vs those that aren’t.
Your challenge is to import this dataset into your analytical database of choice, and then create the best visualisation or dashboard in OBIEE to answer the following question: “Which project can I donate to, where my donation will have most impact?”
How Do I Take Part?
For more on the BI Forum 2015 Data Visualization Challenge including how to download the dataset and the rules of the challenge, take a look at the Rittman Mead BI Forum 2015 Data Visualisation Challenge web page where we’ve provided full details. You can either enter as an individual or as part of a team, but you must be registered for either the Brighton or Atlanta BI Forum events and come along in-person to demonstrate your solution – numbers at each event are strictly limited though, so make sure you register soon at the Rittman Mead BI Forum 2015 home page.
Nearly four years ago I wrote a blog post entitled “Instrumenting OBIEE – The Final Chapter”. With hindsight, that title suffix (“The Final Chapter”) may have been a tad presumptuous and naïve of me (or perhaps I can just pretend to be ironic now and go for a five-part-trilogy style approach…). Back then OBIEE 11g had only just been released (who remembers 18.104.22.168 in all its buggy-glory?), and in the subsequent years we’ve had significant patchset releases of OBIEE 11g bringing us up to 22.214.171.124.150120 now and with talk of OBIEE 12c around the corner.
As a fanboi of Cary Millsap and his approach to measuring and improving performance, instrumenting code in general – and OBIEE specifically – is something that’s interested me for a long time. The article was the final one that I wrote on my personal blog before joining Rittman Mead and it’s one that I’ve been meaning to re-publish here for a while. A recent client engagement gave me cause to revisit the instrumentation approach and refine it slightly as well as update it for a significant change made in OBIEE 126.96.36.199.1.
What do I mean by instrumentation? Instrumentation is making your program expose information about what is being done, as well as actually doing it. Crudely put, it’s something like this:
10PRINT"THE TIME IS "NOW()
30PRINT"I'VE DONE THAT THING, IT TOOK "X" SECONDS"
Rather than just firing some SQL at the database, instead we associate with that SQL information about what program sent it, and what that program was doing, who was using it, and so on. Instrumentation enables you to start analysing performance metrics against tangible actions rather than just amorphous clumps of SQL. It enables you to understand the workload profile on your system and how that’s affecting end users.
Pop quiz: which of these is going to be easier to work with for building up an understanding of a system’s behaviour and workload?
The second one gives us the same information as before, plus the analysis being run by OBIEE, and the dashboard and page.
The benefits of instrumentation work both ways. It makes DBAs happy because they can look at resource usage on the database and trace it back easily to the originating OBIEE dashboard and user. Instrumentation also makes life much easier for troubleshooting OBIEE performance because it’s easy to trace a user’s entire session through from browser, through the BI Stack, and down into the database.
Instrumentation for OBIEE – Step By Step
If you want the ‘tl;dr’ version, the “how” rather than the “why”, here we go. For full details of why it works, see later in the article.
In your RPD create three session variables. These are going to be the default values for variables that we’re going to send to the database. Make sure you set “Enable any user to set the value”.
Set up a session variable initialization block to populate these variables. It is just a “dummy” init block as all you’re doing is setting them to empty/default values, so a ‘SELECT … FROM DUAL’ is just fine:
For each Connection Pool you want to instrument, go to the Connection Scripts tab and add these three scripts to the Execute before query section:
-- Pass the OBIEE user's name to CLIENT_IDENTIFIER
You can leave the comments in there, and in fact I’d recommend doing so to make it clear for future RPD developers what these scripts are for.
Your connection pool should look like this:
An important point to note is that you generally should not be adding these scripts to connection pools that are used for executing initialisation blocks. Initialisation block queries won’t have these request variables so if you did want to instrument them you’d need to find something else to include in the instrumentation.
Once you’ve made the above changes you should see MODULE, CLIENT_IDENTIFIER and CLIENT_INFO being populated in the Oracle system views :
You’ll note that we don’t set ACTION – that’s because OBIEE now sends a hash of the physical query text across in this column, meaning we can’t use it ourselves. Unfortunately the current version of OBIEE doesn’t store the physical query hash anywhere other than in nqquery.log, meaning that you can’t take advantage of it (i.e. link it back to data from Usage Tracking) within the database alone.
That’s all there is to it – easy! If you want to understand exactly how and why it works, read on…
Instrumentation for OBIEE – How Does it Work?
When OBIEE runs a dashboard, it does so by taking each analysis on that dashboard and sending a Logical Request for that analysis to the BI Server (nqsserver). The BI Server parses and compiles that Logical request into one or more Physical requests which it then sends to the source database(s).
OBIEE connects to the database via a Connection Pool which specifies the database-specific connection information including credentials, data source name (such as TNS for Oracle). The Connection Pool, as the name suggests, pools connections so that OBIEE is not going through the overhead of connecting and disconnecting for every single query that it needs to run. Instead it will open one or more connections as needed, and share that connection between queries as needed.
As well as the obvious configuration options in a connection pool such as database credentials, OBIEE also supports the option to send additional SQL to the database when it opens a connection and/or sends a new query. It’s this nice functionality that we piggy-back to enable our instrumentation.
The information that OBIEE can send back through its database connection is limited by what we can expose in variables. From the BI Server’s point of view there are three types of variables:
The first two are fairly simple concepts; they’re defined within the RPD and populated with Initialisation Blocks (often known as “init blocks”) that are run by the BI Server either on a schedule (repository variables) or per user (session variables). There’s a special type of session variables known as System Session Variables, of which USER is a nice obvious example. These variables are pre-defined in OBIEE and are generally populated automatically when the user session begins (although some, like LOGLEVEL, still need an init block to set them explicitly).
The third type of variable, request variable, is slightly less obvious in function. In a nutshell, they are variables that are specified in the logical request sent to the BI Server, and are passed through to the internals of the BI Server. They’re often used for activating or disabling certain functionality. For example, you can tell OBIEE to specifically not use its cache for a request (even if it finds a match) by setting the request variable DISABLE_CACHE_HIT.
Request variables can be set manually inline in an analysis from the Advanced tab:
And they can also be set from Variable Prompts either within a report prompt or as a standalone dashboard prompt object. The point about request variables is that they are freeform; if they specify the name of an existing session variable then they will override it (if permitted), but they do not require the session variable to exist. We can see this easily enough – and see a variable request prompt in action at the same time. From the Prompts tab of an analysis I’ve added a Variable Prompt (rather than the usual Column Prompt) and given it a made up name, FOO:
Now when I run the analysis I specify a value for it:
and in the query log there’s the request variable:
which brings me on very nicely to the key point here. When Presentation Services sends a query to the BI Server it does so with a bunch of request variables set, including QUERY_SRC_CD and SAW_SRC_PATH. If you’ve worked with OBIEE for a while then you’ll recognise these names – they’re present in the Usage Tracking table S_NQ_ACCT. Ever wondered how OBIEE knows what values to store in Usage Tracking? Now you know. It’s whatever Presentation Services tells it to. You can easily test this yourself by playing around in nqcmd:
2015-03-2111:55:10/users/prodney/request variable example Report
Key takeaway here: Presentation Services defines a bunch of useful request variables when it sends Logical SQL to the BI Server:
Embedding Variables in Connection Script Calls
There are four options that we can configure when connecting to the database from OBIEE. These are:
As of OBIEE version 188.8.131.52.1 (i.e. OBIEE >= 184.108.40.206.131017) OBIEE automatically sets the ACTION field to a hash of the physical query – for more information see Doc ID 1941378.1. That leaves us with three remaining fields (since OBIEE sets ACTION after anything we do with the Connection Pool):
The syntax of the command in a Connection Script is physical SQL and the VALUEOF function to extract the OBIEE variable:
As a simple example here is passing the userid of the OBIEE user, using the Execute before query connection script:
-- Pass the OBIEE user's name to CLIENT_IDENTIFIER
This would be set for every Connection Pool – but only those used for query execution – not init blocks. Run a query that is routed through the Connection Pool you defined the script against and check out V$SESSION:
The USER session variable is always present, so this is a safe thing to do. But, what about SAW_SRC_PATH? This is the path in the Presentation Catalog of the analysis being executed. Let’s add this into the Connection Pool script, passing it through as the CLIENT_INFO:
This works just fine for analyses within a dashboard, or standalone analyses that have been saved. But what about a new analysis that hasn’t been saved yet? Unfortunately the result is not pretty:
[State:S1000][NQODBC][SQL_STATE:S1000][nQSError:10058]Ageneral error has occurred.
[nQSError:43113]Message returned from OBIS.
[nQSError:23006]The session variable,NQ_SESSION.SAW_SRC_PATH,has no value definition.
Statement execute failed
That’s because SAW_SRC_PATH is a request variable and since the analysis has not been saved Presentation Services does not pass it to BI Server as a request variable. The same holds true for SAW_DASHBOARD and SAW_DASHBOARD_PG if you run an analysis outside of a dashboard – the respective request variables are not set and hence the connection pool script causes the query itself to fail.
The way around this is we cheat, slightly. If you create a session variable with the names of these request variables that we want to use in the connection pool scripts then we avoid the above nasty failures. If the request variables are set then all is well, and if they are not then we fall back on whatever value we initialise the session variable with.
The final icing on the cake of the solution given above is a bit of string munging with INSTR and SUBSTR to convert and concatenate the dashboard path and page into a single string, so instead of :
Which is much easier on the eye when looking at dashboard names. Similarly with the analysis path we strip all but the last section of it.
Granular monitoring of OBIEE on the database
Once OBIEE has been configured to be more articulate in its connection to the database, it enables the use of DBMS_MONITOR to understand more about the performance of given dashboards, analyses, or queries for a given user. Through DBMS_MONITOR the collection of statistics such as DB time, DB CPU, and so can be triggered, as well as trace-file generation for queries matching the criteria specified.
As an example, here is switching on system statistics collection for just one dashboard in OBIEE, using SERV_MOD_ACT_STAT_ENABLE
Now Oracle stats to collect information whenever that particular dashboard is run, which we can use to understand more about how it is performing from a database point of view:
Similarly the CLIENT_IDENTIFIER field can be used to collect statistics with CLIENT_ID_STAT_ENABLE or trigger trace file generation with CLIENT_ID_TRACE_ENABLE. What you populate CLIENT_IDENTIFIER with it up to you – by default the script I’ve detailed at the top of this article inserts the OBIEE username in it, but you may want to put the analysis here if that’s of more use from a diagnostics point of view on the database side. The CLIENT_INFO field is still available for the other item, but cannot be used with DBMS_MONITOR for identifying queries.
Each year at the Rittman Mead BI Forum we host an optional one-day masterclass before the event opens properly on Wednesday evening, with guest speakers over the year including Kurt Wolff, Kevin McGinley and last year, Cloudera’s Lars George. This year I’m particularly excited that together with Jordan Meyer, our Head of R&D, I’ll be presenting the masterclass on the topic of “Delivering the Oracle Big Data and Information Management Reference Architecture”.
Last year we launched at the Brighton BI Forum event a new reference architecture that Rittman Mead had collaborated with Oracle on, that incorporated big data and schema-on-read databases into the Oracle data warehouse and BI reference architecture. In two subsequent blogposts, and in a white paper published on the Oracle website a few weeks after, concepts such as the “Discovery Lab”, “Data Reservoirs” and the “Data Factory” were introduced as a way of incorporating the latest thinking, and product capabilities, into the reference architecture for Oracle-based BI, data warehousing and big data systems.
One of the problems I always feel with reference architectures though is that they tell you what you should create, but they don’t tell you how. Just how do you go from a set of example files and a vague requirement from the client to do something interesting with Hadoop and data science, and how do you turn the insights produced by that process into a production-ready, enterprise Big Data system? How do you implement the data factory, and how do you use new tools such as Oracle Big Data Discovery and Oracle Big Data SQL as part of this architecture? In this masterclass we’re looking to explain the “how” and “why” to go with this new reference architecture, based on experiences working with clients over the past couple of years.
The masterclass will be divided into two sections; the first, led by Jordan Meyer, will focus on the data discovery and “data science” parts of the Information Management architecture, going through initial analysis and discovery of datasets using R and Oracle R Enterprise. Jordan will share techniques he uses from both his work at Rittman Mead and his work with Slacker Radio, a Silicon Valley startup, and will introduce the R and Oracle R Enterprise toolset for uncovering insights, correlations and patterns in sample datasets and productionizing them as database routines. Over his three hours he’ll cover topics including:
Session #1 – Data exploration and discovery with R (2 hours)
1.1 Introduction to R
1.2 Tidy Data
1.3 Data transformations
1.4 Data Visualization
Session #2 – Predictive Modeling in the enterprise (1 hr)
2.3 Deploying models to the data warehouse with ORE
After lunch, I’ll take the insights and analysis patterns identified in the Discovery Lab and turn them into production big data pipelines and datasets using Oracle Data Integrator 12c, Oracle Big Data Discovery and Oracle Big Data SQL For a flavour of the topics I’ll be covering take a look at this Slideshare presentation from a recent Oracle event, and in the masterclass itself I’ll concentrate on techniques and approaches for ingesting and transforming streaming and semi-structured data, storing it in Hadoop-based data stores, and presenting it out to users using BI tools like OBIEE, and Oracle’s new Big Data Discovery.
Session # 3 – Building the Data Reservoir and Data Factory (2 hr)
3.1 Designing and Building the Data Reservoir using Cloudera CDH5 / Hortonworks HDP, Oracle BDA and Oracle Database 12c
3.2 Building the Data Factory using ODI12c & new component Hadoop KM modules, real-time loading using Apache Kafka, Spark and Spark Streaming
Session #4 – Accessing and visualising the data (1 hr)
4.1 Discovering and Analyzing the Data Reservoir using Oracle Big Data Discovery
4.2 Reporting and Dashboards across the Data Reservoir using Oracle Big Data SQL + OBIEE 220.127.116.11
You can register for a place at the two masterclasses when booking your BI Forum 2015 place, but you’ll need to hurry as we limit the number of attendees at each event in order to maximise interaction and networking within each group. Registration is open now and the two events take place in May – hopefully we’ll see you there!
An important part of working with ODI is analysing the performance when it runs, and identifying steps that might be inefficient as well as variations in runtime against a baseline trend. The Operator tool in ODI itself is great for digging down into individual sessions and load plan executions, but for broader analysis we need a different approach. We also need to make sure we keep the data available for trend analysis, as it’s often the case that tables behind Operator are frequently purged for performance reasons.
In this article I’m going to show how we can make use of a generic method of pulling information out of an RDBMS such as Oracle and storing it in Elasticsearch, from where it can be explored and analysed through Kibana. It’s standalone, it’s easy to do, it’s free open source – and it looks and works great! Here I’m going to use it for supporting the analysis of ODI runtime information, but it is equally applicable to any time-based data you’ve got in an RDBMS (e.g. OBIEE Usage Tracking data).
Kibana is an open-source data visualisation and analysis tool, working with data stored in Elasticsearch. These tools work really well for very rapid analysis of any kind of data that you want to chuck at them quickly and work with. By skipping the process of schema definition and data modelling the time taken to the first results is drastically reduced. It enables to you quickly start “chucking about” data and getting meaning out of it before you commit full-scale to how you want to analyse it, which is what the traditional modelling route can sometimes force you to do prematurely.
ODI writes runtime information to the database, about sessions run, steps executed, time taken and rows processed. This data is important for analysing things like performance issues, and batch run times. Whilst with the equivalent runtime data (Usage Tracking) from OBIEE there is the superb RPD/Dashboard content that Oracle ship in SampleApp v406, for ODI the options aren’t as vast, ultimately being based on home-brew SQL against the repository tables using the repository schema documentation from Oracle. Building an OBIEE metadata model against the ODI schema is one option, but then requires an OBIEE server on which to run it – or merging into an existing OBIEE deployment – which means that it can become more hassle than it’s worth. It also means a bunch of up-front modelling before you get any kind of visualisations and data out. By copying the data across into Elasticsearch it’s easy to quickly build analyses against it, and has the additional benefit of retaining the data as long as you’d like meaning that it’s still available for long-term trend analysis once the data’s been purged from the ODI repository itself.
Let’s take a bit of a walk through the ODI dashboard that I’ve put together. First up is a view on the number of sessions that have run over time, along with their duration. For duration I’ve shown 50th (median), 75th and 95th percentiles to get an idea of the spread of session runtimes. At the moment we’re looking at all sessions, so it’s not surprising that there is a wide range since there’ll always be small sessions and longer ones:
Next up on the dashboard comes a summary of top sessions by runtime, both cumulative and per-session. The longest running sessions are an obvious point of interest, but cumulative runtime is also important; something may only take a short while to run when compared to some singular long-running sessions, but if it runs hundreds of times then it all adds up and can give a big performance boost if time is shaved off it.
Plotting out session execution times is useful to be able to see both when the longest running sessions ran:
The final element on this first dashboard is one giving the detail for each of the top x long-running session executions, including the session number so that it can be examined in further detail through the Operator tool.
Kibana dashboards are interactive, so you can click on a point in a graph to zoom in on that time period, as well as click & drag to select an arbitrary range. The latter technique is sometimes known as “Brushing”, and if I’m not describing it very well have a look at this example here and you’ll see in an instant what I mean.
As you focus on a time period in one graph the whole dashboard’s time filter changes, so where you have a table of detail data it then just shows it for the range you’ve selected. Notice also that the granularity of the aggregation changes as well, from a summary of every three hours in the first of the screenshots through to 30 seconds in the last. This is a nice way of presenting a summary of data, but isn’t always desirable (it can mask extremes and abnormalities) so can be configured to be fixed as well.
Time isn’t the only interaction on the dashboard – anything that’s not a metric can be clicked on to apply a filter. So in the above example where the top session by cumulative time are listed out we might want to find out more about the one with several thousand executions
Simply clicking on it then filters the dashboard and now the session details table and graph show information just for that session, including duration, and rows processed:
Session performance analysis
As an example of the benefit of using a spread of percentiles we can see here is a particular session that had an erratic runtime with great variation, that then stabilised. The purple line is the 95th percentile response time; the green and blue are 50th and 75th respectively. It’s clear that whilst up to 75% of the sessions completed in about the same kind of time each time they ran, the remaining quarter took anything up to five times as long.
One of the most important things in performance is ensuring consistent performance, and that is what happens here from about half way along the horizontal axis at c.February:
But what was causing the variation? By digging a notch deeper and looking at the runtime of the individual steps within the given session it can be seen that the inconsistent runtime was caused by a single step (the green line in this graph) within the execution. When this step’s runtime stabilises, so does the overall performance of the session:
This is performing a port-mortem on a resolved performance problem to illustrate how useful the data is – obviously if there were still a performance problem we’d have a clear path of investigation to pursue thanks to this data.
Data’s pulled from the ODI repository tables using Elasticsearch JDBC river, from where it’s stored and indexed in Elasticsearch, and presented through Kibana 4 dashboards.
The data load from the repository tables into Elasticsearch is incremental, meaning that the solution works for both historical analysis and more immediate monitoring too. Because the data’s actually stored in Elasticsearch for analysis it means the ODI repository tables can be purged if required and you can still work with a full history of runtime data in Kibana.
If you’re interested in finding out more about this solution and how Rittman Mead can help you with your ODI and OBIEE implementation and monitoring needs, please do get in touch.
I first came across John Foreman when somebody recommended his book to me, “Data Smart”, a year or so ago. At that time Rittman Mead were getting more-and-more requests from our customers asking us to help with their advanced analytics and predictive modelings needs, and I was looking around for resources to help myself and the team get to grips with some of the more advanced modelings and statistical techniques Oracle’s tools now support – techniques such as clustering and pattern matching, linear regression and genetic algorithms.
One of the challenges when learning these sorts of techniques is not getting to caught up in the tools and technology – R was our favoured technology at the time, and there’s lots to it – so John’s book was particularly well-timed as it goes through these types of “data science” techniques but focuses on Microsoft Excel as the analysis tool, with simple examples and a very readable style.
Back in his day job, John is Chief Data Scientist at MailChimp and has become a particularly in-demand speaker following the success of his book, and I was very excited to hear from Charles Elliott, our Practice Manager for Rittman Mead America, that he lived near John in Atlanta and had arranged for him to keynote at our Atlanta BI Forum event. His Keynote will be entitled “How Mailchimp used qualitative and quantitative analysis to build their next product” and we’re very much looking forward to meeting him at our event in Atlanta on May 13th-15th 2015.
Our second keynote speaker at the Brighton Rittman Mead BI Forum 2015 event is non-other than Reiner Zimmerman, best known in EMEA for organising the Oracle DW Global Leaders Program. We’ve known Reiner for several years now as Rittman Mead are one of the associate sponsors for the program, which aims to bring together the leading organizations building data warehouse and big data systems on the Oracle Engineered Systems platform.
A bit like the BI Forum (but even more exclusive), the DW Global Leaders program holds meetings in the US, EMEA and AsiaPac over the year and is a fantastic networking and knowledge-sharing group for an exclusive set of customers putting together the most cutting-edge DW and big data systems on the latest Oracle technology. Reiner’s also an excellent speaker and a past visitor to the BI Forum, and his session entitled “Hadoop and Oracle BDA customer cases from around the world” will be a look at what customers are really doing, and the value they’re getting, from building big data systems on the Oracle platform.
Registration is now open for both the Brighton and Atlanta BI Forum 2015 events, with full details including the speaker line-up and how to register on the event website. Keep an eye on the blog for more details of both events later this week including more on the masterclass by myself and Jordan Meyer, and a data visualisation “bake-off” we’re going to run on the second day of each event. Watch this space…!