Statistical Analysis in the Database

My first ever presentation at the UKOUG conference (long before I started working with Mark Rittman and Jon Mead) was an introduction to analytic functions. Since then analytics have remained a key part of my work as a DW developer; many of the processes that I have implemented would be so much less efficient without the use of  functions such as SUM, AVG, LAG, LEAD, FIRST_VALUE and the RANK functions. Analytics, however, are not just for reporting they are also a major feature in performant ETL; for example we have put lag, lead and last_value to good use in change data capture processes for one of our clients.

Recently, I have been writing a new presentation, "BI is More Than Slice and Dice", for April's Collaborate 12 conference in Las Vegas. In it I will be looking at some BI techniques beyond simple aggregation and dicing up the data based on the dimensions. As part of the talk I will be considering some of the things you can do inside the Oracle 11gR2 database - either for free or by paying out some extra license cash.

One of the "for free" things is the use of the statistical SQL database functions, which neatly ties in with my love of analytic functions. Although most of these statistical SQL functions can be used in the analytic function form (with an OVER (PARTITION BY…) clause), some are "aggregate" only (that is, with an optional GROUP BY). In addition to the familiar AVG and the lesser used MEDIAN and STATS_MODE there are functions to calculate things such as: standard deviations, correlations, linear regressions and variance. Other functions (such as standard error) can be calculated by combining multiple statistical functions. We can carry out some very sophisticated analysis directly on our data warehouse data using SQL. Of course to make this analysis valuable we need to apply the correct analysis to our data - and that requires an understanding of the data and an appreciation of how to use statistical techniques; we should no more think of using a socket wrench to drive a nail than use the wrong statistical technique on our data. Recently, Kim Berg Hansen blogged a good worked example of the use of the analytic Linear Regression slope function to create a sales forecast based on previous sales. Of course we can get a lot more sophisticated in our analysis and take other factors into account but it is a good introductory example of what can be done using in-database functionality. More examples can be found in that must read Oracle manual, the Data Warehousing Guide.

As you may have seen elsewhere on the Rittman Mead Blog, it is a season of product launches, acquisitions and new versions  at Oracle - Mark has recently blogged mini series on Endeca, Exalytics and OBI, I have written a little about ODI, but we have not really mentioned Oracle's involvement with R - the open source programming language and environment for statistical computing and graphics. If you have not come across R before it has it's origins in Auckland University's Statistics Department and has, with community contribution, evolved into a powerful environment that compares well with commercial packages such as SAS and SPSS. Oracle recently released its own open source R Distribution, contributed the ROracle package which provides robust OCI database connectivity to R distributions, and introduced two licensable products: Oracle R Enterprise (ORE) and the R connector for Hadoop (part of the Big Data Connector license) which allows the use of R functionality on data stored in a Hadoop file system.

ORE is installed directly into the database (currently 64-bit Linux is supported) and allows us to execute R code and use R packages where the the data is, that is in the Oracle database (whereas conventional R distributions extract the data to the R client's memory space). This is a great advantage for data warehouse size tables which could be too big to process in client memory. ORE is licensed as part of the Advanced Analytics database option (along with Oracle Data Mining, ODM) - this is not a "use either A or B" bundling of products since ORE can give direct R-Language access to the advanced algorithms in ODM, on the other hand R Enterprise permits some forms of analysis that are not possible within ODM. With ORE we can develop very sophisticated statistical analyses within the database (such as forecasting, relationship discovery and what-if analysis) and visualize the results either through the R client application or OBIEE; something which my colleague, Jordan Meyer, has been doing recently. Hopefully Jordan will be able to share some of his experiences with both ORE and Hadoop soon.