Basket Analysis in the Oracle Database
Last time I gave a taster of my forthcoming Collaborate 12 talk on techniques beyond the traditional “slice and dice” that we know so well. I wrote mainly about the ways in which we can perform statistical analysis on our data either through SQL or by using R. This time I am going to talk about another of the often overlooked pieces of functionality supplied with the Oracle database, the PL/SQL DBMS_FREQUENT_ITEMSET package. This package contains two public functions, one for row-based (or HORIZONTAL) item sets and the other for TRANSACTIONAL (that is one item per row) item sets. Unlike the statistical SQL functions I mentioned last time this time we are using an Oracle PL/SQL package.
So what are frequent itemsets? If somebody asked what is our top selling item then it is a simple matter to look for the item with the highest sum of units shipped (or price, if we are analysing by value) for a time period – a simple SELECT ITEM_NAME, SUM(UNITS_SHIPPED) as SHIPPED FROM FACT_SALES GROUP BY ITEM_NAME will do nicely, add an analytic RANK or ROW_NUMBER around it (or just an order by) and we can easily do a top 5 best sellers list. With frequent item sets on the other hand we are looking at items that are associated together in a single ‘transaction’ or ‘basket’. In effect we combining multiple items and counting the instances of those combinations. Sometimes these associations are entirely predictable such as a grocery store finding that many people who buy hot dog buns also buy hot dog sausages as part of the same purchase; other times things are less than predictable (but are explainable) like the data mining folklore analysis of diapers and beer. We can use itemsets whenever we have a list of items for an event, my examples have been about sales – where the event is taking the shopping cart to the checkout (traditional basket analysis), but the same techniques can also be used to analyse bills of material in engineering repair shops and website access patterns (which pages are hit in a single browser session). Knowing our frequent item sets allows us to, for example: make cross selling suggestions to the customer (we have all seen websites that show beneath a product the line “customers that bought this also bought xxxx”); improve customer experience by optimizing website link layouts; measure the success of multi-product promotions. Recently reported use cases have even used the itemset to segment the customer based on items purchased.
Using the PL/SQL packages is not quite as simple as using a SQL function in a query, however as both functions return a TABLE type we are able to use the functions in a SQL select query. As I mentioned in the first paragraph, there are two functions: DBMS_FREQUENT_ITEMSET.FI_HORIZONTAL and DBMS_FREQUENT_ITEMSET.FI_TRANSACTIONAL both functions take the same arguments, and both return a table of the same structure, the only difference is the format of the SYSREFCURSOR passed as the first argument. For FI_HORIZONTAL we pass a cursor with a column for each item in the basket, each row representing a new basket. For FI_TRANSACTIONAL we pass a two column cursor with the first column the basket identifier and the second the item. Items can either be NUMBER or VARCHAR2 data types. The other parameters we pass to the function are:
- SUPPORT_THRESHOLD – the fraction of the total dataset in which the itemset must occur, so setting a support threshold of 0.1 means that for a itemset to be recognised it must occur in more than 10% of the rows selected by the data cursor.
- ITEMSET_LENGTH_MIN – the minimum number of items in an itemset – (between 1 and 20).
- ITEMSET_LENGTH_MAX – the maximum number of items to be considered an itemset (no more than 20 but at least the minimum value).
- INCLUDED_ITEMS – a cursor selecting a list of items where at least one of those listed must be present in the itemset (can be NULL).
- EXCLUDED_ITEMS – a cursor selecting a list of items not to be included in the itemset (can be NULL).
Included and excluded items are very useful in reducing the combinations being evaluated and may improve itemset evaluation time.
The return type from the function is a table with the following columns:
- SUPPORT – the number of times the itemset occurs.
- ITEMSET – a nested table of items that form the itemset.
- LENGTH – the number of items in the itemset.
- TOTAL_TRANX – the number of transaction being analysed in the whole dataset.
The Oracle Database PL/SQL Packages and Types Reference and the Data Warehouse Guide both give examples of how to use the functions within a SQL query – the key thing that you need to do before executing the query is to define a database TYPE for the itemset collection. For example, if our items are VARCHAR2(20) then we should create a type for a table of VARCHAR2(20), or as in this example and assuming we have a numeric item_id and want the most frequent two item combinations from the baskets.
CREATE TYPE my_frequent_item AS TABLE OF NUMBER;
SELECT CAST(itemset as my_frequent_item) itemset, support, length, total_tranx FROM table(DBMS_FREQUENT_ITEMSET.FI_TRANSACTIONAL( cursor(SELECT basket_id, item_name
FROM fact_sales_line),0.05, 2,2, NULL, NULL));
You can also access the frequent itemset functionality through Oracle Data Mining, but using ODM will require the Advanced Analytics option.