More Notes on Right-Time BI

Over the past couple of years Stewart Bryson and I have been looking into things "right time" (or is that realtime?). It is great to have him around to trade ideas (and graphics for presentations!). Most of what we have been discussing has been about "traditional reporting", either with or without a data warehouse, and definitely in the realms of "how well have we done". However, that is not the sole use case for right time BI.

I have long felt that BI is only done for one of three reasons - the law says we must report things, it saves us money, or it makes us money; so if knowing something sooner gives us competitive advantage then surely that is a good thing. Knowing sooner is not enough though; it is also about being able to act on the information to facilitate a change in the organization that enhances return (or lowers costs). To my mind we are moving from the traditional "let's look at this in aggregate" stance to a world where we ask "what is the significance of this newly observed fact". This type of analysis requires a body of data to create a reference model and access to smart statistical tools to allow us to make judgments based on probabilities. Making such decisions based on dynamic events is not just for stock markets and bankers, the same principles apply in many sectors. I know of some restaurant chains that have investigated using centrally monitored sales across all outlets to dynamically adjust staff levels based on likely demand - staff are sent home, brought in, moved between outlets based on a predictive model that uses past trading patterns across many outlets.

As usual, most of the building blocks we need to do this are available to us, we just need a bit of creativity to join them together into an architecture. For this kind of use I feel that messaging should be core to the data capture - we want to look at single items of "fact" and do some statistical analysis on them before adding them to the data warehouse (or what ever form our data repository takes) so that the new fact can become part of the base data set we use to analyze the next fact to arrive. Micro-batch loading of log based change data is probably less suited here as we are:

  1. adding to the latency by using discrete loads at fixed intervals and
  2. the processing of many items at a time complicates the statistical analysis and alerting phases (after all if we get 2453 credit card transactions in a batch only a few will be potentially fraudulent).
After we capture a message from the source system we can pass the information through a chain of processes to analyze the information, propagate alerts based on the statistical significance of the item and add the data to the data store so that it becomes part of the knowledge. This last stage of adding the message to the data will probably need to be in a micro-batch mode rather than one-row-at-a-time-as-it-arrives - the latencies of adding fact to a conformed OLAP system (database, cubes, whatever) are such that single row additions will just take too much time, even if our target is an in-memory system. Here the art of the designer is to balance the availability of data, the time to reprocess the OLAP structures, the desire to keep the system up to date. It is always worth noting that for many data domains having to-the-second data is not that important as any new rows are unlikely to change the statistical results, however some subject domains will need access to all of the recent information including that which has not yet made it to the data warehouse, and here the creativity comes in.