Lag, lead and sparse data

Jonathan Lewis posts a nice example (with a subsequent comment from Jeremy expanding it to work well with RAC) of an use of the analytic LEAD function.

Which reminds me of a gotcha with LAG or LEAD with sparse data. Sparse data is commonly found in business intelligence databases at low levels of aggregation. Suppose we are looking at day-on-day changes in sales amount for "blue splurge widgets", but on some days we don't sell any at all.

As a simple example let's make a table of date and number of blue splurge widgets sold:
CREATE TABLE "WAREHOUSE"."PS_LAG_1" ( "SALES_DATE" DATE, "QUANTITY" NUMBER ) ;
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('01-APR-07', 'DD-MON-RR'), '2');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('02-APR-07', 'DD-MON-RR'), '3');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('04-APR-07', 'DD-MON-RR'), '5');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('05-APR-07', 'DD-MON-RR'), '5');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('06-APR-07', 'DD-MON-RR'), '3');
COMMIT;

NB we made no sales on the April 3.
Using the LAG function to calculate change in sales from the previous day we get:
select sales_date, quantity, quantity - lag(quantity) over (order by sales_date) delta
from ps_lag_1;

SALES_DATE QUANTITY DELTA --------- --------- ----- 01-APR-07 2 02-APR-07 3 1 04-APR-07 5 2 05-APR-07 5 0 06-APR-07 3 -2

5 rows selected


But did we really sell two units more on April 4 than we did on April 3? The problem here is that lag and lead always measure offsets in terms of ROWS and we need to measure our offset in terms of RANGE. But all is not lost we can use an analytic function over a range window consisting of a single row; we just need to use the BETWEEN construction and have the same value for the window start and end. In this example I will use the SUM() analytic function.
select sales_date, quantity, quantity - SUM(quantity) over (order by sales_date range between interval '1' day preceding and interval '1' day preceding ) delta
from ps_lag_1;

SALES_DATE QUANTITY DELTA ----------- ------------- ------------- 01-APR-07 2 02-APR-07 3 1 04-APR-07 5 <= Look, a NULL! 05-APR-07 5 0 06-APR-07 3 -2

5 rows selected


Now we get a NULL for the calculation that would include the missing day. Of course how we deal with the NULL is up to us, we just need a NVL (or CASE) to wrap up the correct business logic.