Lag, lead and sparse data
April 10th, 2007 by Peter Scott
Jonathan Lewis posts a nice example (with a subsequent comment from Jeremy expanding it to work well with RAC) of an use of the analytic LEAD function.
Which reminds me of a gotcha with LAG or LEAD with sparse data. Sparse data is commonly found in business intelligence databases at low levels of aggregation. Suppose we are looking at day-on-day changes in sales amount for “blue splurge widgets”, but on some days we don’t sell any at all.
As a simple example let’s make a table of date and number of blue splurge widgets sold:
CREATE TABLE “WAREHOUSE”.”PS_LAG_1″ ( “SALES_DATE” DATE, “QUANTITY” NUMBER ) ;
INSERT INTO “PS_LAG_1″ (SALES_DATE, QUANTITY)
VALUES (TO_DATE(’01-APR-07′, ‘DD-MON-RR’), ‘2′);
INSERT INTO “PS_LAG_1″ (SALES_DATE, QUANTITY)
VALUES (TO_DATE(’02-APR-07′, ‘DD-MON-RR’), ‘3′);
INSERT INTO “PS_LAG_1″ (SALES_DATE, QUANTITY)
VALUES (TO_DATE(’04-APR-07′, ‘DD-MON-RR’), ‘5′);
INSERT INTO “PS_LAG_1″ (SALES_DATE, QUANTITY)
VALUES (TO_DATE(’05-APR-07′, ‘DD-MON-RR’), ‘5′);
INSERT INTO “PS_LAG_1″ (SALES_DATE, QUANTITY)
VALUES (TO_DATE(’06-APR-07′, ‘DD-MON-RR’), ‘3′);
COMMIT;
NB we made no sales on the April 3.
Using the LAG function to calculate change in sales from the previous day we get:
select sales_date, quantity, quantity - lag(quantity) over (order by sales_date) delta
from ps_lag_1;
SALES_DATE QUANTITY DELTA
——— ——— —–
01-APR-07 2
02-APR-07 3 1
04-APR-07 5 2
05-APR-07 5 0
06-APR-07 3 -25 rows selected
But did we really sell two units more on April 4 than we did on April 3? The problem here is that lag and lead always measure offsets in terms of ROWS and we need to measure our offset in terms of RANGE. But all is not lost we can use an analytic function over a range window consisting of a single row; we just need to use the BETWEEN construction and have the same value for the window start and end. In this example I will use the SUM() analytic function.
select sales_date, quantity, quantity - SUM(quantity) over (order by sales_date range between interval '1' day preceding and interval '1' day preceding ) delta
from ps_lag_1;
SALES_DATE QUANTITY DELTA
———– ————- ————-
01-APR-07 2
02-APR-07 3 1
04-APR-07 5 <= Look, a NULL!
05-APR-07 5 0
06-APR-07 3 -25 rows selected
Now we get a NULL for the calculation that would include the missing day. Of course how we deal with the NULL is up to us, we just need a NVL (or CASE) to wrap up the correct business logic.

April 10th, 2007 at 8:26 pm
Well, but still on 3rd April sales was 3 less than on second. I would rather use row generator to fill in gaps.
select sal_dt, quantity, nvl(quantity,0) - nvl(SUM(quantity) over (order by sal_dt range between interval '1' day preceding and interval '1' day preceding ),0) delta
from ps_lag_1,
(select (select min(sales_date) from ps_lag_1)+level-1 sal_dt
from dual connect by level that gives:
07/04/01 2 2
07/04/02 3 1
07/04/03 -3
07/04/04 5 5
07/04/05 5 0
07/04/06 3 -2
07/04/07 -3
Regards,
Paweł
April 10th, 2007 at 8:28 pm
It seems that code was broken in above comment. It should be:
select sal_dt, quantity, nvl(quantity,0) - nvl(SUM(quantity) over (order by sal_dt range between interval ‘1′ day preceding and interval ‘1′ day preceding ),0) delta
from ps_lag_1,
(select (select min(sales_date) from ps_lag_1)+level-1 sal_dt
from dual connect by level
April 10th, 2007 at 8:29 pm
Hope this time it fill work:
select sal_dt, quantity, nvl(quantity,0) - nvl(SUM(quantity) over (order by sal_dt range between interval '1' day preceding and interval '1' day preceding ),0) delta
from ps_lag_1, (select (select min(sales_date) from ps_lag_1)+level-1 sal_dt
from dual connect by level <= (select max(sales_date)-min(sales_date)+2 from PS_LAG_1))
where sal_dt = sales_date(+);
April 10th, 2007 at 8:47 pm
Thanks Paweł
The original post was spurred by a notion that lag and lead are row based and not range based.
Filling in the gaps to “put in the missing zeros” is indeed probably the way most people would go for this case - row generators are one one way (as you rightly show) but another approach (Oracle 10g) is to use a partition outer join.
April 11th, 2007 at 1:56 am
but another approach (Oracle 10g) is to use a partition outer join.
Wouldn’t you still use some kind row-generator source to join with?
April 11th, 2007 at 6:21 am
Alex
Thanks for the correction - I should have been clearer in that I was referring to a synthetic row generator source such as those cited by Paweł on his blog; you are quite right I would have to have a list of available date for the join - often I use a calendar table; guess that’s sticking to my Star Schema roots.