Query for count of distinct values in a rolling date range

Test case:

CREATE TABLE tbl (date date, email text);
INSERT INTO tbl VALUES
  ('2012-01-01', '[email protected]')
, ('2012-01-01', '[email protected]')
, ('2012-01-01', '[email protected]')
, ('2012-01-02', '[email protected]')
, ('2012-01-02', '[email protected]')
, ('2012-01-03', '[email protected]')
, ('2012-01-04', '[email protected]')
, ('2012-01-05', '[email protected]')
, ('2012-01-05', '[email protected]')
, ('2012-01-06', '[email protected]')
, ('2012-01-06', '[email protected]')
, ('2012-01-06', '[email protected]`')
;

Query – returns only days where an entry exists in tbl:

SELECT date
     ,(SELECT count(DISTINCT email)
       FROM   tbl
       WHERE  date BETWEEN t.date - 2 AND t.date -- period of 3 days
      ) AS dist_emails
FROM   tbl t
WHERE  date BETWEEN '2012-01-01' AND '2012-01-06'  
GROUP  BY 1
ORDER  BY 1;

Or – return all days in the specified range, even if there are no rows for the day:

SELECT date
     ,(SELECT count(DISTINCT email)
       FROM   tbl
       WHERE  date BETWEEN g.date - 2 AND g.date
      ) AS dist_emails
FROM  (SELECT generate_series(timestamp '2012-01-01'
                            , timestamp '2012-01-06'
                            , interval  '1 day')::date) AS g(date);

db<>fiddle here

Result:

day        | dist_emails
-----------+------------
2012-01-01 | 3
2012-01-02 | 3
2012-01-03 | 3
2012-01-04 | 3
2012-01-05 | 1
2012-01-06 | 2

This sounded like a job for window functions at first, but I did not find a way to define the suitable window frame. Also, per documentation:

Aggregate window functions, unlike normal aggregate functions, do not
allow DISTINCT or ORDER BY to be used within the function argument list.

So I solved it with correlated subqueries instead. I guess that’s the smartest way.

BTW, “between said date and 3 days ago” would be a period of 4 days. Your definition is contradictory there.

Slightly shorter, but slower for few days:

SELECT g.date, count(DISTINCT email) AS dist_emails
FROM  (SELECT generate_series(timestamp '2012-01-01'
                            , timestamp '2012-01-06'
                            , interval  '1 day')::date) AS g(date)
LEFT   JOIN tbl t ON t.date BETWEEN g.date - 2 AND g.date
GROUP  BY 1
ORDER  BY 1;

Related:

Leave a Comment