Window Functions or Common Table Expressions: count previous rows within range

I don’t think you can do this cheaply with a plain query, CTEs and window functions – their frame definition is static, but you need a dynamic frame (depending on column values).

Generally, you’ll have to define lower and upper bound of your window carefully: The following queries exclude the current row and include the lower border.
There is still a minor difference: the function includes previous peers of the current row, while the correlated subquery excludes them …

Test case

Using ts instead of reserved word date as column name.

CREATE TABLE test (
  id  bigint
, ts  timestamp
);

ROM – Roman’s query

Use CTEs, aggregate timestamps into an array, unnest, count …
While correct, performance deteriorates drastically with more than a hand full of rows. There are a couple of performance killers here. See below.

ARR – count array elements

I took Roman’s query and tried to streamline it a bit:

  • Remove 2nd CTE which is not necessary.
  • Transform 1st CTE into subquery, which is faster.
  • Direct count() instead of re-aggregating into an array and counting with array_length().

But array handling is expensive, and performance still deteriorates badly with more rows.

SELECT id, ts
     , (SELECT count(*)::int - 1
        FROM   unnest(dates) x
        WHERE  x >= sub.ts - interval '1h') AS ct
FROM (
   SELECT id, ts
        , array_agg(ts) OVER(ORDER BY ts) AS dates
   FROM   test
   ) sub;

COR – correlated subquery

You could solve it with a simple correlated subquery. A lot faster, but still …

SELECT id, ts
     , (SELECT count(*)
        FROM   test t1
        WHERE  t1.ts >= t.ts - interval '1h'
        AND    t1.ts < t.ts) AS ct
FROM   test t
ORDER  BY ts;

FNC – Function

Loop over rows in chronological order with a row_number() in plpgsql function and combine that with a cursor over the same query, spanning the desired time frame. Then we can just subtract row numbers:

CREATE OR REPLACE FUNCTION running_window_ct(_intv interval="1 hour")
  RETURNS TABLE (id bigint, ts timestamp, ct int)
  LANGUAGE plpgsql AS
$func$
DECLARE
   cur   CURSOR FOR
         SELECT t.ts + _intv AS ts1, row_number() OVER (ORDER BY t.ts) AS rn
         FROM   test t ORDER BY t.ts;
   rec   record;
   rn    int;

BEGIN
   OPEN cur;
   FETCH cur INTO rec;
   ct := -1;  -- init

   FOR id, ts, rn IN
      SELECT t.id, t.ts, row_number() OVER (ORDER BY t.ts)
      FROM   test t ORDER BY t.ts
   LOOP
      IF rec.ts1 >= ts THEN
         ct := ct + 1;
      ELSE
         LOOP
            FETCH cur INTO rec;
            EXIT WHEN rec.ts1 >= ts;
         END LOOP;
         ct := rn - rec.rn;
      END IF;

      RETURN NEXT;
   END LOOP;
END
$func$;

Call with default interval of one hour:

SELECT * FROM running_window_ct();

Or with any interval:

SELECT * FROM running_window_ct('2 hour - 3 second');

db<>fiddle here
Old sqlfiddle

Benchmark

With the table from above I ran a quick benchmark on my old test server: (PostgreSQL 9.1.9 on Debian).

-- TRUNCATE test;
INSERT INTO test
SELECT g, '2013-08-08'::timestamp
         + g * interval '5 min'
         + random() * 300 * interval '1 min' -- halfway realistic values
FROM   generate_series(1, 10000) g;

CREATE INDEX test_ts_idx ON test (ts);
ANALYZE test;  -- temp table needs manual analyze

I varied the bold part for each run and took the best of 5 with EXPLAIN ANALYZE.

100 rows
ROM: 27.656 ms
ARR: 7.834 ms
COR: 5.488 ms
FNC: 1.115 ms

1000 rows
ROM: 2116.029 ms
ARR: 189.679 ms
COR: 65.802 ms
FNC: 8.466 ms

5000 rows
ROM: 51347 ms !!
ARR: 3167 ms
COR: 333 ms
FNC: 42 ms

100000 rows
ROM: DNF
ARR: DNF
COR: 6760 ms
FNC: 828 ms

The function is the clear victor. It is fastest by an order of magnitude and scales best.
Array handling cannot compete.

Leave a Comment