I don’t think you can do this cheaply with a plain query, CTEs and window functions – their frame definition is static, but you need a dynamic frame (depending on column values).
Generally, you’ll have to define lower and upper bound of your window carefully: The following queries exclude the current row and include the lower border.
There is still a minor difference: the function includes previous peers of the current row, while the correlated subquery excludes them …
Test case
Using ts
instead of reserved word date
as column name.
CREATE TABLE test (
id bigint
, ts timestamp
);
ROM – Roman’s query
Use CTEs, aggregate timestamps into an array, unnest, count …
While correct, performance deteriorates drastically with more than a hand full of rows. There are a couple of performance killers here. See below.
ARR – count array elements
I took Roman’s query and tried to streamline it a bit:
- Remove 2nd CTE which is not necessary.
- Transform 1st CTE into subquery, which is faster.
- Direct
count()
instead of re-aggregating into an array and counting witharray_length()
.
But array handling is expensive, and performance still deteriorates badly with more rows.
SELECT id, ts
, (SELECT count(*)::int - 1
FROM unnest(dates) x
WHERE x >= sub.ts - interval '1h') AS ct
FROM (
SELECT id, ts
, array_agg(ts) OVER(ORDER BY ts) AS dates
FROM test
) sub;
COR – correlated subquery
You could solve it with a simple correlated subquery. A lot faster, but still …
SELECT id, ts
, (SELECT count(*)
FROM test t1
WHERE t1.ts >= t.ts - interval '1h'
AND t1.ts < t.ts) AS ct
FROM test t
ORDER BY ts;
FNC – Function
Loop over rows in chronological order with a row_number()
in plpgsql function and combine that with a cursor over the same query, spanning the desired time frame. Then we can just subtract row numbers:
CREATE OR REPLACE FUNCTION running_window_ct(_intv interval="1 hour")
RETURNS TABLE (id bigint, ts timestamp, ct int)
LANGUAGE plpgsql AS
$func$
DECLARE
cur CURSOR FOR
SELECT t.ts + _intv AS ts1, row_number() OVER (ORDER BY t.ts) AS rn
FROM test t ORDER BY t.ts;
rec record;
rn int;
BEGIN
OPEN cur;
FETCH cur INTO rec;
ct := -1; -- init
FOR id, ts, rn IN
SELECT t.id, t.ts, row_number() OVER (ORDER BY t.ts)
FROM test t ORDER BY t.ts
LOOP
IF rec.ts1 >= ts THEN
ct := ct + 1;
ELSE
LOOP
FETCH cur INTO rec;
EXIT WHEN rec.ts1 >= ts;
END LOOP;
ct := rn - rec.rn;
END IF;
RETURN NEXT;
END LOOP;
END
$func$;
Call with default interval of one hour:
SELECT * FROM running_window_ct();
Or with any interval:
SELECT * FROM running_window_ct('2 hour - 3 second');
Benchmark
With the table from above I ran a quick benchmark on my old test server: (PostgreSQL 9.1.9 on Debian).
-- TRUNCATE test;
INSERT INTO test
SELECT g, '2013-08-08'::timestamp
+ g * interval '5 min'
+ random() * 300 * interval '1 min' -- halfway realistic values
FROM generate_series(1, 10000) g;
CREATE INDEX test_ts_idx ON test (ts);
ANALYZE test; -- temp table needs manual analyze
I varied the bold part for each run and took the best of 5 with EXPLAIN ANALYZE
.
100 rows
ROM: 27.656 ms
ARR: 7.834 ms
COR: 5.488 ms
FNC: 1.115 ms
1000 rows
ROM: 2116.029 ms
ARR: 189.679 ms
COR: 65.802 ms
FNC: 8.466 ms
5000 rows
ROM: 51347 ms !!
ARR: 3167 ms
COR: 333 ms
FNC: 42 ms
100000 rows
ROM: DNF
ARR: DNF
COR: 6760 ms
FNC: 828 ms
The function is the clear victor. It is fastest by an order of magnitude and scales best.
Array handling cannot compete.